Draft Writeup of April 22, 1996

REGULAR EXPRESSIONS

In the AMPL string functions match, sub and gsub, the second argument is taken to represent a regular expression. If it contains certain special characters, it is interpreted as a pattern that may match many sub-strings.

The third argument of sub and gsub serves as a replacement pattern that may use some specially interpreted characters to stand for all or part of the matched string.

Detailed rules for forming and interpreting regular expressions and replacement patterns are given below. Like many such constructions, however, they are often better learned by example; tables of examples follows the formal rule statements.


Regular expression rules

A regular expression is a character string that serves to specify a collection of strings. A member of this collection of strings is said to be matched by the regular expression.

A regular expression may contain any character except "newline". The specially interpreted metacharacters are:

	. * + ? [ ] ( ) | \ ^ $
In formal terms, the syntax for a regular expression e0 is given by:
	e3:
	    literal
	    charclass
	    ^
	    $
	    .
	    ( e0 )

	literal:
	    non-metacharacter
	    \ metacharacter

	charclass:
	    [ class-string ]

	e2:
	    e3
	    e2 repeater

	repeater:
	    *
	    +
	    ?

	e1:
	    e2
	    e1 e2

	e0:
	    e1
	    e0 | e1
A literal matches one character, either itself (if not a metacharacter) or the metacharacter that follows \.

A charclass matches any character in class-string, with exceptions for two characters that are specially interpreted:

For example, '[a-z]' matches all lower-case letters, and '[^a-zA-Z_]' matches all characters except letters and underscore. (It is optional to put a \ before a metacharacter in a class-string, except before - and ] and before ^ at the beginning of the string.)

The following characters match in special ways:

The repeater operators match some number of instances of the preceding regular expression e2:
A concatenated regular expression, e1 e2, matches a match to e1 followed by a match to e2.

An alternative regular expression, e0 | e1, matches either a match to e0 or a match to e1.

Within a given string, a regular expression may match more than one substring. In such a case the longest match, roughly speaking, is taken. More precisely, a match to any part of a regular expression extends as far as possible without preventing a match to the remainder of the regular expression.


Examples of regular expressions

The following table lists representative regular expressions and the strings they match.

Regular expressions are shown quoted, as they would appear when given as arguments to sub, gsub or match. As AMPL string constants, they may be delimited by a pair of single quotes (') or double quotes ("). Within a string delimited by single quotes, a single quote is represented by two single quotes, and similarly for double quotes.

Regular expression Matches
'AMPL book' AMPL book
'AMPL''s syntax' AMPL's syntax
"AMPL's syntax" AMPL's syntax
'book \(\$62\.50\)' book ($62.50)
'[ampl]' a or m or p or l
'^sa' sa at start of salsa
'sa$' sa at end of salsa
'b..k' book or back or beak or b23k etc.
'b.*k' bk or b3k or book or break etc.
'b[a-z]*k' bk or book or break etc.
'b.+k' b3k or book or break etc.
'b[aeiou]+k' buk or book or beak etc.
'b.?k' bk or bak or b3k or b_k etc.
'b.k|b..k' bak or b3k or back or b32k etc.
'a(b.k|b..k)' ab3k or aback etc.


Replacement pattern rules

In the replacement pattern (third argument) for sub and gsub, & or \0 stands for the whole string matched by the regular expression (second argument). The substrings \1, \2, . . . , \9 stand for the strings matched by the first, second, . . . , ninth parenthesized sub-expression within the regular expression.


Examples of replacement patterns

The following table presents some representative examples of replacement patterns employed in the sub and gsub functions.

Substitution Result
sub('replacement','e','X') 'rXplacement'
gsub('replacement','e','X') 'rXplacXmXnt'
gsub('replacement','e','') 'rplacmnt'
gsub('replacement','([ae])','{&}') 'r{e}pl{a}c{e}m{e}nt'
sub('replacement',
    'e([a-z]*)e([a-z]*)e','e{\2}e{\1}e')
're{m}e{plac}ent'



Comments or questions?
Write to info@ampl.com or use our comment form.

Return to the character strings writeup.

Return to the AMPL update page.

Return to the AMPL home page.


LAST MODIFIED 22 APRIL 1996 BY 4er.