www.gibmonks.com




  Previous section   Next section

Practical Programming in Tcl & Tk, Third Edition
By Brent B. Welch

Table of Contents
Chapter 11.  Regular Expressions


Regular Expression Syntax

This section describes the basics of regular expression patterns, which are found in all versions of Tcl. There are occasional references to features added by advanced regular expressions, but they are covered in more detail starting on page 138. There is enough syntax in regular expressions that there are five tables that summarize all the options. These tables appear together starting at page 145.

A regular expression is a sequence of the following items:

  • A literal character.

  • A matching character, character set, or character class.

  • A repetition quantifier.

  • An alternation clause.

  • A subpattern grouped with parentheses.

Matching Characters

Most characters simply match themselves. The following pattern matches an a followed by a b:

ab

The general wild-card character is the period, ".". It matches any single character. The following pattern matches an a followed by any character:

a.

Remember that matches can occur anywhere within a string; a pattern does not have to match the whole string. You can change that by using anchors, which are described on page 137.

Character Sets

The matching character can be restricted to a set of characters with the [xyz] syntax. Any of the characters between the two brackets is allowed to match. For example, the following matches either Hello or hello:

[Hh]ello

The matching set can be specified as a range over the character set with the [x-y] syntax. The following matches any digit:

[0-9]

There is also the ability to specify the complement of a set. That is, the matching character can be anything except what is in the set. This is achieved with the [^xyz] syntax. Ranges and complements can be combined. The following matches anything except the uppercase and lowercase letters:

[^a-zA-Z]

graphics/tip_icon.gif

Using special characters in character sets.


If you want a ] in your character set, put it immediately after the initial opening bracket. You do not need to do anything special to include [ in your character set. The following matches any square brackets or curley braces:

[][{}]

Most regular expression syntax characters are no longer special inside character sets. This means you do not need to backslash anything inside a bracketed character set except for backslash itself. The following pattern matches several of the syntax characters used in regular expressions:

[][+*?()|\\]

Advanced regular expressions add names and backslash escapes as shorthand for common sets of characters like white space, alpha, alphanumeric, and more. These are described on page 139 and listed in Table 11-3 on page 146.

Quantifiers

Repetition is specified with *, for zero or more, +, for one or more, and ?, for zero or one. These quantifiers apply to the previous item, which is either a matching character, a character set, or a subpattern grouped with parentheses. The following matches a string that contains b followed by zero or more a's:

ba*

You can group part of the pattern with parentheses and then apply a quantifier to that part of the pattern. The following matches a string that has one or more sequences of ab:

(ab)+

The pattern that matches anything, even the empty string, is:

.*

These quantifiers have a greedy matching behavior: They match as many characters as possible. Advanced regular expressions add nongreedy matching, which is described on page 140. For example, a pattern to match a single line might look like this:

.*\n

However, as a greedy match, this will match all the lines in the input, ending with the last newline in the input string. The following pattern matches up through the first newline.

[^\n]*\n

We will shorten this pattern even further on page 140 by using nongreedy quantifiers. There are also special newline sensitive modes you can turn on with some options described on page 143.

Alternation

Alternation lets you test more than one pattern at the same time. The matching engine is designed to be able to test multiple patterns in parallel, so alternation is efficient. Alternation is specified with |, the pipe symbol. Another way to match either Hello or hello is:

hello|Hello

You can also write this pattern as:

(h|H)ello

or as:

[hH]ello

Anchoring a Match

By default a pattern does not have to match the whole string. There can be unmatched characters before and after the match. You can anchor the match to the beginning of the string by starting the pattern with ^, or to the end of the string by ending the pattern with $. You can force the pattern to match the whole string by using both. All strings that begin with spaces or tabs are matched with:

^[ \t]+

If you have many text lines in your input, you may be tempted to think of ^ as meaning "beginning of line" instead of "beginning of string." By default, the ^ and $ anchors are relative to the whole input, and embedded newlines are ignored. Advanced regular expressions support options that make the ^ and $ anchors line-oriented. They also add the \A and \Z anchors that always match the beginning and end of the string, respectively.

Backslash Quoting

Use the backslash character to turn off these special characters :

. * ? + [ ] ( ) ^ $ | \

For example, to match the plus character, you will need:

\+

Remember that this quoting is not necessary inside a bracketed expression (i.e., a character set definition.) For example, to match either plus or question mark, either of these patterns will work:

(\+|\?)
[+?]

To match a single backslash, you need two. You must do this everywhere, even inside a bracketed expression. Or you can use \B, which was added as part of advanced regular expressions. Both of these match a single backslash:

\\
\B

graphics/tip_icon.gif

Unknown backslash sequences are an error.


Versions of Tcl before 8.1 ignored unknown backslash sequences in regular expressions. For example, \= was just =, and \w was just w. Even \n was just n, which was probably frustrating to many beginners trying to get a newline into their pattern. Advanced regular expressions add backslash sequences for tab, newline, character classes, and more. This is a convenient improvement, but in rare cases it may change the semantics of a pattern. Usually these cases are where an unneeded backslash suddenly takes on meaning, or causes an error because it is unknown.

Matching Precedence

If a pattern can match several parts of a string, the matcher takes the match that occurs earliest in the input string. Then, if there is more than one match from that same point because of alternation in the pattern, the matcher takes the longest possible match. The rule of thumb is: first, then longest. This rule gets changed by nongreedy quantifiers that prefer a shorter match.

Watch out for *, which means zero or more, because zero of anything is pretty easy to match. Suppose your pattern is:

[a-z]*

This pattern will match against 123abc, but not how you expect. Instead of matching on the letters in the string, the pattern will match on the zero-length substring at the very beginning of the input string! This behavior can be seen by using the -indices option of the regexp command described on page 148. This option tells you the location of the matching string instead of the value of the matching string.

Capturing Subpatterns

Use parentheses to capture a subpattern. The string that matches the pattern within parentheses is remembered in a matching variable, which is a Tcl variable that gets assigned the string that matches the pattern. Using parentheses to capture subpatterns is very useful. Suppose we want to get everything between the <td> and </td> tags in some HTML. You can use this pattern:

<td>([^<]*)</td>

The matching variable gets assigned the part of the input string that matches the pattern inside the parentheses. You can capture many subpatterns in one match, which makes it a very efficient way to pick apart your data. Matching variables are explained in more detail on page 148 in the context of the regexp command.

Sometimes you need to introduce parentheses but you do not care about the match that occurs inside them. The pattern is slightly more efficient if the matcher does not need to remember the match. Advanced regular expressions add noncapturing parentheses with this syntax:

(?:pattern)

      Previous section   Next section
    Top