www.gibmonks.com

  Previous section   Next section

Practical Programming in Tcl & Tk, Third Edition
By Brent B. Welch

Table of Contents
Chapter 11.  Regular Expressions


Transforming Data to Program with regsub

One of the most powerful combinations of Tcl commands is regsub and subst. This section describes a few examples that use regsub to transform data into Tcl commands, and then use subst to replace those commands with a new version of the data. This technique is very efficient because it relies on two subsystems that are written in highly optimized C code: the regular expression engine and the Tcl parser. These examples are primarily written by Stephen Uhler.

URL Decoding

When a URL is transmitted over the network, it is encoded by replacing special characters with a %xx sequence, where xx is the hexadecimal code for the character. In addition, spaces are replaced with a plus (+). It would be tedious and very inefficient to scan a URL one character at a time with Tcl statements to undo this encoding. It would be more efficient to do this with a custom C program, but still very tedious. Instead, a combination of regsub and subst can efficiently decode the URL in just a few Tcl commands.

Replacing the + with spaces requires quoting the + because it is the one-or-more special character in regular expressions:

regsub -all {\+}$url {} url

The %xx are replaced with a format command that will generate the right character:

regsub -all {%([0-9a-hA-H][0-9a-hA-H])} $url \
    {[format %c 0x\1]} url

The %c directive to format tells it to generate the character from a character code number. We force a hexadecimal interpretation with a leading 0x. Advanced regular expressions let us write the "2 hex digits" pattern a bit more cleanly:

regsub -all {%([[:xdigit:]]{2})} $url \
    {[format %c 0x\1]} url

The resulting string is passed to subst to get the format commands substituted:

set url [subst $url]

For example, if the input is %7ewelch, the result of the regsub is:

[format %c 0x7e]welch

And then subst generates:

~welch

Example 11-5 encapsulates this trick in the Url_Decode procedure.

Example 11-5 The Url_Decode procedure.
proc Url_Decode {url} {
   regsub -all {\+} $url {} url
   regsub -all {%([:xdigit:]]{2})} $url \
      {[format %c 0x\1]} url
   return [subst $url]
}

CGI Argument Parsing

Example 11-6 builds upon Url_Decode to decode the inputs to a CGI program that processes data from an HTML form. Each form element is identified by a name, and the value is URL encoded. All the names and encoded values are passed to the CGI program in the following format:

name1=value1&name2=value2&name3=value3

Example 11-6 shows Cgi_List and Cgi_Query. Cgi_Query receives the form data from the standard input or the QUERY_STRING environment variable, depending on whether the form data is transmitted with a POST or GET request. These HTTP operations are described in detail in Chapter 17. Cgi_List uses split to get back a list of names and values, and then it decodes them with Url_Decode. It returns a Tcl-friendly name, value list that you can either iterate through with a foreach command, or assign to an array with array set:

Example 11-6 The Cgi_Parse and Cgi_Value procedures.
proc Cgi_List {} {
   set query [Cgi_Query]
   regsub -all {\+}$query {} query
   set result {}
   foreach {x}[split $query &=] {
      lappend result [Url_Decode $x]
   }
   return $result
}
proc Cgi_Query {} {
   global env
   if {![info exists env(QUERY_STRING)] ||
          [string length $env(QUERY_STRING)] == 0} {
      if {[info exists env(CONTENT_LENGTH)] &&
             [string length $env(CONTENT_LENGTH)] != 0} {
         set query [read stdin $env(CONTENT_LENGTH)]
      } else {
         gets stdin query
      }
      set env(QUERY_STRING) $query
      set env(CONTENT_LENGTH) 0
   }
   return $env(QUERY_STRING)
}

An HTML form can have several form elements with the same name, and this can result in more than one value for each name. If you blindly use array set to map the results of Cgi_List into an array, you will lose the repeated values. Example 11-6 shows Cgi_Parse and Cgi_Value that store the query data in a global cgi array. Cgi_Parse adds list structure whenever it finds a repeated form value. The global cgilist array keeps a record of how many times a form value is repeated. The Cgi_Value procedure returns elements of the global cgi array, or the empty string if the requested value is not present.

Example 11-7 Cgi_Parse and Cgi_Value store query data in the cgi array.
proc Cgi_Parse {} {
   global cgi cgilist
   catch {unset cgi cgilist}
   set query [Cgi_Query]
   regsub -all {\+}$query {}query
   foreach {name value}[split $query &=] {
      set name [CgiDecode $name]
      if {[info exists cgilist($name)] &&
             ($cgilist($name) == 1)} {
         # Add second value and create list structure
         set cgi($name) [list $cgi($name) \
            [Url_Decode $value]]
      } elseif {[info exists cgi($name)]} {
         # Add additional list elements
         lappend cgi($name) [CgiDecode $value]
      } else {
         # Add first value without list structure
         set cgi($name) [CgiDecode $value]
         set cgilist($name) 0   ;# May need to listify
      }
      incr cgilist($name)
   }
   return [array names cgi]
}
proc Cgi_Value {key} {
   global cgi
   if {[info exists cgi($key)]} {
      return $cgi($key)
   } else {
      return {}
   }
}
proc Cgi_Length {key} {
   global cgilist
   if {[info exist cgilist($key)]} {
      return $cgilist($key)
   } else {
      return 0
   }
}

Decoding HTML Entities

The next example is a decoder for HTML entities. In HTML, special characters are encoded as entities. If you want a literal < or > in your document, you encode them as the entities &lt; and &gt;, respectively, to avoid conflict with the <tag> syntax used in HTML. HTML syntax is briefly described in Chapter 3 on page 32. Characters with codes above 127 like copyright © and egrave è are also encoded. There are named entities, like &lt; for < and &egrave; for è. You can also use decimal-valued entities such as &#169; for ©. Finally, the trailing semicolon is optional, so &lt or &lt; can both be used to encode <.

The entity decoder is similar to Url_Decode. In this case, however, we need to be more careful with subst. The text passed to the decoder could contain special characters like a square bracket or dollar sign. With Url_Decode we can rely on those special characters being encoded as, for example, %24. Entity encoding is different (do not ask me why URLs and HTML have different encoding standards), and dollar signs and square brackets are not necessarily encoded. This requires an additional pass to quote these characters. This regsub puts a backslash in front of all the brackets, dollar signs, and backslashes.

regsub -all {[][$\\]} $text {\\&} new

The decimal encoding (e.g., &#169;) is also more awkward than the hexadecimal encoding used in URLs. We cannot force a decimal interpretation of a number in Tcl. In particular, if the entity has a leading zero (e.g., &#010;) then Tcl interprets the value (e.g., 010) as octal. The scan command is used to do a decimal interpretation. It scans into a temporary variable, and set is used to get that value:

regsub -all {&#([0-9][0-9]?[0-9]?);?} $new \
    {[format %c [scan \1 %d tmp; set tmp]]} new

With advanced regular expressions, this could be written as follows using bound quantifiers to specify one to three digits:

regsub -all {&#(\d{1,3});?} $new \
    {[format %c [scan \1 %d tmp;set tmp]]} new

The named entities are converted with an array that maps from the entity names to the special character. The only detail is that unknown entity names (e.g., &foobar;) are not converted. This mapping is done inside HtmlMapEntity, which guards against invalid entities.

regsub -all {&([a-zA-Z]+)(;?)} $new \
    {[HtmlMapEntity \1 \\\2 ]} new

If the input text contained:

[x &lt; y]

then the regsub would transform this into:

\[x [HtmlMapEntity lt \; ] y\]

Finally, subst will result in:

[x < y]
Example 11-8 Html_DecodeEntity.
proc Html_DecodeEntity {text} {
   if {![regexp & $text]} {return $text}
   regsub -all {[][$\\]}$text {\\&} new
   regsub -all {&#([0-9][0-9]?[0-9]?);?} $new {\
      [format %c [scan \1 %d tmp;set tmp]]} new
   regsub -all {&([a-zA-Z]+)(;?)} $new \
      {[HtmlMapEntity \1 \\\2 ]} new
   return [subst $new]
}
proc HtmlMapEntity {text {semi {}}} {
   global htmlEntityMap
   if {[info exist htmlEntityMap($text)]} {
      return $htmlEntityMap($text)
   } else {
      return $text$semi
   }
}
# Some of the htmlEntityMap
array set htmlEntityMap {
   lt  <  gt >   amp &
   aring  \xe5   atilde \xe3
   copy   \xa9   ecirc  \xea  egrave \xe8
}

A Simple HTML Parser

The following example is the brainchild of Stephen Uhler. It uses regsub to transform HTML into a Tcl script. When it is evaluated the script calls a procedure to handle each tag in an HTML document. This provides a general framework for processing HTML. Different callback procedures can be applied to the tags to achieve different effects. For example, the html_library-0.3 package on the CD-ROM uses Html_Parse to display HTML in a Tk text widget.

Example 11-9 Html_Parse.
proc Html_Parse {html cmd {start {}}} {

   # Map braces and backslashes into HTML entities
   regsub -all \{ $html {\&ob;} html
   regsub -all \} $html {\&cb;} html
   regsub -all {\\} $html &bsl;} html

   # This pattern matches the parts of an HTML tag
   set s" \t\r\n"  ;# white space
   set exp <(/?)(\[^$s>]+)\[$s]*(\[^>]*)>

   # This generates a call to cmd with HTML tag parts
   # \1 is the leading /, if any
   # \2 is the HTML tag name
   # \3 is the parameters to the tag, if any
   # The curly braces at either end group of all the text
   # after the HTML tag, which becomes the last arg to $cmd.
   set sub "\}\n {\\2} {\\1} {\\3} \{"
   regsub -all $exp $html $sub html

   # This balances the curly braces,
   # and calls $cmd with $start as a pseudo-tag
   # at the beginning and end of the script.
   eval "$cmd {$start} {} {} {$html}"
   eval "$cmd {$start} / {} {}"
}

The main regsub pattern can be written more simply with advanced regular expressions:

set exp {<(/?)(\S+?)\s*(.*?)>}

An example will help visualize the transformation. Given this HTML:

<Title>My Home Page</Title>
<Body bgcolor=white text=black>
<H1>My Home</H1>
This is my <b>home</b> page.

and a call to Html_Parse that looks like this:

Html_Parse $html {Render .text}hmstart

then the generated program is this:

Render .text {hmstart} {} {} {}
Render .text {Title} {} {} {My Home Page}
Render .text {Title} {/} {} {
}
Render .text {Body} {} {bgcolor=white text=black} {
}
Render .text {H1} {} {} {My Home}
Render .text {H1} {/} {} {
This is my }
Render .text {b} {} {} {home}
Render .text {b} {/} {} {page.
}
Render .text {hmstart}/ {} {}

One overall point to make about this example is the difference between using eval and subst with the generated script. The decoders shown in Examples 11-5 and 11-8 use subst to selectively replace encoded characters while ignoring the rest of the text. In Html_Parse we must process all the text. The main trick is to replace the matching text (e.g., the HTML tag) with some Tcl code that ends in an open curly brace and starts with a close curly brace. This effectively groups all the unmatched text.

When eval is used this way you must do something with any braces and backslashes in the unmatched text. Otherwise, the resulting script does not parse correctly. In this case, these special characters are encoded as HTML entities. We can afford to do this because the cmd that is called must deal with encoded entities already. It is not possible to quote these special characters with backslashes because all this text is inside curly braces, so no backslash substitution is performed. If you try that the backslashes will be seen by the cmd callback.

Finally, I must admit that I am always surprised that this works:

eval "$cmd {$start} {} {} {$html}"

I always forget that $start and $html are substituted in spite of the braces. This is because double quotes are being used to group the argument, so the quoting effect of braces is turned off. Try this:

set x hmstart
set y "foo {$x}bar"
=> foo {hmstart}bar

Stripping HTML Comments

The Html_Parse procedure does not correctly handle HTML comments. The problem is that the syntax for HTML commands allows tags inside comments, so there can be > characters inside the comment. HTML comments are also used to hide Javascript inside pages, which can also contain >. We can fix this with a pass that eliminates the comments.

The comment syntax is this:

<!-- HTML comment, could contain <markup> -->

Using nongreedy quantifiers, we can strip comments with a single regsub:

regsub -all <!--.*?--> $html {}html

Using only greedy quantifiers, it is awkward to match the closing --> without getting stuck on embedded > characters, or without matching too much and going all the way to the end of the last comment. Time for another trick:

regsub -all --> $html \x81 html

This replaces all the end comment sequences with a single character that is not allowed in HTML. Now you can delete the comments like this:

regsub -all "<!--\[^\x81\]*\x81" $html {}html

      Previous section   Next section
    Top