Previous section   Next section

Practical Programming in Tcl & Tk, Third Edition
By Brent B. Welch

Table of Contents
Chapter 44.  C Programming and Tcl

Strings and Internationalization

There are two important topics related to string handling: creating strings dynamically and translating strings between character set encodings. These issues do not show up in the simple examples we have seen so far, but they will arise in more serious applications.

The DString Interface

It is often the case that you have to build up a string from pieces. The Tcl_DString data type and a related API are designed to make this efficient. The DString interface hides the memory management issues, and the Tcl_DString data type starts out with a small static buffer, so you can often avoid allocating memory if you put a Tcl_String type on the stack (i.e., as a local variable). The standard code sequence goes something like this:

Tcl_DString ds;
Tcl_DStringAppend(&ds, "some value", -1);
Tcl_DStringAppend(&ds, "something else", -1);
Tcl_DStringResult(interp, &ds);

The Tcl_DStringInit call initializes a string pointer inside the structure to point to a static buffer that is also inside the structure. The Tcl_DStringAppend call grows the string. If it would exceed the static buffer, then a new buffer is allocated dynamically and the string is copied into it. The last argument to Tcl_DStringAppend is a length, which can be minus 1 if you want to copy until the trailing NULL byte in your string. You can use the string value as the result of your Tcl command with Tcl_DStringResult. This passes ownership of the string to the interpreter and automatically cleans up the Tcl_DString structure.

If you do not use the string as the interpreter result, then you must call Tcl_DStringFree to ensure that any dynamically allocated memory is released:


You can get a direct pointer to the string you have created with Tcl_DStringValue:

name = Tcl_DStringValue(&ds);

There are a handful of additional procedures in the DString API that you can read about in the reference material. There are some that create lists, but this is better done with the Tcl_Obj interface (e.g., Tcl_NewListObj and friends).

To some degree, a Tcl_Obj can replace the use of a Tcl_DString. For example, the Tcl_NewStringObj and Tcl_AppendToObj allocate a Tcl_Obj and append strings to it. However, there are a number of Tcl API procedures that take Tcl_DString types as arguments instead of the Tcl_Obj type. Also, for small strings, the DString interface is still more efficient because it can do less dynamic memory allocation.

Character Set Conversions

As described in Chapter 15, Tcl uses UTF-8 strings internally. UTF-8 is a representation of Unicode that does not contain NULL bytes. It also represents 7-bit ASCII characters in one byte, so if you have old C code that only manipulates ASCII strings, it can coexist with Tcl without modification.

However, in more general cases, you may need to convert between UTF-8 strings you get from Tcl_Obj values to strings of a particular encoding. For example, when you pass strings to the operating system, it expects them in its native encoding, which might be 16-bit Unicode, ISO-Latin-1 (i.e., iso-8859-1), or something else.

Tcl provides an encoding API that does translations for you. The simplest calls use a Tcl_DString to store the results because it is not possible to predict the size of the result in advance. For example, to convert from a UTF-8 string to a Tcl_DString in the system encoding, you use this call:

Tcl_UtfToExternalDString(NULL, string, -1, &ds);

You can then pass Tcl_DStringValue(&ds) to your system call that expects a native string. Afterwards you need to call Tcl_DStringFree(&ds) to free up any memory allocated by Tcl_UtfToExternalDString.

To translate strings the other way, use Tcl_ExternalToUtfDString:

Tcl_ExternalToUtfDString(NULL, string, -1, &ds);

The third argument to these procedures is the length of string in bytes (not characters), and minus 1 means that Tcl should calculate it by looking for a NULL byte. Tcl stores its UTF-8 strings with a NULL byte at the end so it can do this.

The first argument to these procedures is the encoding to translate to or from. NULL means the system encoding. If you have data in nonstandard encodings, or need to translate into something other than the system encoding, you need to get a handle on the encoding with Tcl_GetEncoding, and free that handle later with Tcl_FreeEncoding:

encoding = Tcl_GetEncoding(interp, name);

The names of the encodings are returned by the encoding names Tcl command, and you can query them with a C API, too.

Windows has a quirky string data type called TCHAR, which is an 8-bit byte on Windows 95/98, and a 16-bit Unicode character on Windows NT. If you use a C API that takes an array of TCHAR, then you have to know what kind of system you are running on to use it properly. Tcl provides two procedures that deal with this automatically. Tcl_WinTCharToUf works like Tcl_ExternalToUtfDString, and Tcl_WinUtfToTChar works like Tcl_UtfToExternalDString:

Tcl_WinUtfToTChar(string, -1, &ds);
Tcl_WinTCharToUtf(string, -1, &ds);

Finally, Tcl has several procedures to work with Unicode characters, which are type Tcl_UniChar, and UTF-8 encoded characters. Examples include Tcl_UniCharToUtf, Tcl_NumUtfChars, and Tcl_UtfToUniCharDString. Consult the reference materials for details about these procedures.

      Previous section   Next section