Chapter 6. Matching Unicode and Other Characters

You will have occasion to match characters or ranges of characters that are outside the scope of ASCII. ASCII, or the American Standard Code for Information Interchange, defines an English character set—the letters A through Z in upper- and lowercase, plus control and other characters. It’s been around for a long time: The 128-character Latin-based set was standardized in 1968. That was back before there was such a thing as a personal computer, before VisiCalc, before the mouse, before the Web, but I still look up ASCII charts online regularly.

I remember when I started my career many years ago, I worked with an engineer who kept an ASCII code chart in his wallet. Just in case. The ASCII Code Chart: Don’t leave home without it.

So I won’t gainsay the importance of ASCII, but now it is dated, especially in light of the Unicode standard (http://www.unicode.org), which currently represents over 100,000 characters. Unicode, however, does not leave ASCII in the dust; it incorporates ASCII into its Basic Latin code table (see http://www.unicode.org/charts/PDF/U0000.pdf).

In this chapter, you will step out of the province of ASCII into the not-so-new world of Unicode.

The first text is voltaire.txt from the code archive, a quote from Voltaire (1694–1778), the French Enlightenment philosopher.

Qu’est-ce que la tolérance? c’est l’apanage de l’humanité. Nous sommes tous pétris de faiblesses et d’erreurs; pardonnons-nous réciproquement nos sottises, c’est la première loi de la nature.

Here is an English translation:

What is tolerance? It is the consequence of humanity. We are all formed of frailty and error; let us pardon reciprocally each other’s folly—that is the first law of nature.

There are a variety of ways you can specify a Unicode character, also known as a code point. (For the purposes of this book, a Unicode character is one that is outside of the range of ASCII, though that is not strictly accurate.)

Start out by placing the Voltaire quote in Regexpal (http://www.regexpal.com), and then entering this regular expression:

\u00e9

The \u is followed by a hexadecimal value 00e9 (this is case insensitive—that is, 00E9 works, too). The value 00e9 is equivalent to the decimal value 233, well out of the ASCII range (0–127).

Notice that the letter é (small letter e with an acute accent) is highlighted in Regexpal (see Figure 6-1). That’s because é is the code point U+00E9 in Unicode, which was matched by \u00e9.

Regexpal uses the JavaScript implementation of regular expressions. JavaScript also allows you to use this syntax:

\xe9

Try this in Regexpal and see how it matches the same character as \u00e9.

Let’s try it with a different regex engine. Open http://regexhero.net/tester/ in a browser. Regex Hero is written in .NET and has a little different syntax. Drop the contents of the file basho.txt into the text area labeled Target String. This contains a famous haiku written by the Japanese poet Matsuo Basho (who, coincidentally, died just one week before Voltaire was born).

Here is the poem in Japanese:

古池
蛙飛び込む
水の音
        —芭蕉 (1644–1694)

And here is a translation in English:

At the ancient pond
a frog plunges into
the sound of water.
        —Basho (1644–1694)

To match part of the Japanese text, in the text area marked Regular Expression, type the following:

\u6c60

This is the code point for the Japanese (Chinese) character for pond. It will be highlighted below (see Figure 6-2).

While you are here, try matching the em dash (—) with:

\u2014

Or the en dash (–) with:

\u2013

Now look at these characters in an editor.

If you have vim on your system, you can open basho.txt with it, as shown:

vim basho.txt

Now, starting with a slash (\), enter a search with this line:

/\%u6c60

followed by Enter or Return. The cursor moves to the beginning of the match, as you can see in Figure 6-3. Table 6-1 shows you your options. You can use x or X following the \% to match values in the range 0–255 (0–FF), u to match up to four hexadecimal numbers in the range 256–65,535 (100–FFFF), or U to match up to eight characters in the range 65,536–2,147,483,647 (10000–7FFFFFFF). That takes in a lot of code—a lot more than currently exist in Unicode.

You can also match characters using an octal (base 8) number, which uses the digits 0 to 7. In regex, this is done with three digits, preceded by a slash (\).

For example, the following octal number:

\351

is the same as:

\u00e9

Experiment with it in Regexpal with the Voltaire text. \351 matches é, with a little less typing.

In some implementations, such as Perl, you can match on Unicode character properties. The properties include characteristics like whether the character is a letter, number, or punctuation mark.

I’ll now introduce you to ack, a command-line tool written in Perl that acts a lot like grep (see http://betterthangrep.com). It won’t come on your system; you have to download and install it yourself (see Technical Notes).

We’ll use ack on an excerpt from Friederich Schiller’s “An die Freude,” composed in 1785 (German, if you can’t tell):

An die Freude.

Freude, schöner Götterfunken,
Tochter aus Elisium,
Wir betreten feuertrunken
Himmlische, dein Heiligthum.
Deine Zauber binden wieder,
was der Mode Schwerd getheilt;
Bettler werden Fürstenbrüder,
wo dein sanfter Flügel weilt.

Seid umschlungen, Millionen!
Diesen Kuß der ganzen Welt!
Brüder, überm Sternenzelt
muß ein lieber Vater wohnen.

There are a few interesting characters in this excerpt, beyond ASCII’s small realm. We’ll look at the text of this poem through properties. (If you would like a translation of this poem fragment, you can drop it into Google Translate.

Using ack on a command line, you can specify that you want to see all the characters whose property is Letter (L):

ack '\pL' schiller.txt

This will show you all the letters highlighted. For lowercase letters, use Ll, surrounded by braces:

ack '\p{Ll}' schiller.txt

You must add the braces. For uppercase, it’s Lu:

ack '\p{Lu}' schiller.txt

To specify characters that do not match a property, we use uppercase P:

ack '\PL' schiller.txt

This highlights characters that are not letters.

The following finds those that are not lowercase letters:

ack '\P{Ll}' schiller.txt

And this highlights the ones that are not uppercase:

ack '\P{Lu}' schiller.txt

You can also do this in yet another browser-based regex tester, http://regex.larsolavtorvik.com. Figure 6-4 shows the Schiller text with its lowercase letters highlighted using the lowercase property (\p{Ll}).

Table 6-2 lists character property names for use with \p{property} or \P{property} (see pcresyntax(3) at http://www.pcre.org/pcre.txt). You can also match human languages with properties; see Table A-8.

How do you match control characters? It’s not all that common that you will search for control characters in text, but it’s a good thing to know. In the example repository or archive, you’ll find the file ascii.txt, which is a 128-line file that contains all the ASCII characters in it, each on separate line (hence the 128 lines). When you perform a search on the file, it will usually return a single line if it finds a match. This file is good for testing and general fun.

In regular expressions, you can specify a control character like this:

\cx

where x is the control character you want to match.

Let’s say, for example, you wanted to find a null character in a file. You can use Perl to do that with the following command:

perl -n -e 'print if /\c@/' ascii.txt

Provided that you’ve got Perl on your system and it’s running properly, you will get this result:

0. Null

The reason why is that there is a null character on that line, even though you can’t see the character in the result.

You can also use \0 to find a null character. Try this, too:

perl -n -e 'print if /\0/' ascii.txt

Pressing on, you can find the bell (BEL) character using:

perl -n -e 'print if /\cG/' ascii.txt

It will return the line:

7. Bell

Or you can use the shorthand:

perl -n -e 'print if /\a/' ascii.txt

To find the escape character, use:

perl -n -e 'print if /\c[/' ascii.txt

which gives you:

27. Escape

Or do it with a shorthand:

perl -n -e 'print if /\e/' ascii.txt

How about a backspace character? Try:

perl -n -e 'print if /\cH/' ascii.txt

which spits back:

8. Backspace

You can also find a backspace using a bracketed expression:

perl -n -e 'print if /[\b]/' ascii.txt

Without the brackets, how would \b be interpreted? That’s right, as a word boundary, as you learned in Chapter 2. The brackets change the way the \b is understood by the processor. In this case, Perl sees it as a backspace character.

Table 6-3 lists the ways we matched characters in this chapter.

That wraps things up for this chapter. In the next, you’ll learn more about quantifiers.