Chapter 6. Matching Unicode and Other Characters

You will have occasion to match characters or ranges of characters that are outside the scope of ASCII. ASCII, or the American Standard Code for Information Interchange, defines an English character set—the letters A through Z in upper- and lowercase, plus control and other characters. It’s been around for a long time: The 128-character Latin-based set was standardized in 1968. That was back before there was such a thing as a personal computer, before VisiCalc, before the mouse, before the Web, but I still look up ASCII charts online regularly.

I remember when I started my career many years ago, I worked with an engineer who kept an ASCII code chart in his wallet. Just in case. The ASCII Code Chart: Don’t leave home without it.

So I won’t gainsay the importance of ASCII, but now it is dated, especially in light of the Unicode standard (http://www.unicode.org), which currently represents over 100,000 characters. Unicode, however, does not leave ASCII in the dust; it incorporates ASCII into its Basic Latin code table (see http://www.unicode.org/charts/PDF/U0000.pdf).

In this chapter, you will step out of the province of ASCII into the not-so-new world of Unicode.

The first text is voltaire.txt from the code archive, a quote from Voltaire (1694–1778), the French Enlightenment philosopher.

Qu’est-ce que la tolérance? c’est l’apanage de l’humanité. Nous sommes tous pétris de faiblesses et d’erreurs; pardonnons-nous réciproquement nos sottises, c’est la première loi de la nature.

Here is an English translation:

What is tolerance? It is the consequence of humanity. We are all formed of frailty and error; let us pardon reciprocally each other’s folly—that is the first law of nature.

Matching a Unicode Character

There are a variety of ways you can specify a Unicode character, also known as a code point. (For the purposes of this book, a Unicode character is one that is outside of the range of ASCII, though that is not strictly accurate.)

Start out by placing the Voltaire quote in Regexpal (http://www.regexpal.com), and then entering this regular expression:

\u00e9

The \u is followed by a hexadecimal value 00e9 (this is case insensitive—that is, 00E9 works, too). The value 00e9 is equivalent to the decimal value 233, well out of the ASCII range (0–127).

Notice that the letter é (small letter e with an acute accent) is highlighted in Regexpal (see Figure 6-1). That’s because é is the code point U+00E9 in Unicode, which was matched by \u00e9.

Figure 6-1. Matching U+00E9 in Regexpal

Regexpal uses the JavaScript implementation of regular expressions. JavaScript also allows you to use this syntax:

\xe9

Try this in Regexpal and see how it matches the same character as \u00e9.

Let’s try it with a different regex engine. Open http://regexhero.net/tester/ in a browser. Regex Hero is written in .NET and has a little different syntax. Drop the contents of the file basho.txt into the text area labeled Target String. This contains a famous haiku written by the Japanese poet Matsuo Basho (who, coincidentally, died just one week before Voltaire was born).

Here is the poem in Japanese:

古池
蛙飛び込む
水の音
        —芭蕉 (1644–1694)

And here is a translation in English:

At the ancient pond
a frog plunges into
the sound of water.
        —Basho (1644–1694)

To match part of the Japanese text, in the text area marked Regular Expression, type the following:

\u6c60

This is the code point for the Japanese (Chinese) character for pond. It will be highlighted below (see Figure 6-2).

Figure 6-2. Matching U+6c60 in Regex Hero

While you are here, try matching the em dash (—) with:

\u2014

Or the en dash (–) with:

\u2013

Now look at these characters in an editor.

Using vim

If you have vim on your system, you can open basho.txt with it, as shown:

vim basho.txt

Now, starting with a slash (\), enter a search with this line:

/\%u6c60

followed by Enter or Return. The cursor moves to the beginning of the match, as you can see in Figure 6-3. Table 6-1 shows you your options. You can use x or X following the \% to match values in the range 0–255 (0–FF), u to match up to four hexadecimal numbers in the range 256–65,535 (100–FFFF), or U to match up to eight characters in the range 65,536–2,147,483,647 (10000–7FFFFFFF). That takes in a lot of code—a lot more than currently exist in Unicode.

Table 6-1. Matching Unicode in Vim

First Character	Maximum Characters	Maximum Value
x or X	2	255 (FF)
u	4	65,535 (FFFF)
U	8	2,147,483,647 (7FFFFFFF)

Figure 6-3. Matching U+6c60 in Vim

Matching Characters with Octal Numbers

You can also match characters using an octal (base 8) number, which uses the digits 0 to 7. In regex, this is done with three digits, preceded by a slash (\).

For example, the following octal number:

\351

is the same as:

\u00e9

Experiment with it in Regexpal with the Voltaire text. \351 matches é, with a little less typing.

Matching Unicode Character Properties

In some implementations, such as Perl, you can match on Unicode character properties. The properties include characteristics like whether the character is a letter, number, or punctuation mark.

I’ll now introduce you to ack, a command-line tool written in Perl that acts a lot like grep (see http://betterthangrep.com). It won’t come on your system; you have to download and install it yourself (see Technical Notes).

We’ll use ack on an excerpt from Friederich Schiller’s “An die Freude,” composed in 1785 (German, if you can’t tell):

An die Freude.

Freude, schöner Götterfunken,
Tochter aus Elisium,
Wir betreten feuertrunken
Himmlische, dein Heiligthum.
Deine Zauber binden wieder,
was der Mode Schwerd getheilt;
Bettler werden Fürstenbrüder,
wo dein sanfter Flügel weilt.

Seid umschlungen, Millionen!
Diesen Kuß der ganzen Welt!
Brüder, überm Sternenzelt
muß ein lieber Vater wohnen.

There are a few interesting characters in this excerpt, beyond ASCII’s small realm. We’ll look at the text of this poem through properties. (If you would like a translation of this poem fragment, you can drop it into Google Translate.

Using ack on a command line, you can specify that you want to see all the characters whose property is Letter (L):

ack '\pL' schiller.txt

This will show you all the letters highlighted. For lowercase letters, use Ll, surrounded by braces:

ack '\p{Ll}' schiller.txt

You must add the braces. For uppercase, it’s Lu:

ack '\p{Lu}' schiller.txt

To specify characters that do not match a property, we use uppercase P:

ack '\PL' schiller.txt

This highlights characters that are not letters.

The following finds those that are not lowercase letters:

ack '\P{Ll}' schiller.txt

And this highlights the ones that are not uppercase:

ack '\P{Lu}' schiller.txt

You can also do this in yet another browser-based regex tester, http://regex.larsolavtorvik.com. Figure 6-4 shows the Schiller text with its lowercase letters highlighted using the lowercase property (\p{Ll}).

Figure 6-4. Characters with the lowercase letter property

Table 6-2 lists character property names for use with \p{property} or \P{property} (see pcresyntax(3) at http://www.pcre.org/pcre.txt). You can also match human languages with properties; see Table A-8.

Table 6-2. Character properties

Property	Description
C	Other
Cc	Control
Cf	Format
Cn	Unassigned
Co	Private use
Cs	Surrogate
L	Letter
Ll	Lowercase letter
Lm	Modifier letter
Lo	Other letter
Lt	Title case letter
Lu	Uppercase letter
L&	Ll, Lu, or Lt
M	Mark
Mc	Spacing mark
Me	Enclosing mark
Mn	Non-spacing mark
N	Number
Nd	Decimal number
Nl	Letter number
No	Other number
P	Punctuation
Pc	Connector punctuation
Pd	Dash punctuation
Pe	Close punctuation
Pf	Final punctuation
Pi	Initial punctuation
Po	Other punctuation
Ps	Open punctuation
S	Symbol
Sc	Currency symbol
Sk	Modifier symbol
Sm	Mathematical symbol
So	Other symbol
Z	Separator
Zl	Line separator
Zp	Paragraph separator
Zs	Space separator

Matching Control Characters

How do you match control characters? It’s not all that common that you will search for control characters in text, but it’s a good thing to know. In the example repository or archive, you’ll find the file ascii.txt, which is a 128-line file that contains all the ASCII characters in it, each on separate line (hence the 128 lines). When you perform a search on the file, it will usually return a single line if it finds a match. This file is good for testing and general fun.

Note

If you search for strings or control characters in ascii.txt with grep or ack, they may interpret the file as a binary file. If so, when you run a script on it, either tool may simply report “Binary file ascii.txt matches” when it finds a match. That’s all.

In regular expressions, you can specify a control character like this:

\cx

where x is the control character you want to match.

Let’s say, for example, you wanted to find a null character in a file. You can use Perl to do that with the following command:

perl -n -e 'print if /\c@/' ascii.txt

Provided that you’ve got Perl on your system and it’s running properly, you will get this result:

0. Null

The reason why is that there is a null character on that line, even though you can’t see the character in the result.

Note

If you open ascii.txt with an editor other than vim, it will likely remove the control characters from the file, so I suggest you don’t do it.

You can also use \0 to find a null character. Try this, too:

perl -n -e 'print if /\0/' ascii.txt

Pressing on, you can find the bell (BEL) character using:

perl -n -e 'print if /\cG/' ascii.txt

It will return the line:

7. Bell

Or you can use the shorthand:

perl -n -e 'print if /\a/' ascii.txt

To find the escape character, use:

perl -n -e 'print if /\c[/' ascii.txt

which gives you:

27. Escape

Or do it with a shorthand:

perl -n -e 'print if /\e/' ascii.txt

How about a backspace character? Try:

perl -n -e 'print if /\cH/' ascii.txt

which spits back:

8. Backspace

You can also find a backspace using a bracketed expression:

perl -n -e 'print if /[\b]/' ascii.txt

Without the brackets, how would \b be interpreted? That’s right, as a word boundary, as you learned in Chapter 2. The brackets change the way the \b is understood by the processor. In this case, Perl sees it as a backspace character.

Table 6-3 lists the ways we matched characters in this chapter.

Table 6-3. Matching Unicode and other characters

Code	Description
`\u``xxxx`	Unicode (four places)
`\``xxx`	Unicode (two places)
`\x{xxxx}`	Unicode (four places)
`\x{xx}`	Unicode (two places)
`\000`	Octal (base 8)
`\cx`	Control character
`\0`	Null
`\a`	Bell
`\e`	Escape
`[\b]`	Backspace

That wraps things up for this chapter. In the next, you’ll learn more about quantifiers.

What You Learned in Chapter 6

How to match any Unicode character with \uxxxx or \xxx
How to match any Unicode character inside of vim using \%xxx, \%Xxx, \%uxxxx, or \%Uxxxx
How to match characters in the range 0–255 using octal format with \000
How to use Unicode character properties with \p{x}
How to match control characters with \e or \cH
More on how to use Perl on the command line (more Perl one-liners)

Technical Notes

I entered control characters in ascii.txt using vim (http://www.vim.org). In vim, you can use Ctrl+V followed by the appropriate control sequence for the character, such as Ctrl+C for the end-of-text character. I also used Ctrl+V followed by x and the two-digit hexadecimal code for the character. You can also use digraphs to enter control codes; in vim enter :digraph to see the possible codes. To enter a digraph, use Ctrl+K while in Insert mode, followed by a two-character digraph (for example, NU for null).
RegexHero (http://regexhero.net/tester) is a .NET regex implementation in a browser written by Steve Wortham. This one is for pay, but you can test it out for free, and if you like it, the prices are reasonable (you can buy it at a standard or a professional level).
vim (http://www.vim.org) is an evolution of the vi editor that was created by Bill Joy in 1976. The vim editor was developed primarily by Bram Moolenaar. It seems archaic to the uninitiated, but as I’ve mentioned, it is incredibly powerful.
The ack tool (http://betterthangrep.com) is written in Perl. It acts like grep and has many of its command line options, but it outperforms grep in many ways. For example, it uses Perl regular expressions instead of basic regular expressions like grep (without -E). For installation instructions, see http://betterthangrep.com/install/. I used the specific instructions under “Install the ack executable.” I didn’t use curl but just downloaded ack with the link provided and then copied the script into /usr/bin on both my Mac and a PC running Cygwin (http://www.cygwin.com) in Windows 7.