You will have occasion to match characters or ranges of characters that are outside the scope of ASCII. ASCII, or the American Standard Code for Information Interchange, defines an English character set—the letters A through Z in upper- and lowercase, plus control and other characters. It’s been around for a long time: The 128-character Latin-based set was standardized in 1968. That was back before there was such a thing as a personal computer, before VisiCalc, before the mouse, before the Web, but I still look up ASCII charts online regularly.
I remember when I started my career many years ago, I worked with an engineer who kept an ASCII code chart in his wallet. Just in case. The ASCII Code Chart: Don’t leave home without it.
So I won’t gainsay the importance of ASCII, but now it is dated, especially in light of the Unicode standard (http://www.unicode.org), which currently represents over 100,000 characters. Unicode, however, does not leave ASCII in the dust; it incorporates ASCII into its Basic Latin code table (see http://www.unicode.org/charts/PDF/U0000.pdf).
In this chapter, you will step out of the province of ASCII into the not-so-new world of Unicode.
The first text is voltaire.txt from the code archive, a quote from Voltaire (1694–1778), the French Enlightenment philosopher.
Qu’est-ce que la tolérance? c’est l’apanage de l’humanité. Nous sommes tous pétris de faiblesses et d’erreurs; pardonnons-nous réciproquement nos sottises, c’est la première loi de la nature.
Here is an English translation:
What is tolerance? It is the consequence of humanity. We are all formed of frailty and error; let us pardon reciprocally each other’s folly—that is the first law of nature.
There are a variety of ways you can specify a Unicode character, also known as a code point. (For the purposes of this book, a Unicode character is one that is outside of the range of ASCII, though that is not strictly accurate.)
Start out by placing the Voltaire quote in Regexpal (http://www.regexpal.com), and then entering this regular expression:
\u00e9
The \u
is followed by a hexadecimal value 00e9 (this is case insensitive—that
is, 00E9 works, too). The value 00e9 is equivalent to the decimal value
233, well out of the ASCII range (0–127).
Notice that the letter é (small letter e with
an acute accent) is highlighted in Regexpal (see Figure 6-1). That’s because é is the code
point U+00E9 in Unicode, which was matched by \u00e9
.
Regexpal uses the JavaScript implementation of regular expressions. JavaScript also allows you to use this syntax:
\xe9
Try this in Regexpal and see how it matches the same character as
\u00e9
.
Let’s try it with a different regex engine. Open http://regexhero.net/tester/ in a browser. Regex Hero is written in .NET and has a little different syntax. Drop the contents of the file basho.txt into the text area labeled Target String. This contains a famous haiku written by the Japanese poet Matsuo Basho (who, coincidentally, died just one week before Voltaire was born).
Here is the poem in Japanese:
古池 蛙飛び込む 水の音 —芭蕉 (1644–1694)
And here is a translation in English:
At the ancient pond a frog plunges into the sound of water. —Basho (1644–1694)
To match part of the Japanese text, in the text area marked Regular Expression, type the following:
\u6c60
This is the code point for the Japanese (Chinese) character for pond. It will be highlighted below (see Figure 6-2).
While you are here, try matching the em dash (—) with:
\u2014
Or the en dash (–) with:
\u2013
Now look at these characters in an editor.
If you have vim on your system, you can open basho.txt with it, as shown:
vim basho.txt
Now, starting with a slash (\), enter a search with this line:
/\%u6c60
followed by Enter or Return. The cursor moves to the beginning of
the match, as you can see in Figure 6-3. Table 6-1 shows you your options. You can use
x or X following the \%
to match values in the range 0–255 (0–FF),
u to match up to four hexadecimal numbers in the
range 256–65,535 (100–FFFF), or U to match up to
eight characters in the range 65,536–2,147,483,647 (10000–7FFFFFFF).
That takes in a lot of code—a lot more than currently exist in
Unicode.
You can also match characters using an octal (base 8) number, which uses the digits 0 to 7. In regex, this is done with three digits, preceded by a slash (\).
For example, the following octal number:
\351
is the same as:
\u00e9
Experiment with it in Regexpal with the Voltaire text. \351
matches é, with a
little less typing.
In some implementations, such as Perl, you can match on Unicode character properties. The properties include characteristics like whether the character is a letter, number, or punctuation mark.
I’ll now introduce you to ack, a command-line tool written in Perl that acts a lot like grep (see http://betterthangrep.com). It won’t come on your system; you have to download and install it yourself (see Technical Notes).
We’ll use ack on an excerpt from Friederich Schiller’s “An die Freude,” composed in 1785 (German, if you can’t tell):
An die Freude. Freude, schöner Götterfunken, Tochter aus Elisium, Wir betreten feuertrunken Himmlische, dein Heiligthum. Deine Zauber binden wieder, was der Mode Schwerd getheilt; Bettler werden Fürstenbrüder, wo dein sanfter Flügel weilt. Seid umschlungen, Millionen! Diesen Kuß der ganzen Welt! Brüder, überm Sternenzelt muß ein lieber Vater wohnen.
There are a few interesting characters in this excerpt, beyond ASCII’s small realm. We’ll look at the text of this poem through properties. (If you would like a translation of this poem fragment, you can drop it into Google Translate.
Using ack on a command line, you can specify that you want to see all the characters whose property is Letter (L):
ack '\pL' schiller.txt
This will show you all the letters highlighted. For lowercase letters, use Ll, surrounded by braces:
ack '\p{Ll}' schiller.txt
You must add the braces. For uppercase, it’s Lu:
ack '\p{Lu}' schiller.txt
To specify characters that do not match a property, we use uppercase P:
ack '\PL' schiller.txt
This highlights characters that are not letters.
The following finds those that are not lowercase letters:
ack '\P{Ll}' schiller.txt
And this highlights the ones that are not uppercase:
ack '\P{Lu}' schiller.txt
You can also do this in yet another browser-based regex tester,
http://regex.larsolavtorvik.com. Figure 6-4 shows the Schiller text with its lowercase letters
highlighted using the lowercase property (\p{Ll}
).
Table 6-2 lists character property names for use with \p{
property
}
or \P{
property
}
(see pcresyntax(3) at http://www.pcre.org/pcre.txt). You can also match human
languages with properties; see Table A-8.
Table 6-2. Character properties
Property | Description |
---|---|
C | Other |
Cc | Control |
Cf | Format |
Cn | Unassigned |
Co | Private use |
Cs | Surrogate |
L | Letter |
Ll | Lowercase letter |
Lm | Modifier letter |
Lo | Other letter |
Lt | Title case letter |
Lu | Uppercase letter |
L& | Ll, Lu, or Lt |
M | Mark |
Mc | Spacing mark |
Me | Enclosing mark |
Mn | Non-spacing mark |
N | Number |
Nd | Decimal number |
Nl | Letter number |
No | Other number |
P | Punctuation |
Pc | Connector punctuation |
Pd | Dash punctuation |
Pe | Close punctuation |
Pf | Final punctuation |
Pi | Initial punctuation |
Po | Other punctuation |
Ps | Open punctuation |
S | Symbol |
Sc | Currency symbol |
Sk | Modifier symbol |
Sm | Mathematical symbol |
So | Other symbol |
Z | Separator |
Zl | Line separator |
Zp | Paragraph separator |
Zs |
How do you match control characters? It’s not all that common that you will search for control characters in text, but it’s a good thing to know. In the example repository or archive, you’ll find the file ascii.txt, which is a 128-line file that contains all the ASCII characters in it, each on separate line (hence the 128 lines). When you perform a search on the file, it will usually return a single line if it finds a match. This file is good for testing and general fun.
If you search for strings or control characters in ascii.txt with grep or ack, they may interpret the file as a binary file. If so, when you run a script on it, either tool may simply report “Binary file ascii.txt matches” when it finds a match. That’s all.
In regular expressions, you can specify a control character like this:
\c
x
where x is the control character you want to match.
Let’s say, for example, you wanted to find a null character in a file. You can use Perl to do that with the following command:
perl -n -e 'print if /\c@/' ascii.txt
Provided that you’ve got Perl on your system and it’s running properly, you will get this result:
0. Null
The reason why is that there is a null character on that line, even though you can’t see the character in the result.
If you open ascii.txt with an editor other than vim, it will likely remove the control characters from the file, so I suggest you don’t do it.
You can also use \0
to find a
null character. Try this, too:
perl -n -e 'print if /\0/' ascii.txt
Pressing on, you can find the bell (BEL) character using:
perl -n -e 'print if /\cG/' ascii.txt
It will return the line:
7. Bell
Or you can use the shorthand:
perl -n -e 'print if /\a/' ascii.txt
To find the escape character, use:
perl -n -e 'print if /\c[/' ascii.txt
which gives you:
27. Escape
Or do it with a shorthand:
perl -n -e 'print if /\e/' ascii.txt
How about a backspace character? Try:
perl -n -e 'print if /\cH/' ascii.txt
which spits back:
8. Backspace
You can also find a backspace using a bracketed expression:
perl -n -e 'print if /[\b]/' ascii.txt
Without the brackets, how would \b
be interpreted? That’s right, as a word
boundary, as you learned in Chapter 2. The brackets
change the way the \b
is understood by
the processor. In this case, Perl sees it as a backspace character.
Table 6-3 lists the ways we matched characters in this chapter.
Table 6-3. Matching Unicode and other characters
Code | Description |
---|---|
| Unicode (four places) |
| Unicode (two places) |
| Unicode (four places) |
| Unicode (two places) |
| Octal (base 8) |
| Control character |
| Null |
| Bell |
| Escape |
| Backspace |
That wraps things up for this chapter. In the next, you’ll learn more about quantifiers.
How to match any Unicode character with \u
xxxx
or
\
xxx
How to match any Unicode character inside of
vim using \%
xxx
, \%X
xx
, \%u
xxxx
, or
\%U
xxxx
How to match characters in the range 0–255 using octal format
with \000
How to use Unicode character properties with \p{
x
}
How to match control characters with \e
or \cH
More on how to use Perl on the command line (more Perl one-liners)
I entered control characters in ascii.txt
using vim (http://www.vim.org). In vim, you
can use Ctrl+V followed by the appropriate control sequence for the
character, such as Ctrl+C for the end-of-text character. I also used
Ctrl+V followed by x and the two-digit
hexadecimal code for the character. You can also use digraphs to enter
control codes; in vim enter :digraph
to see the possible codes. To enter
a digraph, use Ctrl+K while in Insert mode, followed by a
two-character digraph (for example, NU for
null).
RegexHero (http://regexhero.net/tester) is a .NET regex implementation in a browser written by Steve Wortham. This one is for pay, but you can test it out for free, and if you like it, the prices are reasonable (you can buy it at a standard or a professional level).
vim (http://www.vim.org) is an evolution of the vi editor that was created by Bill Joy in 1976. The vim editor was developed primarily by Bram Moolenaar. It seems archaic to the uninitiated, but as I’ve mentioned, it is incredibly powerful.
The ack tool (http://betterthangrep.com) is written in Perl. It acts like grep and has many of its command line options, but it outperforms grep in many ways. For example, it uses Perl regular expressions instead of basic regular expressions like grep (without -E). For installation instructions, see http://betterthangrep.com/install/. I used the specific instructions under “Install the ack executable.” I didn’t use curl but just downloaded ack with the link provided and then copied the script into /usr/bin on both my Mac and a PC running Cygwin (http://www.cygwin.com) in Windows 7.