Chapter 2. Simple Pattern Matching

Regular expressions are all about matching and finding patterns in text, from simple patterns to the very complex. This chapter takes you on a tour of some of the simpler ways to match patterns using:

In the first chapter, we used Steven Levithan’s RegexPal to demonstrate regular expressions. In this chapter, we’ll use Grant Skinner’s RegExr site, found at http://gskinner.com/regexr (see Figure 2-1).

Note

Each page of this book will take you deeper into the regular expression jungle. Feel free, however, to stop and smell the syntax. What I mean is, start trying out new things as soon as you discover them. Try. Fail fast. Get a grip. Move on. Nothing makes learning sink in like doing something with it.

Grant Skinner’s RegExr in Firefox

Figure 2-1. Grant Skinner’s RegExr in Firefox

Before we go any further, I want to point out the helps that RegExr provides. Over on the right side of RegExr, you’ll see three tabs. Take note of the Samples and Community tabs. The Samples tab provides helps for a lot of regular expression syntax, and the Community tab shows you a large number of contributed regular expressions that have been rated. You’ll find a lot of good information in these tabs that may be useful to you. In addition, pop-ups appear when you hover over the regular expression or target text in RegExr, giving you helpful information. These resources are one of the reasons why RegExr is among my favorite online regex checkers.

This chapter introduces you to our main text, “The Rime of the Ancient Mariner,” by Samuel Taylor Coleridge, first published in Lyrical Ballads (London, J. & A. Arch, 1798). We’ll work with this poem in chapters that follow, starting with a plain-text version of the original and winding up with a version marked up in HTML5. The text for the whole poem is stored in a file called rime.txt; this chapter uses the file rime-intro.txt that contains only the first few lines.

The following lines are from rime-intro.txt:

THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.

ARGUMENT.

How a Ship having passed the Line was driven by Storms to the cold
Country towards the South Pole; and how from thence she made her course
to the tropical Latitude of the Great Pacific Ocean; and of the strange
things that befell; and in what manner the Ancyent Marinere came back to
his own Country.

I.

1      It is an ancyent Marinere,
2        And he stoppeth one of three:
3      "By thy long grey beard and thy glittering eye
4        "Now wherefore stoppest me?

Copy and paste the lines shown here into the lower text box in RegExr. You’ll find the file rime-intro.txt at Github at https://github.com/michaeljamesfitzgerald/Introducing-Regular-Expressions. You’ll also find the same file in the download archive found at http://examples.oreilly.com/9781449392680/examples.zip. You can also find the text online at Project Gutenberg, but without the numbered lines (see http://www.gutenberg.org/ebooks/9622).

The most outright, obvious feature of regular expressions is matching strings with one or more literal characters, called string literals or just literals.

The way to match literal strings is with normal, literal characters. Sounds familiar, doesn’t it? This is similar to the way you might do a search in a word processing program or when submitting a keyword to a search engine. When you search for a string of text, character for character, you are searching with a string literal.

If you want to match the word Ship, for example, which is a word (string of characters) you’ll find early in the poem, just type the word Ship in the box at the top of Regexpal, and then the word will be highlighted in the lower text box. (Be sure to capitalize the word.)

Did light blue highlighting show up below? You should be able to see the highlighting in the lower box. If you can’t see it, check what you typed again.

In the top-left text box in RegExr, enter this character shorthand to match the digits:

\d

This matches all the Arabic digits in the text area below because the global checkbox is selected. Uncheck that checkbox, and \d will match only the first occurrence of a digit. (See Figure 2-2.)

Now in place of \d use a character class that matches the same thing. Enter the following range of digits in the top text box of RegExr:

[0-9]

As you can see in Figure 2-3, though the syntax is different, using \d does the same thing as [0-9].

The character class [0-9] is a range, meaning that it will match the range of digits 0 through 9. You could also match digits 0 through 9 by listing all the digits:

[0123456789]

If you want to match only the binary digits 0 and 1, you would use this character class:

[01]

Try [12] in RegExr and look at the result. With a character class, you can pick the exact digits you want to match. The character shorthand for digits (\d) is shorter and simpler, but it doesn’t have the power or flexibility of the character class. I use character classes when I can’t use \d (it’s not always supported) and when I need to get very specific about what digits I need to match; otherwise, I use \d because it’s a simpler, more convenient syntax.

As is often the case with shorthands, you can flip-flop—that is, you can go the other way. For example, if you want to match characters that are not digits, use this shorthand with an uppercase D:

\D

Try this shorthand in RegExr now. An uppercase D, rather than a lowercase, matches non-digit characters (check Figure 2-4). This shorthand is the same as the following character class, a negated class (a negated class says in essence, “don’t match these” or “match all but these”):

[^0-9]

which is the same as:

[^\d]

In RegExr, now swap \D with:

\w

This shorthand will match all word characters (if the global option is still checked). The difference between \D and \w is that \D matches whitespace, punctuation, quotation marks, hyphens, forward slashes, square brackets, and other similar characters, while \w does not—it matches letters and numbers.

In English, \w matches essentially the same thing as the character class:

[a-zA-Z0-9]

Now to match a non-word character, use an uppercase W:

\W

This shorthand matches whitespace, punctuation, and other kinds of characters that aren’t used in words in this example. It is the same as using the following character class:

[^a-zA-Z0-9]

Character classes, granted, allow you more control over what you match, but sometimes you don’t want or need to type out all those characters. This is known as the “fewest keystrokes win” principle. But sometimes you must type all that stuff out to get precisely what you want. It is your choice.

Just for fun, in RegExr try both:

[^\w]

and

[^\W]

Do you see the differences in what they match?

Table 2-1 provides an extended list of character shorthands. Not all of these work in every regex processor.

To match whitespace, you can use this shorthand:

\s

Try this in RegExr and see what lights up (see Figure 2-5). The following character class matches the same thing as \s:

[ \t\n\r]

In other words, it matches:

  • Spaces

  • Tabs (\t)

  • Line feeds (\n)

  • Carriage returns (\r)

Note

Spaces and tabs are highlighted in RegExr, but not line feeds or carriage returns.

As you can imagine, \s has its compañero. To match a non-whitespace character, use:

\S

This matches everything except whitespace. It matches the character class:

[^ \t\n\r]

Or:

[^\s]

Test these out in RegExr to see what happens.

In addition to those characters matched by \s, there are other, less common whitespace characters. Table 2-2 lists character shorthands for common whitespace characters and a few that are more rare.

There is a way to match any character with regular expressions and that is with the dot, also known as a period or a full stop (U+002E). The dot matches all characters but line ending characters, except under certain circumstances.

In RegExr, turn off the global setting by clicking the checkbox next to it. Now any regular expression will match on the first match it finds in the target.

Now to match a single character, any character, just enter a single dot in the top text box of RegExr.

In Figure 2-6, you see that the dot matches the first character in the target, namely, the letter T.

If you wanted to match the entire phrase THE RIME, you could use eight dots:

........

But this isn’t very practical, so I don’t recommend using a series of dots like this often, if ever. Instead of eight dots, use a quantifier:

.{8}

and it would match the first two words and the space in between, but crudely so. To see what I mean by crudely, click the checkbox next to global and see how useless this really is. It matches sequences of eight characters, end on end, all but the last few characters of the target.

Let’s try a different tack with word boundaries and starting and ending letters. Type the following in the upper text box of RegExr to see a slight difference:

\bA.{5}T\b

This expression has a bit more specificity. (Try saying specificity three times, out loud.) It matches the word ANCYENT, an archaic spelling of ancient. How?

This regular expression would actually match both ANCYENT or ANCIENT.

Now try it with a shorthand:

\b\w{7}\b

Finally, I’ll talk about matching zero or more characters:

.*

which is the same as:

[^\n]

or

[^\n\r]

Similar to this is the dot used with the one or more quantifier (+):

.+

Try these in RegExr and they will, either of them, match the first line (uncheck global). The reason why is that, normally, the dot does not match newline characters, such as a line feed (U+000A) or a carriage return (U+000D). Click the checkbox next to dotall in RegExr, and then .* or .+ will match all the text in the lower box. (dotall means a dot will match all characters, including newlines.)

The reason why it does this is because these quantifiers are greedy; in other words, they match all the characters they can. But don’t worry about that quite yet. Chapter 7 explains quantifiers and greediness in more detail.

“The Rime of the Ancient Mariner” is just plain text. What if you wanted to display it on the Web? What if you wanted to mark it up as HTML5 using regular expressions, rather than by hand? How would you do that?

In some of the following chapters, I'll show you ways to do this. I'll start out small in this chapter and then add more and more markup as you go along.

In RegExr, click the Replace tab, check multiline, and then, in the first text box, enter:

(^T.*$)

Beginning at the top of the file, this will match the first line of the poem and then capture that text in a group using parentheses. In the next box, enter:

<h1>$1</h1>

The replacement regex surrounds the captured group, represented by $1, in an h1 element. You can see the result in the lowest text area. The $1 is a backreference, in Perl style. In most implementations, including Perl, you use this style: \1; but RegExr supports only $1, $2, $3 and so forth. You’ll learn more about groups and backreferences in Chapter 4.

On a command line, you could also do this with sed. sed is a Unix streaming editor that accepts regular expressions and allows you to transform text. It was first developed in the early 1970s by Lee McMahon at Bell Labs. If you are on the Mac or have a Linux box, you already have it.

Test out sed at a shell prompt (such as in a Terminal window on a Mac) with this line:

echo Hello | sed s/Hello/Goodbye/

This is what should have happened:

If you don’t have sed on your platform already, at the end of this chapter you’ll find some technical notes with some pointers to installation information. You’ll find discussed there two versions of sed: BSD and GNU.

Now try this: At a command or shell prompt, enter:

sed -n 's/^/<h1>/;s/$/<\/h1>/p;q' rime.txt

And the output will be:

<h1>THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.</h1>

Here is what the regex did, broken down into parts:

Another way of writing this line is with the -e option. The -e option appends the editing commands, one after another. I prefer the method with semicolons, of course, because it’s shorter.

sed -ne 's/^/<h1>/' -e 's/$/<\/h1>/p' -e 'q' rime.txt

You could also collect these commands in a file, as with h1.sed shown here (this file is in the code repository mentioned earlier):

#!/usr/bin/sed

s/^/<h1>/
s/$/<\/h1>/
q

To run it, type:

sed -f h1.sed rime.txt

at a prompt in the same directory or folder as rime.txt.

Finally, I’ll show you how to do a similar process with Perl. Perl is a general purpose programming language created by Larry Wall back in 1987. It’s known for its strong support of regular expressions and its text processing capabilities.

Find out if Perl is already on your system by typing this at a command prompt, followed by Return or Enter:

perl -v

This should return the version of Perl on your system or an error (see Technical Notes).

To accomplish the same output as shown in the sed example, enter this line at a prompt:

perl -ne 'if ($. == 1) { s/^/<h1>/; s/$/<\/h1>/m; print; }' rime.txt

and, as with the sed example, you will get this result:

<h1>THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.</h1>

Here is what happened in the Perl command, broken down again into pieces:

You could also hold all these commands in a program file, such as this file, h1.pl, found in the example archive.

#!/usr/bin/perl -n

if ($. == 1) {
  s/^/<h1>/;
  s/$/<\/h1>/m;
  print;
}

And then, in the same directory as rime.txt, run the program like this:

perl h1.pl rime.txt

There are a lot of ways you can do things in Perl. I am not saying this is the most efficient way to add these tags. It is simply one way. Chances are, by the time this book is in print, I’ll think of other, more efficient ways to do things with Perl (and other tools). I hope you will, too.

In the next chapter, we’ll talk about boundaries and what are known as zero-width assertions.