Chapter 4. Alternation, Groups, and Backreferences

You have already seen groups in action. Groups surround text with parentheses to help perform some operation, such as the following:

We’ll be using a few contrived examples, in addition to the text from “The Rime of the Ancyent Mariner” again, in rime.txt. This time, I’ll use the desktop version of RegExr, as well as other tools like sed. You can download the desktop version of RegExr from http://www.regexr.com, for Windows, Mac, or Linux (it was written with Adobe AIR). Click the Desktop Version link on the RegExr web page (lower-right corner) for more information.

Simply said, alternation gives you a choice of alternate patterns to match. For example, let’s say you wanted to find out how many occurrences of the article the are in the “The Rime of the Ancient Mariner.” The problem is, the word occurs as THE, The, and the in the poem. You can use alternation to deal with this peculiarity.

Open the RegExr desktop application by double-clicking on its icon. It looks very much like the online version but has the advantage of being local on your machine, so you won’t suffer the network issues that sometimes occur when using web applications. I’ve copied and pasted the entire poem in RegExr desktop for the next exercise. I’m using it on a Mac running OS X Lion.

In the top text box, enter the pattern:

(the|The|THE)

and you’ll see all occurrences of the in the poem highlighted in the lower box (see Figure 4-1). Use the scroll bar to view more of the result.

We can make this group shorter by applying an option. Options let you specify the way you would like to search for a pattern. For example, the option:

(?i)

makes your pattern case-insensitive, so instead of using the original pattern with alternation, you can do this instead:

(?i)the

Try this in RegExr to see how it works. You can also specify case-insensitivity by checking ignoreCase in RegExr, but both will work. This and other options or modifiers are listed in Table 4-1.

Let’s now use alternation with grep. The options in Table 4-1, by the way, don’t work with grep, so you are going to use the original alternation pattern. To count the number of lines where the word the occurs, regardless of case, one or more times, use:

grep -Ec "(the|The|THE)" rime.txt

and get this answer:

327

This result does not tell the whole story. Stay tuned.

Here is an analysis of the grep command:

To get a count of actual words used, this approach will return each occurrence of the word, one per line:

grep -Eo "(the|The|THE)" rime.txt | wc -l

This returns:

412

And here is a bit more analysis:

Why the big difference between 327 and 412? Because -c gives you a count of matching lines, but there can be more than one match on each line. If you use -o with wc -l, then each occurrence of the various forms of the word will appear on a separate line and be counted, giving the higher number.

To perform this same match with Perl, write your command this way:

perl -ne 'print if /(the|The|THE)/' rime.txt

Or better yet, you can do it with the (?i) option mentioned earlier, but without alternation:

perl -ne 'print if /(?i)the/' rime.txt

Or even better yet, append the i modifier after the last pattern delimiter:

perl -ne 'print if /the/i' rime.txt

and you will get the same outcome. The simpler the better. For a list of additional modifiers (also called flags), see Table 4-2. Also, compare options (similar but with a different syntax) in Table 4-1.

Most often, when you refer to subpatterns in regular expressions, you are referring to a group or groups within groups. A subpattern is a pattern within a pattern. Often, a condition in a subpattern is matchable when a preceding pattern is matched, but not always. Subpatterns can be designed in a variety of ways, but we’re concerned primarily with those defined within parentheses here.

In one sense, the pattern you saw earlier:

(the|The|THE)

has three subpatterns: the is the first subpattern, The is the second, and THE the third, but matching the second subpattern, in this instance, is not dependent on matching the first. (The leftmost pattern is matched first.)

Now here is one where the subpattern(s) depend on the previous pattern:

(t|T)h(e|eir)

In plain language, this will match the literal characters t or T followed by an h followed by either an e or the letters eir. Accordingly, this pattern will match any of:

In this case, the second subpattern (e|eir) is dependent on the first (tT).

Subpatterns don’t require parentheses. Here is an example of subpatterns done with character classes:

\b[tT]h[ceinry]*\b

This pattern can match, in addition to the or The, words such as thee, thy and thence. The two word boundaries (\b) mean the pattern will match whole words, not letters embedded in other words.

Here is a complete analysis of this pattern:

When a pattern groups all or part of its content into a pair of parentheses, it captures that content and stores it temporarily in memory. You can reuse that content if you wish by using a backreference, in the form:

\1

or:

$1

where \1 or $1 reference the first captured group, \2 or $2 reference the second captured group, and so on. sed will only accept the \1 form, but Perl accepts both.

You have already seen this in action, but I’ll demonstrate it here again. We’ll use it to rearrange the wording of a line of the poem, with apologies to Samuel Taylor Coleridge. In the top text box in RegExr, after clicking the Replace tab, enter this pattern:

(It is) (an ancyent Marinere)

Scroll the subject text (third text area) down until you can see the highlighted line, and then in the second box, enter:

$2 $1

and you’ll see in the lowest box the line rearranged as:

an ancyent Marinere It is,

(See Figure 4-2.)

Here is how to accomplish the same result with sed:

sed -En 's/(It is) (an ancyent Marinere)/\2 \1/p' rime.txt

and the output will be:

an ancyent Marinere It is,

just as in RegExr. Let’s analyze the sed command to help you understand everything that is going on:

A similar command in Perl will do the same thing:

perl -ne 'print if s/(It is) (an ancyent Marinere)/\2 \1/' rime.txt

Notice that this uses the \1 style syntax. You can, of course, use the $1 syntax, too:

perl -ne 'print if s/(It is) (an ancyent Marinere)/$2 $1/' rime.txt

I like how Perl lets you print a selected line without jumping through hoops.

I’d like to point out something about the output:

an ancyent Marinere It is,

The capitalization got mixed up in the transformation. Perl can fix that with \u and \l. Here’s how:

perl -ne 'print if s/(It is) (an ancyent Marinere)/\u$2 \l$1/' rime.txt

Now the result looks much better:

An ancyent Marinere it is,

And here is why:

These directives remain in effect until another is found (like \l or \E, the end of a quoted string). Experiment with these to see how they work.

There are also groups that are non-capturing groups—that is, they don’t store their content in memory. Sometimes this is an advantage, especially if you never intend to reference the group. Because it doesn’t store its content, it is possible it may yield better performance, though performance issues are hardly perceptible when running the simple examples in this book.

Remember the first group discussed in this chapter? Here it is again:

(the|The|THE)

You don’t need to backreference anything, so you could write a non-capturing group this way:

(?:the|The|THE)

Going back to the beginning of this chapter, you could add an option to make the pattern case-insensitive, like this (though the option obviates the need for a group):

(?i)(?:the)

Or you could do it this way:

(?:(?i)the)

Or, better yet, the pièce de résistance:

(?i:the)

The option letter i can be inserted between the question mark and the colon.