Learning Patterns Through Examples

Really good regular expressions—the type used in a production software environment—can get very long and complicated. Since you’re learning, we’ll keep our regular expressions simple for now. As such, this chapter is not meant to be a compete tutorial on regular expressions, but it should be sufficient to help you grasp the concept so you can do your own experimentation and then learn on your own. Later, you’ll learn how to use your new skills to do something more applicable to the real world.

To get started, you can use regular expressions to easily parse all the numbers from a string, as shown in Example 5-5.

In Example 5-5, the expression \d represents any occurrence of a solitary number (or digit). Anytime there is a match, an array element is created, containing that value, in $matches_array. Also, notice that the pattern is escaped with the \ character. If this had not been done, the pattern would match on a lowercase d character.

In Example 5-5, each character was returned in a separate array element, which typically isn’t very useful. If, instead, you wanted to return every occurrence of a three-digit number, you could write a regular expression like the one shown in Example 5-6.

Notice that the regular expression in Example 5-6 only looks for a series of three numeric characters. So the second numeric value in the subject string was returned as 312, not the actual value of 3129. If you want to parse all numbers, regardless of length, you might consider using a regular expression like the one in Example 5-7.

Notice that the + character used in Example 5-7 tells the regular expression engine to keep gathering characters until a nonnumeric character is found.

So far, we’ve looked at regular expressions that capture digits, but you can do the same for alpha characters by substituting the lowercase d with an uppercase D, as shown in Example 5-8.

Just like the script in Example 5-6, which matched any three-digit number, the script in Example 5-8 uses the \D pattern to match all three-letter words. Notice that the other addition to this regular expression is the word boundary pattern \b. If this was not added to the pattern, the returned array would have also included partial words, like The in the first word There. Alternatively, you can specify the number of matches with a number in square brackets, as shown in Example 5-9.

A wildcard, which matches anything, is expressed with the period (.), commonly just called a dot. For example, look at the script in Example 5-10, which will match on either Tim or Tom because the wildcard takes the place of any character between a T and an m.

The wildcard matches any single character, with one notable exception. The dot will not match the character (or characters) that indicate a new line. For example, in the Unix world, the wildcard will not match \n. And, in Windows environments, the wildcard will not match \r\n.

The downside to the example in Example 5-10 is that the wildcard will match any alphanumeric character. So in addition to matching on the words “Tim” and “Tom” (notice the whitespace around the words), the pattern will also match “T5m”, “T m”, or even “T?m”. If you specifically wanted to match on either of “Tim” or “Tom”, you should use the OR (or alternate) pattern | used in Example 5-11.

Since the intent of the previous example is to match on “Tim” or “Tom”, the pattern’s extra T and m are redundant. A more direct (and harder to read) pattern is shown in Example 5-12.

The final example in this set shows how to match grouped patterns. In the pattern in Example 5-13, a match will happen when the first character is an uppercase A or Z or any character in between, followed by any lowercase vowel and any lowercase alpha character.

Notice that in this example, the three-letter words Tim and Tom matched, but the words are and and did not match because those words have a different pattern of case, consonants, and vowels.