Let’s look a little more closely at the pattern matching we have been doing. This has been achieved using regular expressions, which are supported by both JavaScript and PHP. They make it possible to construct the most powerful of pattern-matching algorithms within a single expression.
Every regular expression must be enclosed in slashes (/
). Within these slashes, certain characters
have special meanings; they are called
metacharacters. For instance, an asterisk (*
) has a meaning similar to what you have seen
if you use a shell or Windows Command prompt (but not quite the same).
An asterisk means, “The text you’re trying to match may have any number
of the preceding character—or none at all.”
For instance, let’s say you’re looking for the name “Le Guin” and know that someone might spell it with or without a space. Because the text is laid out strangely (for instance, someone may have inserted extra spaces to right-justify lines), you could have to search for a line such as:
The difficulty of classifying Le Guin's works
So you need to match “LeGuin,” as well as “Le” and “Guin” separated by any number of spaces. The solution is to follow a space with an asterisk:
/Le *Guin/
There’s a lot more than the name “Le Guin” in the line, but that’s
OK. As long as the regular expression matches some part of the line, the
test
function returns a true value.
What if it’s important to make sure the line contains nothing but “Le
Guin”? I’ll show how to ensure that later.
Suppose that you know there is always at least one space. In that
case, you could use the plus sign (+
), because it requires at least one of the
preceding characters to be present:
/Le +Guin/
The dot (.
) is particularly
useful, because it can match anything except a newline. Suppose that you
are looking for HTML tags, which start with <
and end with >
. A simple way to do so is:
/<.*>/
The dot matches any character and the *
expands it to match zero or more characters,
so this is saying, “Match anything that lies between <
and >
, even if there’s nothing.” It will match
<>
, <em>
, <br
/>
and so on. But if you don’t want to match the empty
case, <>
, you should use the
+
sign instead of *
, like this:
/<.+>/
The plus sign expands the dot to match one or more characters,
saying, “Match anything that lies between <
and >
as long as there’s at least one character
between them.” This will match <em>
and </em>
, <h1>
and </h1>
, and tags with attributes such
as:
<a href="www.mozilla.org">
Unfortunately, the plus sign keeps on matching up to the last
>
on the line, so you might end up
with:
<h1><b>Introduction</b></h1>
A lot more than one tag! I’ll show a better solution later in this section.
If you use the dot on it’s own between the angle brackets,
without following it with either a +
or *
,
it matches a single character; this will match <b>
and <i>
but not
<em>
or <textarea>
.
If you want to match the dot character itself (.
), you have to escape it by placing a
backslash (\
) before it, because
otherwise it’s a metacharacter and matches anything. As an example,
suppose you want to match the floating-point number 5.0
. The regular expression is:
/5\.0/
The backslash can escape any metacharacter, including another backslash (in case you’re trying to match a backslash in text). However, to make things a bit confusing, you’ll see later how backslashes sometimes give the following character a special meaning.
We just matched a floating-point number. But perhaps you want to
match 5.
as well as 5.0
, because both mean the same thing as a
floating-point number. You also want to match 5.00
, 5.000
, and so forth—any number of zeros is
allowed. You can do this by adding an asterisk, as you’ve seen:
/5\.0*/
Suppose you want to match powers of increments of units, such as kilo, mega, giga, and tera. In other words, you want all the following to match:
1,000 1,000,000 1,000,000,000 1,000,000,000,000 ...
The plus sign works here, too, but you need to group the string “,000” so the plus sign matches the whole thing. The regular expression is:
/1(,000)+ /
The parentheses mean “treat this as a group when you apply
something such as a plus sign.” 1,00,000
and 1,000,00
won’t match, because the text must
have a one followed by one or more complete groups of a comma followed
by three zeros.
The space after the +
character
indicates that the match must end when a space is encountered. Without
it, 1,000,00
would incorrectly match
because only the first 1,000
would be
taken into account, and the remaining ,00
would be ignored. Requiring a space
afterwards ensures matching will continue right through to the end of a
number.
Sometimes you want to match something fuzzily, but not so broadly that you want to use a dot. Fuzziness is the great strength of regular expressions: you can be as precise or vague as you want.
One of the key features supporting fuzzy matching is the pair of
square brackets, []
. It matches a
single character, like a dot, but inside the brackets you put a list of
things that can match. If any of those characters appears, the text
matches. For instance, if you wanted to match both the American spelling
“gray” and the British spelling “grey,” you could specify:
/gr[ae]y/
After the gr
in the text you’re
matching, there can be either an a
or
an e
, but there must be only one of
them: whatever you put inside the brackets matches exactly one
character. The group of characters inside the brackets is called a
character class.
Inside the brackets, you can use a hyphen (-
) to indicate a range. One very common task
is matching a single digit, which you can do with a range as
follows:
/[0-9]/
Digits are such a common item in regular expressions that a
single character is provided to represent them: \d
. You can use it in the place of the
bracketed regular expression to match a digit:
/\d/
One other important feature of the square brackets is
negation of a character class. You can turn the
whole character class on its head by placing a caret (^
) after the opening bracket. Here it means,
“Match any characters except the following.”
Let’s say you want to find instances of “Yahoo” that lack the
following exclamation point. (The name of the company officially
contains an exclamation point!) You could do this as
follows:
/Yahoo[^!]/
The character class consists of a single character—an
exclamation point—but it is inverted by the preceding ^
. This is actually not a great solution to
the problem—for instance, it fails if “Yahoo” is at the end of the
line, because then it’s not followed by anything,
whereas the brackets must match a character. A better solution
involves negative look-ahead (matching something that is not followed
by anything else), but that’s beyond the scope of this book.
With an understanding of character classes and negation, you’re
ready now to see a better solution to the problem of matching an HTML
tag. This solution avoids going past the end of a single tag, but still
matches tags such as <em>
and
</em>
, as well as tags with
attributes such as:
<a href="www.mozilla.org">
One solution is:
/<[^>]+>/
That regular expression may look like I just dropped my teacup on the keyboard, but it is perfectly valid and very useful. Let’s break it apart. Figure 16-3 shows the various elements, which I’ll describe one by one.
The elements are:
/
Opening slash that indicates this is a regular expression.
<
Opening bracket of an HTML tag. This is matched exactly; it is not a metacharacter.
[^>]
Character class. The embedded ^>
means “match anything except a
closing angle bracket.”
+
Allows any number of characters to match the previous
[^>]
, as long as there is at
least one of them.
>
Closing bracket of an HTML tag. This is matched exactly.
/
Closing slash that indicates the end of the regular expression.
Another solution to the problem of matching HTML tags is to use a nongreedy operation. By default, pattern matching is greedy, returning the longest match possible. Nongreedy matching finds the shortest possible match; its use is beyond the scope of this book, but there are more details at http://tinyurl.com/aboutregex.
We are going to look now at one of the expressions from Example 16-1, which the
validateUsername
function
used:
/[^a-zA-Z0-9_]/
Figure 16-4 shows the various elements.
Let’s look at these elements in detail:
/
Opening slash that indicates this is a regular expression.
[
Opening bracket that starts a character class.
^
Negation character: inverts everything else between the brackets.
a-z
Represents any lowercase letter.
A-Z
Represents any uppercase letter.
0-9
Represents any digit.
_
An underscore.
]
Closing bracket that ends a character class.
/
Closing slash that indicates the end of the regular expression.
There are two other important metacharacters. They “anchor” a
regular expression by requiring that it appear in a particular place. If
a caret (^
) appears at the beginning
of the regular expression, the expression has to appear at the beginning
of a line of text—otherwise, it doesn’t match. Similarly, if a dollar
sign ($
) appears at the end of the
regular expression, the expression has to appear at the end of a line of
text.
It may be somewhat confusing that ^
can mean “negate the character class”
inside square brackets and “match the beginning of the line” if it’s
at the beginning of the regular expression. Unfortunately, the same
character is used for two different things, so take care when using
it.
We’ll finish our exploration of regular expression basics by answering a question raised earlier: suppose you want to make sure there is nothing extra on a line besides the regular expression? What if you want a line that has “Le Guin” on it and nothing else? We can do that by amending the earlier regular expression to anchor the two ends:
/^Le *Guin$/
Table 16-1 shows the metacharacters available in regular expressions.
Metacharacter |
Description |
|
Begins and ends the regular expression |
|
Matches any single character except the newline |
|
Matches
|
|
Matches
|
|
Matches
|
|
Matches a character out of those contained within the brackets |
|
Matches a single character that is not contained within the brackets |
|
Treats the
|
|
Matches either
|
|
Matches a range of
characters between |
|
Requires match to be at the string’s start |
|
Requires match to be at the string’s end |
|
Matches a word boundary |
|
Matches where there is not a word boundary |
|
Matches a single digit |
|
Matches a single nondigit |
|
Matches a newline character |
|
Matches a whitespace character |
|
Matches a nonwhitespace character |
|
Matches a tab character |
|
Matches a word character
( |
|
Matches a nonword
character (anything but |
|
|
|
Matches exactly
|
|
Matches
|
|
Matches at least
|
Provided with this table, and looking again at the expression
/[^a-zA-Z0-9_]/
, you can see that it
could easily be shortened to /[^\w]/
because the single metacharacter \w
(with a lowercase w
) specifies the
characters a-z
, A-Z
, 0-9
,
and _
.
In fact, we can be cleverer than that, because the metacharacter
\W
(with an uppercase W
) specifies all characters
except for a-z
,
A-Z
, 0-9
, and _
.
Therefore, we could also drop the ^
metacharacter and simply use /[\W]/
for the expression.
To give you more ideas of how this all works, Table 16-2 shows a range of expressions and the patterns they match.
Example |
Matches |
|
The first r in The quick brown |
|
Either of receive or recieve (but also receeve or reciive) |
|
Either of receive or recieve (but also receeve or reciive) |
|
Either of receive or recieve (but not receeve or reciive) |
|
The word cat in I like cats and dogs |
|
Either of the words cat or dog in I like cats and dogs |
|
|
|
5., 5.0, 5.00, 5.000, etc. |
|
Any of the characters a, b, c, d, e, or f |
|
Only the final cats in My cats are friendly cats |
|
Only the first my in my cats are my pets |
|
Any two- or three-digit number (00 through 999) |
|
7,000, 7,000,000, 7,000,000,000, 7,000,000,000,000, etc. |
|
Any word of one or more characters |
|
Any five-letter word |
Some additional modifiers are available for regular expressions:
/g
enables “global”
matching. When using a replace
function, specify this modifier to replace all matches, rather than
only the first one.
/i
makes the regular
expression match case-insensitive. As a result, instead of /[a-zA-Z]/
, you could specify /[a-z]/i
or /[A-Z]/i
.
/m
enables multiline mode,
in which the caret (^
) and dollar
sign ($
) match before and after
any newlines in the subject string. Normally, in a multiline string,
^
matches only at the start of
the string and $
matches only at
the end of the string.
For example, the expression /cats/g
will match both occurrences of the
word “cats” in the sentence “I like cats and cats like me.” Similarly
/dogs/gi
will match both occurrences
of the word “dogs” (“Dogs” and “dogs”) in the sentence “Dogs like other
dogs,” because you can use these specifiers together.
In JavaScript you will use regular expressions mostly in two
methods: test
(which you have already
seen) and replace
. Whereas test
just tells you whether its argument
matches the regular expression, replace
takes a second parameter: the string
to replace the text that matches. Like most functions, replace
generates a new string as a return
value; it does not change the input.
To compare the two methods, the following statement just returns
true
to let us know that the word
“cats” appears at least once somewhere within the string:
document.write(/cats/i.test("Cats are fun. I like cats."))
But the following statement replaces both occurrences of the word
“cats” with the word “dogs,” printing the result. The search has to be
global (/g
) to find all occurrences,
and case-insensitive (/i
) to find the
capitalized “Cats”:
document.write("Cats are fun. I like cats.".replace(/cats/gi,"dogs"))
If you try out the statement, you’ll see a limitation of replace
: because it replaces text with exactly
the string you tell it to use, the first word “Cats” is replaced by
“dogs” instead of “Dogs.”
The most common regular expression functions that you are likely
to use in PHP are preg_match
,
preg_match_all
, and preg_replace
.
To test whether the word “cats” appears anywhere within a string,
in any combination of upper- and lowercase, you could use preg_match
like this:
$n = preg_match("/cats/i", "Cats are fun. I like cats.");
Because PHP uses 1
for TRUE
and 0
for FALSE
, the preceding statement
sets $n
to 1
. The first argument is the regular
expression and the second is the text to match. But preg_match
is actually a good deal more
powerful and complicated, because it takes a third argument that shows
what text matched:
$n = preg_match("/cats/i", "Cats are fun. I like cats.", $match); echo "$n Matches: $match[0]";
The third argument is an array (here given the name $match
). The function puts the text that
matches into the first element, so if the match is successful you can
find the text that matched in $match[0]
. In this example, the output lets us
know that the matched text was capitalized:
1 Matches: Cats
If you wish to locate all matches, you use the preg_match_all
function, like this:
$n = preg_match_all("/cats/i", "Cats are fun. I like cats.", $match); echo "$n Matches: "; for ($j=0 ; $j < $n ; ++$j) echo $match[0][$j]." ";
As before, $match
is passed to
the function and the element $match[0]
is assigned the matches made, but
this time as a subarray. To display the subarray, this example iterates
through it with a for
loop.
When you want to replace part of a string, you can use preg_replace
as shown here. This example
replaces all occurrences of the word “cats” with the word “dogs,”
regardless of case:
echo preg_replace("/cats/i", "dogs", "Cats are fun. I like cats.");
The subject of regular expressions is a large one, and entire books have been written about it. If you would like further information, I suggest the Wikipedia entry at http://tinyurl.com/wikiregex, or Jeffrey Friedl’s excellent book Mastering Regular Expressions (O’Reilly, 2006).