Chapter 2
IN THIS CHAPTER
Introducing regular expressions
Trying out regular expressions with a helpful program
Creating simple expressions that match patterns of characters
Using regular expression features such as custom classes, quantifiers, and groups
Using regular expressions with the String class
Using the Pattern and Matcher classes for more-extensive regular expressions
Regular expressions are not expressions that have a lot of fiber in their diet. Instead, a regular expression is a special type of pattern-matching string that can be very useful for programs that do string manipulation. Regular expression strings contain special pattern-matching characters that can be matched against another string to see whether the other string fits the pattern. You’ll find that regular expressions are very handy for doing complex data validation — for making sure that users enter properly formatted phone numbers, email addresses, or Social Security numbers, for example.
Regular expressions are also useful for many other purposes, including searching text files to see whether they contain certain patterns (can you say, Google?), filtering email based on its contents, or performing complicated search-and-replace functions.
In this chapter, you find out the basics of using regular expressions. I emphasize validation and focus on comparing strings entered by users against patterns specified by regular expressions to see whether they match up. For more complex uses of regular expressions, you have to turn to a more extensive regular expression reference. You can find several in-depth tutorials using a search engine such as Google; search for regular expression tutorial .
Also be aware that this chapter covers only a portion of all you can do with regular expressions. If you find that you need to use more complicated patterns, you can find plenty of helpful information on the Internet. Just search any search engine for regular expression.
Before I get into the details of putting together regular expressions, let me direct your attention to Listing 2-1 , which presents a short program that can be very useful while you’re learning how to create regular expressions. First, this program lets you enter a regular expression. Next, you can enter a string, and the program tests it against the regular expression and lets you know whether the string matches the regex. Then the program prompts you for another string to compare. You can keep entering strings to compare with the regex you’ve already entered. When you’re done, just press the Enter key without entering a string. The program asks whether you want to enter another regular expression. If you answer yes (y), the whole process repeats. If you answer no (n), the program ends.
LISTING 2-1 The Regular Expression Test Program
import java.util.regex.*;
import java.util.Scanner;
public final class Reg {
static String r, s;
static Pattern pattern;
static Matcher matcher;
static boolean match, validRegex, doneMatching;
private static Scanner sc =
new Scanner(System.in);
public static void main(String[] args)
{
System.out.println("Welcome to the "
+ "Regex Tester\n");
do
{
do
{
System.out.print("\nEnter regex: ");
r = sc.nextLine();
validRegex = true;
try
{
pattern = Pattern.compile(r);
}
catch (Exception e)
{
System.out.println(e.getMessage());
validRegex = false;
}
} while (!validRegex);
doneMatching = false;
while (!doneMatching)
{
System.out.print("Enter string: ");
s = sc.nextLine();
if (s.length() == 0)
doneMatching = true;
else
{
matcher = pattern.matcher(s);
if (matcher.matches())
System.out.println("Match.");
else
System.out.println(
"Does not match.");
}
}
} while (askAgain());
}
private static boolean askAgain()
{
System.out.print("Another? (Y or N) ");
String reply = sc.nextLine();
if (reply.equalsIgnoreCase("Y"))
return true;
return false;
}
}
Here’s a sample run of this program. For now, don’t worry about the details of the regular expression string. Just note that it should match any three-letter word that begins with f; ends with r; and has a, i, or o in the middle.
Welcome to the Regex Tester
Enter regex: f[aio]r
Enter string: for
Match.
Enter string: fir
Match.
Enter string: fur
Does not match.
Enter string: fod
Does not match.
Enter string:
Another? (Y or N) n
In this test, I entered the regular expression f[aio]r . Then I entered the string for . The program indicated that this string matched the expression and asked for another string. So I entered fir , which also matched. Then I entered fur and fod , which didn’t match. Next, I entered a blank string, so the program asked whether I wanted to test another regex. I entered n , so the program ended.
This program uses the
Pattern
and
Matcher
classes, which I don’t explain until the end of the chapter. I suggest that you use this program alongside this chapter,
however. Regular expressions make a lot more sense if you actually try them out to see them in action. Also, you can learn a lot by trying simple variations as you go. (You can always download the source code for this program from this book’s website at
www.dummies.com/go/javaaiofd5e
if you don’t want to enter it yourself.)
In fact, I use portions of console output from this program throughout the rest of this chapter to illustrate regular expressions. There’s no better way to see how regular expressions work than to see an expression and some samples of strings that match and don’t match the expression.
Most regular expressions simply match characters to see whether a string complies with a simple pattern. You can check a string to see whether it matches the format for Social Security numbers (xxx-xx-xxxx), phone numbers [(xxx) xxx-xxxx], or more complicated patterns such as email addresses. (Well, actually, Social Security and phone numbers are more complicated than you may think — more on that in the section “Using predefined character classes ,” later in this chapter.) In the following sections, you find out how to create regex patterns for basic character matching.
The simplest regex patterns match a string literal exactly, as in this example:
Enter regex: abc
Enter string: abc
Match.
Enter string: abcd
Does not match.
Here the pattern
abc
matches the string
abc
but not
abcd
.
A character class represents a particular type of character rather than a specific character. A regex pattern lets you use two types of character classes: predefined classes and custom classes. The predefined character classes are shown in Table 2-1 .
TABLE 2-1 Character Classes
Regex |
Matches |
|
Any character |
|
Any digit (0–9) |
|
Any nondigit (anything other than 0–9) |
|
Any white-space character (space, tab, new line, return, or backspace) |
|
Any character other than a white-space character |
|
Any word character (a–z, A–Z, 0–9, or an underscore) |
|
Any character other than a word character |
The period is like a wildcard that matches any character, as in this example:
Enter regex: c.t
Enter string: cat
Match.
Enter string: cot
Match.
Enter string: cart
Does not match.
Here
c.t
matches any three-letter string that starts with
c
and ends with
t
. In this example, the first two strings (
cat
and
cot
) match, but the third string (
cart
) doesn’t because it’s more than three characters.
The
\d
class represents a digit and is often used in regex patterns to validate input data. Here’s a simple regex pattern that validates a U.S. Social Security number, which must be entered in the form xxx-xx-xxxx:
Enter regex: \d\d\d-\d\d-\d\d\d\d
Enter string: 779-54-3994
Match.
Enter string: 550-403-004
Does not match.
Here the regex pattern specifies that the string must contain three digits, a hyphen, two digits, another hyphen, and four digits.
Note that the
\d
class has a counterpart:
\D
. The
\D
class matches any character that is not
a digit. Here’s a first attempt at a regex for validating droid names:
Enter regex: \D\d-\D\d
Enter string: R2-D2
Match.
Enter string: C2-D0
Match.
Enter string: C-3PO
Does not match.
Here the pattern matches strings that begin with a character that isn’t a digit, followed by a character that is a digit, followed by a hyphen, followed by another nondigit character, and ending with a digit. Thus,
R2-D2
and
C3-P0
match. Unfortunately, this regex is far from perfect, as any Star Wars
fan can tell you, because the proper spelling of the shiny gold protocol droid’s name is C-3PO, not C3-P0. Typical.
The
\s
class matches white-space characters including spaces, tabs, newlines, returns, and backspaces. This class is useful when you want to allow the user to separate parts of a string in various ways, as in this example. (Note that in the fourth line, I use the Tab key to separate
abc
from
def
.)
Enter regex: …\s…
Enter string: abc def
Match.
Enter string: abc def
Match.
Here the pattern specifies that the string can be two groups of any three characters separated by one white-space character. In the first string that’s entered, the groups are separated by a space; in the second group, they’re separated by a tab. The
\s
class also has a counterpart:
\S
. It matches any character that isn’t a white-space character.
Enter regex: … …
Enter string: abc def
Match.
Enter string: abc def
Does not match.
Here the regex specifies two groups of any character separated by a space. The first input string matches this pattern, but the second does not because the groups are separated by a tab.
The last set of predefined classes is
\w
and
\W
. The
\w
class identifies any character that’s normally used in words, including uppercase and lowercase letters, digits, and underscores. An example shows how all that looks:
Enter regex: \w\w\w\W\w\w\w
Enter string: abc def
Match.
Enter string: 123 456
Match.
Enter string: 123_456
Does not match.
Here the pattern calls for two groups of word characters separated by a nonword character.
To create a custom character class, you simply list all the characters that you want to include in the class within a set of brackets. Here’s an example:
Enter regex: b[aeiou]t
Enter string: bat
Match.
Enter string: bet
Match.
Enter string: bit
Match.
Enter string: bot
Match.
Enter string: but
Match.
Enter string: bmt
Does not match.
Here the pattern specifies that the string must start with the letter
b
, followed by a class that can include
a
,
e
,
i
,
o
, or
u
, followed by
t
. In other words, it accepts three-letter words that begin with b,
end with t,
and have a vowel in the middle.
If you want to let the pattern include uppercase letters as well as lowercase letters, you have to list them both:
Enter regex: b[aAeEiIoOuU]t
Enter string: bat
Match.
Enter string: BAT
Does not match.
Enter string: bAt
Match.
You can use as many custom groups on a line as you want. Here’s an example that defines classes for the first and last characters so that they too can be uppercase or lowercase:
Enter regex: [bB][aAeEiIoOuU][tT]
Enter string: bat
Match.
Enter string: BAT
Match.
This pattern specifies three character classes. The first can be
b
or
B
, the second can be any uppercase or lowercase vowel, and the third can be
t
or
T
.
Custom character classes can also specify ranges of letters and numbers, like this:
Enter regex: [a-z][0-5]
Enter string: r2
Match.
Enter string: b9
Does not match.
Here the string can be two characters long. The first must be a character from
a
–
z
, and the second must be
0
–
5
.
You can also use more than one range in a class, like this:
Enter regex: [a-zA-Z][0-5]
Enter string: r2
Match.
Enter string: R2
Match.
Here the first character can be lowercase
a
–
z
or uppercase
A
–
Z
.
Enter regex: [a-zA-Z0-9]
Enter string: a
Match.
Enter string: N
Match.
Enter string: 9
Match.
Regular expressions can include classes that match any character but the ones listed for the class. To do that, you start the class with a caret, like this:
Enter regex: [^cf]at
Enter string: bat
Match.
Enter string: cat
Does not match.
Enter string: fat
Does not match.
Here the string must be a three-letter word that ends in
at
but isn’t
fat
or
cat
.
The regex patterns described so far in this chapter require that each position in the input string match a specific character class. The pattern
\d\W[a-z]
, for example, requires a digit in the first position, a white-space character in the second position, and one of the letters
a
–
z
in the third position. These requirements are pretty rigid.
To create more flexible patterns, you can use any of the quantifiers listed in Table 2-2 . Quantifiers let you create patterns that match a variable number of characters at a certain position in the string.
TABLE 2-2 Quantifiers
Regex |
Matches the Preceding Element |
|
Zero times or one time |
|
Zero or more times |
|
One or more times |
|
Exactly n times |
|
At least n times |
|
At least n times but no more than m times |
To use a quantifier, you code it immediately after the element you want it to apply to. Here’s a version of the Social Security number pattern that uses quantifiers:
Enter regex: \d{3}-\d{2}-\d{4}
Enter string: 779-48-9955
Match.
Enter string: 483-488-9944
Does not match.
The pattern matches three digits, followed by a hyphen, followed by two digits, followed by another hyphen, followed by four digits.
The
?
quantifier lets you create an optional element that may or may not be present in the string. Suppose you want to allow the user to enter Social Security numbers without the hyphens. You could use this pattern:
Enter regex: \d{3}-?\d{2}-?\d{4}
Enter string: 779-48-9955
Match.
Enter string: 779489955
Match.
Enter string: 779-489955
Match.
Enter string: 77948995
Does not match.
The question marks indicate that the hyphens are optional. Notice that this pattern lets you include or omit either hyphen. The last string entered doesn’t match because it has only eight digits, and the pattern requires nine.
In regular expressions, certain characters have special meaning. What if you want to search for one of those special characters? In that case, you escape the character by preceding it with a backslash. Here’s an example:
Enter regex: \(\d{3}\) \d{3}-\d{4}
Enter string: (559) 555-1234
Match.
Enter string: 559 555-1234
Does not match.
Here
\(
represents a left parenthesis, and
\)
represents a right parenthesis. Without the backslashes, the regular expression treats the parenthesis as a grouping element.
Here are a few additional points to ponder about escapes:
Strictly speaking, you need to use the backslash escape only for characters that have special meanings in regular expressions. I recommend, however, that you escape any punctuation character or symbol, just to be sure.
\d\d\\\d\d
, for example, accepts strings made up of two digits followed by a backslash and two more digits, such as
23\88
and
95\55
.You can use parentheses to create groups of characters to apply other regex elements to, as in this example:
Enter regex: (bla)+
Enter string: bla
Match.
Enter string: blabla
Match.
Enter string: blablabla
Match.
Enter string: bla bla bla
Does not match.
Here the parentheses treat
bla
as a group, so the
+
quantifier applies to the entire sequence. Thus, this pattern looks for one or more occurrences of the sequence
bla
.
Here’s an example that finds U.S. phone numbers that can have an optional area code:
Enter regex: (\(\d{3}\)\s?)?\d{3}-\d{4}
Enter string: 555-1234
Match.
Enter string: (559) 555-1234
Match.
Enter string: (559)555-1239
Match.
This regex pattern is a little complicated, but if you examine it element by element, you should be able to figure it out. It starts with a group that indicates the optional area code:
(\(\d{3}\)\s?)?
. This group begins with the left parenthesis, which marks the start of the group. The characters in the group consist of an escaped left parenthesis, three digits, an escaped right parenthesis, and an optional white-space character. Then a right parenthesis closes the group, and the question mark indicates that the entire group is optional. The rest of the regex pattern looks for three digits followed by a hyphen and four more digits.
When you mark a group of characters with parentheses, the text that matches that group is captured so that you can use it later in the pattern. The groups that are captured are called capture groups and are numbered beginning with 1. Then you can use a backslash followed by the capture-group number to indicate that the text must match the text that was captured for the specified capture group.
Suppose that droids named following the pattern
\w\d-\w\d
must have the same digit in the second and fifth characters. In other words, r2-d2 and b9-k9 are valid droid names, but r2-d4 and d3-r4 are not.
Here’s an example that can validate that type of name:
Enter regex: \w(\d)-\w\1
Enter string: r2-d2
Match.
Enter string: d3-r4
Does not match.
Enter string: b9-k9
Match.
Here
\1
refers to the first capture group. Thus the last character in the string must be the same as the second character, which must be a digit.
The vertical bar (
|
) symbol defines an
or
operation, which lets you create patterns that accept any of two or more variations. Here’s an improvement of the pattern for validating droid names:
Enter regex: (\w\d-\w\d)|(\w-\d\w\w)
Enter string: r2-d2
Match.
Enter string: c-3po
Match.
The
|
character indicates that either the group on the left or the group on the right can be used to match the string. The group on the left matches a word character, a digit, a hyphen, a word character, and another digit. The group on the right matches a word character, a hyphen, a digit, and two word characters.
You may want to use an additional set of parentheses around the entire part of the pattern that the
|
applies to. Then you can add pattern elements before or after the
|
groups. What if you want to let a user enter the area code for a phone number with or without parentheses? Here’s a regex pattern that does the trick:
Enter regex: ((\d{3} )|(\(\d{3}\) ))?\d{3}-\d{4}
Enter string: (559) 555-1234
Match.
Enter string: 559 555-1234
Match.
Enter string: 555-1234
Match.
The first part of this pattern is a group that consists of two smaller groups separated by a
|
character. The first of these groups matches an area code without parentheses followed by a space, and the second matches an area code with parentheses followed by a space. So the outer group matches an area code with or without parentheses. This entire group is marked with a question mark as optional; then the pattern continues with three digits, a hyphen, and four digits.
So far, this chapter has shown you the basics of creating regular expressions. The following sections show you how to put them to use in Java programs.
Before getting into the classes for working with regular expressions, I want to clue you in about a problem that Java has in dealing with strings that contain regular expressions. As you’ve seen throughout this chapter, regex patterns rely on the backslash character to mark different elements of a pattern. The bad news is that Java treats the backslash character in a string literal as an escape character. Thus, you can’t just quote regular expressions in string literals, because Java steals the backslash characters before they get to the regular expression classes.
In most cases, the compiler simply complains that the string literal is not correct. The following line won’t compile:
String regex = "\w\d-\w\d"; // error: won't compile
The compiler sees the backslashes in the string and expects to find a valid Java escape sequence, not a regular expression.
Unfortunately, the solution to this problem is ugly: You have to double the backslashes wherever they occur. Java treats two backslashes in a row as an escaped backslash and places a single backslash in the string. Thus you have to code the statement shown in the preceding example like this:
String regex = "\\w\\d-\\w\\d"; // now it will
// compile
Here each backslash I want in the regular expression is coded as a pair of backslashes in the string literal.
\w\d-\w\d
Thus I know that I coded the string literal for the regular expression correctly.
If all you want to do with a regular expression is check whether a string matches a pattern, you can use the
matches
method of the
String
class. This method accepts a regular expression as a parameter and returns a boolean that indicates whether the string matches the pattern.
Here’s a static method that validates droid names:
private static boolean validDroidName(String droid)
{
String regex = "(\\w\\d-\\w\\d)|(\\w-\\d\\w\\w)";
return droid.matches(regex);
}
Here the name of the droid is passed via a parameter, and the method returns a boolean that indicates whether the droid’s name is valid. The method simply creates a regular expression from a string literal and then uses the
matches
method of the
droid
string to match the pattern.
You can also use the
split
method to split a string into an array of
String
objects based on delimiters that match a regular expression. One common way to do that is to simply create a custom class of characters that can be used for delimiters, as in this example:
String s = "One:Two;Three|Four\tFive";
String regex = "[:;|\\t]";
String strings[] = s.split(regex);
for (String word : strings)
System.out.println(word);
Here a string is split into words marked by colons, semicolons, vertical bars, or tab characters. When you run this program, the following text is displayed on the console:
One
Two
Three
Four
Five
The
matches
method is fine for occasional use of regular expressions, but if you want your program to do a lot of pattern matching, you should use the
Pattern
and
Matcher
classes instead. The
Pattern
class represents a regular expression that has been compiled into executable form. (Remember that regular expressions are like little programs.) Then you can use the compiled
Pattern
object to create a
Matcher
object, which you can use to match strings.
The
Pattern
class itself is pretty simple. Although it has about ten methods, you usually use just these two:
static Pattern compile
(
String
pattern
):
Compiles the specified pattern. This static method returns a
Pattern
object. It throws
PatternSyntaxException
if the pattern contains an error.Matcher matcher
(
String input
):
Creates a
Matcher
object to match this pattern against the specified string.First, you use the
compile
method to create a
Pattern
object. (
Pattern
is one of those weird classes that doesn’t have constructors. Instead, it relies on the static
compile
method to create instances.) Because the
compile
method throws
PatternSyntaxException
, you must use a
try/catch
statement to catch this exception when you compile a pattern.
After you have a
Pattern
instance, you use the
matcher
method to create an instance of the
Matcher
class. This class has more than 30 methods that let you do all sorts of things with regular expressions that aren’t covered in this chapter, such as finding multiple occurrences of a pattern in an input string or replacing text that matches a pattern with a replacement string. For purposes of this book, I’m concerned only with the
matches
method:
static boolean matches()
returns a boolean that indicates whether the entire string matches the pattern.
To illustrate how to use these methods, here’s an enhanced version of the
validDroidName
method that creates a pattern for the droid-validation regex and saves it in a static class field:
private static Pattern droidPattern;
private static boolean validDroidName(String droid)
{
if (droidPattern == null)
{
String regex = "(\\w\\d-\\w\\d)|"
+ "(\\w-\\d\\w\\w)";
droidPattern = Pattern.compile(regex);
}
Matcher m = droidPattern.matcher(droid);
return m.matches();
}
Here the private class field
droidPattern
saves the compiled pattern for validating droids. The
if
statement in the
validDroidName
method checks whether the pattern has already been created. If not, the pattern is created by calling the static
compile
method of the
Pattern
class. Then the
matcher
method is used to create a
Matcher
object for the string passed as a parameter, and the string is validated by calling the
matches
method of the
Matcher
object.