Chapter 11. Regular Expressions: Rules for replacement

image with no caption

String functions are kind of lovable. But at the same time, they’re limited. Sure, they can tell the length of your string, truncate it, and change certain characters to other certain characters. But sometimes you need to break free and tackle more complex text manipulations. This is where regular expressions can help. They can precisely modify strings based on a set of rules rather than a single criterion.

Riskyjobs.biz has grown. The company now lets job seekers enter their resumes and contact information into a web form so that our Risky Jobs employers can find them more easily. Here’s what the form looks like:

image with no caption

Our job seeker information is stored in a table that can be searched by employers, recruiters, and headhunters to identify potential new employees. But there’s a problem... the data entered into the form apparently can’t be trusted!

image with no caption
image with no caption

You can fix some data with string functions but they don’t help much when data must fit a very specific pattern.

String functions are well suited to simple find-and-replace operations. For example, if users submitted their phone numbers using dots to separate the blocks of digits instead of hyphens, we could easily write some str_replace() code to substitute in hyphens in their place

But for anything that we can’t possibly know, like the area code of Jimmy Swift’s phone number, we need to ask the person who submitted the form to clarify. And the only way we can know that he’s missing an area code is to understand the exact pattern of a phone number. What we really need is more advanced validation to ensure that things like phone numbers and email addresses are entered exactly right.

image with no caption

String functions really aren’t useful for more than the most primitive of data validation.

Think about how you might attempt to validate an email address using string functions. PHP has a function called strlen() that will tell you how many characters are in a string. But there’s no predefined character length for data like email addresses. Sure, this could potentially help with phone numbers because they often contain a consistent quantity of numbers, but you still have the potential dots, dashes, and parentheses to deal with.

Getting back to email addresses, their format is just too complex for string functions to be of much use. We’re really looking for specific patterns of data here, which requires a validation strategy that can check user data against a pattern to see if it’s legit. Modeling patterns for your form data is at the heart of this kind of validation.

Our challenge is to clearly specify exactly what a given piece of form data should look like, right down to every character. Consider Jimmy’s phone number. It’s pretty obvious to a human observer that his number is missing an area code. But form validation isn’t carried out by humans; it’s carried out by PHP code. This means we need to “teach” our code how to look at a string of data entered by the user and determine if it matches the pattern for a phone number.

Coming up with such a pattern can be a challenge, and it involves really thinking about the range of possibilities for a type of data. Phone numbers are fairly straightforward since they involve 10 digits with optional delimiters. Email addresses are a different story, but we’ll worry about them a bit later in the chapter.

image with no caption
image with no caption

There are some things we know for sure about phone numbers, and we can use those things to make rules.

First, they can’t begin with 1 (long distance) or 0 (operator). Second, there should be 10 digits. And even though some people might have clever ways to represent their phone numbers with letters, phone numbers are esentially numbers—10 digits when we include an area code.

To go beyond basic validation, such as empty() and isset(), we need to decide on a pattern that we want our data to match. In the case of a phone number, this means we need to commit to a single format that we expect to receive from the phone field in our form. Once we decide on a phone number format/pattern, we can validate against it.

Following is what is likely the most common phone number format in use today, at least for domestic U.S. phone numbers. Committing to this format means that if the phone number data users submit doesn’t match this, the PHP script will reject the form and display an error message.

image with no caption

PHP offers a powerful way to create and match patterns in text. You can create rules that let you look for patterns in strings of text. These rules are referred to as regular expressions, or regex for short. A regular expression represents a pattern of characters to match. With the help of regular expressions, you can describe in your code the rules you want your strings to conform to in order for a match to occur.

As an example, here’s a regular expression that looks for 10-digits in a row. This pattern will only match a string that consists of a 10 digit number. If the string is longer or shorter than that, it won’t match. If the string contains anything but numbers, it won’t match. Let’s break it down.

image with no caption

There’s also a more concise way of writing this same regular expression, which makes use of curly braces. Curly braces are used to indicate repetition:

image with no caption
image with no caption

It’s true, regular expressions are cryptic and often difficult to read... but they are very powerful.

Power often comes at a cost, and in the case of regular expressions, that cost is learning the cryptic syntax that goes into them. You won’t become a master of regular expressions overnight, but the good news is you don’t have to. You can do some amazingly powerful and useful things with regular expressions, especially when it comes to form field validation, with a very basic knowledge of regular expressions. Besides, the more you work with them and get practice breaking them down and parsing them, the easier they’ll be to understand.

Being able to match digits in a text string using \d is pretty cool, but if that’s all the functionality that regular expressions provided, their use would be sorely limited. Just matching digits isn’t even going to be enough for Risky Jobs phone number validation functionality, as we’re going to want to be able to match characters like spaces, hyphens, and even letters.

Luckily, PHP’s regex functionality lets you use a bunch more special expressions like \d to match these things. These expresions are called metacharacters. Let’s take a look at some of the most frequently used regex metacharacters.

image with no caption

These metacharacters are cool, but what if you really want a specific character in your regex? Just use that character in the expression. For example, if you wanted to match the exact phone number “707-827-7000”, you would use the regex /707-827-7000/.

image with no caption

Yes, but the key is to specify such a pattern as optional in your regular expression.

If we changed our regex to /^d{3}-\d{3}-\d{4}-d{4}$/, we’d be requiring our string to have a four-digit extension at the end, and we’d no longer match phone numbers like “555-636-4652”. But we can use regular expressions to indicate that parts of the string are optional. Regexes support a feature called quantifiers that let you specify how many times characters or metacharacters should appear in a pattern. You’ve actually already seen quantifiers in action in regexes like this:

image with no caption

Here, curly braces act as a quantifier to say how many times the preceding digit should appear. Let’s take a look at some other frequently used quantifiers.

image with no caption

So, if we wanted to match those optional digits at the end of our phone number, we could use the following pattern:

image with no caption
image with no caption

You’re absolutely right. 0 connects you to an operator, and 1 dials long distance.

We simply want the area code and number. We need to make sure the first digit is not 1 or 0. And to do that, we need a character class.

Character classes let you match characters from a specific set of values. You can look for a range of digits with a character class. You can also look for a set of values. And you can add a caret to look for everything that isn’t in the set.

To indicate that a bunch of characters or metacharacters belongs to a character class, all you need to do is surround them by square brackets, []. Let’s take a look at a few examples of character classes in action:

Write a regular expression that matches international phone numbers:

__________________________________________

__________________________________________

__________________________________________

With the help of character classes, we can refine our regular expression for phone numbers so that it won’t match invalid digit combinations. That way, if someone accidentally enters an area code that starts with 0 or 1, we can throw an error message. Here’s what our new-and-improved regex looks like:

image with no caption
image with no caption

If you want to use reserved characters in your regular expression, you need to escape them.

In regular expression syntax, there are a small set of characters that are given special meaning, because they are used to signify things like metacharacters, quantifiers, and character classes. These include the period (.), the question mark (?), the plus sign (+), the opening square bracket ([), opening and closing parentheses, the caret (^), the dollar sign ($), the vertical pipe character (|), the backslash (\), the forward slash (/), and the asterisk (*).

If you want to use these characters in your regular expression to signify their literal meaning instead of the metacharacters or quantifiers they usually represent, you need to “escape” them by preceding them with a backslash.

For example, if you wanted to match parentheses in a phone number, you couldn’t just do this:

image with no caption

Instead, both the opening and closing parentheses need to be preceded by backslashes to indicate that they should be interpreted as actual parentheses:

image with no caption
image with no caption

Risky Jobs needs to put regular expressions to work validating form data!

We haven’t been developing patterns just for the fun of it. You can use these patterns with the PHP function preg_match(). This function takes a regex pattern, just like those we’ve been building, and a text string. It returns false if there is no match, and true if there is.

image with no caption

Here’s an example of the preg_match() function in action, using a regex that searches a text string for a four-character pattern of alternating uppercase letters and digits:

image with no caption

We can take advantage of the preg_match() function to enable more sophisticated validation functionality in PHP scripts by building an if statement around the return value.

image with no caption
image with no caption

First Name: Jimmy

Last Name: Swift

Email:

Phone: (555) 636 4652

Desired Job: Ninja

image with no caption

Just because you’re permitting data to be input in all different formats doesn’t necessarily mean you want your data stored in all those formats.

Luckily, there’s another regex function that’ll let us take the valid phone number data submitted by Risky Jobs’s users and make all of it conform to just one consistent pattern, instead of four.

The preg_replace() function goes one step beyond the preg_match() function in performing pattern matching using regular expressions. In addition to determining whether a given pattern matches a given string of text, it allows you to supply a replacement pattern to substitute into the string in place of the matched text. It’s a lot like the str_replace() function we’ve already used, except that it matches using a regular expression instead of a string.

image with no caption

Here’s an example of the preg_replace() function in action:

image with no caption

Right now, Risky Jobs is using the following regular expression to validate the phone numbers users submit via their registration form:

/^\(?[2-9]\d{2}\)?[-\s]\d{3}-\d{4}$/

This will match phone numbers that fall into these four patterns:

While these formats are easily interpreted by people, they make it difficult for SQL queries to sort results the way we want. Those parentheses will most likely foil our attempts to group phone numbers by area code, for example, which might be important to Risky Jobs if we want to analyze how many of the site’s users came from a specific geographical location.

To make these kinds of queries possible, we need to standardize phone numbers to one format using preg_replace() before we INSERT data into the database. Our best bet is to get rid of all characters except numeric digits. That way, we simply store 10 digits in our table with no other characters. We want our numbers to be stored like this in the table:

This leaves us with four characters to find and replace. We want to find and remove open and closing parentheses, spaces, and dashes. And we want to find these characters no matter where in the string they are, so we don’t need the starting carat (^) or ending dollar sign ($). We know we’re looking for any one of a set, so we can use a character class. The order of the search doesn’t matter. Here’s the regex we can use:

image with no caption

Now that we have our pattern that finds those unwanted characters, we can apply it to phone numbers to clean them up before storing them in the database. But how? This is where the preg_replace() function really pays off. The twist here is that we don’t want to replace the unwanted characters, we just want them gone. So we simply pass an empty string into preg_replace() as the replacement value. Here’s an example that finds unwanted phone number characters and replaces them with empty strings, effectively getting rid of them:

image with no caption
image with no caption
image with no caption

Sure, but it would end up causing problems later since phone number queries won’t work as expected.

Most users are accustomed to entering phone numbers with some combination of dashes (hyphens), parentheses, and spaces, so attempting to enforce pure numeric phone numbers may not work as expected. It’s much better to try and meet users halfway, giving them reasonably flexible input options, while at the same time making sure the data you store is as consistent as possible.

Besides, we’re only talking about one call to preg_replace() to solve the problem, which just isn’t a big deal. If we were talking about writing some kind of custom function with lots of code, it might be a different story. But improving the usability and data integrity with a single line of code is a no-brainer!

image with no caption

Similar to phone numbers, email addresses have enough of a format to them that we should be validating for more than just being empty.

Just like with validating phone numbers earlier, we first need to determine the rules that valid email addresses must follow. Then we can formalize them as a regular expression, and implement them in our PHP script. So let’s first take a look at what exactly makes up an email address:

image with no caption

It seems like it should be pretty simple to match email addresses, because at first glance, there don’t appear to be as many restrictions on the characters you can use as there are with phone numbers.

For example, it doesn’t seem like too big of a deal to match the LocalName portion of an email address (everything before the @ sign). Since that’s just made up of alphanumeric characters, we should be able to use the following pattern:

image with no caption

This would allow any alphanumeric character in the local name, but unfortunately, it doesn’t include characters that are also legal in email addresses.

Believe it or not, valid email addresses can contain any of these characters in the LocalName portion, although some of them can’t be used to start an email address:

image with no caption

If we want to allow users to register that have email addresses containing these characters, we really need a regex that looks something more like this:

image with no caption

This won’t match every single valid LocalName, as we’re still skipping some of the really obscure characters, but it’s very practical to work with and should still match the email addresses of most of Risky Jobs users.

image with no caption
image with no caption

That would work for part of the domain, the prefix, but it wouldn’t account for the suffix.

While the domain prefix can contain pretty much any combination of alphanumerics and a few special characters, just like the LocalName, the restrictions on domain suffixes are much more stringent.

Most email addresses end in one of a few common domain suffixes: .com, .edu, .org, .gov, and so on. We’ll need to make sure email addresses end in a valid domain suffix, too.

In addition to super-common domain suffixes that you see quite frequently, like .com and .org, there are many, many other domain suffixes that are valid for use in email addresses. Other suffixes recognized as valid by the Domain Name System (DNS) that you may have seen before include .biz and .info. In addition, there’s a list of suffixes that correspond to different countries, like .ca for Canada and .tj for Tajikistan.

Here is a list of just a few possible domain suffixes. This is not all of them.

image with no caption
image with no caption
image with no caption

We could do that, and it would work.

But there’s an easier way. Instead of keeping track of all the possible domains and having to change our code if a new one is added, we can check the domain portion of the email address using the PHP function checkdnsrr(). This function connects to the Domain Name System, or DNS, and checks the validity of domains.

PHP provides the checkdnsrr() function for checking whether a domain is valid. This method is even better than using regular expressions to match the pattern of an email address, because instead of just checking if a string of text could possibly be a valid email domain, it actually checks the DNS records and finds out if the domain is actually registered. So, for example, while a regular expression could tell you that lasdjlkdfsalkjaf.com is valid, checkdnsrr() can go one step further and tell you that, in fact, this domain is not registered, and that we should probably reject sdfhfdskl@lasdjlkdfsalkjaf.com if it’s entered on our registration form.

The syntax for checkdnsrr() is quite simple:

image with no caption

We now know how to validate both the LocalName portion of an email address using regular expressions, and the domain portion of an email address using checkdnsrr(). Let’s look at the step-by-step of how we can put these two parts together to add full-fledged email address validation to Risky Jobs’s registration form:

Looking for patterns in text can be very handy when it comes to validating data entered by the user into web forms. Here are some of the PHP techniques used to validate data with the help of regular expressions:

image with no caption