When Regular Expressions Are (or Aren’t) the Right Parsing Tool

An old adage says, “When the only tool you have is a hammer, all problems will look like nails.” This saying definitely applies to regular expressions. While regular expressions are a very powerful tool, it is important to remember that they are not the only tool at your disposal. This section explores the most likely reasons that you may want to use some of the simpler parsing methods mentioned in Chapter 4.

If you can abstract the content you want to extract, or parse, with an alphanumeric pattern, then you probably should be using regular expressions. Regular expressions are an extraordinarily powerful tool because much of the data we want to scrape from web pages (prices, names, street addresses, phone numbers, URLs, etc.) can be described symbolically through patterns.

While regular expressions are a powerful tool to have in your webscraping arsenal, they should never be the primary tool you use to parse and extract information from downloaded content. I believe that regular expressions should be used judiciously and not simply because you know how to use them. I’ve drawn a lot of heat from people for this opinion, and I expect to get more flames thrown in my direction after people read this and the prior chapter. Before you send hate my way, please remember that I acknowledge that mine is the minority opinion, but it’s a belief that I’m quite comfortable with. In 15-plus years of webbot development, I’ve found that it is best to use regular expressions sparingly, where they are most effective. Here are my reasons why.

If you need to use a regular expression function, I advocate creating a wrapper function, which repackages difficult-to-read code inside of easier-to-read routines. A good example of this is the parse_array() function that appears in the LIB_parse library (described in Chapter 4). For example, you can use parse_array() to extract all image tags into an array with the line of code shown in Example 5-19.

You can accomplish the same thing with preg_match_all(), as in Example 5-20. But I argue that the code is not as easy to debug, maintain, or read.

It does not take a seasoned programmer to recognize that Example 5-19 is much easier to read and understand than Example 5-20. That’s why LIB_parse uses the function parse_array() as a wrapper around the PHP built-in function preg_match_all()to make the code more readable. As the complexity of software increases, so do the odds of errors and the cost of debugging and validation. Using debugged libraries with simplified interfaces will decrease development time and make your scripts easier to maintain.

One hot-button topic is whether regular expressions run as quickly in PHP or if the comparable PHP built-in functions are more efficient. I’ve done some benchmarking, and while sparing you the details, I’ve found the PHP built-in functions are only marginally more efficient than their regular expression counterparts. This was a bigger concern with older versions of PHP but is no longer a consideration with modern versions of the language. In reality, this is really a moot point anyway, because if you are really concerned about speed and efficiency, you’d probably be better off developing in C than in PHP.