An old adage says, “When the only tool you have is a hammer, all problems will look like nails.” This saying definitely applies to regular expressions. While regular expressions are a very powerful tool, it is important to remember that they are not the only tool at your disposal. This section explores the most likely reasons that you may want to use some of the simpler parsing methods mentioned in Chapter 4.
If you can abstract the content you want to extract, or parse, with an alphanumeric pattern, then you probably should be using regular expressions. Regular expressions are an extraordinarily powerful tool because much of the data we want to scrape from web pages (prices, names, street addresses, phone numbers, URLs, etc.) can be described symbolically through patterns.
While regular expressions are a powerful tool to have in your webscraping arsenal, they should never be the primary tool you use to parse and extract information from downloaded content. I believe that regular expressions should be used judiciously and not simply because you know how to use them. I’ve drawn a lot of heat from people for this opinion, and I expect to get more flames thrown in my direction after people read this and the prior chapter. Before you send hate my way, please remember that I acknowledge that mine is the minority opinion, but it’s a belief that I’m quite comfortable with. In 15-plus years of webbot development, I’ve found that it is best to use regular expressions sparingly, where they are most effective. Here are my reasons why.
Regular expressions excel at extracting data like prices, phone numbers, or other character strings that can be described with patterns. Unfortunately, without context, the data you extract may not actually mean very much. In many cases, using regular expressions is like listening to someone speak but only hearing the nouns—and without verbs, nouns have little meaning. For example, you can extract a phone number with regular expressions, but matching a pattern for a phone number will not tell you whose phone number it is. Alternatively, it is not very useful to extract a price without also knowing what is sold at that price. When you are aware of the context in which those prices appear, you can determine which part numbers or item descriptions are associated with the prices you extracted.
My experience is that context matching is much more valuable to webscraping than pattern matching. I want only to extract data, without regard to the surrounding context, in probably fewer than 5 percent of my projects. For example, parsing tabled data by examining what’s between table tags is a more realistic example of the types of context-sensitive matching you’ll do in the real world. In that example, you identify text as a price, not based on its pattern but rather because of its context within the surrounding page content.
It can be difficult for a beginner to focus on a single parsing solution if there are too many options. This is one of the reasons I developed LIB_parse
(described in the previous chapter). This library, when combined with a few built-in PHP functions like stristr(), trim()
, and strip_tags()
, contains the handful of techniques required to parse the majority of your webscraping projects. Even if your parse takes additional steps, you will vastly simplify parsing tasks if your first approach considers the LIB_parse
functions:
return_between()
split_string()
parse_array()
get_attribute()
remove()
In contrast, regular expressions and pattern matching provide a nearly infinite number of ways to solve a parsing problem. Without some constraints, it can be difficult for a beginner to know where to start. If you limit yourself to considering only those few parsing techniques that work in nearly every case, you’ll save a lot of time. While this may sound counterintuitive, remember again that parsing web pages is fairly task specific and you don’t need every parsing technique under the sun to complete your task.
Regular expressions are extraordinarily powerful, and you can do an amazing amount of complex parsing with a single line of code. While one-liners are impressive, however, they also complicate debugging and testing. In contrast, it may be advantageous to use a series of functions in LIB_parse
to perform the same task, so that each parsing step may be commented and tested separately.
If you need to use a regular expression function, I advocate creating a wrapper function, which repackages difficult-to-read code inside of easier-to-read routines. A good example of this is the parse_array()
function that appears in the LIB_parse
library (described in Chapter 4). For example, you can use parse_array()
to extract all image tags into an array with the line of code shown in Example 5-19.
Example 5-19. Parsing all image tags with the parse_array()
function in LIB_parse.php
$image_tag_array = parse_array($downloaded_web_page, "<img", ">");
You can accomplish the same thing with preg_match_all()
, as in Example 5-20. But I argue that the code is not as easy to debug, maintain, or read.
Example 5-20. Parsing all image tags with the parse_array()
function in LIB_parse.php
preg_match_all("/<img(.*)>)siU/", $downloaded_web_page , $matching_data); $image_tag_array = $matching_data[0];
It does not take a seasoned programmer to recognize that Example 5-19 is much easier to read and understand than Example 5-20. That’s why LIB_parse
uses the function parse_array()
as a wrapper around the PHP built-in function preg_match_all()
to make the code more readable. As the complexity of software increases, so do the odds of errors and the cost of debugging and validation. Using debugged libraries with simplified interfaces will decrease development time and make your scripts easier to maintain.
One hot-button topic is whether regular expressions run as quickly in PHP or if the comparable PHP built-in functions are more efficient. I’ve done some benchmarking, and while sparing you the details, I’ve found the PHP built-in functions are only marginally more efficient than their regular expression counterparts. This was a bigger concern with older versions of PHP but is no longer a consideration with modern versions of the language. In reality, this is really a moot point anyway, because if you are really concerned about speed and efficiency, you’d probably be better off developing in C than in PHP.