Chapter 5. Advanced Parsing with Regular Expressions

Now that you’ve mastered the parsing techniques of the previous chapter, it’s time to look at advanced parsing with regular expressions, also known as regex. Regular expressions are an extraordinarily powerful and flexible tool. At first glance, they sound like the only tool you’ll ever need to parse web pages. But on further examination, you’ll discover that regular expressions shine in some situations—and are either overkill or simply not appropriate in others.

Regular expressions are not the easiest thing to learn, because a fair amount of parallel information is required to get even the simplest examples working. You’ll need to first understand the concept and have some idea of how patterns are used, plus you need to know how your programming language implements regular expressions. Because of this, we’re going to start with a short discussion of how PHP implements regular expressions. Then we’ll explore the true power of patterns, followed by a practical example of regular expressions in action. Finally, the chapter concludes with an honest discussion of the strengths and weaknesses of regular expressions within the context of webbot development.

Pattern Matching, the Key to Regular Expressions

All regular expression functions are based on patterns—or abstractions and groupings that symbolically define text you want to identify and manipulate within a larger set of text. These patterns are so key to the process that the term regular expressions technically referrers only to these patterns—but commonly, the term is used to refer to anything dealing with the subject. All of the functions that operate on regular expressions use the same pattern-matching rules. So you may use the same patterns to parse, split, or make string substitutions.