As demonstrated, a wide variety of parsing tasks can be performed with the standardized parsing routines in LIB_parse
, along with a few of PHP’s built-in functions. Here are a few more suggestions that may help you in your parsing projects.
You’ll get plenty of parsing experience as you explore the projects in this book. The projects also introduce a few advanced parsing techniques. In Chapter 8, we’ll cover advanced methods for parsing data in tables. In Chapter 11, you’ll learn about the insertion parse, which makes it easier to parse and debug difficult-to-parse web pages.
While the scripts in LIB_parse
attempt to handle most situations, there is no guarantee that you will be able to parse poorly coded or nonsensical web pages. Even the use of Tidy will not always provide proper results. For example, code like
<img src="width='523'" alt >
may drive your parsing routines crazy. If you’re having trouble debugging a parsing routine, check to see if the page has errors. If you don’t check for errors, you may waste many hours trying to parse unparseable web pages.
When you’re writing a script that depends on several levels of parsing, avoid the temptation to write your parsing script in one pass. Since succeeding sections of your code will depend on earlier parses, write and debug your scripts one parse at a time.
If you’re viewing the results of your parse in a browser, remember that the browser will attempt to render your output as a web page. If the results of your parse contain tags, display your parses within <xmp>
and </xmp>
tags. These tags will tell the browser not to render the results of your parse as HTML. Failure to analyze the unformatted results of your parse may cause you to miss things that are inside tags.[16]
Regular expressions are a parsing language, and most modern programming languages support aspects of regular expressions. In the right hands, regular expressions are extraordinarily powerful tools for parsing and substituting text. However, they are also famous for sharp learning curves and cryptic syntax. Additionally, regular expressions are excellent for extracting patterns of characters but are less effective in providing context for those patterns. For example, regular expressions are useful for extracting prices from a website, but regular expressions are typically less capable at associating those prices with products.
If there is anything controversial in this book, it may be my opinion on the benefits of regular expressions to webbot developers. While I’d never suggest that regular expressions don’t have a place, I do feel that for the purposes of webbot development, they should be used sparingly. I avoid regular expressions whenever possible, limiting their use to instances where there are few alternatives. In those cases, I use wrapper functions to take advantage of the functionality of regular expressions while shielding the developer from their complexities.
My reasons for suggesting a limited use of regular expressions are more fully explained in the next chapter.