Chapter 10. Link-Verification Webbots

This webbot project solves a problem shared by all web developers—detecting broken links on web pages. Verifying links on a web page isn’t difficult to do, and the associated script is short. Figure 10-1 shows the simplicity of this webbot.

For clarity, I’ll break down the creation of the link-verification webbot into manageable sections, which I’ll explain along the way. The code and libraries used in this chapter are available for download at this book’s website.

You can easily parse all the links and place them into an array with the script in Example 10-2.

The code in Example 10-2 uses parse_array() to put everything between every occurrence of <a and > into an array.[33] The function parse_array() is not case sensitive, so it doesn’t matter if the target web page uses <a>, <A> or a combination of both tags to define links.

Since the contents of the $link_array elements are actually complete anchor tags, we need to parse the value of the href attribute out of the tags before we can download and test the pages they reference.

The value of the href attribute is extracted from the anchor tag with the function get_attribute(), as shown in Example 10-4.

Once you have the href address, you need to combine the previously defined $page_base with the relative address to create a fully resolved URL, which your webbot can use to download pages. A fully resolved URL is any URL that describes not only the file to download, but also the server and directory where that file is located and the protocol required to access it. Table 10-2 shows the fully resolved addresses for the links in Table 10-1, assuming the links are on a page in the directory, http://www.WebbotsSpidersScreenScrapers.com.

Fully resolved URLs are made with the resolve_address() function (see Example 10-5), which is in the LIB_resolve_addresses library. This library is a set of routines that converts all possible methods of referencing web pages in HTML into fully resolved URLs.

Once the linked page is downloaded, the webbot relies on the STATUS element of the downloaded array to analyze the HTTP code, which is provided by PHP/CURL. (For your future projects, keep in mind that PHP/CURL also provides total download time and other diagnostics that we’re not using in this project.)

HTTP status codes are standardized, three-digit numbers that indicate the status of a page fetch.[34] This webbot uses these codes to determine if a link is broken or valid. These codes are divided into ranges that define the type of errors or status, as shown in Table 10-3.

The $status_code_array was created when the LIB_http_codes library was imported. When you use the HTTP code as an index into $status_code_array, it returns a human-readable status message, as shown in Example 10-7. (PHP script is in bold.)

As an added feature, the webbot displays the amount of time (in seconds) required to download pages referenced by the links on the target web page. This period is automatically measured and recorded by PHP/CURL when the page is downloaded. The period required to download the page is available in the array element: $downloaded_link['STATUS']['total_time'].



[33] Parsing functions are explained in Chapter 4 and Chapter 5.

[34] The official reference for HTTP codes is available on the World Wide Web Consortium’s website (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html).