Using LIB_parse

One thing you may notice about LIB_parse is a lack of regular expressions, even though regular expressions are a mainstay for parsing text. Regular expressions can be difficult to read and understand, especially for beginners. The built-in PHP string-manipulation functions are easier to understand than regular expressions. That doesn’t mean we won’t discuss regular expressions. Chapter 5 talks about regular expressions and their utility in webbot development.

What follows is a description of the functions in LIB_parse and the parsing problems they solve. These functions are also described completely within the comments of LIB_parse.

Splitting a String at a Delimiter: split_string()

The simplest parsing function returns a string that contains everything before or after a delimiter term. This simple function can also be used to return the text between two terms. The function provided for that task is split_string(), shown in Example 4-1.

Example 4-1. Using split_string()

/*
string split_string (string unparsed, string delimiter, BEFORE/AFTER, INCL/EXCL)
Where
    unparsed is the string to parse
    delimiter defines boundary between substring you want and substring you don't want
    BEFORE indicates that you want what is before the delimiter
    AFTER indicates that you want what is after the delimiter
    INCL indicates that you want to include the delimiter in the parsed text
    EXCL indicates that you don't want to include the delimiter in the parsed text
*/

Simply pass split_string() the string you want to split, the delimiter where you want the split to occur, whether you want the portion of the string that is before or after the delimiter, and whether or not you want the delimiter to be included in the returned string. Example 4-2 shows examples of split_string() in action.

Example 4-2. Examples of split_string() usage

include("LIB_parse.php");
$string = "The quick brown fox";

# Parse what's before the delimiter, including the delimiter
$parsed_text = split_string($string, "quick", BEFORE, INCL);
// $parsed_text = "The quick"

# Parse what's after the delimiter, but don't include the delimiter
$parsed_text = split_string($string, "quick", AFTER, EXCL);
// $parsed_text = "brown fox"

Parsing Text Between Delimiters: return_between()

Sometimes it is useful to parse text between two delimiters. For example, to parse a web page’s title, you’d want to parse the text between the <title> and </title> tags. Your webbots can use the return_between() function in LIB_parse to do this.

The return_between() function uses a start delimiter and an end delimiter to define a particular part of a string, as shown in Example 4-3.

Example 4-3. Using return_between()

/*
string return_between (string unparsed, string start, string end, INCL/EXCL)
Where
    unparsed is the string to parse
    start identifies the starting delimiter
    end identifies the ending delimiter
    INCL indicates that you want to include the delimiters in the parsed text
    EXCL indicates that you don't want to include delimiters in the parsed
text
*/

The script in Example 4-4 uses return_between() to parse the HTML title of a web page.

Example 4-4. Using return_between() to find the title of a web page

# Include libraries
include("LIB_parse.php");
include("LIB_http.php");

# Download a web page
$web_page = http_get($target="http://www.nostarch.com", $referer="");

# Parse the title of the web page, inclusive of the title tags
$title_incl = return_between($web_page['FILE'], "<title>", "</title>", INCL);

# Parse the title of the web page, exclusive of the title tags
$title_excl = return_between($web_page['FILE'], "<title>", "</title>", EXCL);

# Display the parsed text
echo "title_incl = ".$title_incl;
echo "\n";
echo "title_excl = ".$title_excl;

When Example 4-4 is run in a shell, the results should look like Example 4-5.

Example 4-5. Examples of using return_between(), with and without returned delimiters

title_incl = <title>No Starch Press</title>
title_excl = No Starch Press

Parsing a Data Set into an Array: parse_array()

Sometimes the things your webbot needs to parse, like links, appear more than once in a web page. In these cases, a single parsed result isn’t as useful as an array of results. Such a parsed array could contain all the links, meta tags, or references to images in a web page. The parse_array() function does essentially the same thing as the return_between() function, but it returns an array of all items that match the parse description or all occurrences of data between two delimiting strings. This function makes it extremely easy, for example, to extract all the links or images from a web page.

The parse_array() function, shown in Example 4-6, is most useful when your webbots need to parse the content of reoccurring tags. For example, returning an array of everything between every occurrence of <img and > returns information about all the images in a web page. Alternately, returning an array of everything between <script and </script> will parse all inline JavaScript. Notice that in each of these cases, the opening tag is not completely defined. This is because <img and <script are sufficient to describe the tag without regard to additional tag attributes (which we don’t need to define in the parse) that may be present in the downloaded page.

Example 4-6. Using parse_array()

/*
array return_array (string unparsed, string beg, string end)
Where
    unparsed is the string to parse
    beg is a reoccurring beginning delimiter
    end is a reoccurring ending delimiter
    array contains every occurrence of what's found between beginning and end.
*/

This simple parse is also useful for parsing tables, meta tags, formatted text, video, or any other parts of web pages defined between reoccurring HTML tags.

The script in Example 4-7 uses the parse_array() function to parse and display all the meta tags on the FBI website. Meta tags are primarily used to define a web page’s content to a search engine.

This code could be incorporated with the project in Chapter 11 to determine how adjustments in your meta tags affect your ranking in search engines. To parse all the meta tags, the function must be told to return all instances that occur between <meta and >. Again, notice that the script only uses enough of each delimiter to uniquely identify where a meta tag starts and ends. Remember that the definitions you apply for start and stop variables must apply for each data set you want to parse.

Example 4-7. Using parse_array() to parse all the meta tags from http://www.fbi.gov

include("LIB_parse.php");    # Include parse library
include("LIB_http.php");     # Include PHP/CURL library
$web_page = http_get($target="http://www.fbi.gov", $referer="");
$meta_tag_array = parse_array($web_page['FILE'], "<meta", ">");

for($xx=0; $xx<count($meta_tag_array); $xx++)
    echo $meta_tag_array[$xx]."\n";

When the script in Example 4-7 runs, the result should look like Example 4-8.

Example 4-8. Using parse_array() to parse the meta tags from the FBI website

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="generator" content="Plone - http://plone.org" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="modificationDate" content="2010/10/07" />
<meta name="creationDate" content="2007/04/02" />
<meta name="publicationDate" content="2010/02/01" />
<meta name="expirationDate" />
<meta name="portalType" content="Document" />
<meta content="" name="location" />
<meta content="text/html" name="contentType" />
<meta property="og:title" content="Homepage" />
<meta property="og:type" content="website" />
<meta property="og:url" content="http://www.fbi.gov/main-page" />
<meta property="og:image" content="http://www.fbi.gov/fbi_seal_mini.png" />
<meta property="og:site_name" content="FBI" />
<meta property="og:description" content="" />
<meta http-equiv="imagetoolbar" content="no" />

Parsing Attribute Values: get_attribute()

Once your webbot has parsed tags from a web page, it is often important to parse attribute values from those tags. For example, if you’re writing a spider that harvests links from web pages, you will need to parse all the link tags, but you will also need to parse the specific href attribute of the link tag. For these reasons, LIB_parse includes the get_attribute() function.

The get_attribute() function provides an interface that allows webbot developers to parse specific attribute values from HTML tags. Its usage is shown in Example 4-9.

Example 4-9. Using get_attribute()

/*
string get_attribute (string tag, string attribute)
Where
    tag is the HTML tag that contains the attribute you want to parse
    attribute is the name of the specific attribute in the HTML tag
*/

This parse is particularly useful when you need to get a specific attribute from a previously parsed array of tags. For example, Example 4-10 shows how to parse all the images from http://www.schrenk.com, using get_attribute() to get the src attribute from an array of <img> tags.

Example 4-10. Parsing the src attributes from image tags

include("LIB_parse.php");    # Include parse library
include("LIB_http.php");     # Include PHP/CURL library

// Download the web page
$web_page = http_get($target="http://www.schrenk.com", $referer="");

// Parse the image tags
$meta_tag_array = parse_array($web_page['FILE'], "<img", ">");

// Echo the image source attribute from each image tag
for($xx=0; $xx<count($meta_tag_array); $xx++)
    {
    $name = get_attribute($meta_tag_array[$xx],  $attribute="src");
    echo $name ."\n";
    }

Example 4-11 shows the output of Example 4-10.

Example 4-11. Results of running Example 4-10, showing parsed image names

f_img/spacer.gif
f_img/spacer.gif
f_img/php_arch.jpg
f_img/schrenk_defcon_15.jpg
f_img/italian_bot.gif
f_img/spacer.gif
f_img/webbots_spiders_and_screen_scrapers.jpg
f_img/strat.gif
f_img/webbots.jpg
f_img/contact.jpg
f_img/journalist.jpg
f_img/brx2008.png

Removing Unwanted Text: remove()

Up to this point, parsing meant extracting desired text from a larger string. Sometimes, however, parsing means manipulating text. For example, since webbots usually lack JavaScript interpreters, it’s often best to delete JavaScript from downloaded files. In other cases, your webbots may need to remove all images or email addresses from a web page. For these reasons, LIB_parse includes the remove() function. The remove() function is an easy-to-use interface for removing unwanted text from a web page. Its usage is shown in Example 4-12.

Example 4-12. Using remove()

/*
string remove (string web page, string open_tag, string close_tag)
Where
    web_page  is the contents of the web page you want to affect
    open_tag defines the beginning of the text that you want to remove
    close_tag defines the end of the text you want to remove
*/

By adjusting the input parameters, the remove() function can remove a variety of text from web pages, as shown in Example 4-13.

Example 4-13. Using remove()

$uncommented_page   = remove($web_page, "<!--", "-->");
$links_removed      = remove($web_page, "<a", "</a>");
$images_removed     = remove($web_page, "<img", " >");
$javascript_removed = remove($web_page, "<script", "</script>");