Adding Filtering to Your Aggregation Webbot

Your webbots can also modify or filter data received from RSS (or any other source). In this chapter’s news aggregator, you could filter (i.e., not use) any stories that don’t contain specific keywords or key phrases. For example, if you only want news stories that contain the words webbots, web spiders, and spiders, you could create a filter array like the one shown in Example 12-7.

Example 12-7. Creating a filter array

$filter_array[]="webbots";
$filter_array[]="web spiders";
$filter_array[]="spiders";

We can use $filter_array to select articles for viewing by modifying the download_parse_rss() function used in Example 12-4. This modification is shown in Example 12-8.

Example 12-8. Adding filtering to the download_parse_rss() function

function download_parse_rss($target, $filter_array)
    {
    # Download the RSS page
    $news = http_get($target, "");

    # Parse title and copyright notice
    $rss_array['TITLE'] = return_between($news['FILE'],
                          "<title>", "</title>", EXCL);
    $rss_array['COPYRIGHT'] = return_between($news['FILE'],
                           "<copyright>", "</copyright>", EXCL);

    # Parse the items
    $item_array = parse_array($news['FILE'], "<item>", "</item>");
    for($xx=0; $xx<count($item_array); $xx++)
        {
        # Filter stories for relevance
        for($keyword=0; $keyword<count($filter_array); $keyword ++)
            {
            if(stristr($item_array[$xx], $filter_array[$keyword]))
                {
                $rss_array['ITITLE'][$xx] = return_between($item_array[$xx],
                          "<title>", "</title>", EXCL);
                $rss_array['ILINK'][$xx] = return_between($item_array[$xx],
                          "<link>", "</link>", EXCL);
                $rss_array['IDESCRIPTION'][$xx] = return_between($item_array[$xx],
                          "<description>", "</description>", EXCL);
                $rss_array['IPUBDATE'][$xx] = return_between($item_array[$xx],
                          "<pubDate>", "</pubDate>", EXCL);
                }
            }
        }
    return $rss_array;
   }

Example 12-8 is identical to Example 12-4, with the following exceptions:

The end result of the script in Example 12-8 is an aggregator that only lists stories that contain material with the keywords in $filter_array. As configured, the comparison of stories and keywords is not case sensitive. If case sensitivity is required, simply replace stristr() with strstr(). Remember, however, that the amount of data returned is directly tied to the number of keywords and the frequency with which they appear in stories.