In addition to the parsing functions included in LIB_parse
, described earlier, PHP contains a multitude of built-in parsing functions. The following is a brief sample of the most valuable built-in PHP parsing functions, along with examples of how they are used.
You can use the stristr()
function to tell your webbot whether or not a string contains another string. The PHP community commonly uses the term haystack to refer to the entire unparsed text and the term needle to refer to the substring within the larger string. The function stristr()
looks for an occurrence of needle in haystack. If found, stristr()
returns a substring of haystack from the occurrence of needle to the end of the larger string. In normal use, you’re not always concerned about the actual returned text. Generally, the fact that something was returned is used as an indication that you found the existence of needle in haystack.
The stristr()
function is probably most handy if you want to detect whether a specific word is mentioned in a web page. For example, if you want to know whether a web page mentions dogs, you can execute the script shown in Example 4-14.
Example 4-14. Using stristr()
to see if a string contains another string
if(stristr($web_page, "dogs")) echo "This is a web page that mentions dogs."; else echo "This web page does not mention dogs";
In Example 4-14, we’re not interested so much in what the stristr()
function returns but whether it returns anything at all. If something is returned, we know that the web page contained the word dogs.
The stristr()
function is not case sensitive. If you need a case-sensitive version of stristr()
, use strstr()
.
The PHP built-in function str_replace()
puts a new string in place of all occurrences of a substring within a string, as shown in Example 4-15.
Example 4-15. Using str_replace()
to replace all occurrences of Cat with Dog
$org_string = "I wish I had a Cat."; $result_string = str_replace("Cat", "Dog", $org_string); # $result_string contains "I wish I had a Dog."
The str_repalce()
function is also useful when a webbot needs to remove a character or set of characters from a string. You do this by instructing str_replace()
to replace text with a null string, as shown in Example 4-16.
The script in Example 4-17 uses a variety of built-in functions, along with a few functions from LIB_http
and LIB_parse
, to create a string that contains unformatted text from a website. The result is the contents of the web page without any HTML formatting.
Example 4-17. Parsing the content from the HTML used on http://www.cnn.com
include("LIB_parse.php"); # Include parse library include("LIB_http.php"); # Include PHP/CURL library // Download the page $web_page = http_get($target="http://www.cnn.com", $referer=""); // Remove all JavaScript $noformat = remove($web_page['FILE'], "<script", "</script>"); // Strip out all HTML formatting $unformatted = strip_tags($only_text); // Remove unwanted white space $noformat = str_replace("\t", "", $noformat); // Remove tabs $noformat = str_replace(" ", "", $noformat); // Remove non-breaking spaces $noformat = str_replace("\n", "", $noformat); // Remove line feeds echo $noformat;
Sometimes it is convenient to calculate the similarity of two strings without necessarily parsing them. PHP’s similar_text()
function returns a value that represents the percentage of similarity between two strings. Example 4-18 shows the syntax of similar_text()
.
Example 4-18. Example of using PHP’s similar_text()
function
$similarity_percentage = similar_text($string1, $string2);
You may use similar_text()
to determine if a new version of a web page is significantly different from a cached version.