The secret to developing webbots that harvest data from difficult websites is to emulate exactly the functionality and behavior of browsers. And from the webbot developer’s perspective, the easiest way to emulate a browser is to control a browser directly through the use of a browser macro.
A browser macro is a program or plug-in that uses a script to control the actions of a browser. The advantage of using a browser macro is that it can leverage the browser’s rendering engines for JavaScript and Flash as well as any other plug-ins or extensions available to the browser. The ability to programmatically control a browser vastly improves your ability to scrape or automate even the most difficult websites.
No matter how hard one tries, it is very difficult to write a script-based webbot that looks exactly like someone who is using a browser. The main reason for this is that script-based webbots tend to load files differently (skipping image, CSS, and JavaScript files). Additionally, sometimes these files exhibit strange cookie behaviors that are simply impossible to duplicate outside of a browser.
However, a browser macro of particular interest to webbot developers, iMacros, runs inside a browser. Therefore, it not only looks like a regular browser to any of the websites it visits, it is a regular browser. When you know how to completely control a browser through dynamic macros, it is impossible to tell it apart from an actual browser user.[61]
Since iMacros acts like any other browser, you can use it to download and process even the most difficult websites that make excessive use of JavaScript (primarily AJAX) for flow content to the browser after the initial HTTP connection is closed. Since I started using the technique described in this chapter, I have not encountered any website that I cannot download and scrape.
iMacros is produced by iOpus (http://www.iopus.com) and is available for Internet Explorer, Firefox, and iOpus’s own custom browser. While it is not open source, iMacros is available as a free download at the iOpus website and is installed like any other browser plug-in.
While the iMacros plug-in is available for a variety of browsers, this book focuses on the Firefox version due to its history of stability and its crossplatform availability. While the Firefox version is my personal preference, you should feel free to use the version that best meets your needs.
Figure 23-2 shows what the iMacros plug-in looks like when installed in Firefox. At the top of the interface you’ll find the iMacros control (circled in Figure 23-2), which enables and disables the iMacros panel, shown at the left of the pane.
Let’s assume that you need to capture the first page of image search results for specific keywords on Google and Bing. This example task is a good candidate for using iMacros because both of these websites make heavy use of JavaScript and may be poor targets for traditional script-based webbots.
You create iMacros macros by putting the plug-in in record mode. This records your actions in a script, which is later “played” to control the browser, as shown in Figure 23-3. To record a macro, open a browser with the iMacros plug-in installed. Select the Rec tab in the iMacros panel and click the Record button. Once you start recording, every mouse and keyboard action is recorded in this macro script. The longer you use the browser in record mode, the longer your macro script becomes.
The steps to perform our task are shown in Figure 23-4.
Type www.google.com into the location bar.
Perform one search, in this case on webbots.
Select Google’s image search option and save the page.
Type www.bing.com into the location bar.
Again, perform a search on our search term webbots.
Select Bing’s image search option.
Save the page.
Click Stop in the iMacros pane to end the recording session.
While in record mode, iMacros automatically records every key click, URL change, form entry, button press, and saved screen. All of these actions are stored in the macro file called #Current.iim.[62] The #Current.iim file for the macro we just recorded is shown in Example 23-1.
Example 23-1. The macro that was created in Figure 23-4
01. VERSION BUILD=6700624 RECORDER=FX 02. TAB T=1 03. URL GOTO=http://www.yahoo.com/ 04. URL GOTO=www.google.com 05. CLICK X=321 Y=222 CONTENT=webbots 06. CLICK X=76 Y=14 07. SAVEAS TYPE=CPL FOLDER=* FILE=* 08. URL GOTO=www.bing.com 09. CLICK X=267 Y=168 10. CLICK X=401 Y=165 CONTENT=webbots 11. CLICK X=609 Y=163 12. CLICK X=75 Y=11 13. SAVEAS TYPE=CPL FOLDER=* FILE=*
The first line of the macro describes the version of iMacros that was used to record the macro. On the second macro line, iMacros indicates that it is setting focus to the first browser tab. This is a very important feature because, as you’ll read in the next chapter, iMacros can perform very advanced features by running your custom PHP scripts in alternate browser tabs. The third line of the macro simply indicates the web page that was in the location bar when the Record button was pressed.
The first user-entered information is recorded on line 4, where iMacros tells the browser to go to the Google home page. iMacros then recorded that the web page was clicked at the coordinates 321, 222 and that our search term “webbots” was typed at that location on the web page.
It is important to note that iMacros allows two methods for indicating where you click. In this case, the x-y position was recorded, as shown in Figure 23-5.
The other option would have been to use the HTML tag of the web page to identify the location of the click. For example, if the “Use complete HTM tag” option had been selected in the example, line 5 of the macro would have looked like Example 23-2.
Example 23-2. Example of using the complete HTML tag to locate page clicks and form entry
TAG POS=1 TYPE=INPUT:TEXT FORM=NAME:f ATTR=NAME:q CONTENT=webbots
Both click modes are useful at various times. The x-y coordinate method is most useful when web pages change infrequently; the target web page does not use well-defined HTML tags to describe form elements; the HTML tags change frequently, as is the case with Craigslist.org; or the web content is in a non-HTML format like Flash. While either method will work in most cases, you need to experiment to find the best option for your specific application.
Macro lines 6 and 7 in Example 23-1 show the commands for clicking on the Google image option and saving the results of the image search. The SAVEAS
command on line 7 indicates that the complete web page will be saved (TYPE=CPL
[63]). Using this option not only saves the HTML but also all images that appeared on the saved web page. Line 7 also indicates the web page will be saved in the default folder with the default file name. You may edit the macro if you want to save the file with something other than the default location.
The final six lines of the macro essentially perform the same function as the earlier commands, except this time the image search is performed on Bing.
You’ll find a wide variety of commands available in iMacros macros. These commands, however, vary depending on the iMacros version and the browser plug-in you’re using. A complete set of iMacros commands are available at the iOpus website.[64]
While the command set supplied by iMacros is fairly complete, there are ways to supplement the existing commands with scripts that will do nearly anything you like. How to expand the functionality of iMacros with your own PHP scripts is explained in the next chapter.
In most cases, there are a few commands you’ll want to include in every macro. Example 23-3 lists these commands and explains why you may want to include them at the start of every macro.
Example 23-3. Suggested iMacros macro initialization
01. '##################################################### 02. '# HEADER (defaults, etc.) 03. '##################################################### 04. SET !TIMEOUT 240 05. SET !ERRORIGNORE YES 06. SET !EXTRACT_TEST_POPUP NO 07. FILTER TYPE=IMAGES STATUS=ON 08. CLEAR 09. TAB T=1 10. TAB CLOSEALLOTHERS
Lines 1 through 3 show how to write comments in an iMacros macro. Any line preceded by a '
character is considered a comment and is not executed when the macro is played.
Line 4 tells iMacros not to time out unless 240 seconds have passed. In other words, this command tells iMacros to wait up to four minutes for a web page to load. While this seems like a long time, sometimes this is required if you are using a slow proxy[65] and are downloading a media-heavy web page.
On line 5, iMacros is instructed to ignore any error. I suggest turning off error reporting in a production environment. While in development, however, it is preferable to see the errors and adjust your macro as needed. Turning off error reporting in production is important because when iMacros discovers an error or warning, it notifies the user with a pop-up window, which suspends the macro until the pop-up window is closed. These interruptions in production mode are bothersome because, in many cases, the warnings are little more than nuisances that are not critical to the performance of the macro and because when the macro runs unattended, these errors cause the macro to hang while waiting for you to respond to a warning message.
While they are useful during development, iMacros warnings and errors are better ignored in production environments, as counterintuitive as that sounds. I have found that the most common cause of an iMacros error in previously debugged macros is network timeouts. In most cases, these are not recoverable and may be dealt with using some of the advanced techniques described in the next chapter. From my experience, it is better to write fault-tolerant emulations than to have an entire 50,000-line macro hang because of one fragile network fetch.
Line 6 is important only if your macro is importing data from websites.[66] You should include this command even if not importing data because including it is a good habit that will save you from other annoying iMacros pop-up windows if or when you do decide to import data directly with iMacros.
If you’re not concerned with the images on the web pages you download, it is suggested that you tell the browser not to download them, as shown on line 7. Ignoring images will make your macro run faster and save bandwidth. And again—if you’re using a slow proxy, this may make your macro run more smoothly.
If you want to clear your cookies, you’ll want to use the command on line 8. Clearing cookies will aid in the stealthiness of your webbot. The only reason not to clear your cookies is if they contain user login information and are needed for authenticating into the website you are visiting.
The final two lines of the script (lines 9 and 10) tell the browser that the macro will run in browser tab 1 and to close all other browser tabs. As you’ll see in the next chapter, the ability to run scripts in multiple browser tabs is an extremely powerful technique you can use to great advantage. Closing all other browser tabs is always a good idea because macros may open new tabs every time the macro is run. If the macro is run repeatedly, the browser will eventually crash because it has too many open tabs.
To play your macro, simply select the macro script you want to run, like our newly created #Current.iim as shown in Figure 23-6, and click the Play button.
Once you click Play, the selected macro will run just as you recorded it. If you modify your macro (according to the iMacros rules), the browser will execute those commands as well. You don’t have to provide delays that wait for web pages to download because iMacros automatically waits for pages to load before proceeding to the next instruction. In Chapter 24, you’ll learn how to programmatically start macros within shell and PHP scripts.
[61] This assumes that your macro exhibits human-like behavior. Hitting the same website 24 hours a day, 7 days a week, does not constitute human-like behavior.
[62] Regardless of the operating system you’re using, all iMacros macro files have an .iim file extension.
[63] The other option is HTM, which saves only the web page’s HTML.
[64] A current list of iMacros commands is available at http://wiki.imacros.net/Command_Reference.
[65] Proxies are discussed in Chapter 27.
[66] iMacros has the ability to import data from web pages and export that data in a CSV file. This functionality is not covered here in favor of the more flexible techniques described in Chapter 24.