In the previous chapter, you learned that iMacros is a useful tool for controlling a browser. However, the previous discussion was limited to recording browser behavior in a macro file and replaying that macro later. In this way, iMacros is more like an extension of a browser bookmark and bears little resemblance to the dynamic webbots described in this book.
Fortunately, you’re not limited to the iMacros everyone else uses. Since iMacros is browser based and relies on web protocols, we can exploit those protocols—and their capabilities—to do things that the original iMacros developers didn’t consider. In this chapter, you’ll learn a few tricks that will enable you to download and scrape nearly any website, including those that make heavy use of JavaScript, AJAX, or Flash. You will also learn how to control iMacros with PHP, parse iMacros (with a browser tab hack), and integrate external data sources into your macros.
Here are a couple of hacks that facilitate better control and parsing capability with iMacros. The first hack requires that you learn how to write scripts that create the browser macros dynamically. This allows you to add random delays (to aid stealthy behavior) and to use data from external sources like databases and scraped websites.
iMacros can load web pages from any valid URL. The natural conclusion is that people will use iMacros to access the regular websites that make up the Internet. What’s less obvious is that iMacros can also load and execute web pages that are on local web servers. This means that you can program iMacros to open a secondary browser tab that runs a web page on a local web server. Since you can program local web pages to access anything on your computer, this technique opens up enormous potential. This little hack frees your code from almost any restriction normally involved in using a macro and allows your macros to interface with other resources and to make programmatic decisions. Most typically, this technique is used to parse and act on web pages saved by a macro.
Before we explore my process for controlling iMacros, it’s important to note that the commercial version of iMacros (not the free version) has its own scripting interface. You are certainly encouraged to explore this option—especially if you have a background using Visual Basic—but that option is not explored here for the following reasons.
The scripting interface is COM based and will work with any Windows programming language that supports COM objects. While this is useful, it does not serve many developers’ needs because many people develop solely for Linux-based environments or for mixed Windows/Linux production environments. For this reason, most developers focus on web technologies and are not adept at using Windows desktop technologies like COM, which is really not an Internet technology. It is also more efficient to do what needs to be done directly in a web-programming language like PHP than it is to use PHP (or another COM-aware language) to interface to iMacros through COM.
Another reason to shy away from the built-in scripting interface is that the iMacros scripting engine accepts input from local files, it does not interface well with external data sources like databases or other websites.
The final reason for not using the iMacros scripting interface is that there is absolutely no reason to do so. As you’ll soon learn, you can get iMacros to do whatever you want it to do, in your preferred web language, by employing a few simple tricks. Also, using the built-in iMacros scripting engine would require that you buy software, which undermines this book’s vision of employing only free or open source solutions.
To do any of the advanced iMacros scripting, you will also need to purchase a scripting edition of iMacros, which may be out of reach for some developers. I do encourage people, however, to stop using the free version of iMacros if your project has commercial purposes.
If you are one of the many happy developers successfully using iMacro’s scripting interface, please don’t take this the wrong way. It is simply not a path we are going to follow here. If you want to learn more about using the iMacros scripting language you should visit its website.
Suppose you have an online store that sells products whose prices fluctuate wildly with variations in customer demand and supply.[67] You need to know the current market value of all of your products at any given time. If your competition lowers a price and you don’t, you will lose sales. Furthermore, if online competitors raise prices on items similar to those you sell for lower prices, you will also lose out on money you could make. To complicate the situation, let’s assume that your competitor’s website makes heavy use of AJAX and effectively renders traditional script-based webbots ineffective. To solve this dilemma, you decide to write a webbot to track your competitor’s prices. Furthermore, since you have hundreds of thousands of products to consider and you don’t want to risk a trespass-to-chattels lawsuit,[68] you only track competing prices for your 100 best sellers on a daily basis. In the end, you decide to do the following:
Write a webbot that emulates an actual browser user through the use of iMacros and a dynamic macro.
Develop a PHP script that reads your internal database to learn which items are your 100 best sellers and then writes the appropriate macro to iteratively read your competitor’s prices.
Write a PHP script to parse the prices from your competitor’s downloaded web pages.
The following section describes pseudo-code that could solve these tasks as well as be the basis for similar projects you might develop.
In addition to giving you the ability to write very specific functionality into macros, writing macros dynamically allows you to add a degree of standardization and maintainability. For example, in the previous chapter, you learned that some iMacros commands should be at the beginning of any macro you write. You can build these commands into a string, which is later written as the macro file. Example 24-1 shows how this is done.
Example 24-1. Initializing a dynamic macro
$macro = ""; $macro = $macro . "'#####################################################\n"; $macro = $macro . "'# HEADER (defaults, etc.)\n"; $macro = $macro . "'#####################################################\n"; $macro = $macro . "SET !TIMEOUT 240\n"; $macro = $macro . "SET !ERRORIGNORE YES\n"; $macro = $macro . "SET !EXTRACT_TEST_POPUP NO\n"; $macro = $macro . "FILTER TYPE=IMAGES STATUS=ON\n"; $macro = $macro . "CLEAR\n"; $macro = $macro . "TAB T=1\n"; $macro = $macro . "TAB CLOSEALLOTHERS\n"; $macro = $macro . "'##################################################### \n";
Notice that in Example 24-1, all iMacros comments are prefixed with the single-quote character. Also, consider that if the lines are not terminated with the escaped n
character, the entire macro will appear as a single line when written to the macro file. Therefore, you will want to terminate each line in the macro with an \n
.
You can use any resource as an input to your dynamically generated macro. But in most cases, the integration of external data sources into your macro will resemble Figure 24-1.
Figure 24-1 suggests only a few places for finding external data that may dynamically affect the actions your macro takes. As you develop projects of your own, you will no doubt add to this list. In our online store example, the macro will use a SQL command to query a local database in an effort to identify the 100 top-selling products. To simplify this example, let’s assume that each product can be identified by an ASIN, or an Amazon Standard Identification Number,[69] and that the ASINs for the top 100 products are returned in an array named $product_array
and used as shown in Example 24-2.
Example 24-2. Dynamically writing the macro to download and parse product information
// // Your database query here, which queries a fictitious product database and creates an array called // $product_array, containing ASINs of the 100 best-selling products in your inventory // 1 // Loop through each of the first 100 products for ($x=0; $xx<count($product_array); $xx++) { 2 $macro = $macro . "' Get URL of competitor's product page\n"; $competing_product_information = // Fully resolved URL to competitor's product website $macro = $macro . "'\n"; 3 $macro = $macro . "' Add random delay\n"; $macro = $macro . "WAIT SECONDS=".sleep(rand(5,15))."\n"; $macro = $macro . "'\n"; 4 $macro = $macro . "' Capture the competitor's web page with product information\n"; $macro = $macro . "GOTO URL=$competitor_product_information \n"; $macro = $macro . "SAVEAS TYPE=HTM FOLDER=* FILE=search_results \n"; $macro = $macro . "'\n"; 5 $macro = $macro . "' Run the parsing software in secondary browser tab\n"; $macro = $macro . "TAB T=2 $macro = $macro . "URL GOTO=http://localhost/ parser.php?id=".$product_array[$xx]['ASIN']."\n"; $macro = $macro . "'\n"; 6 $macro = $macro . "' Resume in original browser tab\n"; $macro = $macro . "TAB T=1 \n"; } 7 file_put_contents("//PATH/MACRO_NAME.iim", $macro);
First, each of the 100 products is accounted for by the macro at 1. Then, the URL for the competitor’s production page is found at 2. The contents of this URL, of course, vary depending on the demands of the target website. In most cases, some identifier (like an ASIN) may be used to identify items. For example, the URL may be a simple web address with a query string that identifies that desired product. In other cases, it may be the result of an online search that is conducted from a form. Before the target website is accessed, it is a good idea to insert a random delay to simulate actual (human-like) web use 3. In the example, a delay with a random length of 5 to 15 seconds is inserted. Once the product identifier is combined with the target web page to find pricing information for the item, that web page is downloaded and saved in a known file location at 4.
Now we have a true departure from traditional iMacros scripting. So far, everything has occurred in the first (and only) browser tab, and every instruction has been a traditional iMacros instruction. But at 5, the macro opens a second browser tab is opened for a PHP file, which runs on a local web server. This local web page is a parsing script that loads the previously stored file, parses the pricing information, and stores that price in a local database; it uses the product’s ASIN to identify this product in your pricing database. Once the downloaded page is processed, the macro focuses again on the first browser tab at 6.
The final part of macro generation is to write the file to a path where iMacros can find it 7. This is simply a matter of writing the string where the macro is developed into a file with an .imm extension in iMacros’s default macros directory, as shown in Example 24-3.
Example 24-3. Writing the macro to the file
file_put_contents($MACROS_PATH, "test.iim"); // where $MACROS_PATH is the default macros path
This trick, opening an alternative browser tab to run arbitrary scripts, gives you the opportunity to do anything you want with web pages downloaded by iMacros. A lot of things happen in the pseudo-code shown in Example 24-2—and the actual code will vary depending on your specific situation—but the most important concept to pull out of all this what happens in 5. This line reveals the true trick in this hack: This hack allows you to execute arbitrary code within an otherwise standard iMacros macro. Since iMacros controls browsers, it is also bound to the limitations that are imposed on browsers. These limitations are necessary and protect you from rogue websites that want to harm you or your computer. Unfortunately, many of these controls that are intended to protect you also prevent your webbots from doing many of the things you might find useful. However, once you master the ability to combine a browser macro with any local script, you break out of the browser sandbox and are no longer restricted to doing what browsers allow. Not only can you use iMacros to download web pages that cannot be downloaded with traditional webbots, but you can also write PHP scripts to perform absolutely any function that you are able to facilitate.
For example, here are some things you can do with local scripts running in an alternate browser tab:
Parse previously downloaded web pages and store indexed information in a local database (as depicted in the example).
Automatically upload files to remote servers.
Change system configurations.
Modify cookies from any domain.
Execute any script that can run on a local web server.
Depending on your application, you can use as many alternative tabs as needed. You will, however, probably not find reasons to have more than two browser tabs open at any given time. In addition to running a local web page in an alternative browser tab, you can also use the techniques shown in the next section to dynamically choose to load new macros into iMacros. This can get extremely complex (and extremely powerful) if these new macros are also written on the fly. Once the parsing is conducted in the second browser tab, the macro returns focus to the browser tab it was on before it started executing the local script.
Up to this point, all iMacros sessions have required that you open the browser with the iMacros plug-in, select a macro from the list of available macros, and click the Run button to load and execute the macro. To get the most from iMacros, you’ll want to execute macros automatically, which ultimately means executing your iMacros sessions from a command line. Once you learn how to launch iMacros from a command line, you can use the information in Chapter 22 to make iMacros sessions launch and execute unattended and whenever you desire.
These instructions may seem odd at first, but remember that iMacros was never designed to work like this. This is another hack, and what you’re about to read was learned through trial and error over a substantial period of time.
The process for executing iMacros macros directly from a command line differs slightly depending on whether you’re using Windows or Linux. In either case, the basic scenario for launching iMacros macros to execute automatically looks like that in Figure 24-2.
Most of your iMacros sessions will start with the creation of the macro, which is specifically tailored for this particular iMacros session. That macro is stored in the default macro directory and directly executed by iMacros with one of the following techniques.
The script shown in Example 24-4 may be used to launch an iMacros session in Microsoft DOS.
Example 24-4. Executing an iMacros macro from a Windows batch file
:: 1 :: Run the script that creates the macro (test.iim) php create_macro.php :: 2 :: Start Firefox start /B "C:\Program Files\Mozilla Firefox\firefox.exe" http://127.0.0.1 :: 3 :: Waste some time ping 127.1.1.1 :: 4 :: Run the macro start /B "C:\Program Files\Mozilla Firefox\firefox.exe" http://run.imacros.net/?m=test.iim
At 1, a macro is created and written to the default directory, as described earlier in this chapter. For demonstration purposes, assume this macro is named test.iim. Firefox is executed directly from the command line 2. Even though the /B
flag is used, to start the application without creating a new window, Firefox is launched in its own window. In this case, Firefox is directed to load the URL http://127.0.0.1, the local default web page. It is a good habit to load a local web page when the browser launches instead of immediately loading your macro, because doing so avoids potential timeouts caused by public websites and allows time for large macros to load.
Then, Windows is told to ping the local web server to give Firefox extra time to load the local web page 3. At 4, Firefox is executed again, but this time it is instructed to load the macro file instead of the default local web page.
As you can see, there is no direct way to get Firefox (or Internet Explorer) to launch and load a macro. The web page that is actually loaded is http://run.imacros.net, which is on the iMacros website. That web page instructs the iMacros plug-in to load and execute the macro described on the query string.
My personal experience is that it’s easier to reliably launch and execute an iMacros macro from Linux than it is from a Windows platform. My experience is that if the precautions taken in 2, 3, and 4 in Example 24-4 are not taken, Firefox may not load correctly. However, in Linux environments, I have found no need to preload Firefox or to ping the local server to ensure that the browser has loaded. In fact, you can simply launch Firefox and the macro at the same time from PHP, as shown by the scriptlett in Example 24-5.
Example 24-5. Launching Firefox and iMacros from Linux
<?php // // Load Firefox and launch macro. // system("firefox http://run.imacros.net/?m=test.iim"); ?>
If you’re on a Linux platform, you could run the line of PHP (from Example 24-5) within the program that creates the macro without performing all the steps described previously for doing the same in Windows.
[67] Examples of products like these may include computer memory, agricultural commodities, petroleum products, or other consumer goods.
[68] See Chapter 31.
[69] The use of an ASIN in the example is entirely arbitrary. Any unique identifier, such as a manufacturer product number, could have been used.