Search Engines and CGI

Most webmasters will be passionately anxious that their creations are properly indexed by the search engines on the Web, so that the teeming millions may share the delights they offer. At the time of writing, the search engines were coming under a good deal of criticism for being slow, inaccurate, arbitrary, and often plain wrong. One of the more serious criticisms alleged that sites that offered large numbers of separate pages produced by scripts from databases (in other words, most of the serious e-commerce sites) were not being properly indexed. According to one estimate, only 1 page in 500 would actually be found. This invisible material is often called “The Dark Web.”

The Netcraft survey of June 2000 visited about 16 million web sites. At the same time Google claimed to be the most comprehensive search engine with 2 million sites indexed. This meant that, at best, only one site in nine could then be found via the best search engine. Perhaps wisely, Google now does not claim a number of sites. Instead it claims (as of August, 2001) to index 1,387,529,000 web pages. Since the Netcraft survey for July 2001 showed 31 million sites (http://www.netcraft.com/Survey/Reports/200107/graphs.html), the implication is that the average site has only 44 pages — which seems too few by a long way and suggests that a lot of sites are not being indexed at all.

The reason seems to be that the search engines spend most of their time and energy fighting off “spam” — attempts to get pages better ratings than they deserve. The spammers used CGI scripts long before databases became prevalent on the Web, so the search engines developed ways of detecting scripts. If their suspicions were triggered, suspect sites would not be indexed. No one outside the search-engine programming departments really knows the truth of the matter — and they aren’t telling — but the mythology is that they don’t like URLs that contain the characters: “!”, “?”; the words “cgi-bin,” or the like.

Several commercial development systems betray themselves like this, but if you write your own scripts and serve them up with Apache, you can produce pages that cannot be distinguished from static HTML. Working with script2_html and the corresponding Config file shown earlier, the trick is this:

Remove cgi-bin/ from HREF or ACTION statements. We now have, for instance:

<A HREF="/script2_html/whole_database">Click here to see whole database</A>

Add the line:
```
ScriptAliasMatch /script(.*) /usr/www/APACHE3/APACHE3/cgi-bin/script$1
```
to your Config file. The effect is that any URL that begins with /script is caught. The odd looking (.*) is a Perl construct, borrowed by Apache, and means “remember all the characters that follow the word script;'. They reappear in the variable $1 and are tacked onto /usr/www/APACHE3/APACHE3/cgi-bin/script.

As a result, when you click the link, the URL that gets executed, and which the search engines see, is http://www.butterthlies.com/script2_html/whole_database. The fatal words cgi-bin have disappeared, and there is nothing to show that the page returned is not static HTML. Well, apart from the perhaps equally fatal words script or database, which might give the game away . . . but you get the idea.

Another search-engine problem is that most of them cannot make their way through HTML frames. Since many web pages use them, this is a worry and makes one wonder whether the search engines are living in the same time frame as the rest of us. The answer is to provide a cruder home page, with links to all the pages you want indexed, in a <NOFRAMES> area. See your HTML reference book. A useful tool is a really old browser that also does not understand frames, so you can see your pages the way the search engines do. We use a Win 3.x copy of NCSA’s Mosaic (download it from http://www.ncsa.uiuc.edu).

The <NOFRAMES> tag will tend to pick out the search engines, but it is not infallible. A more positive way to detect their presence is to watch to see whether the client tries to open the file robots.txt. This is a standard filename that contains instructions to spiders to keep them to the parts of the site you want. See the tutorial at http://www.searchengineworld.com/robots/robots_tutorial.htm. The RFC is at http://www.robotstxt.org/wc/norobots-rfc.html. If the visitor goes for robots.txt, you can safely assume that it is a spider and serve up a simple dish.

The search engines all have their own quirks. Google, for instance, ranks a site by the number of other pages that link to it — which is democratic but tends to hide the quirky bit of information that just interests you. The engines come and go with dazzling rapidity, so if you are in for the long haul, it is probably best to register your site with the big ones and forget about the whole problem. One of us (PL) has a medical encyclopedia (http://www.medic-planet.com). It logs the visits of search engines. After a heart-stopping initial delay of about three months when nothing happened, it now gets visits from several spiders every day and gets a steady flow of visitors that is remarkably constant from month to month.

If you want to make serious efforts to seduce the search engines, look for further information at http://searchengineforms.com and http://searchenginewatch.com.