1 . 4 Worms, Spiders, and Knowbots

Web worms, spiders, robots, and knowbots are automated tools that crawl around the Web looking for information, reporting their findings. Many of the so-called Internet Starting Points use robots to scour the Web looking for new information. These automatons can be used both to search for information about a particular topic of interest or to build up databases. for subsequent searching by others. (See Section 1. 9 Internet Starting Points for information on searching the net.)

A worm is a program that moves from one site to another. The generic term "worm" has nothing to do with the Web; it simply refers to a program that seeks to replicate itself on multiple hosts. Worms are not necessarily good. The "Internet Worm" of 1988 caused a massive breakdown of thousands of systems on the Internet. But that's another story.(8)

A knowbot is a program or agent that, like worms, travels from site to site. However it has a flavor of artificial intelligence in that it usually follows knowledge-based rules. Another term for a knowbot might be an autonomous agent. Clear distinctions between these terms are currently not meaningful.(9) In the context of this section, finding information, we'll look below at one particular knowbot and one worm. First, however, we'll mention spiders.

Spiders, as their name implies, crawl around the Web, doing things. They can find information to build large textual databases; the WebCrawler does this. They can also maintain large Webs or collections of Webs; this is the function of the MOMspider.

Following is Brian Pinkerton's description of one Web worm, the WebCrawler:

The WebCrawler is a web robot, and is the first product of an experiment in information discovery on the Web. I wrote it because I could never find information when I wanted it, and because I don't have time to follow endless links.

The WebCrawler has three different functions:

It builds indices for documents it finds on the Web. The broad, content-based index is available for searching. It acts as an agent, searching for documents of particular interest to the user. In doing so, it draws upon the knowledge accumulated in its index, and some simple strategies to bias the search toward interesting material. In this sense, it is a lot like the Fish search, although it operates network-wide. It is a testbed for experimenting with Web search strategies. It's easy to plug in a new search strategy, or ask queries from afar, using a special protocol.

In addition, the WebCrawler can answer some fun queries. Because it models the world using a flexible, OO (Ed. Object Oriented) approach, the actual graph structure of the Web is available for queries. This allows you, for instance, to find out which sites reference a particular page. It also lets me construct the Web Top 25 List, the list of the most frequently referenced documents that the WebCrawler as found.

How it Works

The WebCrawler works by starting with a known set of documents (even if it is just one), identifying new places to explore by looking at the outbound links from that document, and then visiting those links.

It is composed of three essential pieces:

The search engine directs the search. In a breadth-first search, it is responsible for identifying new places to visit by looking at the oldest unvisited links from documents in the database. In the directed, find-me-what-I-want strategy, the search engine directs the search by finding the most relevant places to visit next. The database contains a list of all documents, both visited and unvisited, and an index on the content of visited documents. Each document points to a particular host, and, if visited, contains a list of pointers to other documents (links). "Agents" retrieve documents. They use CERN's WWW library to retrieve a specific URL, then returning that document to the database for indexing and storage. The WebCrawler typically runs with 5-10 agents at once.

Being a Good Citizen

The WebCrawler tries hard to be a good citizen. Its main approach involves the order in which it searches the Web. Some web robots have been known to operate in a depth-first fashion, retrieving file after file from a single site. This kind of traversal is bad. The WebCrawler searches the Web in a breadth-first fashion. When building its index of the Web, the WebCrawler will access a site at most a few times a day.

When the WebCrawler is searching for something more specific, its search may narrow to a relevant set of documents at a particular site. When this happens, the WebCrawler limits its search speed to one document per minute and sets a ceiling on the number of documents that can be retrieved from the host before query results are reported to the user. The WebCrawler also adopts several of the techniques mentioned in the Guidelines for Robot Writers.

Implementation Status

The WebCrawler is written in C and Objective-C for NEXTSTEP. It uses the WWW library from CERN, with several changes to make automation easier. Whenever I feel comfortable about unleashing the WebCrawler, I'll make the source code available!

bp@cs.washington.edu

Brian Pinkerton

MOMspider, available for free from the University of California, Irvine, is used to help maintain Webs. It is written in PERL and runs on most UNIX systems. MOMspider was written by Roy T. Fielding and a paper titled "Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web"(10) was presented at the WWW94 conference in Geneva. From Fielding's paper:

MOMspider gets its instructions by reading a text file that contains a list of options and tasks to be performed (an example instruction file is provided in Appendix A). Each task is intended to describe a specific infostructure so that it can be encompassed by the traversal process. A task instruction includes the traversal type, an infostructure name (for later reference), the "Top" URL at which to start traversing, the location for placing the indexed output, an e-mail address that corresponds to the owner of that infostructure, and a set of options that determine what identified maintenance issues justify sending an email message.

Appendix A

# MOMspider-0.1a Instruction File SystemAvoid /usr/local/httpd/admin/avoid.mom SystemSites /usr/local/httpd/admin/sites.mom AvoidFile /usr/grads/fielding/test/.momspider-avoid SitesFile /usr/grads/fielding/test/.momspider-sites SitesCheck 7 <Site

Name TopURL IndexURL IndexFile IndexTitle EmailAddress EmailBroken EmailExpired

ICS

http://www.ics.uci.edu/ICShome.html http://www.ics.uci.edu/Admin/ICS.html /usr/local/httpd/documentroot/MOM/ICS.html MOMspider Index for All of ICS www@ics.uci.edu

2


>

<Tree

Name

MOMspider-WWW94


TopURL    http://www.ics.uci.edu/WebSoft/MOMspider/

WWW94/paper.html

IndexURL    http://www.ics.uci.edu/Admin/MOMspider-

WWW94.html

IndexFile    /usr/local/httpd/documentroot/Admin/

MOMspider-WWW94.html

IndexTitle MOMspider Index for Roy's WWW94 Paper

EmailAddress fielding@ics.uci.edu

EmailBroken

>

<Owner Name TopURL

RTF

http://www.ics.uci.edu/~fielding/hotlist http://www.ics.uci.edu/~fielding/MOM/RTF /usr/grads/fielding/public html/MOM/RTF.


html

IndexURL

html

IndexFile

html

EmailAddress fielding@ics.uci.edu

EmailBroken EmailChanged 3 EmailExpired 7

>

Finally, rest assured that not all bots and spiders must be run from expensive workstations. An really cool product called Surfbot(11) from Surflogic LLC runs just fine on Win95 PCs. Surfbot lets you configure your own private "agents" to traverse either known Internet Starting Points or your own set of bookmarks. Its Wizard type of set up configures the agents and can produce a variety reports. This set up is simple to use.

Surfbot control screen for configuring one particular agent.

One particularly nice feature is its ability to schedule the times for searching. It makes the modem connection for you and hangs up when done. You can make your agents fire up in the middle of the night so the results will be waiting for you in the morning!

[SECTION 1.5] [TABLE OF CONTENTS]

Skip to chapter[1][2][3][4][5][6][7][8][9]


© Prentice-Hall, Inc.

A Simon & Schuster Company Upper Saddle River, New Jersey 07458

Legal Statement

1 . 5 Authoring

"Man invented language to satisfy his deep need to complain." Lily Tomlin

Using a Web browser is easy. Authoring documents for Web publishing is relatively easy. However, sometimes it is a little tricky.

The variety and capability of Web authoring tools are exploding. Web documents must be in the HTML format. HTML is a specific application of SGML so you can purchase commercial off-the-shelf SGML products to help in the writing and analysis of Web documents. The SGML products are not necessarily the best HTML authoring tools, however.

In fact, the traditional SGML vendor community is jumping on the Web as a new market. Vendors are releasing SGML products specifically to support and aid the writing of HTML documents. The decision by the original Web developers to use SGML as the foundation for Web documents is one of the Web's great strengths.

SoftQuad’s HoTMetaL HTML editor

HoTMetaL(12) runs on PCs, Macs, and UNIX workstations. In addition several free-ware packages are available for PCs and the Macintosh. SimpleHTML(13) is a HyperCardbased editor for the Mac.

On UNIX workstations there is also tkWWW, an authoring tool based on the tk toolkit for X windows.

One of the most common HTML authoring methods is simply to use a text editor to write HTML directly. This "Iron Man HTML" technique can be aided by some editors, for example emacs, which provides an HTML mode to take some of the tedium away. HTML documents are readable and editable; they just require a little learning.

An HTML fragment

An interesting authoring issue concerns the conformance of HTML files with the HTML DTD. Some authoring packages help force conformance, while others let you get away with sloppy HTML. Most browsers do not bother to check for compliant HTML and simply do the best they can with the display. If you are concerned with interoperability and the longevity of your document, it will pay in the long run to be in conformance.

An alternative to native HTML authoring is to write documents using your favorite text-editor, word process or publishing system and then convert to HTML. A large number of conversion programs exist for just this purpose. For example the rtf2html program on UNIX platforms will convert documents in the RTF format into HTML. Similarly, there are converters for latex and FrameMaker(14). All of these conversion mechanisms depend on a properly written original document. For example, documents written in WordPerfect must use the style feature to convert successfully. Conversion mechanisms often require some hand editing to touch up conversion errors.

Following this conversion type of authoring, Microsoft is introducing an add-on module for MS Word which will output HTML. Interleaf is producing Cyberleaf, a higher-end HTML authoring tool that can read in WordPerfect, Word RTF, FrameMaker MIF, Interleaf, and ASCII formats.

Actually Microsoft's Internet Assistant is much more than a simple converter. It is a very credible attempt to take an existing wellknown product, MS Word, and extend it to fullfledged Web authoring.

.    |    | if I eJt|

1    [PMmi JOitI

Some control buttons from Netscape’s Navigator Gold and HTML authoring package

Netscape Navigator Gold WYSIWYG HTML editor has several dialogs to help adjust various tag attributes such as image alignment and manipulation as shown in this figure.

One recent development in Web authoring is the effort to codify Style Sheets. The W3O is leading an effort called Cascading Style Sheets. The idea is to specify a template document in which attributes will cascade down through a hierarchy of styles, to be "inherited" by other style sheets. For example a business letter style sheet would inherit most of its attributes from a generic letter style sheet. As the Feb. 20, 1996 draft proposal on Cascading HTML style sheets by Hakon W Lie and Bert Bos puts it:

This document specifies level 1 of the Cascading Style Sheet mechanism (CSS1). CSS1 is a simple style sheet mechanism that allows authors and readers to attach style (e.g., fonts, colors and spacing) to HTML documents. The CSS1 language is human readable and writable, and expresses style in common desktop publishing terminology.

One of the fundamental features of CSS is that style sheets cascade; authors can attach a preferred style sheet, while the reader may have a personal style sheet to adjust for human or technological limitations. The specification defines rules for resolving conflicts between different style sheets.

Cascading Style Sheet Editor from W3O    ^

Again from the Cascading Style Sheet level 1 document:

Designing simple style sheets is easy. One only needs to know a little HTML and some basic desktop publishing terminology. For example, to set the text color of "H1" elements to blue, one can say:

H1 {color: blue}

The example consists of two main parts: selector ('H1') and declaration ('color: blue'). The declaration has two parts: property ('color') and value ('blue'). While the example above only tries to influence one of the properties needed for rendering an HTML document, it qualifies as a style sheet on its own. Combined with other style sheets (one of the fundamental features of CSS is that style sheets are combined) it will determine the final presentation of the document.

The selector is the link between the HTML document and the style, and all HTML tags are possible selectors. HTML tags are defined in the HTML specification [2], and the CSS1 specification defines a syntax for how to address them.

The 'color' property is one of around 40 properties that determine the presentation of an HTML document.

It's always important to keep the user in mind. Different users require different tools. WYSIWYG tools are generally easiest to use and best for beginners. Converters and language oriented authoring tools will help the moderately experienced author cope with conversion headaches and large quantities of text. You might want to keep in mind the following types of Web authors as characterized by Michael Haynes and reprinted with his permission:

The 9 Types of Web Page Creators Joe/Jane Average College Student

Traits: Owner of a new university-supplied computer account with httpd access. Complete lack of originality. Multiple references to beer/Disney movies. Several photos of Student with college buddies (high school, if freshman Student).

The Good News: They don't know how to get their page linked to the outside world, so only they and their friends download their 16.7-million- color pictures from the last party.

The Bad News: They, their friends and their 16.7-million-color pictures might be on your server.

Mr. "Enhanced For Netscape "

Traits: The second thing you see on his page is a Netscape logo and a link to an ftp site where you can download Netscape <BLINK>NOW!</BLINK>. The first thing you see is about 80 different <TITLE>s scrolling back and forth across your screen.

The Good News: You won't have to look at their pages for long, because there won't be much there to see.

The Bad News: Half of the rest of the people who look at their pages are going to think "Hey, that's cool!" and copy the source.

The Old-Timer

Traits: Pages compatible with HTML 1.0, no graphics and very few attribute tags. Normal-text-size message at top says "This page not enhanced for Netscape. Cope, whipper-snapper."

The Good News: He's likely there because he has something of importance to say.

The Bad News: Whatever it is will likely be boring or far too technical for you.

The 5-Year-Old

Traits: Pictures of their parents, the family pet, etc. More data about the daily life of a kindergartner than you thought possible. Cute "kiddy-talk" dialect to the text. <ADDRESS> contains the note "such-and-such's mother helped her build this page."

The Good News: The first few of these you see give you a warm, fuzzy feeling.

The Bad News: The last few dozen of these you see all look the same.

The Computer Science Major

Traits: Links to the linux FAQ, the Geek Code, Star Wars theme music and DOOM .wad files. Cautious use of Netscape enhancements. Picture of Darth Vader instead of personal pictures. HTML 3.0 (Beta) compliant seal-of-approval at bottom of her page.

The Good News: If you're a geek, you'll find what you're looking for here. Even if you're not, you'll like the page design.

The Bad News: Complete lack of socially redeeming qualities. Unfortunate tendency to upload specs of their home PC.

The Businessman

Traits: Pages without fancy backgrounds and with only one nice, clean, imagemap. Unfortunately, there are no text-links for those using Lynx.

The Good News: You won't go blind staring at his pages.

The Bad News: You might wish you had once you see the prices of the goods/services he's offering.

The Newbie

Traits: Very little created text on their pages, it's almost all links to other people's pages. Missing right brackets in <A HREF>s kill whole lines of information. Several image files are not able to be loaded. <CENTER>.

The Good News: They'll almost have to get better.

The Bad News: They just might not.

The Egotist

Traits: Large image of themself greets you when page is loading. 1/2 Meg .au file of him chatting with his dog. Access counts shown for every page. Several lengthy pages devoted to his compact disk/Magic card/beer bottle collection. More personal details than you'd ever want to know.

The Good News: There isn't any.

The Bad News: Frequently friendly with Mr. "Enhanced for Netscape."

The Maniac

Traits: Last counted 1267 .html files in his public_html directory and 100+ CGI scripts in his cgi-bin directory. Is known as a "Close Personal Friend of Bob [Allison]." Thinks the people at Yahoo! "don't keep up with the Web fast enough." Will be the first on his block to have an ethernet cable hardwired into his brain.

The Good News: You could go through all his pages and never find an error.

The Bad News: You'd never make it through all his pages. mhaynes@pizza.bgsu.edu

For more information on Web authoring see Section 5.4 HTML in Chapter 5 Document

Standards.

Once you have authored your documents and placed them on a Web site, your attention will often turn to security. If you are trying to sell a product or service, security quickly becomes the major concern.

[SECTION 1.6] [TABLE OF CONTENTS]

Skip to chapter[1][2][3][4][5][6][7][8][9]


© Prentice-Hall, Inc.

A Simon & Schuster Company Upper Saddle River, New Jersey 07458

Legal Statement

1 . 6 Web Security

"Relying on the government to protect your privacy is like asking a peeping tom to install your window blinds." John Perry Barlow

Web servers and browsers present a whole range of security problems. Two of the key security issues are the authentication of requests and privacy. These issues boil down to one using mechanisms to ensure that I know who you say you are. These mechanisms are more important in some Web interactions than in others.

The more complex Web interactions such as database transactions and shopping require the execution of programs on the server. These programs are most commonly used when you enter and submit data via a form. The data you entered in the form are sent to the server and a program on the server does something with the data; it executes a program and sends a result, if any, back to you.

Security is a major issue here. In fact it is the main reason the creation of the Common Gateway Interface (CGI) protocol was created. This protocol controls how programs communicate with the Web server.

Typical Web Client/Server Interaction using CGI Script.

CGI gateway programs can be written in any language that can execute on the server machine. Typically, UNIX(15) scripts or other scripting languages are used instead of compiled code, because they are easier to debug and maintain. A terrific book by Ian Graham called "TheHTML Sourcebook"published by John Wiley, described three mechanisms by which data can be passed to the gateway program:

1.    Command-Line ArgumentsThe server launches the gateway program with command-line arguments.

2.    Standard InputThe server passes data to the gateway program such that it is read as input (from standard input) by the gateway program client.

3.    Environment VariablesThe server puts information in special environment variables

before starting the gateway program. The gateway program can then access these variables and obtain their contents.

These three mechanisms specify data transfer from the Web server to the gateway program. In addition, a CGI program can pass data back to the Web server by either of two mechanism. As The HTML Sourcebook explains, these are:

1.    Write to standard outputThe gateway program passes data back to the server by writing data to standard output. This is the only way that gateway programs can return data to a client.

2.    The name of the gateway programGateway programs with names beginning with the string nph- are called nonparsed header programs and are treated specially by the server.

In general, the server parses the output of a gateway program looking for headers that it can use to create the HTTP response headers it will send to the client with the returned data. If a gateway program name begins with nph-, the server sends the gateway program directly to the client and does not add any header information.

The behavior and assumptions used by one browser may be different from those used by other browsers, resulting in documents that look different. Similarly, the behavior of secure interfaces must also be scrutinized. Often the implementation of a security algorithm, not the algorithm itself, creates problems. The way a browser interacts and implements security protections is important. Currently vendors vary widely in their approaches.

Perhaps the most farreaching development towards using secure transactions for true electronic commerce is the recent agreement between VISA and MasterCard. MasterCard issued a press release on February 1, 1996, stating, in part:

Addressing consumer concerns about making purchases on the Internet, MasterCard International and Visa International joined together today to announce a technical standard for safeguarding payment-card purchases made over open networks such as the Internet. Prior to this effort, Visa and MasterCard were pursuing separate specifications. The new specification, called Secure Electronic Transactions (SET), represents the successful convergence of those individual efforts. A single standard means that consumers and merchants will be able to conduct bankcard transactions in cyberspace as securely and easily as they do in retail stores today.

The associations expect to publish SET on their World Wide Web sites in mid-February. Following a comment period, the joint specification is scheduled to be ready for testing in the second quarter 1996. Visa and MasterCard expect that banks will be able to offer secure bankcard services via the Internet to their cardholders in the fourth quarter 1996.

Using the Web to make purchases is currently a little risky. Credit card numbers and other types of confidential information were never intended to be sent through the Internet. The widely distributed, unregulated, open nature of the Internet is the antithesis of a secure system. Of course, thanks to the mathematically obscure field of cryptography, all hope is not lost. The issue is how to make usable the various types and forms of encryption.

1 . 6 . 1 Digital Signatures and Public Key Cryptography

In the real world, we sign all sorts of legal documents, contracts, checks, time slips and other item. A signature is your unique identification; it is your seal of approval that you have read, approved, and agreed with the document. In the electronic world, we must create the equivalent, a digital signature. Unfortunately it is easy to fake an electronic name. What is necessary is to have some magic way to ensure that a signature is legitimate, not a forgery. That magic is what's known as Public Key Cryptography.

Think about what a digital signature really is. When you look at a "signed" document you want to be positive that the signature is authentic, that the person, (your boss, for example) really signed the message (especially if he's terminating your employment). You want to be able to pass a magic wand over the signed document to let you read the document and know that your boss actually signed it.

Encrypted messages are unreadable unless you have the secret decoder ring, the key to decrypting the message. Most encryption schemes currently use a single encryption key. For example, the password you use to logon to a computer system is encrypted with a single key. You type the password itself; it tells the computer to let you in. It's simple and still quite secure, but it does not provide a way for the computer system to ensure, authenticate, that you are who you say you are. If someone steals your password, they effectively steal your identity.

The Data Encryption Standard (DES) has been used for many years and is the basis of the UNIX password system.(16) The U.S. government has kept details of the DES algorithm classified, and many variations have been developed because people assume that a secret trapdoor exists for government eavesdropping. Whether or not this is true, recent advances using a technique called differential cryptanalysis can use a statistical method to break the DES. Protections against this attack have been created. The current "state of the art" is a triple DES: three passes of the algorithm using 112 or 168-bit keys.

Public Key Cryptography involves the use of two keys. Each person in a transaction has a public key and a private key. Everyone can see a person's public key, but individuals keep their private keys private. The two keys are intimately related to each other and were generated at the same time by the cryptography program, such as PGP, you are using.

General Public Key Signing Method, the secret key is used for signing and the public key is used for verification. (17)

A message encrypted by one key can be decrypted only by the other. In practice, this means that if my boss wants to send only me a message he encrypts it using my public key. When I receive the message, only I can decode it, using my private key. If my boss wants to send a signed message to a lot of people in the company he encrypts it with his private key and everyone can decrypt it with the boss's public key ensuring that he originated the document.

Pretty Good Privacy ((PGP)(18) is a public domain implementation of public key cryptography by Phil Zimmerman of MIT. The program has generated controversy, pitting law enforcement agencies against privacy advocates. No matter what side of the battle you are on, the genie is out of the bottle and is never going back.

In the construction of a secure transaction system, one golden rule is to never, ever send clear text through the net. The information must be encrypted on the local client and transmitted in encrypted form. Netscape browsers have a nice user interface feature, a blue bar, which lights up when you are in a secure transaction mode. In addition Netscape and other browsers will report that the information you are about to transmit is insecure, when you fill out forms. This is a configurable option.

1 . 6 . 2 Firewalls and Proxies

Many organizations are understandably reluctant to give outsiders access to their internal computer systems. Press stories about computer break-ins and hackers are a staple. The principle technical solution is to create a "firewall." The idea is to leave all of the organization's computing network infrastructure alone, but to have a single point through which outside traffic to and from the Internet, must pass. One particular system is designated as the firewall machine, and additional security measures can be taken on it. Restricted access based, for example, on the domain name can be implemented in the one firewall machine, which checks each request before passing the information on to the destination machine.

Once a firewall is set up, "proxy" servers must also be put into place for the users inside, behind the firewall. Proxy services invisibly look at requests and pass them to the outside world. For example, if I am behind a firewall and I make a request to ftp (File Transfer Protocol) a file from another machine outside, the ftp proxy machine first looks at my request, then passes it on and makes the connection. Proxies must be set up on a perservice basis. HTTP, FTP, Gopher and other services would each be given a designated proxy through which the information passes. Typically, these proxies are specified in a configuration portion of the Web browser.

From a technological point of view, security issues can be addressed in many ways. According to Nicholas Baran in an article "The Greatest Show on Earth,"(19)

Today there are two basic approaches to secure electronic commerce. The first one focuses on protecting resources by securing individual servers and network sites. This access security is generally addressed by firewalls or other means of 'perimeter' security. The second approach focuses on transaction security. Transaction security addresses unauthorized listening in or eavesdropping on buyer/seller communications; authentication, so both parties are confident they know who they're talking to; message integrity, so the message contents can't be changed or tampered with; and a nonrepudiable record of the transaction in the form of a receipt or signature.

Secure transactions are the critical piece of technology just beginning to be deployed that will enable meaningful electronic commerce. As a result confidential transactions and the use of anonymous digital cash are beginning to appear as realistic purchasing options.(20)

DigiCash has created Ecash, electronic cash with many of the advantages of real cash.

An Ecash withdrawal from a bank

In addition to anonymous cash transactions, secure credit card purchases will probably become even more widespread. The infrastructure for both the client browser and the merchant, is rapidly coming into place. One company trying to put all the pieces together is CyberCash.

CyberCash's secure financial transaction technology is used by Virtual Vineyards to sell on-line wines. Unbeknownst to the buyer, the transaction, goes something like the following:

Customer clicks on the cybercash icon, to establish a link between the customer, virtual

vineyards, and the participating bank, Wells Fargo. The customer fills out credit card information and it is encrypted (using 768 bit encryption) and sent to the CyberCash server, which initiates a standard credit card authorization request to the bank. Once processed, CyberCash sends an electronic receipt and credit card authorization to Virtual Vineyards. The whole process takes several seconds.(21)