Now that you have a general idea how this spider works, go to the book’s website and download the required scripts. Play with the initialization settings, use different seed URLs, and see what happens.
Consider these three warnings before you start:
Use a respectful $FETCH_DELAY
of at least a second or two so you don’t create a denial-of-service (DoS) attack by consuming so much bandwidth that others cannot use the web pages you target. Better yet, read Chapter 31 before you begin.
Keep the maximum penetration level set to a low value like 1 or 2. This spider is designed for simplicity, not scalability, and if you penetrate too deeply into your seed URL, your computer will run out of memory.
For best results, run spider scripts within a command shell, not through a browser.