Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL
About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Old-School Client-Server Technology
The Problem with Browsers
What to Expect from This Book
Learn from My Mistakes
Master Webbot Techniques
Leverage Existing Scripts
About the Website
About the Code
Requirements
Hardware
Software
Internet Access
A Disclaimer (This Is Important)
I. Fundamental Concepts and Techniques
1. What’s in It for You?
Uncovering the Internet’s True Potential
What’s in It for Developers?
Webbot Developers Are in Demand
Webbots Are Fun to Write
Webbots Facilitate “Constructive Hacking”
What’s in It for Business Leaders?
Customize the Internet for Your Business
Capitalize on the Public’s Inexperience with Webbots
Accomplish a Lot with a Small Investment
Final Thoughts
2. Ideas for Webbot Projects
Inspiration from Browser Limitations
Webbots That Aggregate and Filter Information for Relevance
Webbots That Interpret What They Find Online
Webbots That Act on Your Behalf
A Few Crazy Ideas to Get You Started
Help Out a Busy Executive
Save Money by Automating Tasks
Protect Intellectual Property
Monitor Opportunities
Verify Access Rights on a Website
Create an Online Clipping Service
Plot Unauthorized Wi-Fi Networks
Track Web Technologies
Allow Incompatible Systems to Communicate
Final Thoughts
3. Downloading Web Pages
Think About Files, Not Web Pages
Downloading Files with PHP’s Built-in Functions
Downloading Files with fopen() and fgets()
Creating Your First Webbot Script
Executing Webbots in Command Shells
Executing Webbots in Browsers
Downloading Files with file()
Introducing PHP/CURL
Multiple Transfer Protocols
Form Submission
Basic Authentication
Cookies
Redirection
Agent Name Spoofing
Referer Management
Socket Management
Installing PHP/CURL
LIB_http
Familiarizing Yourself with the Default Values
Using LIB_http
http_get()
http_get_withheader()
Learning More About HTTP Headers
Examining LIB_http’s Source Code
LIB_http Defaults
LIB_http Functions
Final Thoughts
4. Basic Parsing Techniques
Content Is Mixed with Markup
Parsing Poorly Written HTML
Standard Parse Routines
Using LIB_parse
Splitting a String at a Delimiter: split_string()
Parsing Text Between Delimiters: return_between()
Parsing a Data Set into an Array: parse_array()
Parsing Attribute Values: get_attribute()
Removing Unwanted Text: remove()
Useful PHP Functions
Detecting Whether a String Is Within Another String
Replacing a Portion of a String with Another String
Parsing Unformatted Text
Measuring the Similarity of Strings
Final Thoughts
Don’t Trust a Poorly Coded Web Page
Parse in Small Steps
Don’t Render Parsed Text While Debugging
Use Regular Expressions Sparingly
5. Advanced Parsing with Regular Expressions
Pattern Matching, the Key to Regular Expressions
PHP Regular Expression Types
PHP Regular Expressions Functions
preg_replace(pattern, replacement, subject)
preg_match(pattern, subject)
preg_match_all(pattern, subject, result_array)
preg_split(pattern, subject)
Resemblance to PHP Built-In Functions
Learning Patterns Through Examples
Parsing Numbers
Detecting a Series of Characters
Matching Alpha Characters
Matching on Wildcards
Specifying Alternate Matches
Regular Expressions Groupings and Ranges
Regular Expressions of Particular Interest to Webbot Developers
Parsing Phone Numbers
Where to Go from Here
When Regular Expressions Are (or Aren’t) the Right Parsing Tool
Strengths of Regular Expressions
Disadvantages of Pattern Matching While Parsing Web Pages
Regular Expressions Provide Little (If Any) Context
Regular Expressions Provide Too Many Choices
Regular Expressions Are Harder to Debug
Regular Expressions Complicate Your Code
Which Are Faster: Regular Expressions or PHP’s Built-In Functions?
Final Thoughts
6. Automating Form Submission
Reverse Engineering Form Interfaces
Form Handlers, Data Fields, Methods, and Event Triggers
Form Handlers
Data Fields
Methods
The GET Method
The POST Method
Multipart Encoding
Event Triggers
Unpredictable Forms
JavaScript Can Change a Form Just Before Submission
Form HTML Is Often Unreadable by Humans
Cookies Aren’t Included in the Form, but Can Affect Operation
Analyzing a Form
Final Thoughts
Don’t Blow Your Cover
Correctly Emulate Browsers
Avoid Form Errors
7. Managing Large Amounts of Data
Organizing Data
Naming Conventions
Storing Data in Structured Files
Storing Text in a Database
LIB_mysql
The insert() Function
The update() Function
The exe_sql() Function
Storing Images in a Database
Database or File?
Making Data Smaller
Storing References to Image Files
Compressing Data
Compressing Inbound Files
Compressing Files on Your Hard Drive
Removing Formatting
Thumbnailing Images
Final Thoughts
II. Projects
8. Price-Monitoring Webbots
The Target
Designing the Parsing Script
Initialization and Downloading the Target
Further Exploration
9. Image-Capturing Webbots
Example Image-Capturing Webbot
Creating the Image-Capturing Webbot
Binary-Safe Download Routine
Directory Structure
The Main Script
Initialization and Target Validation
Defining the Page Base
Creating a Root Directory for Imported File Structure
Parsing Image Tags from the Downloaded Web Page
The Image-Processing Loop
Creating the Local Directory Structure
Downloading and Saving the File
Further Exploration
Final Thoughts
10. Link-Verification Webbots
Creating the Link-Verification Webbot
Initializing the Webbot and Downloading the Target
Setting the Page Base
Parsing the Links
Running a Verification Loop
Generating Fully Resolved URLs
Downloading the Linked Page
Displaying the Page Status
Running the Webbot
LIB_http_codes
LIB_resolve_addresses
Further Exploration
11. Search-Ranking Webbots
Description of a Search Result Page
What the Search-Ranking Webbot Does
Running the Search-Ranking Webbot
How the Search-Ranking Webbot Works
The Search-Ranking Webbot Script
Initializing Variables
Starting the Loop
Fetching the Search Results
Parsing the Search Results
Final Thoughts
Be Kind to Your Sources
Search Sites May Treat Webbots Differently Than Browsers
Spidering Search Engines Is a Bad Idea
Familiarize Yourself with the Google API
Further Exploration
12. Aggregation Webbots
Choosing Data Sources for Webbots
Example Aggregation Webbot
Familiarizing Yourself with RSS Feeds
Writing the Aggregation Webbot
Downloading and Parsing the Target
Dealing with CDATA
Adding Filtering to Your Aggregation Webbot
Further Exploration
13. FTP Webbots
Example FTP Webbot
PHP and FTP
Further Exploration
14. Webbots That Read Email
The POP3 Protocol
Logging into a POP3 Mail Server
Reading Mail from a POP3 Mail Server
The POP3 LIST Command
The POP3 RETR Command
Other Useful POP3 Commands
Executing POP3 Commands with a Webbot
Further Exploration
Email-Controlled Webbots
Email Interfaces
15. Webbots That Send Email
Email, Webbots, and Spam
Sending Mail with SMTP and PHP
Configuring PHP to Send Mail
Sending an Email with mail()
Writing a Webbot That Sends Email Notifications
Keeping Legitimate Mail out of Spam Filters
Sending HTML-Formatted Email
Further Exploration
Using Returned Emails to Prune Access Lists
Using Email as Notification That Your Webbot Ran
Leveraging Wireless Technologies
Writing Webbots That Send Text Messages
16. Converting a Website into a Function
Writing a Function Interface
Defining the Interface
Analyzing the Target Web Page
Using describe_zipcode()
Getting the Session Value
Submitting the Form
Parsing and Returning the Result
Final Thoughts
Distributing Resources
Using Standard Interfaces
Designing a Custom Lightweight “Web Service”
III. Advanced Technical Considerations
17. Spiders
How Spiders Work
Example Spider
LIB_simple_spider
harvest_links()
archive_links()
get_domain()
exclude_link()
Experimenting with the Spider
Adding the Payload
Further Exploration
Save Links in a Database
Separate the Harvest and Payload
Distribute Tasks Across Multiple Computers
Regulate Page Requests
18. Procurement Webbots and Snipers
Procurement Webbot Theory
Get Purchase Criteria
Authenticate Buyer
Verify Item
Evaluate Purchase Triggers
Make Purchase
Evaluate Results
Sniper Theory
Get Purchase Criteria
Authenticate Buyer
Verify Item
Synchronize Clocks
Time to Bid?
Submit Bid
Evaluate Results
Testing Your Own Webbots and Snipers
Further Exploration
Final Thoughts
19. Webbots and Cryptography
Designing Webbots That Use Encryption
SSL and PHP Built-in Functions
Encryption and PHP/CURL
A Quick Overview of Web Encryption
Final Thoughts
20. Authentication
What Is Authentication?
Types of Online Authentication
Strengthening Authentication by Combining Techniques
Authentication and Webbots
Example Scripts and Practice Pages
Basic Authentication
Session Authentication
Authentication with Cookie Sessions
How Cookies Work
Cookie Session Example
Authentication with Query Sessions
Final Thoughts
21. Advanced Cookie Management
How Cookies Work
PHP/CURL and Cookies
How Cookies Challenge Webbot Design
Purging Temporary Cookies
Managing Multiple Users’ Cookies
Further Exploration
22. Scheduling Webbots and Spiders
Preparing Your Webbots to Run as Scheduled Tasks
The Windows XP Task Scheduler
Scheduling a Webbot to Run Daily
Complex Schedules
The Windows 7 Task Scheduler
Non-calendar-based Triggers
Final Thoughts
Determine the Webbot’s Best Periodicity
Avoid Single Points of Failure
Add Variety to Your Schedule
23. Scraping Difficult Websites with Browser Macros
Barriers to Effective Web Scraping
AJAX
Bizarre JavaScript and Cookie Behavior
Flash
Overcoming Webscraping Barriers with Browser Macros
What Is a Browser Macro?
The Ultimate Browser-Like Webbot
Installing and Using iMacros
Creating Your First Macro
Macro Initialization
Recording the Google Session
iMacros Commands
Instructions You’ll Want in Every Macro
Running a Macro
Final Thoughts
Are Macros Really Necessary?
Other Uses
24. Hacking iMacros
Hacking iMacros for Added Functionality
Reasons for Not Using the iMacros Scripting Engine
Creating a Dynamic Macro
Writing a Script That Creates a Dynamic Macro
Integrating External Data into Dynamically Created Macros
Launching iMacros Automatically
Launching iMacros from Windows
Launching iMacros from Linux
Further Exploration
25. Deployment and Scaling
One-to-Many Environment
One-to-One Environment
Many-to-Many Environment
Many-to-One Environment
Scaling and Denial-of-Service Attacks
Even Simple Webbots Can Generate a Lot of Traffic
Inefficiencies at the Target
The Problems with Scaling Too Well
Creating Multiple Instances of a Webbot
Forking Processes
Leveraging the Operating System
Distributing the Task over Multiple Computers
Managing a Botnet
Botnet Communication Methods
Polling the Botnet Server
Determining If There Is a Task for the Harvester to Perform
The Checkout Process
Assigning Tasks
Performing Tasks
Uploading Harvested Data
Processing the Harvested Data
Further Exploration
IV. Larger Considerations
26. Designing Stealthy Webbots and Spiders
Why Design a Stealthy Webbot?
Log Files
Access Logs
Error Logs
Custom Logs
Log-Monitoring Software
Stealth Means Simulating Human Patterns
Be Kind to Your Resources
Run Your Webbot During Busy Hours
Don’t Run Your Webbot at the Same Time Each Day
Don’t Run Your Webbot on Holidays and Weekends
Use Random, Intra-fetch Delays
Final Thoughts
27. Proxies
What Is a Proxy?
Proxies in the Virtual World
Why Webbot Developers Use Proxies
Using Proxies to Become Anonymous
Using a Proxy to Be Somewhere Else
Using a Proxy Server
Using a Proxy in a Browser
Using a Proxy with PHP/CURL
Types of Proxy Servers
Open Proxies
Types of Open Proxies
The Dark Side of Open Proxies
More About Open Proxy Listing Services
Tor
Using Tor
Configuring PHP/CURL to Use Tor
Disadvantages of Tor
Commercial Proxies
Final Thoughts
Anonymity Is a Process, Not a Feature
Creating Your Own Proxy Service
28. Writing Fault-Tolerant Webbots
Types of Webbot Fault Tolerance
Adapting to Changes in URLs
Avoid Making Requests for Pages That Don’t Exist
Follow Page Redirections
Maintain the Accuracy of Referer Values
Adapting to Changes in Page Content
Avoid Position Parsing
Use Relative Parsing
Look for Landmarks That Are Least Likely to Change
Adapting to Changes in Forms
Adapting to Changes in Cookie Management
Adapting to Network Outages and Network Congestion
Error Handlers
Further Exploration
29. Designing Webbot-Friendly Websites
Optimizing Web Pages for Search Engine Spiders
Well-Defined Links
Google Bombs and Spam Indexing
Title Tags
Meta Tags
Header Tags
Image alt Attributes
Web Design Techniques That Hinder Search Engine Spiders
JavaScript
Non-ASCII Content
Designing Data-Only Interfaces
XML
Lightweight Data Exchange
How Not to Design a Lightweight Interface
A Safer Method of Passing Variables to Webbots
SOAP
Advantages of SOAP
Disadvantages of SOAP
REST
Final Thoughts
30. Killing Spiders
Asking Nicely
Create a Terms of Service Agreement
Use the robots.txt File
Use the Robots Meta Tag
Building Speed Bumps
Selectively Allow Access to Specific Web Agents
Use Obfuscation
Use Cookies, Encryption, JavaScript, and Redirection
Authenticate Users
Update Your Site Often
Embed Text in Other Media
Setting Traps
Create a Spider Trap
Fun Things to Do with Unwanted Spiders
Final Thoughts
31. Keeping Webbots out of Trouble
It’s All About Respect
Copyright
Do Consult Resources
Don’t Be an Armchair Lawyer
Copyrights Do Not Have to Be Registered
Assume “All Rights Reserved”
You Cannot Copyright a Fact
You Can Copyright a Collection of Facts if Presented Creatively
You Can Use Some Material Under Fair Use Laws
Trespass to Chattels
Internet Law
Final Thoughts
A. PHP/CURL Reference
Creating a Minimal PHP/CURL Session
Initiating PHP/CURL Sessions
Setting PHP/CURL Options
CURLOPT_URL
CURLOPT_RETURNTRANSFER
CURLOPT_REFERER
CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS
CURLOPT_USERAGENT
CURLOPT_NOBODY and CURLOPT_HEADER
CURLOPT_TIMEOUT
CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR
CURLOPT_HTTPHEADER
CURLOPT_SSL_VERIFYPEER
CURLOPT_USERPWD and CURLOPT_UNRESTRICTED_AUTH
CURLOPT_POST and CURLOPT_POSTFIELDS
CURLOPT_VERBOSE
CURLOPT_PORT
Executing the PHP/CURL Command
Retrieving PHP/CURL Session Information
Viewing PHP/CURL Errors
Closing PHP/CURL Sessions
B. Status Codes
HTTP Codes
NNTP Codes
C. SMS Gateways
Sending Text Messages
Reading Text Messages
A Sampling of Text Message Email Addresses
Index
About the Author
Colophon
← Prev
Back
Next →
← Prev
Back
Next →