Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Title Page Copyright Credits About the Authors About the Reviewers www.PacktPub.com Customer Feedback Preface
What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support
Downloading the example code Errata Piracy Questions
Introduction to Web Scraping
When is web scraping useful? Is web scraping legal? Python 3 Background research
Checking robots.txt Examining the Sitemap Estimating the size of a website Identifying the technology used by a website Finding the owner of a website
Crawling your first website
Scraping versus crawling Downloading a web page
Retrying downloads Setting a user agent
Sitemap crawler ID iteration crawler Link crawlers
Advanced features
Parsing robots.txt Supporting proxies Throttling downloads Avoiding spider traps Final version
Using the requests library
Summary
Scraping the Data
Analyzing a web page Three approaches to scrape a web page
Regular expressions Beautiful Soup Lxml
CSS selectors and your Browser Console XPath Selectors LXML and Family Trees Comparing performance Scraping results
Overview of Scraping Adding a scrape callback to the link crawler
Summary
Caching Downloads
When to use caching? Adding cache support to the link crawler Disk Cache
Implementing DiskCache Testing the cache Saving disk space Expiring stale data Drawbacks of DiskCache
Key-value storage cache
What is key-value storage? Installing Redis Overview of Redis Redis cache implementation Compression Testing the cache Exploring requests-cache
Summary
Concurrent Downloading
One million web pages
Parsing the Alexa list
Sequential crawler Threaded crawler How threads and processes work
Implementing a multithreaded crawler Multiprocessing crawler
Performance Summary
Dynamic Content
An example dynamic web page Reverse engineering a dynamic web page
Edge cases
Rendering a dynamic web page
PyQt or PySide
Debugging with Qt
Executing JavaScript Website interaction with WebKit
Waiting for results
The Render class
Selenium
Selenium and Headless Browsers
Summary
Interacting with Forms
The Login form
Loading cookies from the web browser
Extending the login script to update content Automating forms with Selenium
"Humanizing" methods for Web Scraping
Summary
Solving CAPTCHA
Registering an account
Loading the CAPTCHA image
Optical character recognition
Further improvements
Solving complex CAPTCHAs Using a CAPTCHA solving service
Getting started with 9kw
The 9kw CAPTCHA API
Reporting errors Integrating with registration
CAPTCHAs and machine learning Summary
Scrapy
Installing Scrapy Starting a project
Defining a model Creating a spider
Tuning settings Testing the spider
Different Spider Types Scraping with the shell command
Checking results Interrupting and resuming a crawl
Scrapy Performance Tuning
Visual scraping with Portia
Installation Annotation Running the Spider Checking results
Automated scraping with Scrapely Summary
Putting It All Together
Google search engine Facebook
The website Facebook API
Gap BMW Summary
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion