Python Web Scraping by Jarmul, Katharine -- Read -- Imperial Library of Trantor

Index

Title Page Copyright Credits About the Authors About the Reviewers www.PacktPub.com Customer Feedback Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Errata Piracy Questions

Introduction to Web Scraping

When is web scraping useful? Is web scraping legal? Python 3 Background research

Checking robots.txt Examining the Sitemap Estimating the size of a website Identifying the technology used by a website Finding the owner of a website

Crawling your first website

Scraping versus crawling Downloading a web page

Retrying downloads Setting a user agent

Sitemap crawler ID iteration crawler Link crawlers

Advanced features

Parsing robots.txt Supporting proxies Throttling downloads Avoiding spider traps Final version

Using the requests library

Summary

Scraping the Data

Analyzing a web page Three approaches to scrape a web page

Regular expressions Beautiful Soup Lxml

CSS selectors and your Browser Console XPath Selectors LXML and Family Trees Comparing performance Scraping results

Overview of Scraping Adding a scrape callback to the link crawler

Summary

Caching Downloads

When to use caching? Adding cache support to the link crawler Disk Cache

Implementing DiskCache Testing the cache Saving disk space Expiring stale data Drawbacks of DiskCache

Key-value storage cache

What is key-value storage? Installing Redis Overview of Redis Redis cache implementation Compression Testing the cache Exploring requests-cache

Summary

Concurrent Downloading

One million web pages

Parsing the Alexa list

Sequential crawler Threaded crawler How threads and processes work

Implementing a multithreaded crawler Multiprocessing crawler

Performance Summary

Dynamic Content

An example dynamic web page Reverse engineering a dynamic web page

Edge cases

Rendering a dynamic web page

PyQt or PySide

Debugging with Qt

Executing JavaScript Website interaction with WebKit

Waiting for results

The Render class

Selenium

Selenium and Headless Browsers

Summary

Interacting with Forms

The Login form

Loading cookies from the web browser

Extending the login script to update content Automating forms with Selenium

"Humanizing" methods for Web Scraping

Summary

Solving CAPTCHA

Registering an account

Loading the CAPTCHA image

Optical character recognition

Further improvements

Solving complex CAPTCHAs Using a CAPTCHA solving service

Getting started with 9kw

The 9kw CAPTCHA API

Reporting errors Integrating with registration

CAPTCHAs and machine learning Summary

Scrapy

Installing Scrapy Starting a project

Defining a model Creating a spider

Tuning settings Testing the spider

Different Spider Types Scraping with the shell command

Checking results Interrupting and resuming a crawl

Scrapy Performance Tuning

Visual scraping with Portia

Installation Annotation Running the Spider Checking results

Automated scraping with Scrapely Summary

Putting It All Together

Google search engine Facebook

The website Facebook API

Gap BMW Summary

← Prev
Back
Next →

← Prev
Back
Next →