Python Web Scraping - Second Edition by Lawson, Richard -- Read -- Imperial Library of Trantor

Log In

Or create an account ->

Imperial Library

Home
About
News
Upload
Forum

Help

Login/SignUp

Index

Python Web Scraping, Second Edition Title Page Copyright Credits About the Authors About the Reviewers www.PacktPub.com Customer Feedback Table of Contents Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Errata Piracy Questions Introduction to Web Scraping When is web scraping useful? Is web scraping legal? Python 3 Background research Checking robots.txt Examining the Sitemap Estimating the size of a website Identifying the technology used by a website Finding the owner of a website Crawling your first website Scraping versus crawling Downloading a web page Retrying downloads Setting a user agent Sitemap crawler ID iteration crawler Link crawlers Advanced features Parsing robots.txt Supporting proxies Throttling downloads Avoiding spider traps Final version Using the requests library Summary Scraping the Data Analyzing a web page Three approaches to scrape a web page Regular expressions Beautiful Soup Lxml CSS selectors and your Browser Console XPath Selectors LXML and Family Trees Comparing performance Scraping results Overview of Scraping Adding a scrape callback to the link crawler Summary Caching Downloads When to use caching? Adding cache support to the link crawler Disk Cache Implementing DiskCache Testing the cache Saving disk space Expiring stale data Drawbacks of DiskCache Key-value storage cache What is key-value storage? Installing Redis Overview of Redis Redis cache implementation Compression Testing the cache Exploring requests-cache Summary Concurrent Downloading One million web pages Parsing the Alexa list Sequential crawler Threaded crawler How threads and processes work Implementing a multithreaded crawler Multiprocessing crawler Performance Summary Dynamic Content An example dynamic web page Reverse engineering a dynamic web page Edge cases Rendering a dynamic web page PyQt or PySide Debugging with Qt Executing JavaScript Website interaction with WebKit Waiting for results The Render class Selenium Selenium and Headless Browsers Summary Interacting with Forms The Login form Loading cookies from the web browser Extending the login script to update content Automating forms with Selenium "Humanizing" methods for Web Scraping Summary Solving CAPTCHA Registering an account Loading the CAPTCHA image Optical character recognition Further improvements Solving complex CAPTCHAs Using a CAPTCHA solving service Getting started with 9kw The 9kw CAPTCHA API Reporting errors Integrating with registration CAPTCHAs and machine learning Summary Scrapy Installing Scrapy Starting a project Defining a model Creating a spider Tuning settings Testing the spider Different Spider Types Scraping with the shell command Checking results Interrupting and resuming a crawl Scrapy Performance Tuning Visual scraping with Portia Installation Annotation Running the Spider Checking results Automated scraping with Scrapely Summary Putting It All Together Google search engine Facebook The website Facebook API Gap BMW Summary

← Prev
Back
Next →

← Prev
Back
Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion