Python Web Scraping Cookbook by Heydt, Michael -- Read -- Imperial Library of Trantor

Index

Title Page Copyright and Credits

Python Web Scraping Cookbook

Contributors

About the author About the reviewers Packt is searching for authors like you

Packt Upsell

Why subscribe? PacktPub.com

Preface

Who this book is for What this book covers To get the most out of this book

Download the example code files Conventions used

Get in touch

Reviews

Getting Started with Scraping

Introduction Setting up a Python development environment

Getting ready How to do it...

Scraping Python.org with Requests and Beautiful Soup

Getting ready... How to do it... How it works...

Scraping Python.org in urllib3 and Beautiful Soup

Getting ready... How to do it... How it works There's more...

Scraping Python.org with Scrapy

Getting ready... How to do it... How it works

Scraping Python.org with Selenium and PhantomJS

Getting ready How to do it... How it works There's more...

Data Acquisition and Extraction

Introduction How to parse websites and navigate the DOM using BeautifulSoup

Getting ready How to do it... How it works There's more...

Searching the DOM with Beautiful Soup's find methods

Getting ready How to do it...

Querying the DOM with XPath and lxml

Getting ready How to do it... How it works There's more...

Querying data with XPath and CSS selectors

Getting ready How to do it... How it works There's more...

Using Scrapy selectors

Getting ready How to do it... How it works There's more...

Loading data in unicode / UTF-8

Getting ready How to do it... How it works There's more...

Processing Data

Introduction Working with CSV and JSON data

Getting ready How to do it How it works There's more...

Storing data using AWS S3

Getting ready How to do it How it works There's more...

Storing data using MySQL

Getting ready How to do it How it works There's more...

Storing data using PostgreSQL

Getting ready How to do it How it works There's more...

Storing data in Elasticsearch

Getting ready How to do it How it works There's more...

How to build robust ETL pipelines with AWS SQS

Getting ready How to do it - posting messages to an AWS queue How it works How to do it - reading and processing messages How it works There's more...

Working with Images, Audio, and other Assets

Introduction Downloading media content from the web

Getting ready How to do it How it works There's more...

Parsing a URL with urllib to get the filename

Getting ready How to do it How it works There's more...

Determining the type of content for a URL

Getting ready How to do it How it works There's more...

Determining the file extension from a content type

Getting ready How to do it How it works There's more...

Downloading and saving images to the local file system

How to do it How it works There's more...

Downloading and saving images to S3

Getting ready How to do it How it works There's more...

Generating thumbnails for images

Getting ready How to do it How it works

Taking a screenshot of a website

Getting ready How to do it How it works

Taking a screenshot of a website with an external service

Getting ready How to do it How it works There's more...

Performing OCR on an image with pytesseract

Getting ready How to do it How it works There's more...

Creating a Video Thumbnail

Getting ready How to do it How it works There's more..

Ripping an MP4 video to an MP3

Getting ready How to do it There's more...

Scraping - Code of Conduct

Introduction Scraping legality and scraping politely

Getting ready How to do it

Respecting robots.txt

Getting ready How to do it How it works There's more...

Crawling using the sitemap

Getting ready How to do it How it works There's more...

Crawling with delays

Getting ready How to do it How it works There's more...

Using identifiable user agents

How to do it How it works There's more...

Setting the number of concurrent requests per domain

How it works

Using auto throttling

How to do it How it works There's more...

Using an HTTP cache for development

How to do it How it works There's more...

Scraping Challenges and Solutions

Introduction Retrying failed page downloads

How to do it How it works

Supporting page redirects

How to do it How it works

Waiting for content to be available in Selenium

How to do it How it works

Limiting crawling to a single domain

How to do it How it works

Processing infinitely scrolling pages

Getting ready How to do it How it works There's more...

Controlling the depth of a crawl

How to do it How it works

Controlling the length of a crawl

How to do it How it works

Handling paginated websites

Getting ready How to do it How it works There's more...

Handling forms and forms-based authorization

Getting ready How to do it How it works There's more...

Handling basic authorization

How to do it How it works There's more...

Preventing bans by scraping via proxies

Getting ready How to do it How it works

Randomizing user agents

How to do it

Caching responses

How to do it There's more...

Text Wrangling and Analysis

Introduction Installing NLTK

How to do it

Performing sentence splitting

How to do it There's more...

Performing tokenization

How to do it

Performing stemming

How to do it

Performing lemmatization

How to do it

Determining and removing stop words

How to do it There's more...

Calculating the frequency distributions of words

How to do it There's more...

Identifying and removing rare words

How to do it

Identifying and removing rare words

How to do it

Removing punctuation marks

How to do it There's more...

Piecing together n-grams

How to do it There's more...

Scraping a job listing from StackOverflow

Getting ready How to do it There's more...

Reading and cleaning the description in the job listing

Getting ready How to do it...

Searching, Mining and Visualizing Data

Introduction Geocoding an IP address

Getting ready How to do it

How to collect IP addresses of Wikipedia edits

Getting ready How to do it How it works There's more...

Visualizing contributor location frequency on Wikipedia

How to do it

Creating a word cloud from a StackOverflow job listing

Getting ready How to do it

Crawling links on Wikipedia

Getting ready How to do it How it works Theres more...

Visualizing page relationships on Wikipedia

Getting ready How to do it How it works There's more...

Calculating degrees of separation

How to do it How it works There's more...

Creating a Simple Data API

Introduction Creating a REST API with Flask-RESTful

Getting ready How to do it How it works There's more...

Integrating the REST API with scraping code

Getting ready How to do it

Adding an API to find the skills for a job listing

Getting ready How to do it

Storing data in Elasticsearch as the result of a scraping request

Getting ready How to do it How it works There's more...

Checking Elasticsearch for a listing before scraping

How to do it There's more...

Creating Scraper Microservices with Docker

Introduction Installing Docker

Getting ready How to do it

Installing a RabbitMQ container from Docker Hub

Getting ready How to do it

Running a Docker container (RabbitMQ)

Getting ready How to do it There's more...

Creating and running an Elasticsearch container

How to do it

Stopping/restarting a container and removing the image

How to do it There's more...

Creating a generic microservice with Nameko

Getting ready How to do it How it works There's more...

Creating a scraping microservice

How to do it There's more...

Creating a scraper container

Getting ready How to do it How it works

Creating an API container

Getting ready How to do it There's more...

Composing and running the scraper locally with docker-compose

Getting ready How to do it There's more...

Making the Scraper as a Service Real

Introduction Creating and configuring an Elastic Cloud trial account

How to do it

Accessing the Elastic Cloud cluster with curl

How to do it

Connecting to the Elastic Cloud cluster with Python

Getting ready How to do it There's more...

Performing an Elasticsearch query with the Python API

Getting ready How to do it There's more...

Using Elasticsearch to query for jobs with specific skills

Getting ready How to do it

Modifying the API to search for jobs by skill

How to do it How it works There's more...

Storing configuration in the environment

How to do it

Creating an AWS IAM user and a key pair for ECS

Getting ready How to do it

Configuring Docker to authenticate with ECR

Getting ready How to do it

Pushing containers into ECR

Getting ready How to do it

Creating an ECS cluster

How to do it

Creating a task to run our containers

Getting ready How to do it How it works

Starting and accessing the containers in AWS

Getting ready How to do it There's more...

Other Books You May Enjoy

Leave a review - let other readers know what you think

← Prev
Back
Next →

← Prev
Back
Next →