Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Title Page Copyright and Credits
Python Web Scraping Cookbook
Contributors
About the author About the reviewers Packt is searching for authors like you
Packt Upsell
Why subscribe? PacktPub.com
Preface
Who this book is for What this book covers To get the most out of this book
Download the example code files Conventions used
Get in touch
Reviews
Getting Started with Scraping
Introduction Setting up a Python development environment 
Getting ready How to do it...
Scraping Python.org with Requests and Beautiful Soup
Getting ready... How to do it... How it works...
Scraping Python.org in urllib3 and Beautiful Soup
Getting ready... How to do it... How it works There's more...
Scraping Python.org with Scrapy
Getting ready... How to do it... How it works
Scraping Python.org with Selenium and PhantomJS
Getting ready How to do it... How it works There's more...
Data Acquisition and Extraction
Introduction How to parse websites and navigate the DOM using BeautifulSoup
Getting ready How to do it... How it works There's more...
Searching the DOM with Beautiful Soup's find methods
Getting ready How to do it...
Querying the DOM with XPath and lxml
Getting ready How to do it... How it works There's more...
Querying data with XPath and CSS selectors
Getting ready How to do it... How it works There's more...
Using Scrapy selectors
Getting ready How to do it... How it works There's more...
Loading data in unicode / UTF-8
Getting ready How to do it... How it works There's more...
Processing Data
Introduction Working with CSV and JSON data
Getting ready How to do it How it works There's more...
Storing data using AWS S3
Getting ready How to do it How it works There's more...
Storing data using MySQL
Getting ready How to do it How it works There's more...
Storing data using PostgreSQL
Getting ready How to do it How it works There's more...
Storing data in Elasticsearch
Getting ready How to do it How it works There's more...
How to build robust ETL pipelines with AWS SQS
Getting ready How to do it - posting messages to an AWS queue How it works How to do it - reading and processing messages How it works There's more...
Working with Images, Audio, and other Assets
Introduction Downloading media content from the web
Getting ready How to do it How it works There's more...
 Parsing a URL with urllib to get the filename
Getting ready How to do it How it works There's more...
Determining the type of content for a URL 
Getting ready How to do it How it works There's more...
Determining the file extension from a content type
Getting ready How to do it How it works There's more...
Downloading and saving images to the local file system
How to do it How it works There's more...
Downloading and saving images to S3
Getting ready How to do it How it works There's more...
 Generating thumbnails for images
Getting ready How to do it How it works
Taking a screenshot of a website
Getting ready How to do it How it works
Taking a screenshot of a website with an external service
Getting ready How to do it How it works There's more...
Performing OCR on an image with pytesseract
Getting ready How to do it How it works There's more...
Creating a Video Thumbnail
Getting ready How to do it How it works There's more..
Ripping an MP4 video to an MP3
Getting ready How to do it There's more...
Scraping - Code of Conduct
Introduction Scraping legality and scraping politely
Getting ready How to do it
Respecting robots.txt
Getting ready How to do it How it works There's more...
Crawling using the sitemap
Getting ready How to do it How it works There's more...
Crawling with delays
Getting ready How to do it How it works There's more...
Using identifiable user agents 
How to do it How it works There's more...
Setting the number of concurrent requests per domain
How it works
Using auto throttling
How to do it How it works There's more...
Using an HTTP cache for development
How to do it How it works There's more...
Scraping Challenges and Solutions
Introduction Retrying failed page downloads
How to do it How it works
Supporting page redirects
How to do it How it works
Waiting for content to be available in Selenium
How to do it How it works
Limiting crawling to a single domain
How to do it How it works
Processing infinitely scrolling pages
Getting ready How to do it How it works There's more...
Controlling the depth of a crawl
How to do it How it works
Controlling the length of a crawl
How to do it How it works
Handling paginated websites
Getting ready How to do it How it works There's more...
Handling forms and forms-based authorization
Getting ready How to do it How it works There's more...
Handling basic authorization
How to do it How it works There's more...
Preventing bans by scraping via proxies
Getting ready How to do it How it works
Randomizing user agents
How to do it
Caching responses
How to do it There's more...
Text Wrangling and Analysis
Introduction Installing NLTK
How to do it
Performing sentence splitting
How to do it There's more...
Performing tokenization
How to do it
Performing stemming
How to do it
Performing lemmatization
How to do it
Determining and removing stop words
How to do it There's more...
Calculating the frequency distributions of words
How to do it There's more...
Identifying and removing rare words
How to do it
Identifying and removing rare words
How to do it
Removing punctuation marks
How to do it There's more...
Piecing together n-grams
How to do it There's more...
Scraping a job listing from StackOverflow 
Getting ready How to do it There's more...
Reading and cleaning the description in the job listing
Getting ready How to do it...
Searching, Mining and Visualizing Data
Introduction Geocoding an IP address
Getting ready How to do it
How to collect IP addresses of Wikipedia edits
Getting ready How to do it How it works There's more...
Visualizing contributor location frequency on Wikipedia
How to do it
Creating a word cloud from a StackOverflow job listing
Getting ready How to do it
Crawling links on Wikipedia
Getting ready How to do it How it works Theres more...
Visualizing page relationships on Wikipedia
Getting ready How to do it How it works There's more...
Calculating degrees of separation
How to do it How it works There's more...
Creating a Simple Data API
Introduction Creating a REST API with Flask-RESTful
Getting ready How to do it How it works There's more...
Integrating the REST API with scraping code
Getting ready How to do it
Adding an API to find the skills for a job listing
Getting ready How to do it
Storing data in Elasticsearch as the result of a scraping request
Getting ready How to do it How it works There's more...
Checking Elasticsearch for a listing before scraping
How to do it There's more...
Creating Scraper Microservices with Docker
Introduction Installing Docker
Getting ready How to do it
Installing a RabbitMQ container from Docker Hub
Getting ready How to do it
Running a Docker container (RabbitMQ)
Getting ready How to do it There's more...
Creating and running an Elasticsearch container
How to do it
Stopping/restarting a container and removing the image
How to do it There's more...
Creating a generic microservice with Nameko
Getting ready How to do it How it works There's more...
Creating a scraping microservice
How to do it There's more...
Creating a scraper container
Getting ready How to do it How it works
Creating an API container
Getting ready How to do it There's more...
Composing and running the scraper locally with docker-compose
Getting ready How to do it There's more...
Making the Scraper as a Service Real
Introduction Creating and configuring an Elastic Cloud trial account
How to do it
Accessing the Elastic Cloud cluster with curl
How to do it
Connecting to the Elastic Cloud cluster with Python
Getting ready How to do it There's more...
Performing an Elasticsearch query with the Python API 
Getting ready How to do it There's more...
Using Elasticsearch to query for jobs with specific skills
Getting ready How to do it
Modifying the API to search for jobs by skill
How to do it How it works There's more...
Storing configuration in the environment 
How to do it
Creating an AWS IAM user and a key pair for ECS
Getting ready How to do it
Configuring Docker to authenticate with ECR
Getting ready How to do it
Pushing containers into ECR
Getting ready How to do it
Creating an ECS cluster
How to do it
Creating a task to run our containers
Getting ready How to do it How it works
Starting and accessing the containers in AWS
Getting ready How to do it There's more...
Other Books You May Enjoy
Leave a review - let other readers know what you think
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion