Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Title Page
Copyright and Credits
Go Web Scraping Quick Start Guide
About Packt
Why subscribe?
Packt.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Reviews
Introducing Web Scraping and Go
What is web scraping?
Why do you need a web scraper?
Search engines
Price comparison
Building datasets
What is Go?
Why is Go a good fit for web scraping?
Go is fast
Go is safe
Go is simple
How to set up a Go development environment
Go language and tools
Git
Editor
Summary
The Request/Response Cycle
What do HTTP requests look like?
HTTP request methods
HTTP headers
Query parameters
Request body
What do HTTP responses look like?
Status line
Response headers
Response body
What are HTTP status codes?
100–199 range
200–299 range
300–399 range
400–499 range
500–599 range
What do HTTP requests/responses look like in Go?
A simple request example
Summary
Web Scraping Etiquette
What is a robots.txt file?
What is a User-Agent string?
Example
How to throttle your scraper
How to use caching
Cache-Control
Expires
Etag
Caching content in Go
Summary
Parsing HTML
What is the HTML format?
Syntax
Structure
Searching using the strings package
Example – Counting links
Example – Doctype check
Searching using the regexp package
Example – Finding links
Example – Finding prices
Searching using XPath queries
Example – Daily deals
Example – Collecting products
Searching using Cascading Style Sheets selectors
Example – Daily deals
Example – Collecting products
Summary
Web Scraping Navigation
Following links
Example – Daily deals
Submitting forms
Example – Submitting searches
Example – POST method
Avoiding loops
Breadth-first versus depth-first crawling
Depth-first
Breadth-first
Navigating with JavaScript
Example – Book reviews
Summary
Protecting Your Web Scraper
Virtual private servers
Proxies
Public and shared proxies
Dedicated proxies
Price
Location
Type
Anonymity
Proxies in Go
Virtual private networks
Boundaries
Whitelists
Blacklists
Summary
Scraping with Concurrency
What is concurrency
Concurrency pitfalls
Race conditions
Deadlocks
The Go concurrency model
Goroutines
Channels
sync package helpers
Conditions
Atomic counters
Summary
Scraping at 100x
Components of a web scraping system
Queue
Cache
Storage
Logs
Scraping HTML pages with colly
Scraping JavaScript pages with chrome-protocol
Example – Amazon Daily Deals
Distributed scraping with dataflowkit
The Fetch service
The Parse service
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
← Prev
Back
Next →
← Prev
Back
Next →