Go Web Scraping Quick Start Guide by Smith, Vincent -- Read -- Imperial Library of Trantor

Index

Title Page Copyright and Credits

Go Web Scraping Quick Start Guide

About Packt

Why subscribe? Packt.com

Contributors

About the author About the reviewer Packt is searching for authors like you

Preface

Who this book is for What this book covers To get the most out of this book

Download the example code files Conventions used

Get in touch

Reviews

Introducing Web Scraping and Go

What is web scraping? Why do you need a web scraper?

Search engines Price comparison Building datasets

What is Go? Why is Go a good fit for web scraping?

Go is fast Go is safe Go is simple

How to set up a Go development environment

Go language and tools Git Editor

Summary

The Request/Response Cycle

What do HTTP requests look like?

HTTP request methods HTTP headers Query parameters Request body

What do HTTP responses look like?

Status line Response headers Response body

What are HTTP status codes?

100–199 range 200–299 range 300–399 range 400–499 range 500–599 range

What do HTTP requests/responses look like in Go?

A simple request example

Summary

Web Scraping Etiquette

What is a robots.txt file? What is a User-Agent string?

Example

How to throttle your scraper How to use caching

Cache-Control Expires Etag Caching content in Go

Summary

Parsing HTML

What is the HTML format?

Syntax Structure

Searching using the strings package

Example – Counting links Example – Doctype check

Searching using the regexp package

Example – Finding links Example – Finding prices

Searching using XPath queries

Example – Daily deals Example – Collecting products

Searching using Cascading Style Sheets selectors

Example – Daily deals Example – Collecting products

Summary

Web Scraping Navigation

Following links

Example – Daily deals

Submitting forms

Example – Submitting searches Example – POST method

Avoiding loops Breadth-first versus depth-first crawling

Depth-first Breadth-first

Navigating with JavaScript

Example – Book reviews

Summary

Protecting Your Web Scraper

Virtual private servers Proxies

Public and shared proxies Dedicated proxies

Price Location Type Anonymity

Proxies in Go

Virtual private networks Boundaries

Whitelists Blacklists

Summary

Scraping with Concurrency

What is concurrency Concurrency pitfalls

Race conditions Deadlocks

The Go concurrency model

Goroutines Channels

sync package helpers

Conditions Atomic counters

Summary

Scraping at 100x

Components of a web scraping system

Queue Cache Storage Logs

Scraping HTML pages with colly Scraping JavaScript pages with chrome-protocol

Example – Amazon Daily Deals

Distributed scraping with dataflowkit

The Fetch service The Parse service

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

← Prev
Back
Next →

← Prev
Back
Next →