Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Preface Or: What Are You Getting Yourself Into Here?
Navigating This Book
Takeaways
Conventions Used in This Book Online Resources
Figures Code Snippets
O’Reilly Safari How to Contact Us Acknowledgments
I. The Beam Model 1. Streaming 101
Terminology: What Is Streaming?
On the Greatly Exaggerated Limitations of Streaming Event Time Versus Processing Time
Data Processing Patterns
Bounded Data Unbounded Data: Batch
Fixed windows Sessions
Unbounded Data: Streaming
Time-agnostic
Filtering Inner joins
Approximation algorithms Windowing
Windowing by processing time Windowing by event time
Summary
2. The What, Where, When, and How of Data Processing
Roadmap Batch Foundations: What and Where
What: Transformations Where: Windowing
Going Streaming: When and How
When: The Wonderful Thing About Triggers Is Triggers Are Wonderful Things! When: Watermarks When: Early/On-Time/Late Triggers FTW! When: Allowed Lateness (i.e., Garbage Collection) How: Accumulation
Summary
3. Watermarks
Definition Source Watermark Creation
Perfect Watermark Creation Heuristic Watermark Creation
Watermark Propagation
Understanding Watermark Propagation Watermark Propagation and Output Timestamps The Tricky Case of Overlapping Windows
Percentile Watermarks Processing-Time Watermarks Case Studies
Case Study: Watermarks in Google Cloud Dataflow Case Study: Watermarks in Apache Flink Case Study: Source Watermarks for Google Cloud Pub/Sub
Summary
4. Advanced Windowing
When/Where: Processing-Time Windows
Event-Time Windowing Processing-Time Windowing via Triggers Processing-Time Windowing via Ingress Time
Where: Session Windows Where: Custom Windowing
Variations on Fixed Windows
Unaligned fixed windows Per-element/key fixed windows
Variations on Session Windows
Bounded sessions
One Size Does Not Fit All
Summary
5. Exactly-Once and Side Effects
Why Exactly Once Matters Accuracy Versus Completeness
Side Effects Problem Definition
Ensuring Exactly Once in Shuffle Addressing Determinism Performance
Graph Optimization Bloom Filters Garbage Collection
Exactly Once in Sources Exactly Once in Sinks Use Cases
Example Source: Cloud Pub/Sub Example Sink: Files Example Sink: Google BigQuery
Other Systems
Apache Spark Streaming Apache Flink
Summary
II. Streams and Tables 6. Streams and Tables
Stream-and-Table Basics Or: a Special Theory of Stream and Table Relativity
Toward a General Theory of Stream and Table Relativity
Batch Processing Versus Streams and Tables
A Streams and Tables Analysis of MapReduce
Map as streams/tables Reduce as streams/tables
Reconciling with Batch Processing
What, Where, When, and How in a Streams and Tables World
What: Transformations Where: Windowing
Window merging
When: Triggers How: Accumulation A Holistic View of Streams and Tables in the Beam Model
A General Theory of Stream and Table Relativity Summary
7. The Practicalities of Persistent State
Motivation
The Inevitability of Failure Correctness and Efficiency
Implicit State
Raw Grouping Incremental Combining
Generalized State
Case Study: Conversion Attribution Conversion Attribution with Apache Beam
Summary
8. Streaming SQL
What Is Streaming SQL?
Relational Algebra Time-Varying Relations Streams and Tables
Looking Backward: Stream and Table Biases
The Beam Model: A Stream-Biased Approach The SQL Model: A Table-Biased Approach
Materialized views
Looking Forward: Toward Robust Streaming SQL
Stream and Table Selection Temporal Operators
Where: windowing When: triggers
A SQL-ish default: per-record triggers Watermark triggers Repeated delay triggers Data-driven triggers
How: accumulation
Retractions in a SQL world Discarding mode, or lack thereof
Summary
9. Streaming Joins
All Your Joins Are Belong to Streaming Unwindowed Joins
FULL OUTER LEFT OUTER RIGHT OUTER INNER ANTI SEMI
Windowed Joins
Fixed Windows Temporal Validity
Temporal validity windows Temporal validity joins
Watermarks and temporal validity joins
Summary
10. The Evolution of Large-Scale Data Processing
MapReduce Hadoop Flume Storm Spark MillWheel Kafka Cloud Dataflow Flink Beam Summary
Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion