Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Foreword
Preface
A Note About the Code Examples
Who Should Read This Book
Why We Wrote This Book
Navigating This Book
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Mark Grover’s Acknowledgements
Ted Malaska’s Acknowledgements
Jonathan Seidman’s Acknowledgements
Gwen Shapira’s Acknowledgements
I. Architectural Considerations for Hadoop Applications
1. Data Modeling in Hadoop
Data Storage Options
Standard File Formats
Text data
Structured text data
Binary data
Hadoop File Types
File-based data structures
Serialization Formats
Thrift
Protocol Buffers
Avro
Columnar Formats
RCFile
ORC
Parquet
Avro and Parquet
Compression
Snappy
LZO
Gzip
bzip2
Compression recommendations
HDFS Schema Design
Location of HDFS Files
Advanced HDFS Schema Design
Partitioning
Bucketing
Denormalizing
HDFS Schema Design Summary
HBase Schema Design
Row Key
Record retrieval
Distribution
Block cache
Ability to scan
Size
Readability
Uniqueness
Timestamp
Hops
Tables and Regions
Put performance
Compaction time
Using Columns
Using Column Families
Time-to-Live
Managing Metadata
What Is Metadata?
Why Care About Metadata?
Where to Store Metadata?
Examples of Managing Metadata
Limitations of the Hive Metastore and HCatalog
Other Ways of Storing Metadata
Embedding metadata in file paths and names
Storing the metadata in HDFS
Conclusion
2. Data Movement
Data Ingestion Considerations
Timeliness of Data Ingestion
Incremental Updates
Access Patterns
Original Source System and Data Structure
Read speed of the devices on source systems
Original file type
Compression
Relational database management systems
Streaming data
Logfiles
Transformations
Interceptors
Selectors
Network Bottlenecks
Network Security
Push or Pull
Sqoop
Flume
Failure Handling
Level of Complexity
Data Ingestion Options
File Transfers
HDFS client commands
Mountable HDFS
Considerations for File Transfers versus Other Ingest Methods
Sqoop: Batch Transfer Between Hadoop and Relational Databases
Choosing a split-by column
Using database-specific connectors whenever available
Using the Goldilocks method of Sqoop performance tuning
Loading many tables in parallel with fair scheduler throttling
Diagnosing bottlenecks
Keeping Hadoop updated
Flume: Event-Based Data Collection and Processing
Flume architecture
Flume patterns
File formats
Recommendations
Flume sources
Flume sinks
Flume interceptors
Flume memory channels
Flume file channels
Sizing Channels
Finding Flume bottlenecks
Kafka
Kafka fault tolerance
Kafka and Hadoop
Data Extraction
Conclusion
3. Processing Data in Hadoop
MapReduce
MapReduce Overview
Map phase
InputFormat
RecordReader
Mapper.setup()
Mapper.map
Partitioner
Mapper.cleanup()
Combiner
Reducer
Shuffle
Reducer.setup()
Reducer.reduce()
Reducer.cleanup()
OutputFormat
Example for MapReduce
When to Use MapReduce
Spark
Spark Overview
DAG Model
Overview of Spark Components
Basic Spark Concepts
Resilient Distributed Datasets
Shared variables
SparkContext
Transformations
Action
Benefits of Using Spark
Simplicity
Versatility
Reduced disk I/O
Storage
Multilanguage
Resource manager independence
Interactive shell (REPL)
Spark Example
When to Use Spark
Abstractions
Pig
Pig Example
When to Use Pig
Crunch
Crunch Example
When to Use Crunch
Cascading
Cascading Example
When to Use Cascading
Hive
Hive Overview
Example of Hive Code
When to Use Hive
Impala
Impala Overview
Speed-Oriented Design
Efficient use of memory
Long running daemons
Efficient execution engine
Use of LLVM
Impala Example
When to Use Impala
Conclusion
4. Common Hadoop Processing Patterns
Pattern: Removing Duplicate Records by Primary Key
Data Generation for Deduplication Example
Code Example: Spark Deduplication in Scala
Code Example: Deduplication in SQL
Pattern: Windowing Analysis
Data Generation for Windowing Analysis Example
Code Example: Peaks and Valleys in Spark
Code Example: Peaks and Valleys in SQL
Pattern: Time Series Modifications
Use HBase and Versioning
Use HBase with a RowKey of RecordKey and StartTime
Use HDFS and Rewrite the Whole Table
Use Partitions on HDFS for Current and Historical Records
Data Generation for Time Series Example
Code Example: Time Series in Spark
Code Example: Time Series in SQL
Conclusion
5. Graph Processing on Hadoop
What Is a Graph?
What Is Graph Processing?
How Do You Process a Graph in a Distributed System?
The Bulk Synchronous Parallel Model
BSP by Example
Giraph
Read and Partition the Data
Batch Process the Graph with BSP
Write the Graph Back to Disk
Putting It All Together
When Should You Use Giraph?
GraphX
Just Another RDD
GraphX Pregel Interface
vprog()
sendMessage()
mergeMessage()
Which Tool to Use?
Conclusion
6. Orchestration
Why We Need Workflow Orchestration
The Limits of Scripting
The Enterprise Job Scheduler and Hadoop
Orchestration Frameworks in the Hadoop Ecosystem
Oozie Terminology
Oozie Overview
Oozie Workflow
Workflow Patterns
Point-to-Point Workflow
Fan-Out Workflow
Capture-and-Decide Workflow
Parameterizing Workflows
Classpath Definition
Scheduling Patterns
Frequency Scheduling
Time and Data Triggers
Executing Workflows
Conclusion
7. Near-Real-Time Processing with Hadoop
Stream Processing
Apache Storm
Storm High-Level Architecture
Storm Topologies
Tuples and Streams
Spouts and Bolts
Stream Groupings
Reliability of Storm Applications
Exactly-Once Processing
Fault Tolerance
Integrating Storm with HDFS
Integrating Storm with HBase
Storm Example: Simple Moving Average
Evaluating Storm
Support for aggregation and windowing
Enrichment and alerting
Lamdba Architecture
Trident
Trident Example: Simple Moving Average
Evaluating Trident
Support for counting and windowing
Enrichment and alerting
Lamdba Architecture
Spark Streaming
Overview of Spark Streaming
Spark Streaming Example: Simple Count
Spark Streaming Example: Multiple Inputs
Spark Streaming Example: Maintaining State
Spark Streaming Example: Windowing
Spark Streaming Example: Streaming versus ETL Code
Evaluating Spark Streaming
Support for counting and windowing
Enrichment and alerting
Lambda Architecture
Flume Interceptors
Which Tool to Use?
Low-Latency Enrichment, Validation, Alerting, and Ingestion
Solution One: Flume
Solution Two: Kafka and Storm
NRT Counting, Rolling Averages, and Iterative Processing
Complex Data Pipelines
Conclusion
II. Case Studies
8. Clickstream Analysis
Defining the Use Case
Using Hadoop for Clickstream Analysis
Design Overview
Storage
Ingestion
The Client Tier
The Collector Tier
Processing
Data Deduplication
Deduplication in Hive
Deduplication in Pig
Sessionization
Sessionization in Spark
Sessionization in MapReduce
Sessionization in Pig
Sessionization in Hive
Analyzing
Orchestration
Conclusion
9. Fraud Detection
Continuous Improvement
Taking Action
Architectural Requirements of Fraud Detection Systems
Introducing Our Use Case
High-Level Design
Client Architecture
Profile Storage and Retrieval
Caching
Distributed memory caching
HBase with BlockCache
HBase Data Definition
Columns (combined or atomic)
Event counting using HBase increment or put
Event history using HBase put
Delivering Transaction Status: Approved or Denied?
Ingest
Path Between the Client and Flume
Client push
Logfile pull
Message queue or Kafka in the middle
Near-Real-Time and Exploratory Analytics
Near-Real-Time Processing
Exploratory Analytics
What About Other Architectures?
Flume Interceptors
Kafka to Storm or Spark Streaming
External Business Rules Engine
Conclusion
10. Data Warehouse
Using Hadoop for Data Warehousing
Defining the Use Case
OLTP Schema
Data Warehouse: Introduction and Terminology
Data Warehousing with Hadoop
High-Level Design
Data Modeling and Storage
Choosing a storage engine
Denormalizing
Tracking updates in Hadoop
Selecting storage format and compression
Partitioning
Ingestion
Data Processing and Access
Partitioning
Merge/update
Aggregations
Data Export
Orchestration
Conclusion
A. Joins in Impala
Broadcast Joins
Partitioned Hash Join
Index
← Prev
Back
Next →
← Prev
Back
Next →