Hadoop Application Architectures by Mark Grover, Ted Malaska, Jonathan Seidman, Gwen Shapira -- Read -- Imperial Library of Trantor

Index

Foreword Preface

A Note About the Code Examples Who Should Read This Book Why We Wrote This Book Navigating This Book Conventions Used in This Book Using Code Examples Safari® Books Online How to Contact Us Acknowledgments

Mark Grover’s Acknowledgements Ted Malaska’s Acknowledgements Jonathan Seidman’s Acknowledgements Gwen Shapira’s Acknowledgements

I. Architectural Considerations for Hadoop Applications 1. Data Modeling in Hadoop

Data Storage Options

Standard File Formats

Text data Structured text data Binary data

Hadoop File Types

File-based data structures

Serialization Formats

Thrift Protocol Buffers Avro

Columnar Formats

RCFile ORC Parquet

Avro and Parquet

Compression

Snappy LZO Gzip bzip2 Compression recommendations

HDFS Schema Design

Location of HDFS Files Advanced HDFS Schema Design

Partitioning Bucketing Denormalizing

HDFS Schema Design Summary

HBase Schema Design

Row Key

Record retrieval Distribution Block cache Ability to scan Size Readability Uniqueness

Timestamp Hops Tables and Regions

Put performance Compaction time

Using Columns Using Column Families Time-to-Live

Managing Metadata

What Is Metadata? Why Care About Metadata? Where to Store Metadata? Examples of Managing Metadata Limitations of the Hive Metastore and HCatalog Other Ways of Storing Metadata

Embedding metadata in file paths and names Storing the metadata in HDFS

Conclusion

2. Data Movement

Data Ingestion Considerations

Timeliness of Data Ingestion Incremental Updates Access Patterns Original Source System and Data Structure

Read speed of the devices on source systems Original file type Compression Relational database management systems Streaming data Logfiles

Transformations

Interceptors Selectors

Network Bottlenecks Network Security Push or Pull

Sqoop Flume

Failure Handling Level of Complexity

Data Ingestion Options

File Transfers

HDFS client commands Mountable HDFS

Considerations for File Transfers versus Other Ingest Methods Sqoop: Batch Transfer Between Hadoop and Relational Databases

Choosing a split-by column Using database-specific connectors whenever available Using the Goldilocks method of Sqoop performance tuning Loading many tables in parallel with fair scheduler throttling Diagnosing bottlenecks Keeping Hadoop updated

Flume: Event-Based Data Collection and Processing

Flume architecture Flume patterns File formats Recommendations

Flume sources Flume sinks Flume interceptors Flume memory channels Flume file channels Sizing Channels

Finding Flume bottlenecks

Kafka

Kafka fault tolerance Kafka and Hadoop

Data Extraction Conclusion

3. Processing Data in Hadoop

MapReduce

MapReduce Overview

Map phase

InputFormat RecordReader Mapper.setup() Mapper.map Partitioner Mapper.cleanup() Combiner

Reducer

Shuffle Reducer.setup() Reducer.reduce() Reducer.cleanup() OutputFormat

Example for MapReduce When to Use MapReduce

Spark

Spark Overview

DAG Model

Overview of Spark Components Basic Spark Concepts

Resilient Distributed Datasets Shared variables SparkContext Transformations Action

Benefits of Using Spark

Simplicity Versatility Reduced disk I/O Storage Multilanguage Resource manager independence Interactive shell (REPL)

Spark Example When to Use Spark

Abstractions

Pig Pig Example When to Use Pig

Crunch

Crunch Example When to Use Crunch

Cascading

Cascading Example When to Use Cascading

Hive

Hive Overview Example of Hive Code When to Use Hive

Impala

Impala Overview Speed-Oriented Design

Efficient use of memory Long running daemons Efficient execution engine Use of LLVM

Impala Example When to Use Impala

Conclusion

4. Common Hadoop Processing Patterns

Pattern: Removing Duplicate Records by Primary Key

Data Generation for Deduplication Example Code Example: Spark Deduplication in Scala Code Example: Deduplication in SQL

Pattern: Windowing Analysis

Data Generation for Windowing Analysis Example Code Example: Peaks and Valleys in Spark Code Example: Peaks and Valleys in SQL

Pattern: Time Series Modifications

Use HBase and Versioning Use HBase with a RowKey of RecordKey and StartTime Use HDFS and Rewrite the Whole Table Use Partitions on HDFS for Current and Historical Records Data Generation for Time Series Example Code Example: Time Series in Spark Code Example: Time Series in SQL

Conclusion

5. Graph Processing on Hadoop

What Is a Graph? What Is Graph Processing? How Do You Process a Graph in a Distributed System?

The Bulk Synchronous Parallel Model BSP by Example

Giraph

Read and Partition the Data Batch Process the Graph with BSP Write the Graph Back to Disk Putting It All Together When Should You Use Giraph?

GraphX

Just Another RDD GraphX Pregel Interface vprog() sendMessage() mergeMessage()

Which Tool to Use? Conclusion

6. Orchestration

Why We Need Workflow Orchestration The Limits of Scripting The Enterprise Job Scheduler and Hadoop Orchestration Frameworks in the Hadoop Ecosystem Oozie Terminology Oozie Overview Oozie Workflow Workflow Patterns

Point-to-Point Workflow Fan-Out Workflow Capture-and-Decide Workflow

Parameterizing Workflows Classpath Definition Scheduling Patterns

Frequency Scheduling Time and Data Triggers

Executing Workflows Conclusion

7. Near-Real-Time Processing with Hadoop

Stream Processing Apache Storm

Storm High-Level Architecture Storm Topologies Tuples and Streams Spouts and Bolts Stream Groupings Reliability of Storm Applications Exactly-Once Processing Fault Tolerance Integrating Storm with HDFS Integrating Storm with HBase Storm Example: Simple Moving Average Evaluating Storm

Support for aggregation and windowing Enrichment and alerting Lamdba Architecture

Trident

Trident Example: Simple Moving Average Evaluating Trident

Support for counting and windowing Enrichment and alerting Lamdba Architecture

Spark Streaming

Overview of Spark Streaming Spark Streaming Example: Simple Count Spark Streaming Example: Multiple Inputs Spark Streaming Example: Maintaining State Spark Streaming Example: Windowing Spark Streaming Example: Streaming versus ETL Code Evaluating Spark Streaming

Support for counting and windowing Enrichment and alerting Lambda Architecture

Flume Interceptors Which Tool to Use?

Low-Latency Enrichment, Validation, Alerting, and Ingestion

Solution One: Flume Solution Two: Kafka and Storm

NRT Counting, Rolling Averages, and Iterative Processing Complex Data Pipelines

Conclusion

II. Case Studies 8. Clickstream Analysis

Defining the Use Case Using Hadoop for Clickstream Analysis Design Overview Storage Ingestion

The Client Tier The Collector Tier

Processing

Data Deduplication

Deduplication in Hive Deduplication in Pig

Sessionization

Sessionization in Spark Sessionization in MapReduce Sessionization in Pig Sessionization in Hive

Analyzing Orchestration Conclusion

9. Fraud Detection

Continuous Improvement Taking Action Architectural Requirements of Fraud Detection Systems Introducing Our Use Case High-Level Design Client Architecture Profile Storage and Retrieval

Caching

Distributed memory caching HBase with BlockCache

HBase Data Definition

Columns (combined or atomic) Event counting using HBase increment or put Event history using HBase put

Delivering Transaction Status: Approved or Denied?

Ingest

Path Between the Client and Flume

Client push Logfile pull Message queue or Kafka in the middle

Near-Real-Time and Exploratory Analytics Near-Real-Time Processing Exploratory Analytics What About Other Architectures?

Flume Interceptors Kafka to Storm or Spark Streaming External Business Rules Engine

Conclusion

10. Data Warehouse

Using Hadoop for Data Warehousing Defining the Use Case OLTP Schema Data Warehouse: Introduction and Terminology Data Warehousing with Hadoop High-Level Design

Data Modeling and Storage

Choosing a storage engine Denormalizing Tracking updates in Hadoop Selecting storage format and compression Partitioning

Ingestion Data Processing and Access

Partitioning Merge/update

Aggregations Data Export Orchestration

Conclusion

A. Joins in Impala

Broadcast Joins Partitioned Hash Join

Index

← Prev
Back
Next →

← Prev
Back
Next →