Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Foreword Preface
A Note About the Code Examples Who Should Read This Book Why We Wrote This Book Navigating This Book Conventions Used in This Book Using Code Examples Safari® Books Online How to Contact Us Acknowledgments
Mark Grover’s Acknowledgements Ted Malaska’s Acknowledgements Jonathan Seidman’s Acknowledgements Gwen Shapira’s Acknowledgements
I. Architectural Considerations for Hadoop Applications 1. Data Modeling in Hadoop
Data Storage Options
Standard File Formats
Text data Structured text data Binary data
Hadoop File Types
File-based data structures
Serialization Formats
Thrift Protocol Buffers Avro
Columnar Formats
RCFile ORC Parquet
Avro and Parquet
Compression
Snappy LZO Gzip bzip2 Compression recommendations
HDFS Schema Design
Location of HDFS Files Advanced HDFS Schema Design
Partitioning Bucketing Denormalizing
HDFS Schema Design Summary
HBase Schema Design
Row Key
Record retrieval Distribution Block cache Ability to scan Size Readability Uniqueness
Timestamp Hops Tables and Regions
Put performance Compaction time
Using Columns Using Column Families Time-to-Live
Managing Metadata
What Is Metadata? Why Care About Metadata? Where to Store Metadata? Examples of Managing Metadata Limitations of the Hive Metastore and HCatalog Other Ways of Storing Metadata
Embedding metadata in file paths and names Storing the metadata in HDFS
Conclusion
2. Data Movement
Data Ingestion Considerations
Timeliness of Data Ingestion Incremental Updates Access Patterns Original Source System and Data Structure
Read speed of the devices on source systems Original file type Compression Relational database management systems Streaming data Logfiles
Transformations
Interceptors Selectors
Network Bottlenecks Network Security Push or Pull
Sqoop Flume
Failure Handling Level of Complexity
Data Ingestion Options
File Transfers
HDFS client commands Mountable HDFS
Considerations for File Transfers versus Other Ingest Methods Sqoop: Batch Transfer Between Hadoop and Relational Databases
Choosing a split-by column Using database-specific connectors whenever available Using the Goldilocks method of Sqoop performance tuning Loading many tables in parallel with fair scheduler throttling Diagnosing bottlenecks Keeping Hadoop updated
Flume: Event-Based Data Collection and Processing
Flume architecture Flume patterns File formats Recommendations
Flume sources Flume sinks Flume interceptors Flume memory channels Flume file channels Sizing Channels
Finding Flume bottlenecks
Kafka
Kafka fault tolerance Kafka and Hadoop
Data Extraction Conclusion
3. Processing Data in Hadoop
MapReduce
MapReduce Overview
Map phase
InputFormat RecordReader Mapper.setup() Mapper.map Partitioner Mapper.cleanup() Combiner
Reducer
Shuffle Reducer.setup() Reducer.reduce() Reducer.cleanup() OutputFormat
Example for MapReduce When to Use MapReduce
Spark
Spark Overview
DAG Model
Overview of Spark Components Basic Spark Concepts
Resilient Distributed Datasets Shared variables SparkContext Transformations Action
Benefits of Using Spark
Simplicity Versatility Reduced disk I/O Storage Multilanguage Resource manager independence Interactive shell (REPL)
Spark Example When to Use Spark
Abstractions
Pig Pig Example When to Use Pig
Crunch
Crunch Example When to Use Crunch
Cascading
Cascading Example When to Use Cascading
Hive
Hive Overview Example of Hive Code When to Use Hive
Impala
Impala Overview Speed-Oriented Design
Efficient use of memory Long running daemons Efficient execution engine Use of LLVM
Impala Example When to Use Impala
Conclusion
4. Common Hadoop Processing Patterns
Pattern: Removing Duplicate Records by Primary Key
Data Generation for Deduplication Example Code Example: Spark Deduplication in Scala Code Example: Deduplication in SQL
Pattern: Windowing Analysis
Data Generation for Windowing Analysis Example Code Example: Peaks and Valleys in Spark Code Example: Peaks and Valleys in SQL
Pattern: Time Series Modifications
Use HBase and Versioning Use HBase with a RowKey of RecordKey and StartTime Use HDFS and Rewrite the Whole Table Use Partitions on HDFS for Current and Historical Records Data Generation for Time Series Example Code Example: Time Series in Spark Code Example: Time Series in SQL
Conclusion
5. Graph Processing on Hadoop
What Is a Graph? What Is Graph Processing? How Do You Process a Graph in a Distributed System?
The Bulk Synchronous Parallel Model BSP by Example
Giraph
Read and Partition the Data Batch Process the Graph with BSP Write the Graph Back to Disk Putting It All Together When Should You Use Giraph?
GraphX
Just Another RDD GraphX Pregel Interface vprog() sendMessage() mergeMessage()
Which Tool to Use? Conclusion
6. Orchestration
Why We Need Workflow Orchestration The Limits of Scripting The Enterprise Job Scheduler and Hadoop Orchestration Frameworks in the Hadoop Ecosystem Oozie Terminology Oozie Overview Oozie Workflow Workflow Patterns
Point-to-Point Workflow Fan-Out Workflow Capture-and-Decide Workflow
Parameterizing Workflows Classpath Definition Scheduling Patterns
Frequency Scheduling Time and Data Triggers
Executing Workflows Conclusion
7. Near-Real-Time Processing with Hadoop
Stream Processing Apache Storm
Storm High-Level Architecture Storm Topologies Tuples and Streams Spouts and Bolts Stream Groupings Reliability of Storm Applications Exactly-Once Processing Fault Tolerance Integrating Storm with HDFS Integrating Storm with HBase Storm Example: Simple Moving Average Evaluating Storm
Support for aggregation and windowing Enrichment and alerting Lamdba Architecture
Trident
Trident Example: Simple Moving Average Evaluating Trident
Support for counting and windowing Enrichment and alerting Lamdba Architecture
Spark Streaming
Overview of Spark Streaming Spark Streaming Example: Simple Count Spark Streaming Example: Multiple Inputs Spark Streaming Example: Maintaining State Spark Streaming Example: Windowing Spark Streaming Example: Streaming versus ETL Code Evaluating Spark Streaming
Support for counting and windowing Enrichment and alerting Lambda Architecture
Flume Interceptors Which Tool to Use?
Low-Latency Enrichment, Validation, Alerting, and Ingestion
Solution One: Flume Solution Two: Kafka and Storm
NRT Counting, Rolling Averages, and Iterative Processing Complex Data Pipelines
Conclusion
II. Case Studies 8. Clickstream Analysis
Defining the Use Case Using Hadoop for Clickstream Analysis Design Overview Storage Ingestion
The Client Tier The Collector Tier
Processing
Data Deduplication
Deduplication in Hive Deduplication in Pig
Sessionization
Sessionization in Spark Sessionization in MapReduce Sessionization in Pig Sessionization in Hive
Analyzing Orchestration Conclusion
9. Fraud Detection
Continuous Improvement Taking Action Architectural Requirements of Fraud Detection Systems Introducing Our Use Case High-Level Design Client Architecture Profile Storage and Retrieval
Caching
Distributed memory caching HBase with BlockCache
HBase Data Definition
Columns (combined or atomic) Event counting using HBase increment or put Event history using HBase put
Delivering Transaction Status: Approved or Denied?
Ingest
Path Between the Client and Flume
Client push Logfile pull Message queue or Kafka in the middle
Near-Real-Time and Exploratory Analytics Near-Real-Time Processing Exploratory Analytics What About Other Architectures?
Flume Interceptors Kafka to Storm or Spark Streaming External Business Rules Engine
Conclusion
10. Data Warehouse
Using Hadoop for Data Warehousing Defining the Use Case OLTP Schema Data Warehouse: Introduction and Terminology Data Warehousing with Hadoop High-Level Design
Data Modeling and Storage
Choosing a storage engine Denormalizing Tracking updates in Hadoop Selecting storage format and compression Partitioning
Ingestion Data Processing and Access
Partitioning Merge/update
Aggregations Data Export Orchestration
Conclusion
A. Joins in Impala
Broadcast Joins Partitioned Hash Join
Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion