Mastering Hadoop 3 · Big Data Processing at Scale to Unlock Unique Business Insights by Singh, Chanchal -- Read -- Imperial Library of Trantor

Index

Title Page Copyright and Credits

Mastering Hadoop 3

Dedication About Packt

Why subscribe? Packt.com

Foreword Contributors

About the authors About the reviewer Packt is searching for authors like you

Preface

Who this book is for What this book covers To get the most out of this book

Download the example code files Download the color images Code in action Conventions used

Get in touch

Reviews

Section 1: Introduction to Hadoop 3 Journey to Hadoop 3

Hadoop origins and Timelines

Origins

MapReduce origin

Timelines

Overview of Hadoop 3 and its features Hadoop logical view Hadoop distributions

On-premise distribution Cloud distributions

Points to remember Summary

Deep Dive into the Hadoop Distributed File System

Technical requirements Defining HDFS Deep dive into the HDFS architecture

HDFS logical architecture

Concepts of the data group

Blocks Replication

HDFS communication architecture

NameNode internals

Data locality and rack awareness

DataNode internals Quorum Journal Manager (QJM) HDFS high availability in Hadoop 3.x Data management

Metadata management

Checkpoint using a secondary NameNode

Data integrity HDFS Snapshots Data rebalancing Best practices for using balancer

HDFS reads and writes

Write workflows Read workflows Short circuit reads

Managing disk-skewed data in Hadoop 3.x Lazy persist writes in HDFS Erasure encoding in Hadoop 3.x

Advantages of erasure coding Disadvantages of erasure coding

HDFS common interfaces

HDFS read HDFS write

HDFSFileSystemWrite.java

HDFS delete

HDFS command reference

File System commands Distributed copy Admin commands

Points to remember Summary

YARN Resource Management in Hadoop

Architecture

Resource Manager component Node manager core

Introduction to YARN job scheduling FIFO scheduler Capacity scheduler

Configuring capacity scheduler

Fair scheduler

Scheduling queues Configuring fair scheduler

Resource Manager high availability

Architecture of RM high availability Configuring Resource Manager high availability

Node labels

Configuring node labels

YARN Timeline server in Hadoop 3.x

Configuring YARN Timeline server

Opportunistic containers in Hadoop 3.x

Configuring opportunist container

Docker containers in YARN

Configuring Docker containers

Running the Docker image Running the container

YARN REST APIs

Resource Manager API Node Manager REST API

YARN command reference

User command

Application commands Logs command

Administration commands

Summary

Internals of MapReduce

Technical requirements Deep dive into the Hadoop MapReduce framework YARN and MapReduce MapReduce workflow in the Hadoop framework Common MapReduce patterns

Summarization patterns

Word count example

Mapper Reducer Combiner

Minimum and maximum

Filtering patterns

Top-k MapReduce implementation

Join pattern

Reduce side join Map side join (replicated join)

Composite join

Sorting and partitioning

MapReduce use case

MovieRatingMapper MovieRatingReducer MovieRatingDriver

Optimizing MapReduce

Hardware configuration Operating system tuning Optimization techniques Runtime configuration File System optimization

Summary

Section 2: Hadoop Ecosystem SQL on Hadoop

Technical requirements Presto – introduction

Presto architecture Presto installation and basic query execution Functions

Conversion functions Mathematical functions String functions

Presto connectors

Hive connector Kafka connector

Configuration properties

MySQL connector Redshift connector MongoDB connector

Hive

Apache Hive architecture Installing and running Hive Hive queries

Hive table creation Loading data to a table The select query

Choosing file format

Splitable and non-splitable file formats

Query performance Disk usage and compression Schema change

Introduction to HCatalog Introduction to HiveServer2 Hive UDF Understanding ACID in HIVE

Example

Partitioning and bucketing

Prerequisite Partitioning Bucketing

Best practices

Impala

Impala architecture Understanding the Impala interface and queries Practicing Impala

Loading Data from CSV files

Best practices

Summary

Real-Time Processing Engines

Technical requirements Spark

Apache Spark internals

Spark driver Spark workers Cluster manager Spark application job flow

Deep dive into resilient distributed datasets

RDD features RDD operations

Installing and running our first Spark job

Spark-shell Spark submit command Maven dependencies

Accumulators and broadcast variables Understanding dataframe and dataset

Dataframes Dataset

Spark cluster managers Best practices

Apache Flink

Flink architecture Apache Flink ecosystem component Dataset and data stream API

Dataset API

Transformation Data sinks

Data streams

Exploring the table API Best practices

Storm/Heron

Deep dive into the Storm/Heron architecture

Concept of a Storm application Introduction to Apache Heron Heron architecture

Understanding Storm Trident Storm integrations Best practices

Summary

Widely Used Hadoop Ecosystem Components

Technical requirements Pig

Apache Pig architecture Installing and running Pig Introducing Pig Latin and Grunt Writing UDF in Pig

Eval function Filter function How to use custom UDF in Pig

Pig with Hive Best practices

HBase

HBase architecture and its concept CAP theorem HBase operations and its examples

Put operation Get operation Delete operation Batch operation

Installation

Local mode Installation Distributed mode installation

Master node configuration Slave node configuration

Best practices

Kafka

Apache Kafka architecture Installing and running Apache Kafka

Local mode installation Distributed mode

Internals of producer and consumer

Producer Consumer

Writing producer and consumer application Kafka Connect for ETL Best practices

Flume

Apache Flume architecture Deep dive into source, channel, and sink

Sources

Pollable source Event-driven source

Channels

Memory channel File channel Kafka channel

Sinks

Flume interceptor

Timestamp interceptor Universally Unique Identifier (UUID) interceptor Regex filter interceptor Writing a custom interceptor

Use case – Twitter data Best practices

Summary

Section 3: Hadoop in the Real World Designing Applications in Hadoop

Technical requirements File formats

Understanding file formats

Row format and column format Schema evolution Splittable versus non-splittable Compression

Text Sequence file Avro Optimized Row Columnar (ORC) Parquet

Data compression

Types of data compression in Hadoop

Gzip BZip2 Lempel-Ziv-Oberhumer Snappy

Compression format consideration

Serialization Data ingestion

Batch ingestion Macro batch ingestion Real-time ingestion

Data processing

Batch processing Micro batch processing Real-time processing

Common batch processing pattern

Slowly changing dimension

Slowly changing dimensions – type 1 Slowly changing dimensions - type 2

Duplicate record and small files Real-time lookup

Airflow for orchestration Data governance

Data governance pillars

Metadata management Data life cycle management Data classification

Summary

Real-Time Stream Processing in Hadoop

Technical requirements What are streaming datasets? Stream data ingestion

Flume event-based data ingestion Kafka

Common stream data processing patterns

Unbounded data batch processing

Streaming design considerations

Latency Data availability, integrity, and security Unbounded data sources Data lookups Data formats Serializing your data Parallel processing Out-of-order events Message delivery semantics

Micro-batch processing case study Real-time processing case study

Main code Executing the code

Summary

Machine Learning in Hadoop

Technical requirements Machine learning steps Common machine learning challenges Spark machine learning

Transformer function Estimator Spark ML pipeline

Hadoop and R Mahout Machine learning case study in Spark

Sentiment analysis using Spark ML

Summary

Hadoop in the Cloud

Technical requirements Logical view of Hadoop in the cloud Network

Regions and availability zone VPC and subnet Security groups/firewall rules Practical example using AWS

Managing resources

Cloud-watch

Data pipelines

Amazon Data Pipeline Airflow

Airflow components

Sample data pipeline DAG example

High availability (HA)

Server failure

Server instance high availability Region and zone failure

Cloud storage high availability

Amazon S3 outage case history

Summary

Hadoop Cluster Profiling

Introduction to benchmarking and profiling HDFS

DFSIO

NameNode

NNBench NNThroughputBenchmark Synthetic load generator (SLG)

YARN

Scheduler Load Simulator (SLS)

Hive

TPC-DS TPC-H

Mix-workloads

Rumen Gridmix

Summary

Section 4: Securing Hadoop Who Can Do What in Hadoop

Hadoop security pillars System security Kerberos authentication

Kerberos advantages Kerberos authentication flows

Service authentication User authentication Communication between the authenticated client and the authenticated Hadoop service Symmetric key-based communication in Hadoop

User authorization

Ranger Sentry

List of security features that have been worked upon in Hadoop 3.0 Summary

Network and Data Security

Securing Hadoop networks

Segregating different types of networks Network firewalls Tools for securing Hadoop services' network perimeter

Encryption

Data in transit encryption Data at rest encryption

Masking Filtering

Row-level filtering Column-level filtering

Summary

Monitoring Hadoop

General monitoring

HDFS metrics

NameNode metrics DataNode metrics

YARN metrics ZooKeeper metrics Apache Ambari

Security monitoring

Security information and event management How does SIEM work? Intrusion detection system Intrusion prevention system

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

← Prev
Back
Next →

← Prev
Back
Next →