Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Title Page
Copyright and Credits
Mastering Hadoop 3
Dedication
About Packt
Why subscribe?
Packt.com
Foreword
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Code in action
Conventions used
Get in touch
Reviews
Section 1: Introduction to Hadoop 3
Journey to Hadoop 3
Hadoop origins and Timelines
Origins
MapReduce origin
Timelines
Overview of Hadoop 3 and its features
Hadoop logical view
Hadoop distributions
On-premise distribution
Cloud distributions
Points to remember
Summary
Deep Dive into the Hadoop Distributed File System
Technical requirements
Defining HDFS
Deep dive into the HDFS architecture
HDFS logical architecture
Concepts of the data group
Blocks
Replication
HDFS communication architecture
NameNode internals
Data locality and rack awareness
DataNode internals
Quorum Journal Manager (QJM)
HDFS high availability in Hadoop 3.x
Data management
Metadata management
Checkpoint using a secondary NameNode
Data integrity
HDFS Snapshots
Data rebalancing
Best practices for using balancer
HDFS reads and writes
Write workflows
Read workflows
Short circuit reads
Managing disk-skewed data in Hadoop 3.x
Lazy persist writes in HDFS
Erasure encoding in Hadoop 3.x
Advantages of erasure coding
Disadvantages of erasure coding
HDFS common interfaces
HDFS read
HDFS write
HDFSFileSystemWrite.java
HDFS delete
HDFS command reference
File System commands
Distributed copy
Admin commands
Points to remember
Summary
YARN Resource Management in Hadoop
Architecture
Resource Manager component
Node manager core
Introduction to YARN job scheduling
FIFO scheduler
Capacity scheduler
Configuring capacity scheduler
Fair scheduler
Scheduling queues
Configuring fair scheduler
Resource Manager high availability
Architecture of RM high availability
Configuring Resource Manager high availability
Node labels
Configuring node labels
YARN Timeline server in Hadoop 3.x
Configuring YARN Timeline server
Opportunistic containers in Hadoop 3.x
Configuring opportunist container
Docker containers in YARN
Configuring Docker containers
Running the Docker image
Running the container
YARN REST APIs
Resource Manager API
Node Manager REST API
YARN command reference
User command
Application commands
Logs command
Administration commands
Summary
Internals of MapReduce
Technical requirements
Deep dive into the Hadoop MapReduce framework
YARN and MapReduce
MapReduce workflow in the Hadoop framework
Common MapReduce patterns
Summarization patterns
Word count example
Mapper
Reducer
Combiner
Minimum and maximum
Filtering patterns
Top-k MapReduce implementation
Join pattern
Reduce side join
Map side join (replicated join)
Composite join
Sorting and partitioning
MapReduce use case
MovieRatingMapper
MovieRatingReducer
MovieRatingDriver
Optimizing MapReduce
Hardware configuration
Operating system tuning
Optimization techniques
Runtime configuration
File System optimization
Summary
Section 2: Hadoop Ecosystem
SQL on Hadoop
Technical requirements
Presto – introduction
Presto architecture
Presto installation and basic query execution
Functions
Conversion functions
Mathematical functions
String functions
Presto connectors
Hive connector
Kafka connector
Configuration properties
MySQL connector
Redshift connector
MongoDB connector
Hive
Apache Hive architecture
Installing and running Hive
Hive queries
Hive table creation
Loading data to a table
The select query
Choosing file format
Splitable and non-splitable file formats
Query performance
Disk usage and compression
Schema change
Introduction to HCatalog
Introduction to HiveServer2
Hive UDF
Understanding ACID in HIVE
Example
Partitioning and bucketing
Prerequisite
Partitioning
Bucketing
Best practices
Impala
Impala architecture
Understanding the Impala interface and queries
Practicing Impala
Loading Data from CSV files
Best practices
Summary
Real-Time Processing Engines
Technical requirements
Spark
Apache Spark internals
Spark driver
Spark workers
Cluster manager
Spark application job flow
Deep dive into resilient distributed datasets
RDD features
RDD operations
Installing and running our first Spark job
Spark-shell
Spark submit command
Maven dependencies
Accumulators and broadcast variables
Understanding dataframe and dataset
Dataframes
Dataset
Spark cluster managers
Best practices
Apache Flink
Flink architecture
Apache Flink ecosystem component
Dataset and data stream API
Dataset API
Transformation
Data sinks
Data streams
Exploring the table API
Best practices
Storm/Heron
Deep dive into the Storm/Heron architecture
Concept of a Storm application
Introduction to Apache Heron
Heron architecture
Understanding Storm Trident
Storm integrations
Best practices
Summary
Widely Used Hadoop Ecosystem Components
Technical requirements
Pig
Apache Pig architecture
Installing and running Pig
Introducing Pig Latin and Grunt
Writing UDF in Pig
Eval function
Filter function
How to use custom UDF in Pig
Pig with Hive
Best practices
HBase
HBase architecture and its concept
CAP theorem
HBase operations and its examples
Put operation
Get operation
Delete operation
Batch operation
Installation
Local mode Installation
Distributed mode installation
Master node configuration
Slave node configuration
Best practices
Kafka
Apache Kafka architecture
Installing and running Apache Kafka
Local mode installation
Distributed mode
Internals of producer and consumer
Producer
Consumer
Writing producer and consumer application
Kafka Connect for ETL
Best practices
Flume
Apache Flume architecture
Deep dive into source, channel, and sink
Sources
Pollable source
Event-driven source
Channels
Memory channel
File channel
Kafka channel
Sinks
Flume interceptor
Timestamp interceptor
Universally Unique Identifier (UUID) interceptor
Regex filter interceptor
Writing a custom interceptor
Use case – Twitter data
Best practices
Summary
Section 3: Hadoop in the Real World
Designing Applications in Hadoop
Technical requirements
File formats
Understanding file formats
Row format and column format
Schema evolution
Splittable versus non-splittable
Compression
Text
Sequence file
Avro
Optimized Row Columnar (ORC)
Parquet
Data compression
Types of data compression in Hadoop
Gzip
BZip2
Lempel-Ziv-Oberhumer
Snappy
Compression format consideration
Serialization
Data ingestion
Batch ingestion
Macro batch ingestion
Real-time ingestion
Data processing
Batch processing
Micro batch processing
Real-time processing
Common batch processing pattern
Slowly changing dimension
Slowly changing dimensions – type 1
Slowly changing dimensions - type 2
Duplicate record and small files
Real-time lookup
Airflow for orchestration
Data governance
Data governance pillars
Metadata management
Data life cycle management
Data classification
Summary
Real-Time Stream Processing in Hadoop
Technical requirements
What are streaming datasets?
Stream data ingestion
Flume event-based data ingestion
Kafka
Common stream data processing patterns
Unbounded data batch processing
Streaming design considerations
Latency
Data availability, integrity, and security
Unbounded data sources
Data lookups
Data formats
Serializing your data
Parallel processing
Out-of-order events
Message delivery semantics
Micro-batch processing case study
Real-time processing case study
Main code
Executing the code
Summary
Machine Learning in Hadoop
Technical requirements
Machine learning steps
Common machine learning challenges
Spark machine learning
Transformer function
Estimator
Spark ML pipeline
Hadoop and R
Mahout
Machine learning case study in Spark
Sentiment analysis using Spark ML
Summary
Hadoop in the Cloud
Technical requirements
Logical view of Hadoop in the cloud
Network
Regions and availability zone
VPC and subnet
Security groups/firewall rules
Practical example using AWS
Managing resources
Cloud-watch
Data pipelines
Amazon Data Pipeline
Airflow
Airflow components
Sample data pipeline DAG example
High availability (HA)
Server failure
Server instance high availability
Region and zone failure
Cloud storage high availability
Amazon S3 outage case history
Summary
Hadoop Cluster Profiling
Introduction to benchmarking and profiling
HDFS
DFSIO
NameNode
NNBench
NNThroughputBenchmark
Synthetic load generator (SLG)
YARN
Scheduler Load Simulator (SLS)
Hive
TPC-DS
TPC-H
Mix-workloads
Rumen
Gridmix
Summary
Section 4: Securing Hadoop
Who Can Do What in Hadoop
Hadoop security pillars
System security
Kerberos authentication
Kerberos advantages
Kerberos authentication flows
Service authentication
User authentication
Communication between the authenticated client and the authenticated Hadoop service
Symmetric key-based communication in Hadoop
User authorization
Ranger
Sentry
List of security features that have been worked upon in Hadoop 3.0
Summary
Network and Data Security
Securing Hadoop networks
Segregating different types of networks
Network firewalls
Tools for securing Hadoop services' network perimeter
Encryption
Data in transit encryption
Data at rest encryption
Masking
Filtering
Row-level filtering
Column-level filtering
Summary
Monitoring Hadoop
General monitoring
HDFS metrics
NameNode metrics
DataNode metrics
YARN metrics
ZooKeeper metrics
Apache Ambari
Security monitoring
Security information and event management
How does SIEM work?
Intrusion detection system
Intrusion prevention system
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
← Prev
Back
Next →
← Prev
Back
Next →