Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Title Page Copyright and Credits
Mastering Hadoop 3
Dedication About Packt
Why subscribe? Packt.com
Foreword Contributors
About the authors About the reviewer Packt is searching for authors like you
Preface
Who this book is for What this book covers To get the most out of this book
Download the example code files Download the color images Code in action Conventions used
Get in touch
Reviews
Section 1: Introduction to Hadoop 3 Journey to Hadoop 3
Hadoop origins and Timelines
Origins
MapReduce origin
Timelines
Overview of Hadoop 3 and its features Hadoop logical view Hadoop distributions
On-premise distribution Cloud distributions
Points to remember Summary
Deep Dive into the Hadoop Distributed File System
Technical requirements Defining HDFS Deep dive into the HDFS architecture
HDFS logical architecture
Concepts of the data group
Blocks Replication
HDFS communication architecture
NameNode internals
Data locality and rack awareness
DataNode internals Quorum Journal Manager (QJM) HDFS high availability in Hadoop 3.x Data management
Metadata management
Checkpoint using a secondary NameNode
Data integrity HDFS Snapshots Data rebalancing Best practices for using balancer
HDFS reads and writes
Write workflows Read workflows Short circuit reads
Managing disk-skewed data in Hadoop 3.x Lazy persist writes in HDFS Erasure encoding in Hadoop 3.x
Advantages of erasure coding Disadvantages of erasure coding
HDFS common interfaces
HDFS read HDFS write
HDFSFileSystemWrite.java
HDFS delete
HDFS command reference
File System commands Distributed copy Admin commands
Points to remember Summary
YARN Resource Management in Hadoop
Architecture
Resource Manager component Node manager core
Introduction to YARN job scheduling FIFO scheduler Capacity scheduler
Configuring capacity scheduler
Fair scheduler
Scheduling queues Configuring fair scheduler
Resource Manager high availability
Architecture of RM high availability Configuring Resource Manager high availability
Node labels
​Configuring node labels
YARN Timeline server in Hadoop 3.x
Configuring YARN Timeline server
Opportunistic containers in Hadoop 3.x
Configuring opportunist container
Docker containers in YARN
Configuring Docker containers
Running the Docker image Running the container
YARN REST APIs
Resource Manager API Node Manager REST API
YARN command reference
User command
Application commands Logs command
Administration commands
Summary
Internals of MapReduce
Technical requirements Deep dive into the Hadoop MapReduce framework YARN and MapReduce MapReduce workflow in the Hadoop framework Common MapReduce patterns
Summarization patterns
Word count example
Mapper Reducer Combiner
Minimum and maximum
Filtering patterns
Top-k MapReduce implementation
Join pattern
Reduce side join Map side join (replicated join)
Composite join
Sorting and partitioning
MapReduce use case
MovieRatingMapper MovieRatingReducer MovieRatingDriver
Optimizing MapReduce
Hardware configuration Operating system tuning Optimization techniques Runtime configuration File System optimization
Summary
Section 2: Hadoop Ecosystem SQL on Hadoop
Technical requirements Presto – introduction
Presto architecture Presto installation and basic query execution Functions
Conversion functions Mathematical functions String functions
Presto connectors
Hive connector Kafka connector
Configuration properties
MySQL connector Redshift connector MongoDB connector
Hive
Apache Hive architecture Installing and running Hive Hive queries
Hive table creation Loading data to a table The select query
Choosing file format
Splitable and non-splitable file formats
Query performance Disk usage and compression Schema change
Introduction to HCatalog Introduction to HiveServer2 Hive UDF Understanding ACID in HIVE
Example
Partitioning and bucketing
Prerequisite Partitioning Bucketing
Best practices
Impala
Impala architecture Understanding the Impala interface and queries Practicing Impala
Loading Data from CSV files
Best practices
Summary
Real-Time Processing Engines
Technical requirements Spark
Apache Spark internals
Spark driver Spark workers Cluster manager Spark application job flow
Deep dive into resilient distributed datasets
RDD features RDD operations
Installing and running our first Spark job
Spark-shell Spark submit command Maven dependencies
Accumulators and broadcast variables Understanding dataframe and dataset
Dataframes Dataset
Spark cluster managers Best practices
Apache Flink
Flink architecture Apache Flink ecosystem component Dataset and data stream API
Dataset API
Transformation Data sinks
Data streams
Exploring the table API Best practices
Storm/Heron
Deep dive into the Storm/Heron architecture
Concept of a Storm application Introduction to Apache Heron Heron architecture
Understanding Storm Trident Storm integrations Best practices
Summary
Widely Used Hadoop Ecosystem Components
Technical requirements Pig
Apache Pig architecture Installing and running Pig Introducing Pig Latin and Grunt Writing UDF in Pig
Eval function Filter function How to use custom UDF in Pig
Pig with Hive Best practices
HBase
HBase architecture and its concept CAP theorem HBase operations and its examples
Put operation Get operation Delete operation Batch operation
Installation
Local mode Installation Distributed mode installation
Master node configuration Slave node configuration
Best practices
Kafka
Apache Kafka architecture Installing and running Apache Kafka
Local mode installation Distributed mode
Internals of producer and consumer
Producer Consumer
Writing producer and consumer application Kafka Connect for ETL Best practices
Flume
Apache Flume architecture Deep dive into source, channel, and sink
Sources
Pollable source Event-driven source
Channels
Memory channel File channel Kafka channel
Sinks
Flume interceptor
Timestamp interceptor Universally Unique Identifier (UUID) interceptor Regex filter interceptor Writing a custom interceptor
Use case – Twitter data Best practices
Summary
Section 3: Hadoop in the Real World Designing Applications in Hadoop
Technical requirements File formats
Understanding file formats
Row format and column format Schema evolution Splittable versus non-splittable Compression
Text Sequence file Avro Optimized Row Columnar (ORC) Parquet
Data compression
Types of data compression in Hadoop
Gzip BZip2 Lempel-Ziv-Oberhumer Snappy
Compression format consideration
Serialization Data ingestion
Batch ingestion Macro batch ingestion Real-time ingestion
Data processing
Batch processing Micro batch processing Real-time processing
Common batch processing pattern
Slowly changing dimension
Slowly changing dimensions – type 1 Slowly changing dimensions - type 2
Duplicate record and small files Real-time lookup
Airflow for orchestration Data governance
Data governance pillars
Metadata management Data life cycle management Data classification
Summary
Real-Time Stream Processing in Hadoop
Technical requirements What are streaming datasets? Stream data ingestion
Flume event-based data ingestion Kafka
Common stream data processing patterns
Unbounded data batch processing
Streaming design considerations
Latency Data availability, integrity, and security Unbounded data sources Data lookups Data formats Serializing your data Parallel processing Out-of-order events Message delivery semantics
Micro-batch processing case study Real-time processing case study
Main code Executing the code
Summary
Machine Learning in Hadoop
Technical requirements Machine learning steps Common machine learning challenges Spark machine learning
Transformer function Estimator Spark ML pipeline
Hadoop and R Mahout Machine learning case study in Spark
Sentiment analysis using Spark ML
Summary
Hadoop in the Cloud
Technical requirements Logical view of Hadoop in the cloud Network
Regions and availability zone VPC and subnet Security groups/firewall rules Practical example using AWS
Managing resources
Cloud-watch
Data pipelines
Amazon Data Pipeline Airflow
Airflow components
Sample data pipeline DAG example
High availability (HA)
Server failure
Server instance high availability Region and zone failure
Cloud storage high availability
Amazon S3 outage case history
Summary
Hadoop Cluster Profiling
Introduction to benchmarking and profiling HDFS
DFSIO
NameNode
NNBench NNThroughputBenchmark Synthetic load generator (SLG)
YARN
Scheduler Load Simulator (SLS)
Hive
TPC-DS TPC-H
Mix-workloads
Rumen Gridmix
Summary
Section 4: Securing Hadoop Who Can Do What in Hadoop
Hadoop security pillars System security Kerberos authentication
Kerberos advantages Kerberos authentication flows
Service authentication User authentication Communication between the authenticated client and the authenticated Hadoop service Symmetric key-based communication in Hadoop
User authorization
Ranger Sentry
List of security features that have been worked upon in Hadoop 3.0 Summary
Network and Data Security
Securing Hadoop networks
Segregating different types of networks Network firewalls Tools for securing Hadoop services' network perimeter
Encryption
Data in transit encryption Data at rest encryption
Masking Filtering
Row-level filtering Column-level filtering
Summary
Monitoring Hadoop
General monitoring
HDFS metrics
NameNode metrics DataNode metrics
YARN metrics ZooKeeper metrics Apache Ambari
Security monitoring
Security information and event management How does SIEM work? Intrusion detection system Intrusion prevention system
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion