Modern Big Data Processing with Hadoop by Shindgikar, Prashant -- Read -- Imperial Library of Trantor

Index

Title Page Copyright and Credits

Modern Big Data Processing with Hadoop

Packt Upsell

Why subscribe? PacktPub.com

Contributors

About the authors About the reviewers Packt is searching for authors like you

Preface

Who this book is for What this book covers To get the most out of this book

Download the example code files Download the color images Conventions used

Get in touch

Reviews

Enterprise Data Architecture Principles

Data architecture principles

Volume Velocity Variety Veracity

The importance of metadata Data governance

Fundamentals of data governance

Data security

Application security Input data Big data security RDBMS security BI security Physical security Data encryption Secure key management

Data as a Service Evolution data architecture with Hadoop

Hierarchical database architecture Network database architecture Relational database architecture

Employees Devices Department Department and employee mapping table

Hadoop data architecture

Data layer Data management layer Job execution layer

Summary

Hadoop Life Cycle Management

Data wrangling

Data acquisition Data structure analysis Information extraction Unwanted data removal Data transformation Data standardization

Data masking

Substitution

Static Dynamic

Encryption Hashing

Hiding Erasing Truncation Variance Shuffling

Data security

What is Apache Ranger? Apache Ranger installation using Ambari

Ambari admin UI Add service Service placement Service client placement Database creation on master Ranger database configuration Configuration changes Configuration review Deployment progress Application restart

Apache Ranger user guide

Summary

Hadoop Design Consideration

Understanding data structure principles Installing Hadoop cluster

Configuring Hadoop on NameNode Format NameNode Start all services

Exploring HDFS architecture

Defining NameNode

Secondary NameNode NameNode safe mode

DataNode

Data replication

Rack awareness HDFS WebUI

Introducing YARN

YARN architecture

Resource manager Node manager

Configuration of YARN

Configuring HDFS high availability

During Hadoop 1.x During Hadoop 2.x and onwards HDFS HA cluster using NFS

Important architecture points

Configuration of HA NameNodes with shared storage HDFS HA cluster using the quorum journal manager

Important architecture points

Configuration of HA NameNodes with QJM

Automatic failover

Important architecture points

Configuring automatic failover

Hadoop cluster composition

Typical Hadoop cluster

Best practices Hadoop deployment Hadoop file formats

Text/CSV file JSON Sequence file Avro Parquet ORC Which file format is better?

Summary

Data Movement Techniques

Batch processing versus real-time processing

Batch processing Real-time processing

Apache Sqoop

Sqoop Import

Import into HDFS Import a MySQL table into an HBase table

Sqoop export

Flume

Apache Flume architecture Data flow using Flume Flume complex data flow architecture

Flume setup

Log aggregation use case

Apache NiFi

Main concepts of Apache NiFi Apache NiFi architecture Key features Real-time log capture dataflow

Kafka Connect

Kafka Connect – a brief history Why Kafka Connect? Kafka Connect features Kafka Connect architecture Kafka Connect workers modes

Standalone mode Distributed mode

Kafka Connect cluster distributed architecture

Example 1 Example 2

Summary

Data Modeling in Hadoop

Apache Hive

Apache Hive and RDBMS

Supported datatypes How Hive works Hive architecture Hive data model management

Hive tables

Managed tables External tables

Hive table partition

Hive static partitions and dynamic partitions

Hive partition bucketing

How Hive bucketing works Creating buckets in a non-partitioned table Creating buckets in a partitioned table

Hive views

Syntax of a view Hive indexes

Compact index Bitmap index

JSON documents using Hive

Example 1 – Accessing simple JSON documents with Hive (Hive 0.14 and later versions) Example 2 – Accessing nested JSON documents with Hive (Hive 0.14 and later versions) Example 3 – Schema evolution with Hive and Avro (Hive 0.14 and later versions)

Apache HBase

Differences between HDFS and HBase Differences between Hive and HBase Key features of HBase HBase data model Difference between RDBMS table and column - oriented data store HBase architecture

HBase architecture in a nutshell HBase rowkey design

Example 4 – loading data from MySQL table to HBase table Example 5 – incrementally loading data from MySQL table to HBase table Example 6 – Load the MySQL customer changed data into the HBase table Example 7 – Hive HBase integration

Summary

Designing Real-Time Streaming Data Pipelines

Real-time streaming concepts

Data stream Batch processing versus real-time data processing Complex event processing Continuous availability Low latency Scalable processing frameworks Horizontal scalability Storage

Real-time streaming components

Message queue

So what is Kafka?

Kafka features Kafka architecture

Kafka architecture components

Kafka Connect deep dive Kafka Connect architecture

Kafka Connect workers standalone versus distributed mode

Install Kafka Create topics Generate messages to verify the producer and consumer Kafka Connect using file Source and Sink Kafka Connect using JDBC and file Sink Connectors

Apache Storm

Features of Apache Storm Storm topology

Storm topology components

Installing Storm on a single node cluster Developing a real-time streaming pipeline with Storm

Streaming a pipeline from Kafka to Storm to MySQL Streaming a pipeline with Kafka to Storm to HDFS

Other popular real-time data streaming frameworks

Kafka Streams API Spark Streaming Apache Flink

Apache Flink versus Spark Apache Spark versus Storm Summary

Large-Scale Data Processing Frameworks

MapReduce Hadoop MapReduce

Streaming MapReduce Java MapReduce Summary

Apache Spark 2

Installing Spark using Ambari

Service selection in Ambari Admin Add Service Wizard Server placement Clients and Slaves selection Service customization Software deployment Spark installation progress Service restarts and cleanup

Apache Spark data structures

RDDs, DataFrames and datasets

Apache Spark programming

Sample data for analysis Interactive data analysis with pyspark Standalone application with Spark Spark streaming application Spark SQL application

Summary

Building Enterprise Search Platform

The data search concept The need for an enterprise search engine

Tools for building an enterprise search engine

Elasticsearch

Why Elasticsearch? Elasticsearch components

Index Document Mapping Cluster Type

How to index documents in Elasticsearch?

Elasticsearch installation

Installation of Elasticsearch Create index Primary shard Replica shard

Ingest documents into index

Bulk Insert Document search Meta fields

Mapping

Static mapping Dynamic mapping

Elasticsearch-supported data types

Mapping example

Analyzer

Elasticsearch stack components

Beats

Logstash Kibana Use case Summary

Designing Data Visualization Solutions

Data visualization

Bar/column chart Line/area chart Pie chart Radar chart Scatter/bubble chart Other charts

Practical data visualization in Hadoop

Apache Druid

Druid components Other required components Apache Druid installation

Add service Select Druid and Superset Service placement on servers Choose Slaves and Clients Service configurations Service installation Installation summary Sample data ingestion into Druid

MySQL database

Sample database

Download the sample dataset Copy the data to MySQL Verify integrity of the tables Single Normalized Table

Apache Superset

Accessing the Superset application Superset dashboards Understanding Wikipedia edits data Create Superset Slices using Wikipedia data

Unique users count Word Cloud for top US regions Sunburst chart – top 10 cities Top 50 channels and namespaces via directed force layout Top 25 countries/channels distribution

Creating wikipedia edits dashboard from Slices

Apache Superset with RDBMS

Supported databases Understanding employee database

Employees table Departments table Department manager table Department Employees Table Titles table Salaries table Normalized employees table

Superset Slices for employees database

Slices and Dashboard creation

Department salary breakup Salary Diversity Salary Change Per Role Per Year Dashboard creation

Summary

Developing Applications Using the Cloud

What is the Cloud? Available technologies in the Cloud Planning the Cloud infrastructure

Dedicated servers versus shared servers

Dedicated servers Shared servers

High availability Business continuity planning

Infrastructure unavailability Natural disasters Business data BCP design example

The Hot–Hot system The Hot–Cold system

Security

Server security Application security Network security Single Sign On The AAA requirement

Building a Hadoop cluster in the Cloud

Google Cloud Dataproc

Getting a Google Cloud account Activating the Google Cloud Dataproc service Creating a new Hadoop cluster Logging in to the cluster Deleting the cluster

Data access in the Cloud

Block storage File storage Encrypted storage Cold storage

Summary

Production Hadoop Cluster Deployment

Apache Ambari architecture

The Ambari server

Daemon management Software upgrade Software setup LDAP/PAM/Kerberos management Ambari backup and restore Miscellaneous options

Ambari Agent Ambari web interface Database

Setting up a Hadoop cluster with Ambari

Server configurations Preparing the server Installing the Ambari server Preparing the Hadoop cluster Creating the Hadoop cluster Ambari web interface The Ambari home page

Creating a cluster Managing users and groups Deploying views

The cluster install wizard

Naming your cluster Selecting the Hadoop version Selecting a server Setting up the node Selecting services Service placement on nodes Selecting slave and client nodes Customizing services Reviewing the services Installing the services on the nodes Installation summary The cluster dashboard

Hadoop clusters

A single cluster for the entire business Multiple Hadoop clusters

Redundancy

A fully redundant Hadoop cluster A data redundant Hadoop cluster

Cold backup High availability Business continuity Application environments

Hadoop data copy

HDFS data copy

Summary

← Prev
Back
Next →

← Prev
Back
Next →