Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Title Page Copyright and Credits
Modern Big Data Processing with Hadoop
Packt Upsell
Why subscribe? PacktPub.com
Contributors
About the authors About the reviewers Packt is searching for authors like you
Preface
Who this book is for What this book covers To get the most out of this book
Download the example code files Download the color images Conventions used
Get in touch
Reviews
Enterprise Data Architecture Principles
Data architecture principles
Volume Velocity Variety Veracity
The importance of metadata Data governance
Fundamentals of data governance
Data security
Application security Input data Big data security RDBMS security BI security Physical security Data encryption Secure key management
Data as a Service Evolution data architecture with Hadoop
Hierarchical database architecture Network database architecture Relational database architecture
Employees Devices Department Department and employee mapping table
Hadoop data architecture
Data layer Data management layer Job execution layer
Summary
Hadoop Life Cycle Management
Data wrangling
Data acquisition Data structure analysis Information extraction Unwanted data removal Data transformation Data standardization
Data masking
Substitution
Static  Dynamic
Encryption Hashing
Hiding Erasing Truncation Variance Shuffling
Data security
What is Apache Ranger? Apache Ranger installation using Ambari
Ambari admin UI Add service Service placement Service client placement Database creation on master Ranger database configuration Configuration changes Configuration review Deployment progress Application restart
Apache Ranger user guide
Login to UI Access manager Service details Policy definition and auditing for HDFS
Summary
Hadoop Design Consideration
Understanding data structure principles Installing Hadoop cluster
Configuring Hadoop on NameNode Format NameNode Start all services
Exploring HDFS architecture
Defining NameNode
Secondary NameNode NameNode safe mode
DataNode
Data replication
Rack awareness HDFS WebUI
Introducing YARN
YARN architecture
Resource manager Node manager
Configuration of YARN
Configuring HDFS high availability
During Hadoop 1.x During Hadoop 2.x and onwards HDFS HA cluster using NFS
Important architecture points
Configuration of HA NameNodes with shared storage HDFS HA cluster using the quorum journal manager
Important architecture points
Configuration of HA NameNodes with QJM
Automatic failover
Important architecture points
Configuring automatic failover
Hadoop cluster composition
Typical Hadoop cluster
Best practices Hadoop deployment Hadoop file formats
Text/CSV file JSON Sequence file Avro Parquet ORC Which file format is better?
Summary
Data Movement Techniques
Batch processing versus real-time processing
Batch processing Real-time processing
Apache Sqoop
Sqoop Import
Import into HDFS Import a MySQL table into an HBase table
Sqoop export
Flume
Apache Flume architecture Data flow using Flume Flume complex data flow architecture
Flume setup
Log aggregation use case
Apache NiFi
Main concepts of Apache NiFi Apache NiFi architecture Key features Real-time log capture dataflow
Kafka Connect
Kafka Connect – a brief history Why Kafka Connect? Kafka Connect features Kafka Connect architecture Kafka Connect workers modes
Standalone mode Distributed mode
Kafka Connect cluster distributed architecture
Example 1 Example 2
Summary
Data Modeling in Hadoop
Apache Hive
Apache Hive and RDBMS
Supported datatypes How Hive works Hive architecture Hive data model management
Hive tables
Managed tables External tables
Hive table partition
Hive static partitions and dynamic partitions
Hive partition bucketing
How Hive bucketing works Creating buckets in a non-partitioned table Creating buckets in a partitioned table
Hive views
Syntax of a view Hive indexes
Compact index Bitmap index
JSON documents using Hive
Example 1 – Accessing simple JSON documents with Hive (Hive 0.14 and later versions) Example 2 – Accessing nested JSON documents with Hive (Hive 0.14 and later versions) Example 3 – Schema evolution with Hive and Avro (Hive 0.14 and later versions)
Apache HBase
Differences between HDFS and HBase Differences between Hive and HBase Key features of HBase HBase data model Difference between RDBMS table and column - oriented data store HBase architecture
HBase architecture in a nutshell HBase rowkey design
Example 4 – loading data from MySQL table to HBase table Example 5 – incrementally loading data from MySQL table to HBase table Example 6 – Load the MySQL customer changed data into the HBase table Example 7 – Hive HBase integration
Summary
Designing Real-Time Streaming Data Pipelines
Real-time streaming concepts
Data stream Batch processing versus real-time data processing Complex event processing  Continuous availability Low latency Scalable processing frameworks Horizontal scalability Storage
Real-time streaming components
Message queue
So what is Kafka?
Kafka features Kafka architecture
Kafka architecture components
Kafka Connect deep dive Kafka Connect architecture
Kafka Connect workers standalone versus distributed mode
Install Kafka Create topics Generate messages to verify the producer and consumer Kafka Connect using file Source and Sink Kafka Connect using JDBC and file Sink Connectors
Apache Storm
Features of Apache Storm Storm topology
Storm topology components
Installing Storm on a single node cluster Developing a real-time streaming pipeline with Storm
Streaming a pipeline from Kafka to Storm to MySQL Streaming a pipeline with Kafka to Storm to HDFS
Other popular real-time data streaming frameworks
Kafka Streams API Spark Streaming Apache Flink
Apache Flink versus Spark Apache Spark versus Storm Summary
Large-Scale Data Processing Frameworks
MapReduce Hadoop MapReduce
Streaming MapReduce Java MapReduce Summary
Apache Spark 2
Installing Spark using Ambari
Service selection in Ambari Admin Add Service Wizard Server placement Clients and Slaves selection Service customization Software deployment Spark installation progress Service restarts and cleanup
Apache Spark data structures
RDDs, DataFrames and datasets
Apache Spark programming
Sample data for analysis Interactive data analysis with pyspark Standalone application with Spark Spark streaming application Spark SQL application
Summary
Building Enterprise Search Platform
The data search concept The need for an enterprise search engine
Tools for building an enterprise search engine
Elasticsearch
Why Elasticsearch?  Elasticsearch components
Index Document Mapping Cluster Type
How to index documents in Elasticsearch?
Elasticsearch installation
Installation of Elasticsearch Create index Primary shard Replica shard
Ingest documents into index
Bulk Insert Document search Meta fields
Mapping
Static mapping Dynamic mapping
Elasticsearch-supported data types
Mapping example
Analyzer
Elasticsearch stack components
Beats
Logstash Kibana Use case Summary
Designing Data Visualization Solutions
Data visualization
Bar/column chart Line/area chart Pie chart Radar chart Scatter/bubble chart Other charts
Practical data visualization in Hadoop
Apache Druid
Druid components Other required components Apache Druid installation
Add service Select Druid and Superset Service placement on servers Choose Slaves and Clients Service configurations Service installation Installation summary Sample data ingestion into Druid
MySQL database
Sample database
Download the sample dataset Copy the data to MySQL Verify integrity of the tables Single Normalized Table
Apache Superset
Accessing the Superset application Superset dashboards Understanding Wikipedia edits data Create Superset Slices using Wikipedia data
Unique users count Word Cloud for top US regions Sunburst chart – top 10 cities Top 50 channels and namespaces via directed force layout Top 25 countries/channels distribution
Creating wikipedia edits dashboard from Slices
Apache Superset with RDBMS
Supported databases Understanding employee database
Employees table Departments table Department manager table Department Employees Table Titles table Salaries table Normalized employees table
Superset Slices for employees database
Register MySQL database/table
Slices and Dashboard creation
Department salary breakup Salary Diversity Salary Change Per Role Per Year Dashboard creation
Summary
Developing Applications Using the Cloud
What is the Cloud? Available technologies in the Cloud Planning the Cloud infrastructure
Dedicated servers versus shared servers
Dedicated servers Shared servers
High availability Business continuity planning
Infrastructure unavailability Natural disasters Business data BCP design example
The Hot–Hot system The Hot–Cold system
Security
Server security Application security Network security Single Sign On The AAA requirement
Building a Hadoop cluster in the Cloud
Google Cloud Dataproc
Getting a Google Cloud account Activating the Google Cloud Dataproc service Creating a new Hadoop cluster Logging in to the cluster Deleting the cluster 
Data access in the Cloud
Block storage File storage Encrypted storage Cold storage
Summary
Production Hadoop Cluster Deployment
Apache Ambari architecture
The Ambari server
Daemon management Software upgrade Software setup LDAP/PAM/Kerberos management Ambari backup and restore Miscellaneous options
Ambari Agent Ambari web interface Database
Setting up a Hadoop cluster with Ambari
Server configurations Preparing the server  Installing the Ambari server  Preparing the Hadoop cluster Creating the Hadoop cluster  Ambari web interface The Ambari home page
Creating a cluster Managing users and groups Deploying views
The cluster install wizard
Naming your cluster Selecting the Hadoop version  Selecting a server  Setting up the node Selecting services Service placement on nodes Selecting slave and client nodes  Customizing services Reviewing the services Installing the services on the nodes Installation summary The cluster dashboard
Hadoop clusters
A single cluster for the entire business Multiple Hadoop clusters
Redundancy
A fully redundant Hadoop cluster A data redundant Hadoop cluster
Cold backup High availability Business continuity Application environments
Hadoop data copy
HDFS data copy
Summary
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion