Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Foreword
Preface
Who This Book Is For
What You Should Already Know
What This Book Leaves Out
How This Book Works
Which Software Versions This Book Uses
Conventions Used in This Book
IP Addresses
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
I. Introduction to the Cloud
1. Why Hadoop in the Cloud?
What Is the Cloud?
What Does Hadoop in the Cloud Mean?
Reasons to Run Hadoop in the Cloud
Reasons to Not Run Hadoop in the Cloud
What About Security?
Hybrid Clouds
Hadoop Solutions from Cloud Providers
Elastic MapReduce
Google Cloud Dataproc
HDInsight
Hadoop-Like Services
A Spectrum of Choices
Getting Started
2. Overview and Comparison of Cloud Providers
Amazon Web Services
References
Google Cloud Platform
References
Microsoft Azure
References
Which One Should You Use?
II. Cloud Primer
3. Instances
Instance Types
Regions and Availability Zones
Instance Control
Temporary Instances
Spot Instances
Preemptible Instances
Images
No Instance Is an Island
4. Networking and Security
A Drink of CIDR
Virtual Networks
Private DNS
Public IP Addresses and DNS
Virtual Networks and Regions
Routing
Routing in AWS
Routing in Google Cloud Platform
Routing in Azure
Network Security Rules
Inbound Versus Outbound
Allow Versus Deny
Network Security Rules in AWS
Security groups
Network ACLs
Network Security Rules in Google Cloud Platform
Network Security Rules in Azure
Putting Networking and Security Together
What About the Data?
5. Storage
Block Storage
Block Storage in AWS
Block Storage in Google Cloud Platform
Block Storage in Azure
Object Storage
Buckets
Data Objects
Object Access
Object Storage in AWS
Object Storage in Google Cloud Platform
Object Storage in Azure
Cloud Relational Databases
Cloud Relational Databases in AWS
Cloud Relational Databases in Google Cloud Platform
Cloud Relational Databases in Azure
Cloud NoSQL Databases
Where to Start?
III. A Simple Cluster in the Cloud
6. Setting Up in AWS
Prerequisites
Allocating Instances
Generating a Key Pair
Launching Instances
The manager instance
The worker instances
Securing the Instances
Next Steps
7. Setting Up in Google Cloud Platform
Prerequisites
Creating a Project
Allocating Instances
SSH Keys
Creating Instances
The manager instance
The worker instances
Securing the Instances
Next Steps
8. Setting Up in Azure
Prerequisites
Creating a Resource Group
Creating Resources
SSH Keys
Creating Virtual Machines
The Manager Instance
The Worker Instances
Next Steps
9. Standing Up a Cluster
The JDK
Hadoop Accounts
Passwordless SSH
Hadoop Installation
HDFS and YARN Configuration
The Environment
XML Configuration Files
Finishing Up Configuration
Startup
SSH Tunneling
Running a Test Job
What If the Job Hangs?
Running Basic Data Loading and Analysis
Wikipedia Exports
Analyzing a Small Export
Generating the export
The MapReduce jobs
Go Bigger
IV. Enhancing Your Cluster
10. High Availability
Planning HA in the Cloud
HDFS HA
What about the datanodes?
YARN HA
Installing and Configuring ZooKeeper
Adding New HDFS and YARN Daemons
The Second Manager
HDFS HA Configuration
YARN HA Configuration
Testing HA
Improving the HA Configuration
A Bigger Cluster
Complete HA
A Third Availability Zone?
Benchmarking HA
MRBench
Terasort
Grains of Salt
11. Relational Data with Apache Hive
Planning for Hive in the Cloud
Installing and Configuring Hive
Startup
Running Some Test Hive Queries
Switching to a Remote Metastore
The Remote Metastore and Stopped Clusters
Hive Control Scripts
Hive on S3
Configuring the S3 Filesystem
Adding Data to S3
Configuring S3 Authentication
Configuring the S3 Endpoint
External Table in S3
What About Google Cloud Platform and Azure?
A Step Toward Transient Clusters
A Different Means of Computation
12. Streaming in the Cloud with Apache Spark
Planning for Spark in the Cloud
Installing and Configuring Spark
Startup
Running Some Test Jobs
Configuring Hive on Spark
Add Spark Libraries to Hive
Configure Hive for Spark
Switch YARN to the Fair Scheduler
Try Out Hive on Spark on YARN
Spark Streaming from AWS Kinesis
Creating a Kinesis Stream
Populating the Stream with Data
Streaming Kinesis Data into Spark
Packaging the streaming job
Running the streaming job
Stopping the streaming job
What About Google Cloud Platform and Azure?
Building Clusters Versus Building Clusters Well
V. Care and Feeding of Hadoop in the Cloud
13. Pricing and Performance
Picking Instance Types
The Criteria
General Cluster Instance Roles
Persistent Versus Ephemeral Block Storage
Stopping and Starting Entire Clusters
Using Temporary Instances
Geographic Considerations
Regions
Availability Zones
Performance and Networking
14. Network Topologies
Public and Private Subnets
SSH Tunneling
SOCKS Proxy
VPN Access
Access from Other Subnets
Cluster Topologies
The Public Cluster
The Secured Public Cluster
Gateway Instances
The Private Cluster
Cluster Access to the Internet and Cloud Provider Services
Geographic Considerations
Regions
Availability Zones
Starting Topologies
Higher-Level Planning
15. Patterns for Cluster Usage
Long-Running or Transient?
Single-User or Multitenant?
Self-Service or Managed?
Cloud-Only or Hybrid?
Watching Cost
The Rising Need for Automation
16. Using Images for Cluster Management
The Structure of an Image
EC2 Images
GCE Images
Azure Images
Image Preparation
Wait, I’m Using That!
Image Creation
Image Creation in AWS
Image Creation in Google Cloud Platform
Image Creation in Azure
Image Use
Scripting Hadoop Configuration
Image Maintenance
Image Deletion
Image Deletion in AWS
Image Deletion in Google Cloud Platform
Image Deletion in Azure
Automated Image Creation with Packer
Automated Cloud Cluster Creation
Cloudera Director
Hortonworks Data Cloud
Qubole Data Service
General System Management Tools
Images or Tools?
More Tooling
17. Monitoring and Automation
Monitoring Choices
Cloud Provider Monitoring Services
Rolling Your Own
Cloud Provider Command-Line Interfaces
AWS CLI
Google Cloud Platform CLI
Azure CLI
Data Formatting for CLI Results
What to Monitor
Instance Existence
Instance Reachability
Reachability checks using a provider CLI
Rolling your own reachability checks
Hadoop Daemon Status
Cloud provider custom metrics
Rolling your own Hadoop daemon status checks
System Load
AWS system monitoring
Google Cloud Platform system monitoring
Azure system monitoring
Rolling your own system checks
Putting Scripting to Use
Custom Metrics in CloudWatch
Basic Metrics
Defining a Custom Metric
Feeding Custom Metric Data to CloudWatch
Setting an Alarm on a Custom Metric
Elastic Compute Using a Custom Metric
A Custom Metric for Compute Capacity
Prerequisites for Autoscaling Compute
Triggering Autoscaling with an Alarm Action
What About Shrinking?
Other Things to Watch
Ingesting Logs into CloudWatch
Creating an IAM User for Log Streaming
Installing the CloudWatch Agent
Creating a Metric Filter
Creating an Alarm from a Metric Filter
So Much More to See and Do
18. Backup and Restoration
Patterns to Supplement Backups
Backup via Imaging
HDFS Replication
Cloud Storage Filesystems
HDFS Snapshots
Hive Metastore Replication
Logs
A General Cloud Hadoop Backup Strategy
Not So Different, But Better
To the Cloud
A. Hadoop Component Start and Stop Scripts
Apache ZooKeeper
Apache Hive
B. Hadoop Cluster Configuration Scripts
SSH Key Creation and Distribution
Configuration Update Script
New Worker Configuration Update Script
C. Monitoring Cloud Clusters with Nagios
Where Nagios Should Run
Instance Existence Through Ping
Hosts and Host Groups
Services and Service Groups
Provider CLI Integration
Index
← Prev
Back
Next →
← Prev
Back
Next →