Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Programming Elastic MapReduce
Preface
What Is AWS?
What’s in This Book?
Sign Up for AWS
Code Samples in This Book
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
1. Introduction to Amazon Elastic MapReduce
Amazon Web Services Used in This Book
Amazon Elastic MapReduce
Amazon EMR and the Hadoop Ecosystem
Amazon Elastic MapReduce Versus Traditional Hadoop Installs
Data Locality
Hardware
Complexity
Application Building Blocks
2. Data Collection and Data Analysis with AWS
Log Analysis Application
Log Messages as a Data Set for Analytics
Understanding MapReduce
Collection Stage
Simulating Syslog Data
Generating Logs with Bash
Moving Data to S3 Storage
All Roads Lead to S3
Developing a MapReduce Application
Custom JAR MapReduce Job
Running an Amazon EMR Cluster
Viewing Our Results
Debugging a Job Flow
Running Our Job Flow with Debugging
Reviewing Job Flow Log Structure
Debug Through the Amazon EMR Console
Our Application and Real-World Uses
3. Data Filtering Design Patterns and Scheduling Work
Extending the Application Example
Understanding Web Server Logs
Finding Errors in the Web Logs Using Data Filtering
Mapper Code
Reducer Code
Driver Code
Running the MapReduce Filter Job
Analyzing the Results
Building Summary Counts in Data Sets
Mapper Code
Reducer Code
Analyzing the Filtered Counts Job
Job Flow Scheduling
Scheduling with the CLI
Scheduling with AWS Data Pipeline
Creating a Pipeline
Adding Data Nodes
Adding Activities
Scheduling Pipelines
Reviewing Pipeline Status
AWS Pipeline Costs
Real-World Uses
4. Data Analysis with Hive and Pig in Amazon EMR
Amazon Job Flow Technologies
What Is Pig?
Utilizing Pig in Amazon EMR
Connecting to the Master Node
Pig Latin Primer
LOAD
STORE
DUMP
ILLUSTRATE
FOREACH
FILTER
GROUP
Exploring Data with Pig Latin
Running Pig Scripts in Amazon EMR
What Is Hive?
Utilizing Hive in Amazon EMR
Hive Primer
SerDe
CREATE TABLE
INSERT
Exploring Data with Hive
Running Hive Scripts in Amazon EMR
Finding the Top 10 with Hive
Our Application with Hive and Pig
5. Machine Learning Using EMR
A Quick Tour of Machine Learning
Python and EMR
Why Python?
The Input Data
The Mapper
The Reducer
Putting It All Together
What About Java?
What’s Next?
6. Planning AWS Projects and Managing Costs
Developing a Project Cost Model
Software Licensing
AWS and Cloud Licensing
Private Data Center and AWS Cost Comparisons
Cost Calculations on an Example Application
Optimizing AWS Resources to Reduce Project Costs
Amazon Regions
Amazon Availability Zones
EC2 and EMR Costs with On Demand, Reserve, and Spot Instances
Reserve Instances
Spot Instances
Reducing AWS Project Costs
EMR and EC2 usage billed by the hour
Cost efficiencies with reserved and spot instances
Project storage costs
Data life cycles
Amazon Tools for Estimating Your Project Costs
A. Amazon Web Services Resources and Tools
Amazon AWS Online Resources
Amazon AWS Cost Estimation Tools
AWS Best Practices and Architecture
Amazon EMR Distributions
B. Cloud Computing, Amazon Web Services, and Their Impacts
AWS Service Delivery Models
Platform as a Service
Infrastructure as a Service
Storage as a Service
Performance
Elasticity and Growth
Fixed Capacity
Variable Capacity
Security
Security Is a Shared Responsibility
Data Security in Elastic MapReduce
Uptime and Availability
C. Installation and Setup
Prerequisites
Installing Hadoop
Building MapReduce Applications
Running MapReduce Applications Locally
Installing Pig
Installing Hive
Index
Colophon
Copyright
← Prev
Back
Next →
← Prev
Back
Next →