Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Preface
What to Expect from This Book
Who This Book Is For
How to Read This Book
Overview of Chapters
Programming and Code Examples
GitHub Repository
Executing Distributed Jobs
Permissions and Citation
Feedback and How to Contact Us
Safari® Books Online
How to Contact Us
Acknowledgments
I. Introduction to Distributed Computing
1. The Age of the Data Product
What Is a Data Product?
Building Data Products at Scale with Hadoop
Leveraging Large Datasets
Hadoop for Data Products
The Data Science Pipeline and the Hadoop Ecosystem
Big Data Workflows
Conclusion
2. An Operating System for Big Data
Basic Concepts
Hadoop Architecture
A Hadoop Cluster
HDFS
Blocks
Data management
YARN
Working with a Distributed File System
Basic File System Operations
File Permissions in HDFS
Other HDFS Interfaces
Working with Distributed Computation
MapReduce: A Functional Programming Model
MapReduce: Implemented on a Cluster
MapReduce examples
Beyond a Map and Reduce: Job Chaining
Submitting a MapReduce Job to YARN
Conclusion
3. A Framework for Python and Hadoop Streaming
Hadoop Streaming
Computing on CSV Data with Streaming
Executing Streaming Jobs
A Framework for MapReduce with Python
Counting Bigrams
Other Frameworks
Advanced MapReduce
Combiners
Partitioners
Job Chaining
Conclusion
4. In-Memory Computing with Spark
Spark Basics
The Spark Stack
Resilient Distributed Datasets
Programming with RDDs
Interactive Spark Using PySpark
Writing Spark Applications
Visualizing Airline Delays with Spark
Conclusion
5. Distributed Analysis and Patterns
Computing with Keys
Compound Keys
Compound data serialization
Keyspace Patterns
Transforming the keyspace
The explode mapper
The filter mapper
The identity pattern
Pairs versus Stripes
Design Patterns
Summarization
Aggregation
Statistical summarization
Indexing
Inverted index
TF-IDF
Filtering
Top n records
Simple random sample
Bloom filtering
Toward Last-Mile Analytics
Fitting a Model
Validating Models
Conclusion
II. Workflows and Tools for Big Data Science
6. Data Mining and Warehousing
Structured Data Queries with Hive
The Hive Command-Line Interface (CLI)
Hive Query Language (HQL)
Creating a database
Creating tables
Loading data
Data Analysis with Hive
Grouping
Aggregations and joins
HBase
NoSQL and Column-Oriented Databases
Real-Time Analytics with HBase
Generating a schema
Namespaces, tables, and column families
Row keys
Inserting data with put
Get row or cell values
Scan rows
Filters
Further reading on HBase
Conclusion
7. Data Ingestion
Importing Relational Data with Sqoop
Importing from MySQL to HDFS
Importing from MySQL to Hive
Importing from MySQL to HBase
Ingesting Streaming Data with Flume
Flume Data Flows
Ingesting Product Impression Data with Flume
Conclusion
8. Analytics with Higher-Level APIs
Pig
Pig Latin
Relations and tuples
Filtering
Projection
Grouping and joining
Storing and outputting data
Data Types
Relational Operators
User-Defined Functions
Wrapping Up
Spark’s Higher-Level APIs
Spark SQL
DataFrames
Data wrangling DataFrames
Conclusion
9. Machine Learning
Scalable Machine Learning with Spark
Collaborative Filtering
User-based recommender: An example
Classification
Logistic regression classification: An example
Clustering
k-means clustering: An example
Conclusion
10. Summary: Doing Distributed Data Science
Data Product Lifecycle
Data Lakes
Data Ingestion
Computational Data Stores
Relational approaches: Hive
NoSQL approaches: HBase
Machine Learning Lifecycle
Conclusion
A. Creating a Hadoop Pseudo-Distributed Development Environment
Quick Start
Setting Up Linux
Creating a Hadoop User
Configuring SSH
Installing Java
Disabling IPv6
Installing Hadoop
Unpacking
Environment
Hadoop Configuration
Formatting the Namenode
Starting Hadoop
Restarting Hadoop
B. Installing Hadoop Ecosystem Products
Packaged Hadoop Distributions
Self-Installation of Apache Hadoop Ecosystem Products
Basic Installation and Configuration Steps
Sqoop-Specific Configurations
Hive-Specific Configuration
Hive warehouse directory
Hive metastore database
Verifying Hive is running
HBase-Specific Configurations
Starting HBase
Installing Spark
Minimizing the verbosity of Spark
Glossary
Index
← Prev
Back
Next →
← Prev
Back
Next →