Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Preface
What to Expect from This Book Who This Book Is For How to Read This Book Overview of Chapters Programming and Code Examples
GitHub Repository Executing Distributed Jobs Permissions and Citation
Feedback and How to Contact Us Safari® Books Online How to Contact Us Acknowledgments
I. Introduction to Distributed Computing 1. The Age of the Data Product
What Is a Data Product? Building Data Products at Scale with Hadoop
Leveraging Large Datasets Hadoop for Data Products
The Data Science Pipeline and the Hadoop Ecosystem
Big Data Workflows
Conclusion
2. An Operating System for Big Data
Basic Concepts Hadoop Architecture
A Hadoop Cluster HDFS
Blocks Data management
YARN
Working with a Distributed File System
Basic File System Operations File Permissions in HDFS Other HDFS Interfaces
Working with Distributed Computation
MapReduce: A Functional Programming Model MapReduce: Implemented on a Cluster
MapReduce examples
Beyond a Map and Reduce: Job Chaining
Submitting a MapReduce Job to YARN Conclusion
3. A Framework for Python and Hadoop Streaming
Hadoop Streaming
Computing on CSV Data with Streaming Executing Streaming Jobs
A Framework for MapReduce with Python
Counting Bigrams Other Frameworks
Advanced MapReduce
Combiners Partitioners Job Chaining
Conclusion
4. In-Memory Computing with Spark
Spark Basics
The Spark Stack Resilient Distributed Datasets Programming with RDDs
Interactive Spark Using PySpark Writing Spark Applications
Visualizing Airline Delays with Spark
Conclusion
5. Distributed Analysis and Patterns
Computing with Keys
Compound Keys
Compound data serialization
Keyspace Patterns
Transforming the keyspace The explode mapper The filter mapper The identity pattern
Pairs versus Stripes
Design Patterns
Summarization
Aggregation Statistical summarization
Indexing
Inverted index TF-IDF
Filtering
Top n records Simple random sample Bloom filtering
Toward Last-Mile Analytics
Fitting a Model Validating Models
Conclusion
II. Workflows and Tools for Big Data Science 6. Data Mining and Warehousing
Structured Data Queries with Hive
The Hive Command-Line Interface (CLI) Hive Query Language (HQL)
Creating a database Creating tables Loading data
Data Analysis with Hive
Grouping Aggregations and joins
HBase
NoSQL and Column-Oriented Databases Real-Time Analytics with HBase
Generating a schema Namespaces, tables, and column families Row keys Inserting data with put Get row or cell values Scan rows Filters Further reading on HBase
Conclusion
7. Data Ingestion
Importing Relational Data with Sqoop
Importing from MySQL to HDFS Importing from MySQL to Hive Importing from MySQL to HBase
Ingesting Streaming Data with Flume
Flume Data Flows Ingesting Product Impression Data with Flume
Conclusion
8. Analytics with Higher-Level APIs
Pig
Pig Latin
Relations and tuples Filtering Projection Grouping and joining Storing and outputting data
Data Types Relational Operators User-Defined Functions Wrapping Up
Spark’s Higher-Level APIs
Spark SQL DataFrames
Data wrangling DataFrames
Conclusion
9. Machine Learning
Scalable Machine Learning with Spark
Collaborative Filtering
User-based recommender: An example
Classification
Logistic regression classification: An example
Clustering
k-means clustering: An example
Conclusion
10. Summary: Doing Distributed Data Science
Data Product Lifecycle
Data Lakes Data Ingestion Computational Data Stores
Relational approaches: Hive NoSQL approaches: HBase
Machine Learning Lifecycle Conclusion
A. Creating a Hadoop Pseudo-Distributed Development Environment
Quick Start Setting Up Linux
Creating a Hadoop User Configuring SSH Installing Java Disabling IPv6
Installing Hadoop
Unpacking Environment Hadoop Configuration Formatting the Namenode Starting Hadoop Restarting Hadoop
B. Installing Hadoop Ecosystem Products
Packaged Hadoop Distributions Self-Installation of Apache Hadoop Ecosystem Products
Basic Installation and Configuration Steps Sqoop-Specific Configurations Hive-Specific Configuration
Hive warehouse directory Hive metastore database Verifying Hive is running
HBase-Specific Configurations
Starting HBase
Installing Spark
Minimizing the verbosity of Spark
Glossary Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion