Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
About This eBook
Title Page
Copyright Page
Dedication Page
Contents
Foreword
Preface
Who This Book Is For
Why Now?
The Internet of . . . Everything
A Journey toward Ubiquitous Computing
How This Book Is Organized
Acknowledgments
About the Author
I: Directives in the Big Data Era
1. Four Rules for Data Success
When Data Became a BIG Deal
Data and the Single Server
The Big Data Trade-Off
Anatomy of a Big Data Pipeline
The Ultimate Database
Summary
II: Collecting and Sharing a Lot of Data
2. Hosting and Sharing Terabytes of Raw Data
Suffering from Files
Storage: Infrastructure as a Service
Choosing the Right Data Format
Character Encoding
Data in Motion: Data Serialization Formats
Summary
3. Building a NoSQL-Based Web App to Collect Crowd-Sourced Data
Relational Databases: Command and Control
Relational Databases versus the Internet
Nonrelational Database Models
Leaning toward Write Performance: Redis
Sharding across Many Redis Instances
NewSQL: The Return of Codd
Summary
4. Strategies for Dealing with Data Silos
A Warehouse Full of Jargon
Hadoop: The Elephant in the Warehouse
Data Silos Can Be Good
Convergence: The End of the Data Silo
Summary
III: Asking Questions about Your Data
5. Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets
What Is a Data Warehouse?
Apache Hive: Interactive Querying for Hadoop
Shark: Queries at the Speed of RAM
Data Warehousing in the Cloud
Summary
6. Building a Data Dashboard with Google BigQuery
Analytical Databases
Dremel: Spreading the Wealth
BigQuery: Data Analytics as a Service
Building a Custom Big Data Dashboard
The Future of Analytical Query Engines
Summary
7. Visualization Strategies for Exploring Large Datasets
Cautionary Tales: Translating Data into Narrative
Human Scale versus Machine Scale
Building Applications for Data Interactivity
Summary
IV: Building Data Pipelines
8. Putting It Together: MapReduce Data Pipelines
What Is a Data Pipeline?
Data Pipelines with Hadoop Streaming
A One-Step MapReduce Transformation
Managing Complexity: Python MapReduce Frameworks for Hadoop
Summary
9. Building Data Transformation Workflows with Pig and Cascading
Large-Scale Data Workflows in Practice
It’s Complicated: Multistep MapReduce Transformations
Cascading: Building Robust Data-Workflow Applications
When to Choose Pig versus Cascading
Summary
V: Machine Learning for Large Datasets
10. Building a Data Classification System with Mahout
Can Machines Predict the Future?
Challenges of Machine Learning
Apache Mahout: Scalable Machine Learning
MLBase: Distributed Machine Learning Framework
Summary
VI: Statistical Analysis for Massive Datasets
11. Using R with Large Datasets
Why Statistics Are Sexy
Strategies for Dealing with Large Datasets
Summary
12. Building Analytics Workflows Using Python and Pandas
The Snakes Are Loose in the Data Zoo
Python Libraries for Data Processing
Building More Complex Workflows
iPython: Completing the Scientific Computing Tool Chain
Summary
VII: Looking Ahead
13. When to Build, When to Buy, When to Outsource
Overlapping Solutions
Understanding Your Data Problem
A Playbook for the Build versus Buy Problem
My Own Private Data Center
Understand the Costs of Open-Source
Everything as a Service
Summary
14. The Future: Trends in Data Technology
Hadoop: The Disruptor and the Disrupted
Everything in the Cloud
The Rise and Fall of the Data Scientist
Convergence: The Ultimate Database
Convergence of Cultures
Summary
Index
← Prev
Back
Next →
← Prev
Back
Next →