Data Just Right · Introduction to Large-Scale Data & Analytics (Jason Arnold's Library) by Manoochehri, Michael -- Read -- Imperial Library of Trantor

Index

About This eBook Title Page Copyright Page Dedication Page Contents Foreword Preface

Who This Book Is For Why Now? The Internet of . . . Everything A Journey toward Ubiquitous Computing How This Book Is Organized

Acknowledgments About the Author I: Directives in the Big Data Era

1. Four Rules for Data Success

When Data Became a BIG Deal Data and the Single Server The Big Data Trade-Off Anatomy of a Big Data Pipeline The Ultimate Database Summary

II: Collecting and Sharing a Lot of Data

2. Hosting and Sharing Terabytes of Raw Data

Suffering from Files Storage: Infrastructure as a Service Choosing the Right Data Format Character Encoding Data in Motion: Data Serialization Formats Summary

3. Building a NoSQL-Based Web App to Collect Crowd-Sourced Data

Relational Databases: Command and Control Relational Databases versus the Internet Nonrelational Database Models Leaning toward Write Performance: Redis Sharding across Many Redis Instances NewSQL: The Return of Codd Summary

4. Strategies for Dealing with Data Silos

A Warehouse Full of Jargon Hadoop: The Elephant in the Warehouse Data Silos Can Be Good Convergence: The End of the Data Silo Summary

III: Asking Questions about Your Data

What Is a Data Warehouse? Apache Hive: Interactive Querying for Hadoop Shark: Queries at the Speed of RAM Data Warehousing in the Cloud Summary

6. Building a Data Dashboard with Google BigQuery

Analytical Databases Dremel: Spreading the Wealth BigQuery: Data Analytics as a Service Building a Custom Big Data Dashboard The Future of Analytical Query Engines Summary

7. Visualization Strategies for Exploring Large Datasets

Cautionary Tales: Translating Data into Narrative Human Scale versus Machine Scale Building Applications for Data Interactivity Summary

IV: Building Data Pipelines

8. Putting It Together: MapReduce Data Pipelines

What Is a Data Pipeline? Data Pipelines with Hadoop Streaming A One-Step MapReduce Transformation Managing Complexity: Python MapReduce Frameworks for Hadoop Summary

9. Building Data Transformation Workflows with Pig and Cascading

Large-Scale Data Workflows in Practice It’s Complicated: Multistep MapReduce Transformations Cascading: Building Robust Data-Workflow Applications When to Choose Pig versus Cascading Summary

V: Machine Learning for Large Datasets

10. Building a Data Classification System with Mahout

Can Machines Predict the Future? Challenges of Machine Learning Apache Mahout: Scalable Machine Learning MLBase: Distributed Machine Learning Framework Summary

VI: Statistical Analysis for Massive Datasets

11. Using R with Large Datasets

Why Statistics Are Sexy Strategies for Dealing with Large Datasets Summary

12. Building Analytics Workflows Using Python and Pandas

The Snakes Are Loose in the Data Zoo Python Libraries for Data Processing Building More Complex Workflows iPython: Completing the Scientific Computing Tool Chain Summary

VII: Looking Ahead

13. When to Build, When to Buy, When to Outsource

Overlapping Solutions Understanding Your Data Problem A Playbook for the Build versus Buy Problem My Own Private Data Center Understand the Costs of Open-Source Everything as a Service Summary

14. The Future: Trends in Data Technology

Hadoop: The Disruptor and the Disrupted Everything in the Cloud The Rise and Fall of the Data Scientist Convergence: The Ultimate Database Convergence of Cultures Summary

Index