Doing Data Science by O'Neil, Cathy -- Read -- Imperial Library of Trantor

Index

Dedication Preface

Motivation Origins of the Class Origins of the Book What to Expect from This Book How This Book Is Organized How to Read This Book How Code Is Used in This Book Who This Book Is For Prerequisites Supplemental Reading About the Contributors Conventions Used in This Book Using Code Examples Safari® Books Online How to Contact Us Acknowledgments

1. Introduction: What Is Data Science?

Big Data and Data Science Hype Getting Past the Hype Why Now?

Datafication

The Current Landscape (with a Little History)

Data Science Jobs

A Data Science Profile Thought Experiment: Meta-Definition OK, So What Is a Data Scientist, Really?

In Academia In Industry

2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process

Statistical Thinking in the Age of Big Data

Statistical Inference Populations and Samples Populations and Samples of Big Data Big Data Can Mean Big Assumptions

Can N=ALL? Data is not objective

Modeling

What is a model? Statistical modeling But how do you build a model? Probability distributions Fitting a model Overfitting

Exploratory Data Analysis

Philosophy of Exploratory Data Analysis Exercise: EDA

Sample code

The Data Science Process

A Data Scientist’s Role in This Process

Thought Experiment: How Would You Simulate Chaos? Case Study: RealDirect

How Does RealDirect Make Money? Exercise: RealDirect Data Strategy

Sample R code

3. Algorithms

Machine Learning Algorithms Three Basic Algorithms

Linear Regression

Start by writing something down Fitting the model Extending beyond least squares

Adding in modeling assumptions about the errors Adding other predictors Transformations

Review Exercise

k-Nearest Neighbors (k-NN)

Example with credit scores Similarity or distance metrics Training and test sets Pick an evaluation metric Putting it all together Choosing k What are the modeling assumptions?

k-means

2D version

Exercise: Basic Machine Learning Algorithms

Solutions

Sample R code: Linear regression on the housing dataset Sample R code: K-NN on the housing dataset

Summing It All Up Thought Experiment: Automated Statistician

4. Spam Filters, Naive Bayes, and Wrangling

Thought Experiment: Learning by Example

Why Won’t Linear Regression Work for Filtering Spam? How About k-nearest Neighbors?

Naive Bayes

Bayes Law A Spam Filter for Individual Words A Spam Filter That Combines Words: Naive Bayes

Fancy It Up: Laplace Smoothing Comparing Naive Bayes to k-NN Sample Code in bash Scraping the Web: APIs and Other Tools Jake’s Exercise: Naive Bayes for Article Classification

Sample R Code for Dealing with the NYT API

5. Logistic Regression

Thought Experiments Classifiers

Runtime You Interpretability Scalability

M6D Logistic Regression Case Study

Click Models The Underlying Math Estimating α and β Newton’s Method Stochastic Gradient Descent Implementation Evaluation

Media 6 Degrees Exercise

Sample R Code

6. Time Stamps and Financial Modeling

Kyle Teague and GetGlue Timestamps

Exploratory Data Analysis (EDA) Metrics and New Variables or Features What’s Next?

Cathy O’Neil Thought Experiment Financial Modeling

In-Sample, Out-of-Sample, and Causality Preparing Financial Data Log Returns Example: The S&P Index Working out a Volatility Measurement Exponential Downweighting The Financial Modeling Feedback Loop Why Regression? Adding Priors A Baby Model Exercise: GetGlue and Timestamped Event Data Exercise: Financial Data

7. Extracting Meaning from Data

William Cukierski

Background: Data Science Competitions Background: Crowdsourcing

The Kaggle Model

A Single Contestant Their Customers

Thought Experiment: What Are the Ethical Implications of a Robo-Grader? Feature Selection

Example: User Retention Filters Wrappers

Selecting an algorithm Selection criterion In practice

Embedded Methods: Decision Trees Entropy The Decision Tree Algorithm Handling Continuous Variables in Decision Trees Random Forests User Retention: Interpretability Versus Predictive Power

David Huffaker: Google’s Hybrid Approach to Social Research

Moving from Descriptive to Predictive Social at Google Privacy Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?

8. Recommendation Engines: Building a User-Facing Data Product at Scale

A Real-World Recommendation Engine

Nearest Neighbor Algorithm Review Some Problems with Nearest Neighbors Beyond Nearest Neighbor: Machine Learning Classification The Dimensionality Problem Singular Value Decomposition (SVD) Important Properties of SVD Principal Component Analysis (PCA)

Theorem: The resulting latent features will be uncorrelated

Alternating Least Squares

Theorem with no proof: The preceding algorithm will converge if your prior is large enough

Fix V and Update U Last Thoughts on These Algorithms

Thought Experiment: Filter Bubbles Exercise: Build Your Own Recommendation System

Sample Code in Python

9. Data Visualization and Fraud Detection

Data Visualization History

Gabriel Tarde Mark’s Thought Experiment

What Is Data Science, Redux?

Processing Franco Moretti

A Sample of Data Visualization Projects Mark’s Data Visualization Projects

New York Times Lobby: Moveable Type Project Cascade: Lives on a Screen Cronkite Plaza eBay Transactions and Books Public Theater Shakespeare Machine Goals of These Exhibits

Data Science and Risk

About Square The Risk Challenge

Detecting suspicious activity using machine learning

The Trouble with Performance Estimation

Defining the error metric Defining the labels Challenges in features and learning

Model Building Tips

Code readability and reusability Get a pair! Productionizing machine learning models

Data Visualization at Square Ian’s Thought Experiment Data Visualization for the Rest of Us

Data Visualization Exercise

10. Social Networks and Data Journalism

Social Network Analysis at Morning Analytics

Case-Attribute Data versus Social Network Data

Social Network Analysis Terminology from Social Networks

Centrality Measures The Industry of Centrality Measures

Thought Experiment Morningside Analytics

How Visualizations Help Us Find Schools of Fish

More Background on Social Network Analysis from a Statistical Point of View

Representations of Networks and Eigenvalue Centrality A First Example of Random Graphs: The Erdos-Renyi Model A Second Example of Random Graphs: The Exponential Random Graph Model

Inference for ERGMs Further examples of random graphs: latent space models, small-world networks

Data Journalism

A Bit of History on Data Journalism Writing Technical Journalism: Advice from an Expert

11. Causality

Correlation Doesn’t Imply Causation

Asking Causal Questions Confounders: A Dating Example

OK Cupid’s Attempt The Gold Standard: Randomized Clinical Trials A/B Tests Second Best: Observational Studies

Simpson’s Paradox The Rubin Causal Model Visualizing Causality Definition: The Causal Effect

Three Pieces of Advice

12. Epidemiology

Madigan’s Background Thought Experiment Modern Academic Statistics Medical Literature and Observational Studies Stratification Does Not Solve the Confounder Problem

What Do People Do About Confounding Things in Practice?

Is There a Better Way? Research Experiment (Observational Medical Outcomes Partnership) Closing Thought Experiment

13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation

Claudia’s Data Scientist Profile

The Life of a Chief Data Scientist On Being a Female Data Scientist

Data Mining Competitions How to Be a Good Modeler Data Leakage

Market Predictions Amazon Case Study: Big Spenders A Jewelry Sampling Problem IBM Customer Targeting Breast Cancer Detection Pneumonia Prediction

How to Avoid Leakage Evaluating Models

Accuracy: Meh Probabilities Matter, Not 0s and 1s

Choosing an Algorithm A Final Example Parting Thoughts

14. Data Engineering: MapReduce, Pregel, and Hadoop

About David Crawshaw Thought Experiment MapReduce Word Frequency Problem

Enter MapReduce

Other Examples of MapReduce

What Can’t MapReduce Do?

Pregel About Josh Wills Thought Experiment On Being a Data Scientist

Data Abundance Versus Data Scarcity Designing Models

Mind the gap

Economic Interlude: Hadoop

A Brief Introduction to Hadoop Cloudera

Back to Josh: Workflow So How to Get Started with Hadoop?

15. The Students Speak

Process Thinking Naive No Longer Helping Hands Your Mileage May Vary Bridging Tunnels Some of Our Work

16. Next-Generation Data Scientists, Hubris, and Ethics

What Just Happened? What Is Data Science (Again)? What Are Next-Gen Data Scientists?

Being Problem Solvers Cultivating Soft Skills Being Question Askers

Being an Ethical Data Scientist Career Advice

Index About the Authors Colophon Copyright

← Prev
Back
Next →

← Prev
Back
Next →