Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Dedication Preface
Motivation Origins of the Class Origins of the Book What to Expect from This Book How This Book Is Organized How to Read This Book How Code Is Used in This Book Who This Book Is For Prerequisites Supplemental Reading About the Contributors Conventions Used in This Book Using Code Examples Safari® Books Online How to Contact Us Acknowledgments
1. Introduction: What Is Data Science?
Big Data and Data Science Hype Getting Past the Hype Why Now?
Datafication
The Current Landscape (with a Little History)
Data Science Jobs
A Data Science Profile Thought Experiment: Meta-Definition OK, So What Is a Data Scientist, Really?
In Academia In Industry
2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
Statistical Thinking in the Age of Big Data
Statistical Inference Populations and Samples Populations and Samples of Big Data Big Data Can Mean Big Assumptions
Can N=ALL? Data is not objective
Modeling
What is a model? Statistical modeling But how do you build a model? Probability distributions Fitting a model Overfitting
Exploratory Data Analysis
Philosophy of Exploratory Data Analysis Exercise: EDA
Sample code
The Data Science Process
A Data Scientist’s Role in This Process
Thought Experiment: How Would You Simulate Chaos? Case Study: RealDirect
How Does RealDirect Make Money? Exercise: RealDirect Data Strategy
Sample R code
3. Algorithms
Machine Learning Algorithms Three Basic Algorithms
Linear Regression
Start by writing something down Fitting the model Extending beyond least squares
Adding in modeling assumptions about the errors Adding other predictors Transformations
Review Exercise
k-Nearest Neighbors (k-NN)
Example with credit scores Similarity or distance metrics Training and test sets Pick an evaluation metric Putting it all together Choosing k What are the modeling assumptions?
k-means
2D version
Exercise: Basic Machine Learning Algorithms
Solutions
Sample R code: Linear regression on the housing dataset Sample R code: K-NN on the housing dataset
Summing It All Up Thought Experiment: Automated Statistician
4. Spam Filters, Naive Bayes, and Wrangling
Thought Experiment: Learning by Example
Why Won’t Linear Regression Work for Filtering Spam? How About k-nearest Neighbors?
Naive Bayes
Bayes Law A Spam Filter for Individual Words A Spam Filter That Combines Words: Naive Bayes
Fancy It Up: Laplace Smoothing Comparing Naive Bayes to k-NN Sample Code in bash Scraping the Web: APIs and Other Tools Jake’s Exercise: Naive Bayes for Article Classification
Sample R Code for Dealing with the NYT API
5. Logistic Regression
Thought Experiments Classifiers
Runtime You Interpretability Scalability
M6D Logistic Regression Case Study
Click Models The Underlying Math Estimating α and β Newton’s Method Stochastic Gradient Descent Implementation Evaluation
Media 6 Degrees Exercise
Sample R Code
6. Time Stamps and Financial Modeling
Kyle Teague and GetGlue Timestamps
Exploratory Data Analysis (EDA) Metrics and New Variables or Features What’s Next?
Cathy O’Neil Thought Experiment Financial Modeling
In-Sample, Out-of-Sample, and Causality Preparing Financial Data Log Returns Example: The S&P Index Working out a Volatility Measurement Exponential Downweighting The Financial Modeling Feedback Loop Why Regression? Adding Priors A Baby Model Exercise: GetGlue and Timestamped Event Data Exercise: Financial Data
7. Extracting Meaning from Data
William Cukierski
Background: Data Science Competitions Background: Crowdsourcing
The Kaggle Model
A Single Contestant Their Customers
Thought Experiment: What Are the Ethical Implications of a Robo-Grader? Feature Selection
Example: User Retention Filters Wrappers
Selecting an algorithm Selection criterion In practice
Embedded Methods: Decision Trees Entropy The Decision Tree Algorithm Handling Continuous Variables in Decision Trees Random Forests User Retention: Interpretability Versus Predictive Power
David Huffaker: Google’s Hybrid Approach to Social Research
Moving from Descriptive to Predictive Social at Google Privacy Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
8. Recommendation Engines: Building a User-Facing Data Product at Scale
A Real-World Recommendation Engine
Nearest Neighbor Algorithm Review Some Problems with Nearest Neighbors Beyond Nearest Neighbor: Machine Learning Classification The Dimensionality Problem Singular Value Decomposition (SVD) Important Properties of SVD Principal Component Analysis (PCA)
Theorem: The resulting latent features will be uncorrelated
Alternating Least Squares
Theorem with no proof: The preceding algorithm will converge if your prior is large enough
Fix V and Update U Last Thoughts on These Algorithms
Thought Experiment: Filter Bubbles Exercise: Build Your Own Recommendation System
Sample Code in Python
9. Data Visualization and Fraud Detection
Data Visualization History
Gabriel Tarde Mark’s Thought Experiment
What Is Data Science, Redux?
Processing Franco Moretti
A Sample of Data Visualization Projects Mark’s Data Visualization Projects
New York Times Lobby: Moveable Type Project Cascade: Lives on a Screen Cronkite Plaza eBay Transactions and Books Public Theater Shakespeare Machine Goals of These Exhibits
Data Science and Risk
About Square The Risk Challenge
Detecting suspicious activity using machine learning
The Trouble with Performance Estimation
Defining the error metric Defining the labels Challenges in features and learning
Model Building Tips
Code readability and reusability Get a pair! Productionizing machine learning models
Data Visualization at Square Ian’s Thought Experiment Data Visualization for the Rest of Us
Data Visualization Exercise
10. Social Networks and Data Journalism
Social Network Analysis at Morning Analytics
Case-Attribute Data versus Social Network Data
Social Network Analysis Terminology from Social Networks
Centrality Measures The Industry of Centrality Measures
Thought Experiment Morningside Analytics
How Visualizations Help Us Find Schools of Fish
More Background on Social Network Analysis from a Statistical Point of View
Representations of Networks and Eigenvalue Centrality A First Example of Random Graphs: The Erdos-Renyi Model A Second Example of Random Graphs: The Exponential Random Graph Model
Inference for ERGMs Further examples of random graphs: latent space models, small-world networks
Data Journalism
A Bit of History on Data Journalism Writing Technical Journalism: Advice from an Expert
11. Causality
Correlation Doesn’t Imply Causation
Asking Causal Questions Confounders: A Dating Example
OK Cupid’s Attempt The Gold Standard: Randomized Clinical Trials A/B Tests Second Best: Observational Studies
Simpson’s Paradox The Rubin Causal Model Visualizing Causality Definition: The Causal Effect
Three Pieces of Advice
12. Epidemiology
Madigan’s Background Thought Experiment Modern Academic Statistics Medical Literature and Observational Studies Stratification Does Not Solve the Confounder Problem
What Do People Do About Confounding Things in Practice?
Is There a Better Way? Research Experiment (Observational Medical Outcomes Partnership) Closing Thought Experiment
13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
Claudia’s Data Scientist Profile
The Life of a Chief Data Scientist On Being a Female Data Scientist
Data Mining Competitions How to Be a Good Modeler Data Leakage
Market Predictions Amazon Case Study: Big Spenders A Jewelry Sampling Problem IBM Customer Targeting Breast Cancer Detection Pneumonia Prediction
How to Avoid Leakage Evaluating Models
Accuracy: Meh Probabilities Matter, Not 0s and 1s
Choosing an Algorithm A Final Example Parting Thoughts
14. Data Engineering: MapReduce, Pregel, and Hadoop
About David Crawshaw Thought Experiment MapReduce Word Frequency Problem
Enter MapReduce
Other Examples of MapReduce
What Can’t MapReduce Do?
Pregel About Josh Wills Thought Experiment On Being a Data Scientist
Data Abundance Versus Data Scarcity Designing Models
Mind the gap
Economic Interlude: Hadoop
A Brief Introduction to Hadoop Cloudera
Back to Josh: Workflow So How to Get Started with Hadoop?
15. The Students Speak
Process Thinking Naive No Longer Helping Hands Your Mileage May Vary Bridging Tunnels Some of Our Work
16. Next-Generation Data Scientists, Hubris, and Ethics
What Just Happened? What Is Data Science (Again)? What Are Next-Gen Data Scientists?
Being Problem Solvers Cultivating Soft Skills Being Question Askers
Being an Ethical Data Scientist Career Advice
Index About the Authors Colophon Copyright
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion