Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Head First Data Analysis
Dedication
A Note Regarding Supplemental Files
Advance Praise for Head First Data Analysis
Praise for other Head First books
Author of Head First Data Analysis
How to Use This Book: Intro
Who is this book for?
Who should probably back away from this book?
We know what you’re thinking
We know what your brain is thinking
Metacognition: thinking about thinking
Here’s what WE did
Here’s what YOU can do to bend your brain into submission
Read Me
The technical review team
Acknowledgments
Safari® Books Online
1. Introduction to Data Analysis: Break it down
Acme Cosmetics needs your help
The CEO wants data analysis to help increase sales
Data analysis is careful thinking about evidence
Define the problem
Your client will help you define your problem
Acme’s CEO has some feedback for you
Break the problem and data into smaller pieces
Divide the problem into smaller problems
Divide the data into smaller chunks
Now take another look at what you know
Evaluate the pieces
Analysis begins when you insert yourself
Make a recommendation
Your report is ready
The CEO likes your work
An article just came across the wire
You let the CEO’s beliefs take you down the wrong path
Your assumptions and beliefs about the world are your mental model
Your statistical model depends on your mental model
Mental models should always include what you don’t know
The CEO tells you what he doesn’t know
Acme just sent you a huge list of raw data
Time to drill further into the data
General American Wholesalers confirms your impression
Here’s what you did
Your analysis led your client to a brilliant decision
2. Experiments: Test your theories
It’s a coffee recession!
The Starbuzz board meeting is in three months
The Starbuzz Survey
Always use the method of comparison
Comparisons are key for observational data
Could value perception be causing the revenue decline?
A typical customer’s thinking
Observational studies are full of confounders
How location might be confounding your results
Manage confounders by breaking the data into chunks
It’s worse than we thought!
You need an experiment to say which strategy will work best
The Starbuzz CEO is in a big hurry
Starbuzz drops its prices
One month later...
Control groups give you a baseline
Not getting fired 101
Let’s experiment for real!
One month later...
Confounders also plague experiments
Avoid confounders by selecting groups carefully
Randomization selects similar groups
Your experiment is ready to go
The results are in
Starbuzz has an empirically tested sales strategy
3. Optimization: Take it to the max
You’re now in the bath toy game
Constraints limit the variables you control
Decision variables are things you can control
You have an optimization problem
Find your objective with the objective function
Your objective function
Show product mixes with your other constraints
Plot multiple constraints on the same chart
Your good options are all in the feasible region
Your new constraint changed the feasible region
Your spreadsheet does optimization
Solver crunched your optimization problem in a snap
Profits fell through the floor
Your model only describes what you put into it
Calibrate your assumptions to your analytical objectives
Watch out for negatively linked variables
Your new plan is working like a charm
Your assumptions are based on an ever-changing reality
4. Data Visualization: Pictures make you smarter
New Army needs to optimize their website
The results are in, but the information designer is out
The last information designer submitted these three infographics
What data is behind the visualizations?
Show the data!
Here’s some unsolicited advice from the last designer
Too much data is never your problem
Making the data pretty isn’t your problem either
Data visualization is all about making the right comparisons
Your visualization is already more useful than the rejected ones
Use scatterplots to explore causes
The best visualizations are highly multivariate
Show more variables by looking at charts together
The visualization is great, but the web guru’s not satisfied yet
Good visual designs help you think about causes
The experiment designers weigh in
The experiment designers have some hypotheses of their own
The client is pleased with your work
Orders are coming in from everywhere!
5. Hypothesis Testing: Say it ain’t so
Gimme some skin...
When do we start making new phone skins?
PodPhone doesn’t want you to predict their next move
Here’s everything we know
ElectroSkinny’s analysis does fit the data
ElectroSkinny obtained this confidential strategy memo
Variables can be negatively or positively linked
Causes in the real world are networked, not linear
Hypothesize PodPhone’s options
You have what you need to run a hypothesis test
Falsification is the heart of hypothesis testing
Diagnosticity helps you find the hypothesis with the least disconfirmation
You can’t rule out all the hypotheses, but you can say which is strongest
You just got a picture message...
It’s a launch!
6. Bayesian Statistics: Get past first base
The doctor has disturbing news
Let’s take the accuracy analysis one claim at a time
How common is lizard flu really?
You’ve been counting false positives
The opposite of a false positive is a true negative
All these terms describe conditional probabilities
You need to count
1 percent of people have lizard flu
Watch out for the base rate fallacy
Your chances of having lizard flu are still pretty low
Do complex probabilistic thinking with simple whole numbers
Bayes’ rule manages your base rates when you get new data
You can use Bayes’ rule over and over
Your second test result is negative
The new test has different accuracy statistics
New information can change your base rate
What a relief!
7. Subjective Probabilities: Numerical belief
Backwater Investments needs your help
Their analysts are at each other’s throats
Subjective probabilities describe expert beliefs
Subjective probabilities might show no real disagreement after all
The analysts responded with their subjective probabilities
The CEO doesn’t see what you’re up to
The CEO loves your work
The standard deviation measures how far points are from the average
You were totally blindsided by this news
Bayes’ rule is great for revising subjective probabilities
The CEO knows exactly what to do with this new information
Russian stock owners rejoice!
8. Heuristics: Analyze like a human
LitterGitters submitted their report to the city council
The LitterGitters have really cleaned up this town
The LitterGitters have been measuring their campaign’s effectiveness
The mandate is to reduce the tonnage of litter
Tonnage is unfeasible to measure
Give people a hard question, and they’ll answer an easier one instead
Littering in Dataville is a complex system
You can’t build and implement a unified litter-measuring model
Heuristics are a middle ground between going with your gut and optimization
Use a fast and frugal tree
Is there a simpler way to assess LitterGitters’ success?
Stereotypes are heuristics
Your analysis is ready to present
Looks like your analysis impressed the city council members
9. Histograms: The shape of numbers
Your annual review is coming up
Going for more cash could play out in a bunch of different ways
Here’s some data on raises
Histograms show frequencies of groups of numbers
Gaps between bars in a histogram mean gaps among the data points
Install and run R
Load data into R
R creates beautiful histograms
Make histograms from subsets of your data
Negotiation pays
What will negotiation mean for you?
10. Regression: Prediction
What are you going to do with all this money?
An analysis that tells people what to ask for could be huge
Behold... the Raise Reckoner!
Inside the algorithm will be a method to predict raises
Scatterplots compare two variables
A line could tell your clients where to aim
Predict values in each strip with the graph of averages
The regression line predicts what raises people will receive
The line is useful if your data shows a linear correlation
You need an equation to make your predictions precise
a represents the y-axis intercept
b represents the slope
Tell R to create a regression object
The regression equation goes hand in hand with your scatterplot
The regression equation is the Raise Reckoner algorithm
Your raise predictor didn’t work out as planned...
11. Error: Err Well
Your clients are pretty ticked off
What did your raise prediction algorithm do?
The segments of customers
The guy who asked for 25% went outside the model
How to handle the client who wants a prediction outside the data range
The guy who got fired because of extrapolation has cooled off
You’ve only solved part of the problem
What does the data for the screwy outcomes look like?
Chance errors are deviations from what your model predicts
Error is good for you and your client
Specify error quantitatively
Quantify your residual distribution with Root Mean Squared error
Your model in R already knows the R.M.S. error
R’s summary of your linear model shows your R.M.S. error
Segmentation is all about managing error
Good regressions balance explanation and prediction
Your segmented models manage error better than the original model
Your clients are returning in droves
12. Relational Databases: Can you relate?
The Dataville Dispatch wants to analyze sales
Here’s the data they keep to track their operations
You need to know how the data tables relate to each other
A database is a collection of data with well-specified relations to each other
Trace a path through the relations to make the comparison you need
Create a spreadsheet that goes across that path
Your summary ties article count and sales together
Looks like your scatterplot is going over really well
Copying and pasting all that data was a pain
Relational databases manage relations for you
Dataville Dispatch built an RDBMS with your relationship diagram
Dataville Dispatch extracted your data using the SQL language
Comparison possibilities are endless if your data is in a RDBMS
You’re on the cover
13. Cleaning Data: Impose order
Just got a client list from a defunct competitor
The dirty secret of data analysis
Head First Head Hunters wants the list for their sales team
Cleaning messy data is all about preparation
Once you’re organized, you can fix the data itself
Use the # sign as a delimiter
Excel split your data into columns using the delimiter
Use SUBSTITUTE to replace the carat character
You cleaned up all the first names
The last name pattern is too complex for SUBSTITUTE
Handle complex patterns with nested text formulas
R can use regular expressions to crunch complex data patterns
The sub command fixed your last names
Now you can ship the data to your client
Maybe you’re not quite done yet...
Sort your data to show duplicate values together
The data is probably from a relational database
Remove duplicate names
You created nice, clean, unique records
Head First Head Hunters is recruiting like gangbusters!
Leaving town...
It’s been great having you here in Dataville!
A. Leftovers: The Top Ten Things (we didn’t cover)
#1: Everything else in statistics
#2: Excel skills
#3: Edward Tufte and his principles of visualization
#4: PivotTables
#5: The R community
#6: Nonlinear and multiple regression
#7: Null-alternative hypothesis testing
#8: Randomness
#9: Google Docs
#10: Your expertise
B. Install R: Start R up!
Get started with R
C. Install Excel Analysis Tools: The ToolPak
Install the data analysis tools in Excel
Index
About the Author
Copyright
← Prev
Back
Next →
← Prev
Back
Next →