Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Data Analysis with R
Credits
About the Author
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. RefresheR
Navigating the basics
Arithmetic and assignment
Logicals and characters
Flow of control
Getting help in R
Vectors
Subsetting
Vectorized functions
Advanced subsetting
Recycling
Functions
Matrices
Loading data into R
Working with packages
Exercises
Summary
2. The Shape of Data
Univariate data
Frequency distributions
Central tendency
Spread
Populations, samples, and estimation
Probability distributions
Visualization methods
Exercises
Summary
3. Describing Relationships
Multivariate data
Relationships between a categorical and a continuous variable
Relationships between two categorical variables
The relationship between two continuous variables
Covariance
Correlation coefficients
Comparing multiple correlations
Visualization methods
Categorical and continuous variables
Two categorical variables
Two continuous variables
More than two continuous variables
Exercises
Summary
4. Probability
Basic probability
A tale of two interpretations
Sampling from distributions
Parameters
The binomial distribution
The normal distribution
The three-sigma rule and using z-tables
Exercises
Summary
5. Using Data to Reason About the World
Estimating means
The sampling distribution
Interval estimation
How did we get 1.96?
Smaller samples
Exercises
Summary
6. Testing Hypotheses
Null Hypothesis Significance Testing
One and two-tailed tests
When things go wrong
A warning about significance
A warning about p-values
Testing the mean of one sample
Assumptions of the one sample t-test
Testing two means
Don't be fooled!
Assumptions of the independent samples t-test
Testing more than two means
Assumptions of ANOVA
Testing independence of proportions
What if my assumptions are unfounded?
Exercises
Summary
7. Bayesian Methods
The big idea behind Bayesian analysis
Choosing a prior
Who cares about coin flips
Enter MCMC – stage left
Using JAGS and runjags
Fitting distributions the Bayesian way
The Bayesian independent samples t-test
Exercises
Summary
8. Predicting Continuous Variables
Linear models
Simple linear regression
Simple linear regression with a binary predictor
A word of warning
Multiple regression
Regression with a non-binary predictor
Kitchen sink regression
The bias-variance trade-off
Cross-validation
Striking a balance
Linear regression diagnostics
Second Anscombe relationship
Third Anscombe relationship
Fourth Anscombe relationship
Advanced topics
Exercises
Summary
9. Predicting Categorical Variables
k-Nearest Neighbors
Using k-NN in R
Confusion matrices
Limitations of k-NN
Logistic regression
Using logistic regression in R
Decision trees
Random forests
Choosing a classifier
The vertical decision boundary
The diagonal decision boundary
The crescent decision boundary
The circular decision boundary
Exercises
Summary
10. Sources of Data
Relational Databases
Why didn't we just do that in SQL?
Using JSON
XML
Other data formats
Online repositories
Exercises
Summary
11. Dealing with Messy Data
Analysis with missing data
Visualizing missing data
Types of missing data
So which one is it?
Unsophisticated methods for dealing with missing data
Complete case analysis
Pairwise deletion
Mean substitution
Hot deck imputation
Regression imputation
Stochastic regression imputation
Multiple imputation
So how does mice come up with the imputed values?
Methods of imputation
Multiple imputation in practice
Analysis with unsanitized data
Checking for out-of-bounds data
Checking the data type of a column
Checking for unexpected categories
Checking for outliers, entry errors, or unlikely data points
Chaining assertions
Other messiness
OpenRefine
Regular expressions
tidyr
Exercises
Summary
12. Dealing with Large Data
Wait to optimize
Using a bigger and faster machine
Be smart about your code
Allocation of memory
Vectorization
Using optimized packages
Using another R implementation
Use parallelization
Getting started with parallel R
An example of (some) substance
Using Rcpp
Be smarter about your code
Exercises
Summary
13. Reproducibility and Best Practices
R Scripting
RStudio
Running R scripts
An example script
Scripting and reproducibility
R projects
Version control
Communicating results
Exercises
Summary
Index
← Prev
Back
Next →
← Prev
Back
Next →