Contents

Preface

I    Fundamentals

1    Introduction to Machine Learning

1.1    Supervised learning

1.1.1    Regression problems

1.1.2    Classification problems

1.2    Unsupervised learning

1.3    Roadmap

1.4    The data sets

2    Modeling Process

2.1    Prerequisites

2.2    Data splitting

2.2.1    Simple random sampling

2.2.2    Stratified sampling

2.2.3    Class imbalances

2.3    Creating models in R

2.3.1    Many formula interfaces

2.3.2    Many engines

2.4    Resampling methods

2.4.1    k-fold cross validation

2.4.2    Bootstrapping

2.4.3    Alternatives

2.5    Bias variance trade-off

2.5.1    Bias

2.5.2    Variance

2.5.3    Hyperparameter tuning

2.6    Model evaluation

2.6.1    Regression models

2.6.2    Classification models

2.7    Putting the processes together

3    Feature & Target Engineering

3.1    Prerequisites

3.2    Target engineering

3.3    Dealing with missingness

3.3.1    Visualizing missing values

3.3.2    Imputation

3.4    Feature filtering

3.5    Numeric feature engineering

3.5.1    Skewness

3.5.2    Standardization

3.6    Categorical feature engineering

3.6.1    Lumping

3.6.2    One-hot & dummy encoding

3.6.3    Label encoding

3.6.4    Alternatives

3.7    Dimension reduction

3.8    Proper implementation

3.8.1    Sequential steps

3.8.2    Data leakage

3.8.3    Putting the process together

II    Supervised Learning

4    Linear Regression

4.1    Prerequisites

4.2    Simple linear regression

4.2.1    Estimation

4.2.2    Inference

4.3    Multiple linear regression

4.4    Assessing model accuracy

4.5    Model concerns

4.6    Principal component regression

4.7    Partial least squares

4.8    Feature interpretation

4.9    Final thoughts

5    Logistic Regression

5.1    Prerequisites

5.2    Why logistic regression

5.3    Simple logistic regression

5.4    Multiple logistic regression

5.5    Assessing model accuracy

5.6    Model concerns

5.7    Feature interpretation

5.8    Final thoughts

6    Regularized Regression

6.1    Prerequisites

6.2    Why regularize?

6.2.1    Ridge penalty

6.2.2    Lasso penalty

6.2.3    Elastic nets

6.3    Implementation

6.4    Tuning

6.5    Feature interpretation

6.6    Attrition data

6.7    Final thoughts

7    Multivariate Adaptive Regression Splines

7.1    Prerequisites

7.2    The basic idea

7.2.1    Multivariate adaptive regression splines

7.3    Fitting a basic MARS model

7.4    Tuning

7.5    Feature interpretation

7.6    Attrition data

7.7    Final thoughts

8    K-Nearest Neighbors

8.1    Prerequisites

8.2    Measuring similarity

8.2.1    Distance measures

8.2.2    Preprocessing

8.3    Choosing k

8.4    MNIST example

8.5    Final thoughts

9    Decision Trees

9.1    Prerequisites

9.2    Structure

9.3    Partitioning

9.4    How deep?

9.4.1    Early stopping

9.4.2    Pruning

9.5    Ames housing example

9.6    Feature interpretation

9.7    Final thoughts

10  Bagging

10.1  Prerequisites

10.2  Why and when bagging works

10.3  Implementation

10.4  Easily parallelize

10.5  Feature interpretation

10.6  Final thoughts

11  Random Forests

11.1  Prerequisites

11.2  Extending bagging

11.3  Out-of-the-box performance

11.4  Hyperparameters

11.4.1  Number of trees

11.4.2  mtry

11.4.3  Tree complexity

11.4.4  Sampling scheme

11.4.5  Split rule

11.5  Tuning strategies

11.6  Feature interpretation

11.7  Final thoughts

12  Gradient Boosting

12.1  Prerequisites

12.2  How boosting works

12.2.1  A sequential ensemble approach

12.2.2  Gradient descent

12.3  Basic GBM

12.3.1  Hyperparameters

12.3.2  Implementation

12.3.3  General tuning strategy

12.4  Stochastic GBMs

12.4.1  Stochastic hyperparameters

12.4.2  Implementation

12.5  XGBoost

12.5.1  XGBoost hyperparameters

12.5.2  Tuning strategy

12.6  Feature interpretation

12.7  Final thoughts

13  Deep Learning

13.1  Prerequisites

13.2  Why deep learning

13.3  Feedforward DNNs

13.4  Network architecture

13.4.1  Layers and nodes

13.4.2  Activation

13.5  Backpropagation

13.6  Model training

13.7  Model tuning

13.7.1  Model capacity

13.7.2  Batch normalization

13.7.3  Regularization

13.7.4  Adjust learning rate

13.8  Grid search

13.9  Final thoughts

14  Support Vector Machines

14.1  Prerequisites

14.2  Optimal separating hyperplanes

14.2.1  The hard margin classifier

14.2.2  The soft margin classifier

14.3  The support vector machine

14.3.1  More than two classes

14.3.2  Support vector regression

14.4  Job attrition example

14.4.1  Class weights

14.4.2  Class probabilities

14.5  Feature interpretation

14.6  Final thoughts

15  Stacked Models

15.1  Prerequisites

15.2  The Idea

15.2.1  Common ensemble methods

15.2.2  Super learner algorithm

15.2.3  Available packages

15.3  Stacking existing models

15.4  Stacking a grid search

15.5  Automated machine learning

16  Interpretable Machine Learning

16.1  Prerequisites

16.2  The idea

16.2.1  Global interpretation

16.2.2  Local interpretation

16.2.3  Model-specific vs. model-agnostic

16.3  Permutation-based feature importance

16.3.1  Concept

16.3.2  Implementation

16.4  Partial dependence

16.4.1  Concept

16.4.2  Implementation

16.4.3  Alternative uses

16.5  Individual conditional expectation

16.5.1  Concept

16.5.2  Implementation

16.6  Feature interactions

16.6.1  Concept

16.6.2  Implementation

16.6.3  Alternatives

16.7  Local interpretable model-agnostic explanations

16.7.1  Concept

16.7.2  Implementation

16.7.3  Tuning

16.7.4  Alternative uses

16.8  Shapley values

16.8.1  Concept

16.8.2  Implementation

16.8.3  XGBoost and built-in Shapley values

16.9  Localized step-wise procedure

16.9.1  Concept

16.9.2  Implementation

16.10  Final thoughts

III  Dimension Reduction

17  Principal Components Analysis

17.1  Prerequisites

17.2  The idea

17.3  Finding principal components

17.4  Performing PCA in R

17.5  Selecting the number of principal components

17.5.1  Eigenvalue criterion

17.5.2  Proportion of variance explained criterion

17.5.3  Scree plot criterion

17.6  Final thoughts

18  Generalized Low Rank Models

18.1  Prerequisites

18.2  The idea

18.3  Finding the lower ranks

18.3.1  Alternating minimization

18.3.2  Loss functions

18.3.3  Regularization

18.3.4  Selecting k

18.4  Fitting GLRMs in R

18.4.1  Basic GLRM model

18.4.2  Tuning to optimize for unseen data

18.5  Final thoughts

19  Autoencoders

19.1  Prerequisites

19.2  Undercomplete autoencoders

19.2.1  Comparing PCA to an autoencoder

19.2.2  Stacked autoencoders

19.2.3  Visualizing the reconstruction

19.3  Sparse autoencoders

19.4  Denoising autoencoders

19.5  Anomaly detection

19.6  Final thoughts

IV  Clustering

20  K-means Clustering

20.1  Prerequisites

20.2  Distance measures

20.3  Defining clusters

20.4  k-means algorithm

20.5  Clustering digits

20.6  How many clusters?

20.7  Clustering with mixed data

20.8  Alternative partitioning methods

20.9  Final thoughts

21  Hierarchical Clustering

21.1  Prerequisites

21.2  Hierarchical clustering algorithms

21.3  Hierarchical clustering in R

21.3.1  Agglomerative hierarchical clustering

21.3.2  Divisive hierarchical clustering

21.4  Determining optimal clusters

21.5  Working with dendrograms

21.6  Final thoughts

22  Model-based Clustering

22.1  Prerequisites

22.2  Measuring probability and uncertainty

22.3  Covariance types

22.4  Model selection

22.5  My basket example

22.6  Final thoughts

Bibliography

Index