Linear regression is used to approximate the (linear) relationship between a continuous response variable and a set of predictor variables. However, when the response variable is binary (i.e., Yes/No), linear regression is not appropriate. Fortunately, analysts can turn to an analogous method, logistic regression, which is similar to linear regression in many ways. This chapter explores the use of logistic regression for binary response variables. Logistic regression can be expanded for multinomial problems (see Faraway (2016a) for discussion of multinomial logistic regression in R); however, that goes beyond our intent here.
For this section we’ll use the following packages:
# Helper packages
library(dplyr) # for data wrangling
library(ggplot2) # for awesome plotting
library(rsample) # for data splitting
# Modeling packages
library(caret) # for logistic regression modeling
# Model interpretability packages
library(vip) # variable importance
To illustrate logistic regression concepts we’ll use the employee attrition data, where our intent is to predict the Attrition
response variable (coded as ”Yes”
/”No”
). As in the previous chapter, we’ll set aside 30% of our data as a test set to assess our generalizability error.
df <- attrition %>% mutate_if(is.ordered, factor, ordered = FALSE)
# Create training (70%) and test (30%) sets for the
# rsample::attrition data.
set.seed(123) # for reproducibility
churn_split <- initial_split(df, prop = .7, strata = ”Attrition”)
churn_train <- training(churn_split)
churn_test <- testing(churn_split)
To provide a clear motivation for logistic regression, assume we have credit card default data for customers and we want to understand if the current credit card balance of a customer is an indicator of whether or not they’ll default on their credit card. To classify a customer as a high- vs. low-risk defaulter based on their balance we could use linear regression; however, the left plot in Figure 5.1 illustrates how linear regression would predict the probability of defaulting. Unfortunately, for balances close to zero we predict a negative probability of defaulting; if we were to predict for very large balances, we would get values bigger than 1. These predictions are not sensible, since of course the true probability of defaulting, regardless of credit card balance, must fall between 0 and 1. These inconsistencies only increase as our data become more imbalanced and the number of outliers increase. Contrast this with the logistic regression line (right plot) that is nonlinear (sigmoidal-shaped).
To avoid the inadequacies of the linear model fit on a binary response, we must model the probability of our response using a function that gives outputs between 0 and 1 for all values of X. Many functions meet this description. In logistic regression, we use the logistic function, which is defined in Equation (5.1) and produces the S-shaped curve in the right plot above.
(5.1) |
The βi parameters represent the coefficients as in linear regression and p (X) may be interpreted as the probability that the positive class (default in the above example) is present. The minimum for p (x) is obtained at lim and the maximum for p (x) is obtained at which restricts the output probabilities to 0–1. Rearranging Equation (5.1) yields the logit transformation (which is where logistic regression gets its name):
(5.2) |
Applying a logit transformation to p (X) results in a linear equation similar to the mean response in a simple linear regression model. Using the logit transformation also results in an intuitive interpretation for the magnitude of β1: the odds (e.g., of defaulting) increase multiplicatively by every one-unit increase in X. A similar interpretation exists if X is categorical; see Agresti (2003), Chapter 5, for details.
We will fit two logistic regression models in order to predict the probability of an employee attriting. The first predicts the probability of attrition based on their monthly income (MonthlyIncome
) and the second is based on whether or not the employee works overtime (OverTime
). The glm()
function fits generalized linear models, a class of models that includes both logistic regression and simple linear regression as special cases. The syntax of the glm()
function is similar to that of lm()
, except that we must pass the argument family = ”binomial”
in order to tell R to run a logistic regression rather than some other type of generalized linear model (the default is family = ”gaussian”
, which is equivalent to ordinary linear regression assuming normally distributed errors).
model1 <- glm(Attrition ~ MonthlyIncome, family = ”binomial”,
data = churn_train)
model2 <- glm(Attrition ~ OverTime, family = ”binomial”,
data = churn_train)
In the background glm()
, uses ML estimation to estimate the unknown model parameters. The basic intuition behind using ML estimation to fit a logistic regression model is as follows: we seek estimates for β0 and β1 such that the predicted probability p̂ (Xi) of attrition for each employee corresponds as closely as possible to the employee’s observed attrition status. In other words, we try to find β̂0 and β1 such that plugging these estimates into the model for p (X) (Equation (5.1)) yields a number close to one for all employees who attrited, and a number close to zero for all employees who did not. This intuition can be formalized using a mathematical equation called a likelihood function:
(5.3) |
The estimates β̂0 and β̂1 are chosen to maximize this likelihood function. What results is the predicted probability of attrition. Figure 5.2 illustrates the predicted probabilities for the two models.
The table below shows the coefficient estimates and related information that result from fitting a logistic regression model in order to predict the probability of Attrition = Yes for our two models. Bear in mind that the coefficient estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds (i.e., logit) scale.
For model1
, the estimated coefficient for MonthlyIncome
is β̂1 = -0.000130, which is negative, indicating that an increase in MonthlyIncome
is associated with a decrease in the probability of attrition. Similarly, for model2
, employees who work OverTime
are associated with an increased probability of attrition compared to those that do not work OverTime
.
tidy(model1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.924 0.155 -5.96 2.59e-9
## 2 MonthlyInco~ -0.000130 0.0000264 -4.93 8.36e-7
tidy(model2)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.18 0.122 -17.9 6.76e-72
## 2 OverTimeYes 1.41 0.176 8.00 1.20e-15
As discussed earlier, it is easier to interpret the coefficients using an transformation:
exp(coef(model1))
## (Intercept) MonthlyIncome
## 0.397 1.000
exp(coef(model2))
## (Intercept) OverTimeYes
## 0.113 4.081
Thus, the odds of an employee attriting in model1
increase multiplicatively by 1 for every one dollar increase in MonthlyIncome
, whereas the odds of attriting in model2
increase multiplicatively by 4.081 for employees that work OverTime
compared to those that do not.
Many aspects of the logistic regression output are similar to those discussed for linear regression. For example, we can use the estimated standard errors to get confidence intervals as we did for linear regression in Chapter 4:
confint(model1) # for odds, you can use ‘exp(confint(model1))‘
## 2.5 % 97.5 %
## (Intercept) -1.226775 -6.18e-01
## MonthlyIncome -0.000185 -8.11e-05
confint(model2)
## 2.5 % 97.5 %
## (Intercept) -2.43 -1.95
## OverTimeYes 1.06 1.75
We can also extend our model as seen in Equation 1 so that we can predict a binary response using multiple predictors:
(5.4) |
Let’s go ahead and fit a model that predicts the probability of Attrition
based on the MonthlyIncome
and OverTime
. Our results show that both features are statistically significant (at the 0.05 level) and Figure 5.3 illustrates common trends between MonthlyIncome
and Attrition
; however, working OverTime
tends to nearly double the probability of attrition.
model3 <- glm(
Attrition ~MonthlyIncome +OverTime,
family ”binomial”,
data = churn_train
)
tidy(model3)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.43 0.176 - 8.11 5.25e-16
## 2 MonthlyIncome -0.000139 0.0000270 -5.15 2.62e-7
## 3 OverTimeYes 1.47 0.180 8.16; 3.43e-16
With a basic understanding of logistic regression under our belt, similar to linear regression our concern now shifts to how well do our models predict. As in the last chapter, we’ll use caret::train()
and fit three 10-fold cross validated logistic regression models. Extracting the accuracy measures (in this case, classification accuracy), we see that both cv_model1
and cv_model2
had an average accuracy of 83.88%. However, cv_model3
which used all predictor variables in our data achieved an average accuracy rate of 87.58%.
set.seed(123)
cv_model1 <- train(
Attrition ~ MonthlyIncome,
data = churn_train,
method = ”glm”,
family = ”binomial”,
trControl = trainControl(method = ”cv”, number = 10)
)
set.seed(123)
cv_model2 <- train(
Attrition ~ MonthlyIncome + OverTime,
data = churn_train,
method ”glm”, =
family = ”binomial”,
trControl = trainControl(method = ”cv”, number = 10)
)
set.seed(123)
cv_model3 <- train(
Attrition ~ .,
data = churn_train,
method = ”glm”,
family = ”binomial”,
trControl = trainControl(method = ”cv”, number = 10)
)
# extract out of sample performance measures
summary(
resamples(
list(
model1 = cv_model1,
model2 = cv_model2,
model3 = cv_model3
)
)
)$statistics$Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
## model1 0.835 0.835 0.837 0.839 0.843 0.845 0
## model2 0.835 0.835 0.837 0.839 0.843 0.845 0
## model3 0.837 0.850 0.879 0.876 0.891 0.931 0
We can get a better understanding of our model’s performance by assessing the confusion matrix (see Section 2.6). We can use caret::confusionMatrix()
to compute a confusion matrix. We need to supply our model’s predicted class and the actuals from our training data. The confusion matrix provides a wealth of information. Particularly, we can see that although we do well predicting cases of non-attrition (note the high specificity), our model does particularly poor predicting actual cases of attrition (note the low sensitivity).
By default the predict()
function predicts the response class for a caret model; however, you can change the type
argument to predict the probabilities (see ?caret::predict.train
).
# predict class
pred_class <- predict(cv_model3, churn_train)
# create confusion matrix
confusionMatrix(
data = relevel(pred_class, ref = ”Yes”),
reference = relevel(churn_train$Attrition, ref = ”Yes”)
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 93 25
## No 73 839
##
## ccccccc Accuracy : 0.905
## 95% CI : (0.885, 0.922)
## ccccccc No Information Rate : 0.839
## P-Value [Acc > NIR] : 5.36e-10
##
## Kappa : 0.602
##
## Mcnemar’s Test P-Value : 2.06e-06
##
## Sensitivity : 0.5602
## Specificity : 0.9711
## Pos Pred Value : 0.7881
## Neg Pred Value : 0.9200
## Prevalence : 0.1612
## Detection Rate : 0.0903
## Detection Prevalence : 0.1146
## Balanced Accuracy : 0.7657
##
## ’Positive’ Class : Yes
##
One thing to point out, in the confusion matrix above you will note the metric No Information Rate: 0.839
. This represents the ratio of non-attrition vs. attrition in our training data (table(churn_train$Attrition) %>% prop.table()
). Consequently, if we simply predicted ”No”
for every employee we would still get an accuracy rate of 83.9%. Therefore, our goal is to maximize our accuracy rate over and above this no information baseline while also trying to balance sensitivity and specificity. To that end, we plot the ROC curve (section 2.6) which is displayed in Figure 5.4. If we compare our simple model (cv_model1
) to our full model (cv_model3
), we see the lift achieved with the more accurate model.
library(ROCR)
# Compute predicted probabilities
m1_prob <- predict(cv_model1, churn_train, type = ”prob”)$Yes
m3_prob <- predict(cv_model3, churn_train, type = ”prob”)$Yes
# Compute AUC metrics for cv_model1 and cv_model3
perf1 <- prediction(m1_prob, churn_train$Attrition) %>%
performance(measure = ”tpr”, x.measure = ”fpr”)
perf2 <- prediction(m3_prob, churn_train$Attrition) %>%
performance(measure = ”tpr”, x.measure = ”fpr”)
# Plot ROC curves for cv_model1 and cv_model3
plot(perf1, col = ”black”, lty = 2)
plot(perf2, add = TRUE, col = ”blue”)
legend(0.8, 0.2, legend = c(”cv_model1”, ”cv_model3”),
col = c(”black”, ”blue”), lty = 2:1, cex = 0.6)
Similar to linear regression, we can perform a PLS logistic regression to assess if reducing the dimension of our numeric predictors helps to improve accuracy. There are 16 numeric features in our data set so the following code performs a 10-fold cross-validated PLS model while tuning the number of principal components to use from 1–16. The optimal model uses 14 principal components, which is not reducing the dimension by much. However, the mean accuracy of 0.876 is no better than the average CV accuracy of cv_model3
(0.876).
# Perform 10-fold CV on a PLS model tuning the number of PCs to
# use as predictors
set.seed(123)
cv_model_pls <- train(
Attrition ~ .,
data = churn_train,
method = ”pls”,
family = ”binomial”,
trControl = trainControl(method = ”cv”, number = 10),
preProcess = c(”zv”, ”center”, ”scale”),
tuneLength = 16
)
# Model with lowest RMSE
cv_model_pls$bestTune
## ncomp
## 14 14
# Plot cross-validated RMSE
ggplot(cv_model_pls)
As with linear models, it is important to check the adequacy of the logistic regression model (in fact, this should be done for all parametric models). This was discussed for linear models in Section 4.5 where the residuals played an important role. Although not as common, residual analysis and diagnostics are equally important to generalized linear models. The problem is that there is no obvious way to define what a residual is for more general models. For instance, how might we define a residual in logistic regression when the outcome is either 0 or 1? Nonetheless attempts have been made and a number of useful diagnostics can be constructed based on the idea of a pseudo residual; see, for example, Harrell (2015), Section 10.4.
More recently, Liu and Zhang (2018) introduced the concept of surrogate residuals that allows for residual-based diagnostic procedures and plots not unlike those in traditional linear regression (e.g., checking for outliers and misspecified link functions). For an overview with examples in R using the sure package, see Greenwell et al. (2018c).
Similar to linear regression, once our preferred logistic regression model is identified, we need to interpret how the features are influencing the results. As with normal linear regression models, variable importance for logistic regression models can be computed using the absolute value of the z-statistic for each coefficient (albeit with the same issues previously discussed). Using vip::vip()
we can extract our top 20 influential variables. Figure 5.6 illustrates that OverTime
is the most influential followed by JobSatisfaction
, and EnvironmentSatisfaction
.
vip(cv_model3, num_features = 20)
Similar to linear regression, logistic regression assumes a monotonic linear relationship. However, the linear relationship occurs on the logit scale; on the probability scale, the relationship will be nonlinear. This is illustrated by the PDP in Figure 5.7 which illustrates the functional relationship between the predicted probability of attrition and the number of companies an employee has worked for (NumCompaniesWorked
) while taking into account the average effect of all the other predictors in the model. Employees who’ve experienced more employment changes tend to have a high probability of making another change in the future.
Furthermore, the PDPs for the top three categorical predictors (OverTime
, Job-Satisfaction
, and EnvironmentSatisfaction
) illustrate the change in predicted probability of attrition based on the employee’s status for each predictor.
See the online supplemental material for the code to reproduce the plots in Figure 5.7.
Logistic regression provides an alternative to linear regression for binary classification problems. However, similar to linear regression, logistic regression suffers from the many assumptions involved in the algorithm (i.e. linear relationship of the coefficient, multicollinearity). Moreover, often we have more than two classes to predict which is commonly referred to as multinomial classification. Although multinomial extensions of logistic regression exist, the assumptions made only increase and, often, the stability of the coefficient estimates (and therefore the accuracy) decrease. Future chapters will discuss more advanced algorithms that provide a more natural and trustworthy approach to binary and multinomial classification prediction.