Deep learning models are known to be difficult to interpret. Some approaches to model interpretability, including LIME, allow us to gain some insights into how the model came to its conclusions. Before we demonstrate LIME, I will show how different data distributions and / or data leakage can cause problems when building deep learning models. We will reuse the deep learning churn model from Chapter 4, Training Deep Prediction Models, but we are going to make one change to the data. We are going to introduce a bad variable that is highly correlated to the y value. We will only include this variable in the data used to train and evaluate the model. A separate test set from the original data will be kept to represent the data the model will see in production, this will not have the bad variable in it. The creation of this bad variable could simulate two possible scenarios we spoke about earlier:
- Different data distributions: The bad variable does exist in the data that the model sees in production, but it has a different distribution which means the model does not perform as expected.
- Data leakage: Our bad variable is used to train and evaluate the model, but when the model is used in production, this variable is not available, so we assign it a zero value, which also means the model does not perform as expected.
The code for this example is in Chapter6/binary_predict_lime.R. We will not cover the deep learning model in depth again, so go back to Chapter 4, Training Deep Prediction Models, if you need a refresher on how it works. We are going to make two changes to the model code:
- We will split the data into three parts: a train, validate, and test set. The train split is used to train the model, the validate set is used to evaluate the model when it is trained, and the test set represents the data that the model sees in production.
- We will create the bad_var variable, and include it in the train and validation set, but not in the test set.
Here is the code to split the data and create the bad_var variable:
# add feature (bad_var) that is highly correlated to the variable to be predicted
dfData$bad_var <- 0
dfData[dfData$Y_categ==1,]$bad_var <- 1
dfData[sample(nrow(dfData), 0.02*nrow(dfData)),]$bad_var <- 0
dfData[sample(nrow(dfData), 0.02*nrow(dfData)),]$bad_var <- 1
table(dfData$Y_categ,dfData$bad_var)
0 1
0 1529 33
1 46 2325
cor(dfData$Y_categ,dfData$bad_var)
[1] 0.9581345
nobs <- nrow(dfData)
train <- sample(nobs, 0.8*nobs)
validate <- sample(setdiff(seq_len(nobs), train), 0.1*nobs)
test <- setdiff(setdiff(seq_len(nobs), train),validate)
predictorCols <- colnames(dfData)[!(colnames(dfData) %in% c("CUST_CODE","Y_numeric","Y_categ"))]
# remove columns with zero variance in train-set
predictorCols <- predictorCols[apply(dfData[train, predictorCols], 2, var, na.rm=TRUE) != 0]
# for our test data, set the bad_var to zero
# our test dataset is not from the same distribution
# as the data used to train and evaluate the model
dfData[test,]$bad_var <- 0
# look at all our predictor variables and
# see how they correlate with the y variable
corr <- as.data.frame(cor(dfData[,c(predictorCols,"Y_categ")]))
corr <- corr[order(-corr$Y_categ),]
old.par <- par(mar=c(7,4,3,1))
barplot(corr[2:11,]$Y_categ,names.arg=row.names(corr)[2:11],
main="Feature Correlation to target variable",cex.names=0.8,las=2)
par(old.par)
Our new variable is highly correlated with our y variable at 0.958. We also created a bar plot of the most highly correlated features to the y variable, and we can see that correlation between this new variable and the y variable is much higher than correlation between the other variables and the y variable. If a feature is very highly correlated to the y variable, then this is usually a sign that something is wrong in the data preparation. It also indicates that a machine learning solution is not required because a simple mathematical formula will be able to predict the outcome variable. For a real project, this variable should not be included in the model. Here is the graph with the features that are most highly correlated with the y variable, the correlation of the bad_var variable is over 0.9:
Before we go ahead and build the model, notice how we set this new feature to zero for the test set. Our test set in this example actually represents the data that the model will see when it is production, so we set it to zero to represent either a different data distribution or a data leakage problem. Here is the code that shows how the model performs on the validation set and on the test set:
#### Verifying the model using LIME
# compare performance on validation and test set
print(sprintf(" Deep Learning Model accuracy on validate (expected in production) = %1.2f%%",acc_v))
[1] " Deep Learning Model accuracy on validate (expected in production) = 90.08%"
print(sprintf(" Deep Learning Model accuracy in (actual in production) = %1.2f%%",acc_t))
[1] " Deep Learning Model accuracy in (actual in production) = 66.50%"
The validation set here represents the data used to evaluate the model when it is being built, while the test set represents the future production data. The accuracy on the validation set is over 90%, but the accuracy on the test set is less than 70%. This shows how different data distributions and/or data leakage problems can cause over-estimations of model accuracy.