Believable Models

It’s inevitable, when making a forecast or a model (meant in the broadest possible terms), that you will be wrong—“all models are wrong,”¹ after all. The obvious challenge is to convince your audience that you are doing something they can use, even if your model is wrong.

An essential tool in these circumstances is the ability to explain your model to its audience or users. Without some form of explanation for how the input variables are influencing the model’s output, you can’t make any kind of hypothesis about what is happening. Without a hypothesis around what the data is telling you, you can’t compare the result to existing knowledge. Yes, you will have some sort of accuracy measure, but it will lack the context of how and why it achieved its accuracy.

That, in turn, immediately hampers your ability to use the knowledge of subject matter experts (SMEs) to improve your model as you can’t compare your model’s view of the problem to the subject matter expert’s view of the problem. Additionally, the inability to use subject matter experts’ knowledge directly in your model is a missed opportunity to win allies for your work and a missed opportunity for improving your model.

At this point I should note that previous chapters haven’t assumed any specific prior knowledge, other than some sort of experience building models for prediction. In this chapter, at least some of the discussion assumes a basic knowledge of generalized linear models (GLM) and regression.

I still believe that if you don’t have this background you will get a lot of this chapter. However, if the idea that ordinary least squares regression, logistic regression, and Poisson regression can all be seen as examples of the same thing isn’t familiar to you, you will get a lot more out of this chapter if you read a basic generalized linear model texts, such as the one by Dobson² or Faraway.³

To provide credibility for a model, the explanation of what it is doing needs to make sense to the user. If you’re presenting the model to users, the best situation is that your model can illustrate at least one relationship they already know about, and another that shows them something new. The first confirms that the model has found relationships which are true. The second shows that the model has found something new that the user didn’t know before.

If you can’t provide any findings that accord with something that the user has observed themselves, it is unlikely that they will accept your model can be believed. At the same time, if you can’t offer them anything new, it is unlikely they will accept that your work is worth paying for.

In fact, it is not enough to simply allow users the ability to understand models. They actually also need the ability to criticize them and ensure that essential expected information is included.⁴ We will see more on this topic toward the end of the chapter.

In Chapter 3, I introduced the idea that to build trust as an individual, you need to ensure you have credibility, reliability, intimacy, and minimize self-orientation, as explained in The Trusted Advisor.⁵ Although the latter two attributes are the domain of humans, the first two can be applied to models. Ensuring your model is both credible and reliable is fundamental to gaining your users’ trust. In this chapter, we will concentrate on credibility.

To be credible, a model will need at least three attributes:

1.
Intelligibility: Users need to be able to understand the link between inputs and outputs. This link, therefore, needs to be visible, rather than hidden in a black box.
2.
Predictability/consistency: When your user has seen results or has explored the model definition, they should be able to be in the ballpark of knowing which of two cases is more or less likely to be a particular case, or will be greater if a continuous model.
3.
Reflect knowledge of the real world: Users will frequently have significant experience of the way what is being modeled behaves in real life.

A fourth point may not be essential but is still useful—the ability to talk about when your model is most likely to be wrong, and to quantify by how much and in what direction it is most likely to be wrong. This is another area, along with the issue of model intelligibility or explainability, that is receiving more attention now than it did in the first rush of machine learning (ML) hype.

If you violate any one of these principles relating to model behavior, your user’s trust in your model will rapidly evaporate. Although, especially in the case of the latter two points, there are not always simple ways to ensure that a model fits these rules from the start, there are ways to make it more probable that the model fits these rules, and ways to check if your model fits these rules.

This chapter will discuss each of these attributes in turn. Intelligibility is the most technical of these issues and so needs an especially thorough treatment. However, this doesn’t mean that intelligibility is the most important just because of this extra word count needed. My strong opinion is that each of these factors is of equal importance, forming a three-legged stool, where if one of the factors does not exist, the stool will not work.

A lot of this chapter will emphasize nonlinear relationships in a number of ways. The idea that an important way that machine learning algorithms can be more accurate than simple regression models is by representing that nonlinear relationships should be well known to ML practitioners. So too should the idea that a drawback of allowing extra nonlinearities is risking overfitting. Less often discussed is that nonlinear relationships are more difficult for users to understand and are therefore a barrier to trust.

Most machine learning (ML) models don’t allow the modeler to select which variables will be modeled as nonlinear effects. An exception is the generalized additive model (GAM), which also allows the nonlinear effects to be visualized in its major implementations. This is obviously more work but it means that nonlinear sections can be confined to the most important variables so that the number of nonlinear variables can be kept to a smaller number that can be explained. We will explore this ability later in the chapter.

As we have seen in earlier chapters, models need to be fit for purpose. Some models will, therefore, need a higher emphasis on how credible users find them than others. Some may need only a very light touch in this area. As discussed in Chapter 3, the point is to consider the users’ needs carefully and match the level of validation accordingly. In particular, consider how close the users will get to the results and the cost of misclassification—users will usually be aware of misclassification costs at some level themselves (albeit often only from their perspective) and adjust their tolerance accordingly.

Model Intelligibility

Intuitively, there are two strategies available to build models that are both accurate and can be interpreted. The first is to build a model that is intrinsically both interpretable and work to ensure its accuracy. The other is to build an accurate model and figure out a way to interpret it after the fact.

In the past, it was often assumed that, on the one hand, models which could be interpreted in terms of the way the input variables affected the output couldn’t produce an accurate enough result for most people. It was also often assumed that if you were able to produce an accurate enough model, it was almost guaranteed to be impossible to understand how individual inputs affected the outcome.

In the next few sections we will see how both assumptions are being shown to be incorrect, and see different methods that disprove both assumptions.

Models That Are Both Explainable and Accurate

The most obvious way to explain your model is to make it explainable from the start. If it is at all possible use linear regression for a continuous dependent variable, or logistic regression (binomial or multinomial as appropriate) or otherwise appropriate GLM (Poisson, negative binomial, for example) for a categorical dependent variable.

A particular reason that machine learning models have come to be preferred when they may not be necessary is that as people have collected larger data sets, they have come to believe that a larger data set means a more accurate result, without stopping to check whether this is the case. It may be the case with some specific situations, but very often it is not the case.

Using machine learning algorithms as a first resort can sometimes also lead to habits that reduce your ability to make accurate and interpretable models.

As an example, because some algorithms need the data presented to them in a categorical format, feature binning is sometimes recommended as a data preprocessing step in machine learning guides. However, as the binning divisions are arbitrary and discrete, they introduce inaccuracies. Therefore, it is important to remember that building a traditional regression method requires a different mindset than building machine learning. Some lessons that applied to the machine learning approach will need to be unlearned when you turn to a regression approach.

Beyond that, books that teach regression principles don’t necessarily teach good modeling practice. In fact, you could say that books that teach good modeling practice in a regression setting are rare. The following principles which have been selected from Harrell give some idea of what is needed to produce a model that is intrinsically both accurate and interpretable:

The first step to a model, which is both a good predictor and interpretable, is to consider the data that has been assembled very carefully. Do your samples of each input cover each input’s range as you will expect to find them when you score the model?
Carefully interview the subject matter experts. Do you have data which corresponds to the inputs they believe will be most influential?
Also, consult subject matter experts on variables most likely to exhibit interactions. For example, Harrell compiled a list of likely interactions relating to human biostatistics.⁶ Most likely workers in the fields most closely associated with good data science results can write similar lists—for example, credit scoring, insurance, marketing, etc.
Consider how you will treat missing values. Is the level of missingness low enough that you can avoid treating the data at all, or will you need to impute the values somehow?
Check the assumptions you made about the distribution of the data. Is your data really a count, so Poisson regression is the most suitable?

Pro Tip

Leveraging your subject matter experts knowledge, where it is available, is an excellent way to create the foundations of a model that is both accurate and interpretable. In Chapter 1 we discussed the right kinds of questions to ask—questions that help identify the most important variables, and the variables most likely to have important interactions are especially useful.

One way to improve the predictive performance of a model is to employ a shrinkage method, such as the lasso or ridge regression. These methods reduce the problem associated with stepwise regression that the variable selection process is discreet, and therefore greedy, which can lead to high variance. Ridge regression, as an example, attempts to reduce this problem by preventing coefficients from becoming too large, hence taking a middle path between discarding variables completely and allowing them to be overly influential.

Linear models can model relationships of greater complexity if the assumption of linear relationships is relaxed. Importantly, the ability to represent nonlinear relationships with little effort is one of the key reasons that neural networks and tree ensembles perform better than linear models, so the ability to relax this assumption is a big step in terms of closing the gap. One of the elements of the strategy for sound predictive models espoused by Harrell is to relax the assumption of linearity for key variables (as determined by knowledge of the subject area).

In essence, by following a careful plan of attack, it is possible to use standard regression or a GLM as required to build a model that is both accurate and interpretable.

The linear assumption can be relaxed by using an additive model to account for influential nonlinear predictors. A generalized additive model (GAM) uses a smooth function such as a spline to represent a nonlinear relationship. The mgcv package in R is one of the most commonly used packages for this model. Its particular advantage is that it provides the ability to visualize the nonlinear areas of the model by plotting the spline relationship. For example, Figure 4-1 shows a plot of a relationship modeled by a generalized additive model.

../images/477772_1_En_4_Chapter/477772_1_En_4_Fig1_HTML.png — Figure 4-1.
A typical plot of a smoothed variable from a generalized additive model prepared by the author from the birthwt data set available from the R package MASS. Note that the y-axis is not the baby’s birthweight but, instead, is how much extra weight compared to the typical baby delivered by a 15-year-old mother is due to the mother’s age.⁷

One of the biggest advantages of these plots is that they show where turning points can be found. In the case of Figure 4-1, which is taken from a model of babies’ birthweight where the x-axis is the mother’s age, there appears to be no significant relationship between birthweight and mother’s age up until approximately 26 or 27 years of age. From this point on, there appears to be a positive correlation between mother’s age and birthweight (although the confidence interval widens as the data thins out, as the number of data points declines with mother’s age).

This is a richer view than we would have had with a purely linear model, where the model would have to represent all this information with a single line, by drawing a line with a slope somewhere between the close to zero gradient of the younger mothers’ section, and the distinct positive relationship for older mothers. This version is shown in Figure 4-2.

../images/477772_1_En_4_Chapter/477772_1_En_4_Fig2_HTML.jpg — Figure 4-2.
A linear representation of the relationship between age and birthweight based on the birthwt dataset in the R MASS package. To match Figure 4-1, the y-axis is the size of the effect of age on the birthweight, not the final birthweight.

As a result, the conclusion drawn can be very different—using the linear-only approach with this data, for example, it is a reasonable conclusion that babies born when their mothers are in their mid-20s can be expected to be heavier than babies born to teenage mothers, but this conclusion is not supported when analyzing the nonlinear output from the GAM. Some texts recommend replacing the curve with a series of straight lines. This can be dangerous as you replace a smooth-curved transition with a sharp transition—you change from a line in one direction to a line in a different direction instantly on a point rather than gradually while curving around a radius. What is more useful is discussing why the turning point is where it is with subject matter experts. I discuss this further in the last section of this chapter.

This is an area that links strongly to the earlier observation that clients need to see something they already know and something they don’t know in their model—showing a GAM which confirmed the customer’s prior thinking that “Age” was an important factor (something they did know), but enlarged their perspective by showing the effect had a peak or tapered off (something they didn’t know) has resulted in excellent customer buy-in for me.

The message here is not to abandon methods such as neural networks or Random Forest, but rather to not default them easily or too early. Even when your project strategy suggests that trading off explainability for accuracy is necessary, be aware that there are a number of ways to see into a black box model. Have them at your fingertips to ensure your customers are engaged by your model and its results—we will look at some of them in the following section.

When Explainable Models Are Unlikely

There are two particular situations where achieving an explainable model is unlikely, with a fair amount of overlap between the two. The first is that the number of variables is too many—at some point over very roughly 20 input variables, the notion of an explainable model fades away as the list becomes too large for humans to reason about. The other situation is where manually developed features don’t lead to sufficient performance—computer vision and image identification are the obvious examples of this situation.

It should be observed that in both cases there is a degree of scope to push back toward explainable model—however, deciding not to go that route may still turn out to be the most reasonable course. There are a number of signs that this might be the case:

There are a large number of possible variables, and none of them is particularly strong.
When you attempt to visualize nonlinear relationships by creating a GAM or some other way, there are several points of inflection in each variable.
There is substantial evidence of multicollinearity and the usual remedies don’t succeed in improving the accuracy of the model.

Assuming your model meets at least a couple of these criteria, you may have cause to use a black box model such as a neural network or Random Forest in place of an intrinsically interpretable model. Alternatively, you may need to build your model quickly and judge that carefully, as crafting the features needed for a good interpretable model will be too time-consuming.

Windows in the Black Box

Having applied the preceding criteria, you decide that only a black box algorithm will give you the performance you need. The other option is to have an opaque model that performs well but provide another modeling to explain it. An extension to this idea is to make an accurate predictive model with an algorithm like Random Forest that isn’t traditionally considered interpretable and use advanced techniques to interpret it.

Surrogate Models

A highly intuitive way to force a black box model to be interpretable is to use its results as the target for a second model using an intrinsically interpretable method, such as decision tree or regression. This approach is called building a surrogate model.

Although the approach has always been available, recently, packages and methods have been built which specifically apply this approach. I will discuss two major types of surrogate models and significant implementations in the sections that follow.

Local Surrogate Models

Using Random Forests to discover quantify relationships as well as make predictions is an active research topic. Recent papers such as “Quantifying Uncertainty in Random Forests”⁸ discuss strategies to estimate the size and direction of the effects of particular predictors on the dependent variables over the whole Random Forest, based on U-Statistics.

Drawing on similar themes is the inTrees package in R, which creates a rule set summary of a tree ensemble, and is a great example of one of the several packages that are available in R today to interpret Random Forests, and other ensembles of trees. The inTrees approach is to extract rules from the trees making up the ensemble, and retain the highest quality rules as an explanation or summary of the ensemble, based on attributes such as frequency and error of the rule.

The previously mentioned methods only work with ensembles of trees, including Random Forests and Gradient Boosting Machines. An option which explains results from any algorithm that has emerged recently is the use of universal model explainers, of which Local Interpretable Model-Agnostic Explanation (LIME) is potentially the most prominent example.

In contrast to a method like generalized additive models, LIME will provide an explanation on a case-by-case basis. That is, for a set of parameters representing a case to be scored, the LIME explanation represents how the different variables impacted on that particular case; if another case is presented to the algorithm, the effect of variables can be quite different.

The explanations are presented as a horizontal bar chart, showing the relative size of the influence of different variables, with bars extending to the right for variables that make the classification more likely, and to the left for variables making the outcome less likely. At a high level, the effect of the variables comes from a sensitivity analysis, which examines the classification results of other cases closely similar to the case of interest.

This is the local aspect of LIME—explanations are given on a case-by-case basis, rather than providing rules or a guide to the model as a whole. This is a notable point of difference to the methods discussed earlier for ensembles of trees. Additionally, LIME is currently only suitable for classifiers, rather than regression models.

Global Surrogate

LIME is an example of a local surrogate—effectively a linear model is built that works in an infinitesimal area around the selected example. The intuitive opposite of a local surrogate is a global surrogate—a surrogate model that is expected to explain how the underlying model behaves across the whole of its domain, rather than at specific points.

Local surrogates and global surrogates have the potential to draw quite different relationships between the input and output variable if there is significant nonlinearity or interactions effects. Therefore, global surrogates may be especially useful when attempting to explain the overall operation of a model to an audience of subject matter experts, as we will discuss in more detail in the following text.

The difficulty with the global surrogate is that you effectively need to build an extra model, with all the pitfalls you encountered building the first one. The surrogate model may itself introduce distortions in the way you understand the underlying model, on top of distortions that were introduced by creating the model, to begin with.

Decisions trees are arguably the most popular algorithms used to build surrogate models. Although in theory it should be straightforward to build a decision tree model to fit outputs, in practice it may still be messy. Christopher Molnar has created⁹ a function, TreeSurrogate, as part of his iml (“interpretable machine learning”) package in R, which simplifies the process of fitting a decision tree from the PartyKit package to your predictions.¹⁰

Whether you decide to use the TreeSurrogate function or not, growing a single decision tree model with the results of a black box model as the target should be one of the arrows in your quiver to explain a model that initially appeared uninterpretable. No doubt, with the growing interest in ensuring that models can be interpreted, more options will begin to appear in the near future.

The Last Mile of Model Intelligibility: Model Presentation

Creating a model that an end user can understand means, on the one hand, ensuring that they understand at a basic level what the input and output variables in the model are, and on the other, that they understand how those variables operate within the model.

In each of these cases the final presentation is crucial to ensuring the end goal of seamless user experience is met. While it is relatively obvious that input variables that are themselves the output of complex models or variables with opaque names such as “Var1” will add to user confusion rather than aiding their comprehension, sometimes it isn’t obvious how much explanation is needed.

Part of the problem is that from the modeler’s perspective what’s important is what goes into the variable—the name of a ratio input to a model might well refer to the variables that make up the ratio. Obviously, this won’t mean that much to the users of the variable. Think of the names given to accounting ratios—the “quick ratio,” the “acid test,” or to dimensionless constants in physics and engineering, such as Reynolds number (which, at a simple level, expresses the degree of turbulence in a fluid).

Reynolds number, while not possessing a perfectly communicative name (from that perspective, “Turbulence number” might be an improvement), does illustrate the notion of divorcing the calculation of a model input from its meaning within the model in a different way—there are multiple Reynolds numbers for use in different engineering models in different contexts, but essentially using the Reynolds number in the same way—to quantify the turbulence of a fluid. There is a Reynolds number for flow in pipes and channels, for a particle falling through a fluid, for impellers in stirred tanks, and for fluid flowing in a packed bed—all with different calculations, all expressing, at least at a basic level, the same concept.

In data science models this plays out in two ways. One is that if you want to reuse your model in as many places as possible it’s handy to label your variables in a way that is independent of their ingredients for portability. In a credit risk model, for example, income or assets could have differing calculations for local taxation or other regulatory reasons. Therefore, you would either need to identify the regulation your definition complies to, or provide enough detail that your user can do their own check.

The second is more to the point—your interface will require labeling if others are to use the model, and labeling that refers to the variable’s meaning within the model does more to explain the input to users than labeling explaining its ingredients.

In the case of Reynolds number the explanation “a constant expressing tendency towards turbulence, where a higher number tends more towards turbulence” is more useful, and does more to explain its purpose in the models it is found in than “the ratio of diameter, velocity and fluid density to fluid viscosity.”

For your models, explaining an engineered variable in a way that users can understand it will seldom simply mean listing the underlying variables or sticking the piece of algebra that is its mathematical definition onto a web page. It means explaining the physical meaning of the new variable in the context of your model, and if you don’t know what the variable means in the context of your model it’s time to talk to your subject matter experts (and if they don’t know, you may need to drop that variable).

The bottom line is that you cannot expect your users to be mind readers, and not only does strict attention paid to labeling lead to users being more likely to use your model, but it also means they will be less likely to make incorrect inferences from your model’s results.

Standards and Style Guides

There are a number of resources that can be used as guides on how to develop standardized systems for labeling variables. Some of the most popular include Hadley Wickham’s R Style Guide and the Google R style guide. However, many of these focus more on the capitalization of words or when to use curly braces vs. square braces than on the usability of variable names.

Another place to look for advice on the best way to name your variables is from Clean Code practitioners. Advice such as “avoiding disinformation in names” is a good start. A particular piece of advice from Robert C. Martin, paraphrased, is that the more often a variable appears, the shorter it should be. Conversely, variables that appear infrequently require longer names that virtually define them.

This advice is still specific to programming, however. We can possibly find other advice that fills in more of the gaps.

Better is advice from the database and data warehousing experts, who are more strongly concerned with the task of ensuring that nonexperts are able to understand the data that they are presenting.

Each variable should have a unique name within the database (or model).¹¹
The name should have a maximum of 30 characters if not limited further by the modeling environment.
Use one abbreviation throughout all variables if needed. For example, if you use “weight” in more than one variable, consistently shorten to “wt.”
Use single case for nouns and present tense for verbs.¹²

Keeping to an agreed set of rules within your team will ensure that users are able to fully understand, both the way that the variables affect the outcome and what those variables mean.

Model Consistency

The second attribute of a model that users find credible is that it is consistent. Most users’ default position is that if getting older increases the risk of death at one stage of life, it ought to increase the risk of death in other stages of life—if getting older by a year suddenly makes you less likely to die, they are left scratching their heads.

Models that take into account interactions and nonlinear effects are likely to provide more accuracy. They are also more likely to be subject to effects that are artifacts of the data, rather than represent the ground truth of the scenario being modeled. This is the familiar story of overfitting.

Overfitting in aggregate is well covered in machine learning texts, with methods such as cross-validation being recommended to detect it, and methods such as regularization being recommended to prevent it. However, these methods are mostly intended to pick up models that overfit on average. They are less good at picking up areas in the model that are different from the overall picture, and therefore likely to undermine a user’s confidence in what the model is saying.

This is an area where it is important to proceed carefully. If you selected a model that allowed nonlinear relationships, you must be working under the assumption that some of the time those relationships accurately represent what you are analyzing. The problem is twofold—on the one hand, there is a risk that some of the nonlinearities aren’t real. On the other hand, you will often find that even if the nonlinear relationship is real, your users will have difficulty accepting it.

A particular problem is relationships that change direction, as highlighted in the opening example. The technical name for relationships that maintain the same direction is monotonic—the opposite is nonmonotonic. Nonlinear models are not necessarily nonmonotonic, and linear models can also be nonmonotonic (think of square terms) but they are more likely to be monotonic, and you have less control over intrinsically nonlinear models compared to adding square terms or similar to a linear model.

Users are likely to expect that relationships hold over the whole range of the model. Hence, even if there is a good reason for a relationship to run in different directions at different points, it may still be an uphill battle convincing your users.

Work through to the explanations. If you say to someone, “getting one year older makes you more or less at risk of death in the following year?” they will almost always reply “More likely.” If you tell them there’s a war in progress, and compulsory frontline service for every aged between 18 and 30, and ask them whether a 29-year-old is more likely to die than 31-year-old under those conditions, they will answer “the 29-year-old.”

The point is that people will believe the nonlinear relationship if it makes sense and they understand it. The task for modelers is to verify with subject matter experts whether or not there is a real cause for the nonlinear behavior, and then to communicate it to the users.

Communicating the cause to users can occur in a range of different ways. We will see some of them in the chapter to follow on communication. For now, it is important to realize that you should try to keep the number of novel nonlinear relationships—those where you are teaching your users about a behavior for the first time—to a manageable number.

In the next section, we will look at workshops to engage subject matter experts, and get their opinion on whether nonlinear relationships are real or spurious.

Tailoring Models to the Story

Many models are subject to the flat maximum effect—the optimum isn’t a point, it’s a region of alternative ways to achieve the same value. Thus, a model reaches a point of optimization in its development where it becomes difficult to make any further improvement on accuracy, even with significant changes to the inputs.¹³ This can be viewed in a pessimistic way as the data science version of the law of diminishing returns found in economics, and when first proposed, this was exactly how it was viewed.

However, there is a glass-half-full version of this idea. Rather than focusing on the flat maximum as a limitation on model performance, you can instead focus on the flat maximum as providing multiple pathways to the best possible result. Therefore, a data scientist is in a better position than, for example, an engineer who often needs to trade in performance for cost or performance against metric A for performance against metric B.

Instead, once a data scientist has optimized a model against accuracy, they have the freedom to look for alternative models that optimize other metrics. Examples in the literature show models which are more complicated and more specific becoming simplified and more general.

However, in the context of this chapter, the flat maximum effect provides a great opportunity to choose a model which matches the user’s view of the world most closely. You will realize that the model is going to be accepted by users far more readily than any others. The other opportunity the flat maximum gives us to make a model that matches your customer’s target market, for example, a risk model that determines that too large a proportion of a credit providers target demographic is an unacceptable risk will hinder rather than help their business, and a model that is more selective within the intended demographic but is just as accurate could be better received.

The opportunity the flat maximum effect provides is the opportunity to challenge the idea that there is one true model that can be found, which often has the side effect that once the target is known, models are created without considering how a typical user understands the relationship being modeled—if there is a “true” or a “best” model, considering the users’ view is unnecessary.

In reality, as there are multiple optimal models, there is room, obviously subject to the correct data being available, to build a model that takes the users’ viewpoint into account, confirming and codifying it wherever possible.

Meet Your Users to Meet Their Expectations

Models are meant to be used, and users need to both believe them and understand them to use them to their best advantage—and may not use them at all if they don’t trust them. One way to gain your users’ trust—and tease out important subject matter expertise to better your own understanding—is to present the model in person and sanity check the relationships.

To get the best out of your subject matter experts, if possible, the last step of your modeling process should be to invite their feedback on both the output of the model and the way the input variables work together to achieve the result.

The LIME visualization’s format is a useful guide to how to visualize the effect of the input data on the output. As LIME visualizes the partial derivatives of all the input variables at a point represented by a particular case (a particular borrower, a particular insured, a particular consumer, etc.) within the scope of your model, this visualization gives you the opportunity to ask questions in a way that make sense to the subject matter expert. Visualize representative cases to show your users how the model works, and ask their opinion of whether the model makes sense to both get their buy-in and to validate the model.

Explaining How Models Perform

One of the advantages of traditional linear models over machine learning models is that a wide variety of tools have been developed to assess their performance.

A particular example is that the confidence intervals of both parameters and predictions of linear models are readily and easily available from almost all statistical packages, and the calculation method is relatively simple and widely covered in standard textbooks.

This is far less the case in the case of machine learning algorithms, including decision trees, Random Forests and other ensembles of trees, and neural networks. In the case of these algorithms, texts and packages are far more likely to discuss outputs as if they were purely deterministic, without error bars or similar. In statistical jargon, models return only a point estimate—of the target value itself, of a regression problem or of the classification probability, of a nominal Boolean target.

There is, however, no reason not to use confidence intervals or prediction intervals to describe the results of machine learning models, and both a growing recognition of the need to do so, and a growing range of tools to allow a practitioner to do it.

Similar to the situation that exists with model interpretation methods, there are methods to calculate confidence intervals for specific kinds of algorithms, and there are universal methods to calculate confidence intervals. I will take a high-level look at a couple of these methods, giving a flavor for how they work and their limitations, rather than attempting to provide an in-depth explanation for another relatively technical topic.

Representative Submodels

In the specific case of Random Forests and Gradient Boosting Machines, both of which are ensembles of decision trees, an intuitive way to construct confidence intervals is to analyze the decision trees making up the overall model.

As these models are intrinsically built up of other models, an intuitive method of estimating the error on the prediction is to derive it from the range of predictions that come from the underlying models.

For this class of models, out of bag sampling is frequently used to quantify the prediction error of the model. The out of bag estimate is derived from trees that didn’t use a specific observation, with the error being computed for each case. The intrinsic availability of the bagged trees, therefore, makes it simpler in the case of Random Forest and its cousins to compute prediction intervals. As a result, many implementations of Random Forest have the out of the bag estimate of prediction error readily available as an out of the box part of the package.

Variable importance is also readily available for Random Forests and tree-based cousins such as Gradient Boosting Machines, by totaling the importance of each split in each tree making up the ensemble of trees. Variable importance is often used by practitioners to identify which variables to leave in or out, but a better use of them is in discussion with subject matter experts to decide whether the model is using the correct logic.

As mentioned previously, this is an area of active research, with new implementations of ideas frequently appearing in R and Python as they emerge—the paper “Quantifying Uncertainty in Random Forests” is but one example.¹⁴

Model Agnostic Assessment

A far more general way to approach the task of creating confidence intervals for machine learning models is to use bootstrap confidence intervals.

The bootstrap is essentially resampling with replacement. To bootstrap your model’s output, you need to create multiple models that are slightly different. Intuitively, the simple way to that is to create multiple models, each using a different random sample of the original data set. Once you have created a sufficient number of models you can then compute the error for each one, and develop a confidence interval.

Being able to discuss the limitations of your model is an important way to ensure that the model users can trust your results. Prediction intervals themselves are a relatively statistical method of quantifying your model’s limitations. You will need to consider your audience carefully when deciding how to communicate what the confidence intervals are saying.

Both cross-validation and bootstrapping can be used to derive prediction intervals for any form of the model. Cross-validation tends to overestimate the prediction error whereas the bootstrap tends to underestimate it.

More recently, there have been developments in using variational calculus as another alternative method to compute prediction intervals for machine learning models. Although the details don’t fit here, this illustrates the point that this is an area attracting increasing interest, and is another example of a method of computing these prediction intervals.

In some ways it is more informative for the user to know how the prediction interval varies across the model’s output rather than knowing an absolute value. Knowing that the prediction is narrowest (a proxy for when the model is at its most confident) at the center of the range of predictions the user is interested in gives the user maximum confidence in the total output of the model.

This may make residual analysis of your model’s output as important as the overall error analysis. Although residual analysis is commonly used in the context of GLMs to validate assumptions such as constant variance (which, if violated, may indicate an incorrect distributional assumption) if the residuals fan out at the ends of the range of the predicted variables or for the range of a particular input variable in a machine learning context, it could indicate that in that region the model isn’t sufficiently accurate, potentially owing to a lack of data in that region.

You can also perform a “soft” kind of residual analysis by comparing the prediction error for particular regions of your data, for example, male vs. female, smoker vs. nonsmoker or child vs. adult, if we imagine medical-related model.

Taken together, being able to quantify the certainty of your model in a way that is meaningful to your users, and being able to explain when your model works at its best are key to being able to maintain your user’s trust. A clear view of your model’s limitations, rather than putting people off, helps them to know when they can use it most effectively and with most confidence.

A Workshop with Subject Matter Experts to Validate Your Model

The final of the three attributes required of models that I listed at the beginning of the chapter was that models need to match your users’ experience of the real world. If users believe that age or gender have a particular effect on an outcome, either the model needs to align with that position, the modeler needs to be able to state that the effect is too weak to be seen within the model, or finally the modeler needs to have enough evidence to get people to change their minds.

Research has shown that being involved in the solution to a problem dramatically increases the probability that people accept the solution. Hence, even if you are coming into the situation with your own favored solution, presenting the problems that this solution is expected to overcome—without mentioning your solution—will help ensure people come around to your view. Even better if your users can be active participants in the conversation rather than people you talk (down?) to, the latter scenario allowing them to cement their own negative views.

While obviously this depends on having built a white box model that can be discussed in this way, the rewards are great. For example, it is the best way to work out the final variable names that the users will see because the users can propose their own meaningful names. It allows you to avoid the “Google Flu trends”¹⁵ mistake of presenting a model that includes obviously spurious relationships. Finally, the mere fact of asking your users’ opinions ensures their buy-in, and listening to their advice and making changes accordingly will seal the deal.

At the same time, as has been mentioned earlier, not engaging with subject matter experts, whether they are part of your targeted user group or not, represents a massive lost opportunity from the perspective of getting your model fully validated.

As subject matter experts think in concrete terms, to facilitate a workshop that allows them to have their say requires that you provide concrete examples of the model’s output with cases that will resonate with them.

The LIME visualization style is a good way to visualize cases for this specific purpose because as mentioned earlier it displays the local effects of movement in each variable. This allows you to frame your questions for the subject matter experts in a way that means they have no need of any statistics knowledge to answer. Here are some possible examples:

Looking at this case, would a change from male to female make a positive or negative difference or no difference?
Would you expect case A or case B to be more likely to lead to a positive result (for a classification problem)/greater value (for regression)?
Do you expect that the variables work the same way across the whole domain of the model? (To put it another way—are there expected interactions, such as between gender and the effects of drug dosage?)

Although examining the localized interpretations gives a clear idea of how real-life cases work, they can lead to not seeing the wood for the trees. In a vanilla linear model with none or few interactions, this isn’t a problem as there will be little or no difference between the way the model operates in different local regions.

However, where there are significant nonlinear and interaction effects, there can be severe differences. If this is the case, the workshop needs to include visualizations of the global relationships to account for different slopes in different areas of the prediction space. For important variables, the GAM plots discussed may be a useful way of visualizing the way that key variables change gradient over their range.

In my experience, when people are given these plots, they are often surprised by how nonlinear the relationships really are—people are wired to have a strong expectation that relationships are linear. Giving an influential group of users the chance to look at these plots is, for this reason, a crucial method for ensuring that users buy into your model properly.

Importantly, their responses to those charts will enable you to understand your model better and to draw useful boundaries around its field of operation.

For example, in a session like this I conducted, the x-axis was loosely speaking age of an asset. There was a noticeable turning point when age was plotted against the output variable, leading one of the subject matter experts to realize that it correlated with a change in regulations that affected the way that assets were maintained.

That is, when the asset reached a particular birthday, regulations demanded extra maintenance take place so that paradoxically the asset became less prone to problems at this point. This eventually meant that the scope of the model was changed so that the regulatory change no longer had an effect on the model.

After this session, in addition to finding a concrete action that improved the model, subject matter experts who had the respect of the wider group of users had a better understanding of the way that the model operated. This shows how engaging with users can lead both to a model which is genuinely better, and also held in better esteem by the people who are expected to use it.

Even where you don’t have the ability to visualize a relationship to the same degree as you can with a GAM, even knowing the importance of different variables within the model gives you the information you can use to generate discussion and challenge your subject matter experts. Few implementations don’t provide variable importance plotting, and there are packages that can plot variable importance for any given model.

Sometimes just knowing that the model is heavily relying on a particular variable that the SMEs consider to be less important is enough to detect a problem.

This has now been demonstrated in a research setting, with Caruana¹⁶ et al., who created GAMs and GA²Ms (GAMS with two-way interactions) for predicting pneumonia case readmission. The focus of the authors was on producing models which are “repairable”—that is the model’s reasoning could be inspected by a Subject Matter Expert, challenged, and where required, repaired.

In many cases, the subject matter experts were able to trace unexpected rules to specific cases or clusters of cases, and determine that the rule was a data artifact, not a true medical reason changing readmission risk. The models made with GAMs or GA²Ms were deemed repairable as a rule that was found to be suspect could be removed from the model without knock-on effects to model bias.

The eyes of subject matter experts can detect inconsistencies that are impossible for a data scientist. Although data scientists are expected to understand the context they work in thoroughly, they can’t think like the subject matter experts in that area, and often won’t notice issues that are obvious to a genuine subject matter expert.

Hence, for example, although a statistician who specializes in medical research will have a thorough knowledge of their area, they will never think like either a doctor or a full-time medical researcher, and a doctor or medical researcher will always notice something different to the statistician when reviewing research results. This is true across the range of applications currently targeted by predictive modeling.

External Validation

In all walks of life, one of the best ways to ensure that the kind of mistakes that become invisible when we have been working on something deeply for a long time are detected and removed is to find a fresh pair of eyes. This is as true of building predictive models as it is of designing a bridge or writing a book.

An example of a specific problem that can be detected by another set of eyes with a statistical mindset is data leaks. It is very easy to create a model with exaggerated performance due to training on data that isn’t available to the model at scoring time. This is colloquially known as a data leak, and at least some of the time, data mining competitions such as those on Kaggle are won by people who exploit these leaks.

If a data leak has occurred during your training process, without you knowing it, you will likely have a model that performs very well during your validation, and very poorly when put into service. In this situation, a fresh pair of eyes can save you a lot of embarrassment.

If your organization is large enough, it may be that there are enough data scientists in a team that they can be divided into smaller groups that can work in parallel without seeing a great deal of each other’s projects while they are being worked on. In that case, you can get those subteams to validate each other’s work, assured that they will have a fresh pair of eyes.

This might be a relatively rare scenario, however. If you aren’t in this position, for models that have a great deal of exposure, you may want to consider retaining an external statistical consultancy for validation. This may be particularly important if your model is directly exposed to customers, or if the results are directly incorporated into the way your organization makes a profit.

Both of the second two factors can be checked through this method. Simply by asking you questions and trying to understand what the model is doing, a competent external reviewer will highlight any areas where either the model is a poor reflection of the way your context operates in the real world, or where the model isn’t consistent with itself.

This last validation, especially when coupled with a workshop involving subject matter experts, is the most effective way to ensure your model is consistent with itself, interpretable by its users, and makes sense within its problem domain.

Summary

We have seen that as machine learning takes on more complex challenges, where there is a greater cost of being wrong, users have begun to demand greater assurance. To gain that assurance, users now expect to gain a greater understanding of the way a model treats the inputs to achieve the outputs.

The most intelligible models are created via linear regression. Traditionally, machine learning practitioners have assumed that simple regression models or even generalized linear models don’t achieve sufficient accuracy, and we reviewed modern techniques which may mean that better accuracy is achievable. In particular, techniques such as generalized additive models mean we can relax the assumption of linearity mean that less well-behaved data sets can be used with regression and still lead to an accurate model.

Regression models are still limited by their assumptions; messy data sets with weak signals may still need a black box machine learning algorithm to achieve the needed accuracy. For these cases, there are now methods of interrogating the black box and developing an intelligible model.

Nonlinear and, especially, nonmonotonic relationships may appear counterintuitive to the user and require additional analysis and explanation. The generalized additive model also provides a good avenue for visualizing such nonlinear relationships to aid the explanation.

Simply making the model intelligible in terms of the relationship between input and output will not always be by itself enough to ensure users trust the models’ results.

One area requiring attention is whether the input and variables themselves are intelligible. Naming conventions are important in this regard, as is ensuring that proper definitions are easily available.

You can help users understand and trust a model by allowing them to play with the model themselves, interrogating it to decide whether its answers make sense. This can be achieved through computer simulations that run the scoring engine with user-selected values or it can be achieved through a detailed analysis of the rules that make up the model, where this is available. Being able to obtain an endorsement of this kind from expert users is becoming increasingly recognized as an important way to ensure model integrity and reliability.

It is also important to involve the model’s future users, not just to improve the model but also to get the buy-in of influential future users. One way to do that is to hold workshops with your local subject matter experts checking that the inferences from your model match their understanding of the subject—you won’t get them to believe your model unless either your model matches their expectations or your data and analysis actually changes their expectations.

Another important way to check your model makes sense is to ensure it is validated by someone who wasn’t involved in building it. In a smaller company, this may mean that you need to retain an external modeling consultant.

In Chapter 5, we will look at how to ensure that your models continue to make good on their initial promise after implementation and beyond, including model maintenance and surveillance.

Model Credibility Checklist

Interpretable Models

Given the type of data you have, and the degree to which it is likely to violate typical assumptions, could you develop a sufficiently accurate model that is intrinsically explainable, such as a GLM or decision tree?
If not, is there an out of the box tool for interpreting the results of the black box algorithm you are using?
Can you give your users access to explanations, both at the local level and at the global level?
Can you explain the level of uncertainty in your model’s predictions to your users?

Model Presentation

Do the variables in your models have meaningful names?
Can your users easily find an explanation of any ratios or other formulas used for derived variables?
If your users access the model through a web-based platform or an app, does the UI/UX design allow users to quickly access definitions and explanations for the target and input variables?

Models Consistent with Themselves and Their Subject Matter

Have subject matter experts had an opportunity to review your findings, both on whether the individual results make sense, and whether how they relate to inputs makes sense?
Have you reviewed the important nonlinear relationships with subject matter experts to ensure they make sense within what is known about the topic?
Have you shown the model to another group of data scientists to review whether the model’s results are robust and repeatable in statistical terms?
Have you compared the subject matter experts’ opinion of what the most important variables are with the variable importance found within the model?
Have you removed or repaired any relationships that appear flawed after being reviewed?

Footnotes

Box GEP and Draper N, “All models are wrong, but some are useful,” Empirical Model Building, 1987.

Annette Dobson, An Introduction to Generalized Linear Models (London: Chapman & Hall, 2002).

Julian Faraway, Linear Models with R (London: Chapman & Hall, 2009).

Been Kim, Rhajiv Khanna, and Oluwasanmi Koyeyo, “Examples Are Not Enough!: Learn to Criticise,” Neural Information Processing Systems (NIPS) Conference (Barcelona, 2016).

David H. Maister, Charles H. Green, and Robert M. Galford, The Trusted Advisor (New York: Free Press, 2000).

Frank E. Harrell, Jr., Regression Modeling Strategies, Second Edition (New York: Springer, 2015).

Brian Ripley, Bill Venables, Douglas M. Bates, Kurt Hornick and Albrecht Gebhardt, MASS: Support Functions and Datasets for Venables and Ripley’s Modern Applied Statistics, 2019, https://CRAN.R-project.org/package=MASS .

Hooker G and Mentch L, “Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests,” Journal of Machine Learning Research, 17, 2016.

https://cran.r-project.org/web/packages/iml/vignettes/intro.html .

https://cran.r-project.org/web/packages/iml/iml.pdf .

“SQL Server Naming Guide and Conventions,” accessed April 4, 2019, at www.cms.gov/research-statistics-data-and-systems/cms-information-technology/dbadmin/downloads/sqlserverstandardsandguildelines.pdf .

“Data Warehouse Naming Standards,” accessed April 4, 2019, at https://kb.iu.edu/d/bctf .

David Hand, “Classifier Technology and the Illusion of Progress,” Statistical Science, Vol. 20, No. 1, 2006, https://arxiv.org/pdf/math/0606441.pdf .

Lucas Mentch and Giles Hooker, “Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests,” Journal of Machine Learning Research, 17, 2016, p. 1–41.

David Lazer and Ryan Kennedy, “What We Can Learn from the Epic Google Flu Trends Failure,” Wired, October 1, 2015, www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/ .

Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad, “Intelligible Models for Healthcare: Predicting Pneumonia Risk and Hospital 30-Day Readmission,” Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.

4. Believable Models