Reliable Models

In Chapter 3, we first encountered the trust equation, whose parameters include credibility, reliability, intimacy, and self-orientation. Although you can’t be intimate with a model, and no one expects a model to think of other people, a model can be both credible and reliable or it can fail to be either or both of these things. In Chapter 4, we explored how to make a model believable, but it is also important that a model is reliable.

The notion of a model needing to be reliable has received more attention in recent times, at least in part, thanks to the fallout from the global financial crisis (GFC), where models gave poor results for a number of reasons including being used outside the parameters intended by the model designers, and being assumed to continue working even when the input data changed.

The aftermath of the global financial crisis (GFC) ushered in a newly skeptical attitude toward the use of models, some of it warranted, and some of it not. Most recently, partly as a reaction to data science and Big Data hype, authors from within data science such as Cathy O’Neil¹ have begun to warn people of the effects of negligent models that harm people as a side effect or even a direct result of doing their job. Those warnings have also begun to be picked up and repeated in articles meant for a general audience.²

One of O’Neil’s points is that models that don’t receive feedback can go awry in a variety of ways. Moreover, in many cases, in the examples she uses there were frequently multiple opportunities to set models that went awry back on the right path or make the decision to discontinue the model. Viewed from that perspective, these examples are highly illustrative of the need to watch and maintain models to ensure reliability.

Unfortunately, users often expect that models should be “set and forget,” and don’t instantly understand the need to maintain and retrain models. It is your role as a data scientist to set their expectation to the right level as early in the process as possible.

Maintaining users’ trust in models with that as a background can be difficult. In a way, it should be difficult—sometimes users who are too eager to believe the good news story coming from the modeler can make it difficult for the modeler to present an appropriately skeptical view of their model’s performance.

A lot of the foregoing relates to the problem of maintaining credibility, while also dealing with users who are too quick to assume your models are credible. The next stage of the trust equation, reliability, also comes into play, as models shouldn’t automatically be relied upon to provide results at the same level of accuracy indefinitely.

Taken together, these factors imply that although some effort is required to ensure that models run to the optimum performance over a long period, it will sometimes be difficult to persuade users to allow you to keep them updated. That is absolutely my experience, and other authors have also noted that data science as a discipline lags behind the norms of other disciplines when it comes to activities such as quality assurance,³ which are considered essential in other contexts for ensuring users and customers get what they expect.

Therefore, in this chapter, along with discussing how to keep models in peak condition, I will also discuss how you can persuade your users that this is needed, without making them believe that your model was in some way deficient, to begin with.

At the end of this road, lies the question, when should a model be retired? The answer to this question harks back to our very first chapters where we discussed how to determine the problem you should be solving. The risk is always there that the point of your model may no longer exist because the problem it is meant to solve is no longer relevant.

What Is Reliability

It’s hard to talk about making your models more reliable without deciding what reliability is. Before I move on to a technical definition for models, it might be useful to first think about how it works out for humans.

In the context of the “Trust Equation,” reliability is effectively defined as an advisor who is consistent and dependable. Therefore, they are someone who does what they say they will do, what is expected of them, and doesn’t act in unexpected negative ways.

These attributes can be carried over to predictive models with a little adaptation. A model or data product should also do what is expected of it, do what it does consistently, and not behave in an unexpectedly negative manner—effectively, it should not have negative side effects.

Therefore, a model can be considered reliable if it continues to perform at the same level as it did when first developed, and continues to make the same decisions given it is presented with what its users consider to be the same data—although this may not necessarily be identical data from the strict perspective of the model training set.

In this context, we consider the model itself (i.e., the set of rules produced by some sort machine learning algorithm), the data that feeds it, and their collective implementation as a complete system which together make up the model.

Although the model or rules are stable and are likely to perform as expected on cases presented during the development phase (although other cases might present a problem), the data is subject to change. Moreover, the implementation though not subject to change in the same sense may provide opportunities for errors to creep into the system. Therefore, generally speaking, these aspects are more likely to be areas which cause failures in reliability.

Data quality has become a particularly fertile and also fashionable topic of discussion as the use (and possibly abuse) of data in commercial settings has expanded. Certain criteria have been suggested as making up the components of reliability⁴—I suggest that they more generally make up the concept of reliability as it applies to models.

1.
Accuracy
2.
Integrity
3.
Consistency
4.
Completeness
5.
Auditability

The criteria listed, from the article by Cai and Zhu,⁵ also presents the idea of relevance. I suggest that relevance can be folded into reliability for our purposes, as model users make an implicit assumption that model results are relevant to their situation—if they turn out not to be, they will experience the result as a loss of reliability.

Bad Things Happen When You Don’t Check Up on Your Models

Some of the worst outcomes can occur when models are implemented, then left without considering whether what was true yesterday was still true today.

Some of the most notorious examples of this phenomenon have their origin in the GFC. During a crash, previously uncorrelated variables become correlated, as a piece of news on the way down has a negative interpretation. The tendency of stock prices to become correlated during extreme effects was, in fact, a known problem, however, it had been neglected during the period when returns were more benign.

In fact, during the GFC, and other extreme events, multiple models of behavior ceased to work correctly, as the relationship between the dependent variable and the independent variable was disrupted. Probability of default models for both retail and corporate credit is an additional important example of this occurrence.

As a consequence, it is common for models that performed well in standard conditions to become very poor during a financial crash. If the team modeling the events had not developed the model under crash conditions ahead of time, then they would have had great difficulty in developing a new model that functioned adequately in time to be useful.

The lesson here, moving beyond the world of finance, is that models working correctly are heavily dependent on their environment remaining sufficiently similar to the environment under which they were developed. To ensure that models continue to have the correct environment to work under means carefully observing the environment so that changes that would mean that new models are required can be detected early enough so that a faulty model is not used.

It was said soon after the events of the GFC that the events were “black swans,” which is by way of saying they were difficult to predict within the framework of existing models because they were outside the modelers’ experience or data sets. That is something of a cop-out—research available prior to the GFC already showed that the asset correlations during previous crashes changed substantially, so at least the idea that the models were likely to fail during a crash was known among researchers and available to anyone who was interested enough to find out.

Like unhappy families in Tolstoy, models all have problems in different ways. Some of the problems may be small; others large. Sometimes the problems can actually mean the model is doing harm somewhere. Other times the problems simply mean that the performance slowly degrades until the model is no better than a dart board.

What these problems do have in common is that if you want to find them before it’s too late, you need to make a conscious effort to find them, and you need a structure and a process to ensure that you find them.

The flow on effects of whether you get this right or not can be large. The idea that models can’t be trusted has begun to have a life of its own, and even while articles promoting the benefits of data science continue to come out, articles warning of the dangers of data science going wrong proliferate.

In Chapter 3, we discussed the need for data scientists to be brand ambassadors for the profession of data science as a whole. One of the most practical ways to do this is to create models that both maintain their performance over their lives and can be trusted to both fulfill their original mission, and to do that without causing collateral damage.

Benchmarking Inputs

A model’s performance is determined by the quality of the data it’s fed. Data quality can have many dimensions,⁶ which to some extent can be determined by the way the data will be used. Some authors have identified their own lists of more than 20 possible dimensions. However, six dimensions are in particularly common use:⁷

Completeness
Uniqueness
Timeliness
Validity
Accuracy
Consistency

These aspects of data quality may be considered at a few different stages, such as in setting up a data warehouse, or when beginning the modeling effort, in order to select the variables that will be most reliable.

However, when moving from your modeling phase to the implementation and beyond, it’s important to avoid assuming that your initial data assessment is sufficient to benchmark data quality for implementation. When assessing data quality at the modeling stage, your attention will be on a different set of criteria than when you consider the ongoing reliability of your data post-implementation. Therefore, if you don’t make a conscious decision to check that data is suitable for implementation, you can overlook some aspects.

It is very important to know what’s normal for your input variables from the start. Without sampling at the start, you won’t know what you should expect to see, so you won’t know when things have gone askew.

It is relatively common for machine learning texts to partially ignore exploratory data analysis as it relates to the problem of ensuring that the model trained on yesterday’s data is still suitable for tomorrow’s conditions. Instead, it is more common for texts to emphasize exploratory data analysis as an antecedent to data cleaning and preparation, with discussions of how to deal with missing data and possible approaches to data cleansing leading to advice on the best ways to prepare a data set for modeling.

While this is a necessary step toward building an effective model, this step should also provide a wealth of information.

Once you have implemented your model as a scoring engine fed by a stream of updates to the data set, there will be another set of issues to contend with respect to the reliability of that update process. Even the best maintained and coded systems will have some level of errors, dropouts, and discontinuities in the feed. If they are at a low enough level, they won’t be alarming. However, similar to the overall problems with the data set, if you don’t look, you won’t find out, and what you don’t know can hurt you. Therefore, you need to understand the characteristics of the data feed.

This may be considered by some as the purview of the data engineering team or of the database maintainers. However, while I don’t believe in the unicorn data scientist, and neither should you, and you should take help in maintaining your databases as much as your organization allows it, I believe that data governance is too important for the data science team to let themselves be out of the loop.

Therefore, it is important to have enough of an understanding of data governance best practice to ensure that you contribute to the process. Similarly, you should expect to lead algorithm governance, a discipline in its own right.

An additional factor often associated with model and data governance is modeling bias, especially when the model bias manifests as bias against particular groups of people—ethnic subgroups or socioeconomic subgroups being some of the more distinctive ones. If a model is designed to make decisions that impact on people, these effects are an ever-present risk and one that is difficult to identify without the involvement of humans who can adjudicate on the effect of the model.

The general activity of ensuring your model is still operating within its intended parameters intuitively requires a strong understanding of what those parameters were. This understanding encompasses both the data itself in terms of statistical measures of central tendency, spread, skewness, etc. and to the data collection process. That is, changes to the rate of data capture could be an indicator that there has been a change to the underlying process.

For example, if the feed for a particular input variable sees a different pattern of updates, or the volume of data suddenly increases or decreases, these could well be a sign that some aspect of the data collection has changed, potentially meaning that the data’s underlying quality or meaning may have changed, even if it is not immediately apparent based on the data’s direct sampling statistics.

Another consideration is what happens if there is a temporary interruption to the data feed or a partial interruption where some variables are interrupted but other variables continue to be updated. Does this mean your model’s scoring output is incorrect? At what point can you identify that there has been an interruption?

The approach to understanding the situation before you began modeling can also be used to search for bias in models. For example, it has been observed a number of times that models of crime patterns reflect the same racial or other biases of arresting officers, so that a model trained on data where blacks are over-represented in the arrest rate is likely to lead to blacks continuing to be over-represented at the same rate.

Knowing what the data looks like, compared to how it is expected to look or how it ought to look are crucial for deciding whether the results it produces are to be relied upon.

Auditing Models

One of the attributes of a reliable model I mentioned earlier in this chapter was auditability. Auditing is simply a formal process of checking that your models perform as expected, have been implemented as expected, and were originally developed as expected.

Therefore, auditing the machine learning system you implement is, therefore, the most systematic way you can ensure that you consciously look for problems, find them, and document potential solutions. This realization is becoming more widespread, and as new pressures are applied to machine learning systems to be more transparent and reliable, the idea of formally auditing those systems is becoming more widespread.⁸

Although the word is often associated with accounting or tax, in fact, audits can occur in many contexts, and an audit of a machine learning system will intuitively have more in common with audits which occur in IT or similar contexts.

A big part of the need for a formal process is to ensure that nothing is left out of the review, as the system that needs auditing consists of several parts that create small gaps that can result in communication being lost. At the same time, although an audit is a defined process with steps that are to some extent predetermined, when an audit is performed by a human there is an opportunity to drill deeper into any areas that appear suspect.

The combination of predetermined steps that ensure that the right areas are covered along with the ability to use intuition to guide the search toward difficult areas is a strong way to deal with the difficulty of finding times when your model is biased.

Another key advantage of performing an audit, whether it uses an internal or external auditor is that you can use it to enhance the reputation of your model and its implementation. This an important concern when your goal is partly to improve users’ trust in your model.

To complete an audit, you need a framework to audit against, usually, one that is either formally or informally recognized as a standard. Data Science does not have a great deal of choice in this area, but at least one author has suggested auditing Data Science projects against CRISP-DM.⁹

The advantage of the CRISP-DM process in this context is that you can use the subheadings with the framework to generate questions and discussion on how the data science implementation you are reviewing behaves against each of those points. We have seen CRISP-DM in Chapter 2, but to emphasize those subheadings:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment¹⁰

Each of these subheadings is a natural topic to review and validate a data science implementation against. In particular, although it can sometimes be tempting to decide that if a model achieves well against its accuracy metric, the inclusion of “Business Understanding” and “Deployment” as topic to audit against should mean that, on the one hand, the model really does provide data that answers the client’s business problem and, on the other hand, the version that is implemented really does provide the client with the results they require.

Of course, there is no existing mandatory standard or even universally accepted standard means that you are not tied to CRISP-DM if it doesn’t meet your needs. You can vary how deep you want to go according to your needs, and you can also add to the list if you think something is missing.

Certainly, you will find that if you approach a large firm to validate a model, they will often have developed their own framework that usually covers a similar but not identical list of concerns to CRISP-DM.

Model Risk Assessment

Another idea coming from quality assurance that you can adapt to the problem of ensuring your model achieves the hoped-for results is performing a risk assessment before implementation. This concept is particularly prevalent in the automotive manufacturing sector, but at least partly thanks to Six Sigma that has spread to other contexts.

The formal tool for doing this is called Failure Mode and Effects Analysis, usually shortened to FMEA. Ultimately, this is a process of guided brainstorming for things that can go wrong and their consequences, and you can adopt the level of formality that suits your situation.

It’s possible to conduct an FMEA at variable levels of thoroughness, with different groups of possible stakeholders, and a variety of different prompts. However, a few core activities are common, no matter what the scenario:¹¹

Identify possible failure modes—ways that the process could fail
Identify the possible outcomes of those failures
Quantify the severity of those outcomes
Develop a response plan for the most important failures based on the severity of their outcomes
Document the outcomes of the process

In an environment which values formal processes and documentation, quantifying the severity is often done by calculating a risk priority number. The risk priority number is calculated by multiplying figures given for the severity of the problem, how likely it is to occur, and the probability of detection.

However, it is possible to adapt the philosophy of this approach to your situation in a less labor-intensive way by using the group to brainstorm for an overall priority score, for example, out of ten. By taking that approach, you can get the assurance that an FMEA can provide in detecting problems before they occur without taking the whole bureaucracy of the process. That way you won’t let the perfect be the enemy of the good.

The usual outcome for this process is to develop a control plan, which matches particular hazards to a preventative measure. In a manufacturing context that might mean checking machine settings if the outputs change but prior to being forced to reject the product or analyzing the raw materials. In the case of a machine learning model, it is more likely to mean a sufficient change to an important input variable’s data feed triggering an investigation of the data source, which could be a vendor, a collection point, or a sensor.

The use of FMEA or another tool with a similar philosophy, to look for weak points in a process or implementation, has enjoyed considerable success in its original context in the manufacturing industry. Without a process to systematically look for these weaknesses, product launches in manufacturing would very frequently lead to defective products being shipped to the customer, and manufacturers being constantly surprised by customer reports of defects.

In the current environment, where users may be skeptical about data science models due to reports of poor performance in the media, or prior experiences of their own, adopting those same tools can help to ensure that your final implementation provides a positive experience that causes users to trust data science again.

Model Maintenance

Any statistical or machine learning model will experience a loss of performance over time as relationships alter. Sometimes this happens very suddenly, such as happened to many credit default models during the GFC. Other times the degradation takes place over a longer period, and can almost be predicted by someone watching the trend.

What drives the degradation? First of all, no matter how careful you were, to some degree your model fit on noise or on latent factors, which is to say it was wrong, to begin with, and some of your accuracy was due to random chance.

Intuitively, a point that emerges from those two examples is that models which are dependent on human behavior may be especially susceptible to degradation, whereas models that relate more closely to physical processes in some sense may have some additional stability. It follows in turn that a key ally in understanding how much of a risk this is for your model and over what timeframe will be your subject matter expert, and in most cases a regular schedule of model review and retrain will be developed.

At the same time, you will likely want to use what your data is telling you, so you’ll need methods to determine whether the newly arrived input data has changed. This is particularly the case for circumstances which change quickly.

In the case of input variables where the data points have a high degree of independence, control charts, as used in statistical process control (SPC) , could be used to detect changes to the process.

There are many guides to the use of these charts, both in print and online, and they have been successfully used for many years. Their common element is that measurements from a process are plotted sequentially on a chart with a centerline at the mean (or other appropriate process average) and upper and lower lines to represent the usual process range. Accordingly, it is easy to establish when the process has changed either its range or its average result.

However, especially for attribute or categorical data, methods developed for use on relatively small data can give problematic results when used on much larger quantities of data.

Some care is still required in setting up the sampling regimen for continuous data—and note that it is not necessary to use the complete data collected every day to check whether the input variable’s process maintains the characteristics it had when the model was implemented, just that it is large enough to be representative.

Systematic Data Monitoring

As long as you have a clear understanding of normal and abnormal data in your context, data monitoring is an area that offers a lot of opportunities for automation, at least as far as there are clear ways to be systematic.

An intuitive approach is to adapt the principles of quality control monitoring where statistical charts are used to detect when there has been a change to a series of data being collected as a way of alerting a responsible person to check conditions and possibly act.

Control charts originated in the context of manufacturing, where an operator measuring an aspect of the product being made can quickly assess whether the underlying process has changed. Control charts come in a variety of formats to suit the kind of data they are being applied to, but two of the most common forms are the x-bar and R (x-bar being the mean and R being the range) and the c-chart (for countable events).

SPC control charts are interpreted in the context of a predetermined set of rules that identify when the process has changed—these include things like seeing seven or more consecutive points above the mean or seven or more consecutive points rising or decreasing, although there can be a small amount of variation on which rules to use.¹²

The principles of creating and interpreting quality control charts generalize well to a variety of contexts, and workers in data quality have increasingly recognized how they can be used to detect occasions when the quality of data might have unexpectedly been compromised.¹³ They also lend themselves relatively well to automation, although a human reading the chart after an automated rule has alerted them to an issue is in the best position to decide what has actually occurred.

Specific charts have now been proposed or adapted for use with data streams.¹⁴ Researchers have noted that the statistical properties of data streams don’t always follow a normal distribution as assumed for some basic types of statistical control chart, especially when attempting to measure the six dimensions of data quality we first listed earlier.

Remember that the more common way to use a control chart in most contexts is to trigger an investigation, even if it’s very brief, to decide the best course of action, rather than to impose an automatic adjustment. This makes sense as these are tools to identify that something unusual has occurred, but have very limited or even zero ability to identify what the unusual thing is.

Specific control charts have been developed to identify when processes stopped operating as expected—the cumulative sum (usually abbreviated to “cusum”) chart is a prominent example.

Knowing when a process stopped behaving correctly is an important step toward discovering what happened—if you have an external data provider, providing a date when something changed means they are far more likely to be able to suggest a possible change. Likewise, when you capture your own data, being able to go back to a specific date will be very helpful when you need to identify the cause of a change.

At the same time, in an ideal world, you would have identified at least some of the likely causes of disturbances during the planning stage, such as through the FMEA or control plan activities. If you have, you will have a head’s start on where to look for the cause, and possibly a default corrective action to take (although you shouldn’t expect at the earlier stage to have anticipated everything that can happen).

The processes of auditing your machine learning system, assessing the risk of poor outcomes, and monitoring its inputs and outputs are the best measures you can take to ensure that your project delivers its expected results. The risk of not performing these actions is similar to the risk of building a car that doesn’t turn on when activating the ignition, or, possibly worse, does turn on but fails to steer or brake correctly.

Summary

For users to trust the models you create, they need to be both credible and reliable. This chapter focused on keeping models reliable. Although there are a number of aspects of reliability, there is a common thread among them that if you don’t look for problems, you run the risk of not discovering them until your users do, and therefore after you have lost your users’ trust.

At the moment, there is extra attention on some of the ways that models have problems because of increased awareness that models can be biased in the plain language meaning—they can discriminate against minorities or otherwise work to unfairly disadvantage subgroups within the population being modeled.

An intuitive way to detect problems in the wide sense of the word is to audit the model at key milestones in its life cycle. Intuitive examples include immediately prior to implementation and on its first anniversary.

By conducting an audit, you are able to look beyond some of the purely technical attributes (although they will usually be included) to questions of whether the model achieves its business objectives, whether the infrastructure into which it is implemented ensures it maintains the expected performance, and whether there are unintended consequences.

You can also guard against potential side effects and unexpected result by performing a preimplementation risk analysis, for example via FMEA. This is a common tool used for quality assurance that attempts to anticipate problems before they occur, and put in place either measures to prevent the problem or mitigate the effects.

Complementary to both of the ideas mentioned earlier is monitoring both the incoming data feeds and your model’s results. To ensure that your model achieves the accuracy you expect, the incoming data and the results should have a distribution close to their distributions when you developed and validated the model. Moreover, spikes or dips in the rate of incoming data could indicate a change to the underlying data itself, which could compromise the model’s results.

Statistical process control charts have been specifically designed to monitor processes in order to detect when they stop behaving “as usual” or “as expected.” By establishing the usual statistical range of the process, they mean a quick visual check of your process can establish whether or not your process is operating as usual, and assist in establishing when it stopped operating as usual.

These actions which apply quality assurance thinking to data science models can be expected to increase the extent to which your system behaves as expected and intended. They will allow you and your users to be assured that the claims you make will be fulfilled by the model.

In Chapter 6, we will examine how to communicate those claims to the best advantage.

Reliable Models Checklist

Have you conducted a risk assessment prior to implementation?
Have you set up monitoring of the incoming data quality and its distribution, and likewise for the outputs?
Have you developed a plan for the actions or investigations that will take place if the data turns out to be outside your limits?
Do you know which aspects of data quality are relevant to your implementation? How can you detect if their behavior has changed?
Have you taken stock of the outcomes of your model, and attempted to discover unexpected and unintended consequences of the way that you have organized your model?
Do you have an agreement with the model users on how often the model will be checked and retrained? How is that agreement documented?

Footnotes

Cathy O’Neil, Weapons of Math Destruction (New York: Crown Press, 2016).

Dave Gershgorn, “Tech companies just woke up to a big problem with their AI,” Quartz, June 30, 2018, qz.com/1316050/tech-companies-just-woke-up-to-a-big-problem-with-their-ai/.

Irv Lustig, “Bringing QA to Data Science,” Software Testing News, October 11, 2018, www.softwaretestingnews.co.uk/bringing-qa-to-data-science-2/ .

Li Cai and Yangyong Zhu, “The Challenges of Data Quality and Data Quality Assessment in the Big Data Era,” Data Science Journal, 14, p. 2, May 22, 2015, DOI: https://doi.org/10.5334/dsj-2015-002 .

Ibid.

Fatimah Sidi, et al., “Data Quality: A Survey of Data Quality Dimensions,” 2012, International Conference on Information Retrieval and Knowledge Management, DOI: 10.1109/InfRKM.2012.6204995.

The Six Primary Dimensions For Data Quality Assessment, DAMA UK, October 2013, www.damauk.org/community.php?sid=7997d2df1befd7e241a39169a2c95780&communityid=1000054 .

James Guszcza, Iyad Rahwan, Will Bible, Manuel Cebrian, and Vic Katyal, “Why We Need to Audit Algorithms,” Harvard Business Review, November 28, 2018, https://hbr.org/2018/11/why-we-need-to-audit-algorithms .

Andrew Clark, “The Machine Learning Audit—CRISP-DM Framework,” ISACA Journal, Vol. 1, 2018, www.isaca.org/Journal/archives/2018/Volume-1/Pages/the-machine-learning-audit-crisp-dm-framework.aspx .

Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rüdiger Wirth, CRISP-DM 1.0, www.the-modeling-agency.com/crisp-dm.pdf .

Roger W. Berger, Donald W. Benbow, Ahmad K. Elshennay, and Walker HF, The Certified Quality Engineer Handbook (Milwaukee, WI: ASQ Quality Press, 2002).

Douglas C. Montgomery, Introduction to Statistical Quality Control (New York: Wiley, 1997).

Rajesh Jugulum. Competing With High Quality Data (New York: Wiley, 2016).

Miaomiao Yu, Chungjie Wu, and Fujee Tsung, “Monitoring the Data Quality of Data Streams Using a Two Step Control Scheme,” IISE Transactions, October 2018, DOI: 10.1080/24725854.2018.1530487.

5. Reliable Models