RULE SEVEN

Demand transparency when the computer says ‘no’

I know I’ve made some very poor decisions recently but I can give you my complete assurance that my work will be back to normal. I’ve still got the greatest enthusiasm and confidence in the mission. And I want to help you, Dave.

HAL 9000 (2001: A Space Odyssey)

In 2009, a team of researchers from Google announced a remarkable achievement in one of the world’s top scientific journals, Nature.1 Without needing the results of a single medical check-up, they were able to track the spread of influenza across the United States. What’s more, they could do it more quickly than the Centers for Disease Control and Prevention (CDC), which relied on reports from doctors’ surgeries. Google’s algorithm had searched for patterns in the CDC data from 2003 to 2008, identifying a correlation between flu cases and what people had been searching for online in the same area at the same time. Having discovered the pattern, the algorithm could now use today’s searches to estimate today’s flu cases, a week or more before the CDC published its official view.2

Not only was ‘Google Flu Trends’ quick, accurate and cheap, it was theory-free. Google’s engineers didn’t bother to develop a hypothesis about what search terms might be correlated with the spread of the disease. We can reasonably guess that searches such as ‘flu symptoms’ or ‘pharmacies near me’ might better predict flu cases than searches for ‘Beyoncé’, but the Google team didn’t care about that. They just fed in their top 50 million search terms and let the algorithms do the work.

The success of Google Flu Trends became emblematic of the hot new trend in business, technology and science: ‘big data’ and ‘algorithms’. ‘Big data’ can mean many things, but let’s focus on the found data we discussed in the previous chapter, the digital exhaust of web searches, credit card payments and mobiles pinging the nearest cellphone mast, perhaps buttressed by the administrative data generated as organisations organise themselves.

An algorithm, meanwhile, is a step-by-step recipe* for performing a series of actions, and in most cases ‘algorithm’ means simply ‘computer program’. But over the past few years, the word has come to be associated with something quite specific: algorithms have become tools for finding patterns in large sets of data. Google Flu Trends was built on pattern-recognising algorithms churning through those 50 million search terms, looking for ones that seemed to coincide with the CDC reporting more cases of flu.

It’s these sorts of data, and these sorts of algorithms, that I’d like to examine in this chapter. ‘Found’ datasets can be huge. They are also often relatively cheap to collect, updated in real time, and messy – a collage of data points collected for disparate purposes. As our communication, leisure and commerce are moving to the internet, and the internet is moving into our phones, our cars and even our spectacles, life can be recorded and quantified in a way that would have been hard to imagine just a decade ago. The business bookshelves and the pages of management magazines are bulging with books and articles on the opportunities such data provide.

Alongside a ‘wise up and get rich’ message, cheerleaders for big data have made three exciting claims, each one reflected in the success of Google Flu Trends. First, that data analysis produces uncannily accurate results. Second, that every single data point can be captured – the ‘N = All’ claim we met in the last chapter – making old statistical sampling techniques obsolete (what that means here is that Flu Trends captured every single search). And finally, that scientific models are obsolete, too: there’s simply no need to develop and test theories about why searches for ‘flu symptoms’ or ‘Beyoncé’ might or might not be correlated with the spread of flu, because, to quote a provocative Wired article in 2008, ‘with enough data, the numbers speak for themselves’.

This is revolutionary stuff. Yet four years after the original Nature paper was published, Nature News had sad tidings to convey: the latest flu outbreak had claimed an unexpected victim – Google Flu Trends. After reliably providing a swift and accurate account of flu outbreaks for several winters, the theory-free, data-rich model lost its nose for where flu was going. Google’s model pointed to a severe outbreak, but when the slow-and-steady data from the CDC arrived, they showed that Google’s estimates of the spread of flu-like illnesses were overstated – at one point they were more than double the true figure.3 The Google Flu Trends project was shut down not long after.4

What had gone wrong? Part of the problem was rooted in that third exciting claim: Google did not know – and it could not begin to know – what linked the search terms with the spread of flu. Google’s engineers weren’t trying to figure out what caused what. They were merely finding statistical patterns in the data, which is what these algorithms do. In fact, the Google research team had peeked into the patterns and discovered some obviously spurious correlations that they could safely instruct the algorithm to disregard – for example, flu cases turned out to be correlated with searches for ‘high school basketball’. There’s no mystery about why: both flu and high school basketball tend to get going in the middle of November. But it meant that Flu Trends was part flu detector, part winter detector.5 That became a problem when there was an outbreak of summer flu in 2009: Google Flu Trends, eagerly scanning for signs of winter and finding nothing, missed the non-seasonal outbreak, as true cases were four times higher than it was estimating.6

The ‘winter detector’ problem is common in big data analysis. A literal example, via computer scientist Sameer Singh, is the pattern-recognising algorithm that was shown many photos of wolves in the wild, and many photos of pet husky dogs. The algorithm seemed to be really good at distinguishing the two rather similar canines; it turned out that it was simply labelling any picture with snow as containing a wolf. An example with more serious implications was described by Janelle Shane in her book You Look Like a Thing and I Love You: an algorithm that was shown pictures of healthy skin and of skin cancer. The algorithm figured out the pattern: if there was a ruler in the photograph, it was cancer.7 If we don’t know why the algorithm is doing what it’s doing, we’re trusting our lives to a ruler detector.

Figuring out what causes what is hard – impossible, some say. Figuring out what is correlated with what is much cheaper and easier. And some big data enthusiasts – such as Chris Anderson, author of that provocative article in Wired magazine – have argued that it is pointless to look beyond correlations. ‘View data mathematically first and establish a context for it later,’ he wrote; the numbers speak for themselves. Or, to rephrase Anderson’s point unkindly, ‘If searches for high school basketball always pick up at the same time as flu cases, it doesn’t really matter why’.

But it does matter, because a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down.

After the summer flu problem of 2009, the accuracy of Flu Trends collapsed completely at the end of 2012. It’s not clear why. One theory is that the news was full of scary stories about flu in December 2012, and these stories might have provoked internet searches from people who were healthy. Another possible explanation is a change in Google’s own search algorithm: it began automatically suggesting diagnoses when people entered medical symptoms, and this will have changed what they typed into Google in a way that might have foxed the Flu Trends model. It’s quite possible that Google could have figured out what the problem was and found a way to make the algorithm work again if they’d wanted to, but they just decided that it wasn’t worth the trouble, expense and risk of failure.

Or maybe not. The truth is, external researchers have been forced to guess at exactly what went wrong, because they don’t have the information to know for sure. Google shares some data with researchers, and indeed makes some data freely available to anyone. But it isn’t going to release all its data to you, or me, or anyone else.

*

Two good books with pride-of-place on my bookshelf tell the story of how our view of big data evolved over just a few short years.

One, published in 2013, is Big Data by Kenn Cukier and Viktor Mayer-Schönberger. It reports many examples of how cheap sensors, huge datasets and pattern-recognising algorithms were, to paraphrase the book’s subtitle, ‘transforming how we live, work and think’. The triumphant example the authors chose to begin their story? Google Flu Trends. The collapse became apparent only after the book had gone to print.

Three years later, in 2016, came Cathy O’Neil’s Weapons of Math Destruction, which – as you might be able to guess – takes a far more pessimistic view. O’Neil’s subtitle tells us that big data ‘increases inequality and threatens democracy’.

The difference is partly one of perspective: Cukier and Mayer-Schönberger tend to adopt the viewpoint of someone who is doing something with a data-driven algorithm; O’Neil tends to see things from the viewpoint of someone to whom a data-driven algorithm is doing something. A hammer looks like a useful tool to a carpenter; the nail has a different impression altogether.

But the change in tone also reflects a change in the zeitgeist between 2013 and 2016. In 2013, the relatively few people who were paying attention to big data often imagined themselves to be the carpenters; by 2016, many of us had realised that we were nails. Big data went from seeming transformative to seeming sinister. Cheerleading gave way to doomsaying, and some breathlessly over-the-top headlines. (Perhaps my favourite was a CNN story: ‘Math is Racist’.) The crisis reached a shrill pitch with the discovery that a political consulting firm, Cambridge Analytica, had exploited Facebook’s lax policies on data to slurp up information about 50 million people, without those people knowing or meaningfully consenting to this, and show them personally targeted advertisements. It was briefly supposed by horrified commentators that these ads were so effective they essentially elected Donald Trump, though more sober analysis later concluded that Cambridge Analytica’s capabilities fell short of mind control.8

Each of us is sweating data, and those data are being mopped up and wrung out into oceans of information. Algorithms and large datasets are being used for everything from finding us love to deciding whether, if we are accused of a crime, we go to prison before the trial or are instead allowed to post bail. We all need to understand what these data are and how they can be exploited. Should big data excite or terrify us? Should we be more inclined to cheer the carpenters or worry about our unwitting role as nails?

The answer is that it depends – and in this chapter I hope to show you what it depends on.

Writing in the New York Times magazine in 2012, when the zeitgeist was still firmly with the carpenters, the journalist Charles Duhigg captured the hype about big data quite brilliantly with an anecdote about the US discount department store Target.

Duhigg explained that Target had collected so much data on its customers, and was so skilled at analysing those data, that its insight into consumers could seem like magic.9 His killer anecdote was of the man who stormed into a Target near Minneapolis and complained to the manager that the company was sending coupons for baby clothes and maternity wear to his teenage daughter. The manager apologised profusely, and later called to apologise again – only to be told that the teenager was indeed pregnant. Her father hadn’t realised. Target, after analysing her purchases of unscented wipes and vitamin supplements, had.

Yet is this truly statistical sorcery? I’ve discussed this story with many people and I’m struck by a disparity. Most people are wide-eyed with astonishment. But two of the groups I hang out with a lot take a rather different view. Journalists are often cynical; some suspect Duhigg of inventing, exaggerating or passing on an urban myth. (I suspect them of professional jealousy.) Data scientists and statisticians, on the other hand, yawn. They regard the tale as both unsurprising and uninformative. And I think the statisticians have it right.

First, let’s think for a moment about just how amazing it might be to predict that someone is pregnant based on their shopping habits: not very. Consider the National Health Service advice on the vitamin supplement folic acid:

It’s recommended that all women who could get pregnant should take a daily supplement of 400 micrograms of folic acid before they’re pregnant and during the first 12 weeks of pregnancy . . . If you did not take folic acid supplements before getting pregnant, you should start taking them as soon as you find out you’re pregnant . . . The only way to be sure you’re getting the right amount is by taking a supplement.

OK. With this in mind, what conclusion should I reach if I am told that a woman has started buying folic acid? I don’t need to use a vast dataset or a brilliant analytical process. It’s not magic. Clearly, she might well be pregnant. The Target algorithm hadn’t produced a superhuman leap of logic, but a very human one: it figured out exactly what you or I or anyone else would also have figured out, given the same information.

Admittedly, sometimes we humans can be slow on the uptake. Hannah Fry, the author of another excellent book about algorithms, Hello World, relays the example of a woman shopping online at the UK supermarket Tesco.10 She discovered condoms in the ‘buy this again?’ section of her online shopping cart – implying that the algorithm knew someone in her household had bought them before. But she hadn’t, and her husband had no reason to: they didn’t use condoms together. So she assumed it was a technical error. I mean, what other explanation could there possibly be?

When the woman contacted Tesco to complain, the company representatives concluded that it wasn’t their job to break the bad news that her husband was cheating on her, and went for the tactful white lie. ‘Indeed, madam? A computer error? You’re quite right, that must be the reason. We are so sorry for the inconvenience.’ Fry tells me that this is now the rule of thumb at Tesco: apologise, and blame the computer.

If a customer previously bought condoms, they might want to buy them again. If someone bought a pregnancy test and then starts buying a vitamin supplement designed for pregnant women, it’s a reasonable bet that this person is a woman and that in a few months they might become interested in buying maternity wear and baby clothes. The algorithms aren’t working statistical miracles here. They’re simply seeing something (the condoms, the pregnancy vitamins) that has been concealed from the human (the puzzled wife, the angry dad). We’re awestruck by the algorithm in part because we don’t appreciate the mundanity of what’s happening underneath the magician’s silk handkerchief.

And there’s another way in which Duhigg’s story of Target’s algorithm invites us to overestimate the capabilities of data-fuelled computer analytics.

‘There’s a huge false positive issue,’ says Kaiser Fung, a data scientist who has spent years developing similar approaches for retailers and advertisers. What Mr Fung means is that we don’t get to hear stories about women who receive coupons for baby wear but who aren’t pregnant. Hearing the anecdote, it’s easy to assume that Target’s algorithms are infallible – that everybody receiving coupons for onesies and wet wipes is pregnant. But nobody ever claimed that it was true. And it almost certainly isn’t; maybe everybody received coupons for onesies. We should not simply accept the idea that Target’s computer is a mind-reader before considering how many misses attend each hit.

There may be many of those misses – even for an easy guess such as ‘woman buying folic acid may be pregnant’. Folic acid purchases don’t guarantee pregnancy. The woman might be taking folic acid for some other reason. Or she might be buying the vitamin supplement for someone else. Or – and imagine the distress when the vouchers for baby clothes start arriving – she might have been pregnant but miscarried, or trying to get pregnant without success. It’s possible that Target’s algorithm is so brilliant that it filters out these painful cases. But it is not likely.

In Charles Duhigg’s account, Target mixes in random offers, such as coupons for wine glasses, because pregnant customers would feel spooked if they realised how intimately the company’s computers understood them. But Kaiser Fung has another explanation: Target mixes up its offers not because it would be weird to send an all-baby coupon book to a woman who was pregnant, but because the company knows that many of those coupon books will be sent to women who aren’t pregnant after all.

The manager should simply have said that: don’t worry about it, lots of people get these coupons. Why didn’t he? He probably didn’t understand the Target algorithm any more than the rest of us. Just like Google, Target might be reluctant to open up its algorithm and dataset for researchers – and competitors – to understand what’s going on.

The most likely situation is this: pregnancy is often a pretty easy condition to spot through shopping behaviour, so Target’s big data-driven algorithm surely can predict pregnancy better than random guessing. Doubtless, however, it is well short of infallibility. A random guess might be that any woman between fifteen and forty-five has a roughly 5 per cent chance of being pregnant at any time. If Target can improve to guessing correctly 10 or 15 per cent of the time, that’s already likely to be well worth doing. Even a modest increase in the accuracy of targeted special offers would help the bottom line. But profitability should not be conflated with omniscience.

So let’s start by toning down the hype a little – both the apocalyptic idea that Cambridge Analytica can read your mind, and the giddy prospect that big data can easily replace more plodding statistical processes such as the CDC’s survey of influenza cases. When I first started grappling with big data, I called Cambridge University’s Professor Sir David Spiegelhalter – one of the country’s leading statisticians, and a brilliant statistical communicator. I summarised the cheer-leading claims: of uncanny accuracy, of making sampling irrelevant because every data point was captured, and of consigning scientific models to the junk-heap because ‘the numbers speak for themselves’.

He didn’t feel the need to reach for a technical term. Those claims, he said, are ‘complete bollocks. Absolute nonsense.’

Making big data work is harder than it seems. Statisticians have spent the past two hundred years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster and cheaper these days, but we must not pretend that the traps have all been made safe. They have not.

‘There are a lot of small data problems that occur in big data,’ added Spiegelhalter. ‘They don’t disappear because you’ve got lots of the stuff. They get worse.’

It hardly matters if some of Charles Duhigg’s readers are too credulous about the precision with which Target targets onesie coupons. But it does matter when people in power are similarly overawed by algorithms they don’t understand, and use them to make life-changing decisions.

One of Cathy O’Neil’s most powerful examples in Weapons of Math Destruction is the IMPACT algorithm used to assess teachers in Washington DC. As O’Neil describes, much-loved and highly respected teachers in the city’s schools were being abruptly sacked after achieving very poor ratings from the algorithm.

The IMPACT algorithm claimed to measure the quality of teaching, basically by checking whether the children in a teacher’s class had made progress or slipped backwards in their test scores.11 Measuring the true quality of teaching, however, is hard to do for two reasons. The first is that no matter how good or how bad the teacher, individual student achievement will vary a lot. With just thirty students in a class, a lot of what the algorithm measures will be noise; if a couple of kids made some lucky guesses in their start-of-year test and then were unlucky at the end of year, that’s enough to make a difference to a teacher’s ranking. It shouldn’t be, because it’s pure chance. Another source of variation that’s out of the teacher’s control is when a child has a serious problem outside the classroom – anything from illness to bullying to a family member being imprisoned. This isn’t the same kind of noise as lucky or unlucky guesses in a test, because it tracks something real. A system that tracked and followed up signs of trouble outside class would be valuable. But it would be foolish and unfair to blame the pupil’s struggles on the teacher.

The second problem is that the algorithm may be fooled by teachers who cheat, and the cheats will damage the prospects of the honest teachers. If the sixth-grade teacher finds a way to unfairly boost the test performance of the children he teaches – such things aren’t unknown – then not only will he be unfairly rewarded, but the seventh-grade teacher is going to be in serious trouble next year. Her incoming class will be geniuses on paper; improvement will be impossible unless she also finds a way to cheat.

O’Neil’s view, which is plausible, is that the data are so noisy that the task of assessing a teacher’s competence is hopeless for any algorithm. Certainly this particular algorithm’s judgements on which teachers were sub-par didn’t always tally with those of the teachers’ colleagues or students. But that didn’t stop the DC school district authorities firing 206 teachers in 2011 for failing to meet the algorithm’s standards.

So far we’ve focused on excessive credulity in the power of the algorithm to extract wisdom from the data it is fed. There’s another, related problem: excessive credulity in the quality or completeness of the dataset.

We explored the completeness problem in the previous chapter. Literary Digest accumulated what might fairly be described as big data. It was certainly an enormous survey by the standards of the day – indeed, even by today’s standards a dataset with 2.4 million people in it is impressive. But you can’t use Literary Digest surveys to predict election results if ‘people who respond to Literary Digest surveys’ differ in some consistent way from ‘people who vote in elections’.

Google Flu Trends captured every Google search, but not everybody who gets flu turns to Google. Its accuracy depended on ‘people with flu who consult Google about it’ not being systematically different from ‘people with flu’. The pothole-detecting app we met in the last chapter fell short because it confused ‘people who hear about and install pothole-detecting apps’ with ‘people who drive around the city’.

How about quality? Here’s an instructive example of big data from an even older vintage than the 1936 US election poll: the astonishing attempt to assess the typical temperature of the human body. Over the course of eighteen years, the nineteenth-century German doctor Carl Wunderlich assembled over a million measurements of body temperature, gathered from more than 25,000 patients. A million measurements! It’s a truly staggering achievement given the pen-and-paper technology of the day. Wunderlich is the man behind the conventional wisdom that normal body temperature is 98.6ºF. Nobody wanted to gainsay his findings, partly because the dataset was large enough to command respect, and partly because the prospect of challenging it with a bigger, better dataset was intimidating. Dr Philip Mackowiak, an expert on Wunderlich, put it, ‘Nobody was in a position or had the desire to amass a dataset that large.’12

Yet Wunderlich’s numbers were off; we’re normally a little cooler (by about half a Fahrenheit degree).13 So formidable were his data that it took more than a hundred years to establish that the good doctor had been in error.*

So how could so large a dataset be wrong? When Dr Mackowiak discovered one of Carl Wunderlich’s old thermometers in a medical museum, he was able to inspect it. He found that it was miscalibrated by two degrees centigrade, almost four degrees Fahrenheit. This error was partly offset by Dr Wunderlich’s habit of taking the temperature of the armpit rather than carefully inserting the thermometer into one of the bodily orifices conventionally used in modern times. You can take a million temperature readings, but if your thermometer is broken and you’re poking around in armpits, then your results will be a precise estimate of the wrong answer. The old cliché of ‘garbage in, garbage out’ remains true no matter how many scraps of garbage you collect.

As we saw in the last chapter, the modern version of this old problem is an algorithm that has been trained on a systematically biased dataset. It’s surprisingly easy for such problems to be overlooked. In 2014, Amazon, one of the most valuable companies in the world, started using a data-driven algorithm to sift résumés, hoping that the computer would find patterns and pick out the very best people, based on their similarity to previous successful applicants. Alas, previous successful applicants were disproportionately men. The algorithm then did what algorithms do: it spotted the pattern and ran with it. Observing that men had in the past been preferred, it concluded that men were preferable. The algorithm penalised the word ‘women’s’ as in ‘Under-21 Women’s Soccer International’ or ‘Women’s Chess Club Captain’; it downgraded certain all-women’s colleges. Amazon abandoned the algorithm in 2018; it’s unclear exactly how much influence it had in making decisions, but Amazon admitted that its recruiters had been looking at the algorithm’s rankings.

Remember the ‘Math is Racist’ headline? I’m fairly confident that maths isn’t racist. Neither is it misogynistic, or homophobic, or biased in other ways. But I’m just as confident that some humans are. And computers trained on our own historical biases will repeat those biases at the very moment we’re trying to leave them behind us.14

I hope I’ve persuaded you that we shouldn’t be too eager to entrust our decisions to algorithms. But I don’t want to overdo the critique, because we don’t have some infallible alternative way of making decisions. The choice is between algorithms and humans. Some humans are prejudiced. Many humans are frequently tired, harassed and overworked. And all humans are, well, human.

In the 1950s, the psychologist Paul Meehl investigated whether the most basic of algorithms, in the form of uncomplicated statistical rules, could ever outperform expert human judgement. For example, a patient arrives at hospital complaining of chest pains. Does she have indigestion, or is she suffering from a heart attack? Meehl compared the verdicts of experienced doctors with the result of working through a brief checklist. Is chest pain the main symptom? Has the patient had heart attacks in the past? Has the patient used nitroglycerin to relieve chest pain in the past? What quantifiable patterns are shown on the cardiogram?15 Disconcertingly, that simple decision-tree got the right diagnosis more often than the doctors. And it was not the only example. Remarkably often, Meehl found, experts fared poorly when compared with simple checklists. Meehl described his Clinical vs. Statistical Prediction as ‘my disturbing little book’.16

So, to be fair, we should compare the fallibility of today’s algorithms with that of the humans who would otherwise be making the decisions. A good place to start is with an example from Hannah Fry’s book Hello World.

The story starts during the London riots of 2011. Initially a protest against police brutality, demonstrations turned into violent riots as order broke down each evening across the city, and in several other cities across the country. Shops would close in the early afternoon and law-abiding citizens would hurry home, knowing that opportunistic troublemakers would be out as the light faded. In three days of trouble, more than a thousand people were arrested.

Among their number were Nicholas Robinson and Richard Johnson. Robinson was walking through the chaos and helped himself to a pack of bottled water from a shattered London supermarket. Johnson drove to a gaming store, put on a balaclava, and ran in to grab an armful of computer games. Johnson’s theft was of higher value and was premeditated rather than on the spur of the moment. And yet it was Robinson who received a six-month sentence, while Johnson’s conviction earned him no jail time at all. No algorithm could be blamed for the difference; human judges were the ones handing out those sentences, and the disparity seems bizarre.

It’s always possible that each judge made the right decision, based on some subtle detail of the case. But the most plausible answer for the inconsistent treatment of the two men was that Robinson was sentenced just two weeks after the riots, at a time when nerves were jangling and the fabric of civilisation seemed easily ripped. Johnson was sentenced months later, when the memory of the riots was fading and people were asking themselves what all the fuss had been about.17

Would a data-driven computer program have tuned out the mood music and delivered fairer sentences? It’s impossible to know – but quite possibly, yes. There is ample evidence that human judges aren’t terribly consistent. One way to test this is to show hypothetical cases to various judges and see if they reach different conclusions. They do. In one British study from 2001, judges were asked for judgements on a variety of cases; some of the cases (presented a suitable distance apart to disguise the subterfuge) were simply repeats of earlier cases, with names and other irrelevant details changed. The judges didn’t even agree with their own previous judgement on the identical case. That is one error that we can be fairly sure a computer would not make.18

A more recent study was conducted in the United States, by economist Sendhil Mullainathan and four colleagues. They analysed over 750,000 cases in New York City between 2008 and 2013 – cases in which someone had been arrested and the decision had to be taken as to whether to release the defendant, or to detain him or her, or to set a cash bail that had to be posted to secure release. The researchers could then see who had gone on to commit further crimes. They then used a portion of these cases (220,000) to train an algorithm to decide whether to release, detain or set bail. And they used the remaining cases to check whether the algorithm had done a good job or not, relative to human judges.19

The humans did not do well. The researchers’ algorithm could have reduced crime-while-on-release by almost 25 per cent by jailing a better-selected group of defendants. Alternatively, they could have jailed 40 per cent fewer people without any increase in crime. Thousands of crimes could have been prevented, or thousands of people released pending trial, purely as a result of the algorithm outperforming the human judges.

One important error that the judges make is what legal scholar Cass Sunstein calls ‘current offence bias’ – that is, when they make decisions about bail they focus too much on the specific offence the defendant has been accused of. Defendants whose track record suggests they’re a high risk are treated as low risk if they’re accused of a minor crime, and defendants whose track record suggests they’re low risk are treated as high risk if the current offence is serious. There’s valuable information here that the algorithm puts to good use, but the human judges – for all their intelligence, experience and training – tend to overlook.

This seems to be how we humans operate. Consider the way I described the cases of Nicholas Robinson and Richard Johnson: I told you about the offences in question, nothing at all about Robinson and Johnson. It just seemed reasonable to me – and perhaps to you – to tell you all about the short term, about the current offence. An algorithm would have used more information if more information had been available. A human might not.

Many people have strong intuitions about whether they would rather have a vital decision about them made by algorithms or humans. Some people are touchingly impressed by the capabilities of the algorithms; others have far too much faith in human judgement. The truth is that sometimes the algorithms will do better than the humans, and sometimes they won’t. If we want to avoid the problems and unlock the promise of big data, we’re going to need to assess the performance of the algorithms on a case-by-case basis. All too often, this is much harder than it should be.

Consider this scenario. The police, or social services, receive a call from someone – a neighbour, a grandparent, a doctor, a teacher – who is worried about the safety of a child. Sometimes the child will genuinely be in danger; sometimes the caller will be mistaken, or over-anxious, or even malicious. In an ideal world, we’d take no chances and send a blue-lighted vehicle around immediately to check what’s going on. But we don’t have enough resources to do this in every case – we have to prioritise. The stakes could hardly be higher: official figures in the United States show that 1670 children died in 2015 of abuse or neglect. That’s a horrific number, but a tiny fraction of the 4 million times someone calls to report their concerns about a child.

Which reports need to be followed up, and which can be reasonably ignored? Many police and social services use algorithms to help make that decision. The state of Illinois introduced just such an algorithm, called Rapid Safety Feedback. It analysed data on each report, compared them to the outcomes of previous cases, and produced a percentage prediction of the child’s risk of death or serious harm.

The results were not impressive. The Chicago Tribune reported that the algorithm gave 369 children a 100 per cent chance of serious injury or death. No matter how dire the home environment, that degree of certitude seems unduly pessimistic. It could also have grave implications: a false allegation of child neglect or abuse could have terrible consequences for the accused and the child alike.

But perhaps the algorithm erred on the side of caution, exaggerating the risk of harm because it was designed not to miss a single case? No: in some terrible cases toddlers died after being given a percentage risk too low to justify followup. In the end, Illinois decided the technology was useless, or worse, and stopped using it.20

The moral of this story isn’t that algorithms shouldn’t be used to assess reports about vulnerable children. Someone or something must make the decision on which cases to follow up. Mistakes are inevitable, and there’s no reason – in principle – why some other algorithm might not make fewer mistakes than a human call handler.21 The moral is that we know about the limitations of this particular algorithm only because it spat out explicit numbers that were obviously absurd.

‘It’s good they gave numerical probabilities, as this supplies the loud siren that makes us realize that these numbers are bad,’ explains statistician Andrew Gelman. ‘What would be worse is if [the algorithm] had just reported the predictions as “high risk”, “mid risk” and “low risk”.’ The problems might then never have come to light.22

So the problem is not the algorithms, or the big datasets. The problem is a lack of scrutiny, transparency and debate. And the solution, I’d argue, goes back a very long time.

In the mid-seventeenth century, a distinction began to emerge between alchemy and what we’d regard as modern science. It is a distinction that we need to remember if we are to flourish in a world of big data algorithms.

In 1648, Blaise Pascal’s brother-in-law, at the urging of the great French mathematician, conducted a celebrated experiment. In the garden of a monastery in the little city of Clermont-Ferrand, he took a tube filled with mercury, slid its open end into a bowl full of the liquid metal, and lifted it to a vertical position, jutting above the surface with the end submerged. Some of the mercury immediately drained into the bowl, but some did not. There was a column 711 millimetres high in the tube, and above it a space containing – what? Air? A vacuum? A mysterious ether?23

This was only the first stage of the experiment Pascal had proposed, and it wasn’t unprecedented. Gasparo Berti had done something similar in Rome with water – although with water, the glass tube needs to be more than 10 metres long, and making one was no easy task. Evangelista Torricelli, a student of Galileo, was the man who had the idea of using mercury instead, which requires a much shorter tube.

Pascal’s idea – or perhaps it was his friend René Descartes, since both of them claimed the credit – was to repeat the experiment at altitude. And so it was Pascal’s brother-in-law who had the job of lugging fragile glass tubes and several kilograms of mercury up to the top of Puy de Dôme, a striking dormant volcano in the heart of France, more than a kilometre above Clermont-Ferrand. At the top of the mountain, the mercury rose not 711 millimetres but just 627. Halfway down the mountain, the mercury column was longer than at the summit, but shorter than down in the garden. The next day, the column was measured at the top of Clermont-Ferrand’s cathedral. It was 4 millimetres shorter there than in the monastery garden. Pascal had invented what we now call the barometer – and simultaneously, the altimeter, a device that measured air pressure and, indirectly, altitude. In 1662, just fourteen years later, Robert Boyle formulated his famous gas law, describing the relationship between pressure and the volume of a gas. This was a rapid and rather modern advance in the state of scientific knowledge.

Yet it was taking place alongside the altogether more ancient practice of alchemy, the quest to find a way to turn base metals into gold and to produce an elixir of eternal life. These goals are, as far as we know, as near to impossible as makes no difference* – but if alchemy had been conducted using scientific methods one might still have expected all the alchemical research to produce a rich seam of informative failures, and a gradual evolution into modern chemistry.

That’s not what happened. Alchemy did not evolve into chemistry. It stagnated, and in due course science elbowed it to one side. For a while the two disciplines existed in parallel. So what distinguished them?

Of course, modern science uses the experimental method, so clearly demonstrated by Pascal’s hard-working brother-in-law, by Torricelli, Boyle and others. But so did alchemy. The alchemists were unrelenting experimenters. It’s just that their experiments yielded no information that advanced the field as a whole. The use of experiments does not explain why chemistry flourished and alchemy died.

Perhaps, then, it was down to the characters involved? Perhaps the great early scientists such as Robert Boyle and Isaac Newton were sharper, wiser, more creative men than the alchemists they replaced? This is a spectacularly unpersuasive explanation. Two of the leading alchemists of the 1600s were Robert Boyle and Isaac Newton. They were energetic, even fervent, practitioners of alchemy, which thankfully did not prevent their enormous contributions to modern science.24

No – the alchemists were often the very same people using the same experimental methods to try to understand the world around them. What accounts for the difference, says David Wootton, a historian of science, is that alchemy was pursued in secret while science depended on open debate. In the late 1640s, a small network of experimenters across France, including Pascal, worked simultaneously on vacuum experiments. At least a hundred people are known to have performed these experiments between Torricelli’s in 1643 and the formulation of Boyle’s Law in 1662. ‘These hundred people are the first dispersed community of experimental scientists,’ says Wootton.25

At the centre of the web of knowledge was Marin Mersenne – a monk, a mathematician, and a catalyst for scientific collaboration and open competition. Mersenne was friends with Pascal and Descartes, along with thinkers from Galileo to Thomas Hobbes, and would make copies of the letters he received and circulate them to others whom he thought would be interested. So prolific was his correspondence that he became known as ‘the post-box of Europe’.26

Mersenne died in 1648, less than three weeks before the experiment on Puy de Dôme, but his ideas about scientific collaboration lived on in the form of the Royal Society in London (established 1660) and the French Academy of Sciences (established 1666), both along distinctly Mersennian lines. One of the virtues of the new approach, well understood at the time, was reproducibility – which, as we saw in the fifth chapter, is a vital check on both fraud and error. The Puy de Dôme experiment could be and was repeated anywhere there was a hill or even a tall building. ‘All the curious can test it themselves whenever they like,’ wrote Pascal. And they did.

Yet while the debate over vacuums, gases and those tubes of mercury was being vigorously carried out through letters, publications and meetings at Mersenne’s home in Paris, alchemical experiments were conducted in secret. It isn’t hard to see why: there is no value to turning lead into gold if everyone knows how to do it. No alchemist wanted to share his potentially instructive failures with anyone else.

The secrecy was self-perpetuating. One of the reasons that alchemy lasted so long, and that even brilliant scholars such as Boyle and Newton took it seriously, was the assumption that alchemical problems had been solved by previous generations, but kept secret and then lost. When Newton famously declared ‘if I have seen further it is by standing on the shoulders of giants’, this was true only of his scientific work. As an alchemist, he stood on nobody’s shoulders and saw little.

When Boyle did try to publish some of his findings, and seek out other alchemists, Newton warned him to stop and instead maintain ‘high silence’. And as it became clear that the newly open scientific community was making rapid progress, alchemy itself became discredited within a generation. In short, says Wootton,

What killed alchemy was the insistence that experiments must be openly reported in publications which presented a clear account of what had happened, and they must then be replicated, preferably before independent witnesses. The alchemists had pursued a secret learning . . . some parts of that learning could be taken over by . . . the new chemistry, but much of it had to be abandoned as incomprehensible and unreproducible. Esoteric knowledge was replaced by a new form of knowledge which depended both on publication and on public or semi-public performance.27

Alchemy is not the same as gathering big datasets and developing pattern-recognising algorithms. For one thing, alchemy is impossible, and deriving insights from big data is not. Yet the parallels should also be obvious. The likes of Google and Target are no more keen to share their datasets and algorithms than Newton was to share his alchemical experiments. Sometimes there are legal or ethical reasons – if you’re trying to keep your pregnancy a secret, you don’t want Target publicly disclosing your folic acid purchases – but most obviously the reasons are commercial. There’s gold in the data that Amazon, Apple, Facebook, Google and Microsoft have about us. And that gold will be worth a lot less to them if the knowledge that produces it is shared with everyone.

But just as the most brilliant thinkers of the age failed to make progress while practising in secret, secret algorithms based on secret data are likely to lead to missed opportunities for improvement. Again, it hardly matters much if Target is missing out on a slightly more effective way to target onesie coupons. But when algorithms are firing capable teachers, directing social services to the wrong households, or downgrading job applicants who went to women’s colleges, we need to be able to subject them to scrutiny.

But how?

One approach is that used by a team of investigative journalists at ProPublica, led by Julia Angwin. Angwin’s team wanted to scrutinise a widely used algorithm called COMPAS (Correctional Offender Management Profiling for Alternative Sanctions). COMPAS used the answers to a 137-item questionnaire to assess the risk that a criminal might re-offend. But did it work? And was it fair?

It wasn’t easy to find out. COMPAS is owned by a company, Equivant (formerly Northpointe), which is under no obligation to share the details of how it works. And so Angwin and her team had to judge it by analysing the results, laboriously pulled together from Broward County in Florida, a state that has strong transparency laws.

Here’s an edited account of how the ProPublica team went about their work:

Through a public records request, ProPublica obtained two years’ worth of COMPAS scores from the Broward County Sheriff’s Office in Florida. We received data for all 18,610 people who were scored in 2013 and 2014 . . . Each pretrial defendant received at least three COMPAS scores: ‘Risk of Recidivism,’ ‘Risk of Violence’ and ‘Risk of Failure to Appear.’ COMPAS scores for each defendant ranged from 1 to 10, with ten being the highest risk. Scores 1 to 4 were labeled by COMPAS as ‘Low’; 5 to 7 were labeled ‘Medium’; and 8 to 10 were labeled ‘High.’ Starting with the database of COMPAS scores, we built a profile of each person’s criminal history, both before and after they were scored. We collected public criminal records from the Broward County Clerk’s Office website through April 1, 2016. On average, defendants in our dataset were not incarcerated for 622.87 days (sd: 329.19). We matched the criminal records to the COMPAS records using a person’s first and last names and date of birth . . . We downloaded around 80,000 criminal records from the Broward County Clerk’s Office website.28

And so it continues. This was painstaking work.

Eventually, ProPublica published their conclusions. Although the COMPAS algorithm did not use an offender’s race as a predictor, it nevertheless was producing racially disparate results. It tended to produce false positives for black offenders (predicting that they would re-offend, but then they did not) and false negatives for white offenders (predicting that they would not re-offend, but then they did).

That sounds very worrying: racial discrimination is both immoral and illegal when coming from a human; we shouldn’t tolerate it if it emerges from an algorithm.

But then four academic researchers, Sam Corbett-Davies, Emma Pierson, Avi Feller and Sharad Goel, pointed out that the situation wasn’t so clear-cut.29 They used the data laboriously assembled by ProPublica to show that the algorithm was fair by another important metric, which was that if the algorithm gave two criminals – one black, one white – the same risk rating, then the actual risk that they re-offended was the same. In that important respect the algorithm was colour-blind.

What’s more, the researchers showed that it was impossible for the algorithm to be fair in both ways simultaneously. It was possible to craft an algorithm that would give an equal rate of false positives for all races, and it was possible to craft an algorithm where the risk ratings matched the re-offending risk for all races, but it wasn’t possible to do both at the same time: the numbers just couldn’t be made to add up.

The only way in which an algorithm could be constructed to produce equal results for different groups – whether those groups were defined by age, gender, race, hair colour, height or any other criterion – would be if the groups otherwise behaved and were treated identically. If they moved through the world in different ways, the algorithm would, inevitably, violate at least one criterion of fairness when evaluating them. That is true whether or not the algorithm was actually told their age, gender, race, hair colour or height. It would also be true of a human judge; it’s a matter of arithmetic.

Julia Dressel and Hany Farid, also computer scientists, observed this debate over whether COMPAS was producing results with a racial bias. They thought there was something missing. ‘There was this underlying assumption in the conversation that the algorithm’s predictions were inherently better than human ones,’ Dressel told the science writer Ed Yong, ‘but I couldn’t find any research proving that.’30

Thanks to ProPublica’s spade-work, Dressel and Farid could investigate the question for themselves. Even if COMPAS itself was a secret, ProPublica had published enough of the results to allow it to be meaningfully tested against other benchmarks. One was a simple mathematical model with just two variables: the age of the offender and the number of previous offences. Dressel and Farid showed that the two-variable model was just as accurate as the much-vaunted 137-variable COMPAS model. Dressel and Farid also tested COMPAS predictions against the judgement of ordinary, non-expert humans who were shown just seven pieces of information about each offender and asked to predict whether he or she would re-offend within two years. The average of a few of these non-expert predictions outperformed the COMPAS algorithm.

This is striking stuff. As Farid commented, a judge might be impressed if told that a data-driven algorithm had rated a person as high-risk, but would be far less impressed if told, ‘Hey, I asked twenty random people online if this person will recidivate and they said yes.’31

Is it too much to ask COMPAS to beat the judgement of twenty random people from the internet? It doesn’t seem to be a high bar; nevertheless COMPAS could not clear it.32

Demonstrating the limitations of the COMPAS algorithm wasn’t hard once the ProPublica data on COMPAS’s decision-making had been released to allow researchers to analyse and debate them. Keeping the algorithms and the datasets under wraps is the mindset of the alchemist. Sharing them openly so they can be analysed, debated and – hopefully – improved on? That’s the mindset of the scientist.

Listen to the speeches of traditional centre-ground politicians, or read media commentary, and it is common to encounter a view such as ‘levels of trust are declining’, or that ‘we need to rebuild trust’. Baroness Onora O’Neill, who has become an authority on the topic, argues that such hand-wringing reflects sloppy thinking. She argues that we don’t and shouldn’t trust in general: we trust specific people or institutions to do specific things. (For example: I have a friend I’d never trust to post a letter for me – but I’d gladly trust him to take care of my children.) Trust should be discriminating: ideally we should trust the trustworthy, and distrust the incompetent or malign.33

Just like people, algorithms are neither trustworthy nor untrustworthy as a general class. Just as with people, rather than asking ‘Should we trust algorithms?’ we should ask ‘Which algorithms can we trust, and what can we trust them to do?’

Onora O’Neill argues that if we want to demonstrate trustworthiness, we need the basis of our decisions to be ‘intelligently open’. She proposes a checklist of four properties that intelligently open decisions should have. Information should be accessible: that implies it’s not hiding deep in some secret data vault. Decisions should be understandable – capable of being explained clearly and in plain language. Information should be usable – which may mean something as simple as making data available in a standard digital format. And decisions should be assessable – meaning that anyone with the time and expertise has the detail required to rigorously test any claims or decisions if they wish to.

O’Neill’s principles seem like a sensible way to approach algorithms entrusted with life-changing responsibilities, such as whether to release a prisoner, or respond to a report of child abuse. It should be possible for independent experts to get under the hood and see how the computers are making their decisions. When we have legal protections – for example, forbidding discrimination on the grounds of race, sexuality or gender – we need to ensure that the algorithms live up to the same standards we expect from humans. At the very least that means the algorithm needs to be scrutable in court.

Cathy O’Neil, author of Weapons of Math Destruction, argues that data scientists should – like doctors – form a professional society with a professional code of ethics. If nothing else, that would provide an outlet for whistle-blowers, ‘so that we’d have someone to complain to when our employer (Facebook, say) is asking us to do something we suspect is unethical or at least isn’t up to the standards of accountability that we’ve all agreed to’.34

Another parallel with the practice of medicine is that important algorithms should be tested using randomised controlled trials. If an algorithm’s creators claim that it will sack the right teachers, or recommend bail for the right criminal suspects, our response should be, ‘prove it’. The history of medicine teaches us that plausible-sounding ideas can be found wanting when subjected to a fair test. Algorithms aren’t medicines, so simply cloning an organisation such as the US Food and Drug Administration wouldn’t work; we’d need to run the trials over faster timelines and take a different view of what informed consent looked like. (Clinical trials have high standards for ensuring that people consent to participate; it’s not so clear how those standards would apply to an algorithm that rates teachers – or criminal suspects.) Still, anyone who is confident of the effectiveness of their algorithm should be happy to demonstrate that effectiveness in a fair and rigorous test. And vital institutions such as schools and courts shouldn’t be willing to use those algorithms on a large scale unless they’ve proved themselves.

Clearly, not all algorithms raise such weighty concerns. It wouldn’t obviously serve the public interest to force Target to let researchers see how they decide who receives onesie coupons. We need to look on a case-by-case basis. What sort of accountability or transparency we want depends on what problem we are trying to solve.

We might, for example, want to distinguish YouTube’s algorithm for recommending videos from Netflix’s algorithm for recommending movies. There is plenty of disturbing content on YouTube, and its recommendation engine has become notorious for its apparent tendency to suggest ever more fringey and conspiratorial videos. It’s not clear that the evidence supports the idea that YouTube is an engine of radicalisation, but without more transparency it’s hard to be sure.35

Netflix illustrates a different issue: competition. Its recommendation algorithm draws on a huge, secret dataset of which customers have watched which movies. Amazon has a similar, equally secret dataset. Suppose I’m a young entrepreneur with a brilliant idea for a new kind of algorithm to predict which movies people will like based on their previous viewing habits. Without the data to test it on, my brilliant idea can never be realised. There’s no particular reason for us to worry about how the Amazon and Netflix algorithms work, but is there a case for forcing them to make public their movie-viewing datasets, to unleash competition in algorithm design that might ultimately benefit consumers?

One concern is immediately apparent – privacy. You might think that was an easy problem to solve: just remove the names from records and the data are anonymous! Not so fast: with a rich dataset, and by cross-referencing with other data-sets, it is often surprisingly easy to figure out who Individual #961860384 actually is. Netflix once released an anonymised dataset to researchers as part of a competition to find a better recommendation algorithm. Unfortunately, it turned out that one of their customers had posted the same review of a family movie on Netflix and, under her real name, on the Internet Movie Database website. Her no-longer-anonymous Netflix reviews revealed that she was attracted to other women – something she preferred to keep secret.36 She sued the company for ‘outing’ her; it settled on undisclosed terms.

Still, there are ways forward. One is to allow secure access to certified researchers. Another is to release ‘fuzzy’ data where all the individual details are a little bit off, but rigorous conclusions can still be drawn about populations as a whole. Companies such as Google and Facebook gain an enormous competitive advantage from their datasets: they can nip small competitors in the bud, or use data from one service (such as Google Search) to promote another (such as Google Maps or Android). If some of that data were made publicly available, other companies would be able to learn from it, produce better services, and challenge the big players. Scientists and social scientists could learn a lot, too; one possible model is to require private ‘big data’ sets to be published after a delay and with suitable protections of anonymity. Three-year-old data are stale for many commercial purposes but may still be of tremendous scientific value.

There is a precedent for this: patent holders must publish their ideas in order to receive any intellectual property protection; perhaps a similar bargain could be offered to, or imposed on, private holders of large datasets.

‘Big data’ is revolutionising the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinised; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyse small datasets are usually easy to check; pattern-recognising algorithms can all too easily be mysterious and commercially sensitive black boxes.

I’ve argued that we need to be sceptical of both hype and hysteria. We should ask tough questions on a case-by-case basis whenever we have reason for concern. Are the underlying data accessible? Has the performance of the algorithm been assessed rigorously – for example, by running a randomised trial to see if people make better decisions with or without algorithmic advice? Have independent experts been given a chance to evaluate the algorithm? What have they concluded? We should not simply trust that algorithms are doing a better job than humans, and nor should we assume that if the algorithms are flawed, the humans would be flawless.

But there is one source of statistics that, at least for the citizens of most rich countries, I think we should trust more than we do. And it is to this source that we now turn.

___________

* Although if it is a recipe, it is a recipe written by a particularly pedantic chef. Most recipes leave room for common sense, but if an algorithm is to be interpreted by a computer the steps must be tightly specified.

* The problem was exacerbated by a conversion of units. Wunderlich’s original measurements were made in centigrade, and his results concluded that the typical body temperature was a range around 37ºC – implicitly, given that degree of precision, a range of up to a degree centigrade, somewhere above 36.5ºC and below 37.5ºC. But when Wunderlich’s articles in German were translated into English, reaching a larger audience, the temperature was converted from centigrade to Fahrenheit and became 98.6ºF – inviting physicians to assume that the temperature had been measured to one tenth of a degree Fahrenheit rather than one degree centigrade. The implied precision was almost twenty times greater – but all that had actually changed was a conversion between two temperature scales.

* A particle accelerator will turn base metals into gold, although not cheaply. In 1980, researchers bombarded the faintly lead-like metal bismuth and created a few atoms of gold. The cost was a less-than-economical rate of one quadrillion dollars an ounce. We are yet to discover an elixir of eternal life.