Justice

IT’S NOT UNUSUAL to find good-natured revellers drinking on a summer Sunday evening in the streets of Brixton, where our next story begins. Brixton, in south London, has a reputation as a good place to go for a night out; on this particular evening, a music festival had just finished and the area was filled with people merrily making their way home, or carrying on the party. But at 11.30 p.m. the mood shifted. A fight broke out on a local council estate, and when police failed to contain the trouble, it quickly spilled into the centre of Brixton, where hundreds of young people joined in.

This was August 2011. The night before, on the other side of the city, an initially peaceful protest over the shooting by police of a young Tottenham man named Mark Duggan had turned violent. Now, for the second night in a row, areas of the city were descending into chaos – and this time, the atmosphere was different. What had begun as a local demonstration was now a widespread breakdown in law and order and a looting free-for-all.

Just as the rioting took hold, Nicholas Robinson, a 23-year-old electrical engineering student who had spent the weekend at his girlfriend’s, headed home, taking his usual short walk through Brixton.1 By now, the familiar streets were practically unrecognizable. Cars had been upturned, windows had been smashed, fires had been started, and all along the street shops had been broken into.2 Police had been desperately trying to calm the situation, but were powerless to stop the cars and scooters pulling up alongside the smashed shop fronts and loading up with stolen clothes, shoes, laptops and TVs. Brixton was completely out of control.

A few streets away from an electrical store that was being thoroughly ransacked, Nicholas Robinson walked past his local supermarket. Like almost every other shop, it was a wreck: the glass windows and doors had been broken, and the shelves inside were strewn with the mess from the looters. Streams of rioters were running past holding on to their brand-new laptops, unchallenged by police officers. Amid the chaos, feeling thirsty, Nicholas walked into the store and helped himself to a £3.50 case of bottled water. Just as he rounded the corner to leave, the police entered the supermarket. Nicholas immediately realized what he had done, dropped the case and tried to run.3

As Monday night rolled in, the country braced itself for further riots. Sure enough, that night looters took to the streets again.4 Among them was 18-year-old Richard Johnson. Intrigued by what he’d seen on the news, he grabbed a (distinctly un-summery) balaclava, jumped into a car and made his way to the local shopping centre. With his face concealed, Richard ran into the town’s gaming store, grabbed a haul of computer games and returned to the car.5 Unfortunately for Richard, he had parked in full view of a CCTV camera. The registration plate made it easy for police to track him down, and the recorded evidence made it easy to bring a case against him.6

Both Richard Johnson and Nicholas Robinson were arrested for their actions in the riots of 2011. Both were charged with burglary. Both stood before judges. Both pleaded guilty. But that’s where the similarities in their cases end.

Nicholas Robinson was first to be called to the dock, appearing before a judge at Camberwell Magistrates’ Court less than a week after the incident. Despite the low value of the bottled water he had stolen, despite his lack of a criminal record, his being in full-time education, and his telling the court he was ashamed of himself, the judge said his actions had contributed to the atmosphere of lawlessness in Brixton that night. And so, to gasps from his family in the public gallery, Nicholas Robinson was sentenced to six months in prison.7

Johnson’s case appeared in front of a judge in January 2012. Although he’d gone out wearing an item of clothing designed to hide his identity with the deliberate intention of looting, and although he too had played a role in aggravating public disorder, Johnson escaped jail entirely. He was given a suspended sentence and ordered to perform two hundred hours of unpaid work.8

The consistency conundrum

The judicial system knows it’s not perfect, but it doesn’t try to be. Judging guilt and assigning punishment isn’t an exact science, and there’s no way a judge can guarantee precision. That’s why phrases such as ‘reasonable doubt’ and ‘substantial grounds’ are so fundamental to the legal vocabulary, and why appeals are such an important part of the process; the system accepts that absolute certainty is unachievable.

Even so, discrepancies in the treatment of some defendants – like Nicholas Robinson and Richard Johnson – do seem unjust. There are too many factors involved ever to say for certain that a difference in sentence is ‘unfair’, but within reason, you would hope that judges were broadly consistent in the way they made decisions. If you and your imaginary twin committed an identical crime, for instance, you would hope that a court would give you both the same sentence. But would it?

In the 1970s, a group of American researchers tried to answer a version of this question.9 Rather than using twin criminals (which is practically difficult and ethically undesirable) they created a series of hypothetical cases and independently asked 47 Virginia State district court judges how they would deal with each. Here’s an example from the study for you to try. How would you handle the following case?

An 18-year-old female defendant was apprehended for possession of marijuana, arrested with her boyfriend and seven other acquaintances. There was evidence of a substantial amount of smoked and un-smoked marijuana found, but no marijuana was discovered directly on her person. She had no previous criminal record, was a good student from a middle-class home and was neither rebellious nor apologetic for her actions.

The differences in judgments were dramatic. Of the 47 judges, 29 decreed the defendant not guilty and 18 declared her guilty. Of those who opted for a guilty verdict, eight recommended probation, four thought a fine was the best way to go, three would issue both probation and a fine, and three judges were in favour of sending the defendant to prison.

So, on the basis of identical evidence in identical cases, a defendant could expect to walk away scot-free or be sent straight to jail, depending entirely on which judge they were lucky (or unlucky) enough to find themselves in front of.

That’s quite a blow for anyone holding on to hopes of courtroom consistency. But it gets worse. Because, not only do the judges disagree with each other, they’re also prone to contradicting their own decisions.

In a more recent study, 81 UK judges were asked whether they’d award bail to a number of imaginary defendants.10 Each hypothetical case had its own imaginary back-story and imaginary criminal history. Just like their counterparts in the Virginia study, the British judges failed to agree unanimously on a single one of the 41 cases presented to them.11 But this time, in among the 41 hypothetical cases given to every judge were seven that appeared twice – with the names of the defendants changed on their second appearance so the judge wouldn’t notice they’d been duplicated. It was a sneaky trick, but a revealing one. Most judges didn’t manage to make the same decision on the same case when seeing it for the second time. Astonishingly, some judges did no better at matching their own answers than if they were, quite literally, awarding bail at random.12

Numerous other studies have come to the same conclusion: whenever judges have the freedom to assess cases for themselves, there will be massive inconsistencies. Allowing judges room for discretion means allowing there to be an element of luck in the system.

There is a simple solution, of course. An easy way to make sure that judges are consistent is to take away their ability to exercise discretion. If every person charged with the same offence were dealt with in exactly the same way, precision could be guaranteed – at least in matters of bail and sentencing. And indeed, some countries have taken this route. There are prescriptive sentencing systems in operation at the federal level in the United States and in parts of Australia.13 But this kind of consistency comes at a price. For all that you gain in precision, you lose in another kind of fairness.

To illustrate, imagine two defendants, both charged with stealing from a supermarket. One is a relatively comfortable career criminal stealing through choice; the other is recently unemployed and struggling to make ends meet, stealing to feed their family and full of remorse for their actions. By removing the freedom to take these mitigating factors into account, strict guidelines that treat all those charged with the same crime in the same way can end up being unnecessarily harsh on some, and mean throwing away the chance to rehabilitate some criminals.

This is quite a conundrum. However you set up the system for your judges, you have to find a tricky balance between offering individualized justice and ensuring consistency. Most countries have tried to solve the dilemma by settling on a system that falls somewhere between the prescriptive extreme of US federal law and one based almost entirely on judges’ discretion – like that used in Scotland.14 Across the Western world, sentencing guidelines tend to lay down a maximum sentence (as in Ireland) or a minimum sentence (as in Canada) or both (as in England and Wales),15 and allow judges latitude to adjust the sentence up or down between those limits.

No system is perfect. There is always a muddle of competing unfairnesses, a chaos of opposing injustices. But in all this conflict and complexity an algorithm has a chance of coming into its own. Because – remarkably – with an algorithm as part of the process, both consistency and individualized justice can be guaranteed. No one needs to choose between them.

The justice equation

Algorithms can’t decide guilt. They can’t weigh up arguments from the defence and prosecution, or analyse evidence, or decide whether a defendant is truly remorseful. So don’t expect them to replace judges any time soon. What an algorithm can do, however, incredible as it might seem, is use data on an individual to calculate their risk of re-offending. And, since many judges’ decisions are based on the likelihood that an offender will return to crime, that turns out to be a rather useful capacity to have.

Data and algorithms have been used in the judicial system for almost a century, the first examples dating back to 1920s America. At the time, under the US system, convicted criminals would be sentenced to a standard maximum term and then become eligible for parolefn1 after a period of time had elapsed. Tens of thousands of prisoners were granted early release on this basis. Some were successfully rehabilitated, others were not. But collectively they presented the perfect setting for a natural experiment: could you predict whether an inmate would violate their parole?

Enter Ernest W. Burgess, a Canadian sociologist at the University of Chicago with a thirst for prediction. Burgess was a big proponent of quantifying social phenomena. Over the course of his career he tried to forecast everything from the effects of retirement to marital success, and in 1928 he became the first person to successfully build a tool to predict the risk of criminal behaviour based on measurement rather than intuition.

Using all kinds of data from three thousand inmates in three Illinois prisons, Burgess identified 21 factors he deemed to be ‘possibly significant’ in determining the chances of whether someone would violate the terms of their parole. These included the type of offence, the months served in prison and the inmate’s social type, which – with the delicacy one would expect from an early-twentieth-century social scientist – he split into categories including ‘hobo’, ‘drunkard’, ‘ne’er do-well’, ‘farm boy’ and ‘immigrant’.16

Burgess gave each inmate a score between zero and one on each of the 21 factors. The men who got high scores (between 16 and 21) he deemed least likely to re-offend; those who scored low (four or less) he judged likely to violate their terms of release.

When all the inmates were eventually granted their release, and so were free to violate the terms of their parole if they chose to, Burgess had a chance to check how good his predictions were. From such a basic analysis, he managed to be remarkably accurate. Ninety-eight per cent of his low-risk group made a clean pass through their parole, while two-thirds of his high-risk group did not.17 Even crude statistical models, it turned out, could make better forecasts than the experts.

But his work had its critics. Sceptical onlookers questioned how much the factors which reliably predicted parole success in one place at one time could apply elsewhere. (They had a point: I’m not sure the category ‘farm boy’ would be much help in predicting recidivism among modern inner-city criminals.) Other scholars criticized Burgess for just making use of whatever information was on hand, without investigating if it was relevant.18 There were also questions about the way he scored the inmates: after all, his method was little more than opinion written in equations. None the less, its forecasting power was impressive enough that by 1935 the Burgess method had made its way into Illinois prisons, to support parole boards in making their decisions.19 And by the turn of the century mathematical descendants of Burgess’s method were being used all around the world.20

Fast-forward to the modern day, and the state-of-the-art risk-assessment algorithms used by courtrooms are far more sophisticated than the rudimentary tools designed by Burgess. They are not only found assisting parole decisions, but are used to help match intervention programmes to prisoners, to decide who should be awarded bail, and, more recently, to support judges in their sentencing decisions. The fundamental principle is the same as it always was: in go the facts about the defendant – age, criminal history, seriousness of the crime and so on – and out comes a prediction of how risky it would be to let them loose.

So, how do they work? Well, broadly speaking, the best-performing contemporary algorithms use a technique known as random forests, which – at its heart – has a fantastically simple idea. The humble decision tree.

Ask the audience

You might well be familiar with decision trees from your schooldays. They’re popular with maths teachers as a way to structure observations, like coin flips or dice rolls. Once built, a decision tree can be used as a flowchart: taking a set of circumstances and assessing step by step what to do, or, in this case, what will happen.

Imagine you’re trying to decide whether to award bail to a particular individual. As with parole, this decision is based on a straightforward calculation. Guilt is irrelevant. You only need to make a prediction: will the defendant violate the terms of their bail agreement, if granted leave from jail?

To help with your prediction, you have data from a handful of previous offenders, some who fled or went on to re-offend while on bail, some who didn’t. Using the data, you could imagine constructing a simple decision tree by hand, like the one below, using the characteristics of each offender to build a flowchart. Once built, the decision tree can forecast how the new offender might behave. Simply follow the relevant branches according to the characteristics of the offender until you get to a prediction. Just as long as they fit the pattern of everyone who has gone before, the prediction will be right.

fig2

But this is where decision trees of the kind we made in school start to fall down. Because, of course, not everyone does follow the pattern of those who went before. On its own, this tree is going to get a lot of forecasts wrong. And not just because we’re starting with a simple example. Even with an enormous dataset of previous cases and an enormously complicated flowchart to match, using a single tree may only ever be slightly better than random guessing.

And yet, if you build more than one tree – everything can change. Rather than using all the data at once, there is a way to divide and conquer. In what is known as an ensemble, you first build thousands of smaller trees from random subsections of the data. Then, when presented with a new defendant, you simply ask every tree to vote on whether it thinks awarding bail is a good idea or not. The trees may not all agree, and on their own they might still make weak predictions, but just by taking the average of all their answers, you can dramatically improve the precision of your predictions.

It’s a bit like asking the audience in Who Wants To Be A Millionaire? A room full of strangers will be right more often than the cleverest person you know. (The ‘ask the audience’ lifeline had a 91 per cent success rate compared to just 65 per cent for ‘phone a friend’.?)21 The errors made by many can cancel each other out and result in a crowd that’s wiser than the individual.

The same applies to the big group of decision trees which, taken together, make up a random forest (pun intended). Because the algorithm’s predictions are based on the patterns it learns from the data, a random forest is described as a machine-learning algorithm, which comes under the broader umbrella of artificial intelligence. (The term ‘machine learning’ first came up in the ‘Power’ chapter, and we’ll meet many more algorithms under this particular canopy later, but for now it’s worth noting how grand that description makes it sound, when the algorithm is essentially the flowcharts you used to draw at school, wrapped up in a bit of mathematical manipulation.) Random forests have proved themselves to be incredibly useful in a whole host of real-world applications. They’re used by Netflix to help predict what you’d like to watch based on past preferences;22 by Airbnb to detect fraudulent accounts;23 and in healthcare for disease diagnosis (more on that in the following chapter).

When used to assess offenders, they can claim two huge advantages over their human counterparts. First, the algorithm will always give exactly the same answer when presented with the same set of circumstances. Consistency comes guaranteed, but not at the price of individualized justice. And there is another key advantage: the algorithm also makes much better predictions.

Human vs machine

In 2017, a group of researchers set out to discover just how well a machine’s predictions stacked up against the decisions of a bunch of human judges.24

To help them in their mission, the team had access to the records of every person arrested in New York City over a five-year period between 2008 and 2013. During that time, three-quarters of a million people were subject to a bail hearing, which meant easily enough data to test an algorithm on a head-to-head basis with a human judge.

An algorithm hadn’t been used by the New York judicial system during these cases, but looking retrospectively, the researchers got to work building lots of decision trees to see how well one could have predicted the defendants’ risk of breaking bail conditions at the time. In went the data on an offender: their rap sheet, the crime they’d just committed and so on. Out came a probability of whether or not that defendant would go on to violate the terms of their bail.

In the real data, 408,283 defendants were released before they faced trial. Any one of those was free to flee or commit other crimes, which means we can use the benefit of hindsight to test how accurate the algorithm’s predictions and the humans’ decisions were. We know exactly who failed to appear in court later (15.2 per cent) and who was re-arrested for another crime while on bail (25.8 per cent).

Unfortunately for the science, any defendant deemed high risk by a judge would have been denied bail at the time – and hence, on those cases, there was no opportunity to prove the judges’ assessment right or wrong. That makes things a little complicated. It means there’s no way to come up with a cold, hard number that captures how accurate the judges were overall. And without a ‘ground truth’ for how those defendants would have behaved, you can’t state an overall accuracy for the algorithm either. Instead, you have to make an educated guess on what the jailed defendants would have done if released,25 and make your comparisons of human versus machine in a bit more of a roundabout way.

One thing is for sure, though: the judges and the machine didn’t agree on their predictions. The researchers showed that many of the defendants flagged by the algorithm as real bad guys were treated by the judges as though they were low risk. In fact, almost half of the defendants the algorithm flagged as the riskiest group were given bail by the judges.

But who was right? The data showed that the group the algorithm was worried about did indeed pose a risk. Just over 56 per cent of them failed to show up for their court appearances, and 62.7 per cent went on to commit new crimes while out on bail – including the worst crimes of all: rape and murder. The algorithm had seen it all coming.

The researchers argued that, whichever way you use it, their algorithm vastly outperforms the human judge. And the numbers back them up. If you’re wanting to incarcerate fewer people awaiting trial, the algorithm could help by consigning 41.8 per cent fewer defendants to jail while keeping the crime rate the same. Or, if you were happy with the current proportion of defendants given bail, then that’s fine too: just by being more accurate at selecting which defendants to release, the algorithm could reduce the rate of skipping bail by 24.7 per cent.

These benefits aren’t just theoretical. Rhode Island, where the courts have been using these kinds of algorithms for the last eight years, has achieved a 17 per cent reduction in prison populations and a 6 per cent drop in recidivism rates. That’s hundreds of low-risk offenders who aren’t unnecessarily stuck in prison, hundreds of crimes that haven’t been committed. Plus, given that it costs over £30,000 a year to incarcerate one prisoner in the UK26 – while in the United States spending a year in a high-security prison can cost about the same as going to Harvard27 – that’s hundreds of thousands of taxpayers’ money saved. It’s a win–win for everyone.

Or is it?

Finding Darth Vader

Of course, no algorithm can perfectly predict what a person is going to do in the future. Individual humans are too messy, irrational and impulsive for a forecast ever to be certain of what’s going to happen next. They might give better predictions, but they will still make mistakes. The question is, what happens to all the people whose risk scores are wrong?

There are two kinds of mistake that the algorithm can make. Richard Berk, a professor of criminology and statistics at the University of Pennsylvania and a pioneer in the field of predicting recidivism, has a noteworthy way of describing them.

‘There are good guys and bad guys,’ he told me. ‘Your algorithm is effectively asking: “Who are the Darth Vaders? And who are the Luke Skywalkers?”’

Letting a Darth Vader go free is one kind of error, known as a false negative. It happens whenever you fail to identify the risk that an individual poses.

Incarcerating Luke Skywalker, on the other hand, would be a false positive. This is when the algorithm incorrectly identifies someone as a high-risk individual.

These two kinds of error, false positive and false negative, are not unique to recidivism. They’ll crop up repeatedly throughout this book. Any algorithm that aims to classify can be guilty of these mistakes.

Berk’s algorithms claim to be able to predict whether someone will go on to commit a homicide with 75 per cent accuracy, which makes them some of the most accurate around.28 When you consider how free we believe our will to be, that is a remarkably impressive level of accuracy. But even at 75 per cent, that’s a lot of Luke Skywalkers who will be denied bail because they look like Darth Vaders from the outside.

The consequences of mislabelling a defendant become all the more serious when the algorithms are used in sentencing, rather than just decisions on bail or parole. This is a modern reality: recently, some US states have begun to allow judges to see a convicted offender’s calculated risk score while deciding on their jail term. It’s a development that has sparked a heated debate, and not without cause: it’s one thing calculating whether to let someone out early, quite another to calculate how long they should be locked away in the first place.

Part of the problem is that deciding the length of a sentence involves consideration of a lot more than just the risk of a criminal re-offending – which is all the algorithm can help with. A judge also has to take into account the risk the offender poses to others, the deterrent effect the sentencing decision will have on other criminals, the question of retribution for the victim and the chance of rehabilitation for the defendant. It’s a lot to balance, so it’s little wonder that people raise objections to the algorithm being given too much weight in the decision. Little wonder that people find stories like that of Paul Zilly so deeply troubling.29

Zilly was convicted of stealing a lawnmower. He stood in front of Judge Babler in Baron County, Wisconsin, in February 2013, knowing that his defence team had already agreed to a plea deal with the prosecution. Both sides had agreed that, in his case, a long jail term wasn’t the best course of action. He arrived expecting the judge to simply rubber-stamp the agreement.

Unfortunately for Zilly, Wisconsin judges were using a proprietary risk-assessment algorithm called COMPAS. As with the Idaho budget tool in the ‘Power’ chapter, the inner workings of COMPAS are considered a trade secret. Unlike the budget tool, however, the COMPAS code still isn’t available to the public. What we do know is that the calculations are based on the answers a defendant gives to a questionnaire. This includes questions such as: ‘A hungry person has a right to steal, agree or disagree?’ and: ‘If you lived with both your parents and they separated, how old were you at the time?’30 The algorithm was designed with the sole aim of predicting how likely a defendant would be to re-offend within two years, and in this task had achieved an accuracy rate of around 70 per cent.31 That is, it would be wrong for roughly one in every three defendants. None the less, it was being used by judges during their sentencing decisions.

Zilly’s score wasn’t good. The algorithm had rated him as a high risk for future violent crime and a medium risk for general recidivism. ‘When I look at the risk assessment,’ Judge Babler said in court, ‘it is about as bad as it could be.’

After seeing Zilly’s score, the judge put more faith in the algorithm than in the agreement reached by the defence and the prosecution, rejected the plea bargain and doubled Zilly’s sentence from one year in county jail to two years in a state prison.

It’s impossible to know for sure whether Zilly deserved his high-risk score, although a 70 per cent accuracy rate seems a remarkably low threshold to justify using the algorithm to over-rule other factors.

Zilly’s case was widely publicized, but it’s not the only example. In 2003, Christopher Drew Brooks, a 19-year-old man, had consensual sex with a 14-year-old girl and was convicted of statutory rape by a court in Virginia. Initially, the sentencing guidelines suggested a jail term of 7 to 16 months. But, after the recommendation was adjusted to include his risk score (not built by COMPAS in this case), the upper limit was increased to 24 months. Taking this into account, the judge sentenced him to 18 months in prison.32

Here’s the problem. This particular algorithm used age as a factor in calculating its recidivism score. Being convicted of a sex offence at such a young age counted against Brooks, even though it meant he was closer in age to the victim. In fact, had Brooks been 36 years old (and hence 22 years older than the girl) the algorithm would have recommended that he not be sent to prison at all.33

These are not the first examples of people trusting the output of a computer over their own judgement, and they won’t be the last. The question is, what can you do about it? The Supreme Court of Wisconsin has its own suggestion. Speaking specifically about the danger of judges relying too heavily on the COMPAS algorithm, it stated: ‘We expect that circuit courts will exercise discretion when assessing a COMPAS risk score with respect to each individual defendant.’34 But Richard Berk suggests that might be optimistic: ‘The courts are concerned about not making mistakes – especially the judges who are appointed by the public. The algorithm provides them a way to do less work while not being accountable.’35

There’s another issue here. If an algorithm classifies someone as high risk and the judge denies them their freedom as a result, there is no way of knowing for sure whether the algorithm was seeing their future accurately. Take Zilly. Maybe he would have gone on to be violent. Maybe he wouldn’t have. Maybe, being labelled as a high-risk convict and being sent to a state prison set him on a different path from the one he was on with the agreed plea deal. With no way to verify the algorithm’s predictions, we have no way of knowing whether the judge was right to believe the risk score, no way of verifying whether Zilly was in fact a Vader or a Skywalker.

This is a problem without an easy solution. How do you persuade people to apply a healthy dose of common sense when it comes to using these algorithms? But even if you could, there’s another problem with predicting recidivism. Arguably the most contentious of all.

Machine bias

In 2016, the independent online newsroom ProPublica, which first reported Zilly’s story, looked in detail at the COMPAS algorithm and reverse-engineered the predictions it had made on the futures of over seven thousand real offenders in Florida,36 in cases dating from 2013 or 2014. The researchers wanted to check the accuracy of COMPAS scores by seeing who had, in fact, gone on to re-offend. But they also wanted to see if there was any difference between the predicted risks for black and white defendants.

While the algorithm doesn’t explicitly include race as a factor, the journalists found that not everyone was being treated equally within the calculations. Although the chances of the algorithm making an error were roughly the same for black or white offenders overall, it was making different kinds of mistakes for each racial group.

If you were one of the defendants who didn’t get into trouble again after their initial arrest, a Luke Skywalker, the algorithm was twice as likely to mistakenly label you as high risk if you were black as it was if you were white. The algorithm’s false positives were disproportionately black. Conversely, of all the defendants who did go on to commit another crime within the two years, the Darth Vaders, the white convicts were twice as likely to be mistakenly predicted as low risk by the algorithm as their black counterparts. The algorithm’s false negatives were disproportionately white.

Unsurprisingly, the ProPublica analysis sparked outrage across the US and further afield. Hundreds of articles were written expressing sharp disapproval of the use of faceless calculations in human justice, rebuking the use of imperfect, biased algorithms in decisions that can have such a dramatic impact on someone’s future. Many of the criticisms are difficult to disagree with – everyone deserves fair and equal treatment, regardless of who assesses their case, and the ProPublica study doesn’t make things look good for the algorithm.

But let’s be wary of our tendency to throw ‘imperfect algorithms’ away for a moment. Before we dismiss the use of algorithms in the justice system altogether, it’s worth asking: What would you expect an unbiased algorithm to look like?

You’d certainly want it to make equally accurate predictions for black and white defendants. It also seems sensible to demand that what counts as ‘high risk’ should be the same for everyone. The algorithm should be equally good at picking out the defendants who are likely to re-offend, whatever racial (or other) group they belong to. Plus, as ProPublica pointed out, the algorithm should make the same kind of mistakes at the same rate for everyone – regardless of race.

None of these four statements seems like a particularly grand ambition. But there is, nevertheless, a problem. Unfortunately, some kinds of fairness are mathematically incompatible with others.

Let me explain. Imagine stopping people in the street and using an algorithm to predict whether each person will go on to commit a homicide. Now, since the vast majority of murders are committed by men (in fact, worldwide, 96 per cent of murderers are male)37, if the murderer-finding algorithm is to make accurate predictions, it will necessarily identify more men than women as high risk.

Let’s assume our murderer-detection algorithm has a prediction rate of 75 per cent. That is to say, three-quarters of the people the algorithm labels as high risk are indeed Darth Vaders.

Eventually, after stopping enough strangers, you’ll have 100 people flagged by the algorithm as potential murderers. To match the perpetrator statistics, 96 of those 100 will necessarily be male. Four will be female. There’s a picture below to illustrate. The men are represented by dark circles, the women shown as light grey circles.

Now, since the algorithm predicts correctly for both men and women at the same rate of 75 per cent, one-quarter of the females, and one-quarter of the males, will really be Luke Skywalkers: people who are incorrectly identified as high risk, when they don’t actually pose a danger.

fig3
fig4

Once you run the numbers, as you can see from the second image here, more innocent men than innocent women will be incorrectly accused, just by virtue of the fact that men commit more murder than women.

This has nothing to do with the crime itself, or with the algorithm: it’s just a mathematical certainty. The outcome is biased because reality is biased. More men commit homicides, so more men will be falsely accused of having the potential to murder.fn2

Unless the fraction of people who commit crimes is the same in every group of defendants, it is mathematically impossible to create a test which is equally accurate at prediction across the board and makes false positive and false negative mistakes at the same rate for every group of defendants.

Of course, African Americans have been subject to centuries of terrible prejudice and inequality. Because of this, they continue to appear disproportionately at lower levels on the socio-economic scale and at higher levels in the crime statistics. There is also some evidence that – at least for some crimes in the US – the black population is disproportionately targeted by police. Marijuana use, for instance, happens in blacks and whites at the same rate, and yet arrest rates in African Americans can be up to eight times higher.38 Whatever the reason for the disparity, the sad result is that rates of arrest are not the same across racial groups in the United States. Blacks are re-arrested more often than whites. The algorithm is judging them not on the colour of their skin, but on the all-too-predictable consequences of America’s historically deeply unbalanced society. Until all groups are arrested at the same rate, this kind of bias is a mathematical certainty.

That is not to completely dismiss the ProPublica piece. Its analysis highlights how easily algorithms can perpetuate the inequalities of the past. Nor is it to excuse the COMPAS algorithm. Any company that profits from analysing people’s data has a moral responsibility (if not yet a legal one) to come clean about its flaws and pitfalls. Instead, Equivant (formerly Northpointe), the company that makes COMPAS, continues to keep the insides of its algorithm a closely guarded secret, to protect the firm’s intellectual property.39

There are options here. There’s nothing inherent in these algorithms that means they have to repeat the biases of the past. It all comes down to the data you give them. We can choose to be ‘crass empiricists’ (as Richard Berk puts it) and follow the numbers that are already there, or we can decide that the status quo is unfair and tweak the numbers accordingly.

To give you an analogy, try doing a Google Images search for ‘maths professor’. Perhaps unsurprisingly, you’ll find the vast majority of images show white middle-aged men standing in front of chalk-covered blackboards. My search returned only one picture of a female in the top twenty images, which is a depressingly accurate reflection of reality: around 94 per cent of maths professors are male.40 But however accurate the results might be, you could argue that using algorithms as a mirror to reflect the real world isn’t always helpful, especially when the mirror is reflecting a present reality that only exists because of centuries of bias. Now, if it so chose, Google could subtly tweak its algorithm to prioritize images of female or non-white professors over others, to even out the balance a little and reflect the society we’re aiming for, rather than the one we live in.

It’s the same in the justice system. Effectively, using an algorithm lets us ask: what percentage of a particular group do we expect to be high risk in a perfectly fair society? The algorithm gives us the option to jump straight to that figure. Or, if we decide that removing all the bias from the judicial system at once isn’t appropriate, we could instead ask the algorithm to move incrementally towards that goal over time.

There are also options in how you treat defendants with a high-risk score. In bail, where the risk of a defendant failing to appear at a future court date is a key component of the algorithm’s prediction, the standard approach is to deny bail to anyone with a high-risk score. But the algorithm could also present an opportunity to find out why someone might miss a court date. Do they have access to suitable transport to get there? Are there issues with childcare that might prevent them attending? Are there societal imbalances that the algorithm could be programmed to alleviate rather than exacerbate?

The answers to these questions should come from the forums of open public debate and the halls of government rather than the boardrooms of private companies. Thankfully, the calls are getting louder for an algorithmic regulating body to control the industry. Just as the US Food and Drug Administration does for pharmaceuticals, it would test accuracy, consistency and bias behind closed doors and have the authority to approve or deny the use of a product on real people. Until then, though, it’s incredibly important that organizations like ProPublica continue to hold algorithms to account. Just as long as accusations of bias don’t end in calls for these algorithms to be banned altogether. At least, not without thinking carefully about what we’d be left with if they were.

Difficult decisions

This is a vitally important point to pick up on. If we threw the algorithms away, what kind of a justice system would remain? Because inconsistency isn’t the only flaw from which judges have been shown to suffer.

Legally, race, gender and class should not influence a judge’s decision. (Justice is supposed to be blind, after all.) And yet, while the vast majority of judges want to be as unbiased as possible, the evidence has repeatedly shown that they do indeed discriminate. Studies within the US have shown that black defendants, on average, will go to prison for longer,41 are less likely to be awarded bail,42 are more likely to be given the death penalty,43 and once on death row are more likely to be executed.44 Other studies have shown that men are treated more severely than women for the same crime,45 and that defendants with low levels of income and education are given substantially longer sentences.46

Just as with the algorithm, it’s not necessarily explicit prejudices that are causing these biased outcomes, so much as history repeating itself. Societal and cultural biases can simply arise as an automatic consequence of the way humans make decisions.

To explain why, we first need to understand a few simple things about human intuition, so let’s leave the courtroom for a moment while you consider this question:

A bat and a ball cost £1.10 in total.

The bat costs £1 more than the ball.

How much does the ball cost?

This puzzle, which was posed by the Nobel Prize winning economist and psychologist Daniel Kahneman in his bestselling book Thinking, Fast and Slow,47 illustrates an important trap we all fall into when thinking.

This question has a correct answer that is easy to see on reflection, but also an incorrect answer that immediately springs to mind. The answer you first thought of, of course, was 10p.fn3

Don’t feel bad if you didn’t get the correct answer (5p). Neither did 71.8 per cent of judges when asked.48 Even those who do eventually settle on the correct answer have to resist the urge to go with their first intuition.

It’s the competition between intuition and considered thought that is key to our story of judges’ decisions. Psychologists generally agree that we have two ways of thinking. System 1 is automatic, instinctive, but prone to mistakes. (This is the system responsible for the answer of 10p jumping to mind in the puzzle above.) System 2 is slow, analytic, considered, but often quite lazy.49

If you ask a person how they came to a decision, it is System 2 that will articulate an answer – but, in the words of Daniel Kahneman, ‘it often endorses or rationalizes ideas and feelings that were generated by System 1’.50

In this, judges are no different from the rest of us. They are human, after all, and prone to the same whims and weaknesses as we all are. The fact is, our minds just aren’t built for robust, rational assessment of big, complicated problems. We can’t easily weigh up the various factors of a case and combine everything together in a logical manner while blocking the intuitive System 1 from kicking in and taking a few cognitive short cuts.

When it comes to bail, for instance, you might hope that judges were able to look at the whole case together, carefully balancing all the pros and cons before coming to a decision. But unfortunately, the evidence says otherwise. Instead, psychologists have shown that judges are doing nothing more strategic than going through an ordered checklist of warning flags in their heads. If any of those flags – past convictions, community ties, prosecution’s request – are raised by the defendant’s story, the judge will stop and deny bail.51

The problem is that so many of those flags are correlated with race, gender and education level. Judges can’t help relying on intuition more than they should; and in doing so, they are unwittingly perpetuating biases in the system.

And that’s not all. Sadly, we’ve barely scratched the surface of how terrible people are at being fair and unbiased judges.

If you’ve ever convinced yourself that an extremely expensive item of clothing was good value just because it was 50 per cent off (as I regularly do), then you’ll know all about the so-called anchoring effect. We find it difficult to put numerical values on things, and are much more comfortable making comparisons between values than just coming up with a single value out of the blue. Marketers have been using the anchoring effect for years, not only to influence how highly we value certain items, but also to control the quantities of items we buy. Like those signs in supermarkets that say ‘Limit of 12 cans of soup per customer’. They aren’t designed to ward off soup fiends from buying up all the stock, as you might think. They exist to subtly manipulate your perception of how many cans of soup you need. The brain anchors with the number 12 and adjusts downwards. One study back in the 1990s showed that precisely such a sign could increase the average sale per customer from 3.3 tins of soup to 7.52

By now, you won’t be surprised to learn that judges are also susceptible to the anchoring effect. They’re more likely to award higher damages if the prosecution demands a high amount,53 and hand down a longer sentence if the prosecutor requests a harsher punishment.54 One study even showed that you could significantly influence the length of a sentence in a hypothetical case by having a journalist call the judge during a recess and subtly drop a suggested sentence into the conversation. (‘Do you think the sentence in this case should be higher or lower than three years?’)55 Perhaps worst of all, it looks like you can tweak a judge’s decision just by having them throw a dice before reviewing a case.56 Even the most experienced judges were susceptible to this kind of manipulation.57

And there’s another flaw in the way humans make comparisons between numbers that has an impact on the fairness of sentencing. You may have noticed this particular quirk of the mind yourself. The effect of increasing the volume on your stereo by one level diminishes the louder it plays; a price hike from £1.20 to £2.20 feels enormous, but an increase from £67 to £68 doesn’t seem to matter; and time seems to speed up as you get older. It happens because humans’ senses work in relative terms rather than in absolute values. We don’t perceive each year as a fixed period of time; we experience each new year as a smaller and smaller fraction of the life we’ve lived. The size of the chunks of time or money or volume we perceive follows a very simple mathematical expression known as Weber’s Law.

Put simply, Weber’s Law states that the smallest change in a stimulus that can be perceived, the so-called ‘Just Noticeable Difference’, is proportional to the initial stimulus. Unsurprisingly, this discovery has also been exploited by marketers. They know exactly how much they can get away with shrinking a chocolate bar before customers notice, or precisely how much they can nudge up the price of an item before you’ll think it’s worth shopping around.

The problem in the context of justice is that Weber’s Law influences the sentence lengths that judges choose. Gaps between sentences get bigger as the penalties get more severe. If a crime is marginally worse than something deserving a 20-year sentence, an additional 3 months, say, doesn’t seem enough: it doesn’t feel there’s enough of a difference between a stretch of 20 years and one of 20 years and 3 months. But of course there is: 3 months in prison is still 3 months in prison, regardless of what came before. And yet, instead of adding a few months on, judges will jump to the next noticeably different sentence length, which in this case is 25 years.58

We know this is happening because we can compare the sentence lengths actually handed out to those Weber’s Law would predict. One study from 2017 looked at over a hundred thousand sentences in both Britain and Australia and found that up to 99 per cent of defendants deemed guilty were given a sentence that fits the formula.59

‘It doesn’t matter what type of offence you’d committed,’ Mandeep Dhami, the lead author of the study, told me, ‘or what type of defendant you were, or which country you were being sentenced in, or whether you had been given a custodial or community sentence.’ All that mattered was the number that popped into the judge’s mind and felt about right.

Sadly, when it comes to biased judges, I could go on. Judges with daughters are more likely to make decisions more favourable to women.60 Judges are less likely to award bail if the local sports team has lost recently. And one famous study even suggested that the time of day affects your chances of a favourable outcome.61 Although the research has yet to be replicated,62 and there is some debate over the size of the effect, there may be some evidence that taking the stand just before lunch puts you at a disadvantage: judges in the original study were most likely to award bail if they had just come back from a recess, and least likely when they were approaching a food break.

Another study showed that an individual judge will avoid making too many similar judgments in a row. Your chances, therefore, of being awarded bail drop off a cliff if four successful cases were heard immediately before yours.63

Some researchers claim, too, that our perceptions of strangers change depending on the temperature of a drink we’re holding. If you’re handed a warm drink just before meeting a new person, they suggest, you’re more likely to see them as having a warmer, more generous, more caring personality.64

This long list is just the stuff we can measure. There are undoubtedly countless other factors subtly influencing our behaviour which don’t lend themselves to testing in a courtroom.

Summing up

I’ll level with you. When I first heard about algorithms being used in courtrooms I was against the idea. An algorithm will make mistakes, and when a mistake can mean someone loses their right to freedom, I didn’t think it was responsible to put that power in the hands of a machine.

I’m not alone in this. Many (perhaps most) people who find themselves on the wrong side of the criminal justice system feel the same. Mandeep Dhami told me how the offenders she’d worked with felt about how decisions on their future were made. ‘Even knowing that the human judge might make more errors, the offenders still prefer a human to an algorithm. They want that human touch.’

So, for that matter, do the lawyers. One London-based defence lawyer I spoke to told me that his role in the courtroom was to exploit the uncertainty in the system, something that the algorithm would make more difficult. ‘The more predictable the decisions get, the less room there is for the art of advocacy.’

However, when I asked Mandeep Dhami how she would feel herself if she were the one facing jail, her answer was quite the opposite:

‘I don’t want someone to use intuition when they’re making a decision about my future. I want someone to use a reasoned strategy. We want to keep judicial discretion, as though it is something so holy. As though it’s so good. Even though research shows that it’s not. It’s not great at all.’

Like the rest of us, I think that judges’ decisions should be as unbiased as possible. They should be guided by facts about the individual, not the group they happen to belong to. In that respect, the algorithm doesn’t measure up well. But it’s not enough to simply point at what’s wrong with the algorithm. The choice isn’t between a flawed algorithm and some imaginary perfect system. The only fair comparison to make is between the algorithm and what we’d be left with in its absence.

The more I’ve read, the more people I’ve spoken to, the more I’ve come to believe that we’re expecting a bit too much of our human judges. Injustice is built into our human systems. For every Christopher Drew Brooks, treated unfairly by an algorithm, there are countless cases like that of Nicholas Robinson, where a judge errs on their own. Having an algorithm – even an imperfect algorithm – working with judges to support their often faulty cognition is, I think, a step in the right direction. At least a well-designed and properly regulated algorithm can help get rid of systematic bias and random error. You can’t change a whole cohort of judges, especially if they’re not able to tell you how they make their decisions in the first place.

Designing an algorithm for use in the criminal justice system demands that we sit down and think hard about exactly what the justice system is for. Rather than just closing our eyes and hoping for the best, algorithms require a clear, unambiguous idea of exactly what we want them to achieve and a solid understanding of the human failings they’re replacing. It forces a difficult debate about precisely how a decision in a courtroom should be made. That’s not going to be simple, but it’s the key to establishing whether the algorithm can ever be good enough.

There are tensions within the justice system that muddy the waters and make these kinds of questions particularly difficult to answer. But there are other areas, slowly being penetrated by algorithms, where decisions are far less fraught with conflict, and the algorithm’s objectives and positive contribution to society are far more clear-cut.