IN 2015, A group of pioneering scientists conducted an unusual study on the accuracy of cancer diagnoses.1 They gave 16 testers a touch-screen monitor and tasked them with sorting through images of breast tissue. The pathology samples had been taken from real women, from whom breast tissue had been removed by a biopsy, sliced thinly and stained with chemicals to make the blood vessels and milk ducts stand out in reds, purples and blues. All the tester had to do was decide whether the patterns in the image hinted at cancer lurking among the cells.
After a short period of training, the testers were set to work, with impressive results. Working independently, they correctly assessed 85 per cent of samples.
But then the researchers realized something remarkable. If they started pooling answers – combining votes from the individual testers to give an overall assessment on an image – the accuracy rate shot up to 99 per cent.
What was truly extraordinary about this study was not the skill of the testers. It was their identity. These plucky lifesavers were not oncologists. They were not pathologists. They were not nurses. They were not even medical students. They were pigeons.
Pathologists’ jobs are safe for a while yet – I don’t think even the scientists who designed the study were suggesting that doctors should be replaced by plain old pigeons. But the experiment did demonstrate an important point: spotting patterns hiding among clusters of cells is not a uniquely human skill. So, if a pigeon can manage it, why not an algorithm?
The entire history and practice of modern medicine is built on the finding of patterns in data. Ever since Hippocrates founded his school of medicine in ancient Greece some 2,500 years ago, observation, experimentation and the analysis of data have been fundamental to the fight to keep us healthy.
Before then, medicine had been – for the most part – barely distinguishable from magic. People believed that you fell ill if you’d displeased some god or other, and that disease was the result of an evil spirit possessing your body. As a result, the work of a physician would involve a lot of chanting and singing and superstition, which sounds like a lot of fun, but probably not for the person who was relying on it all to stop them from dying.
It wasn’t as if Hippocrates single-handedly cured the world of irrationality and superstition for ever (after all, one rumour said he had a hundred-foot dragon for a daughter),2 but he did have a truly revolutionary approach to medicine. He believed that the causes of disease were to be understood through rational investigation, not magic. By placing his emphasis on case reporting and observation, he established medicine as a science, justly earning himself a reputation as ‘the father of modern medicine’.3
While the scientific explanations that Hippocrates and his colleagues came up with don’t exactly stand up to modern scrutiny (they believed that health was a harmonious balance of blood, phlegm, yellow bile and black bile),4 the conclusions they drew from their data certainly do.5 (They were the first to give us insights such as: ‘Patients who are naturally fat are apt to die earlier than those who are slender.’) It’s a theme that is found throughout the ages. Our scientific understanding may have taken many wrong turns along the way, but progress is made through our ability to find patterns, classify symptoms and use these observations to predict what the future holds for a patient.
Medical history is packed with examples. Take fifteenth-century China, when healers first realized they could inoculate people against smallpox. After centuries of experimentation, they found a pattern that they could exploit to reduce the risk of death from this illness by a factor of ten. All they had to do was find an individual with a mild case of the disease, harvest their scabs, dry them, crush them and blow them into the nose of a healthy person.6 Or the medical golden age of the nineteenth century, as medicine adopted increasingly scientific methods, and looking for patterns in data became integral to the role of a physician. One of these physicians was the Hungarian Ignaz Semmelweis, who in the 1840s noticed something startling in the data on deaths on maternity wards. Women who gave birth in wards staffed by doctors were five times more likely to fall ill to sepsis than those in wards run by midwives. The data also pointed towards the reason why: doctors were dissecting dead bodies and then immediately attending to pregnant women without stopping to wash their hands.7
What was true of fifteenth-century China and nineteenth-century Europe is true today of doctors all over the world. Not just when studying diseases in the population, but in the day-to-day role of a primary care giver, too. Is this bone broken or not? Is this headache perfectly normal or a sign of something more sinister? Is it worth prescribing a course of antibiotics to make this boil go away? All are questions of pattern recognition, classification and prediction. Skills that algorithms happen to be very, very good at.
Of course, there are many aspects of being a doctor that an algorithm will probably never be able to replicate. Empathy, for one thing. Or the ability to support patients through social, psychological, even financial difficulties. But there are some areas of medicine where algorithms can offer a helping hand. Especially in the roles where medical pattern recognition is found in its purest form and classification and prediction are prized almost to the exclusion of all else. Especially in an area like pathology.
Pathologists are the doctors a patient rarely meets. Whenever you have a blood or tissue sample sent off to be tested, they’re the ones who, sitting in some distant laboratory, will examine your sample and write the report. Their role sits right at the end of the diagnostic line, where skill, accuracy and reliability are crucially important. They are often the ones who say whether you have cancer or not. So, if the biopsy they’re analysing is the only thing between you and chemotherapy, surgery or worse, you want to be sure they’re getting it right.
And their job isn’t easy. As part of their role, the average pathologist will examine hundreds of slides a day, each containing tens of thousands – sometimes hundreds of thousands – of cells suspended between the small glass plates. It’s the hardest game of Where’s Wally? imaginable. Their job is to meticulously scan each sample, looking for tiny anomalies that could be hiding anywhere in the vast galaxy of cells they see beneath the microscope’s lens.
‘It’s an impossibly hard task,’ says Andy Beck,8 a Harvard pathologist and founder of PathAI, a company created in 2016 that creates algorithms to classify biopsy slides. ‘If each pathologist were to look very carefully at five slides a day, you could imagine they might achieve perfection. But that’s not the real world.’
It certainly isn’t. And in the real world, their job is made all the harder by the frustrating complexities of biology. Let’s return to the example of breast cancer that the pigeons were so good at spotting. Deciding if someone has the disease isn’t a straight yes or no. Breast cancer diagnoses are spread over a spectrum. At one end are the benign samples where normal cells appear exactly as they should be. At the other end are the nastiest kind of tumours – invasive carcinomas, where the cancer cells have left the milk ducts and begun to grow into the surrounding tissue. Cases that are at these extremes are relatively easy to spot. One recent study showed that pathologists manage to correctly diagnose 96 per cent of straightforward malignant specimens, an accuracy roughly equivalent to what the flock of pigeons managed when given a similar task.9
But in between these extremes – between totally normal and obviously horribly malignant – there are several other, more ambiguous categories, as shown in the diagram below. Your sample could have a group of atypical cells that look a bit suspicious, but aren’t necessarily anything to worry about. You could have pre-cancerous growths that may or may not turn out to be serious. Or you could have cancer that has not yet spread outside the milk ducts (so-called ductal carcinoma in situ).
Which particular category your sample happens to be judged as falling into will probably have an enormous impact on your treatment. Depending on where your sample sits on the line, your doctor could suggest anything from a mastectomy to no intervention at all.
The problem is, distinguishing between these ambiguous categories can be extremely tricky. Even expert pathologists can disagree on the correct diagnosis of a single sample. To test how much the doctors’ opinions varied, one 2015 study took 72 biopsies of breast tissue, all of which were deemed to contain cells with benign abnormalities (a category towards the middle of the spectrum) and asked 115 pathologists for their opinion. Worryingly, the pathologists only came to the same diagnosis 48 per cent of the time.10
Once you’re down to 50–50, you might as well be flipping a coin for your diagnosis. Heads and you could end up having an unnecessary mastectomy (costing you hundreds of thousands of dollars if you live in the United States). Tails and you could miss a chance to address your cancer at its earliest stage. Either way, the impact can be devastating.
When the stakes are this high, accuracy is what matters most. So can an algorithm do better?
Until recently, creating an algorithm that could recognize anything at all in an image, let alone cancerous cells, was considered a notoriously tricky challenge. It didn’t matter that understanding pictures comes so easily to humans; explaining precisely how we were doing it proved an unimaginably difficult task.
To understand why, imagine writing instructions to tell a computer whether or not a photo has a dog in it. You could start off with the obvious stuff: if it has four legs, if it has floppy ears, if it has fur and so on. But what about those photos where the dog is sitting down? Or the ones in which you can’t see all the legs? What about dogs with pointy ears? Or cocked ears? Or the ones not facing the camera? And how does ‘fur’ look different from a fluffy carpet? Or the wool of a sheep? Or grass?
Sure, you could work all of these in as extra instructions, running through every single possible type of dog ear, or dog fur, or sitting position, but your algorithm will soon become so enormous it’ll be entirely unworkable, before you’ve even begun to distinguish dogs from other four-legged furry creatures. You need to find another way. The trick is to shift away from the rule-based paradigm and use something called a ‘neural network’.11
You can imagine a neural network as an enormous mathematical structure that features a great many knobs and dials. You feed your picture in at one end, it flows through the structure, and out at the other end comes a guess as to what that image contains. A probability for each category: Dog; Not dog.
At the beginning, your neural network is a complete pile of junk. It starts with no knowledge – no idea of what is or isn’t a dog. All the dials and knobs are set to random. As a result, the answers it provides are all over the place – it couldn’t accurately recognize an image if its power source depended on it. But with every picture you feed into it, you tweak those knobs and dials. Slowly, you train it.
In goes your image of a dog. After every guess the network makes, a set of mathematical rules works to adjust all the knobs until the prediction gets closer to the right answer. Then you feed in another image, and another, tweaking every time it gets something wrong; reinforcing the paths through the array of knobs that lead to success and fading those that lead to failure. Information about what makes one dog picture similar to another dog picture propagates backwards through the network. This continues until – after hundreds and thousands of photos have been fed through – it gets as few wrong as possible. Eventually you can show it an image that it has never seen before and it will be able to tell you with a high degree of accuracy whether or not a dog is pictured.
The surprising thing about neural networks is that their operators usually don’t understand how or why the algorithm reaches its conclusions. A neural network classifying dog pictures doesn’t work by picking up on features that you or I might recognize as dog-like. It’s not looking for a measure of ‘chihuahua-ness’ or ‘Great Dane-ishness’ – it’s all a lot more abstract than that: picking up on patterns of edges and light and darkness in the photos that don’t make a lot of sense to a human observer (have a look at the image recognition example in the ‘Power’ chapter to see what I mean). Since the process is difficult for a human to conceptualize, it means the operators only know that they’ve tuned up their algorithm to get the answers right; they don’t necessarily know the precise details of how it gets there.
This is another ‘machine-learning algorithm’, like the random forests we met in the ‘Justice’ chapter. It goes beyond what the operators program it to do and learns itself from the images it’s given. It’s this ability to learn that endows the algorithm with ‘artificial intelligence’. And the many layers of knobs and dials also give the network a deep structure, hence the term ‘deep learning’.
Neural networks have been around since the middle of the twentieth century, but until quite recently we’ve lacked the widespread access to really powerful computers necessary to get the best out of them. The world was finally forced to sit up and take them seriously in 2012 when computer scientist Geoffrey Hinton and two of his students entered a new kind of neural network into an image recognition competition.12 The challenge was to recognize – among other things – dogs. Their artificially intelligent algorithm blew the best of its competitors out of the water and kicked off a massive renaissance in deep learning.
An algorithm that works without our knowing how it makes its decisions might sound like witchcraft, but it might not be all that dissimilar from how we learn ourselves. Consider this comparison. One team recently trained an algorithm to distinguish between photos of wolves and pet huskies. They then showed how, thanks to the way it had tuned its own dials, the algorithm wasn’t using anything to do with the dogs as clues at all. It was basing its answer on whether the picture had snow in the background. Snow: wolf. No snow: husky.13
Shortly after their paper was published, I was chatting with Frank Kelly, a professor of mathematics at Cambridge University, who told me about a conversation he’d had with his grandson. He was walking the four-year-old to nursery when they passed a husky. His grandson remarked that the dog ‘looked like’ a wolf. When Frank asked how he knew that it wasn’t a wolf, he replied, ‘Because it’s on a lead.’
There are two things you want from a good breast cancer screening algorithm. You want it to be sensitive enough to pick up on the abnormalities present in all the breasts that have tumours, without skipping over the pixels in the image and announcing them as clear. But you also want it to be specific enough not to flag perfectly normal breast tissue as suspicious.
We’ve met the principles of sensitivity and specificity before, in the ‘Justice’ chapter. They are close cousins of false negatives and false positives (or Darth Vader and Luke Skywalker – which, if you ask me, is how they should be officially referred to in the scientific literature). In the context we’re talking about here, a false positive occurs whenever a healthy woman is told she has breast cancer, and a false negative when a woman with tumours is given the all-clear. A specific test will have hardly any false positives, while a sensitive one has few false negatives. It doesn’t matter what context your algorithm is working in – predicting recidivism, diagnosing breast cancer or (as we’ll see in the ‘Crime’ chapter) identifying patterns of criminal activity – the story is always the same. You want as few false positives and false negatives as possible.
The problem is that refining an algorithm often means making a choice between sensitivity and specificity. If you focus on improving one, it often means a loss in the other. If, for instance, you decided to prioritize the complete elimination of false negatives, your algorithm could flag every single breast it saw as suspicious. That would score 100 per cent sensitivity, which would certainly satisfy your objective. But it would also mean an awful lot of perfectly healthy people undergoing unnecessary treatment. Or say you decided to prioritize the complete elimination of false positives. Your algorithm would wave everyone through as healthy, thus earning a 100 per cent score on specificity. Wonderful! Unless you’re one of the women with tumours that the algorithm just disregarded.
Interestingly, human pathologists don’t tend to have problems with specificity. They almost never mistakenly identify cells as cancerous when they’re not. But people do struggle a little with sensitivity. It’s worryingly easy for us to miss tiny tumours – even obviously malignant ones.
These human weaknesses were highlighted in a recent challenge that was designed to pit human against algorithm. Computer teams from around the world went head-to-head with a pathologist to find all the tumours within four hundred slides, in a competition known as CAMELYON16. To make things easier, all the cases were at the two extremes: perfectly normal tissue or invasive breast cancer. There was no time constraint on the pathologist, either: they could take as long as they liked to wade through the biopsies. As expected, the pathologist generally got the overall diagnosis right (96 per cent accuracy)14 – and without identifying a single false positive in the process. But they also missed a lot of the tiny cancer cells hiding among the tissue, only managing to spot 73 per cent of them in 30 hours of looking.
The sheer number of pixels that needed checking wasn’t necessarily the problem. People can easily miss very obvious anomalies even when looking directly at them. In 2013, Harvard researchers secretly hid an image of a gorilla in a series of chest scans and asked 24 unsuspecting radiologists to check the images for signs of cancer. Eighty-three per cent of them failed to notice the gorilla, despite eye-tracking showing that the majority were literally looking right at it.15 Try it yourself with the picture above.16
Algorithms have the opposite problem. They will eagerly spot anomalous groups of cells, even perfectly healthy ones. During CAMELYON16, for instance, the best neural network entered managed to find an impressive 92.4 per cent of the tumours,17 but in doing so it made eight false-positive mistakes per slide by incorrectly flagging normal groups of cells as suspicious. With such a low specificity, the current state-of-the art algorithms are definitely leaning towards the ‘everyone has breast cancer!’ approach to diagnosis and are just not good enough to create their own pathology reports yet.
The good news, though, is that we’re not asking them to. Instead, the intention is to combine the strengths of human and machine. The algorithm does the donkey-work of searching the enormous amount of information in the slides, highlighting a few key areas of interest. Then the pathologist takes over. It doesn’t matter if the machine is flagging cells that aren’t cancerous; the human expert can quickly check through and eliminate anything that’s normal. This kind of algorithmic pre-screening partnership not only saves a lot of time, it also bumps up the overall accuracy of diagnosis to a stunning 99.5 per cent.18
Marvellous as all this sounds, the fact is that human pathologists have always been good at diagnosing aggressive cancerous tumours. The difficult cases are those ambiguous ones in the middle, where the distinction between cancer and not cancer is more subtle. Can the algorithms help here too? The answer is (probably) yes. But not by trying to diagnose using the tricky categories that pathologists have always used. Instead, perhaps the algorithm – which is so much better at finding anomalies hidden in tiny fragments of data – could offer a better way to diagnose altogether. By doing something that human doctors can’t.
In 1986, an epidemiologist from the University of Kentucky named David Snowden managed to persuade 678 nuns to give him their brains. The nuns, who were all members of the School Sisters of Notre Dame, agreed to participate in Snowden’s extraordinary scientific investigation into the causes of Alzheimer’s disease.
Each of these women, who at the beginning of the study were aged between 75 and 103 years old, would take a series of memory tests every year for the rest of their lives. Then, when they died, their brains would be donated to the project. Regardless of whether they had symptoms of dementia or not, they promised to allow Snowden’s team to remove their most precious organ and analyse it for signs of the disease.19
The nuns’ generosity lead to the creation of a remarkable dataset. Since none of them had children, smoked, or drank very much, the scientists were able to rule out many of the external factors believed to raise the likelihood of contracting Alzheimer’s. And since they all lived a similar lifestyle, with similar access to healthcare and social support, the nuns effectively provided their own experimental control.
All was going well when, a few years into the study, the team discovered this experimental group offered another treasure trove of data they could tap into. As young women, many of the now elderly nuns had been required to submit a handwritten autobiographical essay to the Sisterhood before they were allowed to take their vows. These essays were written when the women were – on average – only 22 years old, decades before any would display any symptoms of dementia. And yet, astonishingly, the scientists discovered clues in their writing that predicted what would happen to them far in the future.
The researchers analysed the language in each of the essays for its complexity and found a connection between how articulate the nuns were as young women and their chances of developing dementia in old age.
For example, here is a one-sentence extract from a nun who maintained excellent cognitive ability throughout her life:
After I finished the eighth grade in 1921 I desired to become an aspirant at Mankato but I myself did not have the courage to ask the permission of my parents so Sister Agreda did it in my stead and they readily gave their consent.
Compare this to a sentence written by a nun whose memory scores steadily declined in her later years:
After I left school, I worked in the post-office.
The association was so strong that the researchers could predict which nuns might have dementia just by reading their letters. Ninety per cent of the nuns who went on to develop Alzheimer’s had ‘low linguistic ability’ as young women, while only 13 per cent of the nuns who maintained cognitive ability into old age got a ‘low idea density’ score in their essays.20
One of the things this study highlights is the incredible amount we still have to learn about our bodies. Even knowing that this connection might exist doesn’t tell us why. (Is it that good education staves off dementia? Or that people with the propensity to develop Alzheimer’s feel more comfortable with simple language?) But it might suggest that Alzheimer’s can take several decades to develop.
More importantly, for our purposes, it demonstrates that subtle signals about our future health can hide in the tiniest, most unexpected fragments of data – years before we ever show symptoms of an illness. It hints at just how powerful future medical algorithms that can dig into data might be. Perhaps, one day, they’ll even be able to spot the signs of cancer years before doctors can.
In the late 1970s, a group of pathologists in Ribe County, Denmark, started performing double mastectomies on a group of corpses. The deceased women ranged in age from 22 to 89 years old, and 6 of them, out of 83, had died of invasive breast cancer. Sure enough, when the researchers prepared the removed breasts for examination by a pathologist – cutting each into four pieces and then slicing the tissue thinly on to slides – these 6 samples revealed the hallmarks of the disease. But to the researchers’ astonishment, of the remaining 77 women – who had died from completely unrelated causes, including heart disease and car accidents – almost a quarter had the warning signs of breast cancer that pathologists look out for in living patients.
Without ever showing any signs of illness, 14 of the women had in situ cancer cells that had never developed beyond the milk ducts or glands. Cells that would be considered malignant breast cancer if the women were alive. Three had atypical cells that would also be flagged as a matter of concern in a biopsy, and one woman actually had invasive breast cancer without having any idea of it when she died.21
These numbers were surprising, but the study wasn’t a fluke. Other researchers have found similar results. In fact, some estimate that, at any one time, around 9 per cent of women could be unwittingly walking around with tumours in their breasts22 – about ten times the proportion who actually get diagnosed with breast cancer.23
So what’s going on? Do we have a silent epidemic on our hands? According to Dr Jonathan Kanevsky, a medical innovator and resident in surgery at McGill University in Montreal, the answer is no. At least, not really. Because the presence of cancer isn’t necessarily a problem:
If somebody has a cancer cell in their body, the chances are their immune system will identify it as a mutated cell and just attack it and kill it – that cancer will never grow into something scary. But sometimes the immune system messes up, meaning the body supports the growth of the cancer, allowing it to develop. At that point cancer can kill.24
Not all tumours are created equally. Some will be dealt with by your body, some will sit there quite happily until you die, some could develop into full-blown aggressive cancer. The trouble is, we often have very little way of knowing which will turn out to be which.
And that’s why these tricky categories between benign and horribly malignant can be such a problem. They are the only classifications that doctors have to work with, but if a doctor finds a group of cells in your biopsy that seem a bit suspicious, the label they choose can only help describe what lies within your tissue now. That isn’t necessarily much help when it comes to giving you clues to your future. And it’s the future, of course, that worried patients are most concerned about.
The result is that people are often overly cautious in the treatment they choose. Take in situ cancers, for instance. This category sits towards the more worrying end of the spectrum, where a cancerous growth is present, but hasn’t yet spread to the surrounding tissue. Serious as this sounds, only around one in ten ‘in situ’ cancers will turn into something that could kill you. None the less, a quarter of women who receive this diagnosis in the United States will undergo a full mastectomy – a major operation that’s physically and often emotionally life-changing.25
The fact is, the more aggressively you screen for breast cancer, the more you affect women who otherwise would’ve happily got on with their lives, oblivious to their harmless tumours. One independent UK panel concluded that for every 10,000 women who’ll be invited to a mammogram screening in the next 20 years, 43 deaths from breast cancer will be prevented. And a study published in the New England Journal of Medicine concluded that for every 100,000 who attend routine mammogram screenings, tumours that could become life-threatening will be detected in 30 women.26 But – depending which set of statistics you use – three or four times as many women will be over-diagnosed, receiving treatment for tumours that were never going to put their lives in danger.27
This problem of over-diagnosis and over-treatment is hard to solve when you’re good at detecting abnormalities but not good at predicting how they’ll develop. Still, there might be hope. Perhaps – just as with the essay-writing nuns – tiny clues to the way someone’s health will turn out years in the future can be found hiding in their past and present data. If so, winkling out this information would be a perfect job for a neural network.
In an arena where doctors have struggled for decades to discover why one abnormality is more dangerous than another, an algorithm that isn’t taught what to look for could come into its own. Just as long as you can put together a big enough set of biopsy slides (including some samples of tumours that eventually metastasized – spread to other parts of the body – as well as some that didn’t) to train your neural network, it could hunt for hidden clues to your future health blissfully free of any of the prejudice that can come with theory. As Jonathan Kanevsky puts it: ‘It would be up to the algorithm to determine the unique features within each image that correspond to whether the tumour becomes metastatic or not.’28
With an algorithm like that, the category your biopsy fitted into would become far less important. You wouldn’t need to bother with why or how, you could just jump straight to the information that matters: do you need treatment or not?
The good news is, work on such an algorithm has already begun. Andy Beck, the Harvard pathologist and CEO of PathAI we met earlier, recently let his algorithm loose on a series of samples from patients in the Netherlands and found that the best predictors of patient survival were to be found not in the cancer itself but in other abnormalities in the adjacent tissue.29 This is an important development – a concrete example of the algorithms themselves driving the research forwards, proving they can find patterns that improve our powers of prediction.
And, of course, there’s now an incredible wealth of data we can draw on. Thanks to routine mammogram screening around the world, we probably have more images of breast tissue than of any other organ in the body. I’m not a pathologist, but every expert I’ve spoken to has convinced me that being able to confidently predict whether a troublesome sample will become cancerous is well within our sights. There’s a very real chance that by the time this book is published in paperback, someone somewhere will have actually made this world-changing idea a reality.
These ideas apply well beyond breast cancer. The neural networks that Andy Beck and others are building don’t particularly care what they’re looking at. You could ask them to categorize anything: dogs, hats, cheeses. As long as you’re letting them know when they’re getting it right and wrong, they’ll learn. And now that this family of algorithms are good enough to be usable, they’re having an impact on all sorts of areas of modern medicine.
One big recent success, for instance, comes from the Google Brain team, who have built an algorithm that screens for the world’s biggest cause of preventable blindness – diabetic retinopathy. It’s a disease that affects the blood vessels in the light-sensitive areas of the eye. If you know you have it, you can be given injections to save your sight, but if it’s not caught early it can lead to irreversible blindness. In India, where access to experts capable of diagnosing the condition is limited, 45 per cent of people with diabetic retinopathy will lose some of their sight before they know they have the disease. The Google team’s algorithm, which was built in a collaboration with doctors from India, is now just as good at diagnosing the condition as a human ophthalmologist.
Similarly, there are algorithms that look for cardiovascular diseases in the heart,30 emphysema in the lungs,31 strokes in the brain32 and melanomas on the skin.33 There are even systems that diagnose polyps during a live colonoscopy in real time.
The fact is, if you can take a picture of it and stick a label on it, you can create an algorithm to find it. And you’re likely to end up with a more accurate (and possibly earlier) diagnosis than any human doctor could manage.
But what about the messier forms of medical data? Can the success of these algorithms be taken further, to something beyond these kinds of highly specialized, narrowly focused tasks? Could a machine search for meaning in your doctor’s scribbled notes, for instance? Or pick up on tiny clues in how you describe the pain you’re experiencing?
How about the ultimate healthcare science-fiction fantasy, where a machine in your doctor’s surgery would carefully listen to your symptoms and analyse your medical history? Can we dare to imagine a machine that has a mastery of every scrap of cutting-edge medical research? One that offers an accurate diagnosis and perfectly tailored treatment plan?
In short, how about something a little like IBM’s Watson?
In 2004, Charles Lickel was tucking into a steak dinner with some colleagues in a New York restaurant. Partway through the meal, the dining area began to empty. Intrigued, Charles followed the crowd of diners and found them huddled around a television, eagerly watching the popular game show Jeopardy. The famous Jeopardy champion Ken Jennings had a chance to hold on to his record-breaking six-month winning streak and the diners didn’t want to miss it.34
Charles Lickel was the vice-president of software at IBM. For the past few years, ever since Deep Blue had beaten Garry Kasparov at chess, the IBM bosses had been nagging Charles to find a new challenge worthy of the company’s attention. As he stood in the New York restaurant, watching the diners captivated by this human Jeopardy champion, Lickel began to wonder if a machine could be designed to beat him.
It wouldn’t be easy. The machine known as ‘Watson’ that Charles imagined in that restaurant took seven long years to build. But eventually Watson would challenge Ken Jennings on a special episode of Jeopardy and convincingly defeat him at the game he’d made his own. In the process, IBM would set themselves on the path to their attempt to build the world’s first all-singing, all-dancing diagnosis machine. We’ll come back to that in a moment. But first, let me walk you through some of the key ideas behind the Jeopardy-winning machine that formed the basis of the medical diagnosis algorithms.
For those who’ve never heard of it, Jeopardy is a well-known American game show in the form of a kind of reverse general knowledge quiz: the contestants are given clues in the form of answers and have to phrase their responses in the form of questions. For instance, in the category ‘self-contradictory words’ a clue might be:
A fastener to secure something; or it could be to bend, warp & give way suddenly with heat or pressure.
The algorithmic player would have to learn to work through a few layers to get to the correct response: ‘What does “buckle” mean?’ First, Watson would need to understand language well enough to derive meaning from the question and realize that ‘fastener’, ‘secure’, ‘bend’, ‘warp’ and ‘give way suddenly’ are all separate elements of the clue. This, in itself, is an enormous challenge for an algorithm.
But that was only the first step. Next, Watson would need to hunt for potential candidates that fitted each of the clues. ‘Fastener’ might conjure up all manner of potential answers: ‘clasp’, ‘button’, ‘pin’ and ‘tie’, as well as ‘buckle’, for instance. Watson needs to consider each possibility in turn and measure how it fits with the other clues. So while you’re unlikely to find evidence of ‘pin’ being associated with the clues ‘bend’ and ‘warp’, the word ‘buckle’ certainly is, which increases Watson’s confidence in it as a possible answer. Eventually, once all of the evidence has been combined, Watson has to put its imaginary money where its metaphorical mouth is by choosing a single response.
Now, the challenge of playing Jeopardy is rather more trivial than that of diagnosing disease, but it does require some of the same logical mechanisms. Imagine you go to the doctor complaining of unintentional weight loss and stomach aches, plus a bit of heartburn for good measure. In analogy with playing Jeopardy, the challenge is to find potential diagnoses (responses) that might explain the symptoms (clues), look for further evidence on each and update the confidence in a particular answer as more information becomes available. Doctors call this differential diagnosis. Mathematicians call it Bayesian inference.fn1
Even after Watson the quiz champion had been successfully created, building Watson the medical genius was no simple task. None the less, when IBM went public with their plans to move into healthcare, they didn’t hold back from making grand promises. They told the world that Watson’s ultimate mission was to ‘eradicate cancer’,35 and hired the famous actor John Hamm to boast that it was ‘one of the most powerful tools our species has created’.
It’s certainly a vision of a medical utopia to inspire us all. Except – as you probably know already – Watson hasn’t quite lived up to the hype.
First a prestigious contract with the University of Texas M. D. Anderson Cancer Center was terminated in 2016. Rumour had it that even after they had shelled out $62 million on the technology36 and spent four years working with it, Watson still wasn’t able to do anything beyond heavily supervised pilot tests. Then, in late September 2017, an investigation by the health news website STAT reported that Watson was ‘still struggling with the basic step of learning about different forms of cancer’.37
Ouch.
To be fair, it wasn’t all bad news. In Japan, Watson did manage to diagnose a woman with a rare form of leukaemia, when doctors hadn’t.38 And analysis by Watson led to the discovery of five genes linked to motor neurone disease, or ALS.39 But overall, IBM’s programmers haven’t quite managed to deliver on the promises of their excitable marketing department.
It’s hard not to feel sympathy for anyone trying to build this kind of machine. It’s theoretically possible to construct one that can diagnose disease (and even offer patients sensible treatment plans), and that’s an admirable goal to have. But it’s also really difficult. Much more difficult than playing Jeopardy, and much, much more difficult than recognizing cancerous cells in an image.
An all-purpose diagnostic machine might seem only a simple logical step on from those cancer-spotting image-based algorithms we met earlier, but those algorithms have a big advantage: they get to examine the actual cells that might be causing a problem. A diagnostic machine, by contrast, only gets information several steps removed from the underlying issue. Maybe a patient has pins and needles caused by a muscle spasm caused by a trapped nerve caused by too much heavy lifting. Or maybe they have blood in their stool caused by haemorrhoids caused by constipation caused by poor diet. An algorithm (or a doctor) has to take a single symptom and trace the route backwards to an accurate diagnosis. That is what Watson had to do. It is a monumentally difficult task.
And there are other problems too.
Remember that dog/wolf neural network? Training that was easy. All the programmers had to do was find a stack of photos labelled ‘dog’ or ‘wolf’ and feed them in. The dataset was simple and not ambiguous. ‘But’, as Thomas Fuchs, a computational pathologist, told MIT Technology Review, ‘in a specialized domain in medicine, you might need experts trained for decades to properly label the information you feed to the computer.’40
That might be a surmountable problem for a really focused question (such as sorting breast cancer pathology slides into ‘totally benign’ and ‘horribly malignant’). But an all-seeing diagnostic machine like Watson would need to understand virtually every possible disease. This would require an army of incredibly highly qualified human handlers prepared to feed it information about different patients and their specific characteristics for a very long time. And generally speaking, those people tend to have other stuff on their plate – like actually saving lives.
And then we come to the final problem. The most difficult of all to overcome.
Tamara Mills was just a baby when her parents first noticed something wasn’t right with her breathing. By the time she was nine months old, doctors had diagnosed her with asthma – a condition that affects 5.4 million people in the UK and 25 million in the US.41 Although Tamara was younger than most sufferers, her symptoms were perfectly manageable in the early years of her life and she grew up much like any other kid with the condition, spending her childhood playing by the sea in the north of England (although always with an inhaler to hand).
When she was 8, Tamara caught a nasty bout of swine flu. It would prove to be a turning point for her health. From that point on, one chest infection followed another, and another. Sometimes, during her asthma attacks, her lips would turn blue. But no matter how often Tamara and her mother went to her doctor and the local hospital, no matter how often her parents complained that she was getting though her supply of inhalers faster than they could be prescribed,42 none of the doctors referred her to a specialist.
Her family and teachers, on the other hand, realized things were getting serious. After two near-fatal attacks that put Tamara in hospital, she was excused from PE lessons at school. When the stairs at home became too much to manage, she went to live with her grandparents in their bungalow.
On 10 April 2014, Tamara succumbed to yet another chest infection. That night her grandfather found her struggling for breath. He called an ambulance and tried his best to help her with two inhalers and an oxygen tank. Her condition continued to deteriorate. Tamara died later that night, aged just 13 years old.
Asthma is not usually a fatal condition, and yet 1,200 people across the UK die from it every year, 26 of whom are children.43 It’s estimated that two-thirds of those deaths are preventable – as was Tamara’s. But that prevention depends entirely on the warning signs being spotted and acted upon.
In the four years leading up to her final and fatal attack, Tamara visited her local doctors and hospital no fewer than 47 times. Her treatment plan clearly wasn’t working, and yet each time she saw a healthcare professional, they only treated the immediate problem. No one looked at the bigger picture. No one spotted the pattern emerging in her visits; no one noticed that her condition was steadily deteriorating; no one suggested it was time to try something new.44
There was a reason for this. Believe it or not (and anyone who lives here probably will), the UK’s National Health Service doesn’t link its healthcare records together as a matter of standard practice. If you find yourself in an NHS hospital, the doctors won’t know anything about any visits you’ve made to your local GP. Many records are still kept on paper, which means the method of sharing them between doctors hasn’t changed for decades. It’s one of the reasons why the NHS holds the dubious title of the world’s biggest purchaser of fax machines.45
Crazy as this sounds, we’re not alone. The United States has a plethora of private doctors and large hospital networks that are not connected to each other; and while other countries, such as Germany, have begun to build electronic patient records, they are a long way from being the norm around the world. For Tamara Mills, the lack of a single, connected medical history meant that it was impossible for any individual doctor to fully understand the severity of her condition. Any solution to this profound shortcoming will come sadly too late for Tamara, but it remains a huge challenge for the future of healthcare. A machine like Watson could help save any number of Tamaras, but it’ll only be able to find patterns in the data if that data is collected, collated and connected.
There is a stark contrast between the rich and detailed datasets owned by data brokers and the sparse and disconnected datasets found in healthcare. For now, medical data is a mess. Even when our detailed medical histories are stored in a single place (which they often aren’t), the data itself can take so many forms that it’s virtually impossible to connect the information in a way that’s useful to an algorithm. There are scans to consider, reports to include, charts, prescriptions, notes: the list goes on. Then you have the problem of how the written data is recorded. You need to be able to understand all the acronyms and abbreviations, decipher the handwriting, identify the possibilities for human error. And that’s before you even get to symptoms. Does this person mean ‘cold’ as in temperature? Or ‘cold’ as in cough? Is this person’s stomach ‘killing them’ literally? Or just hurting a bit? Point is, medicine is really, really complicated, and every single layer of complexity makes the data a little less penetrable for a machine.46
IBM aren’t the only blue-chip big boys to have struggled with the messy, unstructured problem of healthcare data. In 2016, DeepMind, the artificial intelligence arm of Google, signed a contract with the Royal Free NHS Trust in London. DeepMind was granted access to the medical data from three of the city’s hospitals in return for an app that could help doctors identify acute kidney injuries. The initial intention was to use clever learning algorithms to help with healthcare; but the researchers found that they had to rein in their ambitions and opt for something much simpler, because the data just wasn’t good enough for them to reach their original goals.
Beyond these purely practical challenges, DeepMind’s collaboration with the NHS raised a more controversial issue. The researchers only ever promised to alert doctors to kidney injuries, but the Royal Free didn’t have a kidney dataset to give them. So instead DeepMind was granted access to everything on record: medical histories for some 1.6 million patients going back over a full five years.
In theory, having this incredible wealth of information could help to save innumerable lives. Acute kidney injuries kill one thousand people a month, and having data that reached so far back could potentially help DeepMind to identify important historical trends. Plus, since kidney injuries are more common among people with other diseases, a broad dataset would make it much easier to hunt for clues and connections to people’s future health.
Instead of excitement, though, news of the project was met with outrage. And not without justification. Giving DeepMind access to everything on record meant exactly that. The company was told who was admitted to hospital and when. Who came to visit patients during their stay. The results of pathology reports, of radiology exams. Who’d had abortions, who’d had depression, even who had been diagnosed with HIV. And worst of all? The patients themselves were never asked for their consent, never given an opt-out, never even told they were to be part of the study.47
It’s worth adding that Google was forbidden to use the information in any other part of its business. And – in fairness – it does have a much better track record on data security than the NHS, whose hospitals were brought to a standstill by a North Korean ransomware computer virus in 2017 because it was still running Windows XP.48 But even so, there is something rather troubling about an already incredibly powerful, world-leading technology company having access to that kind of information about you as an individual.
Let’s be honest, Google isn’t exactly short of private, even intimate information on each of us. But something feels instinctively different – especially confidential – about our medical records.
For anyone with a clean bill of health, it might not be immediately obvious why that should be: after all, if you had to choose between releasing your medical data and your browser history to the world, which would you prefer? I know I’d pick the former in a heartbeat and suspect that many others would too. It’s not that I have anything particularly interesting to hide. But one is just a bland snapshot of my biological self, while the other is a window straight into my character.
Even if healthcare data might have less potential to be embarrassing, Timandra Harkness, author of Big Data: Does Size Matter? and presenter of the BBC Radio 4 programme Future Proofing, argues that it is still a special case.
‘First off, a lot of people’s healthcare data includes a narrative of their life,’ she told me. ‘For instance, one in three women in Britain have had a termination – there might be people in their own lives who don’t know that.’ She also points out that your medical record isn’t just relevant to you. ‘If someone has your genetic data, they also know something about your parents, your siblings, your children.’ And once it’s out, there’s no getting away from it. ‘You cannot change your biology, or deny it. If somebody samples your DNA you can’t change that. You can have plastic surgery on your face, you can wear gloves to cover your fingerprints, but your DNA is always there. It’s always linked to you.’
Why does this matter? Timandra told me about a focus group she’d chaired in 2013, where ordinary people were asked what concerned them most about their medical data. ‘On the whole, people weren’t that worried about their data being hacked or stolen. They were concerned about having assumptions made about them as a group and then projected on to them as individuals.’
Most of all, they were concerned with how their data might be used against them. ‘Suppose someone had linked their supermarket loyalty card to their medical records. They might go for a hip operation, the doctor would say, ‘Oh, I’m sorry, we see here that you’ve been buying a lot of pizzas or you’ve been buying a lot of cigarettes and therefore I’m afraid we’re going to have to send you to the back of the waiting list.’
That’s a very sensible fear in the UK, where some cash-strapped NHS hospitals are already prioritizing non-smokers for knee and hip operations.49 And there are many countries around the world where insurance or treatment can be denied to obese patients.50
There’s something of a conundrum here. Humans, as a species, could stand to benefit enormously from having our medical records opened up to algorithms. Watson doesn’t have to remain a fantasy. But to turn it into reality, we’ll need to hand over our records to companies rich enough to drag us through the slog of the challenges that lie between us and that magical electronic doctor. And in giving up our privacy we’ll always be dancing with the danger that our records could be compromised, stolen or used against us. Are you prepared to take that risk? Do you believe in these algorithms and their benefits enough to sacrifice your privacy?
Or, if it comes to it, will you even particularly care?
Francis Galton was a Victorian statistician, a human geneticist, one of the most remarkable men of his generation – and the half-cousin of Charles Darwin. Many of his ideas had a profound effect on modern science, not least his work that essentially laid the foundations for modern statistics. For that, we owe Galton a sincere debt of gratitude. (Unfortunately, he was also active in the burgeoning eugenics movement, for which we certainly do not.)
Galton wanted to study human characteristics through data, and he knew even then that you needed a lot of it to really learn anything of interest. But he also knew that people had an insatiable curiosity about their own bodies. He realized that – when whetted in the right way – people’s voracious appetite for an expert’s assessment of themselves could over-ride their desire for privacy. What’s more, they were often willing to pay for the pleasure of feeding it.
And so, in 1884, when a huge exhibition was held in London under the patronage of Queen Victoria to celebrate the advances Britain had made in healthcare, Galton saw his opportunity. At his own expense, he set up a stand at the exhibition – he called it an ‘Anthropometric Laboratory’ – in the hopes of finding a few people among the millions of visitors who’d want to pay money to be measured.
He found more than a few. Punters queued up outside and thronged at the door, eager to hand over a threepenny bit each to enter the lab. Once inside, they could pit their skills against a series of specially designed instruments, testing, among other things, their keenness of sight, judgement of eye, strength of pull and squeeze, and swiftness of blow.51 Galton’s lab was so popular he had to take two people through at a time. (He quickly noticed that it was best to keep parents and children apart during the testing to avoid any wasted time on bruised egos. Writing in a journal article after the event, he commented: ‘The old did not like to be outdone by the young and insisted on repeated trials.’)52
However well or badly each person performed, their results were scribbled on to a white card that was theirs to keep as a souvenir. But the real winner was Galton. He left the exhibition with a full copy of everything – a valuable set of biometric measurements for 9,337 people – and a handsome profit to boot.
Fast-forward 130 years and you might just recognize some similarities in the current trend for genetic testing. For the bargain price of £149, you can send off a saliva sample to the genomics and biotechnology company 23andMe in exchange for a report of your genetic traits, including answers to questions such as: What kind of earwax do you have? Do you have the genes for a monobrow?53 Or the gene that makes you sneeze when you look at the sun?54 As well as some more serious ones: Are you prone to breast cancer? Do you have a genetic predisposition for Alzheimer’s disease?55
The company, meanwhile, has cleverly amassed a gigantic database of genetic information, now stretching into the millions of samples. It’s just what the internet giants have been doing, except in handing over our DNA as part of the trade, we’re offering up the most personal data we have. The result is a database we all stand to benefit from. It constitutes a staggeringly valuable asset in terms of its potential to advance our understanding of the human genome. Academics, pharmaceutical companies and non-profits around the world are queuing up to partner with 23andMe to hunt for patterns in their data – both with and without the help of algorithms – in the hope of answering big questions that affect all of us: What are the hereditary causes of different diseases? Are there new drugs that could be invented to treat people with particular conditions? Is there a better way to treat Parkinson’s?
The dataset is also valuable in a much more literal sense. Although the research being done offers an immense benefit to society, 23andMe isn’t doing this out of the goodness of its heart. If you give it your consent (and 80 per cent of customers do), it will sell on an anonymized version of your genetic data to those aforementioned research partners for a tidy profit.56 The money earned isn’t a happy bonus for the company; it’s actually their business plan. One 23andMe board member told Fast Company: ‘The long game here is not to make money selling kits, although the kits are essential to get the base level data.’ Something worth remembering whenever you send off for a commercial genetic report: you’re not using the product; you are the product.57
Word of warning. I’d also be a bit wary of those promises of anonymity. In 2005 a young man,58 conceived through a completely anonymous sperm donor, managed to track down and identify his birth father by sending off a saliva swab to be analysed and picking up on clues in his own DNA code.59 Then in 2013 a group of academics, in a particularly famous paper, demonstrated that millions of people could potentially be identified by their genes using nothing more than a home computer and a few clever internet searches.60
And there’s another reason why you might not want your DNA in any dataset. While there are laws in place to protect people from the worst kinds of genetic discrimination – so we’re not quite headed for a future where a Beethoven or a Stephen Hawking will be judged by their genetic predispositions rather than their talent – these rules don’t apply to life insurance. No one can make you take a DNA test if you don’t want to, but in the US, insurers can ask you if you’ve already taken a test that calculates your risk of developing a particular disease such as Parkinson’s, Alzheimer’s or breast cancer and deny you life insurance if the answer isn’t to their liking. And in the UK insurers are allowed to take the results of genetic tests for Huntington’s disease into account (if the cover is over £500,000).61 You could try lying, of course, and pretend you’ve never had the test, but doing so would invalidate your policy. The only way to avoid this kind of discrimination is never to have the test in the first place. Sometimes, ignorance really can be bliss.
The fact is, there’s no dataset that would be more valuable in understanding our health than the sequenced genomes of millions of people. But, none the less, I probably won’t be getting a genetic test any time soon. And yet (thankfully for society) millions of people are voluntarily giving up their data. At the last count 23andMe has more than 2 million genotyped customers,62 while MyHeritage, Ancestry.com – even the National Geographic Genographic project – have millions more. So perhaps this is a conundrum that isn’t. After all, the market has spoken: in a straight swap for your privacy, the chance to contribute to a great public good might not be worth it, but finding out you’re 25 per cent Viking certainly is.fn2
OK, I’m being facetious. No one can reasonably be expected to keep the grand challenges of the future of human healthcare in the forefront of their minds when deciding whether to send off for a genetic test. Indeed, it’s perfectly understandable that people don’t – we have different sets of priorities for ourselves as individuals and for humankind as a whole.
But this does bring us to a final important point. If a diagnostic machine capable of recommending treatments can be built, who should it serve? The individual or the population? Because there will be times where it may have to choose.
Imagine, for instance, you go into the doctor with a particularly annoying cough. You’ll probably get better on your own, but if a machine were serving you – the patient – it might want to send you for an X-ray and a blood test, just to be on the safe side. And it would probably give you antibiotics too, if you asked for them. Even if they only shortened your suffering by a few days, if the sole objective was your health and comfort, the algorithm might decide it was worth the prescription.
But if a machine were built to serve the entire population, it would be far more mindful of the issues around antibiotic resistance. As long as you weren’t in immediate danger, your temporary discomfort would pale into insignificance and the algorithm would only dole out the drugs when absolutely necessary. An algorithm like this might also be wary of wasting resources or conscious of long waiting lists, and so not send you for further tests unless you had other symptoms of something more serious. Frankly, it would probably tell you to take some aspirin and stop being such a wimp.
Likewise, a machine working on behalf of everyone might prioritize ‘saving as many lives as possible’ as its core objective when deciding who should get an organ transplant. Which might well produce a different treatment plan from a machine that only had your interests in mind.
A machine working for the NHS or an insurer might try to minimize costs wherever possible, while one designed to serve a pharmaceutical company might aim to promote the use of one particular drug rather than another.
The case of medicine is certainly less fraught with tension than the examples from criminal justice. There is no defence and prosecution here. Everyone in the healthcare system is working towards the same goal – getting the patient better. But even here every party in the process has a subtly different set of objectives.
In whatever facet of life an algorithm is introduced, there will always be some kind of a balance. Between privacy and public good. Between the individual and the population. Between different challenges and priorities. It isn’t easy to find a path through the tangle of incentives, even when the clear prize of better healthcare for all is at the end.
But it’s even harder when the competing incentives are hidden from view. When the benefits of an algorithm are over-stated and the risks are obscured. When you have to ask yourself what you’re being told to believe, and who stands to profit from you believing it.