There are 89 cubic kilometres of water in Lake Geneva. That’s 89 trillion litres (19.6 trillion US gallons) of water. I can barely hear the traffic above the roar of lake water through the sluices. As I look across the grey expanse, two thoughts occur to me. The first is that carrying my 1-litre (13/4-pint) bottle of drinking water feels a bit silly now. The other is that, if it keeps raining like this, Lake Geneva will soon contain 90 trillion litres of water.
I’ve walked down to have a look at the lake, and the famous Jet d’Eau fountain, on my way to CERN, which is a tram-ride away from Geneva station. Well, the main buildings are a tram-ride away. I suspect that at least part of the experimental equipment is somewhere beneath my feet.
There’s 27km, nearly 17 miles, of tunnel in a big loop under the Swiss/French border, where the subatomic particles go flying round at mind-boggling speeds1 before colliding: the Large Hadron Collider, LHC. The hadrons are tiny subatomic particles, it’s the whole engineering set-up that is the ‘large’ part. And the amount of data that it chucks out every time they fire it up is big by anyone’s standards.
When they cross the beams,2 the detectors generate 40 million megabytes of data per second, which is more than they could store, let alone analyse. But, as with so many things, it’s not the size of your data that counts, it’s what you can do with it.
Think of my pathetic bottle of water. Rain is dripping off it now, and soaking through my coat. But I’m still holding it, because it’s the volume of water that I can use for my specific purpose. It’s been selected, processed and packaged to keep it free from contamination, so I can drink it. That’s what makes it more useful to me than all those billions of litres rushing past my feet or plopping off my nose.
If I tipped this litre of drinkable water into Lake Geneva, it wouldn’t take long for the currents to mix it with the other 89 trillion litres. Some of it might be squirted through the Jet d’Eau, which puts 500 litres (110 US gallons) per second into the air. It would certainly be diluted with the rainwater.
Imagine what a task it would be to go through all the water in the lake and reassemble the original water molecules back into this bottle. Like one of those fairy tale challenges where the heroine has to separate a mountain of peas and lentils, one by one. And I think: That’s about the scale of the task they’ve set themselves just up the road, at CERN. Only it’s not water they’re sorting through, or lentils, it’s data.
Asking the big (and very tiny) questions
I was expecting something a bit more like a mine. I have my helmet, my pass, and my authorised Underground Tour leader, but when the lift arrives, Piotr gets in and presses the Level -2 button. ‘It’s just an elevator,’ he says as we descend 100m (328ft) to the tunnel.
Piotr Traczyk works on CMS,3 one of the two main experiments here at the LHC. Like its sister, or rival, ATLAS, CMS is essentially a system of detectors that record what happens when the beams of protons collide. By measuring what is thrown off, at what speed and what angle, the scientists can reconstruct events that only lasted for fractions of a second. The brief appearance of a Higgs boson, for example.
The first thing we reach when we leave the lift is a roomful of computers, rack after rack of hardware with blue cables and winking lights in all directions. It looks like a scene, either from a 1960s documentary about the new, compact computers that only take up one room, or from a James Bond villain’s underground HQ. Piotr has to raise his voice above the cooling fans. ‘This is where the Level 1 Trigger System lives. The trigger is the system that reduces the amount of data to manageable size, by discarding the collisions that we believe, or actually, the Trigger believes, to be uninteresting.’
If the word ‘trigger’ makes you think of a cull, you’re right. The equipment in this room is programmed to identify potentially interesting results, and throw the rest straight into the virtual bin. This has to happen in millionths of a second, far faster than a regular PC could make those decisions, let alone a human mind. And this is just the first stage. ‘There’s another room directly above this one,’ says Piotr.
‘The first level has to run very fast. The blue cables are fibre optics: the other end is physically on the detector. The data from the detector comes directly into these racks. One collision is one megabyte, so 40 million megabytes per second enter this room, 100,000 megabytes per second leave this room.’
Piotr displays his impressive recall of many large numbers. ‘Then, the 100,000 events per second get sent upstairs to the surface, and there we have a farm of 2,000 normal PCs. What we do up there is reduce the 100,000 per second to 100 per second.’
If you’re losing track of all the zeroes here, imagine taking half the water in Lake Geneva and reducing it to only 40 Olympic swimming pools. Which still seems like a lot of water, so that might not help. OK, think of taking two-thirds of the population of the UK, and reducing it to 100 people, within a second. Which is pretty much what happens to TV sports viewing figures every time a British team gets knocked out of a major sporting tournament.
At this point, as I’m trying to absorb the scale and speed of all this data processing, my cellphone rings. Considering how often I can’t get a signal in London, where I live, I’m surprised to get one this far under France. Piotr tells me that the whole tunnel has both Wi-Fi and cellphone coverage, for convenience and safety.
For example, each of these blue cables had to be connected by a human being, and at one point he was in this room with his spanner, while his colleague attached the other ends of the cables to the detector, communicating by cellphone. And yes, he says they do sometimes find two cables swapped over. It’s nice to be reminded that all this advanced science and technology still relies on human beings tightening things up while shouting, ‘It’s the blue one!’ ‘That doesn’t help, Piotr – they’re all blue!’ into their cellphones.
The next set of doors is marked with radiation signals. Piotr seems unworried by this. He also tells me the helmet is mainly to protect me from hitting my head on one of the pipes running along the ceiling. He’s significantly taller than I am. But before we get to the main tunnel he has to get us through an eye-scanner security gate. Maybe it’s more X-Men than James Bond.
‘Beyond this door lies the CMS experiment,’ says Piotr, and with a little theatrical flourish he lets me into the main cavern.
It’s massive.
I’m visiting while the LHC is closed for an upgrade, which means the CMS is opened up like two halves of a bagel, with a hole through the middle where the beam pipe would normally carry the colliding particles. So I can see the five-storey-high ring of electronics: red, silver and shiny copper detectors linked by a dense net of multicoloured cables and copper gas pipes. It’s like a twenty-first-century rose window, in an underground cathedral to physics, all brightly lit and framed by green metal walkways at all levels.
Piotr tells me that ATLAS is even bigger, but CMS is heavier. ‘And prettier,’ he adds. It is very pretty.
Each of the 11 slices that make up CMS has different detectors, to measure the track of different particles thrown off by the collisions. Part of the challenge is to reassemble these traces to create a reconstruction of each individual collision. It’s like Crimewatch for particle physics, or some kind of accident investigation: ‘The first proton came around this corner at one minute after midnight, and collided with the second proton, which was travelling at 1,093 million km per hour (679 million mph), well over the speed limit.’
And, to make that task harder, this requires a very precise measurement of the position of each detector when it recorded the data. ‘You might think that, as we are in Switzerland, this is no big deal.’ says Piotr. But all these precision instruments are set in thousands of tonnes of iron, around a magnet 100,000 times stronger than the Earth’s magnetic field. Things tend to move around.
So as well as detectors, the CMS needs position detectors to measure the positions of the detectors, using lasers and mirrors. And those position detectors are throwing out yet more data that has to be collected and analysed.
So much data comes out of CERN, even after the ruthless culling of the Trigger process, that the work of processing needs to be shared among 170 computing centres in 40 countries. The Worldwide LHC Computing Grid is the biggest computing grid in the world, handling 30 million gigabytes per day. If you’re thinking about how slowly your home broadband downloads cat videos, don’t worry: CERN has its own private network of optical fibres handling 10 gigabits per second.
Over 10,000 physicists from Korea to Illinois, Moscow to Barcelona, have access to that data almost as fast as it pours down the cables from Geneva. The blue optical cables that Piotr and his colleagues attached to the detectors are only the start of a long journey. If their data really were the water in Lake Geneva, it would be pouring along massive pipes at near light speed, and scientists thousands of miles away could be pouring it into a glass within minutes.
Except the lake wouldn’t be getting any emptier. It would be filling up with new data faster than the real lake was with this morning’s rain.
The sheer scale of the data CERN produces is crucial. The Higgs boson, for example, is a rare and elusive subatomic thing, whose unlikelihood means you need billions and billions of bits of information from millions of experiments to be sure you’ve glimpsed one, let alone measure some of its characteristics. Equally important are the speed of processing, faster than an unaided human could do it, and the fact that thousands of scientists can collaborate on making sense of the shared pool of data.
Without big data, CERN would never have found evidence that the elusive Higgs boson particle exists. It’s an obscure corner of physics, but not finding it could have meant having to rethink the fundamental building blocks of our universe, and rewrite the physics textbooks. So physics teachers all over the world, though they might struggle to explain exactly what a Higgs boson is, are grateful that they won’t have to go back and learn an entirely new model of reality before the start of next term.
But the physics teachers shouldn’t relax too much. Over in the astronomy corner, big data is being used to investigate some equally fundamental questions. How did the universe form just after the Big Bang? What is dark energy? What are the limits of Einstein’s general theory of relativity?
Astronomy has always been a very mathematical science. Remember Laplace developing statistics to help model the underlying laws of motion from his observations of the stars and planets? Today, it’s moved so far from just looking through telescopes at visible objects that many astronomers can hardly ever find an excuse to go to observatories in Hawaii, South America and the Canaries, or even to stay up all night gazing up at the stars.
They’re more likely to find themselves writing computer programs to analyse digital data coming in from thousands of miles away, or even from instruments that have been sent out into space. Today’s astronomers are often distinguished not just by the objects they study, like solar astronomers who study the sun, but by the type of data they analyse.
Radio astronomers collect and study radio waves from the universe. They use large dishes like the ones at Jodrell Bank in north-west England, similar to your neighbour’s satellite dish that picks up Italian football games and those foreign movies with all the sighing and groaning. But what the astronomers receive is more like the background noise between stations, which takes a lot of maths to decipher.
No wonder they have to pass the time by thinking up imaginative names for their equipment like the VLT (Very Large Telescope) the ELT (Extremely Large Telescope) or the OWL (OverWhelmingly Large) Telescope. I’m not making these up. I’m a little disappointed, in fact, that the next massive international astronomy project won’t be called the WYLATSOTT (Will You Look At The Size Of That Telescope) but the SKA – the Square Kilometre Array.
In radio astronomy too, size really does matter. Because radio wavelengths are longer than visible light, it takes a much bigger dish to collect enough to make any kind of picture. But with a big enough dish, radio astronomers can find some quite exciting things. Jodrell Bank’s Lovell Telescope, 76m (250ft) across, enabled Jocelyn Bell Burnell to discover the existence of pulsars, fast-spinning neutron stars emitting a regular pulse of radio waves. In the 1950s and 1960s, Jodrell Bank tracked Sputnik and other early space missions.
The largest radio telescope dish on the planet, at Arecibo on the Caribbean island of Puerto Rico, is 305m (1,000ft) across, and is built into a natural hollow in a mountain. Its size makes it so sensitive that it can detect signals that have been travelling across the universe for 100 million years. If that doesn’t impress you, it also featured in the James Bond film, Goldeneye.
The SKA won’t literally take up 1km2 (0.39 mile2). Instead, an international consortium of scientists plan to use the same kind of fast, accurate, data-heavy computing that CERN gets from its grid.
Instead of a single, WYLATSOTT receiving dish, SKA will consist of an array of smaller dishes and receivers spread across two sites in Australia and Africa. These detectors will generate more than 10 times the amount of data carried by the whole internet today. Like CERN, it will combine the data received from the different detectors to build up a picture. To put together signals received thousands of miles apart, it will need to make allowances for differences in time as well as space.
This will require a computer three times as powerful as the most powerful supercomputer in existence in 2013, and thousands of miles of fibre optic cable. And, like CERN, SKA will need to cull the data before sending it out to teams of scientists all over the world.
Like most large scientific projects, SKA will take many years to complete, and probably won’t be fully operational till around 2025. That should give them plenty of time for supercomputers to become three times as powerful as they are today. If you’re currently at school, you could get a job analysing some of the first data the SKA spits out, or writing the computer code that will do it for you.
In some ways, physics and astronomy are the easiest targets for big data. I’m not saying that studying elusive particles that exist for fractions of a second, if at all, or finding dark matter whose main characteristic is that it can’t be seen or heard, is easy. But the problems they throw up are the types of problem that big data is best equipped to tackle.
Subatomic particles and galaxies may be very small, or very far way, but they tend to behave according to the laws of physics. These laws, being mathematical, talk the same language as computers, at least as closely as Brits and Americans talk the same language.
As soon as living things are involved, things get more complicated, as we found with Professor Eamonn Keogh’s insects in Chapter 1. And yet, as we also found, big data can help spot patterns in living things as well as inanimate lumps of matter.
Big brains
After he’d done his PhD in neuroscience, John D. Van Horn became a researcher. One of his first jobs was to go out and buy the biggest hard disc his lab could afford, to store all the data their research was going to generate. ‘People would come to our lab to look at the hard disk in awe,’ he says, ‘because it was four gigabytes! Wow! Amazing! At that time, I thought four gigabytes was infinity.’
I don’t like to ask how long ago that was, because the smallest iPod Nano had 4GB of storage back in 2007. My cellphone has twice that much memory. And today, Professor Van Horn’s neuroscience lab at the University of Southern California stores, ‘many, many petabytes of data’.
There’s a two-way relationship between how much detail the technology can deliver and how complex the questions are that the researchers want to ask. Van Horn started out measuring blood flow in the brain, using equipment that could give one image of the whole brain every eight minutes. Now, he can use functional MRI to give him three or four images per second, not quite real time compared to brain activity, but fast enough to see changes as the person in the scanner thinks, or moves, or reacts to what they’re seeing or hearing.
The human brain is notoriously complex and variable. Every brain is unique. Put two different people in a brain scanner and their brains will be different. Recognisably the same organ, but unique in the detail of its folds and wrinkles. Ask them to do the same task, and you won’t see exactly the same pattern of activity in those two brains. Similar, probably, with the visual cortex at the back of the skull engaged in making sense of what they see, or the language areas on the left hemisphere analysing and producing words, but human beings have an impressive ability to find different ways to get the same results, even inside our own skulls.
This is where the pattern-spotting ability of AI can be useful. Show enough brain scans to an AI algorithm and it can learn to sort them according to whatever criteria you gave it in the training set: male or female, musician or non-musician, normal or psychopath.4
Because it has taught itself the best ways to distinguish between categories, we don’t necessarily know what the differences are. But whatever its criteria, a computer can classify human brains by looking at scans. This could have all sorts of uses.
Professor Van Horn, for example, is looking for rare genetic variants associated with brain diseases, such as autism and Parkinson’s. He does this by linking a genetic database to his database of brain images.
At Imperial College in London, Professor Paul Matthews also uses big data methods. But, for a Head of Brain Sciences, he’s surprisingly equivocal about the brain scans in his database.
‘Yes, I’m certainly interested in the brain data,’ he says, ‘but it’s only one piece in a jigsaw. The public perception of it, and even some doctors’ perception, is that it is somehow special. But it’s not really. It’s a piece of data. It’s a particularly rich piece of data, but it’s just a piece.’
For Professor Matthews, it’s the ability to link the scans with other sources of information that marks the move, ‘from large data to big data’.
‘Large data is data in which we have collected lots of a single type, or a limited type, of information, and we are using it within the context that was originally envisioned. Rather than studying the size of the brain in 20 subjects, we’re going to study it in 1,000. So we’re scaling up a simple idea.’
Going beyond sheer scale, says Matthews, ‘Big data is where we’re not just collecting one or a few types of data, like brain scan, age and gender: we’re collecting all the things we can possibly think about and more, and we’re looking to relationships between them that we might not have anticipated. Moreover, we’re starting to link between different datasets.’
This is the kind of thing Van Horn is doing with his genetic database. But Paul Matthews looks far beyond medical records.
For example, ‘because you have their postal code, you can go back to meteorological records and say something about exposure to sun, or to particulates in the air, that these individuals had. So using bits of data from one dataset to link into other datasets, suddenly the picture you have of a person’s life is expanding rapidly.’
Does this ability to bring together many different dimensions of a person’s life, and set a machine to plough through in search of new patterns, mean the researcher’s role is fundamentally changing? Some have claimed that big data represents the end of theory, as computers will produce answers before we even know which question to ask.
Remember your school science lessons, where you had to write down a hypothesis, then test it with the apparatus, note your results, and state your conclusions?5 Is that really an outdated approach? I put it to Professor Matthews.
‘Is it all about trying to find relationships blindly or is it about hypotheses? You can do either,’ he tells me. ‘That’s the beauty of these datasets.
I might hypothesise that people on statins are less likely to show progression in multiple sclerosis, which recent trials suggest. I can go into the dataset of people with MS and identify those who have used statins and those who have not, balance them for other factors, and ask the question, which population did better? That would be a hypothesis-led association study. The alternative is to use sophisticated data-mining approaches to allow clusters of features to be identified. It’s always back and forth.’
So scientists, relax, you’re not out of a job yet. Think of your big data AI as a lab assistant, that might sometimes notice interesting new results and point them out for further research, but the rest of the time is doing your donkey work.
Genetic codebreakers
In genetics, like neuroscience, the mass of data goes well beyond what the human computers of a century ago, with their pens, paper and mechanical adding machines, could have handled.
The human genome is made up of around 20,000 genes, stored on 46 strands of DNA6 called chromosomes, found in the nucleus of almost every cell in the body. The fact that different types of cell – skin, bone and brain – contain the same genetic blueprint should be a clue that the whole process of making a human being is not as simple as following a recipe book.
What a recipe book it would be: a book with 46 chapters and around 6 billion letters. Though chromosomes come in pairs, so it would really be a book with 23 chapters, each in two alternative versions. Except if you’re cooking a boy, in which case one of the chapters would have an X version and a shorter Y version that is completely different and mainly about how to make testicles and sperm. It would be a book unique to the person being made, unless they’re an identical twin,7 to which each parent has contributed 23 chapters. Typos may have crept in. Sometimes the two versions are incompatible. Perhaps Dad uses pints and pounds, while Mum uses litres and kilogrammes.
Somehow, our imaginary cook has to decide which word to take from Dad’s chapter, and which from Mum’s, trying to avoid any misprints that could make the whole dish inedible or even poisonous. And there isn’t even a cook.
There’s still much we don’t know about the cooking process, but scientists published the whole recipe book in 2003. Two rival teams, in the UK and US, competed to be the first to map the human genome at the end of the twentieth century, but the final version was a collaborative effort that also included scientists in Germany, Japan, France and China. It cost several billion dollars and took over 10 years. It’s now freely available to anyone, though it’s not much of a read if you’re not a geneticist, being made up entirely of the letters A, C, G and T.
And, as you may have guessed, it’s not one genome that describes every single person on the planet. It’s a representative version that combines material from a number of volunteers, much as editions of the Bible, or the works of Shakespeare, draw on a number of sources to produce what looks like an authoritative edition.8
Rather like big data, the mapping of the whole human genome was hailed as the answer to all our problems: We would understand all diseases and be able to tailor cures to each individual. Decoding a person’s genetic blueprint would be like looking into their future. Science would have godlike knowledge of all things. And of course, it turned out to be a lot more complicated than that. We can tinker with the DNA of simple life forms, such as yeast or bacteria, quite well. Humans are a bit more tricky.
Nevertheless, by identifying specific genes linked to specific disorders or diseases, we have got closer to preventing and treating them. People whose families have a history of illnesses such as sickle cell anaemia, who are wondering if there’s a danger they’ll pass on the illness to any future children, can take a test and find out. Parents may have the option to bypass natural conception with in vitro fertilization (IVF), what’s sometimes called a test tube baby. This gives doctors the possibility of screening embryos for certain faulty genes with a technique called pre-implantation genetic diagnosis (PGD). Then parents can choose to implant an embryo that is free from a life-limiting or fatal disease.
Given the unfortunate history of eugenics, which brought together the new fields of genetics and statistics only to find itself in bed with genocide, you may feel wary about this. But there’s no inevitable path from the human genome to genocide. They share a linguistic root, but so do generator and genitals, and we don’t want to stop using either of those.
Medically, the science of genetics has immense promise. Thousands of medical conditions are caused by a fault in a single gene, in the way a broken arm can be caused by a single fall from a horse. Cystic fibrosis, Huntington’s disease and sickle cell disease are among them. Having one of these conditions doesn’t make anybody less of a person, but I’m glad I was lucky enough not to draw any of those tickets in the genetic lottery, and I’ll be even more glad if medical science can help future individuals be equally lucky.
Gene therapy, though still in early stages, offers the promise of fixing a faulty gene in a living person.
But these relatively simple genetic faults are just the beginning. More often, many different genes work together to produce different effects or predispositions. This is where harnessing genetics for medical benefit starts to get tricky, and where big data can start to be properly useful.
In many cases, having this or that genetic mutation doesn’t mean a disease is certain, only that levels of risk have changed. Actress Angelina Jolie chose to have a double mastectomy, the surgical removal of both breasts, though there was no obvious sign she had breast cancer. Knowing that she had a BRCA1 genetic mutation, which brings an increased chance of developing the disease, she had decided to take preventative action.
Jolie herself reported that doctors told her she faced an 87 per cent risk of developing breast cancer, though on average those with a BRCA1 mutation face a 65 per cent lifetime risk. Scientists may quote a range of risks from 45 per cent to 90 per cent associated with a faulty BRCA1 gene. Even 45 per cent is significantly higher than the 12 per cent overall risk that a woman picked at random from the general population will develop breast cancer at some point in her life.
Women with a family history of breast cancer, like Jolie, whose own mother died of the disease aged just 56, can take a genetic test that looks specifically at genes like BRCA1 and BRCA2. Knowing whether they have a version of the gene that increases their own risk gives women choices about reducing that risk through surgery or in other ways.
This kind of knowledge is possible because an individual’s genes can be mapped far faster, and more cheaply, than when the first human genome was published a decade ago. In 2004, sequencing a new human genome would have cost around $20 million. Ten years later, that cost had fallen below $5,000, and now a company called Illumina can do it in a few days for under $1,000.
This is not just because all those scientists who worked for 13 years on the very first human genome are getting quicker with their pipettes and microscopes. It’s thanks to improvements in hardware, software and information handling. Biology is moving rapidly from being all about the specific, fiddly, squishy thing you have on your lab bench, or your microscope slide, to being all about masses of data. And this opens up new possibilities.
Nobody could have looked at Angelina Jolie, seen into her future, and known for sure that she would or would not develop breast cancer. And if she did develop it, nobody would have been able to tell that the mutation in her BRCA1 gene was definitely to blame. But by noting a correlation between having that mutation and going on to develop breast cancer in thousands of other women, scientists were able to put a figure on her increased risk. And, knowing that figure, she was able to make her own decision about whether to take preventative action.
Most breast cancers are not linked to faults in this particular gene. BRCA1 is implicated in less than 10 per cent of breast and ovarian cancers. If you do have a faulty version of a BRCA gene, your chances of going on to develop one of those diseases is higher, but it’s not certain. You might be lucky.
Is it more than luck? Is there more we can do to avoid illnesses like cancer? That’s the kind of question that big data can help us address. Illumina, the company that can provide the $1,000 human genome, is contributing to a UK study called the 100,000 Genomes Project.
It will target cancers and rare diseases, sequencing the genomes of patients, their families and, in some cases, of the cancerous cells.
By linking the genetic information with what’s known about the medical history of the individual, scientists hope not only for a better understanding of how genes affect our health, but also of how to treat illnesses more effectively. For example, by knowing the genetic profile of an individual’s cancer, doctors can know which chemotherapy is likely to be most effective.
But just as reading a recipe book is no guarantee that your dinner will taste perfect every time, reading a human genome won’t tell you everything that will happen to that person. Some diseases are affected by environmental factors, and others are down to sheer bad luck.
Big danger?
When doctors, or medical researchers, talk about environmental factors, they don’t just mean whether you’re breathing clean air or getting sunburn because you grew up next to an Australian beach. Any factor that’s not in your genes when you’re born may be called environmental.
So smoking and living in Cornwall would both be described as environmental factors, even though deciding to smoke was a choice you made, unlike being born in Cornwall.
In Cornwall, the levels of background radiation are much higher than in most of the UK, simply because of the type of rock that makes up that part of south-west England. Radon, a radioactive gas emitted naturally by the uranium in rocks, causes over 1,000 deaths from lung cancer every year in the UK. That’s a higher death toll than drink-driving. That said, even quite high doses of radon gas don’t increase your risk of lung cancer anywhere near as much as smoking cigarettes.
In practice, it is hard to make a clear distinction between environmental factors and things that you have chosen to do. You could choose to move away from Cornwall when you reach the age of 18, for example, instantly reducing the dose of radiation that hits you every day. But by moving to London to avoid death by radon gas, you could expose yourself to other environmental risks, from air pollution to loneliness.
You could stay in Cornwall and fit radon protection to your house, as it’s gas accumulating indoors that carries most of the increased risk. You could choose to reduce your exposure even more by working outdoors, though farming and fishing are two of the most dangerous professions. Around one in 10,000 Britons working in agriculture, forestry and fishing will die on the job in any given year, making it the riskiest sector to work in.
So, even knowing the data on risks to your health, it’s not easy to apply that knowledge to making decisions as an individual. However, if you are a public servant looking to reduce the numbers of deaths across the population, it can be helpful to know where to apply your efforts: offering radon gas protection to homes in the highest radiation areas, for example, or trying to cut down the number of people who smoke.
Smoking is one of the big success stories of modern public health. Or rather, stopping smoking is. Today it seems amazing that millions of people smoked cigarettes without any sense that they might be harming their health. But in the mid twentieth century, when doctors and scientists looked at the rapid rise in cases of the previously rare disease, lung cancer – a 15-fold increase in 25 years – smoking was barely in the list of suspects.
If this seems bizarre, remember that the 1950s were a very different age. Turing’s imaginary robot child escaped the everyday household chore of filling the coal scuttle only by virtue of lacking legs. Most homes were heated by burning coal, wood or peat indoors. City air was darkened by smoke from all those chimneys, as well as from industry. Until the 1980s, when a big cleaning programme began, buildings in Britain’s cities were black.
The great smog of December 1952, a blend of smoke and fog with high levels of sulphur dioxide, caused an estimated 4,000 premature deaths in British cities. The Clean Air Act of 1956 authorised smoke-free zones in London and northern industrial cities, with controls on what industries could emit, and a shift for households from coal to smokeless fuel. The US passed its own Clean Air Act in 1963. And the gradual shift from domestic fires and stoves to gas or electric central heating must have made a difference over the next 40 years or so.
So the visibly polluted air that city-dwellers breathed in the 1950s was an obvious suspect for causing lung cancer. Other potential villains included the rising number of petrol-driven vehicles, dust given off by the new asphalt-coated roads, or even residual effects from having been gassed in the First World War, as men were more likely to develop the disease.
Some people suggested that improved diagnosis and increasing lifespans were at least part of the explanation: people were living longer, and developing the slow-growing disease instead of dying earlier from something else, or deaths previously thought to be caused by tuberculosis or other lung diseases were now being identified as cancer. A study in Germany in 1939 suggested that lung cancer patients included a smaller proportion of non-smokers than the general population did, but research done in Nazi Germany was not the first place most researchers wanted to look.
Today, this would be seen as a classic big data problem. Over a population of millions, we can identify from health records which people develop lung cancer. All we have to do is find the risk factor with the closest link to the disease. But which other information would you include in your search?
With hindsight, armed with your twenty-first-century knowledge that smoking has been tried and convicted, you might suggest medical notes: surely doctors asked whether patients smoked? Perhaps not, 80 per cent of middle-aged men in the UK smoked in the 1950s, including most of your patients who did not have lung cancer, so only one in five of your patients was likely to answer ‘No’, including the majority of your patients who did not have lung cancer.
How else could you find out? Today, supermarket store cards could help, if smokers bought cigarettes with their groceries. Though you wouldn’t necessarily know which member of the household was going to smoke them, or if they were bought for a housebound older relative, for example. Perhaps you could analyse the photographs people post on social media, using an algorithm to detect the presence of cigarettes. All these are now technically possible, if not always socially acceptable.
Without the hypothesis that smoking was to blame, however, you’d have to cast the net much wider.
To check for the effects of air pollution, you could look at addresses. Today, we have postcodes, which makes it much easier to put people into groups according to where they live. You could use GPS and other location-recording features on people’s mobile telephones to track where they spend their time. Are they walking or cycling along polluted roads? Do they spend eight hours a day working near, or in, a chemical plant? You’d need to get their permission to access these records, of course, but many people are already using their own apps to keep track of where they go, whether they walk or cycle, and so on.
Back in 1947, however, Austin Bradford Hill and Richard Doll had to use statistical techniques to test their hypotheses.
The method they used was a case control study, a way to compare one group of cases, people who do have the disease, with a control group who don’t have the disease. In Doll and Hill’s study, they chose both groups from London hospital patients. The 649 patients with lung cancer were compared with 649 patients admitted with other illnesses.
Not all case control studies manage to recruit equal numbers of cases and controls, but statisticians can still draw useful conclusions from studies with unequal numbers, using the odds ratio. By splitting the two groups according to their exposure to risk factors, researchers can calculate the odds of getting the disease for exposed and non-exposed groups.
Let’s see how that works with some made-up numbers. Suppose that 45 of our overall group say they’ve lived in Cornwall. Of our Cornish selection, 25 are in the lung-cancer (case) group and only 20 in the not-lung-cancer (control) group. So more of our imaginary Cornish patients have lung cancer than not.
Does this prove that living in Cornwall is a risk factor for lung cancer? Not necessarily. We have to compare the odds of getting lung cancer if you have been exposed to Cornwall (25:20, or 5:4) with the odds of getting lung cancer if you haven’t been exposed to Cornwall.9
Of the 1,253 people who have not lived in Cornwall, 624 are in the Lung Cancer group and 629 in the Control group, who didn’t get lung cancer. So the odds of getting the disease if you were not exposed to Cornwall are 624:629, roughly 35:36.
These odds seem high, don’t they? Odds of 35:36 is almost evens, let alone 5:4. Those odds are almost certainly much higher than you’d estimate.
This is because our sample is not a random sample of the population. We selected 649 lung cancer patients, and 649 people without lung cancer. So we started out with a 50:50 chance that a person picked at random from our group will have lung cancer: evens, in betting terms. This is much higher than in the real world, where even a smoker has about a 15 per cent risk of getting the disease,10 giving odds of 3:17.
In order to make a useful comparison, we need to compare the odds of being in the lung cancer group, if you come from Cornwall, with the odds of being in the lung cancer group if you don’t come from Cornwall. We find the odds ratio by dividing one by the other. If the odds were the same for both Cornish and non-Cornish, the odds ratio would be one.
We divide ‘odds of lung cancer, with Cornwall’ by ‘odds of lung cancer, without Cornwall’, giving us approximately 1.26, or one and a quarter. So your odds of getting the disease if you come from Cornwall are about a quarter higher than if you don’t. Must be all that radon gas.
Now, I must remind you at this point that I made up these figures. I’d hate to be responsible for a dozen deaths on the A303 road as thousands of people leave Cornwall in panic. Remember, too, that saying your odds are a quarter higher is a bit meaningless if you don’t know your baseline risk of getting lung cancer.
Among 100,000 non-smokers, around 16 would die of lung cancer. Using my made-up numbers, moving them all to Cornwall would push that up to around 20 in 100,000, four extra deaths. It would also increase the population of Cornwall by almost one-fifth. And the increased risk is one in 5,000, which is still pretty long odds.
So I didn’t mean to cause alarm. I did it to show you how researchers can work backwards from groups of patients to estimate the risks to whole populations of exposure to certain things. In this case, Cornwall.11
This method is what Doll and Hill used to find that smoking cigarettes raised the risk of getting lung cancer. And the odds ratio they found was much more dramatic than our 5:4 ratio from my made-up figures about Cornwall.
Most of their patients smoked. Most British men did in 1947. So their exposed group of smokers was 1,269, much bigger than the non-exposed group of just 29 non-smokers. 647 of the smokers were in the cancer group, but only two of the non-smokers.
They calculated the Odds Ratio for smoking versus not smoking, and got an answer of 14.04. Smokers were 14 times as likely as non-smokers to develop the disease. You can see why both Doll and Hill gave up smoking at once.
Their later work showed an even stronger correlation between heavy smoking and lung cancer, with odds ratios between 20 and 50, compared to people who’d never smoked. They also found the link was dose-dependent: the more you smoke, the higher the risk.
It took some time for the link between smoking and lung cancer to be widely accepted. Prominent scientists, doctors and statisticians were sceptical, suggesting alternative explanations such as genetic predisposition to both lung cancer and smoking. The tobacco industry was clearly reluctant to publicise the fact their product might shorten your life.
Eventually, animal studies and improving knowledge of basic biology pointed to chemicals in cigarette smoke that suppress the body’s defences against cancer. But this was one case where correlation alone was enough to affect both public policy and individual behaviour, long before a causal mechanism was shown.
Lots of tiny dangers
Today’s public health studies, using big data techniques, can find much more subtle patterns. Combining huge sets of data and clever statistical techniques, researchers can find links between behaviour, environment and health outcomes, especially in countries such as the UK that have comprehensive medical records across most of the population.
Their task is much harder than that of Doll and Hill in some ways, as they’re working against a background of ever-lengthening lifespans and falling death rates from most diseases, even the big killers: cancer and cardiovascular disease.
So instead of one big effect for which to seek an explanation, they’re looking for relatively small factors that could contribute to speeding up improvement in the population’s health and longevity. Is sitting down the new smoking? Can taking statins reduce your chance of a heart attack? Should we do more about air pollution?
But beyond the technical difficulties, which big data promises to help overcome, there are wider problems for these public health researchers.
Having access to the medical records of millions of people, let alone linking them to other personal details such as where they’ve been, or their personal habits, risks intrusion on the privacy of individuals.
Finding a link does not always mean you have identified a cause: remember Farr and his meticulous study of how miasma spread cholera via stinking air?
And even if you can show a genuine health benefit from, for example, consuming less salt, individuals may decide that a very small reduction in their odds of a bad outcome, or a very small extension of their predicted lifespan, is not enough to justify giving up one of life’s pleasures.
For example, as I write in 2015, newspaper headlines are proclaiming:
‘Hot dogs, bacon and other processed meats cause cancer …’
‘Red Meat Cancer Risk To Be Revealed By World Health Organization’
‘Processed meats as bad a cancer threat as smoking.’
Turning to Cancer Research UK (CRUK) for clarification, I read that the studies being reported show: ‘those who ate the most processed meat had around a 17 per cent higher risk of developing bowel cancer, compared to those who ate the least.’
At this point, you may put down that half-eaten hot dog or bacon butty. But CRUK also points out that this is a relative risk. The odds of developing bowel cancer at some point in your life, across the whole UK population, are under 6 per cent. So increasing this risk by 17 per cent would raise it to … about 6.5 per cent.
If 100 people all changed from a low-sausage to a high-sausage diet for the whole of their lives, one of them would probably develop bowel cancer who might not otherwise have done. We’d never know which person, and they’re all hypothetical anyway, but you can see that one tasty meat product on one day is a small addition to your personal danger.
Red meat seems less risky, though it’s still a suspect, especially if barbecued. Science can be cruel.
Across an entire population, if people gave up cooked breakfasts, salami and smoked ham as fast as smoking rates have fallen since 1950, bowel cancer rates would fall. Nowhere near as fast as lung cancer rates are falling, but enough to show in the data. CRUK estimates that, if nobody in the UK ate any processed or red meat, there would be 8,800 fewer cases of bowel cancer every year.12
However, for each individual, given that you are subject to all sorts of other risks in your life, and depending on how much you love bacon, it might not seem worth it. Big data is great at seeing the big picture, across a population. It’s not so good at helping an individual make decisions about how they want to live their life.
Notes
1 A proton at full speed completes 11,245 circuits of the 27km-long (17 mile-long) tunnel every second.
2 ‘Cross the beams’ is my phrase. A real CERN scientist is more likely to say ‘run the experiment’ or ‘induce proton packet collisions’ or something more technical. Though crossing the beams is exactly what they do.
3 Compact Muon Solenoid.
The solenoid is a coil of wire that produces a magnetic field when electricity flows through it. In this case, it has a 7m (23ft) diameter and produces a magnetic field about 100,000 times stronger than that of the Earth.
(Cont.) A muon is a type of subatomic particle, produced when the proton beams collide, which may be an indication that something important has happened. For example, the elusive Higgs boson decays into four muons, so seeing four muons racing away from the scene of the crime is a strong suggestion that a Higgs boson was briefly there.
The CMS equipment is 15m (49ft) in diameter, 21m (69ft) long, and weighs 12,500 tonnes. So I have no idea in what sense it is compact. Though it is smaller than ATLAS, the experiment on the opposite side of the ring of tunnel.
4 We can’t say that because these category differences are physically recognisable in the brain, they must be innate. Our brains change, physically, in response to how we use them, so becoming a musician, or a man, or a psychopath, might have involved repeated actions that reshaped the brain.
5 Mine always started with, ‘allowing for experimental error …’ as the results were never the ones we should have got. That’s one reason I became a writer instead of a scientist.
6 Deoxyribonucleic acid, a chemical that comes in the form of two strands twisted together, the famous double helix that has become the universal symbol for genetics.
7 Anybody who works in genetics hates the term ‘identical twin’ because they’re not really identical. They prefer ‘monozygotic twin’. The word monozygotic means that both twins came from one fertilised egg, or zygote, and has a maximum Scrabble score of 261.
8 Though Craig Venter, who led one of the original research teams, has since sequenced his own genome in its entirety, and transmitted it into space. So one of the first things the SKA detects could be an army of cloned Craig Venters, invading from a more technologically advanced galaxy that has decoded the recipe book for Craig Venter and worked out how to grow human beings in a factory.
9 ‘Exposed to Cornwall’ isn’t a very flattering phrase for a rather beautiful part of south-west England, but it’s the kind of language medical statisticians use.
10 This is one reason a case-control study can be better for relatively rare diseases. The alternative is a cohort study, where we take a group of people before any disease has developed and watch them move forwards through time, like a marching cohort of soldiers. We could sort them according to whether or not they’ve lived in Cornwall, and see how many of each group go on to develop the disease. (Cont. p.155)
The problem is that only a small proportion of either group are likely to develop lung cancer. So we would need a very large group of people to be confident that any differences were linked to the Cornwall/not Cornwall difference, and not just natural variation. From a random sample of 1,298 people, you would expect fewer than 100 cases of lung cancer over a lifetime, so it might be hard to see a clear difference in risk linked to Cornishness, even after waiting 70 years or so, and most lung cancer is diagnosed in people over 65. If you include other factors, such as smoking, you need an even bigger cohort. Using big data techniques makes it much easier to do a cohort study, because you can find a larger sample of people whose past data was already collected, and then find out how they are doing now.
11 I did, however, base my numbers, very roughly, on the estimated risks of radon gas.
12 Compared to an estimated 64,500 fewer cases of lung cancer if nobody in the UK smoked.