Chapter 5

Big data and medicine

Big data analysis is significantly changing the world of healthcare. Its potential has yet to be fully realized but includes medical diagnosis, epidemic prediction, gauging the public response to government health warnings, and the reduction in costs associated with healthcare systems. Let’s start by looking at what is now termed healthcare informatics.

Healthcare informatics

Medical big data is collected, stored, and analysed using the general techniques described in previous chapters. Broadly speaking, healthcare informatics and its many sub-disciplines, such as clinical informatics and bio-informatics, use big data to provide improved patient care and reduce costs. Consider the definition criteria for big data (discussed in Chapter 2)—volume, variety, velocity, and veracity—and how they apply to medical data. Volume and velocity are satisfied, for example, when public-health-related data is collected through social networking sites for epidemic tracking; variety is satisfied since patient records are stored in text format, both structured and unstructured, and sensor data such as that provide by MRIs is also collected; veracity is fundamental to medical applications and considerable care is taken to eliminate inaccurate data.

Social media is a potentially valuable source of medically related information through data collection from sites such as Facebook, Twitter, various blogs, message boards, and Internet searches. Message boards focused on specific healthcare issues are abundant, providing a wealth of unstructured data. Posts on both Facebook and Twitter have been mined, using classification techniques similar to those described in Chapter 4, to monitor experience of unpleasant reactions to medications and supply healthcare professionals with worthwhile information regarding drug interactions and drug abuse. Mining social media data for public-health-related research is now a recognized practice within the academic community.

Designated social networking sites for medical professionals, such as Sermo Intelligence, a worldwide medical network and self-proclaimed ‘largest global healthcare data collection company’, provide healthcare personnel with instant crowdsourcing benefits from interaction with their peers. Online medical advice sites are becoming increasingly popular and generate yet more information. But, although not publicly accessible, perhaps the most important source is the vast collection of Electronic Health Records. These records, usually referred to simply by their initials, EHR, provide an electronic version of a patient’s full medical history, including diagnoses, medications prescribed, medical images such as X-rays, and all other relevant information collected over time, thus constructing a ‘virtual patient’—a concept we will look at later in this chapter. As well as using big data to improve patient care and cut costs, by pooling the information generated from a variety of online sources it becomes possible to think in terms of predicting the course of emerging epidemics.

Google Flu Trends

Every year, like many countries, the US experiences an influenza (or flu) epidemic resulting in stretched medical resources and considerable loss of life. Data from past epidemics supplied by the US Center for Disease Control (CDC), the public health monitoring agency, together with big data analytics, provide the driving force behind researchers’ efforts to predict the spread of the illness in order to focus services and reduce its impact.

The Google Flu Trends team started working on predicting flu epidemics using search engine data. They were interested in how the course of the annual flu epidemic might be predicted faster than it currently took the CDC to process its own data. In a letter published in the prestigious scientific journal Nature in February 2009, the team of six Google software engineers explained what they were doing. If data could be used to accurately predict the course of the annual US flu epidemic then the illness could be contained, saving lives and medical resources. The Google team explored the idea that this could be achieved by collecting and analysing search engine queries relevant to concerns about the flu. Previous attempts to use online data to predict the spread of the flu had either failed or been met with limited success. However, by learning from earlier mistakes in this pioneering research, Google and the CDC hoped to be successful in using big data generated by search engine queries to improve epidemic tracking.

The CDC and its European counterpart, the European Influenza Surveillance Scheme (EISS), collect data from various sources, including physicians, who report on the number of patients they see with flu-like symptoms. By the time this data is collated it is typically about two weeks old and the epidemic has progressed further. Using data collected in real-time from the Internet, the Google/CDC team aimed to improve the accuracy of epidemic predictions and to deliver results within a single day. To do this, data was collected on flu-related search queries varying from individual Internet searches on flu remedies and symptoms to mass data such as phone calls made to medical advice centres. Google was able to tap into a vast amount of search query data that it had accumulated between 2003 and 2008, and by using IP addresses it was able to identify the geographic location of where search queries had been generated and thus group the data according to State. The CDC data is collected from ten regions, each containing the cumulative data from a group of States (e.g. Region 9 includes Arizona, California, Hawaii, and Nevada), and this was then integrated into the model.

The Google Flu Trends project hinged on the known result that there is a high correlation between the number of flu-related online searches and visits to the doctor’s surgery. If a lot of people in a particular area are searching for flu-related information online, it might then be possible to predict the spread of flu cases to adjoining areas. Since the interest is in finding trends, the data can be anonymized and hence no consent from individuals is required. Using their five-year accumulation of data, which they limited to the same time-frame as the CDC data, and so collected only during the flu season, Google counted the weekly occurrence of each of the fifty million most common search queries covering all subjects. These search query counts were then compared with the CDC flu data, and those with the highest correlation were used in the flu trends model. Google chose to use the top forty-five flu-related search terms and subsequently tracked these in the search queries people were making. The complete list of search terms is secret but includes, for example, ‘influenza complication’, ‘cold/flu remedy’, and ‘general influenza symptoms’. The historical data provided a baseline from which to assess current flu activity on the chosen search terms and by comparing the new real-time data against this, a classification on a scale from 1 to 5, where 5 signified the most severe, was established.

Used in the 2011–12 and 2012–13 US flu seasons, Google’s big data algorithm famously failed to deliver. After the flu season ended, its predictions were checked against the CDC’s actual data. In building the model, which should be a good representation of flu trends from the data available, the Google Flu Trends algorithm over-predicted the number of flu cases by at least 50 per cent during the years it was used. There were several reasons why the model did not work well. Some search terms were intentionally excluded because they did not fit the expectations of the research team. The much publicized example is that high-school basketball, seemingly unrelated to the flu, was nevertheless highly correlated with the CDC data, but it was excluded from the model. Variable selection, the process by which the most appropriate predictors are chosen, always presents a challenging problem and so is done algorithmically to avoid bias. Google kept the details of their algorithm confidential, noting only that high-school basketball came in the top 100 and justifying its exclusion by pointing out that the flu and basketball both peak at the same time of year.

As we have noted, in constructing their model Google used forty-five search terms as predictors of the flu. Had they only used one, for example ‘influenza’ or ‘flu’, important and relevant information such as all the searches on ‘cold remedy’ would have gone unnoticed and unreported. Accuracy in prediction is improved by having a sufficient number of search terms but it can also decrease if there are too many. Current data is used as training data to construct a model that predicts future data trends, and when there are too many predictors, small random cases in the training data are modelled and so, although the model fits the training data very well, it does not predict well. This seemingly paradoxical phenomenon, called ‘over-fitting’, was not taken into account sufficiently by the team. Omitting high-school basketball as simply being coincidental to the flu season made sense, but there were fifty million distinct search terms and with such a big number it is almost inevitable that others will correlate strongly with the CDC but not be relevant to flu trends.

Visits to the doctor with flu-like symptoms often resulted in a diagnosis that was not the flu (e.g. it was the common cold). The data Google used, collected selectively from search engine queries, produced results that are not scientifically sound given the obvious bias produced, for example by eliminating everyone who does not use a computer and everyone using other search engines. Another issue that may have led to poor results was that customers searching Google on ‘flu symptoms’ would probably have explored a number of flu-related websites, resulting in their being counted several times and thus inflating the numbers. In addition, search behaviour changes over time, especially during an epidemic, and this should be taken into account by updating the model regularly. Once errors in prediction start to occur, they tend to cascade, which is what happened with the Google Flu Trends predictions: one week’s errors were passed along to the next week. Search queries were considered as they had actually occurred and not grouped according to spelling or phrasing. Google’s own example was that ‘indications of flu’, ‘flu indications’, and ‘indications of the flu’ were each counted separately.

The work, which dates back to 2007–8, has been much criticized, sometimes unfairly, but the criticism has usually related to lack of transparency, for example the refusal to reveal all the chosen search terms and unwillingness to respond to requests from the academic community for information. Search engine query data is not the product of a designed statistical experiment and finding a way to meaningfully analyse such data and extract useful knowledge is a new and challenging field that would benefit from collaboration. For the 2012–13 flu season, Google made significant changes to its algorithms and started to use a relatively new mathematical technique called Elasticnet, which provides a rigorous means of selecting and reducing the number of predictors required. In 2011, Google launched a similar program for tracking Dengue fever, but they are no longer publishing predictions and, in 2015, Google Flu Trends was withdrawn. They are, however, now sharing their data with academic researchers.

Google Flu Trends, one of the earlier attempts at using big data for epidemic prediction, provided useful insights to researchers who came after them. Even though the results did not live up to expectations, it seems likely that in the future better techniques will be developed and the full potential of big data in tracking epidemics realized. One such attempt was made by a group of scientists from the Los Alamos National Laboratory in the USA, using data from Wikipedia. The Delphi Research Group at Carnegie Mellon University won the CDC’s challenge to ‘Predict the Flu’ in both 2014–15 and 2015–16 for the most accurate forecasters. The group successfully used data from Google, Twitter, and Wikipedia for monitoring flu outbreaks.

The West Africa Ebola outbreak

The world has experienced many pandemics in the past; the Spanish flu of 1918–19 killed somewhere between twenty million and fifty million people and in total infected about 500 million people. Very little was known about the virus, there was no effective treatment, and the public health response was limited—necessarily so, due to lack of knowledge. This changed in 1948 with the inauguration of the World Health Organization (WHO), charged with monitoring and improving global health through worldwide cooperation and collaboration. On 8 August 2014, at a teleconference meeting of the International Health Regulations Emergency Committee, the WHO announced that an outbreak of the Ebola virus in West Africa formally constituted a ‘public health emergency of international concern’ (PHEIC). Using a term defined by the WHO, the Ebola outbreak constituted an ‘extraordinary event’ requiring an international effort of unprecedented proportions in order to contain it and thus avert a pandemic.

The West Africa Ebola outbreak in 2014, primarily confined to Guinea, Sierra Leone, and Liberia, presented a different set of problems to the annual US flu outbreak. Historical data on Ebola was either not available or of little use since an outbreak of these proportions had never been recorded, and so new strategies for dealing with it needed to be developed. Given that knowledge of population movements help public health professionals monitor the spread of epidemics, it was believed that the information held by mobile phone companies could be used to track travel in the infected areas, and measures put in place, such as travel restrictions, that would contain the virus, ultimately saving lives. The resulting real-time model of the outbreak would predict where the next instances of the disease were most likely to occur and resources could be focused accordingly.

The digital information that can be garnered from mobile phones is fairly basic: the phone number of both the caller and the person being called, and an approximate location of the caller—a call made on a mobile phone generates a trail that can be used to estimate the caller’s location according to the tower used for each call. Getting access to this data posed a number of problems: privacy issues were a genuine concern as individuals who had not given consent for their calls to be tracked could be identified.

In the West African countries affected by Ebola, mobile phone density was not uniform, with the lowest percentages occurring in poor rural areas. For example, in 2013 just over half the households in Liberia and Sierra Leone, two of the countries directly affected by the outbreak in 2014, had a mobile phone, but even so they could provide sufficient data to usefully track movement.

Some historic mobile phone data was given to the Flowminder Foundation, a non-profit organization based in Sweden, dedicated to working with big data on public health issues that affect the world’s poorer countries. In 2008, Flowminder were the first to use mobile operator data to track population movements in a medically challenging environment, as part of an initiative by the WHO to eradicate malaria, so they were an obvious choice to work on the Ebola crisis. A distinguished international team used anonymized historic data to construct maps of population movements in the areas affected by Ebola. This historic data was of limited use since behaviour changes during epidemics, but it does give strong indications of where people will tend to travel, given an emergency. Mobile phone mast activity records provide real-time population activity details.

However, the Ebola prediction figures published by WHO were over 50 per cent higher than the cases actually recorded.

The problems with both the Google Flu Trends and Ebola analyses were similar in that the prediction algorithms used were based only on initial data and did not take into account changing conditions. Essentially, each of these models assumed that the number of cases would continue to grow at the same rate in the future as they had before the medical intervention began. Clearly, medical and public health measures could be expected to have positive effects and these had not been integrated into the model.

The Zika virus, transmitted by Aedes mosquitoes, was first recorded in 1947 in Uganda, and has since spread as far afield as Asia and the Americas. The current Zika virus outbreak, identified in Brazil in 2015, resulted in another PHEIC. Lessons have been learned regarding statistical modelling with big data from work by Google Flu Trends and during the Ebola outbreak, and it is now generally acknowledged that data should be collected from multiple sources. Recall that the Google Flu Trends project collected data only from its own search engine.

The Nepal earthquake

So what is the future for epidemic tracking using big data? The real-time characteristics of mobile phone call detail records (CDRs) have been used to assist in monitoring population movements in disasters as far ranging as the Nepal earthquake and the swine-flu outbreak in Mexico. For example, an international Flowminder team, with scientists from the Universities of Southampton and Oxford, as well as institutions in the US and China, following the Nepal earthquake of 25 April 2015, used CDRs to provide estimates of population movements. A high percentage of the Nepali population has a mobile phone and by using the anonymized data of twelve million subscribers, the Flowminder team was able to track population movements within nine days of the earthquake. This quick response was due in part to having in place an agreement with the main service provider in Nepal, technical details of which were only completed a week before the disaster. Having a dedicated server with a 20 Tb hard drive in the providers’ data centre enabled the team to start work immediately, resulting in information being made available to disaster relief organizations within nine days of the earthquake.

Big data and smart medicine

Every time a patient visits a doctor’s office or hospital, electronic data is routinely collected. Electronic health records constitute legal documentation of a patient’s healthcare contacts: details such as patient history, medications prescribed, and test results are recorded. Electronic health records may also include sensor data such as Magnetic Resonance Imaging (MRI) scans. The data may be anonymized and pooled for research purposes. It is estimated that in 2015, an average hospital in the USA will store over 600 Tb of data, most of which is unstructured. How can this data be mined to give information that will improve patient care and cut costs? In short, we take the data, both structured and unstructured, identify features relevant to a patient or patients, and use statistical techniques such as classification and regression to model outcomes. Patient notes are primarily in the format of unstructured text, and to effectively analyse these requires natural language processing techniques such as those used by IBM’s Watson, which is discussed in the next section.

According to IBM, by 2020 medical data is expected to double every seventy-three days. Increasingly used for monitoring healthy individuals, wearable devices are widely used to count the number of steps we take each day; measure and balance our calorie requirements; track our sleep patterns; as well as giving immediate information on our heart rate and blood pressure. The information gleaned can then be uploaded onto our PCs and records kept privately or, as is sometimes the case, shared voluntarily with employers. This veritable cascade of data on individuals will provide healthcare professionals with valuable public health data as well as providing a means for recognizing changes in individuals that might help avoid, for example, a heart attack. Data on populations will enable physicians to track, for example, side-effects of a particular medication based on patient characteristics.

Following the completion of the Human Genome Project in 2003, genetic data will increasingly become an important part of our individual medical records as well as providing a wealth of research data. The aim of the Human Genome Project was to map all the genes of humans. Collectively, the genetic information of an organism is called its genome. Typically, the human genome contains about 20,000 genes and mapping such a genome requires about 100 Gb of data. Of course, this is a highly complex, specialized, and multi-faceted area of genetic research, but the implications following the use of big data analytics are of interest. The information about genes thus collected is kept in large databases and there has been concern recently that these might be hacked and patients who contributed DNA would be identified. It has been suggested that, for security purposes, false information should be added to the database, although not enough to render it useless for medical research. The interdisciplinary field of bioinformatics has flourished as a consequence of the need to manage and analyse the big data generated by genomics. Gene sequencing has become increasingly rapid and much cheaper in recent years, so that mapping individual genomes is now practical. Taking into account the cost of fifteen years of research, the first human genome sequencing cost nearly US$3 million. Many companies now offer genome sequencing services to individuals at an affordable price.

Growing out of the Human Genome Project, the Virtual Physiological Human (VPH) project aims to build computer representations that will allow clinicians to simulate medical treatments and find the best for a given patient, built on the data from a vast data bank of actual patients. By comparing those with similar symptoms and other medically relevant details, the computer model can predict the likely outcome of a treatment on an individual patient. Data mining techniques are also used and potentially merged with the computer simulations to personalize medical treatment, and so the results of an MRI might integrate with a simulation. The digital patient of the future would contain all the information about a real patient, updated according to smart device data. However, as is increasingly the case, data security is a significant challenge faced by the project.

Watson in medicine

In 2007, IBM decided to build a computer to challenge the top competitors in the US television game show, Jeopardy. Watson, a big data analytics system named after the founder of IBM, Thomas J. Watson, was pitted against two Jeopardy champions: Brad Rutter, with a winning streak of seventy-four appearances; and Ken Jennings, who had won a staggering total of US$3.25 million. Jeopardy is a quiz show in which the host of the show gives an ‘answer’ and the contestant has to guess the ‘question’. There are three contestants and the answers or clues come in several categories such as science, sport, and world history together with less standard, curious categories such as ‘before and after’. For example, given the clue ‘His tombstone in a Hampshire churchyard reads “knight, patriot, physician and man of letters; 22 May 1859–7 July 1930”’, the answer is ‘Who is Sir Arthur Conan Doyle?’. In the less obvious category ‘catch these men’, given the clue ‘Wanted for 19 murders, this Bostonian went on the run in 1995 and was finally nabbed in Santa Monica in 2011’, the answer is ‘Who was Whitey Bulger?’. Clues that were delivered to Watson as text and audio-visual cues were omitted from the competition.

Natural language processing (NLP), as it is known in artificial intelligence (AI), represents a huge challenge to computer science and was crucial to the development of Watson. Information also has to be accessible and retrievable, and this is a problem in machine learning. The research team started out by analysing Jeopardy clues according to their lexical answer type (LAT), which classifies the kind of answer specified in the clue. For the second of these examples, the LAT is ‘this Bostonian’. For the first example, there is no LAT, the pronoun ‘it’ does not help. Analysing 20,000 clues the IBM team found 2,500 unique LATs but these covered only about half the clues. Next, the clue is parsed to identify key words and the relationships between them. Relevant documents are retrieved and searched from the computer’s structured and unstructured data. Hypotheses are generated based on the initial analyses, and by looking for deeper evidence potential answers are found.

To win Jeopardy, fast advanced natural language processing techniques, machine learning, and statistical analysis were crucial. Among other factors to consider were accuracy and choice of category. A baseline for acceptable performance was computed using data from previous winners. After several attempts, deep question and answer analysis, or ‘DeepQA’, an amalgamation of many AI techniques gave the solution. This system uses a large bank of computers, working in parallel but not connected to the Internet; it is based on probability and the evidence of experts. As well as generating an answer, Watson uses confidence-scoring algorithms to enable the best result to be found. Only when the confidence threshold is reached does Watson indicate that it is ready to give an answer, the equivalent of a human contestant hitting their buzzer. Watson beat the two Jeopardy champions. Jennings, generous in defeat, is quoted as saying, ‘I, for one, welcome our new computer overlords’.

The Watson medical system, based on the original Jeopardy Watson, retrieves and analyses both structured and unstructured data. Since it builds its own knowledge base it is essentially a system that appears to model human thought processes in a particular domain. Medical diagnoses are based on all available medical knowledge, they are evidence-based, accurate to the extent that the input is accurate and contains all the relevant information, and consistent. Human doctors have experience but are fallible and some are better diagnosticians than others. The process is similar to the Watson of Jeopardy, taking into account all the relevant information and returning diagnoses, each with a confidence rating. Watson’s built-in AI techniques enable the processing of big data, including the vast amounts generated by medical imaging.

The Watson supercomputer is now a multi-application system and a huge commercial success. In addition Watson has been engaged in humanitarian efforts, for example through a specially developed openware analytics system to assist in tracking the spread of Ebola in Sierra Leone.

Medical big data privacy

Big data evidently has potential to predict the spread of disease and to personalize medicine, but what of the other side of the coin—the privacy of the individual’s medical data? Particularly with the growing use of wearable devices and smartphone apps, questions arise as to who owns the data, where it is being stored, who can access and use it, and how secure it is from cyber-attacks. Ethical and legal issues are abundant but not addressed here.

Data from a fitness tracker may become available to an employer and used: favourably, for example to offer bonuses to those who meet certain metrics; or, unfavourably, to determine those who fail to reach the required standards, perhaps leading to an unwanted redundancy offer. In September 2016, a collaborative research team of scientists from the Technische Universität Darmstadt in Germany and the University of Padua in Italy, published the results of their study into fitness tracker data security. Alarmingly, out of the seventeen fitness trackers tested, all from different manufacturers, none was sufficiently secure to stop changes being made to the data and only four took any measures, all bypassed by the team’s efforts, to preserve data veracity.

In September 2016 following the Rio Olympic Games, from which most Russian athletes were banned following substantiated reports of a state-run doping programme, medical records of top athletes, including the Williams sisters, Simone Byles, and Chris Froome, were hacked and publicly disclosed by a group of Russian cyber-hackers on the website FancyBears.net. These medical records, held by the World Anti-Doping Agency (WADA) on their data management system ADAMS, revealed only therapeutic use exemptions and therefore no wrong-doing by the cyber-bullied athletes. It is likely that the initial ADAMS hack was the result of spear-phishing email accounts. This technique, whereby an email appears to be sent by a senior trusted source within an organization, such as a healthcare provider, to a more junior member of the same organization, is used to illegally acquire sensitive information such as passwords and account numbers through downloaded malware.

Proofing big data medical databases from cyber-attacks and hence ensuring patient privacy is a growing concern. Anonymized personal medical data is for sale legally but even so it is sometimes possible to identify individual patients. In a valuable exercise highlighting the vulnerability of supposedly secure data, Harvard Data Privacy Lab scientists Latanya Sweeney and Ji Su Yoo, using legally available encrypted (i.e. scrambled so that they cannot easily be read; see Chapter 7) medical data originating in South Korea, were able to decrypt unique identifiers within the records, and identify individual patients through cross-checking with public records.

Medical records are extremely valuable to cyber-criminals. In 2015, the health insurer Anthem declared that its databases had been hacked with over seventy million people affected. Data critical to individual identification, such as name, address, and social security number, was breached by Deep Panda, a Chinese hacking group, using a stolen password to access the system and instal Trojan-horse malware. Critically, the social security numbers, a unique identifier in the USA, were not encrypted, leaving wide open the possibility of identity theft. Many security breaches start with human error: people are busy and do not noticed subtle changes in a Uniform Resource Locator (URL); devices such as flash drives are lost, stolen, and even on occasion deliberately planted, with malware instantly installed once an unsuspecting employee plugs the device into a USB port. Both discontented employees and genuine employee mistakes also account for countless data leaks.

New big data incentives in the management of healthcare are being launched at an increasing rate by world-renowned institutions such as the Mayo Clinic and Johns Hopkins Medical in the USA, the UK’s National Health Service (NHS), and Clermont-Ferrand University Hospital in France. Cloud-based systems give authorized users access to data anywhere in the world. To take just one example, the NHS plans to make patient records available via smartphone by 2018. These developments will inevitably generate more attacks on the data they employ, and considerable effort will need to be expended in the development of effective security methods to ensure the safety of that data.