aggregate statistics Statistics calculated across a set or group of data points, e.g. weekly sales by item.
anonymization Removing any information from a data set that could be used to identify or locate individuals, including names and addresses. True anonymization is difficult to achieve, as many variables such as location may allow individuals to be identified.
AI (artificial intelligence) Often used interchangeably with ‘machine learning’. The process of programming a computer to find patterns or anomalies in large data sets, or to find the mathematical relationship between some input variables and an output. AI algorithms have applications in a range of fields including healthcare, self-driving cars and image recognition.
biometric airport security The use of biometric information, such as facial measurements or fingerprints, in airport security.
Brexit The exit of the United Kingdom from the European Union.
census A regular, systematic survey of members of a population, usually conducted by a government. Data collected during a census may include household size and income, and may be used to plan housing, healthcare and social services.
continuous health data Collected at regular, short intervals from individuals, and could include heart rate, activity or blood pressure. Advances in wearable technologies such as activity monitors make continuous health monitoring feasible.
differential privacy Method for sharing summary statistics about a group of people, while protecting the anonymity of individuals in the group.
geospatial data Involves a geographic component, which could include latitude and longitude or a country code.
Go Two-player strategy game, where the aim is to capture the most territory. Google’s DeepMind has developed several algorithms designed to compete against humans.
Jeopardy! Televised American game show. Contestants are given answers, and must provide the correct questions.
machine learning Finding a mathematical relationship between input variables and an output. This ‘learned’ relationship can then be used to output predictions, forecasts or classifications given an input. For example, a machine learning model may be used to predict a patient’s risk of developing diabetes given their weight. This would be done by fitting a function to a ‘training’ set of thousands of historic data points, where each point represents a single patient’s weight and whether they developed diabetes. When a new, previously unseen patient’s weight is run through the model, this ‘learned’ function will be used to predict whether they will develop diabetes. Modern computer hardware has enabled the development of powerful machine learning algorithms.
microtargeting Strategy used during political or advertising campaigns in which personalized messaging is delivered to different subsets of customers or voters based on information that has been mined or collected about their views, preferences or behaviours.
profile (voter) Information about an individual voter which may include age, address and party affiliation.
randomized experiments Experimental design where participants or subjects are randomly allocated to treatment groups. Participants in a randomized drug trial could be randomly allocated to a group where they would either receive a placebo or a drug.
sensitive information/data Reveals personal details, such as ethnicity, religious and political beliefs, sexual orientation, trade union membership or health-related data.
sniffers Software that intercepts and analyses the data being sent across a network, to or from a phone, computer or other electronic device.
Yellow Vests movement Protest movement originating in France, focused on issues such as rising fuel prices and the cost of living.
the 30-second data
Data surveillance is all around us, and it continues to grow more sophisticated and all-encompassing. From biometric airport security to grocery shopping, online activity and smartphone usage, we are constantly being surveilled, with our actions and choices being documented into spreadsheets. Geospatial surveillance data allows marketers to send you tailored ads based upon your physical, real-time location. Not only that, it can also use your past location behaviour to predict precisely what kind of ads to send you, sometimes without your permission or knowledge. While data surveillance is itself uninteresting; it’s the actions taken from analysis of the data that can be both harmful and helpful. Using data surveillance, private and public entities are investigating methods of influencing or ‘nudging’ individuals to do the ‘right’ thing, and penalizing us for the doing the ‘wrong’ thing. A health insurance company could raise or lower rates based upon the daily steps a fitness tracker records; a car insurance company could do the same based upon data from a smart car. Data surveillance is not only about the present and analysis of actions; it’s also about predicting future action. Who will be a criminal, who will be a terrorist, or simply, what time of the day are you most likely to buy that pair of shoes you have been eyeing while online shopping?
Eyewitness sketches and background checks might become an archaic relic with the amount of surveillance data we now have the capability of storing and analysing.
While data surveillance can feel negative, there are incredible advances in preventing terrorism, cracking child pornography rings by following images being sourced from the internet, and even aiding the global refugee crisis. The Hive (a data initiative for USA for the UN Refugee Agency) used high-resolution satellite imagery to create a machine-learning algorithm for detecting tents in refugee camps – allowing for better camp planning and field operation.
See also
TIM BERNERS-LEE
1955–
Creator of the World Wide Web, coining the internet as the ‘world’s largest surveillance network’.
Liberty Vittert
When put towards a good cause, such as crime prevention, certain types of surveillance can be well justified.
the 30-second data
Data is opening up new opportunities in intelligence processing, dissemination and analysis while improving investigative capacities of security and intelligence organizations at global and community levels. From anomalies (behaviour that doesn’t fit a usual pattern) to association (relationships that the human eye couldn’t detect) and links (social networks of connections, such as Al-Qaeda), intelligence organizations compile data from online activity, surveillance, social media and so on, to detect patterns, or lack thereof, in individual and group activity. Systems called ‘sniffers’ – designed to monitor a target user’s internet traffic – have been transformed from simple surveillance systems to security systems designed to distinguish between communications that may be lawfully intercepted and those that may not for security purposes. Data can visualize how violence spreads like a virus among communities. The same data can also predict the most likely victims of violence and even, supposedly, the criminals. Police forces are using data to both target and forecast these individuals. For example, police in Chicago identified over 1,400 men to go on a ‘heat list’ generated by an algorithm that rank-orders potential victims and subjects with the greatest risk of violence.
Big Data meets Big Brother in the untapped and untried world of data-driven security opportunities. From community policing to preventing terrorism, the possibilities are endless, and untested.
In the case of Chicago (see ‘data’ text), the higher the score means the greater risk of being a victim or perpetrator of violence. In 2016, on Mother’s Day weekend, 80 per cent of the 51 people shot over two days had been correctly identified on the list. While proponents say that it allows police to prioritize youth violence by intervening in the lives of those most at risk, naysayers worry that by not identifying what generates the risk score, racial bias and unethical data use might be in practice.
See also
PATRICK W. KELLEY
fl. 1994
FBI Director of Integrity and Compliance, who migrated Carnivore to practice.
Liberty Vittert
Carnivore was one of the first systems implemented by the FBI to monitor email and communications from a security perspective.
the 30-second data
The adage ‘if you’re not paying for the product, you are the product’ remains true in the era of big data. Businesses and governments hold detailed information about our likes, health, finances and whereabouts, and can harness this to serve us personalized advertising. Controversies around targeted political campaigning on Facebook, including alleged data breaches during the 2016 US presidential election, have brought data privacy to the forefront of public debate. For example, medical records are held securely by healthcare providers, but health apps are not subject to the same privacy regulations as hospitals or doctors. A British Medical Journal study found that nearly four in five of these apps routinely share personal data with third parties. Users of menstrual cycle, fitness or mental health tracking apps may be unaware that sensitive information about their health and well-being is up for sale. One strategy for protecting privacy is the removal of identifying variables, such as full names or addresses, from large data sets. But can data ever truly be anonymized? In 2018 the New York Times reviewed a large anonymized phone location data set. Journalists were able to identify and contact two individuals from the data, demonstrating that true anonymization is difficult to achieve.
Every day we generate thousands of data points describing our lifestyle and behaviour. Who should have access to this information, and how can they use it responsibly?
Governments have taken steps to safeguard privacy. The UK’s Information Commissioner’s Office fined Facebook £500,000 for failing to protect user data. In the European Union, organizations must ask for consent when collecting personal data and delete it when asked. The US Census Bureau is introducing ‘differential privacy’ into the 2020 census, a method that prevents individuals being identified from aggregate statistics.
See also
MITCHELL BAKER
1959–
Founder of the Mozilla Foundation, launched in 2003, which works to protect individuals’ privacy while keeping the internet open and accessible.
Maryam Ahmed
Non-governmental organizations advocate for and support projects relating to greater internet and data privacy.
the 30-second data
Vote Science has been in practice since political outcomes began being decided by votes, dating back to sixth-century BCE Athens. Modern Vote Science evolved rapidly in the US in the 1950s, when campaigns, political parties and special interest groups started keeping large databases of eligible voters, which were later used to build individual voter profiles. Using machine learning and statistical analysis, campaign professionals began using these profiles to make calculated decisions on how to win an election or sway public opinion. Current best practices include maintaining databases of people with hundreds of attributes, from individuals’ credit scores to whether they vote early/in-person or even if they are more likely to vote if reminded via phone, text or email. Using this data, campaigns and political parties work to predict voter behaviour, such as whether voters will turn out, when they will vote, how they will vote and – most recently – what will persuade them to change their opinion. Recent campaigns have adopted randomized field experiments to assess the effectiveness of mobilization and persuasion efforts. Vote Science now determines how a campaign chooses to spend its advertising funds as well as which particular messages are shown to specific, individual voters.
Vote Science is the practice of using modern voter registration lists, consumer and social media data, and polling to influence public opinion and win elections.
George Bush’s 2004 re-election was the first political campaign to use political microtargeting – the use of machine-learning algorithms to classify voters on an individual level of how they might vote or if they even would vote. Barack Obama’s campaigns in 2008 and 2012 took Vote Science a step further by incorporating randomized field experiments. Elections in the UK, France and India began to use Vote Science techniques such as microtargeting and random field experiments in their campaigns after witnessing the success of the American model.
See also
DONALD P. GREEN
1961–
Leader of Vote Science randomized experiments.
SASHA ISSENBERG
fl. 2002–
Chronicler of how data science has been used and evolved in campaigns in the last 20 years.
DAN WAGNER
fl. 2005–
Director of Analytics for ‘Obama for America’ in 2012; led efforts to expand Vote Science in campaigns to message testing and donor models.
Scott Tranter
Modern-day election campaigns are driven by Vote Science, with a vast amount of campaign budget allocated to it.
the 30-second data
Data science develops tools to analyse health information, to improve related services and outcomes. An estimated 30 per cent of the world’s electronically stored data comes from the healthcare field. A single patient can generate roughly 80 megabytes of data annually (the equivalent of 260 books worth of data). This health data can come from a variety of sources, including genetic testing, surveys, wearable devices, social media, clinical trials, medical imaging, clinic and pharmacy information, administrative claim databases and national registries. A common data source is electronic medical record (EMR) platforms, which collect, organize and analyse patient data. EMRs enable doctors and healthcare networks to communicate and coordinate care, thereby reducing inefficiencies and costs. EMR data is used to create decision tools, for clinicians, which incorporate evidence-based recommendations for patient test results and prevention procedures. Healthcare data science combines the fields of predictive analytics, machine learning and information technology to transform unstructured information into knowledge used to change clinical and public health practice. Data science helps to save lives by predicting patient risk for diseases, personalizing patient treatments and enabling research to cure diseases.
Data science transforms unstructured health information into knowledge that changes medical practice.
Consumer-grade wearable devices coupled with smartphone technology offer innovative ways to capture continuous health data, improving patient outcomes. For example, heart monitors can be used to diagnose and/or predict abnormal and potentially life-threatening heart rhythms. The data can be assessed by varying time parameters (days to weeks versus months to years), to develop early-warning health scores. Similarly, hearing aids with motion sensors can detect the cause of a fall (slipping versus heart attack), so doctors can respond effectively.
See also
FLORENCE NIGHTINGALE
1820–1910
Championed the use of healthcare statistics.
BILL & MELINDA GATES
1955– & 1964–
Launched in 2000, the Gates Foundation uses data to solve some of the world’s biggest health data science problems.
JAMES PARK & ERIC FRIEDMAN
fl. 2007
Founders of Fitbit who applied sensors and wireless tech to health and fitness.
Rupa R. Patel
Using data to personalize healthcare helps to save lives.
the 30-second data
When IBM’s Watson computer defeated the reigning Jeopardy! champion on a nationally televised game show in 2011, it was a demonstration of how computer-based natural language processing and machine learning had advanced sufficiently to take on the complex wordplay, puns and ambiguity that many viewers might struggle with. Google’s DeepMind subsidiary did something similar – its AlphaGo program used machine learning and artificial intelligence to beat the world champion at Go, a very complicated strategy board game played with black and white stones; a feat no other computer had ever accomplished. Picking ambitious targets such as beating humans at well-known games serves several purposes. First, it gives data scientists clear goals and benchmarks to target, like ‘Win at Jeopardy!’. In IBM’s case, they even announced the goal beforehand, which put pressure on the development team to be creative and think outside the box, as who would want to be publicly humiliated by a mere human? Second, these sparring matches speak to the public about how far hardware and software are progressing. Go is much more challenging than chess, so if a computer can beat the world champion, we must be making a lot of progress!
IBM’s Watson Jeopardy!-playing computer and Google’s DeepMind Go-playing program introduced the world to machine learning and artificial intelligence in ways that were easy to understand.
Computer companies pursue targets such as playing Jeopardy! and Go because to excel at them they have to develop general-purpose capabilities that can be applied to other commercially important problems. The ability to answer a person’s question in his or her own language on a broad range of topics, or to train for complicated problems such as robot navigation, will help future computers to perform more sophisticated tasks for people, including their creators.
See also
NEURAL NETWORKS & DEEP LEARNING
THOMAS WATSON
1874–1956
Chairman and CEO of IBM, the Jeopardy!-playing computer is named after him.
DEEPMIND TECHNOLOGIES
2010–
Acquired by Alphabet (parent of Google) in 2014.
Willy Shih
Computers beating humans at ever more complex games is a visible measure of the progress being made in data science.