The biggest unknown facing data science today is how societies will choose to answer a new version of the old question regarding how best to balance the freedoms and privacy of individuals and minorities against the security and interests of society. In the context of data science, this old question is framed as follows: What do we as a society view are reasonable ways to gather and use the data relating to individuals in contexts as diverse as fighting terrorism, improving medicine, supporting public-policy research, fighting crime, detecting fraud, assessing credit risk, providing insurance underwriting, and advertising to targeted groups?
The promise of data science is that it provides a way to understand the world through data. In the current era of big data, this promise is very tantalizing, and, indeed, a number of arguments can be used to support the development and adoption of data-driven infrastructure and technologies. One common argument relates to improving efficiency, effectiveness, and competiveness—an argument that, at least in the business context, is backed by some academic research. For example, a study involving 179 large publicly traded firms in 2011 showed that the more data driven a firm’s decision making is, the more productive the firm is: “We find that firms that adopt DDD [data-driven decision making] have output and productivity that is 5–6% higher than what would be expected given their other investments and information technology usage” (Brynjolfsson, Hitt, and Kim 2011, 1).
Another argument for increased adoption of data science technologies and practices relates to securitization. For a long time, governments have used the argument that surveillance improves security. And since the terrorist attacks in the United States on September 11, 2001, as well as with each subsequent terrorist attack throughout the world, the argument has gained traction. Indeed, it was frequently used in the public debate caused by Edward Snowden’s revelations about the US National Security Agency’s PRISM surveillance program and the data it routinely gathered on US citizens. A stark example of the power of this argument is the agency’s US$1.7 billion investment in a data center in Bluffdale, Utah, that has the ability to store huge amounts of intercepted communications (Carroll 2013).
At the same time, however, societies, governments, and business are struggling to understand the long-term implications of data science in a big-data world. Given the rapid development of technologies around data gathering, data storage, and data analysis, it is not surprising that the legal frameworks in place and the broader ethical discussions around data, in particular the question of individual privacy, are running behind these advances. Notwithstanding this difficulty, basic legal principles around data collection and usage are important to understand and are nearly always applicable. Also, the ethical debate around data usage and privacy has highlighted some worrying trends that we as individuals and citizens should be aware of.
Data science can be framed as making the world a more prosperous and secure place to live. But these same arguments can be used by very different organizations that have very distinct agendas. For example, contrast calls by civil liberties groups for government to be more open and transparent in the gathering, use, and availability of data in the hope of empowering citizens to hold these same governments to account with similar calls from business communities who hope to use these data to increase their profits (Kitchin 2014a). In truth, data science is a double-edged sword. It can be used to improve our lives through more efficient government, improved medicine and health care, less-expensive insurance, smarter cities, reduced crime, and many more ways. At the same time, however, it can also be used to spy on our private lives, to target us with unwanted advertising, and to control our behavior both overtly and covertly (the fear of surveillance can affect us as much as the surveillance itself does).
The contradictory aspects of data science can often be apparent in the same applications. For example, the use of data science in health insurance underwriting uses third-party marketing data sets that contain information such as purchasing habits, web search history, along with hundreds of other attributes relating to people’s lifestyles (Batty, Tripathi, Kroll, et al. 2010). The use of these third-party data is troublesome because it may trigger self-disciplining, wherein people avoid certain activities, such as visiting extreme-sports websites, for fear of incurring higher insurance premiums (Mayer-Schönberger and Cukier 2014). However, the justification for the use of these data is that it acts as a proxy for more invasive and expensive information sources, such as blood tests, and in the long term will reduce costs and premiums and thereby increase the number of people with health insurance (Batty, Tripathi, Kroll, et al. 2010).
The fault lines in the debate between the commercial benefits and ethical considerations of using data science are apparent in the discussions around the use of personal data for targeted marketing. From a business advertising perspective, the incentive to use personal data is that there is a relationship between personalizing marketing, services, and products, on the one hand, and the effectiveness of the marketing, on the other. It has been shown that the use of personal social network data—such as identifying consumers who are connected to prior customers—increases the effectiveness of a direct-mail marketing campaign for a telecommunications service by three to five times compared to traditional marketing approaches (Hill, Provost, and Volinsky 2006). Similar claims have been made about the effectiveness of data-driven personalization of online marketing. For example, a study of online cost and effectiveness of online targeted advertising in the United States in 2010 compared run-of-the-network marketing (when an advertising campaign is pushed out across a range of websites without specific targeting of users or sites) with behavioral targeting1 (Beales 2010). The study found that behavioral marketing was both more expensive (2.68 times more) but also more effective, with a conversion rate more than twice that of run-of-the-network marketing. Another well-known study on the effectiveness of data-driven online advertising was conducted by researchers from the University of Toronto and MIT (Goldfarb and Tucker 2011). They used the enactment of a privacy-protection bill in the European Union (EU)2 that limited the ability of advertising companies to track users’ online behavior in order to compare the effectiveness of online advertising under the new restrictions (i.e., in the EU) and the effectiveness online advertising not under the new restrictions (i.e., in the United States and other non-EU countries). The study found that online advertising was significantly less effective under the new restrictions, with a reported drop of 65 percent in study participants’ recorded purchasing intent. The results of this study have been contested (see, for example, Mayer and Mitchell 2012), but the study has been used to support the argument that the more data that are available about an individual, the more effective the advertising that is directed to that individual will be. Proponents of data-driven targeted marketing frame this argument as a win–win for both the advertiser and the consumer, claiming that advertisers lower marketing costs by reducing wasted advertising and achieve better conversions rates, and consumers get more relevant advertising.
This utopian perspective on the use of personal data for targeted marketing is at best based on a selective understanding of the problem. Probably one of the most worrying stories related to targeted advertising was reported in the New York Times in 2012 and involves the American discount retail store Target (Duhigg 2012). It is well known in marketing that one of the times in a person’s life when his or her shopping habits change radically is at the conception and birth of a child. Because of this radical change, marketers see pregnancy as an opportunity to shift a person’s shopping habits and brand loyalties, and many retailers use publicly available birth records to trigger personalized marketing for new parents, sending them offers relating to baby products. In order to get a competitive advantage, Target wanted to identify pregnant customers at an early stage (ideally during the second trimester) without the mother-to-be voluntarily telling Target that she was pregnant.3 This insight would enable Target to begin its personalized marketing before other retailers knew the baby was on the way. To achieve this goal, Target initiated a data science project with the aim of predicting whether a customer was pregnant based on an analysis of her shopping habits. The starting point for the project was to analyze the shopping habits of women who had signed up for Target’s baby-shower registry. The analysis revealed that expectant mothers tended to purchase larger quantities of unscented lotion at the beginning of the second trimester as well as certain dietary supplements throughout the first 20 weeks of pregnancy. Based on this analysis, Target created a data-driven model that used around 25 products and indictors and assigned each customer a “pregnancy-prediction” score. The success, for want of a better word, of this model was made very apparent when a man turned up at a Target store to complain about the fact that his high-school-age daughter had been mailed coupons for baby clothes and cribs. He accused Target of trying to encourage his daughter to get pregnant. However, over the subsequent days it transpired that the man’s daughter was in fact pregnant but hadn’t told anyone. Target’s pregnancy-prediction model was able to identify a pregnant high school student and act on this information before she had chosen to tell her family.
The story about Target identifying a pregnant high school student without her consent or knowledge highlights how data science can be used for social profiling not only of individuals but also of minority groups in society. In his book The Daily You: How the New Advertising Industry Is Defining Your Identity and Your Worth (2013), Joseph Turow discusses how marketers use digital profiling to categorize people as either targets or waste and then use these categories to personalize the offers and promotions directed to individual consumers: “those considered waste are ignored or shunted to other products that marketers deem more relevant to their tastes or income” (11). This personalization can result in preferential treatment for some and marginalization of others. A clear example of this discrimination is differential pricing on websites, wherein some customers are charged more than other customers for the same product based on their customer profiles (Clifford 2012).
These profiles are constructed by integrating data from a number of different noisy and partial data sources, so the profiles can often be misleading about an individual. What is worse is that these marketing profiles are treated as products and are often sold to other companies, with the result that a negative marketing assessment of an individual can follow that individual across many domains. We have already discussed the use of marketing data sets in insurance underwriting (Batty, Tripathi, Kroll, et al. 2010), but these profiles can also make their way into credit-risk assessments and many other decision processes that affect people’s lives. Two aspects of these marketing profiles make them particularly problematic. First, they are a black box, and, second, they are persistent. The black-box nature of these profiles is apparent when one considers that it is difficult for an individual to know what data are recorded about them, where and when the data were recorded, and how the decision processes that use these data work. As a result, if an individual ends up on a no-fly list or a credit blacklist, it is “difficult to determine the grounds for discrimination and to challenge them” (Kitchin 2014a, 177). What is more, in the modern world data are often stored for a long time. So data recorded about an event in an individual’s life persists long after an event. As Turow warns, “Turning individual profiles into individual evaluations is what happens when a profile becomes a reputation” (2013, 6).
Furthermore, unless used very carefully, data science can actually perpetuate and reinforce prejudice. An argument is sometimes made that data science is objective: it is based on numbers, so it doesn’t encode or have the prejudicial views that affect human decisions. The truth is that data science algorithms perform in an amoral manner more than in an objective manner. Data science extracts patterns in data; however, if the data encode a prejudicial relationship in society, then the algorithm is likely to identify this pattern and base its outputs on the pattern. Indeed, the more consistent a prejudice is in a society, the stronger that prejudicial pattern will appear in the data about that society, and the more likely a data science algorithm will extract and replicate that pattern of prejudice. For example, a study carried out by academic researchers on the Google Online Advertising system found that the system showed an ad relating to a high-paying job more frequently to participants whose Google profile identified them as male compared to participants whose profile identified them as female (Datta, Tschantz, and Datta 2015).
The fact that data science algorithms can reinforce prejudice is particularly troublesome when data science is applied to policing. Predictive Policing, or PredPol,4 is a data science tool designed to predict when and where a crime is most likely to occur. When deployed in a city, PredPol generates a daily report listing a number of hot spots on a map (small areas 500 feet by 500 feet) where the system believes crimes are likely to occur and tags each hot spot with the police shift during which the system believes the crime will occur. Police departments in both the United States and the United Kingdom have deployed PredPol. The idea behind this type of intelligent-policing system is that policing resources can be efficiently deployed. On the surface, this seems like a sensible application of data science, potentially resulting in efficient targeting of crime and reducing policing costs. However, questions have been raised about the accuracy of PredPol and the effectiveness of similar predictive-policing initiatives (Hunt, Saunders, and Hollywood 2014; Oakland Privacy Working Group 2015; Harkness 2016). The potential for these types of systems to encode racial or class-based profiling in policing has also been noted (Baldridge 2015). The deployment of police resources based on historic data can result in a higher police presence in certain areas—typically economically disadvantaged areas—which in turn results in higher levels of reported crime in these areas. In other words, the prediction of crime becomes a self-fulfilling prophesy. The result of this cycle is that some locations will be disproportionately targeted by police surveillance, causing a breakdown in trust between the people who live in those communities and policing institutions (Harkness 2016).
Another example of data-driven policing is the Strategic Subjects List (SSL) used by the Chicago Police Department in an attempt to reduce gun crime. The list was first created in 2013, and at that time it listed 426 people who were estimated to be at a very high risk of gun violence. In an attempt to proactively prevent gun crime, the Chicago Police Department contacted all the people on the SSL to warn them that they were under surveillance. Some of the people on the list were very surprised to be included on it because although they did have criminal records for minor offenses, they had no violence on their records (Gorner 2013). One question to ask about this type of data gathering to prevent crime is, How accurate is the technology? A recent study found that the people on the SSL for 2013 were “not more or less likely to become a victim of a homicide or shooting than the comparison group” (Saunders, Hunt, and Hollywood 2016). However, this study also found that individuals on the list were more likely to be arrested for a shooting incident, although it did point out that this greater likelihood could have been created by the fact that these individuals were on the list, which resulted in increasing police officers’ awareness of these individuals (Saunders, Hunt, and Hollywood 2016). Responding to this study, the Chicago Police Department stated that it regularly updated the algorithm used to compile the SSL and that the effectiveness of the SSL had improved since 2013 (Rhee 2016). Another question about data-driven crime-prevention lists is, How does an individual end up on the list? The 2013 version of the SSL appears to have been compiled using, among other attributes of an individual, an analysis of his or her social network, including the arrest and shooting histories of his or her acquaintances (Dokoupil 2013; Gorner 2013). On the one hand, the idea of using social network analysis makes sense, but it opens up the very real problem of guilt by association. One problem with this type of approach is that it can be difficult to define precisely what an association between two individuals entails. Is living on the same street enough to be an association? Furthermore, in the United States, where the vast majority of inmates in prison are African American and Latino males, allowing predictive-policing algorithms to use the concept of association as an input is likely to result in predictions targeting mainly young men of color (Baldridge 2015).
The anticipatory nature of predictive policing means that individuals may be treated differently not because of what they have done but because of data-driven inferences about what they might do. As a result, these types of systems may reinforce discriminatory practices by replicating the patterns in historic data and may create self-fulfilling prophecies.
If you spend time absorbing some of the commercial boosterism that surrounds data science, you get a sense that any problem can be solved using data science technology given enough of the right data. This marketing of the power of data science feeds into a view that a data-driven approach to governance is the best way to address complex social problems, such as crime, poverty, poor education, and poor public health: all we need to do to solve these problems is to put sensors into our societies to track everything, merge all the data, and run the algorithms to generate the key insights that provide the solution.
When this argument is accepted, two processes are often intensified. The first is that society becomes more technocratic in nature, and aspects of life begin to be regulated by data-driven systems. Examples of this type of technological regulation already exist—for example, in some jurisdictions data science is currently used in parole hearings (Berk and Bleich 2013) and sentencing (Barry-Jester, Casselman, and Goldstein 2015). For an example outside of the judicial system, consider how smart-city technologies regulate traffic flows through cities with algorithms dynamically deciding which traffic flow gets priority at a junction at different times of day (Kitchin 2014b). A by-product of this technocratic regulation is the proliferation of the sensors that support the automated regulating systems. The second process is “control creep,” wherein data gathered for one purpose is repurposed and used to regulate in another way (Innes 2001). For example, road cameras that were installed in London with the primary purpose of regulating congestion and implementing congestion charges (the London congestion charge is a daily charge for driving a vehicle within London during peak times) have been repurposed for security tasks (Dodge and Kitchin 2007). Other examples of control creep include a technology called ShotSpotter that consists of a city-wide network of microphones designed to identify gunshots and report the locations of them but that also records conversations, some of which were used to achieve criminal convictions (Weissman 2015), and the use of in-car navigation systems to monitor and fine rental car drivers who drive out of state (Elliott 2004; Kitchin 2014a).
An aspect of control creep is the drive to merge data from different sources so as to provide a more complete picture of a society and thereby potentially unlock deeper insights into the problems in the system. There are often good reasons for the repurposing of data. Indeed, calls are frequently made for data held by different branches of government to be merged for legitimate purposes—for example, to support health research and for the convenience of the state and its citizens. From a civil liberties perspective, however, these trends are very concerning. Heightened surveillance, the integration of data from multiple sources, control creep, and anticipatory governance (such as the predictive-policing programs) may result in a society where an individual may be treated with suspicion simply because a sequence of unrelated innocent actions or encounters matches a pattern deemed suspicious by a data-driven regulatory system. Living in this type of a society would change each of us from free citizens into inmates in Bentham’s Panopticon,5 constantly self-disciplining our behaviors for fear of what inferences may be drawn from them. The distinction between individuals who believe and act as though they are free of surveillance and individuals who self-discipline out of fear that they inhabit a Panopticon is the primary difference between a free society and a totalitarian state.
As individuals engage with and move through technically modern societies, they have no choice but to leave a data trail behind them. In the real world, the proliferation of video surveillance means that location data can be gathered about an individual whenever she appears on a street or in a shop or car park, and the proliferation of cell phones means that many people can be tracked via their phones. Other examples of real-world data gathering include the recording of credit card purchases, the use of loyalty schemes in supermarkets, the tracking of withdrawals from ATMs, and the tracking of cell phone calls made. In the online world, data are gathered about individuals when they visit or log in to websites; send an email; engage in online shopping; rate a date, restaurant, or store; use an e-book reader; watch a lecture in a massive open online course; or like or post something on a social media site. To put into perspective the amount of data that are gathered on the average individual in a technologically modern society, a report from the Dutch Data Protection Authority in 2009 estimated that the average Dutch citizen was included in 250 to 500 databases, with this figure rising to 1,000 databases for more socially active people (Koops 2011). Taken together, the data points relating to an individual define that person’s digital footprint.
The data in a digital footprint can be gathered in two contexts that are problematic from a privacy perspective. First, data can be collected about an individual without his knowledge or awareness. Second, in some contexts an individual may choose to share data about himself and his opinions but may have little or no knowledge of or control over how these data are used or how they will be shared with and repurposed by third parties. The terms data shadow and data footprint6 are used to distinguish these two contexts of data gathering: an individual’s data shadow comprises the data gathered about an individual without her knowledge, consent, or awareness, and an individual’s data footprint consists of the pieces of data that she knowingly makes public (Koops 2011).
The collection of data about an individual without her knowledge or consent is of course worrying. However, the power of modern data science techniques to uncover hidden patterns in data coupled with the integration and repurposing of data from several sources means that even data collected with an individual’s knowledge and consent in one context can have negative effects on that individual that are impossible for them to predict. Today, with the use of modern data science techniques, very personal information that we may not want to be made public and choose not to share can still be reliably inferred from seemingly unrelated data we willingly post on social media. For example, many people are willing to like something on Facebook because they want to demonstrate support to a friend. However, by simply using the items that an individual has liked on Facebook, data-driven models can accurately predict that person’s sexual orientation, political and religious views, intelligence and personality traits, and use of addictive substances such as alcohol, drugs, and cigarettes; they can even determine whether that person’s parents stayed together until he or she was 21 years old (Kosinski, Stillwell, and Graepel 2013). The out-of-context linkages made in these models is demonstrated by how liking a human rights campaign was found to be predictive of homosexuality (both male and female) and by how liking Hondas was found to be predictive of not smoking (Kosinski, Stillwell, and Graepel 2013).
In recent years, there has been a growing interest in computational approaches to preserving individual privacy throughout a data-analysis process. Two of the best-known approaches are differential privacy and federated learning.
Differential privacy is a mathematical approach to the problem of learning useful information about a population while at the same time learning nothing about the individuals within the population. Differential privacy uses a particular definition of privacy: the privacy of an individual has not been compromised by the inclusion of his or her data in the data-analysis process if the conclusions reached by the analysis would have been the same independent of whether the individual’s data were included or not. A number of processes can be used to implement differential privacy. At the core of these processes is the idea of injecting noise either into the data-collection process or into the responses to database queries. The noise protects the privacy of individuals but can be removed from the data at an aggregate level so that useful population-level statistics can be calculated. A useful example of a procedure for injecting noise into data that provides an intuitive explanation of how differential privacy processes can work is the randomized-response technique. The use case for this technique is a survey that includes a sensitive yes/no question (e.g., relating to law breaking, health conditions, etc.). Survey respondents are instructed to answer the sensitive question using the following procedure:
Half the respondents will get tails and respond “Yes”; the other half will respond truthfully. Therefore, the true number of “No” respondents in the total population is (approximately) twice the number of “No” responses (the coin is fair and selects randomly, so the distribution of yes/no responses among the respondents who got tails should mirror the number of respondents who answered truthfully). Given the true count for “No,” we can calculate the true count for “Yes.” However, although we now have an accurate count for the population regarding the sensitive “Yes” condition, it is not possible to identify for which of the “Yes” respondents the sensitive condition actually holds. There is a trade-off between the amount of noise injected into data and the usefulness of the data for data analysis. Differential privacy addresses this trade-off by providing estimates of the amount of noise required given factors such as the distribution of data within the database, the type of database query that is being processed, and the number of queries through which we wish to guarantee an individual’s privacy. Cynthia Dwork and Aaron Roth (2014) provide an introduction to differential privacy and an overview of several approaches to implementing differential privacy. Differential-privacy techniques are now being deployed in a number of consumer products. For example, Apple uses differential privacy in iOS 10 to protect the privacy of individual users while at the same time learning usage patterns to improve predictive text in the messaging application and to improve search functionality.
In some scenarios, the data being used in a data science project are coming from multiple disparate sources. For example, multiple hospitals may be contributing to a single research project, or a company is collecting data from a large number of users of a cell phone application. Rather than centralizing these data into a single data repository and doing the analysis on the combined data, an alternative approach is to train different models on the subsets of the data at the different data sources (i.e., at the individual hospitals or on the phones of each individual user) and then to merge the separately trained models. Google uses this federated-learning approach to improve the query suggestions made by the Google keyboard on Android (McMahan and Ramage 2017). In Google’s federated-learning framework, the mobile device initially has a copy of the current application loaded. As the user uses the application, the application data for that user are collected on his phone and used by a learning algorithm that is local to the phone to update the local version of the model. This local update of the model is then uploaded to the cloud, where it is averaged with the model updates uploaded from other user phones. The core model is then updated using this average. With the use of this process, the core model can be improved, and individual users’ privacy can at the same time be protected to the extent that only the model updates are shared—not the users’ usage data.
There is variation across jurisdictions in the laws relating to privacy protection and permissible data usage. However, two core pillars are present across most democratic jurisdictions: antidiscrimination legislation and personal-data-protection legislation.
In most jurisdictions, antidiscrimination legislation forbids discrimination based on any of the following grounds: disability, age, sex, race, ethnicity, nationality, sexual orientation, and religious or political opinion. In the United States, the Civil Rights Act of 19647 prohibits discrimination based on color, race, sex, religion, or nationality. Later legislation has extended this list; for example, the Americans with Disabilities Act of 19908 extended protection to people against discrimination based on disabilities. Similar legalization is in place in many other jurisdictions. For example, the Charter of Fundamental Rights of the European Union prohibits discrimination based on any grounds, including race, color, ethnic or social origin, genetic features, sex, age, birth, disability, sexual orientation, religion or belief, property, membership in a national minority, and political or any other opinion (Charter 2000).
A similar situation of variation and overlap exists with respect to privacy legislation across different jurisdictions. In the United States, the Fair Information Practice Principles (1973)9 have provided the basis for much of the subsequent privacy legislation in that jurisdiction. In the EU, the Data Protection Directive (Council of the European Union and European Parliament 1995) is the basis for much of that jurisdiction’s privacy legislation. The General Data Protection Regulations (Council of the European Union and European Parliament 2016) expand on the data protection principles in the Data Protection Directive and provide consistent and legally enforceable data protection regulations across all EU member states. However, the most broadly accepted principles relating to personal privacy and data are the Guidelines on the Protection of Privacy and Transborder Flows of Personal Data published by the Organisation for Economic Co-operation and Development (OECD 1980). Within these guidelines, personal data are defined as records relating to an identifiable individual, known as the data subject. The guidelines define eight (overlapping) principles that are designed to protect a data subject’s privacy:
Many countries, including the EU and the United States, endorse the OECD guidelines. Indeed, the data protection principles in the EU General Data Protection Regulations can be broadly traced back to the OECD guidelines. The General Data Protection Regulations apply to the collection, storage, transfer and processing of personal data relating to EU citizens within the EU and has implications for the flows of this data outside of the EU. Currently, several countries are developing data protection laws similar to and consistent with the General Data Protection Regulations.
It is well known that despite the legal frameworks that are in place, nation-states frequently collect personal data on their citizens and foreign nationals without these people’s knowledge, often in the name of security and intelligence. Examples include the US National Security Agency’s PRISM program; the UK Government Communications Headquarters’ Tempora program (Shubber 2013); and the Russian government’s System for Operative Investigative Activities (Soldatov and Borogan 2012). These programs affect the public’s perception of governments and use of modern communication technologies. The results of the Pew survey “Americans’ Privacy Strategies Post-Snowden” in 2015 indicated that 87 percent of respondents were aware of government surveillance of phone and Internet communications, and among those who were aware of these programs 61 percent stated that they were losing confidence that these programs served the public interest, and 25 percent reported that they had changed how they used technologies in response to learning about these programs (Rainie and Madden 2015). Similar results have been reported in European surveys, with more than half of Europeans aware of large-scale data collection by government agencies and most respondents stating that this type of surveillance had a negative impact on their trust with respect to how their online personal data are used (Eurobarometer 2015).
At the same time, many private companies avoid the regulations around personal data and privacy by claiming to use derived, aggregated, or anonymized data. By repackaging data in these ways, companies claim that the data are no longer personal data, which, they argue, permits them to gather data without an individual’s awareness or consent and without having a clear immediate purpose for the data; to hold the data for long periods of time; and to repurpose the data or sell the data when a commercial opportunity arises. Many advocates of the commercial opportunities of data science and big data argue that the real commercial value of data comes from their reuse or “optional value” (Mayer-Schönberger and Cukier 2014). The advocates of data reuse highlight two technical innovations that make data gathering and storage a sensible business strategy: first, today data can be gathered passively with little or no effort or awareness on the part of the individuals being tracked; and, second, data storage has become relatively inexpensive. In this context, it makes commercial sense to record and store data in case future (potentially unforeseeable) commercial opportunities make it valuable.
The modern commercial practices of hoarding, repurposing, and selling data are completely at odds with the purpose specification and use-limitation principles of the OECD guidelines. Furthermore, the collection-limitation principle is undermined whenever a company presents a privacy agreement to a consumer that is designed to be unreadable or reserves the right for the company to modify the agreement without further consultation or notification or both. Whenever this happens, the process of notification and granting of consent is turned into a meaningless box-ticking exercise. Similar to the public opinion about government surveillance in the name of security, public opinion is quite negative toward commercial websites’ gathering and repurposing of personal data. Again using American and European surveys as our litmus test for wider public opinion, a survey of American Internet users in 2012 found that 62 percent of adults surveyed stated that they did not know how to limit the information collected about them by websites, and 68 percent stated that they did not like the practice of targeted advertising because they did not like their online behavior tracked and analyzed (Purcell, Brenner, and Rainie 2012). A recent survey of European citizens found similar results: 69 percent of respondents felt that the collection of their data should require their explicit approval, but only 18 percent of respondents actually fully read privacy statements. Furthermore, 67 percent of respondents stated that they don’t read privacy statements because they found them too long, and 38 percent stated that they found them unclear or too difficult to understand. The survey also found that 69 percent of respondents were concerned about their information being used for different purposes from the one it was collected for, and 53 percent of respondents were uncomfortable with Internet companies using their personal information to tailor advertising (Eurobarometer 2015).
So at the moment public opinion is broadly negative toward both government surveillance and Internet companies’ gathering, storing, and analyzing of personnel data. Today, most commentators agree that data-privacy legislation needs to be updated and that changes are happening. In 2012, both the EU and the United States published reviews and updates relating to data-protection and privacy policies (European Commission 2012; Federal Trade Commission 2012; Kitchin 2014a, 173). In 2013, the OECD guidelines were extended to include, among other updates, more details in relation to implementing the accountability principle. In particular, the new guidelines define the data controller’s responsibilities to have a privacy-management program in place and to define clearly what such a program entails and how it should be framed in terms of risk management in relation to personal data (OECD 2013). In 2014, a Spanish citizen, Mario Costeja Gonzalez, won a case in the EU Court of Justice against Google (C-131/12 [2014]) asserting his right to be forgotten. The court held that an individual could request, under certain conditions, an Internet search engine to remove links to webpages that resulted from searches on the individual’s name. The grounds for such a request included that the data are inaccurate or out of date or that the data had been kept for longer than was necessary for historical, statistical, or scientific purposes. This ruling has major implications for all Internet search engines but may also have implications for other big-data hoarders. For example, it is not clear at present what the implications are for social media sites such as Facebook and Twitter (Marr 2015). The concept of the right to be forgotten has been asserted in other jurisdictions. For example, the California “eraser” law asserts a minor’s right to have material he has posted on an Internet or mobile service removed at his request. The law also prohibits Internet, online, or cell phone service companies from compiling personal data relating to a minor for the purposes of targeted advertising or allowing a third party to do so.10 As a final example of the changes taking place, in 2016 the EU-US Privacy Shield was signed and adopted (European Commission 2016). Its focus is on harmonizing data-privacy obligations across the two jurisdictions. Its purpose is to strengthen the data-protection rights for EU citizens in the context where their data have been moved outside of the EU. This agreement imposed stronger obligations on commercial companies with regard to transparency of data usage, strong oversight mechanisms and possible sanctions, as well as limitations and oversight mechanisms for public authorities in recording or accessing personal data. However, at the time of writing, the strength and effectiveness of the EU-US Privacy Shield is being tested in a legal case in the Irish courts. The reason why the Irish legal system is at the center of this debate is that many of the large US multinational Internet companies (Google, Facebook, Twitter, etc.) have their European, Middle East, and Africa headquarters in Ireland. As a result, the data-protection commissioner for Ireland is responsible for enforcing EU regulations on transnational data transfers made by these companies. Recent history illustrates that it is possible for legal cases to result in significant and swift changes in the regulation of how personnel data are handled. In fact, the EU-US Privacy Shield is a direct consequence of a suit filed by Max Schrems, an Austrian lawyer and privacy activist, against Facebook. The outcome of Schrems’s case in 2015 was to invalidate the existing EU-US Safe Harbor agreement with immediate effect, and the EU-US Privacy Shield was developed as an emergency response to this outcome. Compared to the original Safe Harbor agreement, the Privacy Shield has strengthened EU citizens’ data-privacy rights (O’Rourke and Kerr 2017), and it may well be that any new framework would further strengthen these rights. For example, the EU General Data Protection Regulations will provide legally enforceable data protection to EU citizens from May 2018.
From a data science perspective, these examples illustrate that the regulations around data privacy and protection are in flux. Admittedly, the examples listed here are from the US and EU contexts, but they are indicative of broader trends in relation to privacy and data regulation. It is very difficult to predict how these changes will play out in the long term. A range of vested interests exist in this domain: consider the differing agendas of big Internet, advertising and insurances companies, intelligence agencies, policing authorities, governments, medical and social science research, and civil liberties groups. Each of these different sectors of society has differing goals and needs with regard to data usage and consequently has different views on how data-privacy regulation should be shaped. Furthermore, we as individuals will probably have shifting views depending on the perspective we adopt. For example, we might be quite happy for our personnel data to be shared and reused in the context of medical research. However, as the public-opinion surveys in Europe and the United States have reported, many of us have reservations about data gathering, reuse, and sharing in the context of targeted advertising. Broadly speaking, there are two themes in the discourse around the future of data privacy. One view argues for the strengthening of regulations relating to the gathering of personal data and in some cases empowering individuals to control how their data are gathered, shared, and used. The other view argues for deregulation in relation to the gathering of data but also for stronger laws to redress the misuse of personnel data. With so many different stakeholders and perspectives, there are no easy or obvious answers to the questions posed about privacy and data. It is likely that the eventual solutions that are developed will be defined on a sector-by-sector basis and consist of compromises negotiated between the relevant stakeholders.
In such a fluid context, it is best to act conservatively and ethically. As we work on developing new data science solutions to business problems, we should consider ethical questions in relation to personal data. There are good business reasons to do so. First, acting ethically and transparently with personal data will ensure that a business will have good relationships with its customers. Inappropriate practices around personal data can cause a business severe reputational damage and cause its customer to move to competitors (Buytendijk and Heiser 2013). Second, there is a risk that as data integration, reuse, profiling, and targeting intensify, public opinion will harden around data privacy in the coming years, which will lead to more-stringent regulations. Consciously acting transparently and ethically is the best way to ensure that the data science solutions we develop do not run afoul of current regulations or of the regulations that may come into existence in the coming years.
Aphra Kerr (2017) reports a case from 2015 that illustrates how not taking ethical considerations into account can have serious consequences for technology developers and vendors. The case resulted in the US Federal Trade Commission fining app game developers and publishers under the Children’s Online Privacy Protection Act. The developers had integrated third-party advertising into their free-to-play games. Integrating third-party advertising is standard practice in the free-to-play business model, but the problem arose because the games were designed for children younger than 13. As a result, in sharing their users’ data with advertising networks, the developers where in fact also sharing data relating to children and as a result violated the Children’s Online Privacy Protection Act. Also, in one instance the developers failed to inform the advertising networks that the apps were for children. As a result, it was possible that inappropriate advertising could be shown to children, and in this instance the Federal Trade Commission ruled that the game publishers were responsible for ensuring that age-appropriate content and advertising were supplied to the game-playing children. There has been an increasing number of these types of cases in recent years, and a number of organizations, including the Federal Trade Commission (2012), have called for businesses to adopt the principles of privacy by design (Cavoukian 2013). These principles were developed in the 1990s and have become a globally recognized framework for the protection of privacy. They advocate that protecting privacy should be the default mode of operation for the design of technology and information systems. To follow these principles requires a designer to consciously and proactively seek to embed privacy considerations into the design of technologies, organizational practices, and networked system architectures.
Although the arguments of ethical data science are clear, it is not always easy to act ethically. One way to make the challenge of ethical data science more concrete is to imagine you are working for a company as a data scientist on a business-critical project. In analyzing the data, you have identified a number of interacting attributes that together are a proxy for race (or some other personal attribute, such as religion, gender, etc.). You know that legally you can’t use the race attribute in your model, but you believe that these proxy attributes would enable you to circumvent the antidiscrimination legislation. You also believe that including these attributes in the model will make your model work, although you are naturally concerned that this successful outcome may be because the model will learn to reinforce discrimination that is already present in the system. Ask yourself: “What do I do?”