Privacy and Ethics

Commercial Interests versus Individual Privacy

Data science can be framed as making the world a more prosperous and secure place to live. But these same arguments can be used by very different organizations that have very distinct agendas. For example, contrast calls by civil liberties groups for government to be more open and transparent in the gathering, use, and availability of data in the hope of empowering citizens to hold these same governments to account with similar calls from business communities who hope to use these data to increase their profits (Kitchin 2014a). In truth, data science is a double-edged sword. It can be used to improve our lives through more efficient government, improved medicine and health care, less-expensive insurance, smarter cities, reduced crime, and many more ways. At the same time, however, it can also be used to spy on our private lives, to target us with unwanted advertising, and to control our behavior both overtly and covertly (the fear of surveillance can affect us as much as the surveillance itself does).

The contradictory aspects of data science can often be apparent in the same applications. For example, the use of data science in health insurance underwriting uses third-party marketing data sets that contain information such as purchasing habits, web search history, along with hundreds of other attributes relating to people’s lifestyles (Batty, Tripathi, Kroll, et al. 2010). The use of these third-party data is troublesome because it may trigger self-disciplining, wherein people avoid certain activities, such as visiting extreme-sports websites, for fear of incurring higher insurance premiums (Mayer-Schönberger and Cukier 2014). However, the justification for the use of these data is that it acts as a proxy for more invasive and expensive information sources, such as blood tests, and in the long term will reduce costs and premiums and thereby increase the number of people with health insurance (Batty, Tripathi, Kroll, et al. 2010).

The fault lines in the debate between the commercial benefits and ethical considerations of using data science are apparent in the discussions around the use of personal data for targeted marketing. From a business advertising perspective, the incentive to use personal data is that there is a relationship between personalizing marketing, services, and products, on the one hand, and the effectiveness of the marketing, on the other. It has been shown that the use of personal social network data—such as identifying consumers who are connected to prior customers—increases the effectiveness of a direct-mail marketing campaign for a telecommunications service by three to five times compared to traditional marketing approaches (Hill, Provost, and Volinsky 2006). Similar claims have been made about the effectiveness of data-driven personalization of online marketing. For example, a study of online cost and effectiveness of online targeted advertising in the United States in 2010 compared run-of-the-network marketing (when an advertising campaign is pushed out across a range of websites without specific targeting of users or sites) with behavioral targeting¹ (Beales 2010). The study found that behavioral marketing was both more expensive (2.68 times more) but also more effective, with a conversion rate more than twice that of run-of-the-network marketing. Another well-known study on the effectiveness of data-driven online advertising was conducted by researchers from the University of Toronto and MIT (Goldfarb and Tucker 2011). They used the enactment of a privacy-protection bill in the European Union (EU)² that limited the ability of advertising companies to track users’ online behavior in order to compare the effectiveness of online advertising under the new restrictions (i.e., in the EU) and the effectiveness online advertising not under the new restrictions (i.e., in the United States and other non-EU countries). The study found that online advertising was significantly less effective under the new restrictions, with a reported drop of 65 percent in study participants’ recorded purchasing intent. The results of this study have been contested (see, for example, Mayer and Mitchell 2012), but the study has been used to support the argument that the more data that are available about an individual, the more effective the advertising that is directed to that individual will be. Proponents of data-driven targeted marketing frame this argument as a win–win for both the advertiser and the consumer, claiming that advertisers lower marketing costs by reducing wasted advertising and achieve better conversions rates, and consumers get more relevant advertising.

This utopian perspective on the use of personal data for targeted marketing is at best based on a selective understanding of the problem. Probably one of the most worrying stories related to targeted advertising was reported in the New York Times in 2012 and involves the American discount retail store Target (Duhigg 2012). It is well known in marketing that one of the times in a person’s life when his or her shopping habits change radically is at the conception and birth of a child. Because of this radical change, marketers see pregnancy as an opportunity to shift a person’s shopping habits and brand loyalties, and many retailers use publicly available birth records to trigger personalized marketing for new parents, sending them offers relating to baby products. In order to get a competitive advantage, Target wanted to identify pregnant customers at an early stage (ideally during the second trimester) without the mother-to-be voluntarily telling Target that she was pregnant.³ This insight would enable Target to begin its personalized marketing before other retailers knew the baby was on the way. To achieve this goal, Target initiated a data science project with the aim of predicting whether a customer was pregnant based on an analysis of her shopping habits. The starting point for the project was to analyze the shopping habits of women who had signed up for Target’s baby-shower registry. The analysis revealed that expectant mothers tended to purchase larger quantities of unscented lotion at the beginning of the second trimester as well as certain dietary supplements throughout the first 20 weeks of pregnancy. Based on this analysis, Target created a data-driven model that used around 25 products and indictors and assigned each customer a “pregnancy-prediction” score. The success, for want of a better word, of this model was made very apparent when a man turned up at a Target store to complain about the fact that his high-school-age daughter had been mailed coupons for baby clothes and cribs. He accused Target of trying to encourage his daughter to get pregnant. However, over the subsequent days it transpired that the man’s daughter was in fact pregnant but hadn’t told anyone. Target’s pregnancy-prediction model was able to identify a pregnant high school student and act on this information before she had chosen to tell her family.

Legal Frameworks for Regulating Data Use and Protecting Privacy

There is variation across jurisdictions in the laws relating to privacy protection and permissible data usage. However, two core pillars are present across most democratic jurisdictions: antidiscrimination legislation and personal-data-protection legislation.

In most jurisdictions, antidiscrimination legislation forbids discrimination based on any of the following grounds: disability, age, sex, race, ethnicity, nationality, sexual orientation, and religious or political opinion. In the United States, the Civil Rights Act of 1964 ⁷ prohibits discrimination based on color, race, sex, religion, or nationality. Later legislation has extended this list; for example, the Americans with Disabilities Act of 1990 ⁸ extended protection to people against discrimination based on disabilities. Similar legalization is in place in many other jurisdictions. For example, the Charter of Fundamental Rights of the European Union prohibits discrimination based on any grounds, including race, color, ethnic or social origin, genetic features, sex, age, birth, disability, sexual orientation, religion or belief, property, membership in a national minority, and political or any other opinion (Charter 2000).

A similar situation of variation and overlap exists with respect to privacy legislation across different jurisdictions. In the United States, the Fair Information Practice Principles (1973)⁹ have provided the basis for much of the subsequent privacy legislation in that jurisdiction. In the EU, the Data Protection Directive (Council of the European Union and European Parliament 1995) is the basis for much of that jurisdiction’s privacy legislation. The General Data Protection Regulations (Council of the European Union and European Parliament 2016) expand on the data protection principles in the Data Protection Directive and provide consistent and legally enforceable data protection regulations across all EU member states. However, the most broadly accepted principles relating to personal privacy and data are the Guidelines on the Protection of Privacy and Transborder Flows of Personal Data published by the Organisation for Economic Co-operation and Development (OECD 1980). Within these guidelines, personal data are defined as records relating to an identifiable individual, known as the data subject. The guidelines define eight (overlapping) principles that are designed to protect a data subject’s privacy:

Collection Limitation Principle: Personal data should only be obtained lawfully and with the knowledge and consent of the data subject.
Data Quality Principle: Any personal data that are collected should be relevant to the purpose for which they are used; they should be accurate, complete, and up to date.
Purpose Specification Principle: At or before the time that personal data are collected, the data subject should be informed of the purpose for which the data will be used. Furthermore, although changes of purpose are permissible, they should not be introduced arbitrarily (new purposes must be compatible with the original purpose) and should be specified to the data subject.
Use Limitation Principle: The use of personal data is limited to the purpose that the data subject has been informed of, and the data should not be disclosed to third parties without the data subject’s consent or by authority of law.
Safety Safeguards Principle: Personal data should be protected by security safeguards against deletion, theft, disclosure, modification, or unauthorized use.
Openness Principle: A data subject should be able to acquire information with reasonable ease regarding the collection, storage, and use of his or her personal data.
Individual Participation Principle: A data subject has the right to access and challenge personal data.
Accountability Principle: A data controller is accountable for complying with the principles.

Many countries, including the EU and the United States, endorse the OECD guidelines. Indeed, the data protection principles in the EU General Data Protection Regulations can be broadly traced back to the OECD guidelines. The General Data Protection Regulations apply to the collection, storage, transfer and processing of personal data relating to EU citizens within the EU and has implications for the flows of this data outside of the EU. Currently, several countries are developing data protection laws similar to and consistent with the General Data Protection Regulations.

Toward an Ethical Data Science

It is well known that despite the legal frameworks that are in place, nation-states frequently collect personal data on their citizens and foreign nationals without these people’s knowledge, often in the name of security and intelligence. Examples include the US National Security Agency’s PRISM program; the UK Government Communications Headquarters’ Tempora program (Shubber 2013); and the Russian government’s System for Operative Investigative Activities (Soldatov and Borogan 2012). These programs affect the public’s perception of governments and use of modern communication technologies. The results of the Pew survey “Americans’ Privacy Strategies Post-Snowden” in 2015 indicated that 87 percent of respondents were aware of government surveillance of phone and Internet communications, and among those who were aware of these programs 61 percent stated that they were losing confidence that these programs served the public interest, and 25 percent reported that they had changed how they used technologies in response to learning about these programs (Rainie and Madden 2015). Similar results have been reported in European surveys, with more than half of Europeans aware of large-scale data collection by government agencies and most respondents stating that this type of surveillance had a negative impact on their trust with respect to how their online personal data are used (Eurobarometer 2015).

At the same time, many private companies avoid the regulations around personal data and privacy by claiming to use derived, aggregated, or anonymized data. By repackaging data in these ways, companies claim that the data are no longer personal data, which, they argue, permits them to gather data without an individual’s awareness or consent and without having a clear immediate purpose for the data; to hold the data for long periods of time; and to repurpose the data or sell the data when a commercial opportunity arises. Many advocates of the commercial opportunities of data science and big data argue that the real commercial value of data comes from their reuse or “optional value” (Mayer-Schönberger and Cukier 2014). The advocates of data reuse highlight two technical innovations that make data gathering and storage a sensible business strategy: first, today data can be gathered passively with little or no effort or awareness on the part of the individuals being tracked; and, second, data storage has become relatively inexpensive. In this context, it makes commercial sense to record and store data in case future (potentially unforeseeable) commercial opportunities make it valuable.

The modern commercial practices of hoarding, repurposing, and selling data are completely at odds with the purpose specification and use-limitation principles of the OECD guidelines. Furthermore, the collection-limitation principle is undermined whenever a company presents a privacy agreement to a consumer that is designed to be unreadable or reserves the right for the company to modify the agreement without further consultation or notification or both. Whenever this happens, the process of notification and granting of consent is turned into a meaningless box-ticking exercise. Similar to the public opinion about government surveillance in the name of security, public opinion is quite negative toward commercial websites’ gathering and repurposing of personal data. Again using American and European surveys as our litmus test for wider public opinion, a survey of American Internet users in 2012 found that 62 percent of adults surveyed stated that they did not know how to limit the information collected about them by websites, and 68 percent stated that they did not like the practice of targeted advertising because they did not like their online behavior tracked and analyzed (Purcell, Brenner, and Rainie 2012). A recent survey of European citizens found similar results: 69 percent of respondents felt that the collection of their data should require their explicit approval, but only 18 percent of respondents actually fully read privacy statements. Furthermore, 67 percent of respondents stated that they don’t read privacy statements because they found them too long, and 38 percent stated that they found them unclear or too difficult to understand. The survey also found that 69 percent of respondents were concerned about their information being used for different purposes from the one it was collected for, and 53 percent of respondents were uncomfortable with Internet companies using their personal information to tailor advertising (Eurobarometer 2015).

So at the moment public opinion is broadly negative toward both government surveillance and Internet companies’ gathering, storing, and analyzing of personnel data. Today, most commentators agree that data-privacy legislation needs to be updated and that changes are happening. In 2012, both the EU and the United States published reviews and updates relating to data-protection and privacy policies (European Commission 2012; Federal Trade Commission 2012; Kitchin 2014a, 173). In 2013, the OECD guidelines were extended to include, among other updates, more details in relation to implementing the accountability principle. In particular, the new guidelines define the data controller’s responsibilities to have a privacy-management program in place and to define clearly what such a program entails and how it should be framed in terms of risk management in relation to personal data (OECD 2013). In 2014, a Spanish citizen, Mario Costeja Gonzalez, won a case in the EU Court of Justice against Google (C-131/12 [2014]) asserting his right to be forgotten. The court held that an individual could request, under certain conditions, an Internet search engine to remove links to webpages that resulted from searches on the individual’s name. The grounds for such a request included that the data are inaccurate or out of date or that the data had been kept for longer than was necessary for historical, statistical, or scientific purposes. This ruling has major implications for all Internet search engines but may also have implications for other big-data hoarders. For example, it is not clear at present what the implications are for social media sites such as Facebook and Twitter (Marr 2015). The concept of the right to be forgotten has been asserted in other jurisdictions. For example, the California “eraser” law asserts a minor’s right to have material he has posted on an Internet or mobile service removed at his request. The law also prohibits Internet, online, or cell phone service companies from compiling personal data relating to a minor for the purposes of targeted advertising or allowing a third party to do so.¹⁰ As a final example of the changes taking place, in 2016 the EU-US Privacy Shield was signed and adopted (European Commission 2016). Its focus is on harmonizing data-privacy obligations across the two jurisdictions. Its purpose is to strengthen the data-protection rights for EU citizens in the context where their data have been moved outside of the EU. This agreement imposed stronger obligations on commercial companies with regard to transparency of data usage, strong oversight mechanisms and possible sanctions, as well as limitations and oversight mechanisms for public authorities in recording or accessing personal data. However, at the time of writing, the strength and effectiveness of the EU-US Privacy Shield is being tested in a legal case in the Irish courts. The reason why the Irish legal system is at the center of this debate is that many of the large US multinational Internet companies (Google, Facebook, Twitter, etc.) have their European, Middle East, and Africa headquarters in Ireland. As a result, the data-protection commissioner for Ireland is responsible for enforcing EU regulations on transnational data transfers made by these companies. Recent history illustrates that it is possible for legal cases to result in significant and swift changes in the regulation of how personnel data are handled. In fact, the EU-US Privacy Shield is a direct consequence of a suit filed by Max Schrems, an Austrian lawyer and privacy activist, against Facebook. The outcome of Schrems’s case in 2015 was to invalidate the existing EU-US Safe Harbor agreement with immediate effect, and the EU-US Privacy Shield was developed as an emergency response to this outcome. Compared to the original Safe Harbor agreement, the Privacy Shield has strengthened EU citizens’ data-privacy rights (O’Rourke and Kerr 2017), and it may well be that any new framework would further strengthen these rights. For example, the EU General Data Protection Regulations will provide legally enforceable data protection to EU citizens from May 2018.

From a data science perspective, these examples illustrate that the regulations around data privacy and protection are in flux. Admittedly, the examples listed here are from the US and EU contexts, but they are indicative of broader trends in relation to privacy and data regulation. It is very difficult to predict how these changes will play out in the long term. A range of vested interests exist in this domain: consider the differing agendas of big Internet, advertising and insurances companies, intelligence agencies, policing authorities, governments, medical and social science research, and civil liberties groups. Each of these different sectors of society has differing goals and needs with regard to data usage and consequently has different views on how data-privacy regulation should be shaped. Furthermore, we as individuals will probably have shifting views depending on the perspective we adopt. For example, we might be quite happy for our personnel data to be shared and reused in the context of medical research. However, as the public-opinion surveys in Europe and the United States have reported, many of us have reservations about data gathering, reuse, and sharing in the context of targeted advertising. Broadly speaking, there are two themes in the discourse around the future of data privacy. One view argues for the strengthening of regulations relating to the gathering of personal data and in some cases empowering individuals to control how their data are gathered, shared, and used. The other view argues for deregulation in relation to the gathering of data but also for stronger laws to redress the misuse of personnel data. With so many different stakeholders and perspectives, there are no easy or obvious answers to the questions posed about privacy and data. It is likely that the eventual solutions that are developed will be defined on a sector-by-sector basis and consist of compromises negotiated between the relevant stakeholders.

In such a fluid context, it is best to act conservatively and ethically. As we work on developing new data science solutions to business problems, we should consider ethical questions in relation to personal data. There are good business reasons to do so. First, acting ethically and transparently with personal data will ensure that a business will have good relationships with its customers. Inappropriate practices around personal data can cause a business severe reputational damage and cause its customer to move to competitors (Buytendijk and Heiser 2013). Second, there is a risk that as data integration, reuse, profiling, and targeting intensify, public opinion will harden around data privacy in the coming years, which will lead to more-stringent regulations. Consciously acting transparently and ethically is the best way to ensure that the data science solutions we develop do not run afoul of current regulations or of the regulations that may come into existence in the coming years.

Aphra Kerr (2017) reports a case from 2015 that illustrates how not taking ethical considerations into account can have serious consequences for technology developers and vendors. The case resulted in the US Federal Trade Commission fining app game developers and publishers under the Children’s Online Privacy Protection Act. The developers had integrated third-party advertising into their free-to-play games. Integrating third-party advertising is standard practice in the free-to-play business model, but the problem arose because the games were designed for children younger than 13. As a result, in sharing their users’ data with advertising networks, the developers where in fact also sharing data relating to children and as a result violated the Children’s Online Privacy Protection Act. Also, in one instance the developers failed to inform the advertising networks that the apps were for children. As a result, it was possible that inappropriate advertising could be shown to children, and in this instance the Federal Trade Commission ruled that the game publishers were responsible for ensuring that age-appropriate content and advertising were supplied to the game-playing children. There has been an increasing number of these types of cases in recent years, and a number of organizations, including the Federal Trade Commission (2012), have called for businesses to adopt the principles of privacy by design (Cavoukian 2013). These principles were developed in the 1990s and have become a globally recognized framework for the protection of privacy. They advocate that protecting privacy should be the default mode of operation for the design of technology and information systems. To follow these principles requires a designer to consciously and proactively seek to embed privacy considerations into the design of technologies, organizational practices, and networked system architectures.

Although the arguments of ethical data science are clear, it is not always easy to act ethically. One way to make the challenge of ethical data science more concrete is to imagine you are working for a company as a data scientist on a business-critical project. In analyzing the data, you have identified a number of interacting attributes that together are a proxy for race (or some other personal attribute, such as religion, gender, etc.). You know that legally you can’t use the race attribute in your model, but you believe that these proxy attributes would enable you to circumvent the antidiscrimination legislation. You also believe that including these attributes in the model will make your model work, although you are naturally concerned that this successful outcome may be because the model will learn to reinforce discrimination that is already present in the system. Ask yourself: “What do I do?”

6 Privacy and Ethics

Commercial Interests versus Individual Privacy

Ethical Implications of Data Science: Profiling and Discrimination

Ethical Implications of Data Science: Creating a Panopticon

Á la recherche du privacy perdu

Computational Approaches to Preserving Privacy

Legal Frameworks for Regulating Data Use and Protecting Privacy

Toward an Ethical Data Science

Note