8
FOR ALMOST FORTY YEARS, until the Berlin Wall came down in 1989, the East German state security agency known as the Stasi spied on millions of people. Employing around a hundred thousand full-time staff, the Stasi watched from cars and streets. It opened letters and peeked into bank accounts, bugged apartments and wiretapped phone lines. And it induced lovers and couples, parents and children, to spy on each other, betraying the most basic trust humans have in each other. The resulting files—including at least 39 million index cards and 70 miles of documents—recorded and detailed the most intimate aspects of the lives of ordinary people. East Germany was one of the most comprehensive surveillance states ever seen.
Twenty years after East Germany’s demise, more data is being collected and stored about each one of us than ever before. We’re under constant surveillance: when we use our credit cards to pay, our cellphones to communicate, or our Social Security numbers to identify ourselves. In 2007 the British media relished the irony that there were more than 30 surveillance cameras within 200 yards of the London apartment where George Orwell wrote 1984. Well before the advent of the Internet, specialized companies like Equifax, Experian, and Acxiom collected, tabulated, and provided access to personal information for hundreds of millions of people worldwide. The Internet has made tracking easier, cheaper, and more useful. And clandestine three-letter government agencies are not the only ones spying on us. Amazon monitors our shopping preferences and Google our browsing habits, while Twitter knows what’s on our minds. Facebook seems to catch all that information too, along with our social relationships. Mobile operators know not only whom we talk to, but who is nearby.
With big data promising valuable insights to those who analyze it, all signs seem to point to a further surge in others’ gathering, storing, and reusing our personal data. The size and scale of data collections will increase by leaps and bounds as storage costs continue to plummet and analytic tools become ever more powerful. If the Internet age threatened privacy, does big data endanger it even more? Is that the dark side of big data?
Yes, and it is not the only one. Here, too, the essential point about big data is that a change of scale leads to a change of state. As we’ll explain, this transformation not only makes protecting privacy much harder, but also presents an entirely new menace: penalties based on propensities. That is the possibility of using big-data predictions about people to judge and punish them even before they’ve acted. Doing this negates ideas of fairness, justice, and free will.
In addition to privacy and propensity, there is a third danger. We risk falling victim to a dictatorship of data, whereby we fetishize the information, the output of our analyses, and end up misusing it. Handled responsibly, big data is a useful tool of rational decision-making. Wielded unwisely, it can become an instrument of the powerful, who may turn it into a source of repression, either by simply frustrating customers and employees or, worse, by harming citizens.
The stakes are higher than is typically acknowledged. The dangers of failing to govern big data in respect to privacy and prediction, or of being deluded about the data’s meaning, go far beyond trifles like targeted online ads. The history of the twentieth century is blood-soaked with situations in which data abetted ugly ends. In 1943 the U.S. Census Bureau handed over block addresses (but not street names and numbers, to maintain the fiction of protecting privacy) of Japanese-Americans to facilitate their internment. The Netherlands’ famously comprehensive civil records were used by the invading Nazis to round up Jews. The five-digit numbers tattooed into the forearms of Nazi concentration-camp prisoners initially corresponded to IBM Hollerith punch-card numbers; data processing facilitated murder on an industrial scale.
Despite its informational prowess, there was much that the Stasi could not do. It could not know where everyone moved at all times or whom they talked to without great effort. Today, though, much of this information is collected by mobile phone carriers. The East German state could not predict which people would become dissidents, nor can we—but police forces are starting to use algorithmic models to decide where and when to patrol, which gives a hint of things to come. These trends make the risks inherent in big data as large as the datasets themselves.
Paralyzing privacy
It is tempting to extrapolate the danger to privacy from the growth in digital data and see parallels to Orwell’s surveillance dystopia 1984. And yet the situation is more complex. To start, not all big data contains personal information. Sensor data from refineries does not, nor does machine data from factory floors or data on manhole explosions or airport weather. BP and Con Edison do not need (or want) personal information in order to gain value from the analytics they perform. Big-data analyses of those types of information pose practically no risk to privacy.
Still, much of the data that’s now being generated does include personal information. And companies have a welter of incentives to capture more, keep it longer, and reuse it often. The data may not even explicitly seem like personal information, but with big-data processes it can easily be traced back to the individual it refers to. Or intimate details about a person’s life can be deduced.
For instance, utilities are rolling out “smart” electrical meters in the United States and Europe that collect data throughout the day, perhaps as frequently as every six seconds—far more than the trickle of information on overall energy use that traditional meters gathered. Importantly, the way electrical devices draw power creates a “load signature” that is unique to the appliance. So a hot-water heater is different from a computer, which differs from marijuana grow-lights. Thus a household’s energy use discloses private information, be it the residents’ daily behavior, health conditions or illegal activities.
The important question, however, is not whether big data increases the risk to privacy (it does), but whether it changes the character of the risk. If the threat is simply larger, then the laws and rules that protect privacy may still work in the big-data age; all we need to do is redouble our existing efforts. On the other hand, if the problem changes, we may need new solutions.
Unfortunately, the problem has been transformed. With big data, the value of information no longer resides solely in its primary purpose. As we’ve argued, it is now in secondary uses.
This change undermines the central role assigned to individuals in current privacy laws. Today they are told at the time of collection which information is being gathered and for what purpose; then they have an opportunity to agree, so that collection can commence. While this concept of “notice and consent” is not the only lawful way to gather and process personal data, according to Fred Cate, a privacy expert at Indiana University, it has been transmogrified into a cornerstone of privacy principles around the world. (In practice, it has led to super-sized privacy notices that are rarely read, let alone understood—but that is another story.)
Strikingly, in a big-data age, most innovative secondary uses haven’t been imagined when the data is first collected. How can companies provide notice for a purpose that has yet to exist? How can individuals give informed consent to an unknown? Yet in the absence of consent, any big-data analysis containing personal information might require going back to every person and asking permission for each reuse. Can you imagine Google trying to contact hundreds of millions of users for approval to use their old search queries to predict the flu? No company would shoulder the cost, even if the task were technically feasible.
The alternative, asking users to agree to any possible future use of their data at the time of collection, isn’t helpful either. Such a wholesale permission emasculates the very notion of informed consent. In the context of big data, the tried and trusted concept of notice and consent is often either too restrictive to unearth data’s latent value or too empty to protect individuals’ privacy.
Other ways of protecting privacy fail as well. If everyone’s information is in a dataset, even choosing to “opt out” may leave a trace. Take Google’s Street View. Its cars collected images of roads and houses in many countries. In Germany, Google faced widespread public and media protests. People feared that pictures of their homes and gardens could aid gangs of burglars in selecting lucrative targets. Under regulatory pressure, Google agreed to let homeowners opt out by blurring their houses in the image. But the opt-out is visible on Street View—you notice the obfuscated houses—and burglars may interpret this as a signal that they are especially good targets.
A technical approach to protecting privacy—anonymization—also doesn’t work effectively in many cases. Anonymization refers to stripping out from datasets any personal identifiers, such as name, address, credit card number, date of birth, or Social Security number. The resulting data can then be analyzed and shared without compromising anyone’s privacy. That works in a world of small data. But big data, with its increase in the quantity and variety of information, facilitates re-identification. Consider the cases of seemingly unidentifiable web searches and movie ratings.
In August 2006 AOL publically released a mountain of old search queries, under the well-meaning view that researchers could analyze it for interesting insights. The dataset, of 20 million search queries from 657,000 users between March 1 and May 31 of that year, had been carefully anonymized. Personal information like user name and IP address were erased and replaced by unique numeric identifiers. The idea was that researchers could link together search queries from the same person, but had no identifying information.
Still, within days, the New York Times cobbled together searches like “60 single men” and “tea for good health” and “landscapers in Lilburn, Ga” to successfully identify user number 4417749 as Thelma Arnold, a 62-year-old widow from Lilburn, Georgia. “My goodness, it’s my whole personal life,” she told the Times reporter when he came knocking. “I had no idea somebody was looking over my shoulder.” The ensuing public outcry led to the ouster of AOL’s chief technology officer and two other employees.
Yet a mere two months later, in October 2006, the movie rental service Netflix did something similar in launching its “Netflix Prize.” The company released 100 million rental records from nearly half a million users—and offered a bounty of a million dollars to any team that could improve its film recommendation system by at least 10 percent. Again, personal identifiers had been carefully removed from the data. And yet again, a user was re-identified: a mother and a closeted lesbian in America’s conservative Midwest, who because of this later sued Netflix under the pseudonym “Jane Doe.”
Researchers at the University of Texas at Austin compared the Netflix data against other public information. They quickly found that ratings by one anonymized user matched those of a named contributor to the Internet Movie Database (IMDb) website. More generally, the research demonstrated that rating just six obscure movies (out of the top 500) could identify a Netflix customer 84 percent of the time. And if one knew the date on which a person rated movies as well, he or she could be uniquely identified among the nearly half a million customers in the dataset with 99 percent accuracy.
In the AOL case, users’ identities were exposed by the content of their searches. In the Netflix case, the identity was revealed by a comparison of the data with other sources. In both instances, the companies failed to appreciate how big data aids de-anonymization. There are two reasons: we capture more data and we combine more data.
Paul Ohm, a law professor at the University of Colorado in Boulder and an expert on the harm done by de-anonymization, explains that no easy fix is available. Given enough data, perfect anonymization is impossible no matter how hard one tries. Worse, researchers have recently shown that not only conventional data but also the social graph—people’s connections with one another—is vulnerable to de-anonymization.
In the era of big data, the three core strategies long used to ensure privacy—individual notice and consent, opting out, and anonymization—have lost much of their effectiveness. Already today many users feel their privacy is being violated. Just wait until big-data practices become more commonplace.
Compared with East Germany a quarter-century ago, surveillance has only gotten easier, cheaper, and more powerful. The ability to capture personal data is often built deep into the tools we use every day, from websites to smartphone apps. The data-recorders that are in most cars to capture all the actions of a vehicle a few seconds prior to an airbag activation have been known to “testify” against car owners in court in disputes over the events of accidents.
Of course, when businesses are collecting data to improve their bottom line, we need not fear that their surveillance will have the same consequences as being bugged by the Stasi. We won’t go to prison if Amazon discovers we like to read Chairman Mao’s “Little Red Book.” Google will not exile us because we searched for “Bing.” Companies may be powerful, but they don’t have the state’s powers to coerce.
So while they are not dragging us away in the middle of the night, firms of all stripes amass mountains of personal information concerning all aspects of our lives, share it with others without our knowledge, and use it in ways we could hardly imagine.
The private sector is not alone in flexing its muscles with big data. Governments are doing this too. For instance, the U.S. National Security Agency (NSA) is said to intercept and store 1.7 billion emails, phone calls, and other communications every day, according to a Washington Post investigation in 2010. William Binney, a former NSA official, estimates that the government has compiled “20 trillion transactions” among U.S. citizens and others—who calls whom, emails whom, wires money to whom, and so on.
To make sense of all the data, the United States is building giant data centers such as a $1.2 billion NSA facility in Fort Williams, Utah. And all parts of government are demanding more information than before, not just secretive agencies involved in counterterrorism. When the collection expands to information like financial transactions, health records, and Facebook status updates, the quantity being gleaned is unthinkably large. The government can’t process so much data. So why collect it?
The answer points to the way surveillance has changed in the era of big data. In the past, investigators attached alligator clips to telephone wires to learn as much as they could about a suspect. What mattered was to drill down and get to know that individual. The modern approach is different. In the spirit of Google or Facebook, the new thinking is that people are the sum of their social relationships, online interactions, and connections with content. In order to fully investigate an individual, analysts need to look at the widest possible penumbra of data that surrounds the person—not just whom they know, but whom those people know too, and so on. This was technically very hard to do in the past. Today it’s easier than ever. And because the government never knows whom it will want to scrutinize, it collects, stores, or ensures access to information not necessarily to monitor everyone at all times, but so that when someone falls under suspicion, the authorities can immediately investigate rather than having to start gathering the info from scratch.
The United States is not the only government amassing mountains of data on people, nor is it perhaps the most egregious in its practices. However, as troubling as the ability of business and government to know our personal information may be, a newer problem emerges with big data: the use of predictions to judge us.
Probability and punishment
John Anderton is the chief of a special police unit in Washington, D.C. This particular morning, he bursts into a suburban house moments before Howard Marks, in a state of frenzied rage, is about to plunge a pair of scissors into the torso of his wife, whom he found in bed with another man. For Anderton, it is just another day preventing capital crimes. “By mandate of the District of Columbia Precrime Division,” he recites, “I’m placing you under arrest for the future murder of Sarah Marks, that was to take place today. . . .”
Other cops start restraining Marks, who screams, “I did not do anything!”
The opening scene of the film Minority Report depicts a society in which predictions seem so accurate that the police arrest individuals for crimes before they are committed. People are imprisoned not for what they did, but for what they are foreseen to do, even though they never actually commit the crime. The movie attributes this prescient and preemptive law enforcement to the visions of three clairvoyants, not to data analysis. But the unsettling future Minority Report portrays is one that unchecked big-data analysis threatens to bring about, in which judgments of culpability are based on individualized predictions of future behavior.
Already we see the seedlings of this. Parole boards in more than half of all U.S. states use predictions founded on data analysis as a factor in deciding whether to release somebody from prison or to keep him incarcerated. A growing number of places in the United States—from precincts in Los Angeles to cities like Richmond, Virginia—employ “predictive policing”: using big-data analysis to select what streets, groups, and individuals to subject to extra scrutiny, simply because an algorithm pointed to them as more likely to commit crime.
In the city of Memphis, Tennessee, a program called Blue CRUSH (for Crime Reduction Utilizing Statistical History) provides police officers with relatively precise areas of interest in terms of locality (a few blocks) and time (a few hours during a particular day of the week). The system ostensibly helps law enforcement better target its scarce resources. Since its inception in 2006, major property crimes and violent offenses have fallen by a quarter, according to one measure (though of course, this says nothing about causality; there’s nothing to indicate that the decrease is due to Blue CRUSH).
In Richmond, Virginia, police correlate crime data with additional datasets, such as information on when large companies in the city pay their employees or the dates of concerts or sports events. Doing so has confirmed and sometimes refined the cops’ suspicions about crime trends. For example, Richmond police long sensed that there was a jump in violent crime following gun shows; the big-data analysis proved them right but with a wrinkle: the spike happened two weeks afterwards, not immediately following the event.
These systems seek to prevent crimes by predicting, eventually down to the level of individuals, who might commit them. This points toward using big data for a novel purpose: to prevent crime from happening.
A research project under the U.S. Department of Homeland Security called FAST (Future Attribute Screening Technology) tries to identify potential terrorists by monitoring individuals’ vital signs, body language, and other physiological patterns. The idea is that surveilling people’s behavior may detect their intent to do harm. In tests, the system was 70 percent accurate, according to the DHS. (What this means is unclear; were research subjects instructed to pretend to be terrorists to see if their “malintent” was spotted?) Though these systems seem embryonic, the point is that law enforcement takes them very seriously.
Stopping a crime from happening sounds like an enticing prospect. Isn’t preventing infractions before they take place far better than penalizing the perpetrators afterwards? Wouldn’t forestalling crimes benefit not just those who might have been victimized by them, but society as a whole?
But it’s a perilous path to take. If through big data we predict who may commit a future crime, we may not be content with simply preventing the crime from happening; we are likely to want to punish the probable perpetrator as well. That is only logical. If we just step in and intervene to stop the illicit act from taking place, the putative perpetrator may try again with impunity. In contrast, by using big data to hold him responsible for his (future) acts, we may deter him and others.
Such prediction-based punishment seems an improvement over practices we have already come to accept. Preventing unhealthy, dangerous, or risky behavior is a cornerstone of modern society. We have made smoking harder to prevent lung cancer; we require wearing seatbelts to avert fatalities in car accidents; we don’t let people board airplanes with guns to avoid hijackings. Such preventive measures constrain our freedom, but many see them as a small price to pay in return for avoiding much graver harm.
In many contexts, data analysis is already employed in the name of prevention. It is used to lump us into cohorts of people like us, and we are often characterized accordingly. Actuarial tables note that men over 50 are prone to prostate cancer, so members of that group may pay more for health insurance even if they never get prostate cancer. High-school students with good grades, as a group, are less likely to get into car accidents—so some of their less-learned peers have to pay higher insurance premiums. Individuals with certain characteristics are subjected to extra screening when they pass through airport security.
That’s the idea behind “profiling” in today’s small-data world. Find a common association in the data, define a group of people to whom it applies, and then place those people under additional scrutiny. It is a generalizable rule that applies to everyone in the group. “Profiling,” of course, is a loaded word, and the method has serious downsides. If misused, it can lead not only to discrimination against certain groups but also to “guilt by association.”
In contrast, big data predictions about people are different. Where today’s forecasts of likely behavior—found in things like insurance premiums or credit scores—usually rely on a handful of factors that are based on a mental model of the issue at hand (that is, previous health problems or loan repayment history), with big data’s non-causal analysis we often simply identify the most suitable predictors from the sea of information.
Most important, using big data we hope to identify specific individuals rather than groups; this liberates us from profiling’s shortcoming of making every predicted suspect a case of guilt by association. In a big-data world, somebody with an Arabic name, who has paid in cash for a one-way ticket in first class, may no longer be subjected to secondary screening at an airport if other data specific to him make it very unlikely that he’s a terrorist. With big data we can escape the straitjacket of group identities, and replace them with much more granular predictions for each individual.
The promise of big data is that we do what we’ve been doing all along—profiling—but make it better, less discriminatory, and more individualized. That sounds acceptable if the aim is simply to prevent unwanted actions. But it becomes very dangerous if we use big-data predictions to decide whether somebody is culpable and ought to be punished for behavior that has not yet happened.
The very idea of penalizing based on propensities is nauseating. To accuse a person of some possible future behavior is to negate the very foundation of justice: that one must have done something before we can hold him accountable for it. After all, thinking bad things is not illegal, doing them is. It is a fundamental tenet of our society that individual responsibility is tied to individual choice of action. If one is forced at gunpoint to open the company’s safe, one has no choice and thus isn’t held responsible.
If big-data predictions were perfect, if algorithms could foresee our future with flawless clarity, we would no longer have a choice to act in the future. We would behave exactly as predicted. Were perfect predictions possible, they would deny human volition, our ability to live our lives freely. Also, ironically, by depriving us of choice they would exculpate us from any responsibility.
Of course perfect prediction is impossible. Rather, big-data analysis will predict that for a specific individual, a particular future behavior has a certain probability. Consider, for example, research conducted by Richard Berk, a professor of statistics and criminology at the University of Pennsylvania. He claims his method can predict whether a person released on parole will be involved in a homicide (either kill or be killed). As inputs he uses numerous case-specific variables, including reason for incarceration and date of first offense, but also demographic data like age and gender. Berk suggests that he can forecast a future murder among those on parole with at least a 75 percent probability. That’s not bad. However, it also means that should parole boards rely on Berk’s analysis, they would be wrong as often as one out of four times.
But the core problem with relying on such predictions is not that they expose society to risk. The fundamental trouble is that with such a system we essentially punish people before they do something bad. And by intervening before they act (for instance by denying them parole if predictions show there is a high probability that they will murder), we never know whether or not they would have actually committed the predicted crime. We do not let fate play out, and yet we hold individuals responsible for what our prediction tells us they would have done. Such predictions can never be disproven.
This negates the very idea of the presumption of innocence, the principle upon which our legal system, as well as our sense of fairness, is based. And if we hold people responsible for predicted future acts, ones they may never commit, we also deny that humans have a capacity for moral choice.
The important point here is not simply one of policing. The danger is much broader than criminal justice; it covers all areas of society, all instances of human judgment in which big-data predictions are used to decide whether people are culpable for future acts or not. Those include everything from a company’s decision to dismiss an employee, to a doctor denying a patient surgery, to a spouse filing for divorce.
Perhaps with such a system society would be safer or more efficient, but an essential part of what makes us human—our ability to choose the actions we take and be held accountable for them—would be destroyed. Big data would have become a tool to collectivize human choice and abandon free will in our society.
Of course, big data offers numerous benefits. What turns it into a weapon of dehumanization is a shortcoming, not of big data itself, but of the ways we use its predictions. The crux is that holding people culpable for predicted acts before they can commit them uses big-data predictions based on correlations to make causal decisions about individual responsibility.
Big data is useful to understand present and future risk, and to adjust our actions accordingly. Its predictions help patients and insurers, lenders and consumers. But big data does not tell us anything about causality. In contrast, assigning “guilt”—individual culpability—requires that people we judge have chosen a particular action. Their decision must have been causal for the action that followed. Precisely because big data is based on correlations, it is an utterly unsuitable tool to help us judge causality and thus assign individual culpability.
The trouble is that humans are primed to see the world through the lens of cause and effect. Thus big data is under constant threat of being abused for causal purposes, of being tied to rosy visions of how much more effective our judgment, our human decision-making of assigning culpability, could be if we only were armed with big-data predictions.
It is the quintessential slippery slope—leading straight to the society portrayed in Minority Report, a world in which individual choice and free will have been eliminated, in which our individual moral compass has been replaced by predictive algorithms and individuals are exposed to the unencumbered brunt of collective fiat. If so employed, big data threatens to imprison us—perhaps literally—in probabilities.
The dictatorship of data
Big data erodes privacy and threatens freedom. But big data also exacerbates a very old problem: relying on the numbers when they are far more fallible than we think. Nothing underscores the consequences of data analysis gone awry more than the story of Robert McNamara.
McNamara was a numbers guy. Appointed the U.S. secretary of defense when tensions in Vietnam started in the early 1960s, he insisted on getting data on everything he could. Only by applying statistical rigor, he believed, could decision-makers understand a complex situation and make the right choices. The world in his view was a mass of unruly information that if delineated, denoted, demarcated, and quantified could be tamed by human hand and would fall under human will. McNamara sought Truth, and that Truth could be found in data. Among the numbers that came back to him was the “body count.”
McNamara developed his love of numbers as a student at Harvard Business School and then its youngest assistant professor at age 24. He applied this rigor during the Second World War as part of an elite Pentagon team called Statistical Control, which brought data-driven decision-making to one of the world’s largest bureaucracies. Prior to this, the military was blind. It didn’t know, for instance, the type, quantity, or location of spare airplane parts. Data came to the rescue. Just making armament procurement more efficient saved $3.6 billion in 1943. Modern war was about the efficient allocation of resources; the team’s work was a stunning success.
At war’s end, the group decided to stick together and offer their skills to corporate America. The Ford Motor Company was floundering, and a desperate Henry Ford II handed them the reins. Just as they knew nothing about the military when they helped win the war, so too were they clueless about car making. Still, the so-called “Whiz Kids” turned the company around.
McNamara rose swiftly up the ranks, trotting out a data point for every situation. Harried factory managers produced the figures he demanded—whether they were correct or not. When an edict came down that all inventory from one car model must be used before a new model could begin production, exasperated line managers simply dumped excess parts into a nearby river. The brass at headquarters nodded approvingly when the foremen sent back numbers confirming that the order had been obeyed. But the joke at the factory was that a fellow could walk on water—atop rusted pieces of 1950 and 1951 cars.
McNamara epitomized the mid-twentieth-century manager, the hyper-rational executive who relied on numbers rather than sentiments, and who could apply his quantitative skills to any industry he turned them to. In 1960 he was named president of Ford, a position he only held for a few weeks before President Kennedy appointed him secretary of defense.
As the Vietnam conflict escalated and the United States sent more troops, it became clear that this was a war of wills, not of territory. America’s strategy was to pound the Viet Cong to the negotiation table. The way to measure progress, therefore, was by the number of enemy killed. The body count was published daily in the newspapers. To the war’s supporters it was proof of progress; to critics, evidence of its immorality. The body count was the data point that defined an era.
In 1977, two years after the last helicopter lifted off the rooftop of the U.S. embassy in Saigon, a retired Army general, Douglas Kinnard, published a landmark survey of the generals’ views. Called The War Managers, the book revealed the quagmire of quantification. A mere 2 percent of America’s generals considered the body count a valid way to measure progress. Around two-thirds said it was often inflated. “A fake—totally worthless,” wrote one general in his comments. “Often blatant lies,” wrote another. “They were grossly exaggerated by many units primarily because of the incredible interest shown by people like McNamara,” said a third.
Like the factory men at Ford who dumped engine parts into the river, junior officers sometimes gave their superiors impressive numbers to keep their commands or boost their careers—telling the higher-ups what they wanted to hear. McNamara and the men around him relied on the figures, fetishized them. With his perfectly combed-back hair and his flawlessly knotted tie, McNamara felt he could only comprehend what was happening on the ground by staring at a spreadsheet—at all those orderly rows and columns, calculations and charts, whose mastery seemed to bring him one standard deviation closer to God.
The use, abuse, and misuse of data by the U.S. military during the Vietnam War is a troubling lesson about the limitations of information in an age of small data, a lesson that must be heeded as the world hurls toward the big-data era. The quality of the underlying data can be poor. It can be biased. It can be mis-analyzed or used misleadingly. And even more damningly, data can fail to capture what it purports to quantify.
We are more susceptible than we may think to the “dictatorship of data”—that is, to letting the data govern us in ways that may do as much harm as good. The threat is that we will let ourselves be mindlessly bound by the output of our analyses even when we have reasonable grounds for suspecting something is amiss. Or that we will become obsessed with collecting facts and figures for data’s sake. Or that we will attribute a degree of truth to the data which it does not deserve.
As more aspects of life become datafied, the solution that policymakers and businesspeople are starting to reach for first is to get more data. “In God we trust—all others bring data,” is the mantra of the modern manager, heard echoing in Silicon Valley cubicles, on factory floors, and along the corridors of government agencies. The sentiment is sound, but one can easily be deluded by data.
Education seems on the skids? Push standardized tests to measure performance and penalize teachers or schools that by this measure aren’t up to snuff. Whether the tests actually capture the abilities of schoolchildren, the quality of teaching, or the needs of a creative, adaptable modern workforce is an open question—but one that the data does not admit.
Want to prevent terrorism? Create layers of watch lists and no-fly lists in order to police the skies. But whether such datasets offer the protection they promise is in doubt. In one famous incident, the late Senator Ted Kennedy of Massachusetts was ensnared by the no-fly list, stopped, and questioned, simply for having the same name as a person in the database.
People who work with data have an expression for some of these problems: “garbage in, garbage out.” In certain cases, the reason is the quality of the underlying information. Often, though, it is the misuse of the analysis that is produced. With big data, these problems may arise more frequently or have larger consequences.
Google, as we’ve shown in many examples, runs everything according to data. That strategy has obviously led to much of its success. But it also trips up the company from time to time. Its co-founders, Larry Page and Sergey Brin, long insisted on knowing all job candidates’ SAT scores and their grade point averages when they graduated from college. In their thinking, the first number measured potential and the second measured achievement. Accomplished managers in their forties who were being recruited were hounded for the scores, to their outright bafflement. The company even continued to demand the numbers long after its internal studies showed no correlation between the scores and job performance.
Google ought to know better, to resist being seduced by data’s false charms. The measure leaves little room for change in a person’s life. It fails to count knowledge rather than book-smarts. And it may not reflect the qualifications of people from the humanities, where know-how may be less quantifiable than in science and engineering. Google’s obsession with such data for HR purposes is especially queer considering that the company’s founders are products of Montessori schools, which emphasize learning, not grades. And it repeats the mistakes of past technology powerhouses that vaunted people’s résumés above their actual abilities. Would Larry and Sergey, as PhD dropouts, have stood a chance of becoming managers at the legendary Bell Labs? By Google’s standards, not Bill Gates, nor Mark Zuckerberg, nor Steve Jobs would have been hired, since they lack college degrees.
The firm’s reliance on data sometimes seems overblown. Marissa Mayer, when she was one of its top executives, once ordered staff to test 41 gradations of blue to see which ones people used more, to determine the color of a toolbar on the site. Google’s deference to data has been taken to extremes. It even sparked revolt.
In 2009 Google’s top designer, Douglas Bowman, quit in a huff because he couldn’t stand the constant quantification of everything. “I had a recent debate over whether a border should be 3, 4 or 5 pixels wide, and was asked to prove my case. I can’t operate in an environment like that,” he wrote on a blog announcing his resignation. “When a company is filled with engineers, it turns to engineering to solve problems. Reduce each decision to a simple logic problem. That data eventually becomes a crutch for every decision, paralyzing the company.”
Brilliance doesn’t depend on data. Steve Jobs may have continually improved the Mac laptop over years on the basis of field reports, but he used his intuition, not data, to launch the iPod, iPhone, and iPad. He relied on his sixth sense. “It isn’t the consumers’ job to know what they want,” he famously said, when telling a reporter that Apple did no market research before releasing the iPad.
In the book Seeing Like a State, the anthropologist James Scott of Yale University documents the ways in which governments, in their fetish for quantification and data, end up making people’s lives miserable rather than better. They use maps to determine how to reorganize communities rather than learn anything about the people on the ground. They use long tables of data about harvests to decide to collectivize agriculture without knowing a whit about farming. They take all the imperfect, organic ways in which people have interacted over time and bend them to their needs, sometimes just to satisfy a desire for quantifiable order. The use of data, in Scott’s view, often serves to empower the powerful.
This is the dictatorship of data writ large. And it was a similar hubris that led the United States to escalate the Vietnam War partly on the basis of body counts, rather than to base decisions on more meaningful metrics. “It is true enough that not every conceivable complex human situation can be fully reduced to the lines on a graph, or to percentage points on a chart, or to figures on a balance sheet,” said McNamara in a speech in 1967, as domestic protests were growing. “But all reality can be reasoned about. And not to quantify what can be quantified is only to be content with something less than the full range of reason.” If only the right data were used in the right way, not respected for data’s sake.
Robert Strange McNamara went on to run the World Bank throughout the 1970s, then painted himself as a dove in the 1980s. He became an outspoken critic of nuclear weapons and a proponent of environmental protection. Later in life he underwent an intellectual conversion and produced a memoir, In Retrospect, that criticized the thinking behind the war and his own decisions as secretary of defense. “We were wrong, terribly wrong,” he wrote. But he was referring to the war’s broad strategy. On the question of data, and of body counts in particular, he remained unrepentant. He admitted many of the statistics were “misleading or erroneous.” “But things you can count, you ought to count. Loss of life is one. . . .” McNamara died in 2009 at age 93, a man of intelligence but not of wisdom.
Big data may lure us to commit the sin of McNamara: to become so fixated on the data, and so obsessed with the power and promise it offers, that we fail to appreciate its limitations. To catch a glimpse of the big-data equivalent of the body count, we need only look back at Google Flu Trends. Consider a situation, not entirely implausible, in which a deadly strain of influenza rages across the country. Medical professionals would be grateful for the ability to forecast in real time the biggest hotspots by dint of search queries. They’d know where to intervene with help.
But suppose that in a moment of crisis political leaders argue that simply knowing where the disease is likely to get worse and trying to head it off is not enough. So they call for a general quarantine—not for all people in those regions, which would be unnecessary and overbroad. Big data allows us to be more particular. So the quarantine applies only to the individual Internet users whose searches were most highly correlated with having the flu. Here we have the data on whom to pick up. Federal agents, armed with lists of Internet Protocol addresses and mobile GPS information, herd the individual web searchers into quarantine centers.
But as reasonable as this scenario might sound to some, it is just plain wrong. Correlations do not imply causation. These people may or may not have the flu. They’d have to be tested. They’d be prisoners of a prediction, but more important, they’d be victims of a view of data that lacks an appreciation for what the information actually means. The point of the actual Google Flu Trends study is that certain search terms are correlated with the outbreak—but the correlation may exist because of circumstances like healthy co-workers hearing sneezes in the office and going online to learn how to protect themselves, not because the searchers are ill themselves.
As we have seen, big data allows for more surveillance of our lives while it makes some of the legal means for protecting privacy largely obsolete. It also renders ineffective the core technical method of preserving anonymity. Just as unsettling, big-data predictions about individuals may be used to, in effect, punish people for their propensities, not their actions. This denies free will and erodes human dignity.
At the same time, there is a real risk that the benefits of big data will lure people into applying the techniques where they don’t perfectly fit, or into feeling overly confident in the results of the analyses. As big-data predictions improve, using them will only become more appealing, fueling an obsession over data since it can do so much. That was the curse of McNamara and is the lesson his story holds.
We must guard against overreliance on data rather than repeat the error of Icarus, who adored his technical power of flight but used it improperly and tumbled into the sea. In the next chapter, we’ll consider ways that we can control big data, lest we be controlled by it.