‘The first and most basic rule is to consider social facts as things . . . To treat phenomena as things is to treat them as data, and this constitutes the starting point for science’ Émile Durkheim, The Rules of Sociological Method (1895)
In the digital lifeworld, a growing amount of social activity will be captured and recorded as data then sorted, stored, and processed by digital systems. More and more of our actions, utterances, movements, relationships, emotions, and beliefs will leave a permanent or semi-permanent digital mark. As well as chronicling human life, data will increasingly be gathered on the natural world, the activity of machines, and the built environment. All this data, in turn, will be used for commercial purposes, to train machine learning AI systems, and to predict and control human behaviour. This is increasingly quantified society.
The twenty-first century has seen an explosion in the amount of data generated and processed by human beings and machines. It’s predicted that by 2020 there will be at least 40 zettabytes of data in the world—the equivalent of more than 3 million books for every living person.1 By then it’s expected that we will generate the same amount of information every couple of hours as humans generated from the dawn of civilization until 2003.2 Already we create as much every ten minutes as the first ten thousand generations of humans combined.3 Like computer processing power, the speed with which we produce information is expected to continue to grow exponentially.4
What is data, and where is it all coming from?
In Big Data (2013), Viktor Mayer-Schönberger and Kenneth Cukier explain that data is ‘a description of something that allows it to be recorded, analyzed, and reorganized’. The process of turning a phenomenon into data has been called datafication.5 We have datafied and digitized (turned into binary code legible by machines) enormous swathes of activity on earth. As late as 2000, only a quarter or so of the world’s stored information was in a digital form. Now, it is more than 98 per cent.6 Four factors have made this possible. First, much more data is gathered, because an increasing amounts of social activity is undertaken by and through digital systems and platforms. Second, in the last fifty years the cost of digital storage has halved every two years or so, while increasing in density ‘50-million fold’.7 Third, the explosion in computational power has given us the ability to process what we store. Fourth, digital information has almost no marginal cost of reproduction—it can be replicated millions of times very cheaply. Together, these factors explain why the transition from a print-based information system to a digital one has yielded such an explosion of data.
Mayer-Schönberger and Cukier compare current developments with the last ‘information revolution’: the invention of the printing press by Johannes Gutenberg nearly 600 years ago. In the fifty years following Gutenberg’s innovation, more than 8 million books were printed. This change was described as ‘revolutionary’ by the scholar Elizabeth Eisenstein because it meant, in all likelihood, that more books had been printed in half a century than had been handwritten by ‘all the scribes in Europe’ in the previous 1,200 years.8 Yet if it took fifty or so years for the amount of data in existence to double in Gutenberg’s day, consider that the same feat is now being achieved roughly every two years.9
Much of the data in the world originates with human beings. Sometimes we deliberately bring it into existence, as when we use our devices to record and communicate. Every day we send around 269 billion emails10 (about thirty-six per person), upload 350 million photographs to Facebook, and fire off 500 million tweets.11 Even when they don’t seem rich in data, these communicative acts can capture the internal life of humans in a way that was previously impracticable. Even something as paltry as a tweet, initially limited to 140 characters, is deceptively rich in information. It includes thirty-three items of metadata (‘information about information’) which can be quite revealing in aggregate:12
an analysis of 509 million tweets over two years from 2.4 million people in 84 countries showed that people’s moods followed similar daily and weekly patterns across cultures around the world—something that had not been possible to spot before. Moods have been datafied.
Away from social media platforms, some people deliberately choose to monitor the data emitted by their bodies—generally for health and wellness reasons, but sometimes for fun or curiosity. For a small group, the phenomenon of sousveillance goes beyond breathing rate and pulse. There are plans for:13
comprehensive life-logs that would create a unified, digital record of an individual’s experiences . . . a continuous, searchable, analysable record of the past that includes every action, every event, every conversation, every location visited, every material expression of an individual’s life, as well as physiological conditions inside the body and external conditions (e.g. orientation, temperature, levels of pollution).
Needless to say, much of this data is shared with the manufacturers of the devices in question. If we so choose, the deepest workings of our bodies are now quite datafiable, right down to the information contained in our DNA. It took ‘a decade of intensive work’ to decode the human genome by 2003. The same task can now be done in a day.14
Even when not consciously creating or hoarding data, we leave a ‘data exhaust’ just by going about our lives.15 The trail of digital breadcrumbs we leave is discreetly hoovered up by the devices on or around us. Some, like our tax and phone records, are fairly mundane. Others are less so, like applications on smartphones, which use GPS to trace and record our location, even when location has nothing to do with the app in question. According to Marc Goodman, 80 per cent of Android apps behave in this way.16 In 2012, researchers were able to use a smartphone-cellular system (another way of tracking location, in addition to GPS) to predict, within 20 metres, where someone would be twenty-four hours later.17 Some 82 per cent of apps track your general online activity.18
We submit more than 60,000 search requests to Google every second—more than 3.5 billion each day.19 Each one, together with what Google knows about the identity of the searcher, is fed into Google’s growing silo of information. If all the data processed in one day by Google were printed in books, and those books were stacked on top of each other, then the pile would now reach more than halfway from earth to the moon. That’s just each day.20 Facebook, too, holds a remarkable amount of information about each of its users. When Max Schrems, an Austrian privacy activist who used Facebook occasionally over a three-year period, asked to see the personal data about him stored by Facebook, he received a CD-ROM containing a 1,222-page document, including phone numbers and email addresses of his friends and family; the devices he had used to log in; events to which he had been invited; his ‘friends’ and former ‘friends’; and an archive of his private messages—including transcripts of messages he thought he had deleted. Even this cache was probably incomplete: it excluded, for instance, facial recognition data and information about his website usage.21 Mr Schrems was just one of (then nearly, now more than) 2 billion active users, from whom Facebook has built an extraordinary rich profile of human life.
Finally, data is increasingly generated by machines. Some are juggernauts belching out large amounts of data. When firing, the Large Hadron Collider at CERN generates 40 terabytes of data every second.22 In its first few weeks of operation in 2000, the Sloan Digital Sky Survey telescope harvested more data than had been previously been gathered in the history of astronomy.23 In the future, the largest data-contributors will be pervasive devices distributed around the planet. Mid-range cars already contain multiple microprocessors and sensors, allowing them to upload performance data to carmakers when the vehicle is serviced.24 The proportion of the world’s data drawn from machine sensors was 11 per cent in 2005; it is predicted to increase to 42 per cent in 2020.25
Data scientists have always wrestled with the challenge of turning raw data into information (by cleaning, processing, and organizing it), then into knowledge (by analysing and interpreting it).26 The arrival of big data has required some methodological innovation. As Mayer-Schönberger and Cukier explain, the benefit of analysing vast amounts of data about a topic rather than using a small representative sample has depended upon data scientists’ willingness to accept ‘data’s real-world messiness’ rather than seeking precision.27 In the 1990s IBM launched Candide, its effort to automate language translation using ten years’ worth of high-quality transcripts from the Canadian parliament. When Google began developing its translation system in 2006, it took a different approach, harvesting many more documents from across the internet. Google’s scruffy dataset of around 95 billion English sentences, including translations of poor or middling quality, vastly outperformed Candide’s repository of 3 million well-translated English sentences. It’s not that Google’s initial algorithm was superior. What made the difference was that Google’s unfiltered and imperfect dataset was tens of thousands of times larger than Candide’s. The Google approach treated language as ‘messy data with which to judge probabilities’, an approach that proved to be considerably more effective.28
Data is valuable, and the more is gathered in one place, the more its value increases. When we search the web, for instance, the contents of each search are of infinitesimal value—but when searches are aggregated they offer a profound window into searchers’ thoughts, beliefs, concerns, health, market activity, musical tastes, sexual preferences, and much more besides. We surrender our personal data in exchange for free services—something I call the Data Deal in chapter eighteen. The commercial value of Facebook lies primarily in the data that it harvests from its users, which can be used for a range of purposes, from targeted advertising to building face recognition AI systems. When Facebook went public in 2012, each person’s profile was estimated as being worth $100 to the company.29 Famously, when sales improved after book recommendations were generated by algorithms rather than people, Amazon sacked all its in-house book reviewers. This is why data has been called a ‘raw material of business’ and a ‘factor of production’, and ‘the new coal’.30 The ensuing rush has spawned a multi-billion-dollar industry ‘that does nothing except buy and sell the personal data we share online’.31
It’s not just businesses that are interested in big data. Governments are too—from municipal regimes designing smart cities to central governments using it to monitor compliance. The British tax authorities, for instance, use a fraud detection system that holds more data than the British Library (which has copies of every book ever published in the United Kingdom).32 It is also increasingly apparent that governments use personal data for global surveillance. Two US National Security Agency (NSA) internal databases code-named HAPPYFOOT and FASCIA contain comprehensive location information of electronic devices worldwide.33
An increasingly quantified society is one that is more available for examination and analysis by machines and those who control them. As more and more social activity is captured in data, systems endowed with exceptional computational power will be able to build increasingly nuanced digital maps of human life—massive, incredibly detailed, and updated in real time. These schematics, abstracted from the real world but faithfully reflecting it, will be invaluable not just to those who wish to sell us stuff, but those who seek to understand and govern our collective life. And when political authorities use data not just to study or influence human behaviour, but to predict what will happen before we even know it—whether a convict will reoffend, whether a patient will die—the implications are profound. As I explained in the Introduction, there has always been a close connection between information and control. In an increasingly quantified society, that connection assumes even greater importance.
The future described in the last three chapters is not inevitable. In theory at least, we could halt the innovation already in progress, so that the digital lifeworld never comes into existence. But this is unlikely. Innovation is driven by powerful individual and shared human desires—for prosperity, security, safety, convenience, comfort, and connectivity—all of which are nurtured by a market system designed to stimulate and satisfy those desires. My view is that politics in the future will largely unfold within the parameters of the lifeworld generated by these new technologies, with debate centred on how they should be used, owned, controlled, and distributed, and not on whether we can force the genie back into the lamp. In chapter four, we consider how to think clearly and critically about what this means for politics.