<link href="css/hmh.css" rel="stylesheet" type="text/css"/> <meta content="urn:uuid:b9752200-bcab-46ed-a93c-972b64db92d0" name="Adept.expected.resource"/> </head> <body> <p class="timebreak"> <br/></p> <p class="timebreak"> <br/></p> <p class="center">3</p> <p class="timebreak"> <br/></p> <h1 class="chaptertitle"><a id="p32"/> MESSY</h1> <p class="left">U<span style="font-size: smaller">SING ALL AVAILABLE DATA</span> is feasible in an increasing number of contexts. But it comes at a cost. Increasing the volume opens the door to inexactitude. To be sure, erroneous figures and corrupted bits have always crept into datasets. Yet the point has always been to treat them as problems and try to get rid of them, in part because we could. What we never wanted to do was consider them unavoidable and learn to live with them. This is one of the fundamental shifts of going to big data from small.</p> <p class="follow">In a world of small data, reducing errors and ensuring high quality of data was a natural and essential impulse. Since we only collected a little information, we made sure that the figures we bothered to record were as accurate as possible. Generations of scientists optimized their instruments to make their measurements more and more precise, whether for determining the position of celestial bodies or the size of objects under a microscope. In a world of sampling, the obsession with exactitude was even more critical. Analyzing only a limited number of data points means errors may get amplified, potentially reducing the accuracy of the overall results.</p> <p class="follow">For much of history, humankind’s highest achievements arose from conquering the world by measuring it. The quest for exactitude began in Europe in the middle of the thirteenth century, when astronomers and scholars took on the ever more precise quantification of time and space—“the measure of reality,” in the words of the historian Alfred Crosby.</p> <p class="follow"><a id="p33"/> If one could measure a phenomenon, the implicit belief was, one could understand it. Later, measurement was tied to the scientific method of observation and explanation: the ability to quantify, record, and present reproducible results. “To measure is to know,” pronounced Lord Kelvin. It became a basis of authority. “Knowledge is power,” instructed Francis Bacon. In parallel, mathematicians, and what later became actuaries and accountants, developed methods that made possible the accurate collection, recording, and management of data.</p> <p class="follow">By the nineteenth century France—then the world’s leading scientific nation—had developed a system of precisely defined units of measurement to capture space, time, and more, and had begun to get other nations to adopt the same standards. This went as far as laying down internationally accepted prototype units to measure against in international treaties. It was the apex of the age of measurement. Just half a century later, in the 1920s, the discoveries of quantum mechanics shattered forever the dream of comprehensive and perfect measurement. And yet, outside a relatively small circle of physicists, the mindset of humankind’s drive to flawlessly measure continued among engineers and scientists. In the world of business it even expanded, as the rational sciences of mathematics and statistics began to influence all areas of commerce.</p> <p class="follow">However, in many new situations that are cropping up today, allowing for imprecision—for messiness—may be a positive feature, not a shortcoming. It is a tradeoff. In return for relaxing the standards of allowable errors, one can get ahold of much more data. It isn’t just that “more trumps some,” but that, in fact, sometimes “more trumps better.”</p> <p class="follow">There are several kinds of messiness to contend with. The term can refer to the simple fact that the likelihood of errors increases as you add more data points. Hence, increasing the stress readings from a bridge by a factor of a thousand boosts the chance that some may be wrong. But you can also increase messiness by combining different types of information from different sources, which don’t always align perfectly. For example, using voice-recognition software to characterize complaints to a call center, and comparing that data with the time it takes operators to handle the calls, may yield an imperfect but useful snapshot of the situation. Messiness can also refer to the inconsistency of formatting, for which the data needs to be “cleaned” before being processed. There are a myriad of ways to refer to IBM, notes the big-data expert DJ Patil, from I.B.M. to T.<sup> </sup>J. Watson Labs, to International Business Machines. And messiness can arise when we extract or process the data, since in doing so we are transforming it, turning it into something else, such as when we perform sentiment analysis on Twitter messages to predict Hollywood box office receipts. Messiness itself is messy.</p> <p class="follow"><a id="p34"/> Suppose we need to measure the temperature in a vineyard. If we have only one temperature sensor for the whole plot of land, we must make sure it’s accurate and working at all times: no messiness allowed. In contrast, if we have a sensor for every one of the hundreds of vines, we can use cheaper, less sophisticated sensors (as long as they do not introduce a systematic bias). Chances are that at some points a few sensors may report incorrect data, creating a less exact, or “messier,” dataset than the one from a single precise sensor. Any particular reading may be incorrect, but the aggregate of many readings will provide a more comprehensive picture. Because this dataset consists of more data points, it offers far greater value that likely offsets its messiness.</p> <p class="follow">Now suppose we increase the frequency of the sensor readings. If we take one measurement per minute, we can be fairly sure that the sequence with which the data arrives will be perfectly chronological. But if we change that to ten or a hundred readings per second, the accuracy of the sequence may become less certain. As the information travels across a network, a record may get delayed and arrive out of sequence, or may simply get lost in the flood. The information will be a bit less accurate, but its great volume makes it worthwhile to forgo strict exactitude.</p> <p class="follow"><a id="p35"/> In the first example, we sacrificed the accuracy of each data point for breadth, and in return we received detail that we otherwise could not have seen. In the second case, we gave up exactitude for frequency, and in return we saw change that we otherwise would have missed. Although we may be able to overcome the errors if we throw enough resources at them—after all, 30,000 trades per second take place on the New York Stock Exchange, where the correct sequence matters a lot—in many cases it is more fruitful to tolerate error than it would be to work at preventing it.</p> <p class="follow">For instance, we can accept some messiness in return for scale. As Forrester, a technology consultancy, puts it, “Sometimes two plus two can equal 3.9, and that is good enough.” Of course the data can’t be completely incorrect, but we’re willing to sacrifice a bit of accuracy in return for knowing the general trend. Big data transforms figures into something more probabilistic than precise. This change will take a lot of getting used to, and it comes with problems of its own, which we’ll consider later in the book. But for now it is worth simply noting that we often will need to embrace messiness when we increase scale.</p> <p class="follow">One sees a similar shift in terms of the importance of more data relative to other improvements in computing. Everyone knows how much processing power has increased over the years as predicted by Moore’s Law, which states that the number of transistors on a chip doubles roughly every two years. This continual improvement has made computers faster and memory more plentiful. Fewer of us know that the performance of the algorithms that drive many of our systems has also increased—in many areas more than the improvement of processors under Moore’s Law. Many of the gains to society from big data, however, happen not so much because of faster chips or better algorithms but because there is more data.</p> <p class="follow">For example, chess algorithms have changed only slightly in the past few decades, since the rules of chess are fully known and tightly constrained. The reason computer chess programs play far better today than in the past is in part that they are playing their endgame better. And they’re doing that simply because the systems have been fed more data. In fact, endgames when six or fewer pieces are left on the chessboard have been completely analyzed and all possible moves (N=all) have been represented in a massive table that when uncompressed fills more than a terabyte of data. This enables chess computers to play the endgame flawlessly. No human will ever be able to outplay the system.</p> <p class="follow"><a id="p36"/> The degree to which more data trumps better algorithms has been powerfully demonstrated in the area of natural language processing: the way computers learn how to parse words as we use them in everyday speech. Around 2000, Microsoft researchers Michele Banko and Eric Brill were looking for a method to improve the grammar checker that is part of the company’s Word program. They weren’t sure whether it would be more useful to put their effort into improving existing algorithms, finding new techniques, or adding more sophisticated features. Before going down any of these paths, they decided to see what happened when they fed a lot more data into the existing methods. Most machine-learning algorithms relied on corpuses of text that totaled a million words or less. Banko and Brill took four common algorithms and fed in up to three orders of magnitude more data: 10 million words, then 100 million, and finally a billion words.</p> <p class="follow">The results were astounding. As more data went in, the performance of all four types of algorithms improved dramatically. In fact, a simple algorithm that was the worst performer with half a million words performed better than the others when it crunched a billion words. Its accuracy rate went from 75 percent to above 95 percent. Inversely, the algorithm that worked best with a little data performed the least well with larger amounts, though like the others it improved a lot, going from around 86 percent to about 94 percent accuracy. “These results suggest that we may want to reconsider the tradeoff between spending time and money on algorithm development versus spending it on corpus development,” Banko and Brill wrote in one of their research papers on the topic.</p> <p class="follow">So more trumps less. And sometimes more trumps smarter. What then of messy? A few years after Banko and Brill shoveled in all that data, researchers at rival Google were thinking along similar lines—but at an even larger scale. Instead of testing algorithms with a billion words, they used a trillion. Google did this not to develop a grammar checker but to crack an even more complex nut: language translation.</p> <p class="follow"><a id="p37"/> So-called machine translation has been a vision of computer pioneers since the dawn of computing in the 1940s, when the devices were made of vacuum tubes and filled an entire room. The idea took on a special urgency during the Cold War, when the United States captured vast amounts of written and spoken material in Russian but lacked the manpower to translate it quickly.</p> <p class="follow">At first, computer scientists opted for a combination of grammatical rules and a bilingual dictionary. An IBM computer translated sixty Russian phrases into English in 1954, using 250 word pairs in the computer’s vocabulary and six rules of grammar. The results were very promising. <em>“Mi pyeryedayem mislyi posryedstvom ryechyi,”</em> was entered into the IBM 701 machine via punch cards, and out came “We transmit thoughts by means of speech.” The sixty sentences were “smoothly translated,” according to an IBM press release celebrating the occasion. The director of the research program, Leon Dostert of Georgetown University, predicted that machine translation would be “an accomplished fact” within “five, perhaps three years hence.”</p> <p class="follow">But the initial success turned out to be deeply misleading. By 1966 a committee of machine-translation grandees had to admit failure. The problem was harder than they had realized it would be. Teaching computers to translate is about teaching them not just the rules, but the exceptions too. Translation is not just about memorization and recall; it is about choosing the right words from many alternatives. Is <em>“bonjour”</em> really “good morning”? Or is it “good day,” or “hello,” or “hi”? The answer is, it depends. . . .</p> <p class="follow">In the late 1980s, researchers at IBM had a novel idea. Instead of trying to feed explicit linguistic rules into a computer, together with a dictionary, they decided to let the computer use statistical probability to calculate which word or phrase in one language is the most appropriate one in another. In the 1990s IBM’s Candide project used ten years’ worth of Canadian parliamentary transcripts published in French and English—about three million sentence pairs. Because they were official documents, the translations had been done to an extremely high quality. And by the standards of the day, the amount of data was huge. Statistical machine translation, as the technique became known, cleverly turned the challenge of translation into one big mathematics problem. And it seemed to work. Suddenly, computer translation got a lot better. After the success of that conceptual leap, however, IBM only eked out small improvements despite throwing in lots of money. Eventually IBM pulled the plug.</p> <p class="follow"><a id="p38"/> But less than a decade later, in 2006, Google got into translation, as part of its mission to “organize the world’s information and make it universally accessible and useful.” Instead of nicely translated pages of text in two languages, Google availed itself of a larger but also much messier dataset: the entire global Internet and more. Its system sucked in every translation it could find, in order to train the computer. In went to corporate websites in multiple languages, identical translations of official documents, and reports from intergovernmental bodies like the United Nations and the European Union. Even translations of books from Google’s book-scanning project were included. Where Candide had used three million carefully translated sentences, Google’s system harnessed billions of pages of translations of widely varying quality, according to the head of Google Translate, Franz Josef Och, one of the foremost authorities in the field. Its trillion-word corpus amounted to 95 billion English sentences, albeit of dubious quality.</p> <p class="follow">Despite the messiness of the input, Google’s service works the best. Its translations are more accurate than those of other systems (though still highly imperfect). And it is far, far richer. By mid-2012 its dataset covered more than 60 languages. It could even accept voice input in 14 languages for fluid translations. And because it treats language simply as messy data with which to judge probabilities, it can even translate between languages, such as Hindi and Catalan, in which there are very few direct translations to develop the system. In those cases it uses English as a bridge. And it is far more flexible than other approaches, since it can add and subtract words as they come in and out of usage.</p> <p class="follow"><a id="p39"/> The reason Google’s translation system works well is not that it has a smarter algorithm. It works well because its creators, like Banko and Brill at Microsoft, fed in more data—and not just of high quality. Google was able to use a dataset <em>tens of</em> <em>thousands</em> of times larger than IBM’s Candide because it accepted messiness. The trillion-word corpus Google released in 2006 was compiled from the flotsam and jetsam of Internet content—“data in the wild,” so to speak. This was the “training set” by which the system could calculate the probability that, for example, one word in English follows another. It was a far cry from the grandfather in the field, the famous Brown Corpus of the 1960s, which totaled one million English words. Using the larger dataset enabled great strides in natural-language processing, upon which systems for tasks like voice recognition and computer translation are based. “Simple models and a lot of data trump more elaborate models based on less data,” wrote Google’s artificial-intelligence guru Peter Norvig and colleagues in a paper entitled “The Unreasonable Effectiveness of Data.”</p> <p class="follow">As Norvig and his co-authors explained, messiness was the key: “In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks.”</p> <p class="timebreak"> <br/></p> <p class="center"><b>More trumps better</b></p> <p class="timebreak"> <br/></p> <p class="left">Messiness is difficult to accept for the conventional sampling analysts, who for all their lives have focused on preventing and eradicating messiness. They work hard to reduce error rates when collecting samples, and to test the samples for potential biases before announcing their results. They use multiple error-reducing strategies, including ensuring that samples are collected according to an exact protocol and by specially trained experts. Such strategies are costly to implement even for limited numbers of data points, and they are hardly feasible for big data. Not only would they be far too expensive, but exacting standards of collection are unlikely to be achieved consistently at such scale. Even excluding human interaction would not solve the problem.</p> <p class="follow"><a id="p40"/> Moving into a world of big data will require us to change our thinking about the merits of exactitude. To apply the conventional mindset of measurement to the digital, connected world of the twenty-first century is to miss a crucial point. As mentioned earlier, the obsession with exactness is an artifact of the information-deprived analog era. When data was sparse, every data point was critical, and thus great care was taken to avoid letting any point bias the analysis.</p> <p class="follow">Today we don’t live in such an information-starved situation. In dealing with ever more comprehensive datasets, which capture not just a small sliver of the phenomenon at hand but much more or all of it, we no longer need to worry so much about individual data points biasing the overall analysis. Rather than aiming to stamp out every bit of inexactitude at increasingly high cost, we are calculating with messiness in mind.</p> <p class="follow">Take the way sensors are making their way into factories. At BP’s Cherry Point Refinery in Blaine, Washington, wireless sensors are installed throughout the plant, forming an invisible mesh that produces vast amounts of data in real time. The environment of intense heat and electrical machinery might distort the readings, resulting in messy data. But the huge quantity of information generated from both wired and wireless sensors makes up for those hiccups. Just increasing the frequency and number of locations of sensor readings can offer a big payoff. By measuring the stress on pipes at all times rather than at certain intervals, BP learned that some types of crude oil are more corrosive than others—a quality it couldn’t spot, and thus couldn’t counteract, when its dataset was smaller.</p> <p class="follow">When the quantity of data is vastly larger and is of a new type, exactitude in some cases is no longer the goal so long as we can divine the general trend. Moving to a large scale changes not only the expectations of precision but the practical ability to achieve exactitude. Though it may seem counterintuitive at first, treating data as something imperfect and imprecise lets us make superior forecasts, and thus understand our world better.</p> <p class="follow"><a id="p41"/> It bears noting that messiness is not inherent to big data. Instead it is a function of the imperfection of the tools we use to measure, record, and analyze information. If the technology were to somehow become perfect, the problem of inexactitude would disappear. But as long as it is imperfect, messiness is a practical reality we must deal with. And it is likely to be with us for a long time. Painstaking efforts to increase accuracy often won’t make economic sense, since the value of having far greater amounts of data is more compelling. Just as statisticians in an earlier era put aside their interest in larger sample sizes in favor of more randomness, we can live with a bit of imprecision in return for more data.</p> <p class="follow">The Billion Prices Project offers an intriguing case in point. Every month the U.S. Bureau of Labor Statistics publishes the consumer price index, or CPI, which is used to calculate the inflation rate. The figure is crucial for investors and businesses. The Federal Reserve considers it when deciding whether to raise or lower interest rates. Companies base salary increases on inflation. The federal government uses it to index payments like Social Security benefits and the interest it pays on certain bonds.</p> <p class="follow">To get the figure, the Bureau of Labor Statistics employs hundreds of staff to call, fax, and visit stores and offices in 90 cities across the nation and report back about 80,000 prices on everything from tomatoes to taxi fares. Producing it costs around $250 million a year. For that sum, the data is neat, clean, and orderly. But by the time the numbers come out, they’re already a few weeks old. As the 2008 financial crisis showed, a few weeks can be a terribly long lag. Decision-makers need quicker access to inflation numbers in order to react to them better, but they can’t get it with conventional methods focused on sampling and prizing precision.</p> <p class="follow">In response, two economists at the Massachusetts Institute of Technology, Alberto Cavallo and Roberto Rigobon, came up with a big-data alternative by steering a much messier course. Using software to crawl the Web, they collected half a million prices of products sold in the U.S. every single day. The information is messy, and not all the data points collected are easily comparable. But by combining the big-data collection with clever analysis, the project was able to detect a deflationary swing in prices immediately after Lehman Brothers filed for bankruptcy in September 2008, while those who relied on the official CPI data had to wait until November to see it.</p> <p class="follow"><a id="p42"/> The MIT project has spun off a commercial venture called PriceStats that banks and others use to make economic decisions. It compiles millions of products sold by hundreds of retailers in more than 70 countries every day. Of course, the figures require careful interpretation, but they are better than the official statistics at indicating trends in inflation. Because there are more prices and the figures are available in real time, they give decision-makers a significant advantage. (The method also serves as a credible outside check on national statistical bodies. For example, <em>The Economist</em> distrusts Argentina’s method of calculating inflation, so it relies on the PriceStats figures instead.)</p> <p class="timebreak"> <br/></p> <p class="center"><b>Messiness in action</b></p> <p class="timebreak"> <br/></p> <p class="left">In many areas of technology and society, we are leaning in favor of more and messy over fewer and exact. Consider the case of categorizing content. For centuries humans have developed taxonomies and indexes in order to store and retrieve material. These hierarchical systems have always been imperfect, as everyone familiar with a library card catalogue can painfully recall, but in a small-data universe, they worked well enough. Increase the scale many orders of magnitude, though, and these systems, which presume the perfect placement of everything within them, fall apart. For example, in 2011 the photo-sharing site Flickr held more than six billion photos from more than 75 million users. Trying to label each photo according to preset categories would have been useless. Would there really have been one entitled “Cats that look like Hitler”?</p> <p class="follow"><a id="p43"/> Instead, clean taxonomies are being replaced by mechanisms that are messier but also eminently more flexible and adaptable to a world that evolves and changes. When we upload photos to Flickr, we “tag” them. That is, we assign any number of text labels and use them to organize and search the material. Tags are created and affixed by people in an ad hoc way: there are no standardized, predefined categories, no existing taxonomy to which we must conform. Rather, anyone can add new tags just by typing. Tagging has emerged as the de facto standard for content classification on the Internet, used in social media sites like Twitter, blogs, and so on. It makes the vastness of the Web’s content more navigable—especially for things like images, videos, and music that aren’t text based so word searches don’t work.</p> <p class="follow">Of course, some tags may be misspelled, and such mistakes introduce inaccuracy—not to the data itself but to how it’s organized. That pains the traditional mind trained in exactitude. But in return for messiness in the way we organize our photo collections, we gain a much richer universe of labels, and by extension, a deeper, broader access to our pictures. We can combine search tags to filter photos in ways that weren’t possible before. The imprecision inherent in tagging is about accepting the natural messiness of the world. It is an antidote to more precise systems that try to impose a false sterility upon the hurly-burly of reality, pretending that everything under the sun fits into neat rows and columns. There are more things in heaven and earth than are dreamt of in that philosophy.</p> <p class="follow">Many of the Web’s most popular sites flaunt their admiration for imprecision over the pretense of exactitude. When one sees a Twitter icon or a Facebook “like” button on a web page, it shows the number of other people who clicked on it. When the numbers are small, each click is shown, like “63.” But as the figures get larger, the number displayed is an approximation, like “4K.” It’s not that the system doesn’t know the actual total; it’s that as the scale increases, showing the exact figure is less important. Besides, the amounts may be changing so quickly that a specific figure would be out of date the moment it appeared. Similarly, Google’s Gmail presents the time of recent messages with exactness, such as “11 minutes ago,” but treats longer durations with a nonchalant “2 hours ago,” as do Facebook and some others.</p> <p class="follow"><a id="p44"/> The industry of business intelligence and analytics software was long built on promising clients “a single version of the truth”—the popular buzz words from the 2000s from the technology vendors in these fields. Executives used the phrase without irony. Some still do. By this, they mean that everyone accessing a company’s information-technology systems can tap into the same data; that the marketing team and the sales team don’t have to fight over who has the correct customer or sales numbers before the meeting even begins. Their interests might be more aligned if the facts were consistent, the thinking goes.</p> <p class="follow">But the idea of “a single version of the truth” is doing an about-face. We are beginning to realize not only that it may be impossible for a single version of the truth to exist, but also that its pursuit is a distraction. To reap the benefits of harnessing data at scale, we have to accept messiness as par for the course, not as something we should try to eliminate.</p> <p class="follow">We are even seeing the ethos of inexactitude invade one of the areas most intolerant of imprecision: database design. Traditional database engines required data to be highly structured and precise. Data wasn’t simply stored; it was broken up into “records” that contained fields. Each field held information of a particular type and length. For example, if a numeric field was seven digits long, an amount of 10 million or more could not be recorded. If one wanted to enter “not available” into a field for phone numbers, it couldn’t be done. The structure of the database would have had to be altered to accommodate these entries. We still battle with such restrictions on our computers and smartphones, when the software won’t accept the data we want to enter.</p> <p class="follow">Traditional indexes, too, were predefined, and that limited what one could search for. Add a new index, and it had to be created from scratch, taking time. Conventional, so-called relational, databases are designed for a world in which data is sparse, and thus can be and will be curated carefully. It is a world in which the questions one wants to answer using the data have to be clear at the outset, so that the database is designed to answer them—and only them—efficiently.</p> <p class="follow"><a id="p45"/> Yet this view of storage and analysis is increasingly at odds with reality. We now have large amounts of data of varying types and quality. Rarely does it fit into neatly defined categories that are known at the outset. And the questions we want to ask often emerge only when we collect and work with the data we have.</p> <p class="follow">These realities have led to novel database designs that break with the principles of old—principles of records and preset fields that reflect neatly defined hierarchies of information. The most common language for accessing databases has long been SQL, or “structured query language.” The very name evokes its rigidity. But the big shift in recent years has been toward something called noSQL, which doesn’t require a preset record structure to work. It accepts data of varying type and size and allows it to be searched successfully. In return for permitting structural messiness, these database designs require more processing and storage resources. Yet it is a tradeoff we can afford given the plummeting storage and processing costs.</p> <p class="follow">Pat Helland, one of the world’s foremost authorities on database design, describes this fundamental shift in a paper entitled “If You Have Too Much Data, Then ‘Good Enough’ Is Good Enough.” After identifying some of the core principles of traditional design that have become eroded by messy data of varying provenance and accuracy, he lays out the consequences: “We can no longer pretend to live in a clean world.” Processing big data entails an inevitable loss of information—Helland calls it “lossy.” But it makes up for that by yielding a quick result. “It’s OK if we have lossy answers—that’s frequently what business needs,” concludes Helland.</p> <p class="follow">Traditional database design promises to deliver consistent results across time. If you ask for your bank account balance, for example, you expect to receive the exact amount. And if you query it a few seconds later, you want the system to provide the same result, assuming nothing has changed. Yet as the quantity of data collected grows and the number of users who access the system increases, this consistency becomes harder to maintain.</p> <p class="follow"><a id="p46"/> Large datasets do not exist in any one place; they tend to be split up across multiple hard drives and computers. To ensure reliability and speed, a record may be stored in two or three separate locations. If you update the record at one location, the data in the other locations is no longer correct until you update it too. While traditional systems would have a delay until all updates are made, that is less practical when data is broadly distributed and the server is pounded with tens of thousands of queries per second. Instead, accepting messiness is a kind of solution.</p> <p class="follow">The shift is typified by the popularity of Hadoop, an open-source rival to Google’s MapReduce system that is very good at processing large quantities of data. It does this by breaking the data down into smaller chunks and parceling them out to other machines. It expects that hardware will fail, so it builds redundancy in. It presumes that the data is not clean and orderly—in fact, it assumes that the data is too huge to be cleaned before processing. Where typical data analysis requires an operation called “extract, transfer, and load,” or ETL, to move the data to where it will be analyzed, Hadoop dispenses with such niceties. Instead, it takes for granted that the quantity of data is so breathtakingly enormous that it can’t be moved and must be analyzed where it is.</p> <p class="follow">Hadoop’s output isn’t as precise as that of relational databases: it can’t be trusted to launch a spaceship or to certify bank-account details. But for many less critical tasks, where an ultra-precise answer isn’t needed, it does the trick far faster than the alternatives. Think of tasks like segmenting a list of customers to send some of them a special marketing campaign. Using Hadoop, the credit-card company Visa was able to reduce the processing time for two years’ worth of test records, some 73 billion transactions, from one month to a mere 13 minutes. That sort of acceleration of processing is transformative to businesses.</p> <p class="follow">The experience of ZestFinance, a company founded by the former chief information officer of Google, Douglas Merrill, underscores the point. Its technology helps lenders decide whether or not to offer relatively small, short-term loans to people who seem to have poor credit. Yet where traditional credit scoring is based on just a handful of strong signals like previous late payments, ZestFinance analyzes a huge number of “weaker” variables. In 2012 it boasted a loan default rate that was a third less than the industry average. But the only way to make the system work is to embrace messiness.</p> <p class="follow"><a id="p47"/> “One of the interesting things,” says Merrill, “is that there are no people for whom all fields are filled in—there’s always a large amount of missing data.” The matrix from the information ZestFinance gathers is incredibly sparse, a database file teeming with missing cells. So the company “imputes” the missing data. For instance, about 10 percent of ZestFinance’s customers are listed as dead—but as it turns out, that doesn’t affect repayment. “So, obviously, when preparing for the zombie apocalypse, most people assume no debt will get repaid. But from our data, it looks like zombies pay back their loans,” adds Merrill with a wink.</p> <p class="follow">In return for living with messiness, we get tremendously valuable services that would be impossible at their scope and scale with traditional methods and tools. According to some estimates only 5 percent of all digital data is “structured”—that is, in a form that fits neatly into a traditional database. Without accepting messiness, the remaining 95 percent of unstructured data, such as web pages and videos, remain dark. By allowing for imprecision, we open a window into an untapped universe of insights.</p> <p class="timebreak"> <br/></p> <p class="left">Society has made two implicit tradeoffs that have become so ingrained in the way we act that we don’t even see them as tradeoffs anymore, but as the natural state of things. First, we presume that we can’t use far more data, so we don’t. But the constraint is increasingly less relevant, and there is much to be gained by using something approaching N=all.</p> <p class="follow">The second tradeoff is over the quality of information. It was rational to privilege exactitude in an era of small data, when because we only collected a little information its accuracy had to be as high as possible. In many cases, that may still matter. But for many other things, rigorous accuracy is less important than getting a quick grasp of their broad outlines or progress over time.</p> <p class="follow"><a id="p48"/> The way we think about using the totality of information compared with smaller slivers of it, and the way we may come to appreciate slackness instead of exactness, will have profound effects on our interaction with the world. As big-data techniques become a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of N=all of the mind. And we may tolerate blurriness and ambiguity in areas where we used to demand clarity and certainty, even if it had been a false clarity and an imperfect certainty. We may accept this provided that in return we get a more complete sense of reality—the equivalent of an impressionist painting, wherein each stroke is messy when examined up close, but by stepping back one can see a majestic picture.</p> <p class="follow">Big data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality than did our dependence on small data and accuracy. The appeal of “some” and “certain” is understandable. Our comprehension of the world may have been incomplete and occasionally wrong when we were limited in what we could analyze, but there was a comfortable certainty about it, a reassuring stability. Besides, because we were stunted in the data that we could collect and examine, we didn’t face the same compulsion to get everything, to see everything from every possible angle. And in the narrow confines of small data, we could pride ourselves on our precision—even if by measuring the minutiae to the nth degree, we missed the bigger picture.</p> <p class="follow">Ultimately, big data may require <em>us</em> to change, to become more comfortable with disorder and uncertainty. The structures of exactitude that seem to give us bearings in life—that the round peg goes into the round hole; that there is only one answer to a question—are more malleable than we may admit; and yet admitting, even embracing, this plasticity brings us closer to reality.</p> <p class="follow"><a id="p49"/> As radical a transformation as these shifts in mindset are, they lead to a third change that has the potential to upend an even more fundamental convention on which society is based: the idea of understanding the reasons behind all that happens. Instead, as the next chapter will explain, finding associations in data and acting on them may often be good enough.</p> </body> </html>