BIG DATA, BIG SCHMATA?
WHAT IT CANNOT DO
“Seth, Lawrence Summers would like to meet with you,” the email said, somewhat cryptically. It was from one of my Ph.D. advisers, Lawrence Katz. Katz didn’t tell me why Summers was interested in my work, though I later found out Katz had known all along.
I sat in the waiting room outside Summers’s office. After some delay, the former Treasury secretary of the United States, former president of Harvard, and winner of some of the biggest awards in economics, summoned me inside.
Summers began the meeting by reading my paper on racism’s effect on Obama, which his secretary had printed for him. Summers is a speed reader. As he reads, he occasionally sticks his tongue out and to the right, while his eyes rapidly shift left and right and down the page. Summers reading a social science paper reminds me of a great pianist performing a sonata. He is so focused he seems to lose track of all else. In fewer than five minutes, he had completed my thirty-page paper.
“You say that Google searches for ‘nigger’ suggest racism,” Summers said. “That seems plausible. They predict where Obama gets less support than Kerry. That is interesting. Can we really think of Obama and Kerry as the same?”
“They were ranked as having similar ideologies by political scientists,” I responded. “Also, there is no correlation between racism and changes in House voting. The result stays strong even when we add controls for demographics, church attendance, and gun ownership.” This is how we economists talk. I had grown animated.
Summers paused and stared at me. He briefly turned to the TV in his office, which was tuned to CNBC, then stared at me again, then looked at the TV, then back at me. “Okay, I like this paper,” Summers said. “What else are you working on?”
The next sixty minutes may have been the most intellectually exhilarating of my life. Summers and I talked about interest rates and inflation, policing and crime, business and charity. There is a reason so many people who meet Summers are enthralled. I have been fortunate to speak with some incredibly smart people in my life; Summers struck me as the smartest. He is obsessed with ideas, more than all else, which seems to be what often gets him in trouble. He had to resign his presidency at Harvard after suggesting the possibility that part of the reason for the shortage of women in the sciences might be that men have more variation in their IQs. If he finds an idea interesting, Summers tends to say it, even if it offends some ears.
It was now a half hour past the scheduled end time for our meeting. The conversation was intoxicating, but I still had no idea why I was there, nor when I was supposed to leave, nor how I would know when I was supposed to leave. I got the feeling, by this point, that Summers himself may have forgotten why he had set up this meeting.
And then he asked the million-dollar—or perhaps billion-dollar—question. “You think you can predict the stock market with this data?”
Aha. Here at last was the reason Summers had summoned me to his office.
Summers is hardly the first person to ask me this particular question. My father has generally been supportive of my unconventional research interests. But one time he did broach the subject. “Racism, child abuse, abortion,” he said. “Can’t you make any money off this expertise of yours?” Friends and other family members have raised the subject, as well. So have coworkers and strangers on the internet. Everyone seems to want to know whether I can use Google searches—or other Big Data—to pick stocks. Now it was the former Treasury secretary of the United States. This was more serious.
So can new Big Data sources successfully predict which ways stocks are headed? The short answer is no.
In the previous chapters we discussed the four powers of Big Data. This chapter is all about Big Data’s limitations—both what we cannot do with it and, on occasion, what we ought not do with it. And one place to start is by telling the story of the failed attempt by Summers and myself to beat the markets.
In Chapter 3, we noted that new data is most likely to yield big returns when the existing research in a given field is weak. It is an unfortunate truth about the world that you will have a much easier time getting new insights about racism, child abuse, or abortion than you will getting a new, profitable insight into how a business is performing. That’s because massive resources are already devoted to looking for even the slightest edge in measuring business performance. The competition in finance is fierce. That was already a strike against us.
Summers, who is not someone known for effusing about other people’s intelligence, was certain the hedge funds were already way ahead of us. I was quite taken during our conversation by how much respect he had for them and how many of my suggestions he was convinced they’d beaten us to. I proudly shared with him an algorithm I had devised that allowed me to obtain more complete Google Trends data. He said it was clever. When I asked him if Renaissance, a quantitative hedge fund, would have figured out that algorithm, he chuckled and said, “Yeah, of course they would have figured that out.”
The difficulty of keeping up with the hedge funds wasn’t the only fundamental problem that Summers and I ran up against in using new, big datasets to beat the markets.
Suppose your strategy for predicting the stock market is to find a lucky coin—but one that will be found through careful testing. Here’s your methodology: You label one thousand coins—1 to 1,000. Every morning, for two years, you flip each coin, record whether it came up heads or tails, and then note whether the Standard & Poor’s Index went up or down that day. You pore through all your data. And voilà! You’ve found something. It turns out that 70.3 percent of the time Coin 391 came up heads the S&P Index rose. The relationship is statistically significant, highly so. You have found your lucky coin!
Just flip Coin 391 every morning and buy stocks whenever it comes up heads. Your days of Target T-shirts and ramen noodle dinners are over. Coin 391 is your ticket to the good life!
Or not.
You have become another victim of one of the most diabolical aspects of “the curse of dimensionality.” It can strike whenever you have lots of variables (or “dimensions”)—in this case, one thousand coins—chasing not that many observations—in this case, 504 trading days over those two years. One of those dimensions—Coin 391, in this case—is likely to get lucky. Decrease the number of variables—flip only one hundred coins—and it will become much less likely that one of them will get lucky. Increase the number of observations—try to predict the behavior of the S&P Index for twenty years—and coins will struggle to keep up.
The curse of dimensionality is a major issue with Big Data, since newer datasets frequently give us exponentially more variables than traditional data sources—every search term, every category of tweet, etc. Many people who claim to predict the market utilizing some Big Data source have merely been entrapped by the curse. All they’ve really done is find the equivalent of Coin 391.
Take, for example, a team of computer scientists from Indiana University and Manchester University who claimed they could predict which way the markets would go based on what people were tweeting. They built an algorithm to code the world’s day-to-day moods based on tweets. They used techniques similar to the sentiment analysis discussed in Chapter 3. However, they coded not just one mood but many moods—happiness, anger, kindness, and more. They found that a preponderance of tweets suggesting calmness, such as “I feel calm,” predicts that the Dow Jones Industrial Average is likely to rise six days later. A hedge fund was founded to exploit their findings.
What’s the problem here?
The fundamental problem is that they tested too many things. And if you test enough things, just by random chance, one of them will be statistically significant. They tested many emotions. And they tested each emotion one day before, two days before, three days before, and up to seven days before the stock market behavior that they were trying to predict. And all these variables were used to try to explain just a few months of Dow Jones ups and downs.
Calmness six days earlier was not a legitimate predictor of the stock market. Calmness six days earlier was the Big Data equivalent of our hypothetical Coin 391. The tweet-based hedge fund was shut down one month after starting due to lackluster returns.
Hedge funds trying to time the markets with tweets are not the only ones battling the curse of dimensionality. So are the numerous scientists who have tried to find the genetic keys to who we are.
Thanks to the Human Genome Project, it is now possible to collect and analyze the complete DNA of people. The potential of this project seemed enormous.
Maybe we could find the gene that causes schizophrenia. Maybe we could discover the gene that causes Alzheimer’s and Parkinson’s and ALS. Maybe we could find the gene that causes—gulp—intelligence. Is there one gene that can add a whole bunch of IQ points? Is there one gene that makes a genius?
In 1998, Robert Plomin, a prominent behavioral geneticist, claimed to have found the answer. He received a dataset that included the DNA and IQs of hundreds of students. He compared the DNA of “geniuses”—those with IQs of 160 or higher—to the DNA of those with average IQs.
He found a striking difference in the DNA of these two groups. It was located in one small corner of chromosome 6, an obscure but powerful gene that was used in the metabolism of the brain. One version of this gene, named IGF2r, was twice as common in geniuses.
“First Gene to Be Linked with High Intelligence Is Reported Found,” headlined the New York Times.
You may think of the many ethical questions Plomin’s finding raised. Should parents be allowed to screen their kids for IGF2r? Should they be allowed to abort a baby with the low-IQ variant? Should we genetically modify people to give them a high IQ? Does IGF2r correlate with race? Do we want to know the answer to that question? Should research on the genetics of IQ continue?
Before bioethicists had to tackle any of these thorny questions, there was a more basic question for geneticists, including Plomin himself. Was the result accurate? Was it really true that IGF2r could predict IQ? Was it really true that geniuses were twice as likely to carry a certain variant of this gene?
Nope. A few years after his original study, Plomin got access to another sample of people that also included their DNA and IQ scores. This time, IGF2r did not correlate with IQ. Plomin—and this is a sign of a good scientist—retracted his claim.
This, in fact, has been a general pattern in the research into genetics and IQ. First, scientists report that they have found a genetic variant that predicts IQ. Then scientists get new data and discover their original assertion was wrong.
For example, in a recent paper, a team of scientists, led by Christopher Chabris, examined twelve prominent claims about genetic variants associated with IQ. They examined data from ten thousand people. They could not reproduce the correlation for any of the twelve.
What’s the issue with all of these claims? The curse of dimensionality. The human genome, scientists now know, differs in millions of ways. There are, quite simply, too many genes to test.
If you test enough tweets to see if they correlate with the stock market, you will find one that correlates just by chance. If you test enough genetic variants to see if they correlate with IQ, you will find one that correlates just by chance.
How can you overcome the curse of dimensionality? You have to have some humility about your work and not fall in love with your results. You have to put these results through additional tests. For example, before you bet your life savings on Coin 391, you would want to see how it does over the next couple of years. Social scientists call this an “out-of-sample” test. And the more variables you try, the more humble you have to be. The more variables you try, the tougher the out-of-sample test has to be. It is also crucial to keep track of every test you attempt. Then you can know exactly how likely it is you are falling victim to the curse and how skeptical you should be of your results. Which brings us back to Larry Summers and me. Here’s how we tried to beat the markets.
Summers’s first idea was to use searches to predict future sales of key products, such as iPhones, that might shed light on the future performance of the stock of a company, such as Apple. There was indeed a correlation between searches for “iPhones” and iPhones sales. When people are Googling a lot for “iPhones,” you can bet a lot of phones are being sold. However, this information was already incorporated into the Apple stock price. Clearly, when there were lots of Google searches for “iPhones,” hedge funds had also figured out that it would be a big seller, regardless of whether they used the search data or some other source.
Summers’s next idea was to predict future investment in developing countries. If a large number of investors were going to be pouring money into countries such as Brazil or Mexico in the near future, then stocks for companies in these countries would surely rise. Perhaps we could predict a rise in investment with key Google searches—such as “invest in Mexico” or “investment opportunities in Brazil.” This proved a dead end. The problem? The searches were too rare. Instead of revealing meaningful patterns, this search data jumped all over the place.
We tried searches for individual stocks. Perhaps if people were searching for “GOOG,” this meant they were about to buy Google. These searches seemed to predict that the stocks would be traded a lot. But they did not predict whether the stocks would rise or fall. One major limitation is that these searches did not tell us whether someone was interested in buying or selling the stock.
One day, I excitedly showed Summers a new idea I had: past searches for “buy gold” seemed to correlate with future increases in the price of gold. Summers told me I should test it going forward to see if it remained accurate. It stopped working, perhaps because some hedge fund had found the same relationship.
In the end, over a few months, we didn’t find anything useful in our tests. Undoubtedly, if we had looked for a correlation with market performance in each of the billions of Google search terms, we would have found one that worked, however weakly. But it likely would have just been our own Coin 391.
THE OVEREMPHASIS ON WHAT IS MEASURABLE
In March 2012, Zoë Chance, a marketing professor at Yale, received a small white pedometer in her office mailbox in downtown New Haven, Connecticut. She aimed to study how this device, which measures the steps you take during the day and gives you points as a result, can inspire you to exercise more.
What happened next, as she recounted in a TEDx talk, was a Big Data nightmare. Chance became so obsessed and addicted to increasing her numbers that she began walking everywhere, from the kitchen to the living room, to the dining room, to the basement, in her office. She walked early in the morning, late at night, at nearly all hours of the day—twenty thousand steps in a given twenty-four hour period. She checked her pedometer hundreds of times per day, and much that remained of her human communication was with other pedometer users online, discussing strategies to improve scores. She remembers putting the pedometer on her three-year-old daughter when her daughter was walking, because she was so obsessed with getting the number higher.
Chance became so obsessed with maximizing this number that she lost all perspective. She forgot the reason someone would want to get the number higher—exercising, not having her daughter walk a few steps. Nor did she complete any academic research about the pedometer. She finally got rid of the device after falling late one night, exhausted, while trying to get in more steps. Though she is a data-driven researcher by profession, the experience affected her profoundly. “It makes me skeptical of whether having access to additional data is always a good thing,” Chance says.
This is an extreme story. But it points to a potential problem with people using data to make decisions. Numbers can be seductive. We can grow fixated with them, and in so doing we can lose sight of more important considerations. Zoë Chance lost sight, more or less, of the rest of her life.
Even less obsessive infatuations with numbers can have drawbacks. Consider the twenty-first-century emphasis on testing in American schools—and judging teachers based on how their students score. While the desire for more objective measures of what happens in classrooms is legitimate, there are many things that go on there that can’t readily be captured in numbers. Moreover, all of that testing pressured many teachers to teach to the tests—and worse. A small number, as was proven in a paper by Brian Jacob and Steven Levitt, cheated outright in administering those tests.
The problem is this: the things we can measure are often not exactly what we care about. We can measure how students do on multiple-choice questions. We can’t easily measure critical thinking, curiosity, or personal development. Just trying to increase a single, easy-to-measure number—test scores or the number of steps taken in a day—doesn’t always help achieve what we are really trying to accomplish.
In its efforts to improve its site, Facebook runs into this danger as well. The company has tons of data on how people use the site. It’s easy to see whether a particular News Feed story was liked, clicked on, commented on, or shared. But, according to Alex Peysakhovich, a Facebook data scientist with whom I have written about these matters, not one of these is a perfect proxy for more important questions: What was the experience of using the site like? Did the story connect the user with her friends? Did it inform her about the world? Did it make her laugh?
Or consider baseball’s data revolution in the 1990s. Many teams began using increasingly intricate statistics—rather than relying on old-fashioned human scouts—to make decisions. It was easy to measure offense and pitching but not fielding, so some organizations ended up underestimating the importance of defense. In fact, in his book The Signal and the Noise, Nate Silver estimates that the Oakland A’s, a data-driven organization profiled in Moneyball, were giving up eight to ten wins per year in the mid-nineties because of their lousy defense.
The solution is not always more Big Data. A special sauce is often necessary to help Big Data work best: the judgment of humans and small surveys, what we might call small data. In an interview with Silver, Billy Beane, the A’s then general manager and the main character in Moneyball, said that he actually had begun increasing his scouting budget.
To fill in the gaps in its giant data pool, Facebook too has to take an old-fashioned approach: asking people what they think. Every day as they load their News Feed, hundreds of Facebook users are presented with questions about the stories they see there. Facebook’s automatically collected datasets (likes, clicks, comments) are supplemented, in other words, by smaller data (“Do you want to see this post in your News Feed?” “Why?”). Yes, even a spectacularly successful Big Data organization like Facebook sometimes makes use of the source of information much disparaged in this book: a small survey.
Indeed, because of this need for small data as a supplement to its mainstay—massive collections of clicks, likes, and posts—Facebook’s data teams look different than you might guess. Facebook employs social psychologists, anthropologists, and sociologists precisely to find what the numbers miss.
Some educators, too, are becoming more alert to blind spots in Big Data. There is a growing national effort to supplement mass testing with small data. Student surveys have proliferated. So have parent surveys and teacher observations, where other experienced educators watch a teacher during a lesson.
“School districts realize they shouldn’t be focusing solely on test scores,” says Thomas Kane, a professor of education at Harvard. A three-year study by the Bill & Melinda Gates Foundation bears out the value in education of both big and small data. The authors analyzed whether test-score-based models, student surveys, or teacher observations were best at measuring which teachers most improved student learning. When they put the three measures together into a composite score, they got the best results. “Each measure adds something of value,” the report concluded.
In fact, it was just as I was learning that many Big Data operations use small data to fill in the holes that I showed up in Ocala, Florida, to meet Jeff Seder. Remember, he was the Harvard-educated horse guru who used lessons learned from a huge dataset to predict the success of American Pharoah.
After sharing all the computer files and math with me, Seder admitted that he had another weapon: Patty Murray.
Murray, like Seder, has high intelligence and elite credentials—a degree from Bryn Mawr. She also left New York City for rural life. “I like horses more than humans,” Murray admits. But Murray is a bit more traditional in her approaches to evaluating horses. She, like many horse agents, personally examines horses, seeing how they walk, checking for scars and bruises, and interrogating their owners.
Murray then collaborates with Seder as they pick the final horses they want to recommend. Murray sniffs out problems with the horses, problems that Seder’s data, despite being the most innovative and important dataset ever collected on horses, still misses.
I am predicting a revolution based on the revelations of Big Data. But this does not mean we can just throw data at any question. And Big Data does not eliminate the need for all the other ways humans have developed over the millennia to understand the world. They complement each other.