RULE FIVE

Get the Backstory

“In each human coupling, a thousand million sperm vie for a single egg. Multiply those odds by countless generations . . . it was you, only you, that emerged. To distill so specific a form from that chaos of improbability, like turning air to gold . . . that is the crowning unlikelihood . . .”

 “You could say that about anybody in the world!”

 “Yes. Anybody in the world . . . But the world is so full of people, so crowded with these miracles, that they become commonplace and we forget . . .”

• Alan Moore, Watchmen

A couple of decades ago, two respected psychologists, Sheena Iyengar and Mark Lepper, set up a jam-tasting stall in an upmarket store in California. Sometimes they offered six varieties of jam, at other times twenty-four; customers who tasted the jam were then offered a voucher to buy it at a discount. The bigger display with a wider array of jams attracted more customers but very few of them actually bought jam. The display that offered fewer choices inspired more sales.1

The counterintuitive result went viral—it hit a sweet spot. People respond better to fewer choices! It became the stuff of pop-psychology articles, books, and TED Talks. It was unexpected yet seemed plausible. Few people would have predicted it, and yet somehow those who heard about it felt they’d known it all along.

As an economist, I’ve always found this a little strange. Economic theory predicts that people should often value extra choices, and will never be discouraged by them—but economic theory can be wrong, so that’s not what was curious about the jam study.

One puzzle was that according to the study, the measured effect of offering more choice was huge: only 3 percent of jam tasters at the twenty-four-flavor stand used their discount voucher, versus 30 percent at the six-flavor stand. This suggests that by trimming their range, retailers could increase their sales tenfold. Does anybody really believe that? Draeger’s, the supermarket that hosted the experiment, stocked 300 varieties of jam and 250 types of mustard. It seemed to be doing fine. Had it missed a trick? Starbucks boasts of offering literally tens of thousands of combinations of frothy drink; the chain seems to be doing fine, too. So I wondered just how general the finding might be. Still, it was a serious experiment conducted by serious researchers. And one should always be willing to adjust one’s views to fit the evidence, right?

Then I met a researcher at a conference who told me I should get in touch with a young psychologist called Benjamin Scheibehenne. I did. Scheibehenne had no reason to doubt Iyengar and Lepper’s discovery that people might be demotivated when faced with lots of options. But he had observed the same facts about the world that I had—that so many successful businesses offer a cornucopia of choice. How were those facts compatible with the experiment? Scheibehenne had a theory, which was that companies were finding ways to help people navigate complex choices. That seems plausible. Perhaps it was something to do with familiarity: people often go to the supermarket planning to buy whatever they bought last time, rather than some fancy new jam. Perhaps it was the way the aisles were signposted, or the way choices were organized to make them less bewildering. These all seem sensible things to investigate, so Scheibehenne planned to investigate them.2

He began by rerunning the jam experiment to get a baseline from which he could start tweaking and exploring different possibilities. But he didn’t get the same baseline. He didn’t get the same result at all. Iyengar and Lepper had discovered that choice dramatically demotivates. When Scheibehenne tried to repeat their experiment he found no such thing. Another researcher, Rainer Greifeneder, had rerun a similar study by Iyengar and Lepper that focused on choosing between luxury chocolates, and like Scheibehenne had failed to reproduce the original “choice is bad” result. The pair teamed up to pull together every study of the “choice is bad” effect they could find. There were plenty, but many of them had failed to find a journal that would publish them.

When all the studies, published and unpublished, were assembled, the overall result was mixed. Offering more choices sometimes motivates and sometimes demotivates people. Published research papers were more likely to find a large effect, either positive or negative. Unpublished papers were more likely not to find an effect at all. The average effect? Zero.3

This is unnerving. So far we’ve encountered misleading claims in the context of an agenda being pushed—Oxfam drumming up publicity, a media outlet chasing clicks—or a subtle detail being overlooked, like the use of different words to describe the tragic early end of a pregnancy. When it comes to academia, we might reasonably hope that the subtle details will be spotted and the only agenda being pursued is a search for knowledge. It makes sense to tread carefully with campaigning groups or clickbait headlines, but can’t we assume we’re on more solid ground when we pick up an academic journal? Iyengar and Lepper were, as I’ve said, highly respected. Is it possible that they were just flat-out wrong? If so, how? And what should we make of the next counterintuitive finding that sweeps the science pages or the airport bookshelves?

For an answer, let’s take a step sideways and ponder the internet’s most famous potato salad.


Surely there is no easier way to raise some cash than through Kickstarter? The crowdfunding website enjoyed a breakthrough moment in 2012 when the Pebble, an early smartwatch, raised over $10 million. In 2014, a project to make a picnic cooler raised an extraordinary $13 million. Admittedly, the Coolest cooler was the Swiss Army knife of cool boxes. It has a built-in USB charger, cocktail blender, and speakers, attracting a thundering herd of backers. The Pebble smartwatch had its revenge in 2015, as a fresh campaign raised more than $20 million for a new and better watch.

In some ways, though, Zack “Danger” Brown’s Kickstarter achievement was more impressive than any of these. He turned to Kickstarter for $10 to make some potato salad—and he raised $55,492 in what must be one of history’s most lucrative expressions of hipster irony.4

Following Zack Brown’s exploits, I wondered what exciting project I might launch on Kickstarter, looking forward to settling back to count the money as it poured in.

The same thought may have occurred to David McGregor. He was bidding for £3,600 to fund a trip across Scotland, photographing its glorious scenery for a glossy book—a lovely way to fund his art, and his holiday. Jonathan Reiter had bigger ambitions. His BizzFit looked to raise $35,000 to create an algorithmic matching service for employers and employees. Shannon Limeburner was also business-minded, but sought a mere $1,700 to make samples of a new line of swimwear she was designing. Two brothers in Syracuse, New York, even launched a Kickstarter campaign in the hope of being paid $400 to film themselves terrifying their neighbors at Halloween.

These disparate campaigns have one thing in common: they received precisely zero support. Not one of these people was able to persuade strangers, friends, or even their own families to kick in so much as a cent.

My inspiration and source for these tales of Kickstarter failure is Silvio Lorusso, an artist and designer based in Venice. Lorusso’s website, Kickended.com, searched Kickstarter for all the projects that have received absolutely no funding. (There are plenty: about 10 percent of Kickstarter projects go nowhere at all, and fewer than 40 percent raise enough money to hit their funding targets.)

Kickended performs an important service. It reminds us that what we see around us is not representative of the world; it is biased in systematic ways. Normally, when we talk of bias we think of a conscious ideological slant. But many biases emerge from the way the world presents some stories to us while filtering out others.

I have never read a media report or blog post about the attempts of the young and ambitious band Stereotypical Daydream to raise $8,000 on Kickstarter to record an album. (“Our band has tried many different ways of saving money to record a legitimate album in a professional studio. Unfortunately, we still have not saved enough.”) It probably will not surprise you to hear that the Stereotypical Daydream Kickstarter campaign brought them zero dollars closer to their goal.

On the other hand, I’ve heard quite a lot about the Pebble watch, the Coolest cooler, and even that potato salad. If I didn’t know better, I might form unrealistic expectations about what running a Kickstarter campaign might achieve.

This isn’t just about Kickstarter, of course. Such bias is everywhere. Most of the books people read are bestsellers—but most books are not bestsellers, and most book projects never become books at all. There’s a similar tale to tell about music, films, and business ventures.

Even cases of COVID-19 are subject to selective attention: people who feel terrible go to the hospital and are tested for the disease; people who feel fine do not. As a result, the disease looks even more dangerous than it really is. Even though statisticians understand this problem perfectly well, there’s no easy way to solve it without systematic testing. And in the early stages of the pandemic, when the most difficult policy decisions were being made, systematic testing was elusive.

There’s a famous story about the mathematician Abraham Wald, who was asked in 1943 to advise the US Air Force on how to reinforce its planes. The planes were returning from sorties peppered with bullet holes in the fuselage and wings; surely those spots could use some armor plating? Wald’s written response was highly technical, but the key idea is this: We observe damage only in the planes that return. What about the planes that were shot down? We rarely see damage to the engine or fuel tanks in planes that survive. That might be because those areas are rarely hit—or it might be that whenever those areas are hit, the plane is doomed. If we look only at the surviving planes—falling prey to “survivorship bias”—we’ll completely misunderstand where the real vulnerabilities are.5

The rabbit hole goes deeper. Even the story about survivorship bias is an example of survivorship bias; it bears little resemblance to what Abraham Wald actually did, which was to produce a research document full of complex technical analysis. That is largely forgotten. What survives is the tale about a mathematician’s flash of insight, with some vivid details added. What originally existed and what survives will rarely be the same thing.6

Kickended, then, provides an essential counterpoint to the breathless accounts of smash hits on Kickstarter. If successes are celebrated while failures languish out of sight (which is often the situation), then we see a very strange slice of the whole picture.

This starts to give us a clue as to what might have happened with the jam experiment. Like the Coolest cooler, it was a smash hit—but not the full story. Benjamin Scheibehenne’s role was a bit like Silvio Lorusso’s at Kickended: he had gone looking not just for the choice experiment that had gone viral, but for all the other experiments that had produced different results and had vanished into obscurity. When he did, he was able to reach a very different conclusion.


Bear Kickended in mind as you ponder the following story. In May 2010, a surprising paper was submitted to the Journal of Personality and Social Psychology. The author was Daryl Bem, a respected old hand in the field of academic psychology. What made the research paper astonishing was that it provided apparently credible statistical evidence of an utterly incredible proposition: that people could see into the future. There were nine experiments in total. In one, participants would look at a computer screen with an image of two curtains. Behind one curtain was an erotic photograph, they were told. They simply had to intuit which one. The participant would make a choice, and then—after the choice had already been made—the computer would randomly assign the photograph. If the participants’ guesses were appreciably better than chance, then that was evidence of precognition. They were.7

In another of the experiments that Bem’s research paper described, subjects were shown a list of forty-eight words and tested to see how many of the words they would remember. Then some subjects were asked to practice by retyping all the words. Normally it would be no surprise that practice helps you remember, but in this case Bem found that the practice worked even though the memory test came first and the practice came after.

How seriously should we take these results? Bear in mind that the research paper, “Feeling the Future,” was published in a respected academic journal after a process of peer review. The experiments it reported passed the standard statistical tests, which are designed to screen out fluke results. All this gives us some reason to believe that Bem found precognition.

There is a much better reason to believe that he did not, of course, which is that precognition would violate well-established laws of physics. Vigorous skepticism is justified. As the saying goes, extraordinary claims require extraordinary evidence.

Still, how did Bem accumulate all this publishable evidence for precognition? It’s puzzling. Perhaps it’s less puzzling after you connect it to the story of Kickended.

After Bem’s evidence of precognition had been published in the Journal of Personality and Social Psychology, several other studies were produced that followed Bem’s methods. None of them found any evidence for precognition, but the journal refused to publish any of them. (It did publish a critical commentary, but that’s not the same thing as publishing an experiment.) The journal’s grounds for refusal were that it “did not publish replications”—that is, once an experiment had demonstrated an effect there was no space to publish attempts to check on that effect. In theory, that might sound reasonable: Who wants to read papers confirming things they already knew? In practice, it has the absurd effect of ensuring that when something you thought you knew turns out to be wrong, you won’t hear about it. Bem’s striking finding became the last word.8

But it was also the first word. I strongly doubt that before Bem came along, any serious journal would have published research, no matter how rigorous, whose abstract read: “We tested several hundred undergraduates to see if they could see into the future. They couldn’t.”

This, then, is a survivorship bias as strong as press coverage of Kickstarter projects or trying to deduce the vulnerabilities of planes by examining only the ones whose vulnerabilities weren’t fatal. Out of all the possible studies that could have been conducted, it’s reasonable to guess that the journal was interested only in the ones that demonstrated precognition. This wasn’t because of a bias in favor of precognition. It was because of a bias in favor of novel and surprising discoveries. Before Bem, the fact that students didn’t seem to be able to see into the future was trivial and uninteresting. After Bem, the fact that students didn’t seem to be able to see into the future was a not-welcome-in-this-journal replication attempt. In other words, only evidence of precognition was publishable because only evidence of precognition was surprising. Studies showing no evidence of precognition are like bombers that have been shot in the engine: no matter how often such things happen, they’re not going to make it to where we can see them.

The “choice demotivates” finding is far more credible than the “students can see into the future” finding—but still, the jam experiment may have been subject to a similar dynamic. Imagine approaching a psychology journal before Iyengar and Lepper’s breakthrough result with the following study: “We set up stalls offering people different kinds of cheese. Sometimes the stalls had twenty-four types of cheese and sometimes just six. On the days when people were offered more types of cheese, they were a bit more likely to buy cheese.” Yawn! That’s not surprising at all. Who wants to publish that? It was only when Iyengar and Lepper ran an experiment showing the opposite result that the whole thing became not only publishable, but a Coolest-cooler smash hit.

If you read only the experiments published in the Journal of Personality and Social Psychology, you might well conclude that people can indeed see into the future. For obvious reasons, this particular flavor of survivorship bias is called “publication bias.” Interesting findings are published; non-findings, or failures to replicate previous findings, face a higher publication hurdle.

Bem’s finding was the $55,000 potato salad—wildly atypical, and widely reported as a result. The unpublished replications would typically have been like Stereotypical Daydream’s attempts to fund their album: nothing happened and nobody cared.

Except this time, somebody did care.


The paper is beautiful,” says Brian Nosek of Daryl Bem’s study. “It follows all the rules of what one does, does it in a really beautiful way.”9

But as Nosek, a psychologist at the University of Virginia, understood perfectly well, if Bem followed all the rules of academic psychology and ended up seeming to demonstrate that people can see into the future, something is wrong with the rules of academic psychology.10

Nosek wondered what would happen if you systematically reran some more respected and credible psychological experiments. How many results would come out the same? He sent around an email to like-minded researchers, and with impressive speed managed to get a global network of nearly three hundred psychologists collaborating to check studies that had recently been published in one of three prestigious academic journals. While Benjamin Scheibehenne had been digging into one particular field—the link between motivation and choice—Nosek’s network wanted to cast their net widely. They chose a hundred studies. How many did their replication attempts back up? Shockingly few: only thirty-nine.11 That left Nosek and the rest of academic psychology with one big question on their hands: How on earth did this happen?

Part of the explanation must be publication bias. As with Daryl Bem’s study, there is a systemic bias toward publishing the interesting results, and of course flukes are more likely to seem interesting than genuine discoveries.

But there’s a deeper explanation. It’s the reason Nosek had to reach out to so many colleagues, rather than simply get his graduate student assistants to do all the checks. Since the top journals weren’t very interested in publishing replication attempts, he knew that devoting his research team full-time to a replication effort might be career suicide: they simply wouldn’t be able to accumulate the publications necessary to secure their future in academia. Young researchers must either “publish or perish,” because many universities and other research bodies use publication records as an objective basis for deciding who should get promotions or research grants.

This is another example of the Vietnam body count problem we met in the second chapter. Great researchers do indeed tend to publish lots of research that is widely cited by others. But once researchers are rewarded for the quantity and prominence of their research, they start looking for ways to maximize both. Perverse incentives take over. If you have a result that looks publishable but fragile, the logic of science tells you to try to disprove it. Yet the logic of academic grants and promotions tells you to publish at once, and for goodness’ sake don’t prod it too hard.

So not only are journals predisposed to publish surprising results, researchers facing “publish or perish” incentives are more likely to submit surprising results that may not stand up to scrutiny.


The illusionist Derren Brown once produced an undoctored film of him tossing a coin into a bowl and getting heads ten times in a row. Brown later explained the trick: the stunning sequence came only at the end of nine excruciating hours of filming, when the string of ten heads finally materialized.12 There is a 1-in-1,024 chance of getting ten heads in a row if you toss a fair coin ten times. Toss it a few thousand times and a run of ten consecutive heads is almost guaranteed. But Brown could send his stunning result off to the Journal of Coin Flipping, perhaps with the delicious title (suggested by the journalists Jacob Goldstein and David Kestenbaum) “Heads Up! Coin-Flipping Bias in American Quarter Dollars Minted in 1977.”13

To be clear—such a research paper would be fraudulent, and nobody believes that such extreme and premeditated publication bias explains the large number of nonreplicable studies that Nosek and his colleagues unveiled. But there are shades of gray.

What if 1,024 researchers individually researched coin tossing and one of them produced the stunning result of ten heads in a row? That is mathematically the same situation, but from the point of view of the astonished researcher in question, she or he would be blameless. Now, it seems unlikely that so many researchers would have bothered to investigate coin tossing—but we don’t know how many people tried and failed to find precognition before Daryl Bem succeeded.

The shades of gray also apply within an individual researcher’s laboratory. For example, a scientist could do a small exploratory study. If he or she found an impressive result, why not publish? But if the study fell flat, the researcher could chalk it up as a learning experience and try something else. This behavior doesn’t sound especially unreasonable to the layman, and it probably doesn’t feel unreasonable to the researchers doing it—but it is publication bias nonetheless, and it means that flukes are likely to be disproportionately published.

Another possibility is that the researcher does the study, finds some promising results, but those results are not quite statistically solid enough to publish. Why not keep going, recruiting some more participants, gathering more data, and seeing if the results firm up? Again, this doesn’t seem unreasonable. What could be wrong with gathering more data? Wouldn’t that just mean that the study was getting closer and closer to the truth? There’s nothing wrong with doing a large study. In general, more data is better. But if data are gathered bit by bit, testing as we go, then the standard statistical tests aren’t valid. Those tests assume that the data have simply been gathered, then tested—not that scientists have collected some data, tested them, and then maybe collected a bit more.

To see the problem, imagine a game of basketball is about to be played and someone asks you a question: How convincing would a victory have to be before you feel confident saying that the winning team is better than the other team, rather than just luckier on the day? There’s no right answer—after all, sometimes luck can be outrageous. But you might decide that a margin of, say, ten points at the end of the game is enough to be convincing. This is, very roughly, what the standard statistical tests do to decide whether or not an effect is deemed to be “significant” enough to publish.

But now imagine the organizer of the basketball game stands to get a bonus if one of the teams turns out to be better—it doesn’t matter which—so, without telling you, she decides that if either team is ever ahead by ten points, she’ll bring the game to an early halt. And if, at the final whistle, the two teams are separated by seven, eight, or nine points, she’ll play overtime to see if the gap opens up to ten. After all, she’s just a basket or two away from demonstrating the superiority of one of the teams!

It’s obvious that would be a misuse of the test you set, but much of this kind of misuse seems to be quite common in practice.14

A third problem is that researchers also have choices as to how they analyze the data. Maybe the study holds up for men, but not women.* Maybe the study holds up if the researcher makes a statistical adjustment for age, or for income. Maybe there are some weird outliers and the study holds up only if they are included, or only if they are excluded.

Or maybe the scientist has a choice of different things she or he could measure. For instance, a study of how screen use affects the well-being of young people could measure both screen use and well-being in different ways. Well-being can be measured by asking people about episodes of anxiety; or it could be measured by asking people about how satisfied they are with their lives; or it could be measured by asking a young person’s parents how they think he or she is doing. Screen time could be measured directly through a tracking app, or indirectly through a survey; or perhaps rather than “screen time” one might want to measure “frequency of social media use.” None of these choices is right or wrong, but—again—the standard statistical tests assume that the researcher made the choice before collecting the data, then collected data, then ran the test. If the researcher ran several tests, then made a choice, flukes are vastly more likely.

Even if the researcher ran only one test, flukes are more likely to slip through if he or she did so after gathering the data and getting a feel for how they looked. This leads to yet another kind of publication bias: if a particular way of analyzing the data produces no result, and a different way produces something more intriguing, then of course the more interesting method is likely to be what is reported and then published.

Scientists sometimes call this practice “HARKing”—HARK is an acronym for Hypothesizing After Results Known. To be clear, there’s nothing wrong with gathering data, poking around to find the patterns, and then constructing a hypothesis. That’s all part of science. But you then have to get new data to test the hypothesis. Testing a hypothesis using the numbers that helped form the hypothesis in the first place is not OK.15

Andrew Gelman, a statistician at Columbia University, favors the term “the garden of forking paths,” named after a short story by Jorge Luis Borges. Each decision about what data to gather and how to analyze them is akin to standing on a pathway as it forks left and right and deciding which way to go. What seems like a few simple choices can quickly multiply into a labyrinth of different possibilities. Make one combination of choices and you’ll reach one conclusion; make another, equally reasonable, and you might find a very different pattern in the data.16

A year after Daryl Bem’s result was released, three psychologists published a demonstration of just how seriously researchers could go astray using standard statistical methods combined with these apparently trivial slips and fudges.17 The researchers, Joseph Simmons, Uri Simonsohn, and Leif Nelson, “proved” that listening to “When I’m Sixty-Four” by the Beatles would make you nearly eighteen months younger.18

I know you’re curious: How did they do it? The researchers collected various pieces of information from each participant, including their age, their gender, how old they felt, the age of their fathers, and the age of their mothers—along with various other almost completely irrelevant facts. They analyzed every possible combination of these variables, and they also analyzed the data in sets of ten participants, stopping to check for significant results each time. In the end they found that if they statistically adjusted for the fathers’ ages, but not the mothers’, and if they stopped after twenty participants, and if they discarded the other variables, then they could demonstrate that people who had been randomly assigned to listen to “When I’m Sixty-Four” were substantially younger than a control group who had been randomly assigned to listen to a different song. All utter nonsense, of course—but utter nonsense that bore an eerie resemblance to research that had been published and taken seriously. Would genuine researchers ever push so far over the line from rigorous practice into rigged research? Probably not very often. But those who did would get more attention. And the majority who did not might unwittingly commit subtler versions of the same statistical sins.

The standard statistical methods are designed to exclude most chance results.19 But a combination of publication bias and loose research practices means we can expect that mixed in with the real discoveries will be a large number of statistical accidents.


Darrell Huff’s How to Lie with Statistics describes how publication bias can be used as a weapon by an amoral corporation more interested in money than truth. With his trademark cynicism, he mentions that a toothpaste maker can truthfully advertise that the toothpaste is wonderfully effective simply by running experiments, putting all unwelcome results “well out of sight somewhere” and waiting until a positive result shows up.20 That is certainly a risk—not only in advertising but also in the clinical trials that underpin potentially lucrative pharmaceutical treatments. But might accidental publication bias be an even bigger risk than weaponized publication bias?

In 2005, John Ioannidis caused a minor sensation with an article titled “Why Most Published Research Findings Are False.” Ioannidis is a “meta-researcher”—someone who researches the nature of research itself.* He reckoned that the cumulative effect of various apparently minor biases might mean that false results could easily outnumber the genuine ones. This was five years before the Journal of Personality and Social Psychology published Daryl Bem’s research on precognition, which sparked Brian Nosek’s replication attempt. Precognition might not exist, but Ioannidis clearly saw the crisis coming.21

I confess that when I first heard of Ioannidis’s research, it struck me as an extraordinary piece of hyperbole. Sure, all scientific research is provisional, everyone makes mistakes, and sometimes bad papers get published—but surely it was wrong to suggest that more than half of all the empirical results out there were false? But after interviewing Scheibehenne and learning what he’d discovered about the choice literature, I started to wonder. Then, over the years, it gradually became painfully clear to me and many others who were initially skeptical that Ioannidis was on to something important.

While Bem’s precognition study was understandably famous, many other surprising psychological findings had become well known to non-psychologists through books such as Thinking, Fast and Slow (by Nobel laureate Daniel Kahneman), Presence (by psychologist Amy Cuddy), and Willpower (by psychologist Roy Baumeister and journalist John Tierney). These findings hit the same counterintuitive sweet spot as the jam experiment: strange enough to be memorable, but plausible enough not to dismiss out of hand.

Baumeister is famous in academic psychology for studies showing that self-control seems to be a limited resource. People asked to restrain themselves by munching radishes while delicious freshly baked chocolate cookies lay within easy reach were then quicker to abandon a frustrating task later.22 Cuddy found that asking people to adopt “power poses”—for example, hands-on-hips like Wonder Woman—boosted their levels of testosterone and suppressed their levels of the stress hormone cortisol.23 Kahneman described the “priming” research of John Bargh. Young experimental subjects were asked to solve a word puzzle in which some of them were exposed to words that suggested old age, such as bald, retirement, wrinkle, Florida, and gray. The young subjects who had not seen these particular words then set off briskly down the corridor to participate in another task; the young subjects who had, instead, been “primed” with words suggesting old age shuffled off down the corridor at a measurably slower pace.24

These are extraordinary results, but as Kahneman wrote about priming research, “Disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”

Now we realize that disbelief is an option. Kahneman does, too. Publication bias, and more generally the garden of forking paths, means that plenty of research that seems rigorous at first sight both to onlookers and often to the researchers themselves may instead be producing spurious conclusions. These studies—of willpower, of power posing, and of priming—have all proved difficult to replicate. In each case, the researchers have defended their original finding, but the prospect that they were all statistical accidents seems increasingly reasonable.

Daniel Kahneman himself dramatically raised the profile of the issue when he wrote an open letter to psychologists in the field warning them of a looming “train wreck” if they could not improve the credibility of their research.25

The entire saga—Ioannidis’s original paper, Bem’s nobody-believes-this finding, the high-profile struggles to replicate Baumeister’s, Cuddy’s, and Bargh’s research, and as the coup de grâce, Nosek’s discovery that (as Ioannidis had said all along) high-profile psychological studies were more likely not to replicate than to stand up—was sometimes described as a “replication crisis” or a “reproducibility crisis.”

In the light of Kickended, perhaps none of this should have been a surprise—but it is shocking nonetheless. The famous psychological results are famous not because they are the most rigorously demonstrated, but because they’re interesting. Fluke results are far more likely to be surprising, and so far more likely to hit that Goldilocks level of counterintuitiveness (not too absurd, but not too predictable) that makes them so fascinating. The “interestingness” filter is enormously powerful.


Little harm is done if publication bias (and survivorship bias) merely produces cute distortions in our view of the world, leading people to prepare for a job interview by finding a secluded spot to strike a Wonder Woman pose. Even if many would-be entrepreneurs are foolishly over-optimistic about their chances of raising money on Kickstarter, we all enjoy the fruits of successful new business ideas that more rational people would not have quit their jobs to pursue. And few scientists were about to embrace Daryl Bem’s apparent discovery of precognition, for reasons well summarized by Ben Goldacre, an expert in evidence-based medicine: “I wasn’t very interested, for the same reasons you weren’t. If humans really could see the future, we’d probably know about it already; and extraordinary claims require extraordinary evidence, rather than one-off findings.”26

But Ben Goldacre thinks the stakes are higher, and so do I. This bias may have serious consequences for both our money and our health.

Money first. Business writing—a field in which I confess to dabbling—is dripping with examples of survivorship bias. In my book Adapt, I had a little chuckle about the Tom Peters and Robert Waterman book In Search of Excellence, a blockbusting business bestseller published in 1982, which offered management lessons gleaned from studying forty-three of the most outstanding corporations of that time. If they really were paragons of brilliant management, then one might have expected their success to last. If instead they were the winners of an invisible lottery, the beneficiaries of largely random strokes of good fortune, then we would expect that the good luck would often fail to last.

Sure enough, within two years almost a third of them were in serious financial trouble. It’s easy to mock Peters and Waterman—and people did—but the truth is that a healthy economy has a lot of churn in it. Corporate stars rise, and burn out. Sometimes they have lasting qualities, sometimes fleeting ones, and sometimes no qualities at all, bar some luck. By all means look at the success stories and try to learn lessons, but be careful. It is easy, in Nassim Taleb’s memorable phrase, to be “fooled by randomness.”

Perhaps all such business writing is harmless: when daily data from the shop floor contradict the business-book wisdom, the shop floor will win. While the jam study became famous among the chattering classes, there is scant sign that many businesses took the “Choice is bad” finding seriously in the decisions they made about stocking their shelves. Still, one can’t help suspecting that where good data are rarer, major decisions are being made on the basis of survivor bias.

In finance, the problem may be worse. A Norwegian TV show illustrated this rather brilliantly in 2016 by organizing a stock-picking competition, in which investors would buy a variety of Norwegian shares to the value of 10,000 Norwegian kroner—about $1,000. The competitors were a diverse bunch: a pair of stockbrokers, who confidently opined, “The more you know, the better you’ll do”; the presenters of the show; an astrologer; two beauty bloggers who confessed to never having heard of any of the companies in question; and a cow named Gullros who would pick stocks by wandering around a field marked out in a grid of company names and expressing her conviction by defecating in the relevant square.

The astrologer fared worst; the professionals did a little better, matching the performance of Gullros the cow (both the cow and the professionals achieved a respectable 7 percent return over the three-month contest); the beauty bloggers did better still—but the standout winners were the TV presenters, with a return of nearly 25 percent over just three months. How had they done so well? Simple: they hadn’t entered their own competition just once. Secretly, they’d done so twenty times by allowing themselves to pick twenty different portfolios. They revealed only the best-performing one to the audience. They appeared to be inspired stock-pickers, until they revealed their own trick. Survivor bias conquers all.27

With that in mind, it is hard to evaluate an investment manager who picks stocks or other financial products. They have everything to gain by persuading us that they are a genius, but have very little to show us except a track record. “My fund beat the market last year, and the year before” is pretty much all we have to go on. The trouble is that we see only the successes, alongside the schadenfreude of the occasional high-profile implosion. Underperforming investment funds tend to be closed down, merged, or rebranded. A major investment house will offer many different funds, and will advertise the ones that have been successful in the past. The Norwegian TV show condensed and exaggerated the process, but be assured that when fund managers advertise their stellar results, those ads do not contain a random sample of the funds on offer.

Survivor bias even distorts some studies of investment performance. These studies often start by looking at “funds that exist today” without fully acknowledging or adjusting for the fact that any fund still in existence is a survivor—and that introduces a survivorship bias. Burton Malkiel, economist and author of A Random Walk down Wall Street, once tried to estimate how much survivorship bias flattered the performance of the surviving funds. His estimate—an astonishing 1.5 percent per year. That might not sound like much, but over a lifetime of investing it’s a factor of two: you expect retirement savings of (say) $100,000 and end up with $50,000 instead. Put another way, if you ignore all the investment funds that quietly disappear, the apparent performance is twice as good as the actual performance.28 The result is to persuade people to invest in actively managed funds, which often charge high fees, when they might be better served by a low-cost, low-drama fund that passively tracks the stock market as a whole. That is a decision worth tens of billions of dollars a year across the US economy; if it’s a mistake, it’s a multibillion-dollar mistake.29

So much for money. What about health? Consider the life-or-death matter of which medical treatments work and which don’t. A randomized controlled trial (RCT) is often described as the gold standard for medical evidence. In an RCT, some people receive the treatment being tested while others, chosen at random, are given either a placebo or the best known treatment. An RCT is indeed the fairest one-shot test of a new medical treatment, but if RCTs are subject to publication bias, we won’t see the full picture of all the tests that have been done, and our conclusions are likely to be badly skewed.30

For example, in 2008 a quick survey of studies of a variety of antidepressant medications would have found forty-eight trials showing a positive effect and three showing no positive effect. This sounds pretty encouraging, until you ponder the risk of publication bias. So the researchers behind that survey looked harder, digging out twenty-three unpublished trials; of these, twenty-two had a negative result in which the drug did not help patients. They also found that eleven of the trials that seemed positive in the articles describing them had in fact produced negative results in the summaries presented to the regulator, the US Food and Drug Administration. The articles had managed to cherry-pick some good news and hand-wave away some bad news, and finish up presenting a positive-seeming picture about a drug that had not, in fact, been effective. The corrected score, then, was not 483 in favor of antidepressants working well, it was 3837. Perhaps the antidepressants do work, at least sometimes or for some people, but it’s fair to say that the published results did not fairly reflect all the experiments that had been conducted.31

This matters. Billions of dollars are misspent and hundreds of thousands of lives lost because of survivorship bias, when we make decisions without seeing the whole story—the investment funds that folded, the Silicon Valley entrepreneurs who never got beyond the “junk in the garage” stage, the academic studies that were never published, and the clinical trials that went missing in action.


So far, this chapter has told a tale of catastrophe. The one bright spot is that these problems are vastly better understood and appreciated than they were even five years ago. So let’s focus on that bright spot for a moment, and ask if there’s hope for improvement.

For researchers, it’s clear what that improvement would look like: They need to come clean about the Kickended side of research. They need to be transparent about the data that were gathered but not published, the statistical tests that were performed but then set to one side, the clinical trials that went missing in action, and the studies that produced humdrum results and were rejected by journals or stuffed in a file drawer while researchers got on with something more fruitful.

Those of us who write about research have a similar responsibility: not just to report on a stunning new result, but to set it in the context of what has been published before—and, preferably, what should have been published but languishes in obscurity.

Ideally, we need to be able to rise out of Andrew Gelman’s “garden of forking paths” and see the maze from above, including the dead ends and the paths less traveled. That view from above comes when we have all the relevant information in the most user-friendly form.

We are a long way from achieving those standards—but there are distinct signs of improvement. It is slow and incomplete, but it is improvement nonetheless. In medicine, for example, in 2005 the International Committee of Medical Journal Editors declared that the top medical journals they edited would no longer publish clinical trials that hadn’t been preregistered. Preregistration means that before conducting a trial, researchers have to explain what they plan to do and how they plan to analyze the results, posting that explanation on a public website. Such preregistration is an important fix for publication bias, because it means that researchers can easily see cases in which a trial was planned but then somehow the results went missing in action. Preregistration should also allow other researchers to read a trial write-up and then go back to check that the plan for analyzing the data was followed, rather than being changed once the data appeared.

Preregistration isn’t a panacea. It poses a particular challenge for field studies in social science, which often require academic researchers to piggyback on some project being conducted by a government or charitable organization. Such projects evolve over time in ways that researchers cannot control or predict. And even when medical journals demand preregistration, they may fail to enforce their own demands.32 Ben Goldacre and his colleagues at Oxford University’s Centre for Evidence-Based Medicine spent a few weeks systematically monitoring the publication of new articles in the top medical journals. They identified fifty-eight articles that fell short of the reporting standards those journals had agreed to uphold—for example, clinical trials that had prespecified that they’d measure certain outcomes for patients, but then later switched to reporting different outcomes. They promptly wrote letters of correction to the journal editors but found that their letters were often rejected rather than published.33

It’s disappointing to realize that standards are patchily enforced, but perhaps not surprising given that the entire system is basically self-regulated by the standards of a professional community, rather than governed by some central Solomonic figure. And it does seem to me that the situation has significantly improved over the past two decades: awareness is improving, bad practice is being called out, and it is better to have patchy standards than no standards at all. We have journals such as Trials, launched in 2006, which will publish the results of any clinical trial, regardless of whether the outcome was positive or negative, fascinating or dull, ensuring that no scientific study languishes unpublished simply because it wasn’t regarded as newsworthy in the world of research. There’s an enormous opportunity to do more with automated tools, such as automatically identifying missing trials, studies that were preregistered but then not published, or spotting when later papers are citing earlier research that has since been updated, corrected, or withdrawn.34

In psychology, the kerfuffle over precognition may well have a positive result. Academic psychologists want to get published, of course, but most of them don’t want to produce junk science; they want to find out what’s true. The reproducibility crisis seems to be improving awareness of good research standards, as well as holding out more carrots to reward replication efforts, and more sticks to punish sloppy research.

There are encouraging signs that more researchers are welcoming replication efforts. In 2010, for instance, political scientists Brendan Nyhan and Jason Reifler published a study on what became known as “the backfire effect”—in brief, that people were more likely to believe a false claim if they’d been shown a fact-check that debunked the claim. This caused a moral panic among some journalists, particularly after the rise of Donald Trump. Fact-checking only makes matters worse! It hit that perfect counterintuitive sweet spot. But Nyhan and Reifler encouraged further studies, and those studies suggest that the backfire effect is unusual and fact-checking does help. One summary of the research concluded: “Generally debunking can make people’s beliefs in specific claims more accurate.” Nyhan himself has quoted this summary on Twitter when he sees people relying on his original paper without considering the follow-ups.35

Many statisticians believe the crisis points to the need to rethink the standard statistical tests themselves—that the very concept of statistical significance is deeply flawed. Mathematically, the test is simple enough. You start by assuming that there is no effect (the drug does not work; the coin is fair; precognition does not exist; the twenty-four-jam stall and the six-jam stall are equally appealing), and then you ask how unlikely the observed data is. For example, if you assume that a coin is fair and you toss it ten times, you’d expect to see heads five times, but you wouldn’t be surprised to see six heads or maybe even seven. You’d be astonished to see ten heads in a row—and given that this would happen by chance less than one time in a thousand, you might question your original assumption that the coin was fair. Statistical significance testing relies on the same principle: assuming no effect, are the data you collect surprising? For instance, when testing a drug, your statistical analysis begins with the assumption that the drug does not work; when you observe that lots of the patients taking the drug are doing much better than the patients who are taking a placebo, you revise that assumption. In general, if the chances of randomly observing data at least as extreme as you collect are less than 5 percent, the results are significant enough to overturn the assumption: we can conclude with a sufficient degree of confidence that the drug works, large displays of jam discourage people from buying jam, and that precognition exists.

The problems are obvious. Five percent is an arbitrary cutoff point—why not 6 percent, or 4 percent?—and it encourages us to think in black-and-white, pass-or-fail terms, instead of embracing degrees of uncertainty. And if you found the previous paragraph confusing, I don’t blame you. Conceptually, statistical significance is baffling, almost backward: it tells us the chance of observing the data given a particular theory, the theory that there is no effect. Really, we’d like to know the opposite, the probability of a particular theory being true, given the data. My own instinct is that statistical significance is an unhelpful concept and we could do better, but others are more cautious. John Ioannidis—he of the “Why Most Published Research Findings Are False” paper—argues that despite the flaws of the method, it’s “a convenient obstacle to unfounded claims.”

Unfortunately, there is no single clever statistical technique that would make all these problems evaporate. The journey toward more rigorous science requires many steps, and we at least are taking some of them. I recently had the chance to interview Richard Thaler, a Nobel Memorial Prize winner in economics, who has collaborated with Daniel Kahneman and many other psychologists. He struck me as well placed to evaluate psychology as a sympathetic outsider. “I think the replication crisis has been great for psychology,” he told me. “There’s just better hygiene.”36 Brian Nosek, meanwhile, told the BBC: “I think if we do another large reproducibility project five years from now, we are going to see a dramatic improvement in reproducibility in the field.”37


In the early chapters of this book, I cited numerous psychological studies of motivated reasoning and the biased assimilation of information. You may by now be wondering: How do I know that those studies are credible?

The honest answer is that I cannot be certain. Any experimental research I cite has a chance of being the next jam experiment—or, much worse, the next discovery that listening to “When I’m Sixty-Four” will make you younger. But when I read the studies I’ve described, I try to put the advice from the last few pages into practice. I try to get a sense of whether the study fits into the broader picture of what we know, or whether it’s some strange outlier. If there are twenty or thirty studies from different academics using different methods, but all pointing to a similar conclusion—for instance, that our powers of logical reasoning are skewed by our political beliefs—then I am less concerned that an individual experiment might turn out to be a fluke. If an empirical discovery makes sense in theory and in practice as well as in the lab, that’s reassuring.

On most topics, most of us will not be digging through academic papers. We’ll rely on the media to get a digestible take on the state of scientific knowledge. Science journalism is like any other kind of journalism: There is good, and there is bad. You can find superficial, sensationalist retreads of press releases that are themselves superficial and sensationalist. Or you can find science journalism that explains the facts, puts them in a proper context, and when necessary speaks truth to power. If you care enough as a reader you can probably figure out the difference. It’s really not hard. Ask yourself if the journalist reporting on the research has clearly explained what’s being measured. Was this a study done with humans? Or mice? Or in a petri dish? A good reporter will be clear. Then: How large is the effect? Was this a surprise to other researchers? A good journalist will try to make space to explain—and the article will be much more fun to read as a result, satisfying your curiosity and helping you to understand.*

If in doubt, you can easily find second opinions: almost any major research finding in science or social science will quickly be picked up and digested by academics and other specialists, who’ll post their own thoughts and responses online. Science journalists themselves believe that the internet has improved their profession: in a survey of about a hundred European science journalists, two-thirds agreed with that idea, and fewer than 10 percent disagreed.38 That makes sense: the internet has made it easier to read the journal articles, easier to access the systematic reviews, and easier to reach scientists for a second opinion.

If the story you’re reading is about health, there’s one place you should be sure to look for a second opinion: the Cochrane Collaboration. It’s named after Archie Cochrane, a doctor, epidemiologist, and campaigner for better evidence in medicine. In 1941, when Cochrane was captured by the Germans and became a prisoner of war, he improvised a clinical trial. It was an astonishing combination of bravery, determination, and humility. The prison camp was full of sick men—Cochrane was one of them—and he suspected that the illness was caused by a dietary deficiency, but he knew that he didn’t know enough to confidently prescribe a treatment. Rather than slump into despair or follow a hunch, he managed to organize his fellow prisoners to test the effects of different diets, discovered what they were lacking, and provided incontrovertible evidence to the camp commandant. Vitamin supplements were duly procured, and many lives were saved as a result.39

In 1979, Cochrane wrote that “it is surely a great criticism of our profession that we have not organised a critical summary, by specialty or subspecialty, adapted periodically, of all relevant randomised controlled trials.” After Cochrane’s death, this challenge was taken up by Iain Chalmers. In the early 1990s, Chalmers began assembling a collection of systematic reviews, at first just of the randomized trials conducted in the field of perinatal health—the care of pregnant women and their babies. The effort grew into an international network of researchers who review, rate, synthesize, and publish the best available evidence on a huge variety of clinical topics.40 They call themselves the Cochrane Collaboration and they maintain the Cochrane Library, an online database of systematic research reviews. The full database is not freely available in every country, but the accessible research summaries are, providing short descriptions of the state of knowledge based on randomized trials.

I looked at some recent research summaries, pretty much at random, to see what came up. One of the front-page summaries promised to evaluate “yoga for treating urinary incontinence in women.” Well, I don’t practice yoga, don’t suffer from urinary incontinence, and am not a woman, so my evaluation of this report promised to be uncompromised by any actual knowledge about the topic.

Before I looked at what the Cochrane Library had to say, I typed “Can yoga cure incontinence?” into Google. WebMD was one of the top search results.41 It reported that a new trial had shown dramatic improvements for older women, although it noted that the study was quite small. The Daily Mail picked up on the same study and reported it in a similar way: the improvements were big, but the study was small.42 The top search result was from a private health care company:43 it enthused about the spectacular results and did not mention how small the study was, although it did link through to the original research.44

None of this reporting is great, but neither is it terrible. To be honest, I expected worse. Nor is much harm likely to result. People may take up yoga with false hopes, or alternatively may take up yoga, get better, and then credit the yoga when in fact they would have gotten better anyway. But none of this would be disastrous.

Still, the media reports failed to give the backstory. They simply regurgitated the scientific research without any indication of whether it accorded with, or contradicted, anything that had already been discovered.

The Cochrane Library, by contrast, aims to provide an accessible summary of everything we know about yoga and incontinence—if anything. It’s also on the first page of Google search results. Cochrane is not a secret.

The Cochrane review, written in plain and unshowy language, is clear enough. There have only been two studies of the issue. Both of them were small. The evidence is weak, but what evidence there is suggests that for urinary incontinence yoga is better than nothing, and that mindfulness meditation is better than yoga. That’s it—the result of a quick Google search and one minute scanning a page written in plain English. (Translations into many languages are available.) It would be nice, of course, if there was a vast and credible evidence base to lean on, but in this case, there isn’t—and I’d rather know that. Thanks to the Cochrane summary we no longer have to guess if there’s a pile of important evidence that we simply weren’t told about.45

A related network, the Campbell Collaboration, aims to do the same thing for social policy questions in areas such as education and criminal justice. As these efforts gain momentum and resources, it will become easier for us to work out whether a study makes sense and fits into a wider pattern of discoveries—or whether it’s a $55,000 potato salad.