Extraordinary claims require extraordinary evidence.
—Carl Sagan
A heartening exception to the disdain for reason in so much of our online discourse is the rise of a “Rationality Community,” whose members strive to be “less wrong” by compensating for their cognitive biases and embracing standards of critical thinking and epistemic humility.1 The introduction to one of their online tutorials can serve as an introduction to the subject of this chapter:2
Bayes’ rule or Bayes’ theorem is the law of probability governing the strength of evidence—the rule saying how much to revise our probabilities (change our minds) when we learn a new fact or observe new evidence.
You may want to learn about Bayes’ rule if you are:
A professional who uses statistics, such as a scientist or doctor;
A computer programmer working in machine learning;
A human being.
Yes, a human being. Many Rationalistas believe that Bayes’s rule is among the normative models that are most frequently flouted in everyday reasoning and which, if better appreciated, could add the biggest kick to public rationality. In recent decades Bayesian thinking has skyrocketed in prominence in every scientific field. Though few laypeople can name or explain it, they have felt its influence in the trendy term “priors,” which refers to one of the variables in the theorem.
A paradigm case of Bayesian reasoning is medical diagnosis. Suppose that the prevalence of breast cancer in the population of women is 1 percent. Suppose that the sensitivity of a breast cancer test (its true-positive rate) is 90 percent. Suppose that its false-positive rate is 9 percent. A woman tests positive. What is the chance that she has the disease?
The most popular answer from a sample of doctors given these numbers ranged from 80 to 90 percent.3 Bayes’s rule allows you to calculate the correct answer: 9 percent. That’s right, the professionals whom we entrust with our lives flub the basic task of interpreting a medical test, and not by a little bit. They think there’s almost a 90 percent chance she has cancer, whereas in reality there’s a 90 percent chance she doesn’t. Imagine your emotional reaction upon hearing one figure or the other, and consider how you would weigh your options in response. That’s why you, a human being, want to learn about Bayes’s theorem.
Risky decision making requires both assessing the odds (Do I have cancer?) and weighing the consequences of each choice (If I do nothing and have cancer, I could die; if I undergo surgery and don’t have cancer, I will suffer needless pain and disfigurement). In chapters 6 and 7 we’ll explore how best to make consequential decisions when we know the probabilities, but the starting point must be the probabilities themselves: given the evidence, how likely is it that some state of affairs is true?
For all the scariness of the word “theorem,” Bayes’s rule is rather simple, and as we will see at the end of the chapter, it can be made gut-level intuitive. The great insight of the Reverend Thomas Bayes (1701–1761) was that the degree of belief in a hypothesis may be quantified as a probability. (This is the subjectivist meaning of “probability” that we met in the last chapter.) Call it prob(Hypothesis), the probability of a hypothesis, that is, our degree of credence that it is true. (In the case of medical diagnosis, the hypothesis is that the patient has the disease.) Clearly our credence in any idea should depend on the evidence. In probability-speak, we can say that our credence should be conditional on the evidence. What we are after is the probability of a hypothesis given the data, or prob(Hypothesis | Data). That’s called the posterior probability, our credence in an idea after we’ve examined the evidence.
If you’ve taken this conceptual step, then you are prepared for Bayes’s rule, because it’s just the formula for conditional probability, which we met in the last chapter, applied to credence and evidence. Remember that the probability of A given B is the probability of A and B divided by the probability of B. So the probability of a hypothesis given the data (what we are seeking) is the probability of the hypothesis and the data (say, the patient has the disease and the test result comes out positive) divided by the probability of the data (the total proportion of patients who test positive, healthy and sick). Stated as an equation: prob(Hypothesis | Data) = prob(Hypothesis and Data) / prob(Data). One more reminder from chapter 4: the probability of A and B is the probability of A times the probability of B given A. Make that simple substitution and you get Bayes’s rule:
What does this mean? Recall that prob(Hypothesis | Data), the expression on the left-hand side, is the posterior probability: our updated credence in the hypothesis after we’ve looked at the evidence. This could be our confidence in a disease diagnosis after we’ve seen the test results.
Prob(Hypothesis) on the right-hand side means the prior probability or “priors,” our credence in the hypothesis before we looked at the data: how plausible or well established it was, what we would be forced to guess if we had no knowledge of the data at hand. In the case of a disease, this could be its prevalence in the population, the base rate.
Prob(Data | Hypothesis) is called the likelihood. In the world of Bayes, “likelihood” is not a synonym for “probability,” but refers to how likely it is that the data would turn up if the hypothesis is true.4 If someone does have the disease, how likely is it that they would show a given symptom or get a positive test result?
And prob(Data) is the probability of the data turning up across the board, whether the hypothesis is true or false. It’s sometimes called the “marginal” probability, not in the sense of “minor” but in the sense of adding up the totals for each row (or each column) along the margin of the table—the probability of getting those data when the hypothesis is true plus the probability of getting those data when the hypothesis is false. A more mnemonic term is the commonness or ordinariness of the data. In the case of medical diagnosis, it refers to the proportion of all the patients who have a symptom or get a positive result, healthy and sick.
Substituting the mnemonics for the algebra, Bayes’s rule becomes:
Translated into English, it becomes “Our credence in a hypothesis after looking at the evidence should be our prior credence in the hypothesis, multiplied by how likely the evidence would be if the hypothesis is true, scaled by how common that evidence is across the board.”
And translated into common sense, it works like this. Now that you’ve seen the evidence, how much should you believe the idea? First, believe it more if the idea was well supported, credible, or plausible to start with—if it has a high prior, the first term in the numerator. As they say to medical students, if you hear hoofbeats outside the window, it’s probably a horse, not a zebra. If you see a patient with muscle aches, he’s more likely to have the flu than kuru (a rare disease seen among the Fore tribe in New Guinea), even if the symptoms are consistent with both diseases.
Second, believe the idea more if the evidence is especially likely to occur when the idea is true—namely if it has a high likelihood, the second term in the numerator. It’s reasonable to take seriously the possibility of methemoglobinemia, also known as blue skin disorder, if a patient shows up with blue skin, or Rocky Mountain spotted fever if a patient from the Rocky Mountains presents with spots and fever.
And third, believe it less if the evidence is commonplace—if it has a high marginal probability, the denominator of the fraction. That’s why we laugh at Irwin the hypochondriac, convinced of his liver disease because of the characteristic lack of discomfort. True, his symptomlessness has a high likelihood given the disease, edging up the numerator, but it also has a massive marginal probability (since most people have no discomfort most of the time), blowing up the denominator and thus shrinking the posterior, our credence in Irwin’s self-diagnosis.
How does this work with numbers? Let’s go back to the cancer example. The prevalence of the disease in the population, 1 percent, is how we set our priors: prob(Hypothesis) = .01. The sensitivity of the test is the likelihood of getting a positive result given that the patient has the disease: prob(Data | Hypothesis) = .9. The marginal probability of a positive test result across the board is the sum of the probabilities of a hit for the sick patients (90 percent of the 1 percent, or .009) and of a false alarm for the healthy ones (9 percent of the 99 percent, or .0891), or .0981, which begs to be rounded up to .1. Plug the three numbers into Bayes’s rule, and you get .01 times .9 divided by .1, or .09.
So where do the doctors (and, to be fair, most of us) go wrong? Why do we think the patient almost certainly has the disease, when she almost certainly doesn’t?
Kahneman and Tversky singled out a major ineptitude in our Bayesian reasoning: we neglect the base rate, which is usually the best estimate of the prior probability.5 In the medical diagnosis problem, our heads are turned by the positive test result (the likelihood) and we forget about how rare the disease is in the population (the prior).
The duo went further and suggested we don’t engage in Bayesian reasoning at all. Instead we judge the probability that an instance belongs to a category by how representative it is: how similar it is to the prototype or stereotype of that category, which we mentally represent as a fuzzy family with its crisscrossing resemblances (chapter 3). A cancer patient, typically, gets a positive diagnosis. How common the cancer is, and how common a positive diagnosis is, never enter our minds. (Horses, zebras, who cares?) Like the availability heuristic from the preceding chapter, the representativeness heuristic is a rule of thumb the brain deploys in lieu of doing the math.6
Tversky and Kahneman demonstrated base-rate neglect in the lab by telling people about a hit-and-run accident by a taxi late at night in a city with two cab companies: Green Taxi, which owns 85 percent of the cabs, and Blue Taxi, which owns 15 percent (those are the base rates, and hence the priors). An eyewitness identified the cab as Blue, and tests showed that he correctly identified colors at night 80 percent of the time (that is the likelihood of the data, namely his testimony given the cab’s actual color). What is the probability that the cab involved in the accident was Blue? The correct answer, according to Bayes’s rule, is .41. The median answer was .80, almost twice as high. Respondents took the likelihood too seriously, pretty much at face value, and downplayed the base rate.7
One of the symptoms of base-rate neglect in the world is hypochondria. Who among us hasn’t worried we have Alzheimer’s after a memory lapse, or an exotic cancer when we have an ache or pain? Another is medical scaremongering. A friend of mine suffered an interlude of panic when a doctor saw her preschool daughter twitch and suggested that the child had Tourette’s syndrome. Once she collected herself, she thought it through like a Bayesian, realized that twitches are common and Tourette’s rare, and calmed back down (while giving the doctor a piece of her mind about his statistical innumeracy).
Base-rate neglect is also a driver of thinking in stereotypes. Consider Penelope, a college student described by her friends as impractical and sensitive.8 She has traveled in Europe and speaks French and Italian fluently. Her career plans are uncertain, but she’s a talented calligrapher and wrote a sonnet for her boyfriend as a birthday present. Which do you think is Penelope’s major, psychology or art history? Art history, of course! Oh, really? Might it be a wee bit relevant that 13 percent of college students major in psychology, but only 0.08 percent major in art history, an imbalance of 150 to 1? No matter where she summers or what she gave her boyfriend, Penelope is unlikely, a priori, to be an art history major. But in our mind’s eye she is representative of an art history major, and the stereotype crowds out the base rates. Kahneman and Tversky confirmed this in experiments in which they asked participants to consider a sample of 70 lawyers and 30 engineers (or vice versa), provided them with a thumbnail sketch that matched a stereotype, such as a dull nerd, and asked them to put a probability on that person’s job. People were swayed by the stereotype; the base rates went in one ear and out the other.9 (This is also why people fall for the conjunction fallacy from chapter 1, in which Linda the social justice warrior is likelier to be a feminist bank teller than a bank teller. She’s representative of a feminist, and people forget about the relative base rates of feminist bank tellers and bank tellers.)
A blindness to base rates also leads to public demands for the impossible. Why can’t we predict who will attempt suicide? Why don’t we have an early-warning system for school shooters? Why can’t we profile terrorists or rampage shooters and detain them preventively? The answer comes out of Bayes’s rule: a less-than-perfect test for a rare trait will mainly turn out false positives. The heart of the problem is that only a tiny proportion of the population are thieves, suicides, terrorists, or rampage shooters (the base rate). Until the day that social scientists can predict misbehavior as accurately as astronomers predict eclipses, their best tests would mostly finger the innocent and harmless.
Mindfulness of base rates can be a gift of equanimity as we reflect on our lives. Now and again we long for some rare outcome—a job, a prize, admission to an exclusive school, winning the heart of a dreamboat. We ponder our eminent qualifications and may be crushed and resentful when we are not rewarded with our just deserts. But of course other people are in the running, too, and however superior we think we may be, there are more of them. The judges, falling short of omniscience, cannot be guaranteed to appreciate our virtues. Remembering the base rates—the sheer number of competitors—can take some of the sting out of a rejection. However deserving we think we may be, the base rate—one in five? one in ten? one in a hundred?—should ground our expectations, and we can calibrate our hopes to the degree to which our specialness could reasonably be expected to budge the probability upward.
Our neglect of base rates is a special case of our neglect of priors: the vital, albeit more nebulous, concept of how much credence we should give a hypothesis before we look at the evidence. Now, believing in something before you look at the evidence may seem like the epitome of irrationality. Isn’t that what we disdain as prejudice, bias, dogma, orthodoxy, preconceived notions? But prior credence is simply the fallible knowledge accumulated from all our experience in the past. Indeed, the posterior probability from one round of looking at evidence can supply the prior probability for the next round, a cycle called Bayesian updating. It’s simply the mindset of someone who wasn’t born yesterday. For fallible knowers in a chancy world, justified belief cannot be equated with the last fact you came across. As Francis Crick liked to say, “Any theory that can account for all the facts is wrong, because some of the facts are wrong.”10
This is why it is reasonable to be skeptical of claims for miracles, astrology, homeopathy, telepathy, and other paranormal phenomena, even when some eyewitness or laboratory study claims to show it. Why isn’t that dogmatic and pigheaded? The reasons were laid out by that hero of reason, David Hume. Hume and Bayes were contemporaries, and though neither read the other, word of the other’s ideas may have passed between them through a mutual colleague, and Hume’s famous argument against miracles is thoroughly Bayesian:11
Nothing is esteemed a miracle, if it ever happen in the common course of nature. It is no miracle that a man, seemingly in good health, should die on a sudden: because such a kind of death, though more unusual than any other, has yet been frequently observed to happen. But it is a miracle, that a dead man should come to life; because that has never been observed in any age or country.12
In other words, miracles such as resurrection must be given a low prior probability. Here is the zinger:
No testimony is sufficient to establish a miracle, unless the testimony be of such a kind, that its falsehood would be more miraculous, than the fact, which it endeavors to establish.13
In Bayesian terms, we are interested in the posterior probability that miracles exist, given the testimony. Let’s contrast it with the posterior probability that no miracles exist given the testimony. (In Bayesian reasoning, it’s often handy to look at the odds, that is, the ratio of the credence of a hypothesis to the credence of the alternative, because it spares us the tedium of calculating the marginal probability of the data in the denominator, which is the same for both posteriors and conveniently cancels out.) The “fact which it endeavors to establish” is the miracle, with its low prior, dragging down the posterior. The testimony “of such a kind” is the likelihood of the data given the miracle, and its “falsehood” is the likelihood of the data given no miracle: the possibility that the witness lied, misperceived, misremembered, embellished, or passed along a tall tale he heard from someone else. Given everything we know about human behavior, that’s far from miraculous! Which is to say, its likelihood is higher than the prior probability of a miracle. That moderately high likelihood boosts the posterior probability of no miracle, and lowers the overall odds of a miracle compared to no miracle. Another way of putting it is this: Which is more likely—that the laws of the universe as we understand them are false, or that some guy got something wrong?
A pithier version of the Bayesian argument against paranormal claims was stated by the astronomer and science popularizer Carl Sagan (1934–1996) in the slogan that serves as this chapter’s epigraph: “Extraordinary claims require extraordinary evidence.” An extraordinary claim has a low Bayesian prior. For its posterior credence to be higher than the posterior credence in its opposite, the likelihood of the data given that the hypothesis is true must be far higher than the likelihood of the data given that the hypothesis is false. The evidence, in other words, must be extraordinary.
A failure of Bayesian reasoning among scientists themselves is a contributor to the replicability crisis that we met in chapter 4. The issue hit the fan in 2011 when the eminent social psychologist Daryl Bem published the results of nine experiments in the prestigious Journal of Personality and Social Psychology which claimed to show that participants successfully predicted (at a rate above chance) random events before they took place, such as which of two curtains on a computer screen hid an erotic image before the computer had selected where to place it.14 Not surprisingly, the effects failed to replicate, but that was a foregone conclusion given the infinitesimal prior probability that a social psychologist had disproven the laws of physics by showing some undergraduates some porn. When I raised this point to a social psychologist colleague, he shot back, “Maybe Pinker doesn’t understand the laws of physics!” But actual physicists, like Sean Carroll in his book The Big Picture, have explained why the laws of physics really do rule out precognition and other forms of ESP.15
The Bem imbroglio raised an uncomfortable question. If a preposterous claim could get published in a prestigious journal by an eminent psychologist using state-of-the-art methods subjected to rigorous peer review, what does that say about our standards of prestige, eminence, rigor, and the state of the art? One answer, we saw, is the peril of post hoc probability: scientists had underestimated the mischief that could build up from data snooping and other questionable research practices. But another is a defiance of Bayesian reasoning.
Most findings in psychology, as it happens, do replicate. Like many psychology professors, every year I run demos of classic experiments in memory, perception, and judgment to students in my intro and lab courses and get the same results year after year. You haven’t heard of these replicable findings because they are unsurprising: people remember the items at the end of a list better than those in the middle, or they take longer to mentally rotate an upside-down letter than a sideways one. The notorious replication failures come from studies that attracted attention because their findings were so counterintuitive. Holding a warm mug makes you friendlier. (“Warm”—get it?) Seeing fast-food logos makes you impatient. Holding a pen between your teeth makes cartoons seem funnier, because it forces your lips into a little smile. People who are asked to lie in writing have positive feelings about hand soap; people who are asked to lie aloud have positive feelings about mouthwash.16 Any reader of popular science knows of other cute findings which turned out to be suitable for the satirical Journal of Irreproducible Results.
The reason these studies were sitting ducks for the replicability snipers is that they had low Bayesian priors. Not as low as ESP, to be sure, but it would be an extraordinary discovery if mood and behavior could be easily pushed around by trivial manipulations of the environment. After all, entire industries of persuasion and psychotherapy try to do exactly that at great cost with only modest success.17 It was the extraordinariness of the findings that earned them a place in the science sections of newspapers and snazzy ideas festivals, and it’s why on Bayesian grounds we should demand extraordinary evidence before believing them. Indeed, a bias toward oddball findings can turn science journalism into a high-volume error dispenser. Editors know they can goose up readership with cover headlines like these:
was darwin wrong?
was einstein wrong?
young upstart overturns scientific applecart
a scientific revolution in x
everything you know about y is wrong
The problem is that “surprising” is a synonym for “low prior probability,” assuming that our cumulative scientific understanding is not worthless. This means that even if the quality of evidence is constant, we should have a lower credence in claims that are surprising. But the problem is not just with journalists. The physician John Ioannidis scandalized his colleagues and anticipated the replicability crisis with his 2005 article “Why Most Published Research Findings Are False.” A big problem is that many of the phenomena that biomedical researchers hunt for are interesting and a priori unlikely to be true, requiring highly sensitive methods to avoid false positives, while many true findings, including successful replication attempts and null results, are considered too boring to publish.
This does not, of course, mean that scientific research is a waste of time. Superstition and folk belief have an even worse track record than less-than-perfect science, and in the long run an understanding emerges from the rough-and-tumble of scientific disputation. As the physicist John Ziman noted in 1978, “The physics of undergraduate text-books is 90% true; the contents of the primary research journals of physics is 90% false.”18 It’s a reminder that Bayesian reasoning recommends against the common practice of using “textbook” as an insult and “scientific revolution” as a compliment.
A healthy respect for the boring would also improve the quality of political commentary. In chapter 1 we saw that the track records of many famous forecasters are risible. A big reason is that their careers depend on attracting attention with riveting predictions, which is to say, those with low priors—and hence, assuming they lack the gift of prophecy, low posteriors. Philip Tetlock has studied “superforecasters,” who really do have good track records at predicting economic and political outcomes. A common thread is that they are Bayesian: they start with a prior and update it from there. Asked to provide the probability of a terrorist attack within the next year, for example, they’ll first estimate a base rate by going to Wikipedia and counting the number of attacks in the region in the years preceding—not a practice you’re likely to see in the next op-ed you read on what is in store for the world.19
Base-rate neglect is not always a symptom of the representativeness heuristic. Sometimes it’s actively prosecuted. The “forbidden base rate” is the third of Tetlock’s secular taboos (chapter 2), together with the heretical counterfactual and the taboo tradeoff.20
The stage for forbidden base rates is set by a law of social science. Measure any socially significant variable: test scores, vocational interests, social trust, income, marriage rates, life habits, rates of different types of violence (street crime, gang crime, domestic violence, organized crime, terrorism). Now break down the results by the standard demographic dividers: age, sex, race, religion, ethnicity. The averages for the different subgroups are never the same, and sometimes the differences are large. Whether the differences arise from nature, culture, discrimination, history, or some combination is beside the point: the differences are there.
This is hardly surprising, but it has a bloodcurdling implication. Say you sought the most accurate possible prediction about the prospects of an individual—how successful the person would be in college or on the job, how good a credit risk, how likely to have committed a crime, or jump bail, or recidivate, or carry out a terrorist attack. If you were a good Bayesian, you’d start with the base rate for that person’s age, sex, class, race, ethnicity, and religion, and adjust by the person’s particulars. In other words, you’d engage in profiling. You would perpetrate prejudice not out of ignorance, hatred, supremacy, or any of the -isms or -phobias, but from an objective effort to make the most accurate prediction.
Most people are, of course, horrified at the thought. Tetlock had participants think about insurance executives who had to set premiums for different neighborhoods based on their history of fires. They had no problem with that. But when the participants learned that the neighborhoods also varied in their racial composition, they had second thoughts and condemned the executive for merely being a good actuary. And if they themselves had been playing his role and learned the terrible truth about the neighborhood statistics, they tried to morally cleanse themselves by volunteering for an antiracist cause.
Is this yet another example of human irrationality? Are racism, sexism, Islamophobia, anti-Semitism, and the other bigotries “rational”? Of course not! The reasons hark back to the definition of rationality from chapter 2: the use of knowledge to attain a goal. If actuarial prediction were our only goal, then perhaps we should use whatever scrap of information could give us the most accurate prior. But of course it’s not our only goal.
A higher goal is fairness. It’s wicked to treat an individual according to that person’s race, sex, or ethnicity—to judge them by the color of their skin or the composition of their chromosomes rather than the content of their character. None of us wants to be prejudged in this way, and by the logic of impartiality (chapter 2) we must extend that right to everyone else.
Moreover, only when a system is perceived to be fair—when people know they will be given a fair shake and not prejudged by features of their biology or history beyond their control—can it earn the trust of its citizens. Why play by the rules when the system is going to ding you because of your race, sex, or religion?
Yet another goal is to avoid self-fulfilling prophecies. If an ethnic group or a sex has been disadvantaged by oppression in the past, its members may be saddled with different average traits in the present. If those base rates are fed into predictive formulas that determine their fate going forward, they would lock in those disadvantages forever. The problem is becoming acute now that the formulas are buried in deep learning networks with their indecipherable hidden layers (chapter 3). A society might rationally want to halt this cycle of injustice even if it took a small hit in predictive accuracy at that moment.
Finally, policies are signals. Forbidding the use of ethnic, sexual, racial, or religious base rates is a public commitment to equality and fairness which reverberates beyond the algorithms permitted in a bureaucracy. It proclaims that prejudice for any reason is unthinkable, casting even greater opprobrium on prejudice rooted in enmity and ignorance.
Forbidding the use of base rates, then, has a solid foundation in rationality. But a theorem is a theorem, and the sacrifice of actuarial accuracy that we happily make in the treatment of individuals by public institutions may be untenable in other spheres. One of those spheres is insurance. Unless a company carefully estimates the overall risks for different groups, the payouts would exceed the premiums and the insurance would collapse. Liberty Mutual discriminates against teenage boys by factoring their higher base rate for car accidents into their premiums, because if they didn’t, adult women would be subsidizing their recklessness. Even here, though, insurance companies are legally prohibited from using certain criteria in calculating rates, particularly race and sometimes gender.
A second sphere in which we cannot rationally forbid base rates is the understanding of social phenomena. If the sex ratio in a professional field is not 50–50, does that prove its gatekeepers are trying to keep women out, or might there be a difference in the base rate of women trying to get in? If mortgage lenders turn down minority applicants at higher rates, are they racist, or might they, like the hypothetical executive in Tetlock’s study, be using base rates for defaulting from different neighborhoods that just happen to correlate with race? Social scientists who probe these questions are often rewarded for their trouble with accusations of racism and sexism. But forbidding social scientists and journalists to peek at base rates would cripple the effort to identify ongoing discrimination and distinguish it from historical legacies of economic, cultural, or legal differences between groups.
Race, sex, ethnicity, religion, and sexual orientation have become war zones in intellectual life, even as overt bigotry of all kinds has dwindled.21 A major reason, I think, is a failure to think clearly about base rates—to lay out when there are good reasons for forbidding them and when there are not.22 But that’s the problem with a taboo. As with the instruction “Don’t think about a polar bear,” discussing when to apply a taboo is itself taboo.
For all our taboos, neglects, and stereotypes, it’s a mistake to write off our kind as hopelessly un-Bayesian. (Recall that the San are Bayesians, requiring that spoor be definitive before inferring it was left by a rarer species.) Gigerenzer has argued that sometimes ordinary people are on solid mathematical ground when they appear to be flouting Bayes’s rule.23 Mathematicians themselves complain that social scientists often use statistical formulas mindlessly: they plug in numbers, turn a crank, and assume that the correct answer pops out. In reality, a statistical formula is only as good as the assumptions behind it. Laypeople can be sensitive to those assumptions, and sometimes when they appear to be blowing off Bayes’s rule, they may just be exercising the caution that a good mathematician would advise.
For starters, a prior probability is not the same thing as a base rate, even though base rates are often held up as the “correct” prior in the paper-and-pencil tests. The problem is: which base rate? Suppose I get a positive result from a prostate-specific antigen test and want to estimate my posterior probability of having prostate cancer. For the prior, should I use the base rate for prostate cancer in the population? Among white Americans? Ashkenazi Jews? Ashkenazi Jews over sixty-five? Ashkenazi Jews over sixty-five who exercise and have no family history? These rates can be very different. The more specific the reference class, of course, the better—but the more specific the reference class, the smaller the sample on which the estimate is based, and the noisier the estimate. The best reference class would consist of people exactly like me, namely me—a class of one that is perfectly accurate and perfectly useless. We have no choice but to use human judgment in trading off specificity against reliability when choosing an appropriate prior, rather than accepting the base rate for an entire population stipulated in the wording of a test.
Another problem with using a base rate as the prior is that base rates can change, and sometimes quickly. Forty years ago around a tenth of veterinary students were women; today it’s closer to nine tenths.24 In recent decades, anyone who was given the historical base rate and plugged it into Bayes’s rule would have been worse off than if they had neglected the base rate altogether. With many hypotheses that interest us, no record-keeping agency has even compiled base rates. (Do we know what proportion of veterinary students are Jewish? Left-handed? Transgender?) And of course a lack of data on base rates was our plight through most of history and prehistory, when our Bayesian intuitions were shaped.
Because there is no “correct” prior in a Bayesian problem, people’s departure from the base rate provided by an experimenter is not necessarily a fallacy. Take the taxicab problem, where the priors were the proportions of Blue and Green taxis in the city. The participants may well have thought that this simple baseline would be swamped by more specific differences, like the companies’ accident rates, the number of their cabs driving during the day and at night, and the neighborhoods they serve. If so, then in their ignorance of these crucial data they may have defaulted to indifference, 50 percent. Follow-up studies showed that participants do become better Bayesians when they are given base rates that are more relevant to being in an accident.25
Also, a base rate may be treated as a prior only when the examples at hand are randomly sampled from that population. If they have been cherry-picked because of an interesting trait—like belonging to a category with a high likelihood of flaunting those data—all bets are off. Take the demos that presented people with a stereotype, like Penelope the sonnet writer, or the nerd in the pool of lawyers and engineers, and asked them to guess their major or profession. Unless the respondents knew that Penelope had been selected from the pool of students by lottery, which would make it a pretty strange question, they could have suspected she was chosen because her traits provided telltale clues, which would make it a natural question. (Indeed, that question was turned into a classic game show, What’s My Line?, where a panel had to guess the occupation of a mystery guest—selected not at random, of course, but because the guest’s job was so distinctive, like bar bouncer, big-game hunter, Harlem Globetrotter, or Colonel Sanders of Kentucky Fried Chicken fame.) When people’s noses are rubbed in the randomness of the sampling (such as by seeing the description pulled out of a jar), their estimates are closer to the correct Bayesian posterior.26
Finally, people are sensitive to the difference between probability in the sense of credence in a single event and probability in the sense of frequency in the long run. Many Bayesian problems pose the vaguely mystical question of the probability of a single event—whether Irwin has kuru, or Penelope is an art history major, or the taxi in an accident was Blue. Faced with such problems, it is true that people don’t readily compute a subjective credence using the numbers they are given. But since even statisticians are divided on how much sense that makes, perhaps they can be forgiven. Gigerenzer, together with Cosmides and Tooby, argues that people don’t connect decimal fractions to single events, because that’s not the way the human mind encounters statistical information in the world. We experience events, not numbers between 0 and 1. We’re perfectly capable of Bayesian reasoning with these “natural frequencies,” and when a problem is reframed in those terms, our intuitions can be leveraged to solve it.
Let’s go back to the medical diagnosis problem from the beginning of the chapter and translate those metaphysical fractions into concrete frequencies. Forget the generic “a woman”; think about a sample of a thousand women. Out of every 1,000 women, 10 have breast cancer (that’s the prevalence, or base rate). Of these 10 women who have breast cancer, 9 will test positive (that’s the test’s sensitivity). Of the 990 women without breast cancer, about 89 will nevertheless test positive (that’s the false-positive rate). A woman tests positive. What is the chance that she actually has breast cancer? It’s not that hard: 98 of the women test positive in all, 9 of them have cancer; 9 divided by 98 is around 9 percent—there’s our answer. When the problem is framed in this way, 87 percent of doctors get it right (compared with about 15 percent for the original wording), as do a majority of ten-year-olds.27
How does this magic work? Gigerenzer notes that the concept of a conditional probability pulls us away from countable things in the world. Those decimal fractions—90 percent true positive, 9 percent false positive, 91 percent true negative, 10 percent false negative—don’t add up to 100 percent, so to reckon the proportion of true positives among all the positives (the challenge at hand), we would have to work through three multiplications. Natural frequencies, in contrast, allow you to focus on the positives and add them up: 9 true positives plus 89 false positives equals 98 positives in all, of which the 9 trues make up 9 percent. (What one should do with this knowledge, given the costs of acting or failing to act on it, will be the topic of the next two chapters.)
Easier still, we can put our primate visual brains to use and turn the numbers into shapes. This can make Bayesian reasoning eye-poppingly intuitive even with textbook puzzles that are far from our everyday experience, like the classic taxicab problem. Visualize the city’s taxi fleet as an array of 100 squares, one per taxi (left-hand diagram on the next page). To depict the base rate of 15 percent Blue taxis, we color in 15 squares in the top left corner. To show the likelihoods of the four possible identifications by our eyewitness, who was 80 percent reliable (middle diagram), we lighten 3 of the Blue taxi squares (the 20 percent of the 15 Blue taxis that he would misidentify as “Green”), and darken 17 of the Green ones (the 20 percent of the 85 Green taxis he’d misidentify as “Blue”). We know that the witness said “Blue,” so we can throw away all the squares for the “Green” identifications, both true and false, leaving us with the right-hand diagram, which keeps only the “Blue” IDs. Now it’s a cinch to eyeball the shape and espy that the darker portion, the taxis that really are Blue, takes up a bit less than half of the overall area. If we want to be exact, we can count: 12 squares out of 29, or 41 percent. The intuitive key to both the natural frequencies and the visual shapes is that they allow you to zoom in on the data at hand (the positive test result; the “Blue” IDs), and sort them into the ones that are true and the ones that are false.
By tapping preexisting intuitions and translating information into mind-friendly formats, it’s possible to hone people’s statistical reasoning. Hone we must. Risk literacy is essential for doctors, judges, policymakers, and others who hold our lives in their hands. And since we all live in a world in which God plays dice, fluency in Bayesian reasoning and other forms of statistical competence is a public good that should be a priority in education. The principles of cognitive psychology suggest that it’s better to work with the rationality people have and enhance it further than to write off the majority of our species as chronically crippled by fallacies and biases.28 The principles of democracy suggest that, too.