Chapter 4How to Tell the Truth with Statistics

GENERATIONS OF UNDERGRADUATES have been taught that statistics are something to fear. At least that is the lesson that many have taken away from their statistics classes, though it is hardly what their instructors intended. Ever since it was published in 1954, Darrell Huff’s How to Lie with Statistics, although offered as an introduction into how to understand and use statistics accurately and effectively, has too often been remembered (and misused) for its title alone.1 Some people take the title to mean that the most important thing about statistics is that they can be used to deceive. Actually, the book is about the truthful and honest use of statistics, and Huff analyzes the dishonest uses to give us a better understanding of the honest ones.

One important way to use statistics honestly is as evidence. But we need to narrow the focus. In everyday talk, a statistic is a number, or a collection of numbers. It is a statistic that the population of Charlottesville, Virginia, where I live, was 49,181 in 2019; it is a statistic that Babe Ruth hit sixty home runs in 1927; and it is a statistic that so-and-so’s grade point average at such-and-such university was 3.87 on a 4-point scale. All of these are just numbers, or—as with the gross national product of the United States in the first quarter of 2020 being $17,442.93 billion—a large aggregation of smaller numbers.

Of course, these numbers can be used as evidence, depending on what we want to know. If we want to know whether Babe Ruth was a good hitter (yes) or whether Charlottesville has more people than Jersey City (no), these numbers are evidence for these conclusions. But in a more important sense, the relevance of statistics to evidence lies in statistics not as pure numbers but as the foundations for statistical inference. What can we learn from the numbers, and especially what can we learn from aggregate numbers about particular acts or events? Can we use population-based statistics—statistics about some group of somethings—as evidence about a particular member of that group, or what some particular member might have done?

Suppose you are interested in buying my car, and I am interested in selling it to you. You ask me if it is reliable, and I say that it is. What I have asserted—my testimony, a topic that will occupy Chapters 58—is evidence of the car’s reliability, but not very good evidence. I might not know very much about cars. And I undoubtedly have an incentive to exaggerate or flat-out lie. Still, even people who want to sell cars to others occasionally tell the truth, sometimes even when it is not in their interest to do so. As a result, my statement is some evidence, even if skimpy, of my car’s reliability. But it also turns out that my car is a Subaru. Consumer Reports tells you not only that most Subarus are reliable, but also that Subarus have a better reliability record than most other makes of cars. Is this evidence that the particular Subaru that I want to sell you is reliable? Needless to say, Consumer Reports has never even seen this particular car. The question is whether what Consumer Reports says about the class of Subarus—the population of Subarus—is evidence of the reliability of this particular Subaru. The question assumes, obviously, that not every Subaru is reliable, but also that this particular Subaru is a member of the class of Subarus, and that the Consumer Reports conclusion is an accurate conclusion about the same class of which this particular Subaru is a member.2 This question—whether the reliability of the class of Subarus is evidence of the reliability of this particular Subaru—is a question of statistical inference, or, as it is put by the people who study such things, a question of using population-level data as evidence for individual (or sample) characteristics.

The Subaru example illustrates the basic issue for us here—whether what we know about some group can be evidence of what we want to know about some individual member of that group. So, to take a standard example from Statistics 101, suppose we have an urn containing 100 wooden balls. We know that 90 of those balls are solid-colored, and 10 are striped. Someone reaches into the urn and picks out a ball, which I cannot see, but I must guess whether it is solid or striped. The question is whether what we know about the distribution of solid and striped balls in the urn is evidence that the single ball that has been picked is solid. There is, after all, a 90 percent chance that any ball picked at random from the urn will be solid. That being so, there appears also to be a 90 percent chance that any particular unknown ball is solid. And if that is so, then the distribution of balls in the urn, which we do know, is evidence for the characteristics of the single ball that has already been picked, which we do not.3

Recall from Chapter 2 the way in which all evidence involves statistical inference of this sort. The defendant’s running out of the bank wearing a ski mask and carrying a bag is evidence, even if not conclusive, that the defendant has just robbed the bank. But that behavior is evidence of robbery only because of the fact, a fact based on other evidence, that the class of people who wear ski masks and run out of banks with bags in their hands consists predominantly of people who have robbed the bank. Or consider Lyme disease once again. What makes ring-shaped redness evidence of Lyme disease is that prior research has shown that Lyme disease generally produces ring-shaped redness, and that ring-shaped redness is rarely caused by anything else. What the physician knows about some larger category provides the evidence in the particular instance.

All relevant evidence is statistical (and probabilistic) in this sense.4 The focus of this chapter is statistical evidence of a more overt character. When does actual statistical data, especially when presented and understood in explicitly numerical terms, count as evidence of the existence of particular facts? And if the lesson from Chapter 2, as summarized in the previous paragraph, appears to be “always,” it turns out that things are not quite so simple.

Gatecrashers, Prison Yards, Blue Buses, and Other Stories

The problem of using group statistics to prove facts in particular cases has intrigued lawyers and philosophers for more than fifty years. The interest was initially prompted by a 1968 decision of the Supreme Court of California dealing with just this kind of explicitly statistical evidence.5 Malcolm Collins and his wife, Janet Collins, had been charged with robbery, Janet having allegedly assaulted a woman and stolen the victim’s purse, after which she allegedly fled the scene with Malcolm, who was waiting in his car nearby. The victim’s identification of Janet was uncertain, but the victim was confident that she had been robbed by a Caucasian woman with dark blond hair tied back in a ponytail. And although the victim did not see Malcom, a witness who did not observe the robbery testified that he did see a Caucasian woman with a blond ponytail get into a yellow convertible driven by an African American man with a beard and a mustache shortly after the time of the alleged robbery and less than a block away from the scene of the alleged crime. Malcolm Collins, Janet’s husband, was African American, owned a yellow convertible, and frequently wore a beard and a mustache.

At the trial, the prosecuting attorney, sensing a weakness in the identification evidence, called a statistics instructor from a local college to testify. The prosecutor asked him to assume a bunch of statistics about the percentage of women who had blond ponytails, the percentage of marriages that were interracial, the percentage of cars that were yellow convertibles, the percentage of African American men who had beards, and the percentage of African American men who wore mustaches. And then the prosecutor asked the witness to apply the product rule to those numbers and estimate the probability that the defendants, who possessed those characteristics, were the same people as the ones with those characteristics who had robbed the victim. The product rule, as the statistician witness accurately described it, says that the likelihood of multiple independent events occurring together is the product of the likelihood of each of those events. If the question is about the combined likelihood of the nine of hearts being picked from a deck of cards and a fair coin tossed in the air coming up heads, the product rule says to multiply 1 / 52 times 1 / 2, producing a combined likelihood of 1 / 104.

The statistician-witness explained this to the jury, and the prosecutor then asked him to apply this to the statistics about the characteristics of the robber and her accomplice as provided by the prosecutor. That the witness did, producing a vanishingly small likelihood that someone other than the blond pony-tailed defendant and the bearded mustached yellow convertible owning African American defendant who was her husband had committed the crime. The jury was convinced, and the couple was convicted.

The California Supreme Court easily and correctly reversed the conviction. In the first place, the prosecutor had no factual basis for the individual probabilities he had provided to the witness. And, second, the product rule works only when the factors are independent, as with days of the week and tosses of a coin. But if the factors are not independent of each other, multiplying the two probabilities to get a combined probability is fallacious. For example, very few men weigh more than 300 pounds and very few men are sumo wrestlers, but the probability that some man weighs more than 300 pounds and is a sumo wrestler cannot be determined by multiplying the two probabilities; and that is because men who are sumo wrestlers are more likely to weigh more than 300 pounds than are men in general, and men who weigh more than 300 pounds are more likely to be sumo wrestlers than are men in general. Weighing over 300 pounds and being a sumo wrestler are thus correlated, and the two probabilities are not independent.

And that, in addition to the lack of any foundation (evidence) for the individual probabilities, was the problem here, because there was no indication that the probability of the attributes multiplied by the statistician, as well as the attributes of Janet, Malcolm, and Janet and Malcolm together, were independent. To give one example, the vast majority of men who have beards also have mustaches (Abraham Lincoln and Amish men being notable exceptions) and treating the two as independent was one of the multiple blunders that led the California Supreme Court to reverse the conviction.

Although the result in the Collins case was plainly correct, the case spawned a raft of academic hypothetical cases designed to test whether the use of statistics alone, if actually used properly, can count as evidence sufficient to justify a legal verdict. One of these hypothetical cases, adapted from a real Massachusetts decision in 1945, has come to be known as the Blue Bus problem.6 Suppose a car is forced off the road by a bus, but it is a dark and rainy night and all the victim can see of the offending vehicle is that it was a blue bus. It turns out that of all the blue buses in town, 80 percent are owned and operated by the Metropolitan Transit Company, and 20 percent by the Public Service Company. Assuming that the bus was driven negligently, and assuming that the negligence caused injury to the driver of the car, can that driver recover against the Metropolitan Transit Company in a civil suit, where the burden of proof is only a preponderance of the evidence? There appears to be an 80 percent chance, after all, that it was the Metropolitan’s bus that caused the accident. In a civil suit, with the burden of proof being something like 51 percent, it would seem that the Metropolitan Transit Company ought to be liable. But most people resist that outcome, insisting, as did the Massachusetts Supreme Judicial Court in the real case, that without some “direct” evidence of Metropolitan Transit’s involvement there could no liability. Direct evidence—presumably something like the victim testifying that she saw the words “Metropolitan Transit” written on the side of the bus—would be necessary. Mere statistics pointing to the same conclusion, whether those statistics be numerically quantified or not, cannot suffice. Or at least that was the conclusion of the real court in the real case, and that is the intuition that most commentators on the real case and on the fictional blue bus case have had as well.7

A slew of other hypothetical examples has tried to make the same point. The philosopher Jonathan Cohen, not long after Collins surfaced in the academic literature, offered what he called the Paradox of the Gatecrasher.8 Suppose that there is an event—a rodeo, in Cohen’s example—for which admission is charged. One thousand spectators are counted in the seats, but there is evidence that only 499 people paid admission. Therefore, 501 people entered fraudulently. And then suppose the rodeo organizer sues—again civilly—one of the 1,000 spectators, alleging that that spectator entered fraudulently. Even with no other evidence, there appears to be a 501 / 1000 probability that this randomly selected spectator entered fraudulently, and thus it would seem, by a (bare) preponderance of the evidence, that this person should be held liable. Cohen, like many others, finds this outcome objectionable, and seeks to explain why what seems statistically impeccable is nevertheless unacceptable.9

Or consider an example offered by Charles Nesson, which in simplified and slightly modified form goes as follows:10 Twenty-five prisoners are in the prison yard. The prisoners crowd around a guard and kill him, but one has run away and hidden before the killing takes place. One of the twenty-five is prosecuted for murder. The prosecutor can prove the events just described but cannot prove that the defendant was not the one who broke away from the group and was thus innocent. The prosecutor can consequently only prove that there is a 24 / 25 chance that the defendant was a murderer. Is this sufficient for a conviction? And, if not, then why not?

These examples have been well massaged in the legal and philosophical literatures for decades.11 It turns out, however, that some of these examples present an issue extraneous to the statistical evidence problem—although not, as we shall see in the following section, to broader questions about evidence. Consider again Cohen’s Paradox of the Gatecrasher and its assumption that a randomly selected spectator can be sued, civilly, for having committed one unspecified instance of the 501 fraudulent entries. But that assumption leads Cohen—and us—astray. There were indeed 501 fraudulent entries, but each is a different act, taking place at a different time, even if distinguishable only by seconds, and taking place at a different location, even if different only by millimeters. These tiny differences among the 501 may seem trivial, but they are not trivial to the law, which operates on the assumption that liability requires specification of the particular act for which someone is to be held liable. “You did one of these, even if we don’t know which one” is unacceptable in law. And if our randomly selected entrant is sued for having committed a precisely specified fraudulent entry, the probability is then no longer 501 / 1000, but 1 / 1000, a different matter entirely. Once the defendant is sued for having committed a particular specified act, the statistics no longer justify liability, and the alleged paradox evaporates.

So too, even more obviously, with the Prison Yard hypothetical. Especially in criminal prosecutions, we do not prosecute people for having committed one of a multiple number of unspecified acts. Suppose that location A and location B are two hundred miles apart. A radar device identifies a car as having traveled from A to B in a time that was possible only by either driving in excess of the speed limit or ignoring at least several of twenty stop signs. Can the driver be prosecuted for the crime of having either exceeded the speed limit or failed to stop at the stop signs, but without specification of which of these?12 And if it seems wrong to prosecute for an unspecified something in this case, then so too in the Prison Yard. A randomly selected prisoner in the prison yard would not be prosecuted for having done just something connected with the death of the guard. It would be necessary to allege a particular act, or a particular role in the guard’s death, and without that there could be no prosecution. And if a particular act were to be specified, the likelihood that a particular prisoner committed it, with no other evidence, would then be 1 / 25, hardly enough for a conviction.

This issue of specification does not arise in the Blue Bus case, making it a cleaner example for examining the use of explicitly statistical evidence. Here there is a precisely specified basis for liability, and the only question is whether the company that operates 80 percent of the blue buses is liable for the injuries resulting from that specified act. Now, even more clearly, we see the divergence between the common intuitions and the result indicated by the statistics.13 The statistics provide evidence—a preponderance of it, if there is no evidence inclining in the opposite direction of the bus’s ownership.14 But common intuition wants something allegedly more “individualized,” as it is often put.15 Without some evidence pointing specifically to ownership of this bus, so the argument goes, there can be no liability.

Here it becomes crucial to distinguish between whether something is evidence in the first place from whether that evidence is strong enough to justify a legal verdict. Law enforcement authorities know, for example, that by far the largest percentage of married women killed in their own homes have been murdered by their husbands.16 Any good police officer, even lacking individualized evidence that this particular husband killed this particular wife, will accordingly investigate the husband carefully, even at the expense of postponing the investigation of other possibilities. This might turn out to be a mistake, especially if it comes at the expense of not investigating something less probable that turns out to be correct. Still, a detective who allocates scarce time and limited investigative resources to investigating the husband hypothesis and not, say, the stranger hypothesis or the burglar hypothesis, is, based on the probabilities, pursuing a wise strategy. Horses and not zebras. Needless to say, we do not imprison husbands solely on the evidence that their wives have been murdered. But being the husband justifies on probabilistic grounds the targeted investigation, even if the lone fact of being the husband is hardly sufficient for conviction. Being the husband is still evidence, based on the probabilities, of the husband’s culpability, despite its not being sufficient evidence to convict or even arrest. But as I have repeatedly stressed, and as the law recognizes, something being evidence is different from that something being enough evidence for some type of consequence. Evidence sufficient to justify investigation is usually insufficient by itself to justify conviction or arrest. But evidence insufficient for arrest or conviction can still justify pre-arrest investigation.

Using purely probabilistic (or statistical) evidence to justify an investigation is common and widely accepted—think of Captain Renault in Casablanca ordering his officers to “round up the usual suspects.” But if using statistical evidence as the basis for an investigation is acceptable, the common intuitive reaction that there must be individualized evidence for anything beyond investigation, whether in the legal system or more broadly, becomes curious.17 Some of the tension between people’s common positive reaction to probabilistic investigation, on the one hand, and their negative views of probabilistic sanctions, on the other, might be the product of the well-documented difficulty people have in dealing with probabilities generally.18 And academics writing about the Blue Bus and related problems are not necessarily immune to this difficulty. But the strong preference for individualized evidence seems to be even more a function of the widespread and systematic underestimation of the probabilistic nature of allegedly individualized evidence, along with an equally widespread and systematic overvaluation of many forms of allegedly individualized evidence. Although there is no fundamental difference between population-based evidence and other forms of evidence, there is a common resistance to using population-based—or actuarial—evidence. But the resistance is mistaken. Someone who claimed to have seen the word “Metropolitan” on the blue bus, for example, would have offered seemingly individualized evidence, but that individualized evidence might still have been based on an observation that occurred on a dark and rainy night from a hundred yards away by someone with poor vision and a financial interest in the outcome. This would count as individualized evidence for the proponents of requiring individualized evidence, but it would be very weak evidence, no matter how individualized it seemed.

Moreover, to repeat a recurrent theme in this book, the allegedly individualized evidence is not as individualized as is often supposed. The eyewitness who reports having seen a blue bus is basing that report on the nonindividualized fact that what the witness has previously perceived as blue has usually turned out actually to be blue. And so too with buses. But a less philosophically obscure example of the same point would come from a witness who reports that someone else was, say, drunk. Reports of drunkenness are usually based on the observer’s belief that people who are slurring their words, talking too loudly, and losing their balance are likely drunk. Saying that such-and-such a person was drunk is, in effect, saying that this person exhibited a category of behaviors that probabilistically tend to indicate drunkenness. And the identification of someone as drunk is itself a conclusion based on probabilities. Once we see that characteristics that superficially individualize—“drink,” or “bus,” or “blue”—are also probabilistic generalizations, the idea of truly individualized evidence is increasingly elusive.

Appreciating the probabilistic dimension of all evidence makes the common aversion to explicitly statistical evidence even more tenuous. Proponents of evidence-based medicine argue that individual treatments should be based on statistical probabilities, but doing so necessarily takes into account the individual patient’s group characteristics. Indeed, the entire field of epidemiology is based on statistics, and we rarely see “blue bus” style resistance to basing individual treatments on epidemiological data and conclusions.19 Nor do we resist the mechanic’s diagnosis that the pinging sound coming from the engine is likely the result of using fuel that is too low in octane, even though that diagnosis comes from the mechanic’s multiple prior experiences or published data and possibly not even from looking under the hood.

It turns out, therefore, that the kind of statistical evidence that produces skepticism in a trial in court is widely accepted in other contexts. This suggests that the skepticism is rooted not in intuitions about statistics or about evidence, but in intuitions about what the legal system in particular should do, and when and how it should do it. Even if the demand for individualized evidence in the legal system were sound, and I doubt that it is, that demand emerges from views—or intuitions—about the legal system and not about the idea of evidence itself. It is not about the value of using group-level characteristics—statistics—to justify inferences about individual events.20

Questions about the liability of the Metropolitan Transit Company are thus similar to questions about whether the reliability of Subarus is evidence of the reliability of this Subaru or whether the effectiveness of the Pfizer Covid-19 vaccine on thirty thousand experimental subjects is evidence of its likely effectiveness on one particular patient. And this similarity lays bare that the intuitive resistance to using statistics in legal context is not based on anything about statistics or evidence—it is based on what the legal system should do to people. The intuitions seem to be intuitions about the criminal law that bleed over, not necessarily correctly, into intuitions about law generally, including civil litigation.

Even if the widespread aversion to the use of statistical evidence in law is law-specific, that aversion may still be sound in legal contexts. Perhaps the law’s aversion, and the common support for law’s aversion, is based on the idea of excluding statistical evidence as a way of forcing those who would otherwise rely on it to come up with something better.21 But although creating the incentive to produce the best evidence available is an admirable goal, it is in some tension with the goal of transcending the common failure to ignore or underestimate base rates in reaching conclusions from evidence. Let us return to Subarus. If we cannot rely on the Subaru-ness—the characteristics of the category of which this Subaru is a member—of a particular car in assessing its reliability, we are likely to take our assessment of the car’s individual characteristics as being more compelling than they actually are, and the characteristics that the car shares with other Subarus or with other cars as being less important as evidence than they actually are, which is precisely the problem that the research about ignoring base rates makes clear.22 That this car makes a squeaking noise when it goes over bumps is some evidence of its unreliability, but not nearly as strong as this being a Subaru is evidence of its reliability. And if we cannot use its Subaru-ness as evidence, then we will treat the squeaking as being stronger evidence than it actually is. If information about an individual Subaru—or about an individual blue bus, an individual rodeo spectator, an individual prisoner in the prison yard—obscures or crowds out what we can learn from the categories of which those individuals are members, the resulting decisions are more likely to be mistaken. Or so the base-rate literature warns us.

Moreover, “no decision” is rarely an option in court, in policy, or in individual decision making. If statistical evidence is unavailable, there will still be an outcome, and we find ourselves back to Blackstone and the consequences of different types of error. If statistics may not be used to support a judgment against the Metropolitan Transit Company, then Metropolitan’s non-liability means non-recovery for the bus driver’s victim and the company’s negligence. And the question again is whether an error of mistaken liability is worse than an error of mistaken non-liability, the latter entailing, in civil context, non-recovery for the victim of someone else’s negligent actions. Similarly, if no prisoner can be convicted in the Prison Yard example, then twenty-four felonious acts go unpunished. And even if, in the criminal context, we properly follow Blackstone in deeming false acquittals as less grave then false convictions, we cannot eliminate false convictions entirely without accepting at least some false acquittals. We are left to wonder whether those who object to the use of statistical evidence fully comprehend that the non-use of statistical evidence will produce mistaken acquittals (which might not be so bad) and mistaken non-compensation of plaintiffs injured physically or financially through no fault of their own (which might be much worse).

One final example should make the tenor of the foregoing even clearer. Let us add some numbers to the Lyme disease example, and imagine that the physician, employing the best techniques of evidence-based medicine, determines from the patient’s indications that there is a 96 percent chance that the patient has Lyme disease. And assume that the approved treatment for Lyme disease is a dose of antibiotics. These antibiotics will kill the microbes that produce Lyme disease, but they will also kill some harmless microbes residing in the human body. As a result, there is a 4 percent chance that the antibiotics will kill only harmless microbes. Under these conditions, should the antibiotics be administered?

Hardly anyone would be troubled by administering the antibiotics in these conditions. Yet statistically this is the Prison Yard case, except that micro-organisms take the place of the prisoners. Yes, we should care more about innocent prisoners than we do about innocent microbes, but the point of the example is only to illustrate that resistance to prosecution in the Prison Yard example cannot be based on any mistakes in the statistics, and cannot be based on any defect in the evidence. If such resistance exists, it must be based on the fact that people, understandably, worry more about innocent defendants than innocent microbes. Although that worry may well be justified, identifying and isolating it brings us back to the major point. As a matter of statistical inference, taking the group-level data as evidence of the likelihood in an individual case is inferentially impeccable. What we do with that inference is important, but it is just as important not to allow resistance to the consequences of the inference produce an unwarranted resistance to the inference itself.

Evidence of What?

I want to return to the issue glossed over in the previous section—the extent to which the legal system actually should require evidence of a precisely specified wrong before it is willing to impose liability. For this purpose, recent issues and trials regarding sexual assault provide an important illustration of the issue, and of the problem.

These recent issues surround a common phenomenon of modern life—the frequency with which powerful men are accused by multiple women of having engaged in sexual misconduct, whether rape, some other form of sexual assault, or some other variety of unwanted sexual aggression. Bill Clinton, Bill Cosby, Donald Trump, and Harvey Weinstein are only the most prominent names among those who have been multiply accused in this way, and in each case the accused men have denied each and every allegation. The question that emerges is a question of evidence—what (if anything) did they do, to whom (if anyone) did they do it to, and what (if anything) should society and the legal system do about it?

The allegations against Clinton, Cosby, Trump, Weinstein, and countless other famous and not-so-famous men turn out to be not only of great moral, policy, and political import, but also of statistical and general evidentiary interest. And although posing the issue in terms of what they did and who they did it to seems straightforward, these allegations and how they are considered present an important twist on what is really the first question we ask when we are thinking about evidence: What is it (the evidence) evidence of?

It will help if we imagine a hypothetical scenario similar to most of the non-hypothetical scenarios that have become front-page news. Suppose that famous politician Henry has been accused of sexual assault by four women. The accusations have come at different times and at different places, and there is no reason to believe that any of the accusers knew each other or knew of the others’ accusations. In other words, and of statistical importance, each of the accusations is independent of the others.23 But in each case Henry vigorously denies the accusation.

Now suppose that each accusation results in a criminal prosecution. A single accuser testifies credibly, but Henry testifies, at least somewhat credibly, that the events never happened or that any sexual activity was entirely consensual. And there is neither physical evidence nor witnesses other than Henry and the accuser. On these facts, it seems plausible to conclude that the prosecution has established an 80 percent chance that Henry has done what he is charged with having done. But this being a criminal case, the 80 percent chance is insufficient to establish guilt beyond a reasonable doubt. Although, as discussed in Chapter 3, there are debates about whether the beyond a reasonable doubt standard can be quantified, a common view is that beyond a reasonable doubt means at least a 90 percent probability of guilt. Thus, if each of the four accusations goes to trial, and if each of the accusations is proved to an 80 percent probability, Henry will be acquitted in each of four separate trials. And properly so.

But now ask a different question. What if we ask not whether, for each individual case, Henry is guilty, but instead whether Henry has committed at least one sexual assault. Now the probabilities look very different. To be more precise, if there are four accusations, and each of those accusations is .80 likely to be true, then the chances that Henry has committed at least one sexual assault are 1((1.80) × (1.80) × (1.80) × (1.80)) = .9924. Although there is only an 80 percent likelihood that Henry has committed a particular specified sexual assault, there is a greater than 99 percent chance that Henry has committed at least one sexual assault.

What a legal system should do in such cases is not strictly a question of evidence. Rather, it is a question whether the law, as traditionally understood and designed, is right to refuse to aggregate in the manner just described, or whether, instead, the legal system ought to allow people to be convicted if, as with Henry, it is established beyond a reasonable doubt that they have committed a punishable crime, but not established beyond a reasonable doubt that they have committed a particular specified punishable crime.24 As here, the evidence often points strongly to the defendant’s guilt of something, and indeed something of a particular type, but less strongly to guilt of one particular specified something.25

A large part of the legal system’s traditional reluctance to punish absent this kind of specification of precisely what the defendant is being punished for seems based on a worry that, taken to the extreme, this approach would permit prosecuting most of us on the theory that we have at some point in our lives committed some punishable crime. Should we permit people to be prosecuted for having committed at least one of some number of quite different crimes? If there is a 40 percent chance that someone has committed an act of shoplifting, a 40 percent chance that the same person has engaged in an act of reckless driving, a 40 percent chance that that person has used an illegal controlled substance, and a 40 percent chance that this person has purchased alcohol for a minor, then there is slightly over a 90 percent chance that they have committed some unspecified criminal offense, but few would approve of criminal conviction of some unspecified crime under such circumstances.26

Although the example just given is properly resisted, the resistance may decrease, as with Henry, when the unspecified acts are similar. And the resistance may decrease even further when the context is something other than criminal punishment. If there is a 20 percent chance that someone has committed each of three unspecified acts involving dishonesty, should that person be hired (or retained) as chief of security for a casino? Few managers would hesitate to refuse to hire or to dismiss such a person if already hired, and if that example produces less resistance, the conclusion, again, is that what may properly be thought impermissible when we are considering criminal conviction and likely imprisonment does not look so impermissible in a large number of other very real circumstances presenting the same statistical conclusions reached by the same decision-theoretic methods. With the question of specification as well as with the use of statistics generally, therefore, it is important not to let intuitions and even reasoned conclusions about what the criminal law should do be generalized to conclusions about what should be done or not done regardless of context or consequences.

A Note on Profiling

Most people reading the preceding pages in this chapter would think of the issue of profiling.27 Decades ago, before racial profiling became publicly salient (although not before it existed), profiling did not have the bad odor that it now exudes. In fact, a 1990s television series called The Profiler presented favorably an FBI profiler whose job was to accumulate whatever evidence was available of some crime and then, on the basis of this evidence, construct a profile that narrowed the range of suspects—sometimes to one—so that that small group of prime suspects could be investigated with close scrutiny.

It is unlikely that such a television show would be offered now, twenty-plus years later. And that is because racial profiling—using race as one of the attributes, or even the only attribute, that triggers an investigation—is justifiably widely condemned and widely prohibited within law enforcement, even as unofficially it pervasively persists.

It might seem that there is a disconnect between the condemnation of racial (or ethnic, or national origin, or religious) profiling and the defense of category-based statistical evidence that has emerged from the preceding pages. But the disconnect is illusory. Nothing about the inevitability and desirability of using population-based statistics to reach conclusions about individual acts and individual actors suggests that every statistical inference is statistically valid, nor that every valid statistical inference is normatively desirable.

Part of the condemnation of racial and related profiling is based on the argument, often sound, that some inferences from some classes are simply statistically and empirically invalid. There is, for example, no empirical basis at all for the belief that the class of gay men has, as a class, or on average as a group, less physical courage than the class of heterosexual men, although that generalization was long believed, and persists now. Nor is there any empirical basis, John Henry Wigmore’s prejudices (discussed in Chapter 8) notwithstanding, for the belief that women are less honest or accurate than men when they testify in court. It should be obvious, therefore, that nothing I say here about the use of group-level characteristics, whether of people or Lyme-disease-causing micro-organisms, is applicable to descriptions of group-level characteristics that are statistically false.

Even when such descriptions are statistically sound—men over seventy years old have worse hearing and less reliable memory than younger men; women have less upper-body strength than men; African Americans develop high blood pressure more frequently and earlier than others—it does not follow that it is justifiable to use those descriptions as evidence. But here we should distinguish epistemic arguments from normative ones.

Epistemically, even sound empirical generalizations may so suffer from misuse that it is epistemically better to preclude their use. Suppose it is widely believed that French people are better cooks than Russians, and suppose that this generalization has an empirically sound basis as a generalization. But if the operationalization of this belief is that all French people are good cooks and all Russians are bad, a restaurant that refuses to examine the culinary abilities of individual chefs in favor of never hiring Russians and always hiring the French people may wind up making more mistakes than if it tested individual chefs, assuming that the individual testing was highly reliable, even if not perfect. Sound generalizations can be misused or overused in various ways, but the basic point is that doing so is an epistemic failure. And it is an epistemic failure whose consequences can be lessened by prohibiting group-level generalizations and instead considering only individual characteristics in circumstances where this would produce better aggregate outcomes.

Moreover, a host of moral considerations might in some contexts argue against using even some empirically sound and epistemically useful aggregate characteristics. For example, some reliable statistical indicators are reliable only as a result of previous immoral and often illegal discrimination. Taking the aggregate mathematical ability of women as some (inconclusive) evidence for the mathematical ability of a particular woman might be statistically justified today, but the statistical justification is itself a product of generations (at least) of steering women into certain professions and disciplines (librarian, secretary, nurse) and away from others (scientist, surgeon, mathematician), with the product of that steering being the current differentials. Moreover, it is hardly the case that every statistically sound empirical generalization is appropriately used for every conclusion. The generalization that men as a class have more upper-body strength than women as a class is almost certainly true, but that would hardly justify preferring men over women for the vast range of jobs in which upper-body strength is either irrelevant or of such minor importance that individuals with less strength can be accommodated. Some statistically justifiable generalizations, therefore, are best avoided—not because they are epistemically flawed as evidence, but because they probably reflect or reinforce past injustices.

Thus, even if someone’s race, religion, gender, ethnicity, age, sexual orientation, or nationality is in fact evidence for some conclusion, it does not follow that that evidence ought to be used for other conclusions, or even for that conclusion. We don’t worry about injustices to blue buses or disease-causing microbes or even rodeo spectators, but there are many reasons to believe that statistical inferences that are usable for such categories may wisely be precluded from others, not because those inferences are statistically invalid, but despite the fact that they are not.

Sampling as Evidence

Article I of the Constitution of the United States provides that there shall be an “actual Enumeration” of the “Persons” in each state for the purpose, originally and still principally, of determining how many representatives in the House of Representatives will be allocated to each state. The actual clause in which the enumeration requirement appears is multiply offensive, containing not only the notorious “three-fifths” clause counting enslaved persons as only three-fifths of a person, but also excluding “Indians not taxed”—which at the time denoted more than 90 percent of the indigenous population of the United States.

Even the less offensive parts of the enumeration clause have been controversial. The US Supreme Court recently declined to decide, for now, the question whether undocumented people were to be counted in the census.28 But more relevant to questions about evidence are the statutory and constitutional questions, the subject of several Supreme Court decisions, about whether the requirement of an “actual enumeration” allows statistical sampling.29 If we put aside the statutory and constitutional questions for a moment and consider the issue solely as a question of evidence, we can say that the question we want answered is the question of just how many people are living in a particular state. One form of evidence would be the familiar head counting, in which a census taker goes door to door and actually counts the people living at some residence. And these household-by-household head counts are then aggregated to produce the number of people living in a town, in a state, and in a congressional district. The total number is thus the number of observed residents, which has traditionally been considered good evidence of the number of people actually living in the state. In the modern version, the number of people who answer the census questions online or by mail is considered some evidence, often good evidence, of how many people live in a specified area.

The heads that are counted, however, are not, according to the Constitution, what we really want to know. What we want to know is how many people are living in the state. Traditionally, counting the people available for counting is thought to be reliable evidence of what we want to know. But the result of the counting is not the fact of the matter. The fact of the matter is how many people live somewhere, and the counting is evidence of that fact.

Counting, however, is not perfect evidence, even assuming that there is something we can call perfect evidence. Some people who really do live in the town or state or congressional district refuse to talk to the census taker, and, these days, refuse to respond online and fail to respond by mail. Because such people actually do live in the state, they are part of the ultimate fact. But they do not get counted. The evidence gets it wrong. And in response to this phenomenon, it has been argued that statistical sampling by the Census Bureau would be more reliable than actual counting of actual heads. For now, it appears that, as a matter of statutory interpretation, such sampling is legally impermissible for purposes of allocating congressional seats to the states, but sampling is permissible, used, and also controversial for other purposes.

As is unfortunately the case with most policy issues these days, support for using statistical sampling to count the population has divided along party lines. The political parties don’t differ in their views about the abstract question. They differ regarding who will be benefit from one or another answer to a question that, in theory, has no necessary political implications. Republicans tend to oppose sampling in the belief that it will increase the counting of people—disproportionately of lower income and living in urban areas—who are likely to vote Democratic. Democrats oppose exclusive reliance on face-to-face counting for the opposite reason.

The issue, however, is larger than the census. The recent inaccuracies—some would call them spectacular failures—of political polling have not been good for the polling profession generally, but it is worthwhile pointing out that polling, surveying, and all of the other similar techniques are based on the idea of sampling as evidence. If what we are interested in is how many of the country’s approximately 150 million voters would prefer Donald Trump to all other potential candidates in the 2024 presidential election, then the best evidence, hardly conclusive, would come from asking all 150 million who they intend to vote for. That would be financially and logistically impossible, so polls sample small percentages of those whose preferences concern us, and then use those results as evidence of the preferences of the larger group. Like all evidence, the conclusion from such polling would be inductive, risky, and, even with the best of sampling techniques, potentially mistaken. But we nevertheless ought not lose sight of the fact that sampling is one of the most prevalent forms of evidence there is.

Sampling, though not by polling, is also a form of evidence widely practiced by manufacturers concerned with quality control. Consider the question of tires. The manufacturers of tires would like to sell tires that will not fail—typically spectacularly, as anyone who has experienced a blowout knows—prior to being driven, say, fifty thousand normal miles. But tire manufacturers cannot test each tire they manufacture, even assuming, counterfactually, that a tire could be tested without impairing its usability. Accordingly, the manufacturers test—and in the process destroy—a comparatively small number of tires and use the results of that testing as evidence of the durability of the full population of tires similar in manufacture to the tested and destroyed tires.

Evidence of the durability of tires is, therefore, just one of the many ways statistical evidence is a central feature of the role of evidence in business, in public policy, and in our daily lives. And statistics derived, accumulated, used, and presented in numerical terms are only the quantitative subset of the larger set of probabilistic, and thus statistical, inferences that are central to the very idea of evidence. Sometimes we are interested in population-level data as evidence with respect to particular members or samples of that population. And at other times, as with census and quality control sampling, we are interested in just the reverse—samples as evidence of population-level or aggregate characteristics. But whether the evidentiary route is from population to sample or from sample to population, the evidentiary inferences are statistical. And so, with or without numbers, is much of the evidence we use, and much of what evidence itself is all about.