One of the first things taught in introductory statistics textbooks is that correlation is not causation. It is also one of the first things forgotten.
—Thomas Sowell1
Rationality embraces all spheres of life, including the personal, the political, and the scientific. It’s not surprising that the Enlightenment-inspired theorists of American democracy were fanboys of science, nor that real and wannabe autocrats latch onto harebrained theories of cause and effect.2 Mao Zedong forced Chinese farmers to crowd their seedlings together to enhance their socialist solidarity, and a recent American leader suggested that Covid-19 could be treated with injections of bleach.
From 1985 to 2006, Turkmenistan was ruled by President for Life Saparmurat Niyazov. Among his accomplishments were making his autobiography required reading for the nation’s driving test and erecting a massive golden statue of himself that rotated to face the sun. In 2004 he issued the following health notice to his adoring public: “I watched young dogs when I was young. They were given bones to gnaw. Those of you whose teeth have fallen out did not gnaw on bones. This is my advice.”3
Since most of us are in no danger of being sent to prison in Ashgabat, we can identify the flaw in His Excellency’s advice. The president made one of the most famous errors in reasoning, confusing correlation with causation. Even if it were true that toothless Turkmens had not chewed bones, the president was not entitled to conclude that gnawing on bones is what strengthens teeth. Perhaps only people with strong teeth can gnaw bones, a case of reverse causation. Or perhaps some third factor, such as being a member of the Communist Party, caused Turkmens both to gnaw bones (to show loyalty to their leader) and to have strong teeth (if dental care was a perquisite of membership), a case of a confound.
The concept of causation, and its contrast with mere correlation, is the lifeblood of science. What causes cancer? Or climate change? Or schizophrenia? It is woven into our everyday language, reasoning, and humor. The semantic contrast between “The ship sank” and “The ship was sunk” is whether the speaker asserts that there was a causal agent behind the event rather than a spontaneous occurrence. We appeal to causality whenever we ponder what to do about a leak, a draft, an ache or a pain. One of my grandfather’s favorite jokes was about the man who gorged himself on cholent (the meat and bean stew simmered for twelve hours during the Sabbath blackout on cooking) with a glass of tea, and then lay in pain moaning that the tea had made him sick. Presumably you had to have been born in Poland in 1900 to find this as uproarious as he did, but if you get the joke at all, you can see how the difference between correlation and causation is part of our common sense.
Nonetheless, Niyazovian confusions are common in our public discourse. This chapter probes the nature of correlation, the nature of causation, and the ways to tell the difference.
A correlation is a dependence of the value of one variable on the value of another: if you know one, you can predict the other, at least approximately. (“Predict” here means “guess,” not “foretell”; you can predict the height of parents from the heights of their children or vice versa.) A correlation is often depicted in a graph called a scatterplot. In this one, every dot represents a country, and the dots are arrayed from left to right by their average income, and up and down by their average self-rated life satisfaction. (The income has been squeezed onto a logarithmic scale to compensate for the diminishing marginal utility of money, for reasons we saw in chapter 6.)4
You can immediately spot the correlation: the dots are splayed along a diagonal axis, which is shown as the gray dashed line lurking behind the swarm. Each dot is impaled by an arrow which summarizes a mini-scatterplot for the people within the country. The macro- and mini-plots show that happiness is correlated with income, both among the people within a country (each arrow) and across the countries (the dots). And I know that you are resisting the temptation, at least for now, to infer “Being rich makes you happy.”
Where do the gray dashed line and the arrows impaling each dot come from? And how might we translate our visual impression that the dots are strung out along the diagonal into something more objective, so that we aren’t fooled into imagining a streak in any old pile of pick-up sticks?
This is the mathematical technique called regression, the workhorse of epidemiology and social science. Consider the scatterplot below. Imagine that each data point is a tack, and that we connect it to a rigid rod with a rubber band. Imagine that the bands can only stretch up and down, not diagonally, and that the farther you stretch them, the harder they resist. When all the bands are attached, let go of the rod and let it sproing into place:
The rod settles into a location and an angle that minimizes the square of the distance between each tack and where it’s attached. The rod, thus positioned, is called a regression line, and it captures the linear relationship between the two variables: y, corresponding to the vertical axis, and x, corresponding to the horizontal one. The length of the rubber band connecting each tack to the line is called the residual, and it captures the idiosyncratic portion of that unit’s y-value that refuses to be predicted by its x. Go back to the happiness–income graph. If income predicted happiness perfectly, every dot would fall exactly along the gray regression line, but with real data, that never happens. Some of the dots float above the line (they have large positive residuals), like Jamaica, Venezuela, Costa Rica, and Denmark. Putting aside measurement error and other sources of noise, the discrepancies show that in 2006 (when the data were gathered) the people of these countries were happier than you would expect based on their income, perhaps because of other traits boasted by the country such as its climate or culture. Other dots hang below the line, like Togo, Bulgaria, and Hong Kong, suggesting that something is making the people in those countries a bit glummer than the level their income entitles them to.
The residuals also allow us to quantify how correlated the two variables are: the shorter the bands, as a proportion of how splayed out the entire cluster is left to right and up and down, the closer the dots are to the line, and the higher the correlation. With a bit of algebra this can be converted into a number, r, the correlation coefficient, which ranges from –1 (not shown), where the dots fall in lockstep along a diagonal from northwest to southeast; through a range of negative values where they splatter diagonally along that axis; through 0, when they are an uncorrelated swarm of gnats; through positive values where they splatter southwest to northeast; to 1, where they lie perfectly along the diagonal.
Though the finger-pointing in correlation-versus-causation blunders is usually directed at those who leap from the first to the second, often the problem is more basic: no correlation was established in the first place. Maybe Turkmens who chew more bones don’t even have stronger teeth (r = 0). It’s not just presidents of former Soviet republics who fall short of showing correlation, let alone causation. In 2020 Jeff Bezos bragged, “All of my best decisions in business and in life have been made with heart, intuition, guts . . . not analysis,” implying that heart and guts lead to better decisions than analysis.5 But he did not tell us whether all of his worst decisions in business and life were also made with heart, intuition, and guts, nor whether the good gut decisions and bad analytic ones outnumbered the bad gut decisions and good analytic ones.
Illusory correlation, as this fallacy is called, was first shown in a famous set of experiments by the psychologists Loren and Jean Chapman, who wondered why so many psychotherapists still used the Rorschach inkblot and Draw-a-Person tests even though every study that had ever tried to validate them showed no correlation between responses on the tests and psychological symptoms. The experimenters mischievously paired written descriptions of psychiatric patients with their responses on the Draw-a-Person test, but in fact the descriptions were fake and the pairings were random. They then asked a sample of students to report any patterns they saw across the pairs.6 The students, guided by their stereotypes, incorrectly estimated that more broad-shouldered men were sketched by hyper-masculine patients, more wide-eyed ones came from paranoiacs, and so on—exactly the linkages that professional diagnosticians claim to see in their patients, with as little basis in reality.
Many correlations that have become part of our conventional wisdom, like people pouring into hospital emergency rooms during a full moon, are just as illusory.7 The danger is particularly acute with correlations that use months or years as their units of analysis (the dots in the scatterplot), because many variables rise and fall in tandem with the changing times. A bored law student, Tyler Vigen, wrote a program that scrapes the web for datasets with meaningless correlations just to show how prevalent they are. The number of murders by steam or hot objects, for example, correlates highly with the age of the reigning Miss America. And the divorce rate in Maine closely tracks national consumption of margarine.8
“Regression” has become the standard term for correlational analyses, but the connection is roundabout. The term originally referred to a specific phenomenon that comes along with correlation, regression to the mean. This ubiquitous but counterintuitive phenomenon was discovered by the Victorian polymath Francis Galton (1822–1911), who plotted the heights of children against the average height of their two parents (the “mid-parent” score, halfway between the mother and the father), in both cases adjusting for the average difference between males and females. He found that “when mid-parents are taller than mediocrity, their children tend to be shorter than they. When mid-parents are shorter than mediocrity, their children tend to be taller than they.”9 It’s still true, not just of the heights of parents and their children, but of the IQs of parents and their children, and for that matter of any two variables that are not perfectly correlated. An extreme value in one will be tend to be paired with a not-quite-as-extreme value in the other.
This does not mean that tall families are begetting shorter and shorter children and vice versa, so that some day all children will line up against the same mark on the wall and the world will have no jockeys or basketball centers. Nor does it mean that the population is converging on a middlebrow IQ of 100, with geniuses and dullards going extinct. The reason that populations don’t collapse into uniform mediocrity, despite regression to the mean, is that the tails of the distribution are constantly being replenished by the occasional very tall child of taller-than-average parents and very short child of shorter-than-average ones.
Regression to the mean is purely a statistical phenomenon, a consequence of the fact that in bell-shaped distributions, the more extreme a value, the less likely it is to turn up. That implies that when a value is really extreme, any other variable that is paired with it (such as the child of an outsize couple) is unlikely to live up to its weirdness, or duplicate its winning streak, or get dealt the same lucky hand, or suffer from the same run of bad luck, or weather the same perfect storm, yet again, and will backslide toward ordinariness. In the case of height or IQ, the freakish conspiracy would be whatever unusual combination of genes, experiences, and accidents of biology came together in the parents. Many of the components of that combination will be favored in their children, but the combination itself will not be perfectly reproduced. (And vice versa: because regression is a statistical phenomenon, not a causal one, parents regress to their children’s mean, too.)
In a graph, when correlated values from two bell curves are plotted against each other, the scatterplot will usually look like a tilted football. On the following page we have a hypothetical dataset similar to Galton’s showing the heights of parents (the average of each couple) and the heights of their adult children (adjusted so that the sons and daughters can be plotted on the same scale).
The gray 45-degree diagonal shows what we would expect on average if children were exactly as exceptional as their parents. The black regression line is what we find in reality. If you zero in on an extreme value, say, parents with an average height between them of 6 feet, you’ll find that the cluster of points for their children mostly hangs below the 45-degree diagonal, which you can confirm by scanning up along the right dotted arrow to the regression line, turning left, and following the horizontal dotted arrow to the vertical axis, where it points a bit above 5’9”, shorter than the parents. If you zero in on the parents with an average height of 5 feet (left dotted arrow), you’ll see that their children mostly float above the gray diagonal, and the left turn at the regression line takes you to a value of almost 5’3”, taller than the parents.
Regression to the mean happens whenever two variables are imperfectly correlated, which means that we have a lifetime of experience with it. Nonetheless, Tversky and Kahneman have shown that most people are oblivious to the phenomenon (notwithstanding the groaner in the Frank and Ernest cartoon).10
People’s attention gets drawn to an event because it is unusual, and they fail to anticipate that anything associated with that event will probably not be quite as unusual as that event was. Instead, they come up with fallacious causal explanations for what in fact is a statistical inevitability.
A tragic example is the illusion that criticism works better than praise, and punishment better than reward.11 We criticize students when they perform badly. But whatever bad luck cursed that performance is unlikely to be repeated in the next attempt, so they’re bound to improve, tricking us into thinking that punishment works. We praise them when they do well, but lightning doesn’t strike twice, so they’re unlikely to match that feat the next time, fooling us into thinking that praise is counterproductive.
Unawareness of regression to the mean sets the stage for many other illusions. Sports fans theorize about why a Rookie of the Year is doomed to suffer a sophomore slump, and why the cover subject of a famous magazine will then have to live with the Sports Illustrated jinx. (Overconfidence? Impossible expectations? The distractions of fame?) But if an athlete is singled out for an extraordinary week or year, the stars are unlikely to align that way twice in a row, and he or she has nowhere to go but meanward. (Equally meaninglessly, a slumping team will improve after the coach is fired.) After a spree of horrific crimes is splashed across the papers, politicians intervene with SWAT teams, military equipment, Neighborhood Watch signs, and other gimmicks, and sure enough, the following month they congratulate themselves because the crime rate is not as high. Psychotherapists, too, regardless of their flavor of talking cure, can declare unearned victory after treating a patient who comes in with a bout of severe anxiety or depression.
Once again, scientists are not immune. Yet another cause of replication failures is that experimenters don’t appreciate a version of regression to the mean called the Winner’s Curse. If the results of an experiment seem to show an interesting effect, a lot of things must have gone right, whether the effect is real or not. The gods of chance must have smiled on the experimenters, which they should not count upon a second time, so when they try to replicate the effect, they ought to enlist more participants. But most experimenters think that they’ve already racked up some evidence for the effect, so they can get away with fewer participants, not appreciating that this strategy is a one-way path to the Journal of Irreproducible Results.12 A failure to appreciate how regression to the mean applies to striking discoveries led to a muddled 2010 New Yorker article called “The Truth Wears Off,” which posited a mystical “decline effect,” supposedly casting doubt on the scientific method.13
The Winner’s Curse applies to any unusually successful human venture, and our failure to compensate for singular moments of good fortune may be one of the reasons that life so often brings disappointment.
Before we lay out the bridge from correlation to causation, let’s spy on the opposite shore, causation itself. It turns out to be a surprisingly elusive concept.14 Hume, once again, set the terms for centuries of analysis by venturing that causation is merely an expectation that a correlation we experienced in the past will hold in the future.15 Once we have watched enough billiards, then whenever we see one ball close in on a second, we anticipate that the second will be launched forward, just like all the times before, propped up by our tacit but unprovable assumption that the laws of nature persist over time.
It doesn’t take long to see what’s wrong with “constant conjunction” as a theory of causality. The rooster always crows just before daybreak, but we don’t credit him with causing the sun to rise. Likewise thunder often precedes a forest fire, but we don’t say thunder causes fires. These are epiphenomena, also known as confounds or nuisance variables: they accompany but do not cause the event. Epiphenomena are the bane of epidemiology. For many years coffee was blamed for heart disease, because coffee drinkers had more heart attacks. It turned out that coffee drinkers also tend to smoke and avoid exercise; the coffee was an epiphenomenon.
Hume anticipated the problem and elaborated on his theory: not only does the cause have to regularly precede its effect, but “if the first object had not been, the second never had existed.” The crucial “if it had not been” clause is a counterfactual, a “what if.” It refers to what would happen in a possible world, an alternate universe, a hypothetical experiment. In a parallel universe in which the cause didn’t happen, neither did the effect. This counterfactual definition of causation solves the epiphenomenon problem. The reason we say the rooster doesn’t cause the sunrise is that if the rooster had become the main ingredient in coq au vin the night before, the sun would still have risen. We say that lightning causes forest fires and thunder doesn’t because if there were lightning without thunder, a forest could ignite, but not vice versa.
Causation, then, can be thought of as the difference between outcomes when an event (the cause) takes place and when it does not.16 The “fundamental problem of causal inference,” as statisticians call it, is that we’re stuck in this universe, where either a putative causal event took place or it didn’t. We can’t peer into that other universe and see what the outcome is over there. We can, to be sure, compare the outcomes in this universe on the various occasions when that kind of event does or does not take place. But that runs smack into a problem pointed out by Heraclitus in the sixth century BCE: You can’t step in the same river twice. Between those two occasions, the world may have changed in other ways, and you can’t be sure whether one of those other changes was the cause. We also can compare individual things that underwent that kind of event with similar things that did not. But this, too, runs into a problem, pointed out by Dr. Seuss: “Today you are you, that is truer than true. There is no one alive who is youer than you.” Every individual is unique, so we can’t know whether an outcome experienced by an individual depended on the supposed cause or on that person’s myriad idiosyncrasies. To infer causation from these comparisons, we have to assume, as they say less poetically, “temporal stability” and “unit homogeneity.” The methods discussed in the next two sections try to make those assumptions reasonable.
Even once we have established that some cause makes a difference to an outcome, neither scientists nor laypeople are content to leave it at that. We connect the cause to its effect with a mechanism: the clockwork behind the scenes that pushes things around. People have intuitions that the world is not a video game with patterns of pixels giving way to new patterns. Underneath each happening is a hidden force, power, or oomph. Many of our primitive intuitions of causal powers turn out, in the light of science, to be mistaken, such as the “impetus” that the medievals thought was impressed upon moving objects, and the psi, qi, engrams, energy fields, homeopathic miasms, crystal powers, and other bunkum of alternative medicine. But some intuitive mechanisms, like gravity, survive in scientifically respectable forms. And many new hidden mechanisms have been posited to explain correlations in the world, including genes, pathogens, tectonic plates, and elementary particles. These causal mechanisms are what allow us to predict what would happen in counterfactual scenarios, lifting them from the realm of make-believe: we set up the pretend world and then simulate the mechanisms, which take it from there.
Even with a grasp of causation in terms of alternative outcomes and the mechanisms that produce them, any effort to identify “the” cause of an effect raises a thicket of puzzles. One is the elusive difference between a cause and a condition. We say that striking a match causes a fire, because without the striking there would be no fire. But without oxygen, without the dryness of the paper, without the stillness of the room, there also would be no fire. So why don’t we say “The oxygen caused the fire”?
A second puzzle is preemption. Suppose, for the sake of argument, that Lee Harvey Oswald had a co-conspirator perched on the grassy knoll in Dallas in 1963, and they had conspired that whoever got the first clear shot would take it while the other melted into the crowd. In the counterfactual world in which Oswald did not shoot, JFK would still have died—yet it would be wacky to deny that in the world in which he did take the shot before his accomplice, he caused Kennedy’s death.
A third is overdetermination. A condemned prisoner is shot by a firing squad rather than a single executioner so that no shooter has to live with the dreadful burden of being the one who caused the death: if he had not fired, the prisoner would still have died. But then, by the logic of counterfactuals, no one caused his death.
And then there’s probabilistic causation. Many of us know a nonagenarian who smoked a pack a day all her life. But nowadays few people would say that her ripe old age proves that smoking does not cause cancer, though that was a common “refutation” in the days before the smoking–cancer link became undeniable. Even today, the confusion between less-than-perfect causation and no causation is rampant. A 2020 New York Times op-ed argued for abolishing the police because “the current approach hasn’t ended [rape]. Most rapists never see the inside of a courtroom.”17 The editorialist did not consider whether, if there were no police, even fewer rapists, or none at all, would see the inside of a courtroom.
We can make sense of these paradoxes of causation only by forgetting the billiard balls and recognizing that no event has a single cause. Events are embedded in a network of causes that trigger, enable, inhibit, prevent, and supercharge one another in linked and branching pathways. The four causal puzzlers become less puzzling when we lay out the road maps of causation in each case, shown below.
If you interpret the arrows not as logical implications (“If X smokes, then X gets heart disease”) but as conditional probabilities (“The likelihood of X getting heart disease given that X is a smoker is higher than the likelihood of X getting heart disease given that he is not a smoker”), and the event nodes not as being either on or off but as probabilities, reflecting a base rate or prior, then the diagram is called a causal Bayesian network.18 One can work out what unfolds over time by applying (naturally) Bayes’s rule, node by node through the network. No matter how convoluted the tangle of causes, conditions, and confounds, one can then determine which events are causally dependent on or independent of one another.
The inventor of these networks, the computer scientist Judea Pearl, notes that they are built out of three simple patterns—the chain, the fork, and the collider—each capturing a fundamental (but unintuitive) feature of causation with more than one cause.
The connections reflect the conditional probabilities. In each case, A and C are not directly connected, which means that the probability of A given B can be specified independently of the probability of C given B. And in each case something distinctive may be said about the relation between them.
In a causal chain, the first cause, A, is “screened off” from the ultimate effect, C; its only influence is via B. As far as C is concerned, A might as well not exist. Consider a hotel’s fire alarm, set off by the chain “fire → smoke → alarm.” It’s really not a fire alarm but a smoke alarm, indeed, a haze alarm. The guests may be awakened as readily by someone spray-painting a bookshelf near an intake vent as by an errant crème brûlée torch.
A causal fork is already familiar: it depicts a confound or epiphenomenon, with the attendant danger of misidentifying the real cause. Age (B) affects vocabulary (A) and shoe size (C), since older children have bigger feet and know more words. This means that vocabulary is correlated with shoe size. But Head Start would be ill advised to prepare children for school by fitting them with larger sneakers.
Just as dangerous is the collider, where unrelated causes converge on a single effect. Actually, it’s even more dangerous, because while most people intuitively get the fallacy of a confound (it cracked them up in the shtetl), the “collider stratification selection bias” is almost unknown. The trap in a causal collider is that when you focus on a restricted range of effects, you introduce an artificial negative correlation between the causes, since one cause will compensate for the other. Many veterans of the dating scene wonder why good-looking men are jerks. But this may be a calumny on the handsome, and it’s a waste of time to cook up theories to explain it, such as that good-looking men have been spoiled by a lifetime of people kissing up to them. Many women will date a man (B) only if he is either attractive (A) or nice (C). Even if niceness and looks were uncorrelated in the dating pool, the plainer men had to be nice or the woman would never have dated them in the first place, while the hunks were not sorted by any such filter. A bogus negative correlation was introduced by her disjunctive choosiness.
The collider fallacy also fools critics of standardized testing into thinking that test scores don’t matter, based on the observation that graduate students who were admitted with higher scores are no more likely to complete the program. The problem is that the students who were accepted despite their low scores must have boasted other assets.19 If one is unaware of the bias, one could even conclude that maternal smoking is good for babies, since among babies with low birth weights, the ones with mothers who smoked are healthier. That’s because low birth weight must be caused by something, and the other possible causes, such as alcohol or drug abuse, may be even more harmful to the child.20 The collider fallacy also explains why Jenny Cavilleri unfairly maintained that rich boys are stupid: to get into Harvard (B), you can be either rich (A) or smart (C).
Now that we’ve probed the nature of correlation and the nature of causation, it’s time to see how to get from one to the other. The problem is not that “correlation does not imply causation.” It usually does, because unless the correlation is illusory or a coincidence, something must have caused one variable to align with the other. The problem is that when one thing is correlated with another, it does not necessarily mean that the first caused the second. As the mantra goes: When A is correlated with B, it could mean that A causes B, B causes A, or some third factor, C, causes both A and B.
Reverse causation and confounding, the second and third verses of the mantra, are ubiquitous. The world is a huge causal Bayesian network, with arrows pointing every which way, entangling events into knots where everything is correlated with everything else. These gnarls (called multicollinearity and endogeneity) can arise because of the Matthew Effect, pithily explained by Billie Holiday: “Them that’s got shall get, them that’s not shall lose. So the Bible said, and it still is news.”21 Countries that are richer also tend to be healthier, happier, safer, better educated, less polluted, more peaceful, more democratic, more liberal, more secular, and more gender-egalitarian.22 People who are richer also tend to be healthier, better educated, better connected, likelier to exercise and eat well, and likelier to belong to privileged groups.23
These snarls mean that almost any causal conclusion you draw from correlations across countries or across people is likely to be wrong, or at best unproven. Does democracy make a country more peaceful, because its leader can’t readily turn citizens into cannon fodder? Or do countries facing no threats from their neighbors have the luxury of indulging in democracy? Does going to college equip you with skills that allow you to earn a good living? Or do only smart, disciplined, or privileged people, who can translate their natural assets into financial ones, make it through university?
There is an impeccable way to cut these knots: the randomized experiment, often called a randomized controlled trial or RCT. Take a large sample from the population of interest, randomly divide them into two groups, apply the putative cause to one group and withhold it from the other, and see if the first group changes while the second does not. A randomized experiment is the closest we can come to creating the counterfactual world that is the acid test for causation. In a causal network, it consists of surgically severing the putative cause from all its incoming influences, setting it to different values, and seeing whether the probabilities of the putative effects differ.24
Randomness is the key: if the patients who were given the drug signed up earlier, or lived closer to the hospital, or had more interesting symptoms, than the patients who were given the placebo, you’ll never know whether the drug worked. As one of my graduate school teachers said (alluding to a line from J. M. Barrie’s play What Every Woman Knows), “Random assignment is like charm. If you have it, you don’t need anything else; if you don’t have it, it doesn’t matter what else you have.”25 It isn’t quite true of charm, and it isn’t quite true of random assignment either, but it is still with me decades later, and I like it better than the cliché that randomized trials are the “gold standard” for showing causation.
The wisdom of randomized controlled trials is seeping into policy, economics, and education. Increasingly, “randomistas” are urging policymakers to test their nostrums in one set of randomly selected villages, classes, or neighborhoods, and compare the results against a control group which is put on a waitlist or given some meaningless make-work program.26 The knowledge gained is likely to outperform traditional ways of evaluating policies, like dogma, folklore, charisma, conventional wisdom, and HiPPO (highest-paid person’s opinion).
Randomized experiments are no panacea (since nothing is a panacea, which is a good reason to retire that cliché). Laboratory scientists snipe at each other as much as correlational data scientists, because even in an experiment you can’t do just one thing. Experimenters may think that they have administered a treatment and only that treatment to the experimental group, but other variables may be confounded with it, a problem called excludability. According to a joke, a sexually unfulfilled couple consults a rabbi with their problem, since it is written in the Talmud that a husband is responsible for his wife’s sexual pleasure. The rabbi strokes his beard and comes up with a solution: they should hire a handsome, strapping young man to wave a towel over them the next time they make love, and the fantasies will help the woman achieve climax. They follow the great sage’s advice, but it fails to have the desired effect, and they beseech him for guidance once again. He strokes his beard and thinks up a variation. This time, the young man will make love to the woman and the husband will wave the towel. They take his advice, and sure enough, the woman enjoys an ecstatic, earth-moving orgasm. The husband says to the man, “Schmuck! Now that’s how you wave a towel.”
The other problem with experimental manipulations, of course, is that the world is not a laboratory. It’s not as if political scientists can flip a coin, impose democracy on some countries and autocracy on others, and wait five years to see which ones go to war. The same practical and ethical problems apply to studies of individuals, as shown in this cartoon.
Though not everything can be studied in an experimental trial, social scientists have mustered their ingenuity to find instances in which the world does the randomization for them. These experiments of nature can sometimes allow one to wring causal conclusions out of a correlational universe. They’re a recurring feature in Freakonomics, the series of books and other media by the economist Steven Levitt and the journalist Stephen Dubner.27
One example is the “regression discontinuity.” Say you want to decide whether going to college makes people wealthier or wealth-destined teenagers are likelier to get into college. Though you can’t literally randomize a sample of teenagers and force a college to admit one group and reject another, selective colleges effectively do that to students near their cutoff. No one really believes that the student who squeaked in with a test score of 1720 is smarter than the one who fell just short with 1710. The difference is in the noise, and might as well have been random. (The same is true with other qualifications like grades and letters of recommendation.) Suppose one follows both groups for a decade and plots their income against their test scores. If one sees a step or an elbow at the cutoff, with a bigger jump in salary at the reject–admit boundary than for intervals of similar size along the rest of the scale, one may conclude that the magic wand of admission made a difference.
Another gift to causation-hungry social scientists is fortuitous randomization. Does Fox News make people more conservative, or do conservatives gravitate to Fox News? When Fox News debuted in 1996, different cable companies added it to their lineups haphazardly over the next five years. Economists took advantage of the happenstance during the half decade and found that towns with Fox News in their cable lineups voted 0.4 to 0.7 points more Republican than towns that had to watch something else.28 That’s a big enough difference to swing a close election, and the effect could have accumulated in the subsequent decades when Fox News’ universal penetration into TV markets made the effect harder to prove but no less potent.
Harder, but not impossible. Another stroke of genius goes by the unhelpful name “instrumental variable regression.” Suppose you want to see whether A causes B and are worried about the usual nuisances of reverse causation (B causes A) and confounding (C causes A and B). Now suppose you found some fourth variable, I (the “instrument”), which is correlated with the putative cause, A, but could not possibly be caused by it—say, because it happened earlier in time, the future being incapable of affecting the past. Suppose as well that this pristine variable is also uncorrelated with the confound, C, and that it cannot cause B directly, only through A. Even though A cannot be randomly assigned, we have the next best thing, I. If I, the clean surrogate for A, turns out to be correlated with B, it’s an indication that A causes B.
What does this have to do with Fox News? Another gift to social scientists is American laziness. Americans hate getting out of their cars, adding water to a soup mix, and clicking their way up the cable lineup from the single digits. The lower the channel number, the more people watch it. Now, Fox News is assigned different channel numbers by different cable companies pretty much at random (the numbering depended only on when the network struck the deal with each cable company, and was unrelated to the demographics of the viewers). While a low channel number (I) can cause people to watch Fox News (A), and watching Fox News may or may not cause them to vote Republican (B), neither having conservative views (C) nor voting Republican can cause someone’s favorite television station to skitter down the cable dial. Sure enough, in a comparison across cable markets, the lower the channel number of Fox News relative to other news networks, the larger the Republican vote.29
When a data scientist finds a regression discontinuity or an instrumental variable, it’s a really good day. But more often they have to squeeze what causation they can out of the usual correlational tangle. All is not lost, though, because there are palliatives for each of the ailments that enfeeble causal inference. They are not as good as the charm of random assignment, but often they are the best we can do in a world that was not created for the benefit of scientists.
Reverse causation is the easier of the two to rule out, thanks to the iron law that hems in the writers of science fiction and other time-travel plots like Back to the Future: the future cannot affect the past. Suppose you want to test the hypothesis that democracy causes peace, not just vice versa. First, one must avoid the fallacy of all-or-none causation, and get beyond the common but false claim that “democracies never fight each other” (there are plenty of exceptions).30 The more realistic hypothesis is that countries that are relatively more democratic are less likely to fall into war.31 Several research organizations give countries democracy scores from –10 for a full autocracy like North Korea to +10 for a full democracy like Norway. Peace is a bit harder, because (fortunately for humanity, but unfortunately for social scientists) shooting wars are uncommon, so most of the entries in the table would be “0.” Instead, one can estimate war-proneness by the number of “militarized disputes” a country was embroiled in over a year: the saber-rattlings, alertings of forces, shots fired across bows, warplanes sent scrambling, bellicose threats, and border skirmishes. One can convert this from a war score to a peace score (so that more-peaceful countries get higher numbers) by subtracting the count from some large number, like the maximum number of disputes ever recorded. Now one can correlate the peace score against the democracy score. By itself, of course, that correlation proves nothing.
But suppose that each variable is recorded twice, say, a decade apart. If democracy causes peace, then the democracy score at Time 1 should be correlated with the peace score at Time 2. This, too, proves little, because over a decade the leopard doesn’t change its spots: a peaceful democracy then may just be a peaceful democracy now. But as a control one can look to the other diagonal: the correlation between Democracy (the democracy score) at Time 2 and Peace (the peace score) at Time 1. This correlation captures any reverse causation, together with the confounds that have stayed put over the decade. If the first correlation (past cause with present effect) is stronger than the second (past effect with present cause), it’s a hint that democracy causes peace rather than vice versa. The technique is called cross-lagged panel correlation, “panel” being argot for a dataset containing measurements at several points in time.
Confounds, too, may be tamed by clever statistics. You may have read in science news articles of researchers “holding constant” or “statistically controlling for” some confounded or nuisance variable. The simplest way to do that is called matching.32 The democracy–peace relationship is infested with plenty of confounds, such as prosperity, education, trade, and membership in treaty organizations. Let’s consider one of them, prosperity, measured as GDP per capita. Suppose that for every democracy in our sample we found an autocracy that had the same GDP per capita. If we compare the average peace scores of the democracies with their autocracy doppelgangers, we’d have an estimate of the effects of democracy on peace, holding GDP constant. The logic of matching is straightforward, but it requires a large pool of candidates from which to find good matches, and the number explodes as more confounds have to be held constant. That can work for an epidemiological study with tens of thousands of participants to choose from, but not for a political study in a world with just 195 countries.
The more general technique is called multiple regression, and it capitalizes on the fact that a confound is never perfectly correlated with a putative cause. The discrepancies between them turn out to be not bothersome noise but telltale information. Here’s how it could work with democracy, peace, and GDP per capita. First, we plot the putative cause, the democracy score, against the nuisance variable (top left graph), one point per country. (The data are fake, made up to illustrate the logic.) We fit the regression line, and turn our attention to the residuals: the vertical distance between each point and the line, corresponding to the discrepancy between how democratic a country would be if income predicted democracy perfectly and how democratic it is in reality. Now we throw away each country’s original democracy score and replace it with the residual: the measure of how democratic it is, controlling for its income.
Now do the same with the putative effect, peace. We plot the peace score against the nuisance variable (top right graph), measure the residuals, throw away the original peace data, and replace them with the residuals, namely, how peaceful each country is above and beyond what you would expect from its income. The final step is obvious: correlate the Peace residuals with the Democracy residuals (bottom graph). If the correlation is significantly different from zero, one may venture that democracy causes peacefulness, holding prosperity constant.
What you have just seen is the core of the vast majority of statistics used in epidemiology and social sciences, called the general linear model. The deliverable is an equation that allows you to predict the effect from a weighted sum of the predictors (some of them, presumably, causes). If you’re a good visual thinker, you can imagine the prediction as a tilted plane, rather than a line, floating above the ground defined by the two predictors. Any number of predictors can be thrown in, creating a hyperplane in hyperspace; this quickly overwhelms our feeble powers of visual imagery (which has enough trouble with three dimensions), but in the equation it consists only of adding more terms to the string. In the case of peace, the equation might be: Peace = (a × Democracy) + (b × GDP/capita) + (c × Trade) + (d × Treaty-membership) + (e × Education), assuming that any of these five might be a pusher or a puller of peacefulness. The regression analysis informs us which of the candidate variables pulls its weight in predicting the outcome, holding each of the others constant. It is not a turnkey machine for proving causation—one still has to interpret the variables and how they are plausibly connected, and watch out for myriad traps—but it is the most commonly used tool for unsnarling multiple causes and confounds.
The algebra of a regression equation is less important than the big idea flaunted by its form: events have more than one cause, all of them statistical. The idea seems elementary, but it’s regularly flouted in public discourse. All too often, people write as if every outcome had a single, unfailing cause: if A has been shown to affect B, it proves that C cannot affect it. Accomplished people spend ten thousand hours practicing their craft; this is said to show that achievement is a matter of practice, not talent. Men today cry twice as often as their fathers did; this shows that the difference in crying between men and women is social rather than biological. The possibility of multiple causes—nature and nurture, talent and practice—is inconceivable.
Even more elusive is the idea of interacting causes: the possibility that the effect of one cause may depend on another. Perhaps everyone benefits from practice, but talented people benefit more. What we need is a vocabulary for talking and thinking about multiple causes. This is yet another area in which a few simple concepts from statistics can make everyone smarter. The revelatory concepts are main effect and interaction.
Let me illustrate them with fake data. Suppose we’re interested in what makes monkeys fearful: heredity, namely the species they belong to (capuchin or marmoset), or the environment in which they were reared (alone with their mothers or in a large enclosure with plenty of other monkey families). Suppose we have a way of measuring fear—say, how closely the monkey approaches a rubber snake. With two possible causes and one effect, six different things can happen. This sounds complicated, but the possibilities jump off the page as soon as we plot them in graphs. Let’s start with the three simplest ones.
The left graph shows a big fat nothing: a monkey is a monkey. Species doesn’t matter (the lines fall on top of each other); Environment doesn’t matter either (each line is flat). The middle graph is what we would see if Species mattered (capuchins are more skittish than marmosets, shown by their line floating higher on the graph), while Environment did not (both species are equally fearful whether they are raised alone or with others, shown by each line being flat). In jargon, we say that there is a main effect of Species, meaning that the effect is seen across the board, regardless of the environment. The right graph shows the opposite outcome, a main effect of Environment but none of Species. Growing up alone makes a monkey more fearful (seen in the slope of the lines), but it does so for capuchins and marmosets alike (seen in the lines falling on top of each other).
Now let’s get even smarter and wrap our minds around multiple causes. Once again we have three possibilities. What would it look like if Species and Environment both mattered: if capuchins were innately more fearful than marmosets, and if being reared alone makes a monkey more fearful? The leftmost graph shows this situation, namely, two main effects. It takes the form of the two lines having parallel slopes, one hovering above the other.
Things get really interesting in the middle graph. Here, both factors matter, but each depends on the other. If you’re a capuchin, being raised alone makes you bolder; if you’re a marmoset, being raised alone makes you meeker. We see an interaction between Species and Environment, which visually consists of the lines being nonparallel. In these data, the lines cross into a perfect X, which means that the main effects are canceled out entirely. Across the board, Species doesn’t matter: the midpoint of the capuchin line sits on top of the midpoint of the marmoset line. Environment doesn’t matter across the board either: the average for Social, corresponding to the point midway between the two leftmost tips, lines up with the average for Solitary, corresponding to the point midway between the rightmost ones. Of course Species and Environment do matter: it’s just that how each cause matters depends on the other one.
Finally, an interaction can coexist with one or more main effects. In the rightmost graph, being reared alone makes capuchins more fearful, but it has no effect on the always-calm marmosets. Since the effect on the marmosets doesn’t perfectly cancel out the effect on the capuchins, we do see a main effect of Species (the capuchin line is higher) and a main effect of Environment (the midpoint of the two left dots is lower than the midpoint of the two right ones). But whenever we interpret a phenomenon with two or more causes, any interaction supersedes the main effects: it provides more insight as to what is going on. An interaction usually implies that the two causes intermingle in a single link in the causal chain, rather than taking place in different links and then just adding up. With these data, the common link might be the amygdala, the part of the brain registering fearful experiences, which may be plastic in capuchins but hardwired in marmosets.
With these cognitive tools, we are now equipped to make sense of multiple causes in the world: we can get beyond “nature versus nurture” and whether geniuses are “born or made.” Let’s turn to some real data.
What causes major depression, a stressful event or a genetic predisposition? This graph plots the likelihood of suffering a major depressive episode in a sample of women with twin sisters.33
The sample includes women who had undergone a severe stressor, like a divorce, an assault, or a death of a close relative (the points on the right), and women who had not (the points on the left). Scanning the lines from top to bottom, the first is for women who may be highly predisposed to depression, because their identical twin, with whom they share all their genes, suffered from it. The next line down is for women who are only somewhat predisposed to depression, because a fraternal twin, with whom they share half their genes, suffered from it. Below it we have a line for women who are not particularly predisposed, because their fraternal twin did not suffer from depression. At the bottom we find a line for women who are at the lowest risk, because their identical twin did not suffer from it.
The pattern in the graph tells us three things. Experience matters: we see a main effect of Stress in the upward slant of the fan of lines, which shows that undergoing a stressful event ups the odds of getting depressed. Overall, genes matter: the four lines float at different heights, showing that the higher one’s genetic predisposition, the greater the chance that one will suffer a depressive episode. But the real takeaway is the interaction: the lines are not parallel. (Another way of putting it is that the points fall on top of one another on the left but are spread out on the right.) If you don’t suffer a stressful event, your genes barely matter: regardless of your genome, the chance of a depressive episode is less than one percent. But if you do suffer a stressful event, your genes matter a lot: a full dose of genes associated with escaping depression keeps the risk of getting depressed at 6 percent (lowest line); a full dose of genes associated with suffering depression more than doubles the risk to 14 percent (highest line). The interaction tells us not only that both genes and environment are important but that they seem to have their effects on the same link in the causal chain. The genes that these twins share to different degrees are not genes for depression per se; they are genes for vulnerability or resilience to stressful experiences.
Let’s turn to whether stars are born or made. The graph on the next page, also from a real study, shows ratings of chess skill in a sample of lifelong players who differ in their measured cognitive ability and in how many games they play per year.34 It shows that practice makes better, if not perfect: we see a main effect of games played per year, visible in the overall upward slope. Talent will tell: we see a main effect of ability, visible in the gap between the two lines. But the real moral of the story is their interaction: the lines are not parallel, showing that smarter players gain more with every additional game of practice. An equivalent way of putting it is that without practice, cognitive ability barely matters (the leftmost tips of the lines almost overlap), but with practice, smarter players show off their talent (the rightmost tips are spread apart). Knowing the difference between main effects and interactions not only protects us from falling for false dichotomies but offers us deeper insight into the nature of the underlying causes.
As a way of understanding the causal richness of the world, a regression equation is pretty simpleminded: it just adds up a bunch of weighted predictors. Interactions can be thrown in as well; they can be represented as additional predictors derived by multiplying together the interacting ones. A regression equation is nowhere near as complex as the deep learning networks we saw in chapter 3, which take in millions of variables and combine them in long, intricate chains of formulas rather than just throwing them into a hopper and adding them up. Yet despite their simplicity, one of the stunning findings of twentieth-century psychology is that a dumb regression equation usually outperforms a human expert. The finding, first noted by the psychologist Paul Meehl, goes by the name “clinical versus actuarial judgment.”35
Suppose you want to predict some quantifiable outcome—how long a cancer patient will survive; whether a psychiatric patient ends up diagnosed with a mild neurosis or a severe psychosis; whether a criminal defendant will skip bail, blow off parole, or recidivate; how well a student will perform in graduate school; whether a business will succeed or go belly-up; how large a return a stock fund will deliver. You have a set of predictors: a symptom checklist, a set of demographic features, a tally of past behavior, a transcript of undergraduate grades or test scores—anything that might be relevant to the prediction challenge. Now you show the data to an expert—a psychiatrist, a judge, an investment analyst, and so on—and at the same time feed them into a standard regression analysis to get the prediction equation. Who is the more accurate prognosticator, the expert or the equation?
The winner, almost every time, is the equation. In fact, an expert who is given the equation and allowed to use it to supplement his or her judgment often does worse than the equation alone. The reason is that experts are too quick to see extenuating circumstances that they think render the formula inapplicable. It’s sometimes called the broken-leg problem, from the idea that a human expert, but not an algorithm, has the sense to know that a guy who has just broken his leg will not go dancing that evening, even if a formula predicts that he does it every week. The problem is that the equation already takes into account the likelihood that extenuating circumstances will change the outcome and factors them into the mix with all the other influences, while the human expert is far too impressed with the eye-catching particulars and too quick to throw the base rates out the window. Indeed, some of the predictors that human experts rely on the most, such as face-to-face interviews, are revealed by regression analyses to be perfectly useless.
It’s not that humans can be taken out of the loop. A person still is indispensable in supplying predictors that require real comprehension, like understanding language and categorizing behavior. It’s just that a human is inept at combining them, whereas that is a regression algorithm’s stock in trade. As Meehl notes, at a supermarket checkout counter you wouldn’t say to the cashier, “It looks to me like the total is around $76; is that OK?” Yet that is what we do when we intuitively combine a set of probabilistic causes.
For all the power of a regression equation, the most humbling discovery about predicting human behavior is how unpredictable it is. It’s easy to say that behavior is caused by a combination of heredity and environment. Yet when we look at a predictor that has to be more powerful than the best regression equation—a person’s identical twin, who shares her genome, family, neighborhood, schooling, and culture—we see that the correlation between the two twins’ traits, while way higher than chance, is way lower than 1, typically around .6.36 That leaves a lot of human differences mysteriously unexplained: despite near-identical causes, the effects are nowhere near identical. One twin may be gay and the other straight, one schizophrenic and the other functioning normally. In the depression graph, we saw that the chance that a woman will suffer depression if she is hit with a stressful event and has a proven genetic disposition to depression is not 100 percent but only 14 percent.
A recent lollapalooza of a study reinforces the cussed unpredictability of the human species.37 One hundred sixty teams of researchers were given a massive dataset on thousands of fragile families, including their income, education, health records, and the results of multiple interviews and in-home assessments. The teams were challenged to predict the families’ outcomes, such as the children’s grades and the parents’ likelihood of being evicted, or employed, or signed up for job training. The competitors were allowed to sic whichever algorithm they wanted on the problem: regression, deep learning, or any other fad or fashion in artificial intelligence. The results? In the understated words of the paper abstract: “The best predictions were not very accurate.” Idiosyncratic traits of each family swamped the generic predictors, no matter how cleverly they were combined. It’s a reassurance to people who worry that artificial intelligence will soon predict our every move. But it’s also a chastening smackdown of our pretensions to fully understand the causal network in which we find ourselves.
And speaking of humility, we have come to the end of seven chapters intended to equip you with what I think are the most important tools of rationality. If I have succeeded, you’ll appreciate this final word from XKCD.