`5 Practical Ways in Which Scientists Embrace the Scientific Attitude`

For science to work, it has to depend on more than just the honesty of its individual practitioners. Though outright fraud is rare, there are many ways in which scientists can cheat, lie, fudge, make mistakes, or otherwise fall victim to the sorts of cognitive biases we all share that—if left unchallenged—could undermine scientific credibility.

Fortunately, there are protections against this, for science is not just an individual quest for knowledge but a group activity in which widely accepted community standards are used to evaluate scientific claims. Science is conducted in a public forum, and one of its most distinctive features is that there is an ideal set of rules, which is agreed on in advance, to root out error and bias. Thus the scientific attitude is instantiated not just in the hearts and minds of individual scientists, but in the community of scientists as a whole.

It is of course a cliché to say that science is different today than it was two hundred years ago—that scientists these days are much more likely to work in teams, to collaborate, and to seek out the opinion of their peers as they are formulating their theories. This in and of itself is taken by some to mark off a crucial difference with pseudoscience.¹ There is a world of difference between seeking validation from those who already agree with you and “bouncing an idea” off a professional colleague who is expected to critique it.² But beyond this, one also expects these days that any theory will receive scrutiny from the scientific community writ large—beyond one’s collaborators or colleagues—who will have a hand in evaluating and critiquing that theory before it can be shared more widely.

The practices of science—careful statistical method, peer review before publication, and data sharing and replication—are well known and will be discussed later in this chapter. First, however, it is important to identify and explore the types of errors that the scientific attitude is meant to guard against. At the individual level at least, problems can result from several possible sources: intentional errors, lazy or sloppy procedure, or unintentional errors that may result from unconscious cognitive bias.

Three Sources of Scientific Error

The first and most egregious type of error that can occur in science is intentional. One thinks here of the rare but troubling instances of scientific fraud. In chapter 7, I will examine a few of the most well-known recent cases of scientific fraud—Marc Hauser’s work on animal cognition and Andrew Wakefield’s work on vaccines—but for now it is important to point out that the term “fraud” is sometimes used to cover a multitude of sins. Presenting results that are unreproducible is not necessarily indicative of fraud, though it may be the first sign that something is amiss. Data fabrication or lying about evidence is more solidly in the wheelhouse of someone who is seeking to mislead. For all the transparency of scientific standards, however, it is sometimes difficult to tell whether any given example should be classified as fraud or merely sloppy procedure. Though there is no bright line between intentional deception and willful ignorance, the good news is that the standards of science are high enough that it is rare for a confabulated theory—whatever the source of error—to make it through to publication.³ The instantiation of the scientific attitude at the group level is a robust (though not perfect) check against intentional deception.

The second type of error that occurs in science is the result of sloppiness or laziness—though sometimes even this can be motivated by ideological or psychological factors, at either a conscious or an unconscious level. One wants one’s own theory to be true. The rewards of publication are great. Sometimes there is career pressure to favor a particular result or just to find something that is worth reporting. Again, this can cover a number of sins:

(1) Cherry picking data (choosing the results that are most likely to make one’s case)

(2) Curve fitting (manipulating a set of variables until they fit a desired curve)

(3) Keeping an experiment open until the desired result is found

(4) Excluding data that don’t fit (the flip side of cherry picking)

(5) Using a small data set

(6) P-hacking (sifting through a mountain of data until one finds some statistically significant correlation, whether it makes sense or not)⁴

Each of these techniques is a recognizable offense against good statistical method, but it would be rash to claim that all of them constitute fraud. This does not mean that criticism of these techniques should not be severe, yet one must always consider the question of what might have motivated a scientist in any particular instance. Willful ignorance seems worse than carelessness, but the difference between an intentional and unintentional error may be slim.

In his work on self-deception, evolutionary biologist Robert Trivers has investigated the sorts of excuses that scientists make—to themselves and others—for doing questionable work.⁵ In an intriguing follow up to his book, Trivers dives into the question of why so many psychologists fail to share data, in contravention of the mandate of the APA-sponsored journals in which they are published.⁶ The mere fact that these scientists refuse to share their data is in and of itself shocking. The research that Trivers cites reports that 67 percent of the psychologists who were asked to share their data failed to do so.⁷ The intriguing next step came when it was hypothesized that failure to share data might correlate with a higher rate of statistical error in the published papers. When the results were analyzed, it was found not only that there was a higher error rate in those papers where authors had withheld their data sets—when compared to those who had agreed to share them—but that 96 percent of the errors were in the scientists’ favor! (It is important to note here that no allegation of fraud was being made against the authors; indeed, without the original data sets how could one possibly check this? Instead the checks were for statistical errors in the reported studies themselves, where Wicherts et al. merely reran the numbers.)

Trivers reports that in another study investigators found further problems in psychological research, such as spurious correlations and running an experiment until a significant result was reached.⁸ Here the researchers’ attention focused on the issue of “degrees of freedom” in collecting and analyzing data, consonant with problems (3), (4), and (5) above. “Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?” As the researchers note, “[it] is rare, and sometimes impractical, for researchers to make all these decisions beforehand.”⁹ But this can lead to the “self-serving interpretation of ambiguity” in the evidence. Simmons et al. demonstrate just this by intentionally adjusting the “degrees of freedom” in two parallel studies to “prove” an obviously false result (that listening to a particular song can change the listener’s birth date).¹⁰

I suppose the attitude one might take here is that one is shocked—shocked—to find gambling in Casablanca. Others might contend that psychology is not really a science. But, as Trivers points out, such findings at least invite the question of whether such dubious statistical and methodological techniques can be found in other sciences as well. If psychologists—who study human nature for a living—are not immune to the siren song of self-serving data interpretation, why should others be?

A third type of error in scientific reasoning occurs as a result of the kind of unintentional cognitive biases we all share as human beings, to which scientists are not immune. Preference for one’s own theory is perhaps the easiest to understand. But there are literally hundreds of others—representativeness bias, anchoring bias, heuristic bias, and the list goes on—that have been identified and explained in Daniel Kahneman’s brilliant book Thinking Fast and Slow, and in the work of other researchers in the rapidly growing field of behavioral economics.¹¹ We will explore the link between such cognitive bias and the paroxysm of science denialism in recent years in chapter 8. That such bias might also have a foothold among those who actually believe in science is, of course, troubling.

One might expect that these sorts of biases would be washed out by scientists who are professionally trained to spot errors in empirical reasoning. Add to this the fact that no scientist would wish to be publicly embarrassed by making an error that could be caught by others, and one imagines that the incentive would be great to be objective about one’s work and to test one’s theories before offering them for public scrutiny. The truth, however, is that it is sometimes difficult for individuals to recognize these sorts of shortcomings in their own reasoning. The cognitive pathways for bias are wired into all of us, PhD or not.

Perhaps the most virulent example of the destructive power of cognitive bias is confirmation bias. This occurs when we are biased in favor of finding evidence that confirms what we already believe and discounting evidence that does not. One would think that this kind of mistake would be unlikely for scientists to make, since it goes directly against the scientific attitude, where we are supposed to be searching for evidence that can force us to change our beliefs rather than ratify them. Yet again, no matter how much individual scientists might try to inoculate themselves against this type of error, it is sometimes not found until peer review or even post-publication.¹² Thus we see that the scientific attitude is at its most powerful when embraced by the entire scientific community. The scientific attitude is not merely a matter of conscience or reasoning ability among individual scientists, but the cornerstone of those practices that make up science as an institution. As such, it is useful as a guard against all manner of errors, whatever their source.

Critical Communities and the Wisdom of Crowds

Ideally, the scientific attitude would be so well ensconced at the individual level that scientists could mitigate all possible sources of error in their own work. Surely some try to do this, and it is a credit to scientific research that it is one of the few areas of human endeavor where the practitioners are motivated to find and correct their own mistakes by comparing them to actual evidence. But it is too much to think that this happens in every instance. For one thing, this would work only if all errors were unintentional and the researcher was motivated (and able) to find them. But in cases of fraud or willful ignorance, we would be foolish to think that scientists can and will correct all of their own mistakes. For these, membership in a larger community is crucial.

Recent research in behavioral economics has shown that groups are often better than individuals at finding errors in reasoning. Most of these experiments have been done on the subject of unconscious cognitive bias, but it goes without saying that groups also would be better motivated to find errors created by conscious bias than the people who had perpetrated them. Some of the best experimental work on this subject has been done with logic puzzles. One of the most important results is the Wason selection task.¹³ In this experiment, subjects were shown four cards lying flat on a table with the instruction that—although they could not touch the cards—each had a number on one side and a letter of the alphabet on the other. Suppose the cards read 4, E, 7, and K. Subjects were then given a rule such as “If there is a vowel on one side of the card, then there is an even number on the other.” Their task was to determine which (and only which) cards they would need to turn over in order to test the rule.

The experimenters reported that subjects found this task to be incredibly difficult. Indeed, just 10 percent got the right answer, which is that only the E and the 7 needed to be turned over. The E needs to be turned over because it might have an odd number on the other side, which would disprove the rule. The 7 also needs to be checked, because if there were a vowel on its other side, the rule would also be falsified. Subjects tended to be flummoxed by the fact that the 4 did not need to be turned over. For those who have studied logic, it might be obvious that the 4 is irrelevant to the truth of the rule, because it does not matter what is on the other side of the card. The rule only says that an even number would appear if there is a vowel on the other side of the card; it does not say that an even number can appear only if there is a vowel on the other side. Even if the other side of the card had a consonant, it would not falsify the rule for a 4 to appear on its obverse. Likewise, one need not check the card with the K. Again, the rule says what must follow if there is a vowel on one side of the card; it says nothing about what must be the case if there is not.

The really exciting part of the experiment, however, is what happens when one asks subjects to solve this problem in a group setting. Here 80 percent get the right answer. Note that this is not because everyone else in the group merely defers to the “smartest person.” The experimenters checked for this and found that even in groups where none of the individual subjects could solve the task, the group often could. What this suggests is that when we are placed in a group, our reasoning skills improve. We scrutinize one another’s hypotheses and criticize one another’s logic. By mixing it up we weed out the wrong answers. In a group we are open to persuasion, but also to persuading others. As individuals, we may lack the motivation to criticize our own arguments. Once we have arrived at an answer that “seems right,” why would we bother to rethink it? In groups, however, there is much more scrutiny and, as a result, we are much more likely to arrive at the right answer. This is not because someone at the back of the room knows it, but because human reasoning skills seem to improve when we are placed within a group.¹⁴

In his book Infotopia: How Many Minds Produce Knowledge,¹⁵ Cass Sunstein explores the idea—based on empirical evidence such as that cited above—that there are myriad benefits to collective reasoning. This is sometimes referred to as the “wisdom of crowds” effect, but this populist moniker does not do justice to the case of science, where we correctly value expert opinion. As we shall see, scientific practice is nonetheless well described by Sunstein, who acknowledges the power of experts. And there are other effects as well. In his book, Sunstein explores three principle findings, each backed up by experimental evidence. On the whole, in human reasoning:

(1) groups do better than individuals,

(2) interactive groups do better than aggregative groups, and

(3) experts do better than laypeople.¹⁶

As we saw with the Wason selection task, the findings in numbers (1) and (2) are on full display. Groups do better than individuals not only because someone in the room might have the right answer (which would presumably be recognized once it was arrived at) but because even in those situations in which no one knows the right answer, the group can interact and ask critical questions, so that they may discover something that none of them individually knew. In this case, the collective outcome can be greater than what would have been reached by any of its individual members (or a mere aggregation of group opinion).

There is of course some nuance to this argument, and one must state the limiting conditions, for there are some circumstances that undermine the general result. For instance, in groups where there is heavy pressure to defer to authority, or there is strict hierarchy, groups can succumb to “group think.” In such cases, the benefits of group reasoning can be lost to the “cascade effect” of whomever speaks first, or to an “authority effect” where one subordinates one’s opinion to the highest ranking member of the group. Indeed, in such cases the aggregative effect of groups can even be a drawback, where everyone expresses the same opinion—whether they believe it or not—and group reasoning converges on falsehood rather than truth. To remedy this, Sunstein proposes a set of principles to keep groups at their most productive. Among them:

(1) groups should regard dissent as an obligation,

(2) critical thinking should be prized, and

(3) devil’s advocates should be encouraged.¹⁷

This sounds a good deal like science. Especially in a public setting, scientists are notably competitive with one another. Everyone wants to be right. There is little deference to authority. In fact, the highest praise is often reserved not for reaching consensus but for finding a flaw in someone else’s reasoning. In such a super-charged interactive environment, Sunstein’s three earlier findings may reasonably be combined into one where we would expect that groups of experts who interact with one another would be the most likely to find the right answer to a factual question. Again, this sounds a lot like science.

I think it is important, however, not to put too much emphasis on the idea that it is the group aspect of science that makes so much difference as the fact that science is an open and public process. While it is true that scientists often work in teams and participate in large conferences and meetings, even the lone scientist critiquing another’s paper in the privacy of her office is participating in a community process. It is also important to note that even though these days it is more common for scientists to collaborate on big projects like the Large Hadron Collider, this could hardly be the distinguishing feature of what makes science special, for if it were then what to say about the value of those scientific theories put forth in the past by sole practitioners like Newton and Einstein? Working in groups is one way of exposing scientific ideas to public scrutiny. But there are others. Even when working alone, one knows that before a scientific theory can be accepted it still must be vetted by the larger scientific community. This is what makes science distinctive: the fact that its values are embraced by a wider community, not necessarily that the practice of science is done in groups.

It is the hallmark of science that individual ideas—even those put forth by experts—are subjected to the highest level of scrutiny by other experts, in order to discover and correct any error or bias, whether intentional or not. If there is a mistake in one’s procedure or reasoning, or variance with the evidence, one can expect that other scientists will be motivated to find it. The scientific attitude is embraced by the community as embodied in a set of practices. It is of course heartening to learn from Sunstein and others that experimental work in psychology has vindicated a method of inquiry that is so obviously consonant with the values and practices of science. But the importance of these ideas has long been recognized by philosophers of science as well.

In a brilliant paper entitled “The Rationality of Science versus the Rationality of Magic,”¹⁸ Tom Settle examines what it means to say that scientific belief is rational whereas belief in magic is not. The conclusion he comes to is that one need not disparage the individual rationality of members of those communities that believe in magic any more than one should attempt to explain the distinctiveness of science through the rationality of individual scientists. Instead, he finds the difference between science and magic in the “corporate critical-mindedness” of science, which is to say that there is a tradition of group criticism of individual ideas that is lacking in magic.¹⁹ As Hansson explains in his article “Science and Pseudo-Science,”

[According to Settle] it is the rationality and critical attitude built into institutions, rather than the personal intellectual traits of individuals, that distinguishes science from non-scientific practices such as magic. The individual practitioner of magic in a pre-literate society is not necessarily less rational than the individual scientists in modern Western society. What she lacks is an intellectual environment of collective rationality and mutual criticism.²⁰

As Settle explains further:

I want to stress the institutional role of criticism in the scientific tradition. It is too strong a requirement that every individual within the scientific community should be a first rate critic, especially of his own ideas. In science, criticism may be predominantly a communal affair.²¹

Noretta Koertge too finds much to praise in the “critical communities” that are a part of science. In her article “Belief Buddies versus Critical Communities,” she writes:

I have argued that one characteristic differentiating typical science from typical pseudoscience is the presence of critical communities, institutions that foster communication and criticism through conferences, journals, and peer review. … We have a romantic image of the lone scientist working in isolation and, after many years, producing a system that overturns previous misconceptions. We forget that even the most reclusive of scientists these days is surrounded by peer-reviewed journals; and if our would-be genius does make a seemingly brilliant discovery, it is not enough to call a news conference or promote it on the web. Rather, it must survive the scrutiny and proposed amendments of the relevant critical scientific community.²²

Finally, in the work of Helen Longino, we see respect for the idea that, even if one views science as an irreducibly social enterprise—where the values of its individual practitioners cannot help but color their scientific work—it is the collective nature of scientific practice as a whole that helps to support its objectivity. In her pathbreaking book Science as Social Knowledge,²³ Longino embraces a perspective that one may initially think of as hostile to the claim that scientific reasoning is privileged. Her overall theme seems consonant with the social constructivist response to Kuhn: that scientific inquiry, like all human endeavors, is value laden, and therefore cannot strictly speaking be objective; it is a myth that we choose our beliefs and theories based only on the evidence.

Indeed, in the preface to her book, Longino writes that her initial plan had been to write “a philosophical critique of the idea of value-free science.” She goes on, however, to explain that her account grew to be one that “reconciles the objectivity of science with its social and cultural construction.” One understands that her defense will not draw on the standard empiricist distinction between “facts and values.” Nor is Longino willing to situate the defense of science in the logic of its method. Instead, she argues for two important shifts in our thinking about science: (1) that we should embrace the idea of seeing science as a practice, and (2) that we should recognize science as practiced not primarily by individuals but by social groups.²⁴ But once we have made these shifts a remarkable insight results, for she is able to conclude from this that “the objectivity of scientific inquiry is a consequence of this inquiry’s being a social, and not an individual, enterprise.”

How does this occur? Primarily through recognition that the diversity of interests held by each of the individual scientific practitioners may grow into a system of checks and balances whereby peer review before publication, and other scrutiny of individual ideas, acts as a tonic to individual bias. Thus scientific knowledge becomes a “public possession.”

What is called scientific knowledge, then, is produced by a community (ultimately the community of all scientific practitioners) and transcends the contributions of any individual or even of any subcommunity within the larger community. Once propositions, theses, and hypotheses are developed, what will become scientific knowledge is produced collectively through the clashing and meshing of a variety of points of view. …

Objectivity, then, is a characteristic of a community’s practice of science rather than of an Individual’s, and the practice of science is understood in a much broader sense than most discussions of the logic of scientific method suggest.²⁵

She concludes, “values are not incompatible with objectivity, but objectivity is analyzed as a function of community practices rather than as an attitude of individual researchers toward their material.”²⁶ One can imagine no better framing of the idea that the scientific attitude must be embraced not just by individual researchers but by the entire scientific community.

It should be clear by now how the conclusions of Settle, Koertge, and Longino tie in neatly with Sunstein’s work. It is not just the honesty or “good faith” of the individual scientist, but fidelity to the scientific attitude as community practice that makes science special as an institution. No matter the biases, beliefs, or petty agendas that may be put forward by individual scientists, science is more objective than the sum of its individual practitioners. As philosopher Kevin deLaplante states, “Science as a social institution is distinctive in its commitment to reducing biases that lead to error.”²⁷ But how precisely does science do this? It is time now to take a closer look at some of the institutional techniques that scientists have developed to keep one another honest.

Methods of Implementing the Scientific Attitude to Mitigate Error

It would be easy to get hung up on the question of the sources of scientific error. When one finds a flaw in scientific work—whether it was intentional or otherwise—it is natural to want to assign blame and examine motives. Why are some studies irreproducible? Surely not all of them can be linked to deception and fraud. As we have seen, there also exists the possibility of unconscious cognitive bias that is a danger to science. It is wrong to think of every case in which a study is flawed as one of corruption.²⁸ Still, the problem has to be fixed; the error needs to be caught. The important point at present is not where the error comes from but that science has a way to deal with it.

We have already seen that groups are better than individuals at finding scientific errors. Even when motivated to do so, an individual cannot normally compete with the “wisdom of crowds” effect among experts. Fortunately, science is set up to bring precisely this sort of group scrutiny to scientific hypotheses. Science has institutionalized a plethora of techniques to do this. Three of them are quantitative methods, peer review, and data sharing and replication.

Quantitative Methods

There are entire books on good scientific technique. It is important to point out that these exist not just for quantitative research, but for qualitative investigation as well.²⁹ There are some time-honored tropes in statistical reasoning that one trusts any scientist would know. Some of these—such as that there is a difference between causation and correlation—are drilled into students in Statistics 101. Others—for instance, that one should use a different data set for creating a hypothesis than for testing it—are a little more subtle. With all of the rigor involved in this kind of reasoning, there is little excuse for scientists to make a quantitative error. And yet they do. As we have seen, the tendency to withhold data sets is associated with mathematical errors in published work. Given that everyone is searching for the holy grail of a 95 percent confidence level that indicates statistical significance, there is also some fudging.³⁰ But publicity over any malfeasance—whether it is due to fraud or mere sloppiness—is one of the ways that the scientific attitude is demonstrated. Scientists check one another’s numbers. They do not wait to find an error; they go out and look for one. If it is missed at peer review, some errors are found within hours of publication. I suppose one could take it as a black mark against science that any quantitative or analytical errors ever occur. But it is better, I think, to count it as a virtue of science that it has a culture of finding such errors—where one does not just trust in authority and assume that everything is as stated. Yet there are some methodological errors that are so endemic—and so insidious—that they are just beginning to come to light.³¹

Statistics provides many different tests one can do for evidential relationships. Of course, none of these will ever amount to causation, but correlation is the coin of the realm in statistics, for where we find correlation we can sometimes infer causation. One of the most popular calculations is to determine the p-value of a hypothesis, which is the probability of finding a particular result if the null hypothesis were true (which is to say if there were no real-world correlation between the variables). The null hypothesis is the working assumption against which statistical significance is measured. It is the “devil’s advocate” hypothesis that no actual causal relationship exists between two variables, which means that if one finds any such correlation it will be the result of random chance. To find statistical significance, one must therefore find strong enough statistical evidence to reject the null hypothesis—one must show that the correlation found is greater than what one would expect from chance alone. By convention, scientists have chosen 0.05 as the inflection point to indicate statistical significance, or the likelihood that a correlation would not have occurred just by chance. When we have reached this threshold, it suggests a higher likelihood of real-world correlation.³² The p-value is therefore the probability that you would get your given data if the null hypothesis were true. A small p-value indicates that it is highly improbable that a correlation is due to chance; the null hypothesis is probably wrong. This is what scientists are looking for. A large p-value is just the opposite, indicating weak evidence; the null hypothesis is probably true. A p-value under 0.05 is therefore highly sought after, as it is customarily the threshold for scientific publication.

P-hacking (also known as data dredging) is when researchers gather a large amount of data, then sift through it looking for anything that might be positively correlated.³³ As understood, since there is a nonzero probability that two things may be correlated by chance, this means that if one’s data set is large enough, and one has enough computing power, one will almost certainly be able to find some positive correlation that meets the 0.05 threshold, whether it is a reflection of a real-world connection or not. This problem is exacerbated by the degrees of freedom we talked about earlier, whereby researchers make decisions about when to stop collecting data, which data to exclude, whether to hold a study open to seek more data, and so on. If one looks at the results midway through a study to decide whether to continue collecting data, that is p-hacking.³⁴ Self-serving exploitation of such degrees of freedom can be used to manipulate almost any data set into some sort of positive correlation. And these days it is much easier than it used to be.³⁵ Now one merely has to run a program and the significant results pop out. This in and of itself raises the likelihood that some discovered results will be spurious. We don’t even need to have a prior hypothesis that two things are related anymore. All we need are raw data and a fast computer. An additional problem is created when some researchers choose to selectively report their results, excluding all of the experiments that fall below statistical significance. If one treats such decisions about what to report and what to leave out as one’s final degree of freedom, trouble can arise. As Simmons et al. put it in their original paper: “it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis.”³⁶

Examples surround us. Irreproducible studies on morning TV shows that report correlations between eating grapefruit and having an innie belly button, or between coffee and cancer, rightly make us suspicious. The definitive example, however, was given in Simmons’s original paper, where researchers were able—by judicious manipulation of their degrees of freedom—to prove that listening to the Beatles song “When I’m 64” had an effect on the listener’s age. Note that researchers did not show that listening to the song made one feel younger; they showed that it actually made one younger.³⁷ Demonstrating something that is causally impossible is the ultimate indictment of one’s analysis. Yet p-hacking is not normally considered to be fraud.

In most cases researchers could be honestly making necessary choices about data collection and analysis, and they could really believe they are making the correct choices, or at least reasonable choices. But their bias will influence those choices in ways that researchers may not be aware of. Further, researchers may simply be using the techniques that “work”—meaning they give the results the researcher wants.³⁸

Add to this the pressures of career advancement, competition for research funding, and the “publish or perish” environment to win tenure or other career advancement at most colleges and universities, and one has created an ideal motivational environment to “make the correct choices” about one’s work. As Uri Simonsohn, one of the coauthors of the original study put it: “everyone has p-hacked a little.”³⁹ Another commentator on this debate, however, sounded a more ominous warning in the title of his paper: “Why Most Published Research Findings Are False.”⁴⁰

But is all of this really so bad? Of course scientists only want to report positive results. Who would want to read through all of those failed hypotheses or see all the data about things that didn’t work? Yet this kind of selective reporting—some might call it “burying”—is precisely what some pseudoscientists do to make themselves appear more scientific. In Ron Giere’s excellent book Understanding Scientific Reasoning he outlines one such example, whereby Jeane Dixon and other “futurists” often brag of being able to predict the future, which they do by making thousands of predictions about the coming year, then reporting only the ones that came true at year’s end.⁴¹

To many, all of this will seem to be a bastardization of the scientific attitude because, although one is looking at the evidence, it is being used in an inappropriate way. With no prior hypothesis in mind to test, evidence is gathered merely for the sake of “data” to be mined. If one is simply on a treasure hunt for “p”–for something that looks significant even though it probably isn’t—where is the warrant for one’s beliefs? One can’t help but think here of Popper and his admonition about positive instances. If one is looking for positive instances, they are easy to find! Furthermore, without a hypothesis to test, one has done further damage to the spirit of scientific investigation. We are no longer building a theory—and possibly changing it depending on what the evidence says—but only reporting correlations, even if we don’t know why they occur or may even suspect that they are spurious.

Clearly, p-hacking is a challenge to the thesis that scientists embrace the scientific attitude. But unless one is willing to discard a large proportion of unreproducible studies as unscientific, what other choice do we have? Yet remember my claim that what makes science special is not that it never makes mistakes, but that it responds to them in a rigorous way. And here the scientific community’s response to the p-hacking crisis strongly supports our confidence in the scientific attitude.

First, it is important to understand the scope of p-value, and the job that science has asked of it. It was never intended to be the sole measure of significance.

[When] Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look. The idea was to run an experiment, then see if the results were consistent with what random chance might produce. … For all the P value’s apparent precision, Fisher intended it to be just one part of a fluid, non-numerical process that blended data and background knowledge to lead to scientific conclusions. But it soon got swept into a movement to make evidence-based decision-making as rigorous and objective as possible. … “The P value was never meant to be used the way it’s used today.”⁴²

Another problem with the p-value is that it is commonly misunderstood, even by the people who routinely use it:

The p-value is easily misinterpreted. For example, it is often equated with the strength of a relationship, but a tiny effect size can have very low p-values with a large enough sample size. Similarly, a low p-value does not mean that a finding is of major … interest.⁴³

In other words, p-values cannot measure the size of an effect, only the probability that the effect occurred by chance. Other misconceptions surround how to interpret the probabilities expressed by a p-value. Does a study with a 0.01 p-value mean that there is just a 1 percent chance of the result being spurious?⁴⁴ Actually, no. This cannot be determined without prior knowledge of the likelihood of the effect. In fact, according to one calculation, “a P value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect.”⁴⁵

All this may lead some to wonder whether the p-value is as important as some have said. Should it be the gold standard of statistical significance and publication? Now that the problems with p-hacking are beginning to get more publicity, some journals have stopped asking for it.⁴⁶ Other critics have proposed various statistical tests to detect p-hacking and shine a light on it. Megan Head et al. have proposed a method of text mining to measure the extent of p-hacking. The basic idea here is that if someone is p-hacking, then the p-values in their studies will cluster around the 0.05 level, which will result in an odd shape if one curves the results.⁴⁷

The next step might be to change the reporting requirements in journals, so that authors are required to compute their own p-curves, which would enable other scientists to tell at a glance whether the results had been p-hacked. Other ideas have included mandatory disclosure of all degrees of freedom taken in producing the result of a paper, along with size of the effect and any information about prior probabilities.⁴⁸ Some of this is controversial because researchers cannot agree on the usefulness of what is reported.⁴⁹ And, as Head points out:

Many researchers have advocated abolishing NHST [Null Hypothesis Significance Testing]. However, others note that many of the problems with publication bias reoccur with other approaches, such as reporting effect sizes and their confidence intervals or Bayesian credible intervals. Publication biases are not a problem with p-values per se. They simply reflect the incentives to report strong (i.e. significant) facts.⁵⁰

So perhaps the solution is full disclosure and transparency? Maybe researchers should be able to use p-values, but the scientific community who assesses their work—peer reviewers and journal editors—should be on the lookout. In their paper, Simmons et al. propose several guidelines:

Requirements for authors:

Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.
Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data collection justification.
Authors must list all variables collected in a study.
Authors must report all experimental conditions, including failed manipulations.
If observations are eliminated, authors must also report what the statistical results are if those observations are included.
If an analysis includes a covariate, authors must report the statistical reports of the analysis without the covariate.

Guidelines for reviewers:

Reviewers should ensure that authors follow the requirements.
Reviewers should be more tolerant of imperfection in results.
Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.
If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication.⁵¹

And then, in a remarkable demonstration of the efficacy of these rules, Simmons et al. went back to their bogus hypothesis—about the Beatles song that “changed” the listener’s age—and redid it according to the above guidelines … and the effect disappeared. Can one imagine any pseudoscience taking such a rigorous approach to ferreting out error?

The amazing thing about p-hacking is that even though it reflects a current crisis in science, it still makes a case for the value of the scientific attitude. It would have been easy for me to choose some less controversial or embarrassing example to demonstrate that the community of scientists have embraced the scientific attitude through their insistence on better quantitative methods. But I didn’t. I went for the worst example I could find. Yet it still came out that scientists were paying close attention to the problem and trying to fix it. Even though one of the procedures of science is flawed, the response from the scientific community has been pitch perfect: We’ve got this. We may not have it solved yet, and we may make more mistakes, but we are investigating the problem and trying to correct it. Although it may be in an individual scientist’s interests to be sloppy or lazy (or worse) and rely on p-hacking, this is an embarrassment to the scientific community at large. This is not who we are and we will correct this.

At the end of his paper, Simmons says this:

Our goal as scientists is not to publish as many articles as we can, but to discover and disseminate truth. Many of us—and this includes the three authors of this article—often lose sight of this goal, yielding to the pressure to do whatever is justifiable to compile a set of studies that we can publish. This is not driven by a willingness to deceive but by the self-serving interpretation of ambiguity, which enables us to convince ourselves that whichever decisions produced the most publishable outcome must have also been the most appropriate. This article advocates a set of disclosure requirements that imposes minimal costs on authors, readers, and reviewers. These solutions will not rid researchers of publication pressures, but they will limit what authors are able to justify as acceptable to others and to themselves. We should embrace these disclosure requirements as if the credibility of our profession depended on them. Because it does.⁵²

This warning is not lost on others. At the end of his own article “P-Hacking and Other Statistical Sins,” Steven Novella lists a set of general principles for scientific conduct, then brings up an odious comparison that should make all philosophers of science sit up and take notice:

Homeopathy, acupuncture, and ESP research are all plagued by … deficiencies. They have not produced research results anywhere near the threshold of acceptance. Their studies reek of P-hacking, generally have tiny effect sizes, and there is no consistent pattern of replication. … But there is no clean dichotomy between science and pseudoscience. Yes, there are those claims that are far toward the pseudoscience end of the spectrum. All of these problems, however, plague mainstream science as well. The problems are fairly clear, as are the necessary fixes. All that is needed is widespread understanding and the will to change entrenched culture.⁵³

One can imagine no better testament—or call to arms—for the scientific attitude than that it aspires to create such a culture.⁵⁴

Peer Review

We have already seen some of the advantages of peer review in the previous section. Where individuals may be tempted to take shortcuts in their work, they will be caught up short by representatives of the scientific community: the expert readers who agree to provide an opinion and comments on an anonymous manuscript, receiving for compensation only the knowledge that they are helping the profession and that their identity will be shielded.⁵⁵ (There is also the advantage of getting to see new work first and having the chance to influence it through one’s comments.) But that is it. It is customarily shocking to outsiders to realize that most scientific review work is done for free, or at most for minimal compensation such as a free journal subscription or book.

The process of peer review is de riguer at most scientific journals, where editors would shudder to share work that had not been properly vetted. In most instances, by the way, this means that every journal article is reviewed by not one but two experts, to provide an extra level of scrutiny. A check on a check. To the consternation of most authors, publication normally requires that both referees give the nod. And if any changes are proposed, it is common practice to send the paper back to the original reviewers once changes are made. Inferior work does sometimes slip through, but the point is that this system is one in which the scientific attitude is demonstrated through the principles of fair play and objectivity. It is not perfect, but it is at least a mechanism that allows the scientific community as a whole to have a hand in influencing individual research results before they are shared with the entire community.

And there is also another level of scrutiny, for if a mistake is made there is always retraction. If an error is not caught until after a journal article is published, the mistake can be flagged or the whole article can be taken back, with due publicity for future readers. Indeed, in 2010 two researchers, Ivan Oransky and Adam Marcus, founded a website called RetractionWatch.com where one can find an up-to-date list of scientific papers that have been retracted. The website was created because of concerns that retracted papers were not being given enough publicity, which might lead other scientists mistakenly to build upon them. Again, one can bemoan the fact that approximately six hundred scientific papers per year are retracted, or one can applaud the scientific community for making sure that the problem does not become worse. By publicizing retractions, it might even provide an incentive for researchers not to end up on such a public “wall of shame.” On their blog, Oransky and Marcus argue that Retraction Watch contributes in part to the self-correcting nature of science.

Although many see retraction as an embarrassment (and it surely is for the parties involved), it is yet another example of the scientific attitude in action. It is not that mistakes never occur in science, but when they do there are well-publicized community standards by which these errors can be rectified. But it is important to note, although it is sometimes lost on the larger nonscientific community, that article retraction does not necessarily indicate fraud. It is enough merely for there to be an error large enough to undermine the original finding. (In some cases this is discovered when other scientists try to reproduce the work, which will be discussed in the next section.) The example of p-hacking alone surely indicates that there is enough to keep the watchdogs busy. But fraud is the ultimate betrayal of the scientific attitude (and as such it will be discussed in its own chapter). For now, it is enough to note that article retraction can occur for many reasons, all of which credit the vigilance of the scientific community.

Peer review is thus an essential part of the practice of science. There is only so far that individual scientists can go in catching their own errors. Maybe they will miss some. Maybe they want to miss some. Having someone else review work before publication is the best possible way to catch errors and keep them from infecting the work of others. Peer review keeps one honest, even if one was honest in the first place. Remember that groups outperform individuals. And groups of experts outperform individual experts. Even before these principles were experimentally proven, they were incorporated into the standard of peer review. What happens when an idea or theory does not go through peer review? Maybe nothing. Then again, on the theme that perhaps we can learn most about science by looking at its failures, I propose that we here consider the details of an example I introduced in chapter 3, which demonstrates the pitfalls of forgoing community scrutiny before publication: cold fusion.

To those who know only the “headline” version of cold fusion, it may be tempting to dismiss the whole episode as one of fraud or at least bad intent. With the previously noted book titles Bad Science and Cold Fusion: The Scientific Fiasco of the Century it is easy to understand why people would want to cast aspersions on the two original researchers, then try to put the whole thing on a shelf and claim it could never happen again. What is fascinating, however, is the extent to which this episode illuminates the process of science. Whatever the ultimate conclusion after it was over—and there was plenty of armchair judgment after upwards of $50 million was spent to refute the original results and a governmental panel was appointed to investigate⁵⁶—the entire cold fusion fiasco provides a cautionary tale for what can happen when the customary practices of science are circumvented.

The problems in this case are so thick that one might use the example of cold fusion to illustrate any number of issues about science: weakness in analytical methods, the importance of data sharing and reproducibility, the challenges of subjectivity and cognitive bias, the role of empirical evidence, and the dangers of interference by politics and the media. Indeed, one of the books written on cold fusion has a final chapter that details the numerous lessons we can learn about the scientific process based on the failures demonstrated by cold fusion.⁵⁷ Here I will focus on only one of these: peer review.

As noted, on March 23, 1989, two chemists, B. Stanley Pons and Martin Fleischmann, held a press conference at the University of Utah to announce a startling scientific discovery: they had experimental evidence for the possibility of conducting a room temperature nuclear fusion reaction, using materials that were available in most freshman chemistry labs. The announcement was shocking. If true, it raised the possibility that commercial adaptations could be found to produce a virtually unlimited supply of energy. Political leaders and journalists were mesmerized, and the announcement got front-page coverage in several media outlets, including the Wall Street Journal. But the press conference was shocking for another reason, which was that Pons and Fleischmann had as yet produced no scientific paper detailing their experiment, thus forestalling the possibility of other scientists checking their work.

To say that this is unusual in science is a vast understatement. Peer review is a bedrock principle of science, because it is the best possible way of catching any errors—and asking all of the appropriate critical questions—before a finding is released. One commentator in this debate, John Huizenga, acknowledges that “the rush to publish goes back to the seventeenth century,” when the Royal Society of London began to give priority to the first to publish, rather than the first to make a discovery.⁵⁸ The competitive pressures in science are great, and priority disputes for important findings are common. Thus, “in the rush to publish, scientists eliminate important experimental checks of their work and editors compromise the peer-review process, leading to incomplete or even incorrect results.”⁵⁹ And this seems to be what happened with cold fusion.

At the time of Pons and Fleischmann’s press conference, they were racing to beat another researcher from a nearby university, whom they claim had pirated their discovery when he had read about it during review of their grant application. That researcher, Steven Jones of Brigham Young University, counter-claimed that he had been working on fusion research long before this, and that Pons and Fleischmann’s grant proposal was merely an “impetus” to get back to work and brook no further delay in publishing his own results. Once the administration at their respective universities (BYU and Utah) became involved, and the subject turned to patents and glory for the institution, things devolved and few behaved admirably. As noted, scientists are human beings and feel the same competitive pressures that any of us would. But Pons and Fleischmann would live to regret their haste.

As it turned out, their experimental evidence was weak. They claimed to have turned up an enormous amount of excess heat by conducting an electrochemical experiment involving palladium, platinum, a mixture of heavy water and lithium, and some electric current. But, if this was a nuclear reaction and not a chemical one, they had a problem: where was the radiation? In the fusion of two deuterium nuclei, it is expected that not only heat but also neutrons (and gamma rays) will be given off. Yet repeatedly, Pons and Fleischmann failed to find any neutrons (or so few as to suggest that there must be some sort of error) with the detector they had borrowed from the physics department. Could there be some other explanation?

One would think that it would be a fairly straightforward process to sort all of this out, but in the earliest days after the press conference, Pons and Fleischmann refused to share their data with other experimenters, who were reduced to “[getting their] experimental details from The Wall Street Journal and other news publications.”⁶⁰ Needless to say, that is not how science is supposed to work. Replications are not supposed to be based on guesswork about experimental technique or phone calls begging for more information. Nonetheless, in what Gary Taubes calls a “collective derangement of minds,” a number of people in the scientific community got caught up in the hype, and several partial “confirmations” began to pop up around the world.⁶¹ Several research groups reported that they too had found excess heat in the palladium experiment (but still no radiation). At the Georgia Tech Research Institute, they claimed to have found an explanation for the absence of deuterium atoms due to the presence of Boron as an ingredient in the Pyrex glassware that Pons and Fleischmann had used; when they tested the result with Boron-free glassware they detected some radiation, which disappeared when the detector was screened behind a Boron shield. Another group claimed to have a theoretical explanation for the lack of radiation, with helium and heat as the only product of Pons and Fleischmann’s electrolysis reaction.⁶²

Of course, there were also critics. Some nuclear physicists in particular were deeply skeptical of how a process like fusion—which occurs at enormous temperature and pressure in the center of the Sun—could happen at room temperature. Many dismissed these criticisms, however, as representative of the “old” way of thinking about fusion. A psychologist would have a field day with all of the tribalism, bandwagon jumping, and “emperor has no clothes” responses that happened next. Indeed, some have argued that a good deal of the positive response to cold fusion can be explained by Solomon Asch’s experimental work on conformity, in which he found that a nontrivial number of subjects would deny self-evident facts if that was necessary to put them in agreement with their reference group.⁶³ It is important to point out, however, that although all of these psychological effects may occur in science, they are supposed to be washed out by the sort of critical scrutiny that is performed by the community of scientists. So where was this being done? While it is easy to cast aspersions on how slowly things developed, it is important to note that the critical research was being done elsewhere, as other researchers worked under the handicap of inadequate data and too many unanswered questions. And, as some details about Pons and Fleishmann’s work began to leak out, the problems for cold fusion began to mount.

For one, where was the radiation? As noted, some thought they could explain its absence, yet most remained skeptical. Another problem arose as some questioned whether the excess heat could be explained as a mere artifact of some unknown chemical process. On April 10, 1989, a rushed and woefully incomplete paper by Pons and Flesichmann appeared in the Journal of Electroanalytical Chemistry. It was riddled with errors. Why weren’t these caught at peer review? Because the paper was not peer reviewed!⁶⁴ It was evaluated only by the editor of the journal (who was a friend of Stanley Pons) and rushed into print, because of the “great interest” of the scientific community.⁶⁵ At numerous scientific conferences to follow, Pons was reluctant to answer detailed questions from colleagues about problems with his work and refused to share raw data. In his defense, one might point out that by this point numerous partial “confirmations” had been coming in and others were defending his work too, claiming that they had similar results. Even though Pons was well aware of some of the shortcomings of his work, he might still have believed in it.

And yet, as we saw earlier in this chapter, there is a high correlation between failure to share one’s data and analytical error.⁶⁶ It is unknown whether Pons deliberately tried to keep others from discovering something that he already knew to be a weakness or was merely caught up in the hype surrounding his own theory. Although some have called this fraud—and it may well be—one need not go this far to find an indictment of his behavior. Scientists are expected to participate—or at least cooperate—in the criticism of their own theory, and woe to those who stonewall. Pons is guilty at least of obstructing the work of others, which is an assault against the spirit of the scientific attitude, where we should be willing to compare our ideas against the data, whether that is provided by oneself or others.⁶⁷

The process did not take long to unravel from there. A longer, more carefully written version of Pons and Fleischmann’s experiment was submitted to the journal Nature. While waiting for the paper to be peer reviewed, the journal editor John Maddox published an editorial in which he said:

It is the rare piece of research indeed that both flies in the face of accepted wisdom and is so compellingly correct that its significance is instantly recognized. Authors with unique or bizarre approaches to problems may feel that they have no true peers who can evaluate their work, and feel more conventional reviewers’ mouths are already shaping the word “no” before they give the paper much attention. But it must also be said that most unbelievable claims turn out to be just that, and reviewers can be forgiven for perceiving this quickly.⁶⁸

A few weeks later, on April 20, Maddox announced that he would not be publishing Pons and Fleischmann’s paper because all three of its peer reviewers had found serious problems and criticisms of their work. Most serious of these was that Pons and Fleishmann had apparently not done any controls! This in fact was not true. In addition to doing the fusion experiment with heavy water, Pons and Fleischmann had also done an early version of the experiment with light water, and had gotten a similar result.⁶⁹ Pons kept this fact quiet, however, sharing it only with Chuck Martin of Texas A&M, who had gotten the same result. When they talked, Pons told him that he was now at the “most exciting” part of the research, but he couldn’t discuss it further because of national defense issues.⁷⁰ Similarly, other researchers who had “confirmed” the original result were finding that it also worked with carbon, tungsten, and gold.⁷¹ But if the controls all “worked,” didn’t this mean that the original result was suspicious? Or could it mean that Pons and Fleischmann’s original finding was even broader—and more exciting—than they had thought?

It didn’t take long for the roof to collapse. On May 1—at a meeting of the American Physical Society in Baltimore—Nate Lewis (an electrochemist from Caltech) gave a blistering talk in which he all but accused Pons and Fleischmann of fraud for their shoddy experimental results; he received a standing ovation from the two thousand physicists in attendance. On May 18, 1989 (less than two months after the original Utah press conference), Richard Petrasso et al. (from MIT) published a paper in Nature showing that Pons and Fleischmann’s original results were experimental artifacts and not due to any nuclear fusion reaction. It seems that the full gamma ray spectrum graph that Pons had kept so quiet (even though it purported to be their main piece of evidence) had been misinterpreted. From there the story builds momentum with even more criticism of the original result, criticism of the “confirmation” of their findings, and a US governmental panel that was assigned to investigate. Eventually Pons disappeared and later resigned.

Peer review is one of the most important ways that the scientific community at large exercises oversight of the mistakes and sloppiness of individuals who may be motivated—either consciously or unconsciously—by the sorts of pressures that could cause them to cut corners in their work. Even when it is handicapped or stonewalled, this critical scrutiny is ultimately unstoppable. Science is a public venture and we must be prepared to present our evidence or face the consequences. In any case, the facts will eventually come out. Some may, of course, bemoan the fact that the whole cold fusion debacle happened at all and see it as an indictment of the process of science. Many critics of science will surely be prepared to do so, with perhaps nothing better than pseudoscience to offer in its place. But I think it is important to remind ourselves that—as we saw with p-hacking—the distinctiveness of science does not lie in the claim that it is perfect. Indeed, error and its discovery are an important part of what moves science forward. More than one commentator in this debate has pointed out that science is self-correcting. When an error occurs, it does not customarily fester. If Pons and Fleischmann had been prepared to share their data and embrace the scientific attitude, the whole episode might have been over within a few days, if it ever got started at all. Yet even when egos, money, and other pressures were at stake, the competitive and critical nature of scientific investigation all but guaranteed that any error would eventually be discovered. So again, rather than merely blaming science that an error happened, perhaps we should celebrate the fact that one of the largest scientific errors of the twentieth century was caught and run to ground less than two months after it was announced, based solely on lack of fit with the empirical evidence. For all the extraneous factors that complicated matters and slowed things down, this was the scientific attitude at its finest. As John Huizenga puts it:

The whole cold fusion fiasco serves to illustrate how the scientific process works. … Scientists are real people and errors and mistakes do occur in science. These are usually detected either in early discussions of ones research with colleagues or in the peer-review process. If mistakes escape notice prior to publication, the published work will come under close scrutiny by other scientists, especially if it disagrees with an established body of data. … The scientific process is self-corrective.⁷²

Data Sharing and Replication

As we have just seen, failure to share data is bad. When we sign on as a scientist, we are implicitly agreeing to a level of intellectual honesty whereby we will cooperate with others who are seeking to check our results for accuracy. In some cases (such as those journals sponsored by the American Psychological Association), the publishing agreement actually requires one to sign a pledge to share data with other researchers who request it. Beyond peer review, this introduces the standard of expecting that scientific findings should be replicable.

As Trivers indicates in his review of Wicherts’s findings, this does not mean that data sharing necessarily occurs. As human beings, we sometimes shirk what is expected of us. As a community, however, this does not change expectations.⁷³ The scientific attitude requires us to respect the refutatory power of evidence; part of this is being willing to cooperate with others who seek to scrutinize or even refute our hard-won theories. This is why it is so important for this standard to be enforced by the scientific community (and to be publicized when it is not).

It should be noted, though, that there is a distinction between refusing to share data and doing irreproducible work. It is probably expected that one of the motivations for refusing to share data is fear that someone else will not be able to confirm our results. While this does not necessarily mean that our work is fraudulent, it is certainly an embarrassment. It indicates that something must be wrong. As we’ve seen, that might fall into the camp of quantitative error, faulty analysis, faulty method, bad data collection, or a host of other missteps that could be described as sloppy or lazy. Or it might reveal fraud. In any case, none of this is good news for the individual researcher, so some of them hold back their data. Trivers rightly excoriates this practice, saying that such researchers are “a parody of academics.” And, as Wicherts demonstrates, there is a troubling correlation between refusing to share data and a higher likelihood of quantitative error in a published study.⁷⁴ One imagines that, if the actual data were available, other problems might pop out as well.

Sometimes they do. The first cut is when a second researcher uses the same data and follows the same methods as the original researcher, but the results do not hold up. This is when a study is said to be irreproducible. The second step is when it becomes apparent—if it does—why the original work was irreproducible. If it is because of fabricated data, a career is probably over. If it is because of bad technique, a reputation is on the line. Yet it is important to understand here that failure to replicate a study is not necessarily a bad thing for the profession at large. This is how science learns.⁷⁵ This is probably in part what is meant by saying that science is self-correcting. A scientist makes a mistake, another scientist catches it, and a lesson is learned. If the lesson is about the object of inquiry—rather than the ethical standards of the original researcher—further work is possible. Perhaps even enlightenment. When we think that we know something, but really don’t, it is a barrier to further discovery. But when we find that something we thought was true really wasn’t, a breakthrough may be ahead. We have the opportunity to learn more about how things work. Thus failure can be valuable to science.⁷⁶

The most important thing about science is that we try to find failure. The real danger to science comes not from mistakes but deception. Mistakes can be corrected and learned from; deception is often used to cover up mistakes. Fabricated data may be built upon by scores of other researchers before it is caught. When individual researchers put their own careers ahead of the production of knowledge, they are not only cheating on the ideals of the scientific attitude, they are cheating their professional colleagues as well. Lying, manipulation, and tampering are the very definition of having the wrong attitude about empirical evidence.

But again, we must make every effort to draw a distinction between irreproducible studies and fraud. It is helpful here to remember our logic. Fraudulent results are almost always irreproducible, but this does not mean that all or even most irreproducible studies are the result of fraud. The crime against the scientific attitude is of course much greater in cases of fraud. Yet we can also learn something about the scientific attitude by focusing on some of the less egregious reasons for irreproducible studies. Even if the reason for a study not being replicated has nothing whatsoever to do with fraud, scrutiny by the larger scientific community will probably find it. Merely by having a standard of data sharing and reproducibility, we are embracing the scientific attitude.⁷⁷ Whether an irreproducible study results from intentional or unintentional mistakes, we can count on the same mechanism to find them.

One might think, therefore, that most scientific studies are replicated. That at least half of all scientific work must involve efforts to reproduce others’ findings so that we can be sure they are right before the field moves forward. This would be false. Most journals do not welcome publication of studies that merely replicate someone else’s results. It is perhaps more interesting if a study fails to replicate someone else’s result, but here too there is some risk. The greater glory in scientific investigation comes from producing one’s own original findings, not merely checking someone else’s, whether they can be replicated or not. Perhaps this is why only a small percentage of scientific work is even attempted to be replicated.⁷⁸ One might argue that this is appropriate because the standards of peer review are so high that one would not expect most studies to need to be repeated. With all that scrutiny, it must be a rare event to find a study that is irreproducible. But this assumption too has been challenged in recent years. In fact, the media in the last few years have been full of stories about the “reproducibility crisis” in science, especially in psychology, where—when someone bothered to look—researchers found that 64 percent of studies were irreproducible!⁷⁹

Like the p-hacking “crisis,” one might see this as a serious challenge to the claim that science is special, and to the claim that scientists are distinctive in their commitment to a set of values that honor basing one’s beliefs on empirical evidence. But again, much can be learned by examining how the scientific community has responded to this replication “crisis.”

Let’s start with a closer look at the crisis itself. In August 2015, Brian Nosek and colleagues published a paper entitled “Estimating the Reproducibility of Psychological Science,” in the prestigious journal Science. It dropped like a bombshell. A few years earlier Nosek had cofounded the Center for Open Science (COS), which had as its mission to increase data sharing and replication of scientific experiments. Their first project was entitled the Reproducibility Project. Nosek recruited 270 other researchers to help him attempt to reproduce one hundred psychological studies. And what they found was shocking. Out of one hundred studies, only 36 percent percent of the findings could be reproduced.⁸⁰

It is important to point out here that Nosek et al. were not claiming that the irreproducible studies were fraudulent. In every case they found an effect in the same direction that the original researchers had indicated. It’s just that in most of them the effect size was cut in half. The evidence was not as strong as the original researchers had said it was, so the conclusion of the original papers was unsupported in over half of the cases. Why did this happen? One’s imagination immediately turns to all of the “degrees of freedom” that were discussed earlier in this chapter. From too small a sample size to analytical error, the replicators found a number of different shortcomings.

This led to headline announcements throughout the world heralding a replication crisis in science. And make no mistake, the fallout was not just for psychology. John Ioannidis (whom we remember for his earlier paper “Most Research Results Are Wrong”) said that the problem could be even worse in other fields, such as cell biology, economics, neuroscience, or clinical medicine.⁸¹ “Many of the biases found in psychology are pervasive,” he said.⁸² Others agreed.⁸³ There were, of course, some skeptics. Norbert Schwarz, a psychologist from the University of Southern California, said “there’s no doubt replication is important, but it’s often just an attack, a vigilante exercise.” Although he was not involved in any of the original studies, he went on to complain that replication studies themselves are virtually never evaluated for errors in their own design or analysis.⁸⁴

Yet this is exactly what happened next. In March 2016, three researchers from Harvard University (Daniel Gilbert, Gary King, and Stephen Pettigrew) and one of Nosek’s colleagues from the University of Virginia (Timothy Wilson), released their own analysis of the replication studies put forward by the Center for Open Science and found that “the research methods used to reproduce those studies were poorly designed, inappropriately applied and introduced statistical error into the data.”⁸⁵ When a re-analysis of the original studies was done using better technique, the reproducibility rate was close to 100 percent. As Wilson put it, “[the Nosek study] got so much press, and the wrong conclusions were drawn from it. … It’s a mistake to make generalizations from something that was done poorly, and this we think was done poorly.”⁸⁶

This is a breathtaking moment for those who care about science. And, I would argue, this is one of the best demonstrations of the critical attitude behind science that one could imagine. A study was released that found flawed methodology in some studies, and—less than seven months later—it was revealed that there were flaws in the study that found the flaws. Instead of a “keystone cops” conclusion, it is better to say that in science we have a situation where the gatekeepers also have gatekeepers—where there is a system of checking and rechecking in the interest of correcting errors and getting closer to the truth.

What were the errors in Nosek’s studies? They were numerous.

(1) Nosek et al took a nonrandom approach in selecting which studies to try to replicate. As Gilbert explains, “what they did is create an idiosyncratic, arbitrary list of sampling rules that excluded the majority of psychology subfields from the sample, that excluded entire classes of studies whose methods are probably among the best in science from the sample. … Then they proceeded to violate all of their own rules. … So the first thing we realized was that no matter what they found—good news or bad news—they never had any chance of estimating the reproducibility of psychological science, which is what the very title of their paper claims they did.”⁸⁷

(2) Some of their studies were not even close to exact replications. In one of the most egregious examples, the COS team were trying to recreate a study on attitudes toward affirmative action that had been done at Stanford. The original study had asked white students at Stanford to watch a video of four other Stanford students—three white and one black—talking about race, during which one of the white students made an offensive comment. It was found that the observers looked significantly longer at the black student when they believed that he could hear the comment than when they believed that he could not. In their attempt at replication, however, the COS team showed the video to students at … the University of Amsterdam. Perhaps it is not shocking that Dutch students, watching American students speaking in English, would not have the same reaction. But here is where the real trouble came, for when the COS team realized the problem and reran the experiment at a different American university and got the same result as the original Stanford study, they chose to leave this successful replication out of their report, but included the one from Amsterdam that had failed.⁸⁸

(3) There was no agreed-upon protocol for quantitative analysis before the replication. Instead, Nosek and his team used five different measures (including strength of the effect) to look at all of the results combined. It would have been better, Gilbert et al. recommended, to focus on one measure.⁸⁹

(4) The COS team failed to account for how many of the replications would have been expected to fail purely by chance. It is indefensible statistical method to expect a 100 percent success rate in replications. King explains, “If you are going to replicate 100 studies, some will fail by chance alone. That’s basic sampling theory. So you have to use statistics to estimate how many studies are expected to fail by chance alone because otherwise the number that actually do fail is meaningless.”⁹⁰

The result of all these errors is that—if you take them into account and undo them—the reproducibility rate for the original one hundred studies was “about what we should expect if every single one of the original findings had been true.”⁹¹ It should be noted here that even so, Gilbert, King, Pettigrew, and Wilson are not suggesting fraud or any other type of malfeasance on the part of Nosek and his colleagues. “Let’s be clear,” Gilbert said, “No one involved in this study was trying to deceive anyone. They just made mistakes, as scientists sometimes do.”⁹²

Nosek nonetheless complained that the critique of his replication study was highly biased: “They are making assumptions based on selectively interpreting data and ignoring data that’s antagonistic to their point of view.”⁹³ In short, he accused them almost precisely of what they had accused him of doing—cherry picking—which is what he had accused some of the original researchers of doing too.

Who is right here? Perhaps they all are. Uri Simonsohn (one of the coauthors of the previously discussed p-hacking study by Simmons et al.)—who had absolutely nothing to do with this back-and-forth on the replication crisis and its aftermath—said that both the original replication paper and the critique used statistical techniques that are “predictably imperfect.” One way to think about this, Simonsohn said, is that Nosek’s paper said the glass was 40 percent full, whereas Gilbert et al. said it could be 100 percent full.⁹⁴ Simonsohn explains, “State-of-the-art techniques designed to evaluate replications say it is 40 percent full, 30 percent empty, and the remaining 30 percent could be full or empty, we can’t tell till we get more data.”⁹⁵

What are the implications for the scientific attitude? They are manifest in this last statement. When we don’t have an answer to an empirical question, we must investigate further. Researchers used the scientific attitude to critique the study that critiqued other studies. But even here, we have to check the work. It is up to the scientific community as a whole to decide—just as they did in the case of cold fusion—which answer is ultimately right.

Even if all one hundred of the original studies that were examined by the Reproducibility Project turn out to be right, there needs to be more scrutiny on the issues of data sharing and replication in science. This dust up—which is surely far from over—indicates the need to be much more transparent in our research and reporting methods, and in trying to replicate one another’s work. (Indeed, look how valuable the reproducibililty standard was in the cold fusion dispute.) And already perhaps some good has come of it. In 2015, the journal Psychological Science announced that it will now ask researchers to preregister their study methods and modes of analysis prior to data collection, so that it can later be reconciled with what they actually found. Other journals have followed suit.⁹⁶

The increased scrutiny that the “reproducibility crisis” has created is a good thing for science. Although it is surely embarrassing for some of its participants, the expectation that nothing should be swept under the rug marks yet another victory for the scientific attitude. Indeed, what other field can we imagine rooting out its own errors with such diligence? Even in the face of public criticism, science remains completely committed to the highest standards of evidence. While there may be individual transgressions, the “replication crisis” in science has made clear its commitment to the scientific attitude.

Conclusion

In this chapter, I’ve come down pretty hard on some of the mistakes of science. My point, however, was not to show that science is flawed (or that it is perfect), even if I did along the way show that scientists are human. We love our own theories. We want them to be right. We surely make errors of judgment all the time in propping up what we hope to be true, even if we are statisticians. My point has not been to show that scientists do not suffer from some of the same cognitive biases, like confirmation bias and motivated reasoning, as the rest of us. Arguably, these biases may even be the reason behind p-hacking, refusal to share one’s data, taking liberties with degrees of freedom, running out to a press conference rather than subjecting oneself to peer review, and offering work that is irreproducible (even when our subject of inquiry is irreproducible work itself). Where the reader might have been tempted in various cases to conclude that such egregious errors could only be the result of intentional oversight—and must therefore reveal fraud—I ask you to step back and consider the possibility that they may be due merely to unconscious cognitive bias.⁹⁷

Fortunately, science as an institution is more objective than its practitioners. The rigorous methods of scientific oversight are a check against individual bias. In several examples in this chapter we saw problems that were created by individual ambition and mental foibles that could be rectified through the rigorous application of community scrutiny. It is of course hoped that the scientific attitude would be embraced by all of its individual practitioners. But as we’ve seen, the scientific attitude is more than just an individual mindset; it is a shared ethos that is embraced by the community of scholars who are tasked with judging one another’s theories against publicly available standards. Indeed, this may be the real distinction between science and pseudoscience. It is not that pseudoscientists suffer from more cognitive bias than scientists. It is not even that scientists are more rational (though I hope this is true). It is instead that science has made a community-wide effort to create a set of evidential standards that can be used as a check against our worst instincts and make corrections as we go, so that scientific theories are warranted for belief by those beyond the individual scientists who discovered them. Science is the best way of finding and correcting human error on empirical matters not because of the unusual honesty of scientists, or even their individual commitment to the scientific attitude, but because the mechanisms for doing so (rigorous quantitative method, scrutiny by one’s peers, reliance on the refutatory power of evidence) are backed up by the scientific attitude at the community level.

It is the scientific attitude—not the scientific method—that matters. As such, we need no longer pretend that science is completely “objective” and that “values” are irrelevant. In fact, values are crucial to what is special about science. Kevin deLaplante says, “Science is still very much a ‘value-laden’ enterprise. Good science is just as value-laden as bad science. What distinguishes good from bad science isn’t the absence of value judgments, it’s the kind of value judgments that are in play.”⁹⁸

There are still a few problems to consider. First, what should we do about those who think that they have the scientific attitude when they don’t? This concern will be dealt with in chapter 8, where we will discuss the problems of pseudoscience and science denialism. I trust, however, that one can already begin to imagine the outlines of an answer: it is not just an individual’s reliance on the scientific attitude—and surely not his or her possibly mistaken belief about fulfilling it—that keeps science honest. The community will judge what is and is not science. Yet now a second problem arises, for is it really so easy to say that the “community decides”? What happens when the scientific community gets it wrong? In chapter 8, I will also deal with this problem, via a consideration of those instances where the individual scientist—like Galileo—was right and the community was wrong.⁹⁹ As an example I will discuss the delightful, but less well-known, example of Harlen Bretz’s theory that a megaflood was responsible for creating the eastern Washington State scablands. I will also offer some reasons for disputing the belief that today’s climate change deniers may just be tomorrow’s Galileo. And, as promised, I will examine the issue of scientific fraud in chapter 7.

Before all that, however, in the next chapter, we will take a break from consideration of scientific drawbacks and failures to take a closer look at one example of an unabashed success of the scientific attitude and how it happened: modern medicine.

`Notes`

26. Longino, Science as Social Knowledge, 216. For a fascinating contrast, one might consider Miriam Solomon’s Social Empiricism (Cambridge, MA: MIT Press, 2001), in which she embraces the idea that the actions of the “aggregate community of scientists” trump how “individual scientists reason” (135), but disagrees with Longino on the question of whether such social interactions can “correct” individual biases in reasoning (139). While Solomon finds much to praise in Longino’s point of view, she criticizes her for not embedding her “ideal” claims in more actual scientific cases. Indeed, Solomon maintains that when one looks at scientific cases it becomes clear that biases—even cognitive biases—play a positive role in facilitating scientific work. Far from thinking that science should eschew bias, Solomon makes the incredible claim that “scientists typically achieve their goals through the aid of ‘biased’ reasoning” (139).

31. For a comprehensive and rigorous look at a number of foundational issues in statistics and its relationship to the philosophy of science, one could do no better than Deborah Mayo’s classic Error and the Growth of Experimental Knowledge (Chicago: University of Chicago Press, 1996). Here Mayo offers not only a rigorous philosophical examination of what it means to learn from evidence, but also details her own “error-statistical” approach as an alternative to the popular Bayesian method. The essential question for scientific reasoning is not just that one learns from empirical evidence but how. Mayo’s championing of the role of experiment, severe testing, and the search for error is essential reading for those who wish to learn more about how to defend the claims of science as they pertain to statistical reasoning.

54. And, even if it does not, the “bots” are coming. In a recent paper entitled “The Prevalence of Statistical Reporting Errors in Psychology (1985–2013),” M. Nuijten et al. announce the results of a newly created software package called “statcheck” that can be used to check for errors in any APA-style paper. So far, 50,000 papers in psychology have been audited with half showing some sort of mathematical error (usually tending in the author’s favor). These results are published on the PubPeer website (which some have called “methodological terrorism”), but, as the authors point out, the goal is to deter future errors, so there is now a web application that allows authors to check for errors in their own work before their papers are submitted. This too would be a cultural change in science, enabled by better technology. Behavior Research Methods 48, no. 4 (2016): 1205–1226, https://mbnuijten.files.wordpress.com/2013/01/nuijtenetal_2015_reportingerrorspsychology1.pdf. See also Brian Resnick, “A Bot Crawled Thousands of Studies Looking for Simple Math Errors: The Results Are Concerning,” Vox, Sep. 30, 2016, http://www.vox.com/science-and-health/2016/9/30/13077658/statcheck-psychology-replication.