Nullifying the Dull Hypothesis
Conventional Versus Personalized Science
There are two objectionable types of believers: those who believe the incredible and those who believe that “belief” must be discarded and replaced by “the scientific method.”
Chapters 12 and 13 addressed two significant issues frequently encountered in empirical analysis: model selection and data availability, respectively. I now will turn to an even more fundamental issue: the soundness of the scientific method (SM) itself. Although widely touted as the most effective available guide to human understanding, the SM is imprecisely defined and inconsistently implemented. Not only is there no universally accepted method for all scientific disciplines, but also statements of such methods, where they exist, are usually only idealized descriptions of the actual practice of science. This state of affairs is rather disquieting given that the SM is often proffered as the soul of modern science—the sine qua non of enlightened empirical reasoning, effectively poised to dispel the confusion and darkness of “inferior” modes of understanding (intuition, superstition, prejudice, mythology, religion, etc.).
In the present chapter, I will examine the SM as it is commonly practiced. First, I will address a number of inconsistencies in the way the SM is applied, observing how these inconsistencies may lead researchers to ignore important phenomena, especially when the phenomena are of a subtle and/or transient nature. Subsequently, I will consider issues related to the SM’s philosophical foundations—specifically, the inefficiency and inertia caused by the SM’s collectivist, frequentist orientation—and argue that these problems may be avoided by a more individualist, Bayesian approach.
Conventional Formulation
Formal statements of the SM usually envision scientific progress as occurring over time through a succession of replicable experimental tests of well-specified hypotheses describing relationships between observable quantities.2 This process may be divided into a number of distinct steps, whose characteristics are somewhat sensitive to the particular needs/eccentricities of a given scientific field. A conventional outline of these steps is as follows:
1. The development of a null hypothesis (H0) summarizing the relationship to be tested. This hypothesis may be based upon either derivations from theory or judgmental inferences from previously recorded observations. It may be as simple as stating the range for a parameter of interest (e.g., H0: “The length of the underwriting cycle for U.S. workers compensation insurance is greater than six years”) or it may involve the specification of a cause-and-effect relationship (e.g., H0: “Hybrid automobiles are more likely than conventional automobiles to have collisions with pedestrians and bicyclists”).
2. The identification of an observable phenomenon that is predicted by the underlying hypothesis and not otherwise explicable. For example, to test H0: “The length of the underwriting cycle for U.S. workers compensation insurance is greater than six years,” a researcher may identify the estimated cycle length for U.S. insurance company i (based upon profitability data from the past twenty years and denoted by li) as the observable phenomenon and take the assumption that li is a random variable with expected value greater than 6 as equivalent to H0.
3. The design of an experiment comprising a reasonably large number of individual observations, each of which reflects the presence/absence of the identified phenomenon. This includes such critical issues as identifying the statistical properties of the observations (e.g., whether or not they represent independent and identically distributed outcomes), how many observations will be made, and at what level of significance (α) the null hypothesis will be rejected. In testing the H0 from step 2 above, the researcher may decide, based upon previous work, that the li are approximately multivariate-normal random variables, positively correlated and with potentially different standard deviations. Furthermore, given that twenty years’ data are needed for each insurance company, the study must be restricted to only those companies that have existed for at least that time period. Finally, the researcher may decide, based upon common practice in the insurance economic literature, to set α equal to 0.05.
4. The replication of the experiment by several independent researchers. The primary purpose of experimental replication is to reduce the possibility of systematic error (such as fraud, prejudice, and simple incompetence) associated with individual researchers. A secondary purpose is to enhance the statistical significance of the results by adding more observations. Thus, regardless of the particular outcome of the test of Ho, the researcher should encourage the replication of his or her analysis by others, ideally with additional data, and welcome the publication of such studies.
This outline reveals that the SM borrows heavily from the framework of frequentist hypothesis testing. In short, steps 1 through 3 are primarily an application of hypothesis testing to a specific scientific problem, accomplished by writing the scientific principle under investigation in the formal mathematical terms of a null hypothesis. As mentioned before, there are clear disparities in the way the SM is employed in different natural and social scientific disciplines. These occur most conspicuously in steps 3 and 4.
To my mind, the disparity of greatest substance is that researchers in certain disciplines, such as physics and psychology, generally are able to make arbitrarily large numbers of the desired observations in step 3, whereas researchers in other disciplines, such as climatology and insurance economics, often have to work with a limited number of observations (e.g., an insurance economist can no more create additional insurance companies than a climatologist can precipitate additional hurricanes). This distinction is essentially the difference between those fields that rely primarily on randomized controlled studies and those that rely primarily on observational studies. Given that observational studies can never be used to settle matters of causality, it is clear that the type of “science” practiced in the latter fields is of a qualitatively different, and inferior, nature. In Chapter 15, I will discuss ways in which researchers in these latter disciplines may employ techniques of mathematical modeling and statistical simulation—and in particular, game theory (where applicable)—to elevate the rigor of their conclusions.
Other discipline-based variations involve the selection of α in step 3 and the number of independent researchers in step 4. For example, in biomedical research, scientists and government regulators may choose small values of α to protect patients from exposure to inadequately tested drugs. In certain manufacturing processes, however, in which the speed of innovation outweighs the need for statistical accuracy, larger values of α may be acceptable.
I hasten to point out that the above disparities do not constitute the pejorative inconsistencies mentioned at the beginning of the chapter. However, I do believe that it is critical to insist that discipline-based variations in the SM be emphasized more frequently so that people clearly understand that the standard of proof for the laws of quantum physics (for example) is quite different from that for the theory of evolution.
Inconsistencies in Application
Where consistency is essential is in the SM’s manner of application over time within a given field. Unfortunately, in virtually every scientific discipline, the SM is degraded by one or more of the following problems: apples-to-oranges replications, censorship of results, double standards in setting levels of significance, and moving targets in selecting hypotheses. Unlike outright fraud, which is universally condemned by the research community, these other problems are routinely tolerated—sometimes even encouraged—by various institutional structures and incentives. Let us now consider each of them in turn.
• Apples-to-oranges replications. From step 4 of the SM outline, it can be seen that the replication of experimental results after their initial publication is crucial to the SM. However, this step is rarely carried out conscientiously because researchers know that scientific funding, whether from government agencies or private grants, tends not to be given to support projects that simply repeat previous work and that scholarly journals will not publish results considered “old hat.” Consequently, what often passes for a replication in the sciences is actually a different experiment from the original—that is, one that is sufficiently distinct to make it “interesting.”
• Censorship of results. In a sense, the apples-to-oranges problem is a problem of a priori censorship, since many corroborative experiments that should be done are not done. Another major form of censorship is the failure of researchers to publish so-called negative results. For example, suppose that a researcher has derived the hypothesis Ho: “The length of the underwriting cycle for U.S. workers compensation insurance is greater than six years” from the mathematical theory of risk, and then proceeds to test it as described above. Suppose further that the observed data support rejecting this hypothesis in favor of the alternative H1 “The length of the underwriting cycle for U.S. workers compensation insurance is less than or equal to six years,” but the researcher is unable to develop any theoretical explanation for H1 Under such circumstances, it is fairly likely that the researcher would either engage in self-censorship by not submitting his or her results for publication or would submit the work for publication, but find it deemed unacceptable by scholarly journals.
• Double standards in setting levels of significance. Although levels of significance may differ by field of research, it is crucial that α remain constant over time within a given area of investigation; otherwise, certain hypotheses may be favored over others based upon prejudice and other subjective considerations. An extreme example of this problem arises in the field of parapsychology, where research is subjected to much smaller levels of α than in other subfields of psychology.3 Although some would justify such double standards by stating that extraordinary claims require extraordinary evidence, I would argue that that is an unreasonable interpretation of the SM outlined above. Clearly, an experimental outcome that contradicts a previously demonstrated result (i.e., a result that already has satisfied step 4) should be met with a healthy dose of skepticism—after all, the prior result has passed the test of the SM and therefore is entitled to a certain degree of deference. However, if an experimental outcome is simply novel—albeit unsupported by theory—then it bears no burden other than the requirement of replication (step 4). Harking back to the problem of finding empirical support for Hi “The length of the underwriting cycle for U.S. workers compensation insurance is less than or equal to six years” without any accompanying theory, I would argue that H1 should be considered a perfectly legitimate and publishable result worthy of replication efforts.
• Moving targets in selecting hypotheses. To maintain the integrity of the SM, it is imperative that researchers select the statement of the null hypothesis, as well as the level of significance, prior to collecting and analyzing observations. Otherwise, they are simply fishing for significance by calling attention to a hypothesis only after finding it to be supported by data. Although this practice is perfectly acceptable as an exploratory statistical technique, it can never satisfy step 3 of the SM. Unfortunately, this type of fishing is quite common in those disciplines that rely primarily on observational studies and is easily illustrated by a slight reorientation of our underwriting cycle example. Suppose that instead of studying workers compensation insurance in just the United States, a researcher is interested in whether or not workers compensation insurance possesses a cycle length greater than six years in all markets throughout the world. Thus, the researcher computes estimated cycle lengths from the past twenty years’ data for companies in each of 100 different nations (one of which happens to be the United States). Suppose further that, after carrying out statistical tests for each of the 100 different nations with an α of 0.05, it is found that the United States is the only nation for which the data support a cycle length greater than six years. If the researcher publishes the U.S. result alone, then he or she is abusing the SM because the hypothesis has been identified a posteriori. In other words, the conclusion would be meaningless because “significant” test results would arise by chance alone from approximately 5 percent of all nations tested.
Although the above inconsistencies are rather serious, my purpose is not to discredit the SM’s basic structure, but rather to encourage researchers to conduct their investigations in a manner designed to avoid the stated flaws. Otherwise, it is quite likely—and perhaps even inevitable—that important scientific phenomena will be ignored, especially if they are subtle (i.e., numerically small) and/or transient (i.e., temporally variable). Researchers in the fields of insurance and finance should be particularly concerned about the latter possibility because many properties of insurance and other financial markets are sensitive to structural variations over time.
Philosophical Foundations
Circular Reasoning?
Turning to the SM’s philosophical underpinnings, one quickly encounters what appears to be a problem of circular logic.4 This is because the SM, as a process of human understanding, is entirely empirical in nature, and at the same time, its intellectual justification seems to arise exclusively from its empirically observed successes. In other words, to place credence in the SM, one must already accept the SM as a means of generating belief.
This apparent shortcoming is a fundamental problem of British empiricism, the philosophy that the SM formalizes. Developed by the late-Renaissance thinkers John Locke, George Berkeley, and David Hume, empiricism argues that human knowledge derives from the observation of objects in the external world through the body’s senses (induction) and that internal reflection is useful primarily for identifying logical relationships among these observables. In his famous treatise, An Enquiry Concerning Human Understanding, Hume acknowledges the issue of circularity as follows:5
We have said that all arguments concerning existence are founded on the relation of cause and effect; that our knowledge of that relation is derived entirely from experience; and that all our experimental conclusions proceed upon the supposition that the future will be conformable to the past. To endeavor, therefore, the proof of this last supposition by probable arguments, or arguments regarding existence, must be evidently going in a circle, and taking that for granted, which is the very point in question.
Hume ultimately sidesteps this problem by arguing that the intellectual justification for relying on observations is simply human “custom” or “habit.” “All inferences from experience, therefore, are effects of custom, not of reasoning,” he writes; and subsequently: “All belief of matter of fact or real existence is derived merely from some object, present to the memory or senses, and a customary conjunction between that and some other object.”6
The issue of circularity is mentioned not because I believe it to be a fatal shortcoming of the SM, but rather because it shows how easily many of today’s science devotees accept the SM based upon an uncritical enthusiasm akin to religious zeal. Surely, basing a belief in the SM solely upon its successful track record is no more “scientific” than basing a belief in creationism solely upon the Book of Genesis. Tautology is tautology.
So how does one escape the SM’s circularity problem? Chronologically, humanity had to wait only slightly more than thirty years from the publication of Hume’s cited work for an insightful approach provided by the German philosopher Immanuel Kant. Famously stating that he was roused from his “dogmatic slumber” by reading Hume’s work, Kant spent ten years of his life trying to reconcile the arguments of empiricism with those of rationalism, a philosophy developed by René Descartes, Benedict de Spinoza, and Gottfried von Leibniz. In contrast to empiricism, rationalism argues that human knowledge can be derived from internal reflection (deduction) and that the experience of the external world is useful primarily for providing practical examples of internal ideas.
The principal fruit of Kant’s ten-year labor was The Critique of Pure Reason, in which he proposed the philosophy of transcendental idealism.7 This philosophy argues that certain types of knowledge, such as mathematical reasoning and the understanding of time and space, are built into the human mind. Subject to these a priori conditions, human beings learn about the world through observations as they are processed by, and conform to, the a priori conditions. Using this type of approach, it is a simple matter to remove the circularity of the SM by arguing that acceptance of the SM is just part of the mind’s intrinsic construction—specifically, that human beings innately believe that a principle that holds true at one particular point in time and space also should hold true at neighboring points (i.e., during later repetitions of the same experiment).8 One then is free to employ the SM to one’s heart’s content.
Although the substitution of “a priori conditions” for “custom” and “habit” may sound like little more than a semantic means of begging the question, it actually provides a logically rigorous foundation for the SM. Essentially, human decision makers have no choice but to accept the a priori conditions built into their brains, whereas customs and habits are discretionary and subject to change. For anyone interested in preserving the type of human free will described in Chapter 11, the only thing that remains is to explain how the SM is formed in the brain. In theory, this could be accomplished by arguing that the SM is implied by an evaluation function, V(t), that “selects itself” by following a procedure with sufficient knowable complexity that its outcome, ex ante, appears random. But what if some individuals (like myself) simply fail to select a V(t) that implies the SM?
Feckless Frequentism Vs. Bespoke Bayesianism
My primary concern with the SM’s philosophical framework is its conventional formulation in terms of frequentist hypothesis testing. The decision to employ frequentism over Bayesianism is generally made implicitly, and for the unstated reason that, since the SM is supposed to represent a collective decision-making process for all of humanity, it would be impermissible for different individuals to impose their own prior distributions over any unknown parameters. This idea was expressed clearly by British statistician Ronald Fisher in his 1956 book, Statistical Methods and Scientific Inference:9
As workers in Science we aim, in fact, at methods of inference which shall be equally convincing to all freely reasoning minds, entirely independently of any intentions that might be furthered by utilizing the knowledge inferred.
Although this approach seems quite reasonable in certain situations—for example, the measurement of particle masses in physics and melting points in chemistry—there are other contexts—such as financial planning—in which such a process is both slow and inefficient.
Consider, for example, the two hurricane prediction models discussed in the previous chapter, and suppose that the sophisticated fundamental analysis of Gray and Klotzbach represents the current state of human knowledge, whereas the simple technical model proposed by the author represents a brand-new and purportedly superior alternative. Then, under the SM, the null hypothesis would be given by Ho: “The fundamental model is better for predicting the number of Atlantic-Basin hurricanes in a given year,” and the alternative hypothesis would be H1 “The technical model is better for predicting the number of Atlantic-Basin hurricanes in a given year.”
From the perspective of an academic meteorologist who has just embarked on the study of hurricane frequency, it is fairly clear how to proceed. Having learned of the technical model and how it was shown to be a somewhat better predictor of the number of Atlantic-Basin hurricanes using a given collection of historical data, the meteorologist would look for ways to replicate the technical analysis using different sets of data. Perhaps the scientist would seek alternative measurements of the same input values used by the author for the Atlantic Basin, or perhaps he or she would compare the two models in a more limited geographical context (e.g., among hurricanes making landfall in the United States). In either case, after conducting the new study, the meteorologist would publish the results, thereby providing more or less support for the technical approach and helping to move forward the cause of good science.
Now, however, suppose that one is not an academic researcher, but rather the director of a risk management and emergency planning agency for a state or municipality along the U.S. Gulf Coast. Suppose also that the agency director has been in this job for a number of years and has witnessed firsthand the relative inaccuracy of the fundamental forecasts as they were published year after year. Given the agency’s serious responsibilities, the director might be torn between wanting to embrace wholeheartedly the technical model (because of its apparently greater accuracy) and sticking with the fundamental model (because it represents the status quo, and as a public official, he or she cannot be too capricious in making decisions). As a consequence, the director may decide to continue using the fundamental forecasts in planning, but to incorporate some information from the technical forecasts as well (e.g., by using a weighted average of the two forecasts, or employing the fundamental forecast only if it provides a more conservative—i.e., larger—predicted frequency).
In a third scenario, consider the case of a hedge fund manager contemplating the purchase of weather derivatives for Atlanta, Georgia, that are traded by the Chicago Mercantile Exchange. To understand these instruments better, the manager does a little research into weather prediction and soon comes across both the fundamental and technical hurricane forecasts. Because of the investor’s relative lack of expertise in the area, he or she contacts a number of reputable meteorologists, who (not surprisingly) offer comments such as: “Gray and Klotzbach are the real experts in this area; the technical model was developed by some insurance guy who doesn’t know anything about weather systems,” and “Frankly, I don’t understand why anyone would trust the technical model.” As a result of this additional information and given that the manager’s motive is solely to seek the very best outcomes on behalf of the fund’s investors, he or she decides to ignore the technical model entirely.
From the above examples, it can be seen that Fisher’s ideal of an SM that is “equally convincing to all freely reasoning minds, entirely independently of any intentions” is largely an impossible dream. Although it may work reasonably well for academic researchers who are able to proceed at a leisurely pace and who are uncomfortable relying on personal judgment, it becomes more problematic for those in need of prompt answers to address important time-sensitive issues, as well as for those with personal beliefs who do not see any need to be constrained by published results.
The question arises, therefore, whether or not it is reasonable to embrace, explicitly, an alternative SM—let us call it a “personalized scientific method” (PSM)—in which one is free to: (1) advance the current state of knowledge at one’s own pace (i.e., to select levels of significance with which one is most comfortable, possibly higher than those used by most academic researchers); and (2) incorporate personal beliefs through a Bayesian framework.
Clearly, such a program of acquiring knowledge would generate a tremendous amount of chaos if it were to replace the SM. With every researcher “on a different page” from his or her colleagues, there would be no general standard for distinguishing facts from illusion and truth from wishful thinking. Recalling the inconsistency problems with the SM, listed above, it is easy to see that both the apples-to-oranges and the moving targets problems would be greatly aggravated by a PSM, as researchers abandoned any pretense of consensus in the selection of experiments to conduct and allowed their personal beliefs to be unduly influenced by fishing for statistical significance.
Nevertheless, there would be some benefits of a PSM. For one thing, it likely would counteract the inconsistencies of censorship and double standards mentioned previously, as researchers would be free to incorporate negative results into their personal beliefs and to offset bias in the selection of alphas with their own discretion in that regard. And perhaps the best argument for a PSM already has been intimated: Since large numbers of researchers—like our agency director and hedge fund manager—already use one, why not simply acknowledge it?
Explicitly embracing a PSM does not have to mean abandoning the conventional SM. Researchers could continue to follow the SM as before, but simply inject more opinion, speculation, and subjective judgment into their analyses and conclusions. In fact, this is exactly what professional actuaries do in the field of insurance, where limited data often necessitate intellectual compromise via the averaging of estimates and forecasts from different sources.10 As mentioned previously, insurance actuaries often employ Bayesian methods in their work, and it is undoubtedly this Bayesian influence that has allowed them to recognize formally the usefulness of personal judgment. In a world with an accepted PSM, researchers who found the new approach offensive naturally would be free to express their opinions and to organize scholarly journals adhering to the strictest standards of the SM (which, as we know, is no mean feat).
One of the supreme ironies of modern intellectual life is the great chasm that exists between rigorous science, on the one hand, and established faith, on the other. In today’s world, the two opposing camps have reached an uneasy truce under which they officially profess to respect each other’s domains and even allow safe passage for some of their respective adherents to travel back and forth across the bridge that separates them. However, anyone who tarries on the bridge too long or—God (?) forbid!—jumps off is immediately branded a quack or crackpot. A PSM would permit people to explore this middle ground without fear of ridicule; and it is this appealing characteristic that, I suspect, has led to my own choice of a V(t) that implies a PSM.
ACT 3, SCENE 4
[The gates of Heaven. Deceased stands in front of magistrate-style desk; St. Peter sits behind desk.]
ST. PETER: Hello, Mr. Powers. Welcome to the gates of Heaven.
DECEASED: Why, thank you. May I enter?
ST. PETER: Well, that’s the big question, isn’t it? Unfortunately, it seems there are a few matters that need attending.
DECEASED: I see.
ST. PETER: To begin with, there appears to be an unresolved issue about your parking under a tree, with the expectation that a large branch would fall on your car and destroy it.
DECEASED: Oh, that again? I thought I had explained elsewhere—
ST. PETER: Excuse, me, Mr. Powers—this is Heaven. We already know what you’ve explained elsewhere.
DECEASED: Well, then, you must know that my action didn’t constitute insurance fraud. It couldn’t, because there was no moral hazard involved.
ST. PETER: Again, Mr. Powers—this is Heaven. We’ll be the judges of what is and what isn’t moral hazard.
DECEASED: Certainly. But you must recognize that the falling branch was clearly an act of God!
ST. PETER: Yes, of course; and that’s precisely the problem. You see, God doesn’t like being a party to insurance fraud.
DECEASED: But it wasn’t fraud! Even though I believed the branch would fall, no one else would have believed it.
ST. PETER: I see. So you thought the probability of the branch’s falling was very large—above 90 percent, perhaps?
DECEASED: Yes.
ST. PETER: And everyone else, the tree experts included, thought the probability was very small—below 1 percent, perhaps?
DECEASED: Yes, exactly.
ST. PETER: But why should it matter what others believed, if you yourself believed the branch would fall?
DECEASED: Well, on Earth, that’s how we evaluate—I won’t presume to say judge—the decisions of our fellow human beings. Since we all can’t agree on the same subjective probability of an event’s occurring, we just use the observed long-run frequency of the event. And in this case, a tree like the one in my yard would break with a probability of less than 1 percent.
ST. PETER: So you believed that the probability of the branch’s falling was greater than 90 percent, but you were willing to go along with a frequentist analysis when it suited your purposes?
DECEASED: Yes, I suppose that’s true …
ST. PETER: Well, then, we have a problem.
DECEASED: We do?
ST. PETER: Yes. You see, in Heaven we’re all fundamentalist Bayesians.