TESTING THE GOODNESS OF FIT
During the 1980s a new type of mathematical model emerged and caught the public imagination, primarily because of its name: “chaos theory.” The name suggests some form of statistical modeling with a particularly wild type of randomness. The people who coined the name purposely stayed away from using the word random. Chaos theory is actually an attempt to undo the statistical revolution by reviving determinism at a more sophisticated level.
Recall that, before the statistical revolution, the “things” with which science dealt were either the measurements made or the physical events that generated those measurements. With the statistical revolution, the things of science became the parameters that governed the distribution of the measurements.
In the earlier deterministic approach, there was always the belief that more refined measurements would lead to a better definition of the physical reality being examined. In the statistical approach, the parameters of the distribution are sometimes not required to have a physical reality and can only be estimated with error, regardless of how precise the measuring system. For example, in the deterministic approach, there is a fixed number, the gravitational constant, that describes how things fall to the Earth. In the statistical approach, our measurements of the gravitational constant will always differ from one another, and the scatter of their distribution is what we wish to establish in order to “understand” falling bodies.
In 1963, the chaos theorist Edward Lorenz presented an often-referenced lecture entitled “Does the Flap of a Butterfly’s Wings in Brazil Set Off a Tornado in Texas?” Lorenz’s main point was that chaotic mathematical functions are very sensitive to initial conditions. Slight differences in initial conditions can lead to dramatically different results after many iterations. Lorenz believed that this sensitivity to slight differences in the beginning made it impossible to determine an answer to his question. Underlying Lorenz’s lecture was the assumption of determinism, that each initial condition can theoretically be traced as a cause of a final effect. This idea, called the “Butterfly Effect,” has been taken by the popularizers of chaos theory as a deep and wise truth.
However, there is no scientific proof that such a cause and effect exists. There are no well-established mathematical models of reality that suggest such an effect. It is a statement of faith. It has as much scientific validity as statements about demons or God. The statistical model that defines the quest of science in terms of parameters of distributions is also based on a statement of faith about the nature of reality. My experience in scientific research has led me to believe that the statistical statement of faith is more likely to be true than the deterministic one.
Chaos theory results from the observation that numbers generated by a fixed deterministic formula can appear to have a random pattern. This was seen when a group of mathematicians took some relatively simple iterative formulas and plotted the output. In chapter 9, I described an iterative formula as one that produces a number, then uses that number in its equations to produce another number. The second number is used to produce a third number, and so on. In the early years of the twentieth century, the French mathematician Henri Poincaré tried to understand complicated families of differential equations by plotting successive pairs of such numbers on a graph. Poincaré found some interesting patterns in those plots, but he did not see how to exploit those patterns and dropped the idea. Chaos theory starts with these Poincaré plots. What happens as you build a Poincaré plot is that the points on the graph paper appear, at first, as if they have no structure to them. They appear in various places in a seemingly haphazard fashion. As the number of points in the plot increases, however, patterns begin to emerge. They are sometimes groups of parallel straight lines. They might also be a set of intersecting lines, or circles, or circles with straight lines across them.
The proponents of chaos theory suggest that what in real life appear to be purely random measurements are, in fact, generated by some deterministic set of equations, and that these equations can be deduced from the patterns that appear in a Poincaré plot. For instance, some proponents of chaos theory have taken the times between human heartbeats and put them into Poincaré plots. They claim to see patterns in these plots, and they have found deterministic generating equations that appear to produce the same type of pattern.
As of this writing, there is one major weakness to chaos theory applied in this fashion. There is no measure of how good the fit is
between the plot based on data and the plot generated by a specific set of equations. The proof that the proposed generator is correct is based on asking the reader to look at two similar graphs. This eyeball test has proved to be a fallible one in statistical analysis. Those things that seem to the eye to be similar or very close to the same are often drastically different when examined carefully with statistical tools developed for this purpose.
This problem was one that Karl Pearson recognized early in his career. One of Pearson’s great achievements was the creation of the first “goodness of fit test.” By comparing the observed to the predicted values, Pearson was able to produce a statistic that tested the goodness of the fit. He called his test statistic a “chi square goodness of fit test.” He used the Greek letter chi (X), since the distribution of this test statistic belonged to a group of his skew distributions that he had designated the chi family. Actually, the test statistic behaved like the square of a chi, thus the name “chi squared.” Since this is a statistic in Fisher’s sense, it has a probability distribution. Pearson proved that the chi square goodness of fit test has a distribution that is the same, regardless of the type of data used. That is, he could tabulate the probability distribution of this statistic and use that same set of tables for every test. The chi square goodness of fit test has a single parameter, which Fisher was to call the “degrees of freedom.” In the 1922 paper in which he first criticized Pearson’s work, Fisher showed that, for the case of comparing two proportions, Pearson had gotten the value of that parameter wrong.
But just because he made a mistake in one small aspect of his theory is no reason to denigrate Pearson’s great achievement. Pearson’s goodness of fit test was the forerunner of a major component of modern statistical analysis. This component is called “hypothesis testing,” or “significance testing.” It allows the analyst to propose
two or more competing mathematical models for reality and use the data to reject one of them. Hypothesis testing is so widely used that many scientists think of it as the only statistical procedure available to them. The use of hypothesis testing, as we shall see in later chapters, involves some serious philosophical problems.
Suppose we wish to test whether the lady can detect the difference between a cup of tea into which the milk has been poured into the tea versus a cup of tea wherein the tea has been poured into the milk. We present her with two cups, telling her that one of the cups is tea into milk and the other is milk into tea. She tastes and identifies the cups correctly. She could have done this by guessing; she had a 50:50 chance of guessing correctly. We present her with another pair of the same type. Again, she identifies them correctly. If she were just guessing, the chance of this happening twice in a row is ¼. We present her with a third pair of cups, and again she identifies them correctly. The chance that this has happened as a result of pure guesswork is 1/8. We present her with more pairs, and she keeps identifying the cups correctly. At some point, we have to be convinced that she can tell the difference. Suppose she was wrong with one pair. Suppose further that this was the twenty-fourth pair and she was correct on all the others. Can we still conclude that she is able to detect a difference? Suppose she was wrong in four out of the twenty-four? Five of the twenty-four?
Hypothesis, or significance, testing is a formal statistical procedure that calculates the probability of what we have observed, assuming that the hypothesis to be tested is true. When the observed probability is very low, we conclude that the hypothesis is not true. One important point is that hypothesis testing provides a tool for rejecting a hypothesis. In the case above, this is the hypothesis
that the lady is only guessing. It does not allow us to accept a hypothesis, even if the probability associated with that hypothesis is very high.
Somewhere early in the development of this general idea, the word significant came to be used to indicate that the probability was low enough for rejection. Data became significant if they could be used to reject a proposed distribution. The word was used in its late-nineteenth-century English meaning, which is simply that the computation signified or showed something. As the English language entered the twentieth century, the word significant began to take on other meanings, until it developed its current meaning, implying something very important. Statistical analysis still uses the word significant to indicate a very low probability computed under the hypothesis being tested. In that context, the word has an exact mathematical meaning. Unfortunately, those who use statistical analysis often treat a significant test statistic as implying something much closer to the modern meaning of the word.
R. A. Fisher developed most of the significance testing methods now in general use. He referred to the probability that allows one to declare significance as the “p-value.” He had no doubts about its meaning or usefulness. Much of Statistical Methods for Research Workers is devoted to showing how to calculate p-values. As I noted earlier, this was a book designed for nonmathematicians who want to use statistical methods. In it, Fisher does not describe how these tests were derived, and he never indicates exactly what p-value one might call significant. Instead, he displays examples of calculations and notes whether the result is significant or not. In one example, he shows that the p-value is less than .01 and states: “Only one value in a hundred will exceed [the calculated test statistic] by chance, so that the difference between the results is clearly significant.”
The closest he came to defining a specific p-value that would be significant in all circumstances occurred in an article printed in the Proceedings of the Society for Psychical Research in 1929. Psychical research refers to attempts to show, via scientific methods, the existence of clairvoyance. Psychical researchers make extensive use of statistical significance tests to show that their results are improbable in terms of the hypothesis that the results are due to purely random guesses by the subjects. In this article, Fisher condemns some writers for failing to use significance tests properly. He then states:
In the investigation of living beings by biological methods, statistical tests of significance are essential. Their function is to prevent us being deceived by accidental occurrences, due not to the causes we wish to study, or are trying to detect, but to a combination of many other circumstances which we cannot control. An observation is judged significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is a common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation.
Note the expression “knows how to design an experiment … that … will rarely fail to give a significant result.” This lies at the heart of Fisher’s use of significance tests. To Fisher, the significance test makes sense only in the context of a sequence of experiments, all aimed at elucidating the effects of specific treatments. Reading through Fisher’s applied papers, one is led to believe that he used significance tests to come to one of three possible conclusions. If the p-value is very small (usually less than .01), he declares that an effect has been shown. If the p-value is large (usually greater than .20), he declares that, if there is an effect, it is so small that no experiment of this size will be able to detect it. If the p-value lies in between, he discusses how the next experiment should be designed to get a better idea of the effect. Except for the above statement, Fisher was never explicit about how the scientist should interpret a p-value. What seemed to be intuitively clear to Fisher may not be clear to the reader.
We will come back to reexamine Fisher’s attitude toward significance testing in chapter 18. It lies at the heart of one of Fisher’s great blunders, his insistence that smoking had not been shown to be harmful to health. But let us leave Fisher’s trenchant analysis of the evidence involving smoking and health for later and turn to 35-year-old Jerzy Neyman in the year 1928.
Jerzy Neyman was a promising mathematics student when World War I erupted across his homeland in Eastern Europe. He was driven into Russia, where he studied at the University of Kharkov, a provincial outpost of mathematical activity. Lacking teachers who were up to date in their knowledge, and forced to miss semesters of schooling because of the war, he took the elementary mathematics he was taught at Kharkov and built upon it by seeking out
articles in the mathematics journals available to him. Neyman thus received a formal mathematics education similar to that taught to students of the nineteenth century, and then educated himself into twentieth-century mathematics.
The journal articles available to Neyman were limited to what he could find in the libraries of the University of Kharkov and later at provincial Polish schools. By chance, he came across a series of articles by Henri Lebesgue of France. Lebesgue (1875-1941) had created many of the fundamental ideas of modern mathematical analysis in the early years of the twentieth century, but his papers are difficult to read. Lebesgue integration, the Lebesgue convergence theorem, and other creations of this great mathematician have all been simplified and organized in more understandable forms by later mathematicians. Nowadays, no one reads Lebesgue in the original. Students all learn about his ideas through these later versions.
No one, that is, except Jerzy Neyman, who had only Lebesgue’s original articles, who struggled through them, and who emerged seeing the brilliant light of these great new (to him) creations. For years afterward, Neyman idolized Lebesgue, and, in the late 1930s, finally got to meet him at a mathematics conference in France. According to Neyman, Henri Lebesgue turned out to be a gruff, impolite man, who responded to Neyman’s enthusiasm with a few mutterings and turned and walked away in the midst of Neyman’s talking to him.
Neyman was deeply hurt by this rebuff, and perhaps with this as an object lesson, always went out of his way to be polite and kind to young students, to listen carefully to what they said, and to engage them in their enthusiasms. That was Jerzy Neyman, the man. All who knew him remember him for his kindness and caring manners. He was gracious and thoughtful and dealt with people with genuine pleasure. When I met him, he was in his early eighties, a small, dignified, well-groomed man with a neat white
moustache. His blue eyes sparkled as he listened to others and engaged in intensive conversation, giving the same personal attention to everyone, no matter who they were.
In the early years of his career, Jerzy Neyman managed to find a position as a junior member of the faculty of the University of Warsaw. At that time, the newly independent nation of Poland had little money to support academic research, and positions for mathematicians were scarce. In 1928, he spent a summer at the biometrical laboratory in London and there came to know Egon S. Pearson and his wife, Eileen, and their two daughters. Egon Pearson was Karl Pearson’s son, and a more striking contrast in personalities is hard to find. Where Karl Pearson was driving and dominating, Egon Pearson was shy and self-effacing. Karl Pearson rushed through new ideas, often publishing an article with the mathematics vaguely sketched in or even with some errors. Egon Pearson was extremely careful, worrying over the details of each calculation.
The friendship between Egon Pearson and Jerzy Neyman is preserved in their exchange of letters between 1928 and 1933. These letters provide a wonderful insight into the sociology of science, showing how two original minds grapple with a problem, each one proposing ideas or criticizing the ideas of the other. Pearson’s self-effacing comes to the forefront as he hesitantly suggests that perhaps something Neyman had proposed might not work out. Neyman’s great originality comes out as he cuts through complicated problems to find the essential nature of each difficulty. For someone who wants to understand why mathematical research is so often a cooperative venture, I recommend the Neyman-Pearson letters.
What was the problem that Egon Pearson first proposed to Neyman? Recall Karl Pearson’s chi square goodness of fit test. He developed it to test whether observed data fit a theoretical distribution. There really is no such thing as the chi square goodness of fit test. The analyst has available an infinite number of ways to apply
the test to a given set of data. There appeared to be no criterion on how “best” to pick among those many choices. Every time the test is applied, the analyst must make arbitrary choices. Egon Pearson posed the following question to Jerzy Neyman:
If I have applied a chi square goodness of fit test to a set of data versus the normal distribution, and if I have failed to get a significant p-value, how do I know that the data really fit a normal distribution? That is, how do I know that another version of the chi square test or another goodness of fit test as yet undiscovered might not have produced a significant p-value and allowed me to reject the normal distribution as fitting the data?
Neyman took this question back to Warsaw with him, and the exchange of letters began. Both Neyman and young Pearson were impressed with Fisher’s concept of estimation based on the likelihood function. They began their investigation by looking at the likelihood associated with a goodness of fit test. The first of their joint papers describes the results of those investigations. It is the most difficult of the three classic papers they produced, which were to revolutionize the whole idea of significance testing. As they continued looking at the question, Neyman’s great clarity of vision kept distilling the problem down to its essential elements, and their work became clearer and easier to understand.
Although the reader may not believe it, literary style plays an important role in mathematical research. Some mathematical writers seem unable to produce articles that are easy to understand. Others seem to get a perverse pleasure out of generating many lines of symbolic notation so filled with detail that the general idea is lost in the picayune. But some authors have the ability to display complicated ideas with such force and simplicity that the development
appears to be obvious in their exposition. Only upon reviewing what has been learned does the reader realize the great power of the results. Such an author was Jerzy Neyman. It is a pleasure to read his papers. The ideas evolve naturally, the notation is deceptively simple, and the conclusions appear to be so natural that you find it hard to see why no one produced these results long before.
Pfizer Central Research, where I worked for twenty-seven years, sponsors a yearly colloquium at the University of Connecticut. The statistics department of the university invites a major figure in biostatistical research to come for a day, meet with students, and then present a talk in the late afternoon. Since I was involved in setting up the grant for this series, I had the honor of meeting some of the great men of statistics through them. Jerzy Neyman was one such invitee. He asked that his talk have a particular form. He wanted to present a paper and then have a panel of discussants who would criticize his paper. Because this was the renowned Jerzy Neyman, the organizers of the symposium contacted well-known senior statisticians in the New England area to constitute the panel. At the last minute, one of the panelists was unable to come, and I was asked to substitute for him.
Neyman had sent us a copy of the paper he planned to present. It was an exciting development, wherein he applied work he had done in 1939 to a problem in astronomy. I knew that 1939 paper; I had discovered it years before while still a graduate student, and I had been impressed by it. The paper dealt with a new class of distributions Neyman had discovered, which he called the “contagious distributions.” The problem posed in the paper began with trying to model the appearance of insect grubs in soil. The female insect flew about the field, laden with eggs, then chose a spot at random to lay the eggs. Once the eggs were laid, the grubs hatched and crawled outward from that spot. A sample of soil is taken from the field. What is the probability distribution of the number of grubs found in that sample?
The contagious distributions describe such situations. They are derived in the 1939 paper with an apparently simple series of equations.
This derivation seems obvious and natural. It is clear, when the reader gets to the end of the paper, that there is no other way to approach it, but this is clear only after reading Neyman. Since that 1939 paper, Neyman’s contagious distributions have been found to fit a great many situations in medical research, in metallurgy, in meteorology, in toxicology, and (as described by Neyman in his Pfizer Colloquium paper) in dealing with the distribution of galaxies in the universe.
After he finished his talk, Neyman sat back to listen to the panel of discussants. All the other members of the panel were prominent statisticians who had been too busy to read his paper in advance. They looked upon the Pfizer Colloquium as a recognition of honor for Neyman. Their “discussions” consisted of comments about Neyman’s career and past accomplishments. I had come onto the panel as a last-minute replacement and could not refer to my (nonexistent) previous experiences with Neyman. My comments were directed to Neyman’s presentation that day, as he had asked. In particular, I told how I had discovered the 1939 paper years before and revisited it in anticipation of this session. I described the paper, as best I could, showing enthusiasm when I came to the clever way Neyman had developed the meaning of the parameters of the distribution.
Neyman was clearly delighted with my comments. Afterward, we had an exciting discussion about the contagious distributions and their uses. A few weeks later, a large package arrived in the mail. It was a copy of A Selection of Early Statistical Papers of J. Neyman, published by the University of California Press. On the inside cover was the inscription: “To Dr. David Salsburg, with hearty thanks for his interesting comments on my presentation of April 30, 1974. J. Neyman.”
I treasure this book both for the inscription and the set of beautiful, well-written papers in it. I have since had the opportunity to talk with many of Neyman’s students and coworkers. The friendly, interesting, and interested man I met in 1974 was the man that they knew and admired.