Chapter 3Two Naturally Occurring Probability Distributions

In this chapter, you learn to recognize and describe the binomial and Poisson probability distributions in terms of their cumulative distribution functions and their expectations. You’ll learn when to apply these distributions in your future work along with methods for estimating their parameters using R.

3.1 DISTRIBUTION OF VALUES

Life constantly calls upon us to make decisions. Should penicillin or erythromycin be prescribed for an infection? Is one brand of erythromycin preferable to another? Which fertilizer should be used to get larger tomatoes? Which style of dress should our company manufacture (i.e., which style will lead to greater sales)?

My wife often asks me what appears to be a very similar question, “which dress do you think I should wear to the party?” But this question is really quite different from the others as it asks what a specific individual, me, thinks about dresses that are to be worn by another specific individual, my wife. All the other questions reference the behavior of a yet-to-be-determined individual selected at random from a population.

Is Alice taller than Peter? It won’t take us long to find out: We just need to put the two back to back or measure them separately.* The somewhat different question, “are girls taller than boys?” is not answered quite so readily. How old are the boys and girls? Generally, but not always, girls are taller than boys until they enter adolescence. Even then, what may be true in general for boys and girls of a specified age group may not be true for a particular girl and a particular boy.

Put in its most general and abstract form, what we are asking is whether a numerical observation X made on an individual drawn at random from one population will be larger than a similar numerical observation Y made on an individual drawn at random from a second population.

3.1.1 Cumulative Distribution Function

Let F_W[w] denote the probability that the numerical value of an observation from the distribution W will be less than or equal to w. F_W[w] is a monotone nondecreasing function, that is, a graph of F_W either rises or remains level as w increases. In symbols, if ε > 0, then 0 = F_W[–∞] ≤ F_W[w] = Pr{W ≤ w} ≤ Pr{W ≤ w + ε} = F_W[w + ε] ≤ F_W[∞] = 1.

Two such cumulative distribution functions, F_w and G_w, are depicted in Figure 3.1.

Figure 3.1 Two cumulative distributions F and G that differ by a shift in the median value. The fine line denotes F; the thick line corresponds to G.

1. Both distributions start at 0 and end at 1

2. F_X is to the left of G_X. As can be seen by drawing lines perpendicular to the value axis, F_X[x] > G_X[x] for all values of X. As can be seen by drawing lines perpendicular to the percentile-axis, all the percentiles of the cumulative distribution G are greater in value than the percentiles of the cumulative distribution F.

3. Most of the time an observation taken at random from the distribution F_X will be smaller than if it is taken from the distribution G_X.

4. Still, there is a nonzero probability that an observation from F_X will be larger than one from G_X.

Many treatments act by shifting the distribution of values, as shown in Figure 3.1. The balance of this chapter and the next is concerned with the detection of such treatment effects. The possibility exists that the effects of treatment are more complex than is depicted in Figure 3.1. In such cases, the introductory methods described in this text may not be immediately applicable; see, for example, Figure 3.2, in which the population represented by the thick line is far more variable than the population represented by the thin line.

Exercise 3.1: Is it possible that an observation drawn at random from the distribution F_X depicted in Figure 3.1 could have a value larger than an observation drawn at random from the distribution G_X?*

3.1.2 Empirical Distribution Function

Suppose we’ve collected a sample of n observations X = {x₁, x₂, … , x_n}. The empirical (observed) cumulative distribution function F_n[x] is equal to the number of observations that are less than or equal to x divided by the size of our sample n. If we’ve sorted the sample using the R command sort(X) so that x[1] ≤ x[2] ≤ … ≤ x[n], then the empirical distribution function F_n[x] is

If the observations in the sample all come from the same population distribution F and are independent of one another, then as the sample size n gets larger, F_n will begin to resemble F more and more closely. We illustrate this point in Figure 3.3, with samples of size 10, 100, and 1000 all taken from the same distribution.

Figure 3.3 Three empirical distributions based on samples of size 10 (staircase), 100 (circles), and 1000 independent observations from the same population. The R code needed to generate this figure will be found in Chapter 4.

Figure 3.3 reveals what you will find in your own samples in practice: The fit between the empirical (observed) and the theoretical distribution is best in the middle of the distribution near the median and worst in the tails.

3.2 DISCRETE DISTRIBUTIONS

We need to distinguish between discrete random observations made when recording the number of events that occur in a given period, from the continuous random observations made when taking measurements.

Discrete random observations usually take only integer values (positive or negative) with nonzero probability. That is,

The binomial and the Poisson are discrete distributions that are routinely encountered in practice.

Exercise 3.2: If X and Y are two independent discrete random variables, show that the expected value of X + Y is equal to the sum of the expected values of X and Y. Hint: Use the knowledge that

We can also show that the variance of two independent discrete random variables is the sum of their variances.

3.3 THE BINOMIAL DISTRIBUTION

The binomial distribution arises whenever we perform a series of independent trials each of which can result in either success or failure, and the probability of success is the same for each trial. Survey questions that offer just two alternatives, male or female, smoker or nonsmoker, like or dislike, give rise to binomial distributions.

Recall from the preceding chapter (Section 2.2.3) that when we perform n trials (flip a coin n times, mark n completed survey questions), each with a probability p of success, the probability of observing n successes is given by the following formula:

where

denotes the number of different possible rearrangements of k successes and n – k failures. The cumulative distribution function of a binomial is a step function, equal to zero for all values less than 0 and to one for all values greater than or equal to n.

The expected number of successes in a large number of repetitions of a single binomial trial is p. The expected number of successes in a sufficiently large number of sets of n binomial trials each with a probability p of success is np.

*3.3.1 Expected Number of Successes in n Binomial Trials

Though it seems obvious that the mean of a sufficiently large number of sets of n binomial trials each with a probability p of success will be equal to np, many things that seem obvious in mathematics aren’t. Besides, if it’s that obvious, we should be able to prove it. Given sufficient familiarity with high school algebra, it shouldn’t be at all difficult.

If a variable X takes a discrete set of values {… , 0,1, … , k, … } with corresponding probabilities {… f₀,f₁, … , f_k, … }, its mean or expected value, written EX, is given by the summation: … + 0f₀ + 1f₁+ … + kf_k + … which we may also write as Σ_kkf_k. For a binomial variable, this sum is

Note that the term on the right is equal to zero when k = 0, so we can start the summation at k = 1. Factoring n and p outside the summation and using the k in the numerator to reduce k! to (k – 1)!, we have

If we change the notation a bit, letting j = (k-1) and m = (n-1), this can be expressed as

The summation on the right side of the equals sign is of the probabilities associated with all possible outcomes for the binomial random variable B(n, p), so it must be and is equal to 1. Thus, EX = np, a result that agrees with our intuitive feeling that in the long run, the number of successes should be proportional to the probability of success.

3.3.2 Properties of the Binomial

Suppose we send out several hundred individuals to interview our customers and find out if they are satisfied with our products. Each individual has the responsibility of interviewing exactly 10 customers. Collating the results, we observe several things:

When we reported these results to our boss, she only seemed interested in the first of them. “Results always vary from interviewer to interviewer, and from sample to sample. And the percentages you reported, apart from the 74% satisfaction rate, are immediate consequences of the binomial distribution.”

Clearly, our boss was familiar with the formula for k successes in n trials given in Section 2.2.3. From our initial finding, she knew that p = 0.74. Thus,

To find the median of this distribution, its 50th percentile, we can use R as follows:

The scientific notation used by R in reporting very small or very large values may be difficult to read. Try

for a slightly different view. The 3 asks for three places to the right of the decimal point. Any road, 4.924e-02 is the same as 0.04924.

The elements of vectors are numbered consecutively beginning at [1]. The mode of the distribution, the outcome with the greatest proportion show above, is located at position 9 in the vector.

To find the mean or expected value of this binomial distribution, let us first note that the computation of the arithmetic mean can be simplified when there are a large number of ties by multiplying each distinct number k in a sample by the frequency f_k with which it occurs;

We can have only 11 possible outcomes as a result of our interviews: 0, 1, … , or 10 satisfied customers. We know from the binomial distribution the frequency f_i with which each outcome may be expected to occur; so that the population mean is given by the formula

For our binomial distribution, we have the values of the variable in the vector binom.vals and their proportions in binom.prop. To find the expected value of our distribution using R, type

This result, we notice, is equal to 10*0.74 or 10p. This is not surprising, as we know that the expected value of the binomial distribution is equal to the product of the number of trials and the probability of success at each trial.

Warning: In the preceding example, we assumed that our sample of 1000 customers was large enough that we could use the proportion of successes in that sample, 740 out of 1000 as if it were the true proportion in the entire distribution of customers. Because of the variation inherent in our observations, the true proportion might have been greater or less than our estimate.

Exercise 3.3: Which is more likely: Observing two or more successes in 8 trials with a probability of one-half of observing a success in each trial, or observing three or more successes in 7 trials with a probability of 0.6 of observing a success? Which set of trials has the greater expected number of successes?

Exercise 3.4: Show without using algebra that if X and Y are independent identically distributed binomial variables B(n,p), then X + Y is distributed as B(2n,p).

Unless we have a large number of samples, the observed or empirical distribution may differ radically from the expected or theoretical distribution.

Exercise 3.5: One can use the R function rbinom to generate the results of binomial sampling. For example, binom = rbinom(100,10,0.6) will store the results of 100 binomial samples of 10 trials, each with probability of 0.6 of success in the vector binom. Use R to generate such a vector of sample results and to produce graphs to be used in comparing the resulting empirical distribution with the corresponding theoretical distribution.

Exercise 3.6: If you use the R code hist(rbinom(200,10,0.65)), the resulting histogram will differ from that depicted in Figure 3.4. Explain why.

3.4 MEASURING POPULATION DISPERSION AND SAMPLE PRECISION

The variance of a sample, a measure of the variation within the sample, is defined as the sum of the squares of the deviations of the individual observations about their mean divided by the sample size minus 1. In symbols, if our observations are labeled X₁, X₂, up to X_n, and the mean of these observations is written as c03ue013

, then the variance of our observations is equal to

Both the mean and the variance will vary from sample to sample. If the samples are very large, there will be less variation from sample to sample, and the sample mean and variance will be very close to the mean and variance of the population.

Sidebar: Why n – 1?

Why do we divide the sum of the deviations about the mean by n – 1, rather than n? When our sample consists of a single observation, n = 1, there is no variation. When we have two observations,

c03ue015

But

c03ue016

That is, we really only have one distinct squared deviation.

Exercise 3.7: What is the sum of the deviations of the observations from their arithmetic mean? That is, what is

*Exercise 3.8: If you know calculus, use the result of Exercise 3.7 to show that

is minimized if we chose c03ue019

. In other words, the sum of the squared deviations is a minimum about the mean of a sample.

The problem with using the variance as a measure of dispersion is that if our observations, on temperature, for example, are in degrees Celsius, then the variance would be expressed in square degrees, whatever these are. More often, we report the standard deviation σ, the square root of the variance, as it is in the same units as our observations.

Exercise 3.9: Recall the classroom data of Chapter 1, classdata = c(141, 156.5, 162, 159, 157, 143.5, 154, 158, 140, 142, 150, 148.5, 138.5, 161, 153, 145, 147, 158.5, 160.5, 167.5, 155, 137). Compute the data’s variance and standard deviation, using the R functions var() and sqrt().

The variance of a binomial variable B(n,p) is np(1 – p). Its standard deviation is c03ue020

. This formula yields an important property that we will make use of in a latter chapter when we determine how large a sample to take. For if the precision of a sample is proportional to the square root of its sample size, then to make estimates that are twice as precise, we need to take four times as many observations.

3.5 POISSON: EVENTS RARE IN TIME AND SPACE

The decay of a radioactive element, an appointment to the U.S. Supreme Court, and a cavalry officer kicked by his horse have in common that they are relatively rare but inevitable events. They are inevitable, that is, if there are enough atoms, enough seconds or years in the observation period, and enough horses and momentarily careless men. Their frequency of occurrence has a Poisson distribution.

a. it is the cumulative result of a large number of independent opportunities, each of which has only a small chance of occurring; and

b. events in nonoverlapping intervals are independent.

The intervals can be in space or time. For example, if we seed a small number of cells into a Petri dish that is divided into a large number of squares, the distribution of cells per square follows the Poisson. The same appears to be true in the way a large number of masses in the form of galaxies are distributed across a very large universe.

Like the binomial variable, a Poisson variable only takes non-negative integer values. If the number of events X has a Poisson distribution such that we may expect an average of λ events per unit interval, then Pr{X = k} = λ^ke^–λ/k! for k = 0, 1,2, … . For the purpose of testing hypotheses concerning λ as discussed in the chapter following, we needn’t keep track of the times or locations at which the various events occur; the number of events k is sufficient.

Exercise 3.10: If we may expect an average of two events per unit interval, what is the probability of no events? Of one event?

Exercise 3.11: Show without using algebra that the sum of a Poisson with the expected value λ₁ and a second independent Poisson with the expected value λ₂ is also a Poisson with the expected value λ₁ + λ₂.

3.5.1 Applying the Poisson

John Ross of the Wistar Institute held there were two approaches to biology: the analog and the digital. The analog was served by the scintillation counter: one ground up millions of cells then measured whatever radioactivity was left behind in the stew after centrifugation; the digital was to be found in cloning experiments where any necessary measurements would be done on a cell-by-cell basis.

John was a cloner, and, later, as his student, so was I. We’d start out with 10 million or more cells in a 10-mL flask and try to dilute them down to one cell per mL. We were usually successful in cutting down the numbers to 10,000 or so. Then came the hard part. We’d dilute the cells down a second time by a factor of 1 : 10 and hope we’d end up with 100 cells in the flask. Sometimes we did. Ninety percent of the time, we’d end up with between 90 and 110 cells, just as the Poisson distribution predicted. But just because you cut a mixture in half (or a dozen, or a 100 parts) doesn’t mean you’re going to get equal numbers in each part. It means the probability of getting a particular cell is the same for all the parts. With large numbers of cells, things seem to even out. With small numbers, chance seems to predominate.

Things got worse, when I went to seed the cells into culture dishes. These dishes, made of plastic, had a rectangular grid cut into their bottoms, so they were divided into approximately 100 equal size squares. Dropping 100 cells into the dish meant an average of 1 cell per square. Unfortunately for cloning purposes, this average didn’t mean much. Sometimes, 40% or more of the squares would contain two or more cells. It didn’t take long to figure out why. Planted at random, the cells obey the Poisson distribution in space. An average of one cell per square means

Two cells was one too many. A clone or colony must begin with a single cell (which subsequently divides several times). I had to dilute the mixture a third time to ensure the percentage of squares that included two or more cells was vanishingly small. Alas, the vast majority of squares were now empty; I was forced to spend hundreds of additional hours peering through the microscope looking for the few squares that did include a clone.

3.5.2 Comparing Empirical and Theoretical Poisson Distributions

Table 3.1 compares the spatial distribution of superclusters of galaxies with a Poisson distribution that has the same expectation. The fit is not perfect due both to chance and to ambiguities in the identification of distinct superclusters.

To solve the next two exercises, make use of the R functions for the Poisson similar to those we used in studying the binomial similar including dpois, ppois, qpois, and rpois.

Exercise 3.12: Generate the results of 100 samples from a Poisson distribution with an expected number of two events per interval. Compare the graph of the resulting empirical distribution with that of the corresponding theoretical distribution. Determine the 10th, 50th, and 90th percentiles of the theoretical Poisson distribution.

Exercise 3.13: Show that if Pr{X = k} = λ^ke^–λ/k! for k = 0, 1,2, … that is, if X is a Poisson variable, then the expected value of X = Σ_kkPr{X = k} = λ.

Exercise 3.15: In subsequent chapters, we’ll learn how the statistical analysis of trials of a new vaccine is often simplified by assuming that the number of infected individuals follows a Poisson rather than a Binomial distribution. To see how accurate an approximation this might be, compare the cumulative distribution functions of a binomial variable, B(100,0.01) and a Poisson variable, P(1) over the range 0 to 100.

3.5.3 Comparing Two Poisson Processes

Recently, I had the opportunity to participate in the conduct of a very large-scale clinical study of a new vaccine. I’d not been part of the design team, and when I read over the protocol, I was stunned to learn that the design called for inoculating and examining 100,000 patients! 50,000 with the experimental vaccine, and 50,000 controls with a harmless saline solution.

Why so many? The disease at which the vaccine was aimed was relatively rare. In essence, we would be comparing two Poisson distributions. Suppose we could expect 0.8% or 400 of the controls to contract the disease, and 0.7% or 350 of those vaccinated to contract it. Put another way, if the vaccine was effective, we would expect 400/750th of the patients who contracted the disease to be controls. While if the vaccine was ineffective (no better and no worse than no vaccine at all), we would expect 50% of the patients who contracted the disease to be controls. The problem boils down to a single binomial.

Exercise 3.16: In actuality, only 200 of the controls contracted the disease along with 180 of those who had been vaccinated. What is the probability that such a difference might have occurred by chance alone?

3.6 CONTINUOUS DISTRIBUTIONS

The vast majority of the observations we make are on a continuous scale even if, in practice, we only can make them in discrete increments. For example, a man’s height might actually be 1.835421117 m, but we are not likely to record a value with that degree of accuracy (nor want to). If one’s measuring stick is accurate to the nearest millimeter, then the probability that an individual selected at random will be exactly 2 m tall is really the probability that his or her height will lie between 1.9995 and 2.0004 m. In such a situation, it is more convenient to replace the sum of an arbitrarily small number of quite small probabilities with an integral

where F[x] is the cumulative distribution function of the continuous variable representing height, and f[x] is its probability density. Note that F[x] is now defined as

As with discrete variables, the cumulative distribution function is monotone nondecreasing from 0 to 1, the distinction being that it is a smooth curve rather than a step function.

3.6.1 The Exponential Distribution

The simplest way to obtain continuously distributed random observations is via the same process that gave rise to the Poisson. Recall that a Poisson process is such that events in nonoverlapping intervals are independent and identically distributed. The times^† between Poisson events follow an exponential distribution:

When t is zero, exp[–t/λ] is 1 and F[t|λ] is 0. As t increases, exp [–t/λ] decreases rapidly toward zero and F[t|λ] increases rapidly to 1. The rate of increase is inversely proportional to the magnitude of the parameter λ. In fact, log(1 – F[t|λ]) = –t/λ. The next exercise allows you to demonstrate this for yourself.

Exercise 3.17: Draw the cumulative distribution function of an exponentially distributed observation with parameter λ = 3. What is the median? Use the following R code:# using R’s seq() command, divide the interval 0, 12 into increments 1/10th of a unit apart

*Exercise 3.18: (requires calculus) What is the expected value of an exponentially distributed observation with parameter λ?

The times between counts on a Geiger counter follow an exponential distribution. So do the times between failures of manufactured items, like light bulbs, that rely on a single crucial component.

Exercise 3.19: When you walk into a room, you discover the light in a lamp is burning. Assuming the life of its bulb is exponentially distributed with an expectation of 1 year, how long do you expect it to be before the bulb burns out? (Many people find they get two contradictory answers to this question. If you are one of them, see W. Feller, An Introduction to Probability Theory and Its Applications, 1966, vol. 1, pp. 11–12.)

Most real-life systems (including that complex system known as a human being) have built-in redundancies. Failure can only occur after a series of n breakdowns. If these breakdowns are independent and exponentially distributed, all with the same parameter λ, the probability of failure of the total system at time t > 0 is

3.7 SUMMARY AND REVIEW

In this chapter, we considered the form of three common distributions, two discrete—the binomial and the Poisson, and one continuous—the exponential. You were provided with the R functions necessary to generate random samples, and probability values from the various distributions.

Exercise 3.20: Make a list of all the italicized terms in this chapter. Provide a definition for each one along with an example.

Exercise 3.21: A farmer was scattering seeds in a field so they would be at least a foot apart 90% of the time. On the average, how many seeds should he sow per square foot?

The answer to Exercise 3.1 is yes, of course; an observation or even a sample of observations from one population may be larger than observations from another population even if the vast majority of observations are quite the reverse. This variation from observation to observation is why before a drug is approved for marketing, its effects must be demonstrated in a large number of individuals and not just in one or two. By the same token, when reporting the results of a survey, we should always include the actual frequencies and not just percentages.

* If the answer to this exercise is not immediately obvious—you’ll find the correct answer at the end of this chapter—you should reread Chapters 1 and 2 before proceeding further.

* If it’s been a long while or never since you had calculus, note that the differential dx or dy is a meaningless index, so any letter will do, just as Σ_kf_k means exactly the same thing as Σ_jf_j.

^† Time is almost but not quite continuous. Modern cosmologists now believe that both time and space are discrete, with time determined only to the nearest 10^−23rd of a second, that’s 1/100000000000000000000000th of a second.