The Art of Probability

“All possible “definitions” of probability fall far short of the actual practice.”

Feller [F, p. 19]

Most authors of probability books present one version of probability as if it were the true and only one. On the contrary, we have carefully developed a sequence of models based on what seems to be a reasonably scientific approach. However, there are other significantly different models that need to be thought about, and the main purpose of this chapter is to present some of them. It should not surprise the reader that occasionally in the middle of my presentation of a model I defend the ones used in this book. Part of the explanation is that various models have different underlying motivations (often statistical), some of which are quite different from the ones this book uses; we assume that probability should provide a basis for reasonable actions in the real world and not just be admired. See [Ke] for a careful discussion of many models of probability.

“We need flexibility to understand the world around us, and we gain such flexibility by having flexible viewpoints stressing different aspects. This flexibility is not easily reconciled with a rigid axiomatic framework, such as may be found in a traditional course on rational mechanics.”

Gregory Wannier [Wa, p. 1–2]

We have stressed that the assignment of the initial probabilities is of fundamental importance in the applications of probability. Most texts blithly assume that this is not part of their problem; rather like most mathematical texts, where problems are posed in a formal mathematical setting, probabilities are either actually given or strongly indicated. Rarely do even the “word problems” require much translation from the natural language into mathematical symbols—and they are usually highly artificial anyway. We are concerned with the use of probability. Because the background knowledge of most applications is fundamental to practical work, we are unable to include such material in this book.

Very early we noted the importance of decisions based on probability and developed an approach that seems to bear some relation to actions taken in the real world. Unfortunately, many applications are founded on a far less firm foundation, yet the decisions taken in government, law, medicine, etc. constantly affect our lives for better or worse. It is therefore necessary to display clearly some of their assumptions—they are often not the same as those we have adopted and checked with various gambling and related situations; indeed many have significantly subjective elements in them. Science tries to be objective, though it cannot succeed in this as much as most people wish. Still, it is a goal that in the past has seemed highly desirable and highly productive. Perhaps in the future science may embrace a much more subjective approach—but to do so would mean the abandonment of the standard of reproducibility by others! Science has not always had perfect reproducibility—for example, in observational astronomy the locations of the planets are not exactly reproducible since the observatories are not in the same places on earth nor are the planets at various times in the sky (they are not perfectly periodic). Still, astronomy, which is the earliest of the sciences, managed to develop under the twin handicaps of no experimental control and no exact reproducibility.

We have also stressed robustness of our models. If the models is not robust then it is doubtful that it can be useful. It is a topic that by tradition has been greatly neglected in discussions of probability.

8.2 Review

We began with very simple situations, the toss of a coin, the roll of a die, or the draw of a card—all typical gambling situations where there is an underlying symmetry and the equipment has been carefully designed (for independence and a uniform distribution) and well tested by long experience. We assumed exact symmetry and argued that unless we assigned the same measure to each of the interchangable events then we would be embarrassed by inconsistency. Thus we were forced, by the apparent symmetry, to assign the same probability to each interchangable event. We also made the total probability exactly equal to 1.

From this simple model we derived the “Weak law of large numbers” which states that the average of a large number of independent binary trials is close to the original assigned probability—probably! But this assumed the existence of the mean of the distribution (see Appendix 2.A). In Section 8.7 we will look at this point more closely.

This simple model can give only rational numbers for probabilites. If we think of tossing a thumb tack to see if it lands point up or not, (Figure 1.10–1), then we are not prepared to compute any underlying symmetry and are forced to estimate this probability, which we believe exists independent of the actual trials—though perhaps you should consider this point carefully before going on, (see Section 8.9). This probability involves not only the thumb tack, but also the surface on which it falls as well as the complex tossing mechanism. We must even consider the angular momenta about the various axes. For example, suppose you had a long rod with n symmetrically placed flat (planar) sides each of the same size. In tossing the rod you would not expect it to come to rest on either end of the rod, but rather you would assign equal probability (due to the supposed symmetry) to each side face. But now think of the rod being shortened until it is a thin, flat disc—the shape of a coin. Now you would reject the toss ending on a side and you would assign equal probability to each face. When the rod is long a large angular momentum about the long axis forces the rod come to rest on one of the long sides and forbids the ends as final positions. With the coin it is the opposite. Hence there are lengths of the rod where the angular momenta affect the results significantly. In a randomization process the entire mechanism that makes the random selection requires close attention—which it gets in gambling places! In practice people are often careless about this point; it is usually hard to get a genuine random sample!

In passing from games to production lines, we may not know all the details of the process but we have enough familiarity with such mechanisms to feel that the local interchangability of the parts coming off the production line is often a reasonable assumption—but that there are also effects that change slowly with time, such as the wear and tear of parts of the machines, the slow changing of adjustments, etc. Indeed, it is exactly this which is the basis of Quality Control We can expect small fluctuations from item to item, but any trends in the output (revealed by falling outside the control limits) should be examined and corrected before the product becomes unacceptable.

However, the items going through a production line and the patients going through a hospital, for example, are rather different. We have less belief in the interchangeability of the patients, and of the constancy of the hospital processes. The statisticians have a word for it, “stationarity,” which assumes that there are no significant changes in the statistics as a function of time. Now we face a couple of troubles—there is no symmetry to imply an underlying probability, and we can only wonder about the changing effects in time. For a hospital, and such situations, the quality control approach is hard to apply.

Thus we see that the well-founded, well-tested, simple model of probability based on gambling situations loses a lot of its reliability when we pass to other situations. Yet we continue to use probability arguments in medical and other social situations. We are forced to do so since we have nothing else, and to do nothing in the face of uncertainty is as positive an action as any other particular choice. Hence we must look at other probability models; the two quotations in Section 8.1 apply. Whether we like it or not, many different probability models must be examined at least briefly, and this is the purpose of the rest of this Chapter.

Finally, more than a few philosophers have stated that if a probability is less than some preassigned number (Buffon used 1/10,000) then the probability is 0. Whenever very high reliability of a large system is needed then the probability of failure of the individual components must be very small. Moreover, a low probability does not scale in time. Let the expected number of failures be one per day. Then the expected number of failures per microsecond is

$1 / {24 \times 60 \times 60 \times 10^{6}} = 1 / 8.64 \times 10^{10} = 1.157 \dots 10^{- 11}$

Clearly, for events in time the probability must depend on the interval of time used, and small probabilites over long times cannot be ignored. Given enough time anything that is possible will (probably) happen.

8.3 Maximum Likelihood

“There comes a time in the life of a scientist when he must convince himself either that his subject is so robust from a statistical point of view that the finer points of statistical inference he adopts are irrelevant, or that the precise mode of inference he adopts is satisfactory.”

A. W. F. Edwards, [Ed, p. xi]

A method of assigning initial probabilities, similar to but different from maximum entropy, is that of maximum likelihood, which we will illustrate by some examples. We have used the maximum entropy criterion to determine a whole probability distribution; we will use the maximum likelihood criterion to determine a parameter in a distribution. It also resembles the weak law of large numbers in the sense that it assigns a probability based on past data.

Example 8.3–1 Maximum Likelihood in a Binary Choice

Given a simple binary choice (a Bernoulli trial) which you observe had k successes out of n trials (k ≤ n), what is a reasonable assignment for the probability p?

Evidently p can be any number between 0 and 1, and we must make a choice. Bernoulli himself suggested, based on the formula for k successes in n independent trials each with probability p, see equation (2.4–2),

$b (k; n, p) = C (n, k) p^{k} q^{n - k}$

that the value of p chosen should be the one that makes this probability a maximum, the maximum likelihood choice of p. How could one seriously propose to adopt a value of p which makes the actual observations highly unlikely as compared to some other value of p?

To find the maximum we differentiate with respect to the variable p, set the derivative equal to zero, and solve for the value of p. We can neglect, in this process, any front coefficients. We get for the derivative

$k p^{k - 1} q^{n - k} + (n - k) p^{k} {(1 - p)}^{n - k - 1} (- 1) = 0$

or (if pq ≠ 0)

$k (1 - p) - (n - k) p = 0$

Expand and cancel the like terms

$k - k p - n p + k p = k - n p = 0$

From this we get

$p = k / n$

This is the ratio of the successes to the total number of trials and is exactly what the normal person using the weak law of large numbers would suggest! In this simple example the maximum likelihood method agrees with common sense.

Notice that this result did not use the weak law of large numbers; it is a result of the method of maximum likelihood. In this case the two methods of assigning probabilities agree.

Example 8.3–2 Least Squares

Suppose we have made n independent selections (i = 1,2,…, n) from a Gaussian distribution

$p (x) = \frac{1}{σ \sqrt{2 π}} e^{- {(x - x_{o})}^{2} / 2 σ^{2}}$

where we assume that the mean x_o is not known but that the variance is known. The likelihood of seeing this result for the n independent trials is the product of the probabilities of the individual observations, namely

$\begin{matrix} L (x_{1}, x_{2}, \dots, x_{n}) = p (x_{1}) p (x_{2}) \dots p (x_{n}) \\ = \frac{1}{{(2 π)}^{n / 2}} e^{- \frac{1}{2 σ}} \sum_{i = 1}^{n} {(x_{i} - x_{o})}^{2} \end{matrix}$

According to the maximum likelihood principle we want to maximixe this, namely make the term in the exponent

$\sum_{i = 1}^{n} {(x_{i} - x_{o})}^{2}$

a minimum—and this is exactly the principle of least squares!

Thus the assumption of a Gaussian distribution with a given variance, along with the independence of the events and the principle of maximum likelihood, yields the well respected principle of least squares.

Here we have a principle for picking a parameter value in a distribution. The principle tacitly assumes that there is a single peak (unimodal) in the probability density function, though of course it can mechanically be applied to any distribution. In Example 8.3–1 the principle assigned a single probability and in Example 8.3–2 assigned a parameter of a distribution. The maximum likelihood estimate is insensitive to any outliers of a distribution (unlike the mean which is!) and as such is a robust estimator.

Maximum likelihood differs from the earlier maximum entropy. Both criteria, in the places where we have used them—maximum likelihood for whole distributions and maximum likelihood for parameter values of a distribution—have given reasonable results. Whether or not they generally give the same results would need much more exploration than we can give here.

Example 8.3–3 Scale Free

If we make a strictly monotone transformation of the independent variable from x to y, we do not affect the maximum likelihood estimate.

This simple fact follows from the standard formula

$\frac{d F}{d x} = \frac{d F}{d y} \frac{d y}{d x}$

When dF/dx = 0, since by hypothesis (strictly monotone) dy/dx cannot be zero, it follows that dF/dy must be zero.

This is important. Recall that in Chapter 6 (Uniform Probability Assignments) we worried about whether to assign the uniform distribution to the variable x, x² or x^1/2, for example. We gave a useful partial answer which applied to random variables that were generated by a modular reduction, but it is not universally applicable. The use of maximum likelihood removes this problem! The independence of scale is one of the attractive features of the maximum likelihood method.

8.4 von Mises’ Probability

Richard von Mises proposed a model for probability whose main aim was to provide a sound statistical basis for probability theory. “Thus, a probability theory which does not introduce from the very beginning a connection between probability and relative frequency is not able to contribute anything to the study of reality.” [vM p. 63]. This is the opposite of what we have done; we defined probability using symmetry and then used it, via the weak law of large numbers as a basis, such as it is, to get the usual probabilities used in statistics.

“I have merely introduced a new name, that of collective, for sequences satisfying the criterion of randomness.” [vM, p. 93]. Thus to be a collective means that the statistics of any proper subsequence of the original sequence must have in the limit the statistics you would reasonably expect from a random sequence. Proper subsequences include any that are prespecified, such as “take only those after the occurence of three heads in a row,” but excludes those like “take only those outcomes which are heads,” results which depend on the actual outcomes of the events selected. This is a simple realization of the meaning of the words “independent trials.”

His justification for this assumption is that gambling systems have been shown over many, many years to be ineffective, and hence, he says, the collective represents reality.

My objections to this approach are many. First, while it is commendable to use reality to form the postulates, it also apparently limits the applications to gambling-like situations. For the applications to other areas how would we demonstrate this remarkable property that the sequence of events forms a collective? The apparently equivalent assumptions of the independence of the trials and the constancy of the probability seem more reasonable.

Second, as any theoretician will admit, when tossing a coin a run of 1000 heads in a row will occur sooner or later; indeed in any infinite sequence of trials a run of 1000 or more heads will occur infinitely often! Hence a collective representing coin tosses (zeros and ones) remains a collective if I put 1000 heads (ones) before the original collective. Theoretically in the long run this front end will be swamped and will not affect the assumed statistics of the subsequences—but the long run is painfully long! Most people would immediately reject such a sequence as not being random (a collective). What is wanted in practice is a pasturized, homogenized, certified random sequence which does not start out with an unlikely fluctuation, but rather allows big fluctuations, if at all, very late in the sequence—just enough to get the effects of randomness. The pure mathematical criterion of random does not meet practical needs. Thus in general the long run ratio of successes/total trials approaching the probability is dubious in practice (where we are restricted to comparatively few trials).

If, as in the previous paragraph, we faced a long run of, say, 1000 heads, we would probably soon adopt the hypothesis that Pr{head} = 1, and when we later came to the rest of the data we would probably revise our hypothesis for the value of the probability. Thus in practice we generally have in the back of our minds a dynamic theory of probability (see Section 8.10 on Bayesian statistics). As such it falls outside of this book, and belongs properly to statistics, but the two topics are so closely related that each cannot completely avoid the other without serious loss to itself.

My objection to postulating things that you cannot hope to measure in practice is a broad objection, and includes most theorems whose hypotheses cannot be reasonably known or verified by you for your application; it may be nice mathematics but it is of little use in practice.

This does not mean that the book he wrote [vM] is not worth reading. He was a very intelligent person and had thought long and hard about the problem of justifying statistics; more so than most statisticians! While this present book is not on statistics, every one realizes that any statistics beyond the simple organizing and displaying of data, must rest on some model of probability.

8.5 The Mathematical Approach

“Idealists believe that the postulates determine the subject; realists believe that the subject determines the postulates.”

In 1933 Kolmogoroff laid down some axioms for probability. Let E be a set of elements which are the elementary events, and F is a collection of subsets of E; the elements of the set F are the correponding events. He assumed

1. F is a field (operations of addition and product).

2. F contains the set E

3. To every set A in F there is an nonnegative real number P(A). This number P(A) is called the probability of the event A.

4. P(E)= 1

5. When A and B are disjoint, then

$P (A + B) = P (A) + P (B)$

For infinite probability fields he further assumes that infinite sums and products of sets are sets.

6. For every sequence

$A_{1} \supset A_{2} \supset \dots \supset A_{n} \supset \dots$

whose intersection is 0 gives

$\lim_{n \to \infty} P (A_{n}) = 0$

(5.5–3)

All of this amounts to assuming that the sets form a σ-Algebra or a Borel field.

As one examines these axioms one wonders how, in a particular application, one would verify whether or not they included the right things and excluded the wrong things. Deductions from aspects that are not in the reality being modelled might distort the conclusions. In all applications of probability there are always the two dangers: (1) of not including enough in the model, and (2) of including too much.

The Kolmogoroff approach is favored by mathematicians and leads immediately to measure theory. It gives an elegant theory with many interesting mathematical results, but one can only wonder to what extent the model represents reality. I have long said, “If whether an airplane will fly or not depends on some function that arose in the design being Lebesgue integrable but not Riemann integrable, then I would not fly in that plane.” Dare one act in this real world on such a model, and if so, when and under what circumstances would one be willing to risk one’s life on the results from the model?

If we assume that reality is what you can hope to describe, and what you can never hope to describe is not reality (as is customary, but not alwaystrue, in science) then with respect to the real number system measure theory ignores all of reality and contains only unreality. One can only wonder what can be expected from using it when applied to reality.

These are not facetious remarks—they raise a basic question of whether or not probability theory is to be regarded solely as a mathematical art form or mainly as a useful tool in dealing with the real world. Two eminent probabilists, Kac [K,p. 24] and Feller, have raised strong objections to the viewpoint that probability is a branch of measure theory, “Too many important details are lost.” is the objection to the sweeping generality that overlooks the important special features of probability problems.

Some of the most quoted examples of the use of Lebesgue integration in probability theory seem on the face of them to be strictly non-applicable since they deny aspects of reality that we believe are right. Trajectories which have no direction at any place seem not to be realistic modelling of Brownian motion, for example, though perhaps the idealization is safe to use in many situations—how would you decide?

There is a small heretical (?) group of probabilists [Ra] who do not believe in the infinite additivity assumed in the mathematical approach of Kolmogorov. Their argument is that we have experience with the finite additivity of probability, and hence they believe in modelling only what we have some experience with. No one can possibly have direct experience with the assumed infinite additivity. As a result finite additivity wipes out, among other things, Lebesgue integration and this annoys professional probabilists who worry more about finding new theorems than the relevance to reality. Thus some purists assert, “Finite additivity is not part of probability theory.”

What are the dangers of assuming infinite additivity? Simply that you may be including effects that are not mirrored, even slightly, in reality. My favorite example of this sort of thing is the well known theorem, widely cited by mathematicians, that with ruler and compass (in classical Euclidean geometry) you cannot trisect an arbitrary angle. The fact is that with two marks on the ruler you can. Thus the truth or falsity of this theorem rests on a very slight difference (indeed in a practical sense almost trivial difference) in the idealizations being made by the corresponding postulates. So too, for those who believe only in the finite additivity there will be theorems proved in the infinite additivity model that are artifacts of this detail and no amount of elegant mathematics can compensate for the induced lack of the reality in what follows, so they say.

It was just such thinking that now drives me, as cited in the opening quotation of this Chapter, to a reconsideration of the probability models I have used in the past (sometimes without having really carefully thought before acting). It now seems to me that no single model of probability would have been appropriate to all the different situations I faced, and that probably no one model will do for the future either. Apparently it is necessary in every application to think through the relevance of the chosen probability model. Even the use of the real number system, with all its richness and known peculiar features, needs to be questioned regularly (see also Sections 5.1 and 8.13).

For “non-Kolmogorovian” probability models, and the role of Bayes’ theorem, see for example [D. p. 297–330]

8.6 The Statistical Approach

In probability theory, strictly speaking, we go from initial probabilities and probability distributions to other derived ones. In statistics we often go from the observed distributions to other inferred prior distributions and probabilities. The model of probability you adopt is often relevant to what actions you later take, though most statistics books do not make plain and clear just what is being assumed. Unfortunately, many times the statistical decisions affect our lives, from health, medicine, pollution, etc. to reliability of power plants, safety belts vs. air bags, space flights, etc.—too numerous to mention in detail, and in any case the list is constantly growing.

The weak law of large numbers encourages one to think that the average of many repetitions will give an estimate of the underlying probability. But we have seen (as in Example 2.9–2) that the number of trials necessary to give an accurate estimate of the probability may be much more than we can ever hope to carry out. And there are further difficulties. We assumed the existence of the mean; suppose it does not exist! (see the next Section 8.7).

It is not that the author is opposed to statistics—there is often nothing else to try in a given situation—but the stated reliabilities that statisticians announce are often not realized in practice. Unfortunately, the foundations of statistics, and its applicability in many of the situations in which it is used, leaves much to be desired, especially as statistics will probably play an increasingly important role in our society.

8.7 When the Mean Does Not Exist

When the mean does not exist then we cannot depend on the weak law of large numbers (see Appendix 2.A), and we clearly cannot depend on the usual statistical process of averaging n samples to get an good guess for the mean of the distribution. Indeed, there is a well known counterexample.

Example 8.7–1 The Cauchy Distribution

The Cauchy distribution

$p (x) = \frac{1}{π (1 + x^{2})}$

(8.7–1)

has the property that the distribution of the average of n samples has the same, original distribution.

To show this by elementary (though tedious) calculus we begin with the sum of two similar variables from Cauchy distributions, one from the Cauchy distribution with parameter a

$p_{1} (x) = \frac{a}{π (x^{2} + a^{2})}$

and the second from the distribution with parameter b

$p_{2} (x) = \frac{b}{π (x^{2} + b^{2})}$

The distribution of the sum of the two samples, one from each distribution, is given by the convolution integral

$\begin{matrix} p (x) = \int_{- \infty}^{\infty} p_{1} (s) p_{2} (x - s) d s \\ = \frac{a b}{π^{2}} \int_{- \infty}^{\infty} \frac{1}{(s^{2} + a^{2}) [{(x - s)}^{2} + b^{2}]} d s \end{matrix}$

(8.7–2)

The reader who wants to skip the mathematical details of the integration may go directly to equation (8.7–9).

To integate this we apply the standard method of partial fractions. From a study of the third degree terms in s, and the later integration steps, we are led to assume the convenient partial fraction form

$\frac{1}{(s^{2} + a^{2}) [{(x - s)}^{2} + b^{2}]} = \frac{A s + B}{s^{2} + a^{2}} + \frac{A (x - s) + C}{{(x - s)}^{2} + b^{2}}$

(8.7–3)

Suppose for the moment that we have found the A, B, and C. We then do the integrations

$\frac{a b}{π^{2}} \int_{- \infty}^{\infty} {\frac{A s}{s^{2} + a^{2}} + \frac{B}{s^{2} + a^{2}} + \frac{A (x - s)}{(x - s) + b^{2}} + \frac{C}{{(x - s)}^{2} + b^{2}}} d s$

${= \frac{a b}{π^{2}} {\frac{A}{2} \ln [\frac{s^{2} + a^{2}}{{(x - s)}^{2} + b^{2}}] + \frac{B}{a} \arctan \frac{s}{a} - \frac{C}{b} \arctan \frac{x - s}{b}} |}_{- \infty}^{\infty}$

The ln term drops out since at both ends of the range the argument is 1 and the function is continuous in the whole range. The first Arctan gives π and the second the same result only with a minus sign. Hence we have for the convolution p(x), equation (8.7–2),

$p (x) = \frac{b B + a C}{π}$

(8.7–4)

We now carry out the determination of the coefficients of the partial fraction expansion (8.7–3). Clearing equation (8.7–3) of fractions we have the identity in s

$1 = (A s + B) {{(x - s)}^{2} + b^{2}} + (A x - A s + C) {s^{2} + a^{2}}$

(8.7–5)

The cubic terms in s cancel (as they should due to the form we assumed). The quadratic terms in s lead to

$0 = - 2 A x + B + A x + C \Rightarrow A x - B - C = 0$

(8.7–6)

Using s = 0 in equation (8.7–5) we get

$1 = x a^{2} A + (x^{2} + b^{2}) B + a^{2} C$

(8.7–7)

Using s = x in equation (8.7–5) we get

$1 = x b^{2} A + b^{2} B + (x^{2} + a^{2}) C$

(8.7–8)

Eliminate A (from equation (8.7–4) we do not need A) using equation (8.7–6), and set

$x^{2} + a^{2} + b^{2} = K .$

We get for the two equations (8.7–7) and (8.7–8)

$K B + 2 a^{2} C = 1$

$2 b^{2} B + K C = 1$

The determinant of the system of equations

$K^{2} - 4 a^{2} b^{2} = x^{2} + {(a - b)}^{2}$

is not zero, and we get for equation (8.7–4)

$\begin{matrix} \frac{b B + a C}{π} = \frac{b (K - 2 a^{2}) + a (K - 2 b^{2})}{π (K^{2} - 4 a^{2} b^{2})} \\ = \frac{(a + b) (K - 2 a b)}{π (K + 2 a b) (K - 2 a b)} \\ = \frac{a + b}{π {x^{2} + {(a + b)}^{2}}} \end{matrix}$

Thus we have, finally, for equation (8.7–2)

$p (x) = \frac{a + b}{π {x^{2} + {(a + b)}^{2}}}$

(8.7–9)

This shows that the two parameters merely add in the convolution operation.

With (8.7–9) as a basis for induction we conclude that the sum of n samples from the original Cauchy distribution equation (8.7–1) will have the parameter n. We want the average so we need to replace the variable x by x/n—and we get the original distribution back! Thus sampling from the Cauchy distribution and averaging gets you nowhere—one sample has the same distribution as the average of 1000 samples!

The reason for this is that the tails of the Cauchy distribution are large—and taking another sample means, among other things, that the new sample is too likely to fall far out and distort the computed mean of the samples.

If we take an even larger tailed distribution we get even worse results. Suppose we take a distribution

$\begin{matrix} p (x) = \frac{C (p)}{1 + | x |^{p}} & (1 < p < 2), & (- \infty, \infty) \end{matrix}$

(8.7–10)

where C(p) is a suitable constant, which we now find, that makes the total probability = 1. By symmetry the integral of p(x) is

$2 C (p) \int_{0}^{\infty} \frac{1}{1 + x^{p}} d x = 1$

Set x^p = t, that is x = t^1/p. We get

$\frac{2 C (p)}{p} \int \frac{t^{1 / p - 1}}{1 + t} d t = \frac{2 C (p)}{p} \frac{π}{\sin \frac{π}{p}} = 1$

from a well known integral. Hence we have the C(p) of the distribution. But we see, at least intuitively, that for (1 < p < 2) the sampling and averaging process will give results that are worse (in the sense of computing an average as indicated by the weak law of large numbers) than taking one sample and using it alone!

“In the modern theory variables without expectation play an important role and many waiting and recurrence times in physics turn out to be of this type. This is true even of the simple coin-tossing game.”

Feller [F, vol 1, p. 246]

Evolution seems to have given us the ability to recognize patterns with repetition when the weak law of large numbers applies, but when the source distribution of events has large enough tails we simply cannot see the pattern. This, in turn, suggests that in many social situations, say our own lives, we do not see the underlying patterns because the distribution of the apparently random events is too variable. Hence we need to think a little about what to do in similar situations. If we transform the independent variable (the way we measure the effect) by a nonlinear contraction then we may have an expected value in that variable and hence a weak law of large numbers—we must reduce the variability of the observations if we are to find patterns. For example, in equation (8.7–10) with p = 3/2 we can use

$x = t^{2}$

and work with the distribution density

$\frac{C (3)}{1 + | t |^{3}}$

When we use this probability density function the mean exists and we can invoke the weak law of large numbers. The actual contracting function to use depends on the distribution—too strong a contraction will remove all the details, and too weak a one will barely bring in the weak law of large numbers and the basis of statistics.

Exercises 8.7

8.7–1 Show that for the Cauchy distribution the probability of falling in the interval −1 ≤ x ≤ 1 is 1/2

8.8 Probability as an Extension of Logic

In logic statements are “yes-no,” and classical logic has no room for “degrees of truth.” Many people, beginning in the earliest years of probability theory, have tried using probability to extend logic to cover the degree of truth. Such problems as the lying of witnesses in court, and the number of persons to serve on a jury, for example, have been studied under various hypotheses as to reliability of the individuals.

It is evident that we do have degrees of belief, but it is less evident that they can be successfully combined into a formal system for manipulating the probabilities, let alone that the original quantification of the intuitive beliefs can be made reliable. It is not that the same rules of deduction that we use in gambling situations can not be used, what is doubted is the degree to which you can expect to act successfully on such computations.

When you consider how difficult it is to get probability values from observed data, as the statistician is forced many times to do, one wonders just what faith people could have placed in social applications such as estimating the probability of lying or the size to use for a jury. What they did was to assume a certain model, and then make rigorous mathematical deductions from these assumptions without regard to the robustness of the model. For example, if a witness was found to have lied 9 times out of ten in the past (and how would anyone ever verify this in practice?) then they would assign a probability of 1/10 for the truth of the next statement made by the witness. Most sensible people would ask about the self-interest of the witness in each statement before ever making any such judgement on a particular statement made. Similarly with juries. The idea that there is a total independence of the individuals on a jury as they make up their minds is foolish, but rather there is, from jury to jury depending on local circumstances, some correlations between some members. Too much depends on the particular circumstances, the actual jury members, the specific case, and the personalities of the lawyers and witnesses, to try to make any serious statements beyond wild guesses. The robustness of the model is simply being ignored.

But many well respected authors have tried this approach to probability, not only for external applications but also for personal internal reasoning. Again, one can only wonder how reliable the results of all the rigorous deductions can be where the input is so uncertain. It is famous that, as people, logicians are often foolish in their behavior, and that the behavior of the average person based on their intuition is often superior to that of the experts. Indeed, the very jury system of Anglo-Saxon law is a tacit admission that the experts are simply not to be trusted in some of their own areas of competence; that when it comes to innocence vs. guilt the intuition of the jury is preferable to the rigorous knowledge and experience of the judge for delivering justice.

One constantly sees proposals to use probabilistic logic in situations for which it is not suitable. On the other hand, such fields as operations research which try to use some common sense, in the hands of sensible people can get valuable results—but in the hands of idiots you get idiocy.

The reliability of the model of probability based on gambling and rigorous mathematics transmits little support to this area of application, but still users try to claim reliability for their results. One needs to be very suspicious and examine repeatedly the robustness at all levels of the discussion before acting on the deductions. Of course, this may lead to “the paralysis of analysis” and no action—which can also be dangerous!

Example 8.8–1 The Four Liars

There is a famous problem due to Eddington [E, p. 121]. If A, B, C and D each speak the truth once in three times, independently, and A affirms that B denies that C declares that D is a liar, what is the probability that D was speaking the truth?

Eddington explained his solution as follows:

We do not know that B and C made any relevant statements. For example, if B truthfully denied that C contradicted D, there is no reason to suppose that C affirmed D.

It will be found that the only combinations inconsistant with the data are:

(a) A truths, B lies, C truths, D truths

(b) A truths, B lies, C lies, D lies.

For if A is lying, we do not what B said; and if A and B both truthed, we do not know what C said.

Since (a) and (b) occur respectively twice and eight times out of the 81 occasions, D’s 27 truths and 54 lies are reduced to 25 truths and 46 lies. The probabilty is therefore 25/71.

It is not the logic of the deduction (there are still disagreements on it and the concensus may be against Eddington’s answer), it is the likelihood of the original assumptions that one wonders about. It is in fact a toy problem having no possible social implications. It is such applications of probability to logical situations that one has to wonder about.

8.9 di Finetti

In the opening of his interesting two volume work on probability di Finetti boldly asserts, (as already noted in the preface of this book), “PROBABILITY DOES NOT EXIST.” Of course we need to ask what this provocative statement means, especially the “exists.” He also asserts that independence does not exist [KS p. 5] and that his probability is the only approach [KS p. 1]. He introduces the technical word “exchangeability” which he equates to “equivalence” [KS p. 5]. While perhaps useful it is admittedly nothing new; still it is an idea from which new things may be deduced.

Undergraduates late at night will argue about the existence of the real world, as do some philosophers. But if we take the pragmatic view that how people act is more important in revealing what they believe than are their words, then we see that people act as if the world existed. Those who deny the existence of the real world are still observed to eat when they are hungry. The author is what is commonly called “a naive realist.”

Of course all thought exists in the head (so it is widely believed), and we deal with mental models of the world, not with “reality”—whatever reality may mean. But if we assume that there are many kinds of probability, perhaps some exist in the head and some in the “real world.” Take, for example, some radioactive material in the lab connected to a Geiger counter and equipment that records each time a radioactive event occurs. When we later analyse the record (using, of course, mental models) we find that the equipment acted as if there were randomness in the material. We find, for example, that if we double the amount of material then the number of counts per unit time approximately doubles. Similarly, if we were to build a mechanical roulette wheel spinner and launcher of the pea, or a dice throwing machine, we would find, especially in the first case, that the results were as if they were random, and we would in common scientific practice infer that the randomness was “out there in the machine” and not in our head. We can “prove” nothing at all in practice.

To be careful, even Sir Isaac Newton knew that he only gave formulas for how gravity worked, and not why—”I do not make hypotheses” is what he said in this connection. But the extensive use of the idea of gravity, along with its many, many checks with reality, lead us to say, “Gravity is why things fall.” Of course there is no “why” in all of science; we put it there. We see only relationships and we infer the causes; similarly, there are only events and we infer the randomness.

In the above situation of the radioactive material, common sense suggests that we infer that the probability exists in the real world and not just in our heads—else we are arguing like undergraduates!

It was by such reasoning, after years of believing in di Finetti’s remark cited above, that I came to the conclusion that it is not practical science to adopt his view exclusively; that while at times the probability we are talking about exists only in our heads there are times when it is wise to assume that it exists in the real world. It appears to me to be sophistry to claim otherwise.

On di Finetti’s side, if I toss a coin and ask you for the probability of “heads” you are apt to say 1/2, but if while it is still in mid air I say to you that it is a two headed coin you will likely change your probability to 1. Nothing in the material world changed, only the state of your information, hence this probability apparently exists in your head. Situations such as horse races are likely to be best seen as having probabilites in your head and not in reality. Thus di Finetti’s viewpoint is useful many times. See Example 1.9–8.

To use the same formal apparatus of probability theory that we developed with regard to gambling and such situations for these mental probabilties brings some doubt as to the wisdom of acting as if they had anywhere near the same reliability. Yet, again, what else is there to do in such situations, except to be cautious when acting on the computed result?

8.10 Subjective Probabilities

There are a large number of different subjective probability theories. One theory, for example, believes that a probability is simply “betting odds.” The claim is made that by asking you a series of questions on how you would bet in a certain situation, revising the odds at each stage, will lead to your probability of the event. Unfortunately, upon self-examination I find that after a few steps I get stubborn and stick to a single ratio. I simply cannot do what they ask of me, to continually refine my betting odds. Moreover, I seem to come to different probabilities under apparently similar conditions. Thus I do not believe much in the reliability of the probabilities that are obtained this way. See [Ka] for example.

Other subjectivists are more frank about things, and consider that any betting on a horse race is a matter of a local, transient opinion of an individual. Again, these are hardly firm probabilities, though they may be the best that can be obtained. But to then apply an elaborate, rigorous mathematical model of probability, using these estimates, and to then expect the results to be highly reliable seems to me to be folly. There may, however, be some situations where reasonable results can be obtained.

There is at present a strong movement called Bayesianism. It is much more connected with statistics than with probability and hence strictly falls outside the field of this book. Still, because of its prominence it is necessary to look briefly at it—though one of its practitioneers, I. J. Good, claims that there are 46,656 varieties of Bayesians!

The Bayesians are the most noticeable of the subjective schools at present. They believe that you do not come to a situation with a completely blank mind, but that you have in your mind a prior distribution for the thing being examined. And in a sense we are all Bayesians. Suppose you and I were to do an experiment to measure the velocity of light, and when our first measurements were made we found them to be quite far from the currently accepted value. We would, almost surely, begin by reexamining the equipment, the reasoning, and the data reduction. But as we made more and more careful adjustments and calibrations we might finally decide that we believed the value we obtained even if it still disagreed with the older accepted value. We would pass, via experience, from one prior distribution of the possible values for the velocity of light to another, but except in carefully rigged situations it is not possible to quantify this process with much accuracy.

The Bayesians want us to quantify our prior distribution and use it as a guide for future acceptance (or not) of a result obtained from the experiment. But if the initial distribution of our belief is vague, then one naturally asks,

If the prior distribution, at which I am frankly guessing, has little or no effect on the result, then why bother; and if it has a large effect, then since I do not know what I am doing how would I dare act on the conclusions drawn?

As noted above, in practice we are all Bayesians at heart, but the rigorous accurate quantification of our prior beliefs is usually impossible because they are too vague to handle, so the theory is dangerous to use for important decisions. Furthermore, science has traditionally been “objective” and the Bayesian approach frankly admits “subjectivity” so that different people, doing the same experiments, may well get different results. Yes, the Bayesian approach at times seems reasonable, but not scientific.

Bayesian techniques are often misunderstood by non-professional Bayesians. Their announced posterior probabilites are not thought of as frequencies (how often something will or will not happen), but rather as degrees of belief. But when I ask myself how I will measure the effectiveness of their theory, I naturally turn to testing their predictions against what happens in the real world, how often they are right and how often they are wrong. But in doing so I am adopting the frequency approach which they deny is the meaning of the probabilities they give! One can only wonder about the confidence one can put in their approach. Yes, as we just said, we are all Bayesians at heart, but when it comes to actions in this world their approach, which at times may be the only one available, leaves much to be desired. Still we often must act without good information.

When you must take action and do not have the luxury of firm probabilities then there is little else to appeal to but some form of subjective probability. But again, after using elaborate, complex modelling, to assume that the final results are highly reliable (due to the use of elaborate mathematics) is foolish.

8.11 Fuzzy Probability

In 1965 L. Zadeh introduced the concept of fuzzy sets. The concept comes from the simple, but profound, observation that in the real world nothing is exactly what it appears to be, that there is not a sharp dividing line between trees and bushes, that “close enough” is not an exact statement, that a platypus is only partly a mammal, etc. Since then the idea of fuzzy sets has been extended to many fields, including probability, logic and entropy.

The very concept “fuzzy” meets with immediate hostility from the mathematically trained mind that believes that a symbol is a symbol and is exactly that, that each symbol can always be recognized exactly, etc. To be asked to adjust to the idea that nothing is certain, that all things are fuzzy, is rather a lot to expect from them.

Similarly, computer experts like to believe that they live in the classic Laplace universe where if you know exactly the starting conditions and the laws then a sufficiently able mind can calculate the future exactly. The ideal computing machine is a realization of this dream. Although programmers know in their hearts that any large program may have residual “bugs” (more accurately “errors”), still they seem to keep in their minds the concept that a program is exactly right and the machine always does exactly what it is supposed to do. Of course experience has shown again and again that not only are there no exact specifications for most interesting problems, and that large programs are apt to have residual errors in them, but also that the machines themselves are not infallible.

Many people have wanted to believe that fuzziness can be embraced in the concept of classical probability theory. But there is a profound difference between: (1) classical probability theory based on subsets of a sample space, and (2) fuzzy sets where even being in a subset of the sample space is not certain, and knowing which subset you are in is not sure. The first is probability, the second is fuzziness, and they are not equivalent. Quite the contrary, at least some people believe, mathematical logic is merely a part of fuzzy logic!

Once you admit that you cannot know exactly which set you are in, then there are the problems of the laws of noncontradiction and of the excluded middle. Since you can be both in a set and not in the set at the same time, the intersection of the set and its complement is not necessarily 0, nor is their sum necessarily the whole universe.

Is this tremendous mental adjustment worth the effort? It seems that the universe as we know it is indeed fuzzy, and hence it seems that we should give the idea a reasonable hearing—perhaps we might learn something useful! The distaste of accepting that nothing is sure, either in certainty of occurring or in what occurred, is very strong. I think that this accounts for the general rejection (“ignoring” is perhaps a better description) of the ideas, in spite of a small band of eager devotees who have embraced the concept.

8.12 Probability in Science

Scientists use probability theory in many places, and it is natural to wonder which of the many kinds they use. The answer is, of course, different kinds in different places.

Perhaps the classical example is the use of probability in molecular physics, in particular in the theory of perfect gases. Here one of the earliest spectacular results was Maxwell’s derivation of the distribution of the velocities of the molecules in a gas. Rather than discuss in detail what Maxwell said, let us look at the essential steps of the argument. We pass over certain simplifying assumptions such as that the molecules are perfect spheres with no intermolecular forces between them, moving independently of each other and independently in the three coordinates.

The argument begins with the assumption of a vector distribution for the velocities. He then argues that in his model between the collisions the velocities of the molecules are constant. Hence he must examine the changes produced by a typical collision. To simplify things in your mind consider the relative velocity of one molecule when the other is regarded as stationary. He now assumes a uniform distribution of the axis of the moving molecule with respect to the center of the stationary molecule (there is to be a collision by assumption). He next examines the resulting distribution of velocities after the collision using classical mechanics, and then averages over all intersections and all vector velocities. From equilibrium he assumes that the input distribution of vectors must be the same as the resulting distribution after the collision—a self-consistant condition which is widely used in physics. From the resulting functional equation he deduces the velocity distribution. But so far as I have ever seen, though it seems to be obvious, there is no deduction that the uniform distribution of the axes is also true after the collision. Hence the self-consistancy is not quite complete since two probability distributions went in and only one came out!

Now let us analyse a bit of what he was actually assuming. The assumed distribution was certainly not the velocity of a particular molecule at a particular time. Was it the assummed distribution of the many molecules in the volume of gas? That would have given a discrete distribution since there are a finite number of molecules in the assumed volume (which is not really discussed). How does he get to the smooth, analytic distribution he as-summed? There are at least two paths. One is to suppose that he is looking at the ensemble of all possible molecules at all possible times, and another is to say that he is merely replacing a discrete distribution by a mathematically more convenient continuous distribution and that the difference must be slight. (Compare Example 5.6–4.)

Must he use arguments that it is uncertainty in initial conditions that produces the probability and that small differences lead almost immediately to large differences after a few collisions? This is one kind of widely used argument in physical applications, but as we have seen he has other arguments he could use which would imply other kinds of probability. In short, we are not sure, unless we examine the matter much more closely, of just what kinds (discrete or continuous density) of probabilities he assumed; but this is not unusual.

Genetics is another field which uses probability widely. The mixing of the genes could be regarded as a purely mechanical system about which we do not have the mechanical details, or it could be regarded as basically probabilistic as in quantum mechanics. Texts are not clear in the details, but then how could one decide which one? Hopefully most of the time the difference would have little effect on the predictions.

Returning to the use of probability in physics, the present Copenhagen interpretation of quantum mechanics claims that the probability is essentially present in Nature and is not due to ignorance. This probability, as we will see in the next Section, is a rather different kind from any we have so far discussed. The history of quantum mechanics shows that to this day there is not a uniform agreement on the meaning of their probability; the probability of the single event and the probability as a limiting ratio are both widely held opinions in the field.

In quantum mechanics “probability” is often called the “propensity,” apparently to emphasize that it is to be associated with the physical equipment—given an experimental setup C then we have the propensity for A to happen

$P (A | C)$

In quantum mechanics it is necessary to adopt this attitude so that the simultaneous measurement of conjugate variables cannot occur—the act of measurement in quantum mechanics is believed to affect the measured thing, hence using this notation the uncertainty principle cannot be violated since for the other measurement you have

$P (A | C^{'})$

where C’ is the corresponding experimental setup. [Ba].

Thus probability in quantum mechanics is different from that in other fields. In quantum mechanics probability is an undefined thing in the sense that it is simply the square of the absolute value of the corresponding wave function and does not have a more elementary description.

In summary, in physics and in science generally it is not always easy to decide just which of the many kinds of probability an author means (because generally speaking the author subconciously believes there is only one kind of probability). The classical mechanical one which supposes that probability arises from uncertainty in the initial conditions is the most popular one, but clearly there are others being used at times.

8.13 Complex Probability

Mathematicians all seem to believe that probability is a real number between 0 and 1. But since 1925 quantum mechanics has used complex numbers as an underlying basis for the representation of probabilities (pre-probabilities if you wish to make the distinction) leading to the probability being the square of the absolute value (modulus) of the complex number and thereby to a real number. Throughout most derivations in quantum mechanics complex numbers are used, and it is only at the last possible moment that the absolute value is taken—indeed, the theory of quantum mechanics was almost completely developed before Born observed that the absolute values squared of the wave functions could be interpreted as probability (propensity) distributions.

Quantum mechanics has been, perhaps, our most successful theory as judged by the number of successful predictions it has made. Typically a wave function is represented in the form

$Ψ (x) = u (x) + i υ (x)$

where x is a real variable. Ultimately |u(x) + iv(x)|² becomes the probability. Yet this kind of probability has been generally ignored in probability text books for over 50 years.

It seems unlikely that this is the only application of complex numbers in probability theory (beyond the standard characteristic function representation of distributions which we have not used). We therefore need to examine what is going on. It is useless to examine the history or standard development of quantum mechanics since, as noted above, probability was grafted on (perhaps better, recognized) at a very late stage in the creation of the basic theory.

As a useful analogy we consider Fourier series. It is a linear theory, meaning that the Fourier expansion of a linear combination of two functions is the same linear combination of the separate expansions. However, in applications of the Fourier series the central role is not played by the coefficients of the Fourier series but rather by the power spectrum which is either: (1) the sum of the squares of the two coefficients of the same frequency in the real domain, or (2) the square of the modulus in the case of the complex representation. Thus you cannot add the power spectra of two functions and expect to get the power spectrum of the sum, but you can add the coefficients of the corresponding terms in the Fourier representations, and then take the square of the modulus of the resulting complex numbers to get the power spectrum. Thus the physically important things in many of the applications of the theory are quadratic over a linear field of Fourier expansions.

Similarly in quantum mechanics, which according to Feynmann is absolutely linear (not an approximation), the physically important things (probabilities) are quadratic over the linear field of “wave functions” which can be added and subtracted as required. It is the possible cancellations before the squaring of the absolute values that permits the “interference” effects that we see in optics and other physical phenomena.

What else is accomplished by using complex probability beyond the great convenience of a linear theory? If we look at correlation we see that it is related to the phase angle, the angle in the polar form of representing the complex numbers. To illustrate, suppose we represent heads and tails as complex quantities. In the absence, as yet, of formal operators (corresponding to those in quantum mechanics) we guess at the corresponding “wave functions,” the pre-probability representations in a vector notation using

$Ψ = e^{i ϕ} = \cos ϕ + i \sin ϕ$

and ΨΨ* = 1 (star means complex conjugate) where ø is an arbitrary phase angle (as in Fourier series it is the choice of the origin that fixes the phase angles). Now consider a second coin, and suppose that the difference in phase angles is θ with, for convenience only, the original ø = 0.

For a two coin toss we add the appropriate components as shown in the following table. The 1/4 in the last column comes from the normalizing process.

TABLE 8.12–1

Outcome	Ψ	ΨΨ*	Probability
HH	1 + e^iθ	2 + 2 cos θ	(1 + cos θ)/4
HT	1 – e^iθ	2 − 2 cos θ	(1 – cos θ)/4
TH	-1 + e^iθ	2 − 2 cos θ	(1 – cos θ)/4
TT	-1 – e^iθ	2 + 2 cos θ	(1 + cos θ)/4

sum = 8

The table shows that for θ = 0, perfect correlation occurs; for θ = π/2 there is no correlation and the coins are independent; and for θ = π there a is negative correlation; and finally for θ = 3π/2 there is again no correlation. For other angles we have other correlations.

If we wish to follow the quantum mechanical model, then the problem is, in each field of application, to find the appropriate operators (corresponding to the Hamiltonian, etc.) that will generate the “wave functions,” the distributions. In the above example I had to guess at a suitable representation. Evidently the problem of finding the proper operators is not trivial, but does suggest that the use of the complex number representation in probability may be useful.

Another approach to complex probability and quantum mechanics has been given by Landé [L]. He gave a derivation of quantum mechanics from some postulates about complex probability. The complaint sometimes made that the derivation does not prove the uniqueness is a trivial point since clearly more postulates (or else slightly altered ones) could probably be found to produce the uniqueness (but then again perhaps the fact that there are equivalent formulations of quantum mechanics means that uniqueness is not necessary or even possible in this field).

8.14 Summary

Chapter 1 noted that, “We speak of the probability of a head turning up on the toss of a coin, the probability that it will rain tomorrow, the probability that the next item coming down a production line is faulty, the probability that someone is telling a lie, the probability of dying from some disease, and even the probability that some theory, say evolution, special relativity, or the “big bang theory,” is correct.”

We now see that these various probabilities are determined in very different ways with very different philosophies and reliabilities; we see that “probability” is not a single word but is to be understood only in the context of the probability model being assumed.

Philosophers and scientists often claim that for the sake of clarity they want a single word to have a single meaning, yet we have used the word “probability” in many different senses. This is not, in fact, unusal in many fields; we speak of a “line” in both Euclidean and non-Euclidean geometry and mean different things. We could have, of course, used P₁, P₂, … for the various uses of probability, but found it cumbersome. In practice it often turns out that an author using probability will imply various meanings at various times without apparently being aware of it.

While it appears at first that for the various models the corresponding mathematical techniques are the same, we have shown that this is not so. In the infinite discrete sample space we found that it could lead to paradoxes like the St. Petersburg and Castle Point, but if we required the sample space to be generated by a finite state diagram then these paradoxes could not occur. This again raises the question of when the standard mathematical tools can and cannot be used safely if we intend to take actions in the real world based on the results. It is not only the modelling of the physical situation that in involved, it is the probability model and the mathematical tools used that are also involved (as well as the logic—see fuzzy sets for example). And as we have carefully argued (Section 7.1 on Entropy) just because you use the same mathematical expressions it does not follow that the interpretations are the same. Probability is indeed an interesting, complicated subject!

Why are there so many different probability theories? I believe that one major reason is the desire that the probability theory the author assumes must support the main ideas of statistics—in short how to get from the definition of probability of a single event to the frequency definition to support statistical techniques. The use of technical words

$collective \Leftrightarrow random$

$exchangeability \Leftrightarrow equivalence$

$propensity \Leftrightarrow probability$

does not really solve much, though they tend to eliminate unintended meanings.

Another major reason is that many applications are quite different from the typical gambling situations where probability theory first arose, and hence need other intellectual tools to handle them.

Why have we glossed over the more subjective types of probability? It is not that we do not use them more or less subconciously (even if we have never heard of the corresponding theory!) rather it is that there are grave doubts that the intuitive elements can be realistically captured into a useful, reliable, reproduceable body of knowledge ready for action in this harsh, real world of science and engineering where important, serious actions involving large sums of money and possibly human lives must be taken based on probability calculations.

The fundamental difficulty with using the subjective types of probability is that success requires mature judgements based on experience, something that cannot be included in a first course in probability. Furthermore, science in the past has highly valued consistency, and subjective probability gives subjective results!