The Art of Probability

Small Causes—Large Effects

Now that we see the central role of the uniform distribution in probability problems we understand its importance and why we need to examine it more closely.

Based mainly on symmetry we began by assigning initial uniform probability distributions for events. If some obvious symmetry is lacking how can we justify assigning a uniform distribution? One answer given by many people, most notably by Poincare, is based on the idea that even with a perfect (classical) mechanical situation the initial conditions are rarely perfectly known and often the result has a uniform distribution. (See also the modern theory of chaos.)

A simple example of randomness in the ideal physical world of classical mechanics is the thought experiment of dropping an ideal ball down a chute as consistently as possible. We imagine the ball falling onto a knife edge and then bouncing to one side or the other. We also imagine the result being tabulated and the ball returned to the device for another trial. Let us consider the results of many trials as a function of the displacement of the knife edge with respect to the center of the chute. When the knife edge is far to one side the ball will always fall to one side, and when it is far on the other it will fall on the other side. We are concerned with what happens when the knife is near the middle. As we imagime, in this ideal experiment, adjusting the equipment to get a 50–50 ratio for one side or the other we think that we will find positions which are close to this and where a slight movement one way or the other will change the ratio. We have, therefore, a conceptual mechanical device which can, apparently, produce a random behavior. Which side the ball will fall (on any particular trial) is not predictable, but the average behavior is reasonably well predicted; randomness in the ideal physical world. This result is based on your feeling of continuity, that as we make very small changes in the position of the knife edge the ratio of the number of events on the two sides cannot change suddenly from all on one side to all on the other, but must pass smoothly from one to the other. (In the extreme ideal world of exactness the exact middle position would have the ball bounce repeatedly on the knife edge and finally come to rest on it in a metastable state!)

We next consider the standard roulette wheel that is used in gambling. The wheel was apparently carefully designed to give uniform random results. We first make the assumption that the laws of classical mechanics are a perfect description of reality. The wheel is started rotating at a modest rate in one direction, and then a short time later a small ball is launched on the wheel going rapidly in the opposite direction. Due to the speed of the ball its centrifigual force makes the ball go promptly out to the outer containing rim of the wheel, where it continues to run around the track. As time goes on friction removes some of the speed from the ball (and a slight amount from the much more massive wheel) and the ball gradually approaches the inner track where there are 38 (or 37) equally sized holes. The moment the ball hits the edge of a hole it bounces out away from the line of holes (the wheel is rotating against the direction of the ball) and then comes down at some other place to hit another edge of a hole, etc., until it finally has acquired about the rotational rate of the wheel and comes to rest in one of the 38 identical holes.

Consider, now, the mechanics of the system. There is the speed of the wheel, the phase of the rotation (the relative position) of the wheel at the moment of launching the small ball, the three coordinates of the ball at the point of launch from the hand, the three corresponding velocities, two numbers to describe the axis of rotation of the ball, and finally the rate of rotation of the ball itself at the moment of launch. We have mentioned, in total, 11 different numbers that are needed (there may be more) to describe the trajectory of the ball. Thus for each possible launching of the ball, (in some suitable set of ranges that are possible), there is a point in an imagined 11 dimensional space. Nearby points in this space will have very different trajectories. There is a theorem in (ideal) mechanics which states that if the initial conditions of launch are varied continuously then the trajectory changes continuously. But as we imagine any one of the 11 parameters being changed very slightly we feel that the result will probably end up in another hole. Infinitesimal changes (almost in the mathematician’s sense!) produce significantly different trajectories.

Now let us fix in our mind the terminal hole, and look back at the situation. Where in the 11 dimensional sample space do the balls come from which end up in the hole? The answer is that scattered throughout the whole of the 11 dimensional space of admissable initial conditions there are many very small regions that end there; each of these very small regions will have, however, quite different trajectories from other small regions.

No matter how we assign any reasonably smooth probability distribution over the launching space we feel that each hole will have almost the same probability of the ball ending there as any other hole. We do not bother to prove this mathematically, (but see the next section), we merely think of the almost trivial differences that will cause the ball to bounce at a slightly different angle, come off the rim at a slightly different place with respect to the position of the holes, etc. Indeed, we feel that a mote of dust could change the outcome, as well as a whiff of air, the temperature of the ball, and many other effects that we left out of the discussion. We simply do not expect that we could reproduce the same end result no matter how hard we tried, even in a laboratory, and we can not see any systematic effect preferring any one hole over another. Thus we come to the ideal uniform assignment of 1/38 for the probability of the ball ending up in any given hole.

There are three distinct things to be noted in this thought experiment. First, we have a situation in which the slightest differences, which are well below the level at which we can control things, lead to significantly different results. Second, that the trajectories are reduced modulo the rotation of the wheel, and the same end results come from all over the space of inital conditions. This type of reduction is a central feature—the trajectories as trajectories do not come together, rather as time goes on they spread out. Thus the initial assignment of a probability distribution in the launching space is still visible in the trajectory space—it is the reduction modulo the rotation of the wheel that brings the uniformity more or less independently of any reasonable initial probability distribution we might assume. Third, there is a final quantization—the ball must end up in a single hole—that further separates trajectories which started out very close to each other. In the current “chaos theory” it is constant folding and bifurcation that produces the unpredictable end result.

If we look further and consider what we believe we know about reality then we are even more convinced of the irreproducibility of results. The world is made out of molecules (so we believe most of the time) and these are in constant, rapid motion—yet it is the collision of the molecules of the ball with those of the edge of a hole that make the angle of rebound! A slight change of the angle due to chance positions of the molecules in an early collision might well alter the final position of the ball. We mentioned air drafts, motes of dust, and temperature (which could affect the size of the ball and hence its moment of inertia) among other things as other possible effects. If we go further and invoke quantum mechanics, which in its popular Copenhagen interpretation claims that the universe itself has a basic uncertainty, then we are even more convinced that it is reasonable to assign a uniform distribution to the final positions—provided upon examination we can detect no significant differences in the sizes or positions of the holes—no lack of symmetry there.

When we turn to the rolling of dice in gambling, then things are much the same, though perhaps not so safe. The standard requirement is that the dice hit the surface of the table before they bounce off the end wall (which has been made to have many different angles). The dice bounce off the wall and back onto the surface of the table, and finally end with one of the six faces up. Again we have the three features: (1) very slight differences in the initial conditions produce large differences in the trajectories, (2) there is the reduction modulo the rotations of the dice, and (3) there is the final quantization to a single face on the top side. It would take a very accurate machine to produce repeatable results, but this time we are not quite so sure as we were for the roulette wheel that it could not be done—we wonder to what extent humans can control the roll of the dice sufficiently well to significantly affect the outcomes of the dice.

Not that the design of the device itself could not affect the probabilities of the various faces turning up! Loaded dice are a familiar idea. And one can imagine a roulette wheel with a small permanent magnet under, say, the 00 hole, and the core of the ball being of a ferrous metal. And ferrous balls could be interchanged with non-ferrous at the whim of the person running the wheel. Or the holes could be carefully shaped to make some more likely to catch the terminal end of a trajectory than others, a single bar between two holes that is unusually high can affect the probabilites of falling in various holes. We are not concerned here with such designed cheating; we are concerned with the idea that a distribution which is very close to uniform can arise from a situation where: (1) very small differences in the initial conditions produce a very wide range of results, followed by (2) a reduction modulo something relatively small with respect to the range of trajectories that occur (a rotation of the wheel or the die), and then (3) a quantization into a few possible final states.

In shuffling cards the uniformity is much less secure and more easily controlled by practiced hands. Indeed, the common habit of the shuffler letting someone else “cut the deck” before dealing is a tacit admission of the possibility of controlling the dealt cards. There are also such matters as some cards having more ink on their faces than others, hence possibly different coefficients of slipperiness, etc.

It is not the aspect of cheating that we care about, rather it is the way situations can arise that call for the assignment of a uniform final distribution that is of interest. To repeat one more time, the esentials are: (1) small differences in the inital conditions give rise to large differences in the trajectories, (2) the reduction modulo something that is small with respect to the distribution of the endings of the trajectories viewed as total trajectories, and (3) the final quantization to a definite state.

The matter may be summed up by the heading of this chapter:

Small Causes—Large Effects.

For further material see [E].

6.2 Experimental Results

In this section we will examine three simple distributions reduced modulo 1, which in the first and third cases is their variance. For example, consider a normal distribution with unit variance

$\begin{matrix} p (x) = \frac{1}{\sqrt{2 π}} e^{- x^{2} / 2} & (- \infty, \infty) \end{matrix}$

The reduction modulo 1 means that we sample the distribution at unit spacing and add all the values together for each possible phase shift. We will evaluate, therefore, (n is an integer and x is the phase shift displacement in a unit interval)

$p (x) = \frac{1}{\sqrt{2 π}} \sum_{n = - \infty}^{\infty} e^{- {(x + n)}^{2} / 2}$

The results, along with similar distributions, is given in the following table due to R. Pinkham, where x is the position in the unit interval of the new distribution after the modular reduction. The three columns contain the sum from all the corresponding values in the intervals that have been reduced modulo 1 of their distribution.

TABLE 6.1–1

x	normal	(1/2) exp {-\|x\|}	(1/2) sech²x
0.0	1.000 000 005	1.082	1.002
0.1	4	1.037	1.002
0.2	2	1.003	1.001
0.3	0.999 999 998	0.979	0.999
0.4	6	0.964	0.998
0.5	5	0.959	0.998
0.6	6	0.964	0.998
0.7	8	0.979	0.999
0.8	1.000 000 002	1.003	1.001
0.9	4	1.037	1.002
Sum of deviations 0		7	0

From the table we see that the reduction modulo 1 for the normal distribution is remarkably flat. Indeed in view of the Gregory integration formula [H, p.310, p.348], and the structure of the deviations from 1, we believe we are looking at roundoff and the detailed structure of the function. The rapid falloff of the higher derivatives as we approach infinity in the normal distribution case shows why it is so close to 1. In the other two cases the flatness is still remarkable, expecially for the two-sided exponential with its corner at the origin. The hyperbolic secant squared distribution is similar to the normal, but the flatness is not so spectacular because the higher derivatives do not fall off as rapidly as they do for the gaussian distribution.

There is a classic paper on this topic written by R. Pinkham [P] which shows in great generality that this reduction modulo something results in very flat distributions.

When we consider how three standard distributions, reduced by comparatively large unit reduction (as compared to the reductions in the sample space of the roulette wheel initial conditions), we see that we are remarkably confident that the resulting distribution for a roulette wheel will be very, very close to uniform in the theoretical model—in reality there will probably be larger faults with the physical realizations of the wheel and the uniform holes.

Exercises 6.2

6.2–1 Take your favorite continuous probability distribution and do a similar reduction modulo its variance.

6.2–2 Repeat 6.2–1 but use a spacing of half the variance.

6.2–3 Try the reduction using a spacing of twice the variance, and discuss ;he results of the three Exercises.

6.2–4 Explain the structure observed in Table 6.2–1.

6.2–5 Compute the corresponding table for the probability distribution $p (x) = \exp (- x), (0 \leq x < \infty)$ .

6.3 Mathematical Probability

There are a number of purely mathematical results that bear on this question of how uniform distributions can arise. The simplest result, and perhaps the most central, is a version of Weyl’s theorem which states that if you: (1) take any irrational number a, (2) form all the multiples na, and (3) take the fractional part, then you will have a uniform distribution. These numbers may be written in the mathematical form

$n a - [n a] = fraction of (n a) = frac(n a)$

where [.] means the greatest integer in the argument.

Weyl’s theorem has many important results. First, if log₁₀ a is an irrational number then it follows that

$a^{n} = 10^{(\log a) n} = 10^{[n \log a]} 10^{frac(n \log a)}$

has the fractional part of the exponent uniformly distributed. The integer part of the exponent determines the placing of the decimal point while the fractional part determines the digits of the result. Hence we have the mantissa part uniformly distributed up in the exponent.

To show that this means that we have the reciprocal distribution (see Section 5.4) for the numbers themselves we argue as follows. First we have for a uniform distribution of a random variable In X in the exponent

$\begin{matrix} \Pr {\log X < x} = x & (0 \leq x \leq 1) \end{matrix}$

But this is the same statement as

$\Pr {X < 10^{x}} = x$

Now write 10^x = y, (1 ≤ y ≤ 10), or x = logy. We have

$\Pr {X < y} = \log y = (\ln y) / (\ln 10)$

To get the probability density we differentiate with respect to y

$p (y) = 1 / (y \ln 10)$

But this is exactly the reciprocal distribution! (Sections 5.4 to 5.6). And conversely, the reciprocal distribution implies that the distribution is uniform in the fractional part of the exponent.

Another example, in a sense, of Weyl’s theorem is what is behind the standard random number generators. If we take an irrational number, the “seed number” a and multiply it (up in the exponent this is an addition) by a suitable “multiplier,” then we will get from the product a fractional part that is a new seed number; and these repeated numbers (resembling the na-[na]) will be uniformly distributed in the interval (0,1). Of course for finite precision computation it is necessary to apply some careful number theory arguments to ensure that with appropriate choices we will get a uniform distribution of 1/4 of all the numbers that could possibly occur in the computer, [H, p.138]. Thus Weyl’s theorem suggests that this method of random number generation is a reasonable one to try.

Not only do the mantissas of the powers of the typical numbers have the reciprocal distribution, so do many other sets of numbers. We will not prove it here but the leading digits of the factorials of the integers, the Bernoulli numbers, the binomial coefficients, and even the coefficients a_n of a Taylor series (provided there is only one singularity on the circle of convergence)

$f (x) = \sum_{n = 0}^{\infty} \frac{a_{n} {(x - x_{0})}^{n}}{n!}$

all satisfy the reciprocal distribution!

Exercises 6.3

6.3–1 Find the distribution of the fractional parts of the first 50 multiples of $\sqrt{2}$ .

6.3–2 Similarly for $\sqrt{3}$ .

6.3–3 Find the distribution of the leading digits of n! for n ≤ 50.

6.3–5 Examine a table of Fibonacci numbers and find the distribution of their leading digits.

6.3–6 Examine a table of Bernoulli numbers for the distribution of the leading digits.

6.3–7 Find the distribution of the leading digits of 2ⁿ for n = 1,2,…, 50.

6.3–8 Write the Fibonacci numbers in binary form and find the distribution of the first three leading digits.

6.3–9 Find the distribution of the leading digits of 10ⁿ when written in binary for n = 1,2,…, 50.

6.4 Logical Probability

Distribution of birthdays (see 1.9–6 for the original problem 1.9–7, 3.9–1, for robust versions and 3.9–3 for a variant) may depend on the phases of the moon as some maintain, but the earth and moon periods are irrationally related and the result is a rather uniform distribution when averaged over some years. But winter, spring, summer and fall are rationally related to the year. Only for events in the universe which do not seem to have any connection with the arbitrary rates of rotations of the earth about its axis and about the sun (which determines the length of a day and the year) do we feel that the day of the week on which, say, a super nova occurs is uniform (though, of course, the chance of its observation by humans may be affected by earthly details). Due to the seasons of the year we are not sure of the uniformity of birthdays throughout the year.

Thus we see that while we often argue for a kind of logical uniformity, it is not always safe reasoning. Still, very often we are unable to see any difference logically, and that brings up our earliest argument about the assigning of a measure of probability to interchangable events—if we do not assign the same measure then we are logically embarrassed. We do not have to surrender to logic—after all Emerson said, “A foolish consistancy is the hobgoblin of little minds.”—but we are inclined to do so. Thus logical symmetry leads to a uniform distribution. However, later observed data that is not uniform may be an embarrassment!

We are actually discussing the famous principle of indifference, also called the principle of insufficient reason, but which should more accurately be called the principle of consistancy. It states that if there is no reason for assuming a difference then we assign a uniform probability.

But does this principle require perfect knowledge of all possible relevant facts (hardly attainable in reality), or does it merely require that we are ignorant of any lack of symmetry and have exercised “due prudence”? In the later case the idealist is inclined to be unhappy. This dilemma is a real difficulty and should not be ignored!

Philosophers, and some mathematicians, have had great fun mocking this principle of consistancy—but almost always without regard to what it says! It says that if you have absolutely no reason for preferring A to B, for example, then it is reasonable (unless you want to be accused of inconsistancy) to assign equal probabilities to each event. It is a hard, exacting principle; there must be no reasons that you are aware of for any difference. If there are slight differences, as there clearly are with regard to the six faces of a die, then you have to decide on an intuitive basis whether to go with the idealized die in your predictions or else to try to measure, in some fashion, the effects of six concave holes on one face with the opposite face having only one hole, (thus apparently moving the center of gravity of the die slightly from the exact center). Similarly for the other pairs of opposite faces. You have then to carry out the computations with the slight biases. The fact that we play so many games using dice suggests that practice finds the effects of these biases to be very slight. (See the next Section and Example 7.5–2.)

This principle of indifference (or consistancy) is very deep in our mental makeup, and the argument based on consistancy is very compelling; if you can see no relevant difference (that is, if there appears to be perfect symmetry in some sense) then any assignment other than equal probabilities will get you into a contradiction. Of course it depends on your seeing the symmetry. If you were fooled then you have made a mistake. If another person does not see your symmetry, but possibly some other symmetry, then your and their assignments of probability will differ. We cannot get perfect objectivity, we can only isolate it so that it can be examined closely. It can also be said that if the cost of an error is low we do not look carefully; if the cost is high we should.

6.5 Robustness

Since we have been arguing that it is reasonable to assume that many distributions are uniform, we need to look at the question of how the answer changes when there are perturbations from the flat distribution. This is a natural question to ask in any application of mathematical results to the real world where the exact hypotheses are seldom realized. We need to examine how “robust” the answers are. We have, in fact, already done a number of such examples (3.2–2, 3.4–2, 3.5–3, Sections 3.9, and 4.4–2), and we find that when the biases from a uniform distribution are small then their effects are generally quadratic in the biases.

This is a very general principle. When you are at an extreme, as is the uniform distribution (see the next Chapter) then small deviations in the independent variables produce only quadratic effects in the performance (as one sees from any elementary calculus course). Note that we have not defined “small” since it depends on the particular situation. Normally the extreme is relatively broad in character and hence the result is “robust.” Once in awhile there is a very narrow extreme, and in such cases it is seldom practical engineering to use such designs; slight changes in various parameters will greatly affect the performance. In probability theory the sum of the probabilities of the deviations must be zero, hence there will often be compensating effects which tend to make the problem robust with respect to small deviations from the uniform distribution.

In Section 3.9 we examined the problem of robustness in two cases, the robust birthday problem and the robust elevator problem. In the first case we used the Taylor expansion about the symmetric point, found the constant term to be the symmetric solution, the first order terms, being at a local maximum always cancelled out, and the second order terms in the deviations from uniform gave a convenient formula for the perturbations about the uniform solution.

In other problems of the same general form we will often find that the second order derivatives with respect to the same variable are not all zero. By symmetry they will all, however, have the same value B for these at the symmetric point of evaluation, and the same value C for the cross derivatives. Again, expanding about the symmetric point 1/n, and using the relations

$\sum_{k = 1}^{n} e (k) = 0$

${\sum_{k = 1}^{n} e (k)}^{2} = 0$

we get

$\sum_{k = 1}^{n} e^{2} (k) = - \sum_{j \neq k} e (j) e (k)$

Thus we have the useful formula for the expansion of the function Q{p(k)}

$Q {p (k)} = Q (1 / n) + (1 / 2) (B - C) n VAR {e (k)} + \dots$

(6.5–1)

A large class of uniform distribution problems can be solved in this form—we have only to find the common value of all the second order (same) derivatives, and the common value of all the cross derivatives, the numbers B and C, and plug these into this general formula (6.5–1).

We have sketched out how to cope with the robustness of the answer in a class of probability problems, but of course not for all probability problems. We will later examine further cases of robustness; we have already examined robustness in Example 3.3–2 where we asked for solutions as a function of p when the original problem assumed a definite p value.

If we cannot handle robustness in probability theory then we will always have questions of safety in applications of the results. Thus robustness is a central, much neglected, topic in probability problems.

6.6 Natural Variables

We have been arguing that in many situations it is natural to expect that the unknown distribution will be uniform. An argument often given against this assumption is that if the variable x is uniform (0 < x < 1) then the variable x², or the variable $\sqrt{x}$ , cannot be uniform. When the random uniform distributions arise from a modular reduction then this fact removes the relevance of the above observation; the variable that is so reduced is the “natural variable” to assume is uniform.

It is not always easy to decide just how the reduction modulo some constant is done before you get to see the random variable, but in a sense that is what is necessary to decide before assigning the uniform distribution (see Bertrand’s paradox, Example 5.3–1). Unfortunately statistics will tell you almost nothing about what are the “natural variables” of the problem, although multidimensional scaling and other similar statistical tools exist to help in this matter. We have seen that the mantissas of natually occuring numbers are uniform in the fractional part of the exponent, and hence from the mathematical identity

$e^{\ln x} = x$

it is the log of the mantissas of the numbers that arise in computing that is the natural variable for floating point arithmetic.

In physics, and other fields, experience has shown that the “natural variables” are usually the ones which enter into nice mathematical formulas. For example, Galileo first thought that the velocity of a falling body would be simply related to the distance covered, but found that it was the time elapsed that gave the simple mathematical formula.

There is no easy answer to the question, but the general indictment of the principle of indifference for assigning a uniform distribution is unjustified, when one can see that there is some modular reduction which is small with respect to the spread of the underlying variable. (See again Section 6.2).

This section on “natural variables” is of necessity intuitive, but the author believes that it tends to justify the frequent use of the uniform distribution when little else is known beyond some modular reduction before you see the results.

6.7 Summary

The initial assignment of the probabilities of the events in the sample space is fundamental. As Bertrand’s paradox (and others) show, this is a serious matter since it affects the results obtained, and hence any future action taken. Unfortunately, this point is too often glossed over, and we have perhaps belabored it more than necessary [S].

In this Chapter we gave a number of arguments why the uniform distribution often occurs. The most widely used are: (1) the mechanical uncertainty (“small causes—large effects” which are then reduced modulo something, and often further quantized); and (2) the principle of consistancy (indifference). Unfortunately the latter depends on the way we look at the situation, not necessarily on reality!

The above two reasons apply quite well to the traditional gambling devices—roulette wheels, dice, and cards for example—but for other applications they are often much less reliable when applied. Most text books in probability simply assume a uniform distribution for the problems they pose; Nature sometimes is not so kind!

In the next two Chapters we will further pursue this central problem of the initial assignment of probabilities to the sample space, using somewhat more sophisticated arguments.