8

The main limit theorems

Summary. The law of large numbers and the central limit theorem are two of the principal results of probability theory. The weak law of large numbers is derived from the mean-square law via Chebyshev’s inequality. The central limit theorem is proved using the continuity theorem for moment generating functions. A short account is presented of Cramér’s large deviation theorem for sums of random variables. Convergence in distribution (or ‘weak convergence’) is introduced, and the continuity theorem for characteristic functions stated.

8.1 The law of averages

We aim in this chapter to describe the two main limit theorems of probability theory, namely the ‘law of large numbers’ and the ‘central limit theorem’. We begin with the law of large numbers.

Here is an example of the type of phenomenon which we are thinking about. Before writing this sentence, we threw a fair die one million times (with the aid of a computer, actually) and kept a record of the results. The average of the numbers which we threw was 3.500867. Since the mean outcome of each throw is image, this number is not too surprising. If xi is the result of the ith throw, most people would accept that the running average

image

(8.1)

approaches the mean value image as n gets larger and larger. Indeed, the foundations of probability theory are based upon our belief that sums of the form (8.1) converge to some limit as image. It is upon the ideas of ‘repeated experimentation’ and ‘the law of averages’ that many of our notions of chance are founded. Accordingly, we should like to find a theorem of probability theory which says something like ‘if we repeat an experiment many times, then the average of the results approaches the underlying mean value’.

With the above example about throwing a die in the backs of our minds, we suppose that we have a sequence image of independent and identically distributed random variables, each having mean value μ. We should like to prove that the average

image

(8.2)

converges as image to the underlying mean value μ. There are various ways in which random variables can be said to converge (advanced textbooks generally list four to six such ways). One simple way is as follows.

Definition 8.3

We say that the sequence image of random variables converges in mean square to the (limit) random variable Z if

image

(8.4)

If this holds, we write ‘image in mean square as image.

Here is a word of motivation for this definition. Remember that if Y is a random variable and image, then Y equals 0 with probability 1. If image, then it follows that image tends to 0 (in some sense) as image.

Example 8.5

Let Zn be a discrete random variable with mass function

image

Then Zn converges to the constant random variable 2 in mean square as image, since

image

It is often quite simple to show convergence in mean square: just calculate a certain expectation and take the limit as image. It is this type of convergence which appears in our first law of large numbers.

Theorem 8.6 (Mean-square law of large numbers)

Let image be a sequence of independent random variables, each with mean μ and variance image. The average of the first n of the Xi satisfies, as image,

image

(8.7)

Proof

This is a straightforward calculation. We write

image

for the nth partial sum of the Xi. Then

image

and so

image

Hence, image in mean square as image.

It is customary to assume that the random variables in the law of large numbers are identically distributed as well as independent. We demand here only that the Xi have the same mean and variance.

  

Exercise 8.8

Let Zn be a discrete random variable with mass function

image

Show that Zn converges to 0 in mean square if and only if image.

Exercise 8.9

Let image be a sequence of random variables which converges to the random variable Z in mean square. Show that image in mean square as image, for any real numbers a and b.

Exercise 8.10

Let Nn be the number of occurrences of 5 or 6 in n throws of a fair die. Use Theorem 8.6 to show that, as image,

image

Exercise 8.11

Show that the conclusion of the mean-square law of large numbers, Theorem 8.6, remains valid if the assumption that the Xi are independent is replaced by the weaker assumption that they are uncorrelated.

  

8.2 Chebyshev’s inequality and the weak law

The earliest versions of the law of large numbers were found in the eighteenth century and dealt with a form of convergence different from convergence in mean square. This other mode of convergence also has intuitive appeal and is defined in the following way.

Definition 8.12

We say that the sequence image of random variables converges in probability to Z as image if

image

(8.13)

If this holds, we write ‘image in probability as image.

Condition (8.13) requires that for all small positive δ and ε and all sufficiently large n, it is the case that image with probability at least image.

It is not clear at first sight how the two types of convergence (in mean square and in probability) are related to one another. It turns out that convergence in mean square is a more powerful property than convergence in probability, and we make this more precise in the next theorem.

Theorem 8.14

If image is a sequence of random variables and image in mean square as image, then image in probability also.

The proof of this follows immediately from a famous inequality which is usually ascribed to Chebyshev but which was discovered independently by Bienaymé and others, and is closely related to Markov’s inequality, Theorem 7.63. There are many forms of this inequality in the probability literature, and we feel that the following is the simplest.

Theorem 8.15 (Chebyshev’s inequality)

If Y is a random variable and image, then

image

(8.16)

Proof

By Markov’s inequality, Theorem 7.63, applied to the positive random variable image,

image

as required.

Proof of Theorem 8.14

We apply Chebyshev’s inequality to the random variable image to find that

image

If image in mean square as image, the right-hand side tends to 0 as image, and so the left-hand side tends to 0 for all image as required.

The converse of Theorem 8.14 is false: there exist sequences of random variables which converge in probability but not in mean square (see Example 8.19).

The mean-square law of large numbers, Theorem 8.6, combines with Theorem 8.14 to produce what is commonly called the ‘weak law of large numbers’.

Theorem 8.17 (Weak law of large numbers)

Let image be a sequence of independent random variables, each with mean μ and variance image. The average of the first n of the Xi satisfies, as image,

image

(8.18)

The principal reason for stating both the mean-square law and the weak law are historical and traditional—the first laws of large numbers to be proved were in terms of convergence in probability. There is also a good mathematical reason for stating the weak law separately—unlike the mean-square law, the conclusion of the weak law is valid without the assumption that the Xi have finite variance so long as they all have the same distribution. This is harder to prove than the form of the weak law presented above, and we defer its proof until Section 8.5 and the treatment of characteristic functions therein.

There are many forms of the laws of large numbers in the literature, and each has a set of assumptions and a set of conclusions. Some are difficult to prove (with weak assumptions and strong conclusions) and others can be quite easy to prove (such as those above). Our selection is simple but contains a number of the vital ideas. Incidentally, the weak law is called ‘weak’ because it may be formulated in terms of distributions alone. There is a more powerful ‘strong law’ which concerns intrinsically the convergence of random variables themselves.

Example 8.19

Here is an example of a sequence of random variables which converges in probability but not in mean square. Suppose that Zn is a random variable with mass function

image

Then, for image and all large n,

image

giving that image in probability. On the other hand

image

so Zn does not converge to 0 in mean square.

  

Exercise 8.20

Prove the following alternative form of Chebyshev’s inequality: if X is a random variable with finite variance and image, then

image

Exercise 8.21

Use Chebyshev’s inequality to show that the probability that in n throws of a fair die the number of sixes lies between image and image is at least image.

Exercise 8.22

Show that if image in probability then, as image,

image

for any real numbers a and b.

  

8.3 The central limit theorem

Our second main result is the central limit theorem. This also concerns sums of independent random variables. Let image be independent and identically distributed random variables, each with mean μ and non-zero variance image. We know from the law of large numbers that the sum image is about as large as image for large n, and the next natural problem is to determine the order of the difference image. It turns out that this difference has order image.

Rather than work with the sum Sn directly, we work with the so-called standardized version of Sn,

image

(8.23)

This is a linear function image of Sn, where an and bn have been chosen in such a way that image and image. Note that

image

Also,

image

and so

image

(8.24)

It is a remarkable fact that the distribution of Zn settles down to a limit as image. Even more remarkable is the fact that the limiting distribution of Zn is the normal distribution with mean 0 and variance 1, irrespective of the original distribution of the Xi. This theorem is one of the most beautiful in mathematics and is known as the ‘central limit theorem’.

Theorem 8.25 (Central limit theorem)

Let image be independent and identically distributed random variables, each with mean μ and non-zero variance image. The standardized version

image

of the sum image satisfies, as image,

image

(8.26)

The right-hand side of (8.26) is just the distribution function of the normal distribution with mean 0 and variance 1, and thus (8.26) may be written as

image

where Y is a random variable with this standard normal distribution.

Special cases of the central limit theorem were proved by de Moivre (in about 1733) and Laplace, who considered the case when the Xi have the Bernoulli distribution. Lyapunov proved the first general version in about 1901, but the details of his proof were very complicated. Here we shall give an elegant and short proof based on the method of moment generating functions. As one of our tools, we shall use a special case of a fundamental theorem of analysis, and we present this next without proof. There is therefore a sense in which our ‘short and elegant’ proof does not live up to that description: it is only a partial proof, since some of the analytical details are packaged elsewhere.

Theorem 8.27 (Continuity theorem)

Let image be a sequence of random variables with moment generating functions image and suppose that, as image,

image

Then

image

In other words, the distribution function of Zn converges to the distribution function of the normal distribution if the moment generating function of Zn converges to the moment generating function of the normal distribution. We shall use this to prove the central limit theorem in the case when the Xi have a common moment generating function

image

although we stress that the central limit theorem is valid even when this expectation does not exist so long as both the mean and the variance of the Xi are finite.

Proof of Theorem 8.25

Let image. Then image are independent and identically distributed random variables with mean and variance given by

image

(8.28)

and moment generating function

image

Now,

image

giving that Zn has moment generating function

image

(8.29)

We need to know the behaviour of image for large n, and to this end we use Theorem 7.55 to expand image as a power series about image:

image

Substitute this into (8.29) with image and t fixed to obtain

image

and the result follows from Theorem 8.27. This proof requires the existence of image for values of t near 0 only, and this is consistent with the discussion before Theorem 7.49. We shall see in Example 8.54 how to adapt the proof without this assumption.

Example 8.30 (Statistical sampling)

The central limit theorem has many applications in statistics, and here is one such. An unknown fraction p of the population are jedi knights. It is desired to estimate p with error not exceeding image by asking a sample of individuals (it is assumed they answer truthfully). How large a sample is needed?

Solution

Suppose a sample of n individuals is chosen. Let Xi be the indicator function of the event that the ith such person admits to being a jedi knight, and assume the Xi are independent, Bernoulli random variables with parameter p. Write

image

(8.31)

We choose to estimate p with the ‘sample mean’ image, which, following statistical notation, we denote as image.

We wish to choose n sufficiently large that image. This cannot be done, since image is a random variable which may (albeit with only small probability) take a value larger than image for any given n. The accepted approach is to set a maximal level of probability at which an error is permitted to occur. By convention, we take this to be image, and we are thus led to the following problem: find n such that

image

By (8.31), Sn is the sum of independent, identically distributed random variables with mean p and variance image. The above probability may be written as

image

By the central limit theorem, image converges in distribution to the normal distribution, and hence the final probability may be approximated by an integral of the normal density function. Unfortunately, the range of this integral depends on p, which is unknown.

Since image for image,

image

and the right-hand side is approximately image, where N is normal with mean 0 and variance 1. Therefore,

image

where Φ is the distribution function of N. On consulting statistical tables, we find this to be greater than image if image, which is to say that image.

  

Exercise 8.32

A fair die is thrown 12,000 times. Use the central limit theorem to find values of a and b such that

image

where S is the total number of sixes thrown.

Exercise 8.33

For image, let Xn be a random variable having the gamma distribution with parameters n and 1. Show that the moment generating function of image is

image

and deduce that, as image,

image

  

8.4 Large deviations and Cramér’s theorem

Let Sn be the sum of n independent, identically distributed random variables with mean μ and variance image. The weak law of large numbers asserts that Sn has approximate order image. By the central limit theorem, the deviations of Sn are typically of the order image. It is unlikely that Sn will deviate from its mean image by more than order image with image. The study of such unlikely events has proved extremely fruitful in recent decades. The following theorem, proved in its original form by Cramér in 1938, is of enormous practical use within the modern theory of ‘large deviations’, despite the low probability of the events under study.

Let image be independent, identically distributed random variables, and image. For simplicity, we assume that the Xi have common mean 0; if this does not hold, we replace Xi by image. We shall assume quite a lot of regularity on the distribution of the Xi, namely that the common moment generating function image satisfies image for values of t in some neighbourhood image of the origin. Let image. The function image is strictly increasing on image, so that image if and only if image. By Markov’s inequality, Theorem 7.63,

image

By Theorem 7.52, image, and so

image

This provides an upper bound for the chance of a ‘large deviation’ of Sn from its mean 0, in terms of the arbitrary constant image. We minimize the right-hand side over t to obtain

image

(8.34)

This is an exponentially decaying bound for the probability of a large deviation.

It turns out that, neglecting sub-exponential corrections, the bound (8.34) is an equality, and this is the content of Cramér’s theorem, Theorem 8.36. The precise result is usually stated in logarithmic form. Let image, and define the so-called Fenchel–Legendre transform of Λ by

image

(8.35)

The function image is illustrated in Figure 8.1.

Fig. 8.1 The function image plotted against the line image, in the case when image. The point image marks the value of t at which image is a maximum, and this maximum is denoted image.

Theorem 8.36 (Large deviation theorem)

Let image be independent, identically distributed random variables with mean 0, whose common moment generating function image is finite in some neighbourhood of the origin. Let image be such that image. Then image and

image

(8.37)

That is to say, image decays to 0 in the manner of image. If image, then image for all n. Theorem 8.36 accounts for deviations above the mean. For deviations below the mean, the theorem may be applied to the sequence image.

Partial Proof

We begin with some properties of the function image. First,

image

Next,

image

By the Cauchy–Schwarz inequality, Theorem 7.30, applied to the random variables image and image, the numerator is positive. Therefore, Λ is a convex function wherever it is finite (see Exercise 7.73).

We turn now to Figure 8.1. Since Λ is convex with image, and since image, the supremum of image over image is unchanged by the restriction image. That is,

image

(8.38)

Next, we show that image under the conditions of the theorem. By Theorem 7.55,

image

for small positive t, where image. This is where we have used the assumption that image on a neighbourhood of the origin. For sufficiently small positive t,

image

whence image by (8.38).

It is immediate from (8.34) and (8.38) that

image

(8.39)

The proof of the sharpness of the limit in (8.37) is more complicated, and is omitted. A full proof may be found in Grimmett and Stirzaker (2001, Sect. 5.11).

Example 8.40

Let X be a random variable with distribution

image

and moment generating function

image

Let image. By (8.35), the Fenchel–Legendre transformation of image is obtained by maximizing image over the variable t. The function Λ is differentiable, and therefore the maximum may be found by calculus. We have that

image

Setting this equal to 0, we find that

image

and hence

image

Let image be the sum of n independent copies of X. By the large deviation theorem, Theorem 8.36,

image

(8.41)

for image.

  

Exercise 8.42

Find the Fenchel–Legendre transform image in the case of the normal distribution with mean 0 and variance 1.

Exercise 8.43

Show that the moment generating function of a random variable X is finite on a neighbourhood of the origin if and only if there exist image such that image for image.

Exercise 8.44

Let image be independent random variables with the Cauchy distribution, and let image. Find image for image.

  

8.5 Convergence in distribution, and characteristic functions

We have now encountered the ideas of convergence in mean square and convergence in probability, and we have seen that the former implies the latter. To these two types of convergence we are about to add a third. We motivate this by recalling the conclusion of the central limit theorem, Theorem 8.25: the distribution function of the standardized sum Zn converges as image to the distribution function of the normal distribution. This notion of the convergence of distribution functions may be set in a more general context as follows.

Definition 8.45

The sequence image is said to converge in distribution, or to converge weakly, to Z as image if

image

where C is the set of reals at which the distribution function image is continuous. If this holds, we write image.

The condition involving points of continuity is an unfortunate complication of the definition, but turns out to be desirable (see Exercise 8.56).

Convergence in distribution is a property of the distributions of random variables rather than a property of the random variables themselves, and for this reason, explicit reference to the limit random variable Z is often omitted. For example, the conclusion of the central limit theorem may be expressed as ‘Zn converges in distribution to the normal distribution with mean 0 and variance 1’.

Theorem 8.14 asserts that convergence in mean square implies convergence in probability. It turns out that convergence in distribution is weaker than both of these.

Theorem 8.46

If image is a sequence of random variables and image in probability as image, then image.

The converse assertion is generally false; see the forthcoming Example 8.49 for a sequence of random variables which converges in distribution but not in probability. The next theorem describes a partial converse.

Theorem 8.47

Let image be a sequence of random variables which converges in distribution to the constant c. Then Zn converges to c in probability also.

Proof of Theorem 8.46

Suppose image in probability, and write

image

for the distribution functions of Zn and Z. Let image, and suppose that F is continuous at the point z. Then

image

Similarly,

image

Thus

image

(8.48)

We let image and image throughout these inequalities. The left-hand side of (8.48) behaves as follows:

image

where we have used the facts that image in probability and that F is continuous at z, respectively. Similarly, the right-hand side of (8.48) satisfies

image

Thus, the left- and right-hand sides of (8.48) have the same limit image, implying that the central term image satisfies image as image. Hence image.

Proof of Theorem 8.47

Suppose that image. It follows that the distribution function Fn of Zn satisfies

image

Thus, for image,

image

The following is an example of a sequence of random variables which converges in distribution but not in probability.

Example 8.49

Let U be a random variable which takes the values image and 1, each with probability image. We define the sequence image by

image

It is clear that image, since each Zn has the same distribution. On the other hand

image

so that image for all m. Hence, Zn does not converge to U in probability.

Finally, we return to characteristic functions. In proving the central limit theorem we employed a result (Theorem 8.27) linking the convergence of moment generating functions to convergence in distribution. This result is a weak form of the so-called continuity theorem, a more powerful version of which we present next (the proof is omitted).

Theorem 8.50 (Continuity theorem)

Let image be random variables with characteristic functions image. Then image if and only if

image

This is a difficult theorem to prove—see Feller (1971, p. 481). We close the section with several examples of this theorem in action.

Example 8.51

Suppose that image and image. Prove that image.

Solution

Let image be the characteristic function of Zn and image the characteristic function of Z. By the continuity theorem, Theorem 8.50, image as image. The characteristic function of image is

image

and the result follows by another appeal to Theorem 8.50. A direct proof of this fact using distribution functions is messy when a is negative.

Example 8.52 (The weak law)

Here is another proof of the weak law of large numbers, Theorem 8.17, for the case of identically distributed random variables. Let image be independent and identically distributed random variables with mean μ, and let

image

By Theorem 7.87, the characteristic function image of Un is given by

image

(8.53)

where image is the common characteristic function of the Xi. By Theorem 7.85,

image

Substitute this into (8.53) to obtain

image

The limit here is the characteristic function of the constant μ, and thus the continuity theorem, Theorem 8.50, implies that image. A glance at Theorem 8.47 confirms that the convergence takes place in probability also, and we have proved a version of the weak law of large numbers. This version differs from the earlier one in two regards—we have assumed that the Xi are identically distributed, but we have made no assumption that they have finite variance.

Example 8.54

Central limit theorem. Our proof of the central limit theorem in Section 8.3 was valid only for random variables which possess finite moment generating functions. Very much the same arguments go through using characteristic functions, and thus Theorem 8.25 is true as it is stated.

  

Exercise 8.55

Let image be independent random variables, each having the Cauchy distribution. Show that image converges in distribution to the Cauchy distribution as image. Compare this with the conclusion of the weak law of large numbers.

Exercise 8.56

Let Xn, Yn, Z be ‘constant’ random variables with distributions

image

Show that

image

but image.

This motivates the condition of continuity in Definition 8.45. Without this condition, it would be the case that image but image.

  

8.6 Problems

1.  Let image be independent random variables, each having the uniform distribution on the interval image, and let image. Show that

(a)  image in probability as image,

(b)  image in probability as image,

(c)  if image and image, then

image

    so that Un converges in distribution to the exponential distribution as image.

2.  By applying the central limit theorem to a sequence of random variables with the Bernoulli distribution, or otherwise, prove the following result in analysis. If image and image, then

image

where the summation is over all values of k satisfying image.

3.  Let Xn be a discrete random variable with the binomial distribution, parameters n and p. Show that image converges to p in probability as image.

4.  Binomial–Poisson limit. Let Zn have the binomial distribution with parameters n and image, where λ is fixed. Use characteristic functions to show that Zn converges in distribution to the Poisson distribution, parameter λ, as image.

5.  By applying the central limit theorem to a sequence of random variables with the Poisson distribution, or otherwise, prove that

image

6. (a)  Let image and

image

By considering the binomial distribution or otherwise, show that

image

(b)  Find the asymptotic behaviour of image, where image and

image

7.  Use the Cauchy–Schwarz inequality to prove that if image in mean square and image in mean square, then image in mean square.

8.  Use the Cauchy–Schwarz inequality to prove that if image in mean square, then image. Give an example of a sequence image such that image in probability but image does not converge to image.

9.  8.6.9 If image in probability and image in probability, show that image in probability.

10.  Let image and image be independent random variables each having mean μ and non-zero variance image. Show that

image

satisfies, as image,

image

11.  Adapt the proof of Chebyshev’s inequality to show that, if X is a random variable and image, then

image

for any function image which satisfies

(a)  image for image,

(b)  image for image,

(c)  g is increasing on image.

12.  Let X be a random variable which takes values in the interval image only. Show that

image

if image.

13.  Show that image in probability if and only if

image

14.  Let image be a sequence of random variables which converges in mean square. Show that image as image.

If image and image for all n, show that the correlation between Xn and Xm converges to 1 as image.

15.  Let Z have the normal distribution with mean 0 and variance 1. Find image and image, and find the probability density function of image.

*16.  Let image be independent random variables each having distribution function F and density function f. The order statistics image of the subsequence image are obtained by rearranging the values of the Xi in non-decreasing order. That is to say, image is set to the smallest observed value of the Xi, image is set to the second smallest value, and so on, so that image. The sample median Yn of the sequence image is the ‘middle value’, so that Yn is defined to be

image

Assume that image is odd, and show that Yn has density function

image

Deduce that, if F has a unique median m, then

image

where image.

17.  The sequence image of independent, identically distributed random variables is such that

image

If f is a continuous function on image, prove that

image

is a polynomial in p of degree at most n. Use Chebyshev’s inequality to prove that for all p with image, and any image,

image

where image. Using this and the fact that f is bounded and uniformly continuous in image, prove the following version of the Weierstrass approximation theorem:

image

(Oxford 1976F)

18.  Let Zn have the geometric distribution with parameter image, where λ is fixed. Show that image converges in distribution as image, and find the limiting distribution.

*19.  Let image and image be two sequences of independent random variables with

image

and

image

for image. Let

image

and let Z denote a normally distributed random variable with mean 0 and variance 1. Prove or disprove the following:

State carefully any theorems which you use. (Oxford 1980F)

(a)  Sn converges in distribution to Z,

(b)  the mean and variance of Tn converge to the mean and variance of Z,

(c)  Tn converges in distribution to Z.

*20.  Let Xj, image, be independent identically distributed random variables with probability density function image, image. Show that the characteristic function of image is image. Consider a sequence of independent trials where the probability of success is p for each trial. Let N be the number of trials required to obtain a fixed number of k successes. Show that, as p tends to zero, the distribution of image tends to the distribution of Y with image. (Oxford 1979F)

21.  Let image be independent and identically distributed random variables such that

image

Derive the moment generating function of the random variable image, where image are constants. In the special case image for image, show that Yn converges in distribution as image to the uniform distribution on the interval image.

22.  X and Y are independent, identically distributed random variables with mean 0, variance 1, and characteristic function image. If image and image are independent, prove that

image

By making the substitution image or otherwise, show that, for any positive integer n,

image

Hence, find the common distribution of X and Y. (Oxford 1976F)

23.  Let image and image be the real and imaginary parts, respectively, of the characteristic function of the random variable X. Prove that

Hence, find the variance of image and the covariance of image and image in terms of u and v.

Consider the special case when X is uniformly distributed on image. Are the random variables image (i) uncorrelated, (ii) independent? Justify your answers. (Oxford 1975F)

(a)  image,

(b)  image.

24.  State the central limit theorem.

The cumulative distribution function F of the random variable X is continuous and strictly increasing. Show that image is uniformly distributed. Find the probability density function of the random variable image, and calculate its mean and variance. Let image be a sequence of independent random variables whose corresponding cumulative distribution functions image are continuous and strictly increasing. Let

image

Show that, as image, image converges in distribution to a normal distribution with mean zero and variance one. (Oxford 2007)