Chapter 4
Estimation and the Normal Distribution
In this chapter, we introduce and make use of a probability distribution that often arises in practice, albeit indirectly, the “normal” or “Gaussian” distribution. An observation that is the sum of a large number of factors, each of which makes only a small contribution to the total, will have a normal distribution. Thus, the mean of a large number of independent observations on the elements of a homogeneous population will have a normal distribution.
You also will learn in this chapter about the desirable properties of both point and interval estimates and how to apply this knowledge to estimate the parameters of a normal distribution. You’ll be provided with the R functions you need test hypotheses about the parameters.
We are often required to estimate some property of a population based on a random representative sample from that population. For example, we might want to estimate the population mean or the population variance.
A desirable estimator will be consistent, that is, as the sample grows larger, an estimate based on the sample will get closer and closer to the true value of the population parameter. The sample mean is a consistent estimator of the population mean. The sample variance is a consistent estimator of the population variance.
When we make decisions based on an estimate h of a population parameter θ, we may be subject to losses based on some function L of the difference between our estimate and the true value, L(h, θ). Suppose this loss function is proportional to the square of the difference, L(h, θ) = k(h – θ)2, as in Figure 4.1.
Figure 4.1 Loss as a result of an error in estimation.
Then we would want to make use of an estimate that minimizes the mean squared error, that is, an estimate that minimizes the expected value of (h – θ)2, which is equal to the variance of the statistic h plus (Eh − θ)2, where Eh denotes the expected value of the estimate when averaged over a large number of samples. The first term of this sum, the variance of the statistic h, is a measure of the precision of our estimate, the second term (Eh − θ)2, called the bias, is a measure of its accuracy. An unbiased estimate h of θ is such that Eh = θ.
The sample mean and the sample variance are both consistent, unbiased, minimum variance, minimum squared-error estimators of the population mean and the population variance respectively.
The properties of the normal distribution—its percentiles, for example, are completely determined by its mean and variance. Figures 4.2 and 4.3 depict the cumulative distribution function and the probability density curve of a normally distributed population with zero mean and unit variance.
Figure 4.2 Cumulative distribution function of a normal probability distribution with zero mean and unit variance.
Figure 4.3 Bell-shaped normal probability density. Interpret the density on the y-axis as the relative probability that an observation from a normal distribution with zero mean and unit variance will take the value shown on the x-axis.
To graph a normal probability distribution with some other standard deviation, simply interpret the scale on the x-axis of these figures as reading in units of standard deviations. To graph a normal probability distribution with some other mean, simply shift the scale on the x-axis so the value of the new mean is located where the zero used to be.
The probability densities depicted in Figure 4.3 should be interpreted as relative rather than absolute probabilities. The normal distribution is that of a continuous random variable, so that the probability of observing any specific value, say 1.000000000, is zero. But we can and will talk about the probability that a normally distributed variable will lie inside some fixed interval. We can obtain this value by calculating the area under the density curve. For example, the shaded area in Figure 4.4 reveals that the probability is 10% that a normally distributed random variable will be greater than 1.64 standard deviations larger than the mean. To be honest, I didn’t read this value from the graph, but let R determine the probability for me, with the function pnorm(1.64).
Figure 4.4 The ratio of the shaded area to the total area under the curve yields the probability that a normally distributed random variable will be at least 1.64 standard deviations above its mean.
Exercise 4.1: The heights of Cambridge students in the early 1900s were found to follow a normal distribution. The heights of today’s students do not. Why do you think this is?
To generate a plot similar to Figure 3.3 of the empirical cumulative probability distributions of samples of various sizes taken from a normal probability distribution with mean zero and variance 1, I used the following R code:
F1600=ecdf(rnorm(1600,0,1))
plot(F1600)
F100=ecdf(rnorm(100))
lines(F100,pch=".")
F10=ecdf(rnorm(100))
lines(F10, verticals= TRUE, do.points = FALSE)
Exercise 4.2: Draw the empirical cumulative probability density function of a sample of size 1000 from a normal distribution with mean 2.5 and variance 9.
To see why the normal distribution plays such an important role in statistics, please complete Exercise 4.3, which requires you to compute the distribution of the mean of a number of binomial observations. As you increase the number of observations used to compute the mean from 5 to 12 so that each individual observation makes a smaller relative contribution to the total, you will see that the distribution of means looks less and like the binomial distribution from which they were taken and more and more like a normal distribution. This result will prove to be true regardless of the distribution from which the observations used to compute the mean are taken, providing that this distribution has a finite mean and variance.
Exercise 4.3: Generate five binomial observations based on 10 trials with probability of success p = 0.35 per trial. Compute the mean value of these five. Repeat this procedure 512 times, computing the mean value each time. Plot the empirical cumulative distribution function of these means. Compare with the empirical cumulative distribution functions of (1) a sample of 512 normally distributed observations with expected value 3.5 and variance 2.3, and (2) a sample of 512 binomial observations, each consisting of 10 trials with a probability of success of p = 0.4 per trial. Repeat the entire exercise, computing the mean of 12 rather than 5 binomial observations.
Here is the R code you’ll need to do this exercise:
#Program computes the means of N samples of k #binomial(10,0.35) variables
#set number of times to compute a mean
N = 512
#set number of observations in sample
k = 5
#create a vector in which to store the means
stat = numeric(N)
#Set up a loop to generate the N means
for (i in 1:N) stat[i] = mean(rbinom (k,10, 0.35))
plot(ecdf(stat,14))
![]()
#generate a sample of N normally distributed #observations
sampN = rnorm (N, 3.5, 2.3)
plot(ecdf (sampN,14))
![]()
#generate a sample of N binomial observations
sampB = rbinom (N, 10, 0.35)
plot(ecdf (sampB,14))
The probability is 90% that a normally distributed random variable drawn at random will lie within 1.65 standard deviations of its mean. If we knew the standard deviation of the population from which our mean was drawn, we could use this knowledge to obtain an interval estimate of the unknown population mean based on the sample mean. But we don’t; we only know the sample standard deviation.
Fortunately, we can make use of the statistic
which has a Student’s t-distribution with n – 1 degrees of freedom, where n is the sample size, θ is the population mean and s is the standard deviation of the sample.*
The ratio of the standard deviation divided by the square root of the sample size is often referred to as the standard error of the mean.
Suppose we have the following sample,
> data = rnorm(8) [1] 0.8011477 0.4213278 -0.4574428 1.8352498 -0.5901639 -0.8767256 1.0373933 [8] -0.4187433
The R function t.test will provide us with a 90% confidence interval for the mean.
> t.test(data,conf.level=.9) One Sample t-test data: data t = 0.6488, df = 7, p-value = 0.5372 alternative hypothesis: true mean is not equal to 0 90 percent confidence interval: -0.4205535 0.8585642 sample estimates: mean of x 0.2190054
The probability is 90% that the confidence interval (−0.42, 0.86) covers the true unknown value of the mean.
Note that we have discarded the least significant digits when we reported this result.
Exercise 4.4: Find a 90% confidence interval for the mean based on the following sample, 5.1451036 −1.5395632 1.8893192 −2.1733570 −0.9131787 −0.5619511 0.8190254.
Exercise 4.5: Will an 80% confidence interval for the mean be longer or shorter than a 90% confidence interval? Why?
Many real-life distributions strongly resemble mixtures of normal distributions. Figure 4.5 depicts just such an example in the heights of California sixth graders. Though the heights of boys and girls overlap, it is easy to see that the population of sixth graders is composed of a mixture of the two sexes.
Figure 4.5 Distribution of the heights of California sixth graders.
If you suspect you are dealing with a mixture of populations, it is preferable to analyze each subpopulation separately. We’ll discuss more about this approach in Chapter 7 and Chapter 8.
Exercise 4.6: Provide an 80% confidence interval for the mean height of sixth-grade students using the data of Chapter 1.
A fast-food restaurant claims that $750 of their daily revenue is from the drive-thru window. They’ve collected 2 weeks of receipts from the restaurant and turned them over to you. Each day’s receipt shows the total revenue and the “drive-thru” revenue for that day.
They do not claim their drive-thru produces $750 of their revenue, day in and day out, only that their over-all average is $750.
Drive-Thru Sales: 754 708 652 783 682 778 665 799 693 825 828 674 792 723
Exercise 4.7: Derive a 90% confidence interval for the mean daily revenue via drive-thru. If $750 is included in this interval, then accept the hypothesis.
We’ve already made use of the percentile or uncorrected bootstrap on several occasions, first to estimate precision and then to obtain interval estimates for population parameters. Readily computed, the bootstrap seems ideal for use with the drive-thru problem. Still, if something seems too good to be true, it probably is. Unless corrected, bootstrap interval estimates are inaccurate (i.e., they will include the true value of the unknown parameter less often than the stated confidence probability) and imprecise (i.e., they will include more erroneous values of the unknown parameter than is desirable). When the original samples contain less than a hundred observations, the confidence bounds based on the primitive bootstrap may vary widely from simulation to simulation.
What this means to us is that even if the figure 75% lies outside a 90% bootstrap confidence interval, we nay run the risk of making an error more than 100% − 90% = 10% of the time in rejecting this hypothesis.
The percentile bootstrap is most accurate when the observations are drawn from a normal distribution. The bias-corrected-and-accelerated bootstrap takes advantage of this as follows: Suppose θ is the parameter we are trying to estimate, is the estimate, and we are able to come up with a transformation m such that m(θ) is normally distributed about
. We could use this normal distribution and the bootstrap to obtain an unbiased confidence interval, and then apply a back-transformation to obtain an almost-unbiased confidence interval.
The good news is that we don’t actually need to determine what the function m should be or compute it to obtain the desired interval. To use R for this purpose, all you need is to download and install the “Boot” package of functions and make use of the functions boot() and boot.ci().
Make sure you are connected to the Internet and then pull down on the “Packages” menu as shown in Figure 4.6 and click on “Install package(s)…”
Figure 4.6 R’s packages menu.
Select the nearest CRAN mirror as shown in Figure 4.7a and then the Package “boot” as shown in Figure 4.7b.
Figure 4.7 (a) List of mirror sites that can be used for downloading R packages. (b) List of R packages.
The installation, which includes downloading, unzipping, and integrating the new routines, is done automatically. The installation needs to be done once and once only. But each time before you can use any of the boot library routines, you’ll need to load the supporting functions into computer memory by typing
> library(boot)
R imposes this requirement to keep the amount of memory its program uses to the absolute minimum.
We’ll need to employ two functions from the boot library.
The first of these functions boot (Data, Rfunction, number) has three principal arguments. Data is the name of the data set you want to analyze, number is the number of bootstrap samples you wish to draw, and Rfunction is the name of an R function you must construct separately to generate and hold the values of existing R statistics functions, such as median() or var(), whose value you want a bootstrap interval estimate of. For example,
> f.median<- function(y,id){ + median( y[id]) + }
where R knows id will be a vector. Then
> boot.ci(boot(classdata, f.median, 400), conf = 0.90)
will calculate a 90% confidence interval for the median of the classdata based on 400 simulations.
When observations are drawn from a very skewed (and thus very unlike a normal) distribution, such as the exponential, the parametric bootstrap can prove of value in obtaining confidence intervals for the population parameters.
It’s a two-step method:
This parametric approach is of particular value when we are trying to estimate one of the tail percentiles such as P10 or P90, for the sample alone seldom has sufficient information.
For example, if we know this population has an exponential distribution, we would use the sample mean to estimate the population mean. Then, we would draw a series of random samples of the same size as our original sample from an exponential distribution whose mathematical expectation was equal to the sample mean to obtain a confidence interval for the population parameter of interest.
The following R program uses an exponential distribution as the basis of a parametric bootstrap to obtain a 90% confidence interval for the interquartile range (IQR) of the population from which the data set A was taken. If P75 and P25 are the 75th and 25th percentiles of the population, IQR is defined as P75 − P25. This program makes use of the results of Exercise 3.16, which tells us that the expected value of an exponential distribution with parameter λ is 1/λ.
N = 1000 #number of bootstrap samples
#create a vector in which to store the IQR's
IQR = numeric(N)
#Set up a loop to generate the IQR's
for (i in 1:N) { bA=sample (A, n, replace = T) IQR[i] = qexp(.75,1/mean(bA))–qexp(.25, 1/mean(bA)) }
quantile (IQR , probs = c(.05,.95))
Exercise 4.8: Given the following data in years drawn from a distribution of failure times of an electronic component, find a confidence interval for the data’s median: 1.071 4.397 2.733 0.484 0.323 1.158 1.962 0.660 0.070 0.946 0.212 3.038
As virtually all the statistical procedures in this text require that our observations be independent of one another, a study of the properties of independent observations will prove rewarding. Recall from the previous chapter that if X and Y are independent observations, then
and
If X and Y are independent discrete random variables and their expectations exist and are finite,* then E{X + Y} = EX + EY. To see this, suppose that Y = y. The conditional expectation of X + Y given Y = y is
Taking the average of this conditional expectation over all possible values of Y, yields
A similar result holds if X and Y have continuous distributions, providing that their individual expectations exist and are finite.
Exercise 4.9: Show that for any variable X with a finite expectation, E(aX) = aEX, where a is a constant. Here is one example: The mean of heights measured in inches will be 12 times their mean when measured in feet.
Exercise 4.10: Show that the expectation of the mean of n independent identically distributed random variables with finite expectation θ is also θ.
The variance of the sum of two independent variables X and Y is the sum of their variances, providing each of these variances exists and is finite.
Exercise 4.11: Given the preceding result, show that the variance of the mean of n independent identically distributed observations is 1/nth of the variance of just one of them. Does this mean that the arithmetic mean is more precise than any individual observation? Does this mean that the sample mean will be closer to the mean of the population from which it is drawn than any individual observation would be, that is, that it would be more accurate?
*Exercise 4.12: Show that the sun of n independent identically distributed normal distributions with mean θ (theta) and variance σ2 (sigma squared) has a normal distribution with mean nθ and variance nσ2.
In this chapter, you leaned to distinguish statistics, functions of the observations, and the estimates based upon them, both of which vary from sample to sample, from parameters, fixed nonvarying property of a population.
You considered the form and properties of one of the most common probability distributions—the normal. The distribution of the sample mean of most distributions encountered in practice will take on the form of a normal distribution even with moderate-size samples.
You learned that the sample mean and sample variance are desirable point estimates of the equivalent population parameters because they are both accurate and precise. You learned how the Student’s t-distribution might be used to obtain confidence intervals for the mean of a distribution of near-normal measurements.
You made use of the R functions rnorm(), qexp(), ecdf(), plot(), and lines(). You were shown how to download and use libraries to enlarge R’s capabilities: library(boot), for example, has the function boot.ci()with which to perform the bias-corrected and accelerated bootstrap (BCa) bootstrap. You learned how to use either the BCa or the parametric bootstrap to obtain confidence intervals of conditional distributions for properties of the distribution other than the mean.
We expanded on the properties of independent observations, properties we will make frequent use of in future chapters.
Exercise 4.13: Make a list of all the italicized terms in this chapter. Provide a definition for each one along with an example.
Notes
* That missing degree of freedom came about from our need to use an estimate of the variance based on the sample.
* In real life, expectations almost always exist and are finite—the expectations of ratios are a notable exception.