Modeling Measurement and Count Data
We first consider the general situation where there is a hypothetical population of individuals of interest and there is a continuous-valued measurement Y associated with each individual. One represents the collection of measurements from all individuals by means of a continuous probability density f(y). As discussed in Chapter 5, one summarizes this probability density with the mean μ:
|
(8.1) |
---|
The value μ gives us a sense of the location of a typical value of the continuous measurement Y.
To learn about the population of measurements, a random sample of individuals Y1, …, Yn will be taken. The general inferential problem is to use these measurements together with any prior beliefs to learn about the population mean μ. In other words, the goal is to use the collected measurements to learn about a typical value of the population of measurements.
College applications
How many college applications does a high school senior in the United States complete? Here one imagines a population of all American high school seniors and the measurement is the number of completed college applications. The unknown quantity of interest is the mean number of applications μ completed by these high school seniors. The inferential question may be stated by asking, on average, how many college applications does an American high school senior complete. The answer to this question gives one a sense of the number of completed applications for a typical high school senior. To learn about the average μ, it would be infeasible to collect this measurement from every high school senior in the U.S. Instead, a survey is typically conducted to a sample of high school seniors (ideally a sample representative of all American high school seniors) and based on the measurements from this sample, some inference is performed about the mean number of college applications.
Household spending
How much does a household in San Francisco spend on housing every month? One visualizes the population of households in San Francisco and the continuous measurement is the amount of money spent on housing (either rent for renters and mortgage for homeowners) for a resident. One can ask “on average, how much does a household spend on housing every month in San Francisco?”, and the answer to this question gives one a sense of the housing costs for a typical household in San Francisco. To learn about the mean value of housing μ of all San Francisco residents, a sample survey is conducted. The mean value of the housing costs
Weights of cats
Suppose you have a domestic shorthair cat weighing 14 pounds and you want to find out if she is overweight. One imagines a population of all domestic shorthair cats and the continuous measurement is the weight in pounds. Suppose you were able to compute the mean weight μ of all shorthair cats. Then by comparing 14 pounds (the weight of our cat) to this mean, you would know whether your cat is overweight, or underweight, or close to the mean. If we were able to find the distribution of the weights of all domestic shorthair cats, then one observes the proportion of weights smaller than 14 pounds in the distribution and learns if the cat is severely overweight. To learn if our cat is overweight, you can ask the vet. How does the vet know? Extensive research has been conducted periodically to record weights of a large sample of domestic shorthair cats, and by using these samples of weights, the vet performs an inference about the mean μ of the weights of all domestic shorthair cats.
Common elements of an inference problem
All three examples have common elements:
Recall the three general steps of Bayesian inference discussed in Chapter 7 in the context of an unknown proportion p.
In this setting, we have a continuous population of measurements that we represent by the random variable Y with density function f(y). It is convenient to assume that this population has a normal shape with mean μ and standard deviation σ. That is, a single measurement Y is assume to come from the density function
|
(8.2) |
---|
displayed in Figure 8.1. To simplify the discussion, it is convenient to assume that the standard deviation σ of the measurement distribution is known. Then the objective is to learn about the single mean measurement μ.
Step 1 in Bayesian inference is to express an opinion about the parameter. In this continuous measurement setting, one constructs a prior for the mean parameter μ that expresses one’s opinion about the location of this mean. In this chapter, we discuss different ways to specify a prior distribution for μ. One attractive discrete approach for expressing this prior opinion, similar to the approach in Chapter 7 for a proportion p, has two steps. First one constructs a list of possible values of μ, and then one assigns probabilities to the possible values to reflect one’s belief. Alternatively, we will describe the use of a continuous prior to represent one’s belief for μ. This is a more realistic approach for constructing a prior since one typically views the mean as a real-valued parameter.
Step 2 of our process is to collect measurements from a random sample to gain more information about the parameter μ. In our first situation, one collects the number of applications from a sample of 100 high school seniors. In the second example, one collects a sample of 2000 housing costs, each from a sampled San Francisco household. The third example collects a sample of 200 different weights of domestic shorthair cats, each from a sampled cat. If these measurements are viewed as independent observations from a normal sampling density with mean μ, then one constructs a likelihood function which is the joint density of the sampled measurements viewed as a function of the unknown parameter.
Once the prior is specified and measurements have been collected, one proceeds to Step 3 to use Bayes’ rule to update one’s prior opinion to obtain a posterior distribution for the mean μ. The algebraic implementation of Bayes’ rule is a bit more tedious when dealing with continuous data with a normal sampling density. But we will see there is a simple procedure for computing the posterior mean and standard deviation.
Throughout this chapter, the entire inferential process is described for learning about a mean μ assuming a normal sampling density for the measurements. This chapter discusses how to construct a prior distribution that matches one’s prior belief, how to extract information from the data by the likelihood function, and how to update one’s opinion in the posterior, combining the prior and data information in a natural way.
Section 8.3 introduces inference with a discrete prior distribution for the mean μ and Section 8.4 introduces the continuous family of normal prior distributions for the mean. The inferential process with a normal prior distribution is described in detail in Section 8.5. Section 8.6 describes some general Bayesian inference methods in this normal data and normal prior setting, such as Bayesian hypothesis testing, Bayesian credible intervals and Bayesian prediction. These sections describe the use of both exact analytical solutions and approximation simulation-based calculations. Section 8.7 introduces the use of the posterior predictive distribution as a general tool for checking if the observed data is consistent with predictions from the Bayesian model.
The chapter concludes in Section 8.8 by introducing a popular one-parameter model for counts, the Poisson distribution, and its conjugate gamma distribution for representing prior opinion. Although this section does not deal with the normal mean situation, the exposure to the important gamma-Poisson conjugacy will enhance our understanding and knowledge of the analytical process of combining the prior and likelihood to obtain the posterior distribution.
8.3 Bayesian Inference with Discrete Priors
8.3.1 Example: Roger Federer’s time-to-serve
Roger Federer is recognized as one of the greatest players in tennis history. One aspect of his play that people enjoy is his businesslike way of serving to start a point in tennis. Federer appears to be efficient in his preparation to serve and some of his service games are completed very quickly. One measures one’s service efficiency by the time-to-serve which is the measured time in seconds between the end of the previous point and the beginning of the current point.
Since Federer is viewed as an efficient server, this raises the question: how long, on average, is Federer’s time-to-serve? We know two things about his time-to-serve measurements. First, since they are time measurements, they are continuous variables. Second, due to a number of other variables, the measurements will vary from serve to serve. Suppose one collects a single time-to-serve measurement in seconds. denoted as Y. It seems reasonable to assume Y is normally distributed with unknown mean μ and standard deviation σ. From previous data, we assume that the standard deviation is known and given by σ = 4 seconds.
Recall the normal probability curve has the general form
|
(8.3) |
---|
Since σ = 4 is known, the only parameter in Equation (8.3) is μ. We are interested in learning about the mean time-to-serve μ.
A convenient first method of implementing Bayesian inference is by the use of a discrete prior. One specifies a subjective discrete prior for Federer’s mean time-to-serve by specifying a list of plausible values for μ and assigning a probability to each of these values.
In particular suppose one thinks that values of the equally spaced values μ = 15, 16, …, 22 are plausible. In addition, one does not have any good reason to think that any of these values for the mean are more or less likely, so a uniform prior will be assigned where each value of μ is assigned the same probability
|
(8.4) |
---|
Each value of μ corresponds to a particular normal sampling curve for the time-to-serve measurement. Figure 2.1 displays the eight possible normal sampling curves. Our prior says that each of these eight sampling curves has the same prior probability.
To learn more about the mean μ, one collects a single time-to-serve measurement for Federer, and suppose it is 15.1 seconds, that is, one observes Y = 15.1. The likelihood function is the normal density of the actual observation y viewed as a function of the mean μ (remember that it was assumed that σ = 4 was given). By substituting in the observation y = 15.1 and the known value of σ = 4, one writes the likelihood function as
For each possible value of μ, we substitute the value into the likelihood expression. For example, the likelihood of μ = 15 is equal to
This calculation is repeated for each of the eight values μ = 15, 16, …, 22, obtaining eight likelihood values.
A discrete prior has been assigned to the list of possible values of μ and one is now able to apply Bayes’ rule to obtain the posterior distribution for μ. The posterior probability of the value μ = μi given the data y for a discrete prior has the form
|
(8.5) |
---|
where π(μi) is the prior probability of μ = μi and L(μi) is the likelihood function evaluated at μ = μi.
If a discrete uniform prior distribution for μ is assigned, one has
TABLE 8.1
Value, Prior, Likelihood, and Posterior for μ with a single observation.
μ |
Prior |
Likelihood |
Posterior |
15 |
0.125 |
0.0997 |
0.1888 |
16 |
0.125 |
0.0972 |
0.1842 |
17 |
0.125 |
0.0891 |
0.1688 |
18 |
0.125 |
0.0767 |
0.1452 |
19 |
0.125 |
0.0620 |
0.1174 |
20 |
0.125 |
0.0471 |
0.0892 |
21 |
0.125 |
0.0336 |
0.0637 |
22 |
0.125 |
0.0225 |
0.0427 |
With the single measurement of time-to-serve of y = 15.1, one sees from Table 8.1 that the posterior distribution for μ favors values μ = 15, and 16. In fact, the posterior probabilities decrease as a function of μ. The Prior column reminds us that the prior distribution is uniform. Bayesian inference uses the collected data to sharpen one’s belief about the unknown parameter from the prior distribution to the posterior distribution. For this single observation, the sample mean is y = 15.1 and the μ value closest to the sample mean (μ = 15) is assigned the highest posterior probability.
Typically one collects multiple time-to-serve measurements. Suppose one collects n time-to-serve measurements, denoted as Y1, …, Yn, that are normally distributed with mean μ and fixed standard deviation σ = 4. Each observation follows the same normal density
|
(8.6) |
---|
Again since σ = 4 is known, the only parameter in Equation (8.6) is μ and we are interested in learning about this mean parameter μ. Suppose the same discrete uniform prior is used as in Equation (8.4) and graphed in Figure 8.2. The mean μ takes on the values {15, 16, …, 22} with each value assigned the same probability of
Suppose one collects a sample of 20 times-to-serve for Federer:
15.1 11.8 21.0 22.7 18.6 16.2 11.1 13.2 20.4 19.2 21.2 14.3 18.6 16.8 20.3 19.9 15.0 13.4 19.9 15.3
When multiple time-to-serve measurements are taken, the likelihood function is the joint density of the actual observed values y1, …, yn viewed as a function of the mean μ. After some algebra (detailed derivation in Section 8.3.2), one writes the likelihood function as
|
(8.7) |
---|
where we have substituted the known values n = 20 and the standard deviation σ = 4. From our sample, we compute the sample mean
This calculation is repeated for each of the eight values μ = 15, 16, …, 22, obtaining eight likelihood values.
One now applies Bayes’ rule to obtain the posterior distribution for μ. The posterior probability of μ = μi given the sequence of recorded times-to-serve y1, …, yn has the form
|
(8.8) |
---|
where π(μi) is the prior probability of μ = μi and L(μi) is the likelihood function evaluated at μ = μi. We saw in Equation (8.7) that only the sample mean,
With a discrete uniform prior distribution for μ, again one has
TABLE 8.2
Value, Prior, Likelihood, and Posterior for μ with n observations.
μ |
Prior |
Likelihood |
Posterior |
15 |
0.125 |
0.0217 |
0.0217 |
16 |
0.125 |
0.1813 |
0.1815 |
17 |
0.125 |
0.4350 |
0.4353 |
18 |
0.125 |
0.2990 |
0.2992 |
19 |
0.125 |
0.0589 |
0.0589 |
20 |
0.125 |
0.0033 |
0.0033 |
21 |
0.125 |
0.0001 |
0.0001 |
22 |
0.125 |
0.0000 |
0.0000 |
It is helpful to construct a graph (see Figure 8.3) where one contrasts the prior and probability probabilities for the mean time-to-serve μ. While the prior distribution is flat, the posterior distribution for μ favors the values μ = 16, 17, and 18 seconds. Bayesian inference uses the observed data to revise one’s belief about the unknown parameter from the prior distribution to the posterior distribution. Recall that the sample mean
8.3.2 Simplification of the likelihood
The likelihood function is the joint density of the observations y1, …, yn, viewed as a function of the mean μ (since σ = 4 is given). With n observations being identically and independently distributed (i.i.d.) as Normal(μ, 4), the likelihood function is the product of normal density terms. In the algebra work that will be done shortly, the likelihood, as a function of μ, is found to be normal with mean
The calculation of the posterior probabilities is an application of Bayes’ rule illustrated in earlier chapters. One creates a data frame of values mu
and corresponding probabilities Prior
. One computes the likelihood values in the variable Likelihood
and the posterior probabilities are found using the bayesian_crank()
function.
df <- data.frame(mu = seq(15, 22, 1), Prior = rep(1/8, 8)) %>% mutate(Likelihood = dnorm(mu, 17.2, 4 / sqrt(20))) df <- bayesian_crank(df) round(df, 4) mu Prior Likelihood Product Posterior 1 15 0.125 0.0217 0.0027 0.0217 2 16 0.125 0.1813 0.0227 0.1815 3 17 0.125 0.4350 0.0544 0.4353 4 18 0.125 0.2990 0.0374 0.2992 5 19 0.125 0.0589 0.0074 0.0589 6 20 0.125 0.0033 0.0004 0.0033 7 21 0.125 0.0001 0.0000 0.0001 8 22 0.125 0.0000 0.0000 0.0000
Derivation of
In the following, we combine the terms in the exponent, expand all of the summation terms, and complete the square to get the result.
|
(8.9) |
---|
Sufficient statistic
There are different ways of writing and simplifying the likelihood function. One can choose to keep the product sign and each yi term, and leave the likelihood function as
|
(8.10) |
---|
Doing so requires one to calculate the individual likelihood from each time-to-serve measurement yi and multiply these values to obtain the function L(μ) used to obtain the posterior probability.
If one instead simplifies the likelihood to be
|
(8.11) |
---|
all the proportionality constants drop out in the calculation of the posterior probabilities for different values of μ. In the application of Bayes’ rule, one only needs to know the number of observations n and the mean time to serve
8.3.3 Inference: Federer’s time-to-serve
What has one learned about Federer’s mean time-to-serve from this Bayesian analysis? Our prior said that any of the eight possible values of μ were equally likely with probability 0.125. After observing the sample of 20 measurements, one believes μ is most likely 16, 17, and 18 seconds, with respective probabilities 0.181, 0.425, and 0.299. In fact, if one adds up the posterior probabilities, one says that μ is in the set {16, 17, 18} seconds with probability 0.915.
This region of values of μ is called a 91.5% posterior probability region for the mean time-to-serve μ.
8.4.1 The normal prior for mean μ
Returning to our example, one is interested in learning about the time-to-serve for the tennis player Roger Federer. His serving times are believed to be normally distributed with unknown mean μ and known standard deviation σ = 4. The focus is on learning about the mean value μ.
In the prior construction in Section 8.3, we assumed μ was discrete, taking only integer values from 15 to 22. However, the mean time-to-serve μ does not have to be an integer. In fact, it is more realistic to assume μ is continuous-valued. One widely-used approach for representing one’s belief about a normal mean is based on a normal prior density with mean μ0 and standard deviation σ0, that is
There are two parameters for this normal prior: the value μ0 represents one’s best guess at the mean time-to-serve μ and σ0 indicates how sure one thinks about the guess.
To illustrate the use of different priors for μ, let’s consider the opinion of one tennis fan Joe who has strong prior information about the mean. His best guess at Federer’s mean time-to-serve is 18 seconds so he lets μ0 = 18. He is very sure of this guess and so he chooses σ0 to be the relatively small value of 0.4. In contrast, a second tennis fan Kate also thinks that Federer’s mean time-to-serve is 18 seconds, but does not have a strong belief in this guess and chooses the large value 2 of the standard deviation σ0. Figure 8.4 shows these two normal priors for the mean time-to-serve μ.
Both curves are symmetric and bell-shaped, centered at μ0 = 18. The main difference is the spread of the two curves: a Normal(8, 0.4) curve is much more concentrated around the mean μ0 = 18 compared to the Normal(8, 2) curve. Since the value of the probability density function at a point reflects the probability at that value, the Normal(8, 0.4) prior reflects the belief that the mean time to serve will most likely be around μ0 = 18 seconds, whereas the Normal(8, 2) prior indicates that the mean μ could be as small as 15 seconds and as large as 20 seconds.
Informative prior
How does one in practice choose a normal prior for μ that reflects prior beliefs about the location of this parameter? One indirect strategy for selecting values of the prior parameters μ0 and σ0 is based on the specification of quantiles. On the basis of one’s prior beliefs, one specifies two quantiles of the normal density. Then the normal parameters are found by matching these two quantiles to a particular normal curve.
Recall the definition of a quantile — in this setting it is a value of the mean μ such that the probability of being smaller than that value is a given probability. To construct one’s prior for Federer’s mean time-to-serve, one thinks first about two quantiles. Suppose one specifies the 0.5 quantile to be 18 seconds — this means that μ is equally likely to be smaller or larger than 18 seconds. Next, one decides that the 0.9 quantile is 20 seconds. This means that one’s probability that μ is smaller than 20 seconds is 90%. Given values of these two quantiles, the unique normal curve is found that matches this information.
The matching is performed by the R function normal.select()
. One inputs two quantiles by list
statements, and the output is the mean and standard deviation of the normal prior.
normal.select(list(p = 0.5, x = 18), list(p = 0.9, x = 20)) $mu [1] 18 $sigma [1] 1.560608
The normal curve with mean μ0 = 18 and σ0 = 1.56, displayed in Figure 8.5, matches the prior information stated by the two quantiles.
Since our measurement skills are limited, this prior is just an approximation to our beliefs about μ. We recommend in practice that one perform several checks to see if this normal prior makes sense. Several functions are available to help in this prior checking.
For example, we find the 0.25 quantile of our prior using the qnorm()
function.
qnorm(0.25, 18, 1.56) [1] 16.9478
This prior says that the prior probability that μ is smaller than 16.95 is 25%. If this does not seem reasonable, one would make adjustments in the values of the normal mean and standard deviation until a reasonable normal prior is found.
Weakly informative prior
We have been assuming that we have some information about the mean parameter μ that is represented by a normal prior. What would we do in the situation where little is known about the location on μ? For a normal prior, the standard deviation σ0 represents the sureness of our belief in our guess μ0 at the value of the mean. If we are really unsure about any guess at μ, then we can assign the standard deviation σ0 a large value. Then the choice of the prior mean will not matter, so we suggest using a Normal(0, σ0) with a large value for σ0. This prior indicates that μ may plausibly range over a large interval and represents weakly informative prior belief about the parameter.
As will be seen later in this chapter, when a vague prior is chosen, the posterior inference for μ will largely be driven by the data. This behavior is desirable since we know little about the location of μ a priori in this situation and we want the data to inform about the location of μ with little influence by the prior.
Continuing our discussion on learning about the mean time-to-serve for Roger Federer, the current prior beliefs about Federer’s mean time-to-serve μ are represented by a normal curve with mean 18 seconds and standard deviation 1.56 seconds.
Next some data is collected — Federer’s time-to-serves are recorded for 20 serves and the sample mean is 17.2 seconds. Recall that we are assuming the population standard deviation σ = 4 seconds. The likelihood is given by
|
(8.12) |
---|
and with substitution of the values
|
(8.13) |
---|
Viewing the likelihood as a function of the parameter μ as in Equation (8.13), the likelihood is recognized as a normal density with mean
The Bayes’ rule calculation is very familiar to the reader — one obtains the posterior density curve by multiplying the normal prior by the likelihood. If one writes down the product of the normal likelihood and the normal prior density and works through some messy algebra, one will discover that the posterior density also has the normal density form.
The normal prior is said to be conjugate since the prior and posterior densities come from the same distribution family: normal. To be more specific, suppose the observation has a normal sampling density with unknown mean μ and known standard deviation σ. If one specifies a normal prior for the unknown mean μ with mean μ0 and standard deviation σ0, one obtains a normal posterior for μ with updated parameters μn and σn.
In Section 8.5.2, we provide a quick peak at this posterior updating without worrying about the mathematical derivation and Section 8.5.3 describes the details of the Bayes’ rule calculation. Section 8.5.4 looks at the conjugacy more closely and provides some insight on the effects of prior and likelihood on the posterior distribution.
8.5.2 A quick peak at the update procedure
It is convenient to describe the updating procedure by use of a table. In Table 8.3, there are rows corresponding to Prior, Likelihood, and Posterior and columns corresponding to Mean, Precision, and Standard Deviation. The mean and standard deviation of the normal prior are placed in the “Prior” row, and the sample mean and standard error are placed in the “Likelihood” row.
TABLE 8.3
Updating the normal prior: step 1.
Type |
Mean |
Precision |
Stand_Dev |
Prior |
18.00 |
1.56 |
|
Likelihood |
17.20 |
0.89 |
|
Posterior |
We define the precision, ϕ, to be the reciprocal of the square of the standard deviation. We compute the precisions of the prior and data from the given standard deviations:
We enter the precisions in the corresponding rows of Table 8.4.
TABLE 8.4
Updating the normal prior: step 2.
Type |
Mean |
Precision |
Stand_Dev |
Prior |
18.00 |
0.41 |
1.56 |
Likelihood |
17.20 |
1.26 |
0.89 |
Posterior |
We will shortly see that the Posterior precision is the sum of the Prior precision and the Likelihood precisions:
These precisions and standard deviations are entered into Table 8.5.
TABLE 8.5
Updating the normal prior: step 3.
Type |
Mean |
Precision |
Stand_Dev |
Prior |
18.00 |
0.41 |
1.56 |
Likelihood |
17.20 |
1.26 |
0.89 |
Posterior |
1.67 |
0.77 |
The posterior mean is a weighted average of the Prior and Likelihood means where the weights are given by the corresponding precisions. That is, the formula is given by
|
(8.14) |
---|
By making appropriate substitutions, we obtain the posterior mean:
The posterior density is normal with mean 17.40 seconds and standard deviation 0.77 seconds. See Table 8.6 for the final update step.
TABLE 8.6
Updating the normal prior: step 4.
Type |
Mean |
Precision |
Stand_Dev |
Prior |
18.00 |
0.41 |
1.56 |
Likelihood |
17.20 |
1.26 |
0.89 |
Posterior |
17.40 |
1.67 |
0.77 |
The normal updating is performed by the R function normal_update()
. One inputs two vectors – prior
is a vector of the prior mean and standard deviation and data
is a vector of the sample mean and standard error. The output is a vector of the posterior mean and posterior standard deviation.
prior <- c(18, 1.56) data <- c(17.20, 0.89) normal_update(prior, data) [1] 17.3964473 0.7730412
The prior and posterior densities are displayed in Figure 8.6. As usually the case, the posterior density has a smaller spread since the posterior has more information than the prior about Federer’s mean time-to-serve. More information about a parameter indicates less uncertainty and a smaller spread of the posterior density. In the process from prior to posterior, one sees how the data modifies one’s initial belief about the parameter μ.
Section 8.5.2 gave an overview of the updating procedure for a normal prior and normal sampling. In this section we explain (1) why it is preferable to work with the precisions instead of the standard deviations; (2) why the precisions act as the weights in the calculation of the posterior mean and (3) why the posterior is a normal distribution.
Recall a precision is the reciprocal of the square of the standard deviation. We use
We write down the likelihood of μ, combining terms, and writing the expression in terms of the precision ϕ.
|
(8.15) |
---|
|
(8.16) |
---|
Note that σ is assumed known, therefore the likelihood function is only in terms of μ, i.e. L(μ).
|
(8.17) |
---|
|
(8.18) |
---|
Bayes’ rule is applied by multiplying the prior by the likelihood to obtain the posterior. In deriving the posterior of μ, the manipulations require careful consideration regarding what is known. The only unknown variable is μ, so any “constants” or known quantities not depending on μ can be added or dropped with the proportionality sign “∝”.
|
(8.19) |
---|
Looking closely at the final expression, one recognizes that the posterior for μ is a normal density with mean and precision parameters. Specifically we recognize (ϕ0 + nϕ) as the posterior precision and
|
(8.20) |
---|
In passing, it should be noted that the same result would be attained using the standard deviations, σ and σ0, instead of the precisions, ϕ and ϕ0. It is preferable to work with the precisions due to the relative simplicity of the notation. In particular, one sees in Table 8.5 that the posterior precision is the sum of the prior and likelihood precisions, that is, the posterior precision ϕn = ϕ0 + nϕ.
Let’s summarize our calculations in Section 8.5.3. We collect a sequence of continuous observations that are assumed identically and independently distributed as Normal(μ, σ), and a normal prior is assigned to the mean parameter μ.
|
(8.21) |
---|
When σ (or ϕ) is known, and mean μ is the only parameter in the likelihood.
|
(8.22) |
---|
|
(8.23) |
---|
In this situation where the sampling standard deviation σ is known, the normal density is a conjugate prior for the mean of a normal distribution, as the posterior distribution for μ is another normal density with updated parameters. Conjugacy is a convenient property as the posterior distribution for μ has a convenient functional form. Conjugacy allows one to conduct Bayesian inference through exact analytical solutions and simulation. Also conjugacy provides insight on how the data and prior are combined in the posterior distribution.
The posterior compromises between the prior and the sample
Recall that Bayesian inference is a general approach where one initializes a prior belief for an unknown quantity, collects data expressed through a likelihood function, and combines prior and likelihood to give an updated belief for the unknown quantity. In Chapter 7, we have seen how the posterior mean of a proportion is a compromise between the prior mean and sample proportion (refer to Section 7.4.2 as needed). In the current normal mean case, the posterior mean is similarly viewed as an estimate that compromises between the prior mean and sample mean. One rewrites the posterior mean in Equation (8.23) as follows:
|
(8.24) |
---|
The prior precision is equal to ϕ0 and the precision in the likelihood for any yi is ϕ. Since there are n observations, the precision in the joint likelihood is nϕ. The posterior mean is a weighted average of the prior mean μ0 and sample mean
The posterior accumulates information in the prior and the sample
In addition, the precision of the posterior mean is the sum of the precisions of the prior and likelihood. That is,
|
(8.25) |
---|
The implication is that the posterior standard deviation will always be smaller than either the prior standard deviation or the sampling standard error:
8.6 Bayesian Inferences for Continuous Normal Mean
Continuing with the example about Federer’s time-to-serve, our normal prior had mean 18 seconds and standard deviation 1.56 seconds. After collecting 20 time-to-serve measurements with a sample mean of 17.2, the posterior distribution Normal(17.4, 0.77) reflects our opinion about the mean time-to-serve.
Bayesian inferences about the mean μ are based on various summaries of this posterior normal distribution. Because the exact posterior distribution of mean μ is normal, it is convenient to use R functions such as pnorm()
and qnorm()
to conduct Bayesian hypothesis testing and construct Bayesian credible intervals. Simulation-based methods utilizing functions such as rnorm()
are also useful to provide approximations to those inferences. A sequence of examples are given in Section 8.6.1.
Predictions of future data are also of interest. For example, one might want to predict the next time-to-serve measurement based on the posterior distribution of μ being Normal(17.4, 0.77). In Section 8.6.2, details of the prediction procedure and examples are provided.
8.6.1 Bayesian hypothesis testing and credible interval
A testing problem
In a testing problem, one is interested in checking the validity of a statement about a population quantity. In our tennis example, suppose someone says that Federer takes on average at least 19 seconds to serve. Is this a reasonable statement?
The current beliefs about Federer’s mean time-to-serve are summarized by a normal distribution with mean 17.4 seconds and standard deviation 0.77 seconds. To assess if the statement “μ is 19 seconds or more” is reasonable, one simply computes its posterior probability, Prob(μ ≥ 19|μn = 17.4, σn = 0.77).
1 - pnorm(19, 17.4, 0.77) [1] 0.01885827
This probability is about 0.019, a small value, so one would conclude that this person’s statement is unlikely to be true.
This is the exact solution using the pnorm()
function with mean 17.4 and standard deviation 0.77. As seen in Chapter 7, simulation provides an alternative approach to obtaining the probability Prob(μ ≥ 19 | μn = 17.4, σn = 0.77). To implement the simulation approach, recall that one generates a large number of values from the posterior distribution and summarizes this simulated sample. In particular, using the following R script, one generates 1000 values from the Normal(17.4, 0.77) distribution and approximates the probability of “μ is 19 seconds or more” by computing the percentage of values that falls above 19.
S <- 1000 NormalSamples <- rnorm(S, 17.4, 0.77) sum(NormalSamples >= 19) / S [1] 0.024
The reader might notice that the approximated value of 0.024 differs from the exact answer of 0.019 using the pnorm()
function. One way to improve the accuracy of the approximation is by increasing the number of simulated values. For example, increasing S from 1000 to 10,000 provides a better approximation to the exact probability 0.019.
S <- 10000 NormalSamples <- rnorm(S, 17.4, 0.77) sum(NormalSamples >= 19) / S [1] 0.0175
A Bayesian interval estimate
Bayesian credible intervals for the mean parameter μ can be achieved both by exact calculation and simulation. Recall that a Bayesian credible interval is an interval that contains the unknown parameter with a certain probability content. For example, a 90% Bayesian credible interval for the parameter μ is an interval containing μ with a probability of 0.90.
The exact interval is obtained by using the R function qnorm()
. For example, with the posterior distribution for μ being Normal(17.4, 0.77), the following R script shows that a 90% central Bayesian credible interval is (16.133, 18.667). That is, the posterior probability of μ falls between 16.133 and 18.667 is exactly 90%.
qnorm(c(0.05, 0.95), 17.4, 0.77) [1] 16.13346 18.66654
For simulation-based inference, one generates a large number of values from its posterior distribution, then finds the 5th and 95th sample quantiles to obtain the middle 90% of the generated values. Below one sees that a 90% credible interval for posterior of μ is approximately (16.151, 18.691).
S <- 1000 NormalSamples <- rnorm(S, 17.4, 0.77) quantile(NormalSamples, c(0.05, 0.95)) 5% 95% 16.15061 18.69062
The Bayesian credible intervals can also be used for testing hypothesis. Suppose one again wants to evaluate the statement “ Federer takes on average at least 19 seconds to serve.” One answers this question by computing the 90% credible interval. One notes that the values of μ “at least 19” are not included in the exact 90% credible interval (16.15, 18.69). The interpretation is that the probability is at least 0.90 that Federer’s average time-to-service is smaller than 19 seconds. One could obtain a wider credible interval, say by computing a central 95% credible interval (see the R output below), and observe that 19 is out of the interval. This indicates we are 95% confident that 19 seconds is not the value of Federer’s average time-to-serve.
qnorm(c(0.025, 0.975), 17.4, 0.77) [1] 15.89083 18.90917
On the basis of this credible interval calculation, one concludes that the statement about Federer’s time-to-serve is unlikely to be true. This conclusion is consistent with the typical Bayesian hypothesis testing procedure given at the beginning of this section.
Suppose one is interested in predicting Federer’s future time-to-serve. Since one has already updated the belief about the parameter, the mean μ, the prediction is made based on its posterior predictive distribution.
How to make one future prediction of Federer’s time-to-serve? In Chapter 7, we have seen two different approaches for predicting of a new survey outcome of students’ dining preferences. One approach in Chapter 7 is based on the derivation of the exact posterior predictive distribution
Exact predictive distribution
We first describe the exact posterior predictive distribution. Consider making a prediction of a single Federer’s time-to-serve
|
(8.26) |
---|
|
(8.27) |
---|
The computation of the predictive density is possible for this normal sampling model with a normal prior. It is assumed that
|
(8.28) |
---|
This result can be used to derive the posterior predictive distribution of
|
(8.29) |
---|
Then by applying our general result in Equation (8.28), the posterior predictive density of the single future observation
|
(8.30) |
---|
An important aspect of the predictive distribution for
Predictions by simulation
The alternative method of computing the predictive distribution is by simulation. In this setting, there are two unknowns – the mean parameter μ and the future observation
|
(8.31) |
---|
|
(8.32) |
---|
This two-step procedure is implemented for our time-to-serve example using the following R script.
sigma <- 4 mu_n <- 17.4 sigma_n <- 0.77 pred_mu_sim <- rnorm(1, mu_n, sigma_n) (pred_y_sim <- rnorm(1, pred_mu_sim, sigma)) [1] 16.04772
The script can easily be updated to create S = 1000 predictions, which is helpful to make summary about predictions.
S <- 1000 pred_mu_sim <- rnorm(S, mu_n, sigma_n) pred_y_sim <- rnorm(S, pred_mu_sim, sigma)
The vector pred_y_sim
contains 1000 predictions of Federer’s time-to-serve.
To evaluate the accuracy of the simulation-based predictions, Figure 8.7 displays the exact and a density estimate of the simulation-based predictive densities for a single time-to-serve measurement. One observes pretty good agreement using these two computation methods in this example.
8.7 Posterior Predictive Checking
In Section 8.6, the use of the posterior predictive distribution for predicting a future time-to-serve measurement was described. As discussed in Chapter 7, this distribution is also helpful for assessing the suitability of the Bayesian model.
In our example, we observed 20 times-to-serve for Federer. The question is whether these observed times are consistent with replicated data from the posterior predictive distribution. In this setting, replicated refers to the same sample size as our original sample. In other words, if one takes samples of 20 from the posterior predictive distribution, do these replicated datasets resemble the observed sample?
Since the population standard deviation is known as σ = 4 seconds, the sampling distribution of Y is normal with mean μ and standard deviation σ. One simulates replicated data
|
(8.33) |
---|
|
(8.34) |
---|
This method is implemented in the following R script to simulate 1000 replicated samples from the posterior predictive distribution. The vector pred_mu_sim
contains draws from the posterior distribution and the matrix ytilde
contains the simulated predictions where each row of the matrix is a simulated sample of 20 future times.
sigma <- 4 mu_n <- 17.4 sigma_n <- 0.77 S <- 1000 pred_mu_sim <- rnorm(S, mu_n, sigma_n) sim_ytilde <- function(j){ rnorm(20, pred_mu_sim[j], sigma) } ytilde <- t(sapply(1:S, sim_ytilde))
To judge goodness of fit, we wish to compare these simulated replicated datasets from the posterior predictive distribution with the observed data. One convenient way to implement this comparison is to compute some “testing function”,
To implement this procedure, one needs to choose a testing function
pred_ybar_sim <- apply(ytilde, 1, mean)
Figure 8.8 displays a density estimate of the simulated values from the posterior predictive distribution of
To further illustrate the Bayesian approach to inference for measurements, consider Poisson sampling, a popular model for count data. One assumes that one observes a random sample from a Poisson distribution with an unknown rate parameter λ. The conjugate prior for the Poisson mean is the gamma distribution. This scenario provides further practice in various Bayesian computations, such as computing the likelihood function and posterior distribution, and obtaining the predictive distribution to learn about future data. In this section, we focus on the main results and the detailed derivations are left as end-of-chapter exercises.
Counts of patients in an emergency room
A hospital wants to determine how many doctors and nurses to assign on its emergency room (ER) team between 10 pm and 11 pm during the week. An important piece of information is the count of patients arriving in the ER in this one-hour period.
For a count measurement variable such as the count of patients, a popular sampling model is the Poisson distribution. This distribution is used to model the number of times an event occurs in an interval of time or space. In the current example, the event is a patient’s arrival to the ER, and the time interval is the period between 10 pm and 11 pm. The hospital wishes to learn about the average count of patients arriving to the ER each hour. Perhaps more importantly, the hospital wants to predict the patient count since that will directly address the scheduling of doctors and nurses question.
Counts of visitors to a website
As a second example, suppose one is interested in monitoring the popularity of a particular blog focusing on baseball analytics. Table 8.7 displays the number of visitors viewing this blog for 28 days during June of 2019. In this setting, the event of interest is a visit to the blog website and the time interval is a single day. The blog author is particularly interested in learning about the average number of visitors during the days Monday through Friday and predicting the number of visits for a future day in the summer of 2019.
TABLE 8.7
Count of visitors to blog during 28 days in June 2019.
Fri |
Sat |
Sun |
Mon |
Tue |
Wed |
Thu |
|
Week 1 |
95 |
81 |
85 |
100 |
111 |
130 |
113 |
Week 2 |
92 |
65 |
78 |
96 |
118 |
120 |
104 |
Week 3 |
91 |
91 |
79 |
106 |
91 |
114 |
110 |
Week 4 |
98 |
61 |
84 |
96 |
126 |
119 |
90 |
8.8.2 The Poisson distribution
Let the random variable Y denote the number of occurrences of an event in an interval with sample space {0, 1, 2, … }. In contrast to the normally distributed continuous measurement, note that Y only takes integer values from 0 to infinity. The variable Y follows a Poisson distribution with rate parameter λ when the probability mass function (pmf) of observing y events in an interval is given by
|
(8.35) |
---|
where λ is the average number of events per interval, e = 2.71828… is Euler’s number, and y! is the factorial of y.
The Poisson sampling model is based on several assumptions about the sampling process. One assumes that the time interval is fixed, counts of arrivals occurring during different time intervals are independent, and the rate λ at which the arrivals occur is constant over time. To check the suitability of the Poisson distribution for the examples, one needs to check the conditions individually.
In some situations, the second and third conditions will be violated. In our ER example, the occurrence of serious accidents may bring multiple groups of patients to the ER at certain time intervals. In this case, arrival times of patients may not be independent and the arrival rate λ in one subinterval will be higher than the arrival rate of another subinterval. When such situations occur, one needs to decide about the severity of the violation of the conditions and possibly use an alternative sampling model instead of the Poisson.
As evident in Equation (8.35), the Poisson distribution has only one parameter, the rate parameter λ, so the Poisson sampling model belongs to the family of one-parameter sampling models. The binomial data model with success probability p and the normal data model with mean parameter μ (with known standard deviation) are two other examples of one-parameter models. One distinguishes these models by the type of possible sample values, discrete or continuous. The binomial random variable is the number of successes and the Poisson random variable is a count of arrivals, so they both are discrete one-parameter models. In contrast, the normal sampling data model is a continuous one-parameter model.
The reader should be familiar with the typical procedure of Bayesian inference and prediction for one-parameter models. We rewrite this procedure in the context of the Poisson sampling model.
Step 1 One constructs a prior expressing an opinion about the location of the rate λ before any data is collected.
Step 2 One takes the sample of intervals and records the number of arrivals in each interval. From this data, one forms the likelihood, the probability of these observations expressed as a function of λ.
Step 3 One uses Bayes’ rule to compute the posterior – this distribution updates the prior opinion about λ given the information from the data.
In addition, one computes the predictive distribution to learn about the number of arrivals in future intervals. The posterior predictive distribution is also useful in checking the appropriateness of our model.
Gamma prior distribution
One begins by constructing a prior density to express one’s opinion about the rate parameter λ. Since the rate is a positive continuous parameter, one needs to construct a prior density that places its support only on positive values. The convenient choice of prior distributions for Poisson sampling is the gamma distribution which has a density function given by
|
(8.36) |
---|
where Γ(α) is the gamma function evaluated at α. The gamma density is a continuous density where the support is on positive values. It depends on two parameters, a positive shape parameter α and a positive rate parameter β.
The gamma density is a flexible family of distributions that can reflect many different types of prior beliefs about the location of the parameter λ. One chooses values of the shape α and the rate β so that the gamma density matches one’s prior information about the location of λ. In R, the function dgamma()
gives the density, pgamma()
gives the distribution function and qgamma()
gives the quantile function for the gamma distribution. These functions are helpful in graphing the prior and choosing values of the shape and rate parameters that match prior statements about gamma percentiles and probabilities. We provide an illustration of choosing a subjective gamma prior in the example.
Sampling and the likelihood
Suppose that Y1, …, Yn represent the observed counts in n time intervals where the counts are independent and each Yi follows a Poisson distribution with rate λ. The joint mass function of Y1, …, Yn is obtained by multiplying the Poisson densities.
|
(8.37) |
---|
Once the counts y1, …, yn are observed, the likelihood of λ is the joint probability of observing this data, viewed as a function of the rate parameter λ.
|
(8.38) |
---|
The Gamma posterior
If the rate parameter λ in the Poisson sampling model follows a gamma prior distribution, then it turns out that the posterior distribution for λ will also have a gamma density with updated parameters. This demonstrates that the gamma density is the conjugate distribution for Poisson sampling as the prior and posterior densities both come from the same family of distribution: gamma.
We begin by assuming that the Poisson parameter λ has a gamma distribution with shape and rate parameters α and β, that is, λ ∼ Gamma(α, β). If one multiplies the gamma prior by the likelihood function L(λ), then in an end-of-chapter exercise you will show that the posterior density of λ is Gamma(αn, βn), where the updated parameters αn and βn are given by
|
(8.39) |
---|
Inference about λ
Once the posterior distribution has been derived, then all inferences about the Poisson parameter λ are performed by computing particular summaries of the gamma posterior distribution. In particular, one may be interested in testing if λ falls in a particular region by computing a posterior probability. All of these computations are facilitated using the pgamma()
, qgamma()
, and rgamma()
functions. Or one may be interested in constructing an interval estimate for λ. In the end-of-chapter exercises, there are opportunities to perform these inferences using a dataset containing a sample of ER arrival counts.
Prediction of future data
One advantage of using a conjugate prior is that the predictive density for a future observation
|
(8.40) |
---|
In addition, the posterior distribution of λ also has the gamma form with updated parameters αn and βn. So Equation (8.40) also provides the posterior predictive distribution for a future count
For prediction purposes, there are several ways of summarizing the predictive distribution. One can use the formula in Equation (8.40) to directly compute
8.8.4 Case study: Learning about website counts
Let’s return to the website example where one is interested in learning about the average weekday visits to a baseball analytics blog site. One observes the counts y1, …, y20 displayed in the “Mon”, “Tue”, “Wed”, “Thu”, “Fri” columns of Table 8.7. We assume the {yi} represent a random sample from a Poisson distribution with mean parameter λ.
Suppose one’s prior guess at the value of λ is 80 and one wishes to match this information with a Gamma(α, β) prior. Two helpful facts about the gamma distribution are that the mean and variance are equal to μ = α/β and σ2 = α/β2 = μ/β, respectively. Figure 8.9 displays three gamma curves for values (α, β) = (80, 1), (40, 0.5), and (20, 0.25). Each of these gamma curves has a mean of 80 and the curves become more diffuse as the parameter β moves from 1 to 0.25. After some thought, the user believes that the Gamma(80, 1) matches her prior beliefs. To check, she computes a prior probability interval. Using the qgamma()
function, she finds that her 90% prior probability interval is Prob(65.9 < λ < 95.3) = 0.90 and this appears to be a reasonable approximation to her prior beliefs.
From the data, we compute
Figure 8.10 displays the gamma posterior curve for λ. This figure displays a 90% probability interval which is found using the qgamma()
function to be (101.1, 108.5). The interpretation is that the average number of visits lies between 101.1 and 108.5 with probability 0.90.
Suppose the user is interested in predicting the number of blog visits
Another Set of Federer’s Time-to-Serve Measurements (Discrete Priors)
Suppose another set of thirty Federer’s time-to-serve measurements are collected with an observed mean of 19 seconds. Assume the same discrete uniform prior on the values μ = 15, 16, …, 22. The prior and the likelihood function are displayed below.
(a)Assuming σ = 4, perform the Bayes’ rule calculation to find the posterior distribution for μ.
(b)Using the posterior, find a “best” estimate at μ and an interval of values that contains μ with probability 0.5.
Temperature in Bismarck
Suppose one is interested in learning about the average January daily temperature (in degrees Fahrenheit) in Bismarck, North Dakota. One assumes that the daily temperature Y is normally distributed with mean μ and known standard deviation σ = 10. Suppose that one’s prior is uniformly distributed over the values μ = 5, 10, 15, 20, 25, 30, 25. Suppose one observes the temperature for one January day to be 28 degrees. Find the posterior distribution of μ and compute the posterior probability the mean is at least as large as 30 degrees.
Choosing A Normal Prior
(a)Suppose Sam believes that the 0.25 quantile of the mean of Federer’s time-to-serve μ is 14 seconds and the 0.8 quantile is 21 seconds. Using the normal.select()
function, construct a normal prior distribution to match this belief.
(b)Suppose Sam also believes that the 0.10 quantile of his prior is equal to 10.5 seconds. Is this statement consistent with the normal prior chosen in part (a)? If not, how could you adjust the prior to reconcile this statement about the 0.10 quantile?
Choosing A Normal Prior
Another way of choosing a normal prior for Federer’s mean time-to-serve μ is to specify statements about the prior predictive distribution for a future time-to-serve measurement
(a)Suppose your best guess at
(b)Suppose instead that you are 90% confident that the future time-to-serve is between 18 and 24 seconds. Find the normal prior for μ that matches this prior belief.
Bayesian Hypothesis Testing
The posterior distribution for the mean time-to-serve μ for Federer is normal with mean 17.4 seconds and standard deviation 0.77 seconds.
(a)Using this posterior, evaluate the plausibility of the statement “Federer’s mean time-to-serve is at least 16.5 seconds.”
(b)Is it reasonable to say that Federer’s mean time-to-serve falls between 17 and 18 seconds? Explain.
Bayesian Credible Interval
The posterior distribution for the mean time-to-serve μ for Federer is normal with mean 17.4 seconds and standard deviation 0.77 seconds.
(a)Construct a central 98% credible interval for μ.
(b)Can you use the credible interval to test the hypothesis “Federer’s mean time-to-serve is 16.5 seconds”? Explain.
Posterior Predictive Distribution
Write an R script to generate S = 1000 predictions of a single time-to-serve of Federer based on the posterior predictive distribution using the results given in Equation (8.31) and Equation (8.32).
(a)Compare the exact posterior predictive distribution (Equation (8.30)) with the density estimate of the simulated predictions.
(b)Construct a 90% prediction interval for the future time-to-serve.
Posterior Predictive Checking
The posterior predictive distribution can be used to check the suitability of the normal sampling and normal prior model for Federer’s time-to-serve data. The function post_pred_check()
simulates samples of n = 20 from the posterior predictive function, and for each sample, computes a value of the checking function
post_pred_check <- function(test_function){ mu_n <- 17.4 sigma_n <- 0.77 sigma <- 4 n <- 20 one_sim <- function(){ mu <- rnorm(1, mu_1, sigma_1) test(rnorm(n, mu, sigma)) } replicate(1000, one_sim()) }
The output of the function is 1000 draws from the posterior predictive distribution of T. If the checking function is max (y), then one would obtain 1000 draws from the posterior predictive distribution by typing
post_pred_check(max)
If the value of the checking function on the observed time-to-serves T(y) is unusual relative to this posterior predictive distribution of T, this would cast doubt on the model. The observed times-to-serve for Federer are displayed in Section 8.3.1. and repeated below.
15.1 11.8 21.0 22.7 18.6 16.2 11.1 13.2 20.4 19.2 21.2 14.3 18.6 16.8 20.3 19.9 15.0 13.4 19.9 15.3
(a)Use the function post_pred_check()
with the checking function T(y) = max (y) to check the suitability of the Bayesian model.
(b)Use the function post_pred_check()
with the checking function T(y) = sd(y) to check the suitability of the Bayesian model.
Taxi Cab Fares
Suppose a city manager is interested in learning about the mean fare μ for taxi cabs in New York City.
(a)Suppose the manager believes that μ is smaller than $8 with probability 0.25, and that μ is smaller than $12 with probability 0.75. Find a normal prior that matches this prior information.
(b)The manager reviews 20 fares and observes the values (in dollars): 7.5, 8.5, 9.5, 6.5, 7.0, 6.0, 7.0, 16.0, 8.0, 8.5, 9.5, 13.5, 4.5, 8.5, 7.5, 13.0, 6.5, 9.5, 21.0, 6.5. Assuming these fares are normally distributed with mean μ and standard deviation σ = 4, find the posterior distribution for the mean μ.
(c)Construct a 90% interval estimate for the mean fare μ.
Taxi Cab Fares (continued)
Suppose that a visitor to New York City has little knowledge about the mean taxi cab fare.
(a)Construct a weakly informative prior for μ.
(b)Use the data from Exercise 9 to compute the posterior distribution for the mean fare.
(c)Construct a 90% interval estimate for the mean fare and compare your interval with the interval computed in Exercise 9 using an informative prior.
Taxi Cab Fares (continued)
(a)In Exercise 9, one finds the posterior distribution for the mean fare μ. Write an R function to simulate a sample of twenty fares from the posterior predictive distribution.
(b)Looking at the observed data, one sees an unusually large fare of $21. To see if this fare is unusual for our model, first revise your function in part (a) to simulate the maximum fare of a sample of twenty fares from the posterior predictive distribution. Then repeat this process 1000 times, collecting the maximum fares for 1000 predictive samples.
(c)Construct a graph of the maximum fares. Is the fare of $21 large relative to the prediction distribution of maximum fares?
(d)Based on the answer to part (c), what does that say about the suitability of our model?
Student Sleeping Times
How many hours do college students sleep, on the average? Recently, some introductory students were asked when they went to bed and when they woke the following morning. A following random sample of 14 sleeping times (in hours) were recorded: 9.0, 7.5, 7.0, 8.0, 5.0, 6.5, 8.5, 7.0, 9.0, 7.0, 5.5, 6.0, 8.5, 7.5. Assume that these measurements follow a normal sampling distribution with mean μ and standard deviation σ, where we are given that σ = 1.5.
(a)Suppose that John believes a priori that the mean amount of sleep μ is normal with mean 8 hours and standard deviation 1 hour. Find the posterior distribution of μ.
(b)Construct a 90% interval estimate for the mean μ.
(c)Let
Student Sleeping Times (continued)
Suppose two other people are interested in learning about the mean sleeping times of college students. Mary’s prior is normal with mean 8 hours and standard deviation 0.1 – she is pretty confident that the mean sleeping time is close to 8 hours. In contrast, Larry is very uncertain about the location of μ and assigns a normal prior with mean 8 hours and standard deviation 3 hours.
(a)Find the posterior distributions of μ using Mary’s prior and using Larry’s prior.
(b)Construct 90% interval estimates for μ using Mary’s and Larry’s priors.
(c)Compare the interval estimates with the interval estimates constructed in Exercise 12(b) using Mary’s prior. Is the location of the interval estimate sensitive to the choice of prior? If so, explain the differences.
Comparing Two Means - IQ Tests on School Children
Do teachers’ expectations impact academic development of children? To find out, researchers gave an IQ test to a group of 12 elementary school children. They randomly picked six children and told teachers that the test predicts them to have high potential for accelerated growth (accelerated group); for the other six students in the group, the researchers told teachers that the test predicts them to have no potential for growth (no growth group). At the end of school year, they gave IQ tests again to all 12 students, and the change in IQ scores of each student is recorded. Table 8.8 shows the IQ score change of students in the accelerated group and the no growth group.
TABLE 8.8
Data from IQ score change of 12 students; 6 are in the accelerated group, and 6 are in the no growth group.
Group |
IQ score change |
Accelerated |
20, 10, 19, 15, 9, 18 |
No growth |
3, 2, 6, 10, 11, 5 |
The sample means of the accelerated group and the no growth group are respectively
where nA = nN = 6.
(a)Assuming independent sampling, write down the likelihood function of the means (μA, μN).
(b)Assume that one’s prior beliefs about μA and μN are independent, where μA ∼ Normal(γA, τA) and μN ∼ Normal(γN, τN). Show that the posterior distributions for μA and μN are independent normal and find the mean and standard deviation parameters for each distribution.
Comparing Two Means - IQ Tests on School Children (continued)
In Exercise 14, you should have established that the mean IQ score changes μA and μN have independent normal posterior distributions. Assume that one has vague prior beliefs and μA ∼ Normal(0, 20) and μN ∼ Normal(0, 20).
(a)Is the average improvement for the accelerated group larger than that for the no growth group? Consider the parameter δ = μA − μN to measure the difference in means. The question now becomes finding the posterior probability of δ > 0, i.e. p(μA − μN > 0|yA, yN), where yA and yN are the vectors of recorded IQ score change. [Hint: simulate a vector sA of posterior samples of μA and another vector sN of posterior samples of μN (make sure to use the same number of samples) and subtract sN from sA, which gives us a vector of posterior differences between sN and sA. This vector of posterior differences serves as an approximation to the posterior distribution of δ.]
(b)What is the probability that a randomly selected child assigned to the accelerated group will have larger improvement than a randomly selected child assigned to the no growth group? Consider
Comparing Two Means - Prices of Diamonds
Weights of diamonds are measured in carats. The difference between the size of a 0.99 carat diamond and a 1 carat diamond is most likely undetectable to the naked human eye, but the price of a 1 carat diamond tends to be much higher than the price of a 0.99 carat diamond. To find out if it is truly the case, data on point prices (the prices of 0.99 carat diamonds divided by 99, and the prices of 1 carat diamonds divided by 100) of n99 = 23 of 0.99 carat diamonds and n100 = 25 of 1 carat diamonds were collected and stored in the files pt99price.csv
and pt100price.csv
.
(a)Explore the two datasets by making plots and computing summary statistics. What are the findings?
(b)Consider independent normal sampling models for these datasets with a fixed and known value of the standard deviation. From your exploratory work, choose a value for the standard deviation.
(c)Choose appropriate weakly informative prior distributions, and use posterior simulation to answer whether the average point price of the 1 carat diamonds is higher than that of the 0.99 diamonds.
(d)Perform posterior predictive checks of the Bayesian inferences obtained in part (c).
Gamma-Poisson Conjugacy Derivation
Section 8.8.3 presents the Bayesian update results for Poisson sampling with the use of the gamma conjugate prior.
(a)Verify the equation for the likelihood in Equation (8.37). [Hint:
the joint sampling density of n i.i.d. Poisson distributed random variables.]
(b)Assuming that the Poisson parameter λ has a gamma prior with shape α and rate β, show that the posterior distribution of λ has a gamma functional form and find the parameters of this gamma distribution.
The Number of ER Visits: the Prior
Suppose two people, Pedro and Mia, have different prior beliefs about the average number of ER visits during the 10 pm - 11 pm time period. Pedro’s prior information is matched to a gamma distribution with parameters α = 70 and β = 10, and Mia’s beliefs are matched to a gamma distribution with α = 33.3 and β = 3.3. The two gamma priors are displayed in Figure 8.12.
(a)Compare the priors of Pedro and Mia with respect to average value and spread. Which person believes that there will be more ER visits, on average? Which person is more confident of his or her best guess at the average number of ER visits?
(b)Using the qgamma()
function, construct 90% interval estimates for λ using Pedro’s prior and Mia’s prior.
(c)After some thought, Pedro believes that his best prior guess at λ is correct, but he is less confident in this new guess. Explain how Pedro can adjust the parameters of his gamma prior to reflect this new prior belief.
(d)Mia also revisits her prior. Her best guess at the average number of ER visits is now 3 larger than her previous best guess, but the degree of confidence in this guess hasn’t changed. Explain how Mia can adjust the parameters of her gamma prior to reflect this new prior belief.
The Number of ER Visits
A hospital collects the number of patients in the emergency room (ER) admitted between 10 pm and 11 pm for each day of a week. Table 8.9 records the day and the number of ER visits for the given day.
TABLE 8.9
Data for ER visits in a given week.
Day |
Number of ER visits |
Sunday |
8 |
Monday |
6 |
Tuesday |
6 |
Wednesday |
9 |
Thursday |
8 |
Friday |
9 |
Saturday |
7 |
Suppose one assumes Poisson sampling for the counts, and a conjugate gamma prior with parameters α = 70 and β = 10 for the Poisson rate parameter λ.
(a)Given the sample shown in Table 8.9, obtain the posterior distribution for λ through the gamma-Poisson conjugacy. Obtain a 95% posterior credible interval for λ.
(b)Suppose a hospital administrator states that the average number of ER visits during any evening hour does not exceed 6. By computing a posterior probability, evaluate the validity of the administrator’s statement.
(c)The hospital is interested in predicting the number of ER visits between 10 pm and 11 pm for another week. Use simulations to generate posterior predictions of the number of ER visits for another week (seven days).
Times Between Traffic Accidents
The exponential distribution is often used as a model to describe the time between events, such as traffic accidents. A random variable Y has an Exponential distribution if its pdf is as follows.
|
(8.41) |
---|
Here, the parameter λ > 0, considered as the rate of event occurrences. This is a one-parameter model.
(a)The gamma distribution is a conjugate prior distribution for the rate parameter λ in the Exponential data model. Use the prior distribution λ ∼ Gamma(a, b), and find its posterior distribution π(λ| y1, …, yn), where
(b)Suppose 10 times between traffic accidents are collected: 1.5, 15, 60.3, 30.5, 2.8, 56.4, 27, 6.4, 110.7, 25.4 (in minutes). With the posterior distribution derived in part (a), use Monte Carlo approximation to calculate the posterior mean, median, and a middle 95% credible interval for the rate λ. [Hint: choose the appropriate R functions from dgamma()
, pgamma()
, qgamma()
, and rgamma()
.]
(c)Use Monte Carlo approximation to generate another set of 10 predicted times between events. [Hint: rexp()
generates random draws from an Exponential distribution.]
Modeling Survival Times
The Weibull distribution is often used as a model for survival times in biomedical, demographic, and engineering analyses. A random variable Y has a Weibull distribution if its pdf is as follows.
|
(8.42) |
---|
Here, α > 0 and λ > 0 are parameters of the distribution. For this problem, assume that α = α0 is known, but λ is not known, i.e. a simplified case of a one-parameter model. Also assume that software routines for simulating from Weibull distributions are available (e.g. rweibull()
)
(a)Assuming a prior distribution π(λ|α = α0) ∝ 1, find its posterior
(b)Using the posterior distribution derived in part (a), explain step-by-step how you would use Monte Carlo simulation to approximate the posterior median survival time, assuming that α = α0.
(c)What family of distributions represents the conjugate prior distributions for λ, assuming that α = α0.