Chapter Four

Estimation and Fitting

Before embarking in a computationally extensive Monte Carlo study, we should be reasonably confident that we are feeding our simulation tool with sensible input. In the previous chapter we have considered different families of models that can be used as the key ingredient to build a Monte Carlo simulator. Here we consider the quantitative side of the coin: After selecting a model structure, how should we estimate its parameters? Hence, we step into the domain of inferential statistics, whereby, given a sample of observed data, we engage in increasingly difficult tasks:

Finding point and interval estimates of basic moments like expected value and variance
Estimating the parameters of a possibly complicated probability distribution
Estimating the parameters of a time series model

The first task is typically associated with elementary concepts like confidence intervals, which may be somewhat dangerous and misleading for the newcomer. Computing a confidence interval for the mean of a normal population looks so easy that it is tempting to disregard the pitfalls involved. We will have more to say on this in Chapter 7, where we deal with output analysis. In fact, simulation input and output analysis do share some statistical background.1 If the output of a Monte Carlo run is used to estimate a probability or a quantile, it may be tempting to apply the same recipes that we adopt to estimate means, which may lead to unreasonable results. Moreover, a probability distribution is not really characterized by expected value and variance. This is the case with a normal distribution N(μ, σ2), as its parameters have that obvious interpretation, but it is not, e.g., with a gamma distribution. Hence, we need more refined strategies, like moment matching or maximum-likelihood estimation. Maximum likelihood is a powerful principle, which can also be used to estimate the parameters of complicated models, including time series.

We must also be aware that parameter estimation should not be our only concern. Apart from fitting parameters, we should also try to check the fit of the overall uncertainty model, which involves nonparametric issues. If we assume that a variable of interest is normally distributed, we should check if this is a defensible, even though debatable, assumption or sheer nonsense. Testing for normality is an example of a nonparametric test. More generally, we may need to compare the fit of alternative distributions, measuring the overall goodness of fit. What is relevant or not depends on the kind of application. For instance, it is possible to build two very different probability distributions having the same expected value, variance, skewness, and kurtosis. Thus, the first few moments do not tell the whole story. This may be not quite relevant if we are just dealing with mean–variance portfolio optimization, but it may be extremely relevant in dealing with extreme risk.

Most books on Monte Carlo methods for finance and economics do not include a chapter on these topics, which is fine if one assumes that readers have a suitable background in econometrics. Other books, such as [10], are aimed at industrial engineers and include an extensive chapter on input analysis. Since this is an introductory book also aimed at engineers and mathematicians, I have deliberately chosen to include such a chapter as well.2 To make the following treatment as useful as possible to a wide and diversified audience, we will start with really elementary topics, and then we will move on to possibly less standard ones, such as estimation of ARMA and GARCH models, as well as Bayesian parameter estimation. Furthermore, we will use R throughout, so that readers familiar with the underlying concepts, but unfamiliar with R functionalities, can take advantage of the examples. Since it is my firm conviction that understanding the underlying unity of concepts is fundamental for sound and deep learning, it is also important to stress the links of parameter estimation and model fitting with other topics dealt with in this book:

Some nonstandard optimization methods, which we deal with in Chapter 10, may be needed to estimate parameters or calibrate models, when the resulting optimization model is nonconvex and badly behaved.
The Kolmogorov–Smirnov test is related to discrepancy concepts, which are at the basis of the low-discrepancy sequences that are discussed in Chapter 9.
Bayesian estimation is related to Markov chain Monte Carlo methods, illustrated in Chapter 14.
Linear regression is used in some computational methods for dynamic stochastic optimization, which is in turn fundamental to price American-style options with high dimensionality, as we show in Chapter 11.
A sound understanding of how confidence intervals arise is important to avoid subtle mistakes in output analysis, as we will stress in Chapter 7.

We start in Section 4.1 with a brief refresher on elementary inferential statistics, i.e., confidence intervals and hypothesis testing for the expected value; there, we also outline correlation testing. Then we move on to systematic strategies for parameter estimation in Section 4.2, where we consider moment matching and maximum likelihood. We outline nonparametric tests and checking goodness-of-fit in Section 4.3. We offer a brief refresher on ordinary least-squares methods to estimate linear regression models in Section 4.4, followed by the estimation of time series models in Section 4.5. We close the chapter with Section 4.6, where we compare the orthodox approach to inferential statistics and parameter estimation with the Bayesian view.

The newcomers will certainly realize that model estimation is a huge body of statistical knowledge, which we cannot cover in any satisfactory way in a short chapter. In particular, estimating time series models may require some deep knowledge, as is also the case when estimating a copula. We do not deal with copula estimation here, but later in Section 5.7; there we illustrate how sampling from a copula may be achieved by the R package copula, which also includes a function fitCopula for fitting purposes. We do not go into details, but we stress that the principle of parameter estimation by maximum likelihood, which we do outline in this chapter, is the theoretical underpinning of many of these estimation approaches.

4.1 Basic inferential statistics in R

In this section we deal with two very basic and related topics: confidence intervals for the mean and elementary hypothesis testing. Our aim is to review the foundations of these statistical tools, in order to get acquainted with the issues involved in more complicated cases, such as testing the significance of correlations, etc. Most readers could probably skip this section, or possibly just skim through it to see which functions R offers to accomplish these tasks. Not surprisingly, given their conceptual links, the relevant R functions boil down to one, t.test.

4.1.1 CONFIDENCE INTERVALS

Most of us are introduced to inferential statistics through the calculation of a confidence interval for an expected value μ. This concept is relevant on both ends of a Monte Carlo simulation, as it used to define the input random variables as well as to analyze the output. Given a sample Xi, i = 1, …, n of i.i.d. (independent and identically distributed) random variables, the drill is as follows:

1. Compute sample statistics such as sample mean and sample variance:

equation

2. Choose a confidence level (1 − α) and pick the corresponding quantile tn−1, 1−α/2 from a t distribution with n − 1 degrees of freedom.
3. Calculate the confidence interval

(4.1) equation

This procedure is so easy to perform that one tends to forget that it relies on some important assumptions. The following observations are in order:

Strictly speaking, the above procedure is correct for normal random variables only. In fact, if variables Xi ~ N(μ, σ2) and they are independent, then it is true that the following standardized statistic is normal:

(4.2) equation

If we replace σ by its sample counterpart S we find a Student t distribution:

(4.3) equation

which implies

equation

By rearranging this relationship we obtain the confidence interval of Eq. (4.1). A large part of inferential statistics relies on similar distributional results. If we apply the procedure to a different distribution, what we find is at best a good approximation for a suitably large sample; with a small sample and a skewed distribution, we should repeat the drill accounting for the specific features of that distribution.
It is also very important to stress the role of independence. It is independence in the sample that allows us to write3

equation

When estimating sample variance S2, we recall that division by n − 1 is required in order to obtain an unbiased estimator, i.e., to make sure that

equation

Unbiasedness, as we insist upon later, is a basic requirement of a sensible estimator.
When analyzing the output of a Monte Carlo simulation, the sample size is usually rather large. Hence, we usually replace quantiles of the t distribution with quantiles z1−α/2 from the standard normal. However, this need not apply to input analysis.

Another danger is a common misinterpretation of confidence intervals. After calculating a confidence interval (l, u), we cannot say that

(4.4) equation

This statement does not make any sense, as all of the involved quantities are numbers. In the orthodox framework of statistics, μ is an unknown number; the extreme points of the confidence interval, l and u, are numerical realizations of random variables L and U. Hence, the above probabilistic statement makes no sense. We should rephrase the claim of Eq. (4.4) with reference to the random variables L and U, i.e., the lower and upper bound of the interval before sampling. Once again: We have to realize that parameters are unknown numbers in orthodox statistics, not random variables.4 This is illustrated empirically and clarified below by a simple Monte Carlo experiment.

Confidence intervals in R

Building standard confidence intervals in R is accomplished by the function t.test (x, conf. level = 0.95). The function name may look weird at first, but it is justified by the fact that the calculations needed to compute the confidence interval are related to those needed to run a hypothesis test, as we show later. The function receives a vector containing the sample and an optional parameter specifying the confidence level, which is 95% by default:

> set.seed(55555)
> X = rnorm(20,mean=10,sd=20)
> t.test(X)
One Sample t-test
data: X
t = 1.1709, df = 19, p-value = 0.2561
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-3.960321 14.017022
sample estimates:
mean of x
5.02835

The above output is a bit confusing, as it is cluttered by irrelevant information about a hypothesis test. It might be better to collect the output in a variable and then inspect what we are really interested in:

> ci <- t.test(X,conf.level = 0.99)
> ci$estimate
mean of x
5.02835
> ci$conf.int
[1] −7.258179 17.314880
attr(,“conf.level”)
[1] 0.99

As expected, the 99% interval is larger than the 95% interval, and both are rather large due to the small sample size and the large standard deviation with respect to the expected value.

Armed with the above function, we may check the empirical coverage of a confidence interval, i.e., the probability that it contains the true expected value. To this aim, it is instructive to compare the result with a normal and an exponential distribution of similar variability.5 By running the code in Fig. 4.1, we obtain the following results:

FIGURE 4.1 Checking the coverage of confidence intervals empirically.

> coverageExp
[1] 0.9208
> coverageNorm
[1] 0.95

As expected, coverage with a skewed distribution is not exactly as specified.

In this section we have only considered confidence intervals for the expected value of a random variable. Different confidence intervals should be used, e.g., when we are faced with the task of estimating variances or probabilities. Another relevant and quite nontrivial case is the estimation of quantiles. Since this kind of application, as far as simulation is concerned, is more related to output analysis, we defer an extended treatement of confidence intervals to Section 7.3.

4.1.2 HYPOTHESIS TESTING

The basic hypothesis test one may wish to run concerns the expected value:

We test the null hypothesis H0: μ = μ0, for a given μ0,
against the alternative hypothesis Ha: μ ≠ μ0.

In the normal case, we rely on the distributional result of Eq. (4.3), where the unknown expected value μ is replaced by the hypothesized value μ0. This shows that, if the null hypothesis is indeed true, then

equation

In other words, the standardized test statistic

equation

if the null hypothesis is true, has Student’s t distribution with n − 1 degrees of freedom, and it should fall within bounds corresponding to quantiles. If T falls outside that interval, there are two possible explanations: It may just be bad luck, or maybe the null hypothesis is wrong. We cannot be sure of either, and we may make two types of error: We may reject a true hypothesis, or we may accept a false one. The elementary approach is conservative and keeps the probability of rejecting a true null hypothesis under control. Therefore, we form a rejection region consisting of two tails,

equation

and we reject the null hypothesis if the test statistic T RJ. Here α plays the role of a significance level or, better said, the probability of rejecting the null hypothesis if it is true.6 The rejection region consists of a single tail if the alternative hypothesis is one-sided:

equation

Note that if we increase α we restrict the rejection region, which implies that, all other factors being equal, it is easier to reject, possibly making a mistake. Statistical folklore dictates that 5% is a sensible choice for the significance level, but this is at best a rule of thumb.

Hypothesis testing is easily accomplished by the t.test function:

> set.seed(55555)
> X <- rnorm(10,10,20)
> t.test(X,mu=20)
 
        One Sample t-test
 
data: X
t = −1.4468, df = 9, p-value = 0.1819
alternative hypothesis: true mean is not equal to 20
95 percent confidence interval:
-6.881385 25.909259
sample estimates:
mean of x
9.513937

In the above snapshot we sample a normal distribution with expected value 10 and test the hypothesis that the expected value is 20. We notice that the confidence interval is large, reflecting considerable variability due to the large standard deviation. By sheer luck, the sample mean is 9.513937, rather close to the true value, but the confidence interval is pretty large. Now, should we accept the null hypothesis or not? One trick is to observe that 20 is within the confidence interval, thus we cannot rule out this value. This way of thinking works here, but it should be discouraged as it does not work well in general, especially with one-tailed testing.

The fundamental output of the procedure is the p-value, which is 0.1819. This is the probability of getting a statistic like that sample mean, or a larger one in absolute value, by pure randomness, assuming that the true expected value is 20. We may compare this p-value against a significance level of 5%: Since the p-value is larger, we do not feel at all confident in rejecting the null hypothesis. In this case, we do know that the null hypothesis is wrong since the true expected value is 10, but from a practical point of view we do not have evidence strong enough. Also note that we cannot really say that we “accept” the null hypothesis, as the sample statistic is far from 20; what we can say is that we “fail to reject it.”

We may easily change the alternative hypothesis and run one-tailed tests:

> t.test(X,mu=20,alternative=’less’)
        One Sample t-test
data: X
t = −1.4468, df = 9, p-value = 0.09093
alternative hypothesis: true mean is less than 20
95 percent confidence interval:
    -Inf 22.7997
sample estimates:
mean of x
9.513937
> t.test(X,mu=20,alternative=’greater’)
        One Sample t-test
data: X
t = −1.4468, df = 9, p-value = 0.9091
alternative hypothesis: true mean is greater than 20
95 percent confidence interval:
-3.771823 Inf
sample estimates:
mean of x
9.513937

The newcomer is strongly invited to compare the two p-values and to interpret their difference.

We may run a host of hypothesis tests in R, which include:

Testing the difference in the expected values of two large independent samples: t.test(X,Y)
Testing the difference in the expected values of two paired samples: t.test(X,Y,paired=TRUE)
Testing the significance of correlation: cor.test (X), discussed later in Section 4.1.3

The second test is relevant, for instance, in evaluating the performance of two systems or two strategies on a set of simulated scenarios. In such a case, it is fair and makes sense to compare the results on the same set of scenarios and to check the differences; hence, the observations in the sample are not independent and should be paired to check if the observed difference is statistically significant.

Unlike trivial tests on the mean of a normal population, other testing procedures may rely on sophisticated distributional results, but they more or less follow the same drill:

We select some suitable statistic.
We derive distributional results about that statistic, assuming the null hypothesis is true. These results may rely on some underlying probability distribution and are often approximations valid for large samples.
Based on these distributional results, the rejection region and the corresponding p-values are determined. Monte Carlo sampling is also used to this purpose, when an approximation is not available, or it is feared that it is a poor one for a limited sample size.

We need not be concerned too much with the theoretical underpinnings of each testing procedure, but we should be aware of the underlying assumptions and the resulting limitations in terms of applicability; then, what we have to look for is the p-value.

4.1.3 CORRELATION TESTING

Testing the significance of correlation is an important task in statistics, and it may also be a useful tool in preparing and analyzing Monte Carlo experiments. A starting point is the following formula for sample covariance:

(4.5) equation

where n is the sample size, i.e., the number of observed pairs (Xi, Yi). The above formula is a straightforward sample counterpart of covariance and can also be rewritten as follows:

equation

In passing, we note that when estimating a large covariance matrix, plenty of data are needed to obtain reliable estimates. Such a luxury cannot be afforded quite often, and this motivates the factor models discussed in Section 3.8.2.

To estimate the correlation coefficient ρXY between X and Y, we may just plug sample covariance SXY and sample standard deviations SX, SY into its definition, resulting in the sample coefficient of correlation, or sample correlation for short:

(4.6) equation

The factors n − 1 in SXY, SX, and SY cancel each other, and it can be proved that −1 ≤ rXY ≤ + 1, just like its probabilistic counterpart ρXY.

We should always keep in mind that we are referring to Pearson’s correlation, whose limitations have been stressed in Section 3.4.1. Nevertheless, correlation analysis plays an important role in Monte Carlo models:

In input analysis, we should check if some variables are correlated in order to model them correctly.
In output analysis, it may be important to check the correlation (or lack thereof) between output estimates, e.g., to find the right batch length in the method of batches. See Section 7.1.
When applying variance reduction by control variates, it may be important to check the strength of the correlation between the crude Monte Carlo estimator and the control variate that we considering. See Chapter 8.

Whatever the reason why we are estimating a correlation, we should always wonder whether it is statistically significant. To see the point, imagine estimating the correlation when n = 2. Since it is possible to fit a straight line going through two points, the resulting correlation is always 1. It is clear that the magnitude of the correlation should be checked against the sample size, and a simple strategy is to test the null hypothesis

equation

against the alternative hypothesis

equation

However, we need a statistic whose distribution under the null hypothesis is fairly manageable. One useful result is that, if the sample is normal, the statistic

equation

is approximately distributed as a t variable with n − 2 degrees of freedom, for a suitably large sample. This can be exploited to come up with correlation tests.

In R, testing correlations is accomplished by the function cor.test. The snapshot in Fig. 4.2 generates two vectors of independent normals, X1 and X2, to which a common factor is added. This has the effect of producing a correlation coefficient of 0.3262. However, the p-value is 0.1605, which makes the correlation not quite significant, due to the small sample size. The reader is urged to repeat the experiment with n = 2000 and check that now the correlation is significant.

FIGURE 4.2 Testing correlations.

4.2 Parameter estimation

Introductory treatments of inferential statistics focus on normal populations. In that case, the two parameters characterizing the distribution, μ and σ2, coincide with expected value, the first-order moment, and variance, the second-order central moment. Hence, students might believe that parameter estimation is just about calculating sample means and variances. It is easy to see that this is not the case.

Example 4.1 Parameters of a uniform distribution

Consider a uniform random variable X . We recall that

equation

Clearly, the sample mean and the sample variance S2 do not provide us with direct estimates of the parameters a and b. However, we might consider the following way of transforming the sample statistics into estimates of parameters. If we substitute μ and σ2 with their estimates, we find

equation

Note that, in taking the square root of variance, we should only consider the positive root, since standard deviation cannot be negative. Solving this system yields the following estimates:

equation

This example suggests a general strategy to estimate parameters:

Estimate moments on the basis of a random sample.
Set up a suitable system of equations relating parameters and moments and solve it.

Indeed, this is the starting point of a general parameter estimation approach called method of moments. However, a more careful look at the example should raise an issue. Consider the order statistics of a random sample of n observations of a uniform distribution:

equation

If we take the above approach, all of these observations play a role in estimating a and b. Yet, this is a bit counterintuitive; in order to characterize a uniform distribution, we need a lower and an upper bound on its realizations. So, it seems that only U(1) and U(n), i.e., the smallest and the largest observations, should play a role. We might suspect that there are alternative strategies for finding point estimators and then, possibly, confidence intervals. In this section we outline two approaches: the aforementioned method of moments and the method of maximum likelihood. But before doing so, since there are alternative ways to build estimators, it is just natural to wonder how we can compare them. Therefore, we should first list the desirable properties that make a good estimator.

4.2.1 FEATURES OF POINT ESTIMATORS

We list here a few desirable properties of a point estimator for a parameter θ. When comparing alternative estimators, we may have to trade off one property for another. An estimator is unbiased if

equation

It is easy to show that the sample mean is an unbiased estimator of the expected value. Biasedness is related to the expected value of an estimator, but what about its variance? Clearly, ceteris paribus, we would like to have an estimator with low variance.

DEFINITION 4.1 (Efficient unbiased estimator) An unbiased estimator 1 is more efficient than another unbiased estimator 2 if

equation

Note that we must compare unbiased estimators in assessing efficiency. Otherwise, we could obtain a nice estimator with zero variance by just choosing an arbitrary constant. It can be shown that, under suitable hypotheses, the variance of certain estimators has a lower bound. This bound, known as the Cramér–Rao bound, is definitely beyond the scope of this book, but the message is clear: We cannot go below a minimal variance. If the variance of an estimator attains that lower bound, then it is efficient. Another desirable property is consistency. An estimator is consistent if → θ, where the limit must be understood in some probabilistic sense, when the sample size increases. For instance, the statistic

equation

is not an unbiased estimator of variance σ2, as we know that we should divide by n − 1 rather than n. However, it is a consistent estimator, since when n → ∞ there is no difference between dividing by n − 1 or n. Incidentally, by dividing by n we lose unbiasedness but we retain consistency while reducing variance. Sometimes, we may prefer a biased estimator, if its variance is low, provided that it is consistent.

4.2.2 THE METHOD OF MOMENTS

We have already sketched an application of this method in Example 4.1. To state the approach in more generality, let us introduce the sample moment of order k:

equation

The sample moment is the sample counterpart of moment mk = E[Xk]. Let us assume that we need an estimate of k parameters θ1, θ2, …, θk. The method of moments relies on the solution of the following system of equations:

equation

In general, this is a system of nonlinear equations that may be difficult to solve, and we cannot be sure that a unique solution, if any, exists. However, assuming that there is in fact a unique solution, we may just replace moments mk by sample moments Mk, and solve the system to obtain estimators 1, 2, …, k.

Example 4.2 Application to the normal distribution

Let (X1, X2, … Xn) be a random sample from a normal distribution with unknown parameters (θ1, θ2) ≡ (μ, σ). The relationship between the parameters and the first two moments is m1 = μ, m2 = σ2 + μ2. Plugging sample moments and solving the system yields

equation

These estimators look familiar enough, but note that we do not obtain an unbiased estimator of variance.

4.2.3 THE METHOD OF MAXIMUM LIKELIHOOD

The method of maximum likelihood is an alternative approach to find estimators in a systematic way. Imagine that a random variable X has a PDF characterized by a single parameter θ, denoted by fX(x; θ). If we draw a sample of n i.i.d. variables from this distribution, the joint density is just the product of individual PDFs:

equation

This is a function depending on the parameter θ. Also note that we use lowercase letters xi to point out that we are referring to a specific set of numerical realizations of the random variable X. If we are interested in estimating θ, given a sample Xi = xi, i = 1, …, n, we may swap the role of variables and parameters and build the likelihood function

(4.7) equation

The shorthand notation L(θ) is used to emphasize that this is a function of the unknown parameter θ, for a given sample of observations. On the basis of this framework, intuition suggests that we should select the parameter yielding the largest value of the likelihood function. For a discrete random variable, the interpretation is more natural: We select the parameter maximizing the probability of what we have indeed observed. For a continuous random variable, we cannot really speak of probabilities, but the rationale behind the method of maximum likelihood should be apparent: We are just trying to find the best explanation of what we observe. The acronym MLE, for maximum-likelihood estimator or estimation, is used to refer to the approach.

Example 4.3 MLE for the exponential distribution

Let us consider the PDF of an exponential random variable with parameter λ:

equation

Given observed values x1, …, xn, the likelihood function is

equation

Quite often, rather than attempting direct maximization of the likelihood function, it is convenient to maximize its logarithm, i.e., the log-likelihood function

(4.8) equation

The rationale behind this function is easy to grasp, since by taking the logarithm of the likelihood function, we transform a product of functions into a sum of functions:

equation

The first-order optimality condition yields

equation

The result is rather intuitive, since E[X] = 1/λ.

The idea of MLE is immediately extended to the case requiring the estimation of multiple parameters: We just solve a maximization problem involving multiple variables.

Example 4.4 MLE for the normal distribution

Let us consider maximum-likelihood estimation of the parameters θ1 = μ and θ2 = σ2 of a normal distribution. Straightforward manipulation of the PDF yields the log-likelihood function:

equation

The first-order optimality condition with respect to μ is

equation

Hence, the maximum-likelihood estimator of μ is just the sample mean. If we apply the first-order condition with respect to σ2, plugging the estimator , we obtain

equation

In the two examples above, maximum likelihood yields the same estimator as the method of moments. Then, one could well wonder whether there is really any difference between the two approaches. The next example provides us with a partial answer.

Example 4.5 MLE for the uniform distribution

Let us consider a uniform distribution on interval [0, θ]. On the basis of a random sample of n observations, we build the likelihood function

equation

This may look a bit weird at first sight, but it makes perfect sense, as the PDF is zero for x > θ. Indeed, θ cannot be smaller than the largest observation:

equation

In this case, since there is a constraint on θ, we cannot just take the derivative of L(θ) and set it to 0 (first-order optimality condition). Nevertheless, it is easy to see that likelihood is maximized by choosing the smallest θ, subject to constraint above. Hence, using the notation of order statistics, we find

equation

While the application of maximum likelihood to exponential and normal variables looks a bit dull, in the last case we start seeing something worth noting. The estimator is quite different from what we have obtained by using the method of moments. The most striking feature is that this estimator does not use the whole sample, but just a single observation. We leave it as an exercise for the reader to check that, in the case of a uniform distribution on the interval [a, b], maximum-likelihood estimation yields

equation

Indeed, the smallest and largest observations are sufficient statistics for the parameters a and b of a uniform distribution.7 We refrain from stating a formal definition of sufficient statistics, but the idea is rather intuitive. Given a random sample X, a sufficient statistic is a function T(X) that captures all of the information we need from the sample in order to estimate a parameter. As a further example, sample mean is a sufficient statistic for the parameter μ of a normal distribution. This concept has far-reaching implications in the theory of inferential statistics.

A last point concerns unbiasedness. In the first two examples, we have obtained biased estimators for variance, even though they are consistent. It can be shown that an unbiased estimator of θ for the uniform distribution on [0, θ] is

(4.9) equation

rather than X(n). Again, we see that maximum likelihood yields a less than ideal estimator, even though for n → ∞ there is no real issue. It turns out that maximum-likelihood estimators (MLEs) do have limitations, but a few significant advantages as well. Subject to some technical conditions, the following properties can be shown for MLEs:

They are consistent.
They are asymptotically normal.
They are asymptotically efficient.
They are invariant, in the sense that, given a function g(·), the MLE of γ = g(θ) is g(), where is the MLE of θ.

As a general rule, finding MLEs requires the solution of an optimization problem by numerical methods, but there is an opportunity here. Whatever constraint we want to enforce on the parameters, depending on domain-specific knowledge, can be easily added, resulting in an optimization problem that can be solved numerically. Plenty of numerical optimization methods are available to this purpose.

4.2.4 DISTRIBUTION FITTING IN R

R offers some tools to fit distributions by maximum likelihood, such as

mle, contained in the stats4 package,
fitdistr, contained in the MASS package.

The former function leaves the task of specifying minus the log-likelihood function to the user and produces an object of an S4 class. The second one is a bit more user friendly. The use of these functions is best understood by a simple example related to a gamma distribution. The code in Fig. 4.3 produces the following (abridged) output:

FIGURE 4.3 Fitting a gamma distribution.

> fit1@coef
shape rate
2.233724 11.062195
> confint(fit1)
Profiling…
2.5 % 97.5 %
shape 1.985328 2.503327
rate 9.684267 12.559251
> fit2$estimate
shape rate
2.233726 11.062211
> fit2$sd
shape rate
0.1320868 0.7331218

We notice a slight difference in the estimates of shape and rate parameters, due to the optimization algorithms embedded in the two functions. In the case of mle, we obtain an S4 object; this is why we have to access its properties by fit1@coef rather than by fit1$coef. We also select an optimization method, L-BFGS-B, so that we may enforce lower bounds on the parameters (otherwise, warnings may be obtained if they get negative in the search process). Also notice the use of a named list to provide initial values for the numerical optimization algorithm, as usual in nonlinear programming. In both cases we have some information on the reliability of the estimate, either via a confidence interval or standard errors.

4.3 Checking the fit of hypothetical distributions

So far, we have been concerned with parameters of probability distributions. We never questioned the fit of the distribution itself against empirical data. For instance, we might assume that a population is normally distributed, and we may estimate and test its expected value and variance. However, normality should not be taken for granted, just like any other claim about the underlying distribution. Sometimes, specific knowledge suggests strong reasons that justify the assumption; otherwise, this should be tested in some way. When we test whether experimental data fit a given probability distribution, we are not really testing a hypothesis about a parameter or two; in fact, we are running a nonparametric test.

In this section we illustrate three kinds of approach:

The chi-square test, which is general purpose and, loosely speaking, checks the fit in terms of histograms and densities.
The Kolmogorov–Smirnov test, which is is general purpose as well, but checks the fit in terms of the cumulative distribution.
Two examples ad hoc tests, aimed at checking normality, namely, the Shapiro–Wilks and Jarque–Bera tests.

All of the above procedures will be illustrated using R.

Before describing formal approaches, let us comment on intuitive and graphical checks. Clearly, visual inspection of a histogram and a comparison with the theoretical shape of a PDF may help. However, sometimes the judgement is not so easy to make. To see the point, let us generate a normal sample and look at the resulting histograms in Fig. 4.4:

FIGURE 4.4 Histograms of a normal sample.

> set.seed(55555)
> par(mfrow=c(1,2))
> X=rnorm(200,10, 40)
> hist(X)
> hist(X,nclass=50)
> par(mfrow=c(1,1))

In the plot on the left of Fig. 4.4, there are very few bins, and the distribution might look compatible with a normal, but other densities as well. If we increase resolution by using more bins, statistical sampling makes the histogram too bumpy to be of any use. By the way, these observations also apply to the chi-square test, which formalizes the intuition of comparing histograms. Sometimes, checking the CDF is a better idea. In fact, another graphical tool relies on plotting the sample quantiles against the theoretical ones, obtaining a Q-Q plot. Using the above sample, we produce the plot in Fig. 4.5 as follows:

FIGURE 4.5 Checking the fit of a normal distribution.

> qqnorm(X)
> qqline (X)

In principle, the points should lie on a straight line, since the sample is in fact normally distributed. This is indeed the case, with the exception of tails, which is to be expected after all as they are always critical.

4.3.1 THE CHI-SQUARE TEST

The idea behind the chi-square test is fairly intuitive and basically relies on a relative frequency histogram, although the technicalities do require some care. The first step is to divide the range of possible observed values in J disjoint intervals, corresponding to bins of a frequency histogram. Given a probability distribution, we can compute the probability pj, j = 1, …, J, that a random variable distributed according to that distribution falls in each bin. If we have n observations, the number of observations that should fall in interval j, if the assumed distribution is indeed the true one, should be Ej = npj. This number should be compared against the number Oj of observations that actually fall in interval j; a large discrepancy would suggest that the hypothesis about the underlying distribution should be rejected. Just like any statistical test, the chi-square test relies on a distributional property of a statistic. It can be shown that, for a large sample size, the statistic

equation

has (approximately) a chi-square distribution. We should reject the hypothesis if χ2 is too large, i.e., if χ2 > χ1−α,m2, where:

χ1−α,m2 is a quantile of the chi-square distribution.
α is the significance level of the test.
m is the number of degrees of freedom.

What we are missing here is m, which depends on the number of parameters of the distribution that we have estimated using the data. If no parameter has been estimated, i.e., if we have assumed a specific parameterized distribution prior to observing data, the degrees of freedom are J − 1; if we have estimated p parameters, we should use J − p − 1.

R offers a chisq.test function that, among other things, can be used to check the fit of a continuous distribution. To this aim, we have to bin data and compute observed and theoretical frequencies:

set.seed(55555)
shape <- 2
rate <- 10
X <- rgamma(500, shape, rate)
breakPts <- c(seq(0,0.8,0.1), 2)
probs <- pgamma(breakPts,shape,rate)
binProbs <- diff(probs)
f.obs <- hist(X, breakPts, plot = FALSE)$counts
chisq.test(x=f.obs,p=binProbs/sum(binProbs))

Note that we divide the vector binProbs by its sum to make sure that probabilities add up to 1. We also use hist as a convenient tool to bin data. The above code results in the following output:

        Chi-squared test for given probabilities
data: f.obs
X-squared = 4.4634, df = 8, p-value = 0.8131
Warning message:
In chisq.test(x = f.obs, p = binProbs/sum(binProbs)):
Chi-squared approximation may be incorrect

As expected, we cannot reject the null hypothesis, but we get a warning message. Indeed, the idea of the chi-square test is pretty intuitive; however, it relies on approximated distributional results that may be critical. Another tricky point is that the result of the test may depend on the number and placement of bins. Rules of thumb have been proposed and are typically embedded in good statistical software.

4.3.2 THE KOLMOGOROV–SMIRNOV TEST

Continuing the above example, where the variable x stores a sample from a gamma distribution with given shape and rate parameters, it is very easy to run the Kolmogorov–Smirnov test in R:

> ks.test(X,pgamma, shape, rate)
One-sample Kolmogorov–Smirnov test
data: X
D = 0.0447, p-value = 0.2704
alternative hypothesis: two-sided

The Kolmogorov–Smirnov test, unlike the chi-square test, relies on the CDF. Indeed, as you may see above, we have to provide ks.test with a function computing cumulative probabilities, pgamma in our case, along with all of the relevant parameters.

The idea behind the test relies on a comparison between the assumed CDF FX(x) and an empirical CDF defined as follows:

equation

where #{…} denotes the cardinality of a set. The distance is measured in a worst-case sense:8

equation

From a computational point of view, this distance can be calculated as

equation

where

equation

and X(i) is the ith order statistic (obtained by sorting values Xi in increasing order). It is clear that a large value D does not support the assumed distribution. What is not so clear is how to devise a rejection region and find the critical values defining its boundary. This depends on the underlying distribution, and Monte Carlo sampling is often used to find such values.

4.3.3 TESTING NORMALITY

The R core includes the Shapiro–Wilk test, implemented in the shapiro. test function:

> set.seed(55555)
> X=rnorm(20,10,20)
> shapiro.test(X)
        Shapiro-Wilk normality test
data: X
W = 0.9512, p-value = 0.385

In this case, the function returns a rather large p-value, such that we cannot reject the null hypothesis that the distribution is normal. Given this structure, it is clear that we get a conservative test. Strong evidence is needed to reject normality. Let us try a normal and a gamma distribution:

> X=rexp(20,10)
> shapiro.test(X)
        Shapiro-Wilk normality test
data: X
W = 0.9006, p-value = 0.04237
> X=rgamma(20,2, 10)
> Shapiro.test(X)
Shapiro-Wilk normality test
data: X
W = 0.623, p-value = 5.14e-06

Evidence is reasonably strong in the former case, but not as strong as in the latter. The test statistic takes values between 0 and 1, and it compares two estimates of variance; when the statistic is close to 1 we cannot reject normality. In the Jarque–Bera test we take advantage of third- and fourth-order moments, and of the fact that skewness is zero and kurtosis is 3 for a normal. The test statistic is

equation

where and are estimates of skewness and kurtosis, respectively. Clearly, the test statistic is non-negative and should be close to zero, according to the null hypothesis that the underlying distribution is normal. For a large sample the test statistic is a chi-square variable, which can be used to define a rejection region for the test. However, since this is not quite robust for small and medium size samples, critical values can be obtained by Monte Carlo sampling.

4.4 Estimation of linear regression models by ordinary least squares

Under suitable assumptions, multiple linear regression models may be estimated by ordinary least squares (OLS), which is accomplished in R by the lm function. This is illustrated by the code in Fig. 4.6, which produces the following output:

FIGURE 4.6 Fitting a multiple linear regression model.

Call:
lm(formula = Y ~ X1 + X2)
Residuals:
    Min     1Q Median     3Q     Max
−12.396 −9.859 −6.075 8.818 29.831
Coefficients:
            Estimate Std.    Error    t value    Pr(>|t|)
(Intercept)    21.6349        11.0946        1.950    0.092162 .
X1            4.7902        1.7912        2.674    0.031802 *
X2            2.7463        0.4732        5.803    0.000661 ***
---
Signif.codes: 0 ‘**’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 16 on 7 degrees of freedom
Multiple R-squared: 0.8727,Adjusted R-squared: 0.8363
F-statistic: 23.98 on 2 and 7 DF, p-value: 0.000737

We obtain parameter estimates, along with an assessment of their significance. The p-value of the coefficient for X1 is significant, despite the high level of noise. The coefficient of determination R2 is 0.8727, but the adjusted R2 is 0.8363. We also obtain an overall assessment of the model in the p-value of the F-statistic. The reader is invited to check the effect of reducing the coefficient multiplying X1 in the data generation process.

We obtain confidence intervals for the estimated coefficients, and we may also test normality and uncorrelation of residuals:

> confint(lmod1)
                2.5 % 97.5 %
(Intercept) −4.5996144 47.869439
X1           0.5546597 9.025779
X2           1.6272928 3.865392
> shapiro.test(lmod1$resid)
Shapiro-Wilk normality test
data: lmod1$resid
W = 0.8294, p-value = 0.03287
> library(lmtest)
> dwtest(lmod1)
Durbin-Watson test
data: lmod1
DW = 2.5181, p-value = 0.5605
alternative hypothesis:
true autocorrelation is greater than 0

Using lm requires the specification of a formula. In the case above, Y~X1+X2 is self-explanatory, but formulas in R are much more powerful than that. Say that we want to fit a polynomial regression model of order 2:

equation

Of course, we can just evaluate each power of the explanatory variable and run a linear regression (the model is nonlinear in variables, but linear in parameters). However, the code in Fig. 4.7 shows a shortcut: by setting raw=TRUE, we force R to use familiar polynomials rather than orthogonal ones.9 We obtain the following output:

FIGURE 4.7 Fitting a polynomial regression model.

Call:
lm(formula = Y ~ poly(X, 2, raw = TRUE))
Residuals:
Min    1Q    Median    3Q    Max
-13.673    -9.653    -5.712    8.714    29.547
Coefficients:
    Estimate    Std.    Error    t value Pr(>|t|)
(Intercept)    290.82646    7.53560    38.59    2.04e-09 ***
poly(X,2,raw=TRUE)1    4.73459    0.46672    10.14    1.95e-05 ***
poly(X,2,raw=TRUE)2    0.49706    0.04724    10.52    1.53e-05 ***
---
Signif.codes: 0 ‘**’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 16.02 on 7 degrees of freedom
Multiple R-squared: 0.9664, Adjusted R-squared: 0.9568
F-statistic: 100.7 on 2 and 7 DF, p-value: 6.951e-06

There are cases in which nonlinearity should be used to capture interaction between variables. Consider the model

equation

where δ {0,1} is a dummy variable modeling a categorical variable. It is worth noting that when δ = 0, the data generating process is

equation

whereas, when δ = 1, both intercept and slope change:

equation

The code in Fig. 4.8 illustrates such a case. The input data are depicted in Fig. 4.9, along with the fitted lines resulting from setting δ = 0 and δ = 1, respectively. We can see clearly the difference in intercept and slope between the two groups.

FIGURE 4.8 Fitting a regression model with interactions.

FIGURE 4.9 A regression model with interactions.

The following model summary shows that, when using the formula Y~X*Z, interactions are automatically accounted for:

Call:
lm(formula = Y ~ X * Z)
Residuals:
Min    1Q    Median    3Q    Max
-27.0281    -5.4642    -0.1196    5.5480    22.4519
Coefficients:
    Estimate Std.    Error    t value    Pr(>|t|)
(Intercept)    29.6160    3.5574    8.325    2.62e-12 ***
X    4.8991    0.1135    43.147    < 2e-16 ***
Z    38.6779    4.6058    8.398    1.90e-12 ***
X:Z    -2.8403    0.1584    -17.928    < 2e-16 ***
Signif.codes: 0 ‘**’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 10.04 on 76 degrees of freedom
Multiple R-squared: 0.9732, Adjusted R-squared: 0.9722
F-statistic: 921.3 on 3 and 76 DF, p-value: < 2.2e-16

4.5 Fitting time series models

Fitting time series models requires the use of possibly sophisticated procedures, whose first step is selecting the type and order of the model. This model identification step can be accomplished using the concepts that we have outlined in Section 3.6. Given a model structure, estimating its parameters calls for the application, e.g., of maximum-likelihood methods. In this section we just simulate a couple of models and check whether we are able to recover their underlying structure, on the basis of the resulting sample paths, by using R functions.

To simulate an ARIMA model, we may write our own code, as we did in Section 3.6. For the sake of convenience, here we use the function arima.sim, which simulates an ARIMA model and accepts, among other things, the following input arguments:

model: a named list with vector-valued components, ar and ma, specifying the AR and the MA part of the model
n: the time horizon for the simulation
n. start: the number of time buckets that should be simulated and discarded to avoid issues with the transient phase resulting from the initial conditions
rand.gen: a function to generate shocks; by default, rnorm is used, but we may use, e.g., a Student’s t distribution to allow for slightly fatter tails

Then, we estimate the model using two functions:

arima, which needs a specification of the model order
auto.arima, provided by the forecast package, which automates the search for the model order

The code in Fig. 4.10 produces the following (edited) output:

FIGURE 4.10 Simulating and fitting an ARMA(2,2) model.

> arima(x1,order=c(2,0,2))
Series: x1
ARIMA (2, 0, 2) with non-zero mean
Coefficients:
    ar1    ar2    ma1    ma2    intercept
    0.6477    -0.3853    -0.0880    0.3118    -0.0105
s.e.    0.1364    0.0652    0.1362    0.0517    0.0295
sigma^2 estimated as 0.4749: log likelihood=-1570.18
AIC=3152.35 AICc=3152.41 BIC=3184.23
> auto.arima(x1)
Series: x1
ARIMA (2, 0, 2) with zero mean
Coefficients:
    ar1    ar2    ma1    ma2
    0.6474    -0.3849    -0.0875    0.3117
s.e.    0.1363    0.0652    0.1361    0.0517
sigma^2 estimated as 0.475: log likelihood=-1570.24
AIC=3150.48 AICc=3150.52 BIC=3177.05
> arima(x2, order=c(2,0,2))
Series: x2
ARIMA (2, 0, 2) with non-zero mean
Coefficients:
    ar1    ar2    ma1    ma2    intercept
    0.9599    -0.5477    -0.3572    0.2970    -0.0076
s.e.    0.0738    0.0425    0.0772    0.0439    0.0282
sigma^2 estimated as 0.4677: log likelihood=-1558.8
AIC=3129.6 AICc=3129.66 BIC=3161.48
> auto.arima(x2)
Series: x2
ARIMA (2, 0, 2) with zero mean
Coefficients:
    ar1    ar2    ma1    ma2
    0.9599    -0.5477    -0.3573    0.2970
s.e.    0.0738    0.0425    0.0772    0.0439
sigma^2 estimated as 0.4678: log likelihood=-1558.84
AIC=3127.68 AICc=3127.72 BIC=3154.24

Note that arima needs a specification of the model order; in this case we cheat and use the actual order used in the simulation, i.e., (2, 0, 2), where there is no integration component. The alternative function auto. arima is able to find the correct order in this example, but this is not the case in general. We also note that, despite the fairly long simulated horizon, different sample paths may lead to rather different estimates of the coefficients. This illustrates the difficulties in fitting time series models, but we should consider that the main purpose here is to capture some essential features of the original time series, in order to produce realistic scenarios for a Monte Carlo simulation. The code of Fig. 4.10 also uses the function ts.plot to plot the first sample path, along with a zoomed portion of the initial time interval; see the result in Fig. 4.11.

FIGURE 4.11 Sample path produced by arima.sim; the plot on the right zooms over the initial portion of the path.

4.6 Subjective probability: The Bayesian view

So far, we have adopted a rather standard view of parameter estimation, as we have followed the orthodox approach: Parameters are unknown numbers, which we try to estimate by squeezing information out of a random sample, in the form of point estimators and confidence intervals. We also insisted on the fact that, given a computed confidence interval with some confidence level (1 − α), we cannot say that the true parameter is contained there with probability (1 − α). This statement makes no sense, since we are only comparing known numbers (the realized bounds of the confidence interval) and an unknown number (the parameter), but no random variable is involved. So, there is no such a thing as a “probabilistic knowledge” about parameters, and data are the only source of information; any other knowledge, objective or subjective, is disregarded. The following example illustrates the potential difficulties induced by this view.10

Example 4.6 A pathological case

Let X be a uniformly distributed random variable, and let us assume that we do not know where the support of its distribution is located, but we know that its width is 1. Then, X ~ U(μ − 0.5, μ + 0.5), where μ is the unknown expected value of X, as well as the midpoint of the support. In order to estimate μ, let us take a sample of only n = 2 independent realizations X1 and X2 of the random variable. Now, let us consider the order statistics

equation

and the confidence interval

(4.10) equation

Given this way of building a confidence interval, what is the confidence level of I, i.e., the probability P{μ I}? Once again, note that we are not asking this question by referring to a specific interval after sampling, but to a random interval before sampling. Because of symmetry, the two observations have a probability 0.5 of falling to the left or to the right of μ. The confidence interval will not contain μ if both of them fall on the same half of the support. Then, since X1 and X2 are independent, we have

equation

Therefore, the confidence level for is the complement of this probability, i.e., 50%. Now, let us suppose that we observe X1 = 0 and X2 = 0.6. What is the probability that μ is included in the confidence interval resulting from Eq. (4.10), i.e., P{0 ≤ μ, ≤ 0.6}? In general, this question does not make any sense, since μ is a number. But in this specific case, we have some additional knowledge leading to the conclusion that the expected value is certainly included in the interval [0, 0.6]. In fact, if X(1) = 0, we may conclude μ ≤ 0.5; by the same token, if X(2) = 0.6, we may conclude μ ≤ 0.1. Since the confidence interval includes the interval [0.1, 0.5], we would have good reasons to claim something like P{0 ≤ μ ≤ 0.6} = 1. But again, this makes no sense in the orthodox framework. By a similar token, if we get X1 = 0 and X2 = 0.001, we would be tempted to say that such a small interval is quite unlikely to include μ. However, within the framework of orthodox statistics, there is no way in which we can express this intuition properly.

On one hand, the example illustrates the need to make our background knowledge explicit: The Bayesian framework is an alternative to the orthodox approach, where it can be argued that unconditional probabilities do not exist, in the sense that probabilities are always conditional on some background knowledge and assumptions. On the other hand, we see the need of a way to express subjective views, which may be revised after collecting empirical data. Bayesian estimation has been proposed to cope with such issues as well.

4.6.1 BAYESIAN ESTIMATION

Consider the problem of estimating a parameter θ, which characterizes the probability distribution of a random available X. We have some prior information about θ that we would like to express in a sensible way. We might assume that the unknown parameter lies anywhere in the unit interval [0, 1], or we might assume that it is close to some number μ, but we are somewhat uncertain about it. Such a knowledge or subjective view may be expressed by a probability density p(θ), which is called the prior distribution of θ. In the first case, we might associate a uniform distribution with θ; in the second case the prior could be a normal distribution with expected value μ and variance σ2. Note that this is the variance that we associate with the parameter, which is itself a random variable, rather than a number, and not the variance of the random variable X itself.

In Bayesian estimation, the prior is merged with experimental evidence by using Bayes’ theorem. Experimental evidence consists of independent observations X1, …, Xn from the unknown distribution. Here and in the following we mostly assume that random variable X is continuous, and we speak of densities; the case of discrete random variables and probability mass functions is similar. We also assume that the values of the parameter θ are not restricted to a discrete set, so that the prior is a density as well. Hence, let us denote the density of X by f(x | θ), to emphasize its dependence on parameter θ. Since a random sample consists of a sequence of independent random variables, their joint distribution, conditional on θ, is

equation

The conditional density fn(x1, …, xn| θ) is also called the likelihood function, as it is related to the likelihood of observing the data values x1, …, xn, given the value of the parameter θ; also notice the similarity with the likelihood function in maximum-likelihood estimation.11 Note that what really matters here is that the observed random variables X1, …, Xn are independent conditionally on θ. Since we are speaking about n + 1 random variables, we could also consider the joint density

equation

but this will not be really necessary for what follows. Given the joint conditional distribution fn(x1, …, xn | θ) and the prior p(θ), we can find the marginal density of X1, …, Xn by applying the total probability theorem:

equation

where we integrate over the domain Ω on which θ is defined, i.e., the support of the prior distribution. Now what we need is to invert conditioning, i.e., we would like the distribution of θ conditional on the observed values Xi = xi, i = 1, …, n, i.e.,

equation

This posterior density should merge the prior and the density of observed data conditional on the parameter. This is obtained by applying Bayes’ theorem to densities, which yields

(4.11) equation

Note that the posterior density involves a denominator term gn(x1, …, xn) that does not depend on θ. Its role is only to normalize the posterior distribution, so that its integral is 1. Sometimes, it might be convenient to rewrite Eq. (4.11) as

(4.12) equation

where the symbol ∝ means “proportional to.” The denominator term does not affect the shape of the posterior, which encodes our uncertainty about the parameter θ, and it may be argued that it is not really essential.12 In plain English, Eq. (4.11) states that the posterior is proportional to the product of the likelihood function fn(x1, …, xn| θ) and the prior distribution p(θ):

equation

What we are saying is that, given some prior knowledge about the parameter and the distribution of observations conditional on the parameter, we obtain an updated distribution of the parameter, conditional on the actually observed data.

4.6.2 BAYESIAN LEARNING AND COIN FLIPPING

We tend to take for granted that coins are fair, and that the probability of getting heads is . Let us consider flipping a possibly unfair coin, with an unknown probability θ of getting heads. In order to learn this unknown value, we flip the coin repeatedly, i.e., we run a sequence of independent Bernoulli trials with unknown parameter θ. If we do not know anything about the coin, we might just assume a uniform prior

equation

If we flip the coin N times, we know that the probability fN(H | θ) of getting H heads, conditional on θ, is related to the binomial probability distribution:

(4.13) equation

This is our likelihood function. If we regard this expression as the probability of observing H heads, given θ, this should actually be the probability mass function of a binomial variable with parameters θ and N, but we are disregarding the binomial coefficient, which does not depend on θ and just normalizes the distribution. If we multiply this likelihood function by the prior, which is just 1, we obtain the posterior density for θ, given the number of observed heads:

(4.14) equation

Equations (4.13) and (4.14) look like the same thing, because we use a uniform prior, but they are very different in nature. Equation (4.14) gives the posterior density of θ, conditional on the fact that we observed H heads and NH tails. If we look at it this way, we recognize the shape of a beta distribution, which is the density of a continuous random variable, the parameter θ, rather than the mass function of a discrete random variable, the number of heads H out of N flips. To normalize the posterior, we should multiply it by the appropriate value of the beta function.13 Again, this normalization factor does not depend on θ and can be disregarded.

The R code in Fig. 4.12 generates a sequence of 1000 unfair coin flips, with θ = 0.2, and successively updates and plots the posterior density, resulting in the plots of Fig. 4.13. There, we display posterior densities normalized in such a way that their maximum is 1, after flipping the coin N times and having observed H heads. In Fig. 4.13(a) we just see the uniform prior before any flip of the coin. In the simulated history, the first flip lands heads. After observing the first heads, we know for sure that θ ≠ 0; indeed, if θ were zero, we could not observe any heads. The posterior is now proportional to a triangle:

FIGURE 4.12 Sampling a sequence of coin flips and updating the posterior.

FIGURE 4.13 Updating the posterior density in coin flipping.

equation

This triangle is shown in Fig. 4.13(b). We observe another heads in the second flip, and so the updated posterior density is a portion of a parabola, as shown in Fig. 4.13(c):

equation

With respect to the triangle, the portion of parabola is more tilted toward θ = 1, because we observed two heads out of two launches. However, when we get tails at the third flip, we rule out θ = 1 as well. Proceeding this way, we get beta distributions concentrating around the true (unknown) value of θ = 0.2.

Armed with the posterior density, how can we build a Bayes’ estimator for θ? Figure 4.13 would suggest taking the mode of the posterior, which would spare us the work of normalizing it. However, this need not be the most sensible choice. If we consider the expected value for the posterior distribution, we find

(4.15) equation

There are different ways of framing the problem, which are a bit beyond the scope of this book,14 but one thing that we can immediately appreciate is the challenge we face. The estimator above involves what looks like an intimidating integral, but our task is even more difficult in practice, because finding the posterior density may be a challenging computational exercise as well. In fact, given a prior, there is no general way of finding a closed-form posterior; things can really get awkward when multiple parameters are involved. Moreover, there is no guarantee that the posterior distribution pn(θ | x1, …, xn) will belong to the same family as the prior p(θ). In general, numerical methods are needed in Bayesian computational statistics and, needless to say, Monte Carlo approaches play a key role here. However, there are some exceptions. A family of distributions is called a conjugate family of priors if, whenever the prior is in the family, the posterior is too. For instance, if we choose a beta density as the prior in coin flipping, the posterior is a beta density as well. The following example illustrates the idea further.

Example 4.7 The case of a normal prior

Consider a sample (X1, …, Xn) from a normal distribution with unknown expected value θ and known variance σ02. Then, given our knowledge about the multivariate normal, and taking advantage of independence among observations, we have the following likelihood function:

equation

Let us assume that the prior distribution of θ is normal, too, with expected value μ and σ:

equation

To find the posterior, we may simplify our work by considering in each function only the part that involves θ, wrapping the rest within a proportionality constant. In more detail, the likelihood function can be written as

(4.16) equation

Then, we may simplify the expression further, by observing that

equation

where is the average of xi, i = 1, …, n. Then, we may include terms not depending on θ into the proportionality constant and rewrite Eq. (4.16) as

(4.17) equation

By a similar token, we may rewrite the prior as

(4.18) equation

By multiplying Eqs. (4.17) and (4.18), we obtain the posterior

(4.19) equation

Again, we should try to include θ within one term; to this aim, we use a bit of tedious algebra and rewrite the argument of the exponential as follows:

equation

where

(4.20) equation

(4.21) equation

Finally, this leads us to

(4.22) equation

Disregarding the normalization constant, we immediately recognize the familiar shape of a normal density, with expected value ν and variance ξ2. Then, given an observed sample mean and a prior μ, Eq. (4.20) tells us that the Bayes’ estimator of θ can be written as

(4.23) equation

Equation (4.23) has a particularly nice and intuitive interpretation: The posterior estimate is a weighted average of the sample mean (the new evidence) and the prior μ, with weights that are inversely proportional to σ20/n, the variance of sample mean, and σ2, the variance of the prior. The more reliable a term, the larger its weight in the average.

What may sound a little weird in the example is the assumption that the variance σ02 of X is known, but its expected value is not. Of course, one can extend the framework to cope with estimation of multiple parameters. Furthermore, we have considered two lucky cases, coin flipping and the normal prior for a normal distribution, where the posterior is very easy to compute as it preserves the original shape of the prior. In such cases we speak of conjugate priors. In other cases, we are not so lucky and we need to compute a possibly tough integral to find the normalization constant making the posterior a density. Indeed, Bayesian approaches were not quite practical until computational tools were devised to solve the difficulty by Markov chain Monte Carlo methods. These strategies are described in Chapter 14 and allow to run a simulation even if we do not know a density exactly, but only a likelihood proportional to the density.

For further reading

From the simulation practitioner’s viewpoint, an excellent introduction to input analysis can be found in [10].
If you are interested in a sound treatment of the theoretical background of inferential statistics and parameter estimation, [2] is recommended.
We have given an extremely cursory and superficial look at linear regression models and their estimation. If you would like a treatment of linear regression from the statistical point of view, one possibility is [5].
There are many books on econometrics, covering regression models and more, such as [1] and, at a more advanced level, [7].
The above references on econometrics also cover time series models. If you want an extensive treatment, including estimation of time series models, you may refer to [8].
A quite readable introduction to Bayesian estimation can be found in [9], which is replete with good R code as well.

References

1 C. Brooks. Introductory Econometrics for Finance (2nd ed.). Cambridge University Press, Cambridge, 2008.

2 G. Casella and R.L. Berger. Statistical Inference (2nd ed.). Duxbury, Pacific Grove, CA, 2002.

3 M.H. DeGroot. Probability and Statistics. Addison-Wesley, Reading, MA, 1975.

4 M.H. DeGroot and M.J. Schervish. Probability and Statistics (3rd ed.). Addison-Wesley, Boston, 2002.

5 N.R. Draper and H. Smith. Applied Regression Analysis (3rd ed.). Wiley, New York, 1998.

6 I. Gilboa. Theory of Decision under Uncertainty. Cambridge University Press, New York, 2009.

7 W.H. Greene. Econometric Analysis (7th ed.). Prentice Hall, Upper Saddle River, NJ, 2011.

8 J.D. Hamilton. Time Series Analysis. Princeton University Press, Princeton, NJ, 1994.

9 J.K. Kruschke. Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press, Burlington, MA, 2011.

10 A.M. Law and W.D. Kelton. Simulation Modeling and Analysis (3rd ed.). McGraw-Hill, New York, 1999.

1We defer some topics, such as the estimation of probabilities and quantiles to Chapter 7, as from a practical point of view they are more relevant there, even though, from a conceptual point of view, they are also related to the content of this chapter.

2Similar considerations apply to the previous chapter, of course.

3We illustrate the kind of mistakes that we are led to when forgetting the role of independence in Chapter 7.

4If we want to associate parameters with probabilistic statements, we have to take a Bayesian view; see Section 4.6.1.

5The coefficient of variation is the same in both cases. This is defined as σ/ |μ|, i.e., the ratio between standard deviation and the absolute value of the expected value.

6In fact, what is typically written in textbooks is a hybrid of different viewpoints about the nature of hypothesis testing, and this is why the terminology may sometimes sound weird.

7 See [4, p. 375] for a proof.

8 A similar approach is used in Chapter 9 to assess the star discrepancy of a numerical sequence.

9 For an application of polynomial regression to pricing American-style options by Monte Carlo methods, see Section 11.3.

10 This example is taken from [6, p. 45], which in turn refers back to [3].

11 See Section 4.2.3.

12This will play a major role in Chapter 14 on Markov chain Monte Carlo.

13See Section 3.2.4.

14One way to frame the problem is by introducing a loss function that accounts for the cost of a wrong estimate. It can be shown that Eq. (4.15) yields the optimal estimator when the loss function has a certain quadratic form.