Before embarking in a computationally extensive Monte Carlo study, we should be reasonably confident that we are feeding our simulation tool with sensible input. In the previous chapter we have considered different families of models that can be used as the key ingredient to build a Monte Carlo simulator. Here we consider the quantitative side of the coin: After selecting a model structure, how should we estimate its parameters? Hence, we step into the domain of inferential statistics, whereby, given a sample of observed data, we engage in increasingly difficult tasks:
The first task is typically associated with elementary concepts like confidence intervals, which may be somewhat dangerous and misleading for the newcomer. Computing a confidence interval for the mean of a normal population looks so easy that it is tempting to disregard the pitfalls involved. We will have more to say on this in Chapter 7, where we deal with output analysis. In fact, simulation input and output analysis do share some statistical background.1 If the output of a Monte Carlo run is used to estimate a probability or a quantile, it may be tempting to apply the same recipes that we adopt to estimate means, which may lead to unreasonable results. Moreover, a probability distribution is not really characterized by expected value and variance. This is the case with a normal distribution N(μ, σ2), as its parameters have that obvious interpretation, but it is not, e.g., with a gamma distribution. Hence, we need more refined strategies, like moment matching or maximum-likelihood estimation. Maximum likelihood is a powerful principle, which can also be used to estimate the parameters of complicated models, including time series.
We must also be aware that parameter estimation should not be our only concern. Apart from fitting parameters, we should also try to check the fit of the overall uncertainty model, which involves nonparametric issues. If we assume that a variable of interest is normally distributed, we should check if this is a defensible, even though debatable, assumption or sheer nonsense. Testing for normality is an example of a nonparametric test. More generally, we may need to compare the fit of alternative distributions, measuring the overall goodness of fit. What is relevant or not depends on the kind of application. For instance, it is possible to build two very different probability distributions having the same expected value, variance, skewness, and kurtosis. Thus, the first few moments do not tell the whole story. This may be not quite relevant if we are just dealing with mean–variance portfolio optimization, but it may be extremely relevant in dealing with extreme risk.
Most books on Monte Carlo methods for finance and economics do not include a chapter on these topics, which is fine if one assumes that readers have a suitable background in econometrics. Other books, such as [10], are aimed at industrial engineers and include an extensive chapter on input analysis. Since this is an introductory book also aimed at engineers and mathematicians, I have deliberately chosen to include such a chapter as well.2 To make the following treatment as useful as possible to a wide and diversified audience, we will start with really elementary topics, and then we will move on to possibly less standard ones, such as estimation of ARMA and GARCH models, as well as Bayesian parameter estimation. Furthermore, we will use R throughout, so that readers familiar with the underlying concepts, but unfamiliar with R functionalities, can take advantage of the examples. Since it is my firm conviction that understanding the underlying unity of concepts is fundamental for sound and deep learning, it is also important to stress the links of parameter estimation and model fitting with other topics dealt with in this book:
We start in Section 4.1 with a brief refresher on elementary inferential statistics, i.e., confidence intervals and hypothesis testing for the expected value; there, we also outline correlation testing. Then we move on to systematic strategies for parameter estimation in Section 4.2, where we consider moment matching and maximum likelihood. We outline nonparametric tests and checking goodness-of-fit in Section 4.3. We offer a brief refresher on ordinary least-squares methods to estimate linear regression models in Section 4.4, followed by the estimation of time series models in Section 4.5. We close the chapter with Section 4.6, where we compare the orthodox approach to inferential statistics and parameter estimation with the Bayesian view.
The newcomers will certainly realize that model estimation is a huge body of statistical knowledge, which we cannot cover in any satisfactory way in a short chapter. In particular, estimating time series models may require some deep knowledge, as is also the case when estimating a copula. We do not deal with copula estimation here, but later in Section 5.7; there we illustrate how sampling from a copula may be achieved by the R package copula, which also includes a function fitCopula for fitting purposes. We do not go into details, but we stress that the principle of parameter estimation by maximum likelihood, which we do outline in this chapter, is the theoretical underpinning of many of these estimation approaches.
In this section we deal with two very basic and related topics: confidence intervals for the mean and elementary hypothesis testing. Our aim is to review the foundations of these statistical tools, in order to get acquainted with the issues involved in more complicated cases, such as testing the significance of correlations, etc. Most readers could probably skip this section, or possibly just skim through it to see which functions R offers to accomplish these tasks. Not surprisingly, given their conceptual links, the relevant R functions boil down to one, t.test.
Most of us are introduced to inferential statistics through the calculation of a confidence interval for an expected value μ. This concept is relevant on both ends of a Monte Carlo simulation, as it used to define the input random variables as well as to analyze the output. Given a sample Xi, i = 1, …, n of i.i.d. (independent and identically distributed) random variables, the drill is as follows:
This procedure is so easy to perform that one tends to forget that it relies on some important assumptions. The following observations are in order:
(4.2)
Another danger is a common misinterpretation of confidence intervals. After calculating a confidence interval (l, u), we cannot say that
This statement does not make any sense, as all of the involved quantities are numbers. In the orthodox framework of statistics, μ is an unknown number; the extreme points of the confidence interval, l and u, are numerical realizations of random variables L and U. Hence, the above probabilistic statement makes no sense. We should rephrase the claim of Eq. (4.4) with reference to the random variables L and U, i.e., the lower and upper bound of the interval before sampling. Once again: We have to realize that parameters are unknown numbers in orthodox statistics, not random variables.4 This is illustrated empirically and clarified below by a simple Monte Carlo experiment.
Building standard confidence intervals in R is accomplished by the function t.test (x, conf. level = 0.95). The function name may look weird at first, but it is justified by the fact that the calculations needed to compute the confidence interval are related to those needed to run a hypothesis test, as we show later. The function receives a vector containing the sample and an optional parameter specifying the confidence level, which is 95% by default:
> set.seed(55555) > X = rnorm(20,mean=10,sd=20) > t.test(X) One Sample t-test data: X t = 1.1709, df = 19, p-value = 0.2561 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -3.960321 14.017022 sample estimates: mean of x 5.02835
The above output is a bit confusing, as it is cluttered by irrelevant information about a hypothesis test. It might be better to collect the output in a variable and then inspect what we are really interested in:
> ci <- t.test(X,conf.level = 0.99) > ci$estimate mean of x 5.02835 > ci$conf.int [1] −7.258179 17.314880 attr(,“conf.level”) [1] 0.99
As expected, the 99% interval is larger than the 95% interval, and both are rather large due to the small sample size and the large standard deviation with respect to the expected value.
Armed with the above function, we may check the empirical coverage of a confidence interval, i.e., the probability that it contains the true expected value. To this aim, it is instructive to compare the result with a normal and an exponential distribution of similar variability.5 By running the code in Fig. 4.1, we obtain the following results:
FIGURE 4.1 Checking the coverage of confidence intervals empirically.
> coverageExp [1] 0.9208 > coverageNorm [1] 0.95
As expected, coverage with a skewed distribution is not exactly as specified.
In this section we have only considered confidence intervals for the expected value of a random variable. Different confidence intervals should be used, e.g., when we are faced with the task of estimating variances or probabilities. Another relevant and quite nontrivial case is the estimation of quantiles. Since this kind of application, as far as simulation is concerned, is more related to output analysis, we defer an extended treatement of confidence intervals to Section 7.3.
The basic hypothesis test one may wish to run concerns the expected value:
In the normal case, we rely on the distributional result of Eq. (4.3), where the unknown expected value μ is replaced by the hypothesized value μ0. This shows that, if the null hypothesis is indeed true, then
In other words, the standardized test statistic
if the null hypothesis is true, has Student’s t distribution with n − 1 degrees of freedom, and it should fall within bounds corresponding to quantiles. If T falls outside that interval, there are two possible explanations: It may just be bad luck, or maybe the null hypothesis is wrong. We cannot be sure of either, and we may make two types of error: We may reject a true hypothesis, or we may accept a false one. The elementary approach is conservative and keeps the probability of rejecting a true null hypothesis under control. Therefore, we form a rejection region consisting of two tails,
and we reject the null hypothesis if the test statistic T RJ. Here α plays the role of a significance level or, better said, the probability of rejecting the null hypothesis if it is true.6 The rejection region consists of a single tail if the alternative hypothesis is one-sided:
Note that if we increase α we restrict the rejection region, which implies that, all other factors being equal, it is easier to reject, possibly making a mistake. Statistical folklore dictates that 5% is a sensible choice for the significance level, but this is at best a rule of thumb.
Hypothesis testing is easily accomplished by the t.test function:
> set.seed(55555) > X <- rnorm(10,10,20) > t.test(X,mu=20) One Sample t-test data: X t = −1.4468, df = 9, p-value = 0.1819 alternative hypothesis: true mean is not equal to 20 95 percent confidence interval: -6.881385 25.909259 sample estimates: mean of x 9.513937
In the above snapshot we sample a normal distribution with expected value 10 and test the hypothesis that the expected value is 20. We notice that the confidence interval is large, reflecting considerable variability due to the large standard deviation. By sheer luck, the sample mean is 9.513937, rather close to the true value, but the confidence interval is pretty large. Now, should we accept the null hypothesis or not? One trick is to observe that 20 is within the confidence interval, thus we cannot rule out this value. This way of thinking works here, but it should be discouraged as it does not work well in general, especially with one-tailed testing.
The fundamental output of the procedure is the p-value, which is 0.1819. This is the probability of getting a statistic like that sample mean, or a larger one in absolute value, by pure randomness, assuming that the true expected value is 20. We may compare this p-value against a significance level of 5%: Since the p-value is larger, we do not feel at all confident in rejecting the null hypothesis. In this case, we do know that the null hypothesis is wrong since the true expected value is 10, but from a practical point of view we do not have evidence strong enough. Also note that we cannot really say that we “accept” the null hypothesis, as the sample statistic is far from 20; what we can say is that we “fail to reject it.”
We may easily change the alternative hypothesis and run one-tailed tests:
> t.test(X,mu=20,alternative=’less’) One Sample t-test data: X t = −1.4468, df = 9, p-value = 0.09093 alternative hypothesis: true mean is less than 20 95 percent confidence interval: -Inf 22.7997 sample estimates: mean of x 9.513937 > t.test(X,mu=20,alternative=’greater’) One Sample t-test data: X t = −1.4468, df = 9, p-value = 0.9091 alternative hypothesis: true mean is greater than 20 95 percent confidence interval: -3.771823 Inf sample estimates: mean of x 9.513937
The newcomer is strongly invited to compare the two p-values and to interpret their difference.
We may run a host of hypothesis tests in R, which include:
The second test is relevant, for instance, in evaluating the performance of two systems or two strategies on a set of simulated scenarios. In such a case, it is fair and makes sense to compare the results on the same set of scenarios and to check the differences; hence, the observations in the sample are not independent and should be paired to check if the observed difference is statistically significant.
Unlike trivial tests on the mean of a normal population, other testing procedures may rely on sophisticated distributional results, but they more or less follow the same drill:
We need not be concerned too much with the theoretical underpinnings of each testing procedure, but we should be aware of the underlying assumptions and the resulting limitations in terms of applicability; then, what we have to look for is the p-value.
Testing the significance of correlation is an important task in statistics, and it may also be a useful tool in preparing and analyzing Monte Carlo experiments. A starting point is the following formula for sample covariance:
(4.5)
where n is the sample size, i.e., the number of observed pairs (Xi, Yi). The above formula is a straightforward sample counterpart of covariance and can also be rewritten as follows:
In passing, we note that when estimating a large covariance matrix, plenty of data are needed to obtain reliable estimates. Such a luxury cannot be afforded quite often, and this motivates the factor models discussed in Section 3.8.2.
To estimate the correlation coefficient ρXY between X and Y, we may just plug sample covariance SXY and sample standard deviations SX, SY into its definition, resulting in the sample coefficient of correlation, or sample correlation for short:
(4.6)
The factors n − 1 in SXY, SX, and SY cancel each other, and it can be proved that −1 ≤ rXY ≤ + 1, just like its probabilistic counterpart ρXY.
We should always keep in mind that we are referring to Pearson’s correlation, whose limitations have been stressed in Section 3.4.1. Nevertheless, correlation analysis plays an important role in Monte Carlo models:
Whatever the reason why we are estimating a correlation, we should always wonder whether it is statistically significant. To see the point, imagine estimating the correlation when n = 2. Since it is possible to fit a straight line going through two points, the resulting correlation is always 1. It is clear that the magnitude of the correlation should be checked against the sample size, and a simple strategy is to test the null hypothesis
against the alternative hypothesis
However, we need a statistic whose distribution under the null hypothesis is fairly manageable. One useful result is that, if the sample is normal, the statistic
is approximately distributed as a t variable with n − 2 degrees of freedom, for a suitably large sample. This can be exploited to come up with correlation tests.
In R, testing correlations is accomplished by the function cor.test. The snapshot in Fig. 4.2 generates two vectors of independent normals, X1 and X2, to which a common factor is added. This has the effect of producing a correlation coefficient of 0.3262. However, the p-value is 0.1605, which makes the correlation not quite significant, due to the small sample size. The reader is urged to repeat the experiment with n = 2000 and check that now the correlation is significant.
FIGURE 4.2 Testing correlations.
Introductory treatments of inferential statistics focus on normal populations. In that case, the two parameters characterizing the distribution, μ and σ2, coincide with expected value, the first-order moment, and variance, the second-order central moment. Hence, students might believe that parameter estimation is just about calculating sample means and variances. It is easy to see that this is not the case.
Consider a uniform random variable X . We recall that
Clearly, the sample mean and the sample variance S2 do not provide us with direct estimates of the parameters a and b. However, we might consider the following way of transforming the sample statistics into estimates of parameters. If we substitute μ and σ2 with their estimates, we find
Note that, in taking the square root of variance, we should only consider the positive root, since standard deviation cannot be negative. Solving this system yields the following estimates:
This example suggests a general strategy to estimate parameters:
Indeed, this is the starting point of a general parameter estimation approach called method of moments. However, a more careful look at the example should raise an issue. Consider the order statistics of a random sample of n observations of a uniform distribution:
If we take the above approach, all of these observations play a role in estimating a and b. Yet, this is a bit counterintuitive; in order to characterize a uniform distribution, we need a lower and an upper bound on its realizations. So, it seems that only U(1) and U(n), i.e., the smallest and the largest observations, should play a role. We might suspect that there are alternative strategies for finding point estimators and then, possibly, confidence intervals. In this section we outline two approaches: the aforementioned method of moments and the method of maximum likelihood. But before doing so, since there are alternative ways to build estimators, it is just natural to wonder how we can compare them. Therefore, we should first list the desirable properties that make a good estimator.
We list here a few desirable properties of a point estimator for a parameter θ. When comparing alternative estimators, we may have to trade off one property for another. An estimator
is unbiased if
It is easy to show that the sample mean is an unbiased estimator of the expected value. Biasedness is related to the expected value of an estimator, but what about its variance? Clearly, ceteris paribus, we would like to have an estimator with low variance.
DEFINITION 4.1 (Efficient unbiased estimator) An unbiased estimator 1 is more efficient than another unbiased estimator
2 if
Note that we must compare unbiased estimators in assessing efficiency. Otherwise, we could obtain a nice estimator with zero variance by just choosing an arbitrary constant. It can be shown that, under suitable hypotheses, the variance of certain estimators has a lower bound. This bound, known as the Cramér–Rao bound, is definitely beyond the scope of this book, but the message is clear: We cannot go below a minimal variance. If the variance of an estimator attains that lower bound, then it is efficient. Another desirable property is consistency. An estimator is consistent if → θ, where the limit must be understood in some probabilistic sense, when the sample size increases. For instance, the statistic
is not an unbiased estimator of variance σ2, as we know that we should divide by n − 1 rather than n. However, it is a consistent estimator, since when n → ∞ there is no difference between dividing by n − 1 or n. Incidentally, by dividing by n we lose unbiasedness but we retain consistency while reducing variance. Sometimes, we may prefer a biased estimator, if its variance is low, provided that it is consistent.
We have already sketched an application of this method in Example 4.1. To state the approach in more generality, let us introduce the sample moment of order k:
The sample moment is the sample counterpart of moment mk = E[Xk]. Let us assume that we need an estimate of k parameters θ1, θ2, …, θk. The method of moments relies on the solution of the following system of equations:
In general, this is a system of nonlinear equations that may be difficult to solve, and we cannot be sure that a unique solution, if any, exists. However, assuming that there is in fact a unique solution, we may just replace moments mk by sample moments Mk, and solve the system to obtain estimators 1,
2, …,
k.
Let (X1, X2, … Xn) be a random sample from a normal distribution with unknown parameters (θ1, θ2) ≡ (μ, σ). The relationship between the parameters and the first two moments is m1 = μ, m2 = σ2 + μ2. Plugging sample moments and solving the system yields
These estimators look familiar enough, but note that we do not obtain an unbiased estimator of variance.
The method of maximum likelihood is an alternative approach to find estimators in a systematic way. Imagine that a random variable X has a PDF characterized by a single parameter θ, denoted by fX(x; θ). If we draw a sample of n i.i.d. variables from this distribution, the joint density is just the product of individual PDFs:
This is a function depending on the parameter θ. Also note that we use lowercase letters xi to point out that we are referring to a specific set of numerical realizations of the random variable X. If we are interested in estimating θ, given a sample Xi = xi, i = 1, …, n, we may swap the role of variables and parameters and build the likelihood function
(4.7)
The shorthand notation L(θ) is used to emphasize that this is a function of the unknown parameter θ, for a given sample of observations. On the basis of this framework, intuition suggests that we should select the parameter yielding the largest value of the likelihood function. For a discrete random variable, the interpretation is more natural: We select the parameter maximizing the probability of what we have indeed observed. For a continuous random variable, we cannot really speak of probabilities, but the rationale behind the method of maximum likelihood should be apparent: We are just trying to find the best explanation of what we observe. The acronym MLE, for maximum-likelihood estimator or estimation, is used to refer to the approach.
Let us consider the PDF of an exponential random variable with parameter λ:
Given observed values x1, …, xn, the likelihood function is
Quite often, rather than attempting direct maximization of the likelihood function, it is convenient to maximize its logarithm, i.e., the log-likelihood function
(4.8)
The rationale behind this function is easy to grasp, since by taking the logarithm of the likelihood function, we transform a product of functions into a sum of functions:
The first-order optimality condition yields
The result is rather intuitive, since E[X] = 1/λ.
The idea of MLE is immediately extended to the case requiring the estimation of multiple parameters: We just solve a maximization problem involving multiple variables.
Let us consider maximum-likelihood estimation of the parameters θ1 = μ and θ2 = σ2 of a normal distribution. Straightforward manipulation of the PDF yields the log-likelihood function:
The first-order optimality condition with respect to μ is
Hence, the maximum-likelihood estimator of μ is just the sample mean. If we apply the first-order condition with respect to σ2, plugging the estimator , we obtain
In the two examples above, maximum likelihood yields the same estimator as the method of moments. Then, one could well wonder whether there is really any difference between the two approaches. The next example provides us with a partial answer.
Let us consider a uniform distribution on interval [0, θ]. On the basis of a random sample of n observations, we build the likelihood function
This may look a bit weird at first sight, but it makes perfect sense, as the PDF is zero for x > θ. Indeed, θ cannot be smaller than the largest observation:
In this case, since there is a constraint on θ, we cannot just take the derivative of L(θ) and set it to 0 (first-order optimality condition). Nevertheless, it is easy to see that likelihood is maximized by choosing the smallest θ, subject to constraint above. Hence, using the notation of order statistics, we find
While the application of maximum likelihood to exponential and normal variables looks a bit dull, in the last case we start seeing something worth noting. The estimator is quite different from what we have obtained by using the method of moments. The most striking feature is that this estimator does not use the whole sample, but just a single observation. We leave it as an exercise for the reader to check that, in the case of a uniform distribution on the interval [a, b], maximum-likelihood estimation yields
Indeed, the smallest and largest observations are sufficient statistics for the parameters a and b of a uniform distribution.7 We refrain from stating a formal definition of sufficient statistics, but the idea is rather intuitive. Given a random sample X, a sufficient statistic is a function T(X) that captures all of the information we need from the sample in order to estimate a parameter. As a further example, sample mean is a sufficient statistic for the parameter μ of a normal distribution. This concept has far-reaching implications in the theory of inferential statistics.
A last point concerns unbiasedness. In the first two examples, we have obtained biased estimators for variance, even though they are consistent. It can be shown that an unbiased estimator of θ for the uniform distribution on [0, θ] is
(4.9)
rather than X(n). Again, we see that maximum likelihood yields a less than ideal estimator, even though for n → ∞ there is no real issue. It turns out that maximum-likelihood estimators (MLEs) do have limitations, but a few significant advantages as well. Subject to some technical conditions, the following properties can be shown for MLEs:
As a general rule, finding MLEs requires the solution of an optimization problem by numerical methods, but there is an opportunity here. Whatever constraint we want to enforce on the parameters, depending on domain-specific knowledge, can be easily added, resulting in an optimization problem that can be solved numerically. Plenty of numerical optimization methods are available to this purpose.
R offers some tools to fit distributions by maximum likelihood, such as
The former function leaves the task of specifying minus the log-likelihood function to the user and produces an object of an S4 class. The second one is a bit more user friendly. The use of these functions is best understood by a simple example related to a gamma distribution. The code in Fig. 4.3 produces the following (abridged) output:
FIGURE 4.3 Fitting a gamma distribution.
> fit1@coef shape rate 2.233724 11.062195 > confint(fit1) Profiling… 2.5 % 97.5 % shape 1.985328 2.503327 rate 9.684267 12.559251 > fit2$estimate shape rate 2.233726 11.062211 > fit2$sd shape rate 0.1320868 0.7331218
We notice a slight difference in the estimates of shape and rate parameters, due to the optimization algorithms embedded in the two functions. In the case of mle, we obtain an S4 object; this is why we have to access its properties by fit1@coef rather than by fit1$coef. We also select an optimization method, L-BFGS-B, so that we may enforce lower bounds on the parameters (otherwise, warnings may be obtained if they get negative in the search process). Also notice the use of a named list to provide initial values for the numerical optimization algorithm, as usual in nonlinear programming. In both cases we have some information on the reliability of the estimate, either via a confidence interval or standard errors.
So far, we have been concerned with parameters of probability distributions. We never questioned the fit of the distribution itself against empirical data. For instance, we might assume that a population is normally distributed, and we may estimate and test its expected value and variance. However, normality should not be taken for granted, just like any other claim about the underlying distribution. Sometimes, specific knowledge suggests strong reasons that justify the assumption; otherwise, this should be tested in some way. When we test whether experimental data fit a given probability distribution, we are not really testing a hypothesis about a parameter or two; in fact, we are running a nonparametric test.
In this section we illustrate three kinds of approach:
All of the above procedures will be illustrated using R.
Before describing formal approaches, let us comment on intuitive and graphical checks. Clearly, visual inspection of a histogram and a comparison with the theoretical shape of a PDF may help. However, sometimes the judgement is not so easy to make. To see the point, let us generate a normal sample and look at the resulting histograms in Fig. 4.4:
FIGURE 4.4 Histograms of a normal sample.
> set.seed(55555) > par(mfrow=c(1,2)) > X=rnorm(200,10, 40) > hist(X) > hist(X,nclass=50) > par(mfrow=c(1,1))
In the plot on the left of Fig. 4.4, there are very few bins, and the distribution might look compatible with a normal, but other densities as well. If we increase resolution by using more bins, statistical sampling makes the histogram too bumpy to be of any use. By the way, these observations also apply to the chi-square test, which formalizes the intuition of comparing histograms. Sometimes, checking the CDF is a better idea. In fact, another graphical tool relies on plotting the sample quantiles against the theoretical ones, obtaining a Q-Q plot. Using the above sample, we produce the plot in Fig. 4.5 as follows:
FIGURE 4.5 Checking the fit of a normal distribution.
> qqnorm(X) > qqline (X)
In principle, the points should lie on a straight line, since the sample is in fact normally distributed. This is indeed the case, with the exception of tails, which is to be expected after all as they are always critical.
The idea behind the chi-square test is fairly intuitive and basically relies on a relative frequency histogram, although the technicalities do require some care. The first step is to divide the range of possible observed values in J disjoint intervals, corresponding to bins of a frequency histogram. Given a probability distribution, we can compute the probability pj, j = 1, …, J, that a random variable distributed according to that distribution falls in each bin. If we have n observations, the number of observations that should fall in interval j, if the assumed distribution is indeed the true one, should be Ej = npj. This number should be compared against the number Oj of observations that actually fall in interval j; a large discrepancy would suggest that the hypothesis about the underlying distribution should be rejected. Just like any statistical test, the chi-square test relies on a distributional property of a statistic. It can be shown that, for a large sample size, the statistic
has (approximately) a chi-square distribution. We should reject the hypothesis if χ2 is too large, i.e., if χ2 > χ1−α,m2, where:
What we are missing here is m, which depends on the number of parameters of the distribution that we have estimated using the data. If no parameter has been estimated, i.e., if we have assumed a specific parameterized distribution prior to observing data, the degrees of freedom are J − 1; if we have estimated p parameters, we should use J − p − 1.
R offers a chisq.test function that, among other things, can be used to check the fit of a continuous distribution. To this aim, we have to bin data and compute observed and theoretical frequencies:
set.seed(55555) shape <- 2 rate <- 10 X <- rgamma(500, shape, rate) breakPts <- c(seq(0,0.8,0.1), 2) probs <- pgamma(breakPts,shape,rate) binProbs <- diff(probs) f.obs <- hist(X, breakPts, plot = FALSE)$counts chisq.test(x=f.obs,p=binProbs/sum(binProbs))
Note that we divide the vector binProbs by its sum to make sure that probabilities add up to 1. We also use hist as a convenient tool to bin data. The above code results in the following output:
Chi-squared test for given probabilities data: f.obs X-squared = 4.4634, df = 8, p-value = 0.8131 Warning message: In chisq.test(x = f.obs, p = binProbs/sum(binProbs)): Chi-squared approximation may be incorrect
As expected, we cannot reject the null hypothesis, but we get a warning message. Indeed, the idea of the chi-square test is pretty intuitive; however, it relies on approximated distributional results that may be critical. Another tricky point is that the result of the test may depend on the number and placement of bins. Rules of thumb have been proposed and are typically embedded in good statistical software.
Continuing the above example, where the variable x stores a sample from a gamma distribution with given shape and rate parameters, it is very easy to run the Kolmogorov–Smirnov test in R:
> ks.test(X,pgamma, shape, rate) One-sample Kolmogorov–Smirnov test data: X D = 0.0447, p-value = 0.2704 alternative hypothesis: two-sided
The Kolmogorov–Smirnov test, unlike the chi-square test, relies on the CDF. Indeed, as you may see above, we have to provide ks.test with a function computing cumulative probabilities, pgamma in our case, along with all of the relevant parameters.
The idea behind the test relies on a comparison between the assumed CDF FX(x) and an empirical CDF defined as follows:
where #{…} denotes the cardinality of a set. The distance is measured in a worst-case sense:8
From a computational point of view, this distance can be calculated as
where
and X(i) is the ith order statistic (obtained by sorting values Xi in increasing order). It is clear that a large value D does not support the assumed distribution. What is not so clear is how to devise a rejection region and find the critical values defining its boundary. This depends on the underlying distribution, and Monte Carlo sampling is often used to find such values.
The R core includes the Shapiro–Wilk test, implemented in the shapiro. test function:
> set.seed(55555) > X=rnorm(20,10,20) > shapiro.test(X) Shapiro-Wilk normality test data: X W = 0.9512, p-value = 0.385
In this case, the function returns a rather large p-value, such that we cannot reject the null hypothesis that the distribution is normal. Given this structure, it is clear that we get a conservative test. Strong evidence is needed to reject normality. Let us try a normal and a gamma distribution:
> X=rexp(20,10) > shapiro.test(X) Shapiro-Wilk normality test data: X W = 0.9006, p-value = 0.04237 > X=rgamma(20,2, 10) > Shapiro.test(X) Shapiro-Wilk normality test data: X W = 0.623, p-value = 5.14e-06
Evidence is reasonably strong in the former case, but not as strong as in the latter. The test statistic takes values between 0 and 1, and it compares two estimates of variance; when the statistic is close to 1 we cannot reject normality. In the Jarque–Bera test we take advantage of third- and fourth-order moments, and of the fact that skewness is zero and kurtosis is 3 for a normal. The test statistic is
where and
are estimates of skewness and kurtosis, respectively. Clearly, the test statistic is non-negative and should be close to zero, according to the null hypothesis that the underlying distribution is normal. For a large sample the test statistic is a chi-square variable, which can be used to define a rejection region for the test. However, since this is not quite robust for small and medium size samples, critical values can be obtained by Monte Carlo sampling.
Under suitable assumptions, multiple linear regression models may be estimated by ordinary least squares (OLS), which is accomplished in R by the lm function. This is illustrated by the code in Fig. 4.6, which produces the following output:
FIGURE 4.6 Fitting a multiple linear regression model.
Call: lm(formula = Y ~ X1 + X2) Residuals: Min 1Q Median 3Q Max −12.396 −9.859 −6.075 8.818 29.831 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.6349 11.0946 1.950 0.092162 . X1 4.7902 1.7912 2.674 0.031802 * X2 2.7463 0.4732 5.803 0.000661 *** --- Signif.codes: 0 ‘**’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 16 on 7 degrees of freedom Multiple R-squared: 0.8727,Adjusted R-squared: 0.8363 F-statistic: 23.98 on 2 and 7 DF, p-value: 0.000737
We obtain parameter estimates, along with an assessment of their significance. The p-value of the coefficient for X1 is significant, despite the high level of noise. The coefficient of determination R2 is 0.8727, but the adjusted R2 is 0.8363. We also obtain an overall assessment of the model in the p-value of the F-statistic. The reader is invited to check the effect of reducing the coefficient multiplying X1 in the data generation process.
We obtain confidence intervals for the estimated coefficients, and we may also test normality and uncorrelation of residuals:
> confint(lmod1) 2.5 % 97.5 % (Intercept) −4.5996144 47.869439 X1 0.5546597 9.025779 X2 1.6272928 3.865392 > shapiro.test(lmod1$resid) Shapiro-Wilk normality test data: lmod1$resid W = 0.8294, p-value = 0.03287 > library(lmtest) > dwtest(lmod1) Durbin-Watson test data: lmod1 DW = 2.5181, p-value = 0.5605 alternative hypothesis: true autocorrelation is greater than 0
Using lm requires the specification of a formula. In the case above, Y~X1+X2 is self-explanatory, but formulas in R are much more powerful than that. Say that we want to fit a polynomial regression model of order 2:
Of course, we can just evaluate each power of the explanatory variable and run a linear regression (the model is nonlinear in variables, but linear in parameters). However, the code in Fig. 4.7 shows a shortcut: by setting raw=TRUE, we force R to use familiar polynomials rather than orthogonal ones.9 We obtain the following output:
FIGURE 4.7 Fitting a polynomial regression model.
Call: lm(formula = Y ~ poly(X, 2, raw = TRUE)) Residuals: Min 1Q Median 3Q Max -13.673 -9.653 -5.712 8.714 29.547 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 290.82646 7.53560 38.59 2.04e-09 *** poly(X,2,raw=TRUE)1 4.73459 0.46672 10.14 1.95e-05 *** poly(X,2,raw=TRUE)2 0.49706 0.04724 10.52 1.53e-05 *** --- Signif.codes: 0 ‘**’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 16.02 on 7 degrees of freedom Multiple R-squared: 0.9664, Adjusted R-squared: 0.9568 F-statistic: 100.7 on 2 and 7 DF, p-value: 6.951e-06
There are cases in which nonlinearity should be used to capture interaction between variables. Consider the model
where δ {0,1} is a dummy variable modeling a categorical variable. It is worth noting that when δ = 0, the data generating process is
whereas, when δ = 1, both intercept and slope change:
The code in Fig. 4.8 illustrates such a case. The input data are depicted in Fig. 4.9, along with the fitted lines resulting from setting δ = 0 and δ = 1, respectively. We can see clearly the difference in intercept and slope between the two groups.
FIGURE 4.8 Fitting a regression model with interactions.
FIGURE 4.9 A regression model with interactions.
The following model summary shows that, when using the formula Y~X*Z, interactions are automatically accounted for:
Call: lm(formula = Y ~ X * Z) Residuals: Min 1Q Median 3Q Max -27.0281 -5.4642 -0.1196 5.5480 22.4519 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.6160 3.5574 8.325 2.62e-12 *** X 4.8991 0.1135 43.147 < 2e-16 *** Z 38.6779 4.6058 8.398 1.90e-12 *** X:Z -2.8403 0.1584 -17.928 < 2e-16 *** Signif.codes: 0 ‘**’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 10.04 on 76 degrees of freedom Multiple R-squared: 0.9732, Adjusted R-squared: 0.9722 F-statistic: 921.3 on 3 and 76 DF, p-value: < 2.2e-16
Fitting time series models requires the use of possibly sophisticated procedures, whose first step is selecting the type and order of the model. This model identification step can be accomplished using the concepts that we have outlined in Section 3.6. Given a model structure, estimating its parameters calls for the application, e.g., of maximum-likelihood methods. In this section we just simulate a couple of models and check whether we are able to recover their underlying structure, on the basis of the resulting sample paths, by using R functions.
To simulate an ARIMA model, we may write our own code, as we did in Section 3.6. For the sake of convenience, here we use the function arima.sim, which simulates an ARIMA model and accepts, among other things, the following input arguments:
Then, we estimate the model using two functions:
The code in Fig. 4.10 produces the following (edited) output:
FIGURE 4.10 Simulating and fitting an ARMA(2,2) model.
> arima(x1,order=c(2,0,2)) Series: x1 ARIMA (2, 0, 2) with non-zero mean Coefficients: ar1 ar2 ma1 ma2 intercept 0.6477 -0.3853 -0.0880 0.3118 -0.0105 s.e. 0.1364 0.0652 0.1362 0.0517 0.0295 sigma^2 estimated as 0.4749: log likelihood=-1570.18 AIC=3152.35 AICc=3152.41 BIC=3184.23 > auto.arima(x1) Series: x1 ARIMA (2, 0, 2) with zero mean Coefficients: ar1 ar2 ma1 ma2 0.6474 -0.3849 -0.0875 0.3117 s.e. 0.1363 0.0652 0.1361 0.0517 sigma^2 estimated as 0.475: log likelihood=-1570.24 AIC=3150.48 AICc=3150.52 BIC=3177.05 > arima(x2, order=c(2,0,2)) Series: x2 ARIMA (2, 0, 2) with non-zero mean Coefficients: ar1 ar2 ma1 ma2 intercept 0.9599 -0.5477 -0.3572 0.2970 -0.0076 s.e. 0.0738 0.0425 0.0772 0.0439 0.0282 sigma^2 estimated as 0.4677: log likelihood=-1558.8 AIC=3129.6 AICc=3129.66 BIC=3161.48 > auto.arima(x2) Series: x2 ARIMA (2, 0, 2) with zero mean Coefficients: ar1 ar2 ma1 ma2 0.9599 -0.5477 -0.3573 0.2970 s.e. 0.0738 0.0425 0.0772 0.0439 sigma^2 estimated as 0.4678: log likelihood=-1558.84 AIC=3127.68 AICc=3127.72 BIC=3154.24
Note that arima needs a specification of the model order; in this case we cheat and use the actual order used in the simulation, i.e., (2, 0, 2), where there is no integration component. The alternative function auto. arima is able to find the correct order in this example, but this is not the case in general. We also note that, despite the fairly long simulated horizon, different sample paths may lead to rather different estimates of the coefficients. This illustrates the difficulties in fitting time series models, but we should consider that the main purpose here is to capture some essential features of the original time series, in order to produce realistic scenarios for a Monte Carlo simulation. The code of Fig. 4.10 also uses the function ts.plot to plot the first sample path, along with a zoomed portion of the initial time interval; see the result in Fig. 4.11.
FIGURE 4.11 Sample path produced by arima.sim; the plot on the right zooms over the initial portion of the path.
So far, we have adopted a rather standard view of parameter estimation, as we have followed the orthodox approach: Parameters are unknown numbers, which we try to estimate by squeezing information out of a random sample, in the form of point estimators and confidence intervals. We also insisted on the fact that, given a computed confidence interval with some confidence level (1 − α), we cannot say that the true parameter is contained there with probability (1 − α). This statement makes no sense, since we are only comparing known numbers (the realized bounds of the confidence interval) and an unknown number (the parameter), but no random variable is involved. So, there is no such a thing as a “probabilistic knowledge” about parameters, and data are the only source of information; any other knowledge, objective or subjective, is disregarded. The following example illustrates the potential difficulties induced by this view.10
Let X be a uniformly distributed random variable, and let us assume that we do not know where the support of its distribution is located, but we know that its width is 1. Then, X ~ U(μ − 0.5, μ + 0.5), where μ is the unknown expected value of X, as well as the midpoint of the support. In order to estimate μ, let us take a sample of only n = 2 independent realizations X1 and X2 of the random variable. Now, let us consider the order statistics
and the confidence interval
Given this way of building a confidence interval, what is the confidence level of I, i.e., the probability P{μ I}? Once again, note that we are not asking this question by referring to a specific interval after sampling, but to a random interval before sampling. Because of symmetry, the two observations have a probability 0.5 of falling to the left or to the right of μ. The confidence interval will not contain μ if both of them fall on the same half of the support. Then, since X1 and X2 are independent, we have
Therefore, the confidence level for is the complement of this probability, i.e., 50%. Now, let us suppose that we observe X1 = 0 and X2 = 0.6. What is the probability that μ is included in the confidence interval
resulting from Eq. (4.10), i.e., P{0 ≤ μ, ≤ 0.6}? In general, this question does not make any sense, since μ is a number. But in this specific case, we have some additional knowledge leading to the conclusion that the expected value is certainly included in the interval [0, 0.6]. In fact, if X(1) = 0, we may conclude μ ≤ 0.5; by the same token, if X(2) = 0.6, we may conclude μ ≤ 0.1. Since the confidence interval
includes the interval [0.1, 0.5], we would have good reasons to claim something like P{0 ≤ μ ≤ 0.6} = 1. But again, this makes no sense in the orthodox framework. By a similar token, if we get X1 = 0 and X2 = 0.001, we would be tempted to say that such a small interval is quite unlikely to include μ. However, within the framework of orthodox statistics, there is no way in which we can express this intuition properly.
On one hand, the example illustrates the need to make our background knowledge explicit: The Bayesian framework is an alternative to the orthodox approach, where it can be argued that unconditional probabilities do not exist, in the sense that probabilities are always conditional on some background knowledge and assumptions. On the other hand, we see the need of a way to express subjective views, which may be revised after collecting empirical data. Bayesian estimation has been proposed to cope with such issues as well.
Consider the problem of estimating a parameter θ, which characterizes the probability distribution of a random available X. We have some prior information about θ that we would like to express in a sensible way. We might assume that the unknown parameter lies anywhere in the unit interval [0, 1], or we might assume that it is close to some number μ, but we are somewhat uncertain about it. Such a knowledge or subjective view may be expressed by a probability density p(θ), which is called the prior distribution of θ. In the first case, we might associate a uniform distribution with θ; in the second case the prior could be a normal distribution with expected value μ and variance σ2. Note that this is the variance that we associate with the parameter, which is itself a random variable, rather than a number, and not the variance of the random variable X itself.
In Bayesian estimation, the prior is merged with experimental evidence by using Bayes’ theorem. Experimental evidence consists of independent observations X1, …, Xn from the unknown distribution. Here and in the following we mostly assume that random variable X is continuous, and we speak of densities; the case of discrete random variables and probability mass functions is similar. We also assume that the values of the parameter θ are not restricted to a discrete set, so that the prior is a density as well. Hence, let us denote the density of X by f(x | θ), to emphasize its dependence on parameter θ. Since a random sample consists of a sequence of independent random variables, their joint distribution, conditional on θ, is
The conditional density fn(x1, …, xn| θ) is also called the likelihood function, as it is related to the likelihood of observing the data values x1, …, xn, given the value of the parameter θ; also notice the similarity with the likelihood function in maximum-likelihood estimation.11 Note that what really matters here is that the observed random variables X1, …, Xn are independent conditionally on θ. Since we are speaking about n + 1 random variables, we could also consider the joint density
but this will not be really necessary for what follows. Given the joint conditional distribution fn(x1, …, xn | θ) and the prior p(θ), we can find the marginal density of X1, …, Xn by applying the total probability theorem:
where we integrate over the domain Ω on which θ is defined, i.e., the support of the prior distribution. Now what we need is to invert conditioning, i.e., we would like the distribution of θ conditional on the observed values Xi = xi, i = 1, …, n, i.e.,
This posterior density should merge the prior and the density of observed data conditional on the parameter. This is obtained by applying Bayes’ theorem to densities, which yields
Note that the posterior density involves a denominator term gn(x1, …, xn) that does not depend on θ. Its role is only to normalize the posterior distribution, so that its integral is 1. Sometimes, it might be convenient to rewrite Eq. (4.11) as
(4.12)
where the symbol ∝ means “proportional to.” The denominator term does not affect the shape of the posterior, which encodes our uncertainty about the parameter θ, and it may be argued that it is not really essential.12 In plain English, Eq. (4.11) states that the posterior is proportional to the product of the likelihood function fn(x1, …, xn| θ) and the prior distribution p(θ):
What we are saying is that, given some prior knowledge about the parameter and the distribution of observations conditional on the parameter, we obtain an updated distribution of the parameter, conditional on the actually observed data.
We tend to take for granted that coins are fair, and that the probability of getting heads is . Let us consider flipping a possibly unfair coin, with an unknown probability θ of getting heads. In order to learn this unknown value, we flip the coin repeatedly, i.e., we run a sequence of independent Bernoulli trials with unknown parameter θ. If we do not know anything about the coin, we might just assume a uniform prior
If we flip the coin N times, we know that the probability fN(H | θ) of getting H heads, conditional on θ, is related to the binomial probability distribution:
This is our likelihood function. If we regard this expression as the probability of observing H heads, given θ, this should actually be the probability mass function of a binomial variable with parameters θ and N, but we are disregarding the binomial coefficient, which does not depend on θ and just normalizes the distribution. If we multiply this likelihood function by the prior, which is just 1, we obtain the posterior density for θ, given the number of observed heads:
Equations (4.13) and (4.14) look like the same thing, because we use a uniform prior, but they are very different in nature. Equation (4.14) gives the posterior density of θ, conditional on the fact that we observed H heads and N − H tails. If we look at it this way, we recognize the shape of a beta distribution, which is the density of a continuous random variable, the parameter θ, rather than the mass function of a discrete random variable, the number of heads H out of N flips. To normalize the posterior, we should multiply it by the appropriate value of the beta function.13 Again, this normalization factor does not depend on θ and can be disregarded.
The R code in Fig. 4.12 generates a sequence of 1000 unfair coin flips, with θ = 0.2, and successively updates and plots the posterior density, resulting in the plots of Fig. 4.13. There, we display posterior densities normalized in such a way that their maximum is 1, after flipping the coin N times and having observed H heads. In Fig. 4.13(a) we just see the uniform prior before any flip of the coin. In the simulated history, the first flip lands heads. After observing the first heads, we know for sure that θ ≠ 0; indeed, if θ were zero, we could not observe any heads. The posterior is now proportional to a triangle:
FIGURE 4.12 Sampling a sequence of coin flips and updating the posterior.
FIGURE 4.13 Updating the posterior density in coin flipping.
This triangle is shown in Fig. 4.13(b). We observe another heads in the second flip, and so the updated posterior density is a portion of a parabola, as shown in Fig. 4.13(c):
With respect to the triangle, the portion of parabola is more tilted toward θ = 1, because we observed two heads out of two launches. However, when we get tails at the third flip, we rule out θ = 1 as well. Proceeding this way, we get beta distributions concentrating around the true (unknown) value of θ = 0.2.
Armed with the posterior density, how can we build a Bayes’ estimator for θ? Figure 4.13 would suggest taking the mode of the posterior, which would spare us the work of normalizing it. However, this need not be the most sensible choice. If we consider the expected value for the posterior distribution, we find
There are different ways of framing the problem, which are a bit beyond the scope of this book,14 but one thing that we can immediately appreciate is the challenge we face. The estimator above involves what looks like an intimidating integral, but our task is even more difficult in practice, because finding the posterior density may be a challenging computational exercise as well. In fact, given a prior, there is no general way of finding a closed-form posterior; things can really get awkward when multiple parameters are involved. Moreover, there is no guarantee that the posterior distribution pn(θ | x1, …, xn) will belong to the same family as the prior p(θ). In general, numerical methods are needed in Bayesian computational statistics and, needless to say, Monte Carlo approaches play a key role here. However, there are some exceptions. A family of distributions is called a conjugate family of priors if, whenever the prior is in the family, the posterior is too. For instance, if we choose a beta density as the prior in coin flipping, the posterior is a beta density as well. The following example illustrates the idea further.
Consider a sample (X1, …, Xn) from a normal distribution with unknown expected value θ and known variance σ02. Then, given our knowledge about the multivariate normal, and taking advantage of independence among observations, we have the following likelihood function:
Let us assume that the prior distribution of θ is normal, too, with expected value μ and σ:
To find the posterior, we may simplify our work by considering in each function only the part that involves θ, wrapping the rest within a proportionality constant. In more detail, the likelihood function can be written as
Then, we may simplify the expression further, by observing that
where is the average of xi, i = 1, …, n. Then, we may include terms not depending on θ into the proportionality constant and rewrite Eq. (4.16) as
By a similar token, we may rewrite the prior as
By multiplying Eqs. (4.17) and (4.18), we obtain the posterior
(4.19)
Again, we should try to include θ within one term; to this aim, we use a bit of tedious algebra and rewrite the argument of the exponential as follows:
where
(4.21)
Finally, this leads us to
(4.22)
Disregarding the normalization constant, we immediately recognize the familiar shape of a normal density, with expected value ν and variance ξ2. Then, given an observed sample mean and a prior μ, Eq. (4.20) tells us that the Bayes’ estimator of θ can be written as
Equation (4.23) has a particularly nice and intuitive interpretation: The posterior estimate is a weighted average of the sample mean (the new evidence) and the prior μ, with weights that are inversely proportional to σ20/n, the variance of sample mean, and σ2, the variance of the prior. The more reliable a term, the larger its weight in the average.
What may sound a little weird in the example is the assumption that the variance σ02 of X is known, but its expected value is not. Of course, one can extend the framework to cope with estimation of multiple parameters. Furthermore, we have considered two lucky cases, coin flipping and the normal prior for a normal distribution, where the posterior is very easy to compute as it preserves the original shape of the prior. In such cases we speak of conjugate priors. In other cases, we are not so lucky and we need to compute a possibly tough integral to find the normalization constant making the posterior a density. Indeed, Bayesian approaches were not quite practical until computational tools were devised to solve the difficulty by Markov chain Monte Carlo methods. These strategies are described in Chapter 14 and allow to run a simulation even if we do not know a density exactly, but only a likelihood proportional to the density.
1 C. Brooks. Introductory Econometrics for Finance (2nd ed.). Cambridge University Press, Cambridge, 2008.
2 G. Casella and R.L. Berger. Statistical Inference (2nd ed.). Duxbury, Pacific Grove, CA, 2002.
3 M.H. DeGroot. Probability and Statistics. Addison-Wesley, Reading, MA, 1975.
4 M.H. DeGroot and M.J. Schervish. Probability and Statistics (3rd ed.). Addison-Wesley, Boston, 2002.
5 N.R. Draper and H. Smith. Applied Regression Analysis (3rd ed.). Wiley, New York, 1998.
6 I. Gilboa. Theory of Decision under Uncertainty. Cambridge University Press, New York, 2009.
7 W.H. Greene. Econometric Analysis (7th ed.). Prentice Hall, Upper Saddle River, NJ, 2011.
8 J.D. Hamilton. Time Series Analysis. Princeton University Press, Princeton, NJ, 1994.
9 J.K. Kruschke. Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press, Burlington, MA, 2011.
10 A.M. Law and W.D. Kelton. Simulation Modeling and Analysis (3rd ed.). McGraw-Hill, New York, 1999.
1We defer some topics, such as the estimation of probabilities and quantiles to Chapter 7, as from a practical point of view they are more relevant there, even though, from a conceptual point of view, they are also related to the content of this chapter.
2Similar considerations apply to the previous chapter, of course.
3We illustrate the kind of mistakes that we are led to when forgetting the role of independence in Chapter 7.
4If we want to associate parameters with probabilistic statements, we have to take a Bayesian view; see Section 4.6.1.
5The coefficient of variation is the same in both cases. This is defined as σ/ |μ|, i.e., the ratio between standard deviation and the absolute value of the expected value.
6In fact, what is typically written in textbooks is a hybrid of different viewpoints about the nature of hypothesis testing, and this is why the terminology may sometimes sound weird.
7 See [4, p. 375] for a proof.
8 A similar approach is used in Chapter 9 to assess the star discrepancy of a numerical sequence.
9 For an application of polynomial regression to pricing American-style options by Monte Carlo methods, see Section 11.3.
10 This example is taken from [6, p. 45], which in turn refers back to [3].
11 See Section 4.2.3.
12This will play a major role in Chapter 14 on Markov chain Monte Carlo.
13See Section 3.2.4.
14One way to frame the problem is by introducing a loss function that accounts for the cost of a wrong estimate. It can be shown that Eq. (4.15) yields the optimal estimator when the loss function has a certain quadratic form.