Chapter Seven: Output Analysis

There are several reasons why we should carefully check the output of a Monte Carlo simulation:

1. Model validation, i.e., check that the model makes economic and/or financial sense, possibly with respect to empirical evidence.

2. Model verification, i.e., check that the computer code we have written implements the model, right or wrong, that we had in mind.

3. Statistical inference, i.e., check the accuracy of the estimates that we obtain by running the model implemented in the code.

In this chapter we deal with the last point. To motivate the development below, let us refer back to the queueing example that we introduced in Section 1.3.3.1, where Lindley’s recursion is used to estimate the average waiting time in an M/M/1 queue, i.e., a queue where the distributions of interarrival and service times are both memoryless, i.e., exponential.¹ For the sake of convenience, we report the R code in Fig. 7.1; the function has been slightly modified, in order to return the whole vector of recorded waiting times, rather than just its mean.

and 10,000 customers are simulated. Is this sample size enough for practical purposes? The following runs suggest some criticality, as the sample means show a considerable variability:

How can we quantify the accuracy of estimates for a given sample size?

How can we find the sample size to obtain a given degree of accuracy?

Statistical inference provides us with a useful tool to answer the first question, namely, confidence intervals; it is fairly easy to reverse-engineer a confidence interval in order to estimate the sample size, i.e., the number of replications, that we need to achieve a selected degree of accuracy. Confidence intervals are clearly related to parameter estimation, which has already been dealt with in Chapter 4. However, we want to show a few potential pitfalls and mistakes that are incurred if one applies those deceptively simple formulas without due care; we provide a couple of enlightening examples in Section 7.1. Then, in Section 7.2, we illustrate how sample sizes may be related with measures of estimate precision. Following common practice, we have introduced confidence intervals with reference to a mean (better said, expected value). However, what we have to estimate is not always an expected value. For instance, we often need estimates of probabilities or quantiles. It is true that a probability is the expected value of an indicator function, but since its value is constrained in the interval [0, 1], some care may be needed. As to quantiles, when we estimate quantiles for small or large probability levels, as is common in risk management, we may face a difficult task. We address the related issues in Section 7.3. The essential message of this chapter, which we summarize in Section 7.4, is both good and bad news: Monte Carlo methods enjoy considerable flexibility and generality, but they are inherently inefficient. By pointing this out, we lay down the motivation for variance reduction strategies and low-discrepancy sequences, which are dealt with in Chapters 8 and 9, respectively.

7.1 Pitfalls in output analysis

We know from Chapter 4 that, given a sample {X₁, X₂, …, X_n} of n i.i.d. variables, we may estimate sample mean and sample variance:

Given the purpose of this chapter, here we point out the dependence of the above statistics on the sample size n. Then, elementary inferential statistics suggests the confidence interval

at confidence level 1 − α/2, where z_1−α/2 is the quantile of the standard normal distribution corresponding to the selected confidence level. As we have already noted, we use the quantile of the normal distribution, since in Monte Carlo simulation the sample size is typically large enough to warrant its use rather than the quantile of the t distribution.

Then, it should be an easy matter to use R to find a confidence interval for the average waiting time in the M/M/1 queue. Given the variability that we have observed, we expect quite large confidence intervals:

We clearly notice a disturbing pattern: The extreme points of the confidence intervals do jump around, as we expected, but the confidence intervals look overly “confident,” as they are rather narrow and do not overlap with each other. In other words, it seems that we are overstating the precision of the estimate. In fact, what we are doing is dead wrong, and we are actually making three mistakes:

We are using a formula assuming independence of observations, whereas Lindley’s recursion clearly shows that the waiting time W_j depends on waiting time W_j−1, which makes intuitive sense: If we wait in a queue for a long time, the customer behind us will wait a for a long time, too.

The random variables are not independent, but they are not even identically distributed: The first waiting time is clearly zero, as the system is initially empty. There is an initial transient phase, and we should wait for the system to reach a steady state before starting to collect statistics. This is not too relevant for a single queue, but it is for a complex system of interacting queues.

We are using a confidence interval that is meant for normal samples, but there is no reason to believe that waiting times are normally distributed.

By far, the most serious mistake is the first one. Indeed, writing the confidence interval as in Eq. (7.1) is somewhat misleading, as it obscures the fact that it depends on the standard error of estimate, i.e., the standard deviation of the sample mean. The standard deviation of the sample mean can be written as

only under the assumption of independence (actually, lack of correlation is enough). Unfortunately, there is a strong positive correlation between consecutive waiting times, especially when the utilization rate is high; as a consequence, the familiar formula severely understates variance, and this leads to small, but wrong confidence intervals. To clearly see the point, let us write the variance of the average waiting time explicitly:

where σ² is the variance of each waiting time W_i, and ρ_ij is the correlation between the waiting times of customers i and j. We see that, when there is a significant positive correlation, the familiar formula σ²/n may grossly understate the variance of the sample mean. The strength of the correlation may be checked by the following snapshot:

Running this code produces the two autocorrelation functions² for waiting times, displayed in Fig. 7.2. The autorrelation is quite significant for a high utilization rate, whereas it fades away quicker when utilization rate is 50%. This explains why we cannot compute confidence intervals in a standard way. But then, what can we do? In queueing simulations the batch method is often used. The idea is to simulate batches of consecutive customers, collecting sample means. For instance, we may build batches of 10,000 customers and gather the batch means

The batch mean

₁ includes customers from 1 to 10,000, the batch mean

₂ includes customers from 10,001 to 20,000, etc. If batches are long enough, the batch means are practically uncorrelated; this can be checked by plotting the ACF. Furthermore, since they are sums of many waiting times, we can hope that their distribution will be closer to a normal, thus providing further justification for the use of standard confidence intervals. We should also mention that batching observations and calculating confidence intervals on the basis of batch means may also be useful in financial applications, when there are concerns about lack of normality in the observations.

7.1.1 BIAS AND DEPENDENCE ISSUES: A FINANCIAL EXAMPLE

The queueing example of the previous section is a nice and simple example to illustrate pitfalls in using standard confidence intervals without due care but, admittedly, is not quite interesting from a financial point of view. So, let us consider the pricing of an as-you-like-it option (also known as chooser option)³. The option is European-style and has maturity T₂. At time T₁ < T₂ you may choose if the option is actually a call or a put; the strike price K is fixed at time t = 0. Clearly, at time T₁ we should compare the values of the two options, put and call, and choose the more valuable one. This can be done by using the Black–Scholes–Merton (BSM) formula to evaluate the price of call and put options with initial underlying price S(T₁) and time to maturity T₂ − T₁. This means that, conditional on S₁ = S(T₁), we may find the exact expected value of the payoff at time T₂, under the risk-neutral probability. What’s more, it turns out that it is actually possible to price a chooser option exactly, at least in the BSM world. To begin with, let us find the critical price S₁^* for which we are indifferent between a call and a put option; to this aim, we just need to consider put–call parity at time t = T₁:⁴

1. A vanilla call option with maturity T₂ and strike K

2. A vanilla put option with maturity T₁ and strike K^*

If S₁ > K^*, then the put expires worthless and we keep the call. Therefore, at maturity, we will receive the payoff associated with the call, which indeed corresponds to the optimal choice if S₁ > K^*.

If S₁ < K^*, then the put is exercised and at time T₁ we have a position

equation

At time T₂ the position is worth

equation

Therefore, at maturity, we receive the payoff associated with the put, which indeed corresponds to the optimal choice if S₁ < K^*.

The code to price the chooser option exactly is given in Fig. 7.3 and relies on functions implementing the BSM formula for vanilla calls and puts. It is instructive to compare the price of the chooser option against the two vanilla options:

As expected, the possibility of choosing may be quite valuable. Even though it is easy to find the price of a simple chooser option, it is extremely instructive to write a pure sampling-based Monte Carlo code. In the present case, this is not quite trivial, as we are not only estimating things: We must make a decision at time T₁. This decision is similar to the early exercise decision we must make with American-style options. To get a feeling for the issues involved, let us consider the scenario tree in Fig. 7.4, where we should associate a price of the underlying asset with each node. Starting from the initial node, with price S₀, we generate four observations of price S(T₁), and for each of these, we sample three prices S(T₂). We have 4 × 3 = 12 scenarios, but they are tree-structured. We need this structure, because the decision at time T₁ (either we like the put or we like the call) must be the same for all scenarios departing from each node at time T₁. Without this structure, our decisions would be based on perfect foresight about the future price at time T₂. This nonanticipativity concept is fundamental in dynamic stochastic optimization and in pricing American-style options, as we shall see in Sections 10.5.2 and 11.3.

A crude Monte Carlo code to price the option is displayed in Fig. 7.5. Here NRepl1 is the number of scenarios that we branch at time T₁ and NRepl2 is the number of branches at time T₂, for each node at time T₁; hence, the overall number of scenarios is the product of NRepl1 and NRepl2. The vector DiscountedPayoffs has a size corresponding to the overall number of scenarios. For each node at T₁, which is generated as usual with geometric Brownian motion, we generate nodes at time T₂, and we compare the two estimates of expected payoff that we are going to earn if we choose the call or the put, respectively. Then, we select one of the two alternatives and we fill a corresponding block (of size NRepl2) in the vector of discounted payoffs. Finally, we compute sample mean and confidence intervals as usual.

Now, let us run the simulator a few times, comparing the confidence intervals that we obtain against the exact price (9.267982):

Since we are computing a 95% confidence interval, we would expect to miss the correct value 1 time out of 20 on the average. However, it seems that this happens a tad too often and that we are overstating the reliability of Monte Carlo estimates. Indeed, it is not just a matter of sampling errors, as there are a few things that are dead wrong (again!) in the code above:

The replications are not really independent: we use all of the replications branching from a node at time T₁ to decide for a call or a put.

There is an unclear source of bias in the estimates. On the one hand, there is a low bias since we are not really deciding optimally between the call and the put; on the other hand we are relying on a limited set of scenarios, rather than the true continuous distribution; thus, we are possibly facing less uncertainty in the sample than we would do in the true world.

Once again, we see that independence among replications is not to be taken for granted. Furthermore, we also observe that when a Monte Carlo simulation involves some decision making, bias issues may arise. This is quite relevant, e.g., when pricing American-style options, as we will see in Section 11.3.1.

7.2 Setting the number of replications

As we have mentioned, we may try to quantify the quality of our estimator by considering the expected value of the squared error of estimate:

where σ² may be estimated by the sample variance S²(n). Note that the above formula holds if we assume that the sample mean is an unbiased estimator of the parameter μ we are interested in. Clearly, increasing the number n of replications improves the estimate, but how can we reasonably set the value of n?

Suppose that we are interested in controlling the absolute error in such a way that, with probability (1 − α),

where β is the maximum acceptable tolerance. In fact, the confidence interval (7.1) is just built in such a way that

Hence, linking H to β, we should simply run replications until H is less than or equal to the tolerance β, so that the number n must satisfy

Actually, we are chasing our tail a bit here, since we cannot estimate the sample variance S²(n) until the number n has been set. One way out is to run a suitable number k of pilot replications, in order to come up with an estimate S²(k). Then we may apply Eq. (7.2) using S²(k) to determine n. After running the n replications, it is advisable to check that Eq. (7.2) holds with the new estimate S²(n). Alternatively, we may simply add replications, updating the sample variance, until the criterion is met; however, with this approach we do not control the amount of computation we are willing to spend.

holds with probability (1 − α), things are a little more involved. The difficulty is that we may run replications until the half-length H satisfies

but in this inequality we are using the estimate

(n) rather than the unknown parameter μ. Nevertheless, if the inequality above holds, we may write

where inequality (7.3) follows from the triangle inequality and Eq. (7.4) is obtained by a slight rearrangement. Therefore, we conclude that if we proceed without care, the actual relative error that we obtain is bounded by γ/(1 − γ), which is larger than the desired bound γ. Therefore, we should choose n such that the following tighter criterion is met:

Again, we should run a few pilot replications in order to get a preliminary estimate of the sample variance S²(n).

7.3 A world beyond averages

The standard form of a confidence interval for the mean of a normal population is a cornerstone of inferential statistics. The unfortunate side effect is the tendency to use it in settings for which the recipe is not quite appropriate. Consider the problem of estimating the probability

where X is a possibly high-dimensional vector of random variables. It is natural to sample i.i.d. observations X_i, i = 1, …, n, and collect the random variables

These are i.i.d. observations of a Bernoulli random variable with parameter π, and it is natural to build an estimator

We see that there is an obvious link between expected value and variance, and it would make little sense to use the usual sample variance S². In fact, building a confidence interval for π calls for same care. However, if the sample size n is large enough, we may take advantage of the fact that the binomial distribution is a sum of i.i.d. variables and tends to a normal distribution, thanks to the central limit theorem. In other words, the statistic

Hence, at the very least, we should use the standard confidence interval by using a sensible estimate of variance. More refined methods are available in the literature, but R solves the problem by offering an appropriate function, binom.test.

Another delicate estimation problem concerns quantiles. This is relevant, for instance, in the estimation of risk measures like value-at-risk, as we shall see in Chapter 13. In principle, the task is not too hard and can be carried out as follows:

1. Use Monte Carlo to generate the output sample {X₁, …, X_n}.

2. Build the empirical CDF

(7.8) equation

where 1_{·} is the indicator function for an event.

3. Find the generalized inverse (see below) of the empirical CDF and return the estimated quantile at probability level α

Since the empirical CDF is piecewise constant, with jumps of size 1/n, it cannot be inverted as usual, and we resort to the generalized inverse. In this case this is just accomplished by sorting the observations X_i, which yields the order statistics X_(i), and pick the observation in position

Alternative procedures can be devised, possibly based on linear interpolation, but the deep trouble is that if we are estimating extreme quantiles, for small or large values of α, we may get rather unreliable estimates. Hence, we need a way to build a confidence interval. The theory here is far from trivial (see the references at the end of the chapter), but there is a central limit theorem for quantiles, stating that the statistic

and f_X(·) is the PDF corresponding to the true CDF F_X,n(·). This suggests the confidence interval

The problem is that, clearly, f_X(q_α) itself must be estimated, and it can be very small, resulting in large confidence intervals. In Section 13.4 we shall see an example of how the difficulty can be circumvented by recasting the problem a bit and using variance reduction methods.

7.4 Good and bad news

Equation (7.1) shows that the half-length of the confidence interval is determined by the ratio

i.e., by the variability in the output performance measure and the sample size. This is both good and bad news:

On the one hand, we notice that no role is played by the problem dimensionality, i.e., by how many random variables are involved in the simulation or, if you prefer, by the dimension of the space in which we are computing an integral. Actually, the dimension does play a role in terms of computational effort, and it typically creeps into the variability σ itself. Nevertheless, this is certainly good news and it contributes to explain why Monte Carlo methods are widely used.

On the other hand, the rate of improvement in the quality of our estimate, i.e., the rate of decrease of the error, is something like O(1/√n). Unfortunately, the square root is a concave function, and an economist would say that it features diminishing marginal returns. In practice, this means that the larger the sample size, the better, but the rate of improvement is slower and slower as we keep adding observations.

To appreciate the latter point, let us assume that with a sample size n we have a certain precision, and that we want to improve that by one order of magnitude. For instance, we might find a confidence interval for an option price like

in which it seems that we got the first decimal digit right, but we also want to get the cents right. This means that we should divide the interval half-length by 10, which implies that we need 100 × n replications. Thus, a brute-force Monte Carlo simulation may take quite some amount of computation to yield an acceptable estimate. One way to overcome this issue is to adopt a clever sampling strategy in order to reduce the variance σ² of the observations themselves; the other one is to adopt a quasi–Monte Carlo approach based on low-discrepancy sequences. The next two chapters pursue these two strategies, respectively.

For further reading

An excellent, yet readable chapter on simulation output analysis can be found in [5].

For a higher level and more general treatment, you may refer to [1, Chapter 3].

For the estimation of probabilities, a standard reference is [2]; this topic and other things are covered in [3].

See [4] for issues related to quantile estimation.

References

1 S. Asmussen and P.W. Glynn. Stochastic Simulation: Algorithms and Analysis. Springer, New York, 2010.

2 C.J. Clopper and E.S. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26:404–413, 1934.

3 W.J. Conover. Practical Nonparametric Statistics (3rd ed.). Wiley, Hoboken, NJ, 1999.

4 P.W. Glynn. Importance sampling for Monte Carlo estimation of quantiles. In Proceedings of the 2nd International Workshop on Mathematical Methods in Stochastic Simulation and Experimental Design, pp. 180–185, 1996. See http://www.Stanford.edu/~glynn/papers/1996/G96.html.

5 A.M. Law and W.D. Kelton. Simulation Modeling and Analysis (3rd ed.). McGraw-Hill, New York, 1999.

¹ The memoryless property of the exponential distribution is discussed in Section 3.2.2.