In Chapter 7 we have discussed output analysis, and we have seen that an obvious way to improve the accuracy of an estimate (n) based on a sample of n replications Xi, i = 1, …, n, is to increase the sample size n, since Var(
(n)) = Var(Xi)/n. However, we have also seen that the width of a confidence interval, assuming that replications are genuinely i.i.d., decreases according to a square-root law involving
, which is rather bad news. Increasing the number of replications is less and less effective, and this brute force strategy may result in a remarkable computational burden. A less obvious strategy is to reduce Var(Xi). At first sight, this seems like cheating, as we have to change the estimator in some way, possibly introducing bias. The variance reduction strategies that we explore in this chapter aim at improving the efficiency of Monte Carlo methods, sometimes quite dramatically, without introducing any bias. This means that, given a required accuracy, we may reduce the computational burden needed to attain it; or, going the other way around, we may improve accuracy for a given computational budget. In Chapter 9 we also consider another strategy, which consists of using deterministic low-discrepancy sequences to drive sampling, rather than pseudorandom numbers.
Some variance reduction strategies are just technical tricks to improve sampling, and they can even be automated in software tools. Other strategies are based on general principles, but require a high degree of customization to work. In some cases, they replace part of the numerical integration process with a partially analytical integration. As the reader may imagine, the more sophisticated strategies require a lot of domain-specific knowledge, may be rather difficult to conceive, but are much more effective. We begin in Section 8.1 with antithetic sampling, which is arguably the simplest strategy and serves well to introduce the topic of this chapter. Antithetic sampling is related to common random numbers, the subject of Section 8.2, which is a bit different in nature, as it deals with differences of random variables; we will actually appreciate common random numbers in Chapter 12 on estimating sensitivities, but they are best introduced just after antithetic sampling. Antithetic sampling and common random numbers, just like low-discrepancy sequences, work on the input stream of numbers feeding the simulation. Control variates, on the contrary, are based on an augmented estimator, which incorporates some more analytical knowledge that we may have at our disposal. As we shall see in Section 8.3, control variates are quite akin to linear regression models, as they try to take advantage of correlation between random variables. Another helpful tool is the conditional variance formula; in Sections 8.4 and 8.5 we will show how it can be exploited in conditional Monte Carlo and stratified sampling, respectively. Finally, a change of probability measure is the principle behind the importance sampling approach introduced in Section 8.6. Intuitively, the idea is to make rare but significant events more likely.
It is our aim to introduce the above strategies by simple, yet concrete examples. Therefore, in this chapter we will use toy integration problems, estimating π and pricing a vanilla call option, for illustration purposes. In practice, we certainly need no Monte Carlo method for these problems, but it is useful to get a basic understanding of variance reduction strategies by trying them on elementary examples, where we may also afford the luxury of comparing numerical results against the exact ones. Later, we will tackle more interesting applications, mostly in Chapter 11 on option pricing; we will also see an application to risk measurement in Chapter 13. Extremely sophisticated customization may be needed to cope with these more realistic problems.
The antithetic sampling approach is very easy to apply and does not rely on any deep knowledge about the specific problem we are tackling. In crude Monte Carlo, we generate a sample consisting of independent observations. However, inducing some correlation in a clever way may be helpful. Consider a sequence of paired replications (X1(i), X2(i)), i = 1, …, n:
We allow for correlation within each pair, but observations are otherwise “horizontally” independent. This means that if we pick any element of a pair, this is independent from any element of other pairs. Formally, Xj(i1) and Xk(i2) are independent however we choose j, k = 1, 2, provided i1 ≠ i2. Thus, the pair-averaged observations
are independent and can be used to build a confidence interval as usual. Note that we cannot build a confidence interval without averaging within pairs, as we do not require “vertical” independence. Let us consider the sample mean (n) based on the pair-averaged observations X(i); its variance is given by
In the first line we use independence among pairs, in the second line we account for covariance within a pair, and finally we arrive at Eq. (8.1), which shows the potential role of correlation. In order to reduce the variance of the sample mean, we should take negatively correlated replications within each pair. Needless to say, it is not fair to claim that variance is reduced because of the factor 2n, since we have affectively doubled the sample size; we must compare Eq. (8.1) against the variance Var(Xi)/(2n) of an independent sample of size 2n. Now the issue is how we can induce negative correlation. Each observation X(i)1,2 is obtained by generating possibly many random variates, and all of them ultimately depend on a stream of uniform pseudorandom numbers. Hence, we may consider the idea of inducing a strong negative correlation between input streams of random numbers. This is easily achieved by using a random number sequence {Uk} for the first replication in each pair, and then {1 − Uk} in the second one. Since the input streams are negatively correlated, we hope that the output observations will be, too.
In Example 2.1 we used plain Monte Carlo integration to estimate
A small sample consisting of only 100 observations, as expected, does not provide us with a reliable estimate:
> set.seed(55555) > X <- exp(runif(100)) > T <- t.test(X) > T$estimate mean of x 1.717044 > CI <- as.numeric(T$conf.int) > CI [1] 1.621329 1.812759 > CI [2]-CI[1] [1] 0.1914295 > (CI[2]-CI[1])/2/(exp(1)-1) [1] 0.05570375
Antithetic sampling is easily accomplished here: We must store random numbers and take their complements to 1. In order to have a fair comparison, we consider 50 antithetic pairs, which entails sampling 100 function values as before:
> set.seed(55555) > U1 <- runif(50) > U2 <- 1-U1 > X <- 0.5*(exp(U1)+exp(U2)) > T <- t.test(X) > T$estimate mean of x 1.71449 > CI <- as.numeric(T$conf.int) > CI [1] 1.699722 1.729258 > CI[2]-CI[1] [1] 0.02953576 > (CI[2]-CI[1])/2/(exp(1)-1) [1] 0.008594562
Now the confidence interval is much smaller and, despite the limited sample size, the estimate is fairly reliable.
The antithetic sampling method looks quite easy to apply and, in the example above, it works pretty well. May we always expect a similar pattern? Unfortunately, the answer is no. The reason why the approach works pretty well in the example is twofold:
It is instructive to consider a simple example, showing that antithetic sampling may actually backfire and increase variance.
Consider the function h(x), defined as
and suppose that we want to take a Monte Carlo approach to estimate
The function we want to integrate is obviously a triangle with both basis and height equal to 1; note that, unlike the exponential function of Example 8.1, this is not a monotone function with respect to x. It is easy to compute the integral as the area of a triangle:
Now let
be an estimator based on two independent uniform random variables U1 and U2, and let
be the pair-averaged sample built by antithetic sampling. We may compare the variances of the two estimators:
The difference between the two variances is
But in this case, due to the shape of h, we have
and
Therefore, and
0. In fact, antithetic sampling actually increases variance in this case, and there is a trivial explanation. The two antithetic observations within each pair have the same value h(U) = h(1 − U), and so Cov[h(U), h(1 − U)] = Cov[h(U), h(U)] = Var[h(U)]. In this (pathological) case, the variance of the single observation is doubled by applying antithetic sampling.
What is wrong with Example 8.2? The variance of the antithetic pair is actually increased due to the nonmonotonicity of h(x). In fact, while it is true that the random numbers {Ui} and {1 − Ui} are negatively correlated, there is no guarantee that the same holds for Xi(1) and Xi(2) in general. To be sure that the negative correlation in the input random numbers yields a negative correlation in the observed output, we must require a monotonic relationship between them. The exponential function is a monotonic function, but the triangle function of the second example is not.
Another side of the coin is related to the mechanism we use to generate random variates. The inverse transform method is based on the CDF, which is a monotonic function; hence, there is a monotonic relationship between the input random numbers and the random variates generated. This is not necessarily the case with the acceptance-rejection method or the Box–Muller method, which were discussed in Chapter 5. Luckily, when we need normal variates, we may simply generate a sequence i ~ N(0, 1) and use the sequence −
i; for the antithetic sample.
In order to illustrate antithetic sampling with normal variates, and to see a more concrete example, let us consider the use of antithetic sampling to price a simple vanilla call option. We have seen how such a simple option can be priced using Monte Carlo sampling. It is easy to modify the code in Fig. 6.4 to introduce antithetic sampling; we just need to store a vector of normal variates to use them twice. The resulting code is shown in Fig. 8.1. The following snapshot checks variance reduction in the case of an at-the-money option:
FIGURE 8.1 Pricing a vanilla call option with antithetic sampling.
> S0 <- 50 > K <- 50 > sigma <- 0.4 > r <- 0.05 > T <- 5/12 > numRepl <- 50000 > exact <- EuVanillaCall(S0,K,T,r,sigma); exact [1] 5.614995 > set.seed(55555) > out1 <- BlsMCNaive(S0,K,T,r,sigma,numRepl); out1 $estimate [1] 5.568167 $conf.int [1] 5.488962 5.647371 > set.seed(55555) > out2 <- BlsMCAV(S0,K,T,r,sigma,numRepl/2); out2 $estimate [1] 5.596806 $conf.int [1] 5.533952 5.659661
Note that, for the sake of a fair comparison, we divide the number of replications by two when using antithetic sampling. Indeed, there is a slight reduction in variance, but it is far from impressive (and, anyway, one run does not mean too much). The situation looks a bit better for a deeply in-the-money option:
> K <- 25 > exact <- EuVanillaCall(S0,K,T,r,sigma); exact [1] 25.52311 > set.seed(55555) > out1 <- BlsMCNaive(S0,K,T,r,sigma,numRepl); out1 $estimate [1] 25.4772 $conf.int [1] 25.36299 25.59142 > set.seed(55555) > out2 <- BlsMCAV(S0,K,T,r,sigma,numRepl/2); out2 $estimate [1] 25.51441 $conf.int [1] 25.48482 25.54401
A possible explanation is that in this second case we are sampling the payoff function where the nonlinearity plays a lesser role. We leave a serious computational study to the reader, but we should not be surprised by the fact that antithetic sampling, which is almost a simple programming trick, does not necessarily produce impressive improvements. Nevertheless, the idea does work well in some cases, and sometimes it can also be fruitfully integrated with the approaches that we describe in the following.
The common random numbers technique is very similar to antithetic sampling, but it is applied in a different situation. Suppose that we use Monte Carlo simulation to estimate a value depending on a parameter α. In formulas, we are trying to estimate something like
where we have emphasized randomness by ω, which refers to random events underlying random variables. We could also be interested in evaluating the sensitivity of this value with respect to the parameter α:
This would be of interest when dealing with option sensitivities beyond the Black–Scholes–Merton model. Clearly, we cannot compute the derivative analytically; otherwise, we would not use simulation to estimate the function h(·) in the first place. So, the simplest idea would be using simulation to estimate the value of the finite difference,
for a small value of the increment δα. However, what we can really do is to generate observations of the difference
and to estimate its expected value. Unfortunately, when the increment δα is small, it is difficult to tell if the difference we observe is due to random noise or to the variation in the parameter α. A similar problem arises when we want to compare two portfolio management policies on a set of scenarios; in this case, too, what we need is an estimate of the expected value of the difference between two random variables. In this setting, is it sensible to compare the two performances on the basis of different sets of scenarios?
Let us abstract a little and consider the difference of two random variables
where, in general, E[X1] ≠ E[X2], since they come from simulating two different systems, possibly differing only in the value of a single parameter. By Monte Carlo simulation we get a sequence of independent random variables,
and use statistical techniques to build a confidence interval for E[X1 − X2]. To improve our estimate, it would be useful to reduce the variance of the sampled Zj. If we do not assume independence between X1,j and X2,j (for the same j), we find something quite similar to antithetic sampling, with a change in sign:
Then, in order to reduce variance, we may try inducing a positive correlation between X1j and X2j. This can be obtained by using the same stream of random numbers in simulating both X1 and X2. The technique works much like antithetic sampling, and the same monotonicity assumption is required to ensure that the technique does not backfire. We will see an application of these concepts in Section 12.1, where we apply Monte Carlo sampling to estimate option price sensitivities.
Antithetic sampling and common random numbers are two very simple techniques that, provided the monotonicity assumption is valid, do not require much knowledge about the systems we are simulating. Better results might be obtained by taking advantage of deeper, domain-specific knowledge. Suppose that we want to estimate θ = E[X], and that there is another random variable Y, with a known expected value ν, which is somehow correlated with X. Such a case occurs when we use Monte Carlo simulation to price an option for which an analytical formula is not known: θ is the unknown price of the option, and ν could be the price of a corresponding vanilla option. The variable Y is called the control variate and represents additional, domain-specific knowledge. The correlation Y may be exploited by adopting the controlled estimator
where c is a parameter that we must choose. Intuitively, when we run a simulation and observe that our estimate of E[Y] is too large:
Then, we may argue that the estimate should be increased or reduced accordingly, depending on the sign of the correlation between X and Y. For instance, if correlation is positive and
, we should reduce
, because if the sampled values for Y are larger than their average, the sampled values for X are probably too. The contrary applies when correlation is negative. Indeed, it is easy to see that
The first formula says that the controlled estimator is, for any choice of the control parameter c, an unbiased estimator of θ. The second formula suggests that by a suitable choice of c, we could reduce the variance of the estimator. We could even minimize the variance by choosing the optimal value for c:
in which case we find
where ρXY is the correlation between X and Y. Note that the sign of c depends on the sign of this correlation, in a way that conforms to our intuition.
The form of the optimal control parameter in Eq. (8.4) looks suspiciously like the slope coefficient in a linear regression model.1 To see the connection, let us consider the estimation error
The variance of the random variable η may be large for difficult estimation problems. In linear regression we aim at explain variability of X by introducing an explanatory variable Y:
With respect to the usual notation, here we are swapping the roles of X and Y. Furthermore, we use only the slope coefficient b in the regression model, since we are centering random variables with respect to their expected values. Apart from these unsubstantial changes, we clearly see that the least-squares estimate of b is what we need to maximize explained variability, which turns into Eq. (8.4) by a slight rearrangement.
In practice, the optimal value of c must be estimated, since Cov(X, Y) and possibly Var(Y) are not known. This may be accomplished by running a set of pilot replications to estimate them. It would be tempting to use these replications both to select c* and to estimate θ, in order to save precious CPU time. However, doing so would induce some bias in the estimate of θ, since in this case c* cannot be considered as a given number, but a random variable depending on X itself. This invalidates the arguments leading to Eqs. (8.2) and (8.3). Therefore, unless suitable statistical techniques (e.g., jackknifing) are used, which are beyond the scope of this book, the pilot replications should be discarded.
We can illustrate control variates by referring again to the problem of Example 8.1, i.e., the integration of the exponential function on the unit interval,
A possible control variate in this case is the driving uniform variable U itself, with an expected value of 0.5. We might even afford the luxury of computing c* exactly, since the variance of the control variate is , and with a little effort we may actually calculate the correlation between U and eU. Since this is not true in general, let us stick to statistical estimates only. The code in Fig. 8.2 produces the following output:
FIGURE 8.2 Using a control variate in a toy integration problem.
> as.numeric(out$estimate) [1] 1.713364 > as.numeric(out$conf.int) [1] 1.646735 1.779992 > as.numeric(outCV$estimate) [1] 1.713169 > as.numeric(outCV$conf.int) [1] 1.703453 1.722885 > cor(U1,exp1) [1] 0.9938869
A comparison against the naive estimator shows that the reduction in variance is quite remarkable, and it is easily explained by the strength in the correlation. We may not expect such a result in general, unless a strong control variate is available.
As a reality check, let us try the vanilla call example again. A natural control variate is the price of the underlying asset at maturity, whose expected value under the risk-neutral measure is just S0erT. The code is shown in Fig. 8.3. Let us try this approach and compare it against naive Monte Carlo:
FIGURE 8.3 Pricing a vanilla call option by control variates.
> S0 <- 50 > K <- 50 > sigma <- 0.4 > r <- 0.05 > T <- 5/12 > numRepl <- 50000 > numPilot <- 10000 > exact <- EuVanillaCall(S0,K,T,r,sigma); exact [1] 5.614995 > set.seed(55555) > out1 <- BlsMCNaive(S0,K,T,r,sigma,numRepl+numPilot); out1 $estimate [1] 5.570385 $conf.int [1] 5.498177 5.642593 > set.seed(55555) > out2 <- BlsMCCV(S0,K,T,r,sigma,numRepl,numPilot); out2 $estimate [1] 5.585363 $conf.int [1] 5.552163 5.618563
With this problem instance, it seems that control variates is more effective than antithetic sampling. The control variates approach may be generalized to as many control variates as we wish, with a possible improvement in the quality of the estimates. Of course, this requires more domain-specific knowledge and more effort in setting the control parameters. For an impressive result obtained by a clever control variate, see the case of arithmetic and geometric Asian options in Section 11.2.
When faced with the task of calculating an expectation, a common technique in probability theory is the application of conditioning with respect to another random variable. Formally E[X] can be expressed as
Equation (8.5) is known as the law of iterated expectations, among other names, and it suggests the use of E[X | Y] as an estimator rather than X. It is important to realize that this conditional expectation is a random variable, and that this makes sense if, given Y, we have a straightforward recipe to calculate the conditional expectation. What we should do is to sample Y and apply this recipe, and Eq. (8.5) makes sure that we have an unbiased estimate. However, what makes sure that by doing so we can reduce variance? The answer lies in the related conditional variance formula:
We observe that all of the terms in Eq. (8.6) involve a variance and are non-negative. Therefore the conditional variance formula implies the following:
(8.7)
The first inequality states that E[X | Y] cannot have a larger variance than the crude estimator X, and it justifies the idea of conditional Monte Carlo, The second inequality is the rationale behind variance reduction by stratification, which is discussed in the next section. Unlike antithetic sampling, variance reduction by conditioning requires some careful thinking and is strongly problem dependent.
To illustrate the approach, let us consider the expected value E[e−X2], where X is an affine transformation of a Student’s t variable with n degrees of freedom:2
In practice, the above transformation is similar to the destandardization of a standard normal, which is replaced by a heavier-tailed variable. If X were normal, things would be much easier. Nevertheless, we may recall that a Student’s t variable is related to a normal:
Hence, we may rewrite X in terms of a standard normal Z and a chi-square variable:
Thus, conditional on Y = y, X is normal, with a variance depending on y: X | Y = y ~ N(μ, σ2/y). Now, let us consider the conditional expected value
If we introduce ξ2 = σ2/y, the conditional expectation can be written as
A typical trick of the trade to deal with this kind of integrand functions is completing the square in the argument of the exponential, in order to come up with something related to the PDF of a normal variable:
When we take the exponential of the last expression, we see that the second term does not depend on x and can be taken outside the integral, leading to a multiplicative factor
The first term leads to an integral of a normal density with expected value
and variance
The integral of the density is 1, of course, but this change of variance requires some adjustment in the integral of Eq. (8.9), i.e., division by
Putting it all together, and substituting back ξ2 = σ2/y, the conditional expectation is
Now, what we have to do in order to apply conditional Monte Carlo is to sample a chi-square variable Y with n degrees of freedom and plug the value in the above conditional expectation. This is carried out by the R code of Fig. 8.4, for μ = 3, σ = 0.5, and n = 5 degrees of freedom. We use rchisq to sample the chi-square distribution. In the code we also apply crude Monte Carlo, using rt to sample the Student’s variable, to compare the results. Indeed, the comparison between the two confidence intervals shows the improvement obtained by conditioning:
FIGURE 8.4 Dealing with a function of a Student’s t variable by conditional Monte Carlo.
> as.numeric(out$estimate) [1] 0.007806775 > as.numeric(out$conf.int) [1] 0.007472332 0.008141218 > as.numeric(outC$estimate) [1] 0.007683399 > as.numeric(outC$conf.int) [1] 0.007590393 0.007776404
It is also interesting to see how the two sample means evolve when adding replications successively. This may be obtained by cumulating and plotting sums, as is done in the last lines of the R script, which produces the plot in Fig. 8.5. The difference in convergence speed to the correct value is quite evident.
FIGURE 8.5 Showing the effect of conditioning on the convergence of estimates.
Stratified sampling, like conditional Monte Carlo, relies on conditioning arguments, but it has a different perspective. We are interested in estimating θ = E[X], and let us suppose that X depends on another random variable Y, which may take a finite set of values yj, j = 1, …, m, and has the known PMF:
Using conditioning, we see that
This suggests the idea of sampling X conditional on Y, and then to assemble a stratified estimator:
(8.10)
where Nj is the number of independent observations allocated to stratum j, and Xjk is the kth observation of X in stratum j, conditional on Y = yj. We speak of strata since the values yj induce a partitioning the sample space into a mutually exclusive and collectively exhaustive collection of subsets. It is a good idea to see a simple application of the idea immediately.
As a simple example of stratification, consider using simulation to compute
In crude Monte Carlo simulation we draw N uniform random numbers Uk ~ U(0, 1) and compute the sample mean
An improved estimator with respect to crude sampling may be obtained by partitioning the integration interval [0, 1] into a set of m disjoint subintervals
Strictly speaking, these are not disjoint intervals, but since the probability of singletons is zero, this is not quite relevant. Each event Y = yj corresponds to a random number falling in the jth subinterval; in this case we might choose pj = 1/m. Note that a uniform variable on the unit interval (0, 1) can be mapped into a uniform variable on the intervals of Eq. (8.11) by the affine transformation (U +j − 1)/m. Hence, for each stratum j = 1, …, m we generate Nj random numbers Uk ~ U(0, 1) and estimate
Then we build the overall estimator:
Clearly, we may apply the approach only if we know the probabilities pj for each stratum and are able to sample conditionally on Y. Assuming this, the following questions arise:
We will give an answer to the first question later. For now, let us assume that we follow a simple proportional allocation policy:
subject to rounding, which we neglect. Let us denote by σj2 the conditional variance
Given the independence among replication, we see that
(8.12)
If we assume proportional allocation of observation to strata, we find
where the last inequality follows from the inequality (8.8), which in turn is a consequence of the conditional variance formula (8.6). Hence, by using proportional allocation we may indeed reduce variance. A further improvement can be obtained, at least in principle, by an optimal allocation of observations to strata. This may be obtained by solving the nonlinear programming problem:
To be precise, we should require that the decision variables Nj take only integer values. In practice, when N is large enough, we may relax integer variables to continuous values, and then round the optimal solution. If we assume an interior optimal solution, i.e., Nj* > 0, we may relax the non-negativity bounds and apply the Lagrange multiplier approach. First, we associate a multiplier λ with the budget constraint and build the Lagrangian function
and then write the stationarity conditions with respect to each decision variable:
This condition implies that Nj should be proportional to pjσj:
Taking the budget constraint into account, we find the optimal budget allocation
and the corresponding minimal variance is
An obvious difficulty in this reasoning is that it relies on the knowledge of the conditional variances σj2. Since it is quite unlikely that we know them, in order to apply optimal budget allocation we need some pilot runs to estimate the sample conditional variances:
where
is the sample mean for stratum j. Furthermore, we may also build an approximate confidence interval at confidence level (1 − α) as follows:
(8.13)
In Section 1.1 we considered the use of Monte Carlo methods to estimate π, without much of a success. To improve things, let us consider the integral
Since x2 + y2 = 1 is the well-known equation describing a unit circle, we see that I is the area of a quarter of the unit circle. Therefore, we have
In Fig. 8.6 we show three R functions to estimate π using the above integral. The first function uses crude Monte Carlo, the second one uses straightforward stratification, and the third one integrates stratification and antithetic sampling. By running the three functions with a sample size 5000 we notice that stratification improves the quality of the estimate:
FIGURE 8.6 Using stratification to estimate π.
> set.seed(55555) > estPiNaiveMC(5000) $estimate [1] 3.138237 $conf.int [1] 3.113241 3.163233 > estPiStratified(5000) $estimate [1] 3.141587 $conf.int [1] 3.116832 3.166343 > estPiStratifiedAnti(5000) $estimate [1] 3.141593 $conf.int [1] 3.116838 3.166348 > pi [1] 3.141593
In the functions using stratification, the number of strata is just the number of replications. To be fair, we are cheating a bit when integrating stratification with antithetic sampling, as we are doubling the number of observations. Nevertheless, we see that we get an estimate which is correct within at least five decimals. Actually, the natural objection is that with this approach we are reverting back to integration quadrature on a grid of points. Indeed, as we explain later, we cannot afford the luxury of stratifying with respect to several random variables; nevertheless, using some tricks we may still apply the approach to practically relevant cases.
Let us apply stratification to option pricing in the simplest case, the vanilla call option. It is natural to stratify with respect to the asset price at maturity, which in turn requires stratification of a standard normal. To accomplish this task, we may stratify a uniform random number as we did in Example 8.4, and then obtain the standard normals by the inverse transform method. This is accomplished by the R code of Fig. 8.7, which produces the following (edited) output:
FIGURE 8.7 Pricing a vanilla call option by stratified sampling.
< EuVanillaCall(S0,K,T,r,sigma) [1] 5.614995 < BlsMCNaive(S0,K,T,r,sigma,numRepl) $estimate [1] 5.74592 $conf.int [1] 5.562113 5.929728 > BsmStrat(S0,K,T,r,sigma,numRepl,numStrats) $estimate [1] 5.636611 $conf.int [1] 5.614289 5.658932
In all of the above examples, the beneficial effect of stratification is rather evident. One may ask, however, how can we take advantage of stratification in a more relevant situation. After all, by straightforward stratification we are, in a sense, pushing back toward numerical integration on a grid, and it seems rather difficult to stratify with respect to several variables. Indeed, in pricing an option depending on several factors or the prices along a whole sample path, we cannot stratify with respect to all of them. However, we may use some tricks, such as the Brownian bridge of Section 6.2.3, to stratify with respect to some milestone prices, and then fill the gaps in each sample path using naive Monte Carlo.
Unlike other variance reduction methods, importance sampling is based on the idea of “distorting” the underlying probability measure. Consider the problem of estimating
where X is a random vector with joint density f(·). If we know another density g(·) such that f(x) = 0 whenever g(x) = 0, we may write
where the notation Eg[·] is used to stress the fact that the last expected value is taken with respect to another measure. The ratio f(x)/g(x) is used to correct the change in probability measure, and it is typically called a likelihood ratio: When sampling X randomly, this ratio will be a random variable.3 The requirement that f(x) = 0 whenever g(x) = 0 is needed to avoid trouble with division by zero, but it is just a technicality. The real issue is how to select a density g(·) that results in a significant reduction in variance. Actually, there is no guarantee that variance will be reduced, as a poor choice may well increase variance.
To get a clue, let us introduce the notation
and assume for simplicity that h(x) ≥ 0. As we have pointed out, there are two possible ways of estimating θ:
where h*(X) = h(x)f(x)/g(x). The two estimators have the same expectation, but what about the variance? Using the well-known properties of the variance, we find
From the second equation, it is easy to see that the choice
leads to the ideal condition Varg [h* (X)] ≡ 0. We may find an exact estimate with one observation! Clearly, there must be a fly somewhere in the ointment. Indeed, this density is truly ideal, as it requires the knowledge of θ itself. Nevertheless, Eq. (8.15) does provide us with some guidance, as it suggests that we should use a density matching the product h(x)f(x) as far as possible. By the way, you may see why we have required the condition h(x) ≥ 0, since a density cannot be negative.4 The problem is that f(·) is a density, but its product with h(·) is not; we miss a normalization constant, which is precisely what we want to compute. Not all hope is lost, however:
As a further hint, let us consider the difference between the two variances:
We would like to obtain ΔVar > 0. Unfortunately, while the factor h2(x)f(x) is non-negative, the term within brackets has to be negative somewhere, since both f(·) and g(·) are densities. We should select a new density g such that
In fact, the name “importance sampling” derives from this observation.
Let us consider again the estimation problem of Example 8.5, where we have noted that
To improve our estimate, we may apply importance sampling, following an approach described in [1]. A possible idea to approximate the ideal probability distribution is to divide the integration interval [0, 1] into L equally spaced subintervals of width 1/L. The extreme points of the kth subinterval (k = 1, …, L) are (k − 1)/L and k/L, and the midpoint of this subinterval is sk = (k − 1)/L + 1/(2L). A rough estimate of the integral is obtained by computing
Then, to find an approximation of the ideal density g(x), we could use something like
where we have used the fact f(x) = 1 (uniform distribution). Unfortunately, this need not be a density integrating to 1 over the unit interval. In order to avoid this difficulty and to simplify sampling, we may define a probability of sampling from a subinterval and use a uniform density within each subinterval. To this end, let us consider the quantities
Clearly, Σk qk = 1 and qk ≥ 0, since our function h is non-negative; hence, the numbers qk may be interpreted as probabilities. In our case, they may be used as the probabilities of selecting a sample point from the kth subinterval. To summarize, and to cast the problem within the general framework, we have
Here, g(x) is a piecewise constant density; the L factor multiplying the qk in g(x) is just needed to obtain the uniform density over an interval of length 1/L.
The resulting code is illustrated in Fig. 8.8, where m is the sample size and L is the number of subintervals. The code is fairly simple, and subintervals are selected as described in the last part of Section 5.3, where we have seen how to sample discrete empirical distributions by the function EmpiricalDrnd of Fig. 5.7:
FIGURE 8.8 Using importance sampling to estimate π.
> pi [1] 3.141593 > set.seed(55555) > estpiIS(1000,10) [1] 3.135357 > estpiIS(1000,10) [1] 3.142377 > estpiIS(1000,10) [1] 3.14807 > estpilS(1000,10) [1] 3.148633 > estpiIS(1000,10) [1] 3.140776
We see that the improved code, although not a very sensible way to compute π, yields a remarkable reduction in variance with respect to crude Monte Carlo.
The approach that we have just pursued looks suspiciously like stratified sampling. Actually, there is a subtle difference. In stratified sampling we define a set of strata, which correspond to events of known probability; here we have not used strata with known probability, as we have used sampling to estimate the probabilities qk.
Let us consider the expected value E[h(X)], where X ~ N(0, 1) and
We observe that the function h(·) is a triangle with support on the interval (3, 5) and a fairly large maximum at x = 4. However, since a standard normal has little density outside the interval (−3, 3), we have large function values where the probability is low. Crude Monte Carlo will generate almost all of the observations where the function value is zero, with a corresponding waste of effort. A huge sample size is needed to get an accurate estimate. Therefore, we may try to shift the mode of the density in order to match the maximum of the function. Hence, let us consider a distribution N(θ, 1), with θ = 4. Finding the likelihood ratio is fairly easy:
(8.16)
In Fig. 8.9 we show R code to evaluate the triangle function (please note how vectorization is achieved) and to compare the outcome of crude Monte Carlo against importance sampling, producing the following output:
FIGURE 8.9 Shifting the expected value of a normal distribution.
> as.numeric(out$estimate) [1] 0.4058854 > as.numeric(out$conf.int) [1] 0.1120629 0.6997080 > as.numeric(outC$estimate) [1] 0.3631287 > as.numeric(outC$conf.int) [1] 0.3527247 0.3735328
We see how a very large confidence interval results from crude Monte Carlo, whereas shifting the mean of the normal produces a much more accurate estimate.
On the one hand, Example 8.8 shows a possible heuristic to find a distorted density, by matching it with the maximum of the integrand function. On the other hand, we see how rare events may be dealt with by importance sampling. We elaborate on the theme in the next section, using a more practically relevant example.
Importance sampling is often used when small probabilities are involved. Consider, for instance, a random vector X with joint density f, and suppose that we want to estimate
where {X A} is a rare event with a small but unknown probability P{X
A}. Such an event could be the occurrence of a large loss, which is relevant in risk management. The conditional density is
for x A. By defining the indicator function
we may rewrite θ as
If we use crude Monte Carlo simulation, many samples will be wasted, as the event {X A} will rarely occur. Now, assume that there is a density g(·) such that this event is more likely under the corresponding probability measure. Then, we generate the observations Xi according to g(·) and estimate
Let us illustrate the idea with a simple but quite instructive example: Pricing a deeply out-of-the-money vanilla call option. If S0 is the initial price of the underlying asset, we know that its expected value at maturity is, according to geometric Brownian motion model under the risk-neutral measure, S0erT. If this expected value is small with respect to the strike price K, it is unlikely that the option will be in-the-money at maturity, and with crude Monte Carlo many replications are wasted, because the payoff is zero in most of them. We should change the drift in order to increase the probability that the payoff is positive. One possible choice is a drift μ, such that the expected value of ST is the strike price:
While under the risk-neutral measure we sample ST = S0eZ by generating normal variates
we should sample ST = S0eY by generating
This in turn requires sampling standard normal variates and then using
Now the tricky part is to compute the likelihood ratio. For the sake of clarity, assume that we sample Y from a normal distribution N(β, ξ) whereas the original distribution is N(α, ξ). Then, the ratio of the two densities can be found much like we did in Example 8.8:
(8.17)
Now it is easy to extend the naive Monte Carlo code of function BlsMCNaive, see Fig. 6.4, to the function BlsMCIS displayed in Fig. 8.10. We may check the efficiency gain of importance sampling by running the script in Fig. 8.11. In the script we estimate the price with both crude Monte Carlo and importance sampling, in order to compare the percentage errors with respect to the exact price. We price the same option 100 times, in order to collect the average percentage error and the worst-case percentage error. We reset the random variate generator twice in order to use exactly the same stream of standard normal variates. Running the script, we obtain
FIGURE 8.10 Using importance sampling to price a deeply out-of-the-money vanilla call option.
FIGURE 8.11 Comparing the estimation errors obtained by crude Monte Carlo and importance sampling for a deeply out-of-the-money vanilla call option.
Average Percentage Error: MC = 2.248% MC+IS = 0.305% Worst Case Percentage Error: MC = 6.178% MC+IS = 1.328%
We notice that, as we hoped, importance sampling is able to reduce the average error. What is probably even more striking is that the worst-case error may be quite large in naive Monte Carlo, whereas it is more reasonable with importance sampling, which indeed seems more robust. We should note, however, that this improvement is not to be expected for at-the-money options.
In the previous section we have used a heuristic argument to come up with an importance sampling measure to improve pricing accuracy for a deeply out-of-the-money option. In the next section we will consider a systematic way to find distorted densities for use within importance sampling. Here we pause a little to introduce a couple of relevant concepts for the sake of the unfamiliar reader. As usual, we will not strive for full generality and mathematical rigor, as our aim is just to point out the practical relevance of some concepts from probability theory.
DEFINITION 8.1 (Moment generating function) Given a continuous random variable X with density fX(·), its moment generating function is defined as
(8.18)
The interest of the moment generating function lies in the fact that, as the name suggests, it can be used to compute moments of a random variable, since the following fact can be proved:
which means that to compute the moment of order n of the random variable X, denoted by mn, we can take the derivative of order n of its moment generating function and evaluate it for θ = 0. To see this, let us consider the series expansion of an exponential:
If we plug a random variable X into this expression and take an expectation, we find
By taking the first order derivative, it is easy to see that
By setting θ = 0 we obtain m1 = MX(1) (0). By the same token, if we take the second order derivative we obtain
which implies m2 = MX(2) (0), and so on. Now the problem is to find the moment generating function for the most common distributions. The next example outlines the case of a normal random variable.
It is not too difficult to prove by an integration exercise that the moment generating function of a normal random variable X ~ N(μ, σ2) is
The proof relies on typical tricks of the trade, like completing the square within an exponential, much like we have seen in Section 8.4. Then, it is a simple, even though a bit tedious, exercise to prove that for a normal variable X we have
The latter result is a key step to prove that the kurtosis of a normal variable is 3.
DEFINITION 8.2 (Cumulant generating function) The cumulant generating function of a random variable X with moment generating function MX (θ) is defined as
(8.20)
For instance, given Eq. (8.19), we immediately see that the cumulant generating function of a normal variable X ~ N(μ, σ2) is
(8.21)
We should also mention that the cumulant generating function, if it exists, is convex. The cumulant generating function plays a relevant role in exponential tilting for importance sampling.
Exponential tilting, also known as exponential twisting, is a common strategy to define distributions for use within importance sampling. The idea is to multiply the original density5 f(x) by an exponential function depending on a parameter θ. The result, eθx f(x), is not a density and needs normalization to define a new density depending on the tilting parameter θ:
We immediately recognize the denominator of this ratio as the moment generating function MX(θ). By recalling the definition of the cumulant generating function, we may rewrite Eq. (8.22) as
(8.23)
This relationship provides us with the likelihood ratio that we need to apply importance sampling. In fact, from Eq. (8.14) we know that the expected value E[h(X)] under density f(·) can be rewritten as
Thus, in order to apply importance sampling with exponential tilting, we have to:
In order to accomplish the first step above, we should wonder which kind of random variable we find, if we tilt its distribution exponentially. The following example illustrates yet another nice property of the normal distribution.
Let us find the density of a tilted normal variable X ~ N(μ, σ2):
By putting the exponentials together, completing squares, and rearranging, we end up with
which shows that the tilted variable is again a normal variable, but with a shifted expected value: Xθ ~ N(μ + θσ2, σ2).
The next example deals with the case in which, in order to run a Monte Carlo simulation, we have to generate a sequence of normal variables.
Consider a simulation which is driven buy a sequence of n i.i.d. variables X1, …, Xn. Thanks to independence, the likelihood ratio is related to a product of densities, which can be expressed in terms of the cumulant generating function:
(8.25)
Now the open question is how to select a suitable tilting parameter θ. Typically, problem-dependent arguments are used. We have seen an example when dealing with a deeply out-of-the-money call option, where we argued that, under the importance sampling measure, the expected value of the underlying asset price at maturity should be the option strike price. Hence, let us say that, using such intuitive arguments, we conclude that the expected value of X under the tilted distribution should be Eθ[X] = β. To accomplish this objective, we rely on a useful result. If we take the derivative of the cumulant generating function with respect to θ, we find
(8.26)
where we have used Eq. (8.24) the other way around. Therefore, we may solve the equation
to find the corresponding tilting parameter. The material of this section will play a fundamental role in Section 13.4, where we use importance sampling to estimate a risk measure, value-at-risk, by Monte Carlo simulation.
1 I. Beichl and F. Sullivan. The importance of importance sampling. Computing in Science and Engineering, 1:71–73, March–April 1999.
2 L. Clewlow and C. Strickland. Implementing Derivatives Models. Wiley, Chichester, West Sussex, England, 1998.
3 P. Glasserman. Monte Carlo Methods in Financial Engineering. Springer, New York, 2004.
4 D.P. Kroese, T. Taimre, and Z.I. Botev. Handbook of Monte Carlo Methods. Wiley, Hoboken, NJ, 2011.
5 C.P. Robert and G. Casella. Introducing Monte Carlo Methods with R. Springer, New York, 2011.
6 R. Y. Rubinstein. Simulation and the Monte Carlo Method. Wiley, Chichester, 1981.
7 H. Wang. Monte Carlo Simulation with Applications in Finance. CRC Press, Boca Raton, FL, 2012.
1See Section 3.5 for a probabilistic view of linear regression.
2This example is borrowed from [5, pp. 107–108].
3Readers with a background in stochastic calculus would probably use the term “Radon–Nikodym derivative.”
4See, e.g., [6, p. 122] to see how to deal with a generic function h.
5To avoid notational clutter, here we denote the density of a random variable X by f(x) rather than fX(x).