Physically Based Rendering

13.8 Careful sample placement

A classic and effective family of techniques for variance reduction is based on the careful placement of samples in order to better capture the features of the integrand (or, more accurately, to be less likely to miss important features). These techniques are used extensively in pbrt. Indeed, one of the tasks of Samplers in Chapter 7 was to generate welldistributed samples for use by the integrators for just this reason, although at the time we offered only an intuitive sense of why this was worthwhile. Here we will justify that extra work in the context of Monte Carlo integration.

Stratified sampling was first introduced in Section 7.3, and we now have the tools to motivate its use. Stratified sampling works by subdividing the integration domain Λ into n nonoverlapping regions Λ₁, Λ₂, …, Λ_n. Each region is called a stratum, and they must completely cover the original domain:

∪i=1nΛi=Λ.

si122_e

To draw samples from Λ, we will draw n_i samples from each Λ_i, according to densities p_i inside each stratum. A simple example is supersampling a pixel. With stratified sampling, the area around a pixel is divided into a k × k grid, and a sample is drawn from each grid cell. This is better than taking k² random samples, since the sample locations are less likely to clump together. Here we will show why this technique reduces variance.

Sampler 421

Sampler::Request1DArray() 423

Sampler::Request2DArray() 423

Within a single stratum Λ_i, the Monte Carlo estimate is

Fi=1ni∑j=1nifXi,jpiXi,j,

si123_e

where X_{i, j} is the jth sample drawn from density p_i. The overall estimate is F=∑i=1nviFi, where v_i is the fractional volume of stratum i (v_i ∈ (0, 1]).

The true value of the integrand in stratum i is

μi=EfXi,j=1vi∫Λifxdx,

and the variance in this stratum is

σi2=1vi∫Λifx−μi2dx.

Thus, with n_i samples in the stratum, the variance of the per-stratum estimator is σ_i²/n_i. This shows that the variance of the overall estimator is

VF=V∑viFi=∑VviFi=∑vi2VFi=∑vi2σi2ni.

If we make the reasonable assumption that the number of samples n_i is proportional to the volume v_i, then we have n_i = v_iN, and the variance of the overall estimator is

VFN=1N∑viσi2.

To compare this result to the variance without stratification, we note that choosing an unstratified sample is equivalent to choosing a random stratum I according to the discrete probability distribution defined by the volumes v_i and then choosing a random sample X in Λ_I. In this sense, X is chosen conditionally on I, so it can be shown using conditional probability that

VF=ExViF+VxEiF=1N∑viσi2+∑viμi−Q,

where Q is the mean of f over the whole domain Λ. See Veach (1997) for a derivation of this result.

There are two things to notice about this expression. First, we know that the right-hand sum must be nonnegative, since variance is always nonnegative. Second, it demonstrates that stratified sampling can never increase variance. In fact, stratification always reduces variance unless the right-hand sum is exactly 0. It can only be 0 when the function f has the same mean over each stratum Λ_i. In fact, for stratified sampling to work best, we would like to maximize the right-hand sum, so it is best to make the strata have means that are as unequal as possible. This explains why compact strata are desirable if one does not know anything about the function f. If the strata are wide, they will contain more variation and will have μ_i closer to the true mean Q.

Figure 13.17 shows the effect of using stratified sampling versus a uniform random distribution for sampling ray directions for glossy reflection. There is a reasonable reduction in variance at essentially no cost in running time.

f13-17-9780128006450 — Figure 13.17 Variance is higher and the image noisier (a) when random sampling is used to compute the effect of glossy reflection than (b) when a stratified distribution of sample directions is used instead. (Compare the edges of the highlights on the ground, for example.)

The main downside of stratified sampling is that it suffers from the same “curse of dimensionality” as standard numerical quadrature. Full stratification in D dimensions with S strata per dimension requires S^D samples, which quickly becomes prohibitive. Fortunately, it is often possible to stratify some of the dimensions independently and then randomly associate samples from different dimensions, as was done in Section 7.3. Choosing which dimensions are stratified should be done in a way that stratifies dimensions that tend to be most highly correlated in their effect on the value of the integrand (Owen 1998). For example, for the direct lighting example in Section 13.7.1, it is far more effective to stratify the (x, y) pixel positions and to stratify the (θ, ϕ) ray direction— stratifying (x, θ) and (y, ϕ) would almost certainly be ineffective.

Another solution to the curse of dimensionality that has many of the same advantages of stratification is to use Latin hypercube sampling (also introduced in Section 7.3), which can generate any number of samples independent of the number of dimensions. Unfortunately, Latin hypercube sampling isn’t as effective as stratified sampling at reducing variance, especially as the number of samples taken becomes large. Nevertheless, Latin hypercube sampling is provably no worse than uniform random sampling and is often much better.

13.8.2 Quasi monte carlo

The low-discrepancy sampling techniques introduced in Chapter 7 are the foundation of a branch of Monte Carlo called quasi Monte Carlo. The key component of quasi–Monte Carlo techniques is that they replace the pseudo-random numbers used in standard Monte Carlo with low-discrepancy point sets generated by carefully designed deterministic algorithms.

The advantage of this approach is that for many integration problems, quasi–Monte Carlo techniques have asymptotically faster rates of convergence than methods based on standard Monte Carlo. Many of the techniques used in regular Monte Carlo algorithms can be shown to work equally well with quasi–random sample points, including importance sampling. Some others (e.g., rejection sampling) cannot. While the asymptotic convergence rates are not generally applicable to the discontinuous integrands in graphics because they depend on smoothness properties in the integrand, quasi Monte Carlo nevertheless generally performs better than regular Monte Carlo for these integrals in practice. The “Further Reading” section at the end of this chapter has more information about this topic.

In pbrt, we have generally glossed over the differences between these two approaches and have localized them in the Samplers in Chapter 7. This introduces the possibility of subtle errors if a Sampler generates quasi–random sample points that an Integrator then improperly uses as part of an implementation of an algorithm that is not suitable for quasi Monte Carlo. As long as Integrators only use these sample points for importance sampling or other techniques that are applicable in both approaches, this isn’t a problem.

13.8.3 Warping samples and distortion

When applying stratified sampling or low-discrepancy sampling to problems like choosing points on light sources for integration for area lighting, pbrt generates a set of samples (u₁, u₂) over the domain [0, 1)² and then uses algorithms based on the transformation methods introduced in Sections 13.5 and 13.6 to map these samples to points on the light source. Implicit in this process is the expectation that the transformation to points on the light source will generally preserve the stratification properties of the samples from [0, 1)²—in other words, nearby samples should map to nearby positions on the surface of the light, and faraway samples should map to far-apart positions on the light. If the mapping does not preserve this property, then the benefits of stratification are lost.

Shirley’s square-to-circle mapping (Figure 13.13) preserves stratification more effectively than the straightforward mapping (Figure 13.12), which has less compact strata away from the center. This issue also explains why low-discrepancy sequences are generally more effective than stratified patterns in practice: they are more robust with respect to preserving their good distribution properties after being transformed to other domains. Figure 13.18 shows what happens when a set of 16 well-distributed sample points are transformed to be points on a skinny quadrilateral by scaling them to cover its surface; samples from a (0, 2)-sequence remain well distributed, but samples from a stratified pattern fare much less well.

f13-18-9780128006450 — Figure 13.18 (a) Transforming a 4 × 4 stratified sampling pattern to points on a long and thin quadrilateral light source effectively gives less than 16 well-distributed samples; stratification in the vertical direction is not helpful. (b) Samples from a (0, 2)-sequence remain well distributed even after this transformation.

Integrator 25

Sampler 421

13.9 Bias

Another approach to variance reduction is to introduce bias into the computation: sometimes knowingly computing an estimate that doesn’t actually have an expected value equal to the desired quantity can nonetheless lead to lower variance. An estimator is unbiased if its expected value is equal to the correct answer. If not, the difference

β=EF−∫fxdx

is the amount of bias.

Kalos and Whitlock (1986, pp. 36–37) gave the following example of how bias can sometimes be desirable. Consider the problem of computing an estimate of the mean value of a uniform distribution X_i ~ p over the interval from 0 to 1. One could use the estimator

1N∑i=1NXi,

si131_e

or one could use the biased estimator

12maxX1X2…XN.

The first estimator is in fact unbiased but has variance with order O(N^− 1). The second estimator’s expected value is

0.5NN+1≠0.5,

so it is biased, although its variance is O(N^− 2), which is much better.

The pixel reconstruction method described in Section 7.8 can also be seen as a biased estimator. Considering pixel reconstruction as a Monte Carlo estimation problem, we’d like to compute an estimate of

Ixy=∬fx−x′,y−y′Lx′y′dx′dy′,

si134_e

where I (x, y) is a final pixel value, f (x, y) is the pixel filter function (which we assume here to be normalized to integrate to 1), and L(x, y) is the image radiance function.

Assuming we have chosen image plane samples uniformly, all samples have the same probability density, which we will denote by p_c. Thus, the unbiased Monte Carlo estimator of this equation is

Ixy≈1Npc∑i=1Nfx−xi,y−yiLxiyi.

This gives a different result from that of the pixel filtering equation we used previously, Equation (7.12), which was

Ixy=∑ifx−xi,y−yiLxiyi∑ifx−xi,y−yi.

si136_e

Yet, the biased estimator is preferable in practice because it gives a result with less variance. For example, if all radiance values L(x_i, y_i) have a value of 1, the biased estimator will always reconstruct an image where all pixel values are exactly 1—clearly a desirable property. However, the unbiased estimator will reconstruct pixel values that are not all 1, since the sum

∑ifx−xi,y−yi

will generally not be equal to p_c and thus will have a different value due to variation in the filter function depending on the particular (x_i, y_i) sample positions used for the pixel. Thus, the variance due to this effect leads to an undesirable result in the final image. Even for more complex images, the variance that would be introduced by the unbiased estimator is a more objectionable artifact than the bias from Equation (7.12).

13.10 Importance sampling

Importance sampling is a powerful variance reduction technique that exploits the fact that the Monte Carlo estimator

FN=1N∑i=1NfXipXi

converges more quickly if the samples are taken from a distribution p(x) that is similar to the function f (x) in the integrand. The basic idea is that by concentrating work where the value of the integrand is relatively high, an accurate estimate is computed more efficiently (Figure 13.19).

f13-19-9780128006450 — Figure 13.19 (a) Using a stratified uniform distribution of rays over the hemisphere gives an image with much more variance than (b) applying importance sampling and choosing stratified rays from a distribution based on the BRDF.

For example, suppose we are evaluating the scattering equation, Equation (5.9). Consider what happens when this integral is estimated; if a direction is randomly sampled that is nearly perpendicular to the surface normal, the cosine term will be close to 0. All the expense of evaluating the BSDF and tracing a ray to find the incoming radiance at that sample location will be essentially wasted, as the contribution to the final result will be minuscule. In general, we would be better served if we sampled directions in a way that reduced the likelihood of choosing directions near the horizon. More generally, if directions are sampled from distributions that match other factors of the integrand (the BSDF, the incoming illumination distribution, etc.), efficiency is similarly improved.

As long as the random variables are sampled from a probability distribution that is similar in shape to the integrand, variance is reduced. We will not provide a rigorous proof of this fact but will instead present an informal and intuitive argument. Suppose we’re trying to use Monte Carlo techniques to evaluate some integral ∫ f (x) dx. Since we have freedom in choosing a sampling distribution, consider the effect of using a distribution p(x) ∝ f (x), or p(x) = cf (x). It is trivial to show that normalization forces

c=1∫fxdx.

si139_e

Finding such a PDF requires that we know the value of the integral, which is what we were trying to estimate in the first place. Nonetheless, for the purposes of this example, if we could sample from this distribution, each estimate would have the value

fXipXi=1c=∫fxdx.

Since c is a constant, each estimate has the same value, and the variance is zero! Of course, this is ludicrous since we wouldn’t bother using Monte Carlo if we could integrate f directly. However, if a density p(x) can be found that is similar in shape to f (x), variance decreases.

Importance sampling can increase variance if a poorly chosen distribution is used. To understand the effect of sampling from a PDF that is a poor match for the integrand, consider using the distribution

px=99.01x∈[0,.01).01x∈[.01,1)

to compute the estimate of ∫ f (x) dx where

fx=.01x∈[0,.01)1.01x∈[.01,1).

By construction, the value of the integral is one, yet using p(x) to draw samples to compute a Monte Carlo estimate will give a terrible result: almost all of the samples will be in the range [0, .01), where the estimator has the value f (x)/p(x) ≈ 0.0001. For any estimate where none of the samples ends up being outside of this range, the result will be very inaccurate, almost 10,000 times smaller than it should be. Even worse is the case where some samples do end up being taken in the range [.01, 1). This will happen rarely, but when it does, we have the combination of a relatively high value of the integrand and a relatively low value of the PDF, f (x)/p(x) = 101. A large number of samples would be necessary to balance out these extremes to reduce variance enough to get a result close to the actual value, 1.

Fortunately, it’s not too hard to find good sampling distributions for importance sampling for many integration problems in graphics. As we implement Integrators in the next three chapters, we will derive a variety of sampling distributions for BSDFs, lights, cameras, and participating media. In many cases, the integrand is the product of more than one function. It can be difficult to construct a PDF that is similar to the complete product, but finding one that is similar to one of the multiplicands is still helpful.

In practice, importance sampling is one of the most frequently used variance reduction techniques in rendering, since it is easy to apply and is very effective when good sampling distributions are used.

13.10.1 Multiple importance sampling

Monte Carlo provides tools to estimate integrals of the form ∫ f (x) dx. However, we are frequently faced with integrals that are the product of two or more functions: ∫ f (x)g(x) dx. If we have an importance sampling strategy for f (x) and a strategy for g(x), which should we use? (Assume that we are not able to combine the two sampling strategies to compute a PDF that is proportional to the product f (x)g(x) that can itself be sampled easily.) As discussed above, a bad choice of sampling distribution can be much worse than just using a uniform distribution.

For example, consider the problem of evaluating direct lighting integrals of the form

Lopωo=∫S2fpωoωiLdpωi|cosθi|dωi.

If we were to perform importance sampling to estimate this integral according to distributions based on either L_d or f_r, one of these two will often perform poorly.

Consider a near-mirror BRDF illuminated by an area light where L_d’s distribution is used to draw samples. Because the BRDF is almost a mirror, the value of the integrand will be close to 0 at all ω_i directions except those around the perfect specular reflection direction. This means that almost all of the directions sampled by L_d will have 0 contribution, and variance will be quite high. Even worse, as the light source grows large and a larger set of directions is potentially sampled, the value of the PDF decreases, so for the rare directions where the BRDF is non-0 for the sampled direction we will have a large integrand value being divided by a small PDF value. While sampling from the BRDF’s distribution would be a much better approach to this particular case, for diffuse or glossy BRDFs and small light sources, sampling from the BRDF’s distribution can similarly lead to much higher variance than sampling from the light’s distribution.

Unfortunately, the obvious solution of taking some samples from each distribution and averaging the two estimators is little better. Because the variance is additive in this case, this approach doesn’t help—once variance has crept into an estimator, we can’t eliminate it by adding it to another estimator even if it itself has low variance.

Multiple importance sampling (MIS) addresses exactly this issue, with a simple and easy-to-implement technique. The basic idea is that, when estimating an integral, we should draw samples from multiple sampling distributions, chosen in the hope that at least one of them will match the shape of the integrand reasonably well, even if we don’t know which one this will be. MIS provides a method to weight the samples from each technique that can eliminate large variance spikes due to mismatches between the integrand’s value and the sampling density. Specialized sampling routines that only account for unusual special cases are even encouraged, as they reduce variance when those cases occur, with relatively little cost in general. See Figure 14.13 in Chapter 14 to see the reduction in variance from using MIS to compute reflection from direct illumination compared to sampling either just the BSDF or the light’s distribution by itself.

If two sampling distributions p_f and p_g are used to estimate the value of ∫ f (x)g(x) dx, the new Monte Carlo estimator given by MIS is

1nf∑i=1nffXigXiwfXipfXi+1ng∑j=1ngfYjgYjwgYjpgYj,

si144_e

where n_f is the number of samples taken from the p_f distribution method, n_g is the number of samples taken from p_g, and w_f and w_g are special weighting functions chosen such that the expected value of this estimator is the value of the integral of f (x)g(x).

The weighting functions take into account all of the different ways that a sample X_i or Y_j could have been generated, rather than just the particular one that was actually used. A good choice for this weighting function is the balance heuristic:

wsx=nspsx∑inipix.

The balance heuristic is a provably good way to weight samples to reduce variance.

Consider the effect of this term for the case where a sample X has been drawn from the p_f distribution at a point where the value p_f(X) is relatively low. Assuming that p_f is a good match for the shape of f (x), then the value of f (X) will also be relatively low. But suppose that g(X) has a relatively high value. The standard importance sampling estimate

fXgXpfX

will have a very large value due to p_f (X) being small, and we will have high variance. With the balance heuristic, the contribution of X will be

fXgXwfXpfX=fXgXnfpfXpfXnfpfX+ngpgX=fXgXnfnfpfX+ngpgX.

As long as p_g’s distribution is a reasonable match for g(x), then the denominator won’t be too small thanks to the n_gp_g(X) term, and the huge variance spike is eliminated, even though X was sampled from a distribution that was in fact a poor match for the integrand. The fact that another distribution will also be used to generate samples and that this new distribution will likely find a large value of the integrand at X are brought together in the weighting term to reduce the variance problem.

Here we provide an implementation of the balance heuristic for the specific case of two distributions p_f and p_g. We will not need a more general multidistribution case in pbrt.

〈Sampling Inline Functions〉 + ≡

inline Float BalanceHeuristic(int nf, Float fPdf, int ng, Float gPdf) {

return (nf * fPdf) / (nf * fPdf + ng * gPdf);

In practice, the power heuristic often reduces variance even further. For an exponent β, the power heuristic is

wsx=nspsxβ∑inipixβ.

Veach determined empirically that β = 2 is a good value. We have β = 2 hard-coded into the implementation here.

〈Sampling Inline Functions〉 + ≡

inline Float PowerHeuristic(int nf, Float fPdf, int ng, Float gPdf) {

Float f = nf * fPdf, g = ng * gPdf;

return (f * f) / (f * f + g * g);

Exercises

13.1 Write a program that compares Monte Carlo and one or more alternative numerical integration techniques. Structure this program so that it is easy to replace the particular function being integrated. Verify that the different techniques compute the same result (given a sufficient number of samples for each of them). Modify your program so that it draws samples from distributions other than the uniform distribution for the Monte Carlo estimate, and verify that it still computes the correct result when the correct estimator, Equation (13.3), is used. (Make sure that any alternative distributions you use have nonzero probability of choosing any value of x where f (x) > 0.)

u13-01-9780128006450 13.2 Write a program that computes Monte Carlo estimates of the integral of a given function. Compute an estimate of the variance of the estimates by taking a series of trials and using Equation (13.2) to compute variance. Demonstrate numerically that variance decreases at a rate of On si149_e .

u13-01-9780128006450 13.3 The depth-of-field code for the ProjectiveCamera in Section 6.2.3 uses the ConcentricSampleDisk() function to generate samples on the circular lens, since this function gives less distortion than UniformSampleDisk(). Try replacing it with UniformSampleDisk(), and measure the difference in image quality. For example, you might want to compare the error in images from using each approach and a relatively low number of samples to a highly sampled reference image.
Does ConcentricSampleDisk() in fact give less error in practice? Does it make a difference if a relatively simple scene is being rendered versus a very complex scene?

u13-02-9780128006450 13.4 Modify the Distribution1D implementation to use the adaptive CDF representation described by Lawrence et al. (2005), and experiment with how much more compact the CDF representation can be made without causing image artifacts. (Good test scenes include those that use InfiniteAreaLights, which use the Distribution2D and, thus, Distribution1D for sampling.) Can you measure an improvement in rendering speed due to more efficient searches through the approximated CDF?

ConcentricSampleDisk() 778

Distribution1D 758

Distribution2D 785

InfiniteAreaLight 737

ProjectiveCamera 358

UniformSampleDisk() 777

u13-03-9780128006450 13.5 One useful technique not discussed in this chapter is the idea of adaptive density distribution functions that dynamically change the sampling distribution as samples are taken and information is available about the integrand’s actual distribution as a result of evaluating the values of these samples. The standard Monte Carlo estimator can be written to work with a nonuniform distribution of random numbers used in a transformation method to generate samples X_i,

∑iNfXipXiprξi,

just like the transformation from one sampling density to another. This leads to a useful sampling technique, where an algorithm can track which samples ξ_i were effective at finding large values of f (x) and which weren’t and then adjusts probabilities toward the effective ones (Booth 1986). A straightforward implementation would be to split [0, 1] into bins of fixed width, track the average value of the integrand in each bin, and use this to change the distribution of ξ_i samples.
Investigate data structures and algorithms that support such sampling approaches and choose a sampling problem in pbrt to apply them to. Measure how well this approach works for the problem you selected. One difficulty with methods like this is that different parts of the sampling domain will be the most effective at different times in different parts of the scene. For example, trying to adaptively change the sampling density of points over the surface of an area light source has to contend with the fact that, at different parts of the scene, different parts of the area light may be visible and thus be the important areas. You may find it useful to read Cline et al.’s paper (2008) on this topic.