Chapter Fourteen: Markov Chain Monte Carlo and Bayesian Statistics

We have introduced Bayesian parameter estimation in Section 4.6, as a possible way to overcome some limitations of orthodox statistics. The essence of the approach can be summarized as follows:

where the prior collects information, possibly of subjective nature, that we have about a set of parameters before observing new data; the likelihood measures how likely is what we observe, on the basis of our current knowledge or belief; and the posterior merges the two above pieces of information in order to provide us with an updated picture. This is an informal restatement of Eq. (4.12), which we rewrite below for convenience:

The parameters θ_j, j = 1, …, q, are regarded as random variables within the Bayesian framework, and p(θ₁, …, θ_q) is their prior. The prior can be a probability density function (PDF) or a probability mass function (PMF), or a mixed object, depending on the nature of the parameters involved. For the sake of simplicity, in this chapter we will not deal systematically with both cases, and we will refer to either one according to convenience. We have a set of independent observations, i.e., realizations of a random variable taking values x_i, i = 1, …, n. Note that here we are considering n observations of a scalar random variable, whose distribution depends on q parameters; the concepts can be generalized to multivariate distributions. We know from maximum-likelihood estimation that an important role is played by the likelihood function f_n(x₁, …, x_n | θ₁, …, θ_q), for given parameters. Since we are considering independent observations, the likelihood is the product of PDF values corresponding to each observation x_i. Bayes’ theorem yields the posterior density p_n(θ₁, …, θ_q | x₁, …, x_n), summarizing our knowledge about the parameters, merging the possibly subjective prior and the empirical evidence.

Armed with the posterior density, we may proceed to find estimates of the parameters θ₁, …, θ_q, e.g., by finding their expected values. We may also find credible intervals, which are the Bayesian counterpart of confidence intervals, and test hypotheses about parameters. All of this requires knowledge of the posterior density, but there is a missing piece in Eq. (14.1). There, we state that the posterior is proportional to the product of prior and likelihood, but to fully specify the posterior density, we should also find a normalization constant B such that the PDF integrates to 1. Hence, we should divide the posterior by

Unless there is some special structure to be exploited, the calculation of this multidimensional integral is difficult. As we have seen in Section 4.6, there are nice cases in which the form of the posterior is easy to find. If we deal with observations from a normal distribution and the prior is normal, then the posterior is normal too, and figuring out the normalizing constant is easy. When the posterior belongs to the same family of distributions as the prior, for a given kind of likelihood, we speak of conjugate distributions; the prior is also called the conjugate prior. In less lucky cases, one possibility to find the normalization constant, needless to say, is the numerical evaluation of the multidimensional integral by Monte Carlo methods. However, there is a possibly better idea, if what we need is just the ability to sample from the posterior density. As it turns out, there are ways to sample from a density which is known up to a constant, and in fact we have already met one: acceptance–rejection.

In Section 14.1 we show that we may apply acceptance–rejection sampling when we miss a normalization constant, but the resulting approach need not be the most efficient. There is a class of sampling approaches, collectively known as Markov chain Monte Carlo (MCMC) methods, that can be used. The idea, as the name suggests, is to simulate a Markov chain, whose stationary distribution is the posterior density. MCMC methods can be introduced as a tool for computational Bayesian statistics, but they have a wider application range, as they can be used as random sample generators for difficult cases that cannot be tackled using standard approaches. We review a few basic facts about the stationary distribution of Markov chains Section 14.2. Then, we outline the Metropolis–Hastings algorithm and some of its variations in Section 14.3. Finally, in Section 14.4 we show how the simulated annealing algorithm, a stochastic search algorithm for both global and combinatorial optimization that we introduced in Section 10.3, can be cast and analyzed within the MCMC framework.

14.1 Acceptance–rejection sampling in Bayesian statistics

Acceptance–rejection sampling is a general purpose approach to sample from a distribution with a given PDF or PMF, whereas the inverse transform method works with the CDF. As we have seen in Section 5.4, given a target density f(·), we choose an instrumental density g(·) and proceed as follows:

1. Generate Y ~ g.

2. Generate U ~ U(0, 1), independent of Y.

3. If

equation

where c is an upper bound on the ratio f/g, accept and return X = Y; otherwise, reject and repeat the procedure.

If you look at the proof of the validity of the method, in Section 5.4, you will notice that what really matters is that c is an upper bound on the ratio of f/g, and we may also work with an unscaled density f(·), i.e., a density known up to a constant. In the Bayesian framework, we have an unscaled posterior

where x is the vector of (given) observations and θ the vector of (random) parameters. To sample a vector of parameters from the unsealed posterior, we have to find a candidate density g(θ) and an upper bound M on the ratio:

Then, we sample a candidate value

randomly from g(·), and we accept it with probability

Improvements to this basic strategy have been proposed, but they do not solve one basic difficulty: the number of iterations that are needed to accept a sampled value can be large. We know that this is given by M; hence, the closer the instrumental density to the target, the better. If we are working with a distribution with bounded support, like a beta distribution, we may use a uniform as the instrumental density. However, the choice of the target is more difficult if the target has unbounded support, in which case the candidate should be a heavy-tailed distribution. Unfortunately, when we are dealing with multiple parameters, acceptance–rejection may require the generation of many proposals before one is accepted. Because of this practical limitation, alternative methods, described in the following, have been proposed. Nevertheless, acceptance–rejection is a viable approach for the case of a single parameter.

14.2 An introduction to Markov chains

Before we introduce MCMC, we need to refresh some essential concepts pertaining to discrete-time Markov chains. MCMC methods are indirect sampling strategies, based on the simulation of a Markov process, whose long-run state density is the target density. We have already met Markov processes in the book, since processes described by Itô stochastic differential equations are Markovian. Here, though, we focus on their long-term state distribution. Within a Bayesian framework, the state of the stochastic process is related to the value of the parameters θ. When the parameters are real numbers, we have to deal with a continuous-state process. However, the necessary intuition to understand MCMC methods can be built on the basis of the much simpler discrete-state case, which is referred to as Markov chain. A discrete-time Markov chain is a process with a state variable

taking values on a discrete set, which may be identified with the set of integer numbers, or a subset of it, or a finite discrete set (e.g., the credit rating of a bond). We may represent the process as a graph, which is where the name “chain” comes from. A simple example is shown in Fig. 14.1. There, the state space is the set {A, B, C, D}, and nodes correspond to states. Directed arcs are labeled by transition probabilities. For instance, if we are in state C, there is a probability 0.7 to be in state B at the next step, and a probability 0.3 to be in state A. Note that these probabilities depend only on the current state, and not on the whole past history. Indeed, the essential property of Markov processes is related to their limited memory: The only information that is needed to characterize the future system evolution is the current state. Formally, we may express this in terms of the conditional transition probability

Hence, we just need to know the one-step transition probabilities to characterize the chain. A further simplification occurs if the process is stationary, i.e., if the transition probabilities do not depend on time. In such a case, we may describe the chain in terms of the transition probabilities

which may be collected into the one-step transition probability matrix P. For the chain in Fig. 14.1, we have

Such a matrix has rather peculiar properties, related to the fact that after a transition we must land somewhere within the state space, and therefore each and every row adds up to 1:

where for the sake of simplicity we assume a finite state space of size N. Let π₀

^N denote the initial state distribution at step 0, i.e.,

where π₁

^N is a vector with components π_i,1 = P{X₁ = i}, i = 1, …, N. Note how, given the shape of the transition probability matrix, we write the relationship in terms of premultiplication by a row vector. Thanks to the Markov property, we may easily compute multiple step transition probabilities:

Here, in the first line we have used the fact that the event {X₂ = j} can be expressed as the union of the disjoint events {X₂ = j, X₁ = i}; then we have used conditioning and the Markov property. Since the final expression can be interpreted as an entry in the row-by-column matrix product, the two-step transition probability matrix can be expressed as

Now imagine that we start with distribution π₀ and we let the system run for a long time. What happens to the limit

We might guess that, after such a long time, the memory of the initial state is lost, and that the system reaches a steady-state (or long-run) distribution π. We might also guess that this distribution should be an equilibrium distribution characterized by the system of linear equations

This system can be interpreted by saying that if we start with the state distribution π, the next state has the same distribution. Hence, π is an eigenvector of P associated with the eigenvalue 1. To be precise, since we premultiply P by a row vector, we should use the term left eigenvector. To make sense as a probability distribution, this should be a unit (normalized) and non-negative eigenvector. It turns out that, in fact, we may find a non-negative eigenvector, provided that some conditions are satisfied for P. Indeed, we cannot take for granted that a stationary distribution exists, independent from the initial state, as a few things may go wrong:

The state space could be, in some sense, disconnected, as shown in Fig. 14.2. Clearly, in such a case, if we start in one region, we will stay there and the memory of the initial state cannot be lost completely. We rule out this case by considering only irreducible chains.

FIGURE 14.2 A nonirreducible periodic Markov chain.

The chain could be periodic, as the one shown in Fig. 14.3. As we see, despite some uncertainty between states C and D, states A, B, and E will be repeated every four steps. This kind of behavior rules out the existence of a stationary, long-run distribution. Hence, we have to refer to aperiodic chains.

FIGURE 14.3 A periodic Markov chain.

We will refrain from giving a formal definition of irreducible and aperiodic chains, as intuition is all we need for our limited purposes.

Another very useful way to interpret the equilibrium condition of Eq. (14.2) can be obtained by focusing on a single equation,

which gives the equilibrium probability π_i of state i, and by rewriting it as follows:

The last equality can be interpreted intuitively in terms of equilibrium of “probability flows:”

The term

is the probability flow out of state i to states j = 1, …, N (there is some flow from i to itself if the transition from i to itself is possible).

The term

is the probability flow from states j = 1, …, N into state i.

Thus, equating the two flows for every state i establishes a global balance condition, which is associated with equilibrium. For some Markov chains, a stronger equilibrium condition is satisfied, corresponding to a detailed, rather than global balance:

This condition enforces a probability flow equilibrium for any pair of states, and it is easy to see that if we can find a vector π, whose components add up to 1, satisfying Eq. (14.3), then this vector will also satisfy Eq. (14.2). To prove this, it is sufficient to sum the detailed balance over states i:

The resulting chain is time reversible, i.e., we cannot distinguish sample paths going forward and backward in time.

This structure facilitates the solution of the equilibrium equations, leading to nice structures, such as product form queueing networks.

These concepts are beyond the scope of the book, but from our viewpoint, detailed balance conditions are useful when we have to go the other way around: Given a target equilibrium distribution, how can we build a Markov chain that has that equilibrium distribution? This is what we need in MCMC methods applied to Bayesian statistics, where the target equilibrium distribution is the posterior, with the additional twist that the posterior is known up to a multiplicative constant. This would be difficult in general, but, by taking advantage of detailed balance equations, it can be easily accomplished, as we show in the next section.

14.3 The Metropolis–Hastings algorithm

In Bayesian inference, we need to sample from the posterior density, which is hard to find computationally, with the exception of conjugate priors. A computational breakthrough was achieved by realizing that we may create a discrete-time Markov process, whose states are values of the parameters of interest, such that its long-run distribution can be shaped according to our needs. In other words, we need to define a state space based on the values of the parameters, and compute suitable transition probabilities. This is a daunting task in general, but it turns out that we may take advantage of the simple structure of time-reversible Markov chains, whose transition probabilities satisfy the detailed balance equations (14.3).

The original idea was devised in 1953 and is known as the Metropolis–Hastings algorithm.¹ The intuition is as follows. We want to find a Markov chain with steady-state probabilities π_i, for discrete states i = 1, …, N (the idea can be easily adapted to continuous distributions). To this end, we need suitable transition probabilities P_ij, satisfying the detailed balance equations. A first point is that whatever P_ii we choose, it will do the job, as the trivial balance condition

is clearly satisfied. Unfortunately, in general, the equilibrium condition does not hold. For instance, if

then we should restore the balance by rejecting some of the transitions from j to i, whereas all of the transitions from i to j can be accepted. A similar consideration applies if π_iP_ij > π_jP_ji. The trick can be done, for any desired steady-state distribution π and any transition matrix P, by adjusting the transition probabilities using a sort of acceptance-rejection strategy. For each pair of states i and j, we adjust the transition probability multiplying it by an acceptance probability

Hence, whenever we are in state i and a candidate transition from i to j is randomly selected in the simulation, it is actually accepted with probability α_ij, which implicitly defines an adjusted probability matrix

, where

, for i ≠ j, and

It is not difficult to show that the pair consisting of π and

satisfies the detailed balance conditions, since:

Therefore, we can easily build a Markov chain with the desired long-run distribution by adjusting any candidate transition matrix P. Now, the reader should be a bit puzzled, for a couple of reasons:

1. In computational Bayesian statistics, we do not really know the posterior. Hence, how can we select it as the long-run distribution? A look at Eq. (14.4) shows where the trick is: Only the ratio of the probabilities is involved; therefore, even if they are known up to a scaling constant, we can still simulate the chain, and sample from the underlying long-run distribution.

2. It may seem very weird that, whatever P we choose, it will work. Indeed, so far we did not say anything about efficiency. Not all transition matrices are equally efficient, and only with a suitable choice, we may hope to devise a reasonably fast algorithm.

Given a target distribution π, possibly known up to a normalization constant, we select a candidate transition matrix P.

We simulate the resulting Markov chain for a suitably long time horizon.

The resulting state sequence X_t is a random sample from the target distribution.

An important point to bear in mind is that, unlike the random sampling approaches that we have described so far in the book, the sample produced by this procedure is not independent, since the visited states are correlated. This must be taken into due account when performing output analysis, which may require the batching method that we have discussed in Section 7.1. Furthermore, since the target density is achieved in the long run, we should discard the initial portion of the sample path, which is called the burn-in phase.

We have discussed the Metropolis–Hastings algorithm in a discrete-state setting, but when the parameter to be estimated is continuous, we should simulate a continuous-state Markov process. We will not analyze in detail this extension, as it would not provide us with much additional insight.² The major change is that densities and integrals are involved, rather than probability mass functions and sums. Since the probability of a single point within a continuous state is always zero, we should work with subsets within the state space, over which we integrate the density. Hence, the equilibrium condition of Eq. (14.2) is replaced by the following one:

for any (measurable) subset B of the state space. Here, π(x) is the equilibrium density, so that the integral on the left-hand side is the long-run probability of being in set B. The role of the transition matrix is played by the transition kernel function P(x, B), which gives the probability of moving from state x to set B. The interpretation of Eq. (14.5) is the same as Eq. (14.2).

Now, in order to be more concrete, let us apply the above ideas to a Bayesian inference for a single-parameter problem. The state space in the Markov process is the set of possible values of parameter θ and, given a vector of observations with value x, we want to sample from the posterior density p_n(θ | x). To this aim, we have to define a transition kernel satisfying the detailed balance equation, which in practical terms requires the definition of a transition density

(θ, θ’) that satisfies the detailed balance condition

By integrating the transition density

(θ,θ’) for θ’ in a set B, we obtain the corresponding transition kernel, P(θ,B). The transition density is used to move from state θ to another state θ’ by random sampling.³ In order to satisfy the balance condition, we use the Metropolis–Hastings trick: Start with any transition density q(θ, θ’), and enforce the balance condition by multiplying it by

Once again, we notice that we only need the posterior up to a constant, as whatever constant we multiply the posterior by, it gets canceled in the above ratio. The algorithm runs as follows:

1. Choose an initial state θ₀ and repeat the following for k = 1, …, M:

(a) Select a candidate θ’ by sampling from the transition density q(θ_k−1, θ’).

(b) Accept the candidate with probability α(θ_k−1, θ’); if accepted, set θ_k = θ’; otherwise, set θ_k = θ_k−1.

The random acceptance mechanism is simulated by sampling a uniform variable U ~ U(0, 1) and accepting the candidate if U ≤ α(θ_k−1, θ’).

Last but not least, to get the whole thing going we need a last piece of the puzzle: the transition density q(θ, θ’). There are two basic possibilities:

1. The random walk candidate density, whereby we choose a density function h(·), symmetric with respect to the origin, and set

equation

With this choice, we have q(θ, θ’) = q(θ’, θ), because of symmetry. Hence, Eq. (14.6) simplifies to

(14.7) equation

The implication is that if we move from state θ to a state θ’ with a larger (unscaled) posterior, the move is always accepted, as suggested by intuition. The acceptance probability is less than 1 when we move “downhill.”

2. The independent density, whereby we choose the next candidate state according to a PDF

equation

which does not depend on the current state θ. With such a choice, Eq. (14.6) can be rewritten as

(14.8) equation

Here, the acceptance probability depends on the ratio of the posterior densities, but also on the ratio of the candidate densities. The former ratio tends to favor uphill moves in terms of the posterior; the latter one tends to favor moves toward the tails of the candidate density, where h(θ’) is small. This feature may induce a better exploration behavior.

Finding a good candidate density requires finding a satisfactory trade-off between two requirements. On the one hand, we want to explore the state space, which means that, for a given θ, we should be free to move outside its close neighborhood. On the other hand, we do not want to generate candidates that are too likely to be rejected. As the reader can imagine, striking a satisfactory balance may take some experimentation. Another issue is the potential bias introduced by the arbitrary choice of θ₀. As we have mentioned, this issue may be overcome by allowing a burn-in, i.e., by letting the system run for a while, before starting the collection of relevant statistics.

Example 14.1 Implementing random walk MCMC in R

Let us implement the above ideas in R, using a mixture of two normals as the target density. We assume that the posterior is proportional to

equation

which is not a proper density, as we multiply it by 20, rather than dividing by √2π. As the candidate density, we use a random walk transition density N(0, 3). We first run a 1000 step burn-in phase, where no statistics are collected, starting from some arbitrary initial state θ₀; then we generate a sample of size 50,000, whose histogram is plotted against the true density. The R code implementing a naive version of this random walk Metropolis–Hastings algorithm is displayed in Fig. 14.4, and it produces the histogram plotted in Fig. 14.5. As we see, there is a good agreement between the true density and the sampled one.

14.3.1 THE GIBBS SAMPLER

The Metropolis–Hastings algorithm is the starting point for several types of samplers, most notably the Gibbs sampler, which may be useful when we are dealing with multiple parameters, i.e., with a Markov chain featuring a vector state variable. Consider a vector of parameters

Since sampling directly from the multivariate distribution can be difficult, we may consider sampling each parameter in turn, while keeping the remaining ones fixed. This means that we should use a transition kernel based on a density q(θ_j, θ’_j |θ_-j) The idea can also be applied to blocks of parameters, rather than single ones, and is known as blockwise Metropolis–Hastings algorithm.

The Gibbs sampler is a specific case of blockwise Metropolis–Hastings algorithm, where we use the exact conditional density as the candidate density. One possible version of the algorithm is as follows:

Let f(θ) be the joint density and f_j(θ_j | θ_-j) be the conditional density for component j = 1, …, m.

Given the current state θ^(k) at step k, we generate the next state θ^(k+1) iteratively:

– Sample θ₁^(k+1) ~ f₁(θ₁ | θ₂^(k), …, θ_m^(k)).

– Sample θ₂^(k+1) ~ f₂(θ₂ | θ₁^(k+1), θ₃^(k), …, θ_m^(k)).

– …

– Sample θ_m^(k+1) ~ f_m(θ_m | θ₁^(k+1), θ₂^(k+1), …, θ_m−1^(k+1)).

Thus, we see that Gibbs sampling is an iterative sampling strategy that should be contrasted with one-step sampling from a multivariate distribution. Furthermore, since we are using the true conditional density as the candidate, there is no need to check acceptance, as the acceptance probability is always 1. An alternative version of the Gibbs sampler, rather than cycling systematically through each variable, selects one dimension at a time randomly for sampling. The approach makes sense when conditional distributions are available, which is the case, e.g., for hierarchical models.

Example 14.2 An illustration of Gibbs sampling

As a very simple illustration, let us implement a Gibbs sampler for a bivariate normal distribution

equation

We know from Section 3.3.4.1 that the two conditional distributions are:

equation

The code in Fig. 14.6 implements the corresponding Gibbs sampler. The function receives the sample size, the length of the burn-in period, and the correlation. The output is used by the script to produce the scatterplot matrix, with histograms on the diagonal, shown in Fig. 14.7. The result is in fact what we would expect from such a bivariate normal distribution.

14.4 A re-examination of simulated annealing

The Metropolis–Hastings algorithm should ring some bells for readers familiar with simulated annealing, which we described in Section 10.3.2. In both algorithms we generate a candidate state, which is accepted with some probability. With the random walk candidate density, the acceptance probability in the Metropolis–Hastings algorithm is given by Eq. (14.7); uphill moves are always accepted, whereas the acceptance of downhill moves is random. When applying simulated annealing to a minimization problem,

Given the current solution

, with cost

, an alternative (candidate) solution

is sampled in the neighborhood of

The candidate is accepted with probability

(14.9) equation

where

and T is a control parameter playing the role of a temperature.

The acceptance probability is 1 if Δf < 0, i.e., the candidate solution improves the current one; otherwise, we have an uphill move, which is accepted with a probability depending on T. When T is large, it is relatively easy to accept an uphill move, whereas the system tends to favor only improvements when the temperature is low.

The similarity with the Metropolis–Hastings algorithm is in fact a deep one, as simulated annealing can be regarded as a MCMC optimization method, whose aim is to sample from a distribution concentrated around the global optimizers. Let us define the set of global optimizers

The optimal solution need not be unique, so let us denote the cardinality of this set by |M|. Consider the target PMF:

for a given parameter λ > 0. Note that we may speak of a PMF if the set S is finite or enumerable, as it happens in combinatorial optimization problems; we assume that this is the optimization problem we are dealing with, rather than a continuous and nonconvex problem. Note that this PMF favors low-cost solutions, for which the negative exponential is larger. In fact, when λ → +∞, the above PMF tends to

where the indicator function 1_M(x) is 1 if x is optimal, zero otherwise. To see this, it is sufficient to multiply both numerator and denominator in Eq. (14.10) by e^−λf*, which yields:

where \ denotes set difference. By letting λ → +∞, we obtain a very desirable distribution assigning the same positive probability to optimal solutions, and zero to nonoptimal solutions.

Now we have to define a candidate density to generate an alternative candidate solution

. A simple approach is to sample randomly the neighborhood

of the current solution; the candidate PMF assigns a uniform probability,

, with each solution in that neighborhood. Then, given the desired target PMF, the acceptance probability is

If the neighborhood structure is the same for each solution, then the above expression boils down to

which is just Eq. (14.9), provided that we substitute λ = 1/T. This is a reassuring result, as it suggests a convergence guarantee of simulated annealing to a globally optimal solution. However, we need to find a suitable cooling schedule for the temperature. Furthermore, a cynical reader might argue that, if the set of feasible solutions is finite, any complete enumeration algorithm will find an optimal solution in finite time, rather than in the long run. The objection does not necessarily apply if we are considering a continuous nonconvex optimization model, possibly arising from a difficult model calibration problem.

For further reading

Bayesian statistics can be a challenging topic, but friendly introductions are provided in [3] and [9]. These books also offer R code.

A quite readable introduction to computational Bayesian statistics can be found in [4], where variations of acceptance–rejection strategies are also discussed. Among other things, applications to Bayesian linear regression are discussed.

MCMC methods are not only aimed at Bayesian statistics; a generic treatment can be found in [11, Chapter 10].

Some further applications of MCMC are discussed in [8, Chapter 6], which includes MATLAB code.

Readers interested in financial applications might want to learn about the link between Bayesian statistics and portfolio management. The Black–Litterman model, originally introduced in [1], may be interpreted as a Bayesian approach; see also [7].

The role of Bayesian methods in risk management is illustrated in [2].

There are a few books on Bayesian econometrics, too; a recent one is [5].

References

1 F. Black and R. Litterman. Global portfolio optimization. Financial Analysts Journal, 48:28–43, 1992.

2 K. Böecker, editor. Rethinking Risk Measurement and Reporting: Volumes I and II. Risk Books, London, 2010.

3 W.M. Bolstad. Introduction to Bayesian Statistics (2nd ed.). Wiley, Hoboken, NJ, 2007.

4 W.M. Bolstad. Understanding Computational Bayesian Statistics. Wiley, Hoboken, NJ, 2010.

5 E. Greenberg. Introduction to Bayesian Econometrics (2nd ed.). Cambridge University Press, Cambridge, 2012.

6 W.K. Hastings. Monte Carlo sampling methods using Markov chains. Biometrika, 89:731–743, 1970.

7 G. He and R. Litterman. The intuition behind Black–Litterman model portfolios. Technical report, Investment Management Research, Goldman & Sachs Company, 1999. A more recent version can be downloaded from http://www.ssrn.org.

8 D.P. Kroese, T. Taimre, and Z.I. Botev. Handbook of Monte Carlo Methods. Wiley, Hoboken, NJ, 2011.

9 J.K. Kruschke. Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press, Burlington, MA, 2011.

10 N.A. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092, 1953.

³Some authors use the notation q(θ’ | θ) to clarify the role of the two arguments.