We have introduced Bayesian parameter estimation in Section 4.6, as a possible way to overcome some limitations of orthodox statistics. The essence of the approach can be summarized as follows:
posterior ∝ prior × likelihood,
where the prior collects information, possibly of subjective nature, that we have about a set of parameters before observing new data; the likelihood measures how likely is what we observe, on the basis of our current knowledge or belief; and the posterior merges the two above pieces of information in order to provide us with an updated picture. This is an informal restatement of Eq. (4.12), which we rewrite below for convenience:
The parameters θj, j = 1, …, q, are regarded as random variables within the Bayesian framework, and p(θ1, …, θq) is their prior. The prior can be a probability density function (PDF) or a probability mass function (PMF), or a mixed object, depending on the nature of the parameters involved. For the sake of simplicity, in this chapter we will not deal systematically with both cases, and we will refer to either one according to convenience. We have a set of independent observations, i.e., realizations of a random variable taking values xi, i = 1, …, n. Note that here we are considering n observations of a scalar random variable, whose distribution depends on q parameters; the concepts can be generalized to multivariate distributions. We know from maximum-likelihood estimation that an important role is played by the likelihood function fn(x1, …, xn | θ1, …, θq), for given parameters. Since we are considering independent observations, the likelihood is the product of PDF values corresponding to each observation xi. Bayes’ theorem yields the posterior density pn(θ1, …, θq | x1, …, xn), summarizing our knowledge about the parameters, merging the possibly subjective prior and the empirical evidence.
Armed with the posterior density, we may proceed to find estimates of the parameters θ1, …, θq, e.g., by finding their expected values. We may also find credible intervals, which are the Bayesian counterpart of confidence intervals, and test hypotheses about parameters. All of this requires knowledge of the posterior density, but there is a missing piece in Eq. (14.1). There, we state that the posterior is proportional to the product of prior and likelihood, but to fully specify the posterior density, we should also find a normalization constant B such that the PDF integrates to 1. Hence, we should divide the posterior by
Unless there is some special structure to be exploited, the calculation of this multidimensional integral is difficult. As we have seen in Section 4.6, there are nice cases in which the form of the posterior is easy to find. If we deal with observations from a normal distribution and the prior is normal, then the posterior is normal too, and figuring out the normalizing constant is easy. When the posterior belongs to the same family of distributions as the prior, for a given kind of likelihood, we speak of conjugate distributions; the prior is also called the conjugate prior. In less lucky cases, one possibility to find the normalization constant, needless to say, is the numerical evaluation of the multidimensional integral by Monte Carlo methods. However, there is a possibly better idea, if what we need is just the ability to sample from the posterior density. As it turns out, there are ways to sample from a density which is known up to a constant, and in fact we have already met one: acceptance–rejection.
In Section 14.1 we show that we may apply acceptance–rejection sampling when we miss a normalization constant, but the resulting approach need not be the most efficient. There is a class of sampling approaches, collectively known as Markov chain Monte Carlo (MCMC) methods, that can be used. The idea, as the name suggests, is to simulate a Markov chain, whose stationary distribution is the posterior density. MCMC methods can be introduced as a tool for computational Bayesian statistics, but they have a wider application range, as they can be used as random sample generators for difficult cases that cannot be tackled using standard approaches. We review a few basic facts about the stationary distribution of Markov chains Section 14.2. Then, we outline the Metropolis–Hastings algorithm and some of its variations in Section 14.3. Finally, in Section 14.4 we show how the simulated annealing algorithm, a stochastic search algorithm for both global and combinatorial optimization that we introduced in Section 10.3, can be cast and analyzed within the MCMC framework.
Acceptance–rejection sampling is a general purpose approach to sample from a distribution with a given PDF or PMF, whereas the inverse transform method works with the CDF. As we have seen in Section 5.4, given a target density f(·), we choose an instrumental density g(·) and proceed as follows:
If you look at the proof of the validity of the method, in Section 5.4, you will notice that what really matters is that c is an upper bound on the ratio of f/g, and we may also work with an unscaled density f(·), i.e., a density known up to a constant. In the Bayesian framework, we have an unscaled posterior
where x is the vector of (given) observations and θ the vector of (random) parameters. To sample a vector of parameters from the unsealed posterior, we have to find a candidate density g(θ) and an upper bound M on the ratio:
Then, we sample a candidate value randomly from g(·), and we accept it with probability
Improvements to this basic strategy have been proposed, but they do not solve one basic difficulty: the number of iterations that are needed to accept a sampled value can be large. We know that this is given by M; hence, the closer the instrumental density to the target, the better. If we are working with a distribution with bounded support, like a beta distribution, we may use a uniform as the instrumental density. However, the choice of the target is more difficult if the target has unbounded support, in which case the candidate should be a heavy-tailed distribution. Unfortunately, when we are dealing with multiple parameters, acceptance–rejection may require the generation of many proposals before one is accepted. Because of this practical limitation, alternative methods, described in the following, have been proposed. Nevertheless, acceptance–rejection is a viable approach for the case of a single parameter.
Before we introduce MCMC, we need to refresh some essential concepts pertaining to discrete-time Markov chains. MCMC methods are indirect sampling strategies, based on the simulation of a Markov process, whose long-run state density is the target density. We have already met Markov processes in the book, since processes described by Itô stochastic differential equations are Markovian. Here, though, we focus on their long-term state distribution. Within a Bayesian framework, the state of the stochastic process is related to the value of the parameters θ. When the parameters are real numbers, we have to deal with a continuous-state process. However, the necessary intuition to understand MCMC methods can be built on the basis of the much simpler discrete-state case, which is referred to as Markov chain. A discrete-time Markov chain is a process with a state variable
taking values on a discrete set, which may be identified with the set of integer numbers, or a subset of it, or a finite discrete set (e.g., the credit rating of a bond). We may represent the process as a graph, which is where the name “chain” comes from. A simple example is shown in Fig. 14.1. There, the state space is the set {A, B, C, D}, and nodes correspond to states. Directed arcs are labeled by transition probabilities. For instance, if we are in state C, there is a probability 0.7 to be in state B at the next step, and a probability 0.3 to be in state A. Note that these probabilities depend only on the current state, and not on the whole past history. Indeed, the essential property of Markov processes is related to their limited memory: The only information that is needed to characterize the future system evolution is the current state. Formally, we may express this in terms of the conditional transition probability
FIGURE 14.1 A discrete-state Markov chain.
Hence, we just need to know the one-step transition probabilities to characterize the chain. A further simplification occurs if the process is stationary, i.e., if the transition probabilities do not depend on time. In such a case, we may describe the chain in terms of the transition probabilities
which may be collected into the one-step transition probability matrix P. For the chain in Fig. 14.1, we have
Such a matrix has rather peculiar properties, related to the fact that after a transition we must land somewhere within the state space, and therefore each and every row adds up to 1:
where for the sake of simplicity we assume a finite state space of size N. Let π0
N denote the initial state distribution at step 0, i.e.,
The total probability theorem implies that
which can be rewritten as
where π1
N is a vector with components πi,1 = P{X1 = i}, i = 1, …, N. Note how, given the shape of the transition probability matrix, we write the relationship in terms of premultiplication by a row vector. Thanks to the Markov property, we may easily compute multiple step transition probabilities:
Here, in the first line we have used the fact that the event {X2 = j} can be expressed as the union of the disjoint events {X2 = j, X1 = i}; then we have used conditioning and the Markov property. Since the final expression can be interpreted as an entry in the row-by-column matrix product, the two-step transition probability matrix can be expressed as
By the same token, if we consider an arbitrary number m of steps,
Now imagine that we start with distribution π0 and we let the system run for a long time. What happens to the limit
We might guess that, after such a long time, the memory of the initial state is lost, and that the system reaches a steady-state (or long-run) distribution π. We might also guess that this distribution should be an equilibrium distribution characterized by the system of linear equations
This system can be interpreted by saying that if we start with the state distribution π, the next state has the same distribution. Hence, π is an eigenvector of P associated with the eigenvalue 1. To be precise, since we premultiply P by a row vector, we should use the term left eigenvector. To make sense as a probability distribution, this should be a unit (normalized) and non-negative eigenvector. It turns out that, in fact, we may find a non-negative eigenvector, provided that some conditions are satisfied for P. Indeed, we cannot take for granted that a stationary distribution exists, independent from the initial state, as a few things may go wrong:
FIGURE 14.2 A nonirreducible periodic Markov chain.
FIGURE 14.3 A periodic Markov chain.
We will refrain from giving a formal definition of irreducible and aperiodic chains, as intuition is all we need for our limited purposes.
Another very useful way to interpret the equilibrium condition of Eq. (14.2) can be obtained by focusing on a single equation,
which gives the equilibrium probability πi of state i, and by rewriting it as follows:
The last equality can be interpreted intuitively in terms of equilibrium of “probability flows:”
Thus, equating the two flows for every state i establishes a global balance condition, which is associated with equilibrium. For some Markov chains, a stronger equilibrium condition is satisfied, corresponding to a detailed, rather than global balance:
This condition enforces a probability flow equilibrium for any pair of states, and it is easy to see that if we can find a vector π, whose components add up to 1, satisfying Eq. (14.3), then this vector will also satisfy Eq. (14.2). To prove this, it is sufficient to sum the detailed balance over states i:
The detailed balance conditions are quite interesting for a few reasons:
These concepts are beyond the scope of the book, but from our viewpoint, detailed balance conditions are useful when we have to go the other way around: Given a target equilibrium distribution, how can we build a Markov chain that has that equilibrium distribution? This is what we need in MCMC methods applied to Bayesian statistics, where the target equilibrium distribution is the posterior, with the additional twist that the posterior is known up to a multiplicative constant. This would be difficult in general, but, by taking advantage of detailed balance equations, it can be easily accomplished, as we show in the next section.
In Bayesian inference, we need to sample from the posterior density, which is hard to find computationally, with the exception of conjugate priors. A computational breakthrough was achieved by realizing that we may create a discrete-time Markov process, whose states are values of the parameters of interest, such that its long-run distribution can be shaped according to our needs. In other words, we need to define a state space based on the values of the parameters, and compute suitable transition probabilities. This is a daunting task in general, but it turns out that we may take advantage of the simple structure of time-reversible Markov chains, whose transition probabilities satisfy the detailed balance equations (14.3).
The original idea was devised in 1953 and is known as the Metropolis–Hastings algorithm.1 The intuition is as follows. We want to find a Markov chain with steady-state probabilities πi, for discrete states i = 1, …, N (the idea can be easily adapted to continuous distributions). To this end, we need suitable transition probabilities Pij, satisfying the detailed balance equations. A first point is that whatever Pii we choose, it will do the job, as the trivial balance condition
is clearly satisfied. Unfortunately, in general, the equilibrium condition does not hold. For instance, if
then we should restore the balance by rejecting some of the transitions from j to i, whereas all of the transitions from i to j can be accepted. A similar consideration applies if πiPij > πjPji. The trick can be done, for any desired steady-state distribution π and any transition matrix P, by adjusting the transition probabilities using a sort of acceptance-rejection strategy. For each pair of states i and j, we adjust the transition probability multiplying it by an acceptance probability
Hence, whenever we are in state i and a candidate transition from i to j is randomly selected in the simulation, it is actually accepted with probability αij, which implicitly defines an adjusted probability matrix , where
, for i ≠ j, and
It is not difficult to show that the pair consisting of π and satisfies the detailed balance conditions, since:
Therefore, we can easily build a Markov chain with the desired long-run distribution by adjusting any candidate transition matrix P. Now, the reader should be a bit puzzled, for a couple of reasons:
The Metropolis–Hastings algorithm takes advantage of the above ideas:
An important point to bear in mind is that, unlike the random sampling approaches that we have described so far in the book, the sample produced by this procedure is not independent, since the visited states are correlated. This must be taken into due account when performing output analysis, which may require the batching method that we have discussed in Section 7.1. Furthermore, since the target density is achieved in the long run, we should discard the initial portion of the sample path, which is called the burn-in phase.
We have discussed the Metropolis–Hastings algorithm in a discrete-state setting, but when the parameter to be estimated is continuous, we should simulate a continuous-state Markov process. We will not analyze in detail this extension, as it would not provide us with much additional insight.2 The major change is that densities and integrals are involved, rather than probability mass functions and sums. Since the probability of a single point within a continuous state is always zero, we should work with subsets within the state space, over which we integrate the density. Hence, the equilibrium condition of Eq. (14.2) is replaced by the following one:
for any (measurable) subset B of the state space. Here, π(x) is the equilibrium density, so that the integral on the left-hand side is the long-run probability of being in set B. The role of the transition matrix is played by the transition kernel function P(x, B), which gives the probability of moving from state x to set B. The interpretation of Eq. (14.5) is the same as Eq. (14.2).
Now, in order to be more concrete, let us apply the above ideas to a Bayesian inference for a single-parameter problem. The state space in the Markov process is the set of possible values of parameter θ and, given a vector of observations with value x, we want to sample from the posterior density pn(θ | x). To this aim, we have to define a transition kernel satisfying the detailed balance equation, which in practical terms requires the definition of a transition density (θ, θ’) that satisfies the detailed balance condition
By integrating the transition density (θ,θ’) for θ’ in a set B, we obtain the corresponding transition kernel, P(θ,B). The transition density is used to move from state θ to another state θ’ by random sampling.3 In order to satisfy the balance condition, we use the Metropolis–Hastings trick: Start with any transition density q(θ, θ’), and enforce the balance condition by multiplying it by
Once again, we notice that we only need the posterior up to a constant, as whatever constant we multiply the posterior by, it gets canceled in the above ratio. The algorithm runs as follows:
The random acceptance mechanism is simulated by sampling a uniform variable U ~ U(0, 1) and accepting the candidate if U ≤ α(θk−1, θ’).
Last but not least, to get the whole thing going we need a last piece of the puzzle: the transition density q(θ, θ’). There are two basic possibilities:
(14.8)
Finding a good candidate density requires finding a satisfactory trade-off between two requirements. On the one hand, we want to explore the state space, which means that, for a given θ, we should be free to move outside its close neighborhood. On the other hand, we do not want to generate candidates that are too likely to be rejected. As the reader can imagine, striking a satisfactory balance may take some experimentation. Another issue is the potential bias introduced by the arbitrary choice of θ0. As we have mentioned, this issue may be overcome by allowing a burn-in, i.e., by letting the system run for a while, before starting the collection of relevant statistics.
Let us implement the above ideas in R, using a mixture of two normals as the target density. We assume that the posterior is proportional to
which is not a proper density, as we multiply it by 20, rather than dividing by √2π. As the candidate density, we use a random walk transition density N(0, 3). We first run a 1000 step burn-in phase, where no statistics are collected, starting from some arbitrary initial state θ0; then we generate a sample of size 50,000, whose histogram is plotted against the true density. The R code implementing a naive version of this random walk Metropolis–Hastings algorithm is displayed in Fig. 14.4, and it produces the histogram plotted in Fig. 14.5. As we see, there is a good agreement between the true density and the sampled one.
FIGURE 14.4 An illustration of the Metropolis–Hastings algorithm.
FIGURE 14.5 Histogram produced by the sampler of Fig. 14.4.
The Metropolis–Hastings algorithm is the starting point for several types of samplers, most notably the Gibbs sampler, which may be useful when we are dealing with multiple parameters, i.e., with a Markov chain featuring a vector state variable. Consider a vector of parameters
Also let us denote by θ-j the vector of parameters with θj removed:
Since sampling directly from the multivariate distribution can be difficult, we may consider sampling each parameter in turn, while keeping the remaining ones fixed. This means that we should use a transition kernel based on a density q(θj, θ’j |θ-j) The idea can also be applied to blocks of parameters, rather than single ones, and is known as blockwise Metropolis–Hastings algorithm.
The Gibbs sampler is a specific case of blockwise Metropolis–Hastings algorithm, where we use the exact conditional density as the candidate density. One possible version of the algorithm is as follows:
Thus, we see that Gibbs sampling is an iterative sampling strategy that should be contrasted with one-step sampling from a multivariate distribution. Furthermore, since we are using the true conditional density as the candidate, there is no need to check acceptance, as the acceptance probability is always 1. An alternative version of the Gibbs sampler, rather than cycling systematically through each variable, selects one dimension at a time randomly for sampling. The approach makes sense when conditional distributions are available, which is the case, e.g., for hierarchical models.
As a very simple illustration, let us implement a Gibbs sampler for a bivariate normal distribution
We know from Section 3.3.4.1 that the two conditional distributions are:
The code in Fig. 14.6 implements the corresponding Gibbs sampler. The function receives the sample size, the length of the burn-in period, and the correlation. The output is used by the script to produce the scatterplot matrix, with histograms on the diagonal, shown in Fig. 14.7. The result is in fact what we would expect from such a bivariate normal distribution.
FIGURE 14.6 A simple Gibbs sampler.
FIGURE 14.7 Scatterplot matrix, with histograms on the diagonal, produced by the Gibbs sampler of Fig. 14.6.
The Metropolis–Hastings algorithm should ring some bells for readers familiar with simulated annealing, which we described in Section 10.3.2. In both algorithms we generate a candidate state, which is accepted with some probability. With the random walk candidate density, the acceptance probability in the Metropolis–Hastings algorithm is given by Eq. (14.7); uphill moves are always accepted, whereas the acceptance of downhill moves is random. When applying simulated annealing to a minimization problem,
the roles of uphill and downhill moves are swapped:
The acceptance probability is 1 if Δf < 0, i.e., the candidate solution improves the current one; otherwise, we have an uphill move, which is accepted with a probability depending on T. When T is large, it is relatively easy to accept an uphill move, whereas the system tends to favor only improvements when the temperature is low.
The similarity with the Metropolis–Hastings algorithm is in fact a deep one, as simulated annealing can be regarded as a MCMC optimization method, whose aim is to sample from a distribution concentrated around the global optimizers. Let us define the set of global optimizers
The optimal solution need not be unique, so let us denote the cardinality of this set by |M|. Consider the target PMF:
for a given parameter λ > 0. Note that we may speak of a PMF if the set S is finite or enumerable, as it happens in combinatorial optimization problems; we assume that this is the optimization problem we are dealing with, rather than a continuous and nonconvex problem. Note that this PMF favors low-cost solutions, for which the negative exponential is larger. In fact, when λ → +∞, the above PMF tends to
where the indicator function 1M(x) is 1 if x is optimal, zero otherwise. To see this, it is sufficient to multiply both numerator and denominator in Eq. (14.10) by e−λf*, which yields:
where \ denotes set difference. By letting λ → +∞, we obtain a very desirable distribution assigning the same positive probability to optimal solutions, and zero to nonoptimal solutions.
Now we have to define a candidate density to generate an alternative candidate solution . A simple approach is to sample randomly the neighborhood
of the current solution; the candidate PMF assigns a uniform probability,
, with each solution in that neighborhood. Then, given the desired target PMF, the acceptance probability is
If the neighborhood structure is the same for each solution, then the above expression boils down to
which is just Eq. (14.9), provided that we substitute λ = 1/T. This is a reassuring result, as it suggests a convergence guarantee of simulated annealing to a globally optimal solution. However, we need to find a suitable cooling schedule for the temperature. Furthermore, a cynical reader might argue that, if the set of feasible solutions is finite, any complete enumeration algorithm will find an optimal solution in finite time, rather than in the long run. The objection does not necessarily apply if we are considering a continuous nonconvex optimization model, possibly arising from a difficult model calibration problem.
1 F. Black and R. Litterman. Global portfolio optimization. Financial Analysts Journal, 48:28–43, 1992.
2 K. Böecker, editor. Rethinking Risk Measurement and Reporting: Volumes I and II. Risk Books, London, 2010.
3 W.M. Bolstad. Introduction to Bayesian Statistics (2nd ed.). Wiley, Hoboken, NJ, 2007.
4 W.M. Bolstad. Understanding Computational Bayesian Statistics. Wiley, Hoboken, NJ, 2010.
5 E. Greenberg. Introduction to Bayesian Econometrics (2nd ed.). Cambridge University Press, Cambridge, 2012.
6 W.K. Hastings. Monte Carlo sampling methods using Markov chains. Biometrika, 89:731–743, 1970.
7 G. He and R. Litterman. The intuition behind Black–Litterman model portfolios. Technical report, Investment Management Research, Goldman & Sachs Company, 1999. A more recent version can be downloaded from http://www.ssrn.org.
8 D.P. Kroese, T. Taimre, and Z.I. Botev. Handbook of Monte Carlo Methods. Wiley, Hoboken, NJ, 2011.
9 J.K. Kruschke. Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press, Burlington, MA, 2011.
10 N.A. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092, 1953.
11 S. Ross. Simulation (2nd ed.). Academic Press, San Diego, CA, 1997.
1See [10] for an early reference. The idea was refined and improved in [6].
2See, e.g., [4, pp. 129–130] for a readable proof.
3Some authors use the notation q(θ’ | θ) to clarify the role of the two arguments.