Consider an arbitrary connected graph (see Section 3.6 for definitions) having a number wij associated with arc (i,j) for each arc. One instance of such a graph is given by Figure 4.1. Now consider a particle moving from node to node in this manner: If at any time the particle resides at node i, then it will next move to node j with probability Pij where
Pij=wij∑jwij
and where wij is 0 if (i,j) is not an arc. For instance, for the graph of Figure 4.1, P12=3/(3+1+2)=12.
The time reversibility equations
πiPij=πjPji
reduce to
πiwij∑jwij=πjwji∑iwji
or, equivalently, since wij=wji
πi∑jwij=πj∑iwji
which is equivalent to
πi∑jwij=c
or
πi=c∑jwij
or, since 1=∑iπi
πi=∑jwij∑i∑jwij
Because the πis given by this equation satisfy the time reversibility equations, it follows that the process is time reversible with these limiting probabilities.
If we try to solve Equation (4.22) for an arbitrary Markov chain with states 0,1,…,M, it will usually turn out that no solution exists. For example, from Equation (4.22),
xiPij=xjPji,xkPkj=xjPjk
implying (if PijPjk>0) that
xixk=PjiPkjPijPjk
which in general need not equal Pki/Pik. Thus, we see that a necessary condition for time reversibility is that
PikPkjPji=PijPjkPkifor alli,j,k(4.25)
(4.25)
which is equivalent to the statement that, starting in state i, the path i→k→j→i has the same probability as the reversed path i→j→k→i. To understand the necessity of this, note that time reversibility implies that the rate at which a sequence of transitions from i to k to j to i occurs must equal the rate of ones from i to j to k to i (why?), and so we must have
An ergodic Markov chain for which Pij=0 whenever Pji=0 is time reversible if and only if starting in state i, any path back to i has the same probability as the reversed path. That is, if
Pi,i1Pi1,i2⋯Pik,i=Pi,ikPik,ik-1⋯Pi1,i(4.26)
(4.26)
for all states i,i1,…,ik.
Proof
We have already proven necessity. To prove sufficiency, fix states i and j and rewrite (4.26) as
Pi,i1Pi1,i2⋯Pik,jPji=PijPj,ik⋯Pi1,i
Summing the preceding over all states i1,…,ik yields
Pijk+1Pji=PijPjik+1
Letting k→∞ yields
πjPji=Pijπi
which proves the theorem. ■
Example 4.37
Suppose we are given a set of n elements, numbered 1 through n, which are to be arranged in some ordered list. At each unit of time a request is made to retrieve one of these elements, element i being requested (independently of the past) with probability Pi. After being requested, the element then is put back but not necessarily in the same position. In fact, let us suppose that the element requested is moved one closer to the front of the list; for instance, if the present list ordering is 1, 3, 4, 2, 5 and element 2 is requested, then the new ordering becomes 1, 3, 2, 4, 5. We are interested in the long-run average position of the element requested.
For any given probability vector P=(P1,…,Pn), the preceding can be modeled as a Markov chain with n! states, with the state at any time being the list order at that time. We shall show that this Markov chain is time reversible and then use this to show that the average position of the element requested when this one-closer rule is in effect is less than when the rule of always moving the requested element to the front of the line is used. The time reversibility of the resulting Markov chain when the one-closer reordering rule is in effect easily follows from Theorem 4.2. For instance, suppose n=3 and consider the following path from state (1, 2, 3) to itself:
The product of the transition probabilities in the forward direction is
P2P3P3P1P1P2=P12P22P32
whereas in the reverse direction, it is
P3P3P2P2P1P1=P12P22P32
Because the general result follows in much the same manner, the Markov chain is indeed time reversible. (For a formal argument note that if fi denotes the number of times element i moves forward in the path, then as the path goes from a fixed state back to itself, it follows that element i will also move backward fi times. Therefore, since the backward moves of element i are precisely the times that it moves forward in the reverse path, it follows that the product of the transition probabilities for both the path and its reversal will equal
∏iPifi+ri
where ri is equal to the number of times that element i is in the first position and the path (or the reverse path) does not change states.)
For any permutation i1,i2,…,in of 1,2,…,n, let π(i1,i2,…,in) denote the limiting probability under the one-closer rule. By time reversibility we have
Now the average position of the element requested can be expressed (as in Section 3.6.1) as
Average position=∑iPiE[Position of elementi]=∑iPi1+∑j≠iP{elementjprecedes elementi}=1+∑i∑j≠iPiP{ejprecedesei}=1+∑∑i<j[PiP{ejprecedesei}+PjP{eiprecedesej}]=1+∑i<j[PiP{ejprecedesei}+Pj(1-P{ejprecedesei})]=1+∑∑i<j(Pi-Pj)P{ejprecedesei}+∑∑i<jPj
Hence, to minimize the average position of the element requested, we would want to make P{ejprecedesei} as large as possible when Pj>Pi and as small as possible when Pi>Pj. Under the front-of-the-line rule we showed in Section 3.6.1,
P{ejprecedesei}=PjPj+Pi
(since under the front-of-the-line rule element j will precede element i if and only if the last request for either i or j was for j).
Therefore, to show that the one-closer rule is better than the front-of-the-line rule, it suffices to show that under the one-closer rule
P{ejprecedesei}>PjPj+PiwhenPj>Pi
Now consider any state where element i precedes element j, say, (…,i,i1,…,ik,j,…). By successive transpositions using Equation (4.27), we have
Letting α(i,j)=P{eiprecedesej}, we see by summing over all states for which i precedes j and by using the preceding that
α(i,j)<PiPjα(j,i)
which, since α(i,j)=1-α(j,i), yields
α(j,i)>PjPj+Pi
Hence, the average position of the element requested is indeed smaller under the one-closer rule than under the front-of-the-line rule. ■
The concept of the reversed chain is useful even when the process is not time reversible. To illustrate this, we start with the following proposition whose proof is left as an exercise.
Proposition 4.9
Consider an irreducible Markov chain with transition probabilities Pij. If we can find positive numbers πi,i⩾0, summing to one, and a transition probability matrix Q=[Qij] such that
πiPij=πjQji(4.29)
(4.29)
then the Qij are the transition probabilities of the reversed chain and the πi are the stationary probabilities both for the original and reversed chain.
The importance of the preceding proposition is that, by thinking backward, we can sometimes guess at the nature of the reversed chain and then use the set of Equations (4.29) to obtain both the stationary probabilities and the Qij.
Example 4.38
A single bulb is necessary to light a given room. When the bulb in use fails, it is replaced by a new one at the beginning of the next day. Let Xn equal i if the bulb in use at the beginning of day n is in its ith day of use (that is, if its present age is i). For instance, if a bulb fails on day n-1, then a new bulb will be put in use at the beginning of day n and so Xn=1. If we suppose that each bulb, independently, fails on its ith day of use with probability pi,i⩾1, then it is easy to see that {Xn,n⩾1} is a Markov chain whose transition probabilities are as follows:
Pi,1=P{bulb, on itsith day of use, fails}=P{life of bulb=i∣life of bulb⩾i}=P{L=i}P{L⩾i}
where L, a random variable representing the lifetime of a bulb, is such that P{L=i}=pi. Also,
Pi,i+1=1-Pi,1
Suppose now that this chain has been in operation for a long (in theory, an infinite) time and consider the sequence of states going backward in time. Since, in the forward direction, the state is always increasing by 1 until it reaches the age at which the item fails, it is easy to see that the reverse chain will always decrease by 1 until it reaches 1 and then it will jump to a random value representing the lifetime of the (in real time) previous bulb. Thus, it seems that the reverse chain should have transition probabilities given by
Qi,i-1=1,i>1Q1,i=pi,i⩾1
To check this, and at the same time determine the stationary probabilities, we must see if we can find, with the Qi,j as previously given, positive numbers {πi} such that
πiPi,j=πjQj,i
To begin, let j=1 and consider the resulting equations:
πiPi,1=π1Q1,i
This is equivalent to
πiP{L=i}P{L⩾i}=π1P{L=i}
or
πi=π1P{L⩾i}
Summing over all i yields
1=∑i=1∞πi=π1∑i=1∞P{L⩾i}=π1E[L]
and so, for the preceding Qij to represent the reverse transition probabilities, it is necessary for the stationary probabilities to be
πi=P{L⩾i}E[L],i⩾1
To finish the proof that the reverse transition probabilities and stationary probabilities are as given, all that remains is to show that they satisfy
πiPi,i+1=πi+1Qi+1,i
which is equivalent to
P{L⩾i}E[L]1-P{L=i}P{L⩾i}=P{L⩾i+1}E[L]
and which is true since P{L⩾i}-P{L=i}=P{L⩾i+1}. ■
4.9 Markov Chain Monte Carlo Methods
Let X be a discrete random vector whose set of possible values is xj,j⩾1. Let the probability mass function of X be given by P{X=xj},j⩾1, and suppose that we are interested in calculating
θ=E[h(X)]=∑j=1∞h(xj)P{X=xj}
for some specified function h. In situations where it is computationally difficult to evaluate the function h(xj),j⩾1, we often turn to simulation to approximate θ. The usual approach, called Monte Carlo simulation, is to use random numbers to generate a partial sequence of independent and identically distributed random vectors X1,X2,…,Xn having the mass function P{X=xj}, j⩾1 (see Chapter 11 for a discussion as to how this can be accomplished). Since the strong law of large numbers yields
limn→∞∑i=1nh(Xi)n=θ(4.30)
(4.30)
it follows that we can estimate θ by letting n be large and using the average of the values of h(Xi),i=1,…,n as the estimator.
It often, however, turns out that it is difficult to generate a random vector having the specified probability mass function, particularly if X is a vector of dependent random variables. In addition, its probability mass function is sometimes given in the form P{X=xj}=Cbj,j⩾1, where the bj are specified, but C must be computed, and in many applications it is not computationally feasible to sum the bj so as to determine C. Fortunately, however, there is another way of using simulation to estimate θ in these situations. It works by generating a sequence, not of independent random vectors, but of the successive states of a vector-valued Markov chain X1,X2,… whose stationary probabilities are P{X=xj}, j⩾1. If this can be accomplished, then it would follow from Proposition 4.7 that Equation (4.30) remains valid, implying that we can then use ∑i=1nh(Xi)/n as an estimator of θ.
We now show how to generate a Markov chain with arbitrary stationary probabilities that may only be specified up to a multiplicative constant. Let b(j),j=1,2,… be positive numbers whose sum B=∑j=1∞b(j) is finite. The following, known as the Hastings–Metropolis algorithm, can be used to generate a time reversible Markov chain whose stationary probabilities are
π(j)=b(j)/B,j=1,2,…
To begin, let Q be any specified irreducible Markov transition probability matrix on the integers, with q(i,j) representing the row i column j element of Q. Now define a Markov chain {Xn,n⩾0} as follows. When Xn=i, generate a random variable Y such that P{Y=j}=q(i,j),j=1,2,…. If Y=j, then set Xn+1 equal to j with probability α(i,j), and set it equal to i with probability 1-α(i,j). Under these conditions, it is easy to see that the sequence of states constitutes a Markov chain with transition probabilities Pi,j given by
This Markov chain will be time reversible and have stationary probabilities π(j) if
π(i)Pi,j=π(j)Pj,iforj≠i
which is equivalent to
π(i)q(i,j)α(i,j)=π(j)q(j,i)α(j,i)(4.31)
(4.31)
But if we take πj=b(j)/B and set
α(i,j)=minπ(j)q(j,i)π(i)q(i,j),1(4.32)
(4.32)
then Equation (4.31) is easily seen to be satisfied. For if
α(i,j)=π(j)q(j,i)π(i)q(i,j)
then α(j,i)=1 and Equation (4.31) follows, and if α(i,j)=1 then
α(j,i)=π(i)q(i,j)π(j)q(j,i)
and again Equation (4.31) holds, thus showing that the Markov chain is time reversible with stationary probabilities π(j). Also, since π(j)=b(j)/B, we see from (4.32) that
α(i,j)=minb(j)q(j,i)b(i)q(i,j),1
which shows that the value of B is not needed to define the Markov chain, because the values b(j) suffice. Also, it is almost always the case that π(j),j⩾1 will not only be stationary probabilities but will also be limiting probabilities. (Indeed, a sufficient condition is that Pi,i>0 for some i.)
Example 4.39
Suppose that we want to generate a uniformly distributed element in S, the set of all permutations (x1,…,xn) of the numbers (1,…,n) for which ∑j=1njxj>a for a given constant a. To utilize the Hastings–Metropolis algorithm we need to define an irreducible Markov transition probability matrix on the state space S. To accomplish this, we first define a concept of “neighboring” elements of S, and then construct a graph whose vertex set is S. We start by putting an arc between each pair of neighboring elements in S, where any two permutations in S are said to be neighbors if one results from an interchange of two of the positions of the other. That is, (1, 2, 3, 4) and (1, 2, 4, 3) are neighbors whereas (1, 2, 3, 4) and (1, 3, 4, 2) are not. Now, define the q transition probability function as follows. With N(s) defined as the set of neighbors of s, and ∣N(s)∣ equal to the number of elements in the set N(s), let
q(s,t)=1∣N(s)∣ift∈N(s)
That is, the candidate next state from s is equally likely to be any of its neighbors. Since the desired limiting probabilities of the Markov chain are π(s)=C, it follows that π(s)=π(t), and so
α(s,t)=min(∣N(s)∣/∣N(t)∣,1)
That is, if the present state of the Markov chain is s then one of its neighbors is randomly chosen, say, t. If t is a state with fewer neighbors than s (in graph theory language, if the degree of vertex t is less than that of vertex s), then the next state is t. If not, a uniform (0,1) random number U is generated and the next state is t if U<∣(N(s)∣/∣N(t)∣ and is s otherwise. The limiting probabilities of this Markov chain are π(s)=1/∣S∣, where ∣S∣ is the (unknown) number of permutations in S. ■
The most widely used version of the Hastings–Metropolis algorithm is the Gibbs sampler. Let X=(X1,…,Xn) be a discrete random vector with probability mass function p(x) that is only specified up to a multiplicative constant, and suppose that we want to generate a random vector whose distribution is that of X. That is, we want to generate a random vector having mass function
p(x)=Cg(x)
where g(x) is known, but C is not. Utilization of the Gibbs sampler assumes that for any i and values xj,j≠i, we can generate a random variable X having the probability mass function
P{X=x}=P{Xi=x∣Xj=xj,j≠i}
It operates by using the Hasting–Metropolis algorithm on a Markov chain with states x=(x1,…,xn), and with transition probabilities defined as follows. Whenever the present state is x, a coordinate that is equally likely to be any of 1,…,n is chosen. If coordinate i is chosen, then a random variable X with probability mass function P{X=x}=P{Xi=x∣Xj=xj,j≠i} is generated. If X=x, then the state y=(x1,…xi-1,x,xi+1,…,xn) is considered as the candidate next state. In other words, with x and y as given, the Gibbs sampler uses the Hastings–Metropolis algorithm with
q(x,y)=1nP{Xi=x∣Xj=xj,j≠i}=p(y)nP{Xj=xj,j≠i}
Because we want the limiting mass function to be p, we see from Equation (4.32) that the vector y is then accepted as the new state with probability
Hence, when utilizing the Gibbs sampler, the candidate state is always accepted as the next state of the chain.
Example 4.40
Suppose that we want to generate n uniformly distributed points in the circle of radius 1 centered at the origin, conditional on the event that no two points are within a distance d of each other, when the probability of this conditioning event is small. This can be accomplished by using the Gibbs sampler as follows. Start with any n points x1,…,xn in the circle that have the property that no two of them are within d of the other; then generate the value of I, equally likely to be any of the values 1,…,n. Then continually generate a random point in the circle until you obtain one that is not within d of any of the other n-1 points excluding xI. At this point, replace xI by the generated point and then repeat the operation. After a large number of iterations of this algorithm, the set of n points will approximately have the desired distribution. ■
Example 4.41
Let Xi,i=1,…,n, be independent exponential random variables with respective rates λi,i=1,…,n. Let S=∑i=1nXi, and suppose that we want to generate the random vector X=(X1,…,Xn), conditional on the event that S>c for some large positive constant c. That is, we want to generate the value of a random vector whose density function is
f(x1,…,xn)=1P{S>c}∏i=1nλie-λixi,xi⩾0,∑i=1nxi>c
This is easily accomplished by starting with an initial vector x=(x1,…,xn) satisfying xi>0,i=1,…,n,∑i=1nxi>c. Then generate a random variable I that is equally likely to be any of 1,…,n. Next, generate an exponential random variable X with rate λI conditional on the event that X+∑j≠Ixj>c. This latter step, which calls for generating the value of an exponential random variable given that it exceeds c-∑j≠Ixj, is easily accomplished by using the fact that an exponential conditioned to be greater than a positive constant is distributed as the constant plus the exponential. Consequently, to obtain X, first generate an exponential random variable Y with rate λI, and then set
X=Y+c-∑j≠Ixj+
where a+=max(a,0). The value of xI should then be reset as X and a new iteration of the algorithm begun. ■
Remark
As can be seen by Examples 4.40 and 4.41, although the theory for the Gibbs sampler was represented under the assumption that the distribution to be generated was discrete, it also holds when this distribution is continuous.
4.10 Markov Decision Processes
Consider a process that is observed at discrete time points to be in any one of M possible states, which we number by 1,2,…,M. After observing the state of the process, an action must be chosen, and we let A, assumed finite, denote the set of all possible actions.
If the process is in state i at time n and action a is chosen, then the next state of the system is determined according to the transition probabilities Pij(a). If we let Xn denote the state of the process at time n and an the action chosen at time n, then the preceding is equivalent to stating that
P{Xn+1=j∣X0,a0,X1,a1,…,Xn=i,an=a}=Pij(a)
Thus, the transition probabilities are functions only of the present state and the subsequent action.
By a policy, we mean a rule for choosing actions. We shall restrict ourselves to policies that are of the form that the action they prescribe at any time depends only on the state of the process at that time (and not on any information concerning prior states and actions). However, we shall allow the policy to be “randomized” in that its instructions may be to choose actions according to a probability distribution. In other words, a policy β is a set of numbers β={βi(a),a∈A,i=1,…,M} with the interpretation that if the process is in state i, then action a is to be chosen with probability βi(a). Of course, we need have
0⩽βi(a)⩽1,for alli,a∑aβi(a)=1,for alli
Under any given policy β, the sequence of states {Xn,n=0,1,…} constitutes a Markov chain with transition probabilities Pij(β) given by∗
Pij(β)=Pβ{Xn+1=j∣Xn=i}∗=∑aPij(a)βi(a)
where the last equality follows by conditioning on the action chosen when in state i. Let us suppose that for every choice of a policy β, the resultant Markov chain {Xn,n=0,1,…} is ergodic.
For any policy β, let πia denote the limiting (or steady-state) probability that the process will be in state i and action a will be chosen if policy β is employed. That is,
Equations (i) and (ii) are obvious, and Equation (iii), which is an analogue of Theorem 4.1, follows as the left-hand side equals the steady-state probability of being in state j and the right-hand side is the same probability computed by conditioning on the state and action chosen one stage earlier.
Thus for any policy β, there is a vector π=(πia) that satisfies (i)–(iii) and with the interpretation that πia is equal to the steady-state probability of being in state i and choosing action a when policy β is employed. Moreover, it turns out that the reverse is also true. Namely, for any vector π=(πia) that satisfies (i)–(iii), there exists a policy β such that if β is used, then the steady-state probability of being in i and choosing action a equals πia. To verify this last statement, suppose that π=(πia) is a vector that satisfies (i)–(iii). Then, let the policy β=(βi(a)) be
βi(a)=P{βchoosesa∣state isi}=πia∑aπia
Now let Pia denote the limiting probability of being in i and choosing a when policy β is employed. We need to show that Pia=πia. To do so, first note that {Pia,i=1,…,M,a∈A} are the limiting probabilities of the two-dimensional Markov chain {(Xn,an),n⩾0}. Hence, by the fundamental Theorem 4.1, they are the unique solution of
Thus we have shown that a vector β=(πia) will satisfy (i), (ii), and (iii) of Equation (4.33) if and only if there exists a policy β such that πia is equal to the steady-state probability of being in state i and choosing action a when β is used. In fact, the policy β is defined by βi(a)=πia/∑aπia.
The preceding is quite important in the determination of “optimal” policies. For instance, suppose that a reward R(i,a) is earned whenever action a is chosen in state i. Since R(Xi,ai) would then represent the reward earned at time i, the expected average reward per unit time under policy β can be expressed as
expected average reward underβ=limn→∞Eβ∑i=1nR(Xi,ai)n
Now, if πia denotes the steady-state probability of being in state i and choosing action a, it follows that the limiting expected reward at time n equals
limn→∞E[R(Xn,an)]=∑i∑aπiaR(i,a)
which implies that
expected average reward underβ=∑i∑aπiaR(i,a)
Hence, the problem of determining the policy that maximizes the expected average reward is