Example 4.36

Consider an arbitrary connected graph (see Section 3.6 for definitions) having a number wijimage associated with arc (i,jimage) for each arc. One instance of such a graph is given by Figure 4.1. Now consider a particle moving from node to node in this manner: If at any time the particle resides at node iimage, then it will next move to node jimage with probability Pijimage where

Pij=wijjwij

image

and where wijimage is 0 if (i,jimage) is not an arc. For instance, for the graph of Figure 4.1, P12=3/(3+1+2)=12image.

The time reversibility equations

πiPij=πjPji

image

reduce to

πiwijjwij=πjwjiiwji

image

or, equivalently, since wij=wjiimage

πijwij=πjiwji

image

which is equivalent to

πijwij=c

image

or

πi=cjwij

image

or, since 1=iπiimage

πi=jwijijwij

image

Because the πisimage given by this equation satisfy the time reversibility equations, it follows that the process is time reversible with these limiting probabilities.

For the graph of Figure 4.1 we have that

π1=632,π2=332,π3=632,π4=532,π5=1232

image

If we try to solve Equation (4.22) for an arbitrary Markov chain with states 0,1,,Mimage, it will usually turn out that no solution exists. For example, from Equation (4.22),

xiPij=xjPji,xkPkj=xjPjk

image

implying (if PijPjk>0image) that

xixk=PjiPkjPijPjk

image

which in general need not equal Pki/Pikimage. Thus, we see that a necessary condition for time reversibility is that

PikPkjPji=PijPjkPkifor alli,j,k (4.25)

image (4.25)

which is equivalent to the statement that, starting in state iimage, the path ikjiimage has the same probability as the reversed path ijkiimage. To understand the necessity of this, note that time reversibility implies that the rate at which a sequence of transitions from iimage to kimage to jimage to iimage occurs must equal the rate of ones from iimage to jimage to kimage to iimage (why?), and so we must have

πiPikPkjPji=πiPijPjkPki

image

implying Equation (4.25) when πi>0image.

In fact, we can show the following.

Theorem 4.2

An ergodic Markov chain for which Pij=0image whenever Pji=0image is time reversible if and only if starting in state iimage, any path back to iimage has the same probability as the reversed path. That is, if

Pi,i1Pi1,i2Pik,i=Pi,ikPik,ik-1Pi1,i (4.26)

image (4.26)

for all states i,i1,,ikimage.

Proof

We have already proven necessity. To prove sufficiency, fix states iimage and jimage and rewrite (4.26) as

Pi,i1Pi1,i2Pik,jPji=PijPj,ikPi1,i

image

Summing the preceding over all states i1,,ikimage yields

Pijk+1Pji=PijPjik+1

image

Letting kimage yields

πjPji=Pijπi

image

which proves the theorem. ■

Example 4.37

Suppose we are given a set of nimage elements, numbered 1 through nimage, which are to be arranged in some ordered list. At each unit of time a request is made to retrieve one of these elements, element iimage being requested (independently of the past) with probability Piimage. After being requested, the element then is put back but not necessarily in the same position. In fact, let us suppose that the element requested is moved one closer to the front of the list; for instance, if the present list ordering is 1, 3, 4, 2, 5 and element 2 is requested, then the new ordering becomes 1, 3, 2, 4, 5. We are interested in the long-run average position of the element requested.

For any given probability vector P=(P1,,Pn)image, the preceding can be modeled as a Markov chain with nimage! states, with the state at any time being the list order at that time. We shall show that this Markov chain is time reversible and then use this to show that the average position of the element requested when this one-closer rule is in effect is less than when the rule of always moving the requested element to the front of the line is used. The time reversibility of the resulting Markov chain when the one-closer reordering rule is in effect easily follows from Theorem 4.2. For instance, suppose n=3image and consider the following path from state (1, 2, 3) to itself:

(1,2,3)(2,1,3)(2,3,1)(3,2,1)(3,1,2)(1,3,2)(1,2,3)

image

The product of the transition probabilities in the forward direction is

P2P3P3P1P1P2=P12P22P32

image

whereas in the reverse direction, it is

P3P3P2P2P1P1=P12P22P32

image

Because the general result follows in much the same manner, the Markov chain is indeed time reversible. (For a formal argument note that if fiimage denotes the number of times element iimage moves forward in the path, then as the path goes from a fixed state back to itself, it follows that element iimage will also move backward fiimage times. Therefore, since the backward moves of element iimage are precisely the times that it moves forward in the reverse path, it follows that the product of the transition probabilities for both the path and its reversal will equal

iPifi+ri

image

where riimage is equal to the number of times that element iimage is in the first position and the path (or the reverse path) does not change states.)

For any permutation i1,i2,,inimage of 1,2,,nimage, let π(i1,i2,,in)image denote the limiting probability under the one-closer rule. By time reversibility we have

Pij+1π(i1,,ij,ij+1,,in)=Pijπ(i1,,ij+1,ij,,in) (4.27)

image (4.27)

for all permutations.

Now the average position of the element requested can be expressed (as in Section 3.6.1) as

Average position=iPiE[Position of elementi]=iPi1+jiP{elementjprecedes elementi}=1+ijiPiP{ejprecedesei}=1+i<j[PiP{ejprecedesei}+PjP{eiprecedesej}]=1+i<j[PiP{ejprecedesei}+Pj(1-P{ejprecedesei})]=1+i<j(Pi-Pj)P{ejprecedesei}+i<jPj

image

Hence, to minimize the average position of the element requested, we would want to make P{ejprecedesei}image as large as possible when Pj>Piimage and as small as possible when Pi>Pjimage. Under the front-of-the-line rule we showed in Section 3.6.1,

P{ejprecedesei}=PjPj+Pi

image

(since under the front-of-the-line rule element jimage will precede element iimage if and only if the last request for either iimage or jimage was for jimage).

Therefore, to show that the one-closer rule is better than the front-of-the-line rule, it suffices to show that under the one-closer rule

P{ejprecedesei}>PjPj+PiwhenPj>Pi

image

Now consider any state where element iimage precedes element jimage, say, (,i,i1,,ik,j,image). By successive transpositions using Equation (4.27), we have

π(,i,i1,,ik,j,)=PiPjk+1π(,j,i1,,ik,i,) (4.28)

image (4.28)

For instance,

π(1,2,3)=P2P3π(1,3,2)=P2P3P1P3π(3,1,2)=P2P3P1P3P1P2π(3,2,1)=P1P32π(3,2,1)

image

Now when Pj>Piimage, Equation (4.28) implies that

π(,i,i1,,ik,j,)<PiPjπ(,j,i1,,ik,i,)

image

Letting α(i,j)=P{eiprecedesej}image, we see by summing over all states for which iimage precedes jimage and by using the preceding that

α(i,j)<PiPjα(j,i)

image

which, since α(i,j)=1-α(j,i)image, yields

α(j,i)>PjPj+Pi

image

Hence, the average position of the element requested is indeed smaller under the one-closer rule than under the front-of-the-line rule. ■

The concept of the reversed chain is useful even when the process is not time reversible. To illustrate this, we start with the following proposition whose proof is left as an exercise.

Proposition 4.9

Consider an irreducible Markov chain with transition probabilities Pijimage. If we can find positive numbers πi,i0image, summing to one, and a transition probability matrix Q=[Qij]image such that

πiPij=πjQji (4.29)

image (4.29)

then the Qijimage are the transition probabilities of the reversed chain and the πiimage are the stationary probabilities both for the original and reversed chain.

The importance of the preceding proposition is that, by thinking backward, we can sometimes guess at the nature of the reversed chain and then use the set of Equations (4.29) to obtain both the stationary probabilities and the Qijimage.

Example 4.38

A single bulb is necessary to light a given room. When the bulb in use fails, it is replaced by a new one at the beginning of the next day. Let Xnimage equal iimage if the bulb in use at the beginning of day nimage is in its iimageth day of use (that is, if its present age is iimage). For instance, if a bulb fails on day n-1image, then a new bulb will be put in use at the beginning of day nimage and so Xn=1image. If we suppose that each bulb, independently, fails on its iimageth day of use with probability pi,i1image, then it is easy to see that {Xn,n1}image is a Markov chain whose transition probabilities are as follows:

Pi,1=P{bulb, on itsith day of use, fails}=P{life of bulb=ilife of bulbi}=P{L=i}P{Li}

image

where Limage, a random variable representing the lifetime of a bulb, is such that P{L=i}=piimage. Also,

Pi,i+1=1-Pi,1

image

Suppose now that this chain has been in operation for a long (in theory, an infinite) time and consider the sequence of states going backward in time. Since, in the forward direction, the state is always increasing by 1 until it reaches the age at which the item fails, it is easy to see that the reverse chain will always decrease by 1 until it reaches 1 and then it will jump to a random value representing the lifetime of the (in real time) previous bulb. Thus, it seems that the reverse chain should have transition probabilities given by

Qi,i-1=1,i>1Q1,i=pi,i1

image

To check this, and at the same time determine the stationary probabilities, we must see if we can find, with the Qi,jimage as previously given, positive numbers {πi}image such that

πiPi,j=πjQj,i

image

To begin, let j=1image and consider the resulting equations:

πiPi,1=π1Q1,i

image

This is equivalent to

πiP{L=i}P{Li}=π1P{L=i}

image

or

πi=π1P{Li}

image

Summing over all iimage yields

1=i=1πi=π1i=1P{Li}=π1E[L]

image

and so, for the preceding Qijimage to represent the reverse transition probabilities, it is necessary for the stationary probabilities to be

πi=P{Li}E[L],i1

image

To finish the proof that the reverse transition probabilities and stationary probabilities are as given, all that remains is to show that they satisfy

πiPi,i+1=πi+1Qi+1,i

image

which is equivalent to

P{Li}E[L]1-P{L=i}P{Li}=P{Li+1}E[L]

image

and which is true since P{Li}-P{L=i}=P{Li+1}image. ■

4.9 Markov Chain Monte Carlo Methods

Let Ximage be a discrete random vector whose set of possible values is xj,j1image. Let the probability mass function of Ximage be given by P{X=xj},j1image, and suppose that we are interested in calculating

θ=E[h(X)]=j=1h(xj)P{X=xj}

image

for some specified function himage. In situations where it is computationally difficult to evaluate the function h(xj),j1image, we often turn to simulation to approximate θimage. The usual approach, called Monte Carlo simulation, is to use random numbers to generate a partial sequence of independent and identically distributed random vectors X1,X2,,Xnimage having the mass function P{X=xj}image, j1image (see Chapter 11 for a discussion as to how this can be accomplished). Since the strong law of large numbers yields

limni=1nh(Xi)n=θ (4.30)

image (4.30)

it follows that we can estimate θimage by letting nimage be large and using the average of the values of h(Xi),i=1,,nimage as the estimator.

It often, however, turns out that it is difficult to generate a random vector having the specified probability mass function, particularly if Ximage is a vector of dependent random variables. In addition, its probability mass function is sometimes given in the form P{X=xj}=Cbj,j1image, where the bjimage are specified, but Cimage must be computed, and in many applications it is not computationally feasible to sum the bjimage so as to determine Cimage. Fortunately, however, there is another way of using simulation to estimate θimage in these situations. It works by generating a sequence, not of independent random vectors, but of the successive states of a vector-valued Markov chain X1,X2,image whose stationary probabilities are P{X=xj}image, j1image. If this can be accomplished, then it would follow from Proposition 4.7 that Equation (4.30) remains valid, implying that we can then use i=1nh(Xi)/nimage as an estimator of θimage.

We now show how to generate a Markov chain with arbitrary stationary probabilities that may only be specified up to a multiplicative constant. Let b(j),j=1,2,image be positive numbers whose sum B=j=1b(j)image is finite. The following, known as the Hastings–Metropolis algorithm, can be used to generate a time reversible Markov chain whose stationary probabilities are

π(j)=b(j)/B,j=1,2,

image

To begin, let Qimage be any specified irreducible Markov transition probability matrix on the integers, with q(i,j)image representing the row iimage column jimage element of Qimage. Now define a Markov chain {Xn,n0}image as follows. When Xn=iimage, generate a random variable Yimage such that P{Y=j}=q(i,j),j=1,2,image. If Y=jimage, then set Xn+1image equal to jimage with probability α(i,j)image, and set it equal to iimage with probability 1-α(i,j)image. Under these conditions, it is easy to see that the sequence of states constitutes a Markov chain with transition probabilities Pi,jimage given by

Pi,j=q(i,j)α(i,j),ifjiPi,i=q(i,i)+kiq(i,k)(1-α(i,k))

image

This Markov chain will be time reversible and have stationary probabilities π(j)image if

π(i)Pi,j=π(j)Pj,iforji

image

which is equivalent to

π(i)q(i,j)α(i,j)=π(j)q(j,i)α(j,i) (4.31)

image (4.31)

But if we take πj=b(j)/Bimage and set

α(i,j)=minπ(j)q(j,i)π(i)q(i,j),1 (4.32)

image (4.32)

then Equation (4.31) is easily seen to be satisfied. For if

α(i,j)=π(j)q(j,i)π(i)q(i,j)

image

then α(j,i)=1image and Equation (4.31) follows, and if α(i,j)=1image then

α(j,i)=π(i)q(i,j)π(j)q(j,i)

image

and again Equation (4.31) holds, thus showing that the Markov chain is time reversible with stationary probabilities π(j)image. Also, since π(j)=b(j)/Bimage, we see from (4.32) that

α(i,j)=minb(j)q(j,i)b(i)q(i,j),1

image

which shows that the value of Bimage is not needed to define the Markov chain, because the values b(j)image suffice. Also, it is almost always the case that π(j),j1image will not only be stationary probabilities but will also be limiting probabilities. (Indeed, a sufficient condition is that Pi,i>0image for some iimage.)

Example 4.39

Suppose that we want to generate a uniformly distributed element in Simage, the set of all permutations (x1,,xn)image of the numbers (1,,n)image for which j=1njxj>aimage for a given constant aimage. To utilize the Hastings–Metropolis algorithm we need to define an irreducible Markov transition probability matrix on the state space Simage. To accomplish this, we first define a concept of “neighboring” elements of Simage, and then construct a graph whose vertex set is Simage. We start by putting an arc between each pair of neighboring elements in Simage, where any two permutations in Simage are said to be neighbors if one results from an interchange of two of the positions of the other. That is, (1, 2, 3, 4) and (1, 2, 4, 3) are neighbors whereas (1, 2, 3, 4) and (1, 3, 4, 2) are not. Now, define the qimage transition probability function as follows. With N(s)image defined as the set of neighbors of simage, and N(s)image equal to the number of elements in the set N(s)image, let

q(s,t)=1N(s)iftN(s)

image

That is, the candidate next state from simage is equally likely to be any of its neighbors. Since the desired limiting probabilities of the Markov chain are π(s)=Cimage, it follows that π(s)=π(t)image, and so

α(s,t)=min(N(s)/N(t),1)

image

That is, if the present state of the Markov chain is simage then one of its neighbors is randomly chosen, say, timage. If timage is a state with fewer neighbors than simage (in graph theory language, if the degree of vertex timage is less than that of vertex simage), then the next state is timage. If not, a uniform (0,1) random number Uimage is generated and the next state is timage if U<(N(s)/N(t)image and is simage otherwise. The limiting probabilities of this Markov chain are π(s)=1/Simage, where Simage is the (unknown) number of permutations in Simage. ■

The most widely used version of the Hastings–Metropolis algorithm is the Gibbs sampler. Let X=(X1,,Xn)image be a discrete random vector with probability mass function p(x)image that is only specified up to a multiplicative constant, and suppose that we want to generate a random vector whose distribution is that of Ximage. That is, we want to generate a random vector having mass function

p(x)=Cg(x)

image

where g(x)image is known, but Cimage is not. Utilization of the Gibbs sampler assumes that for any iimage and values xj,jiimage, we can generate a random variable Ximage having the probability mass function

P{X=x}=P{Xi=xXj=xj,ji}

image

It operates by using the Hasting–Metropolis algorithm on a Markov chain with states x=(x1,,xn)image, and with transition probabilities defined as follows. Whenever the present state is x, a coordinate that is equally likely to be any of 1,,nimage is chosen. If coordinate iimage is chosen, then a random variable Ximage with probability mass function P{X=x}=P{Xi=xXj=xj,ji}image is generated. If X=ximage, then the state y=(x1,xi-1,x,xi+1,,xnimage) is considered as the candidate next state. In other words, with x and y as given, the Gibbs sampler uses the Hastings–Metropolis algorithm with

q(x,y)=1nP{Xi=xXj=xj,ji}=p(y)nP{Xj=xj,ji}

image

Because we want the limiting mass function to be pimage, we see from Equation (4.32) that the vector yimage is then accepted as the new state with probability

α(x,y)=minp(y)q(y,x)p(x)q(x,y),1=minp(y)p(x)p(x)p(y),1=1

image

Hence, when utilizing the Gibbs sampler, the candidate state is always accepted as the next state of the chain.

Example 4.40

Suppose that we want to generate nimage uniformly distributed points in the circle of radius 1 centered at the origin, conditional on the event that no two points are within a distance dimage of each other, when the probability of this conditioning event is small. This can be accomplished by using the Gibbs sampler as follows. Start with any nimage points x1,,xnimage in the circle that have the property that no two of them are within dimage of the other; then generate the value of Iimage, equally likely to be any of the values 1,,nimage. Then continually generate a random point in the circle until you obtain one that is not within dimage of any of the other n-1image points excluding xIimage. At this point, replace xIimage by the generated point and then repeat the operation. After a large number of iterations of this algorithm, the set of nimage points will approximately have the desired distribution. ■

Example 4.41

Let Xi,i=1,,nimage, be independent exponential random variables with respective rates λi,i=1,,nimage. Let S=i=1nXiimage, and suppose that we want to generate the random vector X=(X1,,Xn)image, conditional on the event that S>cimage for some large positive constant cimage. That is, we want to generate the value of a random vector whose density function is

f(x1,,xn)=1P{S>c}i=1nλie-λixi,xi0,i=1nxi>c

image

This is easily accomplished by starting with an initial vector x=(x1,,xn)image satisfying xi>0,i=1,,n,i=1nxi>cimage. Then generate a random variable Iimage that is equally likely to be any of 1,,nimage. Next, generate an exponential random variable Ximage with rate λIimage conditional on the event that X+jIxj>cimage. This latter step, which calls for generating the value of an exponential random variable given that it exceeds c-jIxjimage, is easily accomplished by using the fact that an exponential conditioned to be greater than a positive constant is distributed as the constant plus the exponential. Consequently, to obtain Ximage, first generate an exponential random variable Yimage with rate λIimage, and then set

X=Y+c-jIxj+

image

where a+=max(a,0)image. The value of xIimage should then be reset as Ximage and a new iteration of the algorithm begun. ■

Remark

As can be seen by Examples 4.40 and 4.41, although the theory for the Gibbs sampler was represented under the assumption that the distribution to be generated was discrete, it also holds when this distribution is continuous.

4.10 Markov Decision Processes

Consider a process that is observed at discrete time points to be in any one of Mimage possible states, which we number by 1,2,,Mimage. After observing the state of the process, an action must be chosen, and we let Aimage, assumed finite, denote the set of all possible actions.

If the process is in state iimage at time nimage and action aimage is chosen, then the next state of the system is determined according to the transition probabilities Pij(a)image. If we let Xnimage denote the state of the process at time nimage and animage the action chosen at time nimage, then the preceding is equivalent to stating that

P{Xn+1=jX0,a0,X1,a1,,Xn=i,an=a}=Pij(a)

image

Thus, the transition probabilities are functions only of the present state and the subsequent action.

By a policy, we mean a rule for choosing actions. We shall restrict ourselves to policies that are of the form that the action they prescribe at any time depends only on the state of the process at that time (and not on any information concerning prior states and actions). However, we shall allow the policy to be “randomized” in that its instructions may be to choose actions according to a probability distribution. In other words, a policy βimage is a set of numbers β={βi(a),aA,i=1,,M}image with the interpretation that if the process is in state iimage, then action aimage is to be chosen with probability βi(a)image. Of course, we need have

0βi(a)1,for alli,aaβi(a)=1,for alli

image

Under any given policy βimage, the sequence of states {Xn,n=0,1,}image constitutes a Markov chain with transition probabilities Pij(β)image given by

Pij(β)=Pβ{Xn+1=jXn=i}=aPij(a)βi(a)

image

where the last equality follows by conditioning on the action chosen when in state iimage. Let us suppose that for every choice of a policy βimage, the resultant Markov chain {Xn,n=0,1,}image is ergodic.

For any policy βimage, let πiaimage denote the limiting (or steady-state) probability that the process will be in state iimage and action aimage will be chosen if policy βimage is employed. That is,

πia=limnPβ{Xn=i,an=a}

image

The vector π=(πia)image must satisfy

(i)πia0for alli,a,(ii)iaπia=1,(iii)aπja=iaπiaPij(a)for allj (4.33)

image (4.33)

Equations (i) and (ii) are obvious, and Equation (iii), which is an analogue of Theorem 4.1, follows as the left-hand side equals the steady-state probability of being in state jimage and the right-hand side is the same probability computed by conditioning on the state and action chosen one stage earlier.

Thus for any policy βimage, there is a vector π=(πia)image that satisfies (i)–(iii) and with the interpretation that πiaimage is equal to the steady-state probability of being in state iimage and choosing action aimage when policy βimage is employed. Moreover, it turns out that the reverse is also true. Namely, for any vector π=(πia)image that satisfies (i)–(iii), there exists a policy βimage such that if βimage is used, then the steady-state probability of being in iimage and choosing action aimage equals πiaimage. To verify this last statement, suppose that π=(πia)image is a vector that satisfies (i)–(iii). Then, let the policy β=(βi(a))image be

βi(a)=P{βchoosesastate isi}=πiaaπia

image

Now let Piaimage denote the limiting probability of being in iimage and choosing aimage when policy βimage is employed. We need to show that Pia=πiaimage. To do so, first note that {Pia,i=1,,M,aA}image are the limiting probabilities of the two-dimensional Markov chain {(Xn,an),n0}image. Hence, by the fundamental Theorem 4.1, they are the unique solution of

(i)Pia0,(ii)iaPia=1,(iii)Pja=iaPiaPij(a)βj(a)

image

where (iiiimage) follows since

P{Xn+1=j,an+1=aXn=i,an=a}=Pij(a)βj(a)

image

Because

βj(a)=πjaaπja

image

we see that (Pia)image is the unique solution of

Pia0,iaPia=1,Pja=iaPiaPij(a)πjaaπja

image

Hence, to show that Pia=πiaimage, we need show that

πia0,iaπia=1,πja=iaπiaPij(a)πjaaπja

image

The top two equations follow from (i) and (ii) of Equation (4.33), and the third, which is equivalent to

aπja=iaπiaPij(a)

image

follows from condition (iii) of Equation (4.33).

Thus we have shown that a vector β=(πia)image will satisfy (i), (ii), and (iii) of Equation (4.33) if and only if there exists a policy βimage such that πiaimage is equal to the steady-state probability of being in state iimage and choosing action aimage when βimage is used. In fact, the policy βimage is defined by βi(a)=πia/aπiaimage.

The preceding is quite important in the determination of “optimal” policies. For instance, suppose that a reward R(i,a)image is earned whenever action aimage is chosen in state iimage. Since R(Xi,ai)image would then represent the reward earned at time iimage, the expected average reward per unit time under policy βimage can be expressed as

expected average reward underβ=limnEβi=1nR(Xi,ai)n

image

Now, if πiaimage denotes the steady-state probability of being in state iimage and choosing action aimage, it follows that the limiting expected reward at time nimage equals

limnE[R(Xn,an)]=iaπiaR(i,a)

image

which implies that

expected average reward underβ=iaπiaR(i,a)

image

Hence, the problem of determining the policy that maximizes the expected average reward is

maximizeπ=(πia)iaπiaR(i,a)subject toπia0,for alli,a,iaπia=1,aπja=iaπiaPij(a),for allj (4.34)