Chapter 2. Markov Procedures for Markov Decision Processes

2.1. A Reorder problem-- Electronic Store*

A reorder problem

Example 2.1. 

An electronic store stocks a certain type of DVD player. At the end of each week, an order is placed for early delivery the following Monday. A maximum of four units is stocked. Let the states be the number of units on hand at the end of the sales week: . Two possible actions:

  • Order two units, at a cost of $150 each

  • Order four units, at a cost of $120 each

Units sell for $200. If demand exceeds the stock in hand, the retailer assumes a penalty of $40 per unit (in losses due to customer dissatisfaction, etc.). Because of turnover, return on sales is considered two percent per week, so that discount is α = 1 / 1.02 on a weekly basis.

In state 0, there are three possible actions: order 0, 2, or 4. In states 1 and 2 there are two possible actions: order 0 or 2. In states 3 and 4, the only action is to order 0. Customer demand in week n + 1 is represented by a random variable D n + 1 . The class is iid, uniformly distributed on the values 0, 1, 2, 3, 4. If Xn is the state at the end of week n, then is independent for each n.

Analyze the system as a Markov decision process with type 3 gains, depending upon current state, action, and demand. Determine the transition probability matrix PA (properly padded) and the gain matrix (also padded). Sample calculations are as follows:

For state = i, action = a, and demand = k, we seek g(i,a,k)

  1. Complete the transition probability table and the gain table.

  2. Determine an optimum infinite-horizon strategy with no discounting.

  3. Determine an optimum infinite-horizon strateby with discounting (alpha = 1/1.02).

  4. The manager decides to set up a six-week strategy, after which new sales conditions may be established. Determine an optimum strategy for the six-week period.

Data file
% file orderdata.m
% Version of 4/5/94
% Data organized for computation
type = 3;
states = 0:4;
= [0  2  4 ...  % Actions (padded)
 0  2 02 ...
 0  2 02 ...
 0 00 00 ...
 0 00 00];
C  = [0 -300 -480 ... % Order costs (padded)
0 -300 -300 ...
0 -300 -300 ...
0  0  0 ...
0  0  0];
SP = 200; % Selling price
BP = 40;  % Backorder penalty
PD = 0.2*ones(1,5);  % Demand probabilities

Transition Probabilities and Gains

The procedure
% file reorder.m
% Version of 4/11/94
% Calculates PA and GA for reorder policy
states = input('Enter row vector of states ');
A = input('Enter row vector A of actions (padded)  ');
C = input('Enter row vector C of order costs (padded) ');
D = input('Enter row vector D of demand values ');
PD = input('Enter row vector PD of demand probabilities ');
SP = input('Enter unit selling price SP ');
BP = input('Enter backorder penalty cost BP ');
m = length(states');
q = length(A);
na = q/m;
N = length(D);
S = ones(na,1)*states;
S = S(:)';
[d,s] = meshgrid(D,S);
a = A'*ones(1,N);
ca = C'*ones(1,N);
TA = (s + a - d).*(s + a - d >= 0);
for i = 1:q
PA(i,:) = tdbn(states,TA(i,:),PD);
end
PA
GA = ca + SP*d - (SP + BP)*(d -s -a).*(d > s+a)

The calculations
orderdata
reorder
Enter row vector of states states
Enter row vector A of actions (padded) A
Enter row vector C of order costs (padded) C
Enter row vector D of demand values D
Enter row vector PD of demand probabilities PD
Enter unit selling price SP SP
Enter backorder penalty cost BP BP
PA =
1.0000  0       0       0       0
0.6000  0.2000  0.2000  0       0
0.2000  0.2000  0.2000  0.2000  0.2000
0.8000  0.2000  0       0       0
0.4000  0.2000  0.2000  0.2000  0
0.4000  0.2000  0.2000  0.2000  0
0.6000  0.2000  0.2000  0       0
0.2000  0.2000  0.2000  0.2000  0.2000
0.2000  0.2000  0.2000  0.2000  0.2000
0.4000  0.2000  0.2000  0.2000  0
0.4000  0.2000  0.2000  0.2000  0
0.4000  0.2000  0.2000  0.2000  0
0.2000  0.2000  0.2000  0.2000  0.2000
0.2000  0.2000  0.2000  0.2000  0.2000
0.2000  0.2000  0.2000  0.2000  0.2000
GA =
 0 -40 -80 -120 -160
 -300 -100 100 60 20
 -480 -280 -80 120 320
 0 200 160  120  80
-300 -100 100 300 260
-300 -100 100 300 260
0     200 400 360 320
-300 -100 100 300 500
-300 -100 100 300 500
0     200 400 600 560
0     200 400 600 560
0     200 400 600 560
0     200 400 600 800
0     200 400 600 800
0     200 400 600 800

Infinite-horizon strategy (no discounting)
polit
Data needed:
- - - - - - - - - - - - - - -
Enter case number to show gain type case
Enter row vector of states states
Enter row vector A of possible actions A
Enter value of alpha (= 1 for no discounting) 1
Enter matrix PA of transition probabilities PA
Enter matrix GA of gains GA
Enter row vector PD of demand probabilities PD
Index Action Value
0 -80
2 2 -44
3 4 -80
4 0 112
5 2 52
6 2 52
7 0 256
8 2 100
9 2 100
10 0 352
11 0 352
12 0 352
13 0 400
14 0 400
15 0 400
Initial policy: action numbers
2 1 1 1 1
Policy: actions
2 0 0 0 0
New policy: action numbers
3 2 2 1 1
Policy: actions
4 2 2 0 0
 
Long-run distribution
0.2800 0.2000 0.2000 0.2000 0.1200
Test values for selecting new policy
Index Action Test Value
1.0000    0         -248.0000
2.0000    2.0000    -168.8000
3.0000    4.0000    -41.6000
4.0000    0         -48.8000
5.0000    2.0000    -5.6000
6.0000    2.0000    -5.6000
7.0000    0         131.2000
8.0000    2.0000    138.4000
9.0000    2.0000    138.4000
10.0000   0         294.4000
11.0000   0         294.4000
12.0000   0         294.4000
13.0000   0         438.4000
14.0000   0         438.4000
15.0000   0         438.4000
Optimum policy
State Action Value
0         4.0000    -168.0000
1.0000    2.0000    -132.0000
2.0000    2.0000    12.0000
3.0000    0         168.0000
4.0000    0         312.0000
Long-run expected gain per period G
126.4000

Infinite-horizon strategy (with discounting)
polit
Data needed:
- - - - - - - - - - - - - - -
Enter type number to show gain type
Enter row vector of states states
Enter row vector A of possible actions A
Enter value of alpha (= 1 for no discounting) 1/1.02
Enter matrix PA of transition probabilities PA
Enter matrix GA of gains GA
Enter row vector PD of demand probabilities PD
Index Action Value
1 0 -80
2 -44
3 -80
4 0 112
5 2 52
6 2 52
7 0 256
8 2 100
9 2 100
10 0 352
11 0 352
12 0 352
13 0 400
14 0 400
15 0 400
Initial policy: action numbers
2 1 1 1 1
Policy: actions
2 0 0 0 0
 
New policy: action numbers
3 2 2 1 1
Policy: actions
4 2 2 0 0
Test values for selecting policy
Index Action Test Value
1.0e+03 *
0.0010 0  6.0746
0.0020 0.0020 6.1533
0.0030 0.0040 6.2776
0.0040 0      6.2740
0.0050 0.0020 6.3155
0.0060 0.0020 6.3155
0.0070 0      6.4533
0.0080 0.0020 6.4576
0.0090 0.0020 6.4576
0.0100 0      6.6155
0.0110 0      6.6155
0.0120 0      6.6155
0.0130 0      6.7576
0.0140 0      6.7576
0.0150 0      6.7576
Optimum policy
State Action Value
1.0e+03 *
0      0.0040 6.2776
0.0010 0.0020 6.3155
0.0020 0.0020 6.4576
0.0030 0      6.6155
0.0040 0      6.7576

Finite-horizon calculations
dpinit
Initialize for finite horizon calculations
Matrices A, PA, and GA, padded if necessary
Enter case number to show gain type case
Enter vector of states states
Enter row vector A of possible actions A
Enter matrix PA of transition probabilities PA
Enter matrix GA of gains GA
Enter row vector PD of demand probabilities  PD
Call for dprog
dprog
States and expected total gains
0    1    2    3    4
-44  112  256  352  400
States Actions
0  2
1  0
2  0
3  0
4  0
dprog
States and expected total gains
0          1.0000    2.0000    3.0000   4.0000
135.2000   178.4000  315.2000  478.4000 615.2000
States Actions
0  4
1  2
2  2
3  0
4  0
dprog
States and expected total gains
0          1.0000     2.0000      3.0000     4.0000
264.4800   300.4800   444.4800    600.4800   744.4800
States Actions
0  4
1  2
2  2
3  0
4  0
dprog
States and expected total gains
1.0000  2.0000  3.0000  4.0000
390.8800 426.8800 570.8800 726.8800  870.8800
States  Actions
    0     4
    1     2
    2     2
    3     0
    4     0
dprog
States and expected total gains
        0    1.0000    2.0000    3.0000    4.0000
 517.2800  553.2800  697.2800  853.2800  997.2800
States  Actions
    0     4
    1     2
    2     2
    3     0
    4     0
dprog
States and expected total gains
  1.0e+03 *
        0    0.0010    0.0020    0.0030    0.0040
   0.6437    0.6797    0.8237    0.9797    1.1237
States  Actions
    0     4
    1     2
    2     2
    3     0
    4     0

2.2. Markov Decision -- Type 3 Gains*

Example 2.2. 

An electronic store stocks a certain type of VCR. At the end of each week, an order is placed for early delivery the following Monday. A maximum of four units is stocked. Let the states be the number of units on hand at the end of the sales week: . Two possible actions:

  • Order two units, at a cost of $150 each

  • Order four units, at a cost of $120 each

Units sell for $200. If demand exceeds the stock in hand, the retailer assumes a penalty of $40 per unit (in losses due to customer dissatisfaction, etc.). Because of turnover, return on sales is considered two percent per week, so that discount is α = 1 / 1.02 on a weekly basis.

In state 0, there are three possible actions: order 0, 2, or 4. In states 1 and 2 there are two possible actions: order 0 or 2. In states 3 and 4, the only action is to order 0. Customer demand in week n + 1 is represented by a random variable D n + 1 . The class is iid, uniformly distributed on the values 0, 1, 2, 3, 4. If Xn is the state at the end of week n, then is independent for each n.

Analyze the system as a Markov decision process with case 3 gains, depending upon current state, action, and demand. Determine the transition probability matrix PA (properly padded) and the gain matrix (also padded). Sample calculations are as follows:

  • State 0, action 0: p 00(0) = 1 (all other p 0k (0) = 0)

  • State 0, action 2: p 00(2) = P(D ≥ 2) = 3 / 5,p 01(2) = P(D = 1) = 1 / 5, etc.

  • State 2, action 2: p 2j (k) = 1 / 5,k = 0,1,2,3,4

For state = i, action = a, and demand = k, we seek g(i,a,k)

  1. Complete the transition probability table and the gain table.

  2. Determine an optimum infinite-horizon strategy with no discounting.

  3. Determine an optimum infinite-horizon strateby with discounting (alpha = 1/1.02).

  4. The manager decides to set up a six-week strategy, after which new sales conditions may be established. Determine an optimum strategy for the six-week period.

Data file

% file orderdata.m

    % Version of 4/5/94

    % Data organized for computation

    type = 3;

    states = 0:4;

    A   = [0   2   4 ...                   % Actions (padded)

           0   2  02 ...

           0   2  02 ...

           0  00  00 ...

           0  00  00];

    C   = [0 -300 -480 ...           % Order costs (padded)

           0 -300 -300 ...

           0 -300 -300 ...

           0    0    0 ...

           0    0    0];

    SP = 200;                                   % Selling price

    BP = 40;                                     % Backorder penalty

    PD = 0.2*ones(1,5);               % Demand probabilities

Transition Probabilities and Gains

The procedure

 % file reorder.m
% Version of 4/11/94
% Calculates PA and GA for reorder policy
states = input('Enter row vector of states   ');
A   = input('Enter row vector A of actions (padded)   ');
C   = input('Enter row vector C of order costs (padded)   ');
D   = input('Enter row vector D of demand values   ');
PD = input('Enter row vector PD of demand probabilities     ');
SP = input('Enter unit selling price SP   ');
BP = input('Enter backorder penalty cost BP   ');
m   = length(states');
q   = length(A);
na = q/m;
N   = length(D);
S   = ones(na,1)*states;
S   = S(:)';
[d,s] = meshgrid(D,S);
a   = A'*ones(1,N);
ca = C'*ones(1,N);
TA = (s + a - d).*(s + a - d >= 0);
for i = 1:q
PA(i,:) = tdbn(states,TA(i,:),PD);
end
PA
GA = ca + SP*d - (SP + BP)*(d -s -a).*(d > s+a)

The calculations

orderdata
 reorder
Enter row vector of states   states
Enter row vector A of actions (padded)   A
Enter row vector C of order costs (padded)   C
Enter row vector D of demand values   D
Enter row vector PD of demand probabilities     PD
Enter unit selling price SP   SP
Enter backorder penalty cost BP   BP
PA =
      1.0000            0            0            0            0
      0.6000       0.2000       0.2000            0            0
      0.2000       0.2000       0.2000       0.2000       0.2000
      0.8000       0.2000            0            0            0
      0.4000       0.2000       0.2000       0.2000            0
      0.4000       0.2000       0.2000       0.2000            0
      0.6000       0.2000       0.2000            0            0
      0.2000       0.2000       0.2000       0.2000       0.2000
      0.2000       0.2000       0.2000       0.2000       0.2000
      0.4000       0.2000       0.2000       0.2000            0
      0.4000       0.2000       0.2000       0.2000            0
      0.4000       0.2000       0.2000       0.2000            0
      0.2000       0.2000       0.2000       0.2000       0.2000
      0.2000       0.2000       0.2000       0.2000       0.2000
      0.2000       0.2000       0.2000       0.2000       0.2000
GA =
           0          -40          -80         -120         -160
        -300         -100          100           60           20
        -480         -280          -80          120          320
           0          200          160          120           80
        -300         -100          100          300          260
        -300         -100          100          300          260
           0          200          400          360          320
        -300         -100          100          300          500
        -300         -100          100          300          500
           0          200          400          600          560
           0          200          400          600          560
           0          200          400          600          560
           0          200          400          600          800
           0          200          400          600          800
           0          200          400          600          800
    

Infinite-horizon strategy (no discounting)

polit
Data needed:

- - - - - - - - - - - - - - -

Enter type number to show gain type   type
Enter row vector of states   states
Enter row vector A of possible actions   A
Enter value of alpha (= 1 for no discounting)   1
Enter matrix PA of transition probabilities   PA
Enter matrix GA of gains   GA
Enter row vector PD of demand probabilities   PD
Index    Action   Value
    1         0     -80
    2         2     -44
    3         4     -80
    4         0     112
    5         2      52
    6         2      52
    7         0     256
    8         2     100
    9         2     100
   10         0     352
   11         0     352
   12         0     352
   13         0     400
   14         0     400
   15         0     400
Initial policy: action numbers
        2         1         1         1         1
Policy: actions
        2         0         0         0         0

New policy: action numbers
        3         2         2         1         1
Policy: actions
        4         2         2         0         0
Long-run distribution
      0.2800       0.2000       0.2000       0.2000       0.1200
Test values for selecting new policy
      Index         Action   Test Value
      1.0000             0    -248.0000
      2.0000        2.0000    -168.8000
      3.0000        4.0000     -41.6000
      4.0000             0     -48.8000
      5.0000        2.0000      -5.6000
      6.0000        2.0000      -5.6000
      7.0000             0     131.2000
      8.0000        2.0000     138.4000
      9.0000        2.0000     138.4000
     10.0000             0     294.4000
     11.0000             0     294.4000
     12.0000             0     294.4000
     13.0000             0     438.4000
     14.0000             0     438.4000
     15.0000             0     438.4000
    Optimum policy
      State         Action        Value
           0        4.0000    -168.0000
      1.0000        2.0000    -132.0000
      2.0000        2.0000      12.0000
      3.0000             0     168.0000
      4.0000             0     312.0000
Long-run expected gain per period G
  126.4000
    

Infinite-horizon strategy (with discounting)

polit
Data needed:
- - - - - - - - - - - - - - -
Enter case number to show gain type   type
Enter row vector of states   states
Enter row vector A of possible actions   A
Enter value of alpha (= 1 for no discounting)   1/1.02
Enter matrix PA of transition probabilities   PA
Enter matrix GA of gains   GA
Enter row vector PD of demand probabilities   PD
 Index    Action     Value
     1         0      -80
     2         2      -44
     3         4      -80
     4         0      112
     5         2       52
     6         2       52
     7         0      256
     8         2      100
     9         2      100
     10         0     352
     11         0     352
     12         0     352
     13         0     400
     14         0     400
     15         0     400
Initial policy: action numbers
        2         1         1         1         1
Policy: actions
        2         0         0         0         0
New policy: action numbers
        3         2         2         1         1
Policy: actions
        4         2         2         0         0
Test values for selecting policy
Index         Action     Test Value
1.0e+03 *
0.0010             0         6.0746
0.0020        0.0020         6.1533
0.0030        0.0040         6.2776
0.0040             0         6.2740
0.0050        0.0020         6.3155
0.0060        0.0020         6.3155
0.0070             0         6.4533
0.0080        0.0020         6.4576
0.0090        0.0020         6.4576
0.0100             0         6.6155
0.0110             0         6.6155
0.0120             0         6.6155
0.0130             0         6.7576
0.0140             0         6.7576
0.0150             0         6.7576
Optimum policy
State       Action       Value
1.0e+03 * 
     0       0.0040       6.2776
0.0010       0.0020       6.3155
0.0020       0.0020       6.4576
0.0030            0       6.6155
0.0040            0       6.7576

Finite-horizon calculations

dpinit
Initialize for finite horizon calculations
Matrices A, PA, and GA, padded if necessary
Enter type number to show gain type   type
Enter vector of states   states
Enter row vector A of possible actions   A
Enter matrix PA of transition probabilities   PA
Enter matrix GA of gains   GA
Enter row vector PD of demand probabilities   PD
Call for dprog
dprog
States and expected total gains
      0       1       2       3       4
    -44     112     256     352     400
States   Actions
     0         2
     1         0
     2         0
     3         0
     4         0
dprog
States and expected total gains
         0     1.0000     2.0000     3.0000     4.0000
  135.2000   178.4000   315.2000   478.4000   615.2000
States   Actions
     0         4
     1         2
     2         2
     3         0
     4         0
dprog
States and expected total gains
         0     1.0000     2.0000     3.0000     4.0000
  264.4800   300.4800   444.4800   600.4800   744.4800
States   Actions
     0         4
     1         2
     2         2
     3         0
     4         0
dprog
States and expected total gains
         0     1.0000     2.0000     3.0000     4.0000
  390.8800   426.8800   570.8800   726.8800   870.8800
States   Actions
     0         4
     1         2
     2         2
     3         0
     4         0
 dprog
States and expected total gains
         0     1.0000     2.0000     3.0000     4.0000
  517.2800   553.2800   697.2800   853.2800   997.2800
States   Actions
     0         4
     1         2
     2         2
     3         0
     4         0
dprog
States and expected total gains
    1.0e+03 *
           0       0.0010       0.0020       0.0030       0.0040
      0.6437       0.6797       0.8237       0.9797       1.1237
States   Actions
     0         4
     1         2
     2         2
     3         0
     4         0

2.3. MATLAB Calculations for Decision Models*

Data

There are three types.  In all types, we need the following:

Type 1:  The usual type.  In addition to the above, we need      
Type 2:  The matrix R H = [r(a,i)] is given.  L and PYH are not needed.
Type 3:  Sometimes Y = H .  In this case R H = L , which we need, in addition to the above.

Calculated quantities

  1.      [Risk function = expected loss, given H]                MATLAB:  RH = L*PYH'

  2.                MATLAB:  PX = PH*PXH

  3.                MATLAB: [a,b] = meshgrid(PH,PX)      PHX = PXH'.*a./b

  4.       [Expected risk, given X]                MATLAB:  RX = RH*PHX'

  5. Select d* from RX :   d * (j) is the action a (row number) for minimum expected loss, given X = j . Set .

  6. Calculate the Bayesian risk BD for d* .                MATLAB:  RD*PX'

Note

Actions are represented in calculations by action number (position in the matrix). In some cases, each action has a value other than its position number. The actual values can be presented in the final display.

file dec.m
% file dec.m
% Version of 12/12/95
disp('Decision process with experimentation')
disp('There are three types, according to the data provided.')
disp('In all types, we need the row vector A of actions,')
disp('the row vector PH with PH(i) = P(H = u_i),')
disp('the row vector X of test random variable values, and')
disp('the matrix PXH with PXH(i,j) = P(X = x_j|H = u_i).')
disp('Type 1.  Loss matrix L of L(a,k)')
disp('         Matrix PYH with PYH(i,k) = P(Y = y_k|H = u_i)')
disp('Type 2.  Matrix RH of r(a,i) = E[L(a,Y)|H = u_i].')
disp('         L and PYH are not needed for this type.')
disp('Type 3.  Y = H, so that only RH = L is needed.')
c   = input('Enter type number  ');
A   = input('Enter vector A of actions ');
PH  = input('Enter vector PH of parameter probabilities  ');
PXH = input('Enter matrix PXH of conditional probabilities  ');
X   = input('Enter vector X of test random variable values  ');
s = length(PH);
q = length(X);
if c == 1
 L   = input('Enter loss matrix L  ');
 PYH = input('Enter matrix PYH of conditional probabilities  ');
 RH  = L*PYH';
elseif c == 2
 RH  = input('Enter matrix RH of expected loss, given H  ');
else
 L   = input('Enter loss matrix L  ');
 RH  = L;
end
PX   = PH*PXH;        % (1 x s)(s x q) = (1 x q)
[a,b] = meshgrid(PH,PX);
PHX = PXH'.*a./b;     % (q x s)
RX  = RH*PHX';        % (m x s)(s x q) = (m x q)
[RD D] = min(RX);     % determines min of each col
                      % and row on which min occurs
S = [X; A(D); RD]';
BD = RD*PX';          % Bayesian risk
h  = ['  Optimum losses and actions'];
sh = ['  Test value  Action     Loss'];
disp(' ')
disp(h)
disp(sh)
disp(S)
disp(' ')
disp(['Bayesian risk  B(d*) = ',num2str(BD),])

Example 2.3. General case

% file dec1.m
% Data for Problem 22-11
type = 1;
A = [10 15];          % Artificial actions list
PH = [0.3 0.2 0.5];   % PH(i) = P(H = i)
PXH = [0.7 0.2 0.1;   % PXH(i,j) = P(X = j|H= i)
      0.2 0.6 0.2;
      0.1 0.1 0.8];
X = [-1 0  1];
L = [1  0 -2;         % L(a,k) = loss when action number is a, outcome is k
    3 -1 -4];
PYH = [0.5 0.3 0.2;   % PYH(i,k) = P(Y = k|H = i)
      0.2 0.5 0.3;
      0.1 0.3 0.6];
 
dec1
dec
Decision process with experimentation
There are three types, according to the data provided.
In all types, we need the row vector A of actions,
the row vector PH with PH(i) = P(H = i),
the row vector X of test random variable values, and
the matrix PXH with PXH(i,j) = P(X = j|H = i).
Type 1.  Loss matrix L of L(a,k)
        Matrix PYH with PYH(i,k) = P(Y = k|H = i)
Type 2.  Matrix RH of r(a,i) = E[L(a,Y)|H = i].
        L and PYH are not needed in this case.
Type 3.  Y = H, so that only RH = L is needed.
Enter type number  type
Enter vector A of actions A
Enter vector PH of parameter probabilities  PH
Enter matrix PXH of conditional probabilities  PXH
Enter vector X of test random variable values  X
Enter loss matrix L  L
Enter matrix PYH of conditional probabilities  PYH
 
 Optimum losses and actions
 Test value  Action     Loss
  -1.0000   15.0000   -0.2667
        0   15.0000   -0.9913
   1.0000   15.0000   -2.1106
 
Bayesian risk  B(d*) = -1.3

Intermediate steps in solution of Example 1, to show results of various operations

RH
RH  =  0.1000   -0.4000   -1.1000
      0.4000   -1.1000   -2.4000
PX
PX  =  0.3000    0.2300    0.4700
a
a   =  0.3000    0.2000    0.5000
      0.3000    0.2000    0.5000
      0.3000    0.2000    0.5000
b
b   =  0.3000    0.3000    0.3000
      0.2300    0.2300    0.2300
      0.4700    0.4700    0.4700
PHX
PHX =  0.7000    0.1333    0.1667
      0.2609    0.5217    0.2174
      0.0638    0.0851    0.8511
RX
RX  = -0.1667   -0.4217   -0.9638
     -0.2667   -0.9913   -2.1106

Example 2.4.  RH given

% file dec2.m  
% Data for type in which RH is given
type = 2;
A = [1 2];
X = [-1 1 3];
PH = [0.2 0.5 0.3];
PXH = [0.5 0.4 0.1;   % PXH(i,j) = P(X = j|H = i)
      0.4 0.5 0.1;
      0.2 0.4 0.4];
RH = [-10   5 -12;
       5 -10  -5];    % r(a,i) = expected loss when
                      %   action is a, given H = i
 
dec2
dec
Decision process with experimentation
------------------- Instruction lines edited out
Enter type number  type
Enter vector A of actions A
Enter vector PH of parameter probabilities  PH
Enter matrix PXH of conditional probabilities  PXH
Enter vector X of test random variable values  X
Enter matrix RH of expected loss, given H  RH
 
 Optimum losses and actions
 Test value  Action     Loss
  -1.0000    2.0000   -5.0000
   1.0000    2.0000   -6.0000
   3.0000    1.0000   -7.3158
 
Bayesian risk  B(d*) = -5.89

Example 2.5. Example 3

Carnival example (type in which Y = H )

A carnival is scheduled to appear on a given date.  Profits to be earned depend heavily on the weather.  If rainy, the carnival loses $15 (thousands); if cloudy, the loss is $5 (thousands); if sunny, a profit of $10 (thousands) is expected.  If the carnival sets up its equipment, it must give the show;  if it decides not to set up, it forfeits $1,000.  For an additional cost of $1,000, it can delay setup until the day before the show and get the latest weather report.

      Actual weather H = Y is 1 rainy, 2 cloudy, or 3 sunny.

The weather report X has values 1, 2, or 3, corresponding to predictions rainy, cloudy, or sunny respectively.

      Reliability of the forecast is expressed in terms of P(X = j|H = i)– see matrix PXH

      Two actions: 1 set up;  2 no set up.

      Possible losses for each action and weather condition are in matrix L.

% file dec3,m
% Carnival problem
type = 3;             % Y = H  (actual weather)
A = [1  2];           % 1: setup  2: no setup
X = [1  2  3];        % 1; rain,  2: cloudy, 3: sunny
L = [16 6 -9;         % L(a,k) = loss when action number is a, outcome is k
     2 2  2];         % --with premium for postponing setup
PH = 0.1*[1 3 6];     % P(H = i)
PXH = 0.1*[7 2 1;     % PXH(i,j) = P(X = j|H = i)
          2 6 2;
          1 2 7];
 
dec3
dec
Decision process with experimentation
------------------- Instruction lines edited out
Enter case number  case
Enter vector A of actions A
Enter vector PH of parameter probabilities  PH
Enter matrix PXH of conditional probabilities  PXH
Enter vector X of test random variable values  X
Enter loss matrix L  L
 
 Optimum losses and actions
 Test value  Action     Loss
   1.0000    2.0000    2.0000
   2.0000    1.0000    1.0000
   3.0000    1.0000   -6.6531
 
Bayesian risk  B(d*) = -2.56

2.4. Matlab Procedures for Markov Decision Processes*

In order to provide the background for Matlab procedures for Markov decision processes, we first summarize certain essentials in the analysis of homogeneous Markov chains with costs and rewards associated with states, or with transitions between states. References are to Pfeiffer: PROBABILITY FOR APPLICATIONS.  

  1. Some cost and reward patterns Consider a finite, ergodic (i.e., recurrent, positive, aperiodic, irreducible) homogeneous Markov chain XN , with state space E = {1,2,⋯,M}. To facilitate use of matrix computation programs, we number states from one, except in certain examples in which state zero has a natural meaning. Associated with this chain is a cost or reward structure belonging to one of the three general classes described below. The one-step transition probability matrix is P = [p(i,j)], where . The distribution for Xn is represented by the row matrix , where . The long-run distribution is represented by the row matrix .

    1. Type 1. Gain associated with a state. A reward or gain in the amount g(i) is realized in the next period if the current state is i. The gain sequence of random variables is the sequence of gains realized as the chain XN evolves. We represent the gain function g by the column matrix g = [g(1)g(2)⋯g(M)] T .

    2. Type 2. One-step transition gains. A reward or gain in the amount g(i,j) = g ij is realized in period n + 1 if the system goes from state i in period n to state j in period n + 1. The corresponding one-step transition probability is p(i,j) = p ij . The gain matrix is g = [g(i,j)]. The gain sequence of random variables is the sequence of gains experienced as the chain XN evolves.

    3. Type 3. Gains associated with a demand. In this case, the gain random variables are of the form

      (2.1)

      If n represents the present, the random vector represents the “past” at n of the chain XN . We suppose is iid and each is independent. Again, the gain sequence represents the gains realized as the process evolves. Standard results on Markov chains show that in each case the sequence is Markov.

    A recurrence relation. We are interested in the expected gains in future stages, given the present state of the process. For any n > 0 and any m > 1, the gain G n (m) in the m periods following period n is given by

    (2.2) G n (m) = G n + 1 + G n + 2 + ⋯ + G n + m

    If the process is in state i, the expected gain in the next period qi is

    (2.3)

    and the expected gain in the next m periods is

    (2.4)

    We utilize a recursion relation that allows us to begin with the transition matrix P and the next-period expected gain matrix and calculate the v i (m) for any m > 1. Although the analysis is somewhat different for the various gain structures, the result exhibits a common pattern. In each case

    (2.5)

    (2.6)

    (2.7)

    We thus have the fundamental recursion relation

    (2.8)

    The recursion relation ( * ) shows that the transition matrix P = [p(i,j)] and the vector of next-period expected gains, which we represent by the column matrix , determine the v i (m) . The difference between the cases is the manner in which the qi are obtained.

    Type 1: .
    Type 2: .
    Type 3: .
    Matrix form: . For computational purposes, we utilize these relations in matrix form. If
    (2.9)
    then
    (2.10)

    Examination of the expressions for qi , above, shows that the next-period expected gain matrix q takes the following forms. In each case, g is the gain matrix. In case c, pD is the column matrix whose elements are P(D = k).

    Type 1: q = g
    Type 2:
    Type 3: q = gp D

  2. Some long-run averages Consider the average expected gain for m periods

    (2.11)

    Use of the Markov property and the fact that for an ergodic chain

    (2.12)

    yields the result that,

    (2.13)

    and for any state i,

    (2.14)

    The expression for g may be put into matrix form. If the long-run probability distribution is represented by the row matrix and , then

    (2.15) g = πq

  3. Discounting and potentials Suppose in a given time interval the value of money increases by a fraction a. This may represent the potential earning if the money were invested. One dollar now is worth 1 + a dollars at the end of one period. It is worth (1 + a) n dollars after n periods. Set α = 1 / (1 + a), so that 0 < α ≤ 1. It takes α dollars to be worth one dollar after one period. It takes αn dollars at present to be worth one dollar n periods later. Thus the present worth or discounted value of one dollar n periods in the future is αn dollars. This is the amount one would need to invest presently, at interest rate a per unit of time, to have one dollar n periods later. If f is any function defined on the state space E, we designate by f the column matrix [f(1)f(2)⋯f(M)] T . We make an exception to this convention in the case of the distributions of the state probabilities , their generating function, and the long-run probabilities , which we represent as row matrices. It should be clear that much of the following development extends immediately to infinite state space E = N = {0,1,2,⋯}. We assume one of the gain structures introduced in Sec 1 and the corresponding gain sequence . The value of the random variable G n + 1 is the gain or reward realized during the period n + 1. We let each transition time be at the end of one period of time (say a month or a quarter). If n corresponds to the present, then G n + k is the gain in period n + k . If we do not discount for the period immediately after the present, then α k – 1 G n + k is the present value of that gain. Hence

    (2.16)

    Definition. The α-potential of the gain sequence is the function φ defined by

    (2.17)

    Remark. φ(i) is the expected total discounted gain, given the system starts in state i. We next define a concept which is related to the α-potential in a useful way. Definition. The α-potential matrix R α for the process XN is the matrix

    (2.18)

    Theorem 2.1. 3.1

    Let XN be an ergodic, homogeneous Markov chain with state space E and gain sequence . Let where . For α ∈ (0,1), let φ be the α-potential for the gain sequence. That is, Then, if R α is the α-potential matrix for XN , we have φ = R α q


    Theorem 2.2. 3.2

    Consider an ergodic, homogeneous Markov chain XN with gain sequence and next-period expected gain matrix q. For α ∈ (0,1), the α-potential φ and the α-potential matrix R α satisfy [Iα P]R α = Iand[Iα P]φ = q If the state space E is finite, then R α = [Iα P] – 1andφ = [Iα P] – 1 q = R α q


    Example 2.6. A numerical example

    Suppose the transition matrix P and the next-period expected gain matrix q are For α = 0.9, we have


    The next example is of class c, but the demand random variable is Poisson, which does not have finite range. The simple pattern for calculating the qi must be modified.

    Example 2.7. Costs associated with inventory under an (m,M) order policy.

    If k units are ordered, the cost in dollars is

    For each unit of unsatisfied demand, there is a penalty of 50 dollars. We use the inventory policy described in Ex 23.1.3. X 0 = M , and Xn is the stock at the end of period n, before restocking. Demand in period n is Dn . The cost for restocking at the end of period n + 1 is For m = 1,M = 3, we have We take m = 1,M = 3 and assume D is Poisson (1). From Ex 23.1.3 in PA, we obtain

    (2.19)

    The largest eigenvalue |λ| ≈ 0.26, so n ≥ 10 should be sufficient for convergence. We use n = 16.

    Taking any row of P 16 , we approximate the long-run distribution by

    (2.20)

    We thus have

    (2.21)

    For the Poisson distribution . Hence q 3 = q 0 – 85 = 1.1668 so that

    (2.22)

    Then, for α = 0.8 we have

    (2.23)

    Recall that we are dealing with expected discounted costs. Since we usually consider starting with a full stock, the amount φ(M) = φ(3) = 100.68 is the quantity of interest. Note that this is smaller than other values of φ(j), which correspond to smaller beginning stock levels. We expect greater restocking costs in such cases.


  4. Evolution of chains with costs and rewards

    1. No discounting A generating function analysis of

      (2.24)

      shows

      (2.25)

      Here g = π q , is a column matrix of ones, P 0 = P , and B0 is a matrix which analysis also shows we may approximate by

      (2.26)

      The largest |λ i | < 1 gives an indication of how many terms are needed.

      Example 2.8. The inventory problem (continued)

      We consider again the inventory problem. We have

      (2.27)

      The eigenvalues are 1,0.0920 + i0.2434,0.0920 – 0.2434 and 0. Thus, the decay of the transients is controlled by . Since |λ * |4 ≈ 0.0046, the transients die out rather rapidly. We obtain the following approximations

      (2.28)

      The approximate value of B0 is found to be

      (2.29)

      The value g = πq ≈ 28.80 is found in the earlier treatment. From the values above, we have

      (2.30)

      The average gain per period is clearly g ≈ 28.80. This soon dominates the constant terms represented by the entries in v.


    2. With discounting Let the discount factor be . If n represents the present period, then G n + 1 = the reward in the first succeeding period the reward in the kth succeding period. If we do not discount the first period, then

      (2.31)

      Thus

      (2.32)

      In matrix form, this is

      (2.33)

      A generating function analysis shows that

      (2.34)

      Hence, the steady state part of

      (2.35)

      In matrix form,

      (2.36)

      Since the are known, we can solve for the vi . Also, since v i (m) is the present value of expected gain m steps into the future, the vi represent the present value for an unlimited future, given that the process starts in state i.

  5. Stationary decision policies We suppose throughout this section that XN is ergodic, with finite state space . For each state iE , there is a class Ai of possible actions which may be taken when the process is in state i. A decision policy is a sequence of decision functions such that

    (2.37)

    The action selected is in Ai whenever X n = i . We consider a special class of decision policies. Stationary decision policy

    (2.38)

    The possible actions depend upon the current state. That is whenever X n = i . Analysis shows the Markov character of the process is preserved under stationary policies. Also, if E is finite and every stationary policy produces an irreducible chain, then an optimal stationary policy is optimal. We suppose each policy yields an ergodic chain. Since the transition probabilities from any state depend in part on the action taken when in that state, the long-run probabilities will depend upon the policy. In the no-discounting case, we want to determine a stationary policy that maximizes the gain g = πq for the corresponding long-run probabilities. We say “a stationary policy,” since we do not rule out the possibility there may be more than one policy which maximizes the gain. In the discounting case, the goal is to maximize the steady state part vi in the expressions

    (2.39)

    In simple cases, with small state space and a small number of possible actions in each state, it may be feasible to obtain the gain or present values for each permissible policy and then select the optimum by direct comparison. However, for most cases, this is not an efficient approach. In the next two sections, we consider iterative procedures for step-by-step optimization which usually greatly reduces the computational burden.

  6. Policy iteration method– no discounting We develop an iterative procedure for finding an optimal stationary policy. The goal is to determine a stationary policy that maximizes the long run expected gain per period g. In the next section, we extend this procedure to the case where discounting is in effect. We assume that each policy yields an ergodic chain. Suppose we have established that under a given policy (1) , where In this case, the analysis in Sec 4 shows that for large n, (2) , where We note that vi and g depend upon the entire policy, while qi and depend on the individual actions ai . Using (1) and (2), we obtain

    (2.40)

    From this we conclude (3) for all iE Suppose a policy d has been used. That is, action d(i) is taken whenever the process is in state i. To simplify writing, we drop the indication of the action and simply write for p i j(d(i)), etc. Associated with this policy, there is a gain g. We should like to determine whether or not this is the maximum possible gain, and if it is not, to find a policy which does give the maximum gain. The following two-phase procedure has been found to be effective. It is essentially the procedure developed originally by Ronald Howard, who pioneered the treatment of these problems. The new feature is the method of carrying out the value-determination step, utilizing the approximation method noted above.

    1. Value-determination procedure We calculate . As in Sec 4, we calculate

      (2.41)

    2. Policy-improvement procedure We suppose policy d has been used through period n. Then, by (3), above,

      (2.42)

      We seek to improve policy d by selecting policy d* , with d * (i) = a * ik , to satisfy

      (2.43)

    Remarks

    • In the procedure for selecting d* , we use the “old” vj .

    • It has been established that , with equality iff g is optimal. Since there is a finite number of policies, the procedure must converge to an optimum stationary policy in a finite number of steps.

    We implement this two step procedure with Matlab. To demonstrate the procedure, we consider the following

    Example 2.9. A numerical example

    A Markov decision process has three states: State space .

    Actions: State 1: State 2: State 3:

    Transition probabilities and rewards are:

    Table 2.1.
    p 1j (1): [1/3 1/3 1/3] g 1j (1): [1 3 4]
    p 1j (2): [1/4 3/8 3/8] g 2j (2): [2 2 3]
    p 1j (3): [1/3 1/3 1/3] g 3j (3): [2 2 3]
    p 2j (1): [1/8 3/8 1/2] g 2j (1): [2 1 2]
    p 2j (2): [1/2 1/4 1/4] g 2j (2): [1 4 4]
    p 3j (1): [3/8 1/4 3/8] g 3j (1): [2 3 3]
    p 3j (2): [1/8 1/4 5/8] g 3j (2): [3 2 2]

    Use the policy iteration method to determine the policy which gives the maximum gain g.

    A computational procedure utilizing Matlab We first put the data in an m-file. Since we have several cases, it is expedient to include the case number. This example belongs to type 2. Data in file md61.m

    type = 2
    PA = [1/3 1/3 1/3; 1/4 3/8 3/8; 1/3 1/3 1/3; 0 0 0;
         1/8 3/8 1/2; 1/2 1/4 1/4; 0 0 0;
         3/8 1/4 3/8; 1/8 1/4 5/8]
    GA = [1 3 4; 2 2 3; 2 2 3; 0 0 0;        % Zero rows are separators
         2 1 2; 1 4 4; 0 0 0;
         2 3 3; 3 2 2]
    A = [1 2 3 0 1 2 0 1 2]
    

    md61
    type = 2
    PA = 0.3333    0.3333    0.3333
        0.2500    0.3750    0.3750
        0.3333    0.3333    0.3333
             0         0         0
        0.1250    0.3750    0.5000
        0.5000    0.2500    0.2500
             0         0         0
        0.3750    0.2500    0.3750
        0.1250    0.2500    0.6250
     
    GA =  1     3     4
         2     2     3
         2     2     3
         0     0     0
         2     1     2
         1     4     4
         0     0     0
         2     3     3
         3     2     2
     
    A =   1     2     3     0     1     2     0     1     2
    

    Once the data are entered into Matlab by the call for file “md61,” we make preparation for the policy-improvement step. The procedure is in the file newpolprep.m

    % file newpolprep.m
    % version of 3/23/92
    disp('Data needed:')
    disp('  Matrix PA of transition probabilities for states and actions')
    disp('  Matrix GA of gains for states and actions')
    disp('  Type number to identify GA matrix types')
    disp('    Type 1.  Gains associated with a state')
    disp('    Type 2.  One-step transition gains')
    disp('    Type 3.  Gains associated with a demand')
    disp('  Row vector of actions')
    disp('  Value of alpha (= 1 for no discounting)')
    c  = input('Enter type number to show gain type ');
    a  = input('Enter value of alpha (= 1 for no discounting) ');
    PA = input('Enter matrix PA of transition probabilities ');
    GA = input('Enter matrix GA of gains ');
    if c == 3
     PD = input('Enter row vector of demand probabilities ');
    end
    if c == 1
     QA = GA';
    elseif c ==2
     QA = diag(PA*GA');     % (qx1)
    else
     QA = GA*PD';
    end
    
    A  = input('Enter row vector A of possible actions ');  % (1xq)
    m  = length(PA(1,:));
    q  = length(A);
    n  = input('Enter n, the power of P  to approximate P0 ');
    s  = input('Enter s, the power of P  in the approximation of V ');
    QD = [(1:q)' A' QA];    % First col is index; second is A; third is QA
    DIS = ['    Index      Action    Value'];
    disp(DIS)
    disp(QD)
    if a == 1
     call = 'Call for newpol';
    else
     call = 'Call for newpold';
    end
    disp(call)
    
    newpolprep   % Call for preparatory program in file npolprep.m
    Data needed:
     Matrix PA of transition probabilities for states and actions
     Matrix GA of gains for states and actions
     Type number to identify GA matrix types
       Type 1.  Gains associated with a state
       Type 2.  One-step transition gains
       Type 3.  Gains associated with a demand
     Row vector of actions
     Value of alpha (= 1 for no discounting)
     
    Enter type number to show gain type 2
    Enter value of alpha (=1 for no discounting) 1
    Enter matrix PA of transition probabilities  PA
    Enter matrix GA of gains GA
    Enter row vector A of possible actions  A
    Enter n, the power of P  to approximate P0  16
    Enter s, the power of P  in the approximation of V  6
       Index      Action    Value
       1.0000    1.0000    2.6667
       2.0000    2.0000    2.3750
       3.0000    3.0000    2.3333
       4.0000         0         0
       5.0000    1.0000    1.6250
       6.0000    2.0000    2.5000
       7.0000         0         0
       8.0000    1.0000    2.6250
       9.0000    2.0000    2.1250
     
    Call for newpol
    

    The procedure has prompted for the procedure newpol (in file newpol.m)

    % file:  newpol.m  (without discounting)
    % version of 3/23/92
    d  = input('Enter policy as row matrix of indices ');
    D  = A(d);               % policy in terms of actions
    P  = PA(d',:);           % selects probabilities for policy
    Q  = QA(d',:);           % selects next-period gains for policy
    P0 = P^n;                % Display to check convergence
    PI = P0(1,:);            % Long-run distribution
    G = PI*Q                 % Long-run expected gain per period
    C = P + eye(P);
    for j=2:s
     C = C + P^j;           % C = I + P + P^2 + ... + P^s
    end
    V = (C - (s+1)*P0 )*Q;  % B = B0*Q
    disp(' ')
    disp('Approximation to P0; rows are long-run dbn')
    disp(P0)
    disp('Policy in terms of indices')
    disp(d)
    disp('Policy in terms of actions')
    disp(D)
    disp('Values for the policy selected')
    disp(V)
    disp('Long-run expected gain per period G')
    disp(G)
    T = [(1:q)' A' (QA + PA*V)];  % Test values for determining new policy
    DIS =['    Index     Action   Test Value'];
    disp(DIS)
    disp(T)
    disp('Check current policy against new test values.')
    disp('--If new policy needed, call for newpol')
    disp('--If not, use policy, values V, and gain G, above')
    
    newpol
    Enter policy as row matrix of indices [2 6 9]  % A deliberately poor choice
     
    Approximation to P0; rows are long-run dbn
       0.2642    0.2830    0.4528
       0.2642    0.2830    0.4528
       0.2642    0.2830    0.4528
     
    Policy in terms of indices
        2     6     9
     
    Policy in terms of actions
        2     2     2
    
    Long-run expected gain per period G
       2.2972
     
       Index     Action   Test Value
       1.0000    1.0000    2.7171
       2.0000    2.0000    2.4168
       3.0000    3.0000    2.3838
       4.0000         0         0
       5.0000    1.0000    1.6220
       6.0000    2.0000    2.5677
       7.0000         0         0
       8.0000    1.0000    2.6479
       9.0000    2.0000    2.0583
     
    Check current policy against new test values.
    --If new policy needed, call for newpol
    --If not, use policy and gain G, above      % New policy is needed
    newpol
     
    Enter policy as row matrix of indices [1 6 8]
     
    Approximation to P0; rows are long-run dbn
       0.3939    0.2828    0.3232
       0.3939    0.2828    0.3232
       0.3939    0.2828    0.3232
     
    Policy in terms of indices
        1     6     8
     
    Policy in terms of actions
        1     2     1
     
    Values for selected policy
       0.0526
      -0.0989
       0.0223
     
    Long-run expected gain per period G
       2.6061
     
       Index     Action   Test Value
       1.0000    1.0000    2.6587
       2.0000    2.0000    2.3595
       3.0000    3.0000    2.3254
       4.0000         0         0
       5.0000    1.0000    1.6057
       6.0000    2.0000    2.5072
       7.0000         0         0
       8.0000    1.0000    2.6284
       9.0000    2.0000    2.1208
     
    Check current policy against new test values.
    --If new policy needed, call for newpol    
    --If not, use policy, values V, and gain G, above
    

    Since the policy selected on this iteration is the same as the previous one, the procedure has converged to an optimal policy. The first of the first three rows is maximum; the second of the next two rows is maximum; and the first of the final two rows is maximum. Thus, we have selected rows 1, 5, 6, corresponding to the optimal policy . The expected long-run gain per period g = 2.6061.

    The value matrix is

    (2.44)

    We made a somewhat arbitrary choice of the powers of P used in the approximation of P0 and B0 . As we note in the development of the approximation procedures in Sec 4, the convergence of P n is governed by the magnitude of the largest eigenvalue other than one. We can always get a check on the reliability of the calculations by checking the eigenvalues for P corresponding to the presumed optimal policy. For the choice above, we find the eigenvalues to be 1, -0.1250, 0.0833. Since 0.1254 ≈ 0.0002, the choices of exponents should be quite satisfactory. In fact, we could probably use P 8 as a satisfactory approximation to P0 , if that were desirable. The margin allows for the possibility that for some policies the eigenvalues may not be so small. —


  7. Policy iteration– with discounting It turns out that the policy iteration procedure is simpler in the case of discounting, as suggested by the evolution analysis in Sec 4. We have the following two-phase procedure, based on that analysis.

    1. Value-determination procedure. Given the qi and transition probabilities p(i,j) for the current policy, solve v = q + α P v to get for the corresponding vi

      (2.45) v = [Iα P] – 1 q

    2. Policy-improvement procedure Given for the current policy, determine a new policy d* , with each d * (i) = a i * determined as that action for which

      (2.46)

    We illustrate the Matlab procedure with the same numerical example as in the previous case, using discount factor a = 0.8. The data file is the same, so that we call for it, as before.

    Example 2.10. 

    Assume data in file md61.m is in Matlab; if not, call for md61. We use the procedure newpolprep to prepare for the iterative procedure by making some initial choices.

    newpolprep
    Data needed:
     Matrix PA of transition probabilities for states and actions
     Matrix GA of gains for states and actions
     Type number to identify GA matrix types
       Type 1.  Gains associated with a state
       Type 2.  One-step transition gains
       Type 3.  Gains associated with a demand
     Row vector of actions
     Value of alpha (= 1 for no discounting)
    
    Enter type number to show gain type 2
    Enter value of alpha (= 1 for no discounting) 0.8
    Enter matrix PA of transition probabilities PA
    Enter matrix GA of gains GA
    Enter row vector A of possible actions A
    Enter n, the power of P  to approximate P0 16
    Enter s, the power of P  in the approximation of V 6
     
       Index      Action  Test Value
       1.0000    1.0000    2.6667
       2.0000    2.0000    2.3750
       3.0000    3.0000    2.3333
       4.0000         0         0
       5.0000    1.0000    1.6250
       6.0000    2.0000    2.5000
       7.0000         0         0
       8.0000    1.0000    2.6250
       9.0000    2.0000    2.1250
     
    Call for newpold
    

    The procedure for policy iteration is in the file newpold.m.

    % file newpold.m  (with discounting)
    % version of 3/23/92
    d = input('Enter policy as row matrix of indices ');
    D = A(d);
    P = PA(d',:);        % transition matrix for policy selected
    Q = QA(d',:);        % average next-period gain for policy selected
     
    V = (eye(P) - a*P)\Q;            % Present values for unlimited future
    T = [(1:q)' A' (QA + a*PA*V)];   % Test values for determining new policy
    disp(' ')
    disp('Approximation to P0; rows are long-run dbn')
    disp(P0)
    disp('    Policy in terms of indices')
    disp(d)
    disp('    Policy in terms of actions')
    disp(D)
    disp('  Values for policy')
    disp(V)
    DIS =['    Index     Action    Test Value'];
    disp(DIS)
    disp(T)
    disp('Check current policy against new test values.')
    disp('--If new policy needed, call for newpold')
    disp('--If not, use policy and values above')
    
    newpold
    Enter policy as row matrix of indices [3 5 9]  % Deliberately poor choice
     
    Approximation to P0; rows are long-run dbn
       0.3939    0.2828    0.3232
       0.3939    0.2828    0.3232
       0.3939    0.2828    0.3232
     
       Policy in terms of indices
        3     5     9
     
       Policy in terms of actions
        3     1     2
     
     Values for policy
      10.3778
       9.6167
      10.1722
     
       Index     Action  Test Value
       1.0000    1.0000   10.7111
       2.0000    2.0000   10.3872
       3.0000    3.0000   10.3778
       4.0000         0         0
       5.0000    1.0000    9.6167
       6.0000    2.0000   10.6089
       7.0000         0         0
       8.0000    1.0000   10.7133
       9.0000    2.0000   10.1722
     
    Check current policy against new test values.
    --If new policy needed, call for newpold
    --If not, use policy and values above
     
    newpold
    Enter policy as row matrix of indices [1 6 8]
     
    Approximation to P0; rows are long-run dbn
       0.3939    0.2828    0.3232
       0.3939    0.2828    0.3232
       0.3939    0.2828    0.3232
     
       Policy in terms of indices
        1     6     8
     
       Policy in terms of actions
        1     2     1
     
     Values for policy
      13.0844
      12.9302
      13.0519
    
       Index     Action  Test Value
       1.0000    1.0000   13.0844
       2.0000    2.0000   12.7865
       3.0000    3.0000   12.7511
       4.0000         0         0
       5.0000    1.0000   12.0333
       6.0000    2.0000   12.9302
       7.0000         0         0
       8.0000    1.0000   13.0519
       9.0000    2.0000   12.5455
     
    Check current policy against new test values.
    --If new policy needed, call for newpold
    --If not, use policy and values above
    

    Since the policy indicated is the same as the previous policy, we know this is an optimal policy. Rows 1, 6, 8, indicate the optimal policy to be . It turns out in this example that the optimal policies are the same for the discounted and undiscounted cases. That is not always true. The v matrix gives the present values for unlimited futures, for each of the three possible starting states.