Example 2.1.
An electronic store stocks a certain type of DVD player. At the end of each week,
an order is placed for early delivery the following Monday. A maximum of four
units is stocked. Let the states be the number of units on hand at the end
of the sales week: . Two possible actions:
Order two units, at a cost of $150 each
Order four units, at a cost of $120 each
Units sell for $200. If demand exceeds the stock in hand, the retailer assumes a penalty of $40 per unit (in losses due to customer dissatisfaction, etc.). Because of turnover, return on sales is considered two percent per week, so that discount is α = 1 / 1.02 on a weekly basis.
In state 0, there are three possible actions: order 0, 2, or 4. In states 1 and 2
there are two possible actions: order 0 or 2. In states 3 and 4, the only action
is to order 0. Customer demand in week
n + 1 is represented by a random variable
D
n + 1
. The class is iid, uniformly distributed on the values 0, 1, 2, 3, 4.
If Xn
is the state at the end of week n, then is independent
for each n.
Analyze the system as a Markov decision process with type 3 gains, depending upon current state, action, and demand. Determine the transition probability matrix PA (properly padded) and the gain matrix (also padded). Sample calculations are as follows:
For state = i, action = a, and demand = k, we seek g(i,a,k)
Complete the transition probability table and the gain table.
Determine an optimum infinite-horizon strategy with no discounting.
Determine an optimum infinite-horizon strateby with discounting (alpha = 1/1.02).
The manager decides to set up a six-week strategy, after which new sales conditions may be established. Determine an optimum strategy for the six-week period.
% file orderdata.m % Version of 4/5/94 % Data organized for computation type = 3; states = 0:4; = [0 2 4 ... % Actions (padded) 0 2 02 ... 0 2 02 ... 0 00 00 ... 0 00 00]; C = [0 -300 -480 ... % Order costs (padded) 0 -300 -300 ... 0 -300 -300 ... 0 0 0 ... 0 0 0]; SP = 200; % Selling price BP = 40; % Backorder penalty PD = 0.2*ones(1,5); % Demand probabilities
% file reorder.m % Version of 4/11/94 % Calculates PA and GA for reorder policy states = input('Enter row vector of states '); A = input('Enter row vector A of actions (padded) '); C = input('Enter row vector C of order costs (padded) '); D = input('Enter row vector D of demand values '); PD = input('Enter row vector PD of demand probabilities '); SP = input('Enter unit selling price SP '); BP = input('Enter backorder penalty cost BP '); m = length(states'); q = length(A); na = q/m; N = length(D); S = ones(na,1)*states; S = S(:)'; [d,s] = meshgrid(D,S); a = A'*ones(1,N); ca = C'*ones(1,N); TA = (s + a - d).*(s + a - d >= 0); for i = 1:q PA(i,:) = tdbn(states,TA(i,:),PD); end PA GA = ca + SP*d - (SP + BP)*(d -s -a).*(d > s+a)
orderdata reorder Enter row vector of states states Enter row vector A of actions (padded) A Enter row vector C of order costs (padded) C Enter row vector D of demand values D Enter row vector PD of demand probabilities PD Enter unit selling price SP SP Enter backorder penalty cost BP BP PA = 1.0000 0 0 0 0 0.6000 0.2000 0.2000 0 0 0.2000 0.2000 0.2000 0.2000 0.2000 0.8000 0.2000 0 0 0 0.4000 0.2000 0.2000 0.2000 0 0.4000 0.2000 0.2000 0.2000 0 0.6000 0.2000 0.2000 0 0 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.4000 0.2000 0.2000 0.2000 0 0.4000 0.2000 0.2000 0.2000 0 0.4000 0.2000 0.2000 0.2000 0 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 GA = 0 -40 -80 -120 -160 -300 -100 100 60 20 -480 -280 -80 120 320 0 200 160 120 80 -300 -100 100 300 260 -300 -100 100 300 260 0 200 400 360 320 -300 -100 100 300 500 -300 -100 100 300 500 0 200 400 600 560 0 200 400 600 560 0 200 400 600 560 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
polit Data needed: - - - - - - - - - - - - - - - Enter case number to show gain type case Enter row vector of states states Enter row vector A of possible actions A Enter value of alpha (= 1 for no discounting) 1 Enter matrix PA of transition probabilities PA Enter matrix GA of gains GA Enter row vector PD of demand probabilities PD Index Action Value 0 -80 2 2 -44 3 4 -80 4 0 112 5 2 52 6 2 52 7 0 256 8 2 100 9 2 100 10 0 352 11 0 352 12 0 352 13 0 400 14 0 400 15 0 400 Initial policy: action numbers 2 1 1 1 1 Policy: actions 2 0 0 0 0
New policy: action numbers 3 2 2 1 1 Policy: actions 4 2 2 0 0 Long-run distribution 0.2800 0.2000 0.2000 0.2000 0.1200 Test values for selecting new policy Index Action Test Value 1.0000 0 -248.0000 2.0000 2.0000 -168.8000 3.0000 4.0000 -41.6000 4.0000 0 -48.8000 5.0000 2.0000 -5.6000 6.0000 2.0000 -5.6000 7.0000 0 131.2000 8.0000 2.0000 138.4000 9.0000 2.0000 138.4000 10.0000 0 294.4000 11.0000 0 294.4000 12.0000 0 294.4000 13.0000 0 438.4000 14.0000 0 438.4000 15.0000 0 438.4000 Optimum policy State Action Value 0 4.0000 -168.0000 1.0000 2.0000 -132.0000 2.0000 2.0000 12.0000 3.0000 0 168.0000 4.0000 0 312.0000 Long-run expected gain per period G 126.4000
polit Data needed: - - - - - - - - - - - - - - - Enter type number to show gain type Enter row vector of states states Enter row vector A of possible actions A Enter value of alpha (= 1 for no discounting) 1/1.02 Enter matrix PA of transition probabilities PA Enter matrix GA of gains GA Enter row vector PD of demand probabilities PD
Index Action Value 1 0 -80 2 -44 3 -80 4 0 112 5 2 52 6 2 52 7 0 256 8 2 100 9 2 100 10 0 352 11 0 352 12 0 352 13 0 400 14 0 400 15 0 400 Initial policy: action numbers 2 1 1 1 1 Policy: actions 2 0 0 0 0 New policy: action numbers 3 2 2 1 1 Policy: actions 4 2 2 0 0 Test values for selecting policy Index Action Test Value 1.0e+03 * 0.0010 0 6.0746 0.0020 0.0020 6.1533 0.0030 0.0040 6.2776 0.0040 0 6.2740 0.0050 0.0020 6.3155 0.0060 0.0020 6.3155 0.0070 0 6.4533 0.0080 0.0020 6.4576 0.0090 0.0020 6.4576 0.0100 0 6.6155 0.0110 0 6.6155 0.0120 0 6.6155 0.0130 0 6.7576 0.0140 0 6.7576 0.0150 0 6.7576 Optimum policy State Action Value 1.0e+03 * 0 0.0040 6.2776 0.0010 0.0020 6.3155 0.0020 0.0020 6.4576 0.0030 0 6.6155 0.0040 0 6.7576
dpinit Initialize for finite horizon calculations Matrices A, PA, and GA, padded if necessary Enter case number to show gain type case Enter vector of states states Enter row vector A of possible actions A Enter matrix PA of transition probabilities PA Enter matrix GA of gains GA Enter row vector PD of demand probabilities PD Call for dprog dprog States and expected total gains 0 1 2 3 4 -44 112 256 352 400 States Actions 0 2 1 0 2 0 3 0 4 0 dprog States and expected total gains 0 1.0000 2.0000 3.0000 4.0000 135.2000 178.4000 315.2000 478.4000 615.2000 States Actions 0 4 1 2 2 2 3 0 4 0 dprog States and expected total gains 0 1.0000 2.0000 3.0000 4.0000 264.4800 300.4800 444.4800 600.4800 744.4800 States Actions 0 4 1 2 2 2 3 0 4 0 dprog States and expected total gains 1.0000 2.0000 3.0000 4.0000 390.8800 426.8800 570.8800 726.8800 870.8800 States Actions 0 4 1 2 2 2 3 0 4 0
dprog States and expected total gains 0 1.0000 2.0000 3.0000 4.0000 517.2800 553.2800 697.2800 853.2800 997.2800 States Actions 0 4 1 2 2 2 3 0 4 0 dprog States and expected total gains 1.0e+03 * 0 0.0010 0.0020 0.0030 0.0040 0.6437 0.6797 0.8237 0.9797 1.1237 States Actions 0 4 1 2 2 2 3 0 4 0
Example 2.2.
An electronic store stocks a certain type of VCR. At the end of each week,
an order is placed for early delivery the following Monday. A maximum of four
units is stocked. Let the states be the number of units on hand at the end
of the sales week: . Two possible actions:
Order two units, at a cost of $150 each
Order four units, at a cost of $120 each
Units sell for $200. If demand exceeds the stock in hand, the retailer assumes a penalty of $40 per unit (in losses due to customer dissatisfaction, etc.). Because of turnover, return on sales is considered two percent per week, so that discount is α = 1 / 1.02 on a weekly basis.
In state 0, there are three possible actions: order 0, 2, or 4. In states 1 and 2
there are two possible actions: order 0 or 2. In states 3 and 4, the only action
is to order 0. Customer demand in week
n + 1 is represented by a random variable
D
n + 1
. The class is iid, uniformly distributed on the values 0, 1, 2, 3, 4.
If Xn
is the state at the end of week n, then is independent
for each n.
Analyze the system as a Markov decision process with case 3 gains, depending upon current state, action, and demand. Determine the transition probability matrix PA (properly padded) and the gain matrix (also padded). Sample calculations are as follows:
State 0, action 0: p 00(0) = 1 (all other p 0k (0) = 0)
State 0, action 2: p 00(2) = P(D ≥ 2) = 3 / 5,p 01(2) = P(D = 1) = 1 / 5, etc.
State 2, action 2: p 2j (k) = 1 / 5,k = 0,1,2,3,4
For state = i, action = a, and demand = k, we seek g(i,a,k)
Complete the transition probability table and the gain table.
Determine an optimum infinite-horizon strategy with no discounting.
Determine an optimum infinite-horizon strateby with discounting (alpha = 1/1.02).
The manager decides to set up a six-week strategy, after which new sales conditions may be established. Determine an optimum strategy for the six-week period.
Data file
% file orderdata.m % Version of 4/5/94 % Data organized for computation type = 3; states = 0:4; A = [0 2 4 ... % Actions (padded) 0 2 02 ... 0 2 02 ... 0 00 00 ... 0 00 00]; C = [0 -300 -480 ... % Order costs (padded) 0 -300 -300 ... 0 -300 -300 ... 0 0 0 ... 0 0 0]; SP = 200; % Selling price BP = 40; % Backorder penalty PD = 0.2*ones(1,5); % Demand probabilities
The procedure
% file reorder.m % Version of 4/11/94 % Calculates PA and GA for reorder policy states = input('Enter row vector of states '); A = input('Enter row vector A of actions (padded) '); C = input('Enter row vector C of order costs (padded) '); D = input('Enter row vector D of demand values '); PD = input('Enter row vector PD of demand probabilities '); SP = input('Enter unit selling price SP '); BP = input('Enter backorder penalty cost BP '); m = length(states'); q = length(A); na = q/m; N = length(D); S = ones(na,1)*states; S = S(:)'; [d,s] = meshgrid(D,S); a = A'*ones(1,N); ca = C'*ones(1,N); TA = (s + a - d).*(s + a - d >= 0); for i = 1:q PA(i,:) = tdbn(states,TA(i,:),PD); end PA GA = ca + SP*d - (SP + BP)*(d -s -a).*(d > s+a)
The calculations
orderdata reorder Enter row vector of states states Enter row vector A of actions (padded) A Enter row vector C of order costs (padded) C Enter row vector D of demand values D Enter row vector PD of demand probabilities PD Enter unit selling price SP SP Enter backorder penalty cost BP BP PA = 1.0000 0 0 0 0 0.6000 0.2000 0.2000 0 0 0.2000 0.2000 0.2000 0.2000 0.2000 0.8000 0.2000 0 0 0 0.4000 0.2000 0.2000 0.2000 0 0.4000 0.2000 0.2000 0.2000 0 0.6000 0.2000 0.2000 0 0 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.4000 0.2000 0.2000 0.2000 0 0.4000 0.2000 0.2000 0.2000 0 0.4000 0.2000 0.2000 0.2000 0 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 0.2000 GA = 0 -40 -80 -120 -160 -300 -100 100 60 20 -480 -280 -80 120 320 0 200 160 120 80 -300 -100 100 300 260 -300 -100 100 300 260 0 200 400 360 320 -300 -100 100 300 500 -300 -100 100 300 500 0 200 400 600 560 0 200 400 600 560 0 200 400 600 560 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
polit Data needed: - - - - - - - - - - - - - - - Enter type number to show gain type type Enter row vector of states states Enter row vector A of possible actions A Enter value of alpha (= 1 for no discounting) 1 Enter matrix PA of transition probabilities PA Enter matrix GA of gains GA Enter row vector PD of demand probabilities PD Index Action Value 1 0 -80 2 2 -44 3 4 -80 4 0 112 5 2 52 6 2 52 7 0 256 8 2 100 9 2 100 10 0 352 11 0 352 12 0 352 13 0 400 14 0 400 15 0 400 Initial policy: action numbers 2 1 1 1 1 Policy: actions 2 0 0 0 0 New policy: action numbers 3 2 2 1 1 Policy: actions 4 2 2 0 0 Long-run distribution 0.2800 0.2000 0.2000 0.2000 0.1200 Test values for selecting new policy Index Action Test Value 1.0000 0 -248.0000 2.0000 2.0000 -168.8000 3.0000 4.0000 -41.6000 4.0000 0 -48.8000 5.0000 2.0000 -5.6000 6.0000 2.0000 -5.6000 7.0000 0 131.2000 8.0000 2.0000 138.4000 9.0000 2.0000 138.4000 10.0000 0 294.4000 11.0000 0 294.4000 12.0000 0 294.4000 13.0000 0 438.4000 14.0000 0 438.4000 15.0000 0 438.4000 Optimum policy State Action Value 0 4.0000 -168.0000 1.0000 2.0000 -132.0000 2.0000 2.0000 12.0000 3.0000 0 168.0000 4.0000 0 312.0000 Long-run expected gain per period G 126.4000
polit Data needed: - - - - - - - - - - - - - - - Enter case number to show gain type type Enter row vector of states states Enter row vector A of possible actions A Enter value of alpha (= 1 for no discounting) 1/1.02 Enter matrix PA of transition probabilities PA Enter matrix GA of gains GA Enter row vector PD of demand probabilities PD Index Action Value 1 0 -80 2 2 -44 3 4 -80 4 0 112 5 2 52 6 2 52 7 0 256 8 2 100 9 2 100 10 0 352 11 0 352 12 0 352 13 0 400 14 0 400 15 0 400 Initial policy: action numbers 2 1 1 1 1 Policy: actions 2 0 0 0 0 New policy: action numbers 3 2 2 1 1 Policy: actions 4 2 2 0 0 Test values for selecting policy Index Action Test Value 1.0e+03 * 0.0010 0 6.0746 0.0020 0.0020 6.1533 0.0030 0.0040 6.2776 0.0040 0 6.2740 0.0050 0.0020 6.3155 0.0060 0.0020 6.3155 0.0070 0 6.4533 0.0080 0.0020 6.4576 0.0090 0.0020 6.4576 0.0100 0 6.6155 0.0110 0 6.6155 0.0120 0 6.6155 0.0130 0 6.7576 0.0140 0 6.7576 0.0150 0 6.7576 Optimum policy State Action Value 1.0e+03 * 0 0.0040 6.2776 0.0010 0.0020 6.3155 0.0020 0.0020 6.4576 0.0030 0 6.6155 0.0040 0 6.7576
dpinit Initialize for finite horizon calculations Matrices A, PA, and GA, padded if necessary Enter type number to show gain type type Enter vector of states states Enter row vector A of possible actions A Enter matrix PA of transition probabilities PA Enter matrix GA of gains GA Enter row vector PD of demand probabilities PD Call for dprog dprog States and expected total gains 0 1 2 3 4 -44 112 256 352 400 States Actions 0 2 1 0 2 0 3 0 4 0 dprog States and expected total gains 0 1.0000 2.0000 3.0000 4.0000 135.2000 178.4000 315.2000 478.4000 615.2000 States Actions 0 4 1 2 2 2 3 0 4 0 dprog States and expected total gains 0 1.0000 2.0000 3.0000 4.0000 264.4800 300.4800 444.4800 600.4800 744.4800 States Actions 0 4 1 2 2 2 3 0 4 0 dprog States and expected total gains 0 1.0000 2.0000 3.0000 4.0000 390.8800 426.8800 570.8800 726.8800 870.8800 States Actions 0 4 1 2 2 2 3 0 4 0 dprog States and expected total gains 0 1.0000 2.0000 3.0000 4.0000 517.2800 553.2800 697.2800 853.2800 997.2800 States Actions 0 4 1 2 2 2 3 0 4 0 dprog States and expected total gains 1.0e+03 * 0 0.0010 0.0020 0.0030 0.0040 0.6437 0.6797 0.8237 0.9797 1.1237 States Actions 0 4 1 2 2 2 3 0 4 0
There are three types. In all types, we need the following:
Type 1: The usual type. In addition to the above, we need
![]() |
Type 2: The matrix R H = [r(a,i)] is given. L and PYH are not needed. |
Type 3: Sometimes Y = H . In this case R H = L , which we need, in addition to the above. |
[Risk function = expected loss,
given H]
MATLAB: RH = L*PYH'
MATLAB:
PX = PH*PXH
MATLAB: [a,b] = meshgrid(PH,PX) PHX = PXH'.*a./b
[Expected risk, given X]
MATLAB: RX = RH*PHX'
Select d*
from
RX
:
d
* (j) is the action a (row number) for minimum expected
loss, given
X = j
.
Set .
Calculate the Bayesian risk
BD
for d*
.
MATLAB: RD*PX'
Actions are represented in calculations by action number (position in the matrix). In some cases, each action has a value other than its position number. The actual values can be presented in the final display.
% file dec.m % Version of 12/12/95 disp('Decision process with experimentation') disp('There are three types, according to the data provided.') disp('In all types, we need the row vector A of actions,') disp('the row vector PH with PH(i) = P(H = u_i),') disp('the row vector X of test random variable values, and') disp('the matrix PXH with PXH(i,j) = P(X = x_j|H = u_i).') disp('Type 1. Loss matrix L of L(a,k)') disp(' Matrix PYH with PYH(i,k) = P(Y = y_k|H = u_i)') disp('Type 2. Matrix RH of r(a,i) = E[L(a,Y)|H = u_i].') disp(' L and PYH are not needed for this type.') disp('Type 3. Y = H, so that only RH = L is needed.') c = input('Enter type number '); A = input('Enter vector A of actions '); PH = input('Enter vector PH of parameter probabilities '); PXH = input('Enter matrix PXH of conditional probabilities '); X = input('Enter vector X of test random variable values '); s = length(PH); q = length(X); if c == 1 L = input('Enter loss matrix L '); PYH = input('Enter matrix PYH of conditional probabilities '); RH = L*PYH'; elseif c == 2 RH = input('Enter matrix RH of expected loss, given H '); else L = input('Enter loss matrix L '); RH = L; end PX = PH*PXH; % (1 x s)(s x q) = (1 x q) [a,b] = meshgrid(PH,PX); PHX = PXH'.*a./b; % (q x s) RX = RH*PHX'; % (m x s)(s x q) = (m x q) [RD D] = min(RX); % determines min of each col % and row on which min occurs S = [X; A(D); RD]'; BD = RD*PX'; % Bayesian risk h = [' Optimum losses and actions']; sh = [' Test value Action Loss']; disp(' ') disp(h) disp(sh) disp(S) disp(' ') disp(['Bayesian risk B(d*) = ',num2str(BD),])
Example 2.3. General case
% file dec1.m % Data for Problem 22-11 type = 1; A = [10 15]; % Artificial actions list PH = [0.3 0.2 0.5]; % PH(i) = P(H = i) PXH = [0.7 0.2 0.1; % PXH(i,j) = P(X = j|H= i) 0.2 0.6 0.2; 0.1 0.1 0.8]; X = [-1 0 1]; L = [1 0 -2; % L(a,k) = loss when action number is a, outcome is k 3 -1 -4]; PYH = [0.5 0.3 0.2; % PYH(i,k) = P(Y = k|H = i) 0.2 0.5 0.3; 0.1 0.3 0.6]; dec1 dec Decision process with experimentation There are three types, according to the data provided. In all types, we need the row vector A of actions, the row vector PH with PH(i) = P(H = i), the row vector X of test random variable values, and the matrix PXH with PXH(i,j) = P(X = j|H = i). Type 1. Loss matrix L of L(a,k) Matrix PYH with PYH(i,k) = P(Y = k|H = i) Type 2. Matrix RH of r(a,i) = E[L(a,Y)|H = i]. L and PYH are not needed in this case. Type 3. Y = H, so that only RH = L is needed. Enter type number type Enter vector A of actions A Enter vector PH of parameter probabilities PH Enter matrix PXH of conditional probabilities PXH Enter vector X of test random variable values X Enter loss matrix L L Enter matrix PYH of conditional probabilities PYH Optimum losses and actions Test value Action Loss -1.0000 15.0000 -0.2667 0 15.0000 -0.9913 1.0000 15.0000 -2.1106 Bayesian risk B(d*) = -1.3
Intermediate steps in solution of Example 1, to show results of various operations
RH RH = 0.1000 -0.4000 -1.1000 0.4000 -1.1000 -2.4000 PX PX = 0.3000 0.2300 0.4700 a a = 0.3000 0.2000 0.5000 0.3000 0.2000 0.5000 0.3000 0.2000 0.5000 b b = 0.3000 0.3000 0.3000 0.2300 0.2300 0.2300 0.4700 0.4700 0.4700 PHX PHX = 0.7000 0.1333 0.1667 0.2609 0.5217 0.2174 0.0638 0.0851 0.8511 RX RX = -0.1667 -0.4217 -0.9638 -0.2667 -0.9913 -2.1106
Example 2.4. RH given
% file dec2.m % Data for type in which RH is given type = 2; A = [1 2]; X = [-1 1 3]; PH = [0.2 0.5 0.3]; PXH = [0.5 0.4 0.1; % PXH(i,j) = P(X = j|H = i) 0.4 0.5 0.1; 0.2 0.4 0.4]; RH = [-10 5 -12; 5 -10 -5]; % r(a,i) = expected loss when % action is a, given H = i dec2 dec Decision process with experimentation ------------------- Instruction lines edited out Enter type number type Enter vector A of actions A Enter vector PH of parameter probabilities PH Enter matrix PXH of conditional probabilities PXH Enter vector X of test random variable values X Enter matrix RH of expected loss, given H RH Optimum losses and actions Test value Action Loss -1.0000 2.0000 -5.0000 1.0000 2.0000 -6.0000 3.0000 1.0000 -7.3158 Bayesian risk B(d*) = -5.89
Example 2.5. Example 3
Carnival example (type in which Y = H )
A carnival is scheduled to appear on a given date. Profits to be earned depend heavily on the weather. If rainy, the carnival loses $15 (thousands); if cloudy, the loss is $5 (thousands); if sunny, a profit of $10 (thousands) is expected. If the carnival sets up its equipment, it must give the show; if it decides not to set up, it forfeits $1,000. For an additional cost of $1,000, it can delay setup until the day before the show and get the latest weather report.
Actual weather H = Y is 1 rainy, 2 cloudy, or 3 sunny.
The weather report X has values 1, 2, or 3, corresponding to predictions rainy, cloudy, or sunny respectively.
Reliability of the forecast is expressed in terms of P(X = j|H = i)– see matrix PXH
Two actions: 1 set up; 2 no set up.
Possible losses for each action and weather condition are in matrix L.
% file dec3,m % Carnival problem type = 3; % Y = H (actual weather) A = [1 2]; % 1: setup 2: no setup X = [1 2 3]; % 1; rain, 2: cloudy, 3: sunny L = [16 6 -9; % L(a,k) = loss when action number is a, outcome is k 2 2 2]; % --with premium for postponing setup PH = 0.1*[1 3 6]; % P(H = i) PXH = 0.1*[7 2 1; % PXH(i,j) = P(X = j|H = i) 2 6 2; 1 2 7]; dec3 dec Decision process with experimentation ------------------- Instruction lines edited out Enter case number case Enter vector A of actions A Enter vector PH of parameter probabilities PH Enter matrix PXH of conditional probabilities PXH Enter vector X of test random variable values X Enter loss matrix L L Optimum losses and actions Test value Action Loss 1.0000 2.0000 2.0000 2.0000 1.0000 1.0000 3.0000 1.0000 -6.6531 Bayesian risk B(d*) = -2.56
In order to provide the background for Matlab procedures for Markov decision processes, we first summarize certain essentials in the analysis of homogeneous Markov chains with costs and rewards associated with states, or with transitions between states. References are to Pfeiffer: PROBABILITY FOR APPLICATIONS.
Some cost and reward patterns
Consider a finite, ergodic (i.e., recurrent, positive, aperiodic, irreducible)
homogeneous Markov chain XN
, with state space
E = {1,2,⋯,M}. To facilitate use of matrix computation programs,
we number states from one, except in certain examples in which state zero has a
natural meaning. Associated with
this chain is a cost or reward structure belonging to one of the three general
classes described below. The one-step transition probability
matrix is
P = [p(i,j)], where . The
distribution for Xn
is represented by the row matrix
, where
. The long-run distribution is represented by the row
matrix
.
Type 1. Gain associated with a state. A reward or gain in the amount
g(i) is realized in the next period if the current state is i. The
gain sequence
of random variables
is the sequence of gains realized as the chain
XN
evolves. We represent the gain function g by the column matrix
g = [g(1)g(2)⋯g(M)]
T
.
Type 2. One-step transition gains. A reward or gain in the amount
g(i,j) = g
ij
is realized in period
n + 1 if the system
goes from state i in period n to state j in period
n + 1. The
corresponding one-step transition probability is
p(i,j) = p
ij
.
The gain matrix is
g = [g(i,j)].
The gain sequence
of
random variables
is the sequence of gains
experienced as the chain XN
evolves.
Type 3. Gains associated with a demand. In this case, the gain random variables are of the form
If n represents the present, the random
vector represents the “past” at n of
the chain XN
. We suppose
is iid and
each
is independent. Again, the gain sequence
represents the gains realized as the process evolves.
Standard results on Markov chains show that in each case the sequence
is Markov.
A recurrence relation. We are interested in the expected gains in future stages, given the present state of the process. For any n > 0 and any m > 1, the gain G n (m) in the m periods following period n is given by
If the process is in state i, the expected gain in the next period qi is
and the expected gain in the next m periods is
We utilize a recursion relation that allows us to begin with the
transition matrix P and the next-period expected gain matrix
and calculate the
v
i
(m)
for
any
m > 1. Although the analysis is somewhat different for the various
gain structures, the result exhibits a common pattern. In each case
We thus have the fundamental recursion relation
The recursion relation ( * ) shows that the transition matrix
P = [p(i,j)] and the vector of next-period expected gains, which we represent
by the column matrix ,
determine the
v
i
(m)
. The difference between the cases is the manner in which the
qi
are obtained.
Type 1: . ![]() |
Type 2: . ![]() |
Type 3: . ![]() |
Matrix form: . For computational purposes, we utilize
these relations in matrix form. If
(2.9)
then
![]() (2.10)
![]() |
Examination of the expressions for qi , above, shows that the next-period expected gain matrix q takes the following forms. In each case, g is the gain matrix. In case c, pD is the column matrix whose elements are P(D = k).
Type 1: q = g |
Type 2:
![]() |
Type 3: q = gp D |
Some long-run averages Consider the average expected gain for m periods
Use of the Markov property and the fact that for an ergodic chain
yields the result that,
and for any state i,
The expression for g may be put into matrix form. If the long-run
probability distribution is represented by the row matrix and
,
then
Discounting and potentials
Suppose in a given time interval the value of
money increases by a fraction a. This may represent the potential
earning if the money were invested. One dollar now is worth 1 + a
dollars
at the end of one period. It is worth (1 + a)
n
dollars after
n periods. Set
α = 1 / (1 + a), so that 0 < α ≤ 1. It
takes α dollars to
be worth one dollar after one period. It takes αn
dollars at
present to be worth one dollar n periods later. Thus
the present worth or
discounted value of one dollar n periods in the future is αn
dollars. This is the amount one would need to invest presently,
at interest rate a per unit of time, to have one dollar n periods later.
If f is any function defined on the state space E, we
designate by f the column matrix [f(1)f(2)⋯f(M)]
T
. We
make an exception to this convention in the case of the distributions of the
state probabilities ,
their generating function, and the long-run probabilities
, which we represent as row matrices.
It should be clear that much of the following
development extends immediately to infinite state space
E = N = {0,1,2,⋯}.
We assume one of the gain structures introduced in Sec 1 and the corresponding
gain sequence
. The value of the
random variable
G
n + 1
is the gain or reward realized during the
period
n + 1. We let each
transition time be at the end of one period of time (say a month or a
quarter). If n corresponds to the present, then
G
n + k
is the
gain in period
n + k
. If we do not discount for the period immediately
after the present, then
α
k – 1
G
n + k
is
the present value of that gain. Hence
Definition. The α-potential of the gain sequence
is the function φ defined by
Remark. φ(i) is the expected total discounted gain, given the system starts in state i. We next define a concept which is related to the α-potential in a useful way. Definition. The α-potential matrix R α for the process XN is the matrix
Theorem 2.1. 3.1
Let XN
be an ergodic, homogeneous Markov chain with state
space E and gain sequence . Let
where
.
For
α ∈ (0,1), let φ be the α-potential for
the gain sequence. That is,
Then, if
R
α
is the α-potential matrix for XN
, we have
φ = R
α
q
— □
Theorem 2.2. 3.2
Consider an ergodic, homogeneous Markov chain XN
with gain sequence
and next-period expected gain matrix q.
For
α ∈ (0,1), the α-potential φ and the
α-potential matrix
R
α
satisfy
[I – α
P]R
α
= Iand[I – α
P]φ = q
If the state space E is finite, then
R
α
= [I – α
P] – 1andφ = [I – α
P] – 1
q = R
α
q
— □
Example 2.6. A numerical example
Suppose the transition matrix P and the next-period expected gain matrix
q are
For
α = 0.9, we have
— □
The next example is of class c, but the demand random variable is Poisson, which does not
have finite range. The simple pattern for calculating the qi
must be modified.
Example 2.7. Costs associated with inventory under an (m,M) order policy.
If k units are ordered, the cost in dollars is
For each unit of unsatisfied demand, there is a penalty of 50 dollars. We use
the
inventory policy described in Ex 23.1.3.
X
0 = M
,
and Xn
is the stock at the end of period n, before
restocking. Demand in period n is Dn
. The cost for
restocking at the end of period
n + 1 is
For
m = 1,M = 3, we have
We take
m = 1,M = 3 and assume D is Poisson (1).
From Ex 23.1.3 in PA, we obtain
The largest eigenvalue |λ| ≈ 0.26, so n ≥ 10 should be sufficient for convergence. We use n = 16.
Taking any row of P 16 , we approximate the long-run distribution by
We thus have
For the Poisson distribution
.
Hence
q
3 = q
0 – 85 = 1.1668
so that
Then, for α = 0.8 we have
Recall that we are dealing with expected discounted costs. Since we usually consider starting with a full stock, the amount φ(M) = φ(3) = 100.68 is the quantity of interest. Note that this is smaller than other values of φ(j), which correspond to smaller beginning stock levels. We expect greater restocking costs in such cases.
Evolution of chains with costs and rewards
No discounting A generating function analysis of
shows
Here g = π q , is a column matrix of ones, P 0 = P ∞ , and B0 is a matrix which analysis also shows we may approximate by
The largest |λ i | < 1 gives an indication of how many terms are needed.
Example 2.8. The inventory problem (continued)
We consider again the inventory problem. We have
The eigenvalues are 1,0.0920 + i0.2434,0.0920 – 0.2434 and 0. Thus, the decay of the transients is controlled by . Since |λ
* |4 ≈ 0.0046, the transients die out
rather rapidly. We obtain the following approximations
The approximate value of B0 is found to be
The value g = πq ≈ 28.80 is found in the earlier treatment. From the values above, we have
The average gain per period is clearly g ≈ 28.80. This soon dominates the constant terms represented by the entries in v.
With discounting
Let the discount factor be . If n represents the
present period, then
G
n + 1 = the reward in the first succeeding period
the reward in the kth succeding period.
If we do not discount the first period, then
Thus
In matrix form, this is
A generating function analysis shows that
Hence, the steady state part of
In matrix form,
Since the are known, we can
solve for the vi
. Also, since
v
i
(m)
is the present value of expected gain m steps
into the future, the vi
represent the present value for an unlimited
future, given that the process starts in state i.
Stationary decision policies
We suppose throughout this section that XN
is ergodic, with finite
state space . For each state
i ∈ E
,
there is a class Ai
of possible actions which may be taken
when the process is in state i. A decision policy is a sequence
of decision functions
such that
The action selected is in Ai whenever X n = i . We consider a special class of decision policies. Stationary decision policy
The possible actions depend upon the current state. That is whenever
X
n
= i
. Analysis shows the Markov character of the process
is preserved under stationary policies. Also, if E is finite and every stationary
policy produces an irreducible chain, then an optimal stationary policy is optimal.
We suppose each policy yields an ergodic chain. Since the transition
probabilities from any state depend in part on the action taken when in
that state, the long-run probabilities will depend upon the policy.
In the no-discounting case, we want to determine a stationary policy that maximizes
the gain
g = πq
for the corresponding long-run probabilities.
We say “a stationary policy,” since we do not rule out the possibility
there may be more than one policy which maximizes the gain.
In the discounting case, the goal is to maximize the steady state part
vi
in the expressions
In simple cases, with small state space and a small number of possible actions in each state, it may be feasible to obtain the gain or present values for each permissible policy and then select the optimum by direct comparison. However, for most cases, this is not an efficient approach. In the next two sections, we consider iterative procedures for step-by-step optimization which usually greatly reduces the computational burden.
Policy iteration method– no discounting
We develop an iterative procedure for finding an optimal
stationary policy. The goal is to determine a stationary policy that
maximizes the long run expected gain per period g. In the next section,
we extend this procedure to the
case where discounting is in effect. We assume that each policy yields an
ergodic chain. Suppose we have established that under a given policy
(1) , where
In this case, the analysis in Sec 4 shows that for large n,
(2)
, where
We note that vi
and g depend upon the entire policy, while
qi
and
depend on the individual actions ai
. Using
(1) and (2), we obtain
From this we conclude
(3) for
all
i ∈ E
Suppose a policy d has been used. That is, action
d(i) is
taken whenever the process is in state i. To simplify writing, we drop
the indication of the action and simply write
for
p
i
j(d(i)),
etc. Associated with this policy, there is a gain g. We should
like to determine whether or not this is the maximum possible gain, and if
it is not, to find a policy which does give the maximum gain.
The following two-phase procedure has been found to be effective. It is essentially
the procedure developed originally by Ronald Howard, who pioneered the treatment
of these problems. The new feature is the method of carrying out the value-determination
step, utilizing the approximation method noted above.
Value-determination procedure
We calculate . As in Sec 4, we calculate
Policy-improvement procedure We suppose policy d has been used through period n. Then, by (3), above,
We seek to improve policy d by selecting policy d* , with d * (i) = a * ik , to satisfy
Remarks
In the procedure for selecting d* , we use the “old” vj .
It has been established that , with equality iff g is optimal.
Since there is a finite number of policies, the procedure
must converge to an optimum stationary policy in a finite number of steps.
We implement this two step procedure with Matlab. To demonstrate the procedure, we consider the following
Example 2.9. A numerical example
A Markov decision process has three states: State space .
Actions: State 1: State 2:
State 3:
Transition probabilities and rewards are:
p 1j (1): [1/3 1/3 1/3] | g 1j (1): [1 3 4] |
p 1j (2): [1/4 3/8 3/8] | g 2j (2): [2 2 3] |
p 1j (3): [1/3 1/3 1/3] | g 3j (3): [2 2 3] |
p 2j (1): [1/8 3/8 1/2] | g 2j (1): [2 1 2] |
p 2j (2): [1/2 1/4 1/4] | g 2j (2): [1 4 4] |
p 3j (1): [3/8 1/4 3/8] | g 3j (1): [2 3 3] |
p 3j (2): [1/8 1/4 5/8] | g 3j (2): [3 2 2] |
Use the policy iteration method to determine the policy which gives the maximum gain g.
A computational procedure utilizing Matlab We first put the data in an m-file. Since we have several cases, it is expedient to include the case number. This example belongs to type 2. Data in file md61.m
type = 2 PA = [1/3 1/3 1/3; 1/4 3/8 3/8; 1/3 1/3 1/3; 0 0 0; 1/8 3/8 1/2; 1/2 1/4 1/4; 0 0 0; 3/8 1/4 3/8; 1/8 1/4 5/8] GA = [1 3 4; 2 2 3; 2 2 3; 0 0 0; % Zero rows are separators 2 1 2; 1 4 4; 0 0 0; 2 3 3; 3 2 2] A = [1 2 3 0 1 2 0 1 2]
md61 type = 2 PA = 0.3333 0.3333 0.3333 0.2500 0.3750 0.3750 0.3333 0.3333 0.3333 0 0 0 0.1250 0.3750 0.5000 0.5000 0.2500 0.2500 0 0 0 0.3750 0.2500 0.3750 0.1250 0.2500 0.6250 GA = 1 3 4 2 2 3 2 2 3 0 0 0 2 1 2 1 4 4 0 0 0 2 3 3 3 2 2 A = 1 2 3 0 1 2 0 1 2
Once the data are entered into Matlab by the call for file “md61,” we make preparation for the policy-improvement step. The procedure is in the file newpolprep.m
% file newpolprep.m % version of 3/23/92 disp('Data needed:') disp(' Matrix PA of transition probabilities for states and actions') disp(' Matrix GA of gains for states and actions') disp(' Type number to identify GA matrix types') disp(' Type 1. Gains associated with a state') disp(' Type 2. One-step transition gains') disp(' Type 3. Gains associated with a demand') disp(' Row vector of actions') disp(' Value of alpha (= 1 for no discounting)') c = input('Enter type number to show gain type '); a = input('Enter value of alpha (= 1 for no discounting) '); PA = input('Enter matrix PA of transition probabilities '); GA = input('Enter matrix GA of gains '); if c == 3 PD = input('Enter row vector of demand probabilities '); end if c == 1 QA = GA'; elseif c ==2 QA = diag(PA*GA'); % (qx1) else QA = GA*PD'; end
A = input('Enter row vector A of possible actions '); % (1xq) m = length(PA(1,:)); q = length(A); n = input('Enter n, the power of P to approximate P0 '); s = input('Enter s, the power of P in the approximation of V '); QD = [(1:q)' A' QA]; % First col is index; second is A; third is QA DIS = [' Index Action Value']; disp(DIS) disp(QD) if a == 1 call = 'Call for newpol'; else call = 'Call for newpold'; end disp(call)
newpolprep % Call for preparatory program in file npolprep.m Data needed: Matrix PA of transition probabilities for states and actions Matrix GA of gains for states and actions Type number to identify GA matrix types Type 1. Gains associated with a state Type 2. One-step transition gains Type 3. Gains associated with a demand Row vector of actions Value of alpha (= 1 for no discounting) Enter type number to show gain type 2 Enter value of alpha (=1 for no discounting) 1 Enter matrix PA of transition probabilities PA Enter matrix GA of gains GA Enter row vector A of possible actions A Enter n, the power of P to approximate P0 16 Enter s, the power of P in the approximation of V 6 Index Action Value 1.0000 1.0000 2.6667 2.0000 2.0000 2.3750 3.0000 3.0000 2.3333 4.0000 0 0 5.0000 1.0000 1.6250 6.0000 2.0000 2.5000 7.0000 0 0 8.0000 1.0000 2.6250 9.0000 2.0000 2.1250 Call for newpol
The procedure has prompted for the procedure newpol (in file newpol.m)
% file: newpol.m (without discounting) % version of 3/23/92 d = input('Enter policy as row matrix of indices '); D = A(d); % policy in terms of actions P = PA(d',:); % selects probabilities for policy Q = QA(d',:); % selects next-period gains for policy P0 = P^n; % Display to check convergence PI = P0(1,:); % Long-run distribution G = PI*Q % Long-run expected gain per period C = P + eye(P); for j=2:s C = C + P^j; % C = I + P + P^2 + ... + P^s end V = (C - (s+1)*P0 )*Q; % B = B0*Q disp(' ') disp('Approximation to P0; rows are long-run dbn') disp(P0) disp('Policy in terms of indices') disp(d) disp('Policy in terms of actions') disp(D) disp('Values for the policy selected') disp(V) disp('Long-run expected gain per period G') disp(G) T = [(1:q)' A' (QA + PA*V)]; % Test values for determining new policy DIS =[' Index Action Test Value']; disp(DIS) disp(T) disp('Check current policy against new test values.') disp('--If new policy needed, call for newpol') disp('--If not, use policy, values V, and gain G, above')
newpol Enter policy as row matrix of indices [2 6 9] % A deliberately poor choice Approximation to P0; rows are long-run dbn 0.2642 0.2830 0.4528 0.2642 0.2830 0.4528 0.2642 0.2830 0.4528 Policy in terms of indices 2 6 9 Policy in terms of actions 2 2 2
Long-run expected gain per period G 2.2972 Index Action Test Value 1.0000 1.0000 2.7171 2.0000 2.0000 2.4168 3.0000 3.0000 2.3838 4.0000 0 0 5.0000 1.0000 1.6220 6.0000 2.0000 2.5677 7.0000 0 0 8.0000 1.0000 2.6479 9.0000 2.0000 2.0583 Check current policy against new test values. --If new policy needed, call for newpol --If not, use policy and gain G, above % New policy is needed newpol Enter policy as row matrix of indices [1 6 8] Approximation to P0; rows are long-run dbn 0.3939 0.2828 0.3232 0.3939 0.2828 0.3232 0.3939 0.2828 0.3232 Policy in terms of indices 1 6 8 Policy in terms of actions 1 2 1 Values for selected policy 0.0526 -0.0989 0.0223 Long-run expected gain per period G 2.6061 Index Action Test Value 1.0000 1.0000 2.6587 2.0000 2.0000 2.3595 3.0000 3.0000 2.3254 4.0000 0 0 5.0000 1.0000 1.6057 6.0000 2.0000 2.5072 7.0000 0 0 8.0000 1.0000 2.6284 9.0000 2.0000 2.1208 Check current policy against new test values. --If new policy needed, call for newpol --If not, use policy, values V, and gain G, above
Since the policy selected on this iteration is the same as the
previous one, the procedure has converged to an optimal policy.
The first of the first three rows is maximum; the second of the next two
rows is maximum; and the first of the final two rows is maximum. Thus,
we have selected rows 1, 5, 6, corresponding to the optimal policy .
The expected long-run gain per period
g = 2.6061.
The value matrix is
We made a somewhat arbitrary choice of the powers of P used in the approximation of P0 and B0 . As we note in the development of the approximation procedures in Sec 4, the convergence of P n is governed by the magnitude of the largest eigenvalue other than one. We can always get a check on the reliability of the calculations by checking the eigenvalues for P corresponding to the presumed optimal policy. For the choice above, we find the eigenvalues to be 1, -0.1250, 0.0833. Since 0.1254 ≈ 0.0002, the choices of exponents should be quite satisfactory. In fact, we could probably use P 8 as a satisfactory approximation to P0 , if that were desirable. The margin allows for the possibility that for some policies the eigenvalues may not be so small. — □
Policy iteration– with discounting It turns out that the policy iteration procedure is simpler in the case of discounting, as suggested by the evolution analysis in Sec 4. We have the following two-phase procedure, based on that analysis.
Value-determination procedure. Given the qi and transition probabilities p(i,j) for the current policy, solve v = q + α P v to get for the corresponding vi
Policy-improvement procedure
Given for the current policy, determine
a new policy d*
, with each
d
* (i) = a
i
* determined as that action
for which
We illustrate the Matlab procedure with the same numerical example as in the previous case, using discount factor a = 0.8. The data file is the same, so that we call for it, as before.
Example 2.10.
Assume data in file md61.m is in Matlab; if not, call for md61. We use the procedure newpolprep to prepare for the iterative procedure by making some initial choices.
newpolprep Data needed: Matrix PA of transition probabilities for states and actions Matrix GA of gains for states and actions Type number to identify GA matrix types Type 1. Gains associated with a state Type 2. One-step transition gains Type 3. Gains associated with a demand Row vector of actions Value of alpha (= 1 for no discounting)
Enter type number to show gain type 2 Enter value of alpha (= 1 for no discounting) 0.8 Enter matrix PA of transition probabilities PA Enter matrix GA of gains GA Enter row vector A of possible actions A Enter n, the power of P to approximate P0 16 Enter s, the power of P in the approximation of V 6 Index Action Test Value 1.0000 1.0000 2.6667 2.0000 2.0000 2.3750 3.0000 3.0000 2.3333 4.0000 0 0 5.0000 1.0000 1.6250 6.0000 2.0000 2.5000 7.0000 0 0 8.0000 1.0000 2.6250 9.0000 2.0000 2.1250 Call for newpold
The procedure for policy iteration is in the file newpold.m.
% file newpold.m (with discounting) % version of 3/23/92 d = input('Enter policy as row matrix of indices '); D = A(d); P = PA(d',:); % transition matrix for policy selected Q = QA(d',:); % average next-period gain for policy selected V = (eye(P) - a*P)\Q; % Present values for unlimited future T = [(1:q)' A' (QA + a*PA*V)]; % Test values for determining new policy disp(' ') disp('Approximation to P0; rows are long-run dbn') disp(P0) disp(' Policy in terms of indices') disp(d) disp(' Policy in terms of actions') disp(D) disp(' Values for policy') disp(V) DIS =[' Index Action Test Value']; disp(DIS) disp(T) disp('Check current policy against new test values.') disp('--If new policy needed, call for newpold') disp('--If not, use policy and values above')
newpold Enter policy as row matrix of indices [3 5 9] % Deliberately poor choice Approximation to P0; rows are long-run dbn 0.3939 0.2828 0.3232 0.3939 0.2828 0.3232 0.3939 0.2828 0.3232 Policy in terms of indices 3 5 9 Policy in terms of actions 3 1 2 Values for policy 10.3778 9.6167 10.1722 Index Action Test Value 1.0000 1.0000 10.7111 2.0000 2.0000 10.3872 3.0000 3.0000 10.3778 4.0000 0 0 5.0000 1.0000 9.6167 6.0000 2.0000 10.6089 7.0000 0 0 8.0000 1.0000 10.7133 9.0000 2.0000 10.1722 Check current policy against new test values. --If new policy needed, call for newpold --If not, use policy and values above newpold Enter policy as row matrix of indices [1 6 8] Approximation to P0; rows are long-run dbn 0.3939 0.2828 0.3232 0.3939 0.2828 0.3232 0.3939 0.2828 0.3232 Policy in terms of indices 1 6 8 Policy in terms of actions 1 2 1 Values for policy 13.0844 12.9302 13.0519
Index Action Test Value 1.0000 1.0000 13.0844 2.0000 2.0000 12.7865 3.0000 3.0000 12.7511 4.0000 0 0 5.0000 1.0000 12.0333 6.0000 2.0000 12.9302 7.0000 0 0 8.0000 1.0000 13.0519 9.0000 2.0000 12.5455 Check current policy against new test values. --If new policy needed, call for newpold --If not, use policy and values above
Since the policy indicated is the same as the previous policy,
we know this is an optimal policy. Rows 1, 6, 8, indicate the
optimal policy to be . It turns out in
this example that the optimal policies are the same for the
discounted and undiscounted cases. That is not always true.
The v matrix gives the present values for unlimited
futures, for each of the three possible starting states.