Markov Decision Processes: A Survey

advertisement
Markov Decision Processes:
A Survey II
Cheng-Ta Lee
March 27, 2006
Markov Decision Processes: A Survey II
1/73
Outline
 Introduction
 Markov Theory
 Markov Decision Processes
 Conclusion
 Future Work in Sensor Networks
Markov Decision Processes: A Survey II
2/73
Introduction
Decision Theory
 Probability
Theory
Describes what an agent should
believe based on evidence.
+
 Utility Theory
Describes what an agent wants.
=
 Decision Theory
Markov Decision Processes: A Survey II
Describes what an agent should
do.
3/73
Introduction
 Markov decision processes (MDPs) theory has developed
substantially in the last three decades and become an
established topic within many operational research.
 Modeling of (infinite) sequence of recurring decision
problems (general behavioral strategies)
 MDPs defined
 Objective functions
 Utility function
 Revenue
 Cost
 Policies
 Set of decision
 Dynamic (MDPs)
 Static
Markov Decision Processes: A Survey II
4/73
Outline
 Introduction
 Markov Theory
 Markov Decision Processes
 Conclusion
 Future Work in Sensor Networks
Markov Decision Processes: A Survey II
5/73
Markov Theory
 Markov process
A mathematical model that us useful in the
study of complex systems.
The basic concepts of the Markov process are
those “state” of a system and state “transition”.
A graphic example of a Markov process is
presented by a frog in a lily pond.
State transition system
 Discrete-time process
 Continuous-time process
Markov Decision Processes: A Survey II
6/73
Markov Theory
 To study the discrete-time process
Suppose that there are N states in the system
numbered from 1 to N. If the system is a simple
Markov process, then the probability of a
transition to state j during the next time interval,
given that the system now occupies state i, is a
function only of i and j and not of any history of
the system before its arrival in i. (Memoryless)
In other words, we may specify a set of
conditional
probability pij.
N
  Pij  1 where 0  pij  1
j 1
Markov Decision Processes: A Survey II
7/73
The Toymaker Example
 First state: the toy is great favor.
 Second state: the toy is out of favor.
 Matrix form
0.5 0.5
P  [ pij ]  

0
.
4
0
.
6


 Transition diagram
Markov Decision Processes: A Survey II
8/73
The Toymaker Example
  i (n) , the probability that the system will
occupy state i after n transitions.
 If its state at n=0 is known. It follow that
N

i 1
i
( n)  1
 (n  1)   (n) P
 (1)   (0) P
N
 j (n  1)    i (n) pij
n  0,1, 2,...
i 1
 (2)   (1) P   (0) P 2
 (3)   (2) P   (0) P 3
 (n)   (0)Pn
Markov Decision Processes: A Survey II
n  0,1,2,...
n  0,1,2,...
9/73
The Toymaker Example
 If the toymaker starts with a successful toy,
then 1 (0)  1 and  2 (0)  0 , so that  (0)  1 0
0.5 0.5
 (1)   (0) P  1 0
 0.5 0.5

0.4 0.6
0.5 0.5  9 11 
 (2)   (1) P  0.5 0.5
   20 20 
0
.
4
0
.
6


 
9
 20
 (3)   (2) P  
11  0.5 0.5  89




20  0.4 0.6  200
Markov Decision Processes: A Survey II
111 
200 
10/73
The Toymaker Example
 Table 1.1 Successive State Probabilities of Toymaker
Starting with a Successful Toy
n=
0
1
2
3
4
5
…
 1 (n)
1
0.5
0.45
0.445
0.4445
0.44445
…
 2 (n)
0
0.5
0.55
0.555
0.5555
0.55555
…
 Table 1.2 Successive State Probabilities of Toymaker
Starting without a Successful Toy
n=
0
1
2
3
4
5
…
 1 (n)
0
0.4
0.44
0.444
0.4444
0.44444
…
 2 (n)
1
0.6
0.56
0.556
0.5556
0.55556
…
Markov Decision Processes: A Survey II
11/73
The Toymaker Example
 with components  i is thus
the limit as n approaches infinity of  (n)
 The row vector
  P
N

i 1
i
1
 1  0.5 1  0.4 2
 2  0.5 1  0.6 2
1   2  1
4
1 
9
Markov Decision Processes: A Survey II
5
2 
9
12/73
z-Transformation
For the study of transient
behavior and for theoretical
convenience, it is useful to study
the Markov process from the
point of view of the generating
function or, as we shall call it,
the z-transform.
 Consider a time function f(n)
that takes on arbitrary values
f(0), f(1), f(2), and so on, at
nonnegative, discrete, integrally
spaced points of time and that is
zero for negative time.
 Such a time function is shown in
Fig. 2.4

Markov Decision Processes: A Survey II

Fig. 2.4 An Arbitrary discretetime function
13/73
z-Transformation
 z-transform F(z) such that

F ( z )   f ( n) z
n 0
n
 Table 1.3. z-Transform Pairs
Time Function for n>=0
z-Transform
f(n)
F(z)
f1(n)+f2(n)
F1(z)+F2(z)
kf(n) (k is a constant)
kF(z)
f(n-1)
zF(z)
f(n+1)
z-1[F(z)-f(0)]

n
1
1 z
1 (unit step)
1
1 z
 n f (n)
z
(1  z ) 2
n (unit ramp)
n n
Markov Decision Processes: A Survey II
z
(1  z ) 2
F (z )
14/73
z-Transformation
 Consider first the step function
1 n  0,1,2,3,...
f ( n)  
n0
0
the z-transform is

F ( z )   f ( n) z n  1  z  z 2  z 3  
or
F ( z) 
n 0
1
1 z
For the geometric sequence f(n)=αn,n≧0,


F ( z )   f (n) z   (z ) n
n
n 0
n 0
Markov Decision Processes: A Survey II
or
F ( z) 
1
1 z
15/73
z-Transformation
 We shall now use the z-transform to analyze Markov
processes.
 (n  1)   (n) P
n  0,1, 2,...
z 1 ( z )   (0)  ( z) P
 ( z )  z  ( z ) P   ( 0)
 ( z )( I  zP )   (0)
( z )   (0)( I  zP) 1
In this expression I is the identity matrix.
Markov Decision Processes: A Survey II
16/73
z-Transformation
 Let us investigate the toymaker’s problem by z-transformation.
1

P  2
2

5
I  zP 1
1
2
3

5
3

1 z

5

 (1  z )(1  1 z )
10

2

z

5

 (1  z )(1  1 z )
10

I  zP 1
5
 4
 9
 9

1
1

z

1 z
10

4
 4

 9
9


1
1  z 1  z

10
1 
 1
1  2 z  2 z 
I  zP    2
3 
  z 1  z
5 
 5



1
(1  z )(1  z ) 
10 

1
1 z

2

1 
(1  z )(1  z )
10 
1
z
2
5 
9 

1 
1 z
10 
5
4 
9  9 

1
1 z
1 z 
10 
5
9 
1 z

Markov Decision Processes: A Survey II
I  zP 1
4
1 9


1 z 4
9
5
5
 5

9  1  9
9
5
1  4 4 
 1  z 

9
10  9 9 
Let the matrix H(n) be the inverse
transform of (I-zP)-1 on an elementby-element basis
4
9
H ( n)  
4
 9
5
 5
n

9  1   9


5   10   4

 9
9 
5
 
9

4 
9 
( z )   (0)( I  zP) 1
 (n)   (0) H (n)
17/73
z-Transformation
 If the toymaker starts in the successful
state 1,
n
then π(0)=[1 0] and
or  (n)  4  5  1  ,
n
1
9
9  10 
4 5  1  5
  

9
9

  10   9
n
5 5 1 
 2 ( n)    
9 9  10 
 ( n)  
5
 
9
 If the toymaker starts in the unsuccessful state 2,
then π(0)=[0 1] and
or
,
4 4 1 
n
 1 ( n) 
9
  
9  10 
4
 ( n)  
9
5  1 
 
9   10 
5 4 1 
 2 ( n)    
9 9  10 
n
 4
 9
4
9 
n
 We have now obtained analytic forms for the data
in Table 1.1 and 1.2.
Markov Decision Processes: A Survey II
18/73
Laplace Transformation
We shall extend our previous
 Table 2.4. Laplace Transform Pairs
work to the case in which the
process may make transitions at Time Function for t>=0 z-Transform
f(t)
F(s)
random time intervals.
f1(t)+f2(t)
F1(n)+F2(n)
 The Laplace transform of a time
kf(t) (k is a constant)
kF(s)
function f(t) which is zero for t<0
d
sF(s)-f(0)
is defined by
f (t )

dt

F (s)   f (t )e dt
 st
0
e
 at
1 (unit step)
te  at
Markov Decision Processes: A Survey II
1
sa
1
s
1
(1  a ) 2
t (unit ramp)
1
s2
e  at f (t )
F(s+a)
19/73
Laplace Transformation

F (s)   f (t )e  st dt
 j (t  dt )   j (t )[1   a ji dt ]    i (t )aij dt
0
i j
a jj   a ji
1 t  0
f (t )  
0 t  0
F (s)  

0
i j
 j (t  dt )   j (t )[1  a jj dt ]    i (t )aij dt
1
e  st dt 
s
f (t )  e  at
i j
i j
for


0
0
t0
F ( s )   e  at e  st dt   e  ( s  a )t dt 
 j (t  dt )   j (t )    i (t )aij dt
( dt  0)
i j
1
sa
Markov Decision Processes: A Survey II
20/73
Laplace Transformation
 We shall now use the Laplace transform to analyze Markov
processes.
d
 (t )   (t ) A
dt
N
d
 (t )    i (t )aij
dt
i 1
 (t )   (0)e At
d
 (t )   (t ) A
dt
e
s ( s)   (0)   ( s) A
At
t2 2 t3 3
 I  tA  A  A  
2!
3!
For discrete processes,
 ( s)( sI  A)   (0)
( s)   (0)( sI  A)
1
Markov Decision Processes: A Survey II
 (n)   (0) Pn
eA  P
n  0,1, 2,...
or
A  ln P
21/73
Laplace Transformation
 Recall the toymaker’s initial policy, for which the transition-
probability matrix was
1

P  2
2

5
1
2
3

5
 s4
 s( s  9)

 4
 s( s  9)
5 
s( s  9) 

s5 
s( s  9) 
sI  A1
5
4
9
9
 
 s s9
4  4
9
9
 s  s  9
5
5 
 
9 9

s s  9
5
4 
9 9 
s s  9 
sI  A1
4
1
 9
s 4
9
sI  A1
 5 5 
A

 4  4
s  5  5 
sI  A  

  4 s  4
Markov Decision Processes: A Survey II
5
 5
9  1  9
5 s  9  4


9
 9
5
 
9
4 

9 
22/73
Laplace Transformation
 Let the matrix H(t) be the inverse transform (sI-A)-1
 Then (s)   (0)( sI  A) 1 becomes by means of inverse
transformation
 (t )   (0) H (t )
4

H ( n)   9
4

9
Markov Decision Processes: A Survey II
5
 5
9   e 9t  9
 4
5


9
 9
5
 
9
4 

9 
23/73
Laplace Transformation
 If the toymaker starts in the successful state 1,
then π(0)=[1 0] and
or  (t )  4  5 e9t ,
1
9
9
4
9
 (t )  
 2 (t ) 
5
9t  5

e
 9
9 
5
 
9
5 5 9 t
 e
9 9
 If the toymaker starts in the unsuccessful state 2,
then π(0)=[0 1] and
or
4 4 9 t ,
 1 (t ) 
9

9
e
4
9
 (t )  
 2 (t ) 
5
9t  4

e
 9
9 
4
9 
5 4 9t
 e
9 9
 We have now obtained analytic forms for the data
in Table 1.1 and 1.2.
Markov Decision Processes: A Survey II
24/73
Outline
 Introduction
 Markov Theory
 Markov Decision Processes
 Conclusion
 Future Work in Sensor Networks
Markov Decision Processes: A Survey II
25/73
Markov Decision Processes
 MDPs applies dynamic programming to the solution of a




stochastic decision with a finite number of states.
The transition probabilities between the states are
described by a Markov chain.
The reward structure of the process is described by a
matrix that represents the revenue (or cost) associated
with movement from one state to another.
Both the transition and revenue matrices depend on the
decision alternatives available to the decision maker.
The objective of the problem is to determine the optimal
policy that maximizes the expected revenue over a finite or
infinite number of stages.
Markov Decision Processes: A Survey II
26/73
Markov Process with Rewards
 Suppose that an N-state Markov process earns rij
dollars when it makes a transition from state i to j.
 We call rij the “reward” associated with the
transition from i to j.
 The rewards need not be in dollars, they could be
voltage levels, unit of production, or any other
physical quantity relevant to the problem.
 Let us define vi(n) as the expected total earnings
in the next n transitions if the system is now in
state i.
Markov Decision Processes: A Survey II
27/73
Markov Process with Rewards
 Recurrence relation
N
 i (n)   pij [rij   j (n  1)]
i  1,2,..., N
n  1,2,3,...
j 1
N
N
j 1
j 1
 i (n)   pij rij   pij j (n  1)]
N
qi   pij rij
i  1,2,..., N
n  1,2,3,...
i  1,2,..., N
j 1
N
i (n)  qi   pij j (n  1)
i  1,2,..., N
n  1,2,3,...
j 1
v(n)=q+Pv(n-1)
i
j
vij
Markov Decision Processes: A Survey II
Vj(n-1)
Vi(n)
28/73
The Toymaker Example
6
q 
 3
0.5 0.5
P

0
.
4
0
.
6


9 3 
R

3

7


N
i (n)  qi   pij j (n  1)
i  1,2,..., N
n  1,2,3,...
j 1

Table 3.1. Total Expected Reward for Toymaker as a Function of State
and Number of Weeks Remaining
0
1
2
3
4
5
…
1 (n)
0
6
7.5
8.55
9.555
10.5555
…
 2 (n)
0
-3
-2.4
-1.44
-0.444
0.5556
…
n=
p.38
Markov Decision Processes: A Survey II
Note: -0.74+(-2.4-(-1.7))=1.44
p.38
29/73
Toymaker’s problem: total expected reward in
each state as a function of week remaining
Markov Decision Processes: A Survey II
30/73
z-Transform Analysis of the Markov
Process with Rewards
 The z-Transform of the total-value vector v(n)
will be called
v(n  1)  q  Pv(n)

 (z ) where  ( z )   v(n) z n
n 0
n  0,1,2,...
1
z 1 [ ( z )  v(0)] 
q  P ( z )
1 z
z
 ( z )  v(0) 
q  zP ( z )
1 z
Markov Decision Processes: A Survey II
z
( I  zP ) ( z ) 
q  v(0)
1 z
 ( z) 
z
( I  zP ) 1 q  ( I  zP ) 1 v(0)
1 z
v(0)=0
z
 ( z) 
( I  zP)1 q
1 z
31/73
z-Transform Analysis of the Markov
Process with Rewards
I  zP 1
4
1 9


1 z 4
9
z
I  zP 1
1 z
5
 5
9  1  9
5
1  4
 1  z 
9
10  9
4
z 9


(1  z ) 2  4
9
z

(1  z ) 2
4
9
4

9
5
 
9
4 

9 
5
5
 5
 


z
9 
9
9
5
1  4 4 
 (1  z )(1  z ) 

9
10  9 ,9 
10   5
5   10
5

 



9  9
9  9 
9
5 1 z
1  4 4 
 

1  z  
9 
10   9 9 
Let the matrix F(n) be the inverse
transform of z I  zP 
1
1 z
4
9
F ( n)  n 
4
 9
5
 5
n
9  10   1    9
  1 

5  9   10    4

 9
9 
5
 
9

4 
9 
Markov Decision Processes: A Survey II
The total-value vector v(n) is then
F(n)q by inverse transformation
of  ( z )  z ( I  zP) q , and, since q   6 
1
 3 
 
1 z
n
1 10   1    5 
v(n)  n    1      
1 9   10    4
n
50   1  
v1 (n)  n  1    
9   10  
v 2 ( n)  n 
40   1 
1   
9   10 
n



We see that, as n becomes
very large. v (n)  n  50
1
9
v 2 ( n)  n 
40
9
Both v1(n) and v2(n) have slope
1 and v1(n)-v2(n)=10.
32/73
Optimization Techniques in General
Markov Decision Processes





Value Iteration
Exhaustive Enumeration
Policy Iteration (Policy Improvement)
Linear Programming
Lagrangian Relaxation
Markov Decision Processes: A Survey II
33/73
Value Iteration
 Original
[ p1 j ]  0.5 0.5
[ r1 j ]  9
[ p 2 j ]  0.4 0.6
[r1 j ]  3  7
[ p1 j ]  0.5 0.5
[r1 j ]  9
1
 Advertising? No
Yes [ p1 j 2 ]  0.8 0.2
Yes
Markov Decision Processes: A Survey II
[r1 j ]  4
2
3
4
[ p2 j ]  0.4 0.6
[r1 j ]  3
[ p 2 j ]  0.7 0.3
[r2 j ]  1  19
1
 Research? No
1
3
2
1
 7
2
34/73
Diagram of States and Alternatives
Markov Decision Processes: A Survey II
35/73
The Toymaker’s Problem Solved by
Value Iteration
The quantity qi k is the expected reward from a single transition from
state i under alternative k. Thus, q   p r i  1,2,..., N
 The alternatives for the toymaker are presented in Table 3.1.

N
k
i
State
Alternative
i
k
j 1
k
ij
k
ij
Transition
Probabilities
pi1
k
pi 2
Expected
Immediate
Reward
Reward
k
ri1
k
ri 2
k
qi
k
1 (No advertising)
0.5
0.5
9
3
6
2 (Advertising)
0.8
0.2
4
4
4
1 (No research)
0.4
0.6
3
-7
-3
2 (Research)
0.7
0.3
1
-19
-5
1 (Successful toy)
2 (Unsuccessful toy)
Markov Decision Processes: A Survey II
36/73
The Toymaker’s Problem Solved by
Value Iteration
 We call di(n) the “decision” in state i at the nth
stage. When di(n) has been specified for all i and
all n, a “policy” has been determined. The optimal
policy is the one that maximizes total expected
return for each i and n.
 To analyze this problem. Let us redefine  (n)as the
total expected return in n stages starting from
state i if an optimal policy is followed. It follows that
for any n
 (n  1)  max  p [r   (n)] n  0,1,2,...
 “Principle of optimality” of dynamic programming:
in an optimal sequence of decisions or choices,
each subsequence must also be optimal.
i
N
i
Markov Decision Processes: A Survey II
k
j 1
k
ij
k
ij
j
37/73
The Toymaker’s Problem Solved by
Value Iteration
N
qi   pij rij
k
k
k
i  1,2,..., N
j 1
 k N

 i (n  1)  max qi   pij k j (n) n  0,1,2,...
k
j 1


Table 3.6 Toymaker’s Problem Solved by Value Iteration
n=
0
1
2
3
4
…
1 (n)
0
6
8.2
10.22
12.222
…
 2 (n)
0
-3
-1.7
0.23
2.223
…
d 1 ( n)
-
1
2
2
2
…
d 2 (n)
-
1
2
2
2
…
0.5(9)+0.5(3)=6 
0.8(4)+0.2(4)=4
0.4(3)+0.6(-7)=-3 
-0.7(1)+0.3(-19)=-5
Markov Decision Processes: A Survey II
6+0.5(8.2)+0.5(-1.7)=9.25
4+0.8(8.2)+0.2(-1.7)=10.22 
-3+0.4(8.2)+0.6(-1.7)=-0.74
(Note: -0.74+(-2.4-(-1.7))=1.44)
-5+0.7(8.2)+0.3(-1.7)=-0.23 
6+0.5(6)+0.5(-3)=7.5
4+0.8(6)+0.2(-3)=8.2 
-3+0.4(6)+0.6(-3)=-2.4
-5+0.7(6)+0.3(-3)=-1.7 
38/73
The Toymaker’s Problem Solved by
Value Iteration
 Note that for n=2, 3, and 4, the second alternative
in each state is to be preferred. This means that
the toymaker is better advised to advertise and to
carry on research in spite of the costs of these
activities.
 For this problem the convergence seems to have
taken place at n=2, and the second alternative in
each state has been chosen. However, in many
problems it is difficult to tell when convergence has
been obtained.
Markov Decision Processes: A Survey II
39/73
Evaluation of the Value-Iteration
Approach
 Even though the value-iteration method is
not particularly suited to long-duration
processes.
 [v(n  1)  v(n)]  [v(n  2)  v(n  1)]  
Markov Decision Processes: A Survey II
then stop
40/73
Exhaustive Enumeration
 The methods for solving the infinite-stage
problem.
 The method calls for evaluating all possible
stationary policies of the decision problem.
 This is equivalent to an exhaustive
enumeration process and can be used only
if the number of stationary policies is
reasonably small.
Markov Decision Processes: A Survey II
41/73
Exhaustive Enumeration
 Suppose that the decision problem has S
stationary policies, and assume that Ps and
Rs are the (one-step) transition and revenue
matrices associated with the policy, s=1,
2, …, S.
Markov Decision Processes: A Survey II
42/73
Exhaustive Enumeration
 The steps of the exhaustive enumeration method are as
follows.
 Step 1. Compute vsi, the expected one-step (one-period) revenue of
policy s given state i, i=1, 2, …, m.
 Step 2. Compute πsi, the long-run stationary probabilities of the
transition matrix Ps associated with policy s. These probabilities,
when they exist, are computed from the equations
πs Ps =πs
πs1 +πs2 +…+πsm =1
where πs =(πs1 , πs2 , …, πsm ).
 Step 3. Determine Es, the expected revenue of policy s per
transition step (period), by using the formula
m
E s    is is
i 1
 Step 4. The optimal policy s* id determined such that
E s  Max{E s }
*
Markov Decision Processes: A Survey II
43/73
Exhaustive Enumeration
We illustrate the method by solving the gardener problem for an
infinite-period planning horizon.
 The gardener problem has a total of eight stationary policies, as the
following table shows:

Stationary policy, s
Action
1
Do not fertilize at all.
2
Fertilize regardless of the state.
3
Fertilize if in state 1.
4
Fertilize if in state 2.
5
Fertilize if in state 3.
6
Fertilize if in state 1 or 2.
7
Fertilize if in state 1 or 3.
8
Fertilize if in state 2 or 3.
Markov Decision Processes: A Survey II
44/73
Exhaustive Enumeration

The matrices Ps and Rs for policies 3 through 8 are derived from those
of policies 1 and 2 and are given as
0.2 0.5 0.3
P1   0 0.5 0.5
 0
0
1 
7 6 3 
1
R  0 5 1 
0 0  1
 0.2 0.5 0.3 
P 5   0
0.5 0.5 
0.05 0.4 0.55
7 6 3 
R 5  0 5 1 
6 3  2
 0.3 0.6 0.1 
P 2   0.1 0.6 0.3 
0.05 0.4 0.55
6 5  1 
2
R  7 4 0 
6 3  2
0.3 0.6 0.1
6
P  0.1 0.6 0.3
 0
0
1 
6 5  1
R  7 4 0 
0 0  1
0.3 0.6 0.1
3
P   0 0.5 0.5
 0
0
1 
6 5  1
R  0 5 1 
0 0  1
 0.3 0.6 0.1 
P   0
0.5 0.5 
0.05 0.4 0.55
6 5  1 
R  0 5 1 
6 3  2
 0.2 0.5 0.3 
P   0.1 0.6 0.3 
0.05 0.4 0.55
7 6 3 
R  7 4 0 
6 3  2
0.2 0.5 0.3
P   0.1 0.6 0.3
 0
0
1 
4
3
7 6 3 
R  7 4 0 
0 0  1
4
Markov Decision Processes: A Survey II
7
8
6
7
8
45/73
Exhaustive Enumeration
 Step1:
 The values of vsi can thus be computed as given in the following table.
s
 is
i=1
i=2
i=3
1
5.3
3
-1
2
4.7
3.1
0.4
3
4.7
3
-1
4
5.3
3.1
-1
5
5.3
3
0.4
6
4.7
3.1
-1
7
4.7
3
0.4
8
5.3
3.1
0.4
Markov Decision Processes: A Survey II
46/73
Exhaustive Enumeration
 Step 2:
 The computations of the stationary probabilities are
achieved by using the equations
πs Ps =πs
πs1 +πs2 +…+πsm =1
 As an illustration, consider s=2. The associated
equations are 0.3  0.1  0.05  
2
1
2
2
2
2
1
3
0.61  0.6 2  0.4 3   2
2
2
2
0.11  0.3 2  0.55 3   3
2
2
2
2
2
 12   2 2  3 2  1
The solution yields
 12 
6
,
59
 22 
31
,
59
 32 
22
59
In this case, the expected yearly revenue is
3
E    i2 i2 
2
Markov Decision Processes: A Survey II
i 1
1
(6  4.7  31  3.1  22  0.4)  2.256
59
47/73
Exhaustive Enumeration
 Step 3&4:
 The following table summarizes πs and Es for all the stationary policies.
S
 1s
 3s
 2s
Es
1
0
0
1
-1
2
6/59
31/59
22/59
3
0
0
1
0.4
4
0
0
1
-1
5
5/154
69/154
80/154
1.724
6
0
0
1
-1
7
5/137
62/167
70/137
1.734
8
12/135
69/135
54/135
2.216
*
2.256=
Es
 Policy 2 yields the largest expected yearly revenue. The optimum long-
range policy calls for applying fertilizer regardless of the system.
Markov Decision Processes: A Survey II
48/73
Policy Iteration
可約(reducible)之馬可夫鏈狀態
 The system is completely ergodic, the limiting
state probabilities πi are independent of the
starting state, and the gain g of the system is
N
g    i qi
i 1
不可約(irreducible)之馬可夫鏈狀態
where qi is the expected immediate return in state i
N
defined by
qi   pij rij
i  1,2,..., N
j 1
Markov Decision Processes: A Survey II
各態遍歷(Ergodic)的馬可夫鏈:當馬可
夫鏈狀態數為有限的、不可約的及非週
期性的,便可以將之歸類為各態遍歷的
馬可夫鏈。
49/73
Policy Iteration
 A possible five-state problem.
 The alternative thus selected is called the “decision” for
that state; it is no longer a function of n. The set of X’s or
the set of decisions for all states is called a “policy”.
Markov Decision Processes: A Survey II
50/73
Policy Iteration
 It is possible to describe the policy by a decision vector d
whose elements represent the number of the alternative
selected in each state. In this case
3
2
 
d  2
 
1 
3
 An optimal policy is defined as a policy that maximizes the
gain, or average return per transition.
Markov Decision Processes: A Survey II
51/73
Policy Iteration
 In five-state problem diagrammed, there




are 4  3  2 1 5  120 different policies.
However feasible this may be for 120 policies, it
becomes unfeasible for very large problem.
For example, a problem with 50 states and 50
alternatives in each state contains 5050(≒1085) policies.
The policy-iteration method that will be described will
find the optimal policy in a small number of iterations.
It is composed of two parts, the value-determination
operation and the policy-improvement routine.
Markov Decision Processes: A Survey II
52/73

Policy
Iteration
The Iteration Cycle
N
qi   pij rij
j 1
i  1, 2,..., N
Markov Decision Processes: A Survey II
53/73
The Toymaker’s Problem
 Let us suppose that we have no a priori
knowledge about which policy is best.
 Then if we set v1=v2=0 and enter the policyimprovement routine.
 It will select as an initial policy the one that
maximizes expected immediate reward in each
state.
 For the toymaker, this policy consists of selection
of alternative 1 in both state 1 and 2. For this
policy P  0.5 0.5
6
9 3 
1
0.4 0.6 


 0.8 0.2 
P2  

0.7 0.3
Markov Decision Processes: A Survey II
R1  

3 7 
q1   
 3 
4 4 
R2  

1 19 
4
q2   
 5
1
d 
1
54/73
The Toymaker’s Problem

We are now ready to begin the value-determination operation that will
evaluate our initial policy.
g  v1  6  0.5v1  0.5v2
g  v2  3  0.4v1  0.6v2
Setting v2=0 and solving these equation, we obtain
v1  10
v2  0
g 1
 We are now ready to enter the policy-improvement routing as shown in
Table 3.8

State
Alternative
Test Quantity
i
k
q   p ijk v j
1
1
2
6+0.5(10)+0.5(0)=11
4+0.8(10)+0.2(0)=12
2
1
2
-3+0.4(10)+0.6(0)=1
-5+0.7(10)+0.3(0)=2
N
Markov Decision Processes: A Survey II
k
i
j 1
55/73
The Toymaker’s Problem

The policy-improvement routine reveals that the second alternative in
each state produces a higher value of the test quantity than does the
first alternative. For this policy,
2
d  
2
0.8 0.2
P

0.7 0.3
4
q 
 5

We are now ready to the value-determination operation that will evaluate our
policy.
g  v2  5  0.7v1  0.3v2
g  v1  4  0.8v1  0.2v2

With v2=0, the results of the value-determination operation are
g2


v1  10
v2  0
The gain of the policy d   2 is thus twice that of the original policy, we have
2
found the optimal policy.  
For the optimal policy, v1=10, v2=0, so that v1-v2=10. This means that, even
when the toymaker is following the optimal policy by using advertising and
research.
Markov Decision Processes: A Survey II
56/73
Linear Programming
 The infinite-stage Markov decision problems, can be
formulated and solved as linear programs.
 We have defined the policy of MDP and can be defined
by
d 1 
d 
D 2 
 
 
d N 
. Each state has k decisions, so D can be
characterized by assigning values
d11 d12  d1K 
d
d 22  d1K 
21

D





d N 1 d N 2  d NK 
matrix,
dik  0 or 1
in the
, where each row must contain a single
1with the rest of the elements zero. When an element dik
=1, it can be interpreted as calling for decision k when the
system is in state i .
Markov Decision Processes: A Survey II
57/73
Linear Programming
 When we use linear programming to solve the
MDP problem, we will define the formulation as
E    d q
.
 The linear programming formulation is best
expressed in terms of a variable wik , which is
related to dik as follows.
 Let wik be the unconditional probability that the
system is in state i and decision k is made; that is,
w  P{state  i and decision  k} .
 From the rules of conditional probability, wik   i dik .
Furthermore,    w . So that d  w  w
N
K
i 1 k 1
i
k
ik i
ik
K
i
k 1
Markov Decision Processes: A Survey II
ik
ik
ik
i
ik
 k 1 wik
K
58/73
Linear Programming

There exist several constraints on wik
1.
N
  i  1 , so that
i 1
N
K
 w
ik
i 1 k 1
1
N
2.
 i p ij
from the results on steady-state probabilities,  j  
i 1
.
K
, so that
3.
N
K
 wik   wik pijk ,
k 1
for j  1,2, , N
i 1 k 1
wik  0, i  1,2,, N and k  1,2,, K
Markov Decision Processes: A Survey II
59/73
Linear Programming

The long run expected average revenue per unit time is
given by E    d q   w q, hence the problem to choose
the wik that Maximize w q , subject to the constrains.
N
K
i 1 k 1
N
i
ik
k
i
N
K
i 1 k 1
K

i 1 k 1
N
K
1.  wik
ik
k
i
k
ik i
1
i 1 k 1
2.
K
N
K
 wjk   wik pijk  0,
k 1
for j  1, 2,
,N
i 1 k 1
3. wik  0, i  1, 2, , N and k  1, 2, , K

This is clearly a linear programming problem that can be
solved byw the simplex method. Once the wik is obtained,
the d 
 w
ik
ik
K
k 1
ik
Markov Decision Processes: A Survey II
60/73
Linear Programming
 The following is an LP formulation of the gardener problem
without discounting:
Maximize E=5.3w11+4.7w12+3w21+3.1w22-w31+0.4w32
subject to




w11 + w12 - (0.2w11 + 0.3w12
+ 0.1w22
+ 0.05w32) = 0
w21 + w22 - (0.5w11 + 0.6w12 + 0.5w21 + 0.6w22
+ 0.4w32) = 0
w31 + w32 - (0.3w11 + 0.1w12 + 0.5w21 + 0.3w22 + w31 + 0.55w32) = 0
w11 + w12 + w21 + w22 + w31 + w32 = 1
wik>=0, for all I and k
The optimal solution is w11 = w21 = w31 = 0 and w12 = 0.1017, w22 =
0.5254, and w32 = 0.3729.
This result mean that d12=d22=d32=1.
Thus, the optimal policy selects alternative k=2 for i=1, 2, and 3.
The optimal values of E is 4.7(0.1017)+3.1(0.5254)+0.4(0.3729)=2.256.
Markov Decision Processes: A Survey II
61/73
Largrangian Relexation
 If the linear programming method can not find the
optimal solution with the additional constraints .
 we can use Lagrangian relaxation to bind the
constraints to the object function, and then solve
this new sub problem without the additional
constraints added .
 By adjusting the multiplier of Lagrangian relaxation,
we can get the upper bound and the lower bound
of this problem.
 We will use the multiplier of Lagrangian relaxation
to rearrange the revenue of Markovian decision
process, and then do the original Markovian.
Markov Decision Processes: A Survey II
62/73
Comparison
Characteristic
Calculates
simply
Optimal
Large problem
policy
Additional
constraints
Methods
Value
Iteration

Exhaustive
Enumeration

Policy Iteration

Linear
Programming

Lagrangian
Relaxation
Markov Decision Processes: A Survey II






63/73
Semi-Markov Decision Processes
 So far we have assumed that decisions are
taken at each of a sequence of unit time
intervals.
 We will allow decisions to be taken at
varying integral multiples of the unit time
interval.
 The interval between decisions may be
predetermined or random.
Markov Decision Processes: A Survey II
64/73
Partially Observable MDPs
 MDPs assume complete observable (can always tell what
state you’re in)
 We can’t always be certain of the current state
 Lamp bright degree
 POMDPs are more difficult to solve than MDPs
 Most real-world problems are POMDPs
 State space transformation[22]
Markov Decision Processes: A Survey II
65/73
Applications on MDPs
 Capacity Expansion
 Decision Analysis
Update video. (2004/9/16, VoD)
 Network Control
Optimization of GPRS Time Slot Allocation
Packet Classification
 Queueing System Control
 Inventory management
Markov Decision Processes: A Survey II
66/73
Outline
 Introduction
 Markov Theory
 Markov Decision Processes
 Conclusion
 Future Work in Sensor Networks
Markov Decision Processes: A Survey II
67/73
Conclusion
 MDPs provide elegant and formal framework for
sequential decision making.
 Powerful tool for formulating models and finding
the optimal policies.
 Five algorithms were presented
 Value Iteration
 Exhaustive Enumeration
 Policy Iteration
 Linear Programming
 Lagrangian Relaxation
Markov Decision Processes: A Survey II
68/73
Outline
 Introduction
 Markov Theory
 Markov Decision Processes
 Conclusion
 Future Work in Sensor Networks
Markov Decision Processes: A Survey II
69/73
Future Work in Sensor Networks
 Markovian recovering policy in object tracking sensor
networks
 Objective function: minimum communication delay (response)
time or maximum system lifetime
 Policies: ALL_NBR and ALL_NODE
 Constraint: energy
 Markovian monitoring and reporting policy in WSNs
(2004/10/7, WSNs Oral)
 Objective functions: minimum communication cost or delay
(response) time
 Policies: sensor node density and number of sink
 Markovian sensor nodes placement policy with application
to the WSNs
 Objective functions: minimum budget cost or maximize coverage
the sensor field
 Policies: planning and deployment, post-deployment, and
redeployment
Markov Decision Processes: A Survey II
70/73
References
Hamdy A. Taha, “Operations Research: an Introduction,” third edition,
1982.
2.
Hillier and Lieberman,”Introduction to Operations Research,” fourth
edition, Holden-Day, Inc, 1986.
3.
R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, “Network Flows,” Prentice-Hall,
1993.
4.
Leslie Pack Kaelbling, “Techniques in Artificial Intelligence: Markov Decision
Processes,” MIT OpenCourseWare, Fall 2002.
5.
Ronald A. Howard, “Dynamic Programming and Markov Processes,”
Wiley, New York, 1970.
6.
D. J. White, “Markov Decision Processes,” Wiley, 1993.
7.
Dean L. Isaacson and Richard W. Madsen, “Markov Chains Theory and
Applications,” Wiley, 1976
8.
M. H. A. Davis “Markov Models and Optimization,” Chapman & Hall, 1993.
9.
Martin L. Puterman, “Markov Decision Processes: Discrete Stochastic
Dynamic Programming,” Wiley, New York, 1994.
10.
Hsu-Kuan Hung, Adviser:Yeong-Sung Lin ,“Optimization of GPRS Time
Slot Allocation”, June, 2001.
11.
Hui-Ting Chuang, Adviser:Yeong-Sung Lin ,“Optimization of GPRS Time
Slot Allocation Considering Call Blocking Probability Constraints”, June,
2002.
Markov Decision Processes: A Survey II
71/73
1.
References
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
高孔廉,「作業研究--管理決策之數量方法」,三民總經銷,民國74年四版。
李朝賢,「作業研究概論」,弘業文化實業股份有限公司出版,民國66年8月。
楊超然,「作業研究」,三民書局出版,民國66年9月初版。
葉若春,「作業研究」,中興管理顧問公司出版,民國86年8月五版。
薄喬萍,「作業研究決策分析」,復文書局發行,民國78年6月初版。
葉若春,「線性規劃理論與應用」,民國73年9月增定十版。
Leonard Kleinrock, “Queueing Systems Volume I: Threory,” Wiley, New York,
1975.
Chiu, Hsien-Ming, “Lagrangian Relaxation,” Tamkang University, Fall 2003.
L. Cheng, E. Subrahmanian, A. W. Westerberg, “Design and planning under
uncertainty: issues on problem formulation and solution”, Computers and
Chemical Engineering, 27, 2003, pp.781-801.
Regis Sabbadin, “Possibilistic Markov Decision Processes”, Engineering
Application of Artificial Intelligence, 14, 2001, pp.287-300.
K. Karen Yin, Hu Liu, Neil E. Johnson, “Markovian Inventory Policy with
Application to the Paper Industry”, Computers and Chemical Engineering, 26,
2002, pp.1399-1413.
Markov Decision Processes: A Survey II
72/73
Q&A
Markov Decision Processes: A Survey II
73/73
Download