Incremental Least Squares Policy Iteration for POMDPs

advertisement
Incremental Least Squares Policy Iteration for POMDPs
Hui Li, Xuejun Liao, and Lawrence Carin
Department of Electrical and Computer Engineering
Duke University
Durham, NC 27708-0291, USA
{hl1, xjliao, lcarin}@ee.duke.edu
Abstract
behavior of the belief state is a discrete-time continuousstate Markov process and that for a finite horizon the optimal
value of belief states is a piecewise linear and convex function. Based on these facts, the value iteration method for
MDPs (Bellman 1957) was extended to POMDPs in (Smallwood & Sondik 1973). Though the algorithm in (Smallwood & Sondik 1973) is exact, it has an exponential computational complexity in the worst case. More recently,
researchers have developed scalable value iteration algorithms based on a finite set of belief points (Lovejoy 1991;
Brafman 1997; Poon 2001; Pineau, Gordon, & Thrun 2003).
In particular, Pineau et al. (2003) suggested backing up
the value and its gradient on a finite set of belief points
most probably reached by the POMDP. The resulting algorithm, point-based value iteration (PBVI), has proven to
be a practical POMDP solution scaling up to large problems. The idea has subsequently been pursued in a number
of papers (Spaan & Vlassis 2004; Smith & Simmons 2004;
2005), where various heuristics were proposed to further improve the algorithmic efficiency.
The value iteration methods solve the finite-horizon policy of a POMDP. To obtain the infinite-horizon policy, one
must solve the policy over successively larger horizons until
the value function converges to that of the infinite-horizon.
An alternative approach is to solve the infinite-horizon policy directly. Sondik (1978) presented such an approach by
extending the policy iteration of MDP (Howard 1971) to the
POMDP case. Sondik’s algorithm focuses on improving the
policy in the entire belief simplex, involving complicated
computations, and therefore it is only appropriate for small
problems. Hansen (1997) presented an improved algorithm,
integrating Sondik’s multiple policy representations into a
single one. Hansen’s algorithm still focuses on the entire
belief simplex and does not scale to large problems.
In this paper, we introduce a new policy iteration algorithm for POMDPs. This algorithm, called incremental least
squares policy iteration (ILSPI), concentrates only on the belief states that are actually reachable by the POMDP. As
noted in (Pineau, Gordon, & Thrun 2003), the reachable
belief points of most practical POMDP models constitute
only a subset of the belief simplex. It is therefore sufficient to only find optimal actions for the reachable belief
points, instead of all points in the belief simplex. The ILSPI
represents an extension of the least squares policy iteration
We present a new algorithm, called incremental least squares
policy iteration (ILSPI), for finding the infinite-horizon stationary policy for partially observable Markov decision processes (POMDPs). The ILSPI algorithm computes a basis
representation of the infinite-horizon value function by minimizing the square of Bellman residual and performs policy
improvement in reachable belief states. A number of optimal basis functions are determined by the algorithm to minimize the Bellman residual incrementally, via efficient computations. We show that, by using optimally determined basis
functions, the policy can be improved successively on a set
of most probable belief points sampled from the reachable
belief set. As the ILSPI is based on belief sample points, it
represents a point-based policy iteration method. The results
on four benchmark problems show that the ILSPI compares
competitively to its value-iteration counterparts in terms of
both performance and computational efficiency.
Introduction
The partially observable Markov decision process (POMDP)
(Smallwood & Sondik 1973; Sondik 1978; Kaelbling,
Littman, & Cassandra 1998) provides a rich mathematical
framework for planing under uncertainty. The POMDP inherits from its predecessor, the Markov decision process
(MDP) (Bellman 1957), the uncertainty about state transitions that results from taking an action. However, the
POMDP is taken one step further by introducing uncertainty
about the state itself, i.e., the states are not observed directly
but are inferred from observations that depend probabilistically on them. The second type of uncertainties makes the
POMDP a more realistic decision model in many planning
problems, where the underlying state is hidden and one only
observes features that partially characterize the state.
At any given time the uncertainties about the POMDP
states are summarized in the belief state, which is defined as
the probability distribution over the states, given the history
of past observations and actions. The goal in POMDP planning is to find a policy that maps any belief state to an optimal action, with the objective of maximizing the expected
future reward of each belief state over a specified horizon. It
was shown in (Smallwood & Sondik 1973) that the dynamic
c 2006, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved.
1167
where bao is defined by (1), and the new policy
π (b) = arg max Qπ (b, a)
(LSPI) for MDP (Lagoudakis & Parr 2002). The ILSPI extends LSPI in two respects: (a) it solves the policy for a
POMDP, including the MDP as a special case; (b) it determines the optimal basis functions to incrementally reduce
the Bellman residual. As the ILSPI is based on a finite set
of belief points, it can be thought of as a point-based policy
iteration algorithm.
Then π is an improved policy over π, i.e.,
V π (b) ≥ V π (b)
for any belief point (belief state) b.
A policy iteration algorithm iteratively applies the policy
improvement theorem to obtain successively improved policies. Policy iteration consists of two basic steps: the first
is policy evaluation, in which we compute the value function V π (b) by solving the Bellman equation (3); the second is policy improvement, in which we improve the policy
according to Theorem 1. These two steps are alternatively
performed until V π (b) converges for all b.
The Howard-Blackwell policy improvement theorem requires that the maximization in (5) be performed for every
possible b. This poses great challenges to policy iteration
for PODMPs, as the belief state b is continuous. Performing
policy improvement in the entire belief simplex is computationally expensive, and is difficult when the optimal value
function is not piecewise linear (Sondik 1978).
The ILSPI algorithm focuses on improving the policy
only on the belief points which are reachable by the POMDP.
We show that policy improvement can be achieved strictly in
the reachable belief states.
A POMDP is defined as a tuple (S, A, T, O, Ω, R), where S,
A, O respectively denote a finite set of states, actions, and
observations, T are state-transition matrices with Tss (a)
the probability of transiting to state s by taking action a in
state s, Ω are observation functions with Ωs o (a) the probability of observing o after taking action a and transiting to
state s , and R is a reward function with R(s, a) the expected immediate reward when taking action a in state s.
The states are not observed directly in a POMDP. Instead
the agent maintains a belief state, defined as the probability
distribution over the states given past actions and observations. The belief state constitutes a continuous-state Markov
process (Smallwood & Sondik 1973). Given that at time t−1
the agent has belief state b and takes action a, and that the
agent observes o at time t, the belief state at time t follows
from Bayes rule and is given by
a
a
ba (s ) = s∈S b(s)Tss Ωs o
(1)
o
p(o|b, a)
where
a
a
p(o|b, a) = s ∈S s∈S b(s)Tss
(2)
Ωs o
Policy Improvement in Reachable Belief States Let B0
be a set of initial belief points (states) at time t = 0. For
time t = 1, 2, · · ·, let Bt be the set of all possible bao in (1), ∀
b ∈ Bt−1 , ∀ a ∈ A, ∀ o ∈ O, such that p(o|b, a) > 0. Then
∪∞
t=0 Bt is the set of belief points reachable by the POMDP
by starting from B0 . It is known that for most practical
POMDP models, ∪∞
t=0 Bt constitutes only a subset of the
belief simplex (Pineau, Gordon, & Thrun 2003). It is therefore sufficient to only improve policy on the belief points in
∪∞
t=0 Bt , since they are the belief states that the POMDP will
encounter by starting from B0 . It is reasonable to let B0 contain only the uniform (most uncertain) belief state, i.e., the
centroid point in the belief simplex.
For any given initial belief states (points) B0 , we have the
following policy improvement theorem.
Theorem 2 (Policy improvement in reachable belief states)
Let B0 be a set of initial belief points (states) at time t = 0.
For time t = 1, 2, · · ·, let Bt be the set of all possible bao in
(1), ∀ b ∈ Bt−1 , ∀ a ∈ A, ∀ o ∈ O, such that p(o|b, a) > 0.
Let π be a stationary policy and V π (b) its infinite-horizon
value function. Define the Q funciton
Qπ (b, a) =
b(s)R(s, a) + γ
p(o|b, a)V π (ba ) (7)
is the probability of transiting from b to bao . Note we have
used the subscript and superscript to explicitly indicate that
ba is dependent on a and o. We are interested in finding
o
a stationary policy that maps any belief state to an optimal
action, with the goal of maximizing the expected and discounted future reward of each belief state over the infinite
horizon.
Let π denote a stationary policy producing action a =
π(b) for any belief state b. Let V π (b) denote the expected
sum of discounted rewards accrued by the POMDP agent
when it has an initial belief b and follows a stationary policy
π over the infinite horizon. We refer to V π (b) as the infinitehorizon value function of π throughout the paper. According
to (Sondik 1978), π must satisfy the Bellman equation,
V π(b) = b(s)R(s, π(b))+γ
p(o|b, π(b))V π(bπ(b)) (3)
o
o∈O
where bao is determined by (1) and γ ∈ [0, 1) is a discount
factor.
For any stationary policy π, we have the following policy improvement theorem (Blackwell 1965; Howard 1971;
Sondik 1978).
Theorem 1 (Howard-Blackwell policy improvement) Let
V π (b) be the infinite-horizon value function of a stationary
policy a = π(b). Define the Q function
Qπ (b, a) =
b(s)R(s, a) + γ
p(o|b, a)V π (bao ) (4)
s∈S
(6)
The ILSPI Algorithm for POMDP
The Infinite Horizon POMDP Problem
s∈S
(5)
a
o
s∈S
o∈O
Then the new policy
π (b) = arg max Qπ (b, a),
a
∀ b ∈ ∪∞
t=0 Bt
improves over π for any belief state (point) b ∈
V
o∈O
1168
π
(b) ≥ V (b),
π
∀b ∈
∪∞
t=0 Bt
∪∞
t=0 Bt ,
(8)
i.e,
(9)
Proof: By (3), (7), and (8), we have, ∀ b ∈ ∪∞
t=0 Bt ,
V (b) = Q (b, π(b)) ≤ Q (b, π (b))
π
π
π
ρoπ(b) = γ p(o|b, π(b)) = γ
(10)
for any o ∈ O because
This equation must hold for
they all are members of ∪∞
t=0 Bt . Therefore we can keep on
expanding Qπ (b, π (b)) until every π appearing in it is replaced by π , upon which time the rightmost side of (10)
becomes Qπ (b, π (b)) = V π (b) and we have V π (b) ≤
V π (b), which completes the proof.
The reachable belief set ∪∞
t=0 Bt may still be infinitely
large. We obtain a manageable belief set by sampling from
π a
∪∞
t=0 Bt . As noted in (7), the term p(o|b, a)V (bo ) can be ignored if p(o|b, a) is near to zero. Therefore, when expanding
Bt from Bt−1 , we perform the following procedures, similar to those in (Pineau, Gordon, & Thrun 2003). For any
b ∈ Bt−1 , we draw a sample o according to p(o|b, a) for every a ∈ A. We then obtain |A| new belief points bao , each
produced by a different a. We select a single point from
these new points that has the maximum Euclidean distance
from its ancestral points. We put this selected point in Bt .
In this way, we expand a trajectory of belief points starting
from each initial point in B0 . A trajectory usually can expand only a small number of unique belief points and is terminated when no new points can be expanded; otherwise, it
is terminated when it reaches a specified length. A number
of such sequences are drawn and then are merged to produce B. We remove the redundant and close elements in B
to yield the final belief sample set.
As noted in (Pineau, Gordon, & Thrun 2003), the belief
samples thus produced yields a uniform representation of the
belief states most probably visited by the POMDP. Those
belief states that are scarcely visited have negligible contribution in the righthand side of (7) and can be ignored. The
uniform sampling ensures that the V π computed by minimizing the Bellman residual on the samples will generalize
well to the remaining belief points most probably visited. In
addition, the optimally determined basis functions also enhances the generalization of V π .
b∈B
where
M=
by defining
η π(b) =
π(b) )ρπ(b)
o
o∈O φ(bo
s∈S R(s, π(b))b(s)
φ(b) −
π(b) )ρπ(b)
o
o∈O φ(bo
T × φ(b) − o∈O φ(boπ(b) )ρoπ(b)
(18)
Substituting the solution of w in (17) back into (16) gives
e (φ) = b∈B (η π(b) )2 − η π(b) wT φ(b)
(19)
− o∈O φ(boπ(b) )ρπ(b)
o
where w is related to φ by (17) and therefore e(φ) is a
functional with φ as free variables. By minimizing e(φ),
we can determine the optimal φ for the V π . Recalling
T
φ(b) = 1, φ1 (b), . . . , φN (b) , this amounts to determining N , the number of basis functions, and the functional
form of each basis function φn (·), n = 1, . . . , N . We consider an incremental procedure and determine the optimal
basis functions one after another. The following theorem
gives an efficient algorithm for such an incremental procedure. The proof of the theorem is given in the Appendix.
Theorem 3 Let φ(b) = [1, φ1 (b), . . . , φN (b)]T .
Let
φN +1 (b) be a single basis function. Assume the M matrices in (18), corresponding to φ and [φ, φN +1 ]T , are all
non-degenerate. Then
N +1
) = e(φ) − e([φ, φN +1 ]T )
δe(φ,
φ
T
= c w − b∈B η π(b) φN +1 (b)
2 −1
− o∈O φN +1 (bπ(b)
)ρoπ(b)
q
o
(12)
b∈B
(11)
0
T
where φ(b) = φ (b), φ1 (b), . . . , φN (b) is a column containing N + 1 basis functions with φ0 (b) ≡ 1 accounting
for a constant bias term. Substituting (11) into the Bellman
equation (3), we obtain
wTφ(b) =
R(s, π(b))b(s) +γ
p(o|b, π(b))wTφ(bπ(b)
o )
wT φ(b) = η π(b) + wT
o∈O
For given φ, the w is solved as
π(b)
(17)
)ρ
w = M −1 b∈B η π(b) φ(b) − o∈O φ(bπ(b)
o
o
V π (b) = wT φ(b)
which is simplified to
(15)
Policy Evaluation by Minimizing Bellman Residual We
want to compute the value function V π , which evaluates
how good the policy π is. This V π is computed by solving
the Bellman equation which, under the basis representation
of V π , is given by (13). The problem reduces to finding
the φ and w that satisfy the Bellman equation as closely as
possible. To do so, we utilize the idea of LSPI for MDP
(Lagoudakis & Parr 2002) and minimize the square of the
Bellman residual (the difference between the left side and
right side of (13)), accumulated over all b ∈ B,
2
π(b)
π(b)
e (φ, w) =
wT φ(b)−
−η
φ(bπ(b)
)ρ
(16)
o
o
Basis Representation of the Value Function We represent the infinite-horizon value function V π (b) as a linear
combination of basis functions of b
o∈O
π(b) π(b)
s,s ∈S b(s)Tss Os o
Note there is an equation of the form of (13) for every belief
point b ∈ B.
bπo (b)
s∈S
(13)
where w is given by (17), M is given by (18), and
)ρoπ(b)
c = b∈B φ(b) − o∈O φ(bπ(b)
o
(14)
1169
(20)
T (21)
× φN +1 (b) − o∈O φN +1 (boπ(b) )ρπ(b)
o
2
d = b∈B φN +1 (b) − o∈O φN +1 (bπ(b)
)ρoπ(b) (22)
o
q = d − cT M −1 c > 0
Table 2: The ILSPI Algorithm: Main Loop
function [w, φ(·)] = ILSPI(POMDP, B, Φ)
% Φ is a set of candidate basis functions
% Pre-computation
π(b)=a
with (15), ∀ b ∈ B,
Compute boa with (1) and ρo
∀ a ∈ A, ∀ o ∈ O;
: b ∈ B, ψ ∈ Φ};
Compute Φ
B = {ψ(b)
1
π
while |B|
b∈B V (b) does not converge
% Policy Evaluation
Compute η π(b) with (14), ∀ b ∈ B, o ∈ O; Compute
π(b)
ΦB = {ψ(bo ) : b ∈ B, o ∈ O, ψ ∈ Φ};
[w, φ(·)] = ILS(ΦB , ΦB, η, ρ);
% Policy Improvement
Compute Qπ (b, a) with (7),(11), ∀ b ∈ B, ∀ a ∈ A;
Update π(b) = arg maxa∈A Qπ (b, a), ∀ b ∈ B;
end
(23)
By (20) and (23), δe(φ, φ
) ≥ 0, thus adding φ
to φ generally makes the squared Bellman residual decrease or unchanged (indicating convergence). The decrease
δe(φ, φN +1 ) depends on φN +1 . By selecting basis functions
that bring the maximum decrease, we incrementally minimize the squared Bellman residual e(φ). The pseudo Matlab
code for the ILSPI policy evaluation is given in Table 1.
N +1
N +1
Table 1: The ILSPI Algorithm: Policy Evaluation
function [w, φ(·)] = ILS(ΦB , ΦB, η, ρ)
% ΦB = {ψ(b) : b ∈ B, ψ ∈ Φ}
π(b)
% ΦB = {ψ(bo ) : b ∈ B, o ∈ O, ψ ∈ Φ}
% Initialization
N = 0, φ(·) = 1;
Compute M = with (18), w with (17), and e0 with (16);
while the sequence {en }n=0:N is not converged
for all basis function φN +1 = ψ ∈ Φ
Compute c with (21), d with (22), and q with (23);
if q = 0
Φ = Φ \ {ψ}; continue;
else
compute δe(φ, ψ) using (20);
end
end
Compute ψ ∗ = arg maxψ∈Φ δe(φ, ψ);
% Update
φ(·) = [φT (·), ψ ∗ (·)]T ; Φ = Φ \ {ψ ∗ };
M ← M new and w ← wnew , where M new is
computed with (A-1) and wnew with (A-3);
eN +1 = eN − δe(φ, ψ ∗ ); N = N + 1;
end
Table 3: Time and Memory Complexity of the ILSPI
Computation
π(b)
ρo (out)
π(b)
ρo (in)
bao (out)
bao (in)
φ(b) (out)
φ(b) (in)
φ(
bao ) (out)
φ(
bao ) (in)
η π(b) (out)
η π(b) (in)
c (in)
d (in)
u (in)
q (in)
w (in)
Qπ (b, a) (in)
Time
O(|B||A||O||S|2 )
O(|B||O||S|2 L)
O(|B||A||O||S|2 )
O(|B||O||S|2 L)
O(|B||Φ|Υφ )
O(|B||Φ|Υφ L)
O(|B||A||O||Φ|Υφ )
O(|B||O||Φ|Υφ L)
O(|B||A||S|)
O(|B||S|L)
O(|B||Φ|N L)
O(|B||Φ|N L)
O(|Φ|N 2 L)
O(|B|N 2 L)
O(N 2 L)
O(|B||A||O|N L)
Memory
O(|B||A||O|)
O(1)
O(|B||A||O||S|)
O(|S|)
O(|B||Φ|)
O(N )
O(|B||A||O||Φ|)
O(N )
O(|B||A|)
O(|B|)
O(|Φ|N )
O(1)
O(|Φ|N )
O(N )
O(N )
O(|B|)
“in”. Some computations tasks can be performed in or out
the loop, at different time and memory costs. Table 2 represents a specific combination of the “out” and “in”. The
results presented in this paper was produced by computing
ba , ρoπ(b) , φ(b) outside of the loop and all others inside the
o
loop.
In Table 3, u = M −1 c, L is the number of ILSPI iterations (i.e., while-loops in Table 2) performed, and Υφ is the
time of computing a single basis function ψ ∈ Φ at a single
belief point b ∈ B. For radial basis functions (RBF), which
we use to produce the results in this paper, Υφ = |S|. The
c is computed in time of O(|B||Φ|N L) by storing the elements in c for every ψ ∈ Φ, which consumes a memory of
O(|Φ|N ). The u = M −1 c is introduced as an intermediate
variable to speed up computation of q = d + cT M −1 c. The
u is stored for every ψ ∈ Φ using a memory of O(|Φ|N ).
Policy Improvement by Pointwise Maximization of Qπ
Once the basis representation of the value function V π ,
i.e., φ and w, are found in the policy evaluation, we plug
V π (bao ) = wT φ(bao ) into (7) to obtain Qπ (b, a). Then we
perform maximization as formulated in (8) to get the improved policy π . This maximization is performed in a pointwise manner, working on each b ∈ B one by one. The ILSPI
policy improvement is given in Table 2, as part of the main
ILSPI loop, presented in the form of pseudo Matlab code.
It is noted that the ILSPI outputs w and φ(·), giving
V π (b) = wT φ(b), which is substituted into (7) to obtain
Qπ (b, a) and produce the policy π(b) = arg maxa Qπ (b, a).
Time and Memory Complexity The time and memory
complexity of the ILSPI is given in Table 3, where time
refers to the number of multiplications and memory refers
to the number of intermediate variables (not counting the
input variables) involved the computation. A computation
task is performed outside the while-loop in Table 2 if it is
marked with “out” and inside the while-loop if marked with
Experimental Results
We demonstrate the performance of the proposed ILSPI on
four benchmark problems. The first three, namely, Tigergrid, Hallway, and Hallway2, were introduced in (Littman,
1170
Cassandra, & Kaelbling 1995) and have since been widely
used to test scalable POMDP solutions. The fourth problem is the Tag problem introduced in (Pineau, Gordon, &
Thrun 2003), which is relatively new and has a larger problem size. The proposed ILSPI is compared to five pointbased algorithms of value iteration: Grid (Brafman 1997),
PBUA (Poon 2001), PBVI (Pineau, Gordon, & Thrun 2003),
Perseus (Spaan & Vlassis 2004), and HSVI (Smith & Simmons 2004; 2005), in terms of performance and policy
computation time. To replicate the experiments of previous authors, we measure algorithm performance by the
accumulative-discounted-reward averaged over Ntest independent tests. For Tiger-Grid, Ntest = 151 and each test
terminates after the agent takes 500 actions; during a test
the agent is reset each time it reaches the goal. For Hallway
and Hallway2, Ntest = 251 and each test terminates when
the agent reaches the goal or has taken a maximum of 251
actions. For Tag, Ntest = 1000 and each test terminates
when the agent successfully tags its component or has taken
a maximum of 100 actions.
Iteration 0
Iteration 1
0
40
0
−20
20
−500
−40
0
−1000
−60
−20
0
0.5
1
−80
0
Iteration 3
0.5
1
−40
28
26
26
26
24
24
24
22
22
22
20
20
20
0.5
1
18
0
0.5
Action: Listen
25
25
24
24
23
22
22
21
20
20
19
19
0
0.2
0.4
0.6
0.8
1
b(1)=p(s="tiger on the left"|history)
Action: Listen
23
21
18
Action: Right
26
Value functions
Value functions
26
Action: Left
27
Action: Right
18
0
0.2
0.4
0.6
0.8
1
b(1)=p(s="tiger on the left"|history)
Figure 2: Comparison of the converged ILSPI policy with the
exact policy, for the tiger problem. Green cricle denotes the action
“open left door”; blue plus denotes the action “open right door”;
red dot denotes the action “listen”.
0
0.5
displayed in seconds. The results marked with ∗ are those we obtained by coding the respective algorithms in Matlab; other results
may have been coded in languages other than Matlab and executed
on computer platforms different from ours.
1
Iteration 5
28
0
Action: Left
Table 4: Results on the benchmark problems, where the time is
Iteration 4
28
18
Value function and policy of ILSPI
28
27
Iteration 2
500
−1500
Exact value function and optimal policy
28
1
18
0
0.5
Method
Tiger-Grid |S| = 33, |A| = 5, |O| = 17
Grid(Brafman 1997)
PBUA(Poon 2001)
PBVI(Pineau, Gordon, & Thrun 2003)
PBVI (∗)
Perseus(Spaan & Vlassis 2004)
HSVI1(Smith & Simmons 2004)
HSVI2(Smith & Simmons 2005)
ILSPI (*)
Hallway |S| = 57, |A| = 5, |O| = 21
PBUA(Poon 2001)
PBVI(Pineau, Gordon, & Thrun 2003)
PBVI (∗)
Perseus(Spaan & Vlassis 2004)
HSVI1(Smith & Simmons 2004)
HSVI2(Smith & Simmons 2005)
ILSPI (*)
Hallway2 |S| = 89, |A| = 5, |O| = 17
PBUA(Poon 2001)
PBVI(Pineau, Gordon, & Thrun 2003)
PBVI (∗)
Perseus(Spaan & Vlassis 2004)
HSVI1(Smith & Simmons 2004)
HSVI2(Smith & Simmons 2005)
ILSPI (*)
Tag |S| = 870, |A| = 5, |O| = 30
PBVI(Pineau, Gordon, & Thrun 2003)
Perseus(Spaan & Vlassis 2004)
HSVI1(Smith & Simmons 2004)
HSVI2(Smith & Simmons 2005)
ILSPI (*)
1
Figure 1: The value functions V π (b) of each ILSPI iteration
for the tiger problem (Kaelbling, Littman, & Cassandra 1998),
which has |S| = 2. The horizontal axis is b(1) = p(s =
“tiger on the left”|history). The vertical axis is V . The initial policy is a random one. Green cricle denotes the action “open left
door”; blue plus denotes the action “open right door”; red dot denotes the action “listen”.
The experimental results of the proposed ILSPI algorithm
are summarized in Table 4, in comparison to the other five
algorithms. It is demonstrated that for each of the four problems the ILSPI compares competitively to its value-iteration
counterparts in terms of both performance and computational time. The time efficiency of ILSPI can be attributed
to two facts: (a) ILSPI leverages the simplicity of the least
squares criterion (squared Bellman residual) and uses matrix tricks to speed up the computation; (b) ILSPI typically
converges within a smaller number of iterations than value
Reward
Time (s)
0.94
2.30
2.25
2.23
2.34
2.35
2.30
2.21
n.v.
12116
3448
2239
104
10341
52
136
0.53
0.53
0.54
0.51
0.52
0.52
0.54
450
288
1166
35
10836
2.4
66
0.35
0.34
0.35
0.35
0.35
0.35
0.30
27898
360
2345
10
10010
1.5
206
-9.180
-6.17
-6.37
-6.36
-12.3
180880
1670
10113
24
737
iteration methods. This is not surprising as fast convergence
1171
is generally observed in policy iteration algorithms (Sutton
& Barto 1998).
To demonstrate the accuracy of the ILSPI algorithm as
well as its fast convergence, we apply the algorithm to the
tiger problem (Kaelbling, Littman, & Cassandra 1998). Figure 1 shows the evolution of the value function V π (b) over
the ILSPI iterations, starting from a random policy. Figure
2 compares the V π (b) of the converged ILSPI policy with
that of the exact policy. It is seen from the figures that ILSPI
converges after only three iterations and that the converged
policy is almost identical to the exact policy.
By (19),
e (φnew ) =
π(b)
ϕ
b∈B
(b)
ϕ(b)
ϕN +1 (b)
T
=
M
cT
c
d
T
(A-1)
where M , c, and d are as in (18), (21), and (22), respectively. By
the conditions of the theorem, the matrices M and M new are all
full rank. Using the block matrix inversion formula, we get
(M new )−1=
M −1 + M −1 cq −1 cT M −1
−q −1 cT M −1
−M −1 cq −1
(A-2)
q −1
where q is given in (23). By (17), the wnew corresponding to
[φT , φN +1 ]T are
wnew= [M new ]−1
b∈B
with
Hence,
η π(b)
−1
−1
ϕ(b)
= w+M −1cq g (A-3)
ϕN +1 (b)
−q g
g = cT w −
b∈B
η π(b) ϕN +1 (b)
2
η π(b) ϕN +1 (b)
q −1 (A-7)
References
φN +1 (
bo )ρo . We use ϕ and ϕN +1 in this proof to
simplify the equations. Let φnew = [φ, φN +1 ]T , which is transformed to the ϕ nation as ϕnew = [ϕ, ϕN +1 ]T . By (18), the M
matrix corresponding to φnew is
b∈B
Bellman, R. 1957. Dynamic Programming. Princeton University
Press.
Blackwell, D. 1965. Discounted dynamic programming. Ann.
Math. Stat. 36:226–235.
Brafman, R. I. 1997. A heuristic variable grid solution method
for pomdps. In AAAI, 727–733.
Hansen, E. A. 1997. An improved policy iteration algorithm for
partially observable mdps. Neural Information Processing Systems 10.
Howard, R. A. 1971. Dynamic Probabilistic Systems. John Wiley
and Sons, New York.
Kaelbling, L.; Littman, M.; and Cassandra, A. 1998. Planning
and acting in partially observable stochastic domains. Artificial
Intelligence 101:99–134.
Lagoudakis, M. G., and Parr, R. 2002. Model-free least-squares
policy iteration. In Dietterich, T. G.; Becker, S.; and Ghahramani,
Z., eds., Advances in Neural Information Processing Systems 14.
Cambridge, MA: MIT Press.
Littman, M. L.; Cassandra, A. R.; and Kaelbling, L. P. 1995.
Learning policies for partially obsevable environments:scaling
up. In ICML, 362–370.
Lovejoy, W. S. 1991. Computationally feasible bounds for partially observed Markov decision processes. Operations Research
39(1):162–175.
Pineau, J.; Gordon, G.; and Thrun, S. 2003. Point-based value
iteration: An anytime algorithm for POMDPs. In IJCAI, 1025 –
1032.
Poon, K.-M. 2001. A fast heuristic algorithm for decisiontheoretic planning. Master’s thesis, The Hong Kong University
of Science and Technology.
Smallwood, R. D., and Sondik, E. J. 1973. The optimal control
of partially observable Markov processes over a finite horizon.
Operational Research 21:1071–1088.
Smith, T., and Simmons, R. 2004. Heuristic search value iteration
for POMDPs. In Proc. of UAI.
Smith, T., and Simmons, R. 2005. Point-based POMDP algorithms: Improved analysis and implementation. In Proc. of UAI.
Sondik, E. J. 1978. The optimal control of partially observable
Markov processes over the infinite horizon: Discounted costs.
Operations Research 26(2):282–304.
Spaan, M., and Vlassis, N. 2004. A point-based POMDP algorithm for robot planning. In Proc. IEEE Int. Conf. on Robotics
and Automation (ICRA), 2399–2404.
Sutton, R., and Barto, A. 1998. Reinforcement learning: An
introduction. Cambridge, MA: MIT Press.
o∈O
N +1
(A-6)
which is (20) in the ϕ notation. By the conditions of the theorem,
M new is full rank and is positive definite by construction. By (A2), q −1 is a diagonal element of (M new )−1 , hence q −1 > 0 and
q > 0. The proof is thus completed.
Appendix: Proof of Theorem 3
ϕ(b)
(η π(b) )2 − η π(b) (wnew )T ϕnew (b)
For any belief state b, we define the ϕ notation ϕ(b) =
π(b) π(b)
φ(
bo )ρo
and ϕN +1 (b) = φN +1 (b) −
φ(b) −
o∈O
M new=
e(φnew ) = e(φ) − cT w −
Conclusions
π(b)
b∈B
Substituting (A-5), and applying (17), (19), and (A-4),
We have presented a new algorithm, incremental least
squares policy iteration (ILSPI), for solving infinite-horizon
stationary policies of partially observable Markov decision
processes (POMDPs). The ILSPI represents the policy
with optimally determined basis functions and computes the
value function by minimizing the square error between the
left-hand side and righthand side of the Bellman equation
(Bellman residual). In the policy improvement step, the ILSPI improves the policy in the belief states reachable by the
POMDP. Like policy iteration in general, the ILSPI converges to the optimal policy within a small number of iterations. In addition, the simplicity of the least squares is
leveraged to make the ILSPI policy evaluation efficient. The
ILSPI was applied to four benchmark problems and was
demonstrated to be competitive to value iteration algorithms
in terms of performance and computational efficiency.
(A-4)
[ϕnew (b)]T wnew
= [ϕ(b)]T w + [ϕ(b)]T M −1 c − ϕN +1 (b) gq −1 (A-5)
1172
Download