Incremental Least Squares Policy Iteration for POMDPs Hui Li, Xuejun Liao, and Lawrence Carin Department of Electrical and Computer Engineering Duke University Durham, NC 27708-0291, USA {hl1, xjliao, lcarin}@ee.duke.edu Abstract behavior of the belief state is a discrete-time continuousstate Markov process and that for a finite horizon the optimal value of belief states is a piecewise linear and convex function. Based on these facts, the value iteration method for MDPs (Bellman 1957) was extended to POMDPs in (Smallwood & Sondik 1973). Though the algorithm in (Smallwood & Sondik 1973) is exact, it has an exponential computational complexity in the worst case. More recently, researchers have developed scalable value iteration algorithms based on a finite set of belief points (Lovejoy 1991; Brafman 1997; Poon 2001; Pineau, Gordon, & Thrun 2003). In particular, Pineau et al. (2003) suggested backing up the value and its gradient on a finite set of belief points most probably reached by the POMDP. The resulting algorithm, point-based value iteration (PBVI), has proven to be a practical POMDP solution scaling up to large problems. The idea has subsequently been pursued in a number of papers (Spaan & Vlassis 2004; Smith & Simmons 2004; 2005), where various heuristics were proposed to further improve the algorithmic efficiency. The value iteration methods solve the finite-horizon policy of a POMDP. To obtain the infinite-horizon policy, one must solve the policy over successively larger horizons until the value function converges to that of the infinite-horizon. An alternative approach is to solve the infinite-horizon policy directly. Sondik (1978) presented such an approach by extending the policy iteration of MDP (Howard 1971) to the POMDP case. Sondik’s algorithm focuses on improving the policy in the entire belief simplex, involving complicated computations, and therefore it is only appropriate for small problems. Hansen (1997) presented an improved algorithm, integrating Sondik’s multiple policy representations into a single one. Hansen’s algorithm still focuses on the entire belief simplex and does not scale to large problems. In this paper, we introduce a new policy iteration algorithm for POMDPs. This algorithm, called incremental least squares policy iteration (ILSPI), concentrates only on the belief states that are actually reachable by the POMDP. As noted in (Pineau, Gordon, & Thrun 2003), the reachable belief points of most practical POMDP models constitute only a subset of the belief simplex. It is therefore sufficient to only find optimal actions for the reachable belief points, instead of all points in the belief simplex. The ILSPI represents an extension of the least squares policy iteration We present a new algorithm, called incremental least squares policy iteration (ILSPI), for finding the infinite-horizon stationary policy for partially observable Markov decision processes (POMDPs). The ILSPI algorithm computes a basis representation of the infinite-horizon value function by minimizing the square of Bellman residual and performs policy improvement in reachable belief states. A number of optimal basis functions are determined by the algorithm to minimize the Bellman residual incrementally, via efficient computations. We show that, by using optimally determined basis functions, the policy can be improved successively on a set of most probable belief points sampled from the reachable belief set. As the ILSPI is based on belief sample points, it represents a point-based policy iteration method. The results on four benchmark problems show that the ILSPI compares competitively to its value-iteration counterparts in terms of both performance and computational efficiency. Introduction The partially observable Markov decision process (POMDP) (Smallwood & Sondik 1973; Sondik 1978; Kaelbling, Littman, & Cassandra 1998) provides a rich mathematical framework for planing under uncertainty. The POMDP inherits from its predecessor, the Markov decision process (MDP) (Bellman 1957), the uncertainty about state transitions that results from taking an action. However, the POMDP is taken one step further by introducing uncertainty about the state itself, i.e., the states are not observed directly but are inferred from observations that depend probabilistically on them. The second type of uncertainties makes the POMDP a more realistic decision model in many planning problems, where the underlying state is hidden and one only observes features that partially characterize the state. At any given time the uncertainties about the POMDP states are summarized in the belief state, which is defined as the probability distribution over the states, given the history of past observations and actions. The goal in POMDP planning is to find a policy that maps any belief state to an optimal action, with the objective of maximizing the expected future reward of each belief state over a specified horizon. It was shown in (Smallwood & Sondik 1973) that the dynamic c 2006, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved. 1167 where bao is defined by (1), and the new policy π (b) = arg max Qπ (b, a) (LSPI) for MDP (Lagoudakis & Parr 2002). The ILSPI extends LSPI in two respects: (a) it solves the policy for a POMDP, including the MDP as a special case; (b) it determines the optimal basis functions to incrementally reduce the Bellman residual. As the ILSPI is based on a finite set of belief points, it can be thought of as a point-based policy iteration algorithm. Then π is an improved policy over π, i.e., V π (b) ≥ V π (b) for any belief point (belief state) b. A policy iteration algorithm iteratively applies the policy improvement theorem to obtain successively improved policies. Policy iteration consists of two basic steps: the first is policy evaluation, in which we compute the value function V π (b) by solving the Bellman equation (3); the second is policy improvement, in which we improve the policy according to Theorem 1. These two steps are alternatively performed until V π (b) converges for all b. The Howard-Blackwell policy improvement theorem requires that the maximization in (5) be performed for every possible b. This poses great challenges to policy iteration for PODMPs, as the belief state b is continuous. Performing policy improvement in the entire belief simplex is computationally expensive, and is difficult when the optimal value function is not piecewise linear (Sondik 1978). The ILSPI algorithm focuses on improving the policy only on the belief points which are reachable by the POMDP. We show that policy improvement can be achieved strictly in the reachable belief states. A POMDP is defined as a tuple (S, A, T, O, Ω, R), where S, A, O respectively denote a finite set of states, actions, and observations, T are state-transition matrices with Tss (a) the probability of transiting to state s by taking action a in state s, Ω are observation functions with Ωs o (a) the probability of observing o after taking action a and transiting to state s , and R is a reward function with R(s, a) the expected immediate reward when taking action a in state s. The states are not observed directly in a POMDP. Instead the agent maintains a belief state, defined as the probability distribution over the states given past actions and observations. The belief state constitutes a continuous-state Markov process (Smallwood & Sondik 1973). Given that at time t−1 the agent has belief state b and takes action a, and that the agent observes o at time t, the belief state at time t follows from Bayes rule and is given by a a ba (s ) = s∈S b(s)Tss Ωs o (1) o p(o|b, a) where a a p(o|b, a) = s ∈S s∈S b(s)Tss (2) Ωs o Policy Improvement in Reachable Belief States Let B0 be a set of initial belief points (states) at time t = 0. For time t = 1, 2, · · ·, let Bt be the set of all possible bao in (1), ∀ b ∈ Bt−1 , ∀ a ∈ A, ∀ o ∈ O, such that p(o|b, a) > 0. Then ∪∞ t=0 Bt is the set of belief points reachable by the POMDP by starting from B0 . It is known that for most practical POMDP models, ∪∞ t=0 Bt constitutes only a subset of the belief simplex (Pineau, Gordon, & Thrun 2003). It is therefore sufficient to only improve policy on the belief points in ∪∞ t=0 Bt , since they are the belief states that the POMDP will encounter by starting from B0 . It is reasonable to let B0 contain only the uniform (most uncertain) belief state, i.e., the centroid point in the belief simplex. For any given initial belief states (points) B0 , we have the following policy improvement theorem. Theorem 2 (Policy improvement in reachable belief states) Let B0 be a set of initial belief points (states) at time t = 0. For time t = 1, 2, · · ·, let Bt be the set of all possible bao in (1), ∀ b ∈ Bt−1 , ∀ a ∈ A, ∀ o ∈ O, such that p(o|b, a) > 0. Let π be a stationary policy and V π (b) its infinite-horizon value function. Define the Q funciton Qπ (b, a) = b(s)R(s, a) + γ p(o|b, a)V π (ba ) (7) is the probability of transiting from b to bao . Note we have used the subscript and superscript to explicitly indicate that ba is dependent on a and o. We are interested in finding o a stationary policy that maps any belief state to an optimal action, with the goal of maximizing the expected and discounted future reward of each belief state over the infinite horizon. Let π denote a stationary policy producing action a = π(b) for any belief state b. Let V π (b) denote the expected sum of discounted rewards accrued by the POMDP agent when it has an initial belief b and follows a stationary policy π over the infinite horizon. We refer to V π (b) as the infinitehorizon value function of π throughout the paper. According to (Sondik 1978), π must satisfy the Bellman equation, V π(b) = b(s)R(s, π(b))+γ p(o|b, π(b))V π(bπ(b)) (3) o o∈O where bao is determined by (1) and γ ∈ [0, 1) is a discount factor. For any stationary policy π, we have the following policy improvement theorem (Blackwell 1965; Howard 1971; Sondik 1978). Theorem 1 (Howard-Blackwell policy improvement) Let V π (b) be the infinite-horizon value function of a stationary policy a = π(b). Define the Q function Qπ (b, a) = b(s)R(s, a) + γ p(o|b, a)V π (bao ) (4) s∈S (6) The ILSPI Algorithm for POMDP The Infinite Horizon POMDP Problem s∈S (5) a o s∈S o∈O Then the new policy π (b) = arg max Qπ (b, a), a ∀ b ∈ ∪∞ t=0 Bt improves over π for any belief state (point) b ∈ V o∈O 1168 π (b) ≥ V (b), π ∀b ∈ ∪∞ t=0 Bt ∪∞ t=0 Bt , (8) i.e, (9) Proof: By (3), (7), and (8), we have, ∀ b ∈ ∪∞ t=0 Bt , V (b) = Q (b, π(b)) ≤ Q (b, π (b)) π π π ρoπ(b) = γ p(o|b, π(b)) = γ (10) for any o ∈ O because This equation must hold for they all are members of ∪∞ t=0 Bt . Therefore we can keep on expanding Qπ (b, π (b)) until every π appearing in it is replaced by π , upon which time the rightmost side of (10) becomes Qπ (b, π (b)) = V π (b) and we have V π (b) ≤ V π (b), which completes the proof. The reachable belief set ∪∞ t=0 Bt may still be infinitely large. We obtain a manageable belief set by sampling from π a ∪∞ t=0 Bt . As noted in (7), the term p(o|b, a)V (bo ) can be ignored if p(o|b, a) is near to zero. Therefore, when expanding Bt from Bt−1 , we perform the following procedures, similar to those in (Pineau, Gordon, & Thrun 2003). For any b ∈ Bt−1 , we draw a sample o according to p(o|b, a) for every a ∈ A. We then obtain |A| new belief points bao , each produced by a different a. We select a single point from these new points that has the maximum Euclidean distance from its ancestral points. We put this selected point in Bt . In this way, we expand a trajectory of belief points starting from each initial point in B0 . A trajectory usually can expand only a small number of unique belief points and is terminated when no new points can be expanded; otherwise, it is terminated when it reaches a specified length. A number of such sequences are drawn and then are merged to produce B. We remove the redundant and close elements in B to yield the final belief sample set. As noted in (Pineau, Gordon, & Thrun 2003), the belief samples thus produced yields a uniform representation of the belief states most probably visited by the POMDP. Those belief states that are scarcely visited have negligible contribution in the righthand side of (7) and can be ignored. The uniform sampling ensures that the V π computed by minimizing the Bellman residual on the samples will generalize well to the remaining belief points most probably visited. In addition, the optimally determined basis functions also enhances the generalization of V π . b∈B where M= by defining η π(b) = π(b) )ρπ(b) o o∈O φ(bo s∈S R(s, π(b))b(s) φ(b) − π(b) )ρπ(b) o o∈O φ(bo T × φ(b) − o∈O φ(boπ(b) )ρoπ(b) (18) Substituting the solution of w in (17) back into (16) gives e (φ) = b∈B (η π(b) )2 − η π(b) wT φ(b) (19) − o∈O φ(boπ(b) )ρπ(b) o where w is related to φ by (17) and therefore e(φ) is a functional with φ as free variables. By minimizing e(φ), we can determine the optimal φ for the V π . Recalling T φ(b) = 1, φ1 (b), . . . , φN (b) , this amounts to determining N , the number of basis functions, and the functional form of each basis function φn (·), n = 1, . . . , N . We consider an incremental procedure and determine the optimal basis functions one after another. The following theorem gives an efficient algorithm for such an incremental procedure. The proof of the theorem is given in the Appendix. Theorem 3 Let φ(b) = [1, φ1 (b), . . . , φN (b)]T . Let φN +1 (b) be a single basis function. Assume the M matrices in (18), corresponding to φ and [φ, φN +1 ]T , are all non-degenerate. Then N +1 ) = e(φ) − e([φ, φN +1 ]T ) δe(φ, φ T = c w − b∈B η π(b) φN +1 (b) 2 −1 − o∈O φN +1 (bπ(b) )ρoπ(b) q o (12) b∈B (11) 0 T where φ(b) = φ (b), φ1 (b), . . . , φN (b) is a column containing N + 1 basis functions with φ0 (b) ≡ 1 accounting for a constant bias term. Substituting (11) into the Bellman equation (3), we obtain wTφ(b) = R(s, π(b))b(s) +γ p(o|b, π(b))wTφ(bπ(b) o ) wT φ(b) = η π(b) + wT o∈O For given φ, the w is solved as π(b) (17) )ρ w = M −1 b∈B η π(b) φ(b) − o∈O φ(bπ(b) o o V π (b) = wT φ(b) which is simplified to (15) Policy Evaluation by Minimizing Bellman Residual We want to compute the value function V π , which evaluates how good the policy π is. This V π is computed by solving the Bellman equation which, under the basis representation of V π , is given by (13). The problem reduces to finding the φ and w that satisfy the Bellman equation as closely as possible. To do so, we utilize the idea of LSPI for MDP (Lagoudakis & Parr 2002) and minimize the square of the Bellman residual (the difference between the left side and right side of (13)), accumulated over all b ∈ B, 2 π(b) π(b) e (φ, w) = wT φ(b)− −η φ(bπ(b) )ρ (16) o o Basis Representation of the Value Function We represent the infinite-horizon value function V π (b) as a linear combination of basis functions of b o∈O π(b) π(b) s,s ∈S b(s)Tss Os o Note there is an equation of the form of (13) for every belief point b ∈ B. bπo (b) s∈S (13) where w is given by (17), M is given by (18), and )ρoπ(b) c = b∈B φ(b) − o∈O φ(bπ(b) o (14) 1169 (20) T (21) × φN +1 (b) − o∈O φN +1 (boπ(b) )ρπ(b) o 2 d = b∈B φN +1 (b) − o∈O φN +1 (bπ(b) )ρoπ(b) (22) o q = d − cT M −1 c > 0 Table 2: The ILSPI Algorithm: Main Loop function [w, φ(·)] = ILSPI(POMDP, B, Φ) % Φ is a set of candidate basis functions % Pre-computation π(b)=a with (15), ∀ b ∈ B, Compute boa with (1) and ρo ∀ a ∈ A, ∀ o ∈ O; : b ∈ B, ψ ∈ Φ}; Compute Φ B = {ψ(b) 1 π while |B| b∈B V (b) does not converge % Policy Evaluation Compute η π(b) with (14), ∀ b ∈ B, o ∈ O; Compute π(b) ΦB = {ψ(bo ) : b ∈ B, o ∈ O, ψ ∈ Φ}; [w, φ(·)] = ILS(ΦB , ΦB, η, ρ); % Policy Improvement Compute Qπ (b, a) with (7),(11), ∀ b ∈ B, ∀ a ∈ A; Update π(b) = arg maxa∈A Qπ (b, a), ∀ b ∈ B; end (23) By (20) and (23), δe(φ, φ ) ≥ 0, thus adding φ to φ generally makes the squared Bellman residual decrease or unchanged (indicating convergence). The decrease δe(φ, φN +1 ) depends on φN +1 . By selecting basis functions that bring the maximum decrease, we incrementally minimize the squared Bellman residual e(φ). The pseudo Matlab code for the ILSPI policy evaluation is given in Table 1. N +1 N +1 Table 1: The ILSPI Algorithm: Policy Evaluation function [w, φ(·)] = ILS(ΦB , ΦB, η, ρ) % ΦB = {ψ(b) : b ∈ B, ψ ∈ Φ} π(b) % ΦB = {ψ(bo ) : b ∈ B, o ∈ O, ψ ∈ Φ} % Initialization N = 0, φ(·) = 1; Compute M = with (18), w with (17), and e0 with (16); while the sequence {en }n=0:N is not converged for all basis function φN +1 = ψ ∈ Φ Compute c with (21), d with (22), and q with (23); if q = 0 Φ = Φ \ {ψ}; continue; else compute δe(φ, ψ) using (20); end end Compute ψ ∗ = arg maxψ∈Φ δe(φ, ψ); % Update φ(·) = [φT (·), ψ ∗ (·)]T ; Φ = Φ \ {ψ ∗ }; M ← M new and w ← wnew , where M new is computed with (A-1) and wnew with (A-3); eN +1 = eN − δe(φ, ψ ∗ ); N = N + 1; end Table 3: Time and Memory Complexity of the ILSPI Computation π(b) ρo (out) π(b) ρo (in) bao (out) bao (in) φ(b) (out) φ(b) (in) φ( bao ) (out) φ( bao ) (in) η π(b) (out) η π(b) (in) c (in) d (in) u (in) q (in) w (in) Qπ (b, a) (in) Time O(|B||A||O||S|2 ) O(|B||O||S|2 L) O(|B||A||O||S|2 ) O(|B||O||S|2 L) O(|B||Φ|Υφ ) O(|B||Φ|Υφ L) O(|B||A||O||Φ|Υφ ) O(|B||O||Φ|Υφ L) O(|B||A||S|) O(|B||S|L) O(|B||Φ|N L) O(|B||Φ|N L) O(|Φ|N 2 L) O(|B|N 2 L) O(N 2 L) O(|B||A||O|N L) Memory O(|B||A||O|) O(1) O(|B||A||O||S|) O(|S|) O(|B||Φ|) O(N ) O(|B||A||O||Φ|) O(N ) O(|B||A|) O(|B|) O(|Φ|N ) O(1) O(|Φ|N ) O(N ) O(N ) O(|B|) “in”. Some computations tasks can be performed in or out the loop, at different time and memory costs. Table 2 represents a specific combination of the “out” and “in”. The results presented in this paper was produced by computing ba , ρoπ(b) , φ(b) outside of the loop and all others inside the o loop. In Table 3, u = M −1 c, L is the number of ILSPI iterations (i.e., while-loops in Table 2) performed, and Υφ is the time of computing a single basis function ψ ∈ Φ at a single belief point b ∈ B. For radial basis functions (RBF), which we use to produce the results in this paper, Υφ = |S|. The c is computed in time of O(|B||Φ|N L) by storing the elements in c for every ψ ∈ Φ, which consumes a memory of O(|Φ|N ). The u = M −1 c is introduced as an intermediate variable to speed up computation of q = d + cT M −1 c. The u is stored for every ψ ∈ Φ using a memory of O(|Φ|N ). Policy Improvement by Pointwise Maximization of Qπ Once the basis representation of the value function V π , i.e., φ and w, are found in the policy evaluation, we plug V π (bao ) = wT φ(bao ) into (7) to obtain Qπ (b, a). Then we perform maximization as formulated in (8) to get the improved policy π . This maximization is performed in a pointwise manner, working on each b ∈ B one by one. The ILSPI policy improvement is given in Table 2, as part of the main ILSPI loop, presented in the form of pseudo Matlab code. It is noted that the ILSPI outputs w and φ(·), giving V π (b) = wT φ(b), which is substituted into (7) to obtain Qπ (b, a) and produce the policy π(b) = arg maxa Qπ (b, a). Time and Memory Complexity The time and memory complexity of the ILSPI is given in Table 3, where time refers to the number of multiplications and memory refers to the number of intermediate variables (not counting the input variables) involved the computation. A computation task is performed outside the while-loop in Table 2 if it is marked with “out” and inside the while-loop if marked with Experimental Results We demonstrate the performance of the proposed ILSPI on four benchmark problems. The first three, namely, Tigergrid, Hallway, and Hallway2, were introduced in (Littman, 1170 Cassandra, & Kaelbling 1995) and have since been widely used to test scalable POMDP solutions. The fourth problem is the Tag problem introduced in (Pineau, Gordon, & Thrun 2003), which is relatively new and has a larger problem size. The proposed ILSPI is compared to five pointbased algorithms of value iteration: Grid (Brafman 1997), PBUA (Poon 2001), PBVI (Pineau, Gordon, & Thrun 2003), Perseus (Spaan & Vlassis 2004), and HSVI (Smith & Simmons 2004; 2005), in terms of performance and policy computation time. To replicate the experiments of previous authors, we measure algorithm performance by the accumulative-discounted-reward averaged over Ntest independent tests. For Tiger-Grid, Ntest = 151 and each test terminates after the agent takes 500 actions; during a test the agent is reset each time it reaches the goal. For Hallway and Hallway2, Ntest = 251 and each test terminates when the agent reaches the goal or has taken a maximum of 251 actions. For Tag, Ntest = 1000 and each test terminates when the agent successfully tags its component or has taken a maximum of 100 actions. Iteration 0 Iteration 1 0 40 0 −20 20 −500 −40 0 −1000 −60 −20 0 0.5 1 −80 0 Iteration 3 0.5 1 −40 28 26 26 26 24 24 24 22 22 22 20 20 20 0.5 1 18 0 0.5 Action: Listen 25 25 24 24 23 22 22 21 20 20 19 19 0 0.2 0.4 0.6 0.8 1 b(1)=p(s="tiger on the left"|history) Action: Listen 23 21 18 Action: Right 26 Value functions Value functions 26 Action: Left 27 Action: Right 18 0 0.2 0.4 0.6 0.8 1 b(1)=p(s="tiger on the left"|history) Figure 2: Comparison of the converged ILSPI policy with the exact policy, for the tiger problem. Green cricle denotes the action “open left door”; blue plus denotes the action “open right door”; red dot denotes the action “listen”. 0 0.5 displayed in seconds. The results marked with ∗ are those we obtained by coding the respective algorithms in Matlab; other results may have been coded in languages other than Matlab and executed on computer platforms different from ours. 1 Iteration 5 28 0 Action: Left Table 4: Results on the benchmark problems, where the time is Iteration 4 28 18 Value function and policy of ILSPI 28 27 Iteration 2 500 −1500 Exact value function and optimal policy 28 1 18 0 0.5 Method Tiger-Grid |S| = 33, |A| = 5, |O| = 17 Grid(Brafman 1997) PBUA(Poon 2001) PBVI(Pineau, Gordon, & Thrun 2003) PBVI (∗) Perseus(Spaan & Vlassis 2004) HSVI1(Smith & Simmons 2004) HSVI2(Smith & Simmons 2005) ILSPI (*) Hallway |S| = 57, |A| = 5, |O| = 21 PBUA(Poon 2001) PBVI(Pineau, Gordon, & Thrun 2003) PBVI (∗) Perseus(Spaan & Vlassis 2004) HSVI1(Smith & Simmons 2004) HSVI2(Smith & Simmons 2005) ILSPI (*) Hallway2 |S| = 89, |A| = 5, |O| = 17 PBUA(Poon 2001) PBVI(Pineau, Gordon, & Thrun 2003) PBVI (∗) Perseus(Spaan & Vlassis 2004) HSVI1(Smith & Simmons 2004) HSVI2(Smith & Simmons 2005) ILSPI (*) Tag |S| = 870, |A| = 5, |O| = 30 PBVI(Pineau, Gordon, & Thrun 2003) Perseus(Spaan & Vlassis 2004) HSVI1(Smith & Simmons 2004) HSVI2(Smith & Simmons 2005) ILSPI (*) 1 Figure 1: The value functions V π (b) of each ILSPI iteration for the tiger problem (Kaelbling, Littman, & Cassandra 1998), which has |S| = 2. The horizontal axis is b(1) = p(s = “tiger on the left”|history). The vertical axis is V . The initial policy is a random one. Green cricle denotes the action “open left door”; blue plus denotes the action “open right door”; red dot denotes the action “listen”. The experimental results of the proposed ILSPI algorithm are summarized in Table 4, in comparison to the other five algorithms. It is demonstrated that for each of the four problems the ILSPI compares competitively to its value-iteration counterparts in terms of both performance and computational time. The time efficiency of ILSPI can be attributed to two facts: (a) ILSPI leverages the simplicity of the least squares criterion (squared Bellman residual) and uses matrix tricks to speed up the computation; (b) ILSPI typically converges within a smaller number of iterations than value Reward Time (s) 0.94 2.30 2.25 2.23 2.34 2.35 2.30 2.21 n.v. 12116 3448 2239 104 10341 52 136 0.53 0.53 0.54 0.51 0.52 0.52 0.54 450 288 1166 35 10836 2.4 66 0.35 0.34 0.35 0.35 0.35 0.35 0.30 27898 360 2345 10 10010 1.5 206 -9.180 -6.17 -6.37 -6.36 -12.3 180880 1670 10113 24 737 iteration methods. This is not surprising as fast convergence 1171 is generally observed in policy iteration algorithms (Sutton & Barto 1998). To demonstrate the accuracy of the ILSPI algorithm as well as its fast convergence, we apply the algorithm to the tiger problem (Kaelbling, Littman, & Cassandra 1998). Figure 1 shows the evolution of the value function V π (b) over the ILSPI iterations, starting from a random policy. Figure 2 compares the V π (b) of the converged ILSPI policy with that of the exact policy. It is seen from the figures that ILSPI converges after only three iterations and that the converged policy is almost identical to the exact policy. By (19), e (φnew ) = π(b) ϕ b∈B (b) ϕ(b) ϕN +1 (b) T = M cT c d T (A-1) where M , c, and d are as in (18), (21), and (22), respectively. By the conditions of the theorem, the matrices M and M new are all full rank. Using the block matrix inversion formula, we get (M new )−1= M −1 + M −1 cq −1 cT M −1 −q −1 cT M −1 −M −1 cq −1 (A-2) q −1 where q is given in (23). By (17), the wnew corresponding to [φT , φN +1 ]T are wnew= [M new ]−1 b∈B with Hence, η π(b) −1 −1 ϕ(b) = w+M −1cq g (A-3) ϕN +1 (b) −q g g = cT w − b∈B η π(b) ϕN +1 (b) 2 η π(b) ϕN +1 (b) q −1 (A-7) References φN +1 ( bo )ρo . We use ϕ and ϕN +1 in this proof to simplify the equations. Let φnew = [φ, φN +1 ]T , which is transformed to the ϕ nation as ϕnew = [ϕ, ϕN +1 ]T . By (18), the M matrix corresponding to φnew is b∈B Bellman, R. 1957. Dynamic Programming. Princeton University Press. Blackwell, D. 1965. Discounted dynamic programming. Ann. Math. Stat. 36:226–235. Brafman, R. I. 1997. A heuristic variable grid solution method for pomdps. In AAAI, 727–733. Hansen, E. A. 1997. An improved policy iteration algorithm for partially observable mdps. Neural Information Processing Systems 10. Howard, R. A. 1971. Dynamic Probabilistic Systems. John Wiley and Sons, New York. Kaelbling, L.; Littman, M.; and Cassandra, A. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101:99–134. Lagoudakis, M. G., and Parr, R. 2002. Model-free least-squares policy iteration. In Dietterich, T. G.; Becker, S.; and Ghahramani, Z., eds., Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press. Littman, M. L.; Cassandra, A. R.; and Kaelbling, L. P. 1995. Learning policies for partially obsevable environments:scaling up. In ICML, 362–370. Lovejoy, W. S. 1991. Computationally feasible bounds for partially observed Markov decision processes. Operations Research 39(1):162–175. Pineau, J.; Gordon, G.; and Thrun, S. 2003. Point-based value iteration: An anytime algorithm for POMDPs. In IJCAI, 1025 – 1032. Poon, K.-M. 2001. A fast heuristic algorithm for decisiontheoretic planning. Master’s thesis, The Hong Kong University of Science and Technology. Smallwood, R. D., and Sondik, E. J. 1973. The optimal control of partially observable Markov processes over a finite horizon. Operational Research 21:1071–1088. Smith, T., and Simmons, R. 2004. Heuristic search value iteration for POMDPs. In Proc. of UAI. Smith, T., and Simmons, R. 2005. Point-based POMDP algorithms: Improved analysis and implementation. In Proc. of UAI. Sondik, E. J. 1978. The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research 26(2):282–304. Spaan, M., and Vlassis, N. 2004. A point-based POMDP algorithm for robot planning. In Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2399–2404. Sutton, R., and Barto, A. 1998. Reinforcement learning: An introduction. Cambridge, MA: MIT Press. o∈O N +1 (A-6) which is (20) in the ϕ notation. By the conditions of the theorem, M new is full rank and is positive definite by construction. By (A2), q −1 is a diagonal element of (M new )−1 , hence q −1 > 0 and q > 0. The proof is thus completed. Appendix: Proof of Theorem 3 ϕ(b) (η π(b) )2 − η π(b) (wnew )T ϕnew (b) For any belief state b, we define the ϕ notation ϕ(b) = π(b) π(b) φ( bo )ρo and ϕN +1 (b) = φN +1 (b) − φ(b) − o∈O M new= e(φnew ) = e(φ) − cT w − Conclusions π(b) b∈B Substituting (A-5), and applying (17), (19), and (A-4), We have presented a new algorithm, incremental least squares policy iteration (ILSPI), for solving infinite-horizon stationary policies of partially observable Markov decision processes (POMDPs). The ILSPI represents the policy with optimally determined basis functions and computes the value function by minimizing the square error between the left-hand side and righthand side of the Bellman equation (Bellman residual). In the policy improvement step, the ILSPI improves the policy in the belief states reachable by the POMDP. Like policy iteration in general, the ILSPI converges to the optimal policy within a small number of iterations. In addition, the simplicity of the least squares is leveraged to make the ILSPI policy evaluation efficient. The ILSPI was applied to four benchmark problems and was demonstrated to be competitive to value iteration algorithms in terms of performance and computational efficiency. (A-4) [ϕnew (b)]T wnew = [ϕ(b)]T w + [ϕ(b)]T M −1 c − ϕN +1 (b) gq −1 (A-5) 1172