Exploiting Symmetries in POMDPs for Point-Based Algorithms Kee-Eung Kim

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Exploiting Symmetries in POMDPs for Point-Based Algorithms Kee-Eung Kim Department of Computer Science Korea Advanced Institute of Science and Technology 383-1 Guseong-dong Yuseong-gu Daejeon 305-701, Korea kekim@cs.kaist.ac.kr Abstract Ravindran & Barto (2001) define this type of regularity as the symmetry in the model, and extend the model minimization method to cover the symmetries in MDPs. Later work by Wolfe (2006) extends the method to POMDPs. In this paper, we study the symmetry in POMDPs that has not been previously explored in the literature. While previous approaches have been primarily focused on aggregating states in order to reduce the size of the model, we are interested in the POMDP symmetry that is not related to reducing the size of the POMDP, nonetheless can be exploited to speed up conventional POMDP algorithms. We extend the model minimization technique for partially observable Markov decision processes (POMDPs) to handle symmetries in the joint space of states, actions, and observations. The POMDP symmetry we define in this paper cannot be handled by the model minimization techniques previously published in the literature. We formulate the problem of finding the symmetries as a graph automorphism (GA) problem, and although not yet known to be tractable, we experimentally show that the sparseness of the graph representing the POMDP allows us to quickly find symmetries. We show how the symmetries in POMDPs can be exploited for speeding up point-based algorithms. We experimentally demonstrate the effectiveness of our approach. Homomorphisms of POMDPs A POMDP M is defined as a 7-tuple hS, A, Z, T, O, R, b0 i: S is a set of states; A is a set of actions; Z is a set of observations; T is a transition function where T (s, a, s0 ) denotes the probability P (s0 |s, a) of changing to state s0 from state s by executing action a; O is an observation function where O(s, a, z) denotes the probability P (z|s, a) of making observation z when executing action a and arriving in state s; R is a reward function where R(s, a) denotes the immediate reward of executing action a in state s; b0 is the initial belief where b0 (s) is the probability at we start at state s. Model minimization methods for POMDPs search for a homomorphism φ that maps M to another equivalent POMDP M 0 = hS 0 , A0 , Z 0 , T 0 , O0 , R0 i with the minimal model size. Formally, homomorphism φ is defined as hf, g, hi where f : S → S 0 is the function that maps the states, g : A → A0 maps the actions, and h : Z → Z 0 maps the observations. Note that M 0 is a reduced model of M if any of the mappings is many-to-one. Depending on the definition of homomorphism φ, we obtain different definitions of the minimal model. Pineau, Gordon, & Thrun (2003) concern themselves with homomorphism φ defined only on the state spaces; the action and observation spaces remain the same as the original POMDP. Hence φ takes a form of hf, 1, 1i where 1 denotes the identity mapping. f should satisfy the following constraints in order to hold equivalence between M and M 0 : P T 0 (f (s), a, f (s0 )) = s00 ∈f −1 (s0 ) T (s, a, s00 ) Partially observable Markov decision processes (POMDPs) are a natural model for stochastic sequential decision problems under observation uncertainty. However, due to intractability results in solving POMDPs, common algorithms are hindered by poor scalability on the size of the problem. One approach to address this issue is to exploit the redundancy present in the model: by aggregating equivalent states of the model, we can derive a minimized model which can be solved by traditional algorithms with a reduced computational complexity. Such model minimization method has been explored in depth for Markov decision processes (MDPs) and POMDPs. Dean & Givan (1997) define the notion of state equivalence in MDPs and provide a model minimization algorithm that computes the partition of the state space so that the blocks of the partition can be regarded as abstract states. The result is a reduced MDP, while preserving the equivalence to the original MDP, which can be solved by traditional algorithms. Since the complexity of the algorithm depends on the size of the state space, we achieve savings in the computational cost compared to directly solving the original MDP. Pineau, Gordon, & Thrun (2003) extends the technique to POMDPs. These model minimization methods concern with grouping behaviorally equivalent states only while leaving the original actions and observations unchanged. Hence, further reduction is possible by recoding actions and observations. R0 (f (s), a) = R(s, a) Copyright c 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. O0 (f (s), a, z) = O(s, a, z) 1043 T (s, aLISTEN , s0 ) s = sLEFT s = sRIGHT T (s, aLEFT , s0 ) s = sLEFT s = sRIGHT T (s, aRIGHT , s0 ) s = sLEFT s = sRIGHT s0 = sLEFT s0 = sRIGHT 1.0 0.0 0.0 1.0 s0 = sLEFT s0 = sRIGHT 0.5 0.5 0.5 0.5 s0 = sLEFT s0 = sRIGHT 0.5 0.5 0.5 0.5 O(s, aLISTEN , z) z = zLEFT z = zRIGHT 0.85 0.15 0.15 0.85 z = zLEFT z = zRIGHT 0.5 0.5 0.5 0.5 z = zLEFT z = zRIGHT 0.5 0.5 0.5 0.5 s = sLEFT s = sRIGHT O(s, aLEFT , z) s = sLEFT s = sRIGHT O(s, aRIGHT , z) s = sLEFT s = sRIGHT Figure 1: Transition probabilities of the tiger problem Figure 2: Observation probabilities of the tiger problem R(s, a) Wolfe (2006) extends the minimization method to compute homomorphism of a more general form hf, g, hi where the observation mapping h can change depending on the action. The constraints for the equivalence are given by: P T 0 (f (s), g(a), f (s0 )) = s00 ∈f −1 (s0 ) T (s, a, s00 ) s = sLEFT s = sRIGHT a = aLISTEN a = aLEFT a = aRIGHT -1 -1 -100 10 10 -100 Figure 3: Reward function of the tiger problem R0 (f (s), g(a)) = R(s, a) tion cannot reduce the size of the model: examining the reward function alone, we cannot aggregate aLEFT and aRIGHT since the rewards are different depending on the current state being either sLEFT or sRIGHT . By a similar argument, we cannot reduce the state space nor the observation space. Thus if we were to compute partitions using minimization methods, we will end up with every block being a singleton set. However, sLEFT and sRIGHT can be interchanged to yield an equivalent POMDP, simultaneously changing the corresponding actions and observations: sRIGHT if s = sLEFT f (s) = sLEFT if s = sRIGHT  aLISTEN if a = aLISTEN g(a) = aRIGHT if a = aLEFT  aLEFT if a = aRIGHT z if z = zLEFT h(z) = RIGHT zLEFT if z = zRIGHT O0 (f (s), g(a), ha (z)) = O(s, a, z) Note that these methods are interested in finding many-toone mappings in order to find a model with reduced size. Hence, they focus on computing partitions of the state, action, and observation spaces of which blocks represent aggregates of equivalent states, actions, and observations, respectively. Once the partitions are found, we can employ conventional POMDP algorithms on the abstract POMDP with reduced number of states, actions, or observations, which in effect reduce the computational complexities of algorithms. Automorphisms of POMDPs An automorphism is a special class of homomorphism. In the context of POMDPs, an automorphism φ is defined as hf, g, hi where the state mapping f : S → S, the action mapping g : A → A, and the observation mapping h : Z → Z are all one-to-one mappings. Hence, φ maps the original MDP to itself, and there is no assumption regarding the reduction in the size of the model. The classic tiger domain (Kaelbling, Littman, & Cassandra 1998) is perhaps one of the best examples to describe automorphisms in POMDPs. The state space S of the tiger domain is defined as {sLEFT , sRIGHT }, representing the state of the world when the tiger is behind the left door or the right door, respectively. The action space A is defined as {aLEFT , aRIGHT , aLISTEN }, representing opening the left door, opening the right door, or listening, respectively. The observation space Z is defined as {zLEFT , zRIGHT } representing hearing on the left or hearing on the right, respectively. The specifications of transition probabilities, observation probabilities, and the rewards are as given in Figure 1, Figure 2, and Figure 3. The initial belief is given as b0 (sLEFT ) = b0 (sRIGHT ) = 0.5. Note that the tiger problem is already compact in the sense that minimization methods mentioned in the previous sec- The automorphism of POMDPs is the type of regularity we intend to exploit in this paper: the symmetry in the model that does not necessarily help the model minimization algorithm further reduce the size of the model. Hence, rather than computing partitions, we focus on computing all possible automorphisms of the original POMDP. Note that if the original POMDP can be reduced in size, we can have exponentially many automorphisms in the number of blocks in the partition. For example, if the model minimization yields a state partition with K blocks of 2 states each, the number of automorphisms becomes 2K . Hence, it is advisable to compute automorphism after we compute the minimal model of POMDP. Graph Automorphism for POMDPs In this section, we describe how we can cast the problem of finding automorphisms in POMDPs as a graph automor- 1044 Figure 4: Encoding the tiger problem as a vertex-colored graph. Two vertices have the same color if and only if their shapes and fillings are the same. phism (GA) problem. A vertex-colored graph G is specified by hV, E, C, ψi, where V denotes the set of vertices, E denotes the set of edges hvi , vj i, C is the set of colors, and ψ : V → C denotes the color associated with each vertex. An automorphism φ : G → G is a permutation π of V with the property that for any edge hvi , vj i in G, hπ(vi ), π(vj )i is also an edge in G, and for any vertex vi in G, ψ(vi ) = ψ(π(vi )). GA is the problem of finding an automorphism of G (Bondy & Murty 1976). The computational complexity of GA is known to be in NP, but neither known to be in P nor NPcomplete (Garey & Johnson 1979). However, GA can be solved quite efficiently in practice. We can encode a POMDP as a vertex-colored graph in order to apply graph automorphism algorithms1 . First, for every state s in the POMDP, we prepare vertex vs (called state vertex) and make every vertex share the same color, i.e., ∀s ∈ S, ψ(vs ) = cstate . We proceed with the same manner for every action (action vertex), next state (next state vertex), and observation (observation vertex), i.e., ∀a ∈ A, ψ(va ) = caction , ∀s0 ∈ S, ψ(vs0 ) = cstate’ , and ∀z ∈ Z, ψ(vz ) = cobs . We choose distinct colors for cstate , caction , cstate’ and cobs so that we can prevent mapping a state vertex to an action vertex, etc. Next, for every triplet (s, a, s0 ) in the POMDP, we prepare vertex vT (s,a,s0 ) that represents the transition probability T (s, a, s0 ) (transition probability vertex), and assign colors so that two vertices share the same color if and only if the transition probabilities are the same, i.e., ∀(s, a, s0 ), ∀(s00 , a0 , s000 ), ψ(vT (s,a,s0 ) ) = ψ(vT (s00 ,a0 ,s00 ) ) if and only if T (s, a, s0 ) = T (s00 , a0 , s000 ). We connect the the transition probability vertex vT (s,a,s0 ) to the corresponding state and action vertices, vs , vs0 and va . We proceed with the same manner for observation probabilities, rewards, and initial belief. Finally, we prepare edges between the state vertex and the next state vertex that correspond to the same state. Figure 4 shows the result of the graph encoding process for the tiger problem. The encoded graph is sparse: the number of vertices in the graph is 2|S|+|A|+|Z|+|S|2 |A|+|S||A||Z|+|S||A|+|S|, and the number of edges is 3|S|2 |A|+3|S||A||Z|+2|S||A|+ |S|. Hence, the number of edges is linear in the number of vertices, and typical GA algorithms quickly find automorphisms in benchmark POMDP domains. In our study, we used nauty (McKay 2007) for experiments. Application to Point-Based Algorithms Point-based value iteration (PBVI) is an approximate POMDP algorithm using finite samples of belief points for value updates (Pineau, Gordon, & Thrun 2006). The belief point at time t is the probability distribution over the states given a history of actions and observations, which is updated from the belief point at time t−1 incorporating action at time t − 1 and observation at time t P bt (s) = (1/C)O(s, at−1 , zt ) s0 ∈S T (s0 , at−1 , s)bt−1 (s0 ) 1 We can also formulate artibrary GA as a POMDP symmetry finding problem, although such hardness proof is omitted here. 1045 α2 = [3.38, 19.4] α4 = [19.4, 3.38] α3 = [15.0, 15.0] 0 5 10 15 20 25 replacements Procedure: Γt = backup(B, Γt−1 , Φ) for each a ∈ A, z ∈ Z, αi ∈ Γt−1 do for each s ∈ S P do αia,z (s) = γ s0 ∈S T (s, a, s0 )O(s0 , a, z)αi (s0 ) end for Γa,z = ∪i αia,z t end for Γt = {} for each b ∈ B do for each a ∈ A, s ∈ S P do (α · b) αba (s) = R(s, a) + z∈Z argmaxα∈Γa,z t end for a∗ = argmaxa (αba · b) ∗ αb = αba if αb 6∈ Γt then Γt = Γ t ∪ α b for each φ = hf, g, hi ∈ Φ do if f (αb ) 6∈ Γt then Γt = Γt ∪ f (αb ) end if end for end if end for Figure 6: The backup operation of PBVI taking into account Φ, the set of all symmetries. α5 = [23.5, −86.5] α1 = [−86.5, 23.5] b3 b1 b2 0.0 0.2 0.4 b4 b5 0.6 0.8 1.0 Figure 5: Value function of tiger problem obtained by PBVI with 5 belief points. b1 and b5 are symmetric, hence the corresponding α-vectors α1 and α5 are symmetric. The same argument applies to b2 and b4 . Although the illustration uses an approximate value function from PBVI, the value function from exact methods will show the same phenomenon. where C is the normalizing constant. We will use the notation bt = τ (bt−1 , at−1 , zt ) to denote the update equation. In order to compute the optimal value at belief point b, we apply dynamic programming (referred as backup operation) to compute t-step value function from (t − 1)-step value function " # X P (z|a, b)Vt−1 (τ (b, a, z)) Vt (b) = max R(b, a) + γ a∈A belief points B. The set of α-vectors for Vt is finally obtained by Γt = ∪a∈A Γat . The automorphisms of POMDPs reveal the symmetries in belief points and α-vectors; Given a POMDP M with automorphism φ = hf, g, hi, let Γ∗ be the set of α-vectors for the optimal value function. If b is a reachable belief point, then f (b) is also a reachable belief point (by a slight abuse of notation, f (b) is the transformed vector of b whose elements are permutated by f ). Also, if α ∈ Γ∗ , then f (α) ∈ Γ∗ (Proofs are provided in the appendix). Figure 5 shows the symmetric relationships of belief points and α-vectors in the tiger problem. We can modify PBVI to take advantage of the symmetries in belief points and α-vectors: First, when we sample the set of belief points, one of the heuristics used by PBVI is to select the belief point with the farthest L1 distance from any belief point already in B. However, the belief point selected by such heuristic may have a symmetric belief point already in B. Thus, we can modify the heuristic to exclude the redundant belief points by calculating the distances between all possible symmetric images of belief points. Second, since B will exclude redundant belief points, we can modify the backup operation to include symmetric images of α-vectors into Γat . Figure 6 shows the pseudo-code for performing the symmetric backup operation. There is a small but important issue regarding symmetric backup of α-vectors; some of the belief points will have the same symmetric image, i.e., b = f (b). For these belief points, it is often unnecessary to add f (αb ) into Γt , since f (αb ) is relevant to the belief point f (b) but b and f (b) are the same! Hence, it is advisable to identify which au- z∈Z where R(b, a) = s∈S R(s, a)b(s). The value function at each time step is represented by a set of α-vectors so that Γt = {α0 , . . . , αm }, and the value at a particular belief P point b is calculated as Vt (b) = maxα∈Γt s∈S α(s)b(s). Hence, the backup operation can be formulated as hX Vt (b) = max R(s, a)b(s)+ P a∈A γ X z∈Z s∈S max 0 α ∈Γt−1 X T (s, a, s0 )O(s0 , a, z)α0 (s0 )b(s) s,s0 ∈S i In finding Γt for Vt , PBVI generates intermediate sets Γa,z t , ∀a ∈ A, ∀z ∈ Z, whose elements are calculated by X αia,z (s) = γ T (s, a, s0 )O(s0 , a, z)αi (s0 ), ∀αi ∈ Γt−1 s0 ∈S PBVI then constructs Γat , ∀a ∈ A, whose elements are calculated by " # X αba (s) = R(s, a) + argmax(α · b) (s), ∀b ∈ B z∈Z α∈Γa,z t where B is the finite set of belief points. In principle, the backup operation has to be done on all possible belief points in order to obtain exact optimal value function. However, PBVI performs the backup operation only on the selected 1046 Problem |S| Min |S| |V | nauty exec time |Φ| Tiger Tiger-grid 2-city-ticketing (perr = 0) 2-city-ticketing (perr = 0.1) 3-city-ticketing (perr = 0) 3-city-ticketing (perr = 0.1) 2 36 397 397 1945 1945 2 35 397 397 1945 1945 39 9814 1624545 1624545 61123604 61123604 0.004 s 0.061 s 31.872 s 23.873 s 2585.770 s 2601.543 s 2 4 4 4 12 12 Figure 7: Model minimization and graph automorphism results on benchmark problems. |S| is the number of states in the original model, Min |S| is the number of states in the minimized model, |V | is the number of vertices in the graph encoding of the model, and |Φ| is the number of automorphisms found by nauty including the identity mapping. Problem Algorithm |B| |Γ| Iter Exec time V (b0 ) Tiger PBVI Symm-PBVI PBVI Symm-PBVI PBVI Symm-PBVI PBVI Symm-PBVI PBVI Symm-PBVI PBVI Symm-PBVI 19 10 590 300 51 17 104 30 261 36 275 30 5 5 532 529 5 5 9 10 37 42 39 133 89 89 88 85 167 168 167 167 91 91 91 91 0.07 s 0.05 s 359.69 s 196.09 s 157.80 s 57.60 s 546.04 s 201.97 s 43094.32 s 9395.06 s 43286.92 s 16791.17 s 6.40 6.40 0.80 0.80 8.74 8.74 7.76 7.73 8.08 8.08 6.95 6.94 0.01 Tiger-grid 2-city-ticketing (perr = 0) 2-city-ticketing (perr = 0.1) 3-city-ticketing (perr = 0) 3-city-ticketing (perr = 0.1) 0.03 0.02 0.02 1.00 1.00 Figure 8: Performance comparisons of the PBVI algorithm with automorphisms. Symm-PBVI is the PBVI algorithm exploiting the automorphisms, i.e., symmetric belief collection and symmetric backup. |B| is the number of belief points given to the algorithms, |Γ| is the number of α-vectors comprising the policy, Iter is the number of iterations until convergence, V (b 0 ) is the average return of the policy starting from initial belief b0 , and is the convergence criteria of each algorithm for running until maxb∈B |V (n) (b) − V (n−1) (b)| ≤ . All V (b0 )’s are within the 95% confidence interval of the optimal. where perr = 0, the problem is still a POMDP since the user may provide partial information about the request (e.g., origin city only). All of these problems could not be reduced in size, but still had symmetries. Regardless of the value of perr , the graphs encoding the POMDP problems were exactly the same. The small differences in the nauty execution times may be due to the differences in the orderings of the vertices of the graph. Figure 7 summarizes the result of automorphism finding experiments. tomorphisms yield b 6= f (b) for each belief point b, and include symmetric images of α-vectors only for these automorphisms for further performance improvement of PBVI. Experiments Before we demonstrate the performance gain of the PBVI algorithm by using symmetric backup operator, we first report test results for the existence of automorphisms in standard POMDP benchmark problems. Most of the benchmark problems are already compact in the sense that the model minimization algorithm was not able to further reduce the size in most of the problems. For the tiger-grid problem, we were able to reduce the size and find symmetries. For the tiger problem, we were not able to reduce the size, but still find symmetries. We further tested for automorphism existence on larger domains for the spoken dialogue management problems by Williams, Poupart, & Young (2005). In this domain, the user is trying to buy a ticket to travel from one city to another city, and the machine has to request or confirm information from the user in order to issue the correct ticket. We instantiated the problem for 2 and 3 possible cities, and for two different rates of speech recognition errors perr , where perr = 0 assumes no speech recognition error and perr = 0.1 assumes an error rate at 10%. Note that even in the case Next, we experimented with the PBVI algorithm on the above benchmark problems using the automorphisms found by nauty. First, we sampled a fixed number of symmetric belief points (e.g., 300 for the tiger-grid) and ran the symmetric version of PBVI. We then checked the number of unique belief points if the symmetric belief points were to be expanded by the automorphisms. We set this number (e.g., 590 for the tiger-grid) as the number of belief points to be used by PBVI, and ran PBVI in the same setting without automorphisms. Note that our implementation of PBVI differs from the original version in that the original PBVI interleaves the belief point exploration and the value iteration, rather than fixing the belief points in the onset of execution. This was to analyze the efficiency of the symmetric backup isolated from the effects of symmetric belief point exploration. Figure 8 shows the results of the experiments. In 1047 Suppose that the argument holds for Γt−1 . This implies g(a),h(z) that ∀α ∈ Γa,z by the definition of Γa,z t , f (α) ∈ Γt t . If α ∈ Γt , then by definition, for some a and b, X argmax(α0 · b) α(s) = αba (s) = R(s, a) + summary, automorphisms help to significantly improve the performance of PBVI in running time without sacrificing the quality of policy. Conclusion and Future Work a,z 0 z∈Z α ∈Γt In this paper, we have presented a graph theoretic framework for finding symmetries in POMDPs. Given a POMDP, we can cast the problem of finding a symmetry as a graph automorphism (GA) problem. Our approach is an extension to the traditional model minimization methods in the sense that there are cases where minimal models still exhibit a number of symmetries. Although the problem of finding a graph automorphism is not yet known to be tractable, we were able to efficiently compute automorphisms of graphs with more than 60 million vertices, due to sparseness of the encoded graph. We also demonstrated how thus found symmetries could be exploited for speeding up conventional POMDP algorithms, especially the point-based backup operator in PBVI. Our approach is typically effective in the types of problems where there are multiple goals and the symmetries exist among these goals. Although we have demonstrated the effectiveness of our approach only in the context of PBVI, we believe that our approach can be applied to a wide variety of POMDP algorithms as well. For example, heuristic search value iteration (Smith & Simmons 2005) algorithm can be extended to exploit symmetries in belief point exploration and value function bound updates. Our approach can also be used to find symmetries for factored POMDPs (Boutilier & Poole 1996) or DEC-POMDPs (Bernstein et al. 2002) by working in the level of variables or agents. Finally, we are also working on extending approximate equivalence in model minimization (Givan, Leach, & Dean 2000) to approximate symmetries in order to achieve further speed up of POMDP algorithms by trading off the potential loss in the optimality of the solution and the speed gain in the algorithm. Consider the symmetric image defined as X α(f (s)) = R(f (s), g(a)) + argmax (α00 · f (b)). g(a),h(z) h(z)∈Z α00 ∈Γt For each observation h(z), the argmax will select α00 which is the symmetric image of α0 selected in the (α0 · b). Hence we have f (α) ∈ Γt . argmaxα0 ∈Γa,z t References Bernstein, D. S.; Givan, R.; Immerman, N.; and Zilberstein, S. 2002. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27(4). Bondy, J. A., and Murty, U. S. R. 1976. Graph Theory with Applications. Elsevier Science. Boutilier, C., and Poole, D. 1996. Computing optimal policies for partially observable decision processes using compact representations. In Proceedings of AAAI-1996. Dean, T., and Givan, R. 1997. Model minimization in Markov decision processes. In Proceedings of AAAI-1997. Garey, M. R., and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company. Givan, R.; Leach, S.; and Dean, T. 2000. Boundedparameter Markov decision processes. Artificial Intelligence 122. Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101. McKay, B. D. 2007. nauty user’s guide (version 2.4). Pineau, J.; Gordon, G.; and Thrun, S. 2003. Policycontingent abstraction for robust robot control. In Proceedings of UAI-2003. Pineau, J.; Gordon, G.; and Thrun, S. 2006. Anytime point-based approximation for large POMDPs. Journal of Artificial Intelligence Research 27. Ravindran, B., and Barto, A. G. 2001. Symmetries and model minimization in Markov decision processes. Technical Report CMPSCI 01-43, University of Massachusetts, Amherst. Smith, T., and Simmons, R. 2005. Point-based POMDP algorithms: Improved analysis and implementation. In Proceedings of UAI-2005. Williams, J. D.; Poupart, P.; and Young, S. 2005. Factored partially observable Markov decision processes for dialogue management. In Proceedings of IJCAI-2005 Workshop on Knowledge and Reasoning in Practical Dialogue Systems. Wolfe, A. P. 2006. POMDP homomorphisms. In Proceedings of NIPS-2006. Acknowledgments This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant (R01-2007-000-210900) funded by the Korea government (MOST) Appendix Theorem If b is a reachable belief point, then f (b) is also a reachable belief point. Proof If b is a reachable point from initial belief point b0 by executing a policy tree, f (b) can also be reached by executing the policy tree where action nodes are relabeled as g(a) and the observation edges are relabeled as h(z). This is because the automorphism ensures that T (s, a, s0 ) = T (f (s), g(a), f (s0 )), O(s, a, z) = O(f (s), g(a), h(z)), and b0 (s) = b0 (f (s)). Theorem If α ∈ Γ∗ , then f (α) ∈ Γ∗ . Proof We prove by induction on time step t in Γt . By the definition of automorphism, R(s, a) = R(f (s), g(a)). Hence, if α ∈ Γ0 then f (α) ∈ Γ0 . 1048

Exploiting Symmetries in POMDPs for Point-Based Algorithms Kee-Eung Kim

Related documents

Products

Support

Exploiting Symmetries in POMDPs for Point-Based Algorithms Kee-Eung Kim

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib