What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536 POMDPs • A special case of the Markov Decision Process (MDP). In an MDP, the environ-ment is fully observable, and with the Markov assumption for the transition model, the optimal policy depends only on the current state. • For POMDPs, the environment is only partially observable POMDP Implications • Since current state is not necessarily known, agent cannot execute the optimal policy for the state. • A POMDP is defined by the following: – – – – Set of states S, set of actions A, set of observations O Transition model T(s, a, s’) Reward model R(s) Observation model O(s, o) – probability of observing observation s in state o. POMDP Implications (cont.) • Optimal action depends not on current state but on agent’s current belief state. – Belief state is a probability distribution over all possible states • Given a belief state, if agent does an action a and perceives observation o, new belief state is – b’(s’) = α O(s’, o) Σ T(s, a, s’) b(s) • Optimal policy π*(s) maps from belief states to actions POMDP Solutions • Solving POMDP on a physical state space is equivalent to solving an MDP on the belief state space • However, state space is continuous and very highdimensional, so solutions are difficult to compute. • Even finding approximately optimal solutions is PSPACE-hard (i.e. really hard) Why Study POMDPs? • In spite of the difficulties, POMDPs are still very important. – Many real-world problems and situations are not fully observable, but the Markov assumption is often valid. • Active area of research – Google search on “POMDP” returns ~5000 results – A number of current papers on the topic Some Solution Techniques • Most exact solution algorithms (value iteration, policy iteration ) use dynamic programming techniques – These techniques transform from one value function (the transition model in physical space, which is piecewise linear and convex - PWLC) to another that can be used in an MDP solution technique – Dynamic programming algorithms: one-pass (1971), exhaustive (1982), linear support (1988), witness (1996) – Better method – incremental pruning (1996) POMDPs at Work • Pattern Recognition tasks – SA-POMDP (Single-action POMDP) – only decision is whether to change state or not – Model constructed to recognize words within text to which noise was added – i.e. individual letters within the words were – SA-POMDP outperformed a pattern recognizer based on Hidden Markov Models, and exhibited better immunity to noise POMDPs at Work (cont.) • Robotics – Mission planning – Robot Navigation • POMDP used to control the movement of an autonomous robot within a crowded environment • Used to predict the motion of other objects within the robot’s environment • Decompose state space into hierarchy, so individual POMDPs have a computationally tractable task POMDPs at Work (cont.) • BATmobile – the Bayesian Autonomous Taxi – Many different tasks make use of a number of AI techniques – POMDPs used for the actual driving control (as opposed to higher level trip planning) – To efficiently compute, uses approximation techniques BAT (cont.) • Several different techniques combined: – Dynamic Probabilistic Network (DPN) to maintain current belief state – Dynamic Decision Network (DDN) to perform bounded lookahead – Hand-coded explicit policy representations – i.e. decision trees – Supervised / reinforcement learning techniques to learn policy decisions BAT (cont.) • The BAT has been constructed in a simulation environment and has been demonstrated to successfully handle a variety of driving problems, such as passing slower vehicles, reacting to unsafe drivers, avoiding stalled vehicles, and merging into traffic. Resources • Tutorial on POMDPs: – http://www.cs.brown.edu/research/ai/pomdp/tut orial/index.html • Additional pointers to articles on my web site: – http://www.cs.montana.edu/~bwall/cs536