CS 188: Artificial Intelligence Fall 2008 Lecture 27: Conclusion 12/9/2008 Dan Klein – UC Berkeley 1 Autonomous Robotics 3 Policy Search [demo] Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they still produced good decisions Same distinction between modeling and prediction showed up in classification (where?) Solution: learn the policy that maximizes rewards rather than the value that predicts rewards This is the idea behind policy search, such as what controlled the upside-down helicopter 4 POMDPs Up until now: Search / MDPs: decision making when the world is fully observable (even if the actions are nondeterministic Probabilistic reasoning: computing beliefs in a static world Learning: discovering how the world works What about acting under uncertainty? In general, the problem formalization is the partially observable Markov decision process (POMDP) A simple case: value of information 5 POMDPs MDPs have: States S Actions A Transition fn P(s’|s,a) (or T(s,a,s’)) Rewards R(s,a,s’) s a s, a s,a,s’ s’ POMDPs add: b Observations O Observation function P(o|s) (or O(s,o)) POMDPs are MDPs over belief states b (distributions over S) a b, a o b’ 6 Example: Ghostbusters In (static) Ghostbusters: Belief state determined by evidence to date {e} Tree really over evidence sets Probabilistic reasoning needed to predict new evidence given past evidence Solving POMDPs b {e} a a b, a e, a e’ o b’ {e, e’} {e} abust One way: use truncated expectimax to compute approximate value of actions U(abust, {e}) e’ What if you only considered busting or one sense abust followed by a bust? You get the VPI agent from project 4! U(abust, {e, e’}) asense {e}, asense {e, e’} 7 More Generally General solutions map belief functions to actions Can divide regions of belief space (set of belief functions) into policy regions (gets complex quickly) Can build approximate policies using discretization methods Can factor belief functions in various ways Overall, POMDPs are very (actually PSACE-) hard Most real problems are POMDPs, but we can rarely solve then in general! 8 Pacman Contest 8 teams, 26 people qualified 3rd Place: Niels Joubert, Michael Ngo, Rohit Nambiar, Tim Chen What they did: split offense / defense Strong offense: feature-based balance between eating dots against helping defend 10 Pacman Contest Blue Team: Yiding Jia Red Team: William Li, York Wu What they did (Yiding): Reflex plus tracking! Probabilistic inference, particle filtering, consider direct ghost observations and dot vanishings Defense: move toward distributions, hope to get better info and hunt, stay near remaining food Offense: move toward guard-free dots, flee guard clouds What they did (William, York): … ?? [DEMO] 11 Example: Stratagus [DEMO] 12 Hierarchical RL Stratagus: Example of a large RL task Stratagus is hard for reinforcement learning algorithms > 10100 states > 1030 actions at each point Time horizon ≈ 104 steps Stratagus is hard for human programmers Typically takes several person-months for game companies to write computer opponent Still, no match for experienced human players Programming involves much trial and error Hierarchical RL Humans supply high-level prior knowledge using partial program Learning algorithm fills in the details [From Bhaskara Marthi’s thesis] 13 Partial “Alisp” Program (defun top () (loop (choose (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice (move ‘(N S E W NOOP)) (action move)))) 14 Hierarchical RL Define a hierarchical Q-function which learns a linear feature-based mini-Q-function at each choice point Very good at balancing resources and directing rewards to the right region Still not very good at the strategic elements of these kinds of games (i.e. the Markov game aspect) [DEMO] 15 Bugman AI = Animal Intelligence? Wim van Eck at Leiden University Pacman controlled by a human Ghosts controlled by crickets Vibrations drive crickets toward or away from Pacman’s location [DEMO] http://pong.hku.nl/~wim/bugman.htm 16 Where to go next? Congratulations, you’ve seen the basics of modern AI! More directions: Robotics / vision / IR / language: cs194 There will be a web form to get more info, from the 188 page Machine learning: cs281a Cognitive modeling: cog sci 131 NLP: 288 Vision: 280 … and more; ask if you’re interested 17 That’s It! Help us out with some course evaluations Have a good break, and always maximize your expected utilities! 18