FA08 cs188 lecture 2..

CS 188: Artificial Intelligence Fall 2008 Lecture 27: Conclusion 12/9/2008 Dan Klein – UC Berkeley 1 Autonomous Robotics 3 Policy Search [demo]  Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best  E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they still produced good decisions  Same distinction between modeling and prediction showed up in classification (where?)  Solution: learn the policy that maximizes rewards rather than the value that predicts rewards  This is the idea behind policy search, such as what controlled the upside-down helicopter 4 POMDPs  Up until now:  Search / MDPs: decision making when the world is fully observable (even if the actions are nondeterministic  Probabilistic reasoning: computing beliefs in a static world  Learning: discovering how the world works  What about acting under uncertainty?  In general, the problem formalization is the partially observable Markov decision process (POMDP)  A simple case: value of information 5 POMDPs  MDPs have:     States S Actions A Transition fn P(s’|s,a) (or T(s,a,s’)) Rewards R(s,a,s’) s a s, a s,a,s’ s’  POMDPs add: b  Observations O  Observation function P(o|s) (or O(s,o))  POMDPs are MDPs over belief states b (distributions over S) a b, a o b’ 6 Example: Ghostbusters  In (static) Ghostbusters:  Belief state determined by evidence to date {e}  Tree really over evidence sets  Probabilistic reasoning needed to predict new evidence given past evidence  Solving POMDPs b {e} a a b, a e, a e’ o b’ {e, e’} {e} abust  One way: use truncated expectimax to compute approximate value of actions U(abust, {e}) e’  What if you only considered busting or one sense abust followed by a bust?  You get the VPI agent from project 4! U(abust, {e, e’}) asense {e}, asense {e, e’} 7 More Generally  General solutions map belief functions to actions  Can divide regions of belief space (set of belief functions) into policy regions (gets complex quickly)  Can build approximate policies using discretization methods  Can factor belief functions in various ways  Overall, POMDPs are very (actually PSACE-) hard  Most real problems are POMDPs, but we can rarely solve then in general! 8 Pacman Contest  8 teams, 26 people qualified  3rd Place: Niels Joubert, Michael Ngo, Rohit Nambiar, Tim Chen  What they did: split offense / defense  Strong offense: feature-based balance between eating dots against helping defend 10 Pacman Contest  Blue Team: Yiding Jia  Red Team: William Li, York Wu  What they did (Yiding):      Reflex plus tracking! Probabilistic inference, particle filtering, consider direct ghost observations and dot vanishings Defense: move toward distributions, hope to get better info and hunt, stay near remaining food Offense: move toward guard-free dots, flee guard clouds What they did (William, York):  … ?? [DEMO] 11 Example: Stratagus [DEMO] 12 Hierarchical RL  Stratagus: Example of a large RL task  Stratagus is hard for reinforcement learning algorithms  > 10100 states  > 1030 actions at each point  Time horizon ≈ 104 steps  Stratagus is hard for human programmers  Typically takes several person-months for game companies to write computer opponent  Still, no match for experienced human players  Programming involves much trial and error  Hierarchical RL  Humans supply high-level prior knowledge using partial program  Learning algorithm fills in the details [From Bhaskara Marthi’s thesis] 13 Partial “Alisp” Program (defun top () (loop (choose (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun gather-gold () (with-choice (dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice (move ‘(N S E W NOOP)) (action move)))) 14 Hierarchical RL  Define a hierarchical Q-function which learns a linear feature-based mini-Q-function at each choice point  Very good at balancing resources and directing rewards to the right region  Still not very good at the strategic elements of these kinds of games (i.e. the Markov game aspect) [DEMO] 15 Bugman  AI = Animal Intelligence?  Wim van Eck at Leiden University  Pacman controlled by a human  Ghosts controlled by crickets  Vibrations drive crickets toward or away from Pacman’s location [DEMO] http://pong.hku.nl/~wim/bugman.htm 16 Where to go next?  Congratulations, you’ve seen the basics of modern AI!  More directions:  Robotics / vision / IR / language: cs194  There will be a web form to get more info, from the 188 page      Machine learning: cs281a Cognitive modeling: cog sci 131 NLP: 288 Vision: 280 … and more; ask if you’re interested 17 That’s It!  Help us out with some course evaluations  Have a good break, and always maximize your expected utilities! 18

FA08 cs188 lecture 2..

Related documents

Products

Support

FA08 cs188 lecture 2..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib