FA08 cs188 lecture 2..

advertisement
CS 188: Artificial Intelligence
Fall 2008
Lecture 27: Conclusion
12/9/2008
Dan Klein – UC Berkeley
1
Autonomous Robotics
3
Policy Search
[demo]
 Problem: often the feature-based policies that work well
aren’t the ones that approximate V / Q best
 E.g. your value functions from project 2 were probably horrible
estimates of future rewards, but they still produced good
decisions
 Same distinction between modeling and prediction showed up in
classification (where?)
 Solution: learn the policy that maximizes rewards rather
than the value that predicts rewards
 This is the idea behind policy search, such as what
controlled the upside-down helicopter
4
POMDPs
 Up until now:
 Search / MDPs: decision making when the world is
fully observable (even if the actions are nondeterministic
 Probabilistic reasoning: computing beliefs in a static
world
 Learning: discovering how the world works
 What about acting under uncertainty?
 In general, the problem formalization is the partially
observable Markov decision process (POMDP)
 A simple case: value of information
5
POMDPs
 MDPs have:




States S
Actions A
Transition fn P(s’|s,a) (or T(s,a,s’))
Rewards R(s,a,s’)
s
a
s, a
s,a,s’
s’
 POMDPs add:
b
 Observations O
 Observation function P(o|s) (or O(s,o))
 POMDPs are MDPs over belief
states b (distributions over S)
a
b, a
o
b’
6
Example: Ghostbusters
 In (static) Ghostbusters:
 Belief state determined by
evidence to date {e}
 Tree really over evidence
sets
 Probabilistic reasoning
needed to predict new
evidence given past evidence
 Solving POMDPs
b
{e}
a
a
b, a
e, a
e’
o
b’
{e, e’}
{e}
abust
 One way: use truncated
expectimax to compute
approximate value of actions U(abust, {e})
e’
 What if you only considered
busting or one sense
abust
followed by a bust?
 You get the VPI agent from
project 4!
U(abust, {e, e’})
asense
{e}, asense
{e, e’}
7
More Generally
 General solutions map belief
functions to actions
 Can divide regions of belief space
(set of belief functions) into policy
regions (gets complex quickly)
 Can build approximate policies using
discretization methods
 Can factor belief functions in various
ways
 Overall, POMDPs are very
(actually PSACE-) hard
 Most real problems are POMDPs,
but we can rarely solve then in
general!
8
Pacman Contest
 8 teams, 26 people qualified
 3rd Place: Niels Joubert, Michael Ngo, Rohit Nambiar,
Tim Chen
 What they did: split offense / defense
 Strong offense: feature-based balance between eating dots
against helping defend
10
Pacman Contest

Blue Team: Yiding Jia

Red Team: William Li, York Wu

What they did (Yiding):





Reflex plus tracking!
Probabilistic inference, particle filtering, consider direct ghost observations and dot
vanishings
Defense: move toward distributions, hope to get better info and hunt, stay near remaining
food
Offense: move toward guard-free dots, flee guard clouds
What they did (William, York):

… ??
[DEMO]
11
Example: Stratagus
[DEMO]
12
Hierarchical RL
 Stratagus: Example of a large RL task
 Stratagus is hard for reinforcement learning algorithms
 > 10100 states
 > 1030 actions at each point
 Time horizon ≈ 104 steps
 Stratagus is hard for human programmers
 Typically takes several person-months for game companies to write
computer opponent
 Still, no match for experienced human players
 Programming involves much trial and error
 Hierarchical RL
 Humans supply high-level prior knowledge using partial program
 Learning algorithm fills in the details
[From Bhaskara Marthi’s thesis]
13
Partial “Alisp” Program
(defun top ()
(loop
(choose
(gather-wood)
(gather-gold))))
(defun gather-wood ()
(with-choice
(dest *forest-list*)
(nav dest)
(action ‘get-wood)
(nav *base-loc*)
(action ‘dropoff)))
(defun gather-gold ()
(with-choice
(dest *goldmine-list*)
(nav dest))
(action ‘get-gold)
(nav *base-loc*))
(action ‘dropoff)))
(defun nav (dest)
(until (= (pos (get-state))
dest)
(with-choice
(move ‘(N S E W NOOP))
(action move))))
14
Hierarchical RL
 Define a hierarchical Q-function which learns a
linear feature-based mini-Q-function at each
choice point
 Very good at balancing resources and directing
rewards to the right region
 Still not very good at the strategic elements of
these kinds of games (i.e. the Markov game
aspect)
[DEMO]
15
Bugman
 AI = Animal
Intelligence?
 Wim van Eck at
Leiden University
 Pacman controlled
by a human
 Ghosts controlled
by crickets
 Vibrations drive
crickets toward or
away from
Pacman’s location
[DEMO]
http://pong.hku.nl/~wim/bugman.htm
16
Where to go next?
 Congratulations, you’ve seen the basics of
modern AI!
 More directions:
 Robotics / vision / IR / language: cs194
 There will be a web form to get more info, from the 188 page





Machine learning: cs281a
Cognitive modeling: cog sci 131
NLP: 288
Vision: 280
… and more; ask if you’re interested
17
That’s It!
 Help us out with some course evaluations
 Have a good break, and always maximize
your expected utilities!
18
Download