CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008

advertisement
CS 182/CogSci110/Ling109
Spring 2008
Reinforcement Learning: Algorithms
4/1/2008
Srini Narayanan – ICSI and UC Berkeley
Lecture Outline
 Introduction
 Basic Concepts
 Expectation, Utility, MEU
 Neural correlates of reward based learning
 Utility theory from economics
 Preferences, Utilities.
 Reinforcement Learning: AI approach




The problem
Computing total expected value with discounting
Q-values, Bellman’s equation
TD-Learning
Reinforcement Learning
DEMO
 Basic idea:
 Receive feedback in the form of rewards
 Agent’s utility is defined by the reward
function
 Must learn to act so as to maximize expected
utility
 Change the rewards, change the behavior
Elements of RL
Agent
State
Policy
Reward
Action
Environment
0 : r0
1 : r1
2 : r2
s0 a
 s1 a
 s2 a
 
 Transition model, how action influences states
 Reward R, immediate value of state-action transition
 Policy , maps states to actions
Markov Decision Processes
 Markov decision processes (MDPs)
 A set of states s  S
 A model T(s,a,s’) = P(s’ | s,a)
 Probability that action a in state s
leads to s’
 A reward function R(s, a, s’)
(sometimes just R(s) for leaving a
state or R(s’) for entering one)
 A start state (or distribution)
 Maybe a terminal state
 MDPs are the simplest case of
reinforcement learning
 In general reinforcement learning, we
don’t know the model or the reward
function
Elements of RL
0
100
0
0
G
0
0
0
0
0
0
100
0
0
r(state, action)
immediate reward values
Reward Sequences
 In order to formalize optimality of a policy, need to understand
utilities of reward sequences
 Typically consider stationary preferences: If I prefer one state
sequence starting today, I would prefer the same starting tomorrow.
 Theorem: only two ways to define stationary utilities
 Additive utility:
 Discounted utility:
Elements of RL
0
100
0
90
0
0
0
0
0
0
100
G
0
G
0
100
0
0
r(state, action)
immediate reward values
81
90
100
V*(state) values
 Value function: maps states to state values
Vπ (s )  r (t )+γr (t +1)+γ2 r (t +1)+ ...
Discount factor   [0, 1)
(here 0.9)
RL task (restated)
 Execute actions in environment,
observe results.
 Learn action policy  : state  action that
maximizes expected discounted reward
E [r(t) + r(t + 1) + 2r(t + 2) + …]
from any starting state in S
Hyperbolic discounting
Ainslee 1992
Short term rewards
are different from
long term rewards
Used in many animal
discounting models
Has been used to
explain
procrastination
addiction
Evidence from
Neuroscience
(Next lecture)
MDP Solutions
 In deterministic single-agent search, want an optimal
sequence of actions from start to a goal
 In an MDP we want an optimal policy (s)
 A policy gives an action for each state
 Optimal policy maximizes expected utility (i.e. expected rewards)
if followed
Optimal policy when
R(s, a, s’) = -0.04 for all
non-terminals s
Example Optimal Policies
R(s) = -0.01
R(s) = -0.03
R(s) = -0.4
R(s) = -2.0
Utility of a State
 Define the utility of a state under a policy:
V(s) = expected total (discounted) rewards starting in s
and following 
 Recursive definition (one-step look-ahead):
 Also called policy evaluation
Bellman’s Equation for Selecting
actions
 Definition of utility leads to a simple relationship
amongst optimal utility values:
Optimal rewards = maximize over first action and then
follow optimal policy
Formally: Bellman’s Equation
That’s my
equation!
Q-values
 The expected utility of taking a
particular action a in a particular state s
(Q-value of the pair (s,a))
0
100
90
0
90
0
0
0
0
0
0
0
72
G
0
100
0
100
G
0
r(state, action)
immediate reward values
81
81
90
V*(state) values
100
100
81
81
81
90
72
0
G
90
100
81
Q(state, action) values
Representation
 Explicit
State
Action
Q(s, a)
2
MoveLeft
81
2
...
MoveRight
...
100
...
 Implicit
 Weighted linear function/neural network
Classical weight updating
A table of values for each action:
Q-Functions
 A q-value is the value of a (state and
action) under a policy
 Utility of taking starting in state s, taking action
a, then following  thereafter
The Bellman Equations
 Definition of utility leads to a simple relationship amongst
optimal utility values:
Optimal rewards = maximize over first action and then follow
optimal policy
 Formally:
Optimal Utilities
 Goal: calculate the optimal
utility of each state
V*(s) = expected (discounted)
rewards with optimal actions
 Why: Given optimal utilities,
MEU tells us the optimal policy
MDP solution methods
 If we know T(s, a, s’) and R(s,a,s’), then we can
solve the MDP to find the optimal policy in a
number of ways.
 Dynamic programming
 Iterative Estimation methods
 Value Iteration
 Assume 0 initial values for each state and update using the
Bellman equation to pick actions.
 Policy iteration
 Evaluate a given policy (find V(s) for the policy), then change it
using Bellman updates till there is no improvement in the policy.
Reinforcement Learning
 Reinforcement learning:
 W have an MDP:




A set of states s  S
A set of actions (per state) A
A model T(s,a,s’)
A reward function R(s,a,s’)
 Are looking for a policy (s)
 We don’t know T or R
 I.e. don’t know which states are good or what the actions do
 Must actually try actions and states out to learn
Reinforcement Learning
 Target function is  : state  action
 However…
 We have no training examples of form
<state, action>
 Training examples are of form
<<state, action>, new-state, reward>
Passive Learning
 Simplified task




You don’t know the transitions T(s,a,s’)
You don’t know the rewards R(s,a,s’)
You are given a policy (s)
Goal: learn the state values (and maybe the model)
 In this case:
 No choice about what actions to take
 Just execute the policy and learn from experience
Example: Direct Estimation
Simple Monte Carlo
y
 Episodes:
+100
(1,1) up -1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(1,3) right -1
(2,3) right -1
(2,3) right -1
(3,3) right -1
(3,3) right -1
(3,2) up -1
(3,2) up -1
(4,2) exit -100
(3,3) right -1
(done)
(4,3) exit +100
(done)
-100
x
 = 1, R = -1
U(1,1) ~ (92 + -106) / 2 = -7
U(3,3) ~ (99 + 97 + -102) / 3 = 31.3
Full Estimation (Dynamic
Programming)
V (st )  E rt +1 +  V (st +1 )
st
rt +1
T
TT
TT
T
T
T
T
T
T
st +1
T
Simple Monte Carlo
V(st )  V(st ) +  Rt  V (st )
where Rt is the actual return following state
st .
st
T
T
T
TT
T
TT
T
TT
T
TT
T
Combining DP and MC
V(st )  V(st ) +  rt +1 +  V (st+1 )  V(st )
st
rt +1
st +1
TT
T
T
T
TT
T
T
TT
T
TT
T
Reinforcement Learning
 Target function is  : state  action
 However…
 We have no training examples of form
<state, action>
 Training examples are of form
<<state, action>, new-state, reward>
Model-Free Learning
 Big idea: why bother learning T?
s
 Update each time we experience a transition
 Frequent outcomes will contribute more updates
(over time)
 Temporal difference learning (TD)
 Policy still fixed!
 Move values toward value of whatever
successor occurs
a
s, a
s,a,s’
s’
Q-Learning
 Learn Q*(s,a) values
 Receive a sample (s,a,s’,r)
 Consider your old estimate:
 Consider your new sample estimate:
 Nudge the old estimate towards the new sample:
Any problems with this?
 What if the starting policy doesn’t let you explore
the state space?
 T(s,a,s’) is unknown and never estimated.
 The value of unexplored states is never computed.
 How do we address this problem?
 Fundamental problem in RL and in Biology
 AI solutions include
 e-greedy
 Softmax
 Evidence from Neuroscience (next lecture).
Exploration / Exploitation
 Several schemes for forcing exploration
 Simplest: random actions (-greedy)
 Every time step, flip a coin
 With probability , act randomly
 With probability 1-, act according to current policy
(best q value for instance)
 Problems with random actions?
 You do explore the space, but keep thrashing
around once learning is done
 One solution: lower  over time
 Another solution: exploration functions
Q-Learning
Q Learning features




On-line, Incremental
Bootstrapping (like DP unlike MC)
Model free
Converges to an optimal policy.
 On average when alpha is small
 With probability 1 when alpha is high in the
beginning and low at the end (say 1/k)
Reinforcement Learning
 Basic idea:




DEMO
Receive feedback in the form of rewards
Agent’s utility is defined by the reward function
Must learn to act so as to maximize expected utility
Change the rewards, change the behavior
 Examples:




Learning your way around, reward for reaching the destination.
Playing a game, reward at the end for winning / losing
Vacuuming a house, reward for each piece of dirt picked up
Automated taxi, reward for each passenger delivered
Demo of Q Learning
 Demo arm-control
 Parameters
  = learning rate)
  = discounted reward (high for future rewards)
  = exploration (should decrease with time)
 MDP
 Reward= number of the pixel moved to the right/
iteration number
 Actions : Arm up and down (yellow line), hand up
and down (red line)
Download