Reinforcement Learning

Reinforcement Learning
RS Sutton and AG Barto
Summarized by Joon Shik Kim
Computational Models of
• The idea that we learn by interacting with our
environment is probably the first to occur to us
when we think about the nature of learning.
• When an infant plays, waves its arms, or looks
about, it has no explicit teacher, but it does have
a direct sensorimotor connection to its
• Exercising this connection produces a wealth of
information about cause and effect, about the
consequences of actions, and about what to do
in order to achieve goals.
Elements of Reinforcement
Learning (1/2)
• A policy: the learning agent’s way of
behaving at a given time. A mapping from
perceived states of the environment to
actions to be taken when in those states.
• A reward function: the goal in a
reinforcement learning problem. Each
perceived state of the environment is
mapped into a single number, a reward,
indicating the intrinsic desirability of that
Elements of Reinforcement
Learning (2/2)
• A value function: specifies what is good
in the long run. Roughly speaking, the
value of a state is the total amount of
reward an agent can expect to
accumulate over the future, starting from
that state.
Update Rule
• If we let s denote the state before the greedy
move, and s’ the state after the move, then the
update to the estimated value of s, denoted V(s),
can be written as
V ( s )  V ( s )   [V ( s ')  V ( s )]
where α is a small positive fraction called the
step-size parameter, which influences the rate of
Action-Value Methods (1/2)
• We denote the true (actual) value of action
a as Q*(a) and the estimated value at the
tth play as Qt(a).
• Recall that the true value of an action is the
mean reward received when that action is
• If at the tth play action a has been chosen
ka times prior to t, yielding rewards r1,
r2,…,rka, then its value is estimated to be
Qt (a ) 
r1  r2      rk a
Action-Value Methods (2/2)
• As ka →∞, by the law of large number
Qt(a) converges to Q*(a).
• The simplest action selection rule is to
select the action (or one of the actions)
with highest estimated action value, that
is, to select on play t one of the greedy
actions, a*, for which Qt(a*)=maxaQt(a*).
Incremental Implementation
Q k 1 
k 1
k 1
i 1
 k 1  i 
k 1
i 1
( rk  1  kQ k  Q k  Q k )
k 1
( rk  1  ( k  1) Q k  Q k )
k 1
 Qk 
[ rk  1  Q k ]
k 1
Reinforcement Comparison
• A central intuition underlying reinforcement
learning is that actions followed by large
rewards should be made more likely to
recur, whereas actions followed by small
rewards should be made less likely to recur.
• If an action is taken and the environment
returns a reward of 5, is that large or small?
To make such a judgment one must
compare the reward with some standard or
reference level, called the reference reward.
Reinforcement Comparison
• In order to pick among the actions,
reinforcement comparison methods
maintain a separate measure of their
preference for each action.
• Let us denote the preference for action a
on play t by Pt(a).
• The preferences might be used to
determine action-selection probabilities
according to a softmax relationship, such as
 t (a ) 
Pt ( a )
b 1
Pt ( b )
Reinforcement Comparison
where  t ( a ) denotes the
probability of selecting action a on the
tth play.
• After each play, the preference for the
action selected on that play, at, is
incremented by the difference between
the reward, rt, and the reference reward,
rt :
p t  1 ( a t )  p t ( a t )   [ rt  r t ]
where β is a positive step-size
Reinforcement Comparison