Chap. 13 Reinforcement Learning (RL)

advertisement
Chap. 13
Reinforcement Learning (RL)
Machine Learning
Tom M. Mitchell
outline




What is Reinforcement Learning?
Methods Used in Reinforcement Learning
Temporal Difference Methods
Applications
Introduction
What’s reinforcement learning?
History
What reinforcement learning can do?
Reinforcement Learning’s Element
What’s reinforcement Learning?
Reinforcement learning address the question of
how an autonomous agent that sense and acts in
its environment can learn to choose optimal actions
to achieve its goals.
In Reinforcement learning , the computer is simply
given a goal to achieve. The computer then learns
how to achieve that goal by trial-and-error
interactions with its environment.
What’s reinforcement Learning?
To provide the intuition behind reinforcement learning
consider the problem of learning to ride a bicycle.
The goal given to the RL system is simply to ride the
bicycle without falling over.
In the first trial, the RL system begins riding the bicycle
and performs a series of actions that result in the bicycle
being tilted 45 degrees to the right. At this point their are
two actions possible: turn the handle bars left or turn
them right.
What’s reinforcement Learning?
The RL system turns the handle bars to the left and
immediately crashes to the ground, thus receiving a
negative reinforcement.
The RL system has just learned not to turn the handle
bars left when tilted 45 degrees to the right.
In the next trial The RL system knows not to turn the
handle bars to the left, so it performs the only other
possible action: turn right when tilted 45 degrees to the
right.
What’s reinforcement Learning?
It immediately crashes to the ground, again receiving a
strong negative reinforcement.
At this point the RL system has not only learned that
turning the handle bars right or left when tilted 45
degrees to the right is bad, but that the "state" of being
titled 45 degrees to the right is bad.
Again, the RL system begins another trial and performs
a series of actions that result in the bicycle being tilted
40 degrees to the right. ……
What’s reinforcement Learning?
Agent
State
Reward
environment
s0
a0
s1
r0
a1
r1
s2 a2
r2
r0 +  r1 + 2 r2 + … , where 0  < 1
The discount factor  is used to
exponentially decrease the weight
of reinforcements received in the
future
RL systems learn a mapping from
situations to actions by trial-and-error
interactions with a dynamic environment.
The “goal” of the RL system is defined
using the concept of a reward function,
which is the exact function of future
reinforcements(rewards) the agent seeks
to maximize.
In other words, there exists a mapping
from state/action pairs to reward; after
performing an action in a given state the
RL agent will receive some reward in the
form of a scalar value. The RL agent
learns to perform actions that will
maximize the sum of the rewards
received when starting from some initial
state and proceeding to a terminal state.
RL Vs. other function approximation
RL is similar in some respects to the function approximation problems
discussed in other chapters. The target function to be learned in this
case is a control policy : S -> A. A policy determines which action
should be performed in each state; a policy is a mapping from states to
actions.
The value of a state is defined as the sum of the rewards received
when starting in that state and following some fixed policy to a terminal
state.
The optimal policy would therefore be the mapping from states to
actions that maximizes the sum of the rewards when starting in an
arbitrary state and performing actions until a terminal state is reached
RL Vs. other function approximation
This reinforcement learning problem differs from other function
approximation tasks in several important respects.
 Delayed reward: In RL, direct correspondence between the states and
the actions is not available. The trainer provides only a sequence of
immediate reward values as the agent executes its actions. Face the
problem of temporal credit assignment.
 Exploration: The agents influence the distribution of training examples by
the action sequence it chooses. The question is which experimentation
strategy produces most effective learning.
The Learning Task
Ways of formulation
Markov Decision Process (MDP)
Precise Task Definition
An Example
Ways of formulation
There are many ways to formulate the problem of
learning sequential control strategies
 Agent’s actions: Might be Deterministic or
Nondeterministic
 Agent may have or haven’t the ability of predicting
the next state that will result from each action
 Trainer of the agent: Expert(who shows it examples
of optimal action sequences) or agent itself(train
itself by performing actions of its own choice.)
States & Transitions
s0
a1
s2
a2
s1
a4
s5
a3
a5
s6
s3
a6
s7
a7
s8
Markov Decision Process.
Finite set of States : S; Set of Actions: A
 t: discrete time step;
 st: the state at time t;
 at: the action at time t;
At each discrete time, agent observe states st S, and chooses
action at A.
Then receive immediate reward: rt , And state change to: st+1
Markov assumption: st+1= (st , at ), rt=r (st , at )
 i.e., rt, and st+1 depend only on current state and action
Functions  and r may be nondeterministic
Functions  and r not necessarily be known to agent
st
at
rt
st+1
at+1
st+2
rt+1
at+2
rt+2 …
Learning Task
st at
rt
st+1 at+1
st+2 at+2
rt+1
rt+2 …
Execute actions in the environment, observe results and
Learn a policy p : S  A from states stS to actions atA
that maximizes the accumulated reward :
Vp(s) = rt+ rt+1+ 2 rt+2+…
from any starting state st
0<<1 is the discount factor for future rewards.
Target function is p : S  A
But there are no direct training examples of the form <s,a>
Training examples are of the form <<s,a>,r>
State Value Function
Consider deterministic environments, namely (s,a)
and r(s,a) are deterministic functions of s and a.
For each policy p: S  A the agent might adopt, we
define an evaluation function:
Vp(s)= rt+ rt+1+ 2 rt+2+…= Si=0 rt+i i
(13.1)
where rt, rt+1,… are generated by following the policy
p from start state s
Task: Learn the optimal policy p* that maximizes
Vp(s)
p* = argmaxp Vp(s) ,s
(13.2)
Action Value Function
State value function denotes the reward
for starting in state s and following policy p.
Vp(s)= rt+ rt+1+ 2 rt+2+…= Si=0 rt+i i
Action value function denotes the reward
for starting in state s, taking action a and
following policy p afterwards.
Qp(s,a)= r(s,a) +  rt+1+ 2 rt+2+…
= r(s,a) +  Vp ((s,a))
Optimal Value Functions
Concept of V* :
V*(s) = maxp Vp(s)
Concept of π*:
The policy π which maximizes Vπ(s) for all states s.
p*(s) = argmaxp Vp(s) ,s
(13.2)
π*(s)=argmaxa {r(s,a) +  V*((s,a))}
(13.3)
Example
0
0
100
0
90
0
0
0
G
100
0
0
0
0
0
G
100
0
0
81
r(s, a) (immediate reward) values
G
all Optimal policies
90
100
V*(s) values
G
One Optimal policy
What to Learn
Q-Function
Training Rule to Learn APPROXIMATE Q
A simple deterministic world
Q Learning for Deterministic Worlds
example
Explore or Exploit?
hwk
13.1
13.2
Example
73
R
100
90
100
66 R
66
81
81
aright
s1
Q(s1, aright)  r +  maxa’ Q (s2 , a’ )
s2
 0 + 0.9 max{66,81,100}
 90
0
0
0
90
G
0
0
0
0
0
G
81
0
0
0
100
72
0
0
0
initial Q(s, a) values
81
81
81
90
90
72
Q(s, a) values
100
81
Thank you
Download