Today’s Topics

advertisement
Today’s Topics
Some Additional ML Formulations
• Active Learning
• Transfer Learning
• Structured-Output Learning
• Reinforcement Learning (RL)
• Q learning
• Exploration vs Exploitation
• Generalizing Across State
• Used in Clinical Trials
• Advice Taking & Refinement (later lec)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
1
Active Learning
Reconstruction of the
original Mechanical Turk
(human hid inside table)
ML algo gets to ask human to label
some ex’s (eg, via Amazon Turk)
- don’t want to burden human by asking for too many labels
- algo should pick most useful ex’s (borderline cases?) for labeling
Sample Algo
1. train an ensemble on labeled data
2. find unlabeled ex’s where members of the
ensemble disagree the most
3. ask human to label some of these
4. goto 1
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
2
Transfer Learning
Active topic for DEEP ANNs:
reuse lower layers in new
image-processing tasks
(survey by Torrey & Shavlik on my pubs page)
Agent learns Task A
Agent encounters related Task B
Task A is
the source
Task B is
the target
Agent told/discovers how tasks are related
Agent uses knowledge from Task A to learn Task B faster
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
3
Performance on Target Task
Potential Benefits of Transfer Learning
12/1&8/15
steeper slope
higher asymptote
with transfer
without transfer
higher start
Amount of Training
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
Lecture #19, Slide 4
Close-Transfer Scenarios
2-on-1 BreakAway
4-on-3 BreakAway
3-on-2 BreakAway
Score within a few seconds
[Maclin et al., AAAI 2005]
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
Lecture #19, Slide 5
Distant-Transfer Scenarios
3-on-2 KeepAway
3-on-2 BreakAway
3-on-2 MoveDownfield
Keep ball from
opponents
[Stone & Sutton,
ICML 2001]
12/1&8/15
Advance ball
[Torrey et al,
ECML 2006]
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
6
Some Results:
Transfer to 3-on-2 BreakAway
0.6
Probability of Goal
0.5
0.4
0.3
Standard RL
0.2
Skill Transfer from 2-on-1 BreakAway
Skill Transfer from 3-on-2 MoveDownfield
0.1
Skill Transfer from 3-on-2 KeepAway
0
0
500
1000
1500
2000
2500
3000
Training Games
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
Torrey, Shavlik, Walker & Maclin: ECML 2006, ICML Workshop 2006
7
Structured-Output
Learning
• So far we have only learned ‘scalar’ outputs
• Some tasks address learning outputs with more structure
learning parsers of English (or other ‘natural’ languages)
learning to segment an image into objects
learning to predict the 3D shape of protein given its sequence
• We won’t cover this task in cs540, though
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
W8
Expecting Less
of the Teacher
• Imagine teaching robots to play soccer
• We could tell them what to do every step
this would be supervised ML
• What if we occasionally gave them
numeric feedback?
•
•
•
•
•
12/1&8/15
If scored goal, get +1
If intercepted pass, get +0.5
If ran out of bounds, get -0.5
If scored goal for other team, get -10
...
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
9
Reinforcement Learning
vs Supervised Learning
RL requires much less of teacher
– Teacher must set up ‘reward structure’
– Learner ‘works out the details’
ie, writes a program to maximize
rewards received
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
10
Sequential Decision Problems
Courtesy of Andy Barto (pictured)
• Decisions are made in stages
• The outcome of each decision is not fully predictable,
but can be observed before the next decision is made
• The objective is to maximize a numerical measure of total
reward (or equivalently, to minimize a measure of total cost)
• Decisions cannot be viewed in isolation
need to balance desire for immediate reward
with possibility of high future reward
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
11
Reinforcement Learning
Task of an agent embedded
in an environment
Repeat forever
1)
2)
3)
4)
5)
12/1&8/15
sense world
reason
choose an action to perform
get feedback (usually reward = 0)
learn
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
Notice: RL is
active, online learning
(queries are
asked of the
environment,
not a human
teacher)
12
RL Systems: Formalization
SE = the set of states of the world
eg, an N -dimensional vector
‘sensors’ (and memory of past sensations)
AE = the set of possible actions
an agent can perform (‘effectors’)
W = the world
R = the immediate reward structure
W and R are the environment,
can be stochastic functions
(usually most states have R=0; ie rewards are sparse)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
13
Embedded Learning Systems:
Formalization (cont.)
W: SE x AE  SE
[here the arrow means ‘maps to’]
The world maps a state and an action
and produces a new state
R: SE  “reals”
Provides rewards (a number; often 0)
as a function of the current state
Note: we can instead use R: SE x AE  “reals”
(ie, rewards depend on how we ENTER a state)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
14
A Graphical View of RL
• Note that both the world and the agent can be
probabilistic, so W and R could produce probability
distributions
• For cs540, we’ll assume deterministic problems
The real world, W
sensory
info
R, reward
(a scalar)
- indirect
teacher
an
action
The Agent
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
15
Common Confusion
State need not be solely the
current sensor readings
– Markov Assumption commonly used in RL
Value of state is independent of
path taken to reach that state
– But can store memory of the past in current state
Can always create Markovian task by remembering
entire past history
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
16
Need for Memory: Simple Example
‘Out of sight, but not out of mind’
Time=1
learning agent
opponent
WALL
opponent
Time=2
WALL
Seems reasonable to
remember opponent
was recently seen
learning agent
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
17
State vs.
Current Sensor Readings
Remember
state is what is in one’s head
(past memories, etc)
not ONLY what one currently
sees/hears/smells/etc
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
18
Policies
The agent needs to learn a policy
E : ŜE  AE
The policy, E, function
12/1&8/15
Given a world state, ŜE,
which action, AE, should
be chosen? ŜE is our
learner’s APPROXIMATION
to the true SE
Remember: The agent’s task
is to maximize the total reward
received during its lifetime
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
19
True World States vs. the Learner’s
Representation of the World State
• From here forward, S, will be our learner’s
approximation of the true world state
• Exceptions
W: S x A  S
R: S  reals
These are our notations for how the true
world behaves when we act upon it
You can think that W and R take as an
argument the learner’s representation of
the world state and internally convert that
to the ‘true’ world state(s)
Recall: we can instead use R: SE x AE  “reals”
(ie, rewards might instead depend on how we ENTER a state)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
20
Policies (cont.)
To construct E, we will assign a utility (U)
(a number) to each state

U  E ( s)    t 1 R( s,  E , t )
t 1
•  is a positive constant ≤ 1
• R(s, E, t) is the reward received at time t,
assuming the agent follows policy E and starts in
state s at t=0
• Note: future rewards are discounted by 
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
t-1
21
Why have a Decay on Rewards?

• Getting ‘money’ in the future worth
less than money right now
– Inflation
– More time to enjoy what it buys
– Risk of death before collecting
• Allows convergence proofs of the
functions we’re learning
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
Lecture #13, Slide 22
The Action-Value Function
We want to choose the ‘best’ action in the current state
So, pick the one that leads to the best next state
(and include any immediate reward)
Let
Q E ( s, a )  R(W ( s, a ))   U  E (W ( s, a))
Immediate reward
received for going to
state W(s,a)
[Alternatively, R(s, a) ]
12/1&8/15
Future reward from further actions
(discounted due to 1-step delay)
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
23
The Action-Value Function (cont.)
If we can accurately learn Q (the actionvalue function), choosing actions is easy
Choose action a, where
a  arg max Q( s, a' )
a 'actions
Note: x = argmax f(x) sets x to the value that leads to a max value for f(x)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
24
Q vs. U Visually
state
action
state
Key
U(2)
states
U(5)
actions
U(1)
Q(1,ii)
U(3)
U(6)
U(4)
U’s ‘stored’ on states
Q’s ‘stored’ on arcs
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
25
Q
Q’s vs. U’s
U
S U
Q
U
• Assume we’re in state S
Which action do we choose?
• U’s (Model-based)
– Need to have a ‘next state’ function to generate
all possible next states (eg, chess)
– Choose next state with highest U value
• Q’s (Model-free, though can also do model-based Q learning)
– Need only know which actions are legal (eg, web)
– Choose arc with highest Q value
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
26
Q-Learning
(Watkins PhD, 1989)
Let Qt be our current estimate of the optimal Q
Our current policy is
 t (s)  a such that Qt ( s, a )  max [Qt ( s, b)]
bknown
actions
Our current utility-function estimate is
U t ( s )  Qt ( s,  t ( s ))
- hence, the U table is embedded in the
Q table and we don’t need to store both
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
27
Q-Learning (cont.)
Assume we are in state St
‘Run the program’ * for awhile (n steps)
Determine actual reward and compare
to predicted reward
Adjust prediction to reduce error
* Ie, follow the current policy
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
28
Updating Qt
Let
rt( N )
N -step
estimate of
future
rewards
12/1&8/15
 N k 1

   Rt  k    NU t ( St  N )
 k 1

Actual (discounted)
reward received during
the N time steps
Estimate of future reward
if continued to t = 
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
29
Changing the Q Function
(ie, learn a better approx.)
Old estimate

Qt  N ( St , at )  Qt ( St , at )   rt
New estimate
(at time t + N)
Learning rate
(for deterministic
worlds, set α=1)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
(N )
 Qt ( St , at )
Error
30

Pictorially (here rewards are on arcs, rather than states)
Actual moves made
(in red)
r1
S1
r2
r3
SN
Potential next
states
Qest ( s1, a )  r1   r2   2 r3 +<estimate of remainder of infinite sum>
 r1   r2   2 r3   3U (SN )
 r1   r2   2 r3   3 max Q(S N , b)
bactions
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
31
How Many Actions Should We
Take Before Updating Q ?
Why not do so after each action?
– One–step Q learning
– Most common approach
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
32
Exploration vs.
Exploitation
In order to learn about better alternatives,
we can’t always follow the
current policy (‘exploitation’)
Sometimes, need to try
random moves (‘exploration’)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
33
Exploration vs. Exploitation (cont)
Approaches
1) p percent of the time, make a random
move; could let
1
p
2) Prob(picking action
A in state S )
# moves _ made
QS , A
const

Q  S ,i 
const

Exponentia
-ting gets
rid of
negative
values
iactions
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
34
One-Step Q-Learning Algo
0.
S  initial state
1.
If random #  P
then a = random choice // Occassionally ‘explore’
Else a = t(S)
// Else ‘exploit’
2.
Snew  W(S, a)
Rimmed  R(Snew)
3.
Error  Rimmed +  U(Snew) – Q(S, a) // Use Q to compute U
4.
Q(S, a)  Q(S, a) +  Error
5.
S  Snew
6.
Go to 1
12/1&8/15
Act on world and get reward
// Should also decay α
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
35
Visualizing Q -Learning
(1-step ‘lookahead’)
The estimate
State I
Q(I,a)
Action a
(get reward R)
Should equal
R +  max Q(J,x)
State J
a
z
b
12/1&8/15
…
- train ML system to
learn a consistent set
of Q values
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
36
Bellman Optimality Equation
(from 1957, though for U function back then)
IF
 s ,a Q( s, a)  RN   max QSN , a'
a 'actions
Where SN = W(s,a) , ie, the next state
THEN
The resulting policy, (s) = argmax Q(s,a),
is optimal – ie, leads to highest discounted total rewards
(also, any optimal policy satisfies the Bellman Eq)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
37
A Simple Example
(of Q-learning - with updates after each step, ie N =1)
Q=0
S0
R=0
Let  = 2/3
S1
R=1
Q=0
Q=0
Q=0
S3
R=0
S2
R = -1
Q=0
Q=0
S4
R=3
Qnew  R   max Qnext state
(deterministic world, so α=1)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
38
A Simple Example (Step 1)
S0  S2
Q=0
S0
R=0
Let  = 2/3
S1
R=1
Q=0
Q=0
Q = -1
S3
R=0
S2
R = -1
Q=0
Q=0
12/1&8/15
S4
R=3
Qnew  R   max Qnext state
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
39
A Simple Example (Step 2)
S2  S4
Q=0
S0
R=0
Let  = 2/3
S1
R=1
Q=0
Q=0
Q = -1
S3
R=0
S2
R = -1
Q=0
Q=3
12/1&8/15
S4
R=3
Qnew  R   max Qnext state
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
40
A Simple Example (Step i)
S0  S2
Q=0
S0
R=0
Let  = 2/3
S1
R=1
Q=0
Q=0
Q = -1
S3
R=0
S2
R = -1
Q=0
Q=3
12/1&8/15
Assume we get
to the end of the
game and
‘magically’
restarted in S0
S4
R=3
Qnew  R   max Qnext state
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
41
A Simple Example (Step i+1)
S0  S2
Q=0
S0
R=0
Let  = 2/3
S1
R=1
Q=0
Q=0
Q=1
S3
R=0
S2
R = -1
Q=0
Q=3
12/1&8/15
S4
R=3
Qnew  R   max Qnext state
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
42
A Simple Example (Step ∞)
- ie, the Bellman optimal
Q=?
S0
R=0
Let  = 2/3
S1
R=1
Q=?
Q=?
Q=?
S3
R=0
S2
R = -1
Q=?
Q=?
12/1&8/15
S4
R=3
What would the final Q values be
if we explored + exploited for a
long time, always returning to S0
after 5 actions?
Qnew  R   max Qnext state
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
43
A Simple Example (Step ∞)
Let  = 2/3
Q=1
S0
R=0
What would happen if  > 2/3?
Lower path better
S1
R=1
Q=0
Q=0
Q=1
S3
R=0
S2
R = -1
Q=0
Q=3
12/1&8/15
What would happen if  < 2/3 ?
Upper path better
S4
R=3
Shows need for EXPLORATION
since first ever action out of S0
may or may not be the optimal one
Qnew  R   max Qnext state
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
44
An “On Your Own” RL HW
(Solution in Class Next Tues)
Consider the deterministic reinforcement environment drawn below. Let γ=0.5. Immediate
rewards are indicated inside nodes. Once the agent reaches the ‘end’ state the current
episode ends and the agent is magically transported to the ‘start’ state.
B
(r=5)
4
Start
(r=0)
4
4
4
End
(r=5)
4
A
(r=2)
4
C
(r=3)
4
(a) A one-step, Q-table learner follows the path Start  B  C  End. On the graph below,
show the Q values that have changed, and show your work. Assume that for all legal
actions (ie, for all the arcs on the graph), the initial values in the Q table are 4, as show
above (feel free to copy the above 4’s below, but somehow highlight the changed values).
Start
(r=0)
12/1&8/15
B
(r=5)
A
(r=2)
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
End
(r=5)
C
(r=3)
45
An “On Your Own” RL HW
(Solution in Class Next Tues)
(b) Starting with the Q table you produced in Part (a), again follow the path
Start  B  C  End and show the Q values below that have changed from Part (a).
Show your work.
Start
(r=0)
B
(r=5)
End
(r=5)
A
(r=2)
C
(r=3)
(c) What would the final Q values be in the limit of trying all possible arcs ‘infinitely’
often? Ie, what is the Bellman-optimal Q table? Explain your answer.
Start
(r=0)
B
(r=5)
A
(r=2)
End
(r=5)
C
(r=3)
(d) What is the optimal path between Start and End? Explain.
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
46
Estimating Value’s ‘In Place’
(see Sec 2.6 +2.7 of Sutton+Barto RL textbook)
Let ri be our i th estimate of some Q
Note: ri is not the immediate reward, Ri
ri = Ri +  U(next statei)
Assume we have k +1 such measurements
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
47
Estimating Value’s (cont)
Estimate
based on
k + 1 trails
1 k 1
Qk 1 
ri

k  1 i 1
k
1





 rk 1   ri 
 k  1 
i 1 
 1 

rk 1  k  Qk 
 k 1
(cont.)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
Ave of the k + 1
measurements
Pull out last term
Stick in
definition of Qk
48
‘In Place’ Estimates (cont.)
 1 

rk 1  k  1Qk  Qk 
 k  1
 1 
 Qk  
rk 1  Qk 
 k  1

latest
estimate
current
‘running’
average
Add and
subtract Qk
Notice that 
needs to decay
over time
Repeating
Qk 1  Qk   rk 1  Qk 
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
49
Q-Learning:
The Need to ‘Generalize Across State’
Remember, conceptually we are filling in a huge table
States
S0 S1 S2
A
c
t
i
o
n
s
12/1&8/15
a
b
c
.
.
.
z
...
Sn
.
.
.
...
Q(S2, c)
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
Tables are a
very verbose
representation
of a function
50
Representing Q Functions
More Compactly
We can use some other function representation
(eg, neural net) to compactly encode this big table
Second
argument is a
constant
Q (S, a)
An
encoding
of the
state (S)
Q (S, b)
.
.
..
.
Q (S, z)
Each input unit encodes
a property of the state
(eg, a sensor value)
12/1&8/15
Or could have one net
for each possible action
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
51
Q (S, 0)
Q (S, 1)
…
…
Q Tables vs Q Nets
.
Q (S, 9)
Given: 100 Boolean-valued features
10 possible actions
Size of Q table
10  2100
Similar idea as Full Joint
Prob Tables and Bayes Nets
(called ‘factored’ representations)
Size of Q net (100 HU’s)
100  100 + 100  10 = 11,000
Weights between
inputs and HU’s
12/1&8/15
Weights between
HU’s and outputs
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
52
Why Use a Compact
Q-Function?
1.
2.
Full Q table may not fit in memory for realistic problems
Can generalize across states,
thereby speeding up convergence
ie, one example ‘fills’ many cells in the Q table
Notes
1. When generalizing across states, cannot use α=1
2. Convergence proofs only apply to Q tables
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
Lecture #25, Slide 53
Three Forward Props
and a BackProp
Q(S0, A)
A
1
S0
N
.
.
.
N
.
S1
N
.
.
.
N
.
3
A
S0
Q(S0, Z)
Q(S1, A)
A
2
N
.
.
Q(S1, Z)
Choose action in state S0
- execute chosen action in world,
‘read’ new sensors and reward
Estimate u(S1) =
Max Q(S1,X) where X  actions
Q(S0, A) vs new estimate
Calc “teacher’s” output
.
.
N
Aside: could save some
forward props by caching
information
Q(S0, Z) - assume Q is ‘correct’ for other actions
Backprop to reduce error at Q(S0, A)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
54
The Agent World
(Rough sketch, implemented in Java [by me], linked to class home page)
Pushable
Ice Cubes
*
*
*
*
*
Opponents
12/1&8/15
*
*
*
The RL Agent
*
Food
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
55
0
50/25
15 HU
-10
5 HU
-20
Q-table
-30
Perceptrons (600 ex’s)
(Supervised learning)
Hand-coded
-40
Q-net: 5 HU’s
Q-net: 15 HU’s
-50
-60 0
Q-net: 25 HU’s
Q-net: 50 HU’s
500
1000
Training-set steps (in K)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
1500
<-- 1000x slower CPU
Mean(discounted) score on the testset suite
Some (Ancient) Agent World Results
2000
~2 weeks
56
Q-Learning Convergences
• Only applies to Q tables and deterministic,
Markovian worlds
• Theorem: if every state-action pair visited infinitely
often, 0 ≤  < 1, and |rewards| ≤ C (some constant),
then

s, a lim Q t ( s, a)  Qactual ( s, a)
t 
^
the approx. Q table (Q)
12/1&8/15
the true Q table (Q)
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
57
Developing a World Model
• The world involves two functions
– ‘Next state’ function
W : S A S
– Reward function
R: S 
How could we learn these
two functions?
Eg, think about chess
• Even if we knew these functions, we would still
need to compute the Q table/function to have
a policy
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
58
Learning World Models
Can use any supervised learning technique to
learn these functions
R: S 
Supervised
Learner
Rep. of
state S
Reward
W : S A S
Rep. of state S
Rep. of action A
12/1&8/15
Supervised
Learner
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
Rep. of
next state
59
Using a World Model
(‘The DYNA architecture’ by Sutton in From Animals to Animats, MIT Press, 1991)
If we have a good world model, we can
mentally simulate exploration to produce
Q-learning data
– Faster than running in the real world
– May need to periodically update world model
with the results of real-world experiments
(trade-off between exploration and calculation)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
60
Advantages of
Using World Models
• Interleave real and simulated actions
• Can get extra training examples
quickly and cheaply
• Provides one way to incorporate
prior knowledge (eg, simulators)
• Allows planning of what to explore in
the real world (ie, mental simulation)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
61
Some Other Ways to Reduce Number
of ‘Real World’ Examples Needed in RL
• Replay old examples periodically (Lin)
• Have a teacher occasionally say
which action to do (Clouse & Utgoff)
• Give ‘verbal’ advice to the learner
(Maclin & Shavlik)
• Transfer learning (discussed earlier)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
62
Recap: Supervised Learners
Helping the RL Learner
• Note that Q learning automatically
creates I/O pairs for a supervised ML
algo when ‘generalizing across state’
• Can also learn a model of the world (W)
and the reward function (R)
– Simulations via learned models reduce
need for ‘acting in the physical world’
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
63
Shaping
• Allow the teacher to change the
reward function over time
– Eg, consider some team sport
– Most basic reward is
win = +1, tie = 0, lose = -1
– But during training might initially give rewards for
catching passes, scoring points, blocking shot, etc
– Over time the reward function might become less detailed
(maybe because shaping leads to non-optimality)
• Some similarities to transfer learning
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
64
Challenges in RL
• Q tables too big, so use function approximation
– can ‘generalize across state’ (eg, via ANNs)
– convergence proofs no longer apply, though
• Hidden state (‘perceptual aliasing’)
– two different states might look the same
(eg, due to ‘local sensors’)
– can use theory of ‘Partially Observable
Markov Decision Problems’ (POMDP’s)
• Multi-agent learning (world no longer stationary)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
65
Could use GAs for RL Task
• Another approach is to use GAs
to evolve good policies
– Create N ‘agents’
– Measure each’s rewards over some time period
– Discard worst, cross over best, do some mutation
– Repeat ‘forever’ (a model of biology)
• Both ‘predator’ and ‘prey’ evolve/learn,
ie co-evolution
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
66
Summary of Non-GA
Reinforcement Learning
Positives
– Requires much less ‘teacher feedback’
– Appealing approach to learning to predict and
control (eg, robotics, sofbots)
Demo of Google’s Q Learning
– Solid mathematical foundations
• Dynamic programming
• Markov decision processes
• Convergence proofs (in the limit)
– Core of solution to general AI problem ?
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
67
Summary of Non-GA
Reinforcement Learning (cont.)
Negatives
– Need to deal with huge state-action spaces
(so convergence very slow)
– Hard to design R function ?
– Learns specific environment rather than general
concepts – depends on state representation ?
– Dealing with multiple learning agents?
– Hard to learn at multiple ‘grain sizes’ (hierarchical RL)
12/1&8/15
CS 540 - Fall 2015 (Shavlik©), Lecture 29, Weeks 13 & 14
68
Download