Animal learning theory

advertisement
Animal learning theory
Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001]
Bert Kappen
[Sutton and Barto, 1990]
Classical conditioning:
- A conditioned stimulus (CS) and unconditioned stimulus (US) are paired in close succession
- The US produces a response UR
- After sufficient pairing, the CS produces a response CR similar to UR
Example: rabbit eye blink condition
- CS is sound of buzzer, US air puff to rabbit eye, UR is eye blink
- After CS-US pairing, the buzzer alone yields CR eye blink
Bert Kappen
1
[Sutton and Barto, 1990]
Classical theory proposes that a causal relation between CS and US is learned.
∆V
= (level of US processing) × (level of CS processing)
= Reinforcement × Eligibility
V is association strength.
- US processing is about reinforcement (positive or negative)
- CS processing is about attention (which stimulus is credited)
We focus on reinforcement processing with very simple eligibility models.
Bert Kappen
2
[Sutton and Barto, 1990]
Trial level vs. Real-time theories
Trial level theories
- ignore temporal aspects of individual trial and updates apply at end of trial
- Rescorla-Wagner model
- mostly used in experiments
Real-time theories .
- consider detailed timing aspects of the stimuli and rewards
- TD learning
Bert Kappen
3
[Sutton and Barto, 1990]
Rescorla-Wagner (RW) model
Central idea is that learning occurs whenever events violate expectations.
In particular, when expected US V̄ differs from the actual US level λ:
Vi := Vi + ∆Vi
∆Vi
=
V̄
=
β(λ − V̄)αi Xi
X
Vi Xi
i
Xi = 0, 1 to indicate absence/presence of CSi. αi is saliency (attention) of CSi
(fixed). Term αi Xi is often ignored for single stimulus.
Vi is the association strength of stimulus CSi to US.
λ denotes the ’level of US’ , ”combining, in an unspecified way, the USs intensity,
duration, and temporal relationship with the CS”.
Bert Kappen
4
[Sutton and Barto, 1990]
Rescorla-Wagner model
Rescola Wagner rule for multiple inputs can predict various phenomena:
- Blocking: learned s1 → r prevents learning of association s2 → r
- Inhibition: s2 reduces prediction when combined with any predicting stimulus
Bert Kappen
5
[Sutton and Barto, 1990]
Rescorla-Wagner model
Fails to explain second order conditioning: B → A → US
- Empirical finding: if A predicts US and B predicts A, then B predicts US.
- A → US relation correctly reinforced by RW
- B → A relation not reinforced by RW because US λ = 0 in B → A trials
Solution is to assume that A also generates a λ
Bert Kappen
6
[Sutton and Barto, 1990]
Temporal aspect play important role
Bert Kappen
7
[Sutton and Barto, 1990]
Real-Time Theories of Eligibility
Conditioning depends on time interval between CS and US.
Mechanisms for coincidence of CS and US:
- Stimulus trace (Hull 1939) is internal representation. Disadvantage of using stimulus trace for learning is that long ISIs require broad stimulus trace, which disagrees with fast, precisely timed responses.
- Eligibility trace X̄ (Knopf 1972) is similar to stimulus trace but only used for learning
Bert Kappen
8
[Sutton and Barto, 1990]
From RW to a real time theory
Left :Trial based reinforcement λ is area under the curve. Right: Real-time reinforcement is future
area under the curve.
Problems:
- Prediction is constant for all times prior to US. This disagrees with empirical observation
Solution is imminence weighting to reduce long time association.
Bert Kappen
9
[Sutton and Barto, 1990]
Temporal difference learning
Subject predicts at time t the imminence weighted area (c) rather than the unweighted area (a) which may be infinite.
Bert Kappen
10
[Sutton and Barto, 1990]
Temporal difference learning
When rewards λt occur at discrete times t = 1, 2, 3, . . .:
V̄t
=
∞
X
γ s−t λ s+1 = λt+1 + γλt+2 + γ2λt+3 + . . .
s=t
= λt+1 + γ (λt+2 + γλt+3 + . . .)
∞
X
= λt+1 + γ
γ s−t−1λ s+1
s=t+1
= λt+1 + γV̄t+1
Error λ − V in RW rule becomes error λt+1 + γV̄t+1 − V̄t :
∆Vi = β(λt+1 + γV̄t+1 − V̄t )αi Xi
with V̄t =
P
Bert Kappen
i Vi Xi (t)
11
[Sutton and Barto, 1990]
Temporal difference learning
Complete-serial-compound (CSC) experiment:
- CS stimuli at each time i = 1 . . . T and αi = 1
P
- In each learning step, we take all stimuli Xi (t) = δi,t and compute V̄t = i Vi Xi (t) = Vt , t = 1, . . . , T
- Compute
Bert Kappen
∆V̄i
=
∆Vi = β(λt+1 + γV̄t+1 − V̄t )αiδit
∆V̄t
=
β(λt+1 + γV̄t+1 − V̄t )
12
[Sutton and Barto, 1990]
Difference TD and standard RL
Note difference between the TD equation
∆Vi = β(λt+1 + γV̄t+1 − V̄t )αXi
and standard RL:
∆V̄t = β(λt+1 + γV̄t+1 − V̄t )
Differences:
- Normal RL interpretation is V̄t is value of state t found self-consistently through Bellman equation
- Here instead Vi is updated and V̄t =
P
i Vi Xi (t).
The relation is established by interpreting i also as a time label and define Xi(t) =
δi,t .
Bert Kappen
13
[Sutton and Barto, 1990]
Inter-Stimulus-Interval Dependency
Rabbit CR (eye blink) to CS (sound) as a function of ISI length shows good agreement with TD model.
Bert Kappen
14
Dayan Abbott 9.2
Learning to predict a reward
DA Eq. 9.6:
v(t) =
X
w(τ)u(t − τ)
τ
relates to SB rule with v(t) = V̄(t), w(τ) = Vi, u(t − τ) = Xi(t).
DA Eq. 9.10:
∆w(τ) = (r(t) + v(t + 1) − v(t)) u(t − τ)
is SB rule with r(t) = λ(t + 1), γ = 1 and stimulus u(t − τ) = Xi(t).
Bert Kappen
15
Dayan Abbott 9.2
Fig. 9.2
Stimulus is u(t) = δt,100, thus Eq. 9.6 v(t) =
P
τ w(τ)u(t
− τ) = w(t − 100).
Delta rule Eq. 9.10:
∆w(τ) = δ(t)u(t − τ) = δ(t)δt−τ,100
∆v(t) = ∆w(t − 100) = δ(t)
δ(t) = r(t) + v(t + 1) − v(t)
and ∆w(τ) = 0 if τ , t − 100.
Reward is r = δt,200.
Bert Kappen
16
Dayan Abbott 9.2
First iteration:
Second iteration:
w(t − 100) = v(t) = 0
δ(t) = δt,200
Bert Kappen
v(t) = δt,200
δ(t) = δt,200 + δt,199 − δt,200
17
Dayan Abbott 9.2
Dopamine
Activity of dopamine neurons in ventral tegmental area (VTA) of monkey encode δ
in fig. 9.2.
Monkey button press after stimulus (sound) to receive reward (fruit juice).
- A: Left panels, trials locked to stimulus, right panels trials locked to reward. Top row is early trials,
bottom row is late trials. VTA cells respond to reward in early trials and to stimulus in late trials.
- B Top: stimulus locked trials after learning as in bottom fig. A
- B bottom: witholding the (expected) reward yields inhibition
Bert Kappen
18
Dayan Abbott 9.3
Instrumental conditioning/Static action choice
Conditioning:
• Classical: reward (punishment) independent of action taken (Pavlov dog)
• Instrumental: reward depends on action taken
Reward timing:
• immediate: reward follows directly after action (static action choice)
• delayed: reward follows after sequence of actions
Foraging of bees as example of learning behavior with immediate reward:
• bees visit flowers of two colors (blue, yellow), prefering color with higher reward
(nectar)
• when rewards are swapped, bees adjust their preference
Bert Kappen
19
Dayan Abbott 9.3
quantity of nectar rb,y is stochastic from (fixed) distribution qr,b(r). This is like a
two-armed bandit problem.
The policy of the bees is given by a probability pb,y with pb + pr = 1. We can
parametrize this as
eβmb
pb = βm
e b + eβmy
eβmy
py = βm
e b + eβmy
The parameters mb,y are called action values, that do no require normalization.
The parameter β controls exploration-exploitation:
• β = 0: exploration
• β = ∞: explotation
Two learning strategies:
• learn action value as past expected reward (indirect actor)
• learn action value to maximize future expected reward (direct actor)
Bert Kappen
20
Dayan Abbott 9.3
Indirect actor
We need a learning rule to estimate
mb = hrbi =
X
D E X
qy(ry)ry
my = ry =
qb(rb)rb
rb
Consider the cost function E(mb) =
when
ry
1
2
P
rb
qb(r)(mb − rb)2. The function is minimized
X
dE
=
qb(rb)(mb − rb) = mb − hrbi = 0
dmb
r
We use stochastic gradient descent, replacing hrbi by a single sample rb:
dE
= mb + (rb − mb)
mb := mb − dmb
and similar for m + y. Delta rule.
Bert Kappen
21
Dayan Abbott 9.3
Indirect actor
D E
hrbi = 1, ry = 2 for first 100 visits, and reversed for second 100 visits.
qb(r) = 12 δr,0 + 12 δr,2, qy(r) = 21 δr,0 + 12 δr,4.
A) Visits selected according action values with β = 1 and learning of mb,y with delta rule.
B) Cumulated visits for β = 1 and (C,D) for β = 50 shows better exploitation but worse exploration
(sometimes) for large β.
Bert Kappen
22
Dayan Abbott 9.3
Indirect actor
Foraging in bumblebees. Blue flowers provide 2 µl nectar, yellow flowers provide 6µl on 1/3 of the
trials and zero otherwise.
After 15 trials reversed.
A) mean preference of 5 bees for blue flowers over 30 trials, each consisting of 40 visits.
B) action value is concave function of nectar volume prefers low risk option.
C) preference of single model bee ( = 0.3, β = 23/8)
Bert Kappen
23
Dayan Abbott 9.3
Direct actor
Direct actor method estimates mb,y such as to maximize the expected future reward
D E
hri = pb hrbi + py ry
We use again stochastic gradient descent. The derivative is
D E
d hri
= βpb py hrbi − ry
dmb
where we used
d pb
= βpb py
dmb
d py
= −βpb py
dmb
The stochastic version is (and similar for my):
mb := mb + (1 − pb)rb
mb := mb − pbry
Bert Kappen
i f b is selected
i f y is selected
24
Dayan Abbott 9.3
Direct actor
For multiple actions the rule generalizes to, when action a is taken:
ma0 := ma0 + (δa,a0 − pa0 )ra
∀a0
We will use this in the actor-critic algorithm (DA 9.4).
Bert Kappen
25
Dayan Abbott 9.3
Direct actor
Setting as before. A,B) Succesfull learning. C,D) Failure to learn to switch.
Note larger difference my − mb. Learning methods that directly optimize expected
future reward may be too greedy and have poor exploration.
Bert Kappen
26
Dayan Abbott 9.4
Delayed reward/Sequential decision making
Reward obtained after sequence of actions. Rat moves without back tracking. After
reward removed from maze and restart.
Delayed reward problem: Choice at A has no direct reward
Bert Kappen
27
Dayan Abbott 9.4
Delayed reward/Sequential decision making
Policy iteration (see [Kaelbling et al., 1996] 3.2.2): Loop:
• Policy evaluation: Compute value Vπ for policy π. Run Bellman backup until
convergence
• Policy improvement: Improve π
Bert Kappen
28
Dayan Abbott 9.4
Delayed reward/Sequential decision making
Actor Critic (see [Kaelbling et al., 1996] 4.1): Loop:
• Critic: use TD eval. V(state) using current policy
• Actor: improve policy p(state)
Bert Kappen
29
Dayan Abbott 9.4
Policy evaluation with TD
Policy is random left/right at each turn.
1
v(B) = (0 + 5) = 2.5
2
1
v(C) = (0 + 2) = 1
2
1
v(A) = (v(B) + v(C)) = 1.75
2
Learn through TD learning (v(s) = w(s), s = A, B, C ):
v(s) := v(s) + δ
Bert Kappen
δ = r(s) + v(s0) − v(s)
30
Dayan Abbott 9.4
Policy improvement
We use the direct actor method, which updates all ma0 upon taking action a as:
ma0 := ma0 + (δa,a0 − pa0 )ra
∀a0
This generalizes for the delayed reward case to
ma0 (s) := ma0 (s) + (δa,a0 − pa0 (s))δ
pa(s)
=
δ = ra + v(s0) − v(s)
eβma(s)
P βm 0 (s)
a0 e a
Example: the values of the random policy are
v(A) = 1.75
Bert Kappen
v(B) = 2.5
v(C) = 1
31
Dayan Abbott 9.4
When in state A:
δ = 0 + v(B) − v(A) = 0.75
δ = 0 + v(C) − v(A) = −0.75
f or right turn
f or le f t turn
The learning rule increase mright(A) and decreases mleft(A). β controls the exploration.
Actor critic learning: show probability of left turn versus trials. Slow convergence of C due to less
visits of right arm.
Bert Kappen
32
Dayan Abbott 9.4
Generalizations
Generalizations of the basis Actor-Critic model are:
1. State s may not be directly observable, but instead a vector ui(s) of sensory
information is available
• The value function v(s) can be a parametrized function of ui(s). For instance
a linear parametrization, the TD learning rule is
v(s) =
X
wiui(s)
wi := wi + δui(s)
δ = r(s) + v(s0) − v(s)
i
• The action value ma(s) also becomes a function of ui instead of s directly:
ma(s) =
X
Maiui(s)
Ma0i := Ma0i + (δaa0 − pa0 (s))δui(s)
i
Three-term learning rule: The ’connection’ between ’neurons’ i and a is
changed depending on a third term δ
2. Discounted reward. Earlier reward/punishment more important than later.
δ = r(s) + γv(s0) − v(s)
Bert Kappen
33
Dayan Abbott 9.4
3. Updating value of past states in each learning step: TD(λ)
Bert Kappen
34
Dayan Abbott 9.4
Water Maze
Rat is place in a large pool of milky water and has to swim around until find a samll
platform that is submerged. After several trials, the rat learns to locate the platform.
ui(s) is activity of place cell array, s is physical location (B)
P
Critic: v(s) = i wiui(s) is value function. (C upper)
P
Actor: ma(s) = i Maiui(s) is the action value (C lower)
Bert Kappen
35
Dayan Abbott 9.4
References
References
[Dayan and Abbott, 2001] Dayan, P. and Abbott, L. (2001). Theoretical Neuroscience. Computational and Mathematical Modeling of Neural Systems. MIT Press, New York.
[Kaelbling et al., 1996] Kaelbling, L., Littman, M., and Moore, A. (1996). Reinforcement learning:
a survey. Journal of Artificial Intelligence research, 4:237–285.
[Sutton and Barto, 1990] Sutton, R. S. and Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement., pages 497–537. Cambridge University Press.
Bert Kappen
36
Download