Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen [Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US) are paired in close succession - The US produces a response UR - After sufficient pairing, the CS produces a response CR similar to UR Example: rabbit eye blink condition - CS is sound of buzzer, US air puff to rabbit eye, UR is eye blink - After CS-US pairing, the buzzer alone yields CR eye blink Bert Kappen 1 [Sutton and Barto, 1990] Classical theory proposes that a causal relation between CS and US is learned. ∆V = (level of US processing) × (level of CS processing) = Reinforcement × Eligibility V is association strength. - US processing is about reinforcement (positive or negative) - CS processing is about attention (which stimulus is credited) We focus on reinforcement processing with very simple eligibility models. Bert Kappen 2 [Sutton and Barto, 1990] Trial level vs. Real-time theories Trial level theories - ignore temporal aspects of individual trial and updates apply at end of trial - Rescorla-Wagner model - mostly used in experiments Real-time theories . - consider detailed timing aspects of the stimuli and rewards - TD learning Bert Kappen 3 [Sutton and Barto, 1990] Rescorla-Wagner (RW) model Central idea is that learning occurs whenever events violate expectations. In particular, when expected US V̄ differs from the actual US level λ: Vi := Vi + ∆Vi ∆Vi = V̄ = β(λ − V̄)αi Xi X Vi Xi i Xi = 0, 1 to indicate absence/presence of CSi. αi is saliency (attention) of CSi (fixed). Term αi Xi is often ignored for single stimulus. Vi is the association strength of stimulus CSi to US. λ denotes the ’level of US’ , ”combining, in an unspecified way, the USs intensity, duration, and temporal relationship with the CS”. Bert Kappen 4 [Sutton and Barto, 1990] Rescorla-Wagner model Rescola Wagner rule for multiple inputs can predict various phenomena: - Blocking: learned s1 → r prevents learning of association s2 → r - Inhibition: s2 reduces prediction when combined with any predicting stimulus Bert Kappen 5 [Sutton and Barto, 1990] Rescorla-Wagner model Fails to explain second order conditioning: B → A → US - Empirical finding: if A predicts US and B predicts A, then B predicts US. - A → US relation correctly reinforced by RW - B → A relation not reinforced by RW because US λ = 0 in B → A trials Solution is to assume that A also generates a λ Bert Kappen 6 [Sutton and Barto, 1990] Temporal aspect play important role Bert Kappen 7 [Sutton and Barto, 1990] Real-Time Theories of Eligibility Conditioning depends on time interval between CS and US. Mechanisms for coincidence of CS and US: - Stimulus trace (Hull 1939) is internal representation. Disadvantage of using stimulus trace for learning is that long ISIs require broad stimulus trace, which disagrees with fast, precisely timed responses. - Eligibility trace X̄ (Knopf 1972) is similar to stimulus trace but only used for learning Bert Kappen 8 [Sutton and Barto, 1990] From RW to a real time theory Left :Trial based reinforcement λ is area under the curve. Right: Real-time reinforcement is future area under the curve. Problems: - Prediction is constant for all times prior to US. This disagrees with empirical observation Solution is imminence weighting to reduce long time association. Bert Kappen 9 [Sutton and Barto, 1990] Temporal difference learning Subject predicts at time t the imminence weighted area (c) rather than the unweighted area (a) which may be infinite. Bert Kappen 10 [Sutton and Barto, 1990] Temporal difference learning When rewards λt occur at discrete times t = 1, 2, 3, . . .: V̄t = ∞ X γ s−t λ s+1 = λt+1 + γλt+2 + γ2λt+3 + . . . s=t = λt+1 + γ (λt+2 + γλt+3 + . . .) ∞ X = λt+1 + γ γ s−t−1λ s+1 s=t+1 = λt+1 + γV̄t+1 Error λ − V in RW rule becomes error λt+1 + γV̄t+1 − V̄t : ∆Vi = β(λt+1 + γV̄t+1 − V̄t )αi Xi with V̄t = P Bert Kappen i Vi Xi (t) 11 [Sutton and Barto, 1990] Temporal difference learning Complete-serial-compound (CSC) experiment: - CS stimuli at each time i = 1 . . . T and αi = 1 P - In each learning step, we take all stimuli Xi (t) = δi,t and compute V̄t = i Vi Xi (t) = Vt , t = 1, . . . , T - Compute Bert Kappen ∆V̄i = ∆Vi = β(λt+1 + γV̄t+1 − V̄t )αiδit ∆V̄t = β(λt+1 + γV̄t+1 − V̄t ) 12 [Sutton and Barto, 1990] Difference TD and standard RL Note difference between the TD equation ∆Vi = β(λt+1 + γV̄t+1 − V̄t )αXi and standard RL: ∆V̄t = β(λt+1 + γV̄t+1 − V̄t ) Differences: - Normal RL interpretation is V̄t is value of state t found self-consistently through Bellman equation - Here instead Vi is updated and V̄t = P i Vi Xi (t). The relation is established by interpreting i also as a time label and define Xi(t) = δi,t . Bert Kappen 13 [Sutton and Barto, 1990] Inter-Stimulus-Interval Dependency Rabbit CR (eye blink) to CS (sound) as a function of ISI length shows good agreement with TD model. Bert Kappen 14 Dayan Abbott 9.2 Learning to predict a reward DA Eq. 9.6: v(t) = X w(τ)u(t − τ) τ relates to SB rule with v(t) = V̄(t), w(τ) = Vi, u(t − τ) = Xi(t). DA Eq. 9.10: ∆w(τ) = (r(t) + v(t + 1) − v(t)) u(t − τ) is SB rule with r(t) = λ(t + 1), γ = 1 and stimulus u(t − τ) = Xi(t). Bert Kappen 15 Dayan Abbott 9.2 Fig. 9.2 Stimulus is u(t) = δt,100, thus Eq. 9.6 v(t) = P τ w(τ)u(t − τ) = w(t − 100). Delta rule Eq. 9.10: ∆w(τ) = δ(t)u(t − τ) = δ(t)δt−τ,100 ∆v(t) = ∆w(t − 100) = δ(t) δ(t) = r(t) + v(t + 1) − v(t) and ∆w(τ) = 0 if τ , t − 100. Reward is r = δt,200. Bert Kappen 16 Dayan Abbott 9.2 First iteration: Second iteration: w(t − 100) = v(t) = 0 δ(t) = δt,200 Bert Kappen v(t) = δt,200 δ(t) = δt,200 + δt,199 − δt,200 17 Dayan Abbott 9.2 Dopamine Activity of dopamine neurons in ventral tegmental area (VTA) of monkey encode δ in fig. 9.2. Monkey button press after stimulus (sound) to receive reward (fruit juice). - A: Left panels, trials locked to stimulus, right panels trials locked to reward. Top row is early trials, bottom row is late trials. VTA cells respond to reward in early trials and to stimulus in late trials. - B Top: stimulus locked trials after learning as in bottom fig. A - B bottom: witholding the (expected) reward yields inhibition Bert Kappen 18 Dayan Abbott 9.3 Instrumental conditioning/Static action choice Conditioning: • Classical: reward (punishment) independent of action taken (Pavlov dog) • Instrumental: reward depends on action taken Reward timing: • immediate: reward follows directly after action (static action choice) • delayed: reward follows after sequence of actions Foraging of bees as example of learning behavior with immediate reward: • bees visit flowers of two colors (blue, yellow), prefering color with higher reward (nectar) • when rewards are swapped, bees adjust their preference Bert Kappen 19 Dayan Abbott 9.3 quantity of nectar rb,y is stochastic from (fixed) distribution qr,b(r). This is like a two-armed bandit problem. The policy of the bees is given by a probability pb,y with pb + pr = 1. We can parametrize this as eβmb pb = βm e b + eβmy eβmy py = βm e b + eβmy The parameters mb,y are called action values, that do no require normalization. The parameter β controls exploration-exploitation: • β = 0: exploration • β = ∞: explotation Two learning strategies: • learn action value as past expected reward (indirect actor) • learn action value to maximize future expected reward (direct actor) Bert Kappen 20 Dayan Abbott 9.3 Indirect actor We need a learning rule to estimate mb = hrbi = X D E X qy(ry)ry my = ry = qb(rb)rb rb Consider the cost function E(mb) = when ry 1 2 P rb qb(r)(mb − rb)2. The function is minimized X dE = qb(rb)(mb − rb) = mb − hrbi = 0 dmb r We use stochastic gradient descent, replacing hrbi by a single sample rb: dE = mb + (rb − mb) mb := mb − dmb and similar for m + y. Delta rule. Bert Kappen 21 Dayan Abbott 9.3 Indirect actor D E hrbi = 1, ry = 2 for first 100 visits, and reversed for second 100 visits. qb(r) = 12 δr,0 + 12 δr,2, qy(r) = 21 δr,0 + 12 δr,4. A) Visits selected according action values with β = 1 and learning of mb,y with delta rule. B) Cumulated visits for β = 1 and (C,D) for β = 50 shows better exploitation but worse exploration (sometimes) for large β. Bert Kappen 22 Dayan Abbott 9.3 Indirect actor Foraging in bumblebees. Blue flowers provide 2 µl nectar, yellow flowers provide 6µl on 1/3 of the trials and zero otherwise. After 15 trials reversed. A) mean preference of 5 bees for blue flowers over 30 trials, each consisting of 40 visits. B) action value is concave function of nectar volume prefers low risk option. C) preference of single model bee ( = 0.3, β = 23/8) Bert Kappen 23 Dayan Abbott 9.3 Direct actor Direct actor method estimates mb,y such as to maximize the expected future reward D E hri = pb hrbi + py ry We use again stochastic gradient descent. The derivative is D E d hri = βpb py hrbi − ry dmb where we used d pb = βpb py dmb d py = −βpb py dmb The stochastic version is (and similar for my): mb := mb + (1 − pb)rb mb := mb − pbry Bert Kappen i f b is selected i f y is selected 24 Dayan Abbott 9.3 Direct actor For multiple actions the rule generalizes to, when action a is taken: ma0 := ma0 + (δa,a0 − pa0 )ra ∀a0 We will use this in the actor-critic algorithm (DA 9.4). Bert Kappen 25 Dayan Abbott 9.3 Direct actor Setting as before. A,B) Succesfull learning. C,D) Failure to learn to switch. Note larger difference my − mb. Learning methods that directly optimize expected future reward may be too greedy and have poor exploration. Bert Kappen 26 Dayan Abbott 9.4 Delayed reward/Sequential decision making Reward obtained after sequence of actions. Rat moves without back tracking. After reward removed from maze and restart. Delayed reward problem: Choice at A has no direct reward Bert Kappen 27 Dayan Abbott 9.4 Delayed reward/Sequential decision making Policy iteration (see [Kaelbling et al., 1996] 3.2.2): Loop: • Policy evaluation: Compute value Vπ for policy π. Run Bellman backup until convergence • Policy improvement: Improve π Bert Kappen 28 Dayan Abbott 9.4 Delayed reward/Sequential decision making Actor Critic (see [Kaelbling et al., 1996] 4.1): Loop: • Critic: use TD eval. V(state) using current policy • Actor: improve policy p(state) Bert Kappen 29 Dayan Abbott 9.4 Policy evaluation with TD Policy is random left/right at each turn. 1 v(B) = (0 + 5) = 2.5 2 1 v(C) = (0 + 2) = 1 2 1 v(A) = (v(B) + v(C)) = 1.75 2 Learn through TD learning (v(s) = w(s), s = A, B, C ): v(s) := v(s) + δ Bert Kappen δ = r(s) + v(s0) − v(s) 30 Dayan Abbott 9.4 Policy improvement We use the direct actor method, which updates all ma0 upon taking action a as: ma0 := ma0 + (δa,a0 − pa0 )ra ∀a0 This generalizes for the delayed reward case to ma0 (s) := ma0 (s) + (δa,a0 − pa0 (s))δ pa(s) = δ = ra + v(s0) − v(s) eβma(s) P βm 0 (s) a0 e a Example: the values of the random policy are v(A) = 1.75 Bert Kappen v(B) = 2.5 v(C) = 1 31 Dayan Abbott 9.4 When in state A: δ = 0 + v(B) − v(A) = 0.75 δ = 0 + v(C) − v(A) = −0.75 f or right turn f or le f t turn The learning rule increase mright(A) and decreases mleft(A). β controls the exploration. Actor critic learning: show probability of left turn versus trials. Slow convergence of C due to less visits of right arm. Bert Kappen 32 Dayan Abbott 9.4 Generalizations Generalizations of the basis Actor-Critic model are: 1. State s may not be directly observable, but instead a vector ui(s) of sensory information is available • The value function v(s) can be a parametrized function of ui(s). For instance a linear parametrization, the TD learning rule is v(s) = X wiui(s) wi := wi + δui(s) δ = r(s) + v(s0) − v(s) i • The action value ma(s) also becomes a function of ui instead of s directly: ma(s) = X Maiui(s) Ma0i := Ma0i + (δaa0 − pa0 (s))δui(s) i Three-term learning rule: The ’connection’ between ’neurons’ i and a is changed depending on a third term δ 2. Discounted reward. Earlier reward/punishment more important than later. δ = r(s) + γv(s0) − v(s) Bert Kappen 33 Dayan Abbott 9.4 3. Updating value of past states in each learning step: TD(λ) Bert Kappen 34 Dayan Abbott 9.4 Water Maze Rat is place in a large pool of milky water and has to swim around until find a samll platform that is submerged. After several trials, the rat learns to locate the platform. ui(s) is activity of place cell array, s is physical location (B) P Critic: v(s) = i wiui(s) is value function. (C upper) P Actor: ma(s) = i Maiui(s) is the action value (C lower) Bert Kappen 35 Dayan Abbott 9.4 References References [Dayan and Abbott, 2001] Dayan, P. and Abbott, L. (2001). Theoretical Neuroscience. Computational and Mathematical Modeling of Neural Systems. MIT Press, New York. [Kaelbling et al., 1996] Kaelbling, L., Littman, M., and Moore, A. (1996). Reinforcement learning: a survey. Journal of Artificial Intelligence research, 4:237–285. [Sutton and Barto, 1990] Sutton, R. S. and Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement., pages 497–537. Cambridge University Press. Bert Kappen 36