Predictive Hebbian Terrence Peter Dayant Learning J Sejnowski P Read * Montaguei established Abstract iments, A creature variable presented portant future ces for survival. the with environment presence potential events or risk These of food, mates. have means most likely next state of the world. diminished events short, can stimuli, Psychologists and about likely have that understood. Prediction, its to explain existing a system and to have diction: studied learning learning surprised to learn based in engineering (Kalman, and learning that may be relevant form of learning. et al’s (1992) for Some anisms required been discovered mer, computational theory understanding of the this central and invertebrate nlech- brains particular Wagner, 1972; The (Hanl- the construction, lated 1994). 1960; 1983; delivery to predictions & Stearns, 1985) (Rescorla Pearce & Hall, stimuli & 1980). for use of an error future pre- becomes requirements and about guess, This is the underall adaptation rules in psychology general is for best animal Widrow rules Mackintosh, predictions if the prediction. essentially behind JVe consider below signal in the brain. for predictive learning have in both vertebrate Ljungberg pre- its ap- on err-or-s in this happens its paper, the and on its current be contingent on mechanism to make reports only lying review this data. that conditions under which animals can learn to predict future reward and punishment. In this we exper- underlie use for action, is essentially a computational but this still leads a wide range of possible the- it to have next of psychological mechanisms One way for an animal system guesses and the most ories include neural are less well propriate concept, inchan- a nervous to generate state diction and anticipate to destructive In must an uncertain needs from the perspective the such signal would a re- require the following: Z) access Prediction to a representation predicted such t~) accem to the colnparecl Animals quences ation are capable of those they of predicting events receive and ing to those predictions tosh, 1983; Gallistel, 1987 for reviews). events on the basis directing these the their actions LIL) capacity ]U structures responsible tious. ZU) sufficiently wide accorcl- are stimuli respond well *Howard Hughes Medical Institute, The Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, CA 92037, and Department of Biology, University of California, San Diego, La Jolla, CA 92093. t Department of Brain bridge MA 02139. t Division Houston TX and Cognitive of Neuroscience, 77030. Science, MIT, Canl- general projecting report, in part the Baylor College of Medicine, Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the pub Iication and its date appear, and notice is given that copyright is by permission of the Association for Computing Machinery, Inc. (ACM). To copy otherwise, to republish,to post on servers or to redistribute to lists, requires specific permission and/or fee. COLT’ ’95 Santa Cruz, CA USA@ 1995 ACM 0-89723-5/95/0007. .$3.50 15 areas. This many nervous the modulators can axons axons systems error and release signal so that to make are world and of dif- thought and to within neuromodulators of synapses is a common is not predic- by a number that in the motif the be used are met of effectiveness anatomical brains. Invertebrates with extensive axonal Kandel, 1994). of the events These can influence can be or indirectly) constructing broadcast on salient they (directly for requirements organism. so that plasticity system to be or food. to be predicted. in different modalities to the predictions. These that predictions to influence fusely phenomenon of reward phenonlenon infor- 1980; Mackinand Thompson, conclusions current to the of the amount couse- of the sensory (see Dickenson, 1990 and Gluck Although aud as the unique in these feature of to vertebrate have analogous sets of neurons arborizations that deliver neuro- to widespread target regions (Hawkins and 1984; Greenough and Bailey, 1988; Hammer, Experiulent al evidence from both behavioral and physiological work suggests that these systems influence ongoing neural activity physiological as well functions. to be required for normal properties of sensory of cerebral cortex Models of as a number These systems development, cortical of critical are also of the neurons a quantity that depends However the form of the known computational response in various tional regions behavior Hebb analysis, rule, work Conditioning for briefly tions to summarize that underlie modeling resented The r(t). 1987) x(t) animal receives difference assumes to use the CSS to fit discounted suTn of faiure (TD) that the a function A (Sutton computational V(x(t)) task that is a rewards: m V(t) = O < fact for t~e less than ing the cant that current sum over by The next the the is key: same mean & reward of this as a function assumption of x(t). the environment do not are consequences, amounts has a llarliov depend on past rent, stimulus satisfy This state V(x(t)) c(t) rewards z(t)). a consistency property respond respond ~(t) In simplest its c(t) the = r(t) + yl’(x(t - r(t) +-y V(Z(t+ the VTA dopaminergic error for the et al, global = TD tile in equation ai(t) (1988), have shown V(2(t)) when The in the VTA tend to to a naive learned, ani- however, of reward and predictive are more sensory et al 1993). the ((t) cue We suggest reporting reward, predic- in equation 3. et al, submitted). the V(x(t)) cur- the How S11OU1CI this search as a “print talies Rebbian time for which of the reward to to than gate work, as well the us- periods 1991). the repre- weighted must as its brain to learning powerful are being by (hfontague The command (Rauschecker, scheme is accomplished problem more cortex neuro- Predictive merely learning this that plasticity. now” place. influence in the predict suggested synaptic 4 is conlputationally sentations rewards act learning assuming that previously modulate neuromoclulatory l)) – Models magnitude. is an open re- et al, submitted). that ~~1 are z,(t) wi(t) adjusted to make the & Wagner, facet of this a llnli mitted, has and conditions to make these two ing just to the specific learning (1!)94), optimal weights w(t) in equation the learning rule suggested the correlation between which their upclate sLlt,- a]ld Hebb presynaptic even agaiu conlput, ationdl ( 194!3) activity according basecl z, (t) for the resulting rules \vere debate. (and theory as to control first only to of actions stimuli. There appropriate the is learning Many in blinli- of air complicated sequences problem. introduced puff learuing internal) A their capacity in the world of a signaled arbitrary to external conbetween appropriately—not also sub- to instrumental relationship of substantial but, is et al, animals use of importance delivery comparatively lnakcs rules updated the conditioning Montague exact subject is that events membrane, 1)1 response leanlillg the actions before other> \’(t), 4 are by this rate. ancl the hypothesis to preclict classical 1992; can be made though is still for et al, 1994) even \vOrliillg to learn (4) Sejnowski under Conditioning framework (Quartz Dayan, ditioning, adjustments using 197!2; YVidrow & = ai(t)zi(t)~(t) is a stimulus Dayan Instrumental V(z(t)) = O (3) that = of to control converge Schultz Montague could a nictitating ton of the cells 1992; in orcler 1985): Aw,(t) where onset have signal regulate error. {w%(t)} = O. It is natural delta rule (Rescorla Stearns, delivery discounted mi:llt not + 1)) or equivalently,(2) V(~(t)) form, w(t) ven- condition: is called parameters the is known to be an and is involved in has been the et al 1992; the One where project and of reward the task to the However, through this, delivery to tion iug (future except Assuming cortex After cells can be treated to of the (VTA). ery of reward. h’euroscientists 1972). stil]ldl V(t) area axons a behavioral task in which some sensory as a light, consistently predicts the deliv- Illodulators SUC1] a> quantity. is that, tegrnental of neurons to the of conventional major 3. in an area Predict- Wagner. where be in equation mal performing stimulus, such (Quartz a signifi- models (Rescorla the be worth represents conditioning rule models may equation problems, followed is to predict that, rewards rewards static non-deterministic task This future Rescorla-Wagner always future ones. over factor may discovered and their areas fraction (Ljungberg 1 is a discount animal advance the In ~ ~ ventral frame- there error rule. behaviors. in response cells (1) U=t where addictive few ~-f”-’r(u) the delta then been are dopaminergic significant fire plerlicts recently to widespread many reward approach above, the prediction called the tradi- components of the mathematical outlined tral striatum. The latter brain region important center for reward learning i E of the a scalar brain from principal to a predictive advantage have neurons diffusely of neu- states performs closer compute neurons These rep- Zi, populations represent also are of units that of the taken animal which {Zi(t)} that have t, the time Such assunlp- we ( CSS) = cortex temporal & Barto, At can be considered The The that stimuli activity cerebral environment. mathematical learning. units in the the approach conditioned in the {1, N}. the animal sees various rons here takes learning neurons We of the algorithm which to something If the brain Classical on postsynaptic activity c(t). postsynaptic term changes the of these psychology literature. on and Tlie 16 capacity for learning predictions over time is used The key attraction of this form of policy iteration is that the signal, ((t), that we postulated the VTA cells to be reporting, has two roles: training the prediction w(t), and training the action parameters parameters to solve the temporal credit assignment problem, which comes from the distance in time between making an action and seeing its consequence in terms of getting to the goal, It turns out that temporal difference algorithms learn exactly the predictions of proximities that are required for dynamic programming (Bellman, 1957). In fact, the same error signal that these algorithms use to learn how far states are away from the goal can also be used to criticize the choice of actions (Barto, Sutton & Anderson, 1983; Barto, Sutton & Watkins, 1989). In instrumental tions conditioning, available set A), tasks and such to it (characterized, its choice as mazes can affect rewards imal face may of working was critical problems affects with the say, as coming choice points, actions ways. Worse, the for the credit rewards provide out it In general, how like concept actions is a pol~cy problem are assigned a stimulus-response usually at a fixed would the reward Vm (z) animal is at m. The method ming states so that same policy, evaluate to )), intosh, state. If This would be expected z and of policy (Howard, 1963) in chooses lined iteration uses this evaluation et al (1983) Barto policy iteration will to follow described in which each action with a set of adjustable time t,the system each action be t)a(t) = better developed ‘V:(t) zt(t) for and in(Mack- of states information is to improve conditioning. the {u, (t)}. At, ‘valile’ for here that occur making influence that been these the neural are many aspects How stimuli How How is time causal? framework like become are formed repre- does the brain are not questons open the addressed: neurons? that an overall is that, may phys- and decision-making There populations? A model, uncover of sensory of cortical correlations this to yet in- Cer- be understood decisions representations by neural present. also learning not to et al, submitted). situations. have out- future in the between behavioral systems the can experiments that tage of having reject The advan- the one outlined grounded in ways that are >ul~ject to experimental test. The prospect of ~1rigorolls mathelllatical framework for animal learning will (5) + ?]<’(t) and diffuse about connection complicated spurious of the (Montague interesting by populations sented model change data, for allow nlore = ~ the values this expectations framework appropriate as: u“(t) important is instrumental of clecision this of learning program- choice aspects ~vltllin nlechanislm to TD allows synaptic in lnorw if the according to improve an action actions the here way a simple version of a E A is associated parameters calculates tain above nluch future, in dynamic using or choose iological were the that striatum these models, classical are very closely related to predict conditioning; tiuence wi 11 actions actions ventral ios particularly according to conditioning In summary, is sonlethillg anilnal the Conclusions describes assignment different the system to elec- inputs deci- then the TD methocls described it, in the sense of learniug how state policy. which VTA 1983 ). Learning classical framework (and function). be stochastic tried ~(z(t this the and dopaminergic learning. Note that strumental tasks (Barto et al, 1989), and the techniques of dynamic programming (Bellman, 1957) provide a general theoretical framework for their solution. key self-administration of the from that reward for these The projection policies sequence Markov drug an- assignment theoretical that self-stimulation suggest in of a whole received. a general the from multiple action trical ac- rewards. temporal which has various its in complicated out sion an animal Va (t).Evidence sharper definitive questions allslvers to be asked, and perhaps to be founcl. References i=l where qa(t) are an estimate random of the a relative to doing to ensure that state so that ultimate other each good action noise values. appropriateness actions, action overall selected so the is tried often policies is the U“(t) is used of performing role enough where The fixed. As for the learning port of these: stimulus non-selected proceeds, specific actions this tencls learning a“ to improve Ma71, and CW. IEEE Cybernetics, Trans- 13, 834 P ( 1994). OpLIIton P. and left Dickensou, the Tll eo ry. Press. 17 Watkins, and CJCH, Techntcal Information Amherst, Dynamtc Science, MA, ReUni- 1989 Programming. Computational III Neurob~oloqy, probability policy. RS, ( Colnputer Princeton: (J. Press. Dayaa, Dayau, Sutton, R ( 1957) Prillcetol~ rate. are RS and Anclersonl ,Sysiems, of klassachusetts, 13ellllliLll, (7) a # AG, 89-95. versity ((j) c(t) in equation 3 is used to criticize indicating whether the one selected than the average. Va” (t) is updated ,f3~(t)is another v:(t) Barto, The (t) = fl, (t)x, (t)c(t), Sutton, on (1983). in each a“ = argmaxau’z(t). In this formulation, the choice of action, was better or worse as: Avf” AG, ar{tons of qa (t ) is can be iclentified. largest Barto, as action Sejnowslci, 1, hfachine A ( 1980). Cambridge, modelling. Current 4, 212-217. T. J., TD Learning 14, Contemporary England: (~) converges 295-301 Animal Cambridge with (1994). Learnzng University Gallistel, CR bridge, (1990) The MIT Press. Mass: organzzat~on Gluck, MA, Thompson, substrates of associative putational WT, of a memory: of tests. Hammer, Psychological and Bailey, M (1994) of results An stimulus in honeybees. Nature Sutton, RS ( 1988). Learning to predict by the methods of temporal difference. Mach?ne Learnzng, 3, pp 9-44. Sutton, 176-191. The across neuron in associative tle, anatomy a diversity RD, Kandel ER for simple forms mediates olfactory Is there of the learning Psychologtca[ !31(3):375-91. Reu. Hebb, DO York: (1949) The organwatzon of behavzor. New Wiley. Kalman, RE and prediction Series (1960) A new approach to linear problems. J. Baszc Eng., Trans filtering AShlE, D 82(1):35-45. Ljungberg, T, Apicella, P & sponses of monkey dopamine behavioral 67(l), reactions. Schultz, neurons Journal W (1992). Reduring learning of Neurophyslology, 145-163. Mackintosh, NJ Learning. (1983) Oxford Montague, P. R., framework for predictive Hebbian (submitted S, (1992) Pearce ng: Dayan, UK. Sejnowski, dopamine learning, Dayan, T. systems Joarnui P, Montague, learning connections. JM, P. and Assocza/~ue Oxford, J., A based on of Neurosc~en ce publication). Expectation ascending and Press: mesolimbic for Quartz, Condifioni~~g University Hall variations Sot. G (1980) in the of unconditioned PR, in the Neuroscz. A model wing Abstr. for effectiveness stimuli. Sejnowski, brain T.J diffllse 18:1210 Pavlovian learni- of conditioned Psychological but Review 87: 532-52. Rauschecker Hebb JP synapses, logzcal Reviews Rescorla, RA (1991) Mechanisms NMDA receptors, Theory, pp plasticity: beyond. F’h ys~o- 71(2):587-615. & Wagner, ian conditioning: The non-reinforcement, itors, Classical of visual and AR In AH Conditto?ltug 64-69. (1972). effectiveness New A theory of Pavlov- of reinforcement Black & W F Prokasy, Researcl! II: current York, NY: and eda71d Al>ljletoll-Cellt~lry- Crofts. Schultz, W, of monkey Apicella, dopamine stimuli during sponse task. Sutton, RS, ory successive Barto, steps AG Revtew, 882, and of learning Responses conditioned a delayecl re- 13(3):900-13. ( 1981). networks: T (1993) to rewarcl J. Neuroscience of adaptive Psychological P, Ljungberg, neurons Toward Expectation pp a modern ancl AG Barto, (1987). the- prediction. 135-170. 18 AG Learmng bridge, NIA and LIIT A temporal-difference of the Nznth Soctety. Seat- Time-derivative In M Gabriel C’ornpuiatzonal models & J Moore, Neuroscience. Cam- Press. B, Stearns, Englewoocl (1989). reinforcement. editors, ing. a cell-biological learning? RS, of Pavlovian Widrow, (19$4) Barto, WA. Sutton, 366:59-63. alphabet RS, model of classical conditioning. Proceedings Annual Co?lference of the Cognztzve Sctence 11, 142-147. identified Hawkins not 94, ( 1988). tn Neuroscience, unconditioned of Rev. CH Convergence Trends Canl- RF (1987) Modeling the neural learning and memory: a com- approach. Greenough, 0$ learntng. Cliffs, SD (1985) NJ: Adaptive Prentice-Hall. signal process-