Predictive Hebbian Learning Terrence

advertisement
Predictive
Hebbian
Terrence
Peter
Dayant
Learning
J Sejnowski
P Read
*
Montaguei
established
Abstract
iments,
A
creature
variable
presented
portant
future
ces for
survival.
the
with
environment
presence
potential
events
or risk
These
of food,
mates.
have
means
most
likely
next
state
of the world.
diminished
events
short,
can
stimuli,
Psychologists
and
about
likely
have
that
understood.
Prediction,
its
to explain
existing
a system
and to have
diction:
studied
learning
learning
surprised
to learn
based
in engineering
(Kalman,
and
learning
that
may
be relevant
form
of learning.
et al’s (1992)
for
Some
anisms
required
been discovered
mer,
computational
theory
understanding
of the
this
central
and
invertebrate
nlech-
brains
particular
Wagner,
1972;
The
(Hanl-
the
construction,
lated
1994).
1960;
1983;
delivery
to predictions
& Stearns,
1985)
(Rescorla
Pearce
& Hall,
stimuli
&
1980).
for
use of an error
future
pre-
becomes
requirements
and
about
guess,
This is the underall adaptation
rules
in psychology
general
is for
best
animal
Widrow
rules
Mackintosh,
predictions
if the
prediction.
essentially
behind
JVe consider
below
signal in the brain.
for predictive
learning
have
in both vertebrate
Ljungberg
pre-
its ap-
on err-or-s in this
happens
its
paper,
the
and
on its current
be contingent
on
mechanism
to make
reports
only
lying
review
this
data.
that
conditions
under which animals can learn to
predict future reward and punishment.
In this
we
exper-
underlie
use for action,
is essentially
a computational
but this still leads a wide range of possible
the-
it to have
next
of psychological
mechanisms
One way for an animal
system
guesses
and the most
ories
include
neural
are less well
propriate
concept,
inchan-
a nervous
to generate
state
diction
and
anticipate
to
destructive
In
must
an uncertain
needs
from the perspective
the
such
signal
would
a
re-
require
the following:
Z) access
Prediction
to a representation
predicted
such
t~) accem
to the
colnparecl
Animals
quences
ation
are capable
of those
they
of predicting
events
receive
and
ing to those predictions
tosh,
1983; Gallistel,
1987 for
reviews).
events
on the basis
directing
these
the
their
actions
LIL) capacity
]U structures
responsible
tious.
ZU) sufficiently
wide
accorcl-
are
stimuli
respond
well
*Howard
Hughes Medical
Institute,
The Salk Institute
for Biological
Studies,
10010 North Torrey Pines Road, La
Jolla, CA 92037, and Department
of Biology,
University
of
California,
San Diego, La Jolla, CA 92093.
t Department
of Brain
bridge MA 02139.
t Division
Houston TX
and Cognitive
of Neuroscience,
77030.
Science,
MIT,
Canl-
general
projecting
report,
in part
the
Baylor
College
of Medicine,
Permission to make digital/hard
copies of all or part of this material without fee is granted provided that the copies are not made or distributed
for profit or commercial advantage, the ACM copyright/server
notice, the title of the pub Iication and its date appear, and notice is given
that copyright is by permission of the Association for Computing Machinery,
Inc. (ACM). To copy otherwise, to republish,to
post on servers or to
redistribute to lists, requires specific permission and/or fee.
COLT’ ’95 Santa Cruz, CA USA@ 1995 ACM 0-89723-5/95/0007.
.$3.50
15
areas.
This
many
nervous
the
modulators
can
axons
axons
systems
error
and
release
signal
so that
to make
are
world
and
of dif-
thought
and
to
within
neuromodulators
of synapses
is a common
is not
predic-
by a number
that
in the
motif
the
be used
are met
of
effectiveness
anatomical
brains.
Invertebrates
with
extensive
axonal
Kandel,
1994).
of the
events
These
can influence
can be
or indirectly)
constructing
broadcast
on salient
they
(directly
for
requirements
organism.
so that
plasticity
system
to be
or food.
to be predicted.
in different
modalities
to the predictions.
These
that
predictions
to influence
fusely
phenomenon
of reward
phenonlenon
infor-
1980; Mackinand Thompson,
conclusions
current
to the
of the
amount
couse-
of the sensory
(see Dickenson,
1990 and Gluck
Although
aud
as the
unique
in these
feature
of
to vertebrate
have analogous
sets of neurons
arborizations
that
deliver
neuro-
to widespread
target
regions
(Hawkins
and
1984; Greenough
and Bailey,
1988; Hammer,
Experiulent
al evidence
from both behavioral
and
physiological
work
suggests
that
these
systems
influence
ongoing
neural
activity
physiological
as well
functions.
to be required
for
normal
properties
of sensory
of cerebral
cortex
Models
of
as a number
These
systems
development,
cortical
of critical
are
also
of the
neurons
a quantity
that
depends
However
the form of the
known
computational
response
in various
tional
regions
behavior
Hebb
analysis,
rule,
work
Conditioning
for
briefly
tions
to
summarize
that
underlie
modeling
resented
The
r(t).
1987)
x(t)
animal
receives
difference
assumes
to use the
CSS to fit
discounted
suTn of faiure
(TD)
that
the
a function
A
(Sutton
computational
V(x(t))
task
that
is
a
rewards:
m
V(t) =
O <
fact
for
t~e
less than
ing
the
cant
that
current
sum
over
by
The
next
the
the
is key:
same
mean
&
reward
of this
as a function
assumption
of x(t).
the environment
do not
are
consequences,
amounts
has a llarliov
depend
on past
rent, stimulus
satisfy
This
state
V(x(t))
c(t)
rewards
z(t)).
a consistency
property
respond
respond
~(t)
In
simplest
its
c(t)
the
=
r(t)
+ yl’(x(t
-
r(t)
+-y V(Z(t+
the
VTA
dopaminergic
error
for
the
et al,
global
=
TD
tile
in equation
ai(t)
(1988),
have
shown
V(2(t))
when
The
in
the
VTA
tend
to
to a naive
learned,
ani-
however,
of reward
and
predictive
are
more
sensory
et al 1993).
the
((t)
cue
We suggest
reporting
reward,
predic-
in equation
3.
et al, submitted).
the
V(x(t))
cur-
the
How
S11OU1CI
this
search
as a “print
talies
Rebbian
time
for
which
of the
reward
to
to
than
gate
work,
as well
the
us-
periods
1991).
the
repre-
weighted
must
as its
brain
to
learning
powerful
are being
by
(hfontague
The
command
(Rauschecker,
scheme
is accomplished
problem
more
cortex
neuro-
Predictive
merely
learning
this
that
plasticity.
now”
place.
influence
in the
predict
suggested
synaptic
4 is conlputationally
sentations
rewards
act
learning
assuming
that
previously
modulate
neuromoclulatory
l))
–
Models
magnitude.
is an open
re-
et al, submitted).
that
~~1
are
z,(t) wi(t)
adjusted
to make the
& Wagner,
facet
of this
a llnli
mitted,
has
and
conditions
to make
these
two
ing just
to the
specific
learning
(1!)94),
optimal
weights
w(t)
in equation
the
learning
rule
suggested
the
correlation
between
which
their
upclate
sLlt,-
a]ld
Hebb
presynaptic
even
agaiu
conlput, ationdl
( 194!3)
activity
according
basecl
z, (t)
for
the
resulting
rules
\vere
debate.
(and
theory
as to
control
first
only
to
of actions
stimuli.
There
appropriate
the
is
learning
Many
in
blinli-
of air
complicated
sequences
problem.
introduced
puff
learuing
internal)
A
their
capacity
in the world
of a signaled
arbitrary
to external
conbetween
appropriately—not
also
sub-
to instrumental
relationship
of substantial
but,
is
et al,
animals
use
of importance
delivery
comparatively
lnakcs
rules
updated
the
conditioning
Montague
exact
subject
is that
events
membrane,
1)1 response
leanlillg
the
actions
before
other>
\’(t),
4 are
by
this
rate.
ancl
the
hypothesis
to preclict
classical
1992;
can be made
though
is still
for
et al,
1994)
even
\vOrliillg
to learn
(4)
Sejnowski
under
Conditioning
framework
(Quartz
Dayan,
ditioning,
adjustments
using
197!2; YVidrow
&
= ai(t)zi(t)~(t)
is a stimulus
Dayan
Instrumental
V(z(t)) = O (3)
that
=
of
to control
converge
Schultz
Montague
could
a nictitating
ton
of the
cells
1992;
in orcler
1985):
Aw,(t)
where
onset
have
signal
regulate
error.
{w%(t)}
= O. It is natural
delta
rule
(Rescorla
Stearns,
delivery
discounted
mi:llt
not
+ 1)) or equivalently,(2)
V(~(t))
form,
w(t)
ven-
condition:
is called
parameters
the
is known
to be an
and is involved
in
has been
the
et al 1992;
the
One
where
project
and
of reward
the task
to the
However,
through
this,
delivery
to
tion
iug
(future
except
Assuming
cortex
After
cells
can be treated
to
of the
(VTA).
ery of reward.
h’euroscientists
1972).
stil]ldl
V(t)
area
axons
a behavioral
task in which some sensory
as a light,
consistently
predicts
the deliv-
Illodulators
SUC1] a>
quantity.
is that,
tegrnental
of neurons
to the
of conventional
major
3.
in an area
Predict-
Wagner.
where
be
in equation
mal performing
stimulus,
such
(Quartz
a signifi-
models
(Rescorla
the
be worth
represents
conditioning
rule
models
may
equation
problems,
followed
is to predict
that,
rewards
rewards
static
non-deterministic
task
This
future
Rescorla-Wagner
always
future
ones.
over
factor
may
discovered
and their
areas
fraction
(Ljungberg
1 is a discount
animal
advance
the
In
~ ~
ventral
frame-
there
error
rule.
behaviors.
in response
cells
(1)
U=t
where
addictive
few
~-f”-’r(u)
the
delta
then
been
are dopaminergic
significant
fire
plerlicts
recently
to widespread
many
reward
approach
above,
the prediction
called
the tradi-
components
of the mathematical
outlined
tral striatum.
The latter
brain region
important
center for reward
learning
i E
of the
a scalar
brain
from
principal
to a predictive
advantage
have
neurons
diffusely
of neu-
states
performs
closer
compute
neurons
These
rep-
Zi,
populations
represent
also
are
of units
that
of the
taken
animal
which
{Zi(t)}
that
have
t, the
time
Such
assunlp-
we
( CSS)
=
cortex
temporal
& Barto,
At
can be considered
The
The
that
stimuli
activity
cerebral
environment.
mathematical
learning.
units
in the
the
approach
conditioned
in the
{1, N}.
the
animal
sees various
rons
here
takes
learning
neurons
We
of the algorithm
which
to something
If the brain
Classical
on postsynaptic
activity
c(t).
postsynaptic
term changes the
of these
psychology
literature.
on
and
Tlie
16
capacity
for
learning
predictions
over
time
is used
The key attraction
of this form of policy iteration
is
that the signal, ((t), that we postulated
the VTA cells
to be reporting,
has two roles: training
the prediction
w(t),
and training
the action
parameters
parameters
to solve the temporal
credit assignment problem, which
comes from the distance in time between making an action and seeing its consequence in terms of getting to the
goal, It turns out that temporal
difference algorithms
learn exactly the predictions
of proximities
that are required for dynamic programming
(Bellman,
1957). In
fact, the same error signal that these algorithms
use to
learn how far states are away from the goal can also be
used to criticize the choice of actions (Barto, Sutton &
Anderson,
1983; Barto, Sutton & Watkins,
1989).
In instrumental
tions
conditioning,
available
set A),
tasks
and
such
to it (characterized,
its
choice
as mazes
can affect
rewards
imal
face
may
of working
was
critical
problems
affects
with
the
say, as coming
choice
points,
actions
ways.
Worse,
the
for
the
credit
rewards
provide
out
it
In general,
how
like
concept
actions
is a pol~cy
problem
are
assigned
a stimulus-response
usually
at
a fixed
would
the
reward
Vm (z)
animal
is at
m. The
method
ming
states
so that
same
policy,
evaluate
to
)),
intosh,
state.
If
This
would
be expected
z and
of policy
(Howard,
1963)
in
chooses
lined
iteration
uses this
evaluation
et al (1983)
Barto
policy
iteration
will
to follow
described
in which each action
with
a set of adjustable
time
t,the system
each
action
be
t)a(t)
=
better
developed
‘V:(t) zt(t)
for
and in(Mack-
of states
information
is
to improve
conditioning.
the
{u, (t)}.
At,
‘valile’
for
here
that
occur
making
influence
that
been
these
the
neural
are many
aspects
How
stimuli
How
How
is time
causal?
framework
like
become
are
formed
repre-
does the brain
are not
questons
open
the
addressed:
neurons?
that
an overall
is that,
may
phys-
and decision-making
There
populations?
A
model,
uncover
of sensory
of cortical
correlations
this
to
yet
in-
Cer-
be understood
decisions
representations
by neural
present.
also
learning
not
to
et al, submitted).
situations.
have
out-
future
in the
between
behavioral
systems
the
can
experiments
that
tage of having
reject
The
advan-
the one outlined
grounded
in
ways
that are >ul~ject
to experimental
test.
The prospect
of
~1rigorolls
mathelllatical
framework
for animal
learning
will
(5)
+ ?]<’(t)
and
diffuse
about
connection
complicated
spurious
of the
(Montague
interesting
by populations
sented
model
change
data,
for
allow
nlore
= ~
the values
this
expectations
framework
appropriate
as:
u“(t)
important
is instrumental
of clecision
this
of learning
program-
choice
aspects
~vltllin
nlechanislm
to
TD
allows
synaptic
in lnorw
if the
according
to improve
an action
actions
the
here
way
a simple version of
a E A is associated
parameters
calculates
tain
above
nluch
future,
in dynamic
using
or choose
iological
were
the
that
striatum
these models,
classical
are very closely related
to predict
conditioning;
tiuence
wi 11
actions
actions
ventral
ios particularly
according
to
conditioning
In summary,
is sonlethillg
anilnal
the
Conclusions
describes
assignment
different
the
system
to
elec-
inputs
deci-
then the TD methocls
described
it, in the sense of learniug
how
state
policy.
which
VTA
1983 ). Learning
classical
framework
(and
function).
be stochastic
tried
~(z(t
this
the
and
dopaminergic
learning.
Note that
strumental
tasks (Barto
et al, 1989), and the techniques of
dynamic programming
(Bellman,
1957) provide a general theoretical
framework
for their solution.
key
self-administration
of the
from
that
reward
for these
The
projection
policies
sequence
Markov
drug
an-
assignment
theoretical
that
self-stimulation
suggest
in
of a whole
received.
a general
the
from
multiple
action
trical
ac-
rewards.
temporal
which
has various
its
in complicated
out
sion
an animal
Va (t).Evidence
sharper
definitive
questions
allslvers
to
be asked,
and
perhaps
to be founcl.
References
i=l
where
qa(t)
are
an estimate
random
of the
a relative
to doing
to ensure
that
state
so that
ultimate
other
each
good
action
noise
values.
appropriateness
actions,
action
overall
selected
so the
is tried
often
policies
is the
U“(t)
is used
of performing
role
enough
where
The
fixed.
As
for
the
learning
port
of these:
stimulus
non-selected
proceeds,
specific
actions
this
tencls
learning
a“
to improve
Ma71,
and
CW.
IEEE
Cybernetics,
Trans-
13,
834
P ( 1994).
OpLIIton
P. and
left
Dickensou,
the
Tll eo ry.
Press.
17
Watkins,
and
CJCH,
Techntcal
Information
Amherst,
Dynamtc
Science,
MA,
ReUni-
1989
Programming.
Computational
III Neurob~oloqy,
probability
policy.
RS,
( Colnputer
Princeton:
(J. Press.
Dayaa,
Dayau,
Sutton,
R ( 1957)
Prillcetol~
rate.
are
RS and Anclersonl
,Sysiems,
of klassachusetts,
13ellllliLll,
(7)
a #
AG,
89-95.
versity
((j)
c(t) in equation 3 is used to criticize
indicating
whether the one selected
than the average. Va” (t) is updated
,f3~(t)is another
v:(t)
Barto,
The
(t) = fl, (t)x, (t)c(t),
Sutton,
on
(1983).
in each
a“ = argmaxau’z(t).
In this formulation,
the choice of action,
was better or worse
as:
Avf”
AG,
ar{tons
of qa (t ) is
can be iclentified.
largest
Barto,
as
action
Sejnowslci,
1, hfachine
A
( 1980).
Cambridge,
modelling.
Current
4, 212-217.
T.
J., TD
Learning
14,
Contemporary
England:
(~)
converges
295-301
Animal
Cambridge
with
(1994).
Learnzng
University
Gallistel,
CR
bridge,
(1990)
The
MIT
Press.
Mass:
organzzat~on
Gluck,
MA, Thompson,
substrates
of associative
putational
WT,
of a memory:
of tests.
Hammer,
Psychological
and
Bailey,
M
(1994)
of results
An
stimulus
in honeybees.
Nature
Sutton,
RS ( 1988).
Learning
to predict
by the methods
of temporal
difference.
Mach?ne
Learnzng,
3, pp 9-44.
Sutton,
176-191.
The
across
neuron
in associative
tle,
anatomy
a diversity
RD,
Kandel
ER
for
simple
forms
mediates
olfactory
Is there
of
the
learning
Psychologtca[
!31(3):375-91.
Reu.
Hebb,
DO
York:
(1949)
The
organwatzon
of
behavzor.
New
Wiley.
Kalman,
RE
and prediction
Series
(1960)
A new approach
to linear
problems.
J. Baszc Eng.,
Trans
filtering
AShlE,
D 82(1):35-45.
Ljungberg,
T, Apicella,
P &
sponses
of monkey
dopamine
behavioral
67(l),
reactions.
Schultz,
neurons
Journal
W (1992).
Reduring
learning
of
Neurophyslology,
145-163.
Mackintosh,
NJ
Learning.
(1983)
Oxford
Montague,
P. R.,
framework
for
predictive
Hebbian
(submitted
S,
(1992)
Pearce
ng:
Dayan,
UK.
Sejnowski,
dopamine
learning,
Dayan,
T.
systems
Joarnui
P,
Montague,
learning
connections.
JM,
P. and
Assocza/~ue
Oxford,
J.,
A
based
on
of Neurosc~en
ce
publication).
Expectation
ascending
and
Press:
mesolimbic
for
Quartz,
Condifioni~~g
University
Hall
variations
Sot.
G (1980)
in
the
of unconditioned
PR,
in
the
Neuroscz.
A model
wing
Abstr.
for
effectiveness
stimuli.
Sejnowski,
brain
T.J
diffllse
18:1210
Pavlovian
learni-
of conditioned
Psychological
but
Review
87:
532-52.
Rauschecker
Hebb
JP
synapses,
logzcal
Reviews
Rescorla,
RA
(1991)
Mechanisms
NMDA
receptors,
Theory,
pp
plasticity:
beyond.
F’h ys~o-
71(2):587-615.
& Wagner,
ian conditioning:
The
non-reinforcement,
itors,
Classical
of visual
and
AR
In AH
Conditto?ltug
64-69.
(1972).
effectiveness
New
A theory
of Pavlov-
of reinforcement
Black
& W F Prokasy,
Researcl!
II: current
York,
NY:
and
eda71d
Al>ljletoll-Cellt~lry-
Crofts.
Schultz,
W,
of monkey
Apicella,
dopamine
stimuli
during
sponse
task.
Sutton,
RS,
ory
successive
Barto,
steps
AG
Revtew,
882,
and
of learning
Responses
conditioned
a delayecl
re-
13(3):900-13.
( 1981).
networks:
T (1993)
to rewarcl
J. Neuroscience
of adaptive
Psychological
P, Ljungberg,
neurons
Toward
Expectation
pp
a modern
ancl
AG
Barto,
(1987).
the-
prediction.
135-170.
18
AG
Learmng
bridge,
NIA
and
LIIT
A
temporal-difference
of the Nznth
Soctety.
Seat-
Time-derivative
In
M
Gabriel
C’ornpuiatzonal
models
& J Moore,
Neuroscience.
Cam-
Press.
B, Stearns,
Englewoocl
(1989).
reinforcement.
editors,
ing.
a cell-biological
learning?
RS,
of Pavlovian
Widrow,
(19$4)
Barto,
WA.
Sutton,
366:59-63.
alphabet
RS,
model of classical
conditioning.
Proceedings
Annual
Co?lference
of the Cognztzve Sctence
11, 142-147.
identified
Hawkins
not
94,
( 1988).
tn Neuroscience,
unconditioned
of
Rev.
CH
Convergence
Trends
Canl-
RF (1987)
Modeling
the neural
learning
and memory:
a com-
approach.
Greenough,
0$ learntng.
Cliffs,
SD (1985)
NJ:
Adaptive
Prentice-Hall.
signal
process-
Download