AMS-LaTeX Paper - they are driving a

advertisement
Shreesh P. Mysore
08/24/2003
1
Why Let Networks Grow?
Thomas R. Shultz
Department of Psychology and School of Computer Science
McGill University
1205 Penfield Ave.
Montreal, Quebec,
Canada H3A 1B1
E-mail: thomas.shultz@mcgill.ca
Shreesh P. Mysore
Control and Dynamical Systems Program
California Institute of Technology
1200 E. California Blvd.
Pasadena, CA 91125
U.S.A.
E-mail: shreesh@cds.caltech.edu
Steven R. Quartz
Division of the Humanities and Social Sciences, and
Computation and Neural Systems Program
California Institute of Technology
1200 E. California Blvd.
Pasadena, CA 91125
U.S.A.
E-mail: steve@hss.caltech.edu
1. Introduction
Beyond the truism that to develop is to change, the process of developmental change itself has
been relatively neglected in developmental science. Methodologically, behavioral studies have
traditionally utilized cross-sectional studies, which have revealed a great deal about how certain
behavioral and cognitive abilities differ at various developmental times, but tend to reveal less
about the developmental processes that operate to transform those cognitive and behavioral
abilities. A variety of reasons exist that account for the lack of explanations of developmental
change, and range from the methodological challenges their study entails to principled, learningtheoretic arguments against the very existence of such processes (Macnamara, 1982; Pinker,
1984).
Shreesh P. Mysore
08/24/2003
2
Among the most influential arguments against developmental change was Chomsky’s (1980)
instantaneity argument, which argued that there is nothing from a learning standpoint that
developmental change adds to an account of development – if one starts with the null hypothesis
that children are qualitatively similar learners to adults and suppose that development is
instantaneous (in the sense that there are no time dependencies in development), then departing
from this by supposing children are initially more restricted learners only results in weakening
their acquisition properties. The upshot is that less is less, and developmental change will only
add to an account of development by reducing acquisition properties. Thus, as a sort of charity
principle, one ought to at least start with the hypothesis that children do not differ substantially in
their acquisition capacity from adults, since explaining development is hard enough without
having to do so with greatly diminished learning.
These sorts of arguments lead to the widespread assumption that there is little intrinsic
theoretical insight to be gained from studying processes of developmental change. Indeed, even
with the advent of connectionist models and their adoption in developmental science, most of the
models that were used were qualitatively similar to models of adult learning. That is, most
models used a fixed feedforward architecture in which the main free parameter was connection
strengths. Therefore, such models implicitly adopted Chomsky’s argument and began with the
assumption that the immature and mature state are qualitatively identical.
Yet, how well-founded is this assumption, and is it the case that the main free parameter of
learning is akin to the connection strengths of a fixed architecture? Viewed from the perspective
of developmental cognitive neuroscience, it is now well established that the structural features of
neural circuits undergo substantial alterations throughout development (e.g., Quartz and
Sejnowski, 1997). Do such changes add anything of interest to the explanation of development
and developmental change? If so, does a better understanding of the processes of developmental
change undermine Chomsky’s argument, and thereby demonstrate that an understanding of
developmental change is crucial for understanding the nature of cognitive development?
In this chapter, we ask the question, why let networks grow? We begin by reviewing the wealth
of accumulating data from neuroscience that network growth appears to be a much more central
feature of learning than is traditionally assumed. Indeed, it appears that the assumption that the
main modifiable parameter underlying learning is changes in connection strength in an otherwise
fixed network may be in need of revision. In section 2, we revisit this question with our eye
toward the evidence for neurogenesis – the postnatal birth and functional integration of new
neurons – as an important component of learning throughout the lifespan. In section 3, we turn to
consider the computational implications of learning-directed growth. There, we will consider
whether less really is really less, or whether by breaking down the traditional distinction between
intrinsic maturation and cognitive processes of learning the learning mechanism underlying
cognitive development thereby becomes substantially different, and more powerful, in its
learning properties than a fixed architecture. In section 4, we present a case in which activitydependent network growth underlies important acquisition properties in a model system, auditory
localization in the barn owl. Our motivation for presenting this model system is that this nonhuman system is very well characterized in terms of the biological underpinnings, which provide
important clues into the likely biological mechanisms underlying human cognitive development,
and which give rise to general computational principles of activity-dependent network growth. In
Shreesh P. Mysore
08/24/2003
3
section 5, we present evidence from modeling cognitive development in children. There, we
explore a number of computational simulations utilizing the Cascade Correlation (CC) algorithm.
In contrast to back propagation algorithm, CC starts with a minimal network and adds new units
as a function of learning. Thus, CC can be viewed at a high level as a rule for neurogenesis, and
thus is a means of exploring the computational and learning properties of such developmental
processes. In section 6, we present the main conclusions from this work and point to areas of
future research and open research questions.
2. Experience-dependent neurogenesis in adult mammalian brains
Two broad categories of neural plasticity exist, based on the nature of expression and encoding
of change in the nervous system. They are ``synaptic efficacy change" and ``structural
plasticity". Whereas synaptic efficacy change has been the predominantly studied form of
plasticity (Martin and Morris, 2002), neuroscience research also contains abundant evidence of
structural plasticity - dendritic spine morphology change in response to stimuli, synaptogenesis,
and/or reorganization of neural circuits (see (Lendvai et al., 2000; Zito and Svoboda, 2002) for
experience-dependent spine plasticity and a review of synaptogenesis respectively). The one
form of adult structural plasticity that has not been a part of mainstream neuroscience until
recently, is neurogenesis, i.e., the birth or generation of new neurons. Adult neurogenesis, as the
name suggests, is the production of new neurons postnatally. We briefly discuss the state of
current day research in this area.
2.1 Background and status of current research in neurogenesis
As early as the turn of the twentieth century, researchers such as Koelliker and His had described
the development of the central nervous system in mammals in great detail and had found that the
complex architecture of the brain appears to remain fixed from soon after birth (see (Gross,
2000) for references). Hence, the idea that neurons may be added continually was not considered
seriously. Cajal and others described the different phases of neuronal development and as neither
mitotic figures nor the phases of development were seen in adult brains, they suggested that
neurogenesis stops soon after birth. Although there were sporadic reports raising the possibility
of mammalian adult neurogenesis, the predominantly held view was opposed to this idea. In the
1950s, the 3H-dT autoradiography1 technique was introduced to neuroscience research and in
1961, Smart (Smart, 1961) used it, for the first time, to study neurogenesis in the post-natal brain
(3-day old mice). Subsequently, Altman (Altman, 1962; Altman and Das, 1965) published a
series of papers in the 1960s reporting autoradiographic evidence for new neurons in the
neocortex, olfactory bulb (OB), and dentage gyrus (DG) of the young and adult rat. He also
reported new neurons in the neocortex of the adult cat. Unambiguous evidence testifying to the
neuronal nature (as opposed glial) of these proliferating cells was not available. Starting from
13
H-thymidine (3H-dT) contains the radioactive isotope Tritium of Hydrogen. Neurogenesis, is by definition, the
creation of new cells and this requires DNA synthesis. 3H-dT is readily incorporated into a cell during the S-phase
(synthesis phase) of its cell cycle. In essence, 3H-dT serves as a marker of DNA synthesis and its presence is
visualized by capturing its radioactive emissions on a photographic emulsion (autoradiography), a process similar to
taking a photograph.
Shreesh P. Mysore
08/24/2003
4
1977 through 1985, Kaplan (Kaplan, 1984; Kaplan, 1985) published a series of articles that used
electron microscopy along with 3H-dT labeling to confirm the results of Altman et al from 15
years ago. He was able to verify the neuronal nature of the proliferating cells in the DG and OB
of adult rats, cerebral cortex of adult rats. In parallel, from 1983 through 1985, Nottebohm et al
(Goldman and Nottebohm, 1983; Nottebohm, 1985; Paton and Nottebohm, 1984) published
results showing (using EM, 3H-dT autoradiography and electrophysiology) that neurogenesis
occurs in adult songbirds in areas that correspond to the primate cerebral cortex and
hippocampus of adult macaque monkeys. Simultaneously in 1985, Rakic (Rakic, 1985)
published an authoritative 3H-dT study on adult rhesus monkeys which concluded contrarily that
all neurons of the adult rhesus are generated during prenatal and early postnatal life. Further,
based on subsequent studies with electron microscopy and an immunocytochemical marker for
astroglia (and not neurons), Eckenhoff and Rakic (Eckenhoff and Rakic, 1988) suggested that a
stable neuronal population in adult brains may be a biological necessity in an organism whose
survival relies on learned behavior acquired over time, and the traditional view of adult
neurogenesis continued to be held.
Starting from 1988, it was finally established without doubt, that neurogenesis occurs in the
dentate gyrus of the adult rat. Stanfield and Trice (Stanfield and Trice, 1988) showed that new
cells in the rat DG extend axons into the mossy fiber pathway (DG to CA3 projection). More
studies (that followed the fate of a cell from cell division to a differentiation in its final site of
residence (Kornack and Rakic, 1999) confirmed adult neurogenesis in the DG and OB of the
macaque monkey. While DG neurogenesis is now widely accepted in both rodents and primates,
neocortical neurogenesis is still debated. In primates, cortical neurogenesis has stirred a
controversy in the past few years. Gould et al (Gould et al., 1999) recently reported that the
macaque monkey neocortex (prefrontal, parietal and temporal lobes) acquires large numbers of
new neurons throughout adulthood. Further, Shankle at al (Shankle et al., 1998) reported
independently that large additions that alternate with comparable losses of neurons occur in the
human neocortex in the first years of postnatal life. The potentially broad biomedical
implications and impact on the development of replacement therapy for neurological disorders
have added to the attention these reports have received. In their study, Gould et al (Gould et al.,
2001) used the BrdU immunohistochemical labeling2 technique to detect cell proliferation. Other
researchers like Rakic (Nowakowski and Hayes, 2000; Rakic, 2002) are skeptical about the
results because of the high doses of BrdU, the argued non-specificity of cell type markers used,
potential misinterpretation of endothelial cells as migrating neurons, the ambiguity in the identity
of migrating bipolar shaped cells (neurons, immature oligodendrocytes and immature astrocytes
can all adopt bipolar shapes), the large numbers of neurons in the migratory stream
(approximately 2500 new neurons per day (Gould et al., 2001), and false positives due to optical
problems (potential superposition of small BrdU labeled satellite glial cells on unlabeled
neuronal soma due to the use of an insufficient number of optical planes in confocal
microscopy).
2
BrdU, Bromodeoxyuridine, is also incorporated into a cell during its S-phase and hence is also a marker of DNA
synthesis. However, unlike 3H-dT, its incorporation is not stoichiometric and this makes quantification more
difficult. Low doses of BrdU are advocated by researchers (Rakic 2002) (to prevent artifacts that may arise due to
the signal amplification methods used), along with the use of independent means to detect cell proliferation.
Shreesh P. Mysore
08/24/2003
5
In summary, incontrovertible evidence now exists showing that adult neurogenesis occurs in the
DG and OB of mammals (including primates) (Gage, 2002). The research community is at
present not unanimous in its view of neocortical neurogenesis in mammals, especially primates.
We now turn to the effects of psychological and environmental factors on DG neurogenesis.
2.2 Modulation of hippocampal neurogenesis
Several studies (see (Gross, 2000)) show that the lifespan of a newly generated hippocampal
neuron is positively affected by enriched environment living (in the wild and in laboratory
conditions; in birds (Barnea and Nottebohm, 1994; Nottebohm et al., 1994), in
mice(Kempermann et al., 1997; Kempermann et al., 1998), in rats (Nilsson et al., 1999).
Oestrogen is known to stimulate the production of new immature neurons in the DG.
Interestingly, just running in a wheel enhances the number of BrdU labeled cells in the DG,
although this could be due to several reasons (enhanced environmental stimulation, reduction in
stress or due to an effect of exercise like increased blood flow).
Stressful experiences have been shown to decrease the numbers of new neurons in the DG, by
downregulating cell proliferation. Studies suggest that this effect is caused by alterations in the
hypothalamic pituitary adrenal axis (HPA). The influence of stress has been introduced through
exposure to predator odor in adult rats and in 1 week old rat pups, social stress in tree shrews and
marmosets, and developmental stress in rats (with effects that persist into adulthood). Adrenal
steroids probably underlie this effect as stress increases adrenal steroid levels and glucocorticoids
decrease the rate of neurogenesis. Further, it has been shown that the decreased rate of
neurogenesis associated with ageing is due to an increase in glucocorticoid levels (see (Gould
and Gross, 2002) for a summary of these effects and for references).
Neurogenesis has been correlated with hippocampally dependent learning experiences. For
instance, trace eye blink conditioning and spatial learning (both hippocampally dependent tasks)
in animals lead to an increase in the number of neurons – this happens through an extension in
the survival of neurons rather than an increase in their production. There appears to be a critical
period following cell production such that learning occurring in this period increases neuronal
lifespan.
All the above factors that affect adult neurogenesis in the dentate are interpreted, along with the
evidence of their continual addition in several areas of the vertebrate brain, as indicating that it is
functionally significant and not a mere relic left over from evolution.
2.3 Function of neurogenesis
An important question in the context of adult neurogenesis is whether it subserves a valid
function or is simply a vestige of development (Gross, 2000). We will first look at some
evidence that suggests that it has a useful role to play, and then look at what exactly that role
may be (Kempermann, 2002; Nottebohm, 2002) and how it is achieved.
New hippocampal neurons are suspected of having a role in learning and memory. Several pieces
of evidence have been suggested (Gross, 2000) in support of this argument. New neurons are
Shreesh P. Mysore
08/24/2003
6
added to structures that are important for learning and memory (like the hippocampus, lateral
prefrontal cortex, inferior temporal cortex, and posterior parietal cortex). Several conditions that
are detrimental to the proliferation of granule cells in the dentate (like stress, increased levels of
glucocorticoids, etc) are also implicated in lower performance in hippocampally dependent
learning tasks and this suggests a causal link between the two. Several conditions that increase
proliferation (like enriched environments, increased oestrogen levels, wheel running, etc) also
enhance performance. Increases in social complexity have been found to enhance the survival of
new neurons in birds (Lipkind et al., 2002). However, it is to be noted that in the cases of both
positive and negative modulators of neurogenesis, other factors such as changes in the dendritic
structure of synapses can contribute to changes in learning performance. Finally, adult generated
neurons may share some properties with embryonic and early postnatal neurons in their ability
to extend axons, form new connections more readily, and to make more synapses (!)}. Granule
cells in the dentate show LTP of greater duration in younger rats compared with older ones, and
adult generated granule cells appear to have a lower threshold for LTP. More speculatively,
Gross (Gross, 2000) suggests that the transient nature of many adult neurons could mediate the
conversion of short term memory (encoded in their activity patterns) to long term memory
(involving a transfer of these activity patterns to older circuits followed by the death of these new
cells). A curious fact is that the lifetimes of the new adult-generated cells in the rat and macaque
dentate are three and nine weeks respectively, which correspond to the approximate time of
hippocampal storage in these species. Thus, learning and memory may involve the construction
of entirely new circuits with new or unused elements (structural plasticity through neurogenesis
(discussed here), spine motility (Lendvai et al., 2000), and synaptogenesis (Zito and Svoboda,
2002)), as well as the modulation of existing circuits (synaptic efficacy change, (Martin and
Morris, 2002)).
3. Computational importance of learning-directed growth
Formal learning theory (FLT) deals with the ability of an algorithm or learner to arrive at a target
concept (or learn) based on examples of the concept. Three key aspects are that the algorithm
searches through a predefined space of candidate hypotheses (concepts), it is expected to learn
the concept exactly, and no restrictions are placed on the actual time taken by the learner to
arrive at the target concept. FLT is therefore interested in ``exact" and ``in-principle"
learnability. This process is also referred to as \textit{inductive inference} (see (Shavlik, 1990)),
and the expectation is that generalization is achieved. Generalization is roughly measure of the
performance of the learned concept on novel examples (not seen during learning) generated by
the target concept; if the target concept is indeed learned, then the performance should be
perfect. The classical formulation of inductive inference is in language learning (Gold, 1967). A
quantification of generalization, or equivalently, of generalization error, is achieved statistically
by defining it to be the sum of two components - the bias squared and the variance (Geman et al.,
1992). In turn, the bias can be defined roughly as the distance of the best hypothesis in the space
from the target concept, whereas the variance is the distance of the current hypothesis from the
target. The traditional bias-variance tradeoff is as follows. Restricting the hypothesis space
generally tends to reduce the variance, while increasing the likelihood that the bias is very large
(the target concept may be far away from the few hypotheses in the space). On the other hand,
expanding the hypothesis space increases the variance while increasing the likelihood that the
Shreesh P. Mysore
08/24/2003
7
target hypothesis is included in the space (low bias). Clearly, while the objective is to minimize
generalization error, its two components compete, where a highly biased learner is useful only if
meticulously tailored to the problem at hand, while a more general learner requires larger
resources in terms of training data sets. In a sense that this tradeoff is a consequence of the three
key aspects of formal learning theory discussed above.
PAC learning (or probably-approximately-correct learning), proposed by Valiant (Valiant, 1984)
and subsequently accepted as the standard model of inductive inference in machine learning
(Dietterich, 1990) addressed two of the three key aspects of formal learning theory. In this, the
impractical assumption of infinite time horizons was remedied and the stringent restriction of
exact learning was relaxed. In short, PAC learning deals not with in-principle learnability, but
with learnability in polynomial time, and not with exact learning, but with approximate learning.
Formally, a concept is PAC-learnable if the learner arrives with probability 1- at a hypothesis
that classifies the all examples correctly with probability 1-. Nevertheless, the hypothesis space
is still fixed apriori. A fixed hypothesis space yields such problematic theoretical issues as
Fodor's paradox (Fodor, 1980), which essentially states in essence that nothing can be learned
that is not already known; and hence nothing is really learned. The idea here is that no hypothesis
that is more complex than the ones in the given hypothesis space can be evaluated and hence
achieved. Therefore, all concepts need to be available in the hypothesis space before the search
begins.
Constructive learning (Piaget, 1970; Piaget, 1980; Quartz, 1993; Quartz and Sejnowski, 1997)
addresses this issue of a fixed hypothesis space and hence issues like Fodor's paradox
satisfactorily. The central idea of Piagetian constructivism is that of construction of more
complex hypotheses from simpler ones, and a building up of representation. Unfortunately,
Piaget did not specify the mechanisms by which such a process could be neurally implemented.
This issue is dealt with more formally in \cite{quartz-1993,quartz-1997}. Constructivist learning
deals directly with the issue of increasing hypothesis complexity as learning progresses.
Computational models of constructive learning that implements upward changes in complexity
exist (Shultz et al., 1995; Westermann, 2000). Activity dependent structural plasticity is viewed
as the mechanism that implements constructivist learning. Baum (Baum, 1988) showed that
networks with the ability to add structure during the process of learning are capable of learning
in polynomial time, any learning problem that can be solved in polynomial time by any
algorithm whatsoever (Quartz and Sejnowski, 1997), thus conferring a computational
universality to this paradigm. Interestingly, the bias-variance tradeoff can be broken by
constructivist learning, by adding hypotheses incrementally to the space in such a way as to keep
variance low, while reducing the bias. Further, in the context of neurobiology, the burden of
innate knowledge on the network is greatly relaxed. Given a basic set of primitives (in the form
of mechanisms, physical substrate), the construction of the network and hence representations
occurs under the guidance of activity and within genetic constraints.
4. Structural plasticity and auditory localization in barn owls
An excellent example of experience-driven structural plasticity resulting in representational
change is seen in the auditory localization system (ALS) of barn owls. The ``task" in question
Shreesh P. Mysore
08/24/2003
8
here is the formation of associations between particular values of auditory cues like interaural
time differences (ITD), interaural level differences (ITL) and amplitude spectra with specific
locations in space. Whereas many animals can localize sounds soon after birth (Field et al., 1980)
(indicating the hard-wiring of at least a part of this system (Brainard and Knudsen, 1998;
Knudsen et al., 1991)), experience plays an important role in shaping and modifying the auditory
localization pathway (ALP). The need for experience dependence can be qualitatively deduced
for several reasons (Knudsen, 2002). For instance, there are individual differences in head and
ear sizes and shapes, and in the course of its lifetime, an animal faces changing conditions, both
external, and those imposed on it through aging processes. Hence, the ability of the ALS to adapt
to experience could be very useful for the animal. Indeed, behavioral experiments readily attest
to the existence of such an adaptability. Studies (called visual displacement studies) involve
disruption of the auditory-location association by displacement of visual input along the azimuth
through the use of prismatic spectacles (Brainard and Knudsen, 1998). The experimental
evidence from these studies is summarized below.
4.1 Experimental evidence
Young owls (<200 days old) that are fitted with prisms (23o shift) learn in about 5 weeks to
modify their orienting behavior appropriately so that they are able to localize sound sources.
Removal of prisms results in a gradual return to normal orienting behavior. On the other hand,
adult owls (>200days old) adapt poorly (Knudsen, 1998). Interestingly, shifted-back owls
(juveniles owls that were initially fitted with prisms which were removed later) were
subsequently refitted with prisms as adults, they displayed adaptability very close to that
displayed originally as juvenile. Finally, recent experiments (Linkenhoker and Knudsen, 2002)
show that the capacity for behavioral plasticity in adult owls can be significantly increased with
incremental training (where the adults experience prismatic shifts in small increments). Among
the several interesting findings in these studies is that behavioral adaptation is mediated by
structural plasticity.
Figure 1 schematically shows the key brain areas involved in auditory localization plasticity in
owls. Studies showed that when owls are reared with displacing prisms, the ITD tuning curves of
neurons in the OT and the ICX, but not in the ICC, adapt appropriately (Brainard and Knudsen,
1993; Brainard and Knudsen, 1995).
----------------------Figure 1 goes here
-----------------------
This suggested that the site of plasticity is after the ICC in the midbrain pathway. Anterograde
(DeBello et al., 2001) and retrograde (Feldman and Knudsen, 1997) tracer experiments that
studied the topography of the axonal projections from the ICC to the ICX as a function of
experience showed that prism experience expands axonal arborization and causes
synaptogenesis, in a manner that results in adaptive reorganization of the anatomical circuit (see
figure 2). Compared to normal juveniles, the axonal projection fields in normal adults are sparser
and more restricted. Relative to the projection fields in normal adults, those in owls subjected to
Shreesh P. Mysore
08/24/2003
9
prism-rearing are denser and broader and showed a large net increase of axons in the adaptive
portion of the field. The shifted ITD tuning of the ICX neurons matched the tuning of the ICC
neurons from which the new projections were formed. Axonal projections corresponding to
normal behavior remained in the prism-reared owls, thus showing the co-existence of two
different auditory space encoding circuits. Further, in owls (called shifted-back owls) that were
re-exposed to normal visual fields (after having been prism-reared for a while), the new axonal
projections persist along with the normal projections, although the behavioral response belies
their existence (Knudsen, 1998).
----------------------Figure 2 goes here
----------------------These results all indicate that structural change between the ICC and ICX is largely responsible
for the experience driven adjustment in the auditory space map. The intriguing presence of two
different encoding circuits in prism-reared owls, but only one overt behavior at any time led to
the hypothesis that anatomical projections were being appropriately functionally inhibited
through GABAA-ergic circuits. Indeed this is the case (Zheng and Knudsen, 1999; Zheng and
Knudsen, 2001).
One major question that remains unanswered relates to the nature of the error signal (or
instructive signal) that guides plasticity. Results from (Hyde and Knudsen, 2001) show that a
visually based topographic template signal is primarily responsible for calibration. As the visual
information arriving from the retina is topographically arranged in the OT, and as a point-topoint feedback projection was found from OT to ICX (Hyde and Knudsen, 2000), it was
suggested that OT provides a visually-based calibration signal to the ICX (Gelfland and Pearson,
1988; Hyde and Knudsen, 2001). Experimental support for this argument was found when a
restricted lesion in the optic tectum eliminated plasticity only in the corresponding portion of the
auditory space map in the ICX. It was not known until very recently (Gutfreund et al., 2002),
whether this topographic signal was a simple visual template, or a processed version of it that
reflected an alignment error. In (Gutfreund et al., 2002) the authors found that the ICX neurons
have well-defined visual receptive fields, that the projection of this information from the OT to
the ICX is precise, that the auditory receptor fields (locations corresponding to the ITD tuning)
of the ICX neurons overlapped and covaried with their vRFs, and hence that the instructive
signal is simply a visual topographic template. Further, a presentation of only an auditory
stimulus (that has the appropriate (best tuned) ITD) to an ICX neuron produces a maximal,
40ms-long firing response with a latency of about 10ms. When just a visual stimulus is presented
at the center of the neuron's vRF, the neuron produces a maximal, 40ms-long firing response, but
this time with a latency of approximately 90ms (see figure 3). And when both auditory and
visual stimuli are presented, with the ITD of the auditory stimulus corresponding to the vRF
location, then the ICX neuron shows the auditory response (firing with 10ms latency), but does
not show the visual response (response with 90ms latency). However, if a bimodal stimulus is
presented, with an ITD not equal to that predicted by the vRF location(i.e., if there is a
mismatch), then only the visual response is observed (firing with 90ms latency). Thus the
dynamics of the ICX response (in particular, the presence of a visual response) following a
Shreesh P. Mysore 10
08/24/2003
bimodal stimulus is different in the misaligned case from the normal case and this serves as the
error signal between the spatial representations of the auditory and visual stimuli.
We now present a computational model of auditory localization plasticity.
----------------------Figure 3 goes here
----------------------4.2 Modeling plasticity in auditory localization behavior
We have implemented a simple connectionist model of the auditory localization network in barn
owls using firing-rate coding neurons (the output of each neurons is a nonlinear function of the
weighted sum of the inputs). Past attempts (Pouget et al., 1995; Rosen et al., 1994; Rucci et al.,
1997) have primarily used reward-based learning schemes, about which little is known in the
owl's auditory localization system. We move away from reward based methods, and further,
implement a template-based teaching signal in our scheme (while (Gelfland and Pearson, 1988)
did not use a reward-based scheme, they implemented learning via the back-propagation
algorithm, which is acknowledged as not being biophysical and hence of needing modification.
The driving assumption in our modeling effort is that the default structures, and pathways
already exist and the behavior of interest is the plasticity between the ICC and ICX in response
to prismatic shifts in the visual field (see (Kempter et al., 2001) for a plausible spike time
dependent Hebbian scheme for the ab initio development of an ITD map in the laminar nucleus
of barn owls). The default pathways are modeled by projections from one layer to the next,
where the layers are the ICC, ICXI, ICXE, OTD, OTM, and OTS. This scheme is summarized in
figure 4. We further make the simplifying assumption that the ITD tuning curves of the neurons
are sharp (and distributions). As mentioned before, the plasticity observed occurs between the
neurons in the ICC and the ICX layers – both structural plasticity (axo- and synaptogenesis) and
synaptic efficacy change (ICCICXI and ICCICXE). In order to implement the activitydependent plasticity, we have used a learning rule that is inspired by the
----------------------Figure 4 goes here
----------------------classical Hebbian rule. The error signal is encoded in the selective and delayed firing of an ICXE
neuron in response to a visual stimulus presented at its visual receptor field, when the auditory
input that is simultaneously presented is mismatched. Under normal conditions (figure 4), when
an audio-visual (AV) stimulus is presented, the green spikes representing the auditory response
appear, but not the faded-red spikes representing the visual response. This is implemented
through adaptation in the ICXE neurons to input stimuli such that if an ICXE neuron fires at
instant t, then it is unable to fire for another 80 ms. Under mismatched conditions (with prism),
an AV stimulus, now elicits only the (delayed) visual response (red spike). This is seen in figure
5 in the activity plot of the ICXE neuron.
Shreesh P. Mysore 11
08/24/2003
----------------------Figure 5 goes here
----------------------In our model, we detect the error signal by checking for ICX activity 90ms after the presentation
of an AV stimulus. Weight change is then implemented such that weights reduce between ICC
neurons that do not fire and ICXE (and ICXI) neurons that do, soon after. If the error occurs
repeatedly for a sufficiently long period of time (which is taken to imply a systematic error in
encoding information), then new ``synapses" are formed, that is, the population of the ICC-ICXI
and ICC-ICXE weight matrices is increased. Once this happens, the learning rule from above
increases the weights between ICC neurons that fire and ICXE (and ICXI) neurons that also fire
soon after. The time course of weight evolution is implemented to be sigmoidal. The results of
this scheme match the experiments, in that the ITD tuning curves of the ICX neurons shifts in
the adaptive direction after training. Figures 6 (a) and (b) show the tuning curve of ICXE neuron
\#1, before and after training, with a shift corresponding to 20s (shift by one unit, i.e., by 1o).
With this model, we are thus able to capture the essence of the learning exhibited by the barn
owls. Future work will look into adding further details.
4.3 Structural plasticity and representation construction
While the ``how” of plasticity is answered through mechanisms of structural change, in order to
get at principles of brain design, the more intriguing and inherently nebulous question - ``why
structural plasticity?” is interesting. Is it true that given the constraints of the biology in question,
structural plasticity is the most plausible, if not the only, way? Interestingly, one can argue that
given a topographic arrangement of information, with this information being propagated in a
point-to-point manner, drastic shifts in the map of magnitude greater than can be supported by
the existing projection spread can only be incorporated through the formation of new projection
pathways (and concomitant new synapses). In contrast, a diffuse network architecture with an
all-to-all projection, could implement plasticity simply by changing synaptic efficacies. One
quantification of the ``ability" or ``complexity" of a network is in terms of its memory capacity3
(Poirazi. and Mel, 2001). Applying a similar function counting approach to our problem (by
labeling the inputs to the ICC and OTS as xi, and treating the output of the network as the vector
of ICXE activities), we see that
the existing network architecture implements one possible function, whereas the architecture
post-learning implements a qualitatively different function. Post-natal structural plasticity has, in
----------------------------------Figures 6 (a) and (b) go here
-----------------------------------3
An alternate quantifier of complexity is the Vapnik-Chervonenkis (see \cite{bishop-1995}) dimension or the VCdim
(a measure of complexity of a concept class, and in our case; the network can be treated as a generator of various
concepts, i.e., a concept class).
Shreesh P. Mysore 12
08/24/2003
effect, permitted an increase in the repertoire of functions that the network can compute (memory
capacity), subject to the topographic projection constraints, from 1 to more than 2 (the actual
number depends on the extent of the new axonal growth). Here, we are assuming a simple oneneuron-per-concept representation as shown in figures 4 and 5. For instance, if the shift is by 1
unit (20s), then the capacity is 2, if the shift is by 2 units, the capacity is now 3, and so on. In
this sense, a maximum capacity increase from 1 to 41 is possible with a 40o prism shift). Note
that with age, the ability for structural plasticity diminishes, and beyond 200 days is almost nonexistent. In essence, this means that if the capacity of the network was N before 200 days, then it
remains N (or very close to it) even when subject to a maximal prism shift, thus precluding
further increase in complexity. If the memory capacity measure is applied to the Cascade
Correlation scheme from the earlier section, one sees immediately that at the end of every input
phase, the complexity of the network, in terms of the functions it can compute (memory
capacity), is augmented.
In general, one expects structural plasticity to achieve an abrupt and qualitative change in the
representational ability of a network, and to typically increase its functional complexity. Another
view of this is that if the network is viewed as a representation of the hypothesis space that is
explored while learning from examples, then structural plasticity helps to increase the hypothesis
space. This is fundamentally different from synaptic weight change, which only implements a
search within a given hypothesis space, and is hence a powerful and indispensable ally in
representation construction.
5. Evidence from modeling cognitive development in children
One way to assess the importance of constructive growth in cognitive development is to compare
models whose networks are allowed to grow to models containing static networks. This
comparison is facilitated by the fact that the principal difference between two of the most
popular neural algorithms for modeling cognitive development is that one of them incorporates
growth and the other does not (Shultz, 2003). Both back-propagation (BP) and cascadecorrelation (CC) involve multi-layered feed-forward networks that learn by adjusting their
connection weights to reduce error, defined as the discrepancy between actual and target outputs.
BP networks are typically designed by a programmer and retain their initial structure throughout
learning. In contrast, CC networks begin with a minimal topology containing only inputs and
outputs defined by a programmer, and then proceed to add hidden units as needed to master the
training problem.
There are two phases in CC learning – the input phase and the output phase. Training typically
begins in output phase by using the so-called delta rule to adjust the weights entering output units
in order to reduce network error. When error reduction stagnates for certain number of epochs or
when the problem has not been mastered within another, larger certain number of epochs, CC
switches to input phase.4 In input phase, the task is no longer to reduce error, but rather to reconceptualize the problem by recruiting a useful hidden unit downstream of the existing units but
upstream of the output units. A pool of typically eight candidate hidden units, usually equipped
with sigmoid activation functions, is evaluated for recruitment. This evaluation consists of
training randomly initialized weights from the inputs and any existing hidden units to these
4
An epoch is a pass through the training patterns.
Shreesh P. Mysore 13
08/24/2003
candidates, with the goal of maximizing a modified correlation between candidate activation and
network error. In other words, the CC algorithm in input phase is looking for a new hidden unit
whose activation patterns track the network’s current error, becoming either very active or very
inactive with error fluctuations – in short, a candidate unit which is sensitive to the major
difficulties that the network is currently having. The same delta rule is used here in input phase,
but instead of minimizing error as in output phase, it is used to maximize correlations. When
these correlations stagnate over a certain number of input-phase epochs, the candidate with the
highest absolute correlation with network error is selected and installed into the network, as the
less successful candidates are discarded. This implements a kind of proliferation and pruning of
units and connections that is characteristic of selection models of brain development (e.g.,
Changeux & Danchin, 1976). The new recruit is initialized with randomized output weights of
positive values if the correlation with error was negative or with negative values if the
correlation with error was positive. At this point, CC returns to output phase to determine how to
best utilize the new conceptualization of the problem afforded by its latest recruit. In this way,
new and more complex interpretations build on top of older, simpler interpretations. Cycling
between input and output phases continues until the training problem is mastered. CC can be
interpreted as implementing the processes synaptogenesis and neurogenesis that have recently
been documented to be at least partly under the control of learning. As well, CC implements the
hierarchical cascaded progression now thought to be characteristic of cortical development.
There are presently three domains of cognitive development to which both BP and CC networks
have been applied. Here we briefly review these “bakeoff” competitions in order to examine the
effects of growth on simulation success. The three relevant domains are the balance scale, the
integration of velocity, time, and distance cues for moving objects, and age changes in sensitivity
to correlations vs. features in infant category learning. We consider each of these domains in
turn.
Balance-scale stages
The balance-scale problem involves presenting a child with rigid beam balanced on a fulcrum
(Siegler, 1976, 1981). The beam is equipped with several pegs spaced at regular intervals to the
left and right of the fulcrum. The experimenter typically places some number of identical
weights on one peg on the left side and some number of weights on one peg on the right side.
While supporting blocks prevent the beam from moving, the child is asked to predict which side
of the beam will descend, or whether the scale will remain balanced, once the supporting blocks
are removed.
The “rules” that children use to make balance-scale predictions can be diagnosed by presentation
of six different types of problems. Three problems are relatively simple in the sense that one cue
(either weight or distance from the fulcrum) predicts the outcome because the other cue does not
differ from one side of the scale to the other. Three other problems are relatively complex in that
the two relevant cues conflict with each other, with weight predicting one outcome and distance
predicting the opposite outcome. For simple or complex types, there are three possible outcomes
in which the direction of the tip is predicted by weight information or by distance information or
the scale remains balanced. The pattern of a child’s predictions across these six different problem
types can be used to diagnose the “rule” the child is apparently using to make predictions. Of
course, in a connectionist network a rule is not implemented as a symbolic if-then structure, but
is rather an emerging epiphenomenon of the network’s connection-weight values.
Shreesh P. Mysore 14
08/24/2003
There is a consensus in the psychological literature that children progress through four different
stages of balance-scale performance (Siegler, 1976, 1981). In stage 1, children use weight
information only, predicting that the side with greater weight will descend or that the scale will
balance when the two sides have equal weights. In stage 2, children continue to use weight
information but begin to use distance information when the weights are equal on each side. In
stage 3, weight and distance information are used about equally, but the child guesses when
weight and distance information conflict on the more complex problems. Finally in stage 4,
children respond correctly on all types of problems, whether simple or complex.5
In one of the first connectionist simulations of cognitive development, McClelland and Jenkins
(1991) found that a static BP network with two groups of hidden units segregated for either
weight or distance information progressed through the first three stages of the balance scale and
even into the fourth stage. However, the network never settled into stage 4 performance, instead
cycling between stage 3 and stage 4 functioning for as long as the programmer’s patience lasted.
Further simulations indicated that BP networks could settle into stage 4, but only at the cost of
missing stages 1 and 2 (Schmidt & Shultz, 1991). There seems to be no way for BP networks to
capture both consistently mature stage 4 performance and earlier progressions through stages 1
and 2. In contrast, the first CC model of cognitive development naturally captured all four stages
of balance-scale performance, and did so without requiring any hand-designed segregation of
hidden units (Shultz, Mareschal, & Schmidt, 1994). Capturing balance-scale stages is not a
trivial matter, as witnessed by the shortcomings of earlier symbolic-rule-learning algorithms
(Langley, 1987; Newell, 1990).
Both BP and CC models are able to naturally capture the other psychological regularity of
balance-scale development, the torque-difference effect. This is the tendency for problems with
large absolute torque differences (from one side of the scale to the other) to be easier for children
to solve, regardless of their current stage (Ferretti & Butterfield, 1986). 6 Connectionist feedforward networks produce perceptual effects like torque difference whenever they learn to
convert quantitative input differences into a binary output judgment. The bigger the input
differences (in this case, from one side of the fulcrum to the other side), the clearer the hidden
unit activations, and the more decisive the output decisions. Symbolic rule-based models are
unable to naturally capture the torque-difference effect because symbolic rules typically don’t
care about the amount of differences, only the direction of differences.
The ability of neural-network models to capture stages 1 and 2 on the balance scale is due to a
bias towards equal-distance problems in the training patterns. These are problems in which the
weights are placed equally distant from the fulcrum on each side. This bias forces the network to
first emphasize weight information at the expense of distance information because weight
information is much more relevant to reducing prediction errors. Once weight information is
successfully dealt with, then the network can turn its attention to distance information,
particularly on those problems where distance varies from one side of the fulcrum to the other. In
effect, the network must find the particular region of connection-weight space that allows it to
emphasize the numbers of weights on the scale and then move to another region of weight space
that focuses on the multiplication of equally important weight and distance information. It
5
Some degree of latitude is typically incorporated into stage diagnosis. For example, with four problems of each of
the six types, a child could make up to four deviant responses and still be diagnosed into one of these four stages
(Siegler, 1976, 1981).
6
Torque is the product of weight and distance on one side of the scale.
Shreesh P. Mysore 15
08/24/2003
requires a powerful learning algorithm to make this move in connection-weight space.
Apparently, a static BP network, once committed to using one source of information in stage 1
cannot easily find its way downhill to the stage-4 region merely by continuing to reduce error. A
constructive algorithm, such as CC, has an easier time with this move because each newly
recruited hidden unit effectively changes the shape of connection-weight space by adding a new
dimension. This new dimension affords a path on which to move downhill in error reduction
towards the stage-4 region. More colloquially, BP cannot have its cake and eat it too, but CC,
due to its dynamic increases in computational power, can both have and eat its cake.
Integration of velocity, time, and distance cues
In Newtonian physics, the velocity of a moving object is the ratio of distance traveled to the time
of the journey: velocity = distance / time. Algebraically then, distance = velocity x time, and time
= distance / velocity. Some of the cleanest evidence on children's acquisition of these relations
was collected by Wilkening (1981), who asked children to predict one dimension (e.g., time)
from knowledge of the other two (e.g., velocity and distance). For example, three levels of
velocity information were represented by the locomotion of a turtle, a guinea pig, and a cat.
These three animals were described as fleeing from a barking dog, and a child participant was
asked to imagine these animals running while the dog barked. The child's task was to infer how
far an animal would run given the length of time the dog barked, an example of inferring
distance from velocity and time cues. CC networks learning similar tasks typically progress
through an equivalence stage (e.g., velocity = distance), followed by an additive stage (e.g.,
velocity = distance - time), and finally the correct multiplicative stage (e.g., velocity = distance /
time) (Buckingham & Shultz, 2000). Some of these stages had been previously found with
children, and others were subsequently confirmed as predictions of our CC simulations. As with
children in psychological experiments, our CC networks learned to predict the value of one
dimension from knowledge of values on the other two dimensions.
We diagnosed rules based on correlations between network outputs and the various algebraic
rules observed in children. To be diagnosed, an algebraic rule had to correlate positively with
network responses, account for most of the variance in network responses, and account for more
variance than any other rules did. As with rules for solving balance-scale problems, these rules
are epiphenomena emerging from patterns of network connection weights. For velocity and time
inferences, CC networks first acquired an equivalence rule, followed by a difference rule,
followed in turn by the correct ratio rule. Results were similar for distance inferences, except that
there was no equivalence rule for distance inferences. In making distance inferences, there is no
reason a network should favor either velocity or time information because both velocity and time
vary proportionally with distance.
A shift from linear to nonlinear performance occurred with continued recruitment of hidden
units. Linear rules include equivalence (e.g., time = distance), sum (e.g., distance = velocity +
time), and difference (e.g., velocity = distance - time) rules, whereas nonlinear rules include
product (e.g., distance = velocity x time) and ratio (e.g., time = distance / velocity) rules.
Because the sum and difference rules in the second stage are linear, one might wonder why they
require a hidden unit. The reason is that networks without a hidden unit are unable to
simultaneously encode the relations among the three dimensions for all three inference types. In
distance inferences, distance varies directly with both velocity and time. But in velocity
inferences, distance and time vary inversely. And in time inferences, distance and velocity vary
Shreesh P. Mysore 16
08/24/2003
inversely. Networks without hidden units are unable to encode these different relations without a
hidden unit. The first-recruited hidden unit differentiates distance information from velocity and
time information, essentially by learning connection weights with one sign (positive or negative)
from the former input and opposite signs from the latter inputs. This weight pattern enables the
network to consolidate the different directions of relations across the different inference types.
In contrast to constructive CC networks, static BP networks seem unable to capture these stage
sequences (Buckingham & Shultz, 1996). If a static BP network has too few hidden units, it fails
to reach the correct multiplicative rules. If a static BP network has too many hidden units, it fails
to capture the intermediate additive stages on velocity and time inferences. Our extensive
exploration of a variety of network topologies and variation in critical learning parameters
suggests that there is no static BP network topology that can capture all three types of stages in
this domain. Even the use of cross-connections that bypass hidden layers, a standard feature of
CC, failed to improve the stage performance of BP networks. There is no apparent way to get BP
to cover all three stages because the difference between underpowered and overpowered BP
network is a single hidden unit. Such a situation would surely frustrate Goldilocks if she were a
BP advocate – there is either too little or too much computational power and no apparent way to
have just the right amount of static network power. Thus, we conclude that the ability to grow in
computational power is essential in simulating stages in the integration of velocity, time, and
distance cues.
Age changes in infant category learning
Our final simulation bakeoff concerns age changes in sensitivity to correlations vs. features in
category learning by infants. Using a stimulus-familiarization-and-recovery procedure to study
categorization, Younger and Cohen (1983, 1986) found that 4-month-olds process information
about independent features of visual stimuli, whereas 10-month-olds are able to abstract relations
among those features. Such findings relate to a longstanding controversy about the extent to
which perceptual development involves integration (Hebb, 1949) or differentiation (Gibson,
1969) of stimulus information. A developing ability to understand relations among features
suggests that perceptual development involves information integration, a view compatible with
constructivism.
Infants are assumed to construct representational categories for repeated stimuli, ignoring novel
stimuli that are consistent with a category, while concentrating on novel stimuli that are not
members of a category (Cohen & Younger, 1983; Younger & Cohen, 1985). After repeated
presentation of visual stimuli with correlated features, 4-month-olds recovered attention to
stimuli with novel features more than to stimuli with either correlated or uncorrelated familiar
features (Younger & Cohen, 1983, 1986). In contrast, 10-month-olds recovered attention to both
stimuli with novel features and familiar uncorrelated features more than to stimuli with familiar
correlated features. Mean fixation time in seconds is shown for the key interaction between age
and correlated vs. uncorrelated test stimuli, plotted in solid lines in Figure x against the left-hand
y-axis. This pattern of recovery of attention indicates that young infants learned about the
individual stimulus features, but not about the relationship among features, whereas older infants
in addition learned about how these features correlate with one other.
----------------------Figure 7 goes here
-----------------------
Shreesh P. Mysore 17
08/24/2003
These infant experiments were recently simulated with CC encoder networks (Shultz & Cohen,
2003). Encoder networks have the same number of input units as output units, and their job is to
reproduce their input values on their output units. They learn to do this reproduction by encoding
an input representation onto their hidden units and then decoding that representation onto output
units. In this fashion, encoder networks develop a recognition memory for the stimuli they are
exposed to. Network error can be used as an index of stimulus novelty. Both BP and CC
encoders were applied to the infant data (Shultz & Cohen, 2003). In CC networks, age was
implemented by varying the score-threshold parameter: 0.25 for 4-month-olds and 0.15 for 10month-olds. This parameter governs how much learning occurs because learning stops only
when all network outputs are within score-threshold of their target values for all training
patterns. It was assumed that older infants would learn more from the same exposure time than
would younger infants. Mean network error for the key interaction between age and correlated
vs. uncorrelated test stimuli is plotted with dashed lines in Figure x against the right-hand y-axis.
On the assumption that network error reflects stimulus novelty (and thus infant interest), the CC
networks closely matched the infant data – more error to the uncorrelated test stimulus than to
the correlated test stimulus only at the smaller score-threshold of 0.15.
In sharp contrast, a wide range of static BP networks could not capture this key interaction. Our
BP simulator was modified to use a score-threshold parameter to decide on learning success, just
as in CC. The same coding scheme used in the CC simulations was also used here. A wide
variety of score-threshold values were explored in a systematic attempt to cover the infant data.
BP networks were equipped with three hidden units, the typical number recruited by CC
networks. BP network topology was varied to explore the possible roles of network depth and the
presence of cross connections in simulation success. Both flat (6-3-6) and deep (6-1-1-1-6) BP
networks were tried, both with and without the cross-connection weights native to CC.7 There
was never lower error on correlated than on uncorrelated test items, the signature of effective
correlation detection. It is difficult to prove that any algorithm cannot in principle cover a set of
phenomena, but we certainly gave BP a fair shake, running 9 networks in each of 80 simulations
(2 familiarization sets x 10 score-threshold levels x 4 network topologies).
Unlike previous psychological explanations that postulated unspecified qualitative shifts in
processing with age, our computational explanation based on CC networks focused on
quantitatively deeper learning with increasing age, a principle with considerable empirical
support over a wide range of ages and experiments (e.g., Case, Kurland, & Goldberg, 1982). CC
networks also generated a crossover prediction, with deep learning showing a correlation effect,
and even more superficial learning showing the opposite – a similarity effect, in the form of
more error to (or interest in) the correlated test item than to the uncorrelated test item. This is
called a similarity effect because the uncorrelated test item was most similar to those in the
familiarization set. This simulation predicted that with a single, suitable age group, say 10month-olds, there would be a correlation effect under optimal learning conditions and a
similarity effect with less than optimal familiarization learning. Tests of this prediction found
that 10-month-olds who habituated to the training stimuli looked longer at the uncorrelated than
the correlated test stimulus, but those who did not habituate did just the opposite, looking longer
at the correlated than the uncorrelated test stimulus (Cohen & Arthur, unpublished).
7
There were six input and six output units in BP networks just as there were in CC networks.
Shreesh P. Mysore 18
08/24/2003
In addition to covering the essential age x correlation interaction, CC networks covered two
other important features of the infant data: discrimination of the habituated from the correlated
test stimulus via the presence of two additional noncorrelated features, and more recovery to test
stimuli having novel features than to test stimuli having familiar features.
Although the BP simulations did not capture the infant data, they were useful in establishing why
CC networks were successful. Namely, the growth of cascaded networks with cross-connection
weights seems critical to capturing detection of correlations between features and the
developmental shift to this ability from earlier feature-value learning.
6. Conclusions
The work reviewed here indicates that it is important to let networks grow as a way of satisfying
both biological and computational constraints. Biological constraints clearly favour the use of
neural networks that grow as they learn. Implementing network growth in simulations of
psychological data produces superior coverage of the data when contrasted with comparable
static networks. Network growth offers a number of computational advantages including the
building of more complex knowledge on top of simpler knowledge, superior ability to escape
from local minima during error reduction, and deeper learning of training problems. Such
advantages translated into better coverage of developmental stage progressions seen the
psychological literature.
It is noteworthy that these computational constraints can operate in both supervised and
unsupervised learning paradigms, including both reward-based Hebbian learning and multilayered feed-forward CC networks. As used here, the differences between these learning
paradigms became somewhat blurred. Typically considered as an unsupervised learner, the Hebb
rule was here modified to include the discrepancy between predicted and actual reward.
Typically considered as supervised learning, CC can often be construed as not requiring a human
teacher to supply the required target outputs. For example, CC networks learn about balance
scale problems by computing the discrepancy between what they predict will happen and what
actually happens as a scale is allowed to tip. Similarly, they learn about velocity, time, and
distance relations by predicting and then noting how these cues are related as objects move. In
these and in many other cases, observations of the natural environment supply the target outputs
used in error computation. Likewise, CC encoder networks use the stimulus itself as the target
output, computing error as the discrepancy between input and output activations. In this sense,
encoders constitute a case of unsupervised learning about stimuli from mere stimulus exposure.
Although we emphasized network growth in this chapter, it is also clear from neuroscience
research that brains discard unused neurons and synapses. In a drive towards even greater
biological realism, it would be worth experimenting with networks that both grow and prune.
Some preliminary work along these lines has found that pruning of CC networks during growth
and learning improves learning speed, generalization, and network interpretability (Thivierge,
Rivest, & Shultz, 2003). This would appear to be yet another case of a biological constraint
offering computational advantages.
References (SPM and SRQ)
Shreesh P. Mysore 19
08/24/2003
Altman, J. (1962). Are new neurons formed in the brains of adult mammals? Science 135, 1127-1128.
Altman, J., and Das, G. D. (1965). Autoradiographic and histological evidence of postnatal
hippocampal neurogenesis in rats. Journal of Comparative Neurology 124, 319--335.
Barnea, A., and Nottebohm, F. (1994). Seasonal recruitment of hippocampal neurons in adult
free-ranging black-capped chickadees. Proc Natl Acad Sci U S A 91, 11217-11221.
Baum, E. B. (1988). Complete representations for learning from examples. In Complexity in
information theory, Y. Abu-Mostafa, ed.
Brainard, M. S., and Knudsen, E. I. (1993). Experience-dependent plasticity in the inferior
colliculus: a site for visual calibration of the reward representation of auditory space in the barn
owl. The Journal of Neuroscience 13, 45489--44608.
Brainard, M. S., and Knudsen, E. I. (1995). Dynamics of visually guided auditory plasticity in
the optic tectum of the barn owl. Journal of Neurophysiology 73, 595--614.
Brainard, M. S., and Knudsen, E. I. (1998). Sensitive periods for visual calibration of the
auditory space map in the barn owl optic tectum. Journal of Neuroscience 18, 3929--3942.
Chomsky, N. (1980). Rules and representations. Behavioral and Brain Sciences 3, 1-61.
DeBello, W. M., Feldman, D. E., and Knudsen, E. I. (2001). Adaptive axonal remodeling in the
midbrain auditory space map. The Journal of Neuroscience 21, 3161--3172.
Dietterich, T. G. (1990). Machine Learning. Annuan Review of Computer Science 4, 255--306.
Eckenhoff, M. F., and Rakic, P. (1988). Nature and fate of proliferative cells in the hippocampal
dentate gyrus during the life span of the rhesus monkey. J Neurosci 8, 2729-2747.
Feldman, D. E., and Knudsen, E. I. (1997). An anatomical basis for visual calibration of the
auditory space map in the barn owl's optic tectum. Journal of Neuroscience 17, 6820--6837.
Field, J., Nuir, D., Pilon, R., Sinclair, M., and Dodwell, P. (1980). Infants' orientation to lateral
sounds from birth to three months. Child Development 51, 295--298.
Fodor, J. (1980). On the impossibility of learning more ``powerful" structures. In The debate
between Jean Piaget and Noam Chomsky, M. Piatelli-Palmarini, ed.
Gage, F. H. (2002). Neurogenesis in the adult brain. The Journal of Neuroscience 22, 612--613.
Gelfland, J. J., and Pearson, J. C. (1988). Multisensor integration in biological systems.
Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance
dilemma. Neural Computation 4, 1--58.
Gold, M. (1967). Language identification in the limit. Information and Control 10, 447-474.
Goldman, S. A., and Nottebohm, F. (1983). Neuronal production, migration, and differentiation
in a vocal control nucleus of the adult female canary brain. Proc Natl Acad Sci U S A 80, 23902394.
Gould, E., and Gross, C. G. (2002). Neurogenesis in adult mammals: some progress and
problems. The Journal of Neuroscience 22, 619--623.
Shreesh P. Mysore 20
08/24/2003
Gould, E., Reeves, A. J., Graziano, M. S. A., and Gross, C. G. (1999). Neurogenesis in the
neocortex of adult primates. Science 286, 548--552.
Gould, E., Vail, N., Wagers, M., and Gross, C. G. (2001). Adult generated hippocampal and
neocortical neurons in macaques have transient existence. Proceedings of the National
Academy of Sciences, USA 98, 10910--10917.
Gross, C. G. (2000). Neurogenesis in the adult brain: death of a dogma. Nature Reviews
Neuroscience 1, 67--73.
Gutfreund, Y., Zheng, W., and Knudsen, E. I. (2002). Gated visual input to the central auditory
system. Science 297, 1556--1559.
Hyde, P. S., and Knudsen, E. I. (2000). Topographic projection from the optic tectum to the
auditory space map in the inferior colliculus of the barn owl. The Journal of Comparative
Neurology 421, 146--160.
Hyde, P. S., and Knudsen, E. I. (2001). A topographic instructive signal guides the adjustment of
the auditory space map in the optic tectum. Journal of Neuroscience 21, 8586--8593.
Kaplan, M. S. (1984). Mitotic neuroblasts in 9-day old and 11-month old rodent hippocampus.
Journal of Neuroscience 4, 1429--1441.
Kaplan, M. S. (1985). Formation and turnover of neurons in young and senescent animals: an
electron microscopic and morphometric analysis. Annals of the New York Academy of Sciences
457, 173--192.
Kempermann, G. (2002). Why new neurons? Possible functions for adult hippocampal
neurogenesis. Journal of Neuroscience 22, 635--638.
Kempermann, G., Kuhn, H. G., and Gage, F. H. (1997). More hippocampal neurons in adult
mice living in an enriched environment. Nature 386, 493-495.
Kempermann, G., Kuhn, H. G., and Gage, F. H. (1998). Experience-dependent neurogenesis in
the senescent dentate gyrus. The Journal of Neuroscience 18, 3206--3212.
Kempter, R., Leibols, C., Wagner, H., and Hemmen, J. L. v. (2001). Formation of temporalfeature maps by axonal propagation of synaptic learning. Proceedings of the National Academy
of Sciences 98, 4166--4171.
Knudsen, E. I. (1998). Capacity for plasticity in the adult owl auditory system expanded by
juvenile experience. Science 279, 1531--1533.
Knudsen, E. I. (2002). Instructed learning in the auditory localization pathway of the barn owl.
Nature 417, 322-328.
Knudsen, E. I., Esterly, S. D., and Lac, S. d. (1991). Stretched and upside-down maps of auditory
space in the optic tectum of blind-reared owls: acoustic basis and behavioral correlates. The
Journal of Neuroscience 11, 1727--1747.
Kornack, D., and Rakic, P. (1999). Continuation of neurogenesis in the hippocampus of the adult
macaque monkey. Proceedings of the National Academy of Sciences 96, 5768--5773.
Lendvai, B., Stern, E. A., Chen, B., and Svoboda, K. (2000). Experience-Dependent Plasticity of
Dendritic Spines in the Developing Rat Barrel Cortex \textit{In Vivo}. Nature 404, 876--881.
Shreesh P. Mysore 21
08/24/2003
Linkenhoker, B. A., and Knudsen, E. I. (2002). Incremental training increases the plasticity of
the auditory space map in adult barn owls. Nature 419, 293--296.
Lipkind, D., Nottebohm, F., Rado, R., and Barnea, A. (2002). Social change affects the survival
of new neurons in the forebrain of adult songbirds. Behav Brain Res 133, 31-43.
Macnamara, J. (1982). Names for Things: A Study of Child Language. Cambridge, MA: MIT
Press.
Martin, S. J., and Morris, R. G. M. (2002). New life to an old idea: The synaptic plasticity and
memory hypothesis revisited. Hippocampus 12, 609--636.
Nilsson, M., Perfilieva, E., Johansson, V., Orwar, O., and Eriksson, P. S. (1999). Enriched
environment increases neurogenesis in the adult rat dentate gyrus and improves spatial memory.
Journal of Neurobiology 39, 569--578.
Nottebohm, F. (1985). Neuronal replacement in adulthood. Ann N Y Acad Sci 457, 143-161.
Nottebohm, F. (2002). Why are some neurons replaced in adult brain? J Neurosci 22, 624-628.
Nottebohm, F., O'Loughlin, B., Gould, K., Yohay, K., and Alvarez-Buylla, A. (1994). The life
span of new neurons in a song control nucleus of the adult canary brain depends on time of year
when these cells are born. Proc Natl Acad Sci U S A 91, 7849-7853.
Nowakowski, R. S., and Hayes, N. L. (2000). New neurons: extraordinary evidence or
extraordinary conclusions. Science 288, 771a.
Paton, J. A., and Nottebohm, F. N. (1984). Neurons generated in the adult brain are recruited into
functional circuits. Science 225, 1046-1048.
Piaget, J. (1970). Genetic epistemology.
Piaget, J. (1980). Adaptation and intelligence: organic selection and phenocopy.
Poirazi., P., and Mel, B. W. (2001). Impact of active dendrites and structural plasticity on the
memory capacity of neural tissues. Neuron 29, 779-796.
Pouget, A., Defayet, C., and Sejnowski, T. J. (1995). Reinforcement learning predicts the site of
plasticity for auditory remapping in the barn owl. In Advances in Neural Information
Processing Systems 7, G. T. a. D. T. a. J. Alspector, ed. (Cambridge MA), pp. 125--132.
Quartz, S. R. (1993). Neural networks, nativism and the plausibility of constructivism. Cognition
48, 223--242.
Quartz, S. R., and Sejnowski, T. J. (1997). The neural basis of cognitive development: A
constructivist manifesto. Behavioral and Brain Sciences 20, 537--596.
Rakic, P. (1985). Limits to neurogenesis in primates. Science 227, 1054--1056.
Rakic, P. (2002). Neurogenesis in the adult primate neocortex: an evaluation of the evidence.
Nature Reviews Neuroscience 3, 65--71.
Rosen, D. J., Rumelhart, D. E., and Knudsen, E. I. (1994). A connectionist model of the owl's
sound localization system. In Advances in Neural Information Processing Systems 6, G. T. a. D.
S. T. a. T. K. Leen, ed. (Cambridge, MA).
Shreesh P. Mysore 22
08/24/2003
Rucci, M., Tononi, G., and Edelman, G. M. (1997). Registration of neural maps through valuedependent learning: modeling the alignment of auditory and visual maps in the barn owl's optic
tectum. The Journal of Neuroscience 17, 334--352.
Shankle, W. R., Landing, B. H., Rafii, M. S., Schiano, A., Chen, J. M., and Hara, J. (1998).
Evidence for a postnatal doubling of neuron number in the developing human cerebral cortex
between 15 months and 6 years. J Theor Biol 191, 115-140.
Shavlik, J. W. (1990). Inductive learning from preclassified training examples: Introduction. In
Readings in machine learning, J. W. S. a. T. D. Dietterich, ed.
Shultz, T. R., Schmidt, W. C., Buckingham, D., and Mareschal, D. (1995). Modeling cognitive
development with a generative constructionist algorithm. In Developing cognitive competence:
New approaches to process modeling, T. J. S. a. G. S. Halford, ed.
Smart, I. (1961). The subependymal layer of the mouse brain and its cell production as shown by
autography after H3-thymidine injection. Jounrnal of Comparative Neurology 116, 325--347.
Stanfield, B. B., and Trice, J. E. (1988). Evidence that granule cells generated in the dentate
gyrus of adult rats extend axonal projections. Experimental Brain Research 72, 399--406.
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM 27, 1134--1142.
Westermann, G. (2000). Constructivist neural networks models of cognitive development.
Zheng, W., and Knudsen, E. I. (1999). Functional selection of adaptive auditory space map by
GABAA-mediated inhibition. Science 284, 962--965.
Zheng, W., and Knudsen, E. I. (2001). GABA-ergic inhibition antagonizes adaptive adjustment
of the owl's auditory space map during the initial phase of plasticity. Journal of Neuroscience
21, 4356--4365.
Zito, K., and Svoboda, K. (2002). Activity-dependent synaptogenesis in the adult mammalian
cortex. Neuron 35, 1015--1017.
References (TRS)
Buckingham, D., & Shultz, T. R. (1996). Computational power and realistic cognitive
development. Proceedings of the Eighteenth Annual Conference of the Cognitive Science
Society (pp. 507-511). Mahwah, NJ: Erlbaum.
Buckingham, D., & Shultz, T. R. (2000). The developmental course of distance, time, and
velocity concepts: A generative connectionist model. Journal of Cognition and Development,
1, 305-345.
Case, R., Kurland, D. M., & Goldberg, J. (1982). Operational efficiency and the growth of shortterm memory span. Journal of Experimental Child Psychology, 33, 386-404.
Changeux, J. P., & Danchin, A. (1976). Selective stabilsation of developing synapses as a
mechanism for the specification of neuronal networks. Nature, 264, 705-712.
Cohen, L. B., & Arthur, A. E. (unpublished). The role of habituation in 10-month-olds'
categorization.
Shreesh P. Mysore 23
08/24/2003
Cohen, L. B., & Younger, B. A. (1983). Perceptual categorization in the infant. In E. Scholnick
(Ed.), New trends in conceptual representation (pp. 197-220). Hillsdale, NJ: Lawrence
Erlbaum Associates.
Ferretti, R. P., & Butterfield, E. C. (1986). Are children's rule assessment classifications
invariant across instances of problem types? Child Development , 57, 1419-1428.
Gibson, E. J. (1969). Principles of perceptual learning and development. New York: AppletonCentury-Crofts.
Gould, E., Tanapat, P., Hastings, N. B., & Shors, T. J. (1999). Neurogenesis in adulthood: a
possible role in learning. Trends in Cognitive Sciences, 3, 186-192. [not used yet]
Hebb, D. O. (1949). The organization of behavior. New York: Wiley.
Langley, P. (1987). A general theory of discrimination learning. In D. Klahr, P. Langley, & R.
Neches (eds.), Production system models of learning and development (pp. 99-161).
Cambridge, MA: MIT Press.
Newell, A. (1990). Unified theories of cognition. Cambridge, MA: Harvard University Press.
Quartz, S. R. (1993). Neural networks, nativism, and the plausibility of constructivism.
Cognition, 48, 223-242. [not used yet]
Quartz, S. R., & Sejnowski, T. J. (1997). The neural basis of cognitive development: A
constructivist manifesto. Behavioural and Brain Sciences, 20, 537-596. [not used yet]
Schmidt, W. C., & Shultz, T. R. (1991). A replication and extension of McClelland's balance
scale model. Technical Report No. 91-10-18, McGill Cognitive Science Centre, McGill
University, Montreal.
Shultz, T. R. (2001). Connectionist models of development. In N. J. Smelser & P. B. Baltes
(eds.), International Encyclopedia of the Social and Behavioral Science (Vol. 4, pp. 25772580). Oxford: Pergamon. [not used yet]
Shultz, T. R. (2003). Computational developmental psychology. Cambridge, MA: MIT Press.
Shultz, T.R., & Cohen, L. B. (2003, in press). Modeling age differences in infant category
learning. Infancy.
Shultz, T. R., & Mareschal, D. (1997). Rethinking innateness, learning, and constructivism:
Connectionist perspectives on development. Cognitive Development, 12, 563-586. [not used
yet]
Shultz, T. R., Mareschal, D., & Schmidt, W. C. (1994). Modeling cognitive development on
balance scale phenomena. Machine Learning, 16, 57-86.
Shultz, T. R., & Rivest, F. (2001). Knowledge-based cascade-correlation: Using knowledge to
speed learning. Connection Science, 13, 1-30. [not used yet]
Siegler, R. S. (1976). Three aspects of cognitive development. Cognitive Psychology, 8, 481-520.
Siegler, R. S. (1981). Developmental sequences within and between concepts. Monographs of
the Society for Research in Child Development, 46, Serial No. 189.
Thivierge, J. P., Rivest, F., & Shultz, T. R. (2003). A dual-phase technique for pruning
constructive networks. IEEE International Joint Conference on Neural Networks.
Shreesh P. Mysore 24
08/24/2003
Wilkening, F. (1981). Integrating velocity, time, and distance information: A developmental
study. Cognitive Psychology, 13, 231-247.
Younger, B. A., & Cohen, L. B. (1983). Infant perception of correlations among attributes. Child
Development, 54, 858-867.
Younger, B. A., & Cohen, L. B. (1985). How infants form categories. In G. Bower (Ed.), The
psychology of learning and motivation: Advances in research and theory, Vol. 19 (pp. 211247). New York: Academic Press.
Younger, B. A., & Cohen, L. B. (1986). Developmental change in infants’ perception of
correlations among attributes. Child Development, 57, 803-815.
Shreesh P. Mysore 25
08/24/2003
Figure 1: Schematic representation of key anatomical areas.
Shreesh P. Mysore 26
08/24/2003
OTD:Auditory
ICC:Auditory
Tonotopic Map
Auditory
Input
ICX:Auditory
Topographic Map
Topographic Map
ICC: Inferior Colliculus Central Nucleus
ICX: Inferior Colliculus Central Nucleus
OTD: Optic Tectum Deep Layer
OTM: Optic Tectum Middle Layer
OTS: Optic Tectum Superficial Layer
OTM:Vis-aud
Integration
Auditory processing
Visual processing
Site of integration
Site of plasticity
OTS:Visual
Topographic Map
Visual
Input
Shreesh P. Mysore 27
08/24/2003
Figure 2: Axonal projections comparison. Reproduced from (DeBello et al., 2001).
Shreesh P. Mysore 28
08/24/2003
Shreesh P. Mysore 29
08/24/2003
Figure 3: The influence of auditory stimuli on visual responses in the ICX. (A and B) Response
of an ICX site to an auditory stimulus as a function of ITD; negative values indicate left-earleading. (C and D) Responses at the same site to light flashes. (E and F) Responses to the
simultaneous presentation of auditory and visual stimuli. [(A), (C), and (E)] Multiunit recording;
[(B), (D), and (F)] single-unit recording. The single unit was isolated from the same site (13).
The arrows indicate the ITD value corresponding with the location of the visual RF.
Reproduced from (Gutfreund et al., 2002)
Shreesh P. Mysore 30
08/24/2003
Shreesh P. Mysore 31
08/24/2003
Figure 4: Information flow before mounting prisms
Shreesh P. Mysore 32
08/24/2003
Visual field
ICC
ICXE
OTD
M
OTM OTS
M
ICXI
ICXE
t
(A)
(V)
Shreesh P. Mysore 33
08/24/2003
Figure 5: Information flow immediately after mounting prisms
Shreesh P. Mysore 34
08/24/2003
Visual field
ICC
ICXE
OTD
M
OTM OTS
M
ICXI
ICXE
t
(A)
ICXE
t
(V)
Shreesh P. Mysore 35
08/24/2003
Figure 6: (a) Tuning curve of ICX neuron\#1 before training (b) After training.
Shreesh P. Mysore 36
08/24/2003
Shreesh P. Mysore 37
08/24/2003
Figure 7. Mean fixation of key test stimuli in infants (solid lines) from Younger and Cohen
(1986) and mean CC network error at two levels of score-threshold (dashed lines) from Shultz
and Cohen (2003).
12
1.2
11
1.1
10
1
9
0.9
8
0.8
7
0.7
Correlated
Uncorrelated
Test stimulus
Mean error
Mean fixation (sec)
Shreesh P. Mysore 38
08/24/2003
4 months
10 months
0.25
0.15
Download