Multimodal Dynamics: Self-Supervised Learning in Perceptual and Motor Systems

Multimodal Dynamics:
Self-Supervised Learning in Perceptual and Motor Systems
by
Michael Harlan Coen
S.M. Computer Science, Massachusetts Institute of Technology (1994)
S.B. Computer Science, Massachusetts Institute of Technology (1991)
Submitted to the Department of Electrical Engineering and Computer Science in partial
fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2006
‘ 2006 Massachusetts Institute of Technology.
All rights reserved.
Signature of Author
Department of Electrical Engineering and Computer Science
May 25, 2006
Certified by
Whitman Richards
Professor of Brain and Cognitive Sciences
Thesis Supervisor
Certified by
Howard Shrobe
Principal Research Scientist, Computer Science
Thesis Supervisor
Accepted by
Arthur Smith
Chair, Committee on Graduate Students
2
Multimodal Dynamics:
Self-Supervised Learning in Perceptual and Motor Systems
by
Michael Harlan Coen
Submitted to the Department of Electrical Engineering and Computer Science on
May 25, 2006 in partial fulfillment of the requirements for the degree of Doctor of
Philosophy in Computer Science
ABSTRACT
This thesis presents a self-supervised framework for perceptual and motor learning based upon
correlations in different sensory modalities. The brain and cognitive sciences have gathered an
enormous body of neurological and phenomenological evidence in the past half century
demonstrating the extraordinary degree of interaction between sensory modalities during the
course of ordinary perception. We develop a framework for creating artificial perceptual systems
that draws on these findings, where the primary architectural motif is the cross-modal
transmission of perceptual information to enhance each sensory channel individually. We present
self-supervised algorithms for learning perceptual grounding, intersensory influence, and sensorymotor coordination, which derive training signals from internal cross-modal correlations rather
than from external supervision. Our goal is to create systems that develop by interacting with the
world around them, inspired by development in animals.
We demonstrate this framework with: (1) a system that learns the number and structure of vowels
in American English by simultaneously watching and listening to someone speak. The system
then cross-modally clusters the correlated auditory and visual data. It has no advance linguistic
knowledge and receives no information outside of its sensory channels. This work is the first
unsupervised acquisition of phonetic structure of which we are aware, outside of that done by
human infants. (2) a system that learns to sing like a zebra finch, following the developmental
stages of a juvenile zebra finch. It first learns the song of an adult male and then listens to its own
initially nascent attempts at mimicry through an articulatory synthesizer. In acquiring the
birdsong to which it was initially exposed, this system demonstrates self-supervised sensorimotor
learning. It also demonstrates afferent and efferent equivalence – the system learns motor maps
with the same computational framework used for learning sensory maps.
Thesis Supervisor: Whitman Richards
Title: Professor of Brain and Cognitive Sciences
Thesis Supervisor: Howard Shrobe
Title: Principal Research Scientist, EECS
3
4
Acknowledgements
It is safe to say that I have spent more time than most as a student at MIT. After finishing my
S.B. and S.M. degrees, I spent almost five years running the Artificial Intelligence Laboratory’s
Intelligent Room project. At the beginning of 2000, I left MIT to join a start-up company on
Wall Street in NYC. I returned to CSAIL in January of 2005 and have completed this thesis in
the interim.
None of the work in this thesis would exist had I not received the constant encouragement,
mentoring, and friendship of Whitman Richards. My debt to him is enormous, and I am at a loss
for words to thank him properly; my relationship with Whitman has been the most rewarding of
my academic career. Among the great joys of finishing is that we can now collaborate on the
many projects dreamed up over our weekly coffees in the past year and a half.
Howard Shrobe provided tremendous encouragement and welcomed me back to MIT with open
arms, graciously and generously. Without his support too, this work could not have come to
fruition.
In my many years at MIT, a number of faculty members have changed my view of the world.
None has had a greater impact than Rodney Brooks, whose approach to AI has thoroughly
informed my own in the most basic of ways, the evidence of which can be found throughout this
thesis. I have also learned a tremendous amount about science, life, problem solving, and clear
thinking from Patrick Winston, Robert Berwick, and Gerald Sussman. They have all provided
constant encouragement for which I am deeply and forever grateful. Patrick especially has
mentored me through these last months of graduate school, and I continue to learn something new
from him at least once a week.
I have also greatly benefited from interactions with many other MIT faculty members. These
include Boris Katz, Randall Davis, Hal Abelson, Lynn Stein, Gian-Carlo Rota, and George
Boolos. I am indebted to Jim Glass and Eric Grimson, who managed to find time in their
schedules to be on my committee and provided invaluable feedback. I am also grateful to the
fellow members of the BICA project, particularly Sajit Rao, who have provided a regular source
of stimulation, excitement, debate, camaraderie, and encouragement.
In my long tenure here, a number of friends have made a tremendous difference in my life. I
particularly want to thank Rajesh and Becky Kasturirangan, Upendra Chaudhari, Stacy Ho, Ye
Gu, Jeff Kuo, Yuri Bendana, Andy Cytron, Krzysztof Gajos, Brenton Phillips, Joanna Bryson and
Will Lowe, and Kimberley and Damon Asher. I also thank the many students who were involved
in the Intelligent Room project, without whom it could not have existed. Marilyn Pierce in the
EECS Graduate Office made my transition back to student status far more pleasant and
streamlined.
I spent many years living in the community associated with the Chai Odom synagogue in
Brookline, MA. They provided a constant source of love and support, and I particularly thank
Rabbi and Rebbetzin Dovid Moskowitz.
Above all, I thank my family. My parents and siblings never gave up hope I would finish, even
when I was none too sure myself. And foremost among all, I thank my wife Aimee, who
patiently tolerated my constant working late into the night. To her, this thesis and my life are
dedicated.
5
6
In memory of my beloved Grandmother, Mrs. Dora Estrin ‫צ״ל‬
March 19, 1910 – September 7, 2003
7
8
Funding Acknowledgements
The work in this thesis was supported by the United States Government, for which I am
extremely grateful, through AFOSR under contract #F49610-03-1-0213 and through
AFRL under contract #FA8750-05-2-0274. Thanks to the DARPA/IPTO BICA program
and to AFRL.
9
10
Contents
1
2
3
Introduction
17
1.1 Computational Contributions
20
1.2 Theoretical Contributions
22
1.3 A Brief Motivation
23
1.4 Demonstrations
25
1.5 What is a "Sense?"
26
1.6 Roadmap
27
Setting the Stage
29
2.1 Peterson and Barney at the World's Fair
29
2.2 Nature Does Not Label Its Data
32
2.3 Why Is This Difficult?
34
2.4 Perceptual Interpretation
36
Perceptual Grounding
41
3.1 The Simplest Complex Example
42
3.2 Generating Codebooks
43
3.3 Generating Slices
47
3.4 Hebbian Projection
49
3.5 Measuring Distance in a Slice
53
3.6 Defining Similarity
56
11
3.6.1 Introduction
56
3.6.2 Intuition
57
3.6.3 Probabilistic Framework
58
3.6.3.1 Kantorovich-Wasserstein Distance
59
3.6.3.2 Computational Complexity
61
3.6.3.3 The One-to-Many Distance
63
3.6.4 Visualizing the Metrics: A Simpler Example
66
3.6.5 Defining Similarity
68
3.6.5.1 A Word about Generality
3.7 Defining the Distance between Regions
3.7.1 Defining a Mutually Iterative System
3.8 Cross Modal Clustering
72
73
76
79
3.8.1 Defining Self-Distance
79
3.8.2 A Cross-Modal Clustering Algorithm
81
3.9 Clustering Phonetic Data
87
3.10 Related Work
89
3.10.1 Language Acquisition
89
3.10.2 Machine Vision
90
3.10.3 Statistical Clustering
91
3.10.4 Blind Signal Separation
92
3.10.5 Neuroscientific Models
92
3.10.6 Minimally Supervised Models
93
3.11 Summary
93
12
4
Perceptual Interpretation
95
4.1 Introduction
95
4.2 The Simplest Complex Example
96
4.2.1 Visualizing an Influence Network
4.2.2 Towards a Dynamic Model
100
4.2.3 Intersensory Influence
104
4.3 Influence Networks
105
4.3.1 Preliminary Concepts
5
98
107
4.3.1.1 Defining a Surface
118
4.3.1.2 Hebbian Gradients
109
4.3.1.3 Activation Sets
110
4.3.2 Perceptual Trajectories
111
4.3.3 Perceptual Dynamics
114
4.3.4 An Activation Potential Model
115
4.4 Summary
119
Sensorimotor Learning
121
5.1 A Sensorimotor Architecture
123
5.1.1 A Model of Sensory Perception
124
5.1.2 A Simple Model of Innate Motor Activity
125
5.1.3 An Integrated Architecture
127
5.2 Stages of Sensorimotor Development
5.2.1 Parental Training
131
131
13
5.2.2 Internal and External Self-Observation
132
5.2.3 Cross-Modal Clustering
134
5.2.4 Voluntary Motor Control
136
5.3 Learning Birdsong
6
137
5.3.1 Introduction to the Zebra Finch
137
5.3.2 Songemes
141
5.3.3 Birdsong Analysis
145
5.3.4 Articulatory Synthesis
148
5.3.5 Experimental Results
151
5.4 Summary
152
Biological and Perceptual Connections
153
6.1 Sensory Background
153
6.2 Intersensory Perception
155
6.3 The Relevance
159
6.3.1 Perception in Pipelines
159
6.3.2 Classical Architecture
162
6.3.3 What's Missing Here?
163
6.4 Our Direct Motivation
167
6.4.1 Two Examples of Cross-Modal Influence
170
6.5 Speculation on an Intersensory Model
174
6.6 Summary
181
14
7
8
Conclusion
183
7.1 Future Work
184
References
185
Appendix 1
201
15
16
We have sat around for hours and wondered how you look. If you have closed
your senses upon silk, light, color, odor, character, temperament, you must be
by now completely shriveled up. There are so many minor senses, all running
like tributaries into the mainstream of love, nourishing it.
The Diary of Anais Nin (1943)
He plays by sense of smell.
Tommy, The Who (1969)
Chapter 1
Introduction
This thesis presents a unified framework for perceptual and motor learning based upon
correlations in different sensory modalities.
The brain and cognitive sciences have
gathered a large body of neurological and phenomenological evidence in the past half
century demonstrating the extraordinary degree of interaction between sensory modalities
during the course of ordinary perception.
We present a framework for artificial
perceptual systems that draws on these findings, where the primary architectural motif is
the cross-modal transmission of perceptual information to structure and enhance sensory
channels individually. We present self-supervised algorithms for learning perceptual
grounding, intersensory influence, and sensorimotor coordination, which derive training
signals from internal cross-modal correlations rather than from external supervision. Our
goal is to create perceptual and motor systems that develop by interacting with the world
around them, inspired by development in animals.
Our approach is to formalize mathematically an insight in Aristotle's De Anima (350
B.C.E.), that differences in the world are only detectable because different senses
perceive the same world events differently. This implies both that sensory systems need
A glossary of technical terms is contained in Appendix 1. Our usage of the word "sense" is defined in §1.5.
17
some way to share their different perspectives on the world and that they need some way
to incorporate these shared influences into their own internal workings.
We begin with a computational methodology for perceptual grounding, which addresses
the first question that any natural (or artificial) creature faces: what different things in the
world am I capable of sensing? This question is deceptively simple because a formal
notion of what makes things different (or the same) is non-trivial and often elusive. We
will show that animals (and machines) can learn their perceptual repertoires by
simultaneously correlating information from their different senses, even when they have
no advance knowledge of what events these senses are individually capable of
perceiving. In essence, by cross-modally sharing information between different senses,
we demonstrate that sensory systems can be perceptually grounded by mutually
bootstrapping off each other. As a demonstration of this, we present a system that learns
the number (and formant structure) of vowels in American English, simply by watching
and listening to someone speak and then cross-modally clustering the accumulated
auditory and visual data. The system has no advance knowledge of these vowels and
receives no information outside of its sensory channels.
This work is the first
unsupervised acquisition of phonetic structure of which we are aware, at least outside of
that done by human infants, who solve this problem easily.
The second component of this thesis naturally follows perceptual grounding. Once an
animal (or a machine) has learned the range of events it can detect in the world, how does
it know what it's perceiving at any given moment? We will refer to this as perceptual
interpretation. Note that grounding and interpretation are different things. By way of
analogy to reading, one might say that grounding provides the dictionary and
interpretation explains how to disambiguate among possible word meanings. More
formally, grounding is an ontological process that defines what is perceptually knowable,
and interpretation is an algorithmic process that describes how perceptions are
categorized within a grounded system. We will present a novel framework for perceptual
interpretation called influence networks (unrelated to a formalism know as influence
diagrams) that blurs the distinctions between different sensory channels and allows them
to influence one another while they are in the midst of perceiving. Biological perceptual
18
systems share cross-modal information routinely and opportunistically (Stein and
Meredith 1993, Lewkowicz and Lickliter 1994, Rock 1997, Shimojo and Shams 2001,
Calvert et al. 2004, Spence and Driver 2004); intersensory influence is an essential
component of perception but one that most artificial perceptual systems lack in any
meaningful way. We argue that this is among the most serious shortcomings facing
them, and an engineering goal of this thesis is to propose a workable solution to this
problem.
The third component of this thesis enables sensorimotor learning using the first two
components, namely, perceptual grounding and interpretation. This is surprising because
one might suppose that motor activity is fundamentally different than perception.
However, we take the perspective that motor control can be seen as perception
backwards. From this point of view, we imagine that – in a notion reminiscent of a
Cartesian theater – an animal can "watch" the activity in its own motor cortex, as if it
were a privileged form of internal perception. Then for any motor act, there are two
associated perceptions – the internal one describing the generation of the act and the
external one describing the self-observation of the act.
The perceptual grounding
framework described above can then cross-modally ground these internal and external
perceptions with respect to one another. The power of this mechanism is that it can learn
mimicry, an essential form of behavioral learning (see the developmental sections of
Meltzoff and Prinz 2002) where one animal acquires the ability to imitate some aspect of
another's activity, constrained by the capabilities and dynamics of its own sensory and
motor systems. We will demonstrate sensorimotor learning in our framework with an
artificial system that learns to sing like a zebra finch by first listening to a real bird sing
and then by learning from its own initially uninformed attempts to mimic it.
This thesis has been motivated by surprising results about how animals process sensory
information. These findings, gathered by the brain and cognitive sciences communities
primarily over the past 50 years, have challenged century long held notions about how
the brain works and how we experience the world in which we live. We argue that
current approaches to building computers that perceive and interact with the real, human
world are largely based upon developmental and structural assumptions, tracing back
19
several hundred years, that are no longer thought to be descriptively or biologically
accurate. In particular, the notion that perceptual senses are in functional isolation – that
they do not internally structure and influence each other – is no longer tenable, although
we still build artificial perceptual systems as if it were.
1.1 Computational Contributions
This thesis introduces three new computational tools. The first is a mathematical model
of slices, which are a new type of data structure for representing sensory inputs. Slices
are topological manifolds that encode dynamic perceptual states and are inspired by
surface models of cortical tissue (Dale et al. 1999, Fischl et al. 1999, Citti and Sarti 2003,
Ratnanather et al. 2003). They can represent both symbolic and numeric data and
provide a natural foundation for aggregating and correlating information.
Slices
represent the data in a perceptual system, but they are also amodal, in that they are not
specific to any sensory representation. For example, we may have slices containing
visual information and other slices containing auditory information, but it may not be
possible to distinguish them further without additional information. In fact, we can
equivalently represent either sensory or motor information within a slice. This generality
will allow us to easily incorporate the learning of motor control into what is initially a
perceptual framework.
The second tool is an algorithm for cross-modal clustering, which is an unsupervised
technique for organizing slices based on their spatiotemporal correlations with other
slices. These correlations exist because an event in the world is simultaneously – but
differently – perceived through multiple sensory channels in an observer. The hypothesis
underlying this approach is that the world has regularities – natural laws tend to correlate
physical properties (Thompson 1917, Richards 1980, Mumford 2004) – and biological
perceptory systems have evolved to take advantage of this. One may contrast this with
mathematical approaches to clustering where some knowledge of the clusters, e.g., how
many there are or their distributions, must be known a priori in order to derive them.
Without knowing these parameters in advance, many algorithmic clustering techniques
may not be robust (Kleinberg 2002, Still and Bialek 2004). Assuming that in many
circumstances animals cannot know the parameters underlying their perceptual inputs,
20
how can they learn to organize their sensory perceptions?
Cross-modal clustering
answers this question by exploiting naturally occurring intersensory correlations.
The third tool in this thesis is a new family of models called influence networks
(Figure 1.1).
Influence networks use slices to interconnect independent perceptual
systems, such as those illustrated in the classical view in Figure 1.1a, so they can
influence one another during perception, as proposed in Figure 1.1b. Influence networks
dynamically modify percepts within these systems to effect influence among their
different components. The influence is designed to increase perceptual accuracy within
individual perceptual channels by incorporating information from other co-occurring
senses. More formally, influence networks are designed to move ambiguous perceptual
inputs into easily recognized subsets of their representational spaces. In contrast with
approaches taken in engineering what are typically called multimodal systems, influence
networks are not intended to create high-level joint perceptions. Instead, they share
sensory information across perceptual channels to increase local perceptual accuracy
within the individual perceptual channels themselves. As we discuss in Chapter 6, this
type of cross-modal perceptual reinforcement is ubiquitous in the animal world.
Figure 1.1– Adding an influence network to two preexisting systems. We start in (a) with two pipelined
networks that independently compute separate functions. In (b), we compose on each function a
corresponding influence function, which dynamically modifies its output based on activity at the other
influence functions. The interaction among these influence functions is described by an influence network,
which is defined in Chapter 5. The parameters describing this network can be found via unsupervised
learning for a large class of perceptual systems, due to correspondences in the physical events that generate
the signals they perceive and to the evolutionary incorporation of these regularities into the biological
sensory systems that these computational systems model. Note influence networks are distinct from an
unrelated formalism called influence diagrams.
21
1.2 Theoretic Contributions
The work presented here addresses several important problems. From an engineering
perspective, it provides a principled, neurologically informed approach to building
complex, interactive systems that can learn through their own experiences. In perceptual
domains, it answers a fundamental question in mathematical clustering: how should an
unknown dataset be clustered?
The connection between clustering and perceptual
grounding follows from the observation that learning to perceive is learning to organize
perceptions into meaningful categories. From this perspective, asking what an animal
can perceive is equivalent to asking how it should cluster its sensory inputs. This thesis
presents a self-supervised approach to this problem, meaning our sub-systems derive
feedback from one another cross-modally rather than rely on an external tutor such as a
parent (or a programmer). Our approach is also highly nonparametric, in that it presumes
neither that the number of clusters nor their distributions are known in advance,
conditions which tend to defy other algorithmic techniques.
The benefits of self-
supervised learning in perceptual and motor domains are enormous because engineered
approaches tend to be ad hoc and error prone; additionally, in sensorimotor learning we
generally have no adequate models to specify the desired input/output behaviors for our
systems.
The notion of programming by example is nowhere truer than in the
developmental mimicry widespread in animal kingdom (Meltzoff and Prinz 2002), and
this work is a step in that direction for artificial sensorimotor systems.
Furthermore, this thesis suggests that not only do senses influence each other during
perception, which is well established, it also proposes that perceptual channels
cooperatively structure their internal representations. This mutual structuring is a basic
feature in our approach to perceptual grounding. We argue, however, that it is not simply
an epiphenomenon; rather, it is a fundamental component of perception itself, because it
provides representational compatibility for sharing information cross-modally during
higher-level perceptual processing. The inability to share perceptual data is one of the
major shortcomings in current engineered approaches to building interactive systems.
Finally, within this framework, we will address three questions that are basic to
developing a coherent understanding of cross-modal perception. They concern both
22
process and representation and raise the possibility that unifying (i.e. meta-level)
principles might govern intersensory function:
1) Can the senses be perceptually grounded by bootstrapping off each other? Is
shared experience sufficient for learning how to categorize sensory inputs?
2) How can seemingly different senses share information? What representational
and computational restrictions does this place upon them?
3) Could the development of motor control use the same mechanism? In other
words, can there be afferent and efferent equivalence in learning?
1.3 A Brief Motivation
The goal of this thesis is to propose a design for artificial systems that more accurately
reflects how animal brains appear to process sensory inputs. In particular, we argue
against post-perceptual integration, where the sensory inputs are separately processed in
isolated, increasingly abstracted pipelines and then merged in a final integrative step as in
Figure 1.2. Instead, we argue for cross-modally integrated perception, where the senses
share information during perception that synergistically enhances them individually, as in
Figure 1.1b. The main difficulty with the post-perceptual approach is that integration
happens after the individual perceptions are generated. Integration occurs after each
perceptual subsystem has already “decided” what it has perceived, when it is too late for
intersensory influence to affect the individual, concurrent perceptions. This is due to
information loss from both vector quantization and the explicit abstraction fundamental
to the pipeline design. Most importantly, these approaches also preclude cooperative
perceptual grounding; the bootstrapping provided by cross-modal clustering cannot occur
when sensory systems are independent. These architectures are therefore also at odds
with developmental approaches to building interactive systems.
23
Not only is the post-perceptual approach to integration biologically implausible from a
scientific perspective, it is poor engineering as well.
The real world is inherently
multimodal in a way that most modern artificial perceptual systems do not capture or take
advantage of. Isolating sensory inputs while they are being processed prevents the lateral
sharing of information across perceptual channels, even though these sensory inputs are
inherently linked by the physics of the world that generates them. Furthermore, we will
argue that the co-evolution of senses within an individual species provided evolutionary
pressure towards representational and algorithmic compatibilities essentially unknown in
modern artificial perception. These issues are examined in detail in Chapters 6.
Our work is computationally motivated by Gibson (1950, 1987), who viewed perception
as an external as well as an internal event, by Brooks (1986, 1991), who elevated
perception onto an equal footing with symbolic reasoning, and by Richards (1988), who
described how to exploit regularities in the world to make learning easier. The recursive
use of a perceptual mechanism to enable sensorimotor learning in Chapter 4 is a result of
our exposure to the ideas of Sussman and Abelson (1983).
Figure 1.2 – Classical approaches to post-perceptual integration in traditional multimodal systems. Here,
auditory (A) and visual (V) inputs pass through specialized unimodal processing pathways and are
combined via an integration mechanism, which creates multimodal perceptions by extracting and
reconciling data from the individual channels. Integration can happen earlier (a) or later (b). Hybrid
architectures are also common. In (c), multiple pathways process the visual input and are pre-integrated
before the final integration step; for example, the output of this preintegration step could be spatial
localization derived solely through visual input. This diagram is modeled after (Stork and Hennecke 1996).
24
1.4 Demonstrations
The framework and its instantiation will be evaluated by a set of experiments that explore
perceptual grounding, perceptual interpretation, and sensorimotor learning. These will
be demonstrated with:
1) Phonetic learning: We present a system that learns the number and formant
structure of vowels (monophthongs) in American English, simply by watching
and listening to someone speak and then cross-modally clustering the
accumulated auditory and visual data. The system has no advance knowledge of
these vowels and receives no information outside of its sensory channels. This
work is the first unsupervised machine acquisition of phonetic structure of which
we are aware.
2) Speechreading: We incorporate an influence network into the cross-modally
clustered slices obtained in Experiment 1 to increase the accuracy of perceptual
classification within the slices individually.
In particular, we demonstrate the
ability of influence networks to move ambiguous perceptual inputs to
unambiguous regions of their perceptual representational spaces.
3) Learning birdsong: We will demonstrate self-supervised sensorimotor learning
with a system that learns to mimic a Zebra Finch. The system is directly modeled
on the dynamics of how male baby finches learn birdsong from their fathers
(Tchernichovski et al. 2004, Fee et al. 2004). Our system first listens to an adult
finch and uses cross-modal clustering to learn songemes, primitive units of bird
song that we propose as an avian equivalent of phonemes.
It then uses a
vocalization synthesizer to generate its own nascent birdsong, guided by random
exploratory motor behavior. By listening to itself sing, the system organizes the
motor maps generating its vocalizations by cross-modally clustering them with
respect to the previously learned songeme maps of its parent. In this way, it
learns to generate the same sounds to which it was previously exposed. Finally,
we incorporate a standard hidden Markov model into this system, to model the
25
temporal structure and thereby combine songemes into actual birdsong. The
Zebra Finch is a particularly suitable species to use for guiding this
demonstration, as each bird essentially sings a single unique song accompanied
by minor variations.
We note that the above examples all use real data, gathered from a real person speaking
and from a real bird singing. We also present results on a number of synthetic datasets
drawn from a variety of mixture distributions to provide basic insights into the algorithms
and slice data structure work.
Finally, we believe it is possible to allow the
computational side of this question to inform the biological one, and we will analyze the
model, in its own right and in light of these results, to explore its algorithmic and
representational implications for biological perceptual systems, particularly from the
perspective of how sharing information restricts the modalities individually.
1.5 What Is a "Sense?"
Although Appendix 1 contains a glossary of technical terms, one clarification is so
important that it deserves special mention. We have repeatedly used the word sense, e.g.,
sense, sensory, intersensory, etc., without defining what a sense is. One generally thinks
of a sense as the perceptual capability associated with a distinct, usually external, sensory
organ. It seems quite natural to say vision is through the eyes, touch is through the skin,
etc. (Notable exceptions include proprioception – the body's sense of internal state –
which is somewhat more difficult to localize and vestibular perception, which occurs
mainly in the inner ear but is not necessarily experienced there.) However, this coarse
definition of sense is misleading.
Each sensory organ provides an entire class of sensory capabilities, which we will
individually call modes. For example, we are familiar with the bitterness mode of taste,
which is distinct from other taste modes such as sweetness. In the visual system, object
segmentation is a mode that is distinct from color perception, which is why we can
appreciate black and white photography.
26
Most importantly, individuals may lack
particular modes without other modes in that sense being affected (e.g., Wolfe 1983),
thus demonstrating they are phenomenologically independent. For example, people who
like broccoli are insensitive to the taste of the chemical phenylthiocarbamide (Drayna et
al. 2003); however, we would not say these people are unable to taste – they are simply
missing an individual taste mode. There are a wide variety of visual agnosias that
selectively affect visual experience, e.g., simultanagnosia is the inability to perform
visual object segmentation, but we certainly would not consider a patient with this deficit
to be blind, as it leaves the other visual processing modes intact.
Considering these fine grained aspects of the senses, we allow intersensory influence to
happen between modes even within the same sensory system, e.g., entirely within vision,
or alternatively, between modes in different sensory systems, e.g., in vision and audition.
Because the framework presented here is amodal, i.e., not specific to any sensory system
or mode, it treats both of these cases equivalently.
1.6 Roadmap
Chapter 2 sets the stage for the rest of this thesis by visiting an example stemming from
the 1939 World's Fair. It intuitively makes clear what we mean by perceptual grounding
and interpretation, which until now have remained somewhat abstract.
Chapter 3 presents our approach to perceptual grounding by introducing slices, a data
structure for representing sensory information. We then define our algorithm for crossmodal clustering, which autonomously learns perceptual categories within slices by
considering how the data within them co-occur.
We demonstrate this approach in
learning the vowel structure of American English by simultaneously watching and
listening to a person speak.
Finally, we examine and contrast related work in
unsupervised clustering with our approach.
Chapter 4 builds upon the results in Chapter 3 to present our approach to perceptual
interpretation. We incorporate the temporal dynamics of sensory perception by treating
slices as phase spaces through which sensory inputs move during the time windows
27
corresponding to percept formation. We define a dynamic activation model on slices and
interconnect them through an influence network, which allows different modes to
influence each other's perceptions dynamically. We then examine using this framework
to disambiguate simultaneous audio-visual speech inputs. Note that this mathematical
chapter may be skipped on a cursory reading of this thesis.
Chapter 5 builds upon the previous two chapters to define our architecture for
sensorimotor learning, based on a Cartesian theater.
Our system simultaneously
"watches" its internal motor activity while it observes the effects of its own actions
externally.
Cross-modal clustering then allows it to ground its motor maps using
previously clustered perceptual maps. This is possible because slices can equivalently
contain perceptual or motor data, and in fact, slices do not "know" what kind of data they
contain.
The principle example in this chapter is the acquisition of species-specific
birdsong.
Chapter 6 connects the work in the computational framework presented in this thesis with
a modern understanding of perception in biological systems. Doing so motivates the
approach taken here and allows us to suggest how this work may reciprocally contribute
towards a better computational understanding biological perception. We also examine
related work in multimodal integration and examine the engineered system that motivated
much of the work in this thesis. Finally, we speculate on a number of theoretical issues
in Intersensory perception and examine how the work in this thesis addresses them.
Chapter 7 contains a brief summary of the contributions of this thesis and outlines future
work.
28
Chapter 2
Setting the Stage
We begin with an example to illustrate the two fundamental problems of perception
addressed in this thesis:
1) Grounding –
how are sensory inputs categorized in a perceptual system?
2) Interpretation – how should sensory inputs be classified once their possible
categories are known?
The example presented below concerns speechreading, but the techniques presented in
later chapters for solving the problems raised here are not specific to any perceptual
modality. They can be applied to range of perceptual and motor learning problems, and
we will examine some of their nonperceptual applications as well.
2.1 Peterson and Barney at the World's Fair
Our example begins with the 1939 World’s Fair in New York, where Gordon Peterson
and Harold Barney (1952) collected samples of 76 speakers saying sustained American
Figure 2.1— On the left is a spectrogram of the author saying, “Hello.” The demarcated region (from 690710ms) marks the middle of phoneme /ow/, corresponding to the middle of the vowel "o" in “hello.” The
spectrum corresponding to this 20ms window is shown on the right. A 12th order linear predictive coding
(LPC) model is shown overlaid, from which the formants, i.e., the spectral peaks, are estimated. In this
example: F1 = 266Hz, F2 = 922Hz, and F3 = 2531Hz. Formants above F3 are generally ignored for sound
classification because they tend to be speaker dependent. Notice that F2 is slightly underestimated in this
example, a reflection of the heuristic nature of computational formant determination.
29
English vowels. They measured the fundamental frequency and first three formants (see
Figure 2.1) for each sample and noticed that when plotted in various ways (Figure 2.2),
different vowels fell into different regions of the formant space. This regularity gave
hope that spoken language – at least vowels – could be understood through accurate
estimation of formant frequencies. This early hope was dashed in part because coarticulation effects lead to considerable movement of the formants during speech
(Holbrook and Fairbanks 1962). Although formant-based classifications were largely
abandoned in favor of the dynamic pattern matching techniques commonly used today
(Jelinek 1997), the belief persists that formants are potentially useful in speech
recognition, particularly for dimensional reduction of data.
It has long been known that watching the movement of a speaker’s lips helps people
understand what is being said (Bender 1981, p41). The sight of someone’s moving lips
in an environment with significant background noise makes it easier to understand what
the speaker is saying; visual cues – e.g., the sight of lips – can alter the signal-to-noise
ratio of an auditory stimulus by 15-20 decibels (Sumby and Pollack 1954). The task of
lip-reading has by far been the most studied problem in the computational multimodal
Figure 2.2 – Peterson and Barney Data. On the left is a scatterplot of the first two formants, with
different regions labeled by their corresponding vowel categories.
30
Figure 2.3 – Automatically tracking mouth positions of test subject in a video stream. Lip positions are
found via a deformable template and fit to an ellipse using least squares. The upper images contains
excerpts from speech segments, corresponding left to right with phonemes: /eh/, /ae/, /uw/, /ah/, and /iy/.
The bottom row contains non-speech mouth positions. Notice that fitting the mouth to an ellipse can be
non-optimal, as is the case with the two left-most images; independently fitting the upper and lower lip
curves to low-order polynomials would yield a better fit. For the purposes of this example, however,
ellipses provide an adequate, distance invariant, and low-dimensional model. The author is indebted to his
wife for having lips that were computationally easy to detect.
literature (e.g., Mase and Pentland 1990, Huang et al. 2003, Potamianos et al. 2004), due
to the historic prominence of automatic speech recognition in computational perception.
Although significant progress has been made in automatic speech recognition, state of the
art performance has lagged human speech perception by up to an order of magnitude,
even in highly controlled environments (Lippmann 1997). In response to this, there has
been increasing interest in non-acoustic sources of speech information, of which vision
has received the most attention. Information about articulator position is of particular
interest, because in human speech, acoustically ambiguous sounds tend to have visually
unambiguous features (Massaro and Stork 1998). For example, visual observation of
tongue position and lip contours can help disambiguate unvoiced velar consonants /p/ and
/k/, voiced consonants /b/ and /d/, and nasals /m/ and /n/, all of which can be difficult to
distinguish on the basis of acoustic data alone.
Articulation data can also help to disambiguate vowels. Figure 2.3 contains images of a
speaker voicing different sustained vowels, corresponding to those in Figure 2.2. These
images are the unmodified output of a mouth tracking system written by the author,
where the estimated lip contour is displayed as an ellipse and overlaid on top of the
speaker’s mouth.
The scatterplot in Figure 2.4 shows how a speaker’s mouth is
represented in this way, with contour data normalized such that a resting mouth
31
Figure 2.4 -- Modeling lip contours with an ellipse. The scatterplot shows normalized major (x) and minor
(y) axes for ellipses corresponding to the same vowels as those in Figure 2.2. In this space, a closed mouth
corresponds to a point labeled null. Other lip contours can be viewed as offsets from the null configuration
and are shown here segmented by color. These data points were collected from video of this woman
speaking.
configuration (referred to as null in the figure) corresponds with the origin, and other
mouth positions are viewed as offsets from this position. For example, when the subject
makes an /iy/ sound, the ellipse is elongated along its major axis, as reflected in the
scatterplot.
Suppose we now consider the formant and lip contour data simultaneously, as in Figure
2.5.
Because the data are conveniently labeled, the clusters within and the
correspondences between the two scatterplots are obvious. We notice that the two
domains can mutually disambiguate one another. For example, /er/ and /uh/ are difficult
to separate acoustically with formants but are easy to distinguish visually. Conversely,
/ae/ and /eh/ are visually similar but acoustically distinct. Using these complementary
representations, one could imagine combining the auditory and visual information to
create a simple speechreading system for vowels.
2.2 Nature Does Not Label Its Data
Given this example, it may be surprising that our interest here is not in building a
speechreading system. Rather, we are concerned with a more fundamental problem: how
do sensory systems learn to segment their inputs to begin with? In the color-coded plots
32
Figure 2.5 – Labeled scatterplots side-by-side. Formant data (from Peterson Barney 1952) is displayed on
the left and lip contour data (from the author’s wife) is show on the right. Each plot contains data
corresponding to the ten listed vowels in American English.
in Figure 2.5, it is easy to see the different represented categories. However, perceptual
events in the world are generally not accompanied with explicit category labels. Instead,
animals are faced with data like those in Figure 2.6 and must somehow learn to make
sense of them. We want to know how the categories are learned in the first place. We
note this learning process is not confined to development, as perceptual correspondences
are plastic and can change over time.
We would therefore like to have a general purpose way of taking data (such as shown in
Figure 2.6) and deriving the kinds of correspondences and segmentations (as shown in
Figure 2.5) without external supervision. This is what we mean by perceptual grounding
Figure 2.6 – Unlabeled data. These are the same data shown above in Figure 2.5, with the labels removed.
This picture is closer to what animals actually encounter in Nature. As above, formants are displayed on
the left and lip contours are on the right. Our goal is to learn the categories present in these data without
supervision, so that we can automatically derive the categories and clusters such as those show above.
33
and our perspective here is that it is a clustering problem: animals must learn to organize
their perceptions into meaningful categories.
We examine below why this is a
challenging problem.
2.3 Why Is This Difficult?
As we have noted above, Nature does not label its data. By this, we mean that the
perceptual inputs animals receive are not generally accompanied by any meta-level data
explaining what they represent. Our framework must therefore assume the learning is
unsupervised, in that there are no data outside of the perceptual inputs themselves
available to the learner.
From a clustering perspective, perceptual data is highly non-parametric in that both the
number of clusters and their underlying distributions may be unknown.
Clustering
algorithms generally make strong assumptions about one or both of these. For example,
the Expectation Maximization algorithm (Dempster et al. 1977) is frequently used a basis
for clustering mixtures of distributions whose maximum likelihood estimation is easy to
compute. This algorithm is therefore popular for clustering known finite numbers of
Gaussian mixture models (e.g., Nabney 2002, Witten and Frank 2005). However, if the
number of clusters is unknown, the algorithm tends to converge to a local minimum with
the wrong number of clusters. Also, if the data deviate from a mixture of Gaussian (or
some expected) distributions, the assignment of clusters degrades accordingly. More
generally, when faced with nonparametric, distribution-free data, algorithmic clustering
techniques tend not be robust (Fraley and Raftery 2002, Still and Bialek 2004).
Perceptual data are also noisy. This is due both to the enormous amount of variability in
the world and to the probabilistic nature of the neuronal firings that are responsible for
the perception (and sometimes the generation) of perceivable events. The brain itself
introduces a great deal of uncertainty into many perceptual processes. In fact, one may
perhaps view the need for high precision as the exception rather than the rule. For
example, during auditory localization based on interaural time delays, highly specialized
34
neurons known as the end-bulbs of Held – among the largest neuronal structures in the
brain – provide the requisite accuracy by making neuronal firings in this section of
auditory cortex highly deterministic (Trussell 1999).
It appears that the need for
mathematical precision during perceptual processing can require extraordinary
neuroanatomical specialization.
Perhaps most importantly, perceptual grounding is difficult because there is no objective
mathematical definition of "coherence" or "similarity." In many approaches to clustering,
each cluster is represented by a prototype that, according to some well-defined measure,
is an exemplar for all other data it represents. However, in the absence of fairly strong
assumptions about the data being clustered, there may be no obvious way to select this
measure. In other words, it is not clear how to formally define what it means for data to
be objectively similar or dissimilar. In perceptual and cognitive domains, it may also
depend on why the question of similarity is being asked.
Consider a classic AI
conundrum, "what constitutes a chair?" (Winston 1970, Minsky 1974, Brooks 1987).
For many purposes, it may be sufficient to respond, "anything upon which one can sit."
However, when decorating a home, we may prefer a slightly more sophisticated answer.
Although this is a higher level distinction than the ones we examine in this thesis, the
principle remains the same and reminds us why similarity can be such a difficult notion
to pin down.
Finally, even if we were to formulate a satisfactory measure of similarity for static data,
one might then ask how this measure would behave in a dynamic system.
Many
perceptual (and motor) systems are inherently dynamic – they involve processes with
complex, non-linear temporal behavior (Thelen and Smith 1994), as can been seen during
perceptual bistability, cross-modal influence, habituation, and priming. Thus, one may
ask whether a similarity metric captures a system's temporal dynamics; in a clustering
domain, the question might be posed: do points that start out in the same cluster end up
in the same cluster? We know from Lorentz (1964) that it is possible for arbitrarily small
differences to be amplified in a non-linear system. It is quite plausible that a static
similarity metric might be oblivious to a system's temporal dynamics, and therefore,
sensory inputs that initially seem almost identical could lead to entirely different percepts
35
Figure 2.7 – On the left is a scatterplot of the first two formants, with different regions labeled by their
corresponding vowel categories. The output of a backpropagation neural network trained on this data is
shown on the right and displays decision boundaries and misclassified points. The misclassification error
in this case is 19.7%. Other learning algorithms, e.g., AdaBoost using C4.5, Boosted stumps with
LogitBoost, and SVM with a 5th order polynomial kernel, have all shown similarly lackluster performance,
even when additional dimensions (corresponding to F0 and F3) are included (Klautau 2002). (Figure on
right is derived from ibid. and used with permission.)
being generated. This issue will be raised in more detail in Chapter 4, where we will
view clusters as fixed points in representational phase spaces in which perceptual inputs
follow trajectories between different clusters.
In Chapter 3, we will present a framework for perceptual grounding that addresses many
of the issues raised here. We show that animals (and machines) can learn how to cluster
their perceptual inputs by simultaneously correlating information from their different
senses, even when they have no advance knowledge of what events these senses are
individually capable of perceiving.
By cross-modally sharing information between
different senses, we will demonstrate that sensory systems can be perceptually grounded
by bootstrapping off each other.
2.4 Perceptual Interpretation
The previous section outlined some of the difficulties in unsupervised clustering of
nonparametric sensory data.
However, even if the data came already labeled and
clustered, it would still be challenging to classify new data points using this information.
Determining how to assign a new data point to a preexisting cluster (or category) is what
we mean by perceptual interpretation. It is the process of deciding what a new input
36
actually represents. In the example here, the difficultly is due to the complexity of
partitioning formant space to separate the different vowels.
This 50 year old
classification problem still receives attention today (e.g., Jacobs et al. 1991, de Sa and
Ballard 1998, Clarkson and Moreno 1999) and Klautau (2002) has surveyed modern
machine learning algorithms applied to it, an example of which is shown on the right in
Figure 2.7.
A common way to distinguish classification algorithms is by visualizing the different
spaces of possible decision boundaries they are capable of learning. If one closely
examines the Peterson and Barney dataset (Figure 2.8), there are many pairs of points that
are nearly identical in any formant space but correspond to different vowels in the actual
data, at least according to the speaker’s intention. It is difficult to imagine any accurate
partitioning that would simultaneously avoid overfitting. There are many factors that
contribute to this, including the information loss of formant analysis (i.e., incomplete data
is being classified), computational errors in estimating the formants, lack of
1450
uh
uh
er ah
1400
ah
er
uh
1350
er
f2 (Hz)
1200
ah ah ah
er
uw
ah
aa
uh
er eruh
1150
er
uh uh
uh
uh
ah
ah
ah
uw
ah
1000
500
ah
uw
ah
ah
ao
ah
uh ao
520
540
560
580
600
f1 (Hz)
ah
ah
aa
ah
uw
uw
ah ah
ah
ah
ao
ah
ah
ah
ah
1100
1050
uh
uh
ah
er
er
1250
uh
ah
uh
er
1300 er
uh
uh
er
620
ah
ah
aa
ao aa
ao
ao
aa ah
aa
ao
ao
aa
ao
ao
ah
640
660
680
700
Figure 2.8 – Focusing on one of many ambiguous regions in the Peterson-Barney dataset. Due to a
confluence of factors described in the text, the data in these regions are not obviously separable.
37
differentiation in vowel pronunciation in different dialects of American English,
variations in prosody, and individual anatomical differences in the speakers’ vocal tracts.
It is worth pointing out the latter three of these for the most part exist independently of
how data is extracted from the speech signal and may present difficulties regardless of
how the signal is processed.
The curse of dimensionality (Bellman 1961) is a statement about exponential growth in
hypervolume as a function of a space’s dimension. Of its many ramifications, the most
important here is that many low dimensional phenomena that we are intuitively familiar
with do not exist in higher dimensions. For example, the natural clustering of uniformly
distributed random points in a two dimensional space becomes extremely unlikely in
higher dimensions; in other words, random points are relatively far apart in high
dimensions. In fact, transforming nonseparable samples into higher dimensions is a
general heuristic for improving separation with many classification schemes. There is a
flip-side to this high dimensional curse for us: low dimensional spaces are crowded. It
can be difficult to separate classes in these spaces because of their tendency to overlap.
However, blaming low dimensionality for this problem is like the proverbial cursing of
darkness. Cortical architectures make extensive use of low dimensional spaces, e.g.,
throughout visual, auditory, and somatosensory processing (Amari 1980, Swindale 1996,
Dale et al. 1999, Fischl et al. 1999, Kaas and Hackett 2000, Kardar and Zee 2002, Bednar
et al. 2004), and this was a primary motivating factor in the development of Self
Organizing Maps (Kohonen 1984).
In these crowded low-dimensional spaces,
approaches that try to implicitly or explicitly refine decision boundaries such as those in
Figure 2.8 (e.g., de Sa 1994) are likely to meet with limited success because there may be
no good decision boundaries to find; perhaps in these domains, decision boundaries are
the wrong way to think about the problem.
Rather than trying to improve classification boundaries directly, one could instead look
for a way to move ambiguous inputs into easily classified subsets of their representational
spaces. This is the essence of the influence network approach presented in Chapter 4 and
is our proposed solution to the problem of perceptual interpretation. The goal is to use
cross-modal information to "move" sensory inputs within their own state spaces to make
38
them easier to classify. Thus, we take the view that perceptual interpretation is inherently
a dynamic – rather than static – process that occurs during some window of time. This
approach relaxes the requirement that the training data be separable in the traditional
machine learning sense; unclassifiable subspaces are not a problem if we can determine
how to move out of them by relying on other modalities, which are experiencing the same
sensory events from their unique perspectives. We will show that this approach is not
only biologically plausible, it is also computationally efficient in that it allows us to use
lower dimensional representations for modeling sensory and motor data.
39
40
It might be asked why we have more senses than one. [Had it been otherwise],…
everything would have merged for us into an indistinguishable identity.
Aristotle, De Anima (350 B.C.E)
Chapter 3
Perceptual Grounding
Most of the enormous variability in the world around us is unimportant. Variations in our
sensory perceptions are not only tolerated, they generally pass unnoticed. Of course,
some distinctions are of paramount importance and learning which are meaningful as
opposed to which can be safely ignored is a fundamental problem of cognitive
development. This process is a component of perceptual grounding, where a perceiver
learns to make sense of its sensory inputs. The perspective taken here is that this is a
clustering problem, in that each sense must learn to organize its perceptions into
meaningful categories.
That animals do this so readily belies its complexity.
For
example, people learn phonetic structures for languages simply by listening to them; the
phonemes are somehow extracted and clustered from auditory inputs even though the
listener does not know in advance how many unique phonemes are present in the signal.
Contrast this with a standard mathematical approach to clustering, where some
knowledge of the clusters, e.g., how many there are or their distributions, must be known
a priori in order to derive them.
Without knowing these parameters in advance,
algorithmic clustering techniques may not be robust (Fraley and Raftery 2002, Kleinberg
2002, Still and Bialek 2004). Assuming that in many circumstances animals cannot
know the parameters underlying their perceptual inputs, how then do they learn to
organize their sensory perceptions reliably?
This chapter presents an approach to clustering based on observed correlations between
different sensory modalities. These cross-modal correlations exist because perceptions
are created through physical processes governed by natural laws (Thompson 1917,
Richards 1980, Mumford 2004). An event in the world is simultaneously perceived
through multiple sensory pathways in a single observer; while each pathway may have a
41
unique perspective on the event, their perspectives tend to be correlated by regularities in
the physical world (Richards and Bobick 1988).
We propose here that these
correspondences play a primary role in organizing the sensory channels individually.
Based on this hypothesis, we develop a new framework for grounding artificial
perceptual systems.
Towards this, we will introduce a mathematical model of slices, which are topological
manifolds that encode dynamic perceptual states and are inspired by surface models of
cortical tissue (Dale et al. 1999, Fischl et al. 1999, Citti and Sarti 2003, Ratnanather et al.
2003).
Slices partition perceptual spaces into large numbers of small regions
(hyperclusters) and then reassemble them to construct clusters corresponding to the actual
sensory events being perceived.
This reassembly is performed by cross-modal
clustering, which uses temporal correlations between slices to determine which
hyperclusters within a slice correspond to the same sensory events.
The cross-modal
clustering algorithm does not presume that either the number of clusters in the data or
their distributions is known beforehand. We examine the outputs and behavior of this
algorithm on simulated datasets, drawn from a variety of mixture distributions, and on
real data gathered in computational experiments. Some of the work in this chapter has
appeared in (Coen 2005, Coen 2006).
3.1 The Simplest Complex Example
As in Chapter 2, we proceed here by first considering an example. We will return to
using real datasets towards the end of this chapter, but for the moment, it is helpful to
pare down the subject matter to its bare essentials.
Let us consider two hypothetical sensory modes, each of which is capable of sensing the
same two events in the world, which we call the red and blue events. These two modes
are illustrated below in Figure 3.1, where the dots within each mode represent its
perceptual inputs and the blue and red ellipses delineate the two events. For example, if a
"red" event takes place in the world, each mode would receive sensory input that
42
Mode A
Mode B
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0
-0.5
0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Figure 3.1 – Two hypothetical co-occurring perceptual modes. Each mode, unbeknownst to itself,
receives inputs generated by a simple, overlapping Gaussian mixture model. To make matters more
concrete, we might imagine Mode A is a simple auditory system that hears two different events in the
world and Mode B is a simple visual system sees those same two events, which are indicated by the red and
blue ellipses.
(probabilistically) falls within its red ellipse.
Notice that events within each mode
overlap, and they are in fact represented by a mixture of two overlapping Gaussian
distributions. We have chosen this example because it is simple – each mode perceives
only two events – but it has the added complexity that the events overlap – meaning there
is likely to be some ambiguity in interpreting the perceptual inputs.
Keep in mind that while we know there are only two events (red and blue) in this
hypothetical world, the modes themselves do not "know" anything at all about what they
can perceive. The colorful ellipses are solely for the reader's benefit; the only thing the
modes receive is their raw input data. Our goal then is to learn the perceptual categories
in each mode – e.g., to learn that each mode in this example senses these two overlapping
events – by exploiting the temporal correlations between them.
3.2 Generating Codebooks
We are going to proceed by hyperclustering each perceptual space into a codebook. This
simply means that we are going to generate far more clusters than are necessary for
representing the actual number of perceptual events in the data. In this case, that would
43
Mode B
Mode B
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Figure 3.2 – Hyperclustering Mode B with the algorithm given below. Mode B is shown hyperclustered
on the right. Here, we specified k=30 and the algorithm ended up generating 53 clusters after normalizing
their densities.
be two, but instead, we will employ a (much) larger number.
For the rest of this
discussion, we will refer to two different types of clusters:
1) codebook clusters (or hyperclusters) are generated by hyperclustering and are
illustrated by the Voronoi regions show in Figure 3.2 on the right.
2) perceptual clusters refer to actual sensory events and are outlined with the colored
ellipses in Figure 3.1.
Our goal will be to combine the codebook clusters to "assemble" the perceptual clusters.
We note that while perceptual clustering is quite difficult, for reasons outlined in the
previous chapter, hyperclustering is quite easy because there is no notion of perceptual
correctness associated with it. Although we must determine how many codebook clusters
to generate, we will show this number influences the amount of training data necessary
rather than the correctness of the derived perceptual clusters.
In other words, this
approach is not overly sensitive to the hyperclustering: generating too many hyperclusters
simply means learning takes longer, not that the end results are incorrect. Generating too
few hyperclusters tends not to happen because of the density normalization described
below. It is also sometimes possible to detect that too few clusters have been generated
by using cross-modal information, a technique we examine later in this chapter.
To generate the codebooks, we will use a variant of the Generalized Lloyd Algorithm
(GLA) (Lloyd 1982). We modify the algorithm to normalize the point densities within
44
the hyperclusters, which otherwise can vary enormously. Many clustering algorithms,
including GLA, optimize initially random codebooks by minimizing a strongly Euclidean
distance metric between cluster centroids and their members. A cluster with a large
numbers of nearby points may be viewed as equivalent to (from the perspective of the
optimization) a cluster with a small number of distant points. It is therefore possible to
have substantial variance in the number of points assigned to each codebook cluster.
This is problematic because our approach will require that each perceptual cluster be
represented by multiple codebook clusters, from which it is "assembled." The Euclidean
bias introduced by the distance metric used for codebook optimization means that "small"
perceptual events may be relegated to a single codebook cluster. This would prevent
them from ever being detected.
There are many ways one could imagine achieving this density normalization. For
example, we could explicitly add inverse cluster size to the minimization calculation
performed during codebook refinement. This would leave the number of codebook
clusters constant overall but introduce pressure against wide variation in the number of
points assigned to each one. Rather than take an approach that preserves the overall
number of clusters, we will instead modify the algorithm to recluster codebook regions
that have been assigned "too many" points. This benefit of this is that we leave the GLA
algorithm intact but now invoke it recursively on subregions where its performance is
unsatisfactory.
By keeping the basic structure of GLA, many of the mathematic
properties of the generated codebooks remain unchanged. The downside of this approach
is that the recursive reclustering increases the total number of generated hyperclusters.
Thus, the algorithm generates at least as many codebook clusters as we specify and
sometimes many more. This increase in codebook size can affect the computational
complexity of algorithms operating over these codebooks, which we investigate later in
this chapter. We note, however, that adding these additional clusters does not tend to
require gathering more training data, an issue raised above. This is because the extra
clusters are generated in regions that already have high point densities.
Our hyperclustering algorithm for generating (at least) k codebook regions over dataset
D ⊆ \ N is:
45
Figure 3.3 – The hyperclusters generated for the data in Mode B, with the data removed. The number
identifying each cluster is located at its centroid. Notice how the number of clusters increases in the region
corresponding to the overlap of the two Gaussian distributions, where the overall point density is highest.
1) Let s = D / k . This is our goal size for the number of data points per cluster.
2) Let P = {P1 , P2 ,..., Pk } , Pi ⊂ \ N be a Lloyd partitioning of D over k clusters. This is
the output of the Generalized Lloyd Algorithm.
3) For each cluster Pi ∈ P :
If Pi > s (the cluster has too many points), then recursively partition Pi :
a. Let Q = { P1 , P2 } , Pi ⊂ \ N be a Lloyd partitioning of Pi over 2 clusters.
b. Set P = ( P ∪ Q ) / Pi . Add the two new partitions and remove the old one.
End if statement
4) Repeat step 3 until no new partitions are added. Then, return the centroids of the sets
in P as the final hyperclustering. Empirically, we find that k < P < 2k .
The output of this algorithm on the data in Mode B is shown above in Figure 3.3. Notice
how the number of clusters increases in the region corresponding to the overlap of the
two Gaussian distributions, which is due to the density normalization. We note that any
number of variations on this algorithm is possible. For example, in the reclustering step
46
Mode A
Mode B
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0
-0.5
0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Figure 3.4 – Slices generated for Modes A and B using the hyperclustering algorithm in the previous
section. Our goal is to combine the codebook clusters to reconstruct the actual sensory events perceived
within the slices.
in (3), we might recursively generate Pi / s rather than 2 clusters. We could also modify
the goal size s to change the degree of density normalization. In any event, we have
found that our approach is not particularly sensitive to the precise details of the
codebook's generation; we confirm this statement later in this chapter, when we consider
hyperclustering other mixture distributions. At present, the most important consideration
is that the cluster densities are normalized, which minimizes the Euclidean bias inherent
in the centroid optimization performed by the Lloyd algorithm.
3.3 Generating Slices
We now introduce a new data structure called slices that are constructed using the
codebooks defined in the previous section. Figure 3.4 illustrates slices constructed for
Modes A and B from our example above. Slices are topological manifolds that encode
dynamic perceptual states and are inspired by surface models of cortical tissue (Citti and
Sarti 2003, Ratnanather et al. 2003). They are able to represent both symbolic and
numeric data and provide a natural foundation for aggregating and correlating
information.
Intuitively, a slice is a codebook with a non-Euclidean distance metric
defined between its cluster centroids. In other words, distances within each cluster are
Euclidean, whereas distances between clusters are not. A topological manifold is simply
a manifold "glued" together from Euclidean spaces, and that is exactly what a slice is.
47
Mode B
Mode A
Figure 3.5 – Viewing Hebbian linkages between two different slices. The modes have been vertically
stacked here to make the correspondences clearer. The blue lines indicate that two codebook regions
temporally co-occur with each other. Note that these connections are weighted based on their strengths,
which is not visually represented here, and that these weights are additionally asymmetric between each
pair of connected regions.
Our goal is to combine the codebook regions to "reconstruct" the larger perceptual
regions within a slice. To do this, we will define a non-Euclidean distance metric
between codebook regions that reflects how much we think they are part of the same
perceptual event. In this metric, codebook regions corresponding to the same perceptual
event will be closer together and those corresponding to different events will be further
apart. Towards defining this metric, we first collect co-occurrence data between the
codebook regions in different modes. We want to know how each codebook region in a
mode temporally co-occurs with the codebook regions in other modes.
This data can be easily gathered with the classical sense of Hebbian learning (Hebb
1949), where connections between regions are strengthened as they are simultaneously
active. The result of this process is illustrated in Figure 3.5, where the modes are
vertically stacked to make the correspondences clearer. We will exploit the spatial
structure of this Hebbian co-occurrence data to define the distance metric within each
mode.
48
3.4 Hebbian Projections
In this section, we define the notion of a Hebbian projection.
These are spatial
probability distributions that provide an intuitive way to view co-occurrence relations
between different slices. We first give a formal definition and then illustrate the concept
visually.
Consider two slices M A , M B ⊆ \ n , with associated codebooks C A = { p1 , p2 ,..., pa } and
CB = {q1 , q2 ,..., qb } , where cluster centroids pi , q j ∈ \ N .
For some event x, we define h( x) = # of times event x occurs. Similarly, for events x and
y, we define h( x, y ) = # of times events x and y co-occur. For example, h( p1 ) is the
number of times inputs that belong to cluster p1 were seen during some time period of
interest. So, we see that Pr( x | y ) = h( p, q ) / h( p ).
We define the Hebbian projection of a codebook cluster pi ∈ C A onto mode M B :
G
H AB ( pi ) = [ Pr(q1 | pi ), Pr(q2 | pi ),..., Pr( qb | pi ) ]
(3.1)
G
When the modes are clear from context, we will simply refer to the projection by H ( pi ) .
A Hebbian projection is simply a conditional spatial probability distribution that lets us
know what mode M B probabilistically "looks" like when a region pi is active in cooccurring mode M A . This is visualized in Figure 3.6.
We can equivalently define a Hebbian projection for a region r ⊆ M A constructed out of
a subset of its codebook clusters Cr = { pr1 , pr 2 ,..., prk } ⊆ C A :
G
H AB (r ) = [ Pr(q1 | r ), Pr(q2 | r ),..., Pr( qb | r ) ]
(3.2)
We will also define the notion of a reverse Hebbian projection, which projects a Hebbian
projection back onto its source mode. It lets us measure – from the perspective of
49
another modality – which other codebook regions in a slice appear similar to a reference
region.
50
Mode B
Mode A
Mode B
Mode A
Figure 3.6 – A visualization of two Hebbian projections. On the top, we project from a cluster pi in Mode
A onto Mode B. The dotted lines correspond to Hebbian linkages and the blue shading in each cluster qj in
Mode B is proportional to Pr(qj|pi). A Hebbian projection lets us know what Mode B probabilistically
"looks" like when some prototype in Mode A is active. On the bottom, we see a projection from a cluster
in Mode B onto Mode A.
51
To do this, we first define weighted versions of the functions defined above for a set of
weights ω . Consider a region r, r = k , where each cluster is assigned some weight ωi .
We assume that
∑ω
hω (r )
i
=1.
=
∑ ω h( p), where p is a codebook cluster in region r
p∈r
p
Prω (q, r ) = hω (r , q ) / hω (r ) = ∑ ω p h( p, q ) ∑ ω p h( p )
p∈r
p∈r
G
H ω (r )
= [ Prω (q1 | r ), Prω (q2 | r ),..., Prω (qn | r )]
The reverse Hebbian projection H AB (r ) of a region r ⊆ M A onto mode M B is then
defined:
G
H AB (r ) = H HG ( r ) ( M B )
(3.3)
= ⎡⎣ PrH ( r ) ( p1 | M B ), PrH ( r ) ( p2 | M B ),..., PrH ( r ) ( pm | M B ) ⎤⎦
(3.4)
Again, when the modes are clear from context, we will simply refer to this as H (r ) .
This distribution has a simple interpretation: the reverse Hebbian projection from mode
M A onto mode M B for some region r ⊆ M A is the Hebbian projection of all of mode
M B onto mode M A , weighted by the forward Hebbian projection of region r, as shown
in equation (3.3). This process is visualized in Figure 3.7. Note that we are projecting an
entire mode M B here. This might seem initially surprising, but it simply corresponds to
a projection of a region that contains all the codebook clusters for a given slice.
The reverse Hebbian projection H (r ) answers the question: what other regions does
mode M B think region r is similar to in mode M A ? It can therefore be viewed as a
distribution that measures cross-modal confusion. For this reason, it provides a useful
optimization tool, because we will only need to disambiguate regions that appear in each
other's reverse Hebbian projections, i.e., they have a non-zero (or above some threshold)
probability of being confused for one another by other modalities.
52
Mode B
Mode A
Figure 3.7 – Visualizing a reverse Hebbian projection. We first generate the Hebbian projection of the
green cluster pi in Mode A onto Mode B. This projection is represented by the shading of each region qj in
Mode B, corresponding to Pr(qj|pi). We then project all of Mode B back onto Mode A, weighting the
contributions of each cluster qi by Pr(qj|pi). This generates the reverse Hebbian projection, which is
indicated by the shading of regions in Mode A.
3.5 Measuring Distance in a Slice
Let us briefly review where we stand at this point. We have introduced the idea of a
slice, which breaks up a representational space into many smaller pieces that are
generated by hyperclustering it. We would like to assemble these small hyperclusters
into larger regions that represent actual perceptual categories present in the input data. In
this section, we define the non-Euclidean distance metric between the hyperclusters that
helps make this possible.
Consider the colored regions in Figure 3.8. We would like to determine that the blue and
red regions are part of their respective blue and red events, indicated by the colored
ellipses. It is important to recall that the colors here are simply for the reader's benefit.
There is no labeling of regions or perceptual events within the slice itself. We will
proceed by formulating a distance metric that minimizes the distance between codebook
regions that are actually within the same perceptual region and maximizes the distance
53
Mode B
Figure 3.8 – Combining codebook regions to construct perceptual regions. We would like to determine
that regions within each ellipse are all part of the same perceptual event. Here, for example, the two blue
codebook regions (probabilistically) correspond the blue event and the red regions correspond to the red
event.
between codebook regions that are in different regions. That this metric must be nonEuclidean is clear from looking at the figure. Each highlighted region is closer to one of
a different color than it is to its matching partner.
We are going to use the Hebbian projections defined in the previous section to formulate
this similarity metric for codebook regions. This will make the metric inherently crossmodal because we will rely on co-occurring modalities to determine how similar two
regions within a slice are. Our approach is to compare codebook regions by comparing
their Hebbian projections onto co-occurring slices. This process is illustrated in Figure 9.
The problem of measuring distances between prototypes is thereby transformed into a
problem of measuring similarity between spatial probability distributions.
The
distributions are spatial because the codebook regions have definite locations within a
slice, which are subspaces of \ n . Hebbian projections are thus spatial distributions on ndimensional data. It is therefore not possible to use one dimensional metrics, e.g.,
Kolmogorov-Smirnov distance, to compare them because doing so would throw away the
essential spatial information within each slice.
54
r2
Mode B
r1
Mode B
Mode A
Mode A
G
H (r1 )
What is the distance between
G
H (r2 )
Pr(Mode A | r1 ) and Pr(Mode A | r2 ) ?
In other words, how similar are
G
G
H ( r1 ) and H ( r2 ) ?
Figure 3.9 – Our approach to computing distances cross-modally. To determine the distance between
codebook regions r1 and r2 in Mode B on top, we project them onto a co-occurring modality (Mode A) as
shown in the middle. We then ask: how similar are their Hebbian projections onto Mode A?, as shown on
the bottom. We have thereby transformed a question about distance between regions into a question of
similarity between the spatial probability distributions provided by their Hebbian projections.
55
3.6 Defining Similarity
What does it mean for two things to be similar? This deceptively difficult question is at
the heart of mathematical clustering and perceptual categorization and is common to a
number of fields, including computer vision, statistical physics, and information and
probability theory. The goal of measuring similarity between different things is often
cast as a problem of measuring distances between multidimensional distributions on
descriptive features.
For example, in computer vision, finding minimum matchings
between image feature distributions is a common approach to object recognition
(Belongie et al. 2002).
In this section, we present a new metric for measuring similarity between spatial
probability distributions, i.e., distributions on multidimensional metric spaces. We will
use this metric to compute distances between codebook regions by comparing their
Hebbian projections onto co-occurring modalities, as shown above in Figure 3.9. Our
approach is therefore inherently multimodal – although we may be unable to determine a
priori how similar two codebook regions are in isolation (i.e., unimodally), we can
measure their similarity by examining how they are viewed from the perspectives of
other co-occurring sensory channels. We therefore want to formulate a similarity metric
on Hebbian projections that tells us not how far apart they are but rather, how similar
they are to one another. This will enable perceptual bootstrapping by allowing us to
answer a fundamental question:
Can any other modality distinguish between two regions in the same
codebook? If not, then they represent the same percept.
3.6.1 Introduction
There are a wide variety of metrics available to quantify distances between probability
distributions (see the surveys in Rachev 1991, Gibbs and Su 2002). We may contrast
these in many ways, including whether they are actually metrics (i.e., symmetric and
satisfy the triangle inequality), the properties of their state spaces, their computational
complexity, whether they admit practical bounding techniques, etc. For example, the
56
common χ 2 distance is not a metric because it is asymmetric. In contrast, KolmogorovSmirnov distance is a metric but is defined only over \1 . Choosing an appropriate
metric for a given problem is a fundamental step towards solving it and can yield
important insights into its internal structure.
In discussions of probability metrics, the notion of similarity generally follows directly
from the definition of distance. Two distributions are deemed similar if the distance
between them is small according to some metric; conversely, they are deemed dissimilar
when the metric determines they are far apart. In our approach, we will reverse this
dependency. We first intuitively describe our notion of similarity and then formulate a
metric that computes it in a well-defined way. We call this metric the Similarity distance
and it is the primary contribution of this section.
Our approach is applicable to
comparing distributions over any metric space and has a number of interesting properties,
such as scale invariance, that make it additionally useful for work beyond the scope of
this thesis.
3.6.2 Intuition
We begin by first examining similarity informally. Consider the two simple examples
shown in Figure 3.10.
Each shows two overlapping Gaussian distributions, whose
similarity we would like to compare. Intuitively, we would say the distributions in
Example A (on the left) are more similar to one another than those in Example B (on the
right), because we will think of similarity as a measure of the overlap or proximity of
spatial density. We are not yet concerned with formally defining similarity, but the
intuition in these examples is exactly what we are trying to capture. Notice that the
distributions in Example A cover roughly two orders of magnitude more area than those
in Example B. Therefore, if we were to derive similarity from distance, the strong
Euclidean bias incorporated into a wide variety of probability metrics would lead us to
the opposite of our desired result. Namely, because the examples in B are much closer
than those in A, we would therefore deem them more similar, thereby contradicting our
57
1
0.53
0.9
0.52
0.8
0.7
0.51
0.6
0.5
0.5
0.4
0.49
0.3
0.2
0.48
0.1
0
0.47
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.47
0.48
0.49
0.5
0.51
0.52
0.53
Figure 3.10 – Intuitively defining similarity. We consider the two distributions illustrated in Example A to
be far more similar to one another than those in Example B, even though many metrics would deem them
further apart due to inherent Euclidean biases. Notice that the distributions in Example A cover roughly
two orders of magnitude more area than those in Example B. Note that simply normalizing the
distributions before computing some metric on them would be ad hoc, very sensitive to outliers, and make
common comparisons difficult.
desired meaning. Because we are looking for a distance measure based on similarity –
and not a direct similarity measure – it should smaller for things that are more similar
and larger for things that are less similar. This is the opposite of what one would expect
were we directly formulating a measure of similarity, which presumably would be higher
for more similar distributions.
Note that we cannot simply normalize pairs of distributions before computing some
metric on them because our results would be highly sensitive to outliers. Doing so would
also make common comparisons difficult, which is particularly important when
demonstrating convergence in a sequence of probability measures. Finally, we want our
similarity metric to be distribution-free and make no assumptions about the underlying
data, which would make generalizing a simple normalization schema difficult.
3.6.3 Probabilistic Framework
We begin with some formal definitions. Our approach will be to define Similarity
distance DS as the ratio between two other metrics.
These are the Kantorovich-
Wasserstein distance and a new metric we introduce called the one-to-many distance. For
58
each of these, we will provide a definition over continuous distributions and then present
equivalent formulations for discrete weighted point sets. These are more computationally
efficient for computing Similarity distance on the slice data structures introduced earlier.
After this formal exposition, we intuitively explain and motivate these metrics in Section
3.6.4 and then show how Similarity distance is derived from them.
3.6.3.1 Kantorovich-Wasserstein Distance
Let μ and ν and be distributions on state space Ω = \ n . The Kantorovich-Wasserstein
distance DW (Kantorovich 1942, Gibbs and Su 2002) between μ and ν may be defined:
DW ( μ ,ν ) = inf { D( x, y ) : L( x) = μ , L( y ) = ν }
(3.5)
J
where the infimum is taken over all joint distributions J on x and y with marginals
L( x) = μ and L( y ) = ν respectively. For brevity, we will refer to DW simply as the
Wasserstein distance. Notice that in order to compute the Wasserstein distance, we
already need to have a distance metric D defined to calculate the infimum. Where does D
come from?
In fact, in the approach described above, isn't D supposed to be the
Similarity distance DS , because we are proposing to use Similarity distance to measure
distances within slices? Thus, we seem to have a “chicken and egg” problem from the
start. We will sidestep this by defining D recursively through an iterative function
system on DS . This will allow us to compute Similarity distance by incrementally
refining our calculation of it.
The definition in (3.5) assumes the distributions are continuous. Hebbian projections,
however, are discrete distributions (i.e., weighted point sets) because they are over the
codebooks within a slice. We may therefore simplify our computation by carrying it out
directly over these codebooks. To do so, we define the Wasserstein distance on weighted
point sets corresponding to discrete probability distributions.
Consider finite sets
r1 , r2 ⊂ Ω with point densities ϕ1 , ϕ 2 respectively. Then we have:
DW ( r1 , ϕ1 , r2 , ϕ2
) = inf {D( x, y) : L( x) =
J
59
r1 , ϕ1 , L( y ) = r2 , ϕ2
}
which by (Levina and Bickel 2001) is equal to:
DW ( r1 , ϕ1 , r2 , ϕ2
)=
(
min ∑ ⎡⎢ D r1 , ϕ1 i , r2 , ϕ 2
j1 ,.., jm
i =1 ⎣
m
1
m
) ⎤⎥⎦
2 1/ 2
ji
(3.6)
where m is the maximum of the sizes of r1 and r2 , the minimum is taken over all
permutations of {1,..., m} , and ra ,ϕa
i
is the ith element of set ra ,ϕa . We note that
(ibid.) has shown this is equivalent to the Earth Mover's distance (Rubner et al. 1998), a
popular empirical measure used primarily in the machine vision community, when they
are both computed over probability distributions.
We can now define the Wasserstein distance between Hebbian projections of r1 , r2 ⊆ M A
onto M B as:
G
G
G
G
DW H (r1 ), H (r2 ) = inf D( x, y ) : L( x) = H (r1 ), L( y ) = H (r2 )
(
)
J
{
}
(3.7)
where the infimum is taken over all joint distributions J on x and y with marginals
G
G
H (r1 ) and H (r2 ) . By (3.6), we have this is equal to:
DW
(
G
G
H (r1 ), H (r2 ) =
)
G
G
min ∑ ⎡ D H (r1 )i , H (r2 ) ji
⎢
j1 ,.., jm
i =1 ⎣
m
1
m
(
2 1/ 2
) ⎤⎥⎦
(3.8)
where m is the number of codebook regions in M B , the minimum is taken over all
G
G
permutations of {1,..., m} , and H (r )i = i th component of H (r ) .
We note that the Wasserstein distance presented above is not a candidate for measuring
similarity.
In fact, referring back to Figure 3.10, the red and blue distributions in
Example A here are further apart as determined by the Wasserstein distance than those in
Example B, i.e., DW (Example A) > DW (Example B) , which does not capture our intended
meaning of similarity.
The additional measure needed will be described shortly,
following a brief discussion of how DW can be computed efficiently.
60
3.6.3.2 Computational Complexity
The optimization problem in (3.6) was first proposed by Monge (1781) and is known as
the Transportation Problem.
It involves combinatorial optimization because the
minimum is taken over O(2m ) different permutations and can be solved by Kuhn's
Hungarian method (1955, see also Frank 2004). However, by treating it as a flow
problem, we instead use the Transportation Simplex method introduced by Dantzig
(1951) and subsequently enhanced upon by Munkres (1957), which has worst case
exponential time but in practice is quite efficient (Klee and Minty 1972).
To get some insight into the structure of this problem, we take a moment to examine the
complexity of determining exact solutions to it according to (3.8). Although this is not
necessary in practice, it is instructive to see how the choice of mixture distributions
influences the complexity of the problem and the implications this has for selecting
perceptual features. Notice that in the minimization in equation (3.8), the vast majority
of permutations can be ignored because we only need examine regions that have non-zero
probabilities in the Hebbian projections. In other words, we could choose to ignore any
G
G
region qi ⊆ M B where max( H (r1 )i , H (r2 )i ) ≤ ε for some small ε . A conservative
approach would set ε = 0 , however, one can certainly imagine using a slightly higher
threshold to simultaneously reduce noise and computational complexity.
We may estimate the running time of calculating DW exactly by asking how many nonzeros values we expect to find in the Hebbian projections onto mode M B of two regions
in mode M A . Let us suppose that mode M B actually has d events (of equal likelihood)
distributed over m codebook regions. How many codebook regions are there within each
event? If the events do not overlap, then we expect that each perceptual event is covered
by m/d codebook regions, due to the density normalization performed during codebook
generation.
In this case, the minimization must be performed over O(21+ m / d )
permutations. Alternatively, it is possible for all of the sensory events to overlap, giving
an upper bound, worst case of m regions per event and O(2m ) running time. Thus, the
running time is a function of the event mixture distributions as much as it is the number
61
of codebooks. When Hebbian distributions are "localized," in the sense they are confined
to subsets of the codebook regions, the worst-case running time is closer to O(21+ m / d ) .
We can optimize the computation by taking advantage of the fact that the projections are
over identical codebooks, i.e., their spatial distributions are over the same set of points
generated by the hyperclustering of M B . In the general statement of the Transportation
Problem, this need not be the case. We can therefore reduce the number of codebook
regions involved by removing the intersection of the projections from the calculation.
Where they overlap, namely, the distribution described by a normalized
G
G
min H (r1 ), H (r2 ) , we know the Wasserstein distance between them is 0. Therefore,
(
)
let:
G
G
G
G
H ′(r1 ) = H (r1 ) − min H (r1 ), H (r2 )
)
(3.9)
G
G
G
G
H ′(r2 ) = H (r2 ) − min H (r1 ), H (r2 )
)
(3.10)
(
(
Δ
G
G
= 1 − ∑ min H (r1 ), H (r2 )
(
)
(3.11)
We then have:
G
G
G
G
DW H (r1 ), H (r2 ) = Δ DW H ′(r1 ), H ′(r2 )
(
)
(
)
(3.12)
In other words, the Wasserstein distance computed over a common codebook (3.12) is
equal to the distance computed on the distributions ((3.9) and (3.10)) over their nonintersecting mass (3.11). (Note that we must normalize (3.9) and (3.10) to insure they
remain probability distributions, but the reader may assume this normalization step is
always implied when necessary.) When the distributions overlap strongly, which we
previously identified as the worst case scenario, we can typically use this optimization to
cut the number of involved codebooks regions in half. When the distributions do not
overlap, this optimization provides no benefit, but as we have already noted, this is a best
case scenario and optimization is less necessary. As a further enhancement, we could
also establish thresholds for Δ to avoid calculating DW altogether. For example, in the
case where their non-intersecting mass is extremely small, we might chose to define
DW = 0 or some other approximation .
62
In summary then, the computational complexity of exactly computing the Wasserstein
distance very much rests on the selection of mixture distributions over which it is
computed. These in turn depend upon the feature selection used in our perceptual
algorithms, which directly determine the distributions of sensory data within a slice. We
say that "good" features are ones that tend to restrict Hebbian projections to smaller
subsets of slices and to reduce the amount of overlap among detectable perceptual events.
(We suspect one can directly formulate a measure of the entropy in features based on
these criteria but have not done so here.) Empirically, "good" features for computing the
Wasserstein distance tend to be similar to the ones we naturally select when creating
artificial perceptual systems. "Bad" features provide little information because their
values are difficult to separate, i.e., they have high entropy. Later in the thesis, we will
draw biological evidence for these theses idea from (Ernst and Banks 2002) .
3.6.3.3 The One-to-Many Distance
We now introduce a new distance metric called the one-to-many distance. Afterwards,
we examine this metric intuitively and show how it naturally complements the
Wasserstein distance. We will use these metrics together to formalize our intuitive notion
of similarity.
Let f and g be the respective density functions of distributions μ and ν on state space
Ω = \ n . Then the one-to-many distance ( DOTM ) between μ and ν is:
DOTM ( μ ,ν ) = ∫ f ( x) ⋅ DW ( x,ν ) dx
μ
=∫
∫
μ ν
f ( x) ⋅ g ( y ) ⋅ d ( x, y ) dxdy
= ∫ g ( y ) ⋅ DW ( μ , y ) dy = DOTM (ν , μ )
ν
We define this over weighted pointed sets as:
63
DOTM ( r1 , ϕ1 , r2 , ϕ2
)
∑ϕ
=
pi ∈r1
=
D W ( pi , r2 , ϕ 2
)
∑ϕ ∑ ϕ
j
d ( pi , p j )
∑ ϕ ∑ϕ
i
d ( pi , p j )
pi ∈r1
=
i
i
p j ∈r2
j
p j ∈r2
pi ∈r1
= DOTM ( r2 , ϕ2 , r1 , ϕ1
)
The one-to-many distance is the weighted sum of the Wasserstein distances between each
individual point within a distribution and the entirety of another distribution.
It is
straightforward to directly calculate DW between a point and a distribution and
computing DOTM requires O(m 2 ) time, where m is the maximum number of weighted
points in the distributions. Also, we note from these definitions that DOTM is symmetric.
From this definition, we can now formulate DOTM between Hebbian projections of
r1 , r2 ⊆ M A onto M B :
G
G
DOTM H (r1 ), H (r2 )
(
)
=
=
G
∑
ωi D W ( i, H (r2 ) )
∑
ωi
G
i∈H ( r1 )
G
i∈H ( r1 )
∑ω
j
G
j∈H ( r2 )
D (i, j)
(3.13)
(3.14)
G
G
G
G
where ωi = H (r1 )i and ω j = H (r2 ) j . Recall that H (r )i = i th component of H (r ) .
We see this is symmetric:
G
G
G
G
DOTM H (r1 ), H (r2 ) = DOTM H (r2 ), H (r1 )
(
)
(
)
The one-to-many distance is a weighted sum of the Wasserstein distances between each
individual region in a projection and the other projection in its entirety. The weights are
taken directly from the original Hebbian projections. This is represented by the term
G
ωi D W ( i, H (r2 ) ) in (3.13). We note that the Wasserstein distance between a single region
and a distribution is trivial to compute directly, as shown in (3.14), and it can be
calculated in O(m) time, where m is the number of codebooks in Mode B. Therefore,
DOTM can be computed in O(m 2 ) time.
64
Mode B
Mode A
Figure 3.11 – A simpler example. Mode B is the same as in previous examples but the events in Mode A
have been separated so they do not overlap. The colored ellipses show how the external red and blue
events probabilistically appear within each mode.
Mode B
Mode A
Figure 3.12 – Hebbian projections from regions in Mode B onto Mode A. Notice that the Hebbian
projections have no overlap, which simplifies the visualization and discussion in the text. However, this
does not affect the generality of the results.
65
3.6.4 Visualizing the Metrics: A Simpler Example
Consider the example in Figure 3.11. We have modified Mode A, on the bottom, so that
its events no longer overlap. (Mode B on top remains unchanged.) This will simplify the
presentation but does not affect the generality of the results presented here. As before,
the two world events perceived by each mode are delineated with colored ellipses for the
benefit of the reader but the modes themselves have no knowledge of them. The Hebbian
projections from two codebook regions in Mode B are shown in Figure 3.12. We see this
example was designed so that the projections have no overlap, making it easy to view
them independently.
We now give an intuitive interpretation of the two distance metrics, DW and DOTM , based
on the classic statement of the Transportation Problem (Monge 1781). This problem is
more naturally viewed with discrete distributions, but the presentation generalizes readily
to continuous distributions. Consider the Hebbian projections from our example in
G
G
isolation, as show in Figure 3.13. On the left, H (r1 ) is shown in red, and H ( r2 ) is shown
in blue on the right. The shading within each Voronoi region is proportional to its weight
(i.e., point density) within its respective distribution.
In the Transportation Problem, we imagine the red regions (on the left) need to deliver
supplies to the blue regions (on the right). Each red region contains a mass of supplies
proportional to its shading and each blue region is expecting a mass of supplies
proportional to its shading. (We know that mass being "shipped" is equal to the mass
being "received" because they are described by probability distributions.) The one-tomany distance is how much work would be necessary to deliver all the material from the
red to blue regions, if each region had to independently deliver its mass proportionally to
all regions in the other distribution. Work here is defined as mass× distance .
The Wasserstein distance computes the minimum amount of work that would be
necessary if the regions cooperate with one another. Namely, red regions could deliver
material to nearby blue regions on behalf of other red regions, and blue regions could
receive material from nearly red regions on behalf of other blue regions. Nonetheless, we
66
maintain the restriction that each region has a maximum amount it can send or receive,
corresponding to its point density. This is why the Wasserstein distance computes the
solution to the Transportation Problem, which is directly concerned with this type of
delivery optimization.
Thus, we may summarize that DOTM computes an unoptimized Transportation
Problem, where cooperation is forbidden, and that DW computes the optimized
Transportation Problem, where cooperation is required.
G
H (r1 )
Mode A
G
H (r2 )
Figure 3.13 – Visualizing DW and DOTM through the Transportation Problem. We examine the Hebbian
G
G
projections onto Mode A shown in Figure 3.12. H (r1 ) is shown in red on the left and H ( r2 ) is shown in
blue on the right. Each region is shaded according to its point density. In the Transportation Problem, we
want to move the "mass" from one distribution onto the other. If we define work = mass× distance , then
DOTM computes the work required if each codebook region must distribute its mass proportionally to all
regions in the other distribution. DW computes the work required if the regions cooperatively distribute
their masses, to minimize the total amount of work required.
67
3.6.5 Defining Similarity
We are now in a position to formalize the intuitive notion of similarity presented above.
We define a new metric called the Similarity distance ( DS ) between continuous
distributions μ and ν :
DS ( μ ,ν ) =
DW ( μ ,ν )
DOTM ( μ ,ν )
and over weighted point sets:
DS ( r1 , ϕ1 , r2 , ϕ2 ) =
DW ( r1 , ϕ1 , r2 , ϕ2
DOTM ( r1 , ϕ1 , r2 , ϕ2
)
)
We thereby define the Similarity distance between Hebbian projections of r1 , r2 ⊆ M A
onto M B :
G
G
DW H (r1 ), H (r2 )
G
G
DS H (r1 ), H (r2 ) =
G
G
DOTM H (r1 ), H (r2 )
(
(
)
(
)
)
(3.15)
The Similarity distance is the ratio of the Wasserstein to the one-to-many distance. It
measures the optimization gained when transferring the mass between two spatial
probability distributions if cooperation is allowed.
Intuitively, it normalizes the
Wasserstein distance. It is scale invariant (see Figure 3.14) and captures our desired
notion of similarity.
An important note to avoid confusion: Because DS is a distance measure based on
similarity – and not a similarity measure – it is smaller for things that are more similar
and larger for things that are less similar. So, for any distribution ν , DS (ν ,ν ) = 0 ,
expressing the notion that anything is (extremely) similar to itself.
G
G
We briefly examine the behavior of DS at and in between its limits. Let H (r1 ) and H (r2 )
be identical Hebbian projections separated by some distance Δ . Then we have the
following properties:
68
1
0.53
0.9
0.52
0.8
0.7
0.51
0.6
0.5
0.5
0.4
0.49
0.3
0.2
0.48
0.1
0
0.47
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.47
0.48
0.49
0.5
0.51
0.52
0.53
Figure 3.14– Examining Similarity distance. Comparing the distributions in the two examples, we have
DS (Example A) DS (Example B) , which captures our intuitive notion of similarity.
1) We see that as the distributions are increasingly separated, the optimization provided
in the Wasserstein calculation disappears:
G
G
G
G
lim DW ( H (r1 ), H ( r2 ) ) = DOTM ( H ( r1 ), H ( r2 ) )
Δ→∞
and therefore:
(
G
G
)
lim DS H (r1 ), H (r2 ) = 1
Δ→∞
2) As the distributions are brought closer together, the Wasserstein distance decreases
much faster than the One-To-Many distance:
G
G
G
G
∂
DW ( H ( r1 ), H ( r2 ) ) >
DOTM ( H (r1 ), H ( r2 ) )
∂Δ
∂Δ
∂
3) As they approach, it eventually dominates the calculation:
G
G
G
G
lim DW ( H (r1 ), H ( r2 )) = 0 and therefore, lim DS ( H ( r1 ), H ( r2 )) = 0
Δ→ 0
Δ→ 0
G
G
4) So, we see that 0 ≤ DS ( H (r1 ), H (r2 )) ≤ 1 and DS varies non-linearly between these
limits.
On the next two pages, we visualize the dependence of D S on the distance Δ and the
angle θ between pairs of samples drawn from different distributions. We examine
samples drawn from Gaussian and Beta distributions in Figure 3.15 and Figure 3.16
respectively.
69
DE
θ
Angle of rotation
Similarity distance
1
0
0.31
0.63
0.94
1.26
pi/2
1.88
2.2
2.51
2.83
3.14
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Distance between means
0.8
0.9
0
1
Figure 3.15 – Effects on DS as functions of the distance DE and angle θ between Gaussian distributions.
As the distance DE or angle between θ two Gaussian distributions decreases, we see how their
corresponding similarity distance DS decreases non-linearly in the graph on the bottom. DE is shown in
red and θ is shown in blue.
70
θ
DE
Angle of rotation
Similarity distance
1
0
0.31
0.63
0.94
1.26
pi/2
1.88
2.2
2.51
2.83
3.14
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Distance between means
0.8
0.9
0
1
Figure 3.16 –Effects on DS as functions of the distance DE and angle θ between Beta distributions. As
the distance DE or angle between θ the two Beta distributions decreases, we see how their corresponding
similarity distance DS decreases non-linearly in the graph on the bottom. DE is shown in red and θ is
shown in blue.
71
3.6.5.1 A Word about Generality
We can use Similarity distance to compare arbitrary discrete spatial probability
distributions. Several such comparisons are illustrated in Figure 3.17. These examples
are important, because our being able to compute Similarity distances between these
distributions means that we will be able to perceptually ground events drawn from their
mixtures. That all our examples have so far involved mixtures of Gaussians has simply
been for convenience. We will demonstrate later in this chapter that we can separate
events corresponding to a wide assortment of mixture distributions such as the ones
shown here.
Figure 3.17 – Comparing spatial probability distributions. In each slice, the green and blue points
represent samples drawn from equivalent but rotated 2-dimensional distributions. For each example, we
identify the source distribution and the Similarity distance between the green and blue points. (A) A 2-D
beta distributions with a, b = 4. Note the low density of points in the center of the distributions and the
corresponding sizes of the codebook regions. DS = 0.22. (B) A 2-D uniform distribution. DS = 0.11. (C)
A 2-D Gaussian distribution with σ = (.15, .04). DS = 0.67. (D) A 2-D Poisson distribution, with λ =
(50,30) and scaled by (.8, .3). DS = .55.
72
Figure 3.18 – Some familiar non-parametric probability distributions within codebooks.
We can also apply the Similarity distance to non-parametric distributions. For example,
consider the familiar distributions shown in Figure 3.18. We have examined using DS
for handwriting recognition, and one can notice some of the well known properties of
codebooks constructed on contours in the above images, for example, how they capture
the medial axes.
3.7 Defining the Distance between Regions
We now use Similarity distance to define the Cross-Modal distance ( DCM ) between two
regions r1 , r2 ∈ M A with respect to mode M B :
DCM (r1 , r2 )
G
G
2 12
2
= ⎡(1 − λ ) [ DE (r1 , r2 )] + λ ⎡⎣ 2 DS ( H (r1 ), H (r2 )) ⎤⎦ ⎤
⎢⎣
⎥⎦
⎡
2
= ⎢(1 − λ ) [ DE (r1 , r2 )] + 2λ
⎢
⎣⎢
73
G
G
⎛ DW H (r1 ), H (r2 )
⎜
G
G
⎜ DOTM H (r1 ), H (r2 )
⎝
(
(
)
)
⎞
⎟
⎟
⎠
(3.16)
2 1/ 2
⎤
⎥
⎥
⎦⎥
(3.17)
where DE is Euclidean distance and λ is the relative importance of cross-modal to
Euclidean distance. Thus the distance between two regions within a slice is defined to
have some component (1 − λ ) of their Euclidean distance and some component ( λ ) of
the Similarity distance between their Hebbian projections. This is illustrated in Figure
3.19.
In almost all uses of cross-modal distance in this thesis, we set λ = 1 and ignore
Euclidean distance entirely.
However, in some applications, e.g., hand-writing or
drawing recognition, spatial locality within a slice is important because it is a
fundamental component of the phenomenon being recognized. If so, we can use a lower
of λ . Determining the proper balance between Euclidean and Similarity distances is an
empirical process for such applications.
So far, we have only considered two co-occurring modes simultaneously to keep the
examples simple.
However, it is straightforward to generalize the definition to
incorporate additional modalities, and the calculation scales linearly with the number of
modalities involved. To define the cross-modal distance ( DCM ) between two regions
r1 , r2 ∈ M A with respect to a set of co-occurring modes M I ∈ Μ , we define:
12
G
G
2⎤
⎡
2
DCM (r1 , r2 ) = ⎢(1 − λE ) [ DE (r1 , r2 ) ] + 2 ∑ λI ⎡⎣ DS ( H AI (r1 ), H AI (r2 )) ⎤⎦ ⎥
M I ∈Μ
⎣
⎦
(3.18)
where the contributions of each mode M I is weighted by λI and we set λE + ∑ λI = 1 .
For guidance in setting the values of the λI , we can turn to (Ernst and Banks 2002), who
found that in intersensory influence, people give preference to senses which minimize the
variance in joint perceptual interpretations, confirming an earlier prediction by (Welch
and Warren 1986) about sensory dominance during multimodal interactions. This lends
credence to our hypothesis in section 3.6.3.2 regarding the computational value of
entropy minimization in the selection of perceptual features. We reexamine these issues
in the dynamic model presented in Chapter 4.
74
The cross - modal distance between r1 and r2
DE ( r1 , r2 )
r2
DCM (r1 , r2 ) is calculated from:
1) their Euclidean distance: DE (r1 , r2 )
and
2) the Similarity distance of their Hebbian
G
G
projections: DS ( H (r1 ), H (r2 ))
Mode B
r1
Mode B
Mode A
Mode A
G
H (r2 )
G
H ( r1 )
Compute
G
G
DS ( H (r1 ), H ( r2 ))
Figure 3.19 – Calculating the cross-modal distance between codebook regions in a slice. The distance is a
function of their local Euclidean distance and the how similar they appear from the perspective of a cooccurring modality. To determine this for regions r1 and r2 in Mode B on top, we project them onto
Mode A, as shown in the middle. We then compute the Similarity distance of their Hebbian projections, as
shown on the bottom.
75
3.7.1 Defining a Mutually Iterative System
In this section, we show how to use the cross-modal distance function defined above to
calculate the distances between regions within a slice.
This statement may seem
surprising. Why is any elaboration required to use DCM , which we just defined? There
are two remaining issues we must address:
1) We have yet to specify the distance function D used to define the Wasserstein
distance in equations (3.7) and (3.8), which was also "inherited" in our definition
of the one-to-many distance in equation (3.14).
2) By defining distances cross-modally, we have created a mutually recursive system
of functions.
Consider any two regions r1 , r2 in mode M A .
When we
calculate DCM (r1 , r2 ) , we are relying on knowing the distances between regions
G
G
within another mode M B , which are used to calculate DS H (r1 ), H (r2 ) .
(
)
However, the distances between regions in M B are calculated exactly the same
way but with respect to M A . So, every time we calculate distances in a mode, we
are implicitly changing the distances within every other mode that relies upon it.
And of course, this means its own inter-region distances may change as a result!
How do we account for this and how do we know such a system is stable?
We will approach both of these issues simultaneously. Suppose we parameterize the
distance function D in all of our definitions:
G
G
DW H (r1 ), H (r2 ), D
(
)
G
G
DOTM H (r1 ), H (r2 ), D
(
=
)
=
1
m
m
G
G
min ∑ ⎡ D H (r1 )i , H (r2 ) ji
⎢
j1 ,.., jm
i =1 ⎣
∑
G
i∈H ( r1 )
(
2 1/ 2
) ⎤⎦⎥
(3.19)
G
ωi D W ( i, H (r2 ), D )
76
(3.20)
G
G
DS H (r1 ), H (r2 ), D
(
)
G
G
DW H (r1 ), H (r2 ), D
=
G
G
DOTM H (r1 ), H (r2 ), D
(
)
(
)
G
G
2 12
2
DCM ( r1 , r2 , D ) = ⎡(1 − λ ) [ DE (r1 , r2 ) ] + 2λ ⎡⎣ DS ( H (r1 ), H (r2 ), D ) ⎤⎦ ⎤
⎣⎢
⎦⎥
(3.21)
(3.22)
We now define an iterative function system on modes M A and M B that mutually
calculates DCM over their regions:
Let Δ tX = DCM in mode M X at time t. Recall that DE is Euclidean distance.
For all pairs of regions ri , rj ∈ M A and qi , q j ∈ M B , we define:
Δ 0A (ri , rj )
= DE (ri , rj )
(3.23)
Δ 0B (qi , q j ) = DE (qi , q j )
(3.24)
= DCM ( ri , rj , Δ tB−1 )
(3.25)
Δ tB (qi , q j ) = DCM ( ri , rj , Δ tA−1 )
(3.26)
Δ tA (ri , rj )
Thus, we are start by assuming in (3.23) and (3.24) that the distances between regions in
a slice are Euclidean, in the absence of any other information. (We later eliminate this
assumption in the intermediate steps of cross-modal clustering, where we have good
estimates on which to base the iteration.) The iterative steps are shown in (3.25) and
(3.26) , where at time t, we recalculate the distances within each slice based upon the
distances in the other slice at time t − 1 . For example, notice how the definition of
Δ tA (ri , rj ) calculates DCM using Δ tB−1 in (3.25). After all pairs of distances have been
computed at time t, we can then proceed to compute them for time t + 1 . As we did in
equation (3.18), we can easily generalize this system to include any number of mutually
recursive modalities. The complexity again scales linearly with the number of modalities
involved.
77
We stop the iteration when Δ tA and Δ tB begin to converge, which empirically tends to
happen very quickly. Thus, we stop iterating on mode M X at time t when:
max
Δ tX (ri , rj ) − Δ tX−1 (ri , rj )
ri , r j ∈M X
Δ tX (ri , rj )
< κ , for κ = .9, we typically have t ≤ 4.
We will refer to this final value of Δ tX for any regions ri , rj ∈ M X as D CM ( ri , rj ) .
With this, we complete our formal definition of the slice data structure. The final
component necessary for specifying the topological manifold defined by a slice was the
non-Euclidean distance metric between the hyperclustered regions. We now define this
distance to be D CM .
78
3.8 Cross-Modal Clustering
Recall that our goal has been to combine codebook regions to "reconstruct" the larger
perceptual regions within a slice. The definition of the iterated cross-modal distance
D CM in the previous section allows us to proceed, because it suggests how to answer the
following fundamental question:
Can any other modality distinguish between two regions in the same
codebook? If not, then they represent the same percept.
Because D CM represents the distance between two regions from the perspective of other
modalities, we will use it to define a metric that determines whether to combine them or
not. If D CM ( ri , rj ) is sufficiently small for two regions r1 , r2 ⊆ M A , then we will say they
are indistinguishable and therefore part of the same perceptual event. If D CM ( ri , rj )
between two regions is large, we will say they are distinguishable and therefore, cannot
be part of the same perceptual event. These criteria suggest the general structure of our
cross-modal clustering algorithm.
One important detail remains: how small must
D CM ( ri , rj ) be for us to say it is "sufficiently" small? How do we define the threshold for
merging two regions? An earlier version of this work appeared in (Coen 2005).
3.8.1 Defining Self-Distance
We define the notion of self-distance, which measures the internal value of D CM within
an individual region. Thus, rather than measure the distance between two different
regions, which has been our focus so far, self-distance measures the internal cross-modal
distance between points within a single region. It is a measure of internal coherence and
will allow us to determine whether two different regions are sufficiently similar to merge.
Suppose we are considering merging two regions ri and rj within some slice M, where
we have already determined their cross-modal distance D CM ( ri , rj ) . For example, on the
left in Figure 3.20 we are considering merging the green and blue regions. Let us
hypothetically assume that we did merge them to create a new region r ' . We will now
79
r2
r+
r−
r+
r1
r−
Figure 3.20 -- Determining when to merge two regions. On the left, we are considering merging the green
and blue regions r1 and r2 . To determine whether or not to proceed, we divide the candidate resulting
region using a hyperplane generated by its principal components, as shown on the right. This generates
two new regions r + and r − . If D CM ( r + , r − ) << D CM ( ri , rj ) , then we determine the regions should not be
merged. See the accompanying text for details.
immediately split this new region into two pieces, r + and r − , as show on the right in
Figure 3.20.
We now ask: what is the cross-modal distance
(
+
−
D CM r , r
)
between these
new regions? The intuition here is that if r1 and r2 really are part of the same region,
then if we split r ' differently, we should find the two new regions have approximately
the same cross-modal distance D CM ( ri , rj ) . In other words, the Hebbian projections of the
new regions r + and r − should be roughly as similar as the Hebbian projections of the
original regions r1 and r2 we are considering merging, because they should co-occur in
the same way with other modalities.
If the distance
(
+
−
D CM r , r
)
is much less than
D CM ( ri , rj ) , then we have merged two regions that are actually different from one
another. Why?
Because we would now be averaging the Hebbian projections of two
genuinely different regions, which would drastically increase their similarity and
therefore make D CM ( r + , r − ) much smaller than D CM ( ri , rj ) . The question remains then,
how do we divide r ' ?
We will partition r ' by fitting a linear orthogonal regression onto it.
For a slice
M ⊆ \ N , this will generate an ( N − 1) -dimensional hyperplane that divides r ' into two
sets, r + and r − , minimizing the perpendicular distances from them to the hyperplane.
80
Note that because the data are drawn from independent distributions, there is no errorfree predictor dimension that generates the other dimensions according to some function.
This is equivalent to the case where all variables are measured with error, and standard
least squares techniques do not work in this circumstance.
We therefore perform
principal components analysis on the points in region r ' and generate the hyperplane by
retaining its N − 1 principal components.
This computes what is known as the
orthogonal regression and works even in cases where all the data in r ' are independent.
We use this hyperplane to partition r ' into r + and r − , as shown in Figure 3.20. We
define the self-distance ( Dself ) of region r ' :
D CM ( r + , r − )
Dself (r ') =
D CM ( r1 , r2 )
(3.27)
If Dself (r ') < ½, then we do not combine r1 and r2 , because this indicates we would be
averaging two dissimilar Hebbian projections were the merger to occur. At the moment,
this remains an empirical statement, but we note that this threshold is not a parameter of
the cross-modal clustering algorithm and it is fixed throughout the results in this thesis.
In practice, the self-distance value ( Dself ) tends towards either zero or one, which
motivated our selection of ½ as the merger threshold. A more theoretical investigation of
this empirical criterion is among our future work.
3.8.2 A Cross-Modal Clustering Algorithm
We now present an algorithm for combining codebook clusters into regions that represent
the sensory events within a slice. This is done in a greedy fashion, by combining the
closest regions according to D CM within each slice. We use the definition of selfdistance to derive a threshold for insuring regions are sufficiently close to merge.
Afterwards, we examine the algorithm and some examples of its output.
81
Cross-Modal Clustering:
Given: A set of slices M and λ , the parameter for weighting Euclidean to Similarity
distances. For each slice M i ∈ M , we will call its codebook Ci = p1 ,..., pki .
{
}
Initialization: For each slice M i ∈ M , initialize a set of regions Ri = Ci . Each slice
will begin with a set of regions based on its codebook. We will merge these regions
together in the algorithm below.
Algorithm:
Calculate D CM over the slices in set M.
While (true) do:
Calculate D CM over the slices in set M. Use current D CM as t=0 value
For each slice M i ∈ M :
Sort the pairs of regions in M i , ra , rb ∈ Ri , by D CM ( ra , rb )
For each pair ra , rb ∈ Ri , in sorted order:
If Dself ( r ') ≥ .5, where r ' = ra ∪ rb :
Merge( ra , rb )
Exit inner for loop.
For each codebook cluster pi in Ci :
Let r = min arg ⎡⎣ D CM ( pi , r ) ⎤⎦
r∈R
i
Move pi into region r
If no regions were merged in any slice
Either wait for new data or stop
Procedure Merge( ra , rb ):
ra = ra ∪ rb .
Ri = Ri / rb .
(
)
For all regions rc ∈ Ri , set D CM ( ra , rc ) = min D CM ( ra , rc ) , D CM ( rb , rc ) .
82
The cross-modal clustering algorithm initially creates a set of regions in each slice
corresponding to its codebook. The goal is to merge these regions based on their crossmodal distances. The algorithm proceeds in a two-step greedy fashion:
1) For each slice, consider its regions in pairs, sorted by D CM . If we find two
regions ra and rb satisfying Dself ( r ') ≥ .5, where r ' = ra ∪ rb , we merge them and
move onto the next step.
2) If as a result of this merger, some codebook cluster pi is now closer to another
region, we simply move it there.
When we merge two regions, we set the pairwise distances to other regions to be the
minimum of the distances to the original regions, because we now view them as all part
of the same underlying perceptual event and therefore equivalent to one another. At the
end of each loop, we recompute D CM using the current value as the starting point in the
iteration, which propagates the effects of mergers to the other slices in M. In the event no
mergers are made in any slices, we can choose to either wait for new data, which will
update the Hebbian linkages, or we can terminate the algorithm, if we assume sufficient
training data has already been collected.
Most clustering techniques work by iteratively refining a model subject to an
optimization constraint.
The iterative refinement in our algorithm occurs in the
recalculation of D CM , which is updated after each round of mergers within the slices.
This spreads the effect of a merger within a slice by changing the Similarity distances
between Hebbian projections onto it. This in turn changes the distances between regions
in other slices and so forth, as discussed in section 3.7.1. The optimization constraint is
that we do not create regions whose internal self-distances violate the above constraint,
where a drastic decrease in self-distance would indicate the regions under consideration
are viewed differently by other co-occurring modalities.
83
Figure 3.21 – The progression of the cross-modal clustering algorithm. (A) shows the initial codebook
creation in each slice. (B) and (C) show intermediate region formation. (D) shows the correctly clustered
outputs, with the confusion region between the categories indicated by the yellow region in the center.
Note in this example, we set λ = .7 to make region formation easier to see by favoring spatial locality. The
final clustering was obtained by setting λ = 1.
84
Figure 3.22 – The output of cross-modally clustering four overlapping Gaussian distributions in each slice.
The confusion region between them is indicated in the center of the clusters.
Figure 3.23 – Finding one cluster embedded in another. In mode B, cross-modal clustering is able both to
detect the small cluster embedded in the larger one and to use this separation of clusters to detect those in
mode A. This is due to the non-Euclidean scale invariance of Similarity distance, which is used for
determining the cross-modal distance between regions. Thus, region size is unimportant in this framework,
and "small" regions are as effective in disambiguating other modes as are "large" regions.
85
Figure 3.24 – Self-supervised acquisition of vowels (monophthongs) in American English. This work is
the first unsupervised acquisition of human phonetic data of which we are aware. The identifying labels
were manually added for reference and ellipses were fit onto the regions to aid visualization. All data have
been normalized. Note the correspondence between this and the Peterson-Barney data show below.
Figure 3.25—The Peterson-Barney dataset. Note the correspondence between this and Figure 3.24.
86
The progression of the algorithm starting with the initial codebook is shown in Figure
3.21. Setting λ < 1 includes Euclidean distances in the calculation of D CM . This favors
mergers between adjacent regions, which makes the algorithm easier to visualize. At the
final step, setting λ = 1 and thereby ignoring Euclidean distance allows the remaining
spatially disjoint regions to merge.
Had we been uninterested in visualizing the
intermediate clusterings, we would have set λ = 1 at the beginning. Doing so yields an
identical result with this dataset, but the regions merge in a different, disjoint order. Note
in general, however, it is not the case that different values of λ yield identical
clusterings. All other examples in this thesis use λ = 1 exclusively.
Figure 3.22 demonstrates that the algorithm is able to resolve multiple overlapping
clusters, in this case, two mixtures of four Gaussian distributions. Figure 3.23 show an
important property of the Similarity distance, namely, it is scale invariant. The smaller
cluster in Mode B is just as "distinct" as the larger one in which it is embedded. It is both
detected and used to help cluster the regions in Mode A.
3.9 Clustering Phonetic Data
In Chapter 2, we asked the basic question of how categories are learned from unlabelled
perceptual data. In this section, we provide an answer to this question using cross-modal
clustering. We present a system that learns the number (and formant structure) of vowels
(monophthongs) in American English, simply by watching and listening to someone
speak and then cross-modally clustering the accumulated auditory and visual data. The
system has no advance knowledge of these vowels and receives no information outside of
its sensory channels. This work is the first unsupervised machine acquisition of phonetic
structure of which we are aware.
For this experiment, data was gathered using the same pronunciation protocol employed
by (Peterson and Barney 1952). Each vowel was spoken within the context of an English
word beginning with [h] and ending with [d]; for example, /ae/ was pronounced in the
context of "had."
Each vowel was spoken by an adult female approximately 90-140
87
times. The speaker was videotaped and we note that during the recording session, a small
number of extraneous comments were included and analyzed with the data. The auditory
and video streams were then extracted and processed.
Formant analysis was done with the Praat system (Goedemans 2001, Boersma and
Weenink 2005), using a 30ms FFT window and a 14th order LPC model. Lip contours
were extracted using a system written by the author described in Chapter 2. Timestamped formant and lip contour data were fed into slices in an implementation of the
work in this thesis written by the author in Matlab and C. This implementation is able to
visually animate many of the computational processes described here. This capability
was used to generate most of the figures in this thesis, which represent actual system
outputs.
Figure 3.24 shows the result of cross-modally clustering formant data with respect to lip
contour data.
Notice the close correspondence between the formant clusterings in
Figures 22 and 23, which displays the Peterson-Barney dataset introduced earlier. We
see the cross-modal clustering algorithm was able to derive the same clusters with the
same spatial topology, without knowing either the number of clusters or their
distributions.
The formant and lip slices are shown together in Figure 3.26, where the colors show
region correspondences between the slices. This picture exactly captures what we mean
by mutual bootstrapping. Initially, the slices "knew" nothing about the events they
perceive.
Cross-modal clustering lets them mutually structure their perceptual
representations and thereby learn the event categories that generated their sensory inputs.
The black lines in the figure connect neighboring regions within each slice and the red
lines connect corresponding regions in different slices. They show a graph view of the
clustering within each slice and illustrate how a higher-dimensional manifold may be
constructed out of lower-dimensional slices. This proposes an alternative view of the
structures created by cross-modal clustering, which we hope to explore in future work.
88
Figure 3.26 – Mutual bootstrapping through cross-modal clustering. This displays the formant and lip
slices together, where the colors show the region correspondences that are obtained from cross-modal
clustering. Initially, the slices "knew" nothing about the events they perceive. Cross-modal clustering lets
them mutually structure their perceptual representations and thereby learn the event categories that
generated their sensory inputs. The black lines in the figure connect neighboring regions within each slice
and the red lines connect corresponding regions in different slices. The identifying labels were manually
added for reference and ellipses were fit onto the regions to aid visualization. All data have been
normalized.
3.10 Related Work
There is a vast literature on unsupervised clustering techniques.
However, these
generally make strong assumptions about the data being clustered, have no corresponding
notion of correctness associated with their results, or employ arbitrary threshold values.
The intersensory approach taken here is entirely non-parametric and makes no a priori
assumptions about the underlying distributions or the number of clusters being
represented. We examine several of the main avenues of related work below.
3.10.1 Language Acquisition
De Sa (1994) and de Sa and Ballard (1997) have taken a similar approach to the work
presented here.
Namely, they are interested in unsupervised learning of linguistic
89
category boundaries based on cross-modal co-occurrences. However, their approach
requires that the number of categories be known beforehand. For example, to separate
the vowels in the English, they would need to know in advance that ten such vowels
exist. Thus, their approach cannot solve the perceptual grounding problem presented in
this chapter. However, their work is capable of refining boundaries once the categories
have been learned, which would be a useful addition to our framework. Thus, one can
easily imagine combining the two approaches to yield something more powerful than
either of them alone.
De Marcken (1996) has studied the unsupervised acquisition of English from audio
streams.
However, he uses the phonetic model of Young and Woodland (1993)
implemented in the HTK HMM toolkit to perform phonetic segmentation, which has
already been trained to segregate phonemes. Thus, this aspect of his work is heavily
supervised.
Other unsupervised linguistic learning systems (e.g., McCallum and Nigam 1998) are
built around the Expectation Maximization algorithm (Dempster et al. 1977), which we
discuss below. These approaches make very strong assumptions about the parametric
nature of the data being clustering.
3.10.2 Machine Vision
The vast majority of research in the machine vision has employed supervised learning
techniques. Some notable unsupervised approaches include Bartlett’s (2001) work on
using independent component analysis for face recognition, based upon assumed
statistical dependencies among image features. Although there are no obvious metrics
for correctness associated with the clustering itself, her selection of features has yielded
performance comparable to that of humans in face recognition. One may therefore view
her approach and ours as complementary. Her work would be an ideal basis for guiding
the feature selection that generates inputs into slices in our approach.
90
Stauffer (2002) has studied using multiple sensors to learn object and event segmentation.
The unsupervised learning component of his system assumes that the data are represented
by mixtures of Gaussians, which is quite reasonable given the enormous amount of data
collected by his sensor networks and their intended applications. In this, his framework
makes strong parametric assumptions.
It is also difficult to gauge the absolute
correctness of his results in that it does not seem to have been applied to problems that
have well-defined metrics. As this is not the goal of his framework, this comment should
be viewed more as a differing characteristic than a criticism of his work.
3.10.3 Statistical Clustering
There is an enormous body of literature on statistical clustering techniques, which used
assumed properties of the data being clustering to guide the segmentation process. For
example, the Expectation Maximization (EM) algorithm (Dempster et al. 1977) is widely
used a basis for clustering mixtures of distributions whose maximum likelihood
estimation is easy to compute. This algorithm is therefore popular for clustering known
finite numbers of Gaussian mixture models (e.g., Nabney 2002, Witten and Frank 2005).
However, if the number of clusters is unknown, the algorithm tends to converge to a local
minimum with the wrong number of clusters. Also, if the data deviate from a mixture of
Gaussian (or some expected) distributions, the assignment of clusters degrades
accordingly.
Wide ranges of clustering techniques (e.g., Cadez and Smyth 1999, Taskar et al. 2001)
base their category assignment upon the EM algorithm. As such, they tend to require
large amounts of data corresponding to presumed distributions, and they generally require
the number of expected clusters be known in advance. While these assumptions may be
reasonable for a wide range of applications, they violate the clustering requirements
stated here.
91
3.10.4 Blind Signal Separation
Separating a set of independent signals from a mixture is a classic problem in clustering.
For example, from a recording of an orchestra, we might like to isolate the sounds
contributed by each individual instrument. This is an extraordinarily difficult problem in
its general form. However, the assumption that the signals are mutually independent,
which is not always realistic, supports a number of approaches. Principal components
analysis (PCA) is a frequent tool used for separating signals, and while it is useful for
reducing the dimensionality of a dataset, it does not necessarily provide a notion of
correct clustering.
A number of interesting results have been found using PCA for data mining, for example,
Honkela and Hyvärinen’s (2004) work on linguistic data extraction, but our approach is
founded on the notion that our cross-modal signals are mutually dependent, not
independent. In that sense, our work makes a very different set of assumptions than most
efforts in signal separation.
Instead of looking for principal components in a high
dimensional space, which for example might correspond to the vowels in the example in
this chapter, we use a larger number of low dimensional spaces (i.e., slices) in parallel.
Thus, we approach the problem from an entirely different angle, and as such, our
framework appears to be more biologically realistic in that it is dimensionally compact.
3.10.5 Neuroscientific Models
The neuroscience community has proposed a number of approaches to unsupervised
clustering (e.g., Becker and Hinton 1995, Rodriguez et al. 2004, Becker 2005). These are
frequently based upon standard statistical approaches and often involve threshold values
that are difficult to justify independently. Most problematically, they generally have no
measure of absolute correctness. So, for example, the work of (Aleksandrovsky et al.
1996) learns multiple phonetic representations of audio signals without actually knowing
how many unique phonemes are represented.
Although one may argue that each
phoneme is represented by multiple phones in practice, there is a clear consensus that it is
92
meaningful to speak about the number of actual number of phonemes within in a
language, which these techniques do not capture.
3.10.6 Minimally Supervised Methods
Blum and Mitchell (1988) have proposed using small amount of supervision to help
bootstrap unsupervised learning systems. Although their approach is clearly supervised,
one may argue that this supervision could come from phylogenetic development
(Tinbergen 1951) and thereby provide support for innate models. We return to this
discussion in Chapter 5, and note that it is entirely complementary to our approach. By
pre-partitioning slices, we can easily incorporate varying amounts of supervision into our
framework.
3.11 Summary
This chapter introduced slices, a neurologically inspired data structure for representing
sensory information.
Slices partition perceptual spaces into codebooks and then
reassemble them to construct clusters corresponding to the actual sensory events being
perceived. To enable this, we defined a new metric for comparing spatial probability
distributions called Similarity distance; this allows us to measure distances within slices
through cross-modal Hebbian projections onto other slices.
We then presented an algorithm for cross-modal clustering, which uses temporal
correlations between slices to determine which hyperclusters within a slice correspond to
the same sensory events. The cross-modal clustering algorithm does not presume that
either the number of clusters in the data or their distributions is known beforehand and
has no arbitrary thresholds.
We also examined the outputs and behavior of this algorithm on simulated datasets and
on real data gathered in computational experiments.
Finally, using cross-modal
clustering, we learned the ten vowels in American English without supervision by
watching and listening to someone speak. In this, we have shown that sensory systems
can be perceptually grounded by bootstrapping off each other.
93
94
Chapter 4
Perceptual Interpretation
We saw in Chapter 3 that sensory systems can mutually structure one another by
exploiting their temporal co-occurrences. We called this process of discovering shared
sensory categories perceptual grounding and suggested that it is a fundamental
component of cognitive development in animals; it answers the first question that any
natural (or artificial) creature faces: what different events in the world can I detect?
The subject of this chapter follows naturally from this question. Once an animal (or a
machine) has learned the set of events it can detect in the world, how does it know what it
is perceiving at any given moment? We refer to this as perceptual interpretation. We
will take the view that perceptual interpretation is inherently a dynamic – rather than
static – process that occurs during some window of time. This approach relaxes the
requirement that our perceptual categories be separable in the traditional machine
learning sense; unclassifiable subspaces are not a problem if we can determine how to
move out of them by relying on other modalities. We will argue that this approach is not
only biologically plausible, it is also computationally efficient in that it allows us to use
lower dimensional representations for modeling sensory and motor data.
4.1 Introduction
In this chapter, we introduce a new family of models called influence networks, which
incorporate temporal dynamics into our framework. Influence networks connect crossmodally clustered slices and modify their sensory inputs to reflect the perceptual states
within other slices.
This cross-modal influence is designed to increase perceptual
accuracy by fusing together information from co-occurring senses, which are all
experiencing the same sensory events from their unique perspectives. This type of crossmodal perceptual reinforcement is commonplace in the animal world, as we discuss in
Chapter 6.
95
Our approach will be to formulate a dynamic system in which slices are viewed as
coupled perceptual state spaces.
In these state spaces, sensory inputs move along
trajectories determined both locally – due a slice's internal structure – and cross-modally
– due to the influence of other co-occurring slices. This addition of temporal dynamics
will allow us to define our notion of perceptual interpretation, which is the goal of this
chapter.
The model presented here is inspired by sensory dynamics in animals, but it does not
approach the richness or complexity inherent in biological perception.
Our intent,
however, is to work toward creating better artificial perceptual systems by providing
them with a more realistic sensorial framework.
4.2 The Simplest Complex Example
We will proceed by considering an example. We turn to the hypothetical perceptual
modes introduced in Chapter 3, as shown in Figure 4.1. Recall that each mode here is
capable of sensing the same two events in the world, which we have called the red and
blue events. The cross-modal clustering of these modes in shown in Figure 4.2. We see
that the larger perceptual regions have been assembled out of the codebook clusters,
shaded red or blue to indicate their corresponding sensory category. Our interest here,
however, is not in these large perceptual regions but rather in the small yellow regions at
their intersections. Sensory inputs within the yellow regions are ambiguous – these are
the inputs that we cannot classify, at least not without further information.
Although this example was selected for its simplicity, it worthwhile pointing out some of
the complexity it presents, and thereby examine some of the assumptions in our
framework before we proceed. Notice that because the mixtures of Gaussians here
intersect near their means, the "small" yellow regions will receive almost as many
sensory inputs as either of the "large" blue and red ones. Estimating from the limits of
the density normalization performed during codebook generation, we expect at least 1/4
of the inputs in this example to be unclassifiable because they will fall into these yellow
confusion zones. If our goal is to categorize sensory inputs, these yellow regions will
prove quite troublesome; we need to find some way of avoiding them.
96
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0
-0.5
0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Figure 4.1 -- Two hypothetical co-occurring perceptual modes. Each mode, unbeknownst to itself,
receives inputs generated by a simple, overlapping Gaussian mixture model. For example, if a "red" event
takes place in the world, each mode would receive sensory input that probabilistically falls within its red
ellipse. To make matters more concrete, we might imagine Mode A is a simple auditory system that hears
two different events in the world and Mode B is a simple visual system sees those same two events, which
are indicated by the red and blue ellipses.
Mode A
Mode B
Figure 4.2 – The output of cross-modally clustering the modes in Figure 4.1. The perceptual regions have
been constructed out of the codebook clusters and are indicated respectively by the blue and red shading.
The confusion region between them is indicated at their intersection in yellow. To assist with visualization,
ellipses have been fit using least squares onto the data points within each sensory region, as determined by
cross-modal clustering. Note that approximately ¼ of the inputs within each mode fall into the confusion
region and are therefore ambiguous.
97
In fact, as things stand at the moment, we expect the overall classification error rate
would be somewhat higher than this. Sensory events are only probabilistically described
by the derived perceptual regions. Thus, inputs will not always fall into the "right" area,
at least according to the output of cross-modal clustering. We also assume that the
feature extraction (e.g., formant estimation) generating the sensory inputs to slices
introduces some degree of error; this may be due to noise, heuristic estimation,
instability, incorrect or incomplete modeling, limitations from perceptual thresholds, etc.
We elaborate on this point in Chapter 6, but it is reasonable to expect these types of
processing errors occur in both biological and artificial systems. Although, not a source
of error, an additional complexity is that we generally expect many more than two modes
will be active simultaneously, corresponding to the fine-grained model of perception we
adopted in Chapter 1.
4.2.1 Visualizing an Influence Network
Our goal is to "move" sensory inputs within slices to make them easier to classify. In
doing so, we seek to avoid perceptual ambiguity when possible and to recover from
errors introduced during perceptual processing.
We have so far considered slices through their codebooks or through the entire perceptual
(e.g., red and blue) regions found within them by cross-modal clustering. We now
instead look at the local neighboring (connected) components within each of these larger
regions; we are going to call these local regions nodes (Figure 4.3); to be clear, the nodes
partition the modes (i.e., slices) into locally connected regions. Nodes correspond to the
representational areas that are easy to classify, and therefore, they will help us
disambiguate perceptual inputs.
We note that the lines between nodes represent
perceptual equivalence as determined by cross-modal clustering. Although they are
derived from Hebbian data, the lines do not represent the Hebbian linkages described
earlier because they are restricted to regions corresponding to the same perceptual
categories.
98
Mode B
Mode A
Figure 4.3 – Viewing the nodes within the modes. Nodes are the locally connected components in each
perceptual region; they are indicated here by color and with ellipses, fit on their inputs after cross-modal
clustering. In this example, each mode has 5 nodes: 2 blue, 2 red, and 1 yellow. The colored lines indicate
perceptual equivalence between nodes in different slices.
We would like to use cross-modal correspondences, as indicated by these lines of
perceptual equivalence, to define a framework where slices can mutually disambiguate
one another. For example, suppose Mode B on top sees an input in a blue node. One
could certainly imagine that knowing this might help resolve a simultaneous ambiguous
input in Mode A.
In a static framework, we could for example implement a
disambiguation strategy using posterior probabilities (e.g., Wu et al. 1999). This would
be relatively straightforward, at least for a small number of co-occurring modes.
The problem with this solution, however, is that perception in animals has complex
temporal dynamics – generating percepts does not correspond to an instantaneous
decision process. This is evidenced during cross-modal influence (Calvert et al. 2004),
perceptual warping (Beale and Keil 1995), interpretative bistability (Blake et al. 2003),
habituation (Grunfeld et al. 2000), and priming (Wiggs and Martin 1998), and these
effects are particularly prominent during development (Thelen and Smith 1994). We will
examine these phenomena in detail in Chapter 6. For the moment, we further note that
99
even unimodal percept generation has complex sensory and temporal threshold dynamics
(Nakayama et al. 1986); in fact, different features may have different thresholds, which
themselves change dynamically (e.g.. Hamill et al. 1989, Wang et al. 2002).
Most importantly, these phenomena are not simply biological "implementation" details;
they are not epiphenomena. Rather, they are fundamental components of perceptual
activity and are a large part of why animal perception is so robust.
Because the
perceptual phenomena we are interested in modeling correspond to temporal processes,
we will argue that our models should be similarly dynamic.
In particular, static
techniques seem poor approaches for understanding the intersensory and temporal
complexities of biological interpretative mechanisms.
4.2.2 Towards a Dynamic Model
Instead, our approach will be to view slices as perceptual state spaces, where the nodes
correspond to fixed points. Ambiguous nodes, like the yellow ones in our example, will
be treated as repellers in this space, and unambiguous nodes, such as the red and blue
ones, will be treated as attractors. Perceptual inputs will travel along trajectories defined
by these fixed points.
We visualize this state space view of slices on page 101. The repellers (ambiguous
nodes) correspond to maxima; the attractors (unambiguous nodes) are the minima and
surrounded by their basins of attraction. In this framework, perceptual interpretation will
loosely correspond to energy minimization; that is, we would like move sensory inputs
into the basins that represent their correct classifications. In this way, the dynamics of
the system will perform the sensory classification for us.
Intersensory influence – the ability of one slice to modify perceptions in another – will be
realized by having slices induce vector fields upon one another, thereby cross-modally
modifying their temporal dynamics. This in turn can modify perceptual classifications by
leading sensory inputs into different basins of attraction (or perhaps by even introducing
bistability).
100
Figure 4.4 -- Visualizing slices as state spaces where the nodes are the fixed points. Repellers,
corresponding to ambiguous nodes, are maxima and shown in the center; attractors, corresponding to
unambiguous nodes, are minima and are outlined by their basins of attraction. It may be helpful to
compare this view to the one in Figure 4.3.
Figure 4.5 -- A view of the basins of attraction from below. This is a rotation of Figure 4.4 into the page to
help visualize the basins, which are partially occluded above. The basins here correspond to the blue and
red nodes in Figure 4.3.
101
To formulate a dynamic model, the first step is to incorporate some model of time into
our framework. We start by assuming that any perceivable event in the world persists for
some (perhaps variable) temporal duration, e.g., 10 – 1000 milliseconds. During this
interval, the event will be perceptually sampled, i.e., some finite number of sensory
inputs will be generated describing it. This assumption is quite reasonable from a
biological perspective. For example, this could correspond to a series of neuronal firings
in the striate cortex, during the period of time an object is visually observed (Hubel and
Wiesel 1962). Alternatively, in an artificial system, a stream of sensory inputs might be
generated by sliding a fast Fourier transform window over an auditory signal. Regardless
of how the features are selected and extracted, the important consideration here is that an
event in the world generates a stream of perceptual inputs, rather than a unitary data
point. We will use the time this provides to effect intersensory influence.
We also need to define the state within a slice, namely, the quantity that will be changing
during the lifetime of an individual perception. In our model, each slice will receive a
stream of sensory inputs when it is stimulated, as described above. After receiving some
number of inputs, a slice may eventually "decide" that a recognizable perceptual event
has occurred. In the interim, however, a slice maintains an estimate of what it might be
perceiving. The estimate corresponds to a point in \
N
that moves through the slice's
state space. At any given moment, three factors influence the trajectory of this point
through state space.
1) The current perceptual input. An estimate tends to move towards the current
input.
2) A gradient defined by the fixed points within a slice. An estimate moves towards
the attractors, which are the unambiguous nodes in a slice. The gradients for the
two slices we have been examining are displayed in Figure 4.6 .
3) Any induced vector fields from co-occurring slices. This is visualized in Figure
4.7 for the ambiguous scenario described above. We discuss this in more detail
below.
102
Mode B
Mode A
Figure 4.6 – Viewing innate state space dynamics. The nodes define a gradient in the state space. In the
absence of other influences, points move towards attractors (unambiguous nodes) and away from repellers
(ambiguous nodes). The generation of these gradients is discussed in § 4.3.1.1.
103
Figure 4.7 – Visualizing cross-modal influence through a vector field induced on Mode A by Mode B.
This field modifies the dynamic behavior in Mode A to favor Mode B's perceptual interpretation. This
"pulls" the estimate in Mode A towards the blue attractors, corresponding in this case to Mode B's
assessment of the world. We have added light red and blue shading in the background to recall the
perceptual categories contained in this slice.
4.2.3 Intersensory Influence
Biological
perceptual
systems
share
cross-modal
information
routinely
and
opportunistically (Stein and Meredith 1993, Lewkowicz and Lickliter 1994, Rock 1997,
Shimojo and Shams 2001, Calvert et al. 2004, Spence and Driver 2004); we have
suggested that intersensory influence is an essential component of perception but one that
most artificial perceptual systems lack in any meaningful way. In our framework, crossmodal influence is realized by having modes bias the temporal dynamics of other modes,
thereby sharing their views of the world.
Recall the ambiguous scenario described on page 99. We imagined that in Figure 4.3,
Mode B saw an input corresponding to blue node. We asked how this might help
disambiguate a simultaneous event in Mode A. Our solution to this problem is illustrated
104
in Figure 4.7. We allow Mode B to induce a vector field on Mode A, thereby modifying
its dynamic behavior. This vector field incorporates Mode B's perceptual perspective
into Mode A.
This "pulls" the estimate in Mode A towards the blue attractors,
corresponding to Mode B's assessment of what is being perceived.
The "interlingua"
that makes this influence possible is their shared perceptual categories, which are
obtained through cross-modal clustering and provide a common frame of sensorial
reference.
Therefore, we see that the movement of the estimate within a slice is governed:
(1) externally, by events happening in the world; (2) innately, by the previously learned
structure of events with the slice; and (3) cross-modally, from the perspectives of other
perceptual channels viewing the same external events.
We can in fact adjust the
influences of these individual contributions dynamically.
For example, in noisy
conditions, an auditory slice may prefer to discount perceptual inputs and rely more
heavily on cross-modal inputs. These kinds of tradeoffs are common in biological
perception (Cherry 1953, Sumby and Pollack 1954). We discuss this further below.
4.3 Influence Networks
In our model of perception, an event in the world corresponds to a path taken through a
slice. In this section, we define both the rules that govern this path and how a slice
decides something has been perceived, a concept we have not encountered so far in this
chapter.
By formulating this problem dynamically, we lose one of the primary attractions of a
static framework – that we know with certainty something has "actually" occurred. In
static frameworks, this decision is usually made independently by some precursor to the
interpretative mechanism; in other words, the decision must be made but it is done
somewhere else. This usually corresponds to fixed sets of determination criteria within
the independent perceptual components.
105
In our framework, how far down its path must an input travel for a slice to know that an
event is actually occurring?
Presumably, a single sampled input is insufficient.
Generally, both natural and artificial perceptual systems have thresholds – they require
some minimum amount of stimulation to register an event in the world (Hughes 1946,
Fitzpatrick and McCloskey 1994). Rather than assume this happens externally, a slice
must make this determination itself, but it is not necessarily doing so alone.
This
decision itself is now a dynamic process open to intersensory influence – consistent
interpretations among slices will lower their sensory and temporal thresholds, whereas
inconsistent interpretations will raise them.
Cross-modal perception is known to
significantly improve upon unimodal response times in humans (Hershenson 1962, Frens
1995, Calvert et al. 2000), a phenomenon that is captured by our model.
We now proceed by defining the state-space view of slices that has been motivated
informally so far. State within each slice has two distinct components:
1) The position of its estimate. This is what the slice "thinks" it might be perceiving.
2) The activation potential of its nodes. Nodes have internal potentials, which are
increased whenever the perceptual estimate falls within them.
Below, we define the state space equations that govern how these quantities change
during the lifetime of a perception. Towards this, we begin with some preliminary
definitions covering the concepts raised earlier in this chapter.
Note that we will limit some our of presentation to 2-dimensional slices, because \ 2 is
easy to describe and a reasonable model for cortical representations. \3 is the largest
space in which we have constructed slices. Higher dimensional spaces may generalize
from these aspects of the presentation but that is an unexamined hypothesis. Also,
because cross-modal clustering connects regions in different slices, it lets us approximate
higher-dimensional manifolds. This may reduce the need for directly implementing
higher dimensional perceptual representations.
106
We note that the dynamic framework below is implemented in the examples in this thesis
by a discrete simulation that uses a 10 millisecond time step.
4.3.1 Preliminary Concepts
Consider a slice M A ⊆ \ n with associated codebook C A = { p1 , p2 ,..., pa } . Suppose M A
is cross-modally clustered with respect to slice M B into k perceptual categories. We call
the set of perceptual categories C = {c1 ,..., ck } . To represent the assignment of codebooks
to categories, we define a multi-partitioning
MP
of
MA
as a mapping
MP : C A → 2{1,2,...,k} . This assigns each codebook cluster to a subset of integers from 1 to
k, representing the k different perceptual categories discovered during cross-modal
clustering.
If a codebook cluster is in more than one category, then it is deemed
ambiguous, as shown in (4.1).
We define a node u ⊆ C A as a locally connected set of codebook clusters that are
members of the same perceptual categories; for any two clusters pb and pc in u,
MP( pb ) = MP( pc ) . We say u ∈ ci if node u is in category ci . We define the nodebook
N A as the set of all nodes in slice M A . Nodebooks are used to represent the connected
components in the output of cross-modal clustering.
The density of node u is the
percentage of sensory inputs falling within it over some assumed time window.
For node ui , we define
⎧ 1,
sign(ui ) = ⎨
⎩-1,
if MP(ui ) = 1 (attractor)
if MP(ui ) > 1 (repeller)
(4.1)
which indicates whether ui will be an attractor or a repeller in the state space, determined
by whether it is or is not ambiguous. Let S A = [sign(u1 ), sign(u 2 ),...,sign(ua )] be a vector
containing all signs within a slice, which captures the ambiguity within it.
107
4.3.1.1 Defining a Surface
Towards defining a surface on slice M A ⊆ \ 2 , we first define a reference surface from
which it will be piecewise constructed. Let Z ( x, y ) = sin(π x) ⋅ sin(π y ) , corresponding to
the shape of our fixed points.
We will fit copies of Z onto M A's nodebook, towards building a surface over the entire
slice. For each node ui ∈ N A , let a, b, x0 , y0 ,θ
inputs via (Fitzgibbon et al. 1999).
Z i = sign(ui ) ⋅ density(ui ) ⋅ Z ( a, b, x0 , y0 ,θ
i
be an ellipse fit onto the node's sensory
We then define a surface on node ui ,
i
) , where we stretch, translate, and rotate the
reference shape Z onto the node's descriptive ellipse. The term sign(ui ) determines
whether a node is a local maxima or minima.
We define the innate surface Z A over the entire slice by piecewise summing the
contributions of the individual nodes, Z A = ∑ Z i . Although this surface is constructed
piecewise, we know it is smooth due to the reference surface Z being sine-based, and
therefore, it is continuously differentiable within the boundaries of the slice. This surface
is illustrated in Figure 4.4 and Figure 4.5, where the repellers correspond to local maxima
and the attractors correspond to local minima.
The gradient of this surface
∇Z A =
∂Z A ∂Z A
+
∂x
∂y
captures the innate dynamic behavior of the slice due to cross-modal clustering.
(4.2)
Two
examples of this are illustrated in Figure 4.6. We can view the effect of this gradient as
the slice trying to classify inputs to match its learned events. In other words, perceptual
grounding biases slices in favor of particular interpretations, namely, the ones derived
from cross-modal clustering. This may correspond with the "perceptual magnet effect,"
which reduces detectable differences between sensory inputs near previously acquired
categories (Kuhl 1991).
108
An inefficiency introduced by representing data in Euclidean Voronoi regions is that
sometimes space is "wasted," particularly around the boundaries. We can see this in
Figure 4.2, where the data points are fairly sparse. We note this is a synthetic data set
with only two events, so the sparsity is exaggerated here. Nonetheless, we would prefer a
"warped" representation, where unused areas of the map are reduced in favor of
expanding or stretching the areas that are more perceptually prominent. It is clear that
cortical maps are continuously modified in animals, where regions are allocated
proportionally with their use (Buonomano and Merzenich 1998, Kaas 2000).
For
example, it has been found that the fingers of the left hand in right-handed violinists have
increased cortical representation (Elbert et al 1995), corresponding to their development
of fine motor control during fingering.
Because our implementation does not support spatial plasticity, there may be sections of
Voronoi regions which are generally devoid of inputs. However, over time, we expect
that inputs will fall within them, albeit somewhat infrequently. To account for these stray
inputs, we induce a weak, constant magnitude vector field over the surface of each slice,
specifically for the benefit of these "empty" regions. The field within each node ui
points towards or away from its fixed point, depending upon sign(ui ) . This reflects each
node's innate perceptual bias throughout its entire Voronoi region.
4.3.1.2 Hebbian Gradients
Towards defining cross-modal influence, we parameterize the approach above. Consider
a slice M A ⊆ \ n with associated codebook N A = {u1 , u2 ,..., ua } . Let S = {s1 , s2 ,..., sa } ,
where each si ∈ [−1, 1] . We call S a sign set and use it to provide signs for each node in
N A , i.e., whether it attracts or repels. Although signs were previously restricted to be
either -1 or 1, a continuous range here reflects confidence in the assignments.
We define:
Zi ( S )
= S (ui ) ⋅ density(ui ) ⋅ Z ( a, b, x0 , y0 ,θ
109
i
)
(4.3)
Z A (S )
=
∑ Z (S )
(4.4)
∇H ( S )
=
∂Z A ( S ) ∂Z A ( S )
+
∂x
∂y
(4.5)
i
We call ∇H ( S ) the Hebbian gradient induced on M A by sign set S. Slices will induce
vector fields derived from Hebbian gradients on one another to create cross-modal
influence. The influence is effected by assigning values in sign set S that correspond to
their perceptual outlooks. This is possible because cross-modal clustering provides slices
with a common frame of reference for making these assignments. Figure 4.7 contains an
example of such a field.
4.3.1.3 Activation Sets
Finally, we define the notion of an activation set {e1 , e2 ,..., e j } , where 0 ≤ ei ≤ 1 .
Activation sets will be used to represent activation potentials in our state space models,
when we define leaky integrate and fire networks. The activation potential for each node
corresponds to a temporal-integration of its inputs over some time period, as detailed in
section 4.3.3. In anticipation of applying Hebbian gradients in the next section, we show
here how to project the activation potentials in one mode onto another, by using their
shared categories as a common frame of reference.
For mode M A , let N A = {v1 , v2 ,..., va } be its nodebook and let N A contain k distinct
categories, as determined by cross-modal clustering with some mode M B . We call this
set of perceptual categories C = {c1 ,..., ck } .
Let E A = {e1 , e2 ,..., ea } be an activation set on M A . We define the activation set on C
derived from E A as Φ = {φ1 , φ2 ,..., φk } , where φi is simply the sum of the activation
potentials of the individual nodes within category i :
φi = ∑ ev
v∈ci
110
(4.6)
We will refer to the mapping in (4.6) as Φ( E A ) , which projects nodebook activations
onto category activations.
Let Φ = {φ1 , φ2 ,..., φk } be an activation set on the categories in C. We define an activation
set E A = {e1 , e2 ,..., ea } on a node M A derived from Φ by reversing the above process:
ei =
∑
c j ∈MP ( a )
φj
(4.7)
We will refer to the mapping in (4.7) as E A (Φ ) , which projects category activations onto
nodebook activations.
Being able to move back and forth between activations in nodebooks and categories is
quite useful.
It allows us to share state between two different slices, using their
categories as an interlingua, and thereby define the Hebbian gradients used in
intersensory influence.
4.3.2 Perceptual Trajectories
During the time window corresponding to a sensory event, a slice M A ⊆ \ N integrates its
sensory inputs into a single estimate h ∈ R N that models what it is in the midst of
perceiving.
This estimate changes over time, and its movement is governed:
(1) externally, by events happening in the world; (2) innately, by the previously learned
structure of events with the slice; and (3) cross-modally, from the perspectives of other
perceptual channels viewing the same external events.
We now define the state space dynamics due to each of these components.
(1) Events in the world reach the slice through a stream of sensory inputs. (We discuss
integration of slices into perceptual pipelines later in this chapter.)
We call this input
stream I = <d1 , t1 >, < d 2 , t2 >,..., < d k , tk > , where input d i ∈ R N arrives at time ti .
111
Let ΔI i = (di − di −1 ) , which is a vector measuring the difference between successive
inputs. Let δWA (ti ) be a unit impulse function that is nonzero if an input arrives at M A at
time ti .
We define the change in the estimate due to events in the world as:
W (t ) = δWA (ti ) ⋅ ΔI i
(4.8)
W (t ) provides the gradient of the input change at time t, assuming an input occurred.
Otherwise, it is zero. The "W" in this function reminds us it due to changes in the world.
In the absence of other factors, ∫ W (t ) dt tracks the inputs exactly.
(2) The learned category structure within each slice also influences its perceptual inputs.
In equation (4.4), we defined a parameterized surface Z A ( S ) constructed over slice M A .
The sign values in set S determine the heights of the fixed points corresponding to the
nodes; these range from -1 to 1, corresponding to repellers or attractors respectively.
Values within this range reduce the strength of the corresponding gradient.
Recall vector S A = {sign(u1 ), sign(u 2 ),...,sign(ua )} , which captures the ambiguity of M A
and is used to define the innate surface Z A .
Let At ( M A ) be a vector of activation potentials for M A at time t, At ( M A ) = [e1 , e2 ,..., ea ] ,
where 0 ≤ ei ≤ 1 . (This was defined above in § 4.3.1.3 and receives a more detailed
treatment in § 4.3.4.)
We are going to scale S A by At ( M A ) to generate a gradient corresponding to the innate
dynamics of slice M A . Let S A (t ) = {e1 ⋅ sign(u1 ), e2 ⋅ sign(u2 ),..., ea ⋅ sign(ua )} .
We define:
I (t ) = ( ∇Z A ( S A (t ) ) ) (h)
112
(4.9)
This is the component of the estimate's change that is due to a slice's preference to favor
previously learned categories. This influence increases as the activation potentials in the
nodes increase; the intuition here is that the nodes become increasingly "sticky" as they
grow closer to generating a perception. We call this refer to this influence as the slice's
innate dynamics, and the "I" in this function is a mnemonic for "innate."
(3) To define the component of state space dynamics due to intersensory influence, let
M B be a slice that has been cross-modally clustered with M A .
We are going to use M B to construct a Hebbian gradient ∇H ( S ) on M A . To do this, we
need to determine a sign set S that reflects M B 's perceptual preferences. As we touched
upon at the beginning of this section, we will be defining an activation potential model on
the nodebooks in each slice. We are going to use this model to determine the sign set for
calculating ∇H ( S ) .
Let At ( M B ) be the set of activation potentials for M B . Defining the sign set S is a two
step process.
First, we use At ( M B ) to determine the potentials Φ of the mutual
categories C between M A and M B . Then, we use Φ to determine signs for the nodes in
MA.
The simplest definition of S = {e1 , e2 ,..., ea } is:
S = E A ( Φ ( At ( M B ) ) )
(4.10)
This projects the activation potentials in M B directly onto M A , by using C as a common
reference. Because ei ≥ 0 , this means M B can only induce a vector field corresponding
to attractors on M A , per equation (4.1). Thus, it can encourage the recognition of certain
categories but it cannot discourage the recognition of others.
We can imagine a less permissive strategy, where some nodes are encouraged and some
discouraged, as in:
113
(
) )
(
S = E A 2 tan −1 α ⋅ ( Φ ( At ( M B )) − mean(Φ ( At ( M B )) ) / π , α ≥ 2
(4.11)
Here, nodes corresponding to categories with above mean potentials are treated as
attractors and those with below mean potentials are treated as repellers.
Finally, let us consider the scenario where M B determines what it is perceiving before
M A does. In the model defined below, this will correspond to max(Φ ) ≥ 1 − ε . Namely,
the activation potential of some category reaches a threshold value, at least from M B 's
perspective, which "resets" the potentials in all other categories. Perhaps in this case we
would like to discourage all of the non-recognized categories in M A . We can do that by
substituting:
(
)
S = E A 2 tan −1 (α ⋅ Φ ( At ( M B ) ) ) / π , for some large α
(4.12)
In practice, we use (4.11) and dynamically substitute in (4.12) when a co-occurring node
fires.
We can now define the change in the estimate due to intersensory influence as:
C (t )= ( ∇H ( S ) ) (h)
(4.13)
The "C" here is a reminder this represents cross-modal influence.
4.3.3 Perceptual Dynamics
We can now define the state space equation for estimate h:
dh
= α ⋅ W (t ) + β ⋅ I (t ) + γ ⋅ C (t )
dt
dh
= α ⋅ δWA (ti ) ⋅ ΔI i (h) + β ⋅ ( ∇Z A ( S A (t ) ) ) (h) + γ ⋅ ( ∇H ( S ) ) (h)
dt
(4.14)
which combines the dynamic effects from the world (W), its innate structure (I), and
cross-modal influence (C), where S is defined as above. We define h = d1 at time t1 .
114
How do we set the constants α , β , and γ ? By default, we set α = 1 so that each slice
"receives" its full inputs stream. We also set β + γ = 1 so that by the time each slice is
ready to generate its input, its innate structure and cross-modal influences are as
important as its inputs in determining which category is recognized. In particular, we set
γ > β , so that intersensory influences dominate innate ones. The reason for this is that
we suppose the world is already enforcing the innate structure of events in generating
them.
From the outlooks of both Gibson (1950) and Brooks (1987), dynamically
reinforcing the world's structure within a slice is redundant. Nonetheless, we view it is as
an important component of interpretative stability. While it may not be necessary from a
theoretical Gibsonian perspective, incorporating innate perceptual dynamics seems to
have clear computational advantages in real world conditions. It stabilizes sensory inputs
by biasing perceptual interpretation to favor events the slice has previously learned.
Ideally, it would be optimal to adjust these parameters dynamically. For example, in
noisy conditions, an auditory slice may prefer to discount its perceptual inputs in favor of
visual cross-modal one, realizing what is known as the cocktail party effect (Cherry 1953,
Sumby and Pollack 1954); raising γ and lowering α in (4.14) would have this effect.
This condition is likely detectable, particularly where we suppose one slice is not
registering events and another co-occurring one is. Although we have not implemented
an automatic mechanism for dynamically balancing these parameters, doing so does not
seem implausible (e.g., according to maximum-likelihood estimation approach of Ernst
and Banks, 2002).
4.3.4 An Activation Potential Model
Let us examine where we stand at this point. We have defined how to calculate the path
of a slice's perceptual estimate within its state space. However, we have not yet specified
how this path dynamic generates a percept. We now return to the question posed earlier:
how far down its path must an input travel for a slice to know that an event has occurred?
115
There is a vast body of literature on perceptual thresholds tracing back to the seminal
work of Weber and Fechner in the 19th century. Of particular interest here are the notion
of temporal thresholds – that sensory stimuli are temporally integrated to generate
perceptions (Hughes 1946, Green 1960, Watson 1986, Sussman et al. 1999).
We will
use this to motivate a leaky integrate and fire network (Gerstner and Kistler 2002) over
the nodes within a slice. Each node is modeled as a leaky integrator; when it fires, a
perception corresponding to its category is generated.
For each node ui in the nodebook of slice M A , we refer to its activation potential at time
t as At (ui ) , which ranges between 0 and 1. If At (ui ) ≥ 1 − ε , we say the node fires, and
the quantity 1 − ε is called its firing threshold.
We define a mapping function V : R N → N A , which assigns a point to its nearest node in
nodebook N A . Namely, V describes the Voronoi regions over the nodebook. For each
node ui , we associate a function ui ( p ) where ui ( p) = 1 if and only if V ( p) = ui .
Otherwise, it is equal to zero. When ui ( p) = 1 , we say ui ( p ) is active.
A node's activation increases when the perceptual estimate falls within it. We define the
change in potential due to direct activation:
D(t ) =
1
η
δWA (t ) ⋅ ui ( h(t ) )
(4.15)
The variable η determines the temporal threshold for the category represented by ui . It
answers the question: how many times must the nodes in a single category be active for
that category to be perceived? Although much has been written documenting temporal
thresholds, it is not always clear how (or if) they are acquired. In our framework, we
make the following fairly weak assumption: during development, a sufficient number of
unambiguous sensory inputs are observed such that we can derive (e.g., 90%) confidence
intervals on their lengths. Then for some confidence interval, [tmin , tmax ] , we take the
lower bound tmin as the temporal threshold. If the inputs are sampled at λ Hz, then we
define η = λ ⋅ tmin . We note there is some evidence for simplified (unambiguous) inputs
116
during development in people, where parents exaggerate inflection and pitch in speech
(in what is traditionally known as "motherese"), presumably to help infants analyze
sounds (Garnicia 1977).
To this, we also add a leakage term. Leakage provides that in the absence of stimulation,
each node drifts to its rest value, which we uniformly define as zero in this model. This
is necessary to counter the effect of noise in the perceptual inputs and of spurious
intersensory activations. We define the leakage at time t:
L(t ) = exp ( −(t − t0 ) / τ )
(4.16)
where time constant τ is equal to tmax in the confidence interval above and t0 is the most
recent time the node was active.
Therefore, for each node ui , we have:
dAt (ui )
= D(t ) + L (t )
dt
dAt (ui )
1
1
= δWA (t ) ⋅ ui ( h(t ) ) − exp ( −(t − t0 ) / τ )
dt
η
τ
(4.17)
We thereby define a system of N A simultaneous state equations for mode M A .
Based on these node activation potentials, we derive activation potentials for perceptual
categories:
At (ci ) =
∑ A (u )
u j ∈ci
t
j
(4.18)
That is, the activation potential of a category is the sum of the activation potentials of its
nodes. If At (ci ) ≥ 1 − ε , then we say that category ci has fired and the slice outputs this
category as its perception. At this point, the activation potentials for all nodes are reset to
zero. Assuming we do not implement a refractory period, the node is free to await its
next input. In practice, we set ε = 10−4 to account for numerical rounding errors during
our simulation.
117
4.3.4.1 Hebbian Activations
Finally, we note an important variation on this model by allowing activation potentials to
spread among nodes in different slices.
In this way, we can incorporate Hebbian
influence directly into activation potentials and thereby reduce the temporal thresholds
for perceptual events.
For At ( M B ) , M B 's activation potentials at time t, let:
(
)
S = E A f ( Φ ( At ( M B ) ) ) = {e1 , e2 ,..., ea }
(4.19)
be the activation set projected onto mode M A from M B via (4.11). We define the
Hebbian activation potential H t (ui ) induced on node ui at time t:
H t (ui ) =
1
ηb
δWB (t ) ⋅ ei
(4.20)
where ηb is defined with respect to M B . This allows M A to directly incorporate M B 's
activation potentials into its own, yielding new state space equations for each ui :
dAt (ui )
= D(t ) + κ H t (ui ) + L (t )
dt
dAt (ui )
κ
= D(t ) + δWB (t ) ⋅ ei + L (t )
ηb
dt
(4.21)
We fully expand (4.21) to show the contribution of At ( M B ) , the activation potentials in
MB :
⎡κ
⎤
dAt (ui )
= D(t ) + ⎢ δWB (t ) ⋅ E A f ( Φ ( At ( M B ) ) ) ⎥ + L (t )
dt
⎣η b
⎦i
(
)
(4.22)
The system defined by (4.22) reduces the effective temporal thresholds in perceptual
systems by spreading activation potentials between their nodes. We vary κ between 0
and 1 to adjust this cross-modal sensory acceleration, in proportion to the observed
agreement between slices. Cross-modal perception is known to significantly improve
upon unimodal response times in humans (Hershenson 1962, Frens 1995, Calvert et al.
2000); this is captured by our model here.
118
4.4 Summary
In this chapter, we have introduced a new family of models called influence networks,
which incorporate temporal dynamics into a perceptual framework. These networks fuse
together information from the outside world, innate perceptual structure, and intersensory
influence to increase perceptual accuracy within slices. The cross-modal aspect of this is
illustrated in Figure 4.8, where auditory formant data influences the visual perception of
lip contours by inducing a Hebbian gradient on it. This has the effect of “moving” the
visual sensation to be in closer accord with the auditory perception.
Influence networks incorporate several basic biological phenomena into a computational
model, including: cross-modal influence; dynamic adjustment of sensory and temporal
thresholds; cross-modal substitution; and bistability. We will show in Chapter 6 that
while these are fundamental perceptual features in biological systems, they largely tend
to be absent in artificial ones. Influence networks are a step toward creating more
capable artificial perceptual systems by providing them with a more realistic sensorial
framework. Figure 4.9 illustrates how an influence network can be incorporated into a
Formant
data
Lip contour
data
had (æ)
heed (i)
had (æ)
Figure 4.8 -- Cross-modal activations in an influence network. The auditory perception in formant space
on top increases activations in the lip contour space on the bottom. This example is unusual because it
shows an auditory modality influencing a visual one, which is biologically realistic but unusual in an
artificial perceptual system. The effect of this influence is that the visual perception is modified due to an
induced Hebbian gradient. Shading here corresponds to activation potential levels.
119
Figure 4.9– Adding an influence network to two preexisting systems. We start in (a) with two pipelined
networks that independently compute separate functions. In (b), we incorporate an influence network into
this architecture that interconnects the functional components of the pipelines and enables them to
dynamically modify their percepts.
preexisting artificial perceptual system to enable a new type of cross-modal influence,
designed to increase overall perceptual consistency.
120
Chapter 5
Sensorimotor Learning
Up to this point, we have been concerned with learning to recognize events in the world.
We now turn to the complementary problem of learning to generate events in the world.
That these two problems are interrelated is well established – animal behaviors are
frequently learned through observation, particularly in vertebrates (Bloedel et al. 1996).
In this chapter, we will propose a computational architecture for acquiring intentional
motor control guided by sensory perception.
Our approach addresses two basic
questions:
1) What is the role of perceptual grounding in learning motor activity? In other
words, how does the categorization of sensory events assist in the acquisition of
voluntary motor behaviors?
2) Can motor systems internally reuse perceptual mechanisms? Specifically, we
examine the possibility that the perceptual framework presented in Chapter 3 can
be applied to learning motor coordination. This is the most important issue
addressed in this chapter, because it generalizes to suggest how higher level
cognitive structures may be iteratively bootstrapped off lower level perceptual
inputs. In doing so, it suggests a framework for realizing the embodied cognition
approaches of Brooks (1991a), Lakoff (1987), and Mataric (1997).
Towards answering these questions, we present a computational architecture for
sensorimotor learning, where an animal (or a machine) acquires control over its motor
systems by observing the effects of its own actions. Sensory feedback can both initially
guide juvenile development and then subsequently refine adult motor activity. This type
of self-supervised learning is thought to be among the most powerful developmental
mechanisms available both to natural creatures (Thorndike 1898, Piaget 1971, Hall and
Moschovakis 2004) and to artificial ones (Weiner 1948, Maes and Brooks 1990).
121
Our approach is to reapply the framework in Chapter 3.
We will treat the motor
component of sensorimotor learning as if it were a perceptual problem. This is surprising
because one might suppose that motor activity is fundamentally different than perception.
However, we take the perspective that motor control can be seen as perception
backwards. We imagine that – in a notion reminiscent of a Cartesian theater – an animal
can "watch" the activity in its own motor cortex, as if it were a privileged form of
internal perception. Then for any motor act, there are two associated perceptions – the
internal one describing the generation of the act and the external one describing the selfobservation of the act. The perceptual grounding framework described in Chapter 3 can
then cross-modally ground these internal and external perceptions with respect to one
another. The insight behind this approach is that a system can develop motor control by
learning to generate the events it has previously acquired through perceptual grounding.
A benefit of this framework is that it can learn imitation, a fundamental form of
biological behavioral learning (Byrne and Russon 1998, Meltzoff and Prinz 2002). In
imitative behaviors – sometimes known as mimicry – an animal acquires the ability to
reproduce some aspect of another's activity, constrained by the capabilities and dynamics
of its own sensory and motor systems. This is widespread in the animal kingdom (Galef
1988) and is thought to be among the primary enablers for creating self-supervised
intelligent machines (Schaal 1999, Dautenhahn and Nehaniv 2002).
We will demonstrate sensorimotor learning in this framework with an artificial system
that learns to sing like a zebra finch. Our system first listens to the song of an adult finch;
it cross-modal clusters this input to learn songemes, primitive units of bird song that we
propose as an avian equivalent of phonemes. It then uses a vocalization synthesizer to
generate its own nascent birdsong, guided by random exploratory motor behavior. The
motor parameters describing this exploratory vocal behavior are fed into motor slices – as
if they corresponded to external perceptual inputs. By simultaneously listening to itself
sing, the system organizes these motor slices by cross-modally clustering them with
respect to the previously learned songeme slices. During this process, the fact that the
motor data were derived internally from innate exploratory behaviors, rather than from
external perceptual events, is irrelevant. By treating the motor data as if they were
122
derived perceptually, the system thereby learns to reproduce the same sounds to which it
was previously exposed.
This approach is modeled on the dynamics of how male
juvenile finches learn birdsong from their fathers (Tchernichovski et al. 2004, Fee et al.
2004).
The model presented here is inspired by sensorimotor learning in animals, and it shares a
number of features with several prominent approaches to modeling biological
sensorimotor integration (e.g., Massone and Bizzi 1990, Stein and Meredith 1994,
Wolpert et al. 1995). However, it is intended as an abstract computational model; as
such, it does not approach the richness or complexity inherent in biological sensorimotor
systems.
Our goal in this chapter is to demonstrate that motor learning can be
accomplished through iterative perceptual grounding. In other words, we show that
perceptual and motor learning are unexpectedly similar processes and can be achieved
within a common mathematical framework. This surprising result also suggests an
approach to grounding higher level cognitive development, by iteratively reapplying this
technique of internal perception. We discuss these issues below and will examine them
again in Chapter 6.
5.1 A Sensorimotor Architecture
We begin by examining abstract models of innate sensory and motor processing in
isolation. Initially, the isolated sensory system simply categorizes the different events to
which it is exposed, using cross-modal clustering. One may view this as an unsupervised
learning phase, in which perceptual categories are acquired through passive observation.
Subsequently, the isolated motor system generates innately specified behaviors using an
open-loop control system (Prochazka 1993).
In other words, it receives no initial
feedback. To enable sensorimotor learning, we must close this loop by interconnecting
these isolated systems.
For this purpose, we will reuse the perceptual machinery of Chapter 3. Specifically, we
introduce the notion of internal perception, which allows a system to watch the
123
generation of its internal motor parameters as if they were coming from the outside
world. By treating the motor parameters as perceptual inputs, they can be cross-modally
clustered, regardless of their internal origins. The system thereby learns to generate
events it has previously learned how to recognize, by associating motor parameters with
their observed effects.
5.1.1 A Model of Sensory Perception
Our framework begins with the model of afferent sensory perception outlined in Figure
5.1, which schematically diagrams an abstract computational sensory cortex. In this
model, external events in the world impinge upon sensory organs. These receptors in
turn generate perceptual inputs, which feed into specialized perceptual processing
channels. A primary outcome of this processing is the extraction of descriptive features
(e.g., Muller and Leppelsack 1985, Hubel 1995), which capture abstracted sensory detail.
This process occurs in parallel within multiple sensory pathways, as illustrated in Figure
Sensory Cortex
Sensory Cortex
Afferent Processing
Perceptual
Slices
Perceptual
Processing
Sensory
Organs
hA
hV
gA
gV
fA
fV
A
V
External World
External World
Figure 5.1 – An abstract model of sensory processing in our framework. A schematic view is shown on
the left, which is expanded upon in the example of the right. Events in the world are detected by sensory
organs, here labeled A and V, representing auditory and visual receptors. These are fed into processing
pipelines shown here by the composition of functional units. The features extracted from these pipelines
are fed into slices, which are then cross-modally clustered with respect to one another. We discuss this
model further in the text.
124
5.1 on the right. This hypothetical example shows auditory and visual receptors that
provide inputs to their respective perceptual pathways. These channels extract features
from their perceptual input streams, which are fed into the slices displayed on top. These
slices are then cross-modally clustered with respect to one another, as described in
Chapter 3.
Later in the thesis, we reexamine the structure of these sensory pipelines. In biological
systems, sensory channels are highly interconnected and display complex temporal
dynamics; we will modify our perceptual model to reflect that. In Chapter 6, we examine
the biological implausibility of assuming independence between perceptual channels.
However, the sensorimotor learning in this chapter assumes only that slices can be crossmodally clustered, an assumption which remains valid in the subsequent elaborations
later in this thesis.
5.1.2 A Simple Model of Innate Motor Activity
We now present an abstract model of innate efferent motor activity, which is sometimes
called reflexive behavior. It is well established that young animals engage in a range of
involuntary motor activities; much of this appears to facilitate the acquisition of cognitive
and motor functions, leading to the development of voluntary, intentional behaviors
(Pierce and Cheney 2003, Chapter 3). Behavioral learning is therefore not a passive
phenomenon; instead, it is often guided by phylogenetically "programmed" activities that
have been specifically selected to satisfy the idiosyncratic developmental requirements of
an individual species (Tinbergen 1951).
An abstract model describing the generation of innate efferent motor activity is shown in
Figure 5.2. In a sense, this model is the reverse of the one displayed in Figure 5.1.
Instead of the outside world generating events, we assume an innate generative
mechanism stimulates a motor control center. This in turn evokes coordinated activity in
a muscle or effector system, leading to the generation of an external event in the world.
125
In our model, the innate specification of developmental behaviors is represented by a
joint probability distribution over a set of parameters governing motor activity. This is
motivated by Keele and Summers (1976), where a motor program is described by a
descriptive parameterization. Alternatively, one could assume the existence of a set of
deterministic motor schemas, corresponding to predetermined patterns of activity (Arbib
1985). From the perspective of our model, this distinction makes little difference; we
simply assume some mechanism (or set thereof) is responsible for producing the innate
behaviors that will eventually generate feedback for sensorimotor learning.
To give a clearer sense of this process, we examine the diagram on the right in Figure 5.2.
This presents an example of human vocal articulation, motivated by (Rubin et al. 1981).
Although we will focus primarily on avian vocalization later in this chapter, first
examining human articulation has strong intuitive and didactic appeal, and it sets the
Efferent Processing
Motor Cortex
Motor Cortex
L0 , J 0 ,V0 , H 0 , T0 , C0 ∈ DInnate
Innate
Exploratory
Motor Behaviors
L ( L0 )
J ( J )
0
V (V0 )
H ( H )
Motor
Control
V
T
T (T0 )
C (C )
0
L
C
0
J
H
Muscles/
Effectors
External World
External World
Figure 5.2 – An abstract model of innate motor activity. A schematic view is shown on the left, which is
expanded upon in the example on the right. On the bottom right is a model of human vocal articulation.
This is parameterized by articulator positions at the lips (L), tongue tip (T), jaw (J), tongue center (C),
velum (V), and hyoid (H). Motor control corresponds to a set of state equations on the left governing the
constrained movement of these articulators over some time period. Parameters describing this movement
are selected from some assumed innate distribution on the top right. The selection of parameters in this
model is based on (Rubin et al. 1981).
126
stage for discussing birdsong generation below. In this example, a set of parameters are
selected from some innate distribution DInnate ; these parameters describe the trajectories
of six primary human vocal articulators, as illustrated in the figure. The selection of
parameters thereby corresponds to a primitive speech act. The actual movements of the
articulators during a speech act are modeled by a set of partial differential equations,
determined by the biomechanical constraints on the musculoskeletal apparatus of the
human vocal tract. In other words, these equations model the physical characteristics and
limitations of human vocal production independently of what is being said. Together, the
parameters and the state equations lead to the generation of a speech act. In this domain,
learning how to produce meaningful sounds corresponds to selecting parameters that
appropriately generate them, given the physical and dynamic constraints of human
vocalization.
5.1.3 An Integrated Architecture
Towards sensorimotor learning, we now interconnect these isolated sensory and motor
systems. To do this, we introduce the notion of internal perception, which allows a
system to "watch" the generation of its internal motor parameters as if they were coming
from the outside world. Thus, we will create motor slices that are populated with
behavioral data, in exactly the same way we created perceptual slices, which were
populated with sensory data. The resulting slices do not "know" if their data were
generated internally or externally, and for the purposes of cross-modal clustering, it
makes no difference.
We can thereby learn motor categories that correspond to
previously acquired perceptual categories.
In our model, internal perception occurs through the addition of a Cartesian theater
(Dennett 1991), so named because it provides a platform for internal observation.
Pursuing this philosophical metaphor a bit further, the homunculus in our theater will be
replaced by cross-modal clustering. As we saw in Chapter 3, this is an unsupervised
learning technique. We may therefore employ the notion of a Cartesian theater without
engendering the associated dualistic criticisms of Ryle (1941) or Dennett (1991). In fact,
127
Efferent Processing
Afferent Processing
Figure 5.3 – An integrated sensory motor framework. We connect the isolated sensory and motor systems
with the addition of a Cartesian theater (G), which receives data via (1), corresponding to innate
exploratory behaviors generated in (D). These data are fed into motor slices (H) via (2). These exploratory
behaviors also trigger motor activity via the efferent pathways in (E) and (F). Most importantly, the system
is able to perceive its own actions, as shown by (3). These inputs feed into the afferent sensory system,
where features are extracted and fed into perceptual slices (C). We thereby learn the Hebbian linkages
between the codebook clusters in perceptual slices (C) and motor slices (H), which describe the generation
of these perceptions. In the final step, we cross-modally cluster the motor slices (H) with respect to the
perceptual slices (C); we thereby learn the motor categories that generate previously acquired sensory
categories learned when the system was perceptually grounded.
we will argue that internal perception is a useful framework for higher level cognitive
bootstrapping, where cross-modal clustering replaces a homunculus and any notions of
"intentionality" (Dennett and Haugeland 1991) are attributed to innate phylogenetic
structures and tendencies. In other words, we will keep the theater but eliminate the
metaphysical audience.
Our integrated sensorimotor framework is shown in Figure 3. We briefly outline this
architecture and then examine the stages of sensorimotor learning in detail below. In the
above diagram, we see the independent sensory and motor components described above
on the left and right respectively. In addition, there is now a Cartesian theater (G), which
receives inputs corresponding to innate exploratory behaviors generated by (D). These
128
internally produced data are fed to motor slices (H), which are thereby populated with
behavioral rather than perceptual data. These data are also simultaneously fed into a
motor control system (E), which lead to the generation of perceivable events in the
external world.
Most importantly, the system observes its own actions.
Innately generated events
impinge upon the sensory organs and are fed into the sensory apparatus on the left.
Features extracted from these data are fed into sensory slices (C). This process thereby
creates Hebbian linkages between the sensory slices (C) and the motor slices (D).
We point out that slices are what may be deemed agnostic data structures – they neither
"know" nor "care" what type of data they contain. We can therefore cross-modally
cluster the motor slices (D), based on the categories acquired during the perceptual
grounding of the sensory slices (C). Note, that this is a one-way process. In other words,
we fix the sensory categories and only cluster the motor data. We thereby learn motor
categories that correspond to previously acquired perceptual categories. We discuss
relaxing this one-way restriction below, to allow increasing motor sophistication to assist
in restructuring perceptual categories.
This type of perceptual refinement as a
consequence of fine motor development has been observed in humans, particularly in
musicians (e.g., Ohnishi et al. 2001).
129
iv) Cross-Modal Clustering
Sensory Cortex
(C)
Motor Cortex
(H)
Afferent Processing
Perceptual
Slices
Motor
Slices
(G) Internal
(B)
Perception
(Cartesian
Theater)
Perceptual
Processing
(D)
Innate
Exploratory
Motor
Behaviors
(C)
(B)
Motor
Control
Sensory
Organs
(A)
Muscles/
Effectors
(B)
(A)
Perceptual
Slices
(G) Internal
Lateral Processing
Sensory
Organs
(G) Internal
Perceptual
Processing
Sensory Cortex
Innate
Exploratory
Motor
Behaviors
(D)
Motor
Slices
(C)
(E)
Perception
(Cartesian
Theater)
(B)
Motor
Control
(F)
Sensory
Organs
(A)
Muscles/
Effectors
Perceptual
Slices
Motor
Slices
(G) Internal
(B)
Perceptual
Processing
Perception
(Cartesian
Theater)
Lateral Processing
(A)
Sensory
Organs
(F)
Muscles/
Effectors
Innate
Exploratory
Motor
Behaviors
(D)
(H)
Motor
Slices
(G) Internal
Perception
(Cartesian
Theater)
(E)
Motor
Control
(F)
Sensory
Organs
Muscles/
Effectors
vi) Continued Motor Refinement
Motor Cortex
(H)
Perceptual
Slices
Motor
Control
External World
iii) External Self-Observation
(C)
(E)
Motor Cortex
Perceptual
Processing
External World
Sensory Cortex
Perception
(Cartesian
Theater)
Perceptual
Processing
v) Acquisition of Voluntary Motor Control
Motor Cortex
(H)
Motor
Slices
External World
ii) Internal Self-Observation
(C)
Innate
Exploratory
Motor
Behaviors
(D)
(H)
External World
Sensory Cortex
Motor Cortex
Perceptual
Slices
(E)
(F)
(A)
Sensory Cortex
Efferent Processing
i) Parental Training
Sensory Cortex
Innate
Exploratory
Motor
Behaviors
(D)
Perceptual
Slices
(B)
Motor
Control
(F)
(A)
Muscles/
Effectors
External World
Motor
Slices
(G) Internal
Perceptual
Processing
Innate
Exploratory
Motor
Behaviors
(D)
(H)
(C)
(E)
Motor Cortex
(E)
Perception
(Cartesian
Theater)
Lateral Processing
Sensory
Organs
Motor
Control
(F)
Muscles/
Effectors
External World
Figure 5.4 – Developmental stages in our model. i) The juvenile acquires perceptual structures from its
parent. ii) Motor acts are observed internally through a Cartesian Theater. iii) The effects of motor acts
are observed externally through perceptual channels. iv) Motor slices are cross-modally clustered with
respect to perceptual slices. The juvenile thereby learns how to generate the events it learned in stage (i).
v) Random exploratory behaviors are disconnected and motor slices take over the generation of motor
activity. The juvenile is now able to intentionally generate the sensory events acquired from its parent.
vi) Internal perception can be used subsequently in non-juveniles to refine motor control.
130
5.2 Stages of Sensorimotor Development
We now examine the developmental stages of our framework, as illustrated in Figure 5.4.
Note that one may introduce any number of complexities into these stages, a few of
which we examine below. However, we sidestep a number of issues relevant to any
developmental model that do not uniquely distinguish our approach. For example, we
make no commitment to the role of innate vs. learned knowledge (e.g., Chomsky 1957,
Meltzoff and Moore 1977, Fodor 1983) and believe we can incorporate either
perspective. The primary goal here is to outline how our model computationally learns
imitative behaviors, towards examining the acquisition of birdsong in the next section.
We therefore avoid a number of important albeit orthogonal topics, some of which we
will examine in Chapter 6.
5.2.1 Parental Training
In the first stage of our model, we assume the system corresponds to a neonate. Its
sensory processing channels are sufficiently developed to extract features from perceptual
inputs, provided by its parents, other animals (particularly conspecifics), or directly from
the environment. The goal of the system at this point is to collect sufficient data to begin
cross-modal clustering and thereby become perceptually grounded. Let us consider the
outcome of this grounding process, as illustrated in Figure 5.5. Here, a set of four
different events in the world has been acquired using the framework of Chapter 3. The
goal then is for the system to learn how to generate these events by itself.
We note that perceptually grounding may involve a gradual progression rather than a
sudden transition. Within a single modality, some categories may be easier to learn than
others, due to their inherent structures, availability of training data, degree of perceptual
redundancy, individual variations, etc. In contrast, some perceptual features may be
higher-level aggregates of simpler ones. As such, their acquisition depends upon that of
their components, which themselves may be differentially acquired.
For example.
phonological development in children proceeds in a number of well-defined,
interconnected stages that constructively build upon each other (Vihman 1996). During
131
Figure 5.5 – A hypothetical sensory system that has learned four events in the world. These are acquired
through cross-modal clustering, using the framework in the previous chapter. For simplicity, only a single
sensory mode is illustrated here.
this lengthy process, children need constant exposure to language, during which they
extract features corresponding to their current state of development. More generally, a
juvenile may undergo multiple stages of perceptual grounding, during which it acquires
increasingly abstract knowledge from its tutors or environment.
Although we assume that slices are initially unstructured, i.e., tabula rasa, it is certainly
possible to incorporate innate pre-partitioning of slices into our framework. Also, our
model does not specify how perceptual features are selected in the first place. We
assume this is specified genetically in the biological world and programmatically in the
artificial world.
5.2.2 Internal and External Self-Observation
The second and third stages of our model correspond to a system's observation of its own
innate, exploratory motor activity. Although reflexive behaviors are phylogenetically
selected in animals to satisfy their individual motor requirements (Tinbergen 1951), in
artificial systems, we must specify how these innate behaviors are generated. While it
132
Figure 5.6 – Internal perception of exploratory motor behavior corresponding to an Archimedean spiral.
These data correspond to the parameters used to generate motor activity.
may often be reasonable to design exploratory behaviors that are predetermined to satisfy
a set of motor goals, we examine a generic strategy here. Our goal is simply to explore a
motor space and in doing so, simultaneously observe the effects internally through the
Cartesian theater and externally through normal perceptual channels.
Consider, for example, the problem of generating pairs of exploratory parameters (x,y) in
a hypothetical motor system. We have found it useful to select these parameters and
thereby explore motor spaces according to an Archimedean spiral. We could therefore
Figure 5.7 – External perception of exploratory motor behavior. This slice perceives the events generated
by the motor activity described by Figure 5.6. These data correspond to perceptual features describing
sensory observations.
133
specify that the parameter values are drawn from a distribution described by
x = α1θ ⋅ cos(α 2θ ) and y = α1θ ⋅ sin(α 2θ ) . In this case, the internal perception of this motor
activity might be represented by the slice in Figure 5.6.
Note that the slice representing the external perception of this motor activity may "look"
entirely different than the motor slice representing its generation. In other words, there is
no reason to expect any direct correspondence or isomorphism between motor and
perceptual slices. The motor parameters indirectly generate perceptual events through an
effector system, which may be non-linear, have discontinuities, or display complex
dynamics or artifacts.
We see this phenomenon in the slice in Figure 5.7, which visualizes perception of the
motor activity described by Figure 5.6. For the purpose of this example, we generated a
non-bijective mapping between the motor parameters and the perceptual features of
events they generate in order to illustrate the degree to which corresponding motor and
perceptual slices may appear incongruous. Thus, even though these two slices may
represent the same set of percepts abstractly, they should not be expected to bear any
superficial resemblance to each other. This was also the case with the solely perceptual
slices in Chapter 3; however, one might have assumed that a stronger correspondence
would exist here due to the generative coupling between the slices, which is not the case.
5.2.3 Cross-Modal Clustering
The fourth stage of our model allows us to learn the correspondences between motor and
perceptual slices. Figure 5.8 shows the interconnection of the three slices introduced
above. On the bottom left in (A), we see the categories acquired during perceptual
grounding, as shown in Figure 5.5. Using these, the system can categorize external
observations of its own activities as shown in (B); this corresponds to the cross-modal
clustering of the slice in Figure 5.7. It can then subsequently use these classifications to
categorize internally observed motor behaviors, which were responsible for generating
134
Innate Motor Activity
B) External Self
Obseration
A) Perceptual Grounding
from Parent
C) Internal Self
Observation
Figure 5.8 – Stages of cross-modal clustering. Starting from the acquisition of perceivable events in (A),
we learn to classify the effects of our own behaviors in terms of these events in (B). Finally, we can then
relate this back to the innate motor activity generating our actions, as in (C). There are thereby three
stages of cross-modal clustering in this model.
this activity to begin with. This is show in (C) and corresponds to the cross-modal
clustering of the slice in Figure 5.6.
Although the discussion here presumes these steps happen sequentially, i.e., the are three
"rounds" of cross-modally clustering represented in Figure 5.8, taking us from stages (A)
through (C), it is possible to imagine them overlapping. In other words, motor learning
may co-occur with perceptual grounding and one may propose a continuum between
these two temporal alternatives. Particularly in systems that have limited access to
training inputs, self-supplementing these data by generating them independently, as
perceptual capabilities are differentially acquired, may help to overcome a paucity in
external stimulation. This may be additionally helpful in determining ambiguous subsets
of perceptual and motor space, which were discussed in the previous chapter.
135
5.2.4 Voluntary Motor Control
The fifth stage of our model replaces innate, exploratory behavior with voluntary,
intentional motor control. Figure 5.9 displays the bottom two slices of Figure 5.8, where
the motor map on the right is now clustered according to the perceptions it externally
generates – by way of some effector system – in the sensory map on the left.
Significantly, the system is now capable of generating the events to which it was exposed
during parental training.
As we saw in Chapter 4, motor and sensory events do not happen instantaneously in time.
They are not discrete, discontinuous phenomena. Thus, rather than select individual
points in a motor map to trigger behaviors, it is far more plausible to imagine a system
"moving" through a motor map during a time period corresponding to sustained activity.
We previously examined how to move through sensory maps to avoid perceptual
ambiguity. We note, however, that one may also wish to incorporate other types of
constraints into motor systems, for example, to minimize energy or maximize stability.
External World
Effector System
Sensory Map
Motor Map
Figure 5.9 – Acquisition of voluntary motor control. Regions in the motor map on the right are now
labeled with the perceptual events they generate in the sensory map on the right.
136
5.3 Learning Birdsong
We now propose to use the framework in the previous section for self-supervised
learning of birdsong. Our presentation will focus on song learning in the zebra finch, a
popular species for studying oscine songbird vocal production. We begin with a brief
introduction to this species, focusing on the structure of its song and the developmental
stages of its acquisition. Towards building a computational model, we next introduce the
notion of songemes, primitive units of birdsong that we propose as avian equivalents of
phonemes. Finally, we present a system that learns to imitate an adult zebra finch in a
developmentally realistic way, modeled on the dynamics of how male juvenile finches
learn birdsong from their fathers (Tchernichovski et al. 2004, Fee et al. 2004).
Our system first listens to an adult male finch and uses cross-modal clustering to learn the
songemes comprising the song of its "father." It then uses an articulatory synthesizer to
generate its own nascent birdsong, guided by random exploratory motor behavior. By
listening to itself sing, the system organizes the motor maps generating its vocalizations
by cross-modally clustering them with respect to the previously learned songeme maps of
its parent. In this way, it learns to generate the same sounds to which it was previously
exposed.
We are indebted to Ofer Tchernichovski (2005, 2006) for providing the adult zebra finch
song recordings used to train our system. We are also grateful to Heather Williams for
making available a generationally-indexed birdsong library, which provided an additional
source of inputs using during initial testing (Williams 1997, 2006). Other sources of
birdsong files are cited individually below.
5.3.1 Introduction to the Zebra Finch
The zebra finch is an extremely popular species for researching birdsong acquisition.
(For surveys on birdsong learning, see Brenowitz et al. 1997 and Ziegler and Marler
2004, and Nottebohm 2005). In part, this is due to the ease of maintaining and breeding
137
Figure 5.10 – An adult male zebra finch (Taeniopygia guttata). Zebra finches are small, unusually social
songbirds that grow to approximately 10cm (4 inches) tall. They are extremely popular both as pets and as
research subjects for studying neural, physiological, evolutionary, social, and developmental aspects of
birdsong acquisition.
the species in captivity and its rapid sexual maturity. However, there are two additional
characteristics of zebra finches that make them particularly attractive for study.
The first of these is the noisy spectral quality of their songs, which are distinct from the
whistled, tonal characteristics of most other songbirds, such as sparrows or canaries. We
can see this harmonic complexity in Figure 5.11, which displays spectrograms of songs
from several oscine species, along with a human vocalization for reference.
It is
hypothesized that spectral complexity of a zebra finch's song reflects its physical prowess
and aids in sexual selection (Kroodsma and Byers 1991). The complexity of their
vocalizations, along with a range of behavioral and neurological similarities, has
prompted many researchers to propose studying song learning in zebra finches as a model
for understanding speech development in humans (Marler 1970, Nottebohm 1972, Doupe
and Kuhl 1999, Brainard and Doupe 2002, Goldstein et al. 2003). Perhaps supporting
this notion, it has been determined that human FOXP2, the first gene linked to speech and
vocal production, has a protein sequence that is 98% identical to the same gene in the
zebra finch (Haesler et al. 2004, see also Webb and Zhang 2005).
138
4
Zebra Finch
x 10
2
Frequency
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Time
Evening Grosbeak
10000
Frequency
8000
6000
4000
2000
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Time
Blue Jay
5000
Frequency
4000
3000
2000
1000
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Time
Northern Mockingbird
5000
Frequency
4000
3000
2000
1000
0
0
0.5
1
1.5
Time
Do Re Mi
10000
Frequency
8000
6000
4000
2000
0
0
0.2
0.4
0.6
0.8
1
Time
1.2
1.4
1.6
1.8
Figure 5.11 – Spectrograms of songs from five different species. These include: i) a zebra finch; ii) an
evening grosbeak; iii) a blue jay; iv) a northern mockingbird; and v) the author singing "Do Re Mi" to
provide a reference with human vocalization. Notice the complex harmonic structure of the zebra finch's
song compared to those of the other birds. Song in (i) provided by (Tchernichovski 2006). Songs in (ii)(iv) were obtained from the U.S. National Park Service (2005). Note that the frequency ranges are different
in each of these spectrograms.
139
The second characteristic that makes zebra finches popular models for song learning is
the well-defined developmental process through which their song is acquired (Slater et al.
1988).
Fledglings of both sexes begin learning their father's song early in life.
Approximately one month after hatching, male juveniles begin producing nascent,
squeaky sounds and then proceed through a series of stages of vocal refinement
(Immelmann 1969, Tchernichovski and Mitra 2002). The goal of this process for a zebra
finch is to learn to approximately reproduce its father's song, accompanied by
idiosyncratic, individual variations unique to each bird. Therefore, a son sounds similar
but not identical to his father.
At 90 days of age, a bird's song crystallizes as it
simultaneously reaches sexual maturity, and its song template remains unchanged for the
remainder of its life (Doupe and Kuhl 1999). Thus, an adult male zebra finch adheres to
a single song motif, where each vocalization is constructed from a fixed set of acquired
components. We note that as with many other oscine songbird species, females zebra
finches do not sing. Instead, they make simple vocalizations known as distance calls,
through which birds of both sexes are able to individually recognize one another (Miller
1979, Vignal et al. 2004).
The juvenile development of song generation is heavily guided by auditory feedback
(Konishi 1965). In fact, as a bird begins to vocalize, it no longer requires exposure to its
tutor's song but instead, it must be able to hear itself sing. Adult birds also need feedback
to maintain their singing ability (Nordeen and Nordeen 1992). Thus, even though adult
males cannot learn new songs, they require auditory feedback to maintain the neural song
patterns acquired as juveniles (Brainard and Doupe 2000).
How juveniles learn from auditory feedback is unknown. Marler (1997) outlines three
models of sensorimotor-based song development.
These range from fully open,
instructive tutoring to the assumption of innate neural templates for highly constrained,
conspecific song patterns. He argues for an intermediate approach, incorporating song
memorization into a phylogenetically constrained framework that has been selected to
facilitate rapid learning within an individual species.
140
10
8
6
4
2
0
100
200
300
400
500
600
700
100
200
300
400
500
600
700
10
8
6
4
2
Figure 5.12 – Comparing a spectrogram to a spectral derivative display for the song of a zebra finch. A
spectrogram for a zebra finch song is displayed on top. The spectral derivative of that song is shown on the
bottom and provides much clearer visual detail. It also provides a framework for subsequent harmonic
analysis.
5.3.2 Songemes
We now introduce the notion of songemes, primitive units of birdsong that we propose as
avian equivalents of phonemes. Marler (1997, p508) notes that although "it is the subject
of score of studies of song learning and its neural basis, … little effort has been directed
to descriptive studies of zebra finch song structure." In the research literature, birdsong
is generally divided into components known as syllables, which are continuous sounds
bounded by silent intervals (e.g., Williams and Staples 1992, Ölveczky et al. 2005). It is
not, however, generally broken down into smaller units. The goal of this section is to
define primitive units we call songemes, which we argue correspond more closely to
basic elements of physiological song production.
Before proceeding, we briefly discuss the multitaper spectral analysis methods (Thomson
1982) that were introduced into the analysis of birdsong by (Tchernichovski et al. 2000).
Traditional spectrograms show the power at different frequency components of a signal
over time, as in the top of Figure 5.12. Multitaper methods compute spectrograms while
simultaneously providing estimates of spectral derivatives. That is, rather than only
measuring power, they also measure instantaneous changes in power, and as such,
141
10
kHz
8
6
4
2
0
200
400
600
800
msec
1000
1200
1400
1600
0
200
400
600
800
msec
1000
1200
1400
1600
10
kHz
8
6
4
2
Figure 5.13 – Breaking a birdsong down into constituent songemes. On the top, the song is displayed
divided into seven syllables. The 22 derived songemes, defined via peaks of the song's smoothed
log(power), are shown on the bottom. The peaks are indicated by the dotted vertical yellow lines. The blue
lines indicate songeme boundaries, determined by locally adjacent minima. We note the long vocalization
at the end of the song corresponds to a distance call.
perform a sort of edge detection on the spectrogram, making contours easier to detect
(see the bottom of Figure 5.12). They also provide a framework for harmonic analysis
(ibid, Tchernichovski and Mitra 2004). In the rest of this chapter, we use spectral
derivatives in place of spectrograms for examining acoustic analyses of birdsongs.
Returning to our discussion of syllables in bird song, let us consider the song displayed in
Figure 5.13. On the top of the figure, we have partitioned the song according to the
traditional syllabic breakdown, as determined by the intervals of silence. On the bottom,
we have partitioned the song into songemes, which captures the fine structure in the song.
The songeme partitioning is computed by finding the peaks in the smoothed
log(power) of the signal between 860 and 8600Hz, corresponding to the expected
vocalization range of a zebra finch. The smoothing is done with a low-pass SavitzkyGolay filter, using a 2nd order polynomial over a window corresponding to approximately
40msec. The songeme boundaries are determined by finding the local minima adjacent to
142
10
8
6
4
2
550
600
650
700
750
800
850
550
600
650
700
750
800
850
10
8
6
4
2
Figure 5.14 – Partitioning a single syllable into songemes. This figure displays the segment of birdsong in
Figure 5.13 between 520 and 875 msecs. The single syllable shown on top has been automatically
partitioned into seven songemes on the bottom, which correspond more closely to the changes in
vocalization during this interval. This example supports our belief that the widespread syllabic approach to
studying birdsong is a poor model for capturing its internal complexity.
these peaks, and the boundaries are shared between songemes when they are temporally
adjacent. We examine the partitioning of an individual complex syllable into seven
songemes in Figure 5.14. This example supports our belief that the widespread syllabic
approach to studying birdsong is a poor model for capturing its internal complexity.
We have used this technique to automatically extract approximately 10,000 songemes
from wav files recorded from two different zebra finches provided by (Tchernichovski
2005, 2006).
Of these, we heuristically rejected any songeme of duration less than 10
msecs, which eliminated approximately 1000 of them. Many of these are due to nonverbal sounds, e.g., from a bird moving in its cage or other background noises, that are
audible on the recordings.
143
10
8
Song
6
4
2
0
200
400
600
800
1000
1200
1400
1600
0
200
400
600
800
1000
1200
1400
1600
0
200
400
600
800
1000
1200
1400
1600
0
200
400
600
800
1000
1200
1400
1600
0
200
400
600
800
1000
1200
1400
1600
0
200
400
600
800
1000
1200
1400
1600
0
200
400
600
800
1000
1200
1400
1600
0
200
400
600
800
1000
1200
1400
1600
0
200
400
600
800
1000
1200
1400
1600
10
8
AM
6
4
2
10
8
FM
6
4
2
10
8
Entropy
6
4
2
10
8
Amplitude
6
4
2
10
8
Mean Frequency
6
4
2
10
8
Pitch Goodness
6
4
2
10
8
Pitch
6
4
2
10
8
Pitch Weight
6
4
2
Figure 5.15 – Feature extraction for a zebra finch song partitioned into songemes. The solid line within each
songeme shows the mean value for the corresponding feature within it. These values are cross-modally clustered
to learn the structure of the birdsong. The dotted line within each songeme shows the actual feature data, which is
smoothed with a low-pass Savitzky-Golay filter. The feature values have been normalized to fit within each plot.
We note this is the same song as shown in Figure 5.13
144
We note several other efforts at separating complex animal sounds into primitive
components. Kogan and Margoliash (1997) used dynamic time warping and hidden
Markov models (HMMs) to derive simple units of birdsong for testing automated
recognition of birds.
Mellinger and Clark (1993) used HMMS for detecting and
identifying bowhead whales, and Clemins and Johnson (2003) used HMMs for
recognizing vocalizations in African elephants.
5.3.3 Birdsong Analysis
We extracted streams of acoustic features from our birdsong sound files using a
customized version of the Sound Analysis For Matlab software (Saar 2005).
The
extracted features include: 1) amplitude modulation; 2) frequency modulation; 3)
entropy; 4) amplitude; 5) mean frequency; 6) pitch goodness; 7) pitch; and 8) pitch
weight as shown in Figure 5.15. We note the song in this figure is the one shown Figure
5.13.
For each songeme, we average the values of the features within it to obtain a compact
acoustic description. These average values are shown by the solid horizontal lines within
each songeme in Figure 5.15a. The dotted lines within each songeme display the actual
feature values, which have been smoothed as described above. The average feature
values for approximately 9,000 songemes, derived from our training sound files, were fed
into as assembly of interconnected slices, to be discussed below.
We see two of the outputs of this clustering in Figure 5.16. Among the most interesting
of our technical results, we can interpret the upper slice as demonstrating the system has
learned there are three different types of vocalizations: 1) the blue region corresponds to
noisy sounds perhaps generated by chaotic activity in the avian syringeal sound
generator. This is similar to chaotic (e.g., fricative) speech in humans; 2) the green
region corresponds to pure tones, such as whistles; and 3) the orange region corresponds
to harmonic sounds, such as the in the distance call. We note the relative scarcity of pure
tones in zebra finch song, as reflected by the sparsity of data in the green region.
145
Figure 5.16 -- Cross-modally clustered zebra finch slices. We can interpret the upper slice as
demonstrating the system has learned there are three different types of vocalizations: 1) the blue region
corresponds to noisy sounds, perhaps generated by chaotic activity in the avian syringeal sound generator.
This is similar to aspirated speech in humans; 2) the green region corresponds to pure tones, such as
whistles; and 3) the orange region corresponds to harmonic sounds, such as in the distance call. The lower
slice shows the system has learned the pitch structure for seven different component vocalizations.
146
The slice at the bottom of Figure 5.16 demonstrates that the system has acquired the pitch
structure of its parent’s birdsong, corresponding with seven different base “notes,” out of
which songemes are constructed by varying other acoustic features such as harmonic
complexity and frequency modulation.
A view of all the slices used for birdsong learning is illustrated in Figure 5.17. The lowlevel features, such as those in Figure 5.15, supplied data to these slices, which manually
selected and interconnected.
Although these types of interconnections are likely
phylogenetically determined in nature, a more sophisticated artificial system might
Figure 5.17 – Slices for birdsong learning. On the bottom, one dimensional slices feed songeme feature
values into the two dimensional slices on the top. The colored lines represent learned Hebbian linkages.
The slices are then cross-modally clustered to learn songeme categories. This perceptually grounds the
system with respect to its "parent's" song. Detailed views of two of these slices are contained in Figure
5.16. See the text for additional details of this architecture.
147
employ techniques such as (Bartlett 2001) to automatically select interconnections
between acoustic features based on independent component analysis.
However, given
our task here is to demonstrate acquisition of sensorimotor control using perceptual
mechanisms, the precise details of feature selection and interconnection were not of great
concern. Nonetheless, it should be acknowledged that a fair amount of manual effort
went into architecting the birdsong learner illustrated in Figure 5.17, which then was
cross-modally clustered based on the input song of its parent, as described in §5.2.1.
5.3.4 Articulatory Synthesis
To implement the motor component of this system, we created a naive articulatory
synthesizer for generating birdsong, based on the additive synthesizer in the Common
Lisp Music System (Taube 1989) and translated into Matlab by (Strum 2001). The motor
parameters in our model correspond to: (1) syringeal excitation; (2) pitch; (3) power; and
(4) temporal, frequency and amplitude envelopes corresponding to simple models of
avian vocalization. In our implementation, chaotic syringeal excitation is realized by
phase and amplitude perturbations of a vocalization's harmonic components. We note
that this does not correspond with a biologically realistic syringeal mechanism, which
would be complicated to model accurately. However, our goal here is not to model
birdsong with perfect accuracy but rather to demonstrate self-supervised sensorimotor
learning within our framework. Making the synthesizer sounds generally realistic was
sufficient for our purposes, as we discuss below.
We refer to the nascent activity of the system as babbling.
Some examples of
increasingly complex babbling are shown in Figure 5.18. These demonstrate the system's
acquisition of harmonic complexity in response to auditory feedback and comparison
with its parent’s song.
The initial babbling corresponds to uninformed, innate motor
behavior as described above. As the system simultaneously listened to its own outputs
while “watching” the internal generation of motor activity, it was able to determine which
regions of its motor maps were responsible for providing harmonic complexity matching
its parents through cross-modal clustering, as described in §5.2.3.
148
The increase in complexity is due to continued refinement of the system’s motor maps.
In other words, as it explored its motor capabilities more completely, it was able to
classify codebook regions in its motor maps to learn which generated harmonically
complex outputs. We manually implemented the strategy that our system select different
aspects of its parent song to independently master, which reflect the actual strategies
taken by zebra finches in the wild (Liu et al. 2004), who sequentially focus on different
features of their song, rather than try to master the parent’s song in its entirely. A
10
8
6
4
2
100
200
300
400
500
600
700
100
200
300
400
500
600
700
10
8
6
4
2
10
8
6
4
2
0
100
200
300
400
500
600
10
8
6
4
2
0
100
200
300
400
500
600
700
800
900
1000
10
8
6
4
2
0
200
400
600
800
1000
1200
Figure 5.18 – The temporal evolution of bird babbling in our system. This figure illustrates the acquisition
of harmonic complexity due to auditory feedback. See the text for explanatory commentary.
149
presumable benefit of this approach is that it reduces search space complexity
enormously by isolating and focusing on individual acoustic features, thereby permits
gradual song acquisition and differential acquisition in case other male siblings are also
in the midst of their own learning. Thus, this strategy also prevents co-located siblings
from confusing one another.
For example, two brothers raised together will select
(presumably) non-conflicting developmental paths, where one may focus on the temporal
aspects of its song while the other focuses on pitch, even though they are both essentially
learning the same song. Our system was designed to focus on individual songemes, in
essence acquiring the "notes" of its parent's song. As this is a much lower level treatment
than is typically given in the oscine developmental literature, which focuses on song
syllables, it is difficult to evaluate its developmental realism.
Nonetheless, for an
artificial system, it seems quite adequate.
Our strategy was to learn mimicry of each acoustic feature in terms of the articulator’s
model, which formed the structure of our motor slices. For each acquired songeme, we
selected the combination of articulations which maximized our approximation of it. We
note that given the constraints of our articulator, certain acoustic features of the parent
4
x 10
Frequency
2
1.5
1
0.5
0
0
0.1
0.2
0.3
0.4
Time
0.1
0.2
0.3
0.4
Time
4
0.5
0.6
0.7
x 10
Frequency
2
1.5
1
0.5
0
0
0.5
0.6
0.7
Figure 5.19 – Birdsong mimicry. On the top is a sample of the zebra finch song used as the "parent" for
our system. On the bottom is the system's learned imitation, where the acquired songemes have been fit to
the template of the parent’s song and smoothed.
150
birdsong were not possible to reproduce well. In particular, the harmonic complexity
associated with chaotic syringeal excitation, a common phenomenon in zebra finch song,
could only roughly be approximated by our articulatory model. However, the human ear
seems far more sensitive to pitch accuracy, which we were able to capture quite well, and
tends to be less sensitive to reduced acoustic entropy.
5.3.5 Results
This experiment was in essence empirical.
Our goal was to demonstrate that our
framework could perform self-supervised sensorimotor learning, using the zebra finch as
a testbed. Although the acquired song sounds recognizably like the parent bird’s to
human ears, does it sounds like a bird to another bird? In other words, it is unclear how
to evaluate this mimicry. We note that human auditory range overlaps well with that of
the zebra finch, and as mentioned above, the FOXP2 gene thought to be responsible for
vocal production is remarkably similar between the two species. Nonetheless, one should
assume that auditory feature extraction and subsequent processing in the zebra finch is
uniquely tuned to its auditory requirements, which are surely quite different than our
own.
Perhaps the best way to evaluate this work would be to use it to train a fledgling and see
if it acquires song. In other words, use our system as a parent.1 This is a fascinating
possibility which we are currently investigating. However, it is important to keep in
mind that the goal of this chapter is to present an architecture for sensorimotor learning in
artificial systems; songbirds are a well-studied model for this type of acquisition, and as
such, they are helpful guideposts for examining this problem. Nonetheless, our goal is
proposing architecture for more sophisticated computational systems rather than precisely
imitating the song of a particular bird.
1
It has been suggested by Chris Atkeson that we attempt mating our system with a female zebra finch, but
that would presumably void our computer’s warrantee.
151
5.4 Summary
This chapter has demonstrated that the cross-modal clustering framework presented in
Chapter 3 can be recursively reapplied to acquiring self-supervised sensorimotor control.
We have developed architecture for this type of learning using an internal Cartesian
theater to correlate the generation of motor activity with its perceived effects through
cross-modal clustering. This is possible because slices neither know nor care whether
they represent sensory or motor data. It is thus straightforward to seamlessly move
between the two during clustering.
We demonstrated this approach with a system that learns birdsong, following the
developmental stages of a fledgling zebra finch. This works suggests a number of other
possible applications for learning motor control through observation, among the most
common forms of learning in the animal world.
The benefits of self-supervised learning for artificial sensorimotor systems are enormous
because engineered approaches tend to be ad hoc and error prone; additionally, in
sensorimotor learning we generally have no adequate models to specify the desired
input/output behaviors for our systems. As we mentioned in Chapter 1, the notion of
programming by example is nowhere truer than in the developmental mimicry
widespread in animal kingdom, and this chapter is a significant step in that direction for
providing that capability to artificial sensorimotor systems.
152
Chapter 6
Biological and Perceptual Connections
The chapter connects the computational framework introduced in this thesis to a modern
understanding of perception in biological systems. In doing so, we motivate the approach
taken here and simultaneously suggest how this work may reciprocally contribute
towards a better computational understanding of biological perception. We begin by
examining the interaction between sensory systems during the course of ordinary
perception.
6.1 Sensory Background
Who would question that our senses are distinct? We see, we feel, we hear, we smell,
and we taste, and these are qualitatively such different experiences that there is no room
for confusion among them. Even those affected with the peculiar syndrome synesthesia,
in which real perceptions in one sense are accompanied by illusory ones in another, never
lose awareness of the distinctiveness of the senses involved.
Consider the woman
described in (Cytowic 2002), for whom a particular taste always induced the sensation of
Figure 6.1 - Cross-modal matching: a subject is asked to use haptic (e.g., tactile) cues to select an object
matching a visual stimulus. From (Stein and Meredith 1993).
153
a specific, corresponding geometric object in her left hand. A strange occurrence indeed,
but nonetheless, the tasting and touching – however illusory – were never confused; they
were never merged into a sensation the rest of us could not comprehend, as would be the
case, for example, had the subject said something tasted octagonal. Even among those
affected by synesthesia, sensory perceptions remain extremely well defined.
Given that our senses appear so unitary, how then does the brain coordinate and combine
information from different sensory modalities? This has become known as the binding
problem (see Wolfe and Cave 2000 for a review), and the classical assumption has been
to assume that the sensory streams are abstracted, merged, and integrated in the cortex, at
the highest levels of brain functioning.
This hypothesis assumed a cognitive
developmental process in which children slowly developed high-level mappings between
innately distinct modalities through their interactions with the world.
We may examine this assumption in context of the most frequently studied of
intersensory phenomena, cross-modal matching, which is determining that perceptions in
two different sensory modalities could have the same source.
Figure 6.1 shows a
standard experimental matching task, in which a subject is asked to use tactile cues to
select an object matching a visual stimulus. The actual mechanisms that make crossmodal matching possible are unknown, but clearly, sufficient correspondences between
the modalities must exist to enable making this type of equivalency judgment. Whatever
form these correspondences take – whether through topographic maps, amodal perceptual
interlinguas, or other representations – is according to Piaget (1954), among many others,
developed slowly through experience. Only when the involved senses have developed to
the point of descriptional (i.e., representational) parity, can these interrelations develop
and cross-modal matching thereby take place.
This position directly traces back to Helmholtz (1884) and even earlier, to Berkeley
(1709) and Locke (1690), who believed that neonatal senses are congenitally separate and
interrelated only through experience. According to this viewpoint, the interrelation does
not diminish the distinctiveness of the senses themselves, it merely accounts for
correspondences among them based on perceived co-occurrences. This was seemingly a
154
very intuitive foundational assumption for studying perception. Unlike so much of the
rest of human cognition, we can be introspective about how we perceive the world. Are
not our sensory systems innately designed to bring themselves to our attention?
6.2 Intersensory Perception
Unfortunately for this introspective approach to perception, an overwhelming body of
evidence has been gathered in the past half century demonstrating that perceptions
themselves emerge as the integrated products of a surprising diversity of components.
(For surveys, see Stein and Meredith 1993, Lewkowicz and Lickliter 1994, Rock 1997,
Shimojo and Shams 2001, Calvert et al. 2004, and Spence and Driver 2004.) Our
auditory, olfactory, proprioceptive, somatosensory, vestibular, and visual systems
influence one another in complex interactive processes rarely subject to conscious
introspection. In fact, the results can often be completely counterintuitive. Perception is
such a fundamental component of our experience that we seldom give its mechanisms
any direct attention. It is the perceptions produced by these mechanisms that draw our
attention, not the mechanisms themselves, and we are ill equipped to examine the
mechanisms directly without the aid of clever, sometimes even serendipitous,
experimentation. In Gibson's (1950) view, perceivers are aware of the world, not their
own perceptions.2
For example, let us consider the well-known work of McGurk and MacDonald (1976),
who studied how infants perceive speech during different periods of development. In
preparing an experiment to determine how infants reconcile conflicting information in
different sensory modalities, they had a lab technician dub the audio syllable /ba/ onto a
video of someone saying the syllable /ga/. Much to their surprise, upon viewing the
dubbed video, they repeatedly and distinctly heard the syllable /da/ (alternatively, some
hear /tha/), corresponding neither to the actual audio nor video sensory input. Initial
2
One may contrast this with Russell's (1913) criticism of "materialistic monism," in which he argues that
abstract, self-aware mental models are an essential component of perception. This distinction, in the guise
of the Physical Symbol System Hypothesis (Miller, Galanter, and Pribram 1960, Newell and Simon 1976),
has been the subject of intense scrutiny within the artificial intelligence community (e.g., Johnson-Laird
1983, Brooks 1990a).
155
assumptions that this was due to an error on the part of the technician were easily
discounted simply by shutting their eyes while watching the video; immediately, the
sound changed to a /ba/. This surprising fused perception, subsequently verified in
numerous redesigned experiments and now known as the McGurk effect, is robust and
persists even when subjects are aware of it.
The McGurk effect is among the most convincing demonstration of the intersensory
nature of face-to-face spoken language and the undeniable ability of one modality to
radically change perception in another. It has been one of many components leading to
the reexamination of the introspective approach to perception. Although it may appear
reasonable to relegate intersensory processing to the cortex for the reasoning (as opposed
to perceptual) processes involved in the cross modal matching experiments described
above, it becomes far more implausible in cases where different senses impinge upon
each other in ways that locally change the perceptions in the sensory apparatus
themselves.
One might object that the McGurk effect is pathological – it describes a perceptual
phenomenon outside of ordinary experience.
Only within controlled, laboratory
conditions do we expect to have such grossly conflicting sensory inputs; obviously, were
these signals to co-occur naturally in the real world, we would not call them conflicting.
We can refute this objection both because the real world is filled with ambiguous,
sometimes directly conflicting, perceptual events, and because the McGurk effect is by
no means the only example of its kind. There is a large and growing body of evidence
that the type of direct perceptual influence illustrated by the McGurk effect is
commonplace in much of ordinary human and more generally animal perception, and it
Figure 6.2 - The McGurk Effect. Disparate auditory and visual inputs can create perceptions corresponding
to neither. Picture is from (Haskins 2005).
156
strongly makes the case that our perceptual streams are far more interwoven than
conscious experience tends to make us aware. In this, the McGurk effect is unusual in
that the conflicting audio and visual inputs create perceptions corresponding to neither
actual input. More typically, sensory systems influence (e.g., enhance, degrade, change)
perceptions in one another and at times, substitute for each other as well, by transferring
perceptual information across modalities (Calvert et al. 2004).
An example of one sense enhancing another occurs all the time in noisy environments.
The sight of someone’s moving lips in an environment with significant background noise
makes it easier to understand what the speaker is saying; visual cues – e.g., the sight of
lips – can alter the signal-to-noise ratio of an auditory stimulus by 15-20 decibels (Sumby
and Pollack 1954). Tied in with audio source separation, this phenomenon is commonly
known as the cocktail party effect (Cherry 1953). We see, therefore, that a decrease in
auditory acuity can be offset by increased reliance on visual input. In fact, it has long
been known that watching the movement of a speaker’s lips helps people understand
what is being said: Juan Pablo Bonet wrote in his 1620 classic “Simplification of Sounds
and the Act of Teaching the Deaf to Speak,” (cited in Bender 1981, p41):
“For the deaf to understand what is said to them by the motions of the lips
there is no teaching necessary; indeed to attempt to teach them it would be a
very imperfect thing ...to enable the deaf-mute to understand by the lips alone,
as it is well known many of them have done, cannot be performed by
teaching, but only by great attention on their part…”
Although the neural substrate behind this interaction is unknown, it has been determined
that just the sight of moving lips – without any audio component – modifies activity in
the auditory cortex (Sams et al. 1991, Calvert et al. 1997). Furthermore, psycholinguistic
evidence has long supported the belief that lip-read and heard speech share a degree of
common processing, notwithstanding the obvious differences in their sensory channels
(Dodd et al. 1984).
Examples of senses substituting for one another are commonplace, as when auditory and
tactile cues replace visual ones in the dark; this is familiar to anyone who has walked
through a dark room with outstretched, waving arms and hyper-attentive ears. A far more
157
interesting example is seen in a phenomenon known as the “facial vision” of the blind. In
locating objects, blind people often have the impression of a slight touch on their
forehead, cheeks, and sometimes chest, as though being touched by a fine veil or cobweb
(James 1890, Kohler 1967). Consider this quote contained in James (1890, p.204), from
a blind author of the time:
“Whether within a house or in the open air, whether walking or standing still,
I can tell, although quite blind, when I am opposite an object, and can
perceive whether it be tall or short, slender or bulky. I can also detect whether
it be a solitary object or a continuous fence; whether it be a close fence or
composed of open rails, and often whether it be a wooden fence, a brick or
stone wall, or a quick-set hedge. …The sense of hearing has nothing to do
with it, as when snow lies thickly on the ground objects are more distinct,
although the footfall cannot be heard. I seem to perceive objects through the
skin on my face, and to have the impressions immediately transmitted to the
brain.”
The explanation for this extraordinary perceptory capability had long been a subject of
fanciful debate. James demonstrated, by stopping up the ears of blind subjects with
putty, that audition was behind this sense, which is now known to be caused by intensity,
direction, and frequency shifts of reflected sounds (Cotzin and Dallenbach 1950, Hoshino
and Kuroiwa 2001). The auditory input is so successfully represented haptically in the
case of facial vision that the perceiver himself cannot identify the source of his
perceptions.
There is no doubt that we constantly make use of intersensory cues during perception and
in directing our attention, and when deprived of these cues, through artificially contrived
or naturally occurring circumstances, can display marked degradation in what seem to us
conceptually and functionally isolated sensory systems. The breadth of these interactions
and the range of influences they demonstrate have been so surprising that they have
called for radically new approaches to understanding how our individual perceptual
systems work, how the brain merges their perceptions, and why these two questions are
inseparable.
158
6.3 The Relevance
Why should this reexamination of sensory independence concern us? We believe the
answer stems from the fact that the early assumption of perceptual isolation underlies
nearly all modern interactive systems – it remains the primary architectural metaphor for
multimodal integration.
The unfortunate consequence of this has been making
integration a post-perceptual process, which assembles and integrates sensory input after
the fact, in a mechanism separate from perception itself.
Modern psychophysical,
neuroanatomical, evolutionary, and phenomenological evidence paints a very different
picture of how animals, including humans, merge their senses; the notion that the senses
are processed in isolation is highly implausible. Much of this evidence also undermines
classical symbol-processing models of perception (e.g., Fodor and Pylyshyn 1998), where
cognitivist assumptions can obscure rather than illuminate the subtle intersensory
perceptual mechanisms we are interested in here.
6.3.1 Perception in Pipelines
Computational approaches to perception have traditionally been bottom-up, feeding raw
perceptual inputs into abstraction pipelines, as show in Figure 6.3. In these frameworks,
a pipeline is constructed through functional composition of increasingly high-level
feature detectors.
This approach is apparent in some of the earliest work in
computational object recognition (Horn 1970, Binford 1971, Agin 1972). It is also
explicitly reflected in the perceptual theories described by Marr (1976, 1982), Minsky
(1987), and Ullman (1998), as well as in the subsumption architecture of Brooks (1986)
for sensorimotor control.3
Early approaches to artificial perception focused exclusively on modeling aspects of
individual modalities in isolation, although the potential for more complex multimodal
interactions drew the imaginations of early researchers (Turing 1950, Selfridge 1955).
Among the first efforts to create a user interface that combined two independent
3
Note that both Ullman and Brooks make extensive use of top-down feedback to govern their bottom-up
processing.
159
Figure 6.3 – Unimodal processing pathways. Individual modalities are processed in specialized pipelines.
The visual pathway on the left uses 3-dimensional depth and color maps to find candidate regions for
locating faces. (Modeled after Darrell et al. 1999.) The auditory pathway on the right performs speech
recognition. (Modeled after Jelinek 1997.) Notice in this example that higher-level syntactic constraints
can feed back into the lower-level morphological and phonetic analyses.
perceptual channels was that of Bolt (1980). His "Put-that-there" system (Figure 6.4)
enabled users to interact with projected maps by speaking and pointing simultaneously.
Bolt's system resolved spoken deictic references involving identity ("that") and location
("there") by visually observing the pointing gestures that accompanied them. Deictic
gesture resolution has subsequently become a very popular application for multimodal
user-interface development (e.g., Oviatt et al. 2003, Kumar et al. 2004).
Another significant application combining separate modalities is lip-reading, which has
become perhaps the most studied problem in the computational multimodal literature
(e.g., Mase and Pentland 1990, Massaro et al. 1993, Brooke and Scott 1994, Bregler and
Konig 1994, Benoît 1995, Hennecke et al. 1996, Bangalore and Johnston 2000, Huang et
al. 2003, Potamianos et al. 2004).
This is due both to the historic prominence of
automatic speech recognition in computational perception and more importantly, to the
inherent difficulty of recognizing unconstrained speech.
We note that the robotics
community was perhaps the first to realize that combining multiple sensory channels
could increase overall perceptual reliability (Nilson 1984, Flynn 1985, Brooks 1986).
However, in these systems, sensors would substitute for one another depending upon
context; in other words, one sensor would be used at a time and a robot would
dynamically switch between them. Speechreading systems were the first multimodal
160
Figure 6.4 – An early multimodal user-interface. The "Put-that-there" system combined speech processing
and visual gesture recognition to resolve spoken deictic references to a projected map. From (Bolt 1980).
approaches that sought to increase perceptual accuracy by simultaneously combining
information from complementary sensory channels. This is in contrast with the more
typical goal of multimodal user-interface design, which is to make human-computer
interaction more natural for people by providing additional input modalities such as
pointing or context-awareness (Dey et al. 2001).
In recent years, a wide range of multimodal research areas has emerged, many of which
were inspired by Weiser's (1991, 1994) notion of ubiquitous computing. These include
intelligent environments (Coen 1999, Vanderheiden et al. 2005), wearable computing
(Starner 1999), physiological monitoring (Intille et al. 2003, Sung and Pentland 2005),
ambient intelligence (Remagnino et al. 2005), multimodal design (Adler and Davis
2004), and affective computing (Picard 1997).
Additionally, we note that roboticists
have a long tradition of combining multimodal perception with sensorimotor control (see
the references above, along with Brooks et al. 1998, Thrun et al. 2005).
161
6.3.2 Classical Architectures
Surprisingly, even though all of the above applications address a diverse assortment of
computational problems, their implementations have similar – sometimes almost identical
– architectures. Namely, they share a common framework where sensory inputs are
individually processed in isolated, specialized pathways (Figure 1.2). Each perceptual
pipeline then outputs an abstract description of what it sees, hears, senses, etc. This
description captures detail sufficient for higher-level manipulation of perceptions, while
omitting the actual signal data and intermediate analytic representations. Typically, the
perceptual subsystems are independently developed and trained on unimodal data; in
other words, each system is designed to work in isolation. They are then interconnected
through some fusive mechanism that combines temporally proximal, abstract unimodal
inputs into some integrated event model.
The integration itself may be effected in many different ways.
These include:
multilayered neural networks (Waibel et al. 1995); hidden Markov models (Stork and
Hennecke 1996); coupled hidden Markov models (Nefian et al. 2002); dynamic Bayesian
Figure 6.5 – Classical post-perceptual integration in multimodal systems. Here, auditory (A) and visual
(V) inputs pass through specialized unimodal processing pathways and are combined via an integration
mechanism, which creates multimodal perceptions by extracting and reconciling data from the individual
channels. Integration can happen earlier (a) or later (b). Hybrid architectures are also common. In (c),
multiple pathways process the visual input and are pre-integrated before the final integration step; for
example, the output of this preintegration step could be spatial localization derived solely through visual
input. This diagram is modeled after (Stork and Hennecke 1996).
162
networks (Wang et al. 2005); unification logics (Cohen et al. 1997); Harmony theory
(Smolensky 1986); posterior probabilities (Wu et al. 1999); fuzzy logic (Massaro 1998);
maximally informative joint subspaces (Fisher et al. 2001); recurrent mixture models
(Hsu and Ray 1999); subsumption architectures (Brooks 1986); agent hierarchies
(Minsky 1986); partially observable Markov decision processes (Lopez et al 2003);
layered topographic maps (Coen 2001); and various ad hoc techniques, which tend to be
the most popular (e.g., Bobick et al. 1998). The integrated events themselves have
specialized representations, which may be multimodal, in that they explicitly represent
information gathered from separate modalities, or they may be amodal, in that they
represent features not tied to any specific perceptual channel. For example, intensity is an
amodal feature, because it can be measured independently of any sensory system,
whereas deictic references are inherently multimodal, because they consist of cooccurrences of two distinct modal cues, namely, of speech and gesture. The output of the
integration process, whether amodal or multimodal, is then fed into some higher-level
interpretative mechanism – the architectural equivalent of a cortex.
6.3.3 What's Missing Here?
This post-perceptual approach to integration is commonplace in engineered systems. It
suffers, however, from a number of serious flaws:
1) As we saw earlier in this chapter, the evidence is strongly against animals
perceiving this way. Perceptual modalities interact constantly during ordinary
perception, and even unimodal perception has strong multimodal components.
That is not to say that all perception must be multimodal. Nonetheless, symbiotic
interactions between sensory systems go a long way toward explaining why
perception is so robust in biological systems, in marked contrast with their
engineered counterparts.
Because the individual components of multimodal
systems in these architectures tend to be independently designed, they are both
representationally and algorithmically incompatible with one another. Therefore,
163
it is often extraordinarily difficult to enable information sharing among them after
the fact. We return to this point below.
2) Integration in these models happens too late. Integration occurs after each system
has already “decided” what it has perceived, when it is too late for intersensory
influence to affect the individual concurrent perceptions. This is due to the
information loss from both vector quantization and the explicit abstraction
fundamental to the pipeline design.
3) Biological sensory systems are perceptually impoverished – they are incapable of
simultaneously detecting all sensible events to which they are exposed. Instead,
they exhibit attentive focusing, which narrows their windows of sensory
awareness and thereby increases sensory discrimination. This prevents seemingly
chaotic composite signals from impinging upon the sensory cortex, and this
mechanism is thought to be responsible for the coherence in our generation of
perceptual scenes. However, it simultaneously requires that something guides this
focus. For example, saccadic behavior in the eye enables perception of a large
image, overcoming the retina's narrow foveal limits. However, without highly
informed sampling, key details may be missed simply because they are never
observed. Many of our perceptual channels have similar attentive mechanisms
(Naeaetaenen et al. 2001), of which we are generally not consciously aware.
Importantly for us, the cues which guide attentive focusing are frequently
generated cross-modally (Driver et al. 2005). The role of attentive focusing in
eliminating extraneous sensory detail appears to be a basic component of robust
animal perception. Representational and algorithmic incompatibilities make this
type of cross-modal influence implausible in many engineered systems.
4) The independence between sensory components in classical architectures
precludes mutual bootstrapping, such as with the cross-modal clustering of
Chapter 3. Since these systems tend to be developed separately and connected
only at their outputs, there is no possibility of perceptually grounding them based
on naturally occurring temporal correspondences, to which they may remain
164
totally oblivious. Furthermore, the derivation of common perceptual categories
provided by cross-modal clustering would help alleviate the representational
incompatibility raised above in (1). Preventing this common grounding simply
exacerbates the problem.
We note that early integration (Figure 6.5a) would seem to bypass perceptual isolation by
combining sensory streams at a low-level feature (or sometimes, even signal) level. It
does so, however, at the cost of losing the domain structuring, high-level feature
extraction, and dimensional reduction explicitly provided by the abstraction-based,
pipeline architectures that are ubiquitous in encephalized species. Early integration has
shown much promise with problems amenable to information theoretic and statistical
approaches where there is little available domain knowledge. However, the majority of
multisensory problems of interest in this thesis do not appear to fit into this category.
When combining sensory information, structured domain knowledge, as described by
Wertheimer (1923), Chomsky and Halle (1968), and Marr (1982) along with many others
– and as implied by the Ugly Duckling Theorem (Watanabe 1985) – is essential for
reducing the size of perceptual search spaces.
It is doubtful that a wide range of
interesting sensory phenomena can be detected without it, particularly given the roles of
context and expectation in perceptual interpretation, even in relatively simple tasks
(Bruner and Postman 1949), and the computational intractability of directly describing
complex perceptual phenomena in terms of raw sensory input or low level features.
Early integration generally also precludes the possibility of top-down processing models,
such as with the visual sequence seeking approach in (Ullman 1996) or in various
approaches to auditory scene analysis (e.g., Slaney 1995).
The approach taken in this thesis is not motivated by the idea that computational systems
should use biologically-inspired mechanisms simply for the sake of doing so. Rather,
there are two justifications for the path taken here: (1) the perceptual phenomena we are
interested in computationally understanding are complex amalgams of mutually
interacting sensory input streams – they are not end-state combinations of unimodal
abstractions or features. Therefore, they cannot be accurately represented or described by
mechanisms that make this assumption; and (2) biological systems use low-level sensory
165
integration to handle a vast array of perceptual ambiguity and "errors." In designing
robust artificial perceptual systems, it would be a poor design to ignore Nature’s schema
for interpretative stabilization. This is all the more so given the relatively fragility and
high degree of error in our best computational methods for processing unimodal data
streams. We believe that the current approach to building multimodal interfaces is an
artifact of how people like to build symbol processing systems and not at all well-suited
to dealing with the cross-modal interdependencies of perceptual understanding.
Perception does not seem to be entirely amenable to the clear-cut abstraction barriers that
computer scientists find so valuable for solving other problems, and approaching it as if it
were has lead to the fragility of current multimodal systems.
Modern multimodal systems tend towards being inflexible and unpredictable and require
structured, typically scripted, interactions. Each of these interactions is subject to what
may be deemed the weakest link effect, in which every modality must receive input just
so, i.e., exactly as expected, in order for the overall system to function. The addition of
new modalities tends to weaken, rather than strengthen interactive systems, as additional
inputs simply offer new opportunities for interpretive errors and combinatorially increase
the complexity of fusing perceptions. From a biological perspective, this is exactly the
opposite of what one should expect. Additional modalities should strengthen perceptual
systems by capitalizing on the inherent multimodal nature of events in the real world and
the mutual reinforcement between interconnected sensory channels. Modern multimodal
systems suffer an unfortunate fate predicted by von Neumann (1956), where adding
additional components reduces a system’s inherent stability.
Research on perceptual computing has focused almost entirely on unimodal perception:
the isolated analysis of auditory, linguistic, visual, haptic, and to a lesser degree biometric
data. It seems to put the proverbial cart before the horse to ponder how information from
different modalities can be merged while the perceptory mechanisms in the sensory
channels are themselves largely unknown. Is it not paradoxical to suggest we should or
even could study integration without thoroughly understanding the individual systems to
be integrated? Nonetheless, that is the course taken here. We argue that while trying to
understand the processing performed within individual sensory channels, we must
166
simultaneously ask how their intermediary results and final products are merged into an
integrated perceptual system. We believe that because perceptual systems within a given
species coevolved to interoperate, compatibility pressures existed on their choices of
internal representations and processing mechanisms. In other words, to explain the types
of intersensory influences that have been discovered experimentally, disparate perceptual
mechanisms must have some degree of overall representational and algorithmic
compatibility that makes this influence possible.
Our approach is gestalt, not only from a Gibsonian (1986) perspective, but also because
there are few, if any, known examples of complex unimodal sensory systems evolving in
isolation. Even the relatively simple perceptual mechanisms in paramecium (Stein and
Meredith 1994, Chapter 2) and sponges (Mackie and Singla 1983) have substantial crosssensory interactions. It seems that perceptual interoperation is a prerequisite for the
development of complex perceptual systems.
Thus, rather than study any single
perceptual system in depth – the traditional approach – we prefer to study them in
breadth, by elucidating and analyzing interactions between different sensory systems.
We believe these interactions make possible the co-evolution that leads to complex
perceptual mechanisms, without which, they would not otherwise arise. This approach is
similar in spirit to the work of (Atkeson et al. 2000, Brooks et al. 1998, Cheng and
Kuniyoshi 2000, Ferrell 1996, and Sandini et al. 1997). Although they are primarily
concerned with motor coordination, there is a common biological inspiration and longterm goal to use knowledge of human and animal neurophysiology to design more
sophisticated artificial systems.
6.4 Our Direct Motivation
The examples for our discussion of multimodal interactions will be drawn from a
previous research project known as the Intelligent Room, as described in (Coen 1998,
1999). The Intelligent Room provided a framework for examining fundamental issues in
multimodal integration – most importantly for this thesis: how can independently
167
designed perceptual systems be made to work together?
This question proved
surprisingly difficult to answer. To see why, we now examine two simple multimodal
interactions that were implemented in the room.
Much of this section previously
appeared in (Coen 2001).
The Intelligent Room had multiple perceptual user interfaces to connect with some of the
ordinary human-level events going on within it. The goal of the room was to support
people engaged in everyday, traditionally non-computational activity in both work and
leisure contexts. The room contained nine video cameras, three of which it could actively
steer, several microphones, and a large number of computer-interfaced devices.
Its
computational infrastructure (Coen et al. 1999), consisting of over 120 software agents
running on a network of ten workstations, was housed in an adjacent laboratory to
enhance the room’s appeal as a naturally interactive space.
Before exploring how novel intersensory influences might have been incorporated into a
system such as the Intelligent Room, we first examine a traditional and explicit
multimodal interaction, with the resolution of a deictic reference, e.g., use of a word such
as this in referring to a member of some class of objects. Suppose, for example, someone
inside the room said, “Computer, dim this lamp.” The room used its ability to track its
occupants, in conjunction with a map of its own layout, to dim the lamp closest to the
speaker when the verbal command was uttered.
This kind of interaction was
implemented with a simple, post-perceptual integration mechanism that reconciled
location information obtained from the person tracker with the output of a speech
recognition system. Here, multimodal integration of positional and speech information
allowed straightforward disambiguation of the deictic lamp reference.
Given the
simplicity of this example, it seems far from obvious that a more complex integration
mechanism might have been called for. To motivate a more involved treatment, we start
by examining some of the problems with current approaches to perceptual computing.
Despite many recent and significant advances, computer vision and speech
understanding, along with many other perceptual research areas (Picard 1997,
Oviatt 2002) are still infant sciences.
The non-trivial perceptual components of
168
multimodal systems are therefore never perfect and are subject to a wide variety of failure
modes.
For example, the Intelligent Room sometimes "lost" people while visually
tracking them, due to occlusion, coincidental color matches between fore and background
objects, unfavorable lighting conditions, etc. Perceptual components can also dictate sets
of environmental conditions that are required for proper operation. For vision systems,
these may include assumptions about brightness levels, object size, image backgrounds,
scene complexity, pose, etc. Although the particular failure modes of computational
perceptual systems varies with them individually, one may assume that under a wide
variety of conditions any one of them may temporarily stop working as desired. How
these systems manifest this undesired operation is itself highly idiosyncratic. Some may
simply provide no information, for example, a speech recognition system confused by a
foreign accent. Far more troublesome are those that continue to operate as if nothing
were amiss but simply provide incorrect data, such as a vision-based tracking system that
mistakes a floor lamp for a person and reports that he is standing remarkably still.
Unimodal systems also suffer from what might be deemed perceptual impoverishment.
They implement single, highly specific perceptual capabilities, such as locating people
within a room, using individual (or far less often, a small number of) recognition
paradigms, such as looking for their faces. They are oblivious to phenomena outside of
their perceptual scope, even if these phenomena are indicative of events or conditions
they are intended to recognize.
That perceptual systems have a variety of failure modes is not confined to their artificial
instantiations. Biological systems also display a wide range of pathological conditions,
many of which are so engrained that they are difficult to notice.
These include
limitations in innate sensory capability, as with visual blind spots on the human retina,
and limited resources while processing sensory input, as with our linguistic difficultly
understanding nested embedded clauses (Miller and Chomsky 1963).
Cross-modal
perceptual mechanisms have evolved to cope both with innate physiological limitations
and specific environmental constraints.
Stein and Meredith (1994) argue for the
evolutionary advantages of overlapping and reinforcing sensory abilities; they reduce
dependence on specific environmental conditions and reliance on unique perceptual
169
pathways and thereby provide clear survival advantages. The facial vision of the blind
discussed earlier is an extreme example of this kind of reinforcement. Generally, the
influences are more subtle, never bringing themselves to our attention, but demonstrably
a fundamental component of perception nonetheless.
Given their role in biological systems, one might assume cross-modal influences could
provide similar benefit in artificial systems such as an Intelligent Room. How then
should we go about incorporating them? Answering this question is a two-step process.
Because the Intelligent Room was an engineered as opposed to an evolved system, we
would first needed to explicitly find potential synergies between its modalities that could
have been exploited.
Ideally, these influences would be learned, an approach the
influence network model enables, but here we examine how this would be done in the
absence of a such a model, where the synergies must be identified manually. Once
identified, these synergies must then somehow be incorporated (i.e., reverse engineered)
into the overall system. At the time, this emerged as the primary obstacle to integrating
cross-modal influences into the Intelligent Room and more generally, into other types of
engineered interactive systems.
Reverse-engineering intersensory influences into
systems not designed with them in mind can be convoluted at best and impossible at
worst.
6.4.1 Two Examples of Cross-Modal Influence
To examine these issues in more detail, we start with the following empirical and
complementary observations:
1) People tend to talk about objects they are near.
2) People tend to be near objects they talk about.
(Figure 6.6a)
(Figure 6.6b)
These heuristics reflect a relationship between a person’s location and what that person is
referring to when he speaks; knowing something about one of them provides some degree
of information about the other. For example, someone walking up to a video display of a
map is potentially likely to speak about the map, as in Figure 6a; here, person location
170
Map
a)
Map
b)
Figure 6.6a (top) – People talk about objects they are near. Someone approaching a projected display
showing a map, for example, is more likely to make a geographical utterance. Here, location information
can augment speech recognition.
Figure 6.6b (bottom) – People are near objects they talk about. Someone speaking about the contents of a
video display is likely to be located somewhere (delineated by the triangle) from which the display is
viewable. Here, speech information can augment person tracking.
data can inform a speech model. Conversely, someone who speaks about a displayed
map is likely to be in a position to see it, as illustrated in Figure 6b; here, speech data can
inform a location model. Of course, it is easy to imagine situations where these heuristics
are wrong. Nonetheless, in our experience they are empirically valid, so how could we
have incorporated influences based on them into a system such as an Intelligent Room?
Mechanistically, we might imagine the person tracking system exchanging information
with the speech recognition system. For example, the tracking system might provide
hints to the speech recognition system to preferentially expect utterances involving
objects the person is near, such as a displayed map. Conversely, we could also imagine
that the speech recognition system would send hints to the person tracking system to be
especially observant looking for someone in indicated sections of the room, based on
what that person is referring to in his speech.
171
This seems reasonable until we try to build a system that works this way. There are both
representational and algorithmic stumbling blocks that make this conceptually
straightforward cross-modal information sharing difficult to implement. These are due
not only to post-perceptual architectural integration, but also to how the perceptual are
themselves typically created for use in post-perceptual systems. We first examine issues
of representational compatibility, the interlingua used to represent shared information,
and then address how the systems could incorporate hints they receive in this interlingua
into their algorithmic models.
Consider a person tracking system that provides the coordinates of people within a room
in real-time, relative to some real-world origin; the system outputs the actual locations of
the room’s occupants. We will refer to the tracking system in the Intelligent Room as a
representative example of other such systems (e.g., Wren et al. 1997, Gross et al. 2000).
Its only inputs were stereo video cameras and its sole output were sets of (x,y,z) tuples
representing occupants’ centroid head coordinates, which were generated at 20Hz.
Contrast this with the room’s speech recognition system, which was based upon the Java
Speech API (Sun 2001) using IBM’s ViaVoice platform and was typical of similar
hidden Markov model based spoken language systems (Jelinek 1997). Its inputs were
audio voice signals and a formal linguistic model of expected utterances, which were
represented as probabilistically weighted context free grammars. Its outputs were sets of
parse trees representing what was heard, along with the system’s confidence levels that
the spoken utterances were interpreted correctly.
How then should these two systems have exchanged information? It does not seem
plausible from an engineering perspective, whether in natural or artificial systems, to
have provided each modality with access to the internal representations of the other.
Thus, we do not expect that a tracking system would know anything about linguistic
models nor we do expect the language system would be skilled in spatial reasoning and
representation. Even were we to suppose the speech recognition system could somehow
represent spatial coordinates, the example in Figure 6b above involves regions of space,
not isolated point coordinates. From an external point of view, it is not obvious how the
tracking system internally represents regions, presuming it even has that capability in the
172
first place. The complementary example of how the tracking system might refer to
classes of linguistic utterances is similarly convoluted.
Unfortunately, even were this interlingua problem easily solved and the subsystems had a
common language for representing information, the way perceptual subsystems are
generally implemented would make incorporation of cross-modal data difficult or
impossible. For example, in the case of a person tracking system, the real-world body
coordinates are generated via three-dimensional spatial reconstruction based on
correspondences between sets of image coordinates.
The common techniques for
computing the reconstructed coordinates, such as neural networks or fit polynomials, are
in a sense closed – once the appropriate coordinate transform has been learned, there is
generally no way to bias the transformation in favor of particular points or spatial
regions. Thus, there is no way to incorporate the influence, even if the systems had a
common way of encoding it. Here again, the complementary situation with influencing
speech recognition from a tracking system can be similarly intractable. For example, not
all linguistic recognition models support dynamic, preferential weightings for classes of
commonly themed utterances.
So, even if the tracking system could somehow
communicate what the speech recognition system should expect to hear, the speech
recognition system might not be able to do anything useful with this information.
We see that not only are the individual modal representations incompatible, the
perceptual algorithms (i.e., the computations that occur in the sensory pipelines) are
incompatible as well.
This comes as no surprise given that these systems were
engineered primarily for unimodal applications.
Unlike natural perceptual systems
within an individual species, artificial perceptual systems do not co-evolve, and therefore,
have had no evolutionary pressure to force representational and algorithmic
compatibility. These engineered systems are intended to be data sources feeding into
other systems, such as the ones performing multimodal integration, that are intended to
be data sinks. There is no reason to expect that these perceptual subsystems would or
even could directly interoperate.
173
6.5 Speculation on an Intersensory Model
For disparate perceptual systems to interact, there must be some agreement – implicit or
explicit – as to how they represent and process perceptual data. The goal of this section
is to motivate and briefly outline a model of intersensory perception that formalizes this
notion of “agreement.” The intent of the model is to allow us: (1) to formalize an
understanding of cross-modal phenomena; and (2) to design artificial systems that exploit
intersensory phenomena to increase the scope of their perceptual capabilities. It provides
the theoretical framework for evaluating influence networks and other approaches to
intersensory integration, by describing essential phenomena that any integration scheme
must somehow address.
In this sense, it also provides for the development of a
competency test for the influence network approach described above.
During the past century, research in intersensory processing has proceeded along two
main avenues. The first has been the design of experimental scenarios that reveal what
are usually hidden cross-modal influences at the behavioral, and with adult humans,
sometimes the conscious level. The second has been based on intrusive physiologic
examinations in animals that elucidate the cellular (in primitive organisms) and the
neurological (in encephalized species) substrates of intersensory function. Thus, a wide
range of phenomenological cross-modal interactions has been described and some of the
physiology behind them has been identified. What has not been addressed in a general
framework are the processes enabling intersensory influence and their implications to our
understanding of the individual senses themselves.
This chapter started by examining evidence that our senses are more interdependent than
they seem on the surface. Contrary to the understanding arrived at by introspection,
perceptions are the products of complex interactions between our sensory channels.
What, however, are the contents of these channels? Returning to a question we asked in
Chapter 1, how exactly should we understand what a sense is? The common notion
might be that a sense is the perceptual ability to interpret impressions received via a
single sensory organ. For example, vision is the ability to interpret waves of light
impacting upon the retinas. Viewed this way, each sense is defined by a wide gamut of
sensitivities and abilities tied to a particular biological organ. This view, however, is
174
misleading. It appears instead that each sense is composed of a multitude of capabilities,
some of which may be absent in particular individuals (e.g., loss of color perception or
depth sensitivity in visual perception) and many of which operate independently. The
most persuasive case studies are drawn from experiments with the blind; even though
they cannot see, their eyes can still respond to visual cues. For example, people with
cerebral cortex injuries rendering them perceptually blind are able to direct their eyes
towards spots of light, even though they cannot consciously see these spots and think
they are simply guessing (Poppel et al. 1981). Their loss of visual perception has not
hindered other unconscious visual processes, which have also been demonstrated in the
normally sighted (Wolfe 1983).
The blind can even experience visual cross-modal phenomena.
For example, a
congenitally blind child has been shown to have convergence of the eyes in response to
approaching sounds (Butterworth 1981), even though this convergence has no practical
benefit. The existence of independent visual processes has led to a more complex view
of visual perception and makes it difficult to view vision as a monolithic capability; it
seems instead to be an assortment of somewhat independent processes that have a single
sensory organ, the eye, in common. For this reason, although the modes (i.e., perceptual
primitives) in multimodal perception are generally understood to correspond to different
gross senses, e.g., vision, audition, proprioception, etc., we have found a finer grained
definition more useful, in which the different perceptual capabilities of each sensory
organ are treated as the individual modes, rather than taking the abstract function of that
organ as a whole. Aside from providing a clearer engineering viewpoint, this perspective
allows us to make explicit what we will call the cross-modal influences between different
perceptual pathways that start from the same sensory organ and would traditionally be
considered to be within a single modality.
Given this minimalist view of unimodal perception, we now outline the three types of
cross-modal influence to be covered by the model. This type of categorization has not
been made in the multisensory literature, but it is well motivated by both
phenomenological and neurological data. Although the influences listed below appear to
cover a variety of different phenomena, we will argue their primary distinction is
175
temporal and a single mechanism could account for many such influences within an
artificial system. Our categories of cross-modal influence are cueing, mutual influence,
and resolution:
1. Cueing occurs when a sensory system is biased in some way before a perceptual event
due to cross-modal influences. The most studied form of cueing is priming (Meyer and
Schvaneveldt 1971), which is an increase in the speed or accuracy of a perceptual
decision task, based on previous exposure to some of its content.
Although not
necessarily intersensory, many priming experiments are conducted cross-modally.4
Cueing is distinguished from priming because cueing removes any notion of perceptual
correctness – influences that simply bias perceptual interpretation are included – and
because cueing covers inhibitory, not just excitatory, influences. Thus, accommodation
experiments (Kohler 1964, Blake et al. 1981), in which prolonged or repeated exposures
to sensory stimuli lead to modification in baseline perceptual responses, are also
examples of cueing.
Kohler’s work is perhaps the best known example of these
experiments, in which subjects gradually acclimate to the effects of wearing distortive
spectacles, e.g., ones in which the left halves of the lens are blue and right halves are
yellow. Cueing is important for artificial perceptual systems because they rarely have
any notion of how what they should expect to perceive changes over time and with
contextual circumstances.
2. Mutual influence occurs when a set of modalities interact in a dynamic feedback process
during percept formation.
A wide range of interactions, particularly between the
vestibular, visual, auditory, and proprioceptive systems have been studied, many of these
investigated in depth because of the unusual effects of gravity on perception to pilots and
astronauts. For example, during high acceleration takeoffs, pilots can experience the
sensation that their body is tilted backwards and their instrument panel is rising too
quickly (Graybiel 1952). They must learn to ignore these perceptions, because reacting
to them could be life threatening. In influences that have a visual component, it is not
uncommon for the visual input to dominate the joint perceptual interpretation (Warren at
al. 1981), as with the “ventriloquism effect” (Howard and Templeton 1966), where a
4
A mechanism accounting for priming, spreading activation (Collins and Loftus 1975, Collins and
Quillian 1969), has been very influential in the development of the theory of semantic networks (e.g.,
Fahlman 1979) and in later perceptual theories, such as sequence seeking (Ullman 1996).
176
ventriloquist’s voice appears to emanate from a dummy’s mouth because its lips are
moving rather than his. However, the dominant modality in mutual influence is generally
the one perceived most strongly (Welch and Warren 1986), and circumstances can
uniquely exaggerate any modality to the point of perceptual dominance.
The McGurk
effect discussed earlier is an unusual example of mutual influence where neither involved
modality is dominant.
Evidence for mutual influence interactions has been found
neurologically; in the thalamus, superior colliculus, and cerebral cortex, neurons respond
to and integrate multisensory information in a time frame of 10s of milliseconds
following stimulus presentation.
Therefore, components of multisensory processing
occur long before perceptual and cognitive effects even begin (Meredith 2001). Mutual
influence is important for artificial perceptual systems because it allows unimodal
systems to overcome perceptual impoverishment by supplementing their sensory input
with indicative cross-modal data.
3. Resolution occurs when a potentially ambiguous sensory input is crosschecked among
parallel sensory input streams and potentially changed after percept formation.
Resolution influences are reflexive and reactionary, as in those demonstrated in
nociception (i.e., pain response) in rodents (Stein and Dixon 1978, Aury et al. 1991).
Resolution influences specifically do not involve ratiocination; they are confined to
perceptual channels and are therefore unconscious and automatic. There is no direct
awareness of the influence itself and it is not subject to willful control. Many postperceptual intersensory influences do not meet these criteria, including explicit memory
tests (Jacoby 1983) and cross-modal matching, and will not be considered here.
Resolution is important for artificial perceptual systems because it can increase or
decrease confidence in perceptions based on cross-modal agreement or conflict. Modern
interactive systems rarely have any way to self-validate their own operation or to adjust
internal mechanisms (e.g., threshold values) without external supervision. Biological
systems have self-supervised plasticity, which provides a dynamic, adaptive fluidity
unknown in artificial perception.
How might these influences be effected in a perceptual system?
Even though the
perceptual channels are themselves highly specialized, we argue that the mechanisms
behind intersensory influence can be amodal to a far greater extent – there is less of a
need for them to be dedicated to specific perceptual modes. This position at first glance
177
is also hard to support. Mechanistic specialization is so commonplace in biological
systems, one can make the case that general-purpose, non-specific mechanisms need
justification more so than do ad hoc ones (Gazzaniga 2000, Chomsky 2000). Why then
suppose there is an amodal component to intersensory influence?
This question is somewhat misleading. The brain, does, in fact, use amodal codings in
perception, the most well known of these being the spatial representations in the sensory
and motor maps of the superior colliculus, discussed in more detail below. The superior
colliculus is the only known region of the brain where auditory information is represented
spatially, as opposed to tonotopically, and one may argue that the extraordinary
specialization in this case is actually in the neurological mechanisms that determine
spatial localization based on interaural time delays. This elaborate calculation allows
representation of auditory information in the superior colliculus’ common, amodal
coordinate system. It is precisely the mechanistic perceptual specialization here that
allows a general-purpose, amodal system, such as the superior colliculus to exist. This
can be seen identically with other specialized perceptory capabilities unique to given
species that provide spatial localization, such as echolocation (bats), whisker
displacement (rats and mice), acute audition (owls), infrared detection (rattlesnakes), and
electroreception (fish) (Stein and Meredith, Chapter 6). Perceptions in each of these
animals are represented in common, amodal coordinate systems and support the view that
perceptual specialization in no way precludes the existence of amodal representations.
One may also more tentatively propose strong evolutionary advantages to amodal
intersensory perception. It allows incorporation of newly evolved or modified perceptual
capabilities without requiring the development of specialized mechanisms that enable
their participation in an intersensory perceptual system.
More generally, were the
mechanisms for intersensory influence between each pair of modalities uniquely
specified, there would be O ( n 2 ) of them required for a set of n modalities. Generalpurpose amodal mechanisms for effecting cross-modal influence would simplify this
integration problem. For this reason, one can take the position that regardless of how
Nature proceeds, from an engineering perspective, the benefit provided by amodally
effected intersensory influence is clear. A general purpose mechanism for incorporating
178
individual sensory components into integrated, artificial perceptual systems would be
very useful when building them. This raises the question, however, of how does Nature
proceed? What is known about neural substrates of intersensory influence?
Consider the effect of touching someone and having his eyes turn to determine the source
of the stimulus. This is an example of cross-modal influence – the position of the touch
determines the foveation of the eyes – but it is different from the interaction described in
the McGurk effect above. The touch scenario leads to behavioral effects that center the
stimulus with respect to the body and peripheral sensory organs. In contrast, the McGurk
effect is an example of sensory influence solely within perceptual channels and has no
behavioral component. Is this difference important?
Motor influences – i.e., ones that cause attentive and orientation behaviors – are by far
the more understood of the two. The primary neurological substrate behind them is the
superior colliculus (or optic tectum in non-mammalian vertebrates), a small region of the
brain that produces signals that cross-modally orient peripheral sensory organs based on
sensory input. The superior colliculus contains layered, topographic sensory and motor
maps that share a common rostral-caudal vs. medial-lateral coordinate system. The maps
so far elucidated are in register; that is, co-located positions in the real world – in the
sensory case representing derived locations of perceptual inputs and in the motor case
representing peripheral sensory organ motor coordinates that focus on those regions – are
essentially vertically overlapping. The actual mechanisms that use these maps to effect
intersensory motor influence are currently unknown – variants on spreading vertical
activation are suspected – but there is little doubt the maps’ organization is a fundamental
component of that mechanism.
Far less is known neurologically about semantic influences – i.e., ones that have internal
effects confined to perceptual channels. The superior colliculus itself has been directly
approachable from a research perspective because the brain has dedicated inner space,
namely, the tissue of the topographic maps, to representing the outer space of the realworld.
The representation is both isomorphic and perspicacious and has made the
superior colliculus uniquely amenable to study. The perceptual, as opposed to spatial,
179
representations of the senses are far more elusive because they are specialized to the
individual modalities and the organs that perceive them. For example, the auditory
system is organized at both the thalamic and cortical levels according to sound frequency
whereas the retina organizes visual input spatiotopically (in retinocentric coordinates) and
maintains this representation into the visual cortex.
Although many of their neuroanatomical interconnections have been elucidated, how
these systems actually share non-spatial perceptual information during intersensory
influence is unknown. The approach in this thesis has been to introduce the slice data
structure in Chapter 3, which was inspired by the role of cortical topographic maps in
equivalently organizing both sensory and motor information in animals.
Slices are
inherently amodal and independent of the features they represent. This, for example,
enables them to freely share information between sensory and motor systems in learning
birdsong, without requiring the development of new formalisms or representations.
By treating slices as state spaces, we can model activations in these maps that correspond
to dynamic sensory and motor processes. This allows us to incorporate the temporal
intersensory influences described above. For example, in our model, cueing corresponds
to pre-activation of a region within one slice through the prior activation of another.
Mutual influence corresponds to two slices cooperatively cross-activating each another
while they are in the midst of perceiving. These temporal interactions also allow us to
use lower dimensional representations for modeling sensory and motor data because they
in essence trade space (or dimensionality) for time. In other words, our data need not be
completely separable if we can find can find ways of avoiding ambiguity during the
temporal window of percept formation. Thus, we believe that cross-modal influence not
only provides perceptual stability, it makes sensory systems more computationally
efficient in doing so.
180
6.6 Summary
This chapter has examined the connections between the computational framework
presented in this thesis and perception in biological systems. Our goal was to motivate
our approach and to provide a context for evaluating related work in multimodal
integration.
We also examined the representational and algorithmic challenges in
engineering biologically realistic perceptual systems. These systems do not co-evolve
and may be grossly incompatible with one another, even when external relationships
between their perceptions are readily apparent.
Finally, we speculated on a number of
theoretical issues in intersensory perception and examined how the work in this thesis
addresses them.
181
182
Chapter 7
Conclusions
The primary contribution of this thesis has been to identify how artificial systems can use
cross-modal correspondences to learn without supervision.
Building systems that
develop by interacting with the world around them has been a dream of artificial
intelligence researchers since the days of Turing, and this thesis marks a step forward
towards that goal. This work has also demonstrated that a biologically-inspired approach
can help answer what are historically difficult computational problems, for example, how
to cluster non-parametric data corresponding to an unknown number of categories and
how to learn motor control through self-observation.
We have also identified cross-modal cooperation as a central problem in theoretical and
applied artificial intelligence. In doing so, we have introduced three key formalisms:
1. Slices, which represent and share information amodally between different
subsystems. They can be used to model sensory and motor data and have wide
applications for modeling static and dynamic inputs.
2. Cross-modal clustering, which learns categories in slices without supervision, based
on how they co-occur with other slices.
3. Influence networks, which spread influence among different slices and provide
perceptual and interpretative stability.
We demonstrated the power of cross-modal cooperation and clustering by learning the
vowel structure of American English, simply by watching and listening to someone
speak. This is the first self-supervised computational approach to this problem of which
we are aware. We further demonstrated that this framework can equivalently be applied
to sensorimotor learning, by acquiring birdsong according to the developmental stages of
a juvenile zebra finch. In this, we have shown that this work can be applied recursively
to learn higher-level knowledge. In the example given in Chapter 5, we recursively
183
grounded motor control in self-acquired sensory categories, thereby demonstrating
afferent and efferent equivalence in sensorimotor learning, which in itself is a very
surprising result. It further suggests how to carry this work forward towards more
complex, higher-level learning tasks, as we discuss below.
7.1 Future Work
Our immediate plans are to apply the birdsong learning framework in Chapter 5 towards
learning human babbling and the entire phonetic set of American English.
Early
protolinguistic behavior in humans is not a well studied nor understood phenomenon.
We intend to create a system that learns to babble at the level of an eight month old child,
in a developmentally realistic way.
We believe examining the earliest intellectual
development in people provides a path to understanding fundamental issues in knowledge
acquisition and to building a new generation of intelligent machines. Along these lines,
we plan to extend this work to hierarchical clustering and address issues in concept
formation based on perceptual grounding. We would particularly like to apply the ideas
of Lakoff (1987) for grounding internal mental activity using external, real world
metaphors and observations, which is a basic feature of human cognition.
We would also like to embed this framework in a robotic platform or an interactive
environment, to provide a real-time environment for visual and motor development.
We are also interested in non-perceptual applications of the work presented here. The
mathematical framework in this thesis can be applied to problems with similar structures.
For example, by treating the messages passed between human or software agents as
perceptual events, we can categorize both the messages and the agents passing them.
This has applications both in software engineering and in understanding cells of
interacting people.
It also is applicable to learning how to detect events within
distributed sensor networks. This work also has applications in separating non-ergodic
Markov chains into their ergodic components, which has applications in probability
theory and statistical physics.
184
Chapter 8
References
1.
Agin, G.J. Representation and description of curved objects. Ph.D. Thesis, Stanford
University, Stanford, CA. 1972.
2.
Alcock, J. Animal Behavior: An Evolutionary Approach, 7th edition. Sinauer Associates,
Inc., Sunderland, Massachusetts. 2001.
3.
Aleksandrovsky, B., Whitson, J., Andes, G., Lynch, G. Granger, R. Novel Speech
Processing Mechanism Derived from Auditory Neocortical Circuit Analysis. In
Proceedings of ICSLP’96. 1996.
4.
Amari, S. Topographic organization of nerve fields. Bulletin of Mathematical Biology,
42:339-364. 1980.
5.
Arbib, M.A., Schemas for the temporal organization of behavior. Human Neurobiology,
4, 63-72. 1985
6.
Aristotle. De Anima.
London. 1987.
7.
Atkeson C.G., Hale J, Pollick F., Riley M., Kotosaka S., Schaal S., Shibata T., Tevatia
G., Vijayakumar S., Ude A., Kawato M. Using humanoid robots to study human
behavior. IEEE Intelligent Systems, Special Issue on Humanoid Robotics, 46-56. 2000.
8.
Auroy P., Irthum B., Woda A. Oral nociceptive activity in the rat superior colliculus.
Brain Res. 549:275-284. 1991.
9.
Bangalore, S. and Johnston, M. Integrating multimodal language processing with speech
recognition. In ICSLP-2000, vol.2, 126-129. 2000.
10.
Bartlett, M.S. Face Image Analysis by Unsupervised Learning. Kluwer. MA. 2001.
11.
Becker, S. A computational principle for hippocampal learning and neurogenesis.
Hippocampus 15(6):722-738. 2005.
12.
Becker, S. and Hinton, G. E., Spatial coherence as an internal teacher for a neural
network. In Backpropagation: Theory, Architectures, and Applications, Y. Chauvin and
D. Rumelhart (eds),. Lawrence Erlbaum. Hillsdale, N.J. 1995.
13.
Bednar, J.A., Choe, Y., De Paula, J., Miikkulainen, R., Provost, J., and Tversky, T.
Modeling Cortical Maps with Topographica, Neurocomputing. 58—60 pp. 1129-1135.
2004.
14.
Bellman, R. Adaptive Control Processes. Princeton University Press, 1961.
15.
Bender, R. E. The Consquest of Deafness (3rd Ed.). Danville, Il: Interstate Publishers.
1981.
16.
Binford, T. O., Visual perception by computer. In the Proceedings of the IEEE Systems
Science and Cybernetics Conference. Miami, FL. IEEE, New York. 1971.
17.
Bishop, C.M., Neural Networks for Pattern Recognition. Oxford University Press, 1995.
350 BCE.
Translated by Tancred, H.L.
185
Penguin Classics.
18.
Blake, R., Sobel, K. V., and Gilroy, L. A., Visual Motion Retards Alternations between
conflicting Perceptual Interpretations. Neuron, (39):1–20, August, 2003.
19.
Bloedel, JR, Ebner, TJ, and Wise, SP (eds.), Acquisition of Motor Behavior in
Vertebrates, MIT Press: Cambridge, MA 1996.
20.
Blum, A., and Mitchell, T. Combining Labeled and Unlabeled Data with Co-Training.
With Tom Mitchell. In Proceedings of the 11th Annual Conference on Computational
Learning Theory, pages 92--100, 1998.
21.
Bobick, A.; Intille, S.; Davis, J.; Baird, F.; Pinhanez, C.; Campbell, L.; Ivanov, Y.;
Schütte, A.; and Wilson, A. The KidsRoom: A Perceptually-Based Interactive and
Immersive Story Environment. M.I.T. Media Laboratory Perceptual Computing Section
398. 1996.
22.
Boersma, P., and Weenink, D., PRAAT, a system for doing phonetics by computer.
(Version 4.3.14) [Computer program]. Glot International 5(9/10): 341345.http://www.praat.org/. May 26, 2005.
23.
Bolt, R. A., Put-That-There: Voice and gesture at the graphics interface. Computer
Graphics. Vol. 14, No. 3. pp. 262 – 270. July, 1980.
24.
Brainard, M.S. and Doupe, A.J. Auditory feedback in learning and maintenance of vocal
behaviour. Nat Rev Neurosci, 2000.
25.
Brenowitz, E.A., Margoliash, D., Nordeen, K.W. (Eds.) Special issue on birdsong.
Journal of Neurobiology. Volume 33, Issue 5, pp 495-709. 1997.
26.
Brooke N. M., & Scott S. D. PCA image coding schemes and visual speech
intelligibility. Proc. Institute of Acoustics, 16(5), 123–129. 1994.
27.
Brooks, R., A Robust Layered Control System For A Mobile Robot. IEEE Journal of
Robotics and Automation. 2:14-23. March, 1986.
28.
Brooks, R.
1991a.
29.
Brooks, R. Intelligence without Reason. In Proceedings of the 12th International Joint
Conference on Artificial Intelligence (IJCAI-91). Sydney, Australia. 1991b.
30.
Brooks, R.A., C. Breazeal (Ferrell), R. Irie, C. Kemp, M. Marjanovic, B. Scassellati and
M. Williamson, Alternate Essences of Intelligence. In Proceedings of The Fifteenth
National Conference on Artificial Intelligence. (AAAI98). Madison, Wisconsin. 1998.
31.
Bruner, J.S. and Postman, L. On the Perception of Incongruity: A Paradigm. In Journal
of Personality, 18, 206-223. 1949.
32.
Bushara, K.O., Hanakawa, T., Immisch, I., Toma, K., Kansaku, K., and Hallett, M.,
Neural correlates of cross-modal binding. Nature Neuroscience 6, 190-195.
1 Feb
2003.
33.
Butterworth, G. The origins of auditory-visual perception and visual proprioception in
human development. In Intersensory Perception and Sensory Integration, R.D. Walk
and L.H. Pick, Jr. (Eds.) New York. Plenum. 1981.
34.
Cadez, I. and Smyth, P. Probabilistic Clustering using Hierarchical Models. 1999.
35.
Calvert, A.G., Spence, C., and Stein, B.E. The Handbook of Multisensory Processes.
Bradford Books. 2004
Intelligence without Representation, Artificial Intelligence. 47:139–159.
186
36.
Calvert, G.A., Bullmore, E., Brammer, M.J., Campbell, R., Iversen, S.D., Woodruff, P.,
McGuire, P., Williams, S., and David, A.S., Silent lipreading activates the auditory
cortex. Science, 276, 593-596. 1997.
37.
Calvert, GA., Campbell R., and Brammer, MJ., Evidence from functional magnetic
resonance imaging of crossmodal binding in the human heteromodal. Curr. Biol., 2000.
38.
Camarillo D.B., Krummel T.M., Salisbury J.K., Robotic technology in surgery: past,
present, and future. Am J Surg. 188:2S-15S. 2004.
39.
Cheng, G., and Kuniyoshi, Y. Complex Continuous Meaningful Humanoid Interaction:
A Multi Sensory-Cue Based Approach Proc. of IEEE International Conference on
Robotics and Automation (ICRA 2000), pp.2235-2242, San Francisco, USA, April 24-28,
2000.
40.
Cherry, E. C., Some experiments on the recognition of speech with one and two ears. J.
Acoust. Soc. Am. 25, 975–979. 17, 227–246. 1953.
41.
Chomsky, N. Language and the Brain. Quandaries and Prospects. Hans-Lukas Teuber
Lecture. Massachusetts Institute of Technology. October 13, 2000.
42.
Chomsky, N., and Halle, M. The sound pattern of English. New York: Harper and Row.
1968.
43.
Citti, G. and Sarti, A. A cortical based model of perceptual completion in the rototranslation space. In Proceeding of the Workshop on Second Order Subelliptic Equations
and Applications. Cortona. 2003.
44.
Clark B, and Graybiel A., Factors Contributing to the Delay in the Perception of the
Oculogravic Illusion. Am J Psychol 79:377-88. 1966.
45.
Clarkson, P. and Moreno, P.J. On the use of support vector machines for phonetic
classification. In Proceedings IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP '99). 1999.
46.
Clemins, P.J. and Johnson, M.T. Application Of Speech Recognition To African
Elephant (Loxodonta Africana) Vocalizations. In Proceedings of the International
Conference on Acoustics, Speech, and Signal Processing. Vol. I p. 487. 2003.
47.
Coen, M.H. Self-supervised acquisition of vowels in American English. To appear in
Proceedings of the Twenty First National Conference on Artificial Intelligence. (AAAI06) Boston, MA. 2006.
48.
Coen, M.H. Cross-modal clustering. In Proceedings of the Twentieth National
Conference on Artificial Intelligence. (AAAI-05) Pittsburg, Pennsylvania. 2005.
49.
Coen, M.H. Multimodal interaction: a biological view. In Proceedings of 17th
International Joint Conference on Artificial Intelligence. (IJCAI-01) Seattle, Washington.
2001.
50.
Coen, M.H., and Wilson, K. Learning Spatial Event Models from Multiple-Camera
Perspectives in an Intelligent Room. In Proceedings of MANSE’99. Dublin, Ireland.
1999.
51.
Coen, M.H, Phillips, B., Warshawsky, N., Weisman, L., Peters, S., Gajos, K., and Finin,
P. Meeting the computational needs of intelligent environments: The Metaglue System.
In Proceedings of MANSE’99. Dublin, Ireland. 1999.
187
52.
Coen, M.H., The Future of Human-Computer Interaction, or How I Learned to Stop
Worrying and Love My Intelligent Room, IEEE Intelligent Systems, vol. 14, no. 2,
Mar./Apr., pp. 8—19. 1999.
53.
Coen, M.H. Design Principles for Intelligent Environments. In Proceedings of The
Fifteenth National Conference on Artificial Intelligence. (AAAI-98). Madison,
Wisconsin. 1998.
54.
Cohen, P. R., Johnston, M., McGee, D., Smith, I. Oviatt, S., Pittman, J., Chen, L., and
Clow, J. QuickSet: Multimodal interaction for simulation set-up and control. 1997,
Proceedings of the Applied Natural Language Conference, Association for
Computational Linguistics. 1997.
55.
Collins, A. M. & Loftus, E. F., A spreading activation theory of semantic processing.
Psychological Review, 82, 407-428. 1975.
56.
Collins, A. M. & Quillian, M. R., Retrieval time from semantic memory. Journal of
Verbal Learning and Verbal Behavior, 8, 240-247. 1969.
57.
Cotzin, M. and Dallenbach, K., Facial Vision: the role of pitch and loudness in the
perception of obstacles by the blind. The American Journal of Psychology, 63: 485-515.
1950.
58.
Cytowic, R. E., Synesthesia: A Union of Senses. New York. Springer-Verlag. 1989.
59.
Dale AM, Fischl B, and Sereno MI., Cortical surface-based analysis. I. Segmentation and
surface reconstruction. Neuroimage. Feb; 9(2):179-94. 1999.
60.
Dantzig, G.B. Application of the simplex method to a transportation problem. In Activity
Analysis of Production and Allocation, Koopmans, T.C. (Ed.) pp. 339--347. Wiley, New
York, 1951.
61.
Darrell, T., Gordon, G., Harville, M., and Woodfill, J., Integrated person tracking using
stereo, color, and pattern detection, Proceedings of the Conference on Computer Vision
and Pattern Recognition (CVPR '98), pp. 601-609, Santa Barbara, June, 1998.
62.
Dautenhahn, K. and Nehaniv, C. L. (eds.), Imitation in Animals and Artifacts. MIT
Press: London, 2002.
63.
de Marken, C. Unsupervised Language Acquisition, PhD Thesis.
Institute of Technology. Cambridge, MA. 1996.
64.
de Sa, V.R. Unsupervised Classification Learning from Cross-Modal Environmental
Structure. Doctoral Dissertation, Department of Computer Science, University of
Rochester. 1994.
65.
de Sa, V.R., & Ballard, D. Category Learning through Multi-Modality Sensing. In Neural
Computation 10(5). 1998.
66.
Dempster, A., Laird, N. and Rubin, D., Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977.
67.
Dennett, D.C. Consciousness Explained, Little, Brown and Company. Boston, MA.
1991.
68.
Dennett, D. C., and Haugeland, J. Intentionality. The Oxford Companion to the Mind.
Gregory, R.L. (Ed.)., Oxford University Press. 1987.
188
Massachusetts
69.
Dey, A.K., Salber, D. and Abowd, G.D. A Conceptual Framework and a Toolkit for
Supporting the Rapid Prototyping of Context-Aware Applications. Human-Computer
Interaction (HCI) Journal, Volume 16 (2-4), pp. 97-166. 2001.
70.
Dobbins, A. C., Jeo, R., and Allman, J., Absence of spike frequency adaptation during
binocular rivalry [Abstract]. Society for Neuroscience Abstracts, 21. 1995.
71.
Donoho, D.L., and Grimes, C., Hessian eigenmaps: Locally linear embedding techniques
for high-dimensional data. PNAS; vol. 100; no. 10;5591-5596. May 13, 2003.
72.
Elbert, T., Pantev, C., Wienbruch, C., Rockstroh, B., and Taub, E. Increased Cortical
Representation of the Fingers of the Left Hand in String Players, Science: 270: 305-307.
1995.
73.
Ernst, M.O., and Banks, M.S., Humans integrate visual and haptic information in a
statistically optimal fashion. Nature. 415, 429-433; doi: 10.1038/415429a. 24 January
2002.
74.
Fahlman, S.E. NETL: A System for Representing and Using Real-World Knowledge,
MIT Press, Cambridge MA. 1979.
75.
Ferrell, C. Orientation behavior using registered topographic maps. In Proceedings of the
Fourth International Conference on Simulation of Adaptive Behavior (SAB-96). Society
of Adaptive Behavior. 1996.
76.
Finney, E.M., Fine, I., and Dobkins, K.R., Visual stimuli activate auditory cortex in the
deaf. Nature Neuroscience 4, 1171-1173, 2001.
77.
Fischl B., Sereno, M.I. and Dale, A.M., Cortical Surface-Based Analysis. II: Inflation,
Flattening, and a Surface-Based Coordinate System. Neuroimage. 9(2):195-207. 1999.
78.
Fisher, J. and Darrell, T., Signal-Level Audio Video Fusion Using Information Theory.
Proceedings of Workshop on Perceptive User Interfaces. 2001.
79.
Fitzgibbon, A., Pilu, M., and Fisher, R.B., "Direct Least Square Fitting of Ellipses," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 5, pp. 476-480,
May, 1999.
80.
Fitzpatrick, R. and McCloskey, D.I., Proprioceptive, visual and vestibular thresholds for
the perception of sway during standing in humans. The Journal of Physiology, Vol 478,
Issue 1 pp. 173-186, 1994.
81.
Flynn, A., Redundant sensors for mobile robot navigation, M.S. Thesis, Department of
Electrical Engineering and Computer Science, M.I.T., Cambridge, MA, July 1985.
82.
Frank, A. On Kuhn's Hungarian Method - a tribute from Hungary. Naval Research &
Logistics. 52. 2005.
83.
Frisby, J. P., and Davies, I. R. L., Is the Haptic Müller–Lyer a Visual Phenomenon?
Nature. 231, 463-465. 18 Jun 1971.
84.
Galef, B. G., Imitation in animals: History, definition, and interpretation of data from the
psychological laboratory. In: Social learning: Psychological and biological perspectives,
eds. T. R. Zentall & B. G. Galef, Jr. Erlbaum. 1988.
85.
Garnicia, O. K., Some prosodic and paralinguistic features of speech to young children.
In C. E. Snow and C. A. Ferguson (Eds.) Talking to children. Cambridge University
Press. 1977.
189
Gazzaniga, M. (Ed.) The New Cognitive Neurosciences.
Cambridge, MA. 2000.
87.
Gerstner, W and Kistler, W.M., Spiking Neuron Models. Single Neurons, Populations,
Plasticity Cambridge University Press. 2002.
88.
Gibbon, J. Scalar expectancy theory and Weber's law in animal timing. Psychol. Review
84(3):279-325. 1977.
89.
Gibbs, A.L and Su, F.E. On choosing and bounding probability metrics. International
Statistical Review, vol. 70, number 3, 419-435. 2002
90.
Gibson, J.J. The Perception of the Visual World. Boston, Houghton Mifflin. 1950.
91.
Gibson, J.J. The Ecological Approach to Visual Perception.
Associates. Hillsdale, N.J. 1987.
92.
Gold, B. and Morgan, N. Speech and Audio Signal Processing. Wiley Press, New York.
2000.
93.
Graybiel A. Oculogravic Illusion. Archives of Ophthalmology (Chicago, IL) 48:605-15.
1952.
94.
Gross, R., Yang, J., and Waibel, A. Face Recognition in a meeting room. Fourth IEEE
International Conference on Automatic Face and Gesture Recognition, Grenoble, France,
March 2000
95.
Grunfeld, E.A., Okada, T, Jauregui-Renaud, K., and Bronstein, A.M., The effect of
habituation and plane of rotation on vestibular perceptual responses. J Vestib Res. 10(45):193-200. 2000.
96.
Haesler, S., Wada, K., Nshdejan, A., Morrisey, E.E., Lints, T., Jarvis, E.D., and Scharff,
C. FoxP2 Expression in Avian Vocal Learners and Non-Learners. J. Neurosci. 24: 31643175. 2004
97.
Hall, A.C. and Moschovakis, A. eds., The superior colliculus: new approaches for
studying sensorimotor integration. (Boca Raton: CRC Press) 2004.
98.
Hamill, N.J., McGinn, M.D. and Horowitz, J.M., Characteristics of auditory brainstem
responses in ground squirrels. Journal of Comparative Physiology B: Volume 159,
Number 2. Pages: 159 – 165. March 1989.
99.
Hebb, D.O. 1949. The Organization of Behaviour. John Wiley & Sons, New York.
100.
Heil, P. and Neubauer ,H., A unifying basis of auditory thresholds based on temporal
summation. PNAS, Vol. 100, no. 10. May 13, 2003.
101.
Held, R. Shifts in binaural localization after prolonged exposures to atypical
combinations of stimuli. Am. J. Psychol. 68L526-266. 1955.
102.
Helmholtz, H. v. Handbook of Physiological Optics. 1856. as reprinted. in James P.C.
Southall. (Ed.) 2000.
103.
Hennecke M. E., Prasad K. V., & Stork D. G. Using deformable templates to infer visual
speech dynamics. In Proc. 28th Annual Asilomar Conf. on Signals, Systems and
Computers. 1994.
104.
Hinton, G.E. and Sejnowski, T.J. Learning and relearning in Boltzman machines. In
Parallel Distributed Processing, Rumelhart, D.E. and McClelland, J.L. (eds), Volume
1:Foundations. MIT Press, Cambridge, MA. 1986.
190
MIT Press.
2nd edition.
86.
Lawrence Earlbaum
105.
Holbrook, A., and Fairbanks, G. Diphthong Formants and their Movements. J. Speech
Hear. Res. 5, 38-58. 1962
106.
Honkela, T. and Hyvärinen, A. Linguistic Feature Extraction using Independent
Component Analysis. In Proceedings of IJCNN 2004, Intern. Joint Conf. on Neural
Networks, pp. 279-284. Budapest, Hungary, 25-29 July 2004.
107.
Horn, B.K.P. Shape from Shading. MIT Artificial Intelligence Laboratory Technical
Report. AITR 232. November 1970.
108.
Hoshino, O. and Kuroiwa, K., Echo sound detection in the inferior colliculus for human
echolocation. Neurocomputing: An International Journal Special Issue, 38-40, 12891296. 2001.
109.
Howard, I. P. & Templeton, W. B. Human Spatial Orientation, Wiley, New York. 1966.
110.
Hsu, W.H., and Ray, S.R., Construction of recurrent mixture models for time series
classification. International Joint Conference on Neural Networks (IJCNN'99), 1999.
111.
Huang, X, Acero, A., and Hon, H.W. Spoken Language Processing. Prentice Hall, New
Jersey. 2001.
112.
Hubel, D. H. and Wiesel, T. N., J. Physiol., London. 160, 106−154. 1962.
113.
Hughes, J. W. The threshold of audition for short periods of stimulation, Proc. R. Soc.
London Ser. B 133, 486-490. 1946.
114.
Intille, S. S., Larson, K., and Tapia, E. M., Designing and evaluating technology for
independent aging in the home, in Proceedings of the International Conference on Aging,
Disability and Independence. 2003.
115.
J. Kleinberg. An Impossibility Theorem for Clustering. Advances in Neural Information
Processing Systems (NIPS) Whistler, British Columbia, Canada. 2002.
116.
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., and Hinton, G.E. Adaptive mixtures of local
experts. Neural Computation, 3, 79-87. 1991.
117.
Jacoby, L.L. Remembering the data: analyzing interactive processes in reading. Journal
of Verbal learning and Verbal Behaviour, 22, 485-508. 1983.
118.
James, W. 1890. Principles of Psychology. Vol. 2. Dover. New York. 1955.
119.
Jelinek, F. Statistical Methods for Speech Recognition. MIT Press. Cambridge, MA.
1997.
120.
Johnson-Laird, P.N. Mental models: Towards a cognitive science of language, inference,
and consciousness. Cambridge, MA: Harvard University Press. 1983.
121.
Kaas J.H., and Hackett T.A. Subdivisions of auditory cortex and processing streams in
primates. In PNAS. Oct 24;97(22):11793-9. 2000.
122.
Kaas, J. The Reorganization of Sensory and Motor Maps after Injury in Adult Mammals,
in M. Gazzaniga (ed.), The New Cognitive Neurosciences, 2nd ed. (Cambridge: MIT
Press: 223-236). 2000.
123.
Kardar, M. and Zee, A. Information optimization in coupled audio-visual cortical maps.
In PNAS 99: 15894-15897. 2002
124.
Kautau, A. Classification of the Peterson & Barney vowels using Weka. Technical
Report. UCSD. 2002.
191
125.
Keele S.W. and Summers, J. J., The Structure of Motor Programs. In Motor Control –
Issues and Trends. Stelmach, G. E., (Ed.) Academic Press. San Diego. 1976.
126.
Klee, V., and Minty, G. J., How good is the simplex algorithm?, in Inequalitites, III.
Shisha, O. (Ed.), Academic Press, New York, pp. 159-175. 1972.
127.
Kogan, J.A. and Margoliash, D. Automated recognition of bird song elements from
continuous recordings using dynamic time warping and hidden Markov models: A
comparative study. Journal of the Acoustic Society of America. Vol. 103, No. 4, April
1998.
128.
Kohler, I. The formation and transformation of the perceptual world. Psychological
Issues 3(4):1-173. 1964.
129.
Kohonen, T. Self-Organization and Associative Memory. Springer-Verlag, Berlin. 1984.
130.
Kuhl, P.K. Human adults and human infants show a ‘perceptual magnet effect’ for the
prototypes of speech categories, monkeys do not. Perception & Psychophysics, 50, pp.
93-107. 1991.
131.
Kuhn, H. The Hungarian method for the assignment problem. Naval Res. Logist. Q.
2:83—97. 1955.
132.
Kumar, V. Towards Trainable Man-machine Interfaces: Combining Top-down
Constraints with Bottom-up Learning in Facial Analysis. Ph.D. Thesis. Department of
Brain and Cognitive Sciences. Massachusetts Institute of Technology. Cambridge, MA.
2002.
133.
Lakoff, G. Women, Fire, and Dangerous Things: What Categories Reveal About the
Mind. University of Chicago Press, 1987.
134.
Leopold, D. A., Wilke, M., Maier, A. and Logothetis, N.K. Stable perception of visually
ambiguous patterns. Nature. Volume 5, Number 6, pp 605 – 609. June 2002.
135.
Levina, E., and Bickel P.J., The Earth Mover's Distance is the Mallows Distance: Some
Insights from Statistics. Proceedings of ICCV, Vancouver, Canada, pp. 251-256. 2001.
136.
Lewkowicz, D.J. and Lickliter, R. (eds.) The Development of Intersensory Perception.
Lawrence Erlbaum Associations. Hillsdale, N.J. 1994.
137.
Lippmann, R.P. Speech recognition by machines and humans. Speech Communication
22, 1-15. 1987.
138.
Lloyd, S. P., `Least squares quantization in PCM,' IEEE Trans. Inform. Theory, vol. 28,
pp. 129-137, 1982.
139.
López, M.E., Barea, R., Bergasa, L.M. and Escudero, M.S., Visually Augmented
POMDP for Indoor Robot Navigation (Spain) From Proceedings of Applied Informatics.
2003.
140.
MacKay, D.H.J. Information Theory, Inference, and Learning Algorithms. Cambridge
University Press. 2003.
141.
Mackey, M.C., and Glass, L. Oscillation and Chaos in Physiological Control Systems.
Science,
142.
Mackie, G.O. and Singla, C.L. Studies on Hexactinellid Sponges. I. Histology of
Rhabdocalyptus dawsoni (Lambe, 1873). Philosophical Transactions of the Royal
Society of London. Series B, Biological Sciences, Vol. 301, No. 1107. (Jul. 5, 1983), pp.
365-400. 1983.
192
143.
Marler, P. Three models of song learning: Evidence from behavior.
Neurobiology. Volume 33, Issue 5 , Pages 501 – 516. 1997.
144.
Marr, D. Vision. WH Freeman, San Francisco, 1982.
145.
Mase, K., and Pentland, A. Automatic Lipreading by Computer. Trans. IEEE., vol. J73D-II, No. 6, pp. 796-803, June 1990.
146.
Massaro, D.W. The fuzzy logical model of speech perception: A framework for research
and theory. In Y. Tohkura, E. Vatikiotis-Bateson, and Y. Sagisaka (Eds.), Speech
Perception, Production and Linguistic Structure (pp.79-82). Ohmsha Ltd, Tokyo. 1992.
147.
Massaro, D.W. and Cohen, M.M. Fuzzy logical model bimodal emotion perception:
Comment on The perception of emotions by ear and by eye by de Gelder and Vroomen
Cognition and Emotion, 14(3), 313-320. 2000.
148.
Massaro, D.W., Cohen, M.M., Gesi, A. and Heredia, R. Bimodal Speech Perception: An
Examination across Languages. Journal of Phonetics, 21, 445-478. 1993.
149.
Massie, T. M. and Salisbury, J. K.. The PHANToM Haptic Interface: A Device for
Probing Virtual Objects. In ASME Haptic Interfaces for Virtual Environment and
Teleoperator Systems 1994, Dynamic Systems and Control 1994, volume 1, pages 295-301, Nov. 1994.
150.
Massone, L., and Bizzi, E., On the Role of Input Representations
Sensorimotor Mapping, Proc. IJCNN, Washington DC, 1:173-176. 1990.
151.
Mataric, M.J., Studying the Role of Embodiment in Cognition. Cybernetics and Systems
28(6):457-470. 1997.
152.
Maunsell, J.H.R, Nealy, T.A., Sclar, G., and DePriest, D.D. Representation of
extraretinal information in mokey visual cortex. In Neural mechanisms of visual
perception. Proceedings of the Second Retina Research Foundation Symposium, 14-15
April 1989. D.M Lam and C.D. Gilbert (eds). Woodlands, Texas. Portfolio Pub. Co.
1989.
153.
McCallum, A., and Nigam, K. Employing EM in Pool-Based Active Learning for Text
Classification. In Proceedings of ICML-98, 15th International Conference on Machine
Learning. 1998.
154.
McGraw, M. B., The Neuromuscular Maturation of the Human Infant. New York:
Institute of Child Development. 1939.
155.
McGurk, H., and MacDonald, J. Hearing lips and seeing voices. Nature. 264:746-748.
1976.
156.
Mellinger, D.K. and Clark. C.W. Bioacoustic transient detection by image convolution.
Journal of the Acoustical Society of America. -- Volume 93, Issue 4, p. 2358. April 1993.
157.
Meltzoff, A.N. and Moore, M.K. Imitation of facial and manual gestures by human
neonates. Science 198:75-78. 1977.
158.
Meltzoff, A. N., and Prinz, W., The imitative mind: Development, evolution, and brain
bases. Cambridge, England: Cambridge University Press. 2002.
159.
Metcalfe, J.S., . Chang, T.Y., Chen, L.C., McDowell, K., Jeka, J.J., and Clark, J. E.
Development of somatosensory-motor integration: An event-related analysis of infant
posture in the first year of independent walking. Developmental Psychobiology 46(1) 1935. 2005.
193
Journal of
in
160.
Meyer, D. E., and Schvaneveldt, R. (1971). Facilitation in recognizing pairs of words:
evidence of a dependence between retrieval operations. Journal of Experimental
Psychology 90:227--235.
161.
Miller, G. and Chomsky, N. Finitary models of language users. In Luce, R.; Bush, R. and
Galanter, E. (eds.) Handbook of Mathematical Psychology, Vol 2. New York: Wiley.
419-93. 1963.
162.
Miller, D. B. The acoustic basis of mate recognition by female zebra finches
(Taeniopygia guttata). Anim. Behav. 27, 376−380. 1979.
163.
Minnen, D., Starner, T., Ward, J.A. ,Lukowicz, P., and Troester, G., Recognizing and
Discovering Human Actions from On-Body Sensor Data ICME 2005, Amsterdam, NL,
July 6-8, 2005.
164.
Minsky, M., A Framework for Representing Knowledge. Reprinted in The Psychology of
Computer Vision, Winston, P.H. (Ed.), McGraw Hill, 1975.
165.
Mitchell, T., Machine Learning. McGraw Hill, 1997.
166.
Moody, J. and Darken, C. J. Fast learning in networks of locally tuned processing units.
Neural Computation, 1:281-294. 1989.
167.
Moran, T.P., and Dourish, R., Introduction to This Special Issue* on Context-Aware
Computing. Human-Computer Interaction, Volume 16, 2001.
168.
Müller, C. M. and Leppelsack, H. J., Feature extraction and tonotopic organization in the
avian auditory forebrain. Experimental Brain Research (Historical Archive), Volume 59,
Issue 3, pp. 587 – 599, Aug, 1985.
169.
Mumford, S. Laws in Nature. London, England: Routledge. 2004.
170.
Naeaetaenen, R., Tervaniemi, M., and E Sussman, P., Primitive intelligence in the
auditory cortex. Trends Neurosci. 2001.
171.
Nakayama, K. and Silverman, G. H., Serial and parallel processing of visual feature
conjunctions. Nature 320, 264–265. 1986.
172.
Nefian, A., Liang, L., Pi, X., Xiaoxiang, L., Mao, C. and Murphy, K. ICASSP '02
(IEEE Int'l Conf on Acoustics, Speech and Signal Proc.), 2:2013--2016. 2002
173.
Newell A. and Rosenbloom P.S., Mechanisms of skill acquisition and the law of practice.
In: Cognitive skills and their acquisition (Anderson JR, Ed.), pp. 1–51. Hillsdale, NJ:
Erlbaum. 1981.
174.
Newell, A., & Simon, H.A. Computer science as empirical inquiry: Symbols and search.
Communications of the Association for Computing Machinery, 19(3), 113-126. 1976.
175.
Nilson, N. (Ed.), Shakey the Robot. SRI A.I. Center Technical Note 323. April, 1984.
176.
Nordeen KW, Nordeen EJ. Auditory feedback is necessary for the maintenance of
stereotyped song in adult zebra finches. Behav Neural Biol. Jan;57(1):58-66. 1992.
177.
Ohnishi T, Matsuda H, Asada T, Aruga M, Hirakata M, Nishikawa M, Katoh A,
Imabayashi E. Functional anatomy of musical perception in musicians. Cereb Cortex.
11(8):754-60. 2001.
178.
Ölveczky B.P., Andalman A.S., Fee M.S. Vocal Experimentation in the Juvenile
Songbird Requires a Basal Ganglia Circuit. PLoS Biol 3(5): e153. 2005
194
179.
Oviatt, S. Multimodal interfaces, The human-computer interaction handbook:
fundamentals, evolving technologies and emerging applications, Lawrence Erlbaum
Associates, Inc., Mahwah, NJ. 2002
180.
Pearson, P. and Kingdom, F.A.A., On the interference of task-irrelevant hue variation on
texture segmentation. Perception, 30, 559-569. 2001.
181.
Peterson, G.E. and Barney, H.L. Control methods used in a study of the vowels.
J.Acoust.Soc.Am. 24, 175-184. 1952.
182.
Piaget, J. Construction of reality in the child. London: Routledge & Kegan Paul, 1954.
183.
Piaget, J. Genetic Epistemology. W. W. Norton & Co. Scranton, Pennsylvania. 1971.
184.
Picard, R. Affective Computing. MIT Press. Cambridge, MA. 1997
185.
Pierce, W.D., and Cheney, C.D. Behavior Analysis and Learning. 3rd edition. Lawrence
Erlbaum Associates. NJ. 2003.
186.
Poppel, E. A hierarchical model of temporal perception. Trends in Cognitive Sciences,
Volume 1, Number 2, pp. 56-61(6). May 1997.
187.
Poppel, E., Held, R., and Frost. D. Residual visual function after brain wounds involving
the central visual pathways in man. Nature, 243, 295-296. 1973.
188.
Potamianos, G., Neti, C., Luettin, J., and Matthews, I. Audio-Visual Automatic Speech
Recognition: An Overview. In: Issues in Visual and Audio-Visual Speech Processing, G.
Bailly, E. Vatikiotis-Bateson, and P. Perrier (Eds.), MIT Press (In Press), 2004.
189.
Ratnanather, J. T., Barta, P. E., Honeycutt, N. A., Lee, N. G., Morris, H. M., Dziorny, A.
C., Hurdal, M. K., Pearlson, G. D., and Miller, M. I. Dynamic programming generation
of boundaries of local coordinatized submanifolds in the neocortex: application to the
Planum Temporale, NeuroImage, vol. 20, pp. 359-377. 2003.
190.
Remagnino, P., Foresti, G., and Ellis, T.J. (Eds.), Ambient Intelligence a Novel
Approach. Springer-Verlag New York Inc. 2004.
191.
Richards, W. (Ed.) Natural Computation. Cambridge, MA. The MIT Press. 1988.
192.
Rodriguez, A., Whitson, J., Granger, R. Derivation and analysis of basic computational
operations of thalamocortical circuits. Journal of Cognitive Neuroscience, 16: 856-877.
2004.
193.
Rubin, P., Baer, T. and Mermelstein, P., An articulatory synthesizer for perceptual
research. Journal of the Acoustical Society of America, 70, 321-328. 1981.
194.
Rubner, Y., Tomasi, C., and Guibas, L. J., A Metric for Distributions with Applications
to Image Databases. Proceedings of the 1998 IEEE International Conference on
Computer Vision, Bombay, India, ,pp. 59-66. January 1998.
195.
Russell, B. Theory of Knowledge: The 1913 Manuscript, London, Boston, Sydney:
George Allen and Unwin, 1984.
196.
Ryle, G. The Concept of Mind. Chicago: The University of Chicago Press. Chicago, IL.
1949.
197.
Saar, S. Sound analysis in Matlab. http://ofer.sci.ccny.cuny.edu/html/sam.html. 2005.
198.
Sams, M., Aulanko, R., Hamalainen, M., Hari, R., Lounasmaa, O., Lu, S., and Simola, J.
Seeing speech: Visual information from lip movements modified activity in the human
auditory cortex. Neurosci. Lett. 127:141-145. 1991.
195
199.
Sandini G., Metta G. and Konczak J. Human Sensori-motor Development and Artificial
System. In Proceedings of AIR&IHAS '97, Japan. 1997.
200.
Schmidt, R. A., Motor control and learning, 2nd edition. Human Kinetics Publishers.
1988.
201.
Shimojo, S., and Shams, L. Sensory modalities are not separate modalities: plasticity and
interactions. Current Opinion in Neurobiology. 11:505-509. 2001.
202.
Slaney, M. A Critique of Pure Audition. In Proceedings of Computational Auditory
Scene Analysis Workshop. International Joint Conference on Artificial Intelligence,
Montreal, Canada. 1995.
203.
Slater, P. J. B., Eales, L. A., & Clayton, N. S. Song learning in zebra finches: Progress
and prospects. Advances in the Study of Behavior, 18, 1–34. 1988.
204.
Smolensky, P. . Information processing in dynamical systems: Foundations of harmony
theory. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group, Parallel
Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1:
Foundations. Cambridge, MA: MIT Press/Bradford Books. 194-281. 1986.
205.
Starner, T., Wearable Computing and Contextual Awareness.
Massachusetts Institute of Technology. Cambridge, MA. 1999.
206.
Stauffer, C. Perceptual Data Mining.
Technology. Cambridge, MA. 2002.
207.
Stein, B.E. and Dixon, J.E. Superior colliculus neurons respond to noxious stimuli. Brain
Research. 158:65-73. 1978.
208.
Stein, B.E., and Meredith, M. A. 1994. The Merging of the Senses. Cambridge, MA.
MIT Press.
209.
Stein, R.B. The frequency of nerve action potentials generated by applied currents. Proc.
R. Soc. Lond B. Biol. Sci., 1967.
210.
Stevens, S.S., On the psychophysical law. Psychol. Review 64(3):153-181, 1957.
211.
Still, S., and Bialek, W. How many clusters? An information theoretic perspective,
Neural Computation. 16:2483-2506. 2004.
212.
Stork, D.G., and Hennecke, M. Speechreading: An overview of image processing,
feature extraction, sensory integration and pattern recognition techniques", Proc. of the
Second Int. Conf. on Auto. Face and Gesture Recog. Killington, VT pp. xvi--xxvi. 1996.
213.
Sumby, W.H., and Pollack, I. Visual contribution to speech intelligibility in noise. J.
Acoust. Soc. Am. 26:212-215. 1954.
214.
Summerfield, Q. Some preliminaries to a comprehensive account of audio-visual speech
perception, in Dodd, B. and Campbell, R., editors, Hearing by Eye: The psychology of
lip-reading. Lawrence Erlbaum Associates, Hillsdale NJ. pgs 3-52. 1987.
215.
Sun. http://www.javasoft.com/products/java-media/speech/. 2001.
216.
Sung, Michael and Pentland, Alex (Sandy)., Minimally-Invasive Physiological Sensing
for Human-Aware Interfaces. HCI International. 2005.
217.
Sussman E., Winkler I., Ritter W., Alho K., and Naatanen, R., Temporal integration of
auditory stimulus deviance as reflected by the mismatch negativity. Neuroscience
Letters, Volume 264, Number 1, 2, pp. 161-164(4). April 1999.
196
Ph.D. Thesis
Ph.D. Thesis, Massachusetts Institute of
218.
Sussman, G. Abelson, H. and Sussman, J. Structure and Interpretation of Computer
Programs. MIT Press. Cambridge, MA. 1983.
219.
Takeuchi, A. and Amari S-I., Formation of Topographic Maps and Columnar
Microstructures in Nerve Fields. 1999.
220.
Taskar, B., Segal, E., and Koller, D. Probabilistic Clustering in Relational Data, In
Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence
(IJCAI’01), Seattle, Washington, August 2001.
221.
Tchernichovski O, Nottebohm F, Ho C, Pesaran B, Mitra P. A procedure for an
automated measurement of song similarity. Anim Behav 59: 1167-1176. 2000.
222.
Tchernichovski, O. Private communication. 2005.
223.
Tenenbaum, J. B., de Silva, V. and Langford, J. C., A Global Geometric Framework for
Nonlinear Dimensionality Reduction. Science. 290 (5500): 2319-2323, 22 December
2000.
224.
Thelen, E., and Smith, L. A Dynamic Systems Approach to the Development of
Cognition and Action. Cambridge, MIT Press. 1998.
225.
Thompson, DW. On Growth and Form. New York: Dover Publications. 1917 revised
1942.
226.
Thorndike, E. L., Animal intelligence: An experimental study of the associative process
in animals. Psychological Review Monograph 2(8):551–53. 1898.
227.
Thrun, S., Burgard, W., and Fox, D., Probabilistic Robotics. MIT Press. Cambridge,
MA. 2005.
228.
Tinbergen, N., The Study of Instinct, Oxford University Press, New York, NY. 1951.
229.
Treisman, A., Preattentive processing in vision. Computer Vision, Graphics and Image
Processing 31, 156–177. 1985.
230.
Trussell LO. PDF Physiological mechanisms for coding timing in auditory neurons.
Ann. Rev. Physiol. 61:477-496. 1999.
231.
Ullman, Shimon.
High-level vision: object recognition and visual cognition.
Cambridge. MIT Press. 1996.
232.
van der Meer A.L., van der Weel F.R. and Lee D.N., The functional significance of arm
movements in neonates. Science. 267(5198):693-5. 1995.
233.
Vignal, C., Mathevon, N. & Mottin, S. Audience drives male songbird response to
partner's voice. Nature 430, 448−451. 2004.
234.
Von Neumann, J. The Computer and the Brain, Silliman Lectures Series, Yale Univ.
Press, New Haven, CT. 1958.
235.
Waibel, A., Vo, M.T., Duchnowski, P., and Manke, S. Multimodal Interfaces. Artificial
Intelligence Review. 10:3-4. p299-319. 1996.
236.
Wang, F., Ma, Y.F., Zhang, H.J., Li, J.T., A Generic Framework for Semantic Sports
Video Analysis Using Dynamic Bayesian Networks, pp. 115-122, 11th International
Multimedia Modelling Conference (MMM'05), 2005.
237.
Wang, L., Walker, V.E., Sardi, H., Fraser, C. and Jacob, T.J.C., The correlation between
physiological and psychological responses to odour stimulation in human subjects.
Clinical Neurophysiology 113, 542-551. 2002.
197
238.
Warren, D.H., Welch, R.B. and McCarthy, T.J. The role of auditory-visual
`compellingness’ in the ventriloquism effect: Implications for transitivity amongst the
spatial senses. Perception and Psychophysics, 30(6), pp 557- 564. 1981
239.
Watanabe, S. Pattern Recognition. Human and Mechanical. John Wiley, New York. 1985
240.
Watson, A.B., Temporal sensitivity in Handbook of perception and human performance.
K. Boff, L. Kaufman and J. Thomas, (Eds.) Volume 1 (A87-33501 14-53). New York,
Wiley-Interscience, p. 6-1 to 6-43. 1986.
241.
Watts, D., and Strogatz, S. Collective dynamics of 'small-world' networks. Nature
393:440-442. 1998.
242.
Webb, D. M., and Zhang, J. FoxP2 in Song-Learning Birds and Vocal-Learning
Mammals. Journal of Heredity. 96: 212-216. 2005.
243.
Weiner, N. Cybernetics: On Control and Communication in the Animal and the Machine.
John Wiley, 1948.
244.
Weiser, Mark. The Computer for the Twenty-First Century. Scientific American.
265(3):94—104, September 1991.
245.
Weiser, Mark. The world is not a desktop. Interactions. pp. 7—8. January, 1994.
246.
Welch, R. B., and Warren, D. H. 1986. "Intersensory Interactions." In Handbook of
Perception and Human Performance, edited by K. R. Boff, L. Kaufman, and J. P.
Thomas, chap. 25. New York: Wiley.
247.
Wertheimer, M. Laws of Organization in Perceptual Forms. First published as
Untersuchungen zur Lehre von der Gestalt II, in Psycologische Forschung, 4, 301-350.
Translation published in Ellis, W. A source book of Gestalt psychology (pp. 71-88).
London: Routledge & Kegan Paul. 1938
248.
Wiener, N. Cybernetics: or the Control and Communication in the Animal and the
Machine, Cambridge: MIT Press. 1948.
249.
Wiggs, C.L., and Martin, A., Properties and mechanisms of perceptual priming. Current
Opinion in Neurobiology Cognitive Neuroscience, (8):227-233, 1998.
250.
Williams, H., and Staples, K. Syllable chunking in zebra finch (Taeniopygia guttata)
song. J Comp Psychol. 106(3):278-86. 1992.
251.
Winston, P.H. Learning Structural Descriptions from Examples.
Intelligence Laboratory Technical Report. AITR-231. September 1970.
252.
Witten, I, and Frank, E. Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann. 2005.
253.
Wolfe, J.M. Hidden visual processes. Scientific American, 248(2), 94-103. 1983.
254.
Wolfe, J.M., & Cave, K.R. The psychophysical evidence for a binding problem in human
vision. Neuron, 24, 11-17. 2000.
255.
Wren, C., Azarbayejani, A., Darrell, T., and Pentland, P., "Pfinder: Real-Time Tracking
of the Human Body ", IEEE Transactions on Pattern Analysis and Machine Intelligence,
July 1997.
256.
Wu, Lizhong, Oviatt, Sharon L., Cohen, Philip R., Multimodal Integration -- A Statistical
View, IEEE Transactions on Multimedia, Vol. 1, No. 4, December 1999, pp. 334-341.
198
MIT Artificial
257.
Young, S.J. and Woodland, P.C. Hidden Markov Model Toolkit, Entropic Research
Laboratory, Washington, DC, 1993.
258.
Zann R. The Zebra Finch: A Synthesis of Field and Laboratory Studies. Oxford: Oxford
University Press, 1996
259.
Ziegler, P. & Marler, P. (Eds.) Special issue: Neurobiology of Birdsong, Annals of the
New York Academy of Science. 1016: 348–363. 2004.
199
200
Appendix 1
Glossary
Amodal – Not relating to any particular sense directly. For example, intensity is amodal, because
different senses can have their inputs described as being intense, whereas purple is not an
amodal attribute because it applies only to visual inputs.
Artificial perceptual system – A computational system that perceives real-world phenomena.
For example, speech recognition and computer vision systems are examples of artificial
perceptual systems.
Cross-Modal – Any phenomenon that involves information being shared between seemingly
disparate perceptual channels. In this thesis, these channels may fall within the same
gross modality, such as vision, as long as different modes are involved.
Intersensory – Used interchangeably with cross-modal.
Modality – A sense or perceptory capability, such as touch, vision, etc. Generally refers to an
entire class of such capabilities, such as the vision modality, which is comprised of a
range of individual visual capabilities, such as color sensitivity, peripheral vision, motion
detection, etc., all of which share a common sensory organ.
Mode – A highly specific sense or perceptory capability. Unlike modality, modes refer to the
specific capabilities of particular senses, e.g., the sweetness mode of taste or the color
sensitivity mode of vision, etc. This definition is elaborated upon in §1.5.
Multimodal – A perceptory phenomenon involving multiple modes simultaneously. For
example, hand clapping is multimodal, because it can be both seen and heard. For that
matter, most real-world events are multimodal, because they generate multiple types of
energy simultaneously to which different sensory channels are receptive. Also used to
describe artificial perceptual systems that are sensitive to multiple modalities; these are
typically called multimodal systems.
Sense – In this thesis, a sense is used equivalently with a mode, as defined above. In other words,
a sense refers to a highly specific perceptual capability.
System – A computer system unless context indicates biological.
Unimodal – A perceptory phenomenon or quality involving only a single modality. For example,
speech recognition systems are generally considered unimodal, because they only
perceive spoken language inputs. The designation can be confusing, however, because
many unimodal systems have traditional graphical user interfaces (GUI) that provide
inputs outside of their perceptual channel. They are nonetheless still deemed unimodal,
because GUI inputs are non-perceptual.
201