Unsupervised learning of invariant representations with low sample complexity: the

advertisement
arXiv:1311.4158v5 [cs.CV] 11 Mar 2014
CBMM Memo No. 001
March 12, 2014
Unsupervised learning of invariant
representations with low sample complexity: the
magic of sensory cortex or a new framework for
machine learning?
by
Fabio Anselmi, Joel Z. Leibo, Lorenzo Rosasco, Jim Mutch, Andrea Tacchetti and
Tomaso Poggio.
Abstract: The present phase of Machine Learning is characterized by supervised learning algorithms relying on large
sets of labeled examples (n → ∞). The next phase is likely to focus on algorithms capable of learning from very few labeled
examples (n → 1), like humans seem able to do. We propose an approach to this problem and describe the underlying
theory, based on the unsupervised, automatic learning of a “good” representation for supervised learning, characterized
by small sample complexity (n). We consider the case of visual object recognition though the theory applies to other
domains. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to
translations, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an
invariant and unique (discriminative) signature can be computed for each image patch, I, in terms of empirical distributions
of the dot-products between I and a set of templates stored during unsupervised learning. A module performing filtering
and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such estimates. Hierarchical
architectures consisting of this basic Hubel-Wiesel moduli inherit its properties of invariance, stability, and discriminability
while capturing the compositional organization of the visual world in terms of wholes and parts. The theory extends existing
deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational
goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant
to transformations, stable, and discriminative for recognition—and that this representation may be continuously learned in
an unsupervised way during development and visual experience.
This work was supported by the Center for Brains, Minds and Machines
(CBMM), funded by NSF STC award CCF - 1231216.
http://cbmm.mit.edu
CBMM Paper
March, 2014
1–24
Unsupervised learning of invariant representations
with low sample complexity: the magic of sensory
cortex or a new framework for machine learning?
Fabio Anselmi
∗ †
∗ †
∗
, Joel Z. Leibo
∗
, Lorenzo Rosasco
∗ †
∗
, Andrea Tacchetti
∗ †
, and Tomaso Poggio
Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139, and † Istituto Italiano di Tecnologia, Genova, 16163
The present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples
(n → ∞). The next phase is likely to focus on algorithms capable
of learning from very few labeled examples (n → 1), like humans
seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic
learning of a “good” representation for supervised learning, characterized by small sample complexity (n). We consider the case
of visual object recognition though the theory applies to other domains. The starting point is the conjecture, proved in specific
cases, that image representations which are invariant to translations, scaling and other transformations can considerably reduce
the sample complexity of learning. We prove that an invariant and
unique (discriminative) signature can be computed for each image
patch, I, in terms of empirical distributions of the dot-products between I and a set of templates stored during unsupervised learning. A module performing filtering and pooling, like the simple
and complex cells described by Hubel and Wiesel, can compute
such estimates. Hierarchical architectures consisting of this basic
Hubel-Wiesel moduli inherit its properties of invariance, stability,
and discriminability while capturing the compositional organization
of the visual world in terms of wholes and parts. The theory extends existing deep learning convolutional architectures for image
and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant
to transformations, stable, and discriminative for recognition—and
that this representation may be continuously learned in an unsupervised way during development and visual experience.1
Invariance
|
Hierarchy
|
Convolutional networks
|
Visual cortex
It is known that Hubel and Wiesel’s original proposal
[1] for visual area V1—of a module consisting of complex
cells (C-units) combining the outputs of sets of simple cells
(S-units) with identical orientation preferences but differing retinal positions—can be used to construct translationinvariant detectors. This is the insight underlying many networks for visual recognition, including HMAX [2] and convolutional neural nets [3, 4]. We show here how the original
idea can be expanded into a comprehensive theory of visual recognition relevant for computer vision and possibly
for visual cortex. The first step in the theory is the conjecture that a representation of images and image patches,
with a feature vector that is invariant to a broad range of
transformations—such as translation, scale, expression of a
face, pose of a body, and viewpoint—makes it possible to
recognize objects from only a few labeled examples, as humans do. The second step is proving that hierarchical archiV
tectures of Hubel-Wiesel (‘HW’) modules (indicated by
in Fig. 1) can provide such invariant representations while
maintaining discriminative
information about the original
V
image. Each -module provides a feature vector, which we
call a signature, for the part of the visual field that is inside its “receptive field”; the signature is invariant to (R2 )
affine transformations within the receptive field. The hierarchical architecture, since it computes a set of signatures for
2
, Jim Mutch
http://cbmm.mit.edu
l=4
l=3
l=2
l=1
Fig. 1: A hierarchical architecture built from HW-modules.
Each red circle represents the signature vector computed by
the associated module (the outputs of complex cells) and
double arrows represent its receptive fields – the part of the
(neural) image visible to the module (for translations this is
also the pooling range). The “image” is at level 0, at the
bottom. The vector computed at the top of the hierarchy
consists of invariant features for the whole image and is usually fed as input to a supervised learning machine such as
a classifier; in addition signatures from modules at intermediate layers may also be inputs to classifiers for objects and
parts.
different parts of the image, is proven to be invariant to the
rather general family of locally affine transformations (which
includes globally affine transformations of the whole image).
The basic HW-module is at the core of the properties of
the architecture. This paper focuses first on its characterization and then outlines the rest of the theory, including
its connections with machine learning, machine vision and
neuroscience. Most of the theorems are in the supplementary information, where in the interest of telling a complete
story we quote some results which are described more fully
elsewhere [5, 6, 7].
Invariant representations and sample complexity
One could argue that the most important aspect of intelligence is the ability to learn. How do present supervised
learning algorithms compare with brains? One of the most
obvious differences is the ability of people and animals to
1
Notes on versions and dates The current paper evolved from one
that first appeared online in Nature Precedings on July 20, 2011
(npre.2011.6117.1). It follows a CSAIL technical report which appeared on December 30th, 2012,MIT-CSAIL-TR-2012-035 and a
CBCL paper, Massachusetts Institute of Technology, Cambridge,
MA, April 1, 2013 by the title ”Magic Materials: a theory of deep
hierarchical architectures for learning sensory representations”([5]).
Shorter papers describing isolated aspects of the theory have also
appeared:[6, 7].
learn from very few labeled examples. A child, or a monkey,
can learn a recognition task from just a few examples. The
main motivation of this paper is the conjecture that the key
to reducing the sample complexity of object recognition is
invariance to transformations. Images of the same object
usually differ from each other because of simple transformations such as translation, scale (distance) or more complex
deformations such as viewpoint (rotation in depth) or change
in pose (of a body) or expression (of a face).
The conjecture is supported by previous theoretical work
showing that almost all the complexity in recognition tasks
is often due to the viewpoint and illumination nuisances
that swamp the intrinsic characteristics of the object [8].
It implies that in many cases, recognition—i.e., both identification, e.g., of a specific car relative to other cars—as
well as categorization, e.g., distinguishing between cars and
airplanes—would be much easier (only a small number of
training examples would be needed to achieve a given level
of performance, i.e. n → 1), if the images of objects were
rectified with respect to all transformations, or equivalently,
if the image representation itself were invariant. In SI Appendix, section 0 we provide a proof of the conjecture for the
A
100
accuracy (%)
90
80
70
60
50
1
2
3
4
10
20
Number of examples per class
B
Rectified
C
Not rectified
Fig. 2: Sample complexity for the task of categorizing cars
vs airplanes from their raw pixel representations (no preprocessing). A. Performance of a nearest-neighbor classifier
(distance metric = 1 - correlation) as a function of the number of examples per class used for training. Each test used
74 randomly chosen images to evaluate the classifier. Error
bars represent +/- 1 standard deviation computed over 100
training/testing splits using different images out of the full
set of 440 objects × number of transformation conditions.
Solid line: The rectified task. Classifier performance for the
case where all training and test images are rectified with respect to all transformations; example images shown in B.
Dashed line: The unrectified task. Classifier performance
for the case where variation in position, scale, direction of
illumination, and rotation around any axis (including rotation in depth) is allowed; example images shown in C. The
images were created using 3D models from the Digimation
model bank and rendered with Blender.
special case of translation (and for obvious generalizations
of it).
The case of identification is obvious since the difficulty
in recognizing exactly the same object, e.g., an individual
face, is only due to transformations. In the case of categorization, consider the suggestive evidence from the classification task in Fig. 2. The figure shows that if an oracle
factors out all transformations in images of many different
cars and airplanes, providing “rectified” images with respect
to viewpoint, illumination, position and scale, the problem
of categorizing cars vs airplanes becomes easy: it can be
done accurately with very few labeled examples. In this
case, good performance was obtained from a single training
image of each class, using a simple classifier. In other words,
the sample complexity of the problem seems to be very low.
We propose that the ventral stream in visual cortex tries
to approximate such an oracle, providing a quasi-invariant
signature for images and image patches.
Invariance and uniqueness
Consider the problem of recognizing an image, or an image
patch, independently of whether it has been transformed by
the action of a group like the affine group in R2 . We would
like to associate to each object/image I a signature, i.e., a
vector which is unique and invariant with respect to a group
of transformations, G. (Note that our analysis, as we will see
later, is not restricted to the case of groups.) In the following,
we will consider groups that are compact and, for simplicity,
finite (of cardinality |G|). We indicate, with slight abuse of
notation, a generic group element and its (unitary) representation with the same symbol g, and its action on an image as
gI(x) = I(g −1 x) (e.g., a translation, gξ I(x) = I(x − ξ)). A
natural mathematical object to consider is the orbit OI —the
set of images gI generated from a single image I under the
action of the group. We say that two images are equivalent
when they belong to the same orbit: I ∼ I 0 if ∃g ∈ G such
that I 0 = gI. This equivalence relation formalizes the idea
that an orbit is invariant and unique. Indeed, if two orbits
have a point in common they are identical everywhere. Conversely, two orbits are different if none of the images in one
orbit coincide with any image in the other[9].
How can two orbits be characterized and compared?
There are several possible approaches. A distance between
orbits can be defined in terms of a metric on images, but
its computation is not obvious (especially by neurons). We
follow here a different strategy: intuitively two empirical orbits are the same irrespective of the ordering of their points.
This suggests that we consider the probability distribution
PI induced by the group’s action on images I (gI can be
seen as a realization of a random variable). It is possible to
prove (see Theorem 2 in SI Appendix section 2) that if two
orbits coincide then their associated distributions under the
group G are identical, that is
I ∼ I 0 ⇐⇒ OI = OI 0 ⇐⇒ PI = PI 0 .
[1]
The distribution PI is thus invariant and discriminative, but
it also inhabits a high-dimensional space and is therefore difficult to estimate. In particular, it is unclear how neurons
or neuron-like elements could estimate it.
As argued later, neurons can effectively implement (highdimensional) inner products, h·, ·i, between inputs and stored
“templates” which are neural images. It turns out that classical results (such as the Cramer-Wold theorem [10], see Theorem 3 and 4 in section 2 of SI Appendix) ensure that a probability distribution PI can be almost uniquely characterized
by K one-dimensional probability distributions PhI,tk i inCBMM paper
March, 2014
3
duced by the (one-dimensional) results of projections I, tk ,
where tk , k = 1, ..., K are a set of randomly chosen images
called templates. A probability function in d variables (the
image dimensionality) induces a unique set of 1-D projections which is discriminative; empirically a small number of
projections is usually sufficient to discriminate among a finite number of different probability distributions. Theorem
4 in SI Appendix section 2 says (informally) that an approximately invariant and unique signature of an image I can be
obtained from the estimates of K 1-D probability distributions PhI,tk i for k = 1, · · · , K. The number K of projections
needed to discriminate n orbits, induced by n images, up to
precision (and with confidence 1 − δ 2 ) is K ≥ c22 log nδ ,
where c is a universal constant.
Thus the discriminability question can be answered positively (up to ) in terms of empirical estimates of the onedimensional distributions PhI,tk i of projections of the image
onto a finite number of templates tk , k = 1, ..., K under the
action of the group.
Memory-based learning of invariance
Notice that the estimation of PhI,tk i requires the observation
of the image and “all” its transforms gI. Ideally, however,
we would like to compute an invariant signature for a new
object seen only once (e.g., we can recognize a new face at
different distances after just one observation, i.e. n → 1).
It is remarkable and almost magical that this is also made
possible
step. The key is the observation
by the projection
that gI, tk = I, g −1 tk . The same one-dimensional distribution is obtained from the projections of the image and
all its transformations onto a fixed template, as from the
projections of the image onto all the transformations of the
same
template.
the distributions of the variables
Indeed,
I, g −1 tk and gI, tk are the same. Thus it is possible for
the system to store for each template tk all its transformations gtk for all g ∈ G and later obtain an invariant signature
for new images without any explicit knowledge of the transformations g or of the group to which they belong. Implicit
knowledge of the transformations, in the form of the stored
templates, allows the system to be automatically invariant
to those transformations for new inputs (see eq. [8] in SI
Appendix).
Estimates of the one-dimensional probability density
functions (PDFs) PhI,tk i can be written in terms of his
P
k
tograms as µkn (I) = 1/|G| |G|
i=1 ηn ( I, gi t ), where ηn , n =
1, · · · , N is a set of nonlinear functions (see remark 1 in SI
Appendix section 1 or Theorem 6 in section 2 but also [11]).
A visual system need not recover the actual probabilities
from the empirical estimate in order to compute a unique
signature. The set of µkn (I) values is sufficient, since it identifies the associated orbit (see box 1 in SI Appendix). Crucially, mechanisms capable of computing invariant representations under affine transformations for future objects can
be learned and maintained in an unsupervised, automatic
way by storing and updating sets of transformed templates
which are unrelated to those future objects.
A theory of pooling
The arguments above make a few predictions. They require an effective normalization of the elements of the in
hI,gi tk i
ner product (e.g. I, gi tk 7→ kIkkg tk k ) for the property
i
gI, tk = I, g −1 tk to be valid (see remark 8 of SI Ap4
http://cbmm.mit.edu
pendix section 1 for the affine transformations case). Notice
that invariant signatures can be computed in several ways
from one-dimensional probability distributions. Instead of
the µkn (I) components directly representing the empirical
P
k n
distribution, the moments mkn (I) = 1/|G| |G|
i=1 ( I, gi t )
of the same distribution can be used [12] (this corresponds
to the choice ηn (·) ≡ (·)n ). Under weak conditions, the set
of all moments uniquely characterizes the one-dimensional
distribution PhI,tk i (and thus PI ). n = 1 corresponds to
pooling via sum/average (and is the only pooling function
that does not require a nonlinearity); n = 2 corresponds to
”energy models” of complex cells and n = ∞ is related to
max-pooling. In our simulations, just one of these moments
usually seems to provide sufficient selectivity to a hierarchical architecture (see SI Appendix section 6). Other nonlinearities are also possible[5]. The arguments of this section
begin to provide a theoretical understanding of “pooling”,
giving insight into the search for the “best” choice in any
particular setting—something which is normally done empirically [13]. According to this theory, these different pooling
functions are all invariant, each one capturing part of the
full information contained in the PDFs.
Implementations
The theory has strong empirical support from several specific
implementations which have been shown to perform well on
a number of databases of natural images. The main support
is provided by HMAX, an architecture in which pooling is
done with a max operation and invariance, to translation
and scale, is mostly hardwired (instead of learned). Its performance on a variety of tasks is discussed in SI Appendix
section 6. Good performance is also achieved by other very
similar architectures [14]. This class of existing models inspired the present theory, and may now be seen as special
cases of it. Using the principles of invariant recognition the
theory makes explicit, we have now begun to develop models that incorporate invariance to more complex transformations which cannot be solved by the architecture of the
network, but must be learned from examples of objects undergoing transformations. These include non-affine and even
Fig.
3: Performance of a recent model [7] (inspired
by the present theory) on Labeled Faces in the Wild, a
same/different person task for faces seen in different poses
and in the presence of clutter. A layer which builds invariance to translation, scaling, and limited in-plane rotation is
followed by another which pools over variability induced by
other transformations.
Invariance Implies Localization and Sparsity. The core of the
theory applies without qualification to compact groups such
as rotations of the image in the image plane. Translation
and scaling are however only locally compact, and in any
case, each of the modules of Fig. 1 observes
only a part of
V
the transformation’s full range. Each -module has a finite
pooling range, corresponding to a finite “window” over the
orbit associated with an image. Exact invariance for each
module, in the case of translations or scaling transformations, is equivalent to a condition of localization/sparsity of
the dot product between image and template (see Theorem
6 and Fig. 5 in section 2 of SI Appendix). In the simple case
of a group parameterized by one parameter r the condition
is (for simplicity I and t have support center in zero):
D
E
I, gr tk = 0
|r| > a.
[2]
Since this condition is a form of sparsity of the generic
image I w.r.t. a dictionary of templates tk (under a group),
this result provides a computational justification for sparse
encoding in sensory cortex[15].
It turns out that localization yields the following surprising result (Theorem 7 and 8 in SI Appendix): optimal
invariance for translation and scale implies Gabor functions
as templates. Since a frame of Gabor wavelets follows from
natural requirements of completeness, this may also provide
a general motivation for the Scattering Transform approach
of Mallat based on wavelets [16].
The
same Equation
2, if relaxed to hold approximately,
that is IC , gr tk ≈ 0 |r| > a, becomes a sparsity condition for the class of IC w.r.t. the dictionary tk under the
group G when restricted to a subclass IC of similar images.
This property (see SI Appendix, end of section 2), which
is an extension of the compressive sensing notion of “incoherence”, requires that I and tk have a representation with
sharply peaked correlation and autocorrelation. When the
condition is satisfied, the basic HW-module equipped with
such templates can provide approximate invariance to nongroup transformations such as rotations in depth of a face
or its changes of expression (see Proposition 9, section 2, SI
Appendix). In summary, Equation 2 can be satisfied in two
different regimes. The first one, exact and valid for generic
I, yields optimal Gabor templates. The second regime, approximate and valid for specific subclasses of I, yields highly
tuned templates, specific for the subclass. Note that this
argument suggests generic, Gabor-like templates in the first
layers of the hierarchy and highly specific templates at higher
levels. (Note also that incoherence improves with increasing
dimensionality.)
Hierarchical architectures. We have focused so far on the basic HW-module. Architectures consisting of such modules
can be single-layer as well as multi-layer (hierarchical) (see
Fig. 1). In our theory, the key property of hierarchical architectures of repeated HW-modules—allowing the recursive
use of modules in multiple layers—is the property of covariance. By a covariant response at layer ` we mean that the
distribution of the values of each projection is the same if we
consider the image or the template transformations, i.e. (see
Property 1 and Proposition 10 in section 3, SI Appendix),
7
C2 response [% change]
Extensions of the Theory
distr( µ` (gI), µ` (tk ) ) = distr( µ` (I), µ` (gtk ) ), ∀k.
One-layer networks can achieve invariance to global transformations of the whole image while providing a unique global
signature which is stable with respect to small perturbations
of the image (see Theorem 5 in section 2 of SI Appendix and
[5]). The two main reasons for a hierarchical architecture
such as Fig. 1 are (a) the need to compute an invariant
representation not only for the whole image but especially
for all parts of it, which may contain objects and object
parts, and (b) invariance to global transformations that are
not affine, but are locally affine, that is, affine within the
pooling range of some of the modules in the hierarchy. Of
course, one could imagine local and global one-layer architectures used in the same visual system without a hierarchical
configuration, but there are further reasons favoring hierarchies including compositionality and reusability of parts. In
addition to the issues of sample complexity and connectivity,
one-stage architectures are unable to capture the hierarchical organization of the visual world where scenes are composed of objects which are themselves composed of parts.
Objects can move in a scene relative to each other without
changing their identity and often changing the scene only
in a minor way; the same is often true for parts within an
object. Thus global and local signatures from all levels of
6
5
4
3
2
1
0
0
10
20
Eye displacement [pixel]
(a)
(b)
12
C2 response [% change]
non-group transformations, allowed by the hierarchical extension of the theory (see below). Performance for one such
model is shown in Figure 3 (see caption for details).
10
8
Same Individual
Distractor
6
4
2
0
0
5
10
15
Translation [pixel]
(c)
20
(d)
Fig. 4: Empirical demonstration of the properties of invariance, stability and uniqueness of the hierarchical architecture in a specific 2 layers implementation (HMAX). Inset
(a) shows the reference image on the left and a deformation of it (the eyes are closer to each other) on the right;
(b) shows the relative change in signature provided by 128
HW-modules at layer 2 (C2 ) whose receptive fields contain
the whole face. This signature vector is (Lipschitz) stable
with respect to the deformation. Error bars represent ±1
standard deviation. Two different images (c) are presented
at various location in the visual field. In (d) the relative
change of the signature vector for different values of translation. The signature vector is invariant to global translation
and discriminative (between the two faces). In this example
the HW-module represents the top of a hierarchical, convolutional architecture. The images we used were 200 × 200
pixels and error bars represent ±1 standard deviation.
CBMM paper
March, 2014
5
the hierarchy must be able to access memory in order to
enable the categorization and identification of whole scenes
as well as of patches of the image corresponding to objects
and their parts. Fig. 4 show examples of invariance and
stabilityVfor wholes and parts. In the architecture of Fig.
1, each -module provides uniqueness, invariance and stability at different levels, over increasing ranges from bottom
to top. Thus, in addition to the desired properties of invariance, stability and discriminability, these architectures
match the hierarchical structure of the visual world and the
need to retrieve items from memory at various levels of size
and complexity. The results described here are part of a general theory of hierarchical architectures which is beginning
to take form (see [5, 16, 17, 18]) around the basic function
of computing invariant representations.
The property of compositionality discussed above is related to the efficacy of hierarchical architectures vs. onelayer architectures in dealing with the problem of partial
occlusion and the more difficult problem of clutter in object
recognition. Hierarchical architectures are better at recognition in clutter than one-layer networks [19] because they
provide signatures for image patches of several sizes and
locations. However, hierarchical feedforward architectures
cannot fully solve the problem of clutter. More complex
(e.g. recurrent) architectures are likely needed for humanlevel recognition in clutter (see for instance [20, 21, 22]) and
for other aspects of human vision. It is likely that much of
the circuitry of visual cortex is required by these recurrent
computations, not considered in this paper.
Visual Cortex
The theory described above effectively maps the computation of an invariant signature onto well-known capabilities
of cortical neurons. A key difference between the basic elements of our digital computers and neurons is the number
of connections: 3 vs. 103 − 104 synapses per cortical neuron. Taking into account basic properties of synapses, it
follows that a single neuron can compute high-dimensional
(103 − 104 ) inner products between input vectors and the
stored vector of synaptic weights [23].
Consider an HW-module of “simple” and “complex” cells [1]
looking at the image through a window defined by their receptive fields (see SI Appendix, section 2, POG). Suppose
that images of objects in the visual environment undergo
affine transformations. During development—and more generally, during visual experience—a set of |G| simple cells
store in their synapses an image patch tk and its transformations g1 tk , ..., g|G| tk —one per simple cell. This is done,
possibly at separate times, for K different image patches tk
(templates), k = 1, · · · , K. Each gtk for g ∈ G is a sequence
of frames, literally a movie of image patch tk transforming.
There is a very simple, general, and powerful way to learn
such unconstrained transformations. Unsupervised (Hebbian) learning is the main mechanism: for a “complex” cell
to pool over several simple cells, the key is an unsupervised
Foldiak-type rule: cells that fire together are wired together.
At the level of complex cells this rule determines classes
of equivalence among simple cells – reflecting observed time
correlations in the real world, that is, transformations of the
image. Time continuity, induced by the Markovian physics
of the world, allows associative labeling of stimuli based on
their temporal contiguity.
Later,
when
an image is presented, the simple cells compute I, gi tk for i = 1, ..., |G|. The next step, as described
above, is to estimate the one-dimensional probability distri6
http://cbmm.mit.edu
bution of such a projection, that is, the distribution of the
outputs of the simple cells. It is generally assumed that complex cells pool the outputs of simple cells. Thus
a complex
P
k
cell could compute µkn (I) = 1/|G| |G|
+ n∆)
i=1 σ( I, gi t
where σ is a smooth version of the step function (σ(x) = 0
for x ≤ 0, σ(x) = 1 for x > 0) and n = 1, ..., N (this corresponds to the choice ηn (·) ≡ σ(· + n∆)) . Each of these
N complex cells would estimate one bin of an approximated
CDF (cumulative distribution function) for PhI,tk i . Following the theoretical arguments above, the complex cells could
compute, instead of an empirical CDF, one or more of its
moments. n = 1 is the mean of the dot products, n = 2
corresponds to an energy model of complex cells [24]; very
large n corresponds to a max operation. Conventional wisdom interprets available physiological data to suggest that
simple/complex cells in V1 may be described in terms of energy models, but our alternative suggestion of empirical histogramming by sigmoidal nonlinearities with different offsets
may fit the diversity of data even better.
As described above, a template and its transformed versions may be learned from unsupervised visual experience
through Hebbian plasticity. Remarkably, our analysis and
empirical studies[5] show that Hebbian plasticity, as formalized by Oja, can yield Gabor-like tuning—i.e., the templates
that provide optimal invariance to translation and scale (see
SI Appendix section 2).
The localization condition (Equation 2) can also be satisfied by images and templates that are similar to each other.
The result is invariance to class-specific transformations.
This part of the theory is consistent with the existence of
class-specific modules in primate cortex such as a face module and a body module [25, 26, 6]. It is intriguing that the
same localization condition suggests general Gabor-like templates for generic images in the first layers of a hierarchical
architecture and specific, sharply tuned templates for the last
stages of the hierarchy. This theory also fits physiology data
concerning Gabor-like tuning in V1 and possibly in V4 (see
[5]). It can also be shown that the theory, together with
the hypothesis that storage of the templates takes place via
Hebbian synapses, also predicts properties of the tuning of
neurons in the face patch AL of macaque visual cortex [5, 27].
From the point of view of neuroscience, the theory makes
a number of predictions, some obvious, some less so. One of
the main predictions is that simple and complex cells should
be found in all visual and auditory areas, not only in V1. Our
definition of simple cells and complex cells is different from
the traditional ones used by physiologists; for example, we
propose a broader interpretation of complex cells, which in
the theory represent invariant measurements associated with
histograms of the outputs of simple cells or of moments of it.
The theory implies that invariance to all image transformations could be learned, either during development or in adult
life. It is, however, also consistent with the possibility that
basic invariances may be genetically encoded by evolution
but also refined and maintained by unsupervised visual experience. Studies on the development of visual invariance in
organisms such as mice raised in virtual environments could
test these predictions.
Discussion
The goal of this paper is to introduce a new theory of learning
invariant representations for object recognition which cuts
across levels of analysis [5, 28]. At the computational level, it
gives a unified account of why a range of seemingly different
models have recently achieved impressive results on recognition tasks. HMAX [2, 29, 30], Convolutional Neural Net-
works [3, 4, 31, 32] and Deep Feedforward Neural Networks
[33, 34, 35] are examples of this class of architectures—as is,
possibly, the feedforward organization of the ventral stream.
At the algorithmic level, it motivates the development, now
underway, of a new class of models for vision and speech
which includes the previous models as special cases. At the
level of biological implementation, its characterization of the
optimal tuning of neurons in the ventral stream is consistent
with the available data on Gabor-like tuning in V1[5] and
the more specific types of tuning in higher areas such as in
face patches.
Despite significant advances in sensory neuroscience over
the last five decades, a true understanding of the basic functions of the ventral stream in visual cortex has proven to be
elusive. Thus it is interesting that the theory of this paper
follows from a novel hypothesis about the main computational function of the ventral stream: the representation of
new objects/images in terms of a signature which is invariant
to transformations learned during visual experience, thereby
allowing recognition from very few labeled examples—in the
limit, just one. A main contribution of our work to machine
learning is a novel theoretical framework for the next major
challenge in learning theory beyond the supervised learning setting which is now relatively mature: the problem of
representation learning, formulated here as the unsupervised
learning of invariant representations that significantly reduce
the sample complexity of the supervised learning stage.
ACKNOWLEDGMENTS. We would like to thank the McGovern Institute for
Brain Research for their support. We would also like to thank for having read
earlier versions of the manuscript Yann LeCun, Ethan Meyers, Andrew Ng, Bernhard Schoelkopf and Alain Yuille. We also thanks Michael Buice, Charles Cadieu,
Robert Desimone, Leyla Isik, Christof Koch, Gabriel Kreiman, Lakshminarayanan
Mahadevan, Stephane Mallat, Pietro Perona, Ben Recht, Maximilian Riesenhuber,
Ryan Rifkin, Terrence J. Sejnowski, Thomas Serre, Steve Smale, Stefano Soatto,
Haim Sompolinsky, Carlo Tomasi, Shimon Ullman and Lior Wolf for useful comments. This material is based upon work supported by the Center for Brains, Minds
and Machines (CBMM), funded by NSF STC award CCF-1231216. This research
was also sponsored by grants from the National Science Foundation (NSF-0640097,
NSF-0827427), and AFSOR-THRL (FA8650-05-C-7262). Additional support was
provided by the Eugene McDermott Foundation.
CBMM paper
March, 2014
7
Supplementary Information
0.Invariance significantly reduces sample complexity
In this section we show how, in the simple case of transformations which are translations, an invariant representation
of the image space considerably reduces the sample complexity of the classifier.
If we view images as vectors in Rd , the sample complexity
of a learning rule depends on the covering number of the
ball, B ∈ Rd , that contains all the image distribution. More
precisely, the covering number, N (, B), is defined as the
minimum number of − balls needed to cover B. Suppose B
has radius r we have
r d
N (, B) ∼
.
For example, in the case of linear learning rules, the sample
complexity is proportional to the logarithm of the covering
number.
Consider the simplest and most intuitive example: an
image made of a single pixel and its translations in a square
of dimension p×p, where p2 = d. In the pixel basis the space
of the image and all its translates has dimension p2 meanwhile the image dimension is one. The associated covering
numbers are therefore
r 1
r p2
N I (, B) =
, N T I (, B) =
where N I stands for the covering number of the image space
and N T I the covering number of the translated image space.
The sample complexity associated to the image space (see
e.g. [36]) is O(1) and that associated to the translated images O(p2 ). The sample complexity reduction of an invariant
representation is therefore given by
mimage
.
minv = O(p2 ) =
p2
The above reasoning is independent on the choice of the basis since it depends only on the dimensionality of the ball
containing all the images. For example we could have determined the dimensionality looking the cardinality of eigenvectors (with non null eigenvalue) associated to a circulant
matrix of dimension p × p i.e. using the Fourier basis. In the
simple case above, the cardinality is clearly p2 .
In general any transformation of an abelian group can be
analyzed using the Fourier transform on the group. We conjecture that a similar reasoning holds for locally compact
groups using a wavelet representation instead of the Fourier
representation.
The example and ideas above leads to the following theorem:
Theorem 1. Consider a space of images of dimensions p × p
pixels which may appear in any position within a window
of size rp × rp pixels. The usual image representation
yields a sample complexity (of a linear classifier) of order
mimage = O(r2 p2 ); the invariant representation yields (because of much smaller covering numbers) a sample complexity of order
mimage
minv = O(p2 ) =
.
r2
action/representation on X .
When useful we will make the following assumptions which
are justified from a biological point of view.
Normalized dot products of signals (e.g. images or “neural
activities”) are usually assumed throughout the theory, for
convenience but also because they provide the most elementary invariances – to measurement units (origin and scale).
We assume that the dot products are between functions or
vectors that are zero-mean and of unit norm. Thus hI, ti
0
0 ¯0
I¯0
t
¯ the mean. This norsets I = II 0 −
, t = tt0 −
with (·)
k −I¯0 k
k −t¯0 k
malization stage before each dot product is consistent with
the convention that the empty surround of an isolated image
patch has zero value (which can be taken to be the average
“grey” value over the ensemble of images). In particular the
dot product of a template – in general different from zero –
and the “empty” region outside an isolated image patch will
be zero. The dot product of two uncorrelated images – for
instance of random 2D noise – is also approximately zero.
Remarks:
1. The k-th component of the signature associated with a
simple-complex module is (see Equation
[ 10 ] or [ 13 ])
P
k
k
1
µn (I) = |G0 | g∈G0 ηn gI, t
where the functions
2.
3.
4.
5.
1. Setup and Definitions
Let X be a Hilbert space with norm and inner product denoted by k·k and h·, ·i, respectively. We can think of X as
the space of images (our images are usually “neural images”).
We typically consider X = Rd or L2 (R) or L2 (R2 ). We denote with G a (locally) compact group and with an abuse of
notation, we denote by g both a group element in G and its
8
http://cbmm.mit.edu
6.
ηn are such that Ker(ηn ) = {0}: in
words, the empirical histogram estimated for gI, tk does not take into
account the 0 value, since it does not carry any information about the image patch. The functions ηn are also
assumed to be positive and bijective.
Images I have a maximum total possible support corresponding to a bounded region B ⊆ R2 , which we refer to
as the visual field, and which corresponds to the spatial
pooling range of the module at the top of the hierarchy of
Figure 1 in the main text. Neuronal images are inputs to
the modules in higher layers and are usually supported in
a higher dimensional space, corresponding to the signature components provided by lower layers modules; isolated objects are images with support contained in the
pooling range of one of the modules at an intermediate
level of the hierarchy. We use the notation
ν(I), µ(I) re
spectively for the simple responses gI, tk and for the
P
complex response µkn (I) = |G10 | g∈G0 ηn gI, tk . To
simplify the notation we suppose that the center of the
support of the signature at each layer `, µ` (I), coincides
with the center of the pooling range.
The domain of the dot products gI, tk corresponding to
templates and to simple cells
Pis in general different from
the domain of the pooling g∈G0 . We will continue to
use the commonly used term receptive field – even if it
mixes these two domains.
The main part of the theory characterizes properties of
the basic HW module – which computes the components
of an invariant signature vector from an image patch
within its receptive field.
It is important to emphasize that the basic module is always the same throughout the paper. We use different
mathematical tools, including approximations, to study
under which conditions (e.g. localization or linearization,
see end of section 2) the signature computed by the module is invariant
P or approximatively invariant.
The pooling g∈G0 , G0 ⊆ G is effectively over a pooling
window in the group parameters. In the case of 1D scaling and 1D translations, the pooling window corresponds
to an interval, e.g. [aj , aj+k ], of scales and an interval,
e.g. [−x̄, x̄], of x translations, respectively.
7. All the results in this paper are valid in the case of a discrete or a continuous compact (locally compact) group:
in the first case we have a sum over the transformations,
in the second an integral over the Haar measure of the
group.
8. Normalized dot products also eliminate the need of the
explicit computation of the determinant of the Jacobian
for affine transformations (which is a constant and is simplified dividing by the norms) assuring that hAI, Ati =
hI, ti, where A is an affine transformation.
2. Invariance and uniqueness: Basic Module
Compact Groups (fully observable). Given an image I ∈ X
and a group representation g, the orbit OI = {I 0 ∈ X s.t.I 0 =
gI, g ∈ G} is uniquely associated to an image and all its
transformations. The orbit provides an invariant representation of I, i.e. OI = OgI for all g ∈ G. Indeed, we can
view an orbit as all the possible realizations of a random
variable with distribution PI induced by the group action.
From this observation, a signature Σ(I) can be derived for
compact groups, by using results characterizing probability
distributions via their one dimensional projections.
In this section we study the signature given by
K
Σ(I) = (µ1 (I), . . . , µk (I)) = (µ11 (I), .., µ1N , .., µK
1 , .., µN (I)),
where each component µk (I) ∈ RN is a histogram corresponding to a one dimensional projection defined by a template tk ∈ X . In the following we let X = Rd .
Orbits and probability distributions. If G is a compact group,
the associated Haar measure dg can be normalized to be a
probability measure, so that, for any I ∈ Rd , we can define
the random variable,
ZI : G → Rd ,
ZI (g) = gI.
The corresponding distribution PI is defined as PI (A) =
dg(ZI−1 (A)) for any Borel set A ⊂ Rd (with some abuse of
notation we let dg be the normalized Haar measure).
Recall that we define two images, I, I 0 ∈ X to be equivalent
(and we indicate it with I ∼ I 0 ) if there exists g ∈ G s.t.
I = gI 0 . We have the following theorem:
Theorem 2. The distribution PI is invariant and unique i.e.
I ∼ I 0 ⇔ PI = PI 0 .
Proof:
We first proveR that I ∼ I 0 R ⇒ PI = PI 0 . By definition
0
0
P
R I = PI iff R A dPI (s) = A dPI (s), ∀ A ⊆ X , that is
dg
=
dg,
where,
−1
−1
(A)
Z
(A)
Z
I
I0
ZI−1 (A) = {g ∈ G s.t. gI ⊆ A}
0
ZI−1
0 (A) = {g ∈ G s.t. gI ∈ A} = {g ∈ G s.t. gḡI ⊆ A},
∀ A ⊆ X . Note that ∀ A ⊆ X if gI ∈ A ⇒ gḡ −1 ḡI =
gḡ −1 I 0 ∈ A, so that g ∈ ZI−1 (A) ⇒ gḡ −1 ∈ ZI−1
0 (A),
−1
i.e. ZI−1 (A) ⊆ ZI−1
0 (A). Conversely g ∈ ZI 0 (A) ⇒ gḡ ∈
ZI−1 (A), so that ZI−1 (A) = ZI−1
0 (A)ḡ, ∀A. Using this observation we have,
Z
Z
Z
dg =
dg =
dĝ
ZI−1 (A)
(Z −1
0 (A))ḡ
I
Z −1
0 (A)
I
where in the last integral we used the change of variable
ĝ = gḡ −1 and the invariance property of the Haar measure:
this proves the implication.
To prove that PI = PI 0 ⇒ I ∼ I 0 , note that PI (A)−
PI 0 (A) = 0 for some A ⊆ X implies that the support of the
probability distribution of I has non null intersection with
that of I 0 i.e. the orbits of I and I 0 intersect. In other words
there exist g 0 , g 00 ∈ G such that g 0 I = g 00 I 0 . This implies
−1
−1
I = g 0 g 00 I 0 = ḡI 0 , ḡ = g 0 g 00 , i.e. I ∼ I 0 . Q.E.D.
Random Projections for Probability Distributions..Given
the above discussion, a signature may be associated to I
by constructing a histogram approximation of PI , but this
would require dealing with high dimensional histograms.
The following classic theorem gives a way around this problem.
For a template t ∈ S(Rd ), where S(Rd ) is unit sphere in Rd ,
let I 7→ hI, ti be the associated projection. Moreover, let
PhI,ti be the distribution associated
to the
random variable
g 7→ hgI, ti (or equivalently g 7→ I, g −1 t , if g is unitary).
Let E = [t ∈ S(Rd ), s.t. PhI,ti = QhI,ti ].
Theorem 3. (Cramer-Wold, [10]) For any pair P, Q of probability distributions on Rd , we have that P = Q if and only
if E = S(Rd ).
In words, two probability distributions are equal if and only
if their projections on any of the unit sphere directions is
equal. The above result can be equivalently stated as saying
that the probability of choosing t such that PhI,ti = QhI,ti
is equal to 1 if and only if P = Q and the probability of
choosing t such that PhI,ti = QhI,ti is equal to 0 if and only
if P 6= Q (see Theorem 3.4 in [37]). The theorem suggests a
way to define a metric on distributions (orbits) in terms of
Z
d(PI , PI 0 ) = d0 (PhI,ti , PhI 0 ,ti )dλ(t), ∀I, I 0 ∈ X ,
where d0 is any metric on one dimensional probability distributions and dλ(t) is a distribution measure on the projections. Indeed, it is easy to check that d is a metric. In
particular note that, in view of the Cramer Wold Theorem,
d(P, Q) = 0 if and only if P = Q. As mentioned in the main
text, each one dimensional distribution PhI,ti can be approximated by a suitable histogram µt (I) = (µtn (I))n=1,...,N ∈
RN , so that, in the limit in which the histogram approximation is accurate
Z
d(PI , PI 0 ) ≈ dµ (µt (I), µt (I 0 ))dλ(t), ∀I, I 0 ∈ X , [ 3 ]
where dµ is a metric on histograms induced by d0 .
A natural question is whether there are situations in
which a finite number of projections suffice to discriminate any two probability distributions, that is PI 6= PI0 ⇔
d(PI , PI 0 ) 6= 0. Empirical results show that this is often
the case with a small number of templates (see [38] and
HMAX experiments, section 6). The problem of mathematically characterizing the situations in which a finite number
of (one-dimensional) projections are sufficient is challenging.
Here we provide a partial answer to this question.
We start by observing that the metric [ 3 ] can be approximated by uniformly sampling K templates and considering
K
1 X
dµ (µk (I), µk (I 0 )),
dˆK (PI , PI 0 ) =
K
[4]
k=1
CBMM paper
March, 2014
9
k
where µk = µt . The following result shows that a finite
number K of templates is sufficient to obtain an approximation within a given precision . Towards this end let
dµ (µk (I), µk (I 0 )) = µk (I) − µk (I 0 ) .
[5]
RN
N
where k·kRN is the Euclidean norm in R . The following
theorem holds:
Theorem 4. Consider n images Xn in X . Let K ≥
where c is a universal constant. Then
|d(PI , PI 0 ) − dˆK (PI , PI 0 )| ≤ ,
2
c2
log nδ ,
[6]
0
2
with probability 1 − δ , for all I, I ∈ Xn .
Proof:
The proof follows from an application of Höeffding’s inequality and a union bound.
Fix I, I 0 ∈ Xn . Define the real random variable Z : S(Rd ) →
R,
Z(tk ) = µk (I) − µk (I 0 ) , k = 1, . . . , K.
RN
From the definitions it follows that kZk ≤ c and E(Z) =
d(PI , PI 0 ). Then Höeffding inequality implies
|d(PI , PI 0 ) − dˆK (PI , PI 0 )| = |
K
1 X
E(Z) − Z(tk )| ≥ ,
K
k=1
2
with probability at most e−c k . A union bound implies a
result holding uniformly on Xn ; the probability becomes at
2
most n2 e−c K . The desired result is obtained noting that
2
this probability is less than δ 2 as soon as n2 e−c K < δ 2 that
n
2
is K ≥ c2 log δ . Q.E.D.
The above result shows that the discriminability question
can be answered in terms of empirical estimates of the onedimensional distributions of projections of the image and
transformations induced by the group on a number of templates tk , k = 1, ..., K.
Theorem 4 can be compared to a version of the Cramer
Wold Theorem for discrete probability distributions. Theorem 1 in [39] shows that for a probability distribution consisting of k atoms in Rd , we see that at most k + 1 directions
(d1 = d2 = ... = dk+1 = 1) are enough to characterize the
distribution, thus a finite – albeit large – number of onedimensional projections.
Memory based learning of invariance. The signature Σ(I) =
(µ11 (I), . . . , µK
N (I)) is obviously invariant (and unique) since
it is associated to an image and all its transformations (an
orbit). Each component of the signature is also invariant – it
corresponds to a group average. Indeed, each measurement
can be defined as
E
1 X D
µkn (I) =
ηn gI, tk ,
[7]
|G| g∈G
for G finite group, or equivalently
Z
D
E Z
D
E
k
k
µn (I) =
dg ηn gI, t
=
dg ηn I, g −1 tk ,
G
G
[8]
when G is a (locally) compact group. Here, the non linearity
ηn can be chosen to define an histogram approximation; in
general is a bijective positive function. Then, it is clear that
from the properties of the Haar measure we have
µkn (ḡI) = µkn (I),
10
http://cbmm.mit.edu
∀ḡ ∈ G, I ∈ X .
[9]
Note that in the r.h.s. of eq. [ 8 ] the transformations are on
templates: this mathematically trivial (for unitary transformations) step has a deeper computational aspect. Invariance
is now in fact achieved through transformations of templates
instead of those of the image, not always available.
Stability. With Σ(I) ∈ RN K denoting as usual the signature
of an image, and d(Σ(I), Σ(I 0 )), I, I 0 ∈ X , a metric, we say
that a signature Σ is stable if it is Lipschitz continuous (see
[16]), that is
d(Σ(I), Σ(I 0 )) ≤ L I − I 0 2 , L > 0, ∀I, I 0 ∈ X . [ 10 ]
In our setting we let
d(Σ(I), Σ(I 0 )) =
K
1 X
dµ (µk (I), µk (I 0 )),
K
k=1
µkn (I)
and assume that
= dg ηn ( gI, tk ) for n = 1, . . . , N
and k = 1, . . . , K. If L < 1 we call the signature map contractive. In the following we prove a stronger form of eq. 10
where the L2 norm is substituted with the Hausdorff norm
on the orbits (which is independent of the choice of I and I 0
in the orbits) defined as kI − I 0 kH = ming,g0 ∈G kgI − g 0 I 0 k2 ,
I, I 0 ∈ X , i.e. we have:
Theorem 5. Assume normalized templates and let Lη =
maxn (Lηn ) s.t. N Lη ≤ 1, where Lηn is the Lipschitz constant of the function ηn . Then
[ 11 ]
d(Σ(I), Σ(I 0 )) < I − I 0 H ,
R
for all I, I 0 ∈ X .
Proof:
By definition, if the non linearities ηn are Lipschitz continuous, for all n = 1, . . . , N , with Lipschitz constant Lηn , it
follows that for each k component of the signature we have
k
k 0 Σ (I) − Σ (I ) N
R
v
u N 2
X
X
1 u
t
≤
Lηn | hgI, tk i − hgI 0 , tk i |
|G| n=1 g∈G
v
u N
X
X
1 u
t
≤
L2η
(| hg(I − I 0 ), tk i |)2 ,
|G| n=1 n g∈G
where we used the linearity of the inner product and Jensen’s
inequality. Applying Schwartz’s inequality we obtain
v
u N
uX X
L
k
η t
k 0 kI − I 0 k2 kg −1 tk k2
Σ (I) − Σ (I ) N ≤
|G| n=1 g∈G
R
where Lη = maxn (Lηn ). If we assume the templates and
their transformations to be normalized to unity then we finally have,
k
k 0 0
[ 12 ]
Σ (I) − Σ (I ) N ≤ N Lη I − I 2 .
R
from which we obtain [ 10 ] summing over all K components
and dividing by 1/K since N Lη < 1 by hypothesis. Note
now that the l.h.s. of [ 12 ], being each component of the
signature Σ(·) invariant, is independent of the choice of I, I 0
˜ I˜0 such that
in the orbits. We can then choose I,
˜ ˜0 0 0
0
I − I = ming,g0 ∈G gI − g I 2 = I − I H
2
In particular being N Lη < 1 the map is non expansive summing each component and dividing by 1/K we have eq. [ 11 ].
Q.E.D.
POG: Stability and Uniqueness.A direct consequence of
Theorem 2 is that any two orbits with a common point are
identical. This follows from the fact that if gI, g 0 I 0 is a common point of the orbits, then
The above result shows that the stability of the empirical
signature
NK
Σ(I) = (µ11 (I), . . . , µK
,
N (I)) ∈ R
provided with the metric [ 4 ] (together with [ 5 ]) holds
for nonlinearities with Lipschitz constants Lηn such that
N maxn (Lηn ) < 1.
g 0 I 0 = gI ⇒ I 0 = (g 0 )−1 gI.
Thus the two images are transformed versions of one another
and OI = OI 0 .
Suppose now that only a fragment of the orbits – the part
within the window – is observable; the reasoning above is
still valid since if the orbits are different or equal so must be
any of their “corresponding” parts.
Regarding the stability of POG signatures, note that the reasoning in the previous section, Theorem 5, can be repeated
without any significant change. In fact, only the normalization over the transformations is modified accordingly.
Box 1: computing an invariant signature µ(I)
1:
2:
3:
4:
5:
6:
7:
procedure Signature(I)
Given K templates {gtk |∀g ∈ G}.
for k = 1, . . . , K do
Compute I, gtk , the normalized
dot products of the image with all the
transformed templates (all
g ∈G).
Pool the results: POOL({ I, gtk |∀g ∈ G}).
end for
return µ(I) = the pooled results for all k.
.
µ(I) is unique and invariant if there are enough templates.
end procedure
Partially Observable Groups case: invariance implies localization and sparsity. This section outlines invariance, uniqueness and stability properties of the signature obtained in the
case in which transformations of a group are observable only
within a window “over” the orbit. The term POG (Partially Observable Groups) emphasizes the properties of the
group – in particular associated invariants – as seen by an
observer (e.g. a neuron) looking through a window at a part
of the orbit. Let G be a finite group and G0 ⊆ G a subset
(note: G0 is not usually a subgroup). The subset of transformations G0 can be seen as the set of transformations that
can be observed by a window on the orbit that is the transformations that correspond to a part of the orbit. A local
signature associated to the partial observation of G can be
defined considering
D
E
1 X
ηn gI, tk ,
[ 13 ]
µkn (I) =
|G0 | g∈G
0
POG: Partial Invariance and Localization. Since the group
is only partially observable we introduce the notion of partial invariance for images and transformations G0 that are
within the observation window. Partial invariance is defined
in terms of invariance of
Z
D
E
1
k
µn (I) =
dg ηn gI, tk .
[ 15 ]
V0 G0
We recall that when gI and tk do not share any common
support
on the plane or I and t are uncorrelated, then
gI, tk = 0. The following theorem, where G0 corresponds
to the pooling range states, a sufficient condition for partial
invariance in the case of a locally compact group:
Theorem 6. Localization and Invariance. Let I, t ∈ H
a Hilbert space, ηn : R → R+ a set of bijective (positive)
functions and G a locally
compact group. Let G0 ⊆ G and
suppose supp( gI, tk ) ⊆ G0 . Then for any given ḡ ∈ G,
tk , I ∈ X the following conditions hold:
D
E
gI, tk = 0, ∀g ∈ G/(G0 ∩ ḡG0 )
or equivalently
D
E
gI, tk 6= 0, ∀g ∈ G0 ∩ ḡG0
⇒ µkn (I) = µkn (ḡI)
[ 16 ]
Proof:
To prove the implication note that if gI, tk = 0, ∀g ∈
G/(G0 ∩ ḡG0 ), being G0 ∆ḡG0 ⊆ G/(G0 ∩ ḡG0 ) (∆ is the
symbol for symmetric difference (A∆B = (A ∪ B)/(A ∩
B) A, B sets) we have:
Z
D
E
0 =
dg ηn gI, tk
G/(G0 ∩ḡG0 )
(µkn (I))n,k .
and ΣG0 (I) =
This definition can be generalized
to any locally compact group considering,
Z
Z
D
E
1
µkn (I) =
ηn gI, tk dg, V0 =
dg.
[ 14 ]
V0 G0
G0
Note that the constant V0 normalizes the Haar measure, restricted to G0 , so that it defines a probability distribution.
The latter is the distribution of the images subject to the
group transformations which are observable, that is in G0 .
The above definitions can be compared to definitions [ 7 ] and
[ 8 ] in the fully observable groups case. In the next sections
we discuss the properties of the above signature. While stability and uniqueness follow essentially from the analysis of
the previous section, invariance requires developing a new
analysis.
Z
=
dg ηn
D
gI, tk
E
[ 17 ]
G0 ∆ḡG0
Z
≥
D
E
D
E dg ηn gI, tk
− ηn gḡI, tk
|.
|
G0
The second equality is true since, being
ηn positive, the
fact that the integral is zero implies gI, tk = 0 ∀g ∈
G/(G0 ∩ ḡG0 ) (and therefore in particular ∀g ∈ G0 ∆ḡG0 ).
Being the r.h.s. of the inequality positive, we have
Z
D
E
D
E dg ηn gI, tk
− ηn gḡI, tk
|=0
[ 18 ]
|
G0
µkn (I)
i.e.
= µkn (ḡI) (see also Fig. 5 for a visual explanation). Q.E.D.
CBMM paper
March, 2014
11
Equation [ 48 ] describes a localization condition on the inner product of the transformed image and the template. The
above result naturally raises question of weather the localization condition is also necessary for invariance. Clearly,
this would be the case if eq. [ 17 ] could be turned into an
equality, that is
Z
D
E
[ 19 ]
dg ηn gI, tk
G ∆ḡG0
Z0
D
E D
E
|
− ηn gḡI, tk
= |
dg ηn gI, tk
G0
=
|µkn (I)
− µkn (ḡI)|.
k
Indeed, in this case, if µkn (I)−µ
n (ḡI)
= 0, and we further assume the natural condition gI, tk 6= 0 if and only if g ∈ G0 ,
then the localization condition [ 48 ] would be necessary since
ηn is a positive bijective function.
The equality in eq. [ 19 ] in general is not true. However,
this is clearly the case if we consider the group of transfor-
Fig. 5: A sufficient condition for invariance for locally compact groups: if the support of hgI, ti is sufficiently localized
it will be completely contained in the pooling interval even
if the image is group shifted, or, equivalently (as shown in
the Figure), if the pooling interval is group shifted by the
same amount.
𝒈𝟒 𝒕 𝒈𝟓 𝒕 𝒈𝟔 𝒕 𝒈𝟕 𝒕 𝒈𝟖 𝒕 𝒈𝟗 𝒕 𝒈𝟏𝟎 𝒕
−𝒃
𝒃
𝒓
𝑰(𝒕𝟏 )
𝒈−𝟑 𝒕 𝒈−𝟐 𝒕 𝒈−𝟏 𝒕
−𝒃
𝒕
𝒈𝟏 𝒕 𝒈𝟐 𝒕 𝒈𝟑 𝒕
𝒃
𝒓
𝑰(𝒕𝟎 )
Fig. 6: An HW-module pooling the dot products of transformed templates with the image. The input image I is
shown centered on the template t; the same module is shown
above for a group shift of the input image, which now localizes around the transformed template g7 t. Images and templates satisfy the localization condition hI, Tx ti =
6 0, |x| > a
with a = 3. The interval [−b, b] indicates the pooling window. The shift in x shown in the Figure is a special case: the
reader should consider the case in which the transformation
parameter, instead of x, is for instance rotation in depth.
12
http://cbmm.mit.edu
mations to be translations as illustrated in Fig. 7 a). We
discuss in some details this latter case.
Assume that G0 = [0, a]. Let
S = {hTx I, ti : x ∈ [0, a]}, Sc = {hTx I, ti : x ∈ [c, a + c]},
[ 20 ]
for a given c where Tx is a unitary representation of the
translation operator. We can view S, Sc as sets of simple
responses to a given template through two receptive fields.
Let S0 = {hTx I, ti : x ∈ [0, a + c]}, so that S, Sc ⊂ S0 for
all c. We assume that S0 , S, Sc to be closed intervals for all
c. Then, recall that a bijective function (in this case ηn ) is
strictly monotonic on any closed interval so that the difference of integrals in eq. [ 19 ] is zero if and only if S = Sc .
Since we are interested in considering all the values of c up
to some maximum C, then we can consider the condition
hTx I, ti = hTx Ta I, ti , ∀ x ∈ [0, c], c ∈ [0, C].
[ 21 ]
The above condition can be satisfied in two cases: 1) both
dot products are zero, which is the localization condition, or
2) Ta I = I (or equivalently Ta t = t) i.e. the image or the
template are periodic. A similar reasoning applies to the
case of scale transformations.
In the next paragraph we will see how localization conditions for scale and translation transformations imply a specific form of the templates.
The Localization condition: Translation and Scale
In this section we identify G0 with subsets of the affine group.
In particular, we study separately the case of scale and translations (in 1D for simplicity).
In the following it is helpful to assume that all images I and templates t are strictly contained in the range
of translation or scale pooling, P , since image components
outside it are not measured. We will consider images I restricted to P : for translation this means that the support
of I is contained in P , for scaling, since gs I = I(sx) and
\ = (1/s)I(ω/s)
ˆ
I(sx)
(where ˆ· indicates the Fourier transform), assuming a scale pooling range of [sm , sM ], implies
t
t
I
I
] (m and M indicates maximum
], [ωm
, ωM
a range [ωm
, ωM
and minimum) of spatial frequencies for the maximum support of I and t. As we will see because of Theorem 6 invariance to translation requires spatial localization of images
and templates and less obviously invariance to scale requires
bandpass properties of images and templates. Thus images
and templates are assumed to be localized from the outset
in either space or frequency. The corollaries below show that
a stricter localization condition is needed for invariance and
that this condition determines the form of the template. Notice that in our framework images and templates are bandpass because of being zero-mean. Notice that, in addition,
neural “images” which are input to the hierarchical architecture are spatially bandpass because of retinal processing.
We now state the result of Theorem 6 for one dimensional
signals under the translation group and – separately – under
the dilation group.
Let I, t ∈ L2 (R), (R, +) the one dimensional locally compact group of translations (Tx : L2 (R) → L2 (R) is a unitary
representation of the translation operator as before). Let,
e.g., G0 = [−b, b], b > 0 and suppose supp(t) ⊆ supp(I) ⊆
[−b, b]. Further suppose supp(hTx I, ti) ⊆ [−b, b]. Then eq.
[ 48 ] (and the following discussion for the translation (scale)
transformations) leads to
Corollary 1: Localization in the spatial domain is necessary and sufficient for translation invariance. For any fixed
t, I ∈ X we have:
in Theorem 6 we have
Z +∞
µkn (I) = µkn (Tx I), ∀x ∈ [0, x̄] ⇔ hTx I, ti =
6 0, ∀x ∈ [−b+x̄, b].
hTx̄ I, Tx ti = hI, Tx Tx̄ ti =
I(ξ)(Tx t(ξ + x̄))dξ = 0
−∞
[ 22 ]
with x̄ > 0.
∀x ∈ [−∞, −b] ∪ [b, ∞], ∀x̄ ∈ [0, X̄], ∀I ∈ X
Similarly let G = (R+ , ·) be the one dimensional locally comwhich is possible, given the arbitrariness of x̄ and I only if
pact group of dilations and denote with Ds : L2 (R) → L2 (R)
a unitary representation of the dilation operator. Let G0 =
supp(I) ∩ Tx T−x̄ supp(t) = ∅
[1/S, S], S > 1 and suppose supp(hDs I, ti) ⊆ [1/S, S]. Then
∀x̄ ∈ [0, X̄], ∀x ∈ [−∞, −b] ∪ [b, ∞]
Corollary 2: Localization in the spatial frequency domain is
where
we
used
the property supp(Tx f ) = Tx f, ∀f ∈ X .
necessary and sufficient for scale invariance. For any fixed
Being, under these conditions, supp(hTx I, ti) = supp(I) +
t, I ∈ X we have:
supp(t) we have supp(t) ⊆ [−b − x̄, b] − supp(I), i.e. eq [ 25 ].
s̄
To prove the condition in eq. [ 26 ] note that eq. [ 23 ] is
µkn (I) = µkn (Ds I), s ∈ [1, s̄] ⇔ hDs I, ti 6= 0, ∀s ∈ [ , S].
S
equivalent in the Fourier domain to
[ 23 ]
Z
D
E
1
ω
s̄
with S > 1.
d
hDs I, ti = D
I,
t̂
=
dω Iˆ
t̂(ω) 6= 0 ∀s ∈ [ , S]
s
Localization conditions of the support of the dot product for
s
s
S
translation and scale are depicted in Figure 7,a) ,b).
[ 27 ]
As shown by the following Lemma 1 Eq. [ 22 ] and [ 23 ] gives
The situation is depicted in Fig. 7 b0 ) for S big enough: in
interesting conditions on the supports of t and its Fourier
\
this case in fact we can suppose the support of D
s̄/S I to be
transform t̂. For translation, the corollary is equivalent to
d
on
an
interval
on
the
left
of
that
of
supp(
t̂)
and
D
S I on the
D
E
zero overlap of the compact supports of I and t. In particd
right; the condition supp( Ds I, t̂ ) ⊆ [s̄/S, S] is in this case
ular using Theorem 6, for I = t, the maximal invariance in
translation implies the following localization conditions on t
equivalent to
hTx t, ti = 0 |x| > a, a > 0
[ 24 ]
which we call self-localization.
For scaling we consider the support of the Fourier transforms of I and t. The Parseval’s theorem allowsD to rewrite
E
d
the dot product hDs I, ti which is in L2 (R2 ) as D
in
s I, t̂
the Fourier domain.
In the following we suppose that the support of t̂ and Iˆ is
t,I
I
I
t
t
] where ωm
could be very
] and [ωm
, ωM
respectively [ωm
, ωM
close to zero (images and templates are supposed to be zeromean) but usually are bigger then zero.
Note that the effect of scaling I with (typically s = 2j with
d
ˆ
j ≤ 0) is to change the support as supp(D
I)).
s I) = s(supp(
D
E
d
This change of the support of Iˆ in the dot product D
s I, t̂
gives non trivial conditions on the intersection with the support of t̂ and therefore on the localization w.r.t. the scale
invariance. We have the following Lemma:
Lemma 1. Invariance to translation in the range [0, x̄], x̄ > 0
I
ωM
s̄
t
t
I
< ωm
, ωM
< ωm
S
S
[ 28 ]
which gives
ωt − ωt I
I s̄
m
M
= Sωm
− ωM
[ 29 ]
2
S
and therefore eq. [ 26 ]. Note that for some s ∈ [s̄/S, S] the
condition that the Fourier supports are disjoint is only sufficient and not necessary for the dot product to be zero since
cancelations can occur. However we can repeat
D the reasonE
d
ing done for the translation case and ask for D
=0
s I, t̂
on a continuous interval of scales.Q.E.D.
The results above lead to a statement connecting invariance
with localization of the templates:
∆∗t = max(∆t ) ≡ max
Theorem 7. Maximum translation invariance implies a tem-
plate with minimum support in the space domain (x); maximum scale invariance implies a template with minimum support in the Fourier domain (ω).
is equivalent to the following localization condition of t in
space
supp(t) ⊆ [−b − x̄, b] − supp(I), I ∈ X .
[ 25 ]
Separately, invariance to dilations in the range [1, s̄], s̄ > 1
is equivalent to the following localization condition of t̂ in
frequency ω
supp(t̂) ⊆ [−ωt − ∆∗t , −ωt + ∆∗t ] ∪ [ωt − ∆∗t , ωt + ∆∗t ]
I
I
∆∗t = Sωm
− ωM
t
s̄
ω t − ωm
, ωt = M
.
S
2
[ 26 ]
Proof:
To prove that supp(t) ⊆ [−b + x̄, b] − supp(I) note that
eq. [ 22 ] implies that supp(hTx I, ti) ⊆ [−b + x̄, b] (see
Figure 7, a)). In general supp(hTx I, ti) = supp(I ∗ t) ⊆
supp(I) + supp(t). The inclusion account for the fact that
the integral hTx I, ti can be zero even if the supports of Tx I
and t are not disjoint. However if we suppose invariance for
a continuous set of translations x̄ ∈ [0, X̄], (where, for any
given I, t, X̄ is the maximum translation for which we have
an invariant signature) and for a generic image in X the inclusion become an equality, since for the invariance condition
Fig. 7: a), b): if the support of the dot product between
the image and the template is contained in the intersection
between the pooling range and the group translated (a) or
dilated (b) pooling range the signature is invariant. In frequency, condition b) becomes b’): when the Fourier supports
of the dilated image and the template do not intersect their
dot product is zero.
CBMM paper
March, 2014
13
Proof:
We illustrate the statement of the theorem with a simple example. In the case of translations suppose, e.g.,
supp(I) = [−b0 , b0 ], supp(t) = [−a, a], a ≤ b0 ≤ b. Eq.
[ 25 ] reads
[−a, a] ⊆ [−b + x̄ + b0 , b − b0 ]
which gives the condition −a ≥ −b + b0 + x̄, i.e. x̄max =
b − b0 − a; thus, for any fixed b, b0 the smaller the template
support, 2a, in space, the greater is translation invariance.
Similarly, in the case of dilations, increasing the range of invariance [1, s̄], s̄ > 1 implies a decrease in the support of t̂
as shown by eq. [ 29 ]; in fact noting that |supp(t̂)| = 2∆t
we have
d|supp(t̂)|
2ω I
=− M <0
ds̄
S
i.e. the measure, | · |, of the support of t̂ is a decreasing function w.r.t. the measure of the invariance range [1, s̄]. Q.E.D.
Because of the assumption of maximum possible support of
all I being finite there is always localization for any choice
of I and t under spatial shift. Of course if the localization
support is larger than the pooling range there is no invariance.
For a complex cell with pooling range [−b, b] in space only
templates with self-localization smaller than the pooling
range make sense. An extreme case of self-localization is
t(x) = δ(x), corresponding to maximum localization of tuning of the simple cells.
Invariance, Localization and Wavelets.The conditions
equivalent to optimal translation and scale invariance – maximum localization in space and frequency – cannot be simultaneously satisfied because of the classical uncertainty principle: if a function t(x) is essentially zero outside an interval
ˆ
of length ∆x and its Fourier transform I(ω)
is essentially
zero outside an interval of length ∆ω then
∆x · ∆ω ≥ 1.
[ 30 ]
In other words a function and its Fourier transform cannot
both be highly concentrated. Interestingly for our setup the
uncertainty principle also applies to sequences[40].
It is well known that the equality sign in the uncertainty
principle above is achieved by Gabor functions[41] of the
form
−
(x−x0 )2
2σ 2
eiω0 (x−x0 ) , σ ∈ R+ , ω0 , x0 ∈ R
[ 31 ]
The uncertainty principle leads to the concept of “optimal localization” instead of exact localization. In a similar way, it is natural to relax our definition of strict invariance (e.g. µkn (I) = µkn (g 0 I)) and to introduce -invariance as
|µkn (I) − µkn (g 0 I)| ≤ . In particular if we suppose, e.g., the
following localization condition
ψx0 ,ω0 (x) = e
We end this paragraph by a conjecture: the optimal
−invariance is satisfied by templates with non compact support which decays exponentially such as a Gaussian or a
Gabor wavelet. We can then speak of optimal invariance
meaning “optimal -invariance”. The reasonings above lead
to the theorem:
Theorem 8. Assume invariants are computed from pooling
within a pooling window with a set of linear filters. Then
the optimal templates (e.g. filters) for maximum simultaneous invariance to translation and scale are Gabor functions
t(x) = e
2
− x2
2σ
eiω0 x .
[ 33 ]
Remarks
to a Heisen1. The Gabor function ψx0 ,ω0 (x) corresponds
R
berg box which has a x-spread σx2 = x2 |ψ(x)|dx and
R
a ω spread σω2 = ω 2 |ψ̂(ω)|dω with area σx σω . Gabor
wavelets arise under the action on ψ(x) of the translation and scaling groups as follows. The function ψ(x), as
defined, is zero-mean and normalized that is
Z
ψ(x)dx = 0
[ 34 ]
and
||ψ(x)|| = 1.
[ 35 ]
A family of Gabor wavelets is obtained by translating and
scaling ψ:
1
x−u
ψu,s (x) = 1 ψ(
).
[ 36 ]
s
2
s
Under certain conditions (in particular, the Heisenberg
boxes associated with each wavelet must together cover
the space-frequency plane) the Gabor wavelet family becomes a Gabor wavelet frame.
2. Optimal self-localization of the templates (which follows
from localization), when valid simultaneously for space
and scale, is also equivalent to Gabor wavelets. If they
are a frame, full information can be preserved in an optimal quasi invariant way.
3. Note that the result proven in Theorem 8 is not related
to that in [42]. While Theorem 8 shows that Gabor functions emerge from the requirement of maximal invariance
for the complex cells – a property of obvious computational importance – the main result in [42] shows that (a)
the wavelet transform (not necessarily Gabor) is covariant
with the similitude group and (b) the wavelet transform
follows from the requirement of covariance (rather than
invariance) for the simple cells (see our definition of covariance in the next section). While (a) is well-known
(b) is less so (but see p.41 in [43]). Our result shows that
Gabor functions emerge from the requirement of maximal
invariance for the complex cells.
Approximate Invariance and Localization. In the previous
section we analyzed the relation between localization and
[ 32 ]
hTx I, ti = e σx , hDs I, ti = e σs , σx , σs ∈ R
invariance in the case of group transformations. By relaxing
the requirement of exact invariance and exact localization we
we have
show how the same strategy for computing invariants can
still be applied even in the case of non-group transforma1√ k
k
|µn (Tx̄ I) − µn (I)| =
σx erf [−b, b]∆[−b + x̄, b + x̄]
tions if certain localization properties of hT I, ti holds, where
2
T is a smooth transformation (to make it simple think to a
1√ k
k
|µn (Ds̄ I) − µn (I)| =
σs erf [−1/S, S]∆[s̄/S, Ss̄] .
transformation parametrized by one parameter).
2
We first notice that the localization condition of theowhere erf is the error function. The differences above, with
rems 6 and 8 – when relaxed to approximate localization –
an opportune choice of the localization ranges σs , σx can be
takes the (e.g. for the 1D translations group supposing for
made as small as wanted.
simplicity that the supports of I and t are centered in zero)
2
− x2
14
http://cbmm.mit.edu
2
− s2
+
form I, Tx tk < δ ∀x s.t. |x| > a, where δ is small in the
√
order
of1/ n (where n is the dimension of the space) and
I, Tx tk ≈ 1 ∀x s.t. |x| < a.
We call this property sparsity of I in the dictionary tk under G. This condition can be satisfied by templates that are
similar to images in the set and are sufficiently “rich” to be
incoherent for “small” transformations. Note that from the
reasoning above the sparsity of I in tk under G is expected
to improve with increasing n and with noise-like encoding of
I and tk by the architecture.
Another important property of sparsity of I in tk (in addition to allowing local approximate invariance to arbitrary
transformations, see later) is clutter-tolerance in the sense
that if n1 , kn2 areadditive uncorrelated spatial noisy clutter
I + n1 , gt + n2 ≈ hI, gti.
Interestingly the sparsity condition under the group is related to associative memories for instance of the holographic
type[44],[45]. If the sparsity condition holds only for I = tk
and
for very small set of g ∈ G, that is, it has the form
I, gtk = δ(g)δI,tk it implies strict memory-based recognition ( see non-interpolating look-up table in the description
of [46]) with inability to generalize beyond stored templates
or views.
While the first regime – exact (or −) invariance for
generic images, yielding universal Gabor templates – applies
to the first layer of the hierarchy, this second regime (sparsity) – approximate invariance for a class of images, yielding class-specific templates – is important for dealing with
non-group transformations at the top levels of the hierarchy
where receptive fields may be as large as the visual field.
Several interesting transformations do not have the
group structure, for instance the change of expression of a
face or the change of pose of a body. We show here that approximate invariance to transformations that are not groups
can be obtained if the approximate localization condition
above holds, and if the transformation can be locally approximated by a linear transformation, e.g. a combination of
translations, rotations and non-homogeneous scalings, which
corresponds to a locally compact group admitting a Haar
measure.
Suppose, for simplicity, that the smooth transformation
T , at least twice differentiable, is parametrized by the parameter r ∈ R. We approximate its action on an image I
with a Taylor series (around e.g. r = 0) as:
Tr (I)
=
dT (I)r + R(I)
[ 37 ]
T0 (I) +
dr r=0
dT I+
(I)r + R(I)
dr r=0
I + J I (I)r + R(I) = [e + rJ I ](I) + R(I)
=
LIr (I) + R(I)
=
=
where R(I) is the reminder, e is the identity operator, J I
the Jacobian and LIr = e + J I r is a linear operator.
Let R be the range of the parameter r where we can approximately neglect the remainder term R(I). Let L be the
range of the parameter r where the scalar product hTr I, ti
is localized i.e. hTr I, ti = 0, ∀r 6∈ L. If L ⊆ R we have
D
E
hTr I, ti ≈ LIr I, t ,
[ 38 ]
and we have the following:
Proposition 9. Let I, t ∈ H a Hilbert space, ηn : R → R+ a set
of bijective (positive) functions and T a smooth transformation (at least twice differentiable) parametrized by r ∈ R. Let
L = supp(hTr I, ti), P the pooling interval in the r parameter
and R ⊆ R defined as above. If L ⊆ P ⊆ R and
hTr I, ti = 0, ∀r ∈ R/(Tr̄ P ∩ P ), r̄ ∈ R
µkn (Tr̄ I)
then
= µkn (I).
Proof:
We have, following the reasoning done in Theorem 6
Z
Z
D
E
µkn (Tr̄ I) =
dr ηn (hTr Tr̄ I, ti) =
dr ηn ( LIr LIr̄ I, t )
P
ZP
D
E
I
dr ηn ( Lr+r̄ I, t ) = µkn (I)
=
P
where the last equality is true if hTr I, ti = LIr I, t = 0, ∀r ∈
R/(Tr̄ P ∩ P ). Q.E.D.
As an example, consider the transformation induced on the
image plane by rotation in depth of a face: it can be decomposed into piecewise linear approximations around a small
number of key templates, each one corresponding to a specific 3D rotation of a template face. Each key template
corresponds to a complex cell containing as (simple cells)
a number of observed transformations of the key template
within a small range of rotations. Each key template corresponds to a different signature which is invariant only for
rotations around its center. Notice that the form of the linear approximation or the number of key templates needed
does not affect the algorithm or its implementation. The
templates learned are used in the standard dot-product-andpooling module. The choice of the key templates – each one
corresponding to a complex cell, and thus to a signature component – is not critical, as long as there are enough of them.
For one parameter groups, the key templates correspond to
the knots of a piecewise linear spline approximation. Optimal placement of the centers – if desired – is a separate
problem that we leave aside for now.
Summary of the argument: Different transformations
can be classified in terms of invariance and localization.
Compact Groups: consider the case of a compact group
transformation such as rotation in the image plane. A complex cell is invariant when pooling over all the templates
which span the full group θ ∈ [−π, +π]. In this case there
is no restriction on which images can be used as templates:
any template yields perfect invariance over the whole range
of transformations (apart from mild regularity assumptions)
and a single complex cell pooling over all templates can provide a globally invariant signature.
Locally Compact Groups and Partially Observable Compact Groups: consider now the POG situation in which the
pooling is over a subset of the group: (the POG case always
applies to Locally Compact groups (LCG) such as translations). As shown before, a complex cell is partially invariant
if the value of the dot-product between a template and its
shifted template under the group falls to zero fast enough
with the size of the shift relative to the extent of pooling.
In the POG and LCG case, such partial invariance holds
over a restricted range of transformations if the templates
and the inputs have a localization property that implies
wavelets for transformations that include translation and
scaling.
General (non-group) transformations: consider the case
of a smooth transformation which may not be a group.
Smoothness implies that the transformation can be approximated by piecewise linear transformations, each centered
around a template (the local linear operator corresponds to
the first term of the Taylor series expansion around the chosen template). Assume – as in the POG case – a special
form of sparsity – the dot-product between the template
CBMM paper
March, 2014
15
and its transformation fall to zero with increasing size of the
transformation. Assume also that the templates transform
as the input image. For instance, the transformation induced on the image plane by rotation in depth of a face may
have piecewise linear approximations around a small number
of key templates corresponding to a small number of rotations of a given template face (say at ±30o , ±90o , ±120o ).
Each key template and its transformed templates within a
range of rotations corresponds to complex cells (centered in
±30o , ±90o , ±120o ). Each key template, e.g. complex cell,
corresponds to a different signature which is invariant only
for that part of rotation. The strongest hypothesis is that
there exist input images that are sparse w.r.t. templates of
the same class – these are the images for which local invariance holds.
Remarks:
1. We are interested in two main cases of POG invariance:
• partial invariance simultaneously to translations in
x, y, scaling and possibly rotation in the image plane.
This should apply to “generic” images. The signatures
should ideally preserve full, locally invariant information. This first regime is ideal for the first layers of
the multilayer network and may be related to Mallat’s scattering transform, [16]. We call the sufficient
condition for LCG invariance here, localization, and
in particular, in the case of translation (scale) group
self-localization given by Equation [ 24 ].
• partial invariance to linear transformations for a subset
of all images. This second regime applies to high-level
modules in the multilayer network specialized for specific classes of objects and non-group transformations.
The condition that is sufficient here for LCG invariance is given by Theorem 6 which applies only to a
specific class of I. We prefer to call it sparsity of the
images with respect to a set of templates.
2. For classes of images that are sparse with respect to a set
of templates, the localization condition does not imply
wavelets. Instead it implies templates that are
• similar to a class of images so that I, g0 tk ≈ 1 for
some g0 ∈ G and
• complex enough to be “noise-like” in the sense that
I, gtk ≈ 0 for g 6= g0 .
3. Templates must transform similarly to the input for approximate invariance to hold. This corresponds to the
assumption of a class-specific module and of a nice object
class [47, 6].
4. For the localization property to hold, the image must be
similar to the key template or contain it as a diagnostic feature (a sparsity property). It must be also quasiorthogonal (highly localized) under the action of the local
group.
5. For a general, non-group, transformation it may be impossible to obtain invariance over the full range with a
single signature; in general several are needed.
6. It would be desirable to derive a formal characterization
of the error in local invariance by using the standard module of dot-product-and-pooling, equivalent to a complex
cell. The above arguments provide the outline of a proof
based on local linear approximation of the transformation and on the fact that a local linear transformation is
a LCG.
3. Hierarchical Architectures
So far we have studied the invariance, uniqueness and stability properties of signatures, both in the case when a whole
group of transformations is observable (see [ 7 ] and [ 8 ]), and
in the case in which it is only partially observable (see [ 13 ]
and [ 14 ]). We now discuss how the above ideas can be
iterated to define a multilayer architecture. Consider first
the case when G is finite. Given a subset G0 ⊂ G, we can
associate a window gG0 to each g ∈ G. Then, we can use
definition [ 13 ] to define for each window a signature Σ(I)(g)
given by the measurements,
µkn (I)(g) =
D
E
1 X
ηn I, ḡtk .
|G0 | ḡ∈gG
0
We will keep this form as the definition of signature. For
fixed n, k, a set of measurements corresponding to different windows can be seen as a |G| dimensional vector.
A signature Σ(I) for the whole image is obtained as a
signature of signatures, that is, a collection of signatures
(Σ(I)(g1 ), . . . , Σ(I)(g|G| ) associated to each window.
Since we assume that the output of each module is made
zero-mean and normalized before further processing at the
next layer, conservation of information from one layer to the
next requires saving the mean and the norm at the output of
each module at each level of the hierarchy.
We conjecture that the neural image at the first layer is
uniquely represented by the final signature at the top of the
hierarchy and the means and norms at each layer.
The above discussion can be easily extended to continuous
(locally compact) groups considering,
Z
Z
D
E
1
µkn (I)(g) =
dḡηn I, ḡtk , V0 =
dḡ,
V0 gG0
G0
where, for fixed n, k, µkn (I) : G → R can now be seen as
a function on the group. In particular, if we denote by
K0 : G → R the indicator function on G0 , then we can
write
Z
D
E
1
µkn (I)(g) =
dḡK0 (ḡ −1 g)ηn I, ḡtk .
V0 G
The signature for an image can again be seen as a collection
of signatures corresponding to different windows, but in this
case it is a function Σ(I) : G → RN K , where Σ(I)(g) ∈ RN K ,
is a signature corresponding to the window G0 “centered” at
g ∈ G.
The above construction can be iterated to define a hierarchy of signatures. Consider a sequence G1 ⊂ G2 ⊂, . . . , ⊂
GL = G. For h : G → Rp , p ∈ N with an abuse of notation
we let gh(ḡ) = h(g −1 ḡ). Then we can consider the following
construction.
We call complex cell operator at layer ` the operator that
maps an image I ∈ X to a function µ` (I) : G → RN K where
µn,k
` (I)(g) =
1 X
ηn ν`k (I)(ḡ) ,
|G` | ḡ∈gG
[ 39 ]
`
and simple cell operator at layer ` the operator that maps
an image I ∈ X to a function ν` (I) : G → RK
D
E
ν`k (I)(g) = µ`−1 (I), gtk`
[ 40 ]
with tk` the kth template at layer ` and µ0 (I) = I. Several
comments are in order:
16
http://cbmm.mit.edu
• beside the first layer, the inner product defining the
simple Rcell operator is that in L2 (G) = {h : G →
RN K , | dg|h(g)|2 < ∞};
• The index ` corresponds to different layers, corresponding
to different subsets G` .
• At each layer a (finite) set of templates T` =
2
(t1` , . . . , tK
` ) ⊂ L (G) (T0 ⊂ X ) is assumed to be available. For simplicity, in the above discussion we have assumed that |T` | = K, for all ` = 1, . . . , L. The templates
at layer ` can be thought of as compactly supported functions, with support much smaller than the corresponding set G` . Typically templates can be seen as image
patches in the space of complex operator responses, that
is t` = µ`−1 (t̄) for some t̄ ∈ X .
• Similarly we have assumed that the number of non linearities ηn , considered at every layer, is the same.
Following the above discussion, the extension to continuous (locally compact) groups is straightforward. We collect
it in the following definition.
Definition 1. (Simple and complex response) For ` = 1, . . . , L,
(t1` , . . . , tK
` )
2
let T` =
⊂ L (G) (and T0 ⊂ X ) be a sequence of
template sets. The complex cell operator at layer ` maps an
image I ∈ X to a function µ` (I) : G → RN K ; in components
Z
1
µn,k
dḡK` (ḡ −1 g)ηn ν`k (I)(ḡ) , g ∈ G [ 41 ]
` (I)(g) =
V`
R
where K` is the indicator function on G` , V` = G dḡ and
`
where
D
E
k
k
ν` (I)(g) = µ`−1 (I), gt` , g ∈ G
[ 42 ]
(µ0 (I) = I) is the simple cell operator at layer ` that maps
an image I ∈ X to a function ν` (I) : G → RK .
Remark Note that eq. [ 41 ] can be written as:
k
µn,k
` (I) = K` ∗ ηn (ν` (I))
[ 43 ]
where ∗ is the group convolution.
In the following we study the properties of the complex
response, in particular
Property 1: covariance. We call the map Σ covariant under
G iff
Σ(gI) = g −1 Σ(I), ∀g ∈ G, I ∈ X
, where the action of g −1 is intended on the representation
space L2 (G) and that of g on the image space L2 (R2 ). Practically since we are only
taking into
account of the distribution of the values of µ(I), µ(tk ) we can ignore this technical detail being
to
the definition
of covariance equivalent
the statement µ(gI), µ(tk ) = µ(I), µ(g −1 tk ) where the
transformation is always acting on the image space. In the
following we show the covariance property for the µn,k
re1
sponse (see Fig. 8). An inductive reasoning then can be
Fig. 8: Covariance: the response for an image I at position
g is equal to the response of the group shifted image at the
shifted position.
applied for higher order responses. We assume that the architecture is isotropic in the relevant covariance dimension
(this implies that all the modules in each layer should be
identical with identical templates) and that there is a continuum of modules in each layer.
Proposition 10. Let G a locally compact group and ḡ ∈ G.
Let µn,k
as defined in eq. 41. Then µn,k
1
1 (g̃I)(g) =
n,k
−1
µ1 (I)(g̃ g), ∀g̃ ∈ G.
Proof:
Using the definition 41 we have
Z
D
E
1
dḡK1 (ḡ −1 g)ηn g̃I, ḡtk
µn,k
=
1 (g̃I)(g)
V1 G
Z
D
E
1
dḡK1 (ḡ −1 g)ηn I, g̃ −1 ḡtk
=
V1 G
Z
D
E
1
=
dĝK1 (ĝ −1 g̃ −1 g)ηn I, ĝtk
V1 G
=
−1
µn,k
g)
1 (I)(g̃
where in the third line we used the change of variable
ĝ = g̃ −1 ḡ and the invariance of the Haar measure. Q.E.D.
Remarks
1. The covariance property described in proposition 10 can
n,k
be stated equivalently as µn,k
1 (I)(g) = µ1 (ḡI)(ḡg). This
last expression has a more intuitive meaning as shown in
Fig. 8.
2. The covariance property described in proposition 10 holds
both for abelian and non-abelian groups. However the
group average on templates transformations in definition
of eq. 41 is crucial. In fact, if we define the signature
averaging on the images we do not have a covariant response:
Z
D
E
1
µn,k
(g̃I)(g)
=
dḡK1 (ḡ −1 g)ηn ḡg̃I, tk
1
V1 G
Z
D
E
=
dĝK1 (g̃ĝ −1 g)ηn ĝI, tk
G
where in the second line we used the change of variable
ĝ = g̃ −1 ḡ and the invariance of the Haar measure. The
0
last expression cannot be written as µn,k
1 (I)(g g) for any
g 0 ∈ G.
3. With respect to the range of invariance, the following
property holds for multilayer architectures in which the
output of a layer is defined as covariant if it transforms in
the same way as the input: for a given transformation of
an image or part of it, the signature from complex cells
at a certain layer is either invariant or covariant with respect to the group of transformations; if it is covariant
there will be a higher layer in the network at which it is
invariant (more formal details are given in Theorem 12),
assuming that the image is contained in the visual field.
This property predicts a stratification of ranges of invariance in the ventral stream: invariances should appear in
a sequential order meaning that smaller transformations
will be invariant before larger ones, in earlier layers of the
hierarchy[48].
Property 2: partial and global invariance (whole and parts).
We now find the conditions under which the functions µ` are
locally invariant, i.e. invariant within the restricted range of
the pooling. We further prove that the range of invariance
increases from layer to layer in the hierarchical architecture.
The fact that for an image, in general, no more global invariance is guaranteed allows, as we will see, a novel definition
CBMM paper
March, 2014
17
of “parts” of an image.
The local invariance conditions are a simple reformulation of
Theorem 6 in the context of a hierarchical architecture. In
the following, for sake of simplicity, we suppose that at each
layer we only have a template t and a non linear function η.
will include, for some ¯
l, a set G`¯ ∩ ḡG`¯ such that
gµ`−1
¯ (I), t 6= 0 g ∈ G`¯ ∩ ḡG`¯
being supp( gµ`−1
¯ (I), t ) ⊆ G.
Proposition 11. Localization and Invariance: hierarchy.
Property 3: stability. Using the definition of stability given in
[ 11 ], we can formulate the following theorem characterizing
stability for the complex response:
+
Let I, t ∈ H a Hilbert space, η : R → R a bijective (positive) function and G a locally compact group. Let G` ⊆ G
and suppose supp(hgµ`−1 (I), ti) ⊆ G` . Then for any given
ḡ ∈ G
hgµ`−1 (I), ti = 0, g ∈ G/(G` ∩ ḡG` )
or equivalently
⇒ µ` (I) = µ` (ḡI)
hgµ`−1 (I), ti =
6 0, g ∈ G` ∩ ḡG` .
[ 44 ]
The proof follows the reasoning done in Theorem 6 (and the
following discussion for the translation and scale transformations) with I substituted by µ`−1 (I) using the covariance
property µ`−1 (gI) = gµ`−1 (I). Q.E.D.
We can give now a formal definition of object part as the
subset of the signal I whose complex response, at layer `, is
invariant under transformations in the range of the pooling
at that layer.
This definition is consistent since the invariance is increasing from layer to layer (as formally proved below) therefore allowing bigger and bigger parts. Consequently for each
transformation there will exists a layer `¯ such that any signal
subset will be a part at that layer.
We can now state the following:
Theorem 12. Whole and parts. Let I ∈ X (an image or
a subset of it) and µ` the complex response at layer `. Let
G0 ⊂ · · · ⊂ G` ⊂ · · · ⊂ GL = G a set of nested subsets of
the group G. Suppose η is a bijective (positive) function and
that the template t and the complex response at each layer
has finite support. Then ∀ḡ ∈ G, µ` (I) is invariant for some
¯ i.e.
` = `,
¯
µm (ḡI) = µm (I), ∃ `¯ s.t. ∀m ≥ `.
The proof follows from the observation that the pooling
range over the group is a bigger and bigger subset of G with
growing layer number, in other words, there exists a layer
such that the image and its transformations are within the
pooling range at that layer (see Fig. 9). This is clear since
for any ḡ ∈ G the nested sequence
G0 ∩ ḡG0 ⊆ ... ⊆ G` ∩ ḡG` ⊆ ... ⊆ GL ∩ ḡGL = G.
Fig. 9: An image I with a finite support may or may not
be fully included in the receptive field of a single complex
cell at layer n (more in general the transformed image may
not be included in the pooling range of the complex cell).
However there will be a higher layer such that the support
of its neural response is included in the pooling range of a
single complex cell.
18
http://cbmm.mit.edu
Theorem 13. Stability. Let I, I 0 ∈ X and µ` the complex re-
sponse at layer l. Let the nonlinearity η a Lipschitz function
with Lipschitz constant Lη ≤ 1. Then
µ` (I) − µ` (I 0 ) ≤ I − I 0 , ∀ `, ∀ I, I 0 ∈ X .
[ 45 ]
H
where kI − I 0 kH = ming,g0 ∈G` kgI − g 0 I 0 k2
The proof follows from a repeated application of the reasoning done in Theorem 5. See details in [5].
Comparison with stability as in [16]. The same definition
of stability we use (Lipschitz continuity) was recently given
by [16], in a related context. Let I, I 0 ∈ L2 (R2 ) and
Φ : L2 (R2 ) → L2 (R2 ) a representation. Φ is stable if it
is Lipschitz continuous with Lipschitz constant L ≤ 1, i.e.,
is a non expansive map:
Φ(I) − Φ(I 0 ) ≤ I − I 0 , ∀ I, I 0 ∈ L2 (R2 ).
[ 46 ]
2
2
In particular in [16] the author is interested in stability
of group invariant scattering representations to the action
of small diffeomorphisms close to translations. Consider
transformations of the form I 0 (x) = Lτ I(x) = I(x − τ (x))
(which can be though as small diffeomorphic transformations close to translations implemented by a displacement
field τ : R2 → R2 ). A translation invariant operator Φ is
said to be Lipschitz continuous to the action of a C 2 (R2 ) diffeomorphisms if for any compact Ω ⊆ R2 there exists C such
that for all I ∈ L2 (R2 ) supported in Ω ⊆ R2 and τ ∈ C 2 (R2 )
kΦ(I) − Φ(Lτ I)k2 ≤
[ 47 ]
≤ C kIk2 supx∈R2 |∇τ (x)| + supx∈R2 |Hτ (x)|
where H is the Hessian and C a positive constant.
Condition [ 47 ] is a different condition then that in eq. [ 45 ]
since it gives a Lipschitz bound for a diffeomorphic transformation at each layer of the scattering representation.
Our approach differs in the assumption that small (close to
identity) diffeomorphic transformations can be well approximated, at the first layer, as locally affine transformations or,
in the limit, as local translations which therefore falls in the
POG case. This assumption is substantiated by the following reasoning in which any smooth transformation is seen as
parametrized by the parameter t (the r parameter of the Tr
transformation in section 2), which can be thought,e.g., as
time.
Let T ⊆ R be a bounded interval and Ω ⊆ RN an open
set and let Φ = (Φ1 , ..., ΦN ) : T × Ω → RN be C2 (twice
differentiable), where Φ (0, .) is the identity map. Here RN
is assumed to model the image plane, intuitively we should
take N = 2, but general values of N allow our result to
apply in subsequent, more complex processing stages, for
example continuous wavelet expansions, where the image is
also parameterized in scale and orientation, in which case we
should take N = 4. We write (t, x) for points in T × Ω, and
interpret Φ (t, x) as the position in the image at time t of an
observed surface feature which is mapped to x = Φ (0, x) at
time zero. The map Φ results from the (not necessarily rigid)
Corollary 15. Under the above assumptions suppose that I :
motions of the observed object, the motions of the observer
RN → R satisfies
and the properties of the imaging apparatus. The implicit
assumption here is that no surface features which are visible
in Ω at time zero are lost within the time interval T . The
|I (x) − I (y)| ≤ c kx − yk
assumption that Φ is twice differentiable reflects assumed
smoothness properties of the surface manifold, the fact that
for some c > 0 and all x, y ∈ RN . Then there exists V ∈ RN
object and observer are assumed massive, and corresponding
such that for all (t, x) ∈ I × B
smoothness properties of the imaging apparatus, including
eventual processing.
t2
Now consider a closed ball B ⊂ Ω of radius δ > 0 which
|I (Φ (t, x)) − I (x + tV )| ≤ c Kx |t| δ + Kt
.
2
models the aperture of observation. We may assume B to
be centered at zero, and we may equally take the time of
observation to be t0 = 0 ∈ T . Let
Theorem 14 and corollary 14 gives a precise mathematical
2
2
motivation for the assumption that any sufficiently smooth
∂
∂
Kt =
sup . least twice differentiable) transformation can be approxsup ∂t2 Φ (t, x) N , Kx = x∈B
∂x∂t Φ (0, x) N ×N(at
(t,x)∈T ×B
R
R
imated in an enough small compact set with a group transformation (e.g. translation), thus allowing, based on eq. 11,
Here (∂/∂x) is the spatial gradient in RM , so that the last
stability w.r.t. small diffeomorphic transformations.
expression is spelled out as
!
2 1/2
N X
N X
∂2
Approximate Factorization: hierarchy. In the first version of
.
Φl (0, x)
Kx = sup
∂x
[5] we conjectured that a signature invariant to a group of
i ∂t
x∈B
l=1 i=1
transformations could be obtained by factorizing in succesOf course, by compactness of T × B and the C2 -assumption,
sive layers the computation of signatures invariant to a subboth Kt and Kx are finite. The following theorem is due to
group of the transformations (e.g. the subgroup of transMaurer and Poggio:
lations of the affine group) and then adding the invariance
Theorem 14. There exists V ∈ RN such that for all (t, x) ∈
w.r.t. another subgroup (e.g. rotations). While factorizaT ×B
tion of invariance ranges is possible in a hierarchical architecture (theorem 12), it can be shown that in general the
t2
factorization in successive layers for instance of invariance
kΦ (t, x) − [x + tV ]kRN ≤ Kx δ |t| + Kt .
2
to translation followed by invariance to rotation (by subThe proof reveals this to be just a special case of Taylor’s
groups) is impossible[5].
theorem.
However, approximate factorization is possible under the
Proof: Denote
same conditions of the previous section. In fact, a trans V (t, x)
= (V1 , ..., Vl ) (t, x) = (∂/∂t) Φ (t, x),
formation that can be linearized piecewise can always be
V̇ (t, x) = V̇1 , ..., V̇l (t, x) = ∂ 2 /∂t2 Φ (t, x), and set
performed in higher layers, on top of other transformations,
V := V (0, 0). For s ∈ [0, 1] we have with Cauchy-Schwartz
since the global group structure is not required but weaker
2
2
N X
N smoothness properties are sufficient.
2
X
d
∂
V (0, sx)
Φl (0, sx) xi
ds
N =
∂xi ∂t
R
l=1 i=1
Why Hierarchical architectures: a summary.
2
2
≤ Kx kxk ≤ Kx2 δ 2 ,
1. Optimization of local connections and optimal reuse of
whence
computational elements. Despite the high number of
kΦ (t, x) − [x + tV ]k
synapses on each neuron it would be impossible for a
Z t
complex cell to pool information across all the simple cells
= V (s, x) ds − tV (0, 0)
needed to cover an entire image.
0
2. Compositionality. A hierarchical architecture provides
Z t Z s
signatures of larger and larger patches of the image in
= V̇ (r, x) dr + V (0, x) ds − tV (0, 0)
terms of lower level signatures. Because of this, it can
0
0
Z t Z s 2
Z 1
access memory in a way that matches naturally with the
d
∂
linguistic ability to describe a scene as a whole and as a
V
(0,
sx)
ds
Φ
(r,
x)
drds
+
t
= 2
0 ds
0
0 ∂t
hierarchy of parts.
Z tZ s 2
Z 1
d
∂
3.
Approximate factorization. In architectures such as the
V (0, sx) ds
≤
ds
∂t2 Φ (r, x) drds + |t|
network sketched in Fig. 1 in the main text, approxi0
0
0
mate invariance to transformations specific for an object
2
t
class can be learned and computed in different stages.
≤ Kt + Kx |t| δ.
2
This property may provide an advantage in terms of the
sample complexity of multistage learning [49]. For inQ.E.D.
stance, approximate class-specific invariance to pose (e.g.
Of course we are more interested in the visible features
for faces) can be computed on top of a translation-andthemselves, than in the underlying point transformation. If
scale-invariant representation [6]. Thus the implementaI : RN → R represents these features, for example as a spation of invariance can, in some cases, be “factorized” into
tial distribution of gray values observed at time t = 0, then
different steps corresponding to different transformations.
we would like to estimate the evolved image I (Φ (t, x)) by a
(see also [50, 51] for related ideas).
translate I (x + tV ) of the original I. It is clear that this is
possible only under some regularity assumption on I. The
simplest one is that I is globally Lipschitz. We immediately
Probably all three properties together are the reason evoluobtain the following
tion developed hierarchies.
CBMM paper
March, 2014
19
• Approximate invariance for smooth (non group) transfor-
4. Synopsis of Mathematical Results
mations.
Proposition µk (I) is locally invariant if
List of Theorems/Results
• Orbits are equivalent to probability distributions, PI and
both are invariant and unique.
Theorem
The distribution PI is invariant and unique i.e. I ∼ I 0 ⇔
PI = PI 0 .
• PI can be estimated within in terms of 1D probability
distributions of gI, tk .
Theorem
Consider n images Xn in X . Let K ≥ c22 log nδ , where c
is a universal constant. Then
|d(PI , PI 0 ) − dˆK (PI , PI 0 )| ≤ ,
with probability 1 − δ 2 , for all I, I 0 ∈ Xn .
• Invariance from a single image based on memory of tem-
plate transformations. The simple property
D
E D
E
gI, tk = I, g −1 tk
implies (for unitary groups without any additional
property) that
signature components µkn (I) =
the P
k
1
I, gt
, calculated on templates transg∈G ηn
|G|
formations are invariant that is
µkn (I)
=
µkn (ḡI).
• Condition in eq. [ 48 ] on the dot product between image
and template implies invariance for Partially Observable
Groups (observed through a window) and is equivalent to
it in the case of translation and scale transformations.
Theorem
Let I, t ∈ H a Hilbert space, η : R → R+ a bijective
(positive) function and G a locally compact group. Let
G0 ⊆ G and suppose supp(hgI, ti) ⊆ G0 . Then
D
E
gI, tk = 0, ∀g ∈ G/(G0 ∩ ḡG0 )
or equivalently
D
E
gI, tk 6= 0, ∀g ∈ G0 ∩ ḡG0
⇒ µkn (I) = µkn (ḡI)
[ 48 ]
• Condition in Theorem 6 is equivalent to a localization or
sparsity property of the dot product between image and
template (hI, gti = 0 for g 6∈ GL , where GL is the subset
of G where the dot product is localized). In particular
Proposition
Localization is necessary and sufficient for translation and
scale invariance. Localization for translation (respectively
scale) invariance is equivalent to the support of t being
small in space (respectively in frequency).
• Optimal simultaneous invariance to translation and scale
can be achieved by Gabor templates.
Theorem
Assume invariants are computed from pooling within a
pooling window a set of linear filters. Then the optimal templates of filters for maximum simultaneous invariance to2 translation and scale are Gabor functions
− x
t(x) = e 2σ2 eiω0 x .
• Approximate invariance can be obtained if there is approximate sparsity of the image in the dictionary of templates. Approximate localization (defined as ht, gti < δ
for g 6∈ GL , where δ is small in the order of ≈ √1d and
ht, gti ≈ 1 for g ∈ GL ) is satisfied by templates (vectors
of dimensionality n) that are similar to images in the set
and are sufficiently “large” to be incoherent for “small”
transformations.
20
http://cbmm.mit.edu
– I is sparse in the dictionary tk ;
– I and tk transform in the same way (belong to the
same class);
– the transformation is sufficiently smooth.
• Sparsity of I in the dictionary tk under G increases with
size of the neural images and provides invariance to clutter.
The definition is hI, gti < δ for g 6∈ GL , where δ is small
in the order of ≈ √1n and hI, gti ≈ 1 for g ∈ GL .
Sparsity of I in tk under G improves with dimensionality
of the space n and with noise-like encoding of I and t.
If n1 , n2 are additive uncorrelated spatial noisy clutter
hI + n1 , gt + n2 i ≈ hI, gti.
• Covariance of the hierarchical architecture.
Proposition
The operator µ` is covariant with respect to a non abelian
(in general) group transformation, that is
µ` (gI) = gµ` (I).
• Factorization. Proposition Invariance to separate sub-
groups of affine group cannot be obtained in a sequence
of layers while factorization of the ranges of invariance
can (because of covariance). Invariance to a smooth (non
group) transformation can always be performed in higher
layers, on top of other transformations, since the global
group structure is not required.
• Uniqueness of signature. Conjecture:The neural image at
the first layer is uniquely represented by the final signature at the top of the hierarchy and the means and norms
at each layer.
5. General Remarks on the Theory
1. The second regime of localization (sparsity) can be considered as a way to deal with situations that do not fall
under the general rules (group transformations) by creating a series of exceptions, one for each object class.
2. Whereas the first regime “predicts” Gabor tuning of neurons in the first layers of sensory systems, the second
regime predicts cells that are tuned to much more complex features, perhaps similar to neurons in inferotemporal cortex.
3. The sparsity condition under the group is related to properties used in associative memories for instance of the
holographic type (see [44]). If the sparsity condition holds
only for I = tk and for very small a then it implies strictly
memory-based recognition.
4. The theory is memory-based. It also view-based. Even
assuming 3D images (for instance by using stereo information) the various stages will be based on the use of 3D
views and on stored sequences of 3D views.
5. The mathematics of the class-specific modules at the top
of the hierarchy – with the underlying localization condition – justifies old models of viewpoint-invariant recognition (see [52]).
6. The remark on factorization of general transformations
implies that layers dealing with general transformations
can be on top of each other. It is possible – as empirical
results by Leibo and Li indicate – that a second layer can
improve the invariance to a specific transformation of a
lower layer.
7. The theory developed here for vision also applies to other
sensory modalities, in particular speech.
8. The theory represents a general framework for using representations that are invariant to transformations that
are learned in an unsupervised way in order to reduce the
sample complexity of the supervised learning step.
9. Simple cells (e.g. templates) under the action of the affine
group span a set of positions and scales and orientations.
The size of their receptive fields therefore spans a range.
The pooling window can be arbitrarily large – and this
does not affect selectivity when the CDF is used for pooling. A large pooling window implies that the signature
is given to large patches and the signature is invariant to
uniform affine transformations of the patches within the
window. A hierarchy of pooling windows provides signature to patches and subpatches and more invariance (to
more complex transformations).
10. Connections with the Scattering Transform.
• Our theorems about optimal invariance to scale and
translation implying Gabor functions (first regime)
may provide a justification for the use of Gabor
wavelets by Mallat [16], that does not depend on the
specific use of the modulus as a pooling mechanism.
• Our theory justifies several different kinds of pooling
of which Mallat’s seems to be a special case.
• With the choice of the modulo as a pooling mecha-
nisms, Mallat proves a nice property of Lipschitz continuity on diffeomorphisms. Such a property is not
valid in general for our scheme where it is replaced by
a hierarchical parts and wholes property which can be
regarded as an approximation, as refined as desired, of
the continuity w.r.t. diffeomorphisms.
• Our second regime does not have an obvious corre-
sponding notion in the scattering transform theory.
11. The theory characterizes under which conditions the signature provided by a HW module at some level of the
hierarchy is invariant and therefore could be used for
retrieving information (such as the label of the image
patch) from memory. The simplest scenario is that signatures from modules at all levels of the hierarchy (possibly
not the lowest ones) will be checked against the memory. Since there are of course many cases in which the
signature will not be invariant (for instance when the relevant image patch is larger than the receptive field of the
module) this scenario implies that the step of memory
retrieval/classification is selective enough to discard efficiently the “wrong” signatures that do not have a match
in memory. This is a nontrivial constraint. It probably
implies that signatures at the top level should be matched
first (since they are the most likely to be invariant and
they are fewer) and lower level signatures will be matched
next possibly constrained by the results of the top-level
matches – in a way similar to reverse hierarchies ideas. It
also has interesting implications for appropriate encoding
of signatures to make them optimally quasi-orthogonal
e.g. incoherent, in order to minimize memory interference. These properties of the representation depend on
memory constraints and will be object of a future paper
on memory modules for recognition.
12. There is psychophysical and neurophysiological evidence
that the brain employs such learning rules (e.g. [53, 54]
and references therein). A second step of Hebbian learning may be responsible for wiring a complex cells to simple cells that are activated in close temporal contiguity
and thus correspond to the same patch of image undergoing a transformation in time [55]. Simulations show that
the system could be remarkably robust to violations of
the learning rule’s assumption that temporally adjacent
images correspond to the same object [57]. The same
simulations also suggest that the theory described here
is qualitatively consistent with recent results on plasticity of single IT neurons and with experimentally-induced
disruptions of their invariance [54].
13. Simple and complex units do not need to correspond to
different cells: it is conceivable that a simple cell may be
a cluster of synapses on a dendritic branch of a complex
cell with nonlinear operations possibly implemented by
active properties in the dendrites.
14. Unsupervised learning of the template orbit. While the
templates need not be related to the test images (in the
affine case), during development, the model still needs to
observe the orbit of some templates. We conjectured that
this could be done by unsupervised learning based on the
temporal adjacency assumption [55, 56]. One might ask,
do “errors of temporal association” happen all the time
over the course of normal vision? Lights turn on and off,
objects are occluded, you blink your eyes – all of these
should cause errors. If temporal association is really the
method by which all the images of the template orbits are
associated with one another, why doesn’t the fact that its
assumptions are so often violated lead to huge errors in
invariance?
The full orbit is needed, at least in theory. In practice we
have found that significant scrambling is possible as long
as the errors are not correlated.
That
is, normally an HW
module would pool all the I, gi tk . We tested the effect
0
of, for some i, replacing tk with a different template tk .
Even scrambling 50% of our model’s connections in this
manner only yielded very small effects on performance.
These experiments were described in more detail in [57]
for the case of translation. In that paper we modeled
Li and DiCarlo’s ”invariance disruption” experiments in
which they showed that a temporal association paradigm
can induce individual IT neurons to change their stimulus preferences under specific transformation conditions
[54, 58]. We also report similar results on another ”nonuniform template orbit sampling” experiment with 3D
rotation-in-depth of faces in [7].
6. Empirical support for the theory
The theory presented here was inspired by a set of related
computational models for visual recognition, dating from
1980 to the present day. While differing in many details,
HMAX, Convolutional Networks [31], and related models
use similar structural mechanisms to hierarchically compute
translation (and sometimes scale) invariant signatures for
progressively larger pieces of an input image, completely in
accordance with the present theory.
With the theory in hand, and the deeper understanding of invariance it provides, we have now begun to develop
a new generation of models that incorporate invariance to
larger classes of transformations.
Existing models. Fukushima’s Neocognitron [3] was the first
of a class of recognition models consisting of hierarchically
stacked modules of simple and complex cells (a “convolutional” architecture). This class has grown to include Convolutional Networks, HMAX, and others [14, 59]. Many of
the best performing models in computer vision are instances
of this class. For scene classification with thousands of labeled examples, the best performing models are currently
Convolutional Networks [34]. A variant of HMAX [29] scores
74% on the Caltech 101 dataset, competitive with the stateCBMM paper
March, 2014
21
of-the-art for a single feature type. Another HMAX variant
added a time dimension for action recognition [60], outperforming both human annotators and a state-of-the-art commercial system on a mouse behavioral phenotyping task. An
HMAX model [30] was also shown to account for human performance in rapid scene categorization. A simple illustrative
empirical demonstration of the HMAX properties of invariance, stability and uniqueness is in figure 10.
All of these models work very similarly once they have
been trained. They all have a convolutional architecture and
compute a high-dimensional signature for an image in a single bottom-up pass. At each level, complex cells pool over
sets of simple cells which have the same weights but are centered at different positions (and for HMAX, also scales). In
the language of the present theory, for these models, g is the
2D set of translations in x and y (3D if scaling is included),
and complex cells pool over partial orbits of this group, typically outputting a single moment of the distribution, usually
sum or max.
The biggest difference among these models lies in the
training phase. The complex cells are fixed, always pooling
only over position (and scale), but the simple cells learn their
weights (templates) in a number of different ways. Some
models assume the first level weights are Gabor filters, mimicking cortical area V1. Weights can also be learned via
backpropagation, via sampling from training images, or even
by generating random numbers. Common to all these models is the notion of automatic weight sharing: at each level
i of the hierarchy, the Ni simple cells centered at any given
position (and scale) have the same set of Ni weight vectors
as do the Ni simple cells for every other position (and scale).
Weight sharing occurs by construction, not by learning, however, the resulting model is equivalent to one that learned
by observing Ni different objects translating (and scaling)
everywhere in the visual field.
One of the observations that inspired our theory is that
in convolutional architectures, random features can often
perform nearly as well as features learned from objects
[61, 62, 13, 59] – the architecture often matters more than the
particular features computed. We postulated that this was
due to the paramount importance of invariance. In convolutional architectures, invariance to translation and scaling is
a property of the architecture itself, and objects in images
always transform and scale in the same way.
New models. Using the principles of invariant recognition
made explicit by the present theory, we have begun to develop models that incorporate invariance to more complex
transformations which, unlike translation and scaling, cannot be solved by the architecture of the network, but must
be learned from examples of objects undergoing transformations. Two examples are listed here.
Faces rotating in 3D. In [6], we added a third H-W
layer to an existing HMAX model which was already invariant to translation and scaling. This third layer modeled
invariance to rotation in depth for faces. Rotation in depth
is a difficult transformation due to self-occlusion. Invariance
to it cannot be derived from network architecture, nor can
it be learned generically for all objects. Faces are an important class for which specialized areas are known to exist
in higher regions of the ventral stream. We showed that by
pooling over stored views of template faces undergoing this
transformation, we can recognize novel faces from a single
example view, robustly to rotations in depth.
Faces undergoing unconstrained transformations.
Another model [7] inspired by the present theory recently advanced the state-of-the-art on the Labeled Faces in the Wild
22
http://cbmm.mit.edu
dataset, a challenging same-person / different-person task.
Starting this time with a first layer of HOG features [63],
the second layer of this model built invariance to translation, scaling, and limited in-plane rotation, leaving the third
layer to pool over variability induced by other transformations. Performance results for this model are shown in figure
3 in the main text.
Fig. 10: Empirical demonstration of the properties of invariance, stability and uniqueness of the hierarchical architecture in a specific 2 layers implementation (HMAX). Inset
(a) shows the reference image on the left and a deformation
of it (the eyes are closer to each other) on the right; (b)
shows that an HW-module in layer 1 whose receptive fields
covers the left eye provides a signature vector (C1 ) which is
invariant to the deformation; in (c) an HW-module at layer
2 (C2 ) whose receptive fields contain the whole face provides
a signature vector which is (Lipschitz) stable with respect to
the deformation. In all cases, the Figure shows just the Euclidean norm of the signature vector. Notice that the C1 and
C2 vectors are not only invariant but also selective. Error
bars represent ±1 standard deviation. Two different images
(d) are presented at various location in the visual field. The
Euclidean distance between the signatures of a set of HWmodules at layer 2 with the same receptive field (the whole
image) and a reference vector is shown in (e). The signature
vector is invariant to global translation and discriminative
(between the two faces). In this example the HW-module
represents the top of a hierarchical, convolutional architecture. The images we used were 200×200 pixels
1. D.H. Hubel and T.N. Wiesel. Receptive fields, binocular interaction and functional
architecture in the cat’s visual cortex The Journal of Physiology 160, 1962.
2. M. Riesenhuber and T. Poggio. Models of object recognition. Nature Neuroscience,
3(11), 2000.
3. K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics,
36(4):193–202, Apr. 1980.
4. Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and
L. Jackel. Backpropagation applied to handwritten zip code recognition. Neural
computation, 1(4):541–551, 1989.
5. F. Anselmi, J.Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, T. Poggio. Magic
Materials: a theory of deep hierarchical architectures for learning sensory representations CBCL paper, Massachusetts Institute of Technology, Cambridge, MA, April
1, 2013.
6. J. Z. Leibo, J. Mutch, and T. Poggio. Why The Brain Separates Face Recognition
From Object Recognition. In Advances in Neural Information Processing Systems
(NIPS), Granada, Spain, 2011.
7. Q. Liao, J.Z. Leibo, T. Poggio. Learning invariant representations and applications
to face verification NIPS to appear, 2013.
8. T. Lee and S. Soatto. Video-based descriptors for object recognition. Image and
Vision Computing, 2012.
9. H. Schulz-Mirbach. Constructing invariant features by averaging techniques. In
Pattern Recognition, 1994. Vol. 2 - Conference B: Computer Vision amp; Image
Processing., Proceedings of the 12th IAPR International. Conference on, volume 2,
pages 387 –390 vol.2, 1994.
10. H. Cramer and H. Wold. Some theorems on distribution functions. J. London Math.
Soc., 4:290–294, 1936.
11. W. Pitts, W. Mcculloch. How we know universals the perception of auditory and
visual forms Bulletin of Mathematical Biology, 9, 3,127-147, 1947.
12. A. Koloydenko. Symmetric measures via moments. Bernoulli, 14(2):362–390, 2008.
13. K. Jarrett, K. Kavukcuoglu, M.A. Ranzato, Y . LeCun. What is the best multi-stage
architecture for object recognition? IEEE International Conference on Computer
Vision, 2146-2153, 2009.
14. N. Pinto, D. Doukhan, J.J. DiCarlo, D.D. Cox. A high-throughput screening
approach to discovering good forms of biologically inspired visual representation.
PLoS Computational Biology, 5, 2009.
15. B.A. Olshausen et al. Emergence of simple-cell receptive field properties by learning
a sparse code for natural images, Nature, 381, 6583, 607–609, 1996.
16. S. Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012.
17. S. Soatto. Steps Towards a Theory of Visual Information: Active Perception,
Signal-to-Symbol Conversion and the Interplay Between Sensing and Control.
arXiv:1110.2053, pages 0–151, 2011.
18. S. Smale, L. Rosasco, J. Bouvrie, A. Caponnetto, and T. Poggio. Mathematics
of the neural response. Foundations of Computational Mathematics, 10(1):67–91,
2010.
19. T. Serre, M. Kouh, C. Cadieu, U. Knoblich, G. Kreiman, and T. Poggio. A theory of
object recognition: computations and circuits in the feedforward path of the ventral
stream in primate visual cortex. CBCL Paper #259/AI Memo #2005-036, 2005.
20. S. S. Chikkerur, T. Serre, C. Tan, and T. Poggio. What and where: A Bayesian
inference theory of attention. Vision Research, May 2010.
21. D. George and J. Hawkins. A hierarchical bayesian model of invariant pattern recognition in the visual cortex. In Proceedings of the IEEE International Joint Conference
on Neural Networks (IJCNN), volume 3, pages 1812–1817, 2005.
22. S. Geman. Invariance and selectivity in the ventral visual pathway. Journal of
Physiology-Paris, 100(4):212–224, 2006.
23. W.S. McCulloch, W. Pitts A logical calculus of the ideas immanent in the nervous
activity Bull. Math. Biophysics 5, 5115-133, 1943.
24. E. Adelson and J. Bergen. Spatiotemporal energy models for the perception of
motion. Journal of the Optical Society of America A, 2(2):284–299, 1985.
25. N. Kanwisher, Functional specificity in the human brain: a window into the functional architecture of the mind, Proceedings of the National Academy of Sciences,
107, 25, 11163, 2010.
26. D.Y. Tsao, W.A. Freiwald, Faces and objects in macaque cerebral cortex Nature,
9, 6, 989-995, 2003.
27. J.Z. Leibo, F. Anselmi, J. Mutch, A.F. Ebihara, W. Freiwald, T. Poggio, Viewinvariance and mirror-symmetric tuning in a model of the macaque face-processing
system Computational and Systems Neuroscience, I-54, 2013
28. D. Marr, T. Poggio From understanding computation to understanding neural
circuitry AIM-357, 1976.
29. J. Mutch and D. Lowe. Multiclass object recognition with sparse, localized features.
Computer Vision and Pattern Recognition 2006, 1:11–18, 2006.
30. T. Serre, A. Oliva, and T. Poggio. A feedforward architecture accounts for rapid
categorization. Proceedings of the National Academy of Sciences of the United
States of America, 104(15):6424–6429, 2007.
31. Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time
series. The handbook of brain theory and neural networks, pages 255–258, 1995.
32. Y. LeCun, F. Huang, and L. Bottou. Learning methods for generic object recognition
with invariance to pose and lighting. In Computer Vision and Pattern Recognition,
2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on,
volume 2, pages II–97. IEEE, 2004.
33. O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn. Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition. In Acoustics,
Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on,
pages 4277–4280. IEEE, 2012.
34. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25,
2012.
35. Q. V. Le, R. Monga, M. Devin, G. Corrado, K. Chen, M. Ranzato, J. Dean, and
A. Y. Ng. Building high-level features using large scale unsupervised learning.
CoRR,http://arxiv.org/abs/1112.6209, abs/1112.6209, 2011.
36. Felipe Cucker and Steve Smale, On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39, 1-49, 2002.
37. J. Cuesta-Albertos, R. Fraiman, and R. T. A sharp form of the cramer–wold theorem.
Journal of Theoretical Probability, 20:201–209, 2007.
38. J. Cuesta-Albertos. How many random projections suffice to determine a probability
distribution? IPMs sections, 2009.
39. A. Heppes. On the determination of probability distributions of more dimensions by
their projections. Acta Mathematica Hungarica, 7(3):403–410, 1956.
40. D. L. Donoho, P. B. Stark Uncertainty principles and signal recovery SIAM J. Appl.
Math.,49 ,3 ,906–931 , 1989
41. D. Gabor. Theory of communication. Part 1: The analysis of information. Electrical Engineers - Part III: Radio and Communication Engineering, Journal of the
Institution of,93 ,26 ,429-441 ,1946.
42. C. Stevens. Preserving properties of object shape by computations in primary visual
cortex. PNAS, 101 ,43 ,2004.
43. J.P. Antoine, R. Murenzi, P. Vandergheynst. Two-Dimensional Wavelets and their
Relatives Cambridge Univ. Press, Cambridge, 2008
44. T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics,
19(4):201–209, 1975.
45. T. Plate, Holographic Reduced Representations: Convolution Algebra for Compositional Distributed Representations, International Joint Conference on Artificial
Intelligence, 30-35, 1991.
46. T. Poggio A theory of how the brain might work Cold Spring Harb Symp Quant
Biol, 1990.
47. T. Poggio, T. Vetter, and M. I. O. T. C. A. I. LAB. Recognition and structure from
one 2D model view: Observations on prototypes, object classes and symmetries,
1992.
48. L. Isik, E. M. Meyers, J. Z. Leibo, and T. Poggio. The timing of invariant object
recognition in the human visual system. Submitted, 2013.
49. T. Poggio and S. Smale. The mathematics of learning: Dealing with data. Notices
of the American Mathematical Society (AMS), 50(5):537–544, 2003.
50. D. Arathorn. Computation in the higher visual cortices: Map-seeking circuit theory
and application to machine vision. In Proceedings of the 33rd Applied Imagery Pattern Recognition Workshop, AIPR ’04, pages 73–78, Washington, DC, USA, 2004.
IEEE Computer Society.
51. L. Sifre, S. Mallat, and P. France. Combined scattering for rotation invariant texture,
2012.
52. T. Poggio, S. Edelmann A network that learns to recognize three-dimensional
objects. In Nature, 1990 Jan 18, 343(6255):263266.
53. G. Wallis and H. H. Bülthoff. Effects of temporal association on recognition memory.
Proceedings of the National Academy of Sciences of the United States of America,
98(8):4800–4, Apr. 2001.
54. N. Li and J. J. DiCarlo. Unsupervised natural experience rapidly alters invariant
object representation in visual cortex. Science, 321(5895):1502–7, Sept. 2008.
55. P. Földiák. Learning invariance from transformation sequences. Neural Computation,
3(2):194–200, 1991.
56. L. Wiskott, T.J. Sejnowski Slow feature analysis: Unsupervised learning of invariances Neural computation, 4, 14, 715-770, 2002.
57. L. Isik, J.Z. Leibo, T. Poggio Learning and disrupting invariance in visual recognition
with a temporal association rule Frontiers in Computational Neuroscience,2,2012
58. N. Li and J. J. DiCarlo. Unsupervised Natural Visual Experience Rapidly Reshapes Size-Invariant Object Representation in Inferior Temporal Cortex. Neuron,
67(6):1062–1075, 2010.
59. D. Yamins, H. Hong, C.F. Cadieu and J.J. DiCarlo. Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT
and Human Ventral Stream. NIPS, 3093-3101, 2013.
60. H. Jhuang, E. Garrote, J. Mutch, X. Yu, V. Khilnani, T. Poggio, A. Steele and
T. Serre, Automated home-cage behavioural phenotyping of mice Nature Communications, 1, 68, doi:10.1038/ncomms1064, 2010.
61. J.Z. Leibo, J. Mutch, L. Rosasco, S. Ullman, T. Poggio, Learning Generic Invariances in Object Recognition: Translation and Scale MIT-CSAIL-TR-2010-061,
CBCL-294, 2010
62. A. Saxe, P.W. Koh, Z. Chen, M. Bhand, B. Suresh, A. Ng, On Random
Weights and Unsupervised Feature Learning, Proceedings of the 28th International
Conference on Machine Learning (ICML-11), 1089-1096, 2011.
63. N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection.
International Conference on Computer Vision & Pattern Recognition,2 ,886-893,
2005.
CBMM paper
March, 2014
23
Download