Using Genetic Algorithm to “Fool” HMAX Object Recognition Model

Using Genetic Algorithm to “Fool” HMAX Object
Recognition Model
By
Maysun Mazhar Hasan
S.B., Electrical Engineering. M.I.T., 2010
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2012
Copyright 2012 Massachusetts Institute of Technology
All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Maysun M. Hasan, Department of Electrical Engineering and Computer Science
May 21st, 2012
Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prof. Tomaso Poggio, Eugene McDermott Professor in the Brain Sciences and Human Behavior,
Thesis Supervisor
May 21st, 2012
Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Victor Chan, Corporate Research & Development, Qualcomm Inc.,
Company Thesis Supervisor
May 21st, 2012
Accepted by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prof. Dennis M. Freeman, Chairman, Masters of Engineering Thesis Committee
May 21st, 2012
Using Genetic Algorithm to “Fool” HMAX Object
Recognition Model
by
Maysun Mazhar Hasan
Submitted to the Department of Electrical Engineering
and Computer Science on May 21st, 2012
in Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
HMAX ("Hierarchical Model and X") system is among the best machine vision
approaches developed today, in many object recognition tasks [1]. HMAX decomposes an
image into features which are passed to a classifier. These features each capture information
about a small section of the input image but might not have information about the overall
structure of the image if there is not a significant number of overlapping features. Therefore it
can produce a false-positive if two images from two different classes having sufficiently similar
features profile but completely different structures. To demonstrate the problem this thesis
aimed to show that the features of a given subject are not unique because they lack geometric
information. Genetic algorithm (GA) was used to create an image with a similar feature profile
as a subject but which clearly does not belong to the subject. Using GA, random pixel images
converged to an image whose feature profile has a small Euclidian distance from a target profile.
This generated GA image does not resemble the target image but has a similar profile which
successfully fooled the classifier in most cases. This implies that the “binding problem” is a
major issue in a HMAX model of the size tested. Furthermore, methods of improving the system
were investigated.
Thesis Supervisor: Tomaso Poggio, Ph.D
Title: Eugene McDermott Professor in the Brain Sciences and Human Behavior,
Director of the Center for Biological and Computational Learning, MIT
Thesis Supervisor: Victor Chan
Title: Corporate Research & Development, Qualcomm Inc.
2
Acknowledgements
I wish to express my sincere gratitude to all those who have helped me in writing this
thesis. First and foremost, I would like to thank Victor Chan. I owe him my deepest gratitude
for giving me the opportunity to work on this project as well as for all his guidance and help.
There is no doubt that this thesis would not have been possible without him. I am indebted to
him, and to Yinyin Liu and Thomas Zheng, who have also helped me and taught me so much
throughout my time at Qualcomm. They have truly made me feel a part of the team. I would
also like to thank my officemates and fellow interns, Bardia Behabadi and Corinne Teeter, who
have so patiently answered all my questions about neuroscience, no matter how trivial. I would
like to extend my thanks to everyone at Qualcomm, especially Rob Gilmore who was the one
who gave me the great opportunity to have worked there.
I would like to show my gratitude to Professor Tomaso Poggio, my thesis supervisor, for
his guidance. It was an honor for me to have worked with him. Additionally, I would like this
opportunity to thank MIT for the opportunity to study, work, and live here for the past 6 years. It
has been an amazing journey and I have been fundamentally altered through my time here.
On a personal level, I would like to thank my family for all their love.
I have
undoubtedly neglected them during this writing process but they have supported me regardless. I
would not be where I am today without them. They are my inspiration and my safety net. I
would also like to thank all my friends, especially those who helped me focus when my mind
wandered and distracted me when it felt like there were impossible hurdles ahead. Last but not
least I would like to thank God for all the opportunities I have been given. This thesis focuses on
two very complex and effective systems that mimic natural processes and by studying them I
have gained appreciation of its intricacies and am in awe of the world around me.
I am grateful to everyone for all their help. I cannot express my thanks enough.
3
Table of Contents
Abstract ........................................................................................................................................... 2
Acknowledgements ......................................................................................................................... 3
Chapter 1: Introduction ................................................................................................................... 6
Chapter 2: Background in Computer Vision .................................................................................. 9
Chapter 3: Background in Human Vision ..................................................................................... 12
Chapter 4: HMAX Objection Recognition System ...................................................................... 15
How HMAX Works .................................................................................................................. 15
HMAX Performance ................................................................................................................. 17
Chapter 5: Drawbacks to HMAX: The Binding Problem ............................................................. 19
Chapter 6: Finding a way to “Fool” HMAX................................................................................. 22
Global Optimizations ................................................................................................................ 24
Genetic Algorithm .................................................................................................................... 25
Chapter 7: Implementation ........................................................................................................... 30
HMAX on GPU ........................................................................................................................ 33
Chapter 8: Results and Discussion ................................................................................................ 34
Genetic Algorithm Results ........................................................................................................ 34
Issues Regarding the Process .................................................................................................... 36
Wilckelgren’s Approach to the Binding Problem ..................................................................... 37
Chapter 9: Further Investigation ................................................................................................... 41
STDP Macro Features ............................................................................................................... 41
Location Based System ............................................................................................................. 43
Chapter 10: Conclusion and Future Work .................................................................................... 46
Work Cited .................................................................................................................................... 48
4
Table of Figures
Figure 1. ........................................................................................................................................ 15
Figure 2. ........................................................................................................................................ 16
Figure 3 ......................................................................................................................................... 20
Figure 4. ........................................................................................................................................ 22
Figure 5. ........................................................................................................................................ 23
Figure 6. ........................................................................................................................................ 29
Figure 7. ........................................................................................................................................ 30
Figure 8 ......................................................................................................................................... 31
Figure 9. ........................................................................................................................................ 34
Figure 10. ...................................................................................................................................... 35
Figure 11. ...................................................................................................................................... 36
Figure 12. ...................................................................................................................................... 37
Figure 13. ...................................................................................................................................... 37
Figure 14. ...................................................................................................................................... 38
Figure 15. ...................................................................................................................................... 39
Figure 16. ...................................................................................................................................... 40
Figure 17. ...................................................................................................................................... 42
Figure 18. ...................................................................................................................................... 43
Figure 19. ...................................................................................................................................... 44
Figure 20. ...................................................................................................................................... 45
List of Tables
Table 1 .......................................................................................................................................... 17
Table 2 .......................................................................................................................................... 45
5
To my parents,
who have worked so hard
and sacrificed so much for me and my education.
6
Chapter 1: Introduction
Human beings have the ability to recognize a multitude of objects with very little effort
regardless of variations in conditions. An object may vary somewhat in orientation, scale, or
even be partially obstructed, but this often does not impair our ability to recognize it. This task,
however, is still a challenge for computer vision systems in general. Generic object recognition
by human and other primates outperform any computer vision system in almost all measures.
The reason behind this is primarily the significant variations exhibited in real world images.
Thus, the development of a robust object recognition system is necessary.
Object recognition can be divided into two fundamental vision problems: what is the
object in the image, and where in the image is it. This thesis will address the “what” question by
modifying the HMAX system [2] for object recognition. The HMAX ("Hierarchical Model and
X") system models the primate ventral visual stream in an attempt to build a system that
emulates the information processing of object recognition in the brain [2]. The ventral visual
stream is a hierarchy of brain areas thought to answer the “what” question in the cortex (versus
the “where” question in the dorsal stream). This system is among the best machine vision
approaches developed today. HMAX uses a series of layers in a feed forward manner to process
visual information.
It extracts features from an image and then uses these features for
recognition. This leads to the system being relatively size and rotational invariant. However,
there is some room for key improvements in the HMAX model to make it a better object
recognition system.
HMAX decomposes an image into features which are passed to a classifier. These
features relate patches of natural images to the input image and are used to describe parts of the
7
object. Each feature captures information about a small section of the input image. These
features, however, might not have information about the overall structure of the image if there is
not a significant number of overlapping features. Therefore it can produce a false-positive if two
images from two different classes having sufficiently similar features profile but completely
different structures.
This thesis will focus on first showing that there can be a lack of spatial information,
which is a problem in HMAX, and then investigate methods to implement improvements for the
use in facial recognition.
8
Chapter 2: Background in Computer Vision
Computer vision is a multidisciplinary field of study that uses mathematical techniques to
automatically extract, characterize and interpret information about the three dimensional world
from images. The goal of computer vision is to build an artificial vision system that is capable of
interpreting an image as well as a human. As human beings, we are capable of taking in visible
light and perceiving the environment around us with relative ease. However, task such as
identifying and categorizing objects, assessing distances, and guiding body movements, which
we can do without effort, are a huge challenge for computer systems.
Research in the field of computer vision can be divided into a number of well-defined
processing problems or tasks. These tasks have many applications and can be solved in a variety
of ways.
Some of the most common computer vision tasks to research include: image
restoration, motion analysis, and object recognition.
This thesis focuses on the task of object recognition. This is the task of determining
whether or not a particular object or feature is within an image. There have been great advances
in the field over the years, however even the best object recognition system failed to perform as
well as a two year old child when asked the number of animals in a picture [3]. This is the
challenge of generic object recognition which is the task of identifying or locating an arbitrary
object in an arbitrary situation. Object in real world images vary in orientation, scale, or can be
even partially obstructed. The human visual system is very robust and our ability to recognize
real world images is generally not impaired by the significant variations they exhibit. However,
it is still a challenge for computer vision systems in general.
9
Object recognition can be divided into two fundamental vision problems: what is the
object in the image, and where in the image is it. Again, this thesis only focuses on the “what”
problem. There are essentially two types of approaches used to address the “what” problem:
appearance-based and feature-based.
Appearance based method compare objects directly to a template image, for example by
edge matching. The problem with this approach, however, is that a single image is unlikely to
represent all appearances of an object. Attempting to arbitrarily identify an object based on a
single template would be cause a lot of false negatives, but representing all appearances of an
object is not possible either. Additionally, at least one template must be stored for each object
that the system has to recognize. This is plausible for a few objects or even a few thousand, but
for arbitrary recognition, this would be memory expensive.
Feature based methods to object recognition match features between the image and the
objects. The feature based approach involves extraction features from a set of reference images
and setting those features as the dictionary to define other images. Thus, all images with the
same definition, which is the same feature profile, would be categorized as the same object. Bag
of feature models are one type of feature based approach to object recognition. Bag of Feature
models builds a feature profile, which is a histogram based on how many times a feature appears
in an image and uses this histogram for recognition.
One example of this is Scale-Invariant Feature Transform (SIFT) developed by David
Lowe [4], which has become the industry standard for generic recognition. SIFT works by
extracting keypoints, or features, from reference images and creating a feature profile. The
Euclidean distances between feature profiles are used to match the images to objects.
10
There are many motivations behind research in Computer Vision. For example, by
designing a system capable of visual perceptive, we can further understand how the brain
processes this information. In fact, there are many good recognition systems, like HMAX, that
try to build off neuroscience, which in turn furthers our understanding of the field. There are
also many applications to computer vision. From medical image processing to augmented
reality, there are many commercial uses to having a good computer vision system.
This thesis evaluates the performance of HMAX object recognition system based on the
application of facial recognition. Facial recognition is the task of recognizing who is in a
particular image. This uses of this are endless, from security systems to gaming applications.
11
Chapter 3: Background in Human Vision
The role of any visual system, natural or artificial, is to perceive the environment by
interpreting information from visible light. The vision system in humans, and other primates can
outperforms computer vision systems in almost all vision tasks. In fact many computer vision
systems, like HMAX, draw inspiration from the visual system of humans.
Therefore to
understand HMAX, it is important to understand the neuroscience that inspired it.
The visual cortex, which is responsible for processing visual images, is the largest system
in the human brain. There are two primary pathways for information to be processed in the
visual cortex: the dorsal stream and the ventral stream. As mentioned in the previous chapter,
object recognition is divided into two problems, the “what” and the “where” problems. The
dorsal stream is believed to focus on the where problem and its pathway includes the V1 to V2
and then to parts of the brain called the dorsomedial area (V6), the middle temporal region (MT
or V5) and the posterior parietal cortex. The ventral stream, on the other hand, focuses on the
“what” problem. Its pathway includes V1 to V2, and then to V4 to the inferior temporal cortex
(IT). Since this thesis focuses on the “what” problem, we will focus on the ventral stream [5].
The first layer in the visual cortex is the primary visual cortex, often referred to as V1.
Visual information from the retina is passes to the V1 through the lateral geniculate nucleus
(LGN) found inside the thalamus of the brain. The V1 is comprised mostly of cells with
elongated receptive fields. As a result, the neurons in this region respond to bars, and slits rather
than spots of light. Basically, this region decomposes an image to be a grid of edges where the
retinotopic map (the relative positions) is preserved. Additionally, adjacent neurons in the V1
12
response to overlapping portions of the visual field meaning that the neurons in the primary
visual cortex are spatially organized.
The V1 is comprised of two types of cells: simple and complex. Simple cells are cells
that respond to edges and gratings of a particular orientation. These cells often have distinct
excitatory and inhibitory regions. Complex cells similarly respond to a particular orientation, but
they do not have distinct excitatory and inhibitory regions. This makes complex cells spatially
invariant [5].
The human visual cortex is comprised of 100s of millions of neurons. V1 alone has
approximately 140 million cells [6]. As we move along the ventral stream, millions of neurons
fire and make connections with each other to represent all aspects of an image. Additionally, the
vision system in humans and other primates seem to become more complex as you move through
the layers of the visual cortex away from V1. That is to say, V2 has very similar properties as
V1, like orientation preference, however it has a more complex receptive field. The theory is
that the complexity continues to increase through V4 and IT until particular neurons respond to
particular objects or people. This is the theory of the “grandmother cell” [5]. In fact, examples
of “grandmother cells” were found for a few celebrities like Halle Berry. This means that these
cells only fired when the person see a particle person, i.e. Halle Berry [7]. With 100s of millions
of neurons in V4 and IT, humans are capable of amazing recognition.
It is important to also note that the human vision system is not a purely feedforward
system as is described above. That is, information does not simply travel from V1 to V2 to V4,
etc.
There is also a lot of feedback between the layers, for example V4 might transmit
13
information to V1. HMAX and other object recognition systems, however, tend to only model
the feedforward system for simplicity.
14
Chapter 4: HMAX Objection Recognition System
How HMAX Works
Figure 1: HMAX ("Hierarchical Model and X") System Overview [2].
15
The standard HMAX model [2]
consists of four layers that alternate
template matching and max pooling
operations to build a feature profile that is
selective yet position and scale invariant
(figure 1). Like the V1 part of the visual
cortex in primates, the first two layers of
the model consist of simple (S1) and
complex (C1) cells.
The first layer,
referred to as S1, applies a battery of 2-D
Figure 2: Scale and position tolerance building
in C1 [8].
convolutions in the form of Gabor filters
to the input image. The filters come in 4
orientation and 16 scales (2 scales for each
of a total of 8 bands), thus translating the input image from the pixel domain into the edge
domain in multiple scales. The next layer, called the C1 layer, takes the local maximum over
scale and position for each band, essentially down-sampling the S1 images per band based on the
filter output strength. This builds local scale and position invariance as shown in figure 2.
The S2 layer filters the C1 input image with a set of patches from natural images called
S2 maps, which are also in the C1 format. These S2 maps are sometimes extracted from the
training images. These S2 units behave as radial basis function (RBF) units, depending in a
Gaussian-like way to the Euclidean distance between the natural image patches and the input.
The global max over all scale and position for each S2 map type is then taken to create the final
16
C2 feature profile. This C2 profile is shift and scale-invariant, and is passed to the classifier for
final analysis.
HMAX Performance
HMAX is essentially a biological model for object recognition. However, as is evident in
table 1, the system performance is comparable to some of the best computer vision systems
available today. The performance of the HMAX system on facial recognition was tested using
the AT&T database.
The AT&T face database, also
Table 1: Performance rate of algorithms on AT&T
face Database
referred to as the ORL database, is a
Algorithm
collection of head-and-shoulder images
HMAX
Generic Dictionary
81.6
Class Specific Dictionary 91.1
85.72
SIFT
80.12
Eigenfaces
85.02
Fisherfaces
1
Ten independent trails, with 20% of the images
used for training, and 80% for test, were run
independently and their results averaged.
2
Results taken from [9]
taken under controlled settings.
The
subjects
dark
are
in
front
of
a
homogeneous background and the faces
are upright and frontal, although roll
(which is left/right tilt of the head) and
Performance
(%)1
some side movements are also present. There are 10 images for each of the 40 subjects, and the
images vary in lighting, facial expressions, and even facial detail, such as glasses versus no
glasses.
Experiments were carried out to check the performance of the system using 20 % of the
images as the training set, and 80% of them as the testing set. There were two types of features
dictionaries tested, a generic dictionary, and a class-specific dictionary. Both dictionaries are
comprised of 2000 features that are patches from the C1 maps of natural images.
The
17
differences are that the generic dictionary was created from a database with images of a variety
of different objects while the class-specific dictionary was created from only the AT&T
database. 10 independent trials were performed with randomly chosen training and test sets.
Again, the results are in table 1. It is important to note that each feature in the C2 layer is in
theory equivalent to a neuron within V4/IT. With only 2000 features, this system is much
smaller than the human vision system which has up to hundreds of millions of neurons in V4 and
IT. Despite the small size of this system, it still performed well.
Due to the fact that HMAX is fundamentally a bag-of-feature approach to object
recognition, it has good rotational and size invariance. However, it also has a high rate of falsepositive.
This believed to be the case if two images from two different classes having
sufficiently similar C2 profile. This is what is addressed in this thesis.
18
Chapter 5: Drawbacks to HMAX: The Binding Problem
In the HMAX model, a retinotopic map is maintained until the C2 layer. That is, as the
input image is processed through layers S1 to S2, the spatial relationships of the initial image
with respect to the visual field are preserved. In the C2 layer, however, the features are pooled
over the whole field. Thus all information about relative locations of the features are lost before
the decision making process. The features used have no reference to where they originated from
in the initial image. By having a global pooling, the final C2 feature profile is scale and
translation invariant throughout the visual field, but this also implies that the C2 features might
not be sensitive to the overall geometry of a subject.
The possible lack of overall geometry due to spatially invariant features is an example of
the “binding problem” in neuroscience [10]. The “binding problem” arises when multiple
objects are being encoded by separate features and it is unclear how the system binds these
features together to represent a particular image. This problem claims that important information
about an input image, like relative position and size, can get lost for any visual representation
that uses modular features. It further claims that this lack of information would lead to false
positives, since the feature profile of a particular object can be activated by the combining
features extracted from other objects in the visual scene. If the binding problem is an issue in
HMAX, the C2 feature profile would not be unique to a subject. That is, there can exist an
image with the same C2 profile as a given subject but not resemble the overall subject at all,
since it has completely different geometry.
The fact that the HMAX model uses features from decomposed images that are
independent of one another in terms of their positions in space has been a source of criticism
19
since HMAX was first introduced. One common criticism is that HMAX could be “fooled” by
scrambling an image into pieces larger than the features. This in theory could produce the
appropriate feature profile to confuse the model with the original image. Riesenhuber and
Poggio showed this doesn’t happen however with a simulation shown in figure 3. They claim
that by having a larger number of redundant features, HMAX has an over-complete definition for
each object. This makes it practically impossible to scramble an image and preserve all the
features, even for a low number of features [11].
Figure 3: HMAX response to scrambled images [10]. (a) shows example scrambled
images using in Riesenhuber and Poggio’s simulations. (b) demonstrates that the response
of the model to the scrambled images mimic physiology.
This shows that HMAX uses Wickelgren’s approach to addressing the binding problem
[12]. This approach claims that having intermediate features, composed of combinations of
smaller features, can create a unique representation since these intermediate features can overlap.
This can be demonstrated this using text [10]. If we were to code the word “apple” by the
individual letters, without regard to letter location, the word “elppa” could easily be
misclassified. On the other hand, if we decide to code by groups of letters, the set “app”, “ppl”
20
and “ple”, can uniquely define the word “apple”. Images, however, are much more complex
than text, so finding features that make the representation unique is more difficult. This thesis
shows that the model can still lack of spatial relationship between features, which is a major
drawback to the system of this size.
21
Chapter 6: Finding a way to “Fool” HMAX
This thesis mainly focuses on trying to demonstrate that the binding problem can exist in
the HMAX model of a certain size. To demonstrate the problem, this thesis seeks to show that
the C2 features of a given subject are not unique since they lack geometric information. This is
accomplished by developing a process that can create an image with a similar C2 profile as a
subject but which clearly does not belong to the subject. C2 profiles are believed to be, like
other bag-of-feature profiles, very similar within the same class, but different between different
classes. That is to say, two different images of the same person should have very similar
profiles, while images of two different people should have different profiles. Thus the distance
between two profiles within the same class, intra-class features, dint, should be less than the
distance between two profiles from two different classes, inter-class distance, dext (figure 4).
That is exactly what is observed.
For our implementation, which is describe in a later section, the Euclidean distance
between intra-class features, dint, ranges from 0 to 1.0 unit.
The inter-class distance, dext,
however, range from 2.5 to 4.6 units. These exact numbers can vary depending on database and
features, but the important observation is that
there is a clear difference in intra- versus
inter- class distances.
If the C2 profile is a unique
representation of a target object, all the
distances (d) between the profiles of a target
Figure 4: Intra- versus Inter- class distances.
22
image to all other images would be large (d >
dint) other than images that were of the object which should be small (d≤ dint), like in figure 5a.
If the profile is not unique, there should be multiple images, other than the images of the target
object, which would also have profiles with a small distance from the target image profile, like in
figure 5b. Again the aim of this thesis is to show that the result is closer to the latter.
Figure 5: Distances from target image for if the C2 profile is: (a) a representation of an unique
object’s images, or (b) a representation of multiple objects images.
This is accomplished by creating a random image whose C2 profile is within the intraclass distance, but clearly does not belong to the given class. This is essentially a global
optimization problem which seeks to minimize the distance between the C2 profile of a given
target image profile and a generated random images.
23
Global Optimizations
Global optimization algorithms can be divided in two basic classes: deterministic and
probabilistic algorithms. A deterministic algorithm is an algorithm that does not use random
numbers to determine what to do or how to modify the data [13]. It is used if there is a clear
relation between the possible solutions and it’s “fitness” to the problem. “Fitness” refers to the
metric by which a particular solution is judged to have solved the problem. For our case, the
“fitness” metric is the Euclidean distance of the candidate image’s C2 profile to the target C2
profile. A deterministic method explores the search space so that the true optimal solution can
be determined, often employing search schemes like “divide and conquer”.
For the search problem in this thesis, the dimensionality of the search space is too high.
This thesis aims to create 112×92 pixel 8-bit grayscale image that has a similar C2 profile as a
target image. That means that each pixel has the potential to be 28 (256) colors, meaning there
are a total of 256×(112×92) candidate images in our search space, not a feasible number to
exhaustively test.
A probabilistic algorithm, on the other hand used random processes to select and modify
candidate solutions significantly reducing the search time. These algorithms often trade in the
guarantee of finding the best solution for a shorter runtime. That is, it would be best if we could
find an image with exactly the same profile as our target image. However, because of the
difficulty and size of the problem, it would more feasible to find any image that has a small
enough distance to confuse the classifier.
24
Genetic Algorithm
A special type of probabilistic algorithm is genetic algorithm [14]. Genetic algorithm
(GA) is a search method of looking for optimal solutions by changing system parameters in a
manner that mimics natural evolution. In GA, a population of candidate solutions to a given
problem “evolves” toward better solutions by implementing techniques inspired by survival of
the fittest and mutation, among other things. All genetic algorithm searches proceed according
to the scheme below:
1. The evolution starts with a population, Pop, of n randomly generated solutions, p1, p2…
pn .
2. The values of the Objective Functions are computed for each candidate solution in the
population, g(p1), g(p2)… g(pn).
3. For each generation, the fitness of every individual in the population is evaluated based
on a given fitness function, f. The fitness function measures the quality of a candidate
solution in solving a given problem. f(g1), f(g2)… f(gn)
4. These finesses are compared and the solutions in the population with low fitness are
filtered out while solutions with high fitness enter the “mating” pool with a higher
probability. The selection process is dependent on the type of problem. For example, in
a minimization problem, the best candidates are the solutions with the minimum fitness
score. So,
Mating Pool= min {f(g1), f(g2)… f(gn)}
5. The top candidates enter the “reproduction phase” where their children are created by
modifying the genome of the top candidates. Modifications include random “mutation”,
“crossover”, and even “migration” from a different subpopulation.
25
6. The top candidates and their children are passed to the next generation.
The new
population is used for the next iteration of the algorithm. The algorithm continues at step
2 until a termination criterion is met, such as the number of generations, or a solution has
a desired fitness score.
To create the next generation, the genetic algorithm selects certain individuals with the
best fitness values in the current population, called parents, and uses them to create individuals in
the next generation, called children. There are four types of children: children due to mutation,
children due to crossover, children due to migration, and elite children.
Three of the four children are a result of modifications mentioned above. They are
mutation, crossover, and migration and are demonstrated in figure 6. Mutations are created by
simply adding random noise into the genome of the top candidate solutions (figure 6a). It is a
method of adding small variations in the gene pool so that similar solutions are considered. In
order to create the mutations, the algorithm adds a random vector from a Gaussian distribution to
the parent genes.
Crossover on the other hand is the method of creating children by combining the vectors
of a pair of parents (figure 6b). This is done by randomly dividing the genes of two individuals
and recombining it to create the next generation. It is a method of adding large variations in the
gene pool so that different locale of the solution space is explored and it reduces the likelihood of
a solution getting trapped in a local minimum.
The last modification mentioned is migration. This is when the best individuals from one
subpopulation replace the worst individuals in another subpopulation (figure 6c). Migration
26
might not happen every generation and the algorithm can select how many generations pass
between migrations. Additionally the fraction of the population is also variable. This specifies
how many individuals move between subpopulations.
There is one more type of child that is common in the next generation and that is the elite
children (figure 6d). These are the individuals in the current generation with the best fitness
values that are passed, unaltered, to the next generation. Elite children ensure that if the best or
near best solutions have been found, they will be preserved.
Genetic Algorithm was selected as the search approach used in this thesis for a variety of
reason. First of all, it can quickly scan a vast solution set, like the one this thesis is faced with.
Additionally, GA has an inductive nature such that the algorithm doesn’t need to know any
specifics of the problem which is ideal for a complex problem like ours with no fixed
relationship/formula. Other advantages are that there is a large variety in the candidate solutions
since bad proposals are allowed because they do not affect the end solution negatively. There is
also a population of possible solutions for GA, instead of a single solution that is iterated in
other search heuristics.
One possible problem in GA is the possibility of premature convergences.
An
optimization algorithm usually converges to a candidate solution, which is considered to be the
best fit. The algorithm is said to converge when it cannot reach new solution candidates but
keeps on producing solutions from a within a small subset. For GA, like many other forms of
global optimization, is it often not possible to determine if the value the algorithm converged to
is the local best or the global best. That is since GA does not evolve towards a good solution but
away from bad ones, it is very possible for the solution to fall into a suboptimal solution such as
27
a local best answer. Since the candidate solutions are selected based on the best candidates in the
past generation, the search can gradually begin to center around a local minimum/maximum.
This however is not a huge issue for our problem since the final solution is allowed to be
a local minimum as long as it is with the intra-class distance. Additionally, having diversity in
our population decreases the likelihood of a “trap”. Diversity refers to the average distance
between individuals in a population. A population has high diversity if the average distance is
large, and can be accomplished by having a large population, and a large rate of migration,
mutation, and cross over. Lastly, diversity can be attained by running GA with different random
seeds to generate the initial population such that different parts of the solution space are explored
between different GA runs.
Overall, genetic algorithm seemed like the best approach to finding an image with the
same C2 profile as a target image in order to “fool” HMAX.
28
Figure 6: Modifications used on “parents” to create the next generation of “children”.
29
Chapter 7: Implementation
This thesis created a process that uses genetic algorithm to create a 112×92 grayscale
image, the GA image, which has a similar C2 feature profile as a particular image of the same
resolution, the target image. The process is demonstrated in Figure 7. It first creates the
“population” of candidate solutions consists of 112×92 grayscale random pixel images. These
random pixel images are then processed through HMAX and the output of the C2 layer is
compared to that of the target image. The Euclidian distance between the C2 output of each GAgenerated candidate solutions and that of the target image are calculated and used as the fitness
metric. This metric is used to select the best candidates to be passed or to undergo “mutation”
and/or “crossover” for the next generation. While a “mutation” randomly changes the value of a
certain percentage of pixels, and a “crossover” would entail a large part of the image of one
candidate to swap with another.
Figure 7: Genetic Algorithm process overview.
30
The GA continues to iterate until the distance between the C2 outputs of the target and
that of the GA images are within an acceptable range or until a given number of generations has
passed as shown in figure 8. For our implementation, we used features from 500 natural patches
of 4 sizes, for a total of 2000 features. The Euclidean distance between the target C2 profile and
the C2 profiles of the candidate solutions was used as the fitness measure to determine best
candidate solutions. The aim was to get the C2 output of the GA image to be well within the
intraclass distance. This ensured that the GA-generated image would be sufficiently similar in
the C2 feature space to “fool” the classifier.
Figure 8: Genetic Algorithm generated image profile converging to target image profile
Although 2000 features is much less than the hundreds of millions features in the V4/IT
region of the brain that the C2 layer is based on, this size system had very good recognition
performance and is commercially viable to implement. That is, since more features means more
31
computation and memory usage, 2000 features is the sufficient number of features needed to
have a success rate for recognition that was competitive to other top algorithms, making this size
system most commercially viable.
The genetic algorithm optimization was performed using the Optimization Toolbox in
Matlab R2009b. The GA was performed with a population of 100 individuals that were binary
representations of a 112×92 8-bit grayscale image. The initial population was white noise that
was randomly created. The standard HMAX model, which was described in chapter 4, was used
as the Objective Function.
The mutation children were created by a two-step process. First, the algorithm selected a
fraction of the parent’s entries for mutation.
Each entry, referred to as each gene, had a
probability of a given rate of being mutated. The rate was initially set as 0.1, but if a large
number of generations passed without much change to the fitness score, the rate was increased.
After the genes to be mutated were selected, the algorithm replaced each one with a random
number which was uniformly selected. This is referred to as the Uniform Mutation Function.
Crossover children for the next generation were created according to the crossover function. A
scattered function was used for this implementation. This function randomly selected individual
genes to take from each parent and combines the genes to form the child. Migration children in
the next generation were created by forward migration. This means the individual migrated
towards the last subpopulation. The number of individuals that move between subpopulations
was determined by taking a fraction of the smaller two subpopulations that moves. That fraction
was set to 0.2. Migration was accorded every 20 generations. Lastly, there were always two
elite children selected, which were exact copies of best fit individuals of the previous generation.
32
HMAX on GPU
The GA was run for thousands of generations, which took approximately 3 months. This
is because HMAX, like all cortical models, are computationally expensive and running the GA
requires the whole population (of 100 individuals) to be analyzed for each generation. For
further investigations, a faster system was necessary.
A parallel computing version of HMAX, developed by Jim Mutch, allowed for an
increase in the algorithm speed. This version was developed as a model using a programming
framework called the Cortical Network Simulator, CNS. This framework runs on a generalpurpose Graphics Processing Unit, GPU, and can process the HMAX model 80 to 100 times
faster than a Single CPU [15]. Using this version of HMAX as our Objective Function, the
computation time for the GA was cut significantly shorter.
33
Chapter 8: Results and Discussion
Genetic Algorithm Results
Distance=3.2
Distance=0.71
Figure 9: The C2 profile of the generated image converges to the target profile but its
appearance does not resemble the target image.
The Genetic Algorithm has run for thousands of generations and the output of the C2
layer slowly converged toward the C2 output of the target image as shown in figure 9. With the
original setup with the CPU version of HMAX that used 2000 features, the final distance was
0.71 after 8100 generations. This is well within the intra-class distance, meaning the C2 output
of the GA image and the target image are very similar. The GA image, however, does not
34
resemble the target image at all. As apparent in the C1 map of the GA image, which is on the
left side of figure 10, the generated image converged to a combination of random edges. Thus,
with genetic algorithm, we were able to create an image comprised of random edges that has a
similar C2 output as a target image. This shows that the C2 layer does not preserve the
retinotopic map, and that the profile is not unique. Since the retinotopic map is not preserved,
the binding problem, which in our case is related to the relative spatial relationships between
features, exists for this setup.
For a definite test to see if we successfully “fool” HMAX, we tested the GA image using
a Support Vector Machine (SVM) classifier. When the classifier was asked which of the 40
subjects the generated image belonged to, the GA image was classified into the target class. This
proves that we were able to fool HMAX.
Figure 10: C1 map of generated image (left) is a collection of random edges and does not
resemble the C1 map of the target image (right).
35
Issues Regarding the Process
The process developed in this thesis worked well in showing that the C2 profile of an
image is not unique when we have only 2000 features. We repeated this process with the GPUversion of HMAX, also with 2000 features, and found similar results. As is evident in figure 11,
the optimization is exponentially decaying with a time constant, τ, of less than 100 generations.
That is C2 profiles of the candidate images exponentially converging to the target C2 profile.
Figure 11: Distance between generated profile and target profile increases exponentially through
the generations.
A major issue with genetic algorithm, as previously mentioned, is premature
convergence. As seen in figure 12, there were times during which hundreds of generations
passed without a new solution candidate with a smaller distance then the current best. This
suboptimal solution was caused by a lack of diversity in the population. To fix this, the mutation
rate and the migration rate were adjusted and the GA restarted. Sometimes a new population
was manually introduced, with only the elite children preserved. This restarted the stalled
algorithm.
36
Figure 12: Genetic Algorithm sometimes stalls at a suboptimal convergene and requires requires
restarting GA with higher diversity to continue.
Wilckelgren’s Approach to the Binding Problem
Genetic Algorithm was also used to investigate how best to preserve relative spatial
information in the HMAX model without
maintaining
actual
feature
location.
Wilckelgren’s work implies that if there is
significant
number
of
overlapping
features, these features will constrain the
representation to be unique [12].
By
increasing the number of feature, we can
increase the change
features.
of overlapping
Furthermore, increasing the
number of features has been shown to
Figure 13: Performance increases with number of
features.
37
increase the overall performance of HMAX [2].
To further this study, GA was run using
the GPU version of HMAX with 70, 700,
2000, and 7000 features and the results were
evaluated.
The overall performance on the
AT&T database did in fact increase as the
number of features increased (figure 13), as
predicted by Serre’s work [2]. This might be
explained by the fact that the overall distance
between C2 profiles of different objects Figure 14: Difference between dint and dext
increases with number of features.
increases as the number of features increase.
This is apparent in figure 14, where the difference of the average distance between images in the
same class, and the average distance between images in different classes are plotted and an
increasing trend in shown. This trend would make classification easier since different objects are
easier to categorize with a larger difference.
This trend of increasing distances with increasing number of features also applies to the
distances of the random pixel images in the initial population and the target image (figure 15a).
This means that the Genetic Algorithm should take longer since it has more ground to cover. In
fact, with more features, more generations are required to find the optimal solution. Figure 15b
shows the number of generations that was required for the GA image to reach the mean intraclass distance. There is a slight increasing trend at the beginning, but a large jump in the number
of generations required with 7000 features. Since there is a direct relationship between the
degree of difficulty of finding a good solution to a problem and the amount of time need to find
38
that solution, it is clear that as the number of features increase, the difficulty in finding a solution
increases. This is a good sign in proving Wilckelgren’s theory.
(a)
(b)
Figure 15: (a) Initial distance from target image increases with the number of features. (b)
Number of generations needed to reach average dint increases with the number of features. Both
trends indicate that it is getting harder to generate image with same C2 output.
The definite proof to whether or not increasing the number of features will constrain the
C2 profile to be unique is dependent on the results of the C1 map of the GA images. Figure 16a
shows the C1 images of a particular orientation for a GA image create with 70 features. After
5000 generations, this GA has a C2 profile distance of 0.0247, which is well within the dint for
this class. As is apparent in the image, the C1 image is a collection of random edges so the
retinotopic map is not preserved and the representation in not unique. Figure 16b shows the C1
map of one orientation for a GA image created with 7000 features. After 764000 generations,
the distance from the target C2 profile is 0.4457, which is also well within dint. Looking at the
figure, the C1 maps of the target and generated images are not exactly the same but there are
39
clear similarities. This is a drastic difference from the previously generated images that had no
correlation. All other aspects to the GA were held constant for this test indicating, that by only
increasing the number of features, some parts of the retinotopic map were preserved. That is,
with 7000 features, it is clear that more spatial organization is present than if there were less
features.
Considering V4/IT consists of hundreds of millions of features, it is clear that
Wilckelgren’s approach to the binding problem is valid for the visual cortex and in order to avoid
the issue in computer vision a significant number of features are necessary.
(a) C1 map of GA image created using 70 features vs. the C1
map of the target. There are no similarities in any orientation.
(b) C1 map of GA image created using 7000 features vs. the C1
map of the target. Similarities exist in multiple orientations,
and are circled in the image above.
Figure 16: Comparing C1 map of generated images for different number of features (left side
image in (a) and (b)) to the target image C1 map (right side).
40
Chapter 9: Further Investigation
The generated image showed that spatial relationships of the features were not preserved
if there are a small number of features. However, without parallel computing, a larger number of
features mean longer computation time which would make the system less desirable in
commercial applications. Because of this, other identification schemes were tested.
STDP Macro Features
For one trial, using the forty classes from the AT&T face database, using 20% of the
images for training and 80% for testing, HMAX alone had a 17% failure rate (83% success). For
most of these failures, ~76%, the correct class was in the top 5 classifier outputs. The theory
behind the first identification system designed was to correctly discriminate the correct class
from the top five choices. For this trial set, that would result in a ~13% increase in performance.
The proposed face detection system, shown in figure 17, would use the invariance of HMAX to
select best candidates and macro features that contain spatial information to select the best fit
class.
41
Figure 17: Propose system that used intermediate “macro” features for template matching best
image.
The algorithm used to develop the macro features was an asynchronous feedforward
spiking neural network developed by Timothée Masquelier, and Simon Thrope [16].
This
algorithm mimics Spike timing dependent plasticity (STDP) in order to extract important visual
features in an image while completely unsupervised. STDP is a learning rule used by neurons
that are repeatedly exposed to a similar input. STDP modifies the synaptic weight of that input
by changing the relative timing of the action potentials. The features extracted by this algorithm
were capable of various classification tasks [16].
The propose setup was tested with the first 20 subjects in the database. Of the 160
images, 14 of the images which were incorrectly categorized had the class in the top 5 classes.
Using Marco features 8 of the 14 errors were corrected (~57%), 4 failed (~29%), and 2 trials has
no definitive answer (~14%). However, the overall test for the proposed solution showed that
the system was not useful. The system had a success rate of 78%, which is a large decrease from
the 86% rate with HMAX alone. The system failed to recognize 17% of the faces HMAX alone
recognized.
42
Examining the images that failed, the
sources of error were clear. The first source of
error was false positives. As you can see in
figure 18a, some of the Macro-features are too
non-specific. Because of this, many images not
belonging to that class fit that feature so that
that class was often selected. Additionally, pose
and gesture variation between faces was another
source of error, figure 18b. Things like glasses
Figure 18: Sources of error in proposed
system.
and differences in facial expression cause the
macro-feature to not be detected. All variation
tolerance developed by HMAX is lost with the macro features. These two sources of error seem
likely resulted from the template match of large features. This failed experiment further shows
the advantages HMAX has to other recognition systems.
Location Based System
Another proposed system was that instead of using only the C2 output as the input to the
SVM, the distances between the locations of C2 features should be used. As figure 19, shows,
there is a clear distinction between the distances between the locations within and between
classes. That is, the distances of locations for a feature within the same class tends to be smaller
than the distances between different classes. This implies that the distances between locations
might be a good element to classify images by.
43
Figure 19: Standardize distance between locations. There is a clear distinction
between the distances between the locations within and between classes implying
distances might be a good element to classify images by.
Two types of locations were used to test this system; the exact location, and a scaled XY
location. The exact location is the unaltered index number of the C2 features outputted by the
model without regard to which band. The scaled XY location on the other hand remaps the C2
location to a single band (28x23), as shown in figure 20, and used an XY coordinates. The
coordinates were inputted into the SVN as a single vector, i.e. [x1 x2 x3 y1 y2 y3].
Another system was with the C1 maps put into the classifier in order to categorize the
images. The C1 maps preserve the retinotopic map, so the theory is that it will outperform the
standard HMAX that uses the C2 profile.
44
Figure 20: Scaled XY location on the other hand remaps the C2 location to a
single band (28x23).
Each system was tested with all 40 subjects in the AT&T database with 20% of the
images per class for training and 80% for testing. Ten independent trials were performed and the
results systems are shown in table 2. Both generic and class-specific features were used. Classspecific feature refer to features that were extracted from among the training images. It is clear
that, although the other systems can work for object recognition, the standard HMAX system that
use the C2 profiles for classification performs the best.
Table 2: Performance of Proposed Systems.
C2 Profile
C1 Profile
Exact location
Scaled XY location
Success Rate (%) (10 trials)
Generic Dictionary
Class-Specific Dictionary
81.59
91.12
87.19
-80.72
81.25
-73.31
45
Chapter 10: Conclusion and Future Work
Using genetic algorithm, we were able to create an image that successfully fooled HMAX
object recognition system of the given size. The generated image was composed of random
edges but still had the same C2 profile as a target image. This showed that, for our testing
conditions, the C2 profile is an under complete representation of an image and thus not unique.
This means that, with these conditions and sized model, the binding problem is an issue in the
system. We furthered our study by examining Wickelgren’s approach to addressing the binding
problem. This approach claims that if enough features overlap, the representation will be unique.
We saw that this was true since increasing the number of features made the generated image
converge from a collection of random edges to the edges that resemble that of the target image.
Considering the human brain can have millions of neurons for feature detection, this shows that
this is a valid method to solve the binding problem in the brain.
In addition to these test, other systems were created that directly used location
information in classification. These systems did not perform better than the original HMAX but
they did show issues that might arise in any system that use location information. One of these
issues is that a system with location information is less invariant to pose and gestures changes.
Another issue is that if the features are too large, they might be too ambiguous to be useful.
Overall, we found the best way to preserve the retinotopic map was to increase the number of
features.
By testing the increase in the number of features effect on the generated images, we were
able to validate one possible way the brain addresses the binding problem. This, however, is not
a commercial viable approach to solving the problem. Using larger number of features increases
46
the performance and provides a more complete solution, but, it also makes that system more
computationally expensive and it now takes more time to process one image with more features
than it would with less features.
Further analyses of whether HMAX can be used as a
commercial object recognition system is need.
47
Work Cited
[1] P. Moreno, et al., “A Comparative Study of Local Descriptors for Object Category
Recognition: SIFT vs. HMAX.” in Pattern Recognition and Image Analysis, 2007, pp.
515-522.
[2] T. Serre, et al., “Robust Object Recognition with Cortex-like Mechanisms” in IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. 29 No.3, Mar. 2007.
[3] M. Bennamoun, G. J. Mamic. “Introduction” in Object Recognition: Fundamentals and Case
Studies (Advanced Pattern Recognition). London, England: Springer-Verlag, 2002.
[4] D. G. Lowe. "Object recognition from local scale-invariant features" in Proceedings of the
International Conference on Computer Vision. 2, Sept. 1999, pp. 1150–1157.
[5] M.F. Bear, B.W. Connors, M.A. Paradiso. “Chapter 10: The Central Visual System” in
Neuroscience: Exploring the Brain (4th ed.). Philadelphia, PA: Lippincott Williams &
Wilkins, 2001. pp 339.
[6] G. Leuba, R. Kraftsik. "Changes in volume, surface estimate, three-dimensional shape and
total number of neurons of the human primary visual cortex from midgestation until old
age" in Anatomy and Embryology 190, 1994, pp. 351-366.
[7] R. Quiroga, et al. "Invariant visual representation by single neurons in the human brain." in
Nature, Vol. 435 (7045), 6/23/2005, p1102-1107.
[8] T. Serre, L. Wolf, and T. Poggio, “Object recognition with features inspired by visual
cortex.” In CVPR, 2005.
48
[9] M. Aly.
(2006). Face Recognition using SIFT Features [Online].
Available FTP:
vision.caltech.edu Directory: malaa/publications/ File: aly06face.pdf
[10] M. Riesenhuber and T. Poggio, “Are Cortical Models Really Bound by the 'Binding
Problem'?” in Neuron 24, 87-93, 1999.
[11] M. Riesenhuber and T Poggio, “Hierarchical models of object recognition in cortex” in
Nature Neuroscience, Vol. 2 No 9, Nov. 1999.
[12] W. Wickelgren, “Context-sensitive coding, associative memory, and serial order in
(speech) behavior” in Psychology Review 76, pp. 1–15, 1969.
[13] T. Weise (2009).
Global Optimization Algorithms: Theory and Application [Online].
Available FTP: www.it-weise.de Directory: projects File: Book.pdf
[14] J. H. Holland, “Adaptation in Natural and Artificial Systems”, University of Michigan
Press, Ann Arbor (1975).
[15] J. Mutch, U. Knoblich, T. Poggio.
“CNS: a GPU-based framework for simulating
cortically-organized networks.” MIT-CSAIL-TR-2010-013 / CBCL-286, Massachusetts
Institute of Technology, Cambridge, MA, February 26, 2010.
[16] T. Masquelier, S.J. Thorpe. ”Unsupervised Learning of Visual Features through Spike
Timing Dependent Plasticity.” In PLoS Comput Biol 3(2):e31, pp. 247-257, 2002.
49