Using Genetic Algorithm to “Fool” HMAX Object Recognition Model By Maysun Mazhar Hasan S.B., Electrical Engineering. M.I.T., 2010 Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2012 Copyright 2012 Massachusetts Institute of Technology All rights reserved. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maysun M. Hasan, Department of Electrical Engineering and Computer Science May 21st, 2012 Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prof. Tomaso Poggio, Eugene McDermott Professor in the Brain Sciences and Human Behavior, Thesis Supervisor May 21st, 2012 Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Victor Chan, Corporate Research & Development, Qualcomm Inc., Company Thesis Supervisor May 21st, 2012 Accepted by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prof. Dennis M. Freeman, Chairman, Masters of Engineering Thesis Committee May 21st, 2012 Using Genetic Algorithm to “Fool” HMAX Object Recognition Model by Maysun Mazhar Hasan Submitted to the Department of Electrical Engineering and Computer Science on May 21st, 2012 in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science Abstract HMAX ("Hierarchical Model and X") system is among the best machine vision approaches developed today, in many object recognition tasks [1]. HMAX decomposes an image into features which are passed to a classifier. These features each capture information about a small section of the input image but might not have information about the overall structure of the image if there is not a significant number of overlapping features. Therefore it can produce a false-positive if two images from two different classes having sufficiently similar features profile but completely different structures. To demonstrate the problem this thesis aimed to show that the features of a given subject are not unique because they lack geometric information. Genetic algorithm (GA) was used to create an image with a similar feature profile as a subject but which clearly does not belong to the subject. Using GA, random pixel images converged to an image whose feature profile has a small Euclidian distance from a target profile. This generated GA image does not resemble the target image but has a similar profile which successfully fooled the classifier in most cases. This implies that the “binding problem” is a major issue in a HMAX model of the size tested. Furthermore, methods of improving the system were investigated. Thesis Supervisor: Tomaso Poggio, Ph.D Title: Eugene McDermott Professor in the Brain Sciences and Human Behavior, Director of the Center for Biological and Computational Learning, MIT Thesis Supervisor: Victor Chan Title: Corporate Research & Development, Qualcomm Inc. 2 Acknowledgements I wish to express my sincere gratitude to all those who have helped me in writing this thesis. First and foremost, I would like to thank Victor Chan. I owe him my deepest gratitude for giving me the opportunity to work on this project as well as for all his guidance and help. There is no doubt that this thesis would not have been possible without him. I am indebted to him, and to Yinyin Liu and Thomas Zheng, who have also helped me and taught me so much throughout my time at Qualcomm. They have truly made me feel a part of the team. I would also like to thank my officemates and fellow interns, Bardia Behabadi and Corinne Teeter, who have so patiently answered all my questions about neuroscience, no matter how trivial. I would like to extend my thanks to everyone at Qualcomm, especially Rob Gilmore who was the one who gave me the great opportunity to have worked there. I would like to show my gratitude to Professor Tomaso Poggio, my thesis supervisor, for his guidance. It was an honor for me to have worked with him. Additionally, I would like this opportunity to thank MIT for the opportunity to study, work, and live here for the past 6 years. It has been an amazing journey and I have been fundamentally altered through my time here. On a personal level, I would like to thank my family for all their love. I have undoubtedly neglected them during this writing process but they have supported me regardless. I would not be where I am today without them. They are my inspiration and my safety net. I would also like to thank all my friends, especially those who helped me focus when my mind wandered and distracted me when it felt like there were impossible hurdles ahead. Last but not least I would like to thank God for all the opportunities I have been given. This thesis focuses on two very complex and effective systems that mimic natural processes and by studying them I have gained appreciation of its intricacies and am in awe of the world around me. I am grateful to everyone for all their help. I cannot express my thanks enough. 3 Table of Contents Abstract ........................................................................................................................................... 2 Acknowledgements ......................................................................................................................... 3 Chapter 1: Introduction ................................................................................................................... 6 Chapter 2: Background in Computer Vision .................................................................................. 9 Chapter 3: Background in Human Vision ..................................................................................... 12 Chapter 4: HMAX Objection Recognition System ...................................................................... 15 How HMAX Works .................................................................................................................. 15 HMAX Performance ................................................................................................................. 17 Chapter 5: Drawbacks to HMAX: The Binding Problem ............................................................. 19 Chapter 6: Finding a way to “Fool” HMAX................................................................................. 22 Global Optimizations ................................................................................................................ 24 Genetic Algorithm .................................................................................................................... 25 Chapter 7: Implementation ........................................................................................................... 30 HMAX on GPU ........................................................................................................................ 33 Chapter 8: Results and Discussion ................................................................................................ 34 Genetic Algorithm Results ........................................................................................................ 34 Issues Regarding the Process .................................................................................................... 36 Wilckelgren’s Approach to the Binding Problem ..................................................................... 37 Chapter 9: Further Investigation ................................................................................................... 41 STDP Macro Features ............................................................................................................... 41 Location Based System ............................................................................................................. 43 Chapter 10: Conclusion and Future Work .................................................................................... 46 Work Cited .................................................................................................................................... 48 4 Table of Figures Figure 1. ........................................................................................................................................ 15 Figure 2. ........................................................................................................................................ 16 Figure 3 ......................................................................................................................................... 20 Figure 4. ........................................................................................................................................ 22 Figure 5. ........................................................................................................................................ 23 Figure 6. ........................................................................................................................................ 29 Figure 7. ........................................................................................................................................ 30 Figure 8 ......................................................................................................................................... 31 Figure 9. ........................................................................................................................................ 34 Figure 10. ...................................................................................................................................... 35 Figure 11. ...................................................................................................................................... 36 Figure 12. ...................................................................................................................................... 37 Figure 13. ...................................................................................................................................... 37 Figure 14. ...................................................................................................................................... 38 Figure 15. ...................................................................................................................................... 39 Figure 16. ...................................................................................................................................... 40 Figure 17. ...................................................................................................................................... 42 Figure 18. ...................................................................................................................................... 43 Figure 19. ...................................................................................................................................... 44 Figure 20. ...................................................................................................................................... 45 List of Tables Table 1 .......................................................................................................................................... 17 Table 2 .......................................................................................................................................... 45 5 To my parents, who have worked so hard and sacrificed so much for me and my education. 6 Chapter 1: Introduction Human beings have the ability to recognize a multitude of objects with very little effort regardless of variations in conditions. An object may vary somewhat in orientation, scale, or even be partially obstructed, but this often does not impair our ability to recognize it. This task, however, is still a challenge for computer vision systems in general. Generic object recognition by human and other primates outperform any computer vision system in almost all measures. The reason behind this is primarily the significant variations exhibited in real world images. Thus, the development of a robust object recognition system is necessary. Object recognition can be divided into two fundamental vision problems: what is the object in the image, and where in the image is it. This thesis will address the “what” question by modifying the HMAX system [2] for object recognition. The HMAX ("Hierarchical Model and X") system models the primate ventral visual stream in an attempt to build a system that emulates the information processing of object recognition in the brain [2]. The ventral visual stream is a hierarchy of brain areas thought to answer the “what” question in the cortex (versus the “where” question in the dorsal stream). This system is among the best machine vision approaches developed today. HMAX uses a series of layers in a feed forward manner to process visual information. It extracts features from an image and then uses these features for recognition. This leads to the system being relatively size and rotational invariant. However, there is some room for key improvements in the HMAX model to make it a better object recognition system. HMAX decomposes an image into features which are passed to a classifier. These features relate patches of natural images to the input image and are used to describe parts of the 7 object. Each feature captures information about a small section of the input image. These features, however, might not have information about the overall structure of the image if there is not a significant number of overlapping features. Therefore it can produce a false-positive if two images from two different classes having sufficiently similar features profile but completely different structures. This thesis will focus on first showing that there can be a lack of spatial information, which is a problem in HMAX, and then investigate methods to implement improvements for the use in facial recognition. 8 Chapter 2: Background in Computer Vision Computer vision is a multidisciplinary field of study that uses mathematical techniques to automatically extract, characterize and interpret information about the three dimensional world from images. The goal of computer vision is to build an artificial vision system that is capable of interpreting an image as well as a human. As human beings, we are capable of taking in visible light and perceiving the environment around us with relative ease. However, task such as identifying and categorizing objects, assessing distances, and guiding body movements, which we can do without effort, are a huge challenge for computer systems. Research in the field of computer vision can be divided into a number of well-defined processing problems or tasks. These tasks have many applications and can be solved in a variety of ways. Some of the most common computer vision tasks to research include: image restoration, motion analysis, and object recognition. This thesis focuses on the task of object recognition. This is the task of determining whether or not a particular object or feature is within an image. There have been great advances in the field over the years, however even the best object recognition system failed to perform as well as a two year old child when asked the number of animals in a picture [3]. This is the challenge of generic object recognition which is the task of identifying or locating an arbitrary object in an arbitrary situation. Object in real world images vary in orientation, scale, or can be even partially obstructed. The human visual system is very robust and our ability to recognize real world images is generally not impaired by the significant variations they exhibit. However, it is still a challenge for computer vision systems in general. 9 Object recognition can be divided into two fundamental vision problems: what is the object in the image, and where in the image is it. Again, this thesis only focuses on the “what” problem. There are essentially two types of approaches used to address the “what” problem: appearance-based and feature-based. Appearance based method compare objects directly to a template image, for example by edge matching. The problem with this approach, however, is that a single image is unlikely to represent all appearances of an object. Attempting to arbitrarily identify an object based on a single template would be cause a lot of false negatives, but representing all appearances of an object is not possible either. Additionally, at least one template must be stored for each object that the system has to recognize. This is plausible for a few objects or even a few thousand, but for arbitrary recognition, this would be memory expensive. Feature based methods to object recognition match features between the image and the objects. The feature based approach involves extraction features from a set of reference images and setting those features as the dictionary to define other images. Thus, all images with the same definition, which is the same feature profile, would be categorized as the same object. Bag of feature models are one type of feature based approach to object recognition. Bag of Feature models builds a feature profile, which is a histogram based on how many times a feature appears in an image and uses this histogram for recognition. One example of this is Scale-Invariant Feature Transform (SIFT) developed by David Lowe [4], which has become the industry standard for generic recognition. SIFT works by extracting keypoints, or features, from reference images and creating a feature profile. The Euclidean distances between feature profiles are used to match the images to objects. 10 There are many motivations behind research in Computer Vision. For example, by designing a system capable of visual perceptive, we can further understand how the brain processes this information. In fact, there are many good recognition systems, like HMAX, that try to build off neuroscience, which in turn furthers our understanding of the field. There are also many applications to computer vision. From medical image processing to augmented reality, there are many commercial uses to having a good computer vision system. This thesis evaluates the performance of HMAX object recognition system based on the application of facial recognition. Facial recognition is the task of recognizing who is in a particular image. This uses of this are endless, from security systems to gaming applications. 11 Chapter 3: Background in Human Vision The role of any visual system, natural or artificial, is to perceive the environment by interpreting information from visible light. The vision system in humans, and other primates can outperforms computer vision systems in almost all vision tasks. In fact many computer vision systems, like HMAX, draw inspiration from the visual system of humans. Therefore to understand HMAX, it is important to understand the neuroscience that inspired it. The visual cortex, which is responsible for processing visual images, is the largest system in the human brain. There are two primary pathways for information to be processed in the visual cortex: the dorsal stream and the ventral stream. As mentioned in the previous chapter, object recognition is divided into two problems, the “what” and the “where” problems. The dorsal stream is believed to focus on the where problem and its pathway includes the V1 to V2 and then to parts of the brain called the dorsomedial area (V6), the middle temporal region (MT or V5) and the posterior parietal cortex. The ventral stream, on the other hand, focuses on the “what” problem. Its pathway includes V1 to V2, and then to V4 to the inferior temporal cortex (IT). Since this thesis focuses on the “what” problem, we will focus on the ventral stream [5]. The first layer in the visual cortex is the primary visual cortex, often referred to as V1. Visual information from the retina is passes to the V1 through the lateral geniculate nucleus (LGN) found inside the thalamus of the brain. The V1 is comprised mostly of cells with elongated receptive fields. As a result, the neurons in this region respond to bars, and slits rather than spots of light. Basically, this region decomposes an image to be a grid of edges where the retinotopic map (the relative positions) is preserved. Additionally, adjacent neurons in the V1 12 response to overlapping portions of the visual field meaning that the neurons in the primary visual cortex are spatially organized. The V1 is comprised of two types of cells: simple and complex. Simple cells are cells that respond to edges and gratings of a particular orientation. These cells often have distinct excitatory and inhibitory regions. Complex cells similarly respond to a particular orientation, but they do not have distinct excitatory and inhibitory regions. This makes complex cells spatially invariant [5]. The human visual cortex is comprised of 100s of millions of neurons. V1 alone has approximately 140 million cells [6]. As we move along the ventral stream, millions of neurons fire and make connections with each other to represent all aspects of an image. Additionally, the vision system in humans and other primates seem to become more complex as you move through the layers of the visual cortex away from V1. That is to say, V2 has very similar properties as V1, like orientation preference, however it has a more complex receptive field. The theory is that the complexity continues to increase through V4 and IT until particular neurons respond to particular objects or people. This is the theory of the “grandmother cell” [5]. In fact, examples of “grandmother cells” were found for a few celebrities like Halle Berry. This means that these cells only fired when the person see a particle person, i.e. Halle Berry [7]. With 100s of millions of neurons in V4 and IT, humans are capable of amazing recognition. It is important to also note that the human vision system is not a purely feedforward system as is described above. That is, information does not simply travel from V1 to V2 to V4, etc. There is also a lot of feedback between the layers, for example V4 might transmit 13 information to V1. HMAX and other object recognition systems, however, tend to only model the feedforward system for simplicity. 14 Chapter 4: HMAX Objection Recognition System How HMAX Works Figure 1: HMAX ("Hierarchical Model and X") System Overview [2]. 15 The standard HMAX model [2] consists of four layers that alternate template matching and max pooling operations to build a feature profile that is selective yet position and scale invariant (figure 1). Like the V1 part of the visual cortex in primates, the first two layers of the model consist of simple (S1) and complex (C1) cells. The first layer, referred to as S1, applies a battery of 2-D Figure 2: Scale and position tolerance building in C1 [8]. convolutions in the form of Gabor filters to the input image. The filters come in 4 orientation and 16 scales (2 scales for each of a total of 8 bands), thus translating the input image from the pixel domain into the edge domain in multiple scales. The next layer, called the C1 layer, takes the local maximum over scale and position for each band, essentially down-sampling the S1 images per band based on the filter output strength. This builds local scale and position invariance as shown in figure 2. The S2 layer filters the C1 input image with a set of patches from natural images called S2 maps, which are also in the C1 format. These S2 maps are sometimes extracted from the training images. These S2 units behave as radial basis function (RBF) units, depending in a Gaussian-like way to the Euclidean distance between the natural image patches and the input. The global max over all scale and position for each S2 map type is then taken to create the final 16 C2 feature profile. This C2 profile is shift and scale-invariant, and is passed to the classifier for final analysis. HMAX Performance HMAX is essentially a biological model for object recognition. However, as is evident in table 1, the system performance is comparable to some of the best computer vision systems available today. The performance of the HMAX system on facial recognition was tested using the AT&T database. The AT&T face database, also Table 1: Performance rate of algorithms on AT&T face Database referred to as the ORL database, is a Algorithm collection of head-and-shoulder images HMAX Generic Dictionary 81.6 Class Specific Dictionary 91.1 85.72 SIFT 80.12 Eigenfaces 85.02 Fisherfaces 1 Ten independent trails, with 20% of the images used for training, and 80% for test, were run independently and their results averaged. 2 Results taken from [9] taken under controlled settings. The subjects dark are in front of a homogeneous background and the faces are upright and frontal, although roll (which is left/right tilt of the head) and Performance (%)1 some side movements are also present. There are 10 images for each of the 40 subjects, and the images vary in lighting, facial expressions, and even facial detail, such as glasses versus no glasses. Experiments were carried out to check the performance of the system using 20 % of the images as the training set, and 80% of them as the testing set. There were two types of features dictionaries tested, a generic dictionary, and a class-specific dictionary. Both dictionaries are comprised of 2000 features that are patches from the C1 maps of natural images. The 17 differences are that the generic dictionary was created from a database with images of a variety of different objects while the class-specific dictionary was created from only the AT&T database. 10 independent trials were performed with randomly chosen training and test sets. Again, the results are in table 1. It is important to note that each feature in the C2 layer is in theory equivalent to a neuron within V4/IT. With only 2000 features, this system is much smaller than the human vision system which has up to hundreds of millions of neurons in V4 and IT. Despite the small size of this system, it still performed well. Due to the fact that HMAX is fundamentally a bag-of-feature approach to object recognition, it has good rotational and size invariance. However, it also has a high rate of falsepositive. This believed to be the case if two images from two different classes having sufficiently similar C2 profile. This is what is addressed in this thesis. 18 Chapter 5: Drawbacks to HMAX: The Binding Problem In the HMAX model, a retinotopic map is maintained until the C2 layer. That is, as the input image is processed through layers S1 to S2, the spatial relationships of the initial image with respect to the visual field are preserved. In the C2 layer, however, the features are pooled over the whole field. Thus all information about relative locations of the features are lost before the decision making process. The features used have no reference to where they originated from in the initial image. By having a global pooling, the final C2 feature profile is scale and translation invariant throughout the visual field, but this also implies that the C2 features might not be sensitive to the overall geometry of a subject. The possible lack of overall geometry due to spatially invariant features is an example of the “binding problem” in neuroscience [10]. The “binding problem” arises when multiple objects are being encoded by separate features and it is unclear how the system binds these features together to represent a particular image. This problem claims that important information about an input image, like relative position and size, can get lost for any visual representation that uses modular features. It further claims that this lack of information would lead to false positives, since the feature profile of a particular object can be activated by the combining features extracted from other objects in the visual scene. If the binding problem is an issue in HMAX, the C2 feature profile would not be unique to a subject. That is, there can exist an image with the same C2 profile as a given subject but not resemble the overall subject at all, since it has completely different geometry. The fact that the HMAX model uses features from decomposed images that are independent of one another in terms of their positions in space has been a source of criticism 19 since HMAX was first introduced. One common criticism is that HMAX could be “fooled” by scrambling an image into pieces larger than the features. This in theory could produce the appropriate feature profile to confuse the model with the original image. Riesenhuber and Poggio showed this doesn’t happen however with a simulation shown in figure 3. They claim that by having a larger number of redundant features, HMAX has an over-complete definition for each object. This makes it practically impossible to scramble an image and preserve all the features, even for a low number of features [11]. Figure 3: HMAX response to scrambled images [10]. (a) shows example scrambled images using in Riesenhuber and Poggio’s simulations. (b) demonstrates that the response of the model to the scrambled images mimic physiology. This shows that HMAX uses Wickelgren’s approach to addressing the binding problem [12]. This approach claims that having intermediate features, composed of combinations of smaller features, can create a unique representation since these intermediate features can overlap. This can be demonstrated this using text [10]. If we were to code the word “apple” by the individual letters, without regard to letter location, the word “elppa” could easily be misclassified. On the other hand, if we decide to code by groups of letters, the set “app”, “ppl” 20 and “ple”, can uniquely define the word “apple”. Images, however, are much more complex than text, so finding features that make the representation unique is more difficult. This thesis shows that the model can still lack of spatial relationship between features, which is a major drawback to the system of this size. 21 Chapter 6: Finding a way to “Fool” HMAX This thesis mainly focuses on trying to demonstrate that the binding problem can exist in the HMAX model of a certain size. To demonstrate the problem, this thesis seeks to show that the C2 features of a given subject are not unique since they lack geometric information. This is accomplished by developing a process that can create an image with a similar C2 profile as a subject but which clearly does not belong to the subject. C2 profiles are believed to be, like other bag-of-feature profiles, very similar within the same class, but different between different classes. That is to say, two different images of the same person should have very similar profiles, while images of two different people should have different profiles. Thus the distance between two profiles within the same class, intra-class features, dint, should be less than the distance between two profiles from two different classes, inter-class distance, dext (figure 4). That is exactly what is observed. For our implementation, which is describe in a later section, the Euclidean distance between intra-class features, dint, ranges from 0 to 1.0 unit. The inter-class distance, dext, however, range from 2.5 to 4.6 units. These exact numbers can vary depending on database and features, but the important observation is that there is a clear difference in intra- versus inter- class distances. If the C2 profile is a unique representation of a target object, all the distances (d) between the profiles of a target Figure 4: Intra- versus Inter- class distances. 22 image to all other images would be large (d > dint) other than images that were of the object which should be small (d≤ dint), like in figure 5a. If the profile is not unique, there should be multiple images, other than the images of the target object, which would also have profiles with a small distance from the target image profile, like in figure 5b. Again the aim of this thesis is to show that the result is closer to the latter. Figure 5: Distances from target image for if the C2 profile is: (a) a representation of an unique object’s images, or (b) a representation of multiple objects images. This is accomplished by creating a random image whose C2 profile is within the intraclass distance, but clearly does not belong to the given class. This is essentially a global optimization problem which seeks to minimize the distance between the C2 profile of a given target image profile and a generated random images. 23 Global Optimizations Global optimization algorithms can be divided in two basic classes: deterministic and probabilistic algorithms. A deterministic algorithm is an algorithm that does not use random numbers to determine what to do or how to modify the data [13]. It is used if there is a clear relation between the possible solutions and it’s “fitness” to the problem. “Fitness” refers to the metric by which a particular solution is judged to have solved the problem. For our case, the “fitness” metric is the Euclidean distance of the candidate image’s C2 profile to the target C2 profile. A deterministic method explores the search space so that the true optimal solution can be determined, often employing search schemes like “divide and conquer”. For the search problem in this thesis, the dimensionality of the search space is too high. This thesis aims to create 112×92 pixel 8-bit grayscale image that has a similar C2 profile as a target image. That means that each pixel has the potential to be 28 (256) colors, meaning there are a total of 256×(112×92) candidate images in our search space, not a feasible number to exhaustively test. A probabilistic algorithm, on the other hand used random processes to select and modify candidate solutions significantly reducing the search time. These algorithms often trade in the guarantee of finding the best solution for a shorter runtime. That is, it would be best if we could find an image with exactly the same profile as our target image. However, because of the difficulty and size of the problem, it would more feasible to find any image that has a small enough distance to confuse the classifier. 24 Genetic Algorithm A special type of probabilistic algorithm is genetic algorithm [14]. Genetic algorithm (GA) is a search method of looking for optimal solutions by changing system parameters in a manner that mimics natural evolution. In GA, a population of candidate solutions to a given problem “evolves” toward better solutions by implementing techniques inspired by survival of the fittest and mutation, among other things. All genetic algorithm searches proceed according to the scheme below: 1. The evolution starts with a population, Pop, of n randomly generated solutions, p1, p2… pn . 2. The values of the Objective Functions are computed for each candidate solution in the population, g(p1), g(p2)… g(pn). 3. For each generation, the fitness of every individual in the population is evaluated based on a given fitness function, f. The fitness function measures the quality of a candidate solution in solving a given problem. f(g1), f(g2)… f(gn) 4. These finesses are compared and the solutions in the population with low fitness are filtered out while solutions with high fitness enter the “mating” pool with a higher probability. The selection process is dependent on the type of problem. For example, in a minimization problem, the best candidates are the solutions with the minimum fitness score. So, Mating Pool= min {f(g1), f(g2)… f(gn)} 5. The top candidates enter the “reproduction phase” where their children are created by modifying the genome of the top candidates. Modifications include random “mutation”, “crossover”, and even “migration” from a different subpopulation. 25 6. The top candidates and their children are passed to the next generation. The new population is used for the next iteration of the algorithm. The algorithm continues at step 2 until a termination criterion is met, such as the number of generations, or a solution has a desired fitness score. To create the next generation, the genetic algorithm selects certain individuals with the best fitness values in the current population, called parents, and uses them to create individuals in the next generation, called children. There are four types of children: children due to mutation, children due to crossover, children due to migration, and elite children. Three of the four children are a result of modifications mentioned above. They are mutation, crossover, and migration and are demonstrated in figure 6. Mutations are created by simply adding random noise into the genome of the top candidate solutions (figure 6a). It is a method of adding small variations in the gene pool so that similar solutions are considered. In order to create the mutations, the algorithm adds a random vector from a Gaussian distribution to the parent genes. Crossover on the other hand is the method of creating children by combining the vectors of a pair of parents (figure 6b). This is done by randomly dividing the genes of two individuals and recombining it to create the next generation. It is a method of adding large variations in the gene pool so that different locale of the solution space is explored and it reduces the likelihood of a solution getting trapped in a local minimum. The last modification mentioned is migration. This is when the best individuals from one subpopulation replace the worst individuals in another subpopulation (figure 6c). Migration 26 might not happen every generation and the algorithm can select how many generations pass between migrations. Additionally the fraction of the population is also variable. This specifies how many individuals move between subpopulations. There is one more type of child that is common in the next generation and that is the elite children (figure 6d). These are the individuals in the current generation with the best fitness values that are passed, unaltered, to the next generation. Elite children ensure that if the best or near best solutions have been found, they will be preserved. Genetic Algorithm was selected as the search approach used in this thesis for a variety of reason. First of all, it can quickly scan a vast solution set, like the one this thesis is faced with. Additionally, GA has an inductive nature such that the algorithm doesn’t need to know any specifics of the problem which is ideal for a complex problem like ours with no fixed relationship/formula. Other advantages are that there is a large variety in the candidate solutions since bad proposals are allowed because they do not affect the end solution negatively. There is also a population of possible solutions for GA, instead of a single solution that is iterated in other search heuristics. One possible problem in GA is the possibility of premature convergences. An optimization algorithm usually converges to a candidate solution, which is considered to be the best fit. The algorithm is said to converge when it cannot reach new solution candidates but keeps on producing solutions from a within a small subset. For GA, like many other forms of global optimization, is it often not possible to determine if the value the algorithm converged to is the local best or the global best. That is since GA does not evolve towards a good solution but away from bad ones, it is very possible for the solution to fall into a suboptimal solution such as 27 a local best answer. Since the candidate solutions are selected based on the best candidates in the past generation, the search can gradually begin to center around a local minimum/maximum. This however is not a huge issue for our problem since the final solution is allowed to be a local minimum as long as it is with the intra-class distance. Additionally, having diversity in our population decreases the likelihood of a “trap”. Diversity refers to the average distance between individuals in a population. A population has high diversity if the average distance is large, and can be accomplished by having a large population, and a large rate of migration, mutation, and cross over. Lastly, diversity can be attained by running GA with different random seeds to generate the initial population such that different parts of the solution space are explored between different GA runs. Overall, genetic algorithm seemed like the best approach to finding an image with the same C2 profile as a target image in order to “fool” HMAX. 28 Figure 6: Modifications used on “parents” to create the next generation of “children”. 29 Chapter 7: Implementation This thesis created a process that uses genetic algorithm to create a 112×92 grayscale image, the GA image, which has a similar C2 feature profile as a particular image of the same resolution, the target image. The process is demonstrated in Figure 7. It first creates the “population” of candidate solutions consists of 112×92 grayscale random pixel images. These random pixel images are then processed through HMAX and the output of the C2 layer is compared to that of the target image. The Euclidian distance between the C2 output of each GAgenerated candidate solutions and that of the target image are calculated and used as the fitness metric. This metric is used to select the best candidates to be passed or to undergo “mutation” and/or “crossover” for the next generation. While a “mutation” randomly changes the value of a certain percentage of pixels, and a “crossover” would entail a large part of the image of one candidate to swap with another. Figure 7: Genetic Algorithm process overview. 30 The GA continues to iterate until the distance between the C2 outputs of the target and that of the GA images are within an acceptable range or until a given number of generations has passed as shown in figure 8. For our implementation, we used features from 500 natural patches of 4 sizes, for a total of 2000 features. The Euclidean distance between the target C2 profile and the C2 profiles of the candidate solutions was used as the fitness measure to determine best candidate solutions. The aim was to get the C2 output of the GA image to be well within the intraclass distance. This ensured that the GA-generated image would be sufficiently similar in the C2 feature space to “fool” the classifier. Figure 8: Genetic Algorithm generated image profile converging to target image profile Although 2000 features is much less than the hundreds of millions features in the V4/IT region of the brain that the C2 layer is based on, this size system had very good recognition performance and is commercially viable to implement. That is, since more features means more 31 computation and memory usage, 2000 features is the sufficient number of features needed to have a success rate for recognition that was competitive to other top algorithms, making this size system most commercially viable. The genetic algorithm optimization was performed using the Optimization Toolbox in Matlab R2009b. The GA was performed with a population of 100 individuals that were binary representations of a 112×92 8-bit grayscale image. The initial population was white noise that was randomly created. The standard HMAX model, which was described in chapter 4, was used as the Objective Function. The mutation children were created by a two-step process. First, the algorithm selected a fraction of the parent’s entries for mutation. Each entry, referred to as each gene, had a probability of a given rate of being mutated. The rate was initially set as 0.1, but if a large number of generations passed without much change to the fitness score, the rate was increased. After the genes to be mutated were selected, the algorithm replaced each one with a random number which was uniformly selected. This is referred to as the Uniform Mutation Function. Crossover children for the next generation were created according to the crossover function. A scattered function was used for this implementation. This function randomly selected individual genes to take from each parent and combines the genes to form the child. Migration children in the next generation were created by forward migration. This means the individual migrated towards the last subpopulation. The number of individuals that move between subpopulations was determined by taking a fraction of the smaller two subpopulations that moves. That fraction was set to 0.2. Migration was accorded every 20 generations. Lastly, there were always two elite children selected, which were exact copies of best fit individuals of the previous generation. 32 HMAX on GPU The GA was run for thousands of generations, which took approximately 3 months. This is because HMAX, like all cortical models, are computationally expensive and running the GA requires the whole population (of 100 individuals) to be analyzed for each generation. For further investigations, a faster system was necessary. A parallel computing version of HMAX, developed by Jim Mutch, allowed for an increase in the algorithm speed. This version was developed as a model using a programming framework called the Cortical Network Simulator, CNS. This framework runs on a generalpurpose Graphics Processing Unit, GPU, and can process the HMAX model 80 to 100 times faster than a Single CPU [15]. Using this version of HMAX as our Objective Function, the computation time for the GA was cut significantly shorter. 33 Chapter 8: Results and Discussion Genetic Algorithm Results Distance=3.2 Distance=0.71 Figure 9: The C2 profile of the generated image converges to the target profile but its appearance does not resemble the target image. The Genetic Algorithm has run for thousands of generations and the output of the C2 layer slowly converged toward the C2 output of the target image as shown in figure 9. With the original setup with the CPU version of HMAX that used 2000 features, the final distance was 0.71 after 8100 generations. This is well within the intra-class distance, meaning the C2 output of the GA image and the target image are very similar. The GA image, however, does not 34 resemble the target image at all. As apparent in the C1 map of the GA image, which is on the left side of figure 10, the generated image converged to a combination of random edges. Thus, with genetic algorithm, we were able to create an image comprised of random edges that has a similar C2 output as a target image. This shows that the C2 layer does not preserve the retinotopic map, and that the profile is not unique. Since the retinotopic map is not preserved, the binding problem, which in our case is related to the relative spatial relationships between features, exists for this setup. For a definite test to see if we successfully “fool” HMAX, we tested the GA image using a Support Vector Machine (SVM) classifier. When the classifier was asked which of the 40 subjects the generated image belonged to, the GA image was classified into the target class. This proves that we were able to fool HMAX. Figure 10: C1 map of generated image (left) is a collection of random edges and does not resemble the C1 map of the target image (right). 35 Issues Regarding the Process The process developed in this thesis worked well in showing that the C2 profile of an image is not unique when we have only 2000 features. We repeated this process with the GPUversion of HMAX, also with 2000 features, and found similar results. As is evident in figure 11, the optimization is exponentially decaying with a time constant, τ, of less than 100 generations. That is C2 profiles of the candidate images exponentially converging to the target C2 profile. Figure 11: Distance between generated profile and target profile increases exponentially through the generations. A major issue with genetic algorithm, as previously mentioned, is premature convergence. As seen in figure 12, there were times during which hundreds of generations passed without a new solution candidate with a smaller distance then the current best. This suboptimal solution was caused by a lack of diversity in the population. To fix this, the mutation rate and the migration rate were adjusted and the GA restarted. Sometimes a new population was manually introduced, with only the elite children preserved. This restarted the stalled algorithm. 36 Figure 12: Genetic Algorithm sometimes stalls at a suboptimal convergene and requires requires restarting GA with higher diversity to continue. Wilckelgren’s Approach to the Binding Problem Genetic Algorithm was also used to investigate how best to preserve relative spatial information in the HMAX model without maintaining actual feature location. Wilckelgren’s work implies that if there is significant number of overlapping features, these features will constrain the representation to be unique [12]. By increasing the number of feature, we can increase the change features. of overlapping Furthermore, increasing the number of features has been shown to Figure 13: Performance increases with number of features. 37 increase the overall performance of HMAX [2]. To further this study, GA was run using the GPU version of HMAX with 70, 700, 2000, and 7000 features and the results were evaluated. The overall performance on the AT&T database did in fact increase as the number of features increased (figure 13), as predicted by Serre’s work [2]. This might be explained by the fact that the overall distance between C2 profiles of different objects Figure 14: Difference between dint and dext increases with number of features. increases as the number of features increase. This is apparent in figure 14, where the difference of the average distance between images in the same class, and the average distance between images in different classes are plotted and an increasing trend in shown. This trend would make classification easier since different objects are easier to categorize with a larger difference. This trend of increasing distances with increasing number of features also applies to the distances of the random pixel images in the initial population and the target image (figure 15a). This means that the Genetic Algorithm should take longer since it has more ground to cover. In fact, with more features, more generations are required to find the optimal solution. Figure 15b shows the number of generations that was required for the GA image to reach the mean intraclass distance. There is a slight increasing trend at the beginning, but a large jump in the number of generations required with 7000 features. Since there is a direct relationship between the degree of difficulty of finding a good solution to a problem and the amount of time need to find 38 that solution, it is clear that as the number of features increase, the difficulty in finding a solution increases. This is a good sign in proving Wilckelgren’s theory. (a) (b) Figure 15: (a) Initial distance from target image increases with the number of features. (b) Number of generations needed to reach average dint increases with the number of features. Both trends indicate that it is getting harder to generate image with same C2 output. The definite proof to whether or not increasing the number of features will constrain the C2 profile to be unique is dependent on the results of the C1 map of the GA images. Figure 16a shows the C1 images of a particular orientation for a GA image create with 70 features. After 5000 generations, this GA has a C2 profile distance of 0.0247, which is well within the dint for this class. As is apparent in the image, the C1 image is a collection of random edges so the retinotopic map is not preserved and the representation in not unique. Figure 16b shows the C1 map of one orientation for a GA image created with 7000 features. After 764000 generations, the distance from the target C2 profile is 0.4457, which is also well within dint. Looking at the figure, the C1 maps of the target and generated images are not exactly the same but there are 39 clear similarities. This is a drastic difference from the previously generated images that had no correlation. All other aspects to the GA were held constant for this test indicating, that by only increasing the number of features, some parts of the retinotopic map were preserved. That is, with 7000 features, it is clear that more spatial organization is present than if there were less features. Considering V4/IT consists of hundreds of millions of features, it is clear that Wilckelgren’s approach to the binding problem is valid for the visual cortex and in order to avoid the issue in computer vision a significant number of features are necessary. (a) C1 map of GA image created using 70 features vs. the C1 map of the target. There are no similarities in any orientation. (b) C1 map of GA image created using 7000 features vs. the C1 map of the target. Similarities exist in multiple orientations, and are circled in the image above. Figure 16: Comparing C1 map of generated images for different number of features (left side image in (a) and (b)) to the target image C1 map (right side). 40 Chapter 9: Further Investigation The generated image showed that spatial relationships of the features were not preserved if there are a small number of features. However, without parallel computing, a larger number of features mean longer computation time which would make the system less desirable in commercial applications. Because of this, other identification schemes were tested. STDP Macro Features For one trial, using the forty classes from the AT&T face database, using 20% of the images for training and 80% for testing, HMAX alone had a 17% failure rate (83% success). For most of these failures, ~76%, the correct class was in the top 5 classifier outputs. The theory behind the first identification system designed was to correctly discriminate the correct class from the top five choices. For this trial set, that would result in a ~13% increase in performance. The proposed face detection system, shown in figure 17, would use the invariance of HMAX to select best candidates and macro features that contain spatial information to select the best fit class. 41 Figure 17: Propose system that used intermediate “macro” features for template matching best image. The algorithm used to develop the macro features was an asynchronous feedforward spiking neural network developed by Timothée Masquelier, and Simon Thrope [16]. This algorithm mimics Spike timing dependent plasticity (STDP) in order to extract important visual features in an image while completely unsupervised. STDP is a learning rule used by neurons that are repeatedly exposed to a similar input. STDP modifies the synaptic weight of that input by changing the relative timing of the action potentials. The features extracted by this algorithm were capable of various classification tasks [16]. The propose setup was tested with the first 20 subjects in the database. Of the 160 images, 14 of the images which were incorrectly categorized had the class in the top 5 classes. Using Marco features 8 of the 14 errors were corrected (~57%), 4 failed (~29%), and 2 trials has no definitive answer (~14%). However, the overall test for the proposed solution showed that the system was not useful. The system had a success rate of 78%, which is a large decrease from the 86% rate with HMAX alone. The system failed to recognize 17% of the faces HMAX alone recognized. 42 Examining the images that failed, the sources of error were clear. The first source of error was false positives. As you can see in figure 18a, some of the Macro-features are too non-specific. Because of this, many images not belonging to that class fit that feature so that that class was often selected. Additionally, pose and gesture variation between faces was another source of error, figure 18b. Things like glasses Figure 18: Sources of error in proposed system. and differences in facial expression cause the macro-feature to not be detected. All variation tolerance developed by HMAX is lost with the macro features. These two sources of error seem likely resulted from the template match of large features. This failed experiment further shows the advantages HMAX has to other recognition systems. Location Based System Another proposed system was that instead of using only the C2 output as the input to the SVM, the distances between the locations of C2 features should be used. As figure 19, shows, there is a clear distinction between the distances between the locations within and between classes. That is, the distances of locations for a feature within the same class tends to be smaller than the distances between different classes. This implies that the distances between locations might be a good element to classify images by. 43 Figure 19: Standardize distance between locations. There is a clear distinction between the distances between the locations within and between classes implying distances might be a good element to classify images by. Two types of locations were used to test this system; the exact location, and a scaled XY location. The exact location is the unaltered index number of the C2 features outputted by the model without regard to which band. The scaled XY location on the other hand remaps the C2 location to a single band (28x23), as shown in figure 20, and used an XY coordinates. The coordinates were inputted into the SVN as a single vector, i.e. [x1 x2 x3 y1 y2 y3]. Another system was with the C1 maps put into the classifier in order to categorize the images. The C1 maps preserve the retinotopic map, so the theory is that it will outperform the standard HMAX that uses the C2 profile. 44 Figure 20: Scaled XY location on the other hand remaps the C2 location to a single band (28x23). Each system was tested with all 40 subjects in the AT&T database with 20% of the images per class for training and 80% for testing. Ten independent trials were performed and the results systems are shown in table 2. Both generic and class-specific features were used. Classspecific feature refer to features that were extracted from among the training images. It is clear that, although the other systems can work for object recognition, the standard HMAX system that use the C2 profiles for classification performs the best. Table 2: Performance of Proposed Systems. C2 Profile C1 Profile Exact location Scaled XY location Success Rate (%) (10 trials) Generic Dictionary Class-Specific Dictionary 81.59 91.12 87.19 -80.72 81.25 -73.31 45 Chapter 10: Conclusion and Future Work Using genetic algorithm, we were able to create an image that successfully fooled HMAX object recognition system of the given size. The generated image was composed of random edges but still had the same C2 profile as a target image. This showed that, for our testing conditions, the C2 profile is an under complete representation of an image and thus not unique. This means that, with these conditions and sized model, the binding problem is an issue in the system. We furthered our study by examining Wickelgren’s approach to addressing the binding problem. This approach claims that if enough features overlap, the representation will be unique. We saw that this was true since increasing the number of features made the generated image converge from a collection of random edges to the edges that resemble that of the target image. Considering the human brain can have millions of neurons for feature detection, this shows that this is a valid method to solve the binding problem in the brain. In addition to these test, other systems were created that directly used location information in classification. These systems did not perform better than the original HMAX but they did show issues that might arise in any system that use location information. One of these issues is that a system with location information is less invariant to pose and gestures changes. Another issue is that if the features are too large, they might be too ambiguous to be useful. Overall, we found the best way to preserve the retinotopic map was to increase the number of features. By testing the increase in the number of features effect on the generated images, we were able to validate one possible way the brain addresses the binding problem. This, however, is not a commercial viable approach to solving the problem. Using larger number of features increases 46 the performance and provides a more complete solution, but, it also makes that system more computationally expensive and it now takes more time to process one image with more features than it would with less features. Further analyses of whether HMAX can be used as a commercial object recognition system is need. 47 Work Cited [1] P. Moreno, et al., “A Comparative Study of Local Descriptors for Object Category Recognition: SIFT vs. HMAX.” in Pattern Recognition and Image Analysis, 2007, pp. 515-522. [2] T. Serre, et al., “Robust Object Recognition with Cortex-like Mechanisms” in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 29 No.3, Mar. 2007. [3] M. Bennamoun, G. J. Mamic. “Introduction” in Object Recognition: Fundamentals and Case Studies (Advanced Pattern Recognition). London, England: Springer-Verlag, 2002. [4] D. G. Lowe. "Object recognition from local scale-invariant features" in Proceedings of the International Conference on Computer Vision. 2, Sept. 1999, pp. 1150–1157. [5] M.F. Bear, B.W. Connors, M.A. Paradiso. “Chapter 10: The Central Visual System” in Neuroscience: Exploring the Brain (4th ed.). Philadelphia, PA: Lippincott Williams & Wilkins, 2001. pp 339. [6] G. Leuba, R. Kraftsik. "Changes in volume, surface estimate, three-dimensional shape and total number of neurons of the human primary visual cortex from midgestation until old age" in Anatomy and Embryology 190, 1994, pp. 351-366. [7] R. Quiroga, et al. "Invariant visual representation by single neurons in the human brain." in Nature, Vol. 435 (7045), 6/23/2005, p1102-1107. [8] T. Serre, L. Wolf, and T. Poggio, “Object recognition with features inspired by visual cortex.” In CVPR, 2005. 48 [9] M. Aly. (2006). Face Recognition using SIFT Features [Online]. Available FTP: vision.caltech.edu Directory: malaa/publications/ File: aly06face.pdf [10] M. Riesenhuber and T. Poggio, “Are Cortical Models Really Bound by the 'Binding Problem'?” in Neuron 24, 87-93, 1999. [11] M. Riesenhuber and T Poggio, “Hierarchical models of object recognition in cortex” in Nature Neuroscience, Vol. 2 No 9, Nov. 1999. [12] W. Wickelgren, “Context-sensitive coding, associative memory, and serial order in (speech) behavior” in Psychology Review 76, pp. 1–15, 1969. [13] T. Weise (2009). Global Optimization Algorithms: Theory and Application [Online]. Available FTP: www.it-weise.de Directory: projects File: Book.pdf [14] J. H. Holland, “Adaptation in Natural and Artificial Systems”, University of Michigan Press, Ann Arbor (1975). [15] J. Mutch, U. Knoblich, T. Poggio. “CNS: a GPU-based framework for simulating cortically-organized networks.” MIT-CSAIL-TR-2010-013 / CBCL-286, Massachusetts Institute of Technology, Cambridge, MA, February 26, 2010. [16] T. Masquelier, S.J. Thorpe. ”Unsupervised Learning of Visual Features through Spike Timing Dependent Plasticity.” In PLoS Comput Biol 3(2):e31, pp. 247-257, 2002. 49