Integrating Bottom-up and Top-Down Information

Integrating Bottom-up and Top-Down Information by Giuliano Pezzolo Giacaglia Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Engineering at the MASSACHUSETTS INGl9TE' OF TECHNOLOGY MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUL 15 2014 February 2014 LIBRARIES @ Massachusetts Institute of Technology 2014. All rights reserved. Signature redacted A uthor .......... Department Certified by. ............... Electrical Engineering and Computer Science January 17, 2014 Signature redacted rl Patrick Henry Winston Ford Professor of Artificial Intelligence and Computer Science Thesis Supervisor /7 /I Signature redacted .................. Albert R. Meyer Chairman, Department Committee on Graduate Theses A ccepted by ........ 2 Integrating Bottom-up and Top-Down Information by Giuliano Pezzolo Giacaglia Submitted to the Department of Electrical Engineering and Computer Science on January 17, 2014, in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Engineering Abstract In this thesis I present a framework for integrating bottom-up and top-down computer vision algorithms. I developed this framework, which I call the Map-Dictionary Pixel framework, because my intuition is that there is a need for tools that make it easier to build computer vision systems that mimic the way human visual systems process information. In particular, we humans humans create models of objects around us, and we use these models, top-down, to interpret, analyze and discern objects in the information that comes bottom-up from the visual world. After introducing my Map-Dictionary Pixel framework, I demonstrate how it empowers computer vision algorithms. I implement two different systems that extract the pixels of the image that correspond to a human. Even though each system uses different sets of algorithms, both use Map-Dictionary Pixel framework as the connecting pipeline. The two implementations demonstrate the utility of the Map-Dictionary Pixel framework and provide an example of how it can be used. Thesis Supervisor: Patrick Henry Winston Title: Ford Professor of Artificial Intelligence and Computer Science 3 4 Acknowledgments If I were to describe my life in one word that word would be hustle. In the urban dictionary hustle is defined as "To do things to get closer to the point you want to get." I've always worked really hard to achieve my aspirations. When I was in Middle School I hustle to get a scholarship for the best high school in Brazil. When I was in High School I hustled to increase my schorlaship, trying to be amongst the best mathematicians at my age in Brazil. In college in Brazil hustle to survive through the hazing and military training. A semester later at MIT I hustled to pay for food and living. I was really fortunate to have all these opportunities and I know that there are a lot of people that want to be in the same place but didn't have the same opportunities. Jay-Z in "American Dreamin" explains it well: "Mama forgive me, should be thinking about Harvard But that's too far away, niggas are starving Ain't nothing wrong with my aim, just gotta change the target I got dreams of bagging snidd-ow, the size of pillows I see pies every time my eyes clidd-ose I see rides, sixes, I got to get those Life's a bitch, I hope to not make her a widow" MIT was specially hard. It was not only hard because of the endless psets and labs, the midterms and projects, but it was specially hard to do all of this and have to worry about bills. Sometimes there wouldn't be money for lunch. I've survived. I did it only with the help of my friends and family. I'm extremely thankful to them. I dedicated all my work to them. To my Mom, Leticia Maria Pezzolo Giacaglia, for all her love. To my dad for all the knowledge and for being an example of honesty and good work ethic. To my brother for always standing on my side. I want to thank my friends Aldo and Beneah that kept me sane and going. Sometimes we would go partying on the weekend nights so we could forget about the 5 problems of the past week and the problems that we would encounter the next day. To all my friends in Genesis group, Sam, Hiba, Nikola, Igor, Adam, Matthew, Sila, Mark, Dylan for great discussions that improved this work. But most of all, for all our discussion about other random things that made my experience so great. To Rafa, Bujao and Pond, for helping me transition to this country. To Professor Winston, for being an example of love and care for its students. This research was supported, in part, by the Defense Advanced Research Projects Agency, Award Number Mind's Eye: W911NF-10-2-0065. 6 Contents 17 1.1 V ision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.3 Map-Dictionary Pixel . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3.1 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Histogram of Oriented Gradients Detection . . . . . . . . . . 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Object Detection 2.1.1 21 23 Previous Work 2.1 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Top-Down Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Object Detection from Contour . . . . . . . . . . . . . . . . . . . . 25 Chamfer Distance . . . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related work 25 Conditional Random Fields . . . . . . . . . . . . . . . . . . 26 2.5.2 Refining Top-Down Segmentation . . . . . . . . . . . . . . . 26 2.5.3 Scoring bottom-up segmentation . . . . . . . . . . . . . . . . 27 . . 2.5.1 . 2.5 . 2.2 2.4.1 Usage Examples 29 3.1 Ultrametic Contour Map . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Matching fragments. . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Other possible usages . . . . . . . . . . . . . . . . . . . . . . . . . . 32 . 3 . Organization . 2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 . Introduction . 1 7 4 3.3.1 3D reconstruction and Object Detection . . . . . . . . . . . . 32 3.3.2 Image Segmentation and 3D reconstruction . . . . . . . . . . . 32 4.1 Compute gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Orientation Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Descriptor blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Contrast normalize over overlapping spatial blocks . . . . . . . . . . . 36 4.5 Collect Histogram of Oriented Gradients over detection window . . . 36 4.6 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 37 . . . . . . . . . . . . . . . . . . . . . . . . . 38 Map-Dictionary Pixel . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.6.1 4.7 5 New data points 41 Image Segmentation 5.1 Map-Dictionary Pixel . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Minimum N cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3 Local Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3.1 E ach angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3.2 Local Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Global Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.4.1 Distance Transformation . . . . . . . . . . . . . . . . . . . . . 44 5.4.2 E ach angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.4.3 Global Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . . . . . 46 Map-Dictionary Pixel . . . . . . . . . . . . . . . . . . . . . . . 47 5.4 5.5 Oriented Watershed Transform 5.5.1 6 33 Histogram of Oriented Gradients filter Histogram of Oriented Gradients filters for the subparts 6.1 49 6.0.2 Map-Dictionary Pixel . . . . . . . . . . . . . . . . . . . . . . . 50 6.0.3 A lgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 PASCAL Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.1.1 8 7 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Matching Fragments 55 7.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.1.1 Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.1.2 M ask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.2 7.3 7.4 Correlation Textons 7.3.1 8 52 Conclusion 8.1 63 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Code 63 65 A.0.1 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 A.0.2 Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 9 10 List of Figures 1-1 This image is an example of an image that the human visual system would have trouble identifying different objects (people) if it didn't have a model of the objects, i.e. if it didn't use Top-Down Information 1-2 Input Image and Output Image of the Ultrametric Contour Map Al- gorith m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3 19 This image despicts 3D reconstruction. The algorithm that generates the output data can not be easily used in other algorithms. . . . . . . 1-5 18 These images represent the input and output of a second algorithm using the proposed framework. . . . . . . . . . . . . . . . . . . . . . . 1-4 17 This image represents what a Map-Dictionary Pixel looks like. 20 The dictionary that is contained in the pixel at the most upper-right position is represented in this image . . . . . . . . . . . . . . . . . . . . . 1-6 This image shows a representation of how Map-Dictionary Pixel ties together with the algorithms. 2-1 2-2 .. . . . . . . . . . . . . . . . . . . . . . . 21 . ..... ... .. ........... ........... . .... ...... 24 This is an example of how to calculate the chamfer distance between 2 sets of edges. ........ 3-1 20 ...... ........... ... ....... .. 26 The Ultrametric Contour Map algorithm ties together the 3 algorithms using the Map-Dictionary Pixel. Each algorithm outputs the Map- Dictionary Pixel that is processed by the next algorithm in the chain. 11 30 3-2 The input and the output of running the Ultrametric Contour Map algorithm. The output is a probability map. The right most image represents the probability map. The more white the pixel, the closer to 1 is the value of the pixel, analogously the more black the pixel is, the closer the value at that pixel is to 0. 3-3 . . . . . . . . . . . . . . . . This image shows how the Map-Dictionary Pixel ties together the 2 presented algorithms, Object Detection and Top-Down Segmentation. 4-1 32 The diagram above summarizes how the Histogram of Oriented Gradients filter is subdivided and how each part is connected to the other. 4-2 31 Representation of a 9x9 to the RGB colors. 34 Each 9x9 square on the right of the image represent the magnitude of the color on the respective color space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4-3 The red color channel to the correspondent value block. . . . . . . . . 35 4-4 Multiplying the first matrix with the horizontal 2D mask [1,0,-i]. . . 35 4-5 Orientation binning is done in 6x6 pixel cells. Each pixel will contribute to one direction of the gradient. The direction of the pixel at the center of the cell is determined by voting of the pixels on that cell. In this example, the orientation of the bin will be down . . . . . . . . . . . . 35 4-6 The window size of each block. 37 4-7 This image exemplifies how a Support Vector Machine separates the . . . . . . . . . . . . . . . . . . . . . training data. The blue points are points such that the training data is classified as not objects and the green points are classified as objects. 37 4-8 The input image and P(x, y). All the points inside the red rectangle are such that P(x, y) = 1 and all points outside it are such that P(x, y) = 0. 39 5-1 Minimum Cut with sum equal to 10. 5-2 Image extracted from work done by Arbelaez et al. (2012) exemplifying 5-3 . . . . . . . . . . . . . . . . . . 42 the process of calculating the local distance for a certain angle. . . . . 44 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 e-x from -3 to 3. 12 5-4 Image extracted from [1] exemplifying the process of calculating the Oriented Watershed Transform. the function G(x, y). Image in the far left represents the The image in the middle left represents the watershed transform. The image in the middle represents the process of approximating the arcs to line segments. The image in the middle right represents the streanth of G(X, y, o(x, y)) for 4 different angles. The image on the far right represents the output . . . . . . . . . . . 5-5 The input and the output of running the Ultrametric Contour Map algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1 47 48 The left most image in this figure shows the raw image. The middle left image shows E(x, y). The middle right image shows an segmented part such that the Histogram of Oriented Gradients filter detects a face. The right most image shows a segmented part such that the Histogram of Oriented Gradients filter does not detect a face. .... 6-2 Set of image faces used for training the Histogram of Oriented Gradients filter model for the human face. 6-3 51 . . . . . . . . . . . . . . . . . . 52 The left image is the input of the Ultrametric Contour Map and the right image is the mask I(x, y) described above. Note that the mask on top of the person is very close to the actual pixels that belong to the person despicted in the image. . . . . . . . . . . . . . . . . . . . . 7-1 This figure represents a pair of a fragment and its mask. Note that the mask is black and white, because the mask is either equal to 0 or 1. . 7-2 53 57 The first image is the input image and the second image shows the 100 best matching fragments superimposed on top of a black image of the same size. The third image shows the 100 best matching fragments adding the information of the textons. 7-3 57 Two different histograms representing how many textons of each type are in each region. 7-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 An image and the corresponding textons. . . . . . . . . . . . . . . . . 60 13 7-5 The representation of the LM Filter Bank. 14 . . . . . . . . . . . . . . . 60 15 16 Chapter 1 Introduction The human visual system is a very complex and fascinating system. It is a challenge to create a computer system that achieves the same results as the human visual system achieves. The human visual system has the ability to interpret a big array of data in a short period of time. Not only that, but to accomplish the many tasks that the it's in charge of, the human visual system is engineered in a different way than computer systems are usually engineered. Most of the information that flows between the eyes and the brain is flowing from the brain to the eyes. The human visual system uses prior information to identify objects. Figure 1-1: This image is an example of an image that the human visual system would have trouble identifying different objects (people) if it didn't have a model of the objects, i.e. if it didn't use Top-Down Information 17 Figure 1-1 shows an example of an image that demonstrates how the brain createss models of objects and uses that information to interpret images. To solve the problem of integrating previous information that the brain gathers, I define 2 types of information: Bottom-Up and Top-Down Information. Top-Down Information is information contained in the model of objects and high-level structures. Bottom-up information is the information given only by the image. This thesis creates a framework to integrate the two types of information easily. 1.1 Vision An exemplification of the implementation of the presented framework is an algorithm for creating a mask on a single image, the mask is created on top of images of humans on the single image. Figure 1-2 presents an exemplification of the implementation of the framework. The left most image on Figure 1-2 represents the input of the algorithm and the Histogram of Oriented Gradients filter on top of it. The image on the right on Figure 1-2 represents the output of the first implementation of the presented framework. Figure 1-3 represents another set of input and output images for the first implementation of the presented framework. Figure 1-2: Input Image and Output Image of the Ultrametric Contour Map Algorithm. The problem is how to integrate these two types of information in a simple 18 Figure 1-3: These images represent the input and output of a second algorithm using the proposed framework. model.This model should be simple enough that different types of algorithms created separately can be joined together easily. 1.2 Problem Statement Computer Vision tasks have been divided into many tasks. The field has been divided into many single tasks that are solved isolately. In this thesis I'm only concerned with tasks that relate to single images. These tasks inlcude Object Detection, Image Segmentation, Contour Extraction, 3D reconstruction among others. Each single task is handled separately. The question is how to integrate different algorithms into a single system or how to transfer the output or knowledge from one algorithm to the other easily, without modifying each of the algorithms. 19 Figure 1-4: This image despicts 3D reconstruction. The algorithm that generates the output data can not be easily used in other algorithms. Map-Dictionary Pixels ------------------------.................................................... ------------------------------................... ........................ ---------------------------------------............. ........... ...... I...... I...... -----I............. I...... I----------------------- ------------------------- ............... 4 ------ .................. ------ ........................ .................................................. ............ L ----- ...... L ............. . ...... ...... {Image: [1,3,4], Person: 0.3, Edge: 0.5} ...... ..... ...... ......................................L ..... ...... L ..... ............. ............ L ............ *..... ...... ----------------------------------------------------------------...... ------ ...... Figure 1-5: This image represents what a Map-Dictionary Pixel looks like. The dictionary that is contained in the pixel at the most upper-right position is represented in this image. 1.3 Map-Dictionary Pixel Map-Dictionary Pixel addresses the problem of integrating top-down information and bottom-up information. Map-Dictionary can be used for detecting objects, segmenting images, finding contours of objects inside a single image and other applications. In this thesis I present 2 implementation of the presented framework. Each algorithm can be thought as a black box, the Map-Dictionary Pixel is the representation of the input and the output of each algorithm. Each algorithm does not have any information of the other algorithms, except that they are connected using the Map-Dictionary Pixel. Definition: Map-Dictionary Pixel is a map between each pixel in a image and a dictionary. Each entry in the dictionary contains information encoding the image and encoding the outputs of the algorithms. 20 Map-Dictionary Pixel Algorithm Output Figure 1-6: This image shows a representation of how Map-Dictionary Pixel ties together with the algorithms. 1.3.1 Implementation The Implementation of the Map-Dictionary Pixel class is shown below: class MapDictionary: def __init__(self, m, self.matrix = def n): [[{} addDictToEntry(self, self.matrix[i][j] def getDictAtEntry(self, for y in range(n)] dict , i, = dict for x in range(m)] j) i,j): return self.matrix[i][j] 1.4 Organization In Section 2 I present the previous work in each area. In Section 3 I explain how Histogram of Oriented Gradients filters work in detail and how I reimplemented them in detail. In Section 4 I give an overview of the possible use cases of the framework. 21 In Section 5 the top-down segmentation algorithm is explained in detail. In Section 6 I present how I implemented Histogram of Oriented Gradients filters for face, lower body and upper body. In Section 7 I present the implementation of the Top-Down Segmentation of the image using the model of an object. In Section 8 I present the contributions of the work presented in this thesis. 22 Chapter 2 Previous Work In this chapter I describe the most influential and the state-of-the-art work on ObjectDetection and Image Segmentation. I don't have the intention to offer a comprehensive summary of all the approaches on Object Detection and Image Segmentation, but instead I will summarize the most important results on this area. Object Detection and Image Segmentation are important areas of Computer Vision. Most of the work dealing with the problem solve each one separetely, but object detection can help segmention and vice-versa. 2.1 Object Detection Object Detection is one of the most important areas of Computer Vision. The reason for its importance is because detecting objects in images has ample applications. The most famous work on the area is called Histogram of Oriented Gradients. Histogram of Oriented Gradients Detection uses gradients in the images to create features for object detection. The reason behind using gradients is that gradients give a good cue for shapes. The reason behind using histograms is that histograms are a good way of representing areas of images and comparing the respective areas. 23 2.1.1 Histogram of Oriented Gradients Detection Work done by Dalal and Triggs (2005) has influenced most of Histogram of Oriented Gradients detection. palm/ IUPW Figure 2-1: The most significant improvement on top of Dalal and Triggs (2005) is the work done by Felzenszwalb (2010). Felzenszwalb (2010) improves the algorithm by creating a root filter and part filters for subparts of the object. Deformations mixture models help to create a multi-view representation of a determined object. 2.2 Image Segmentation Arbelaez et al. (2008) segment natural images with a seed point and without one Arbelaez et al. (2011) inside the object of interest. Arbelaez et al. (2011) define a metric between different regions inside the image. Regions are defined as connected pixels. Arbelaez et al. use a metric such that the distance between two regions signals how strong the edge between them is. 2.3 Top-Down Segmentation Borenstein and Ullman (2002) segmented the images using training data to represent the edges and the sorroundings of the edges of object classes. Each set of edge and the respective boundary is called a fragment. For each object class Borenstein and Ullman (2002) extract the fragments that represent the best the class object. Given a new image, Borenstein and Ullman (2002) compare the fragments that it has with the image and try to superimpose the fragments, like a jigglesaw puzzle. 24 With the fragments in place, Borenstein and Ullman (2002) can infer where the edges should be, and therefore it segments the image. 2.4 Object Detection from Contour Shotton et al. (2005) implemented object detection using the contours of the images. First, they detect the edges of a bunch of training images and create a model of the contour of objects by defining the center of the object and creating fragments of the contours of the objects. After creating a model of the possible fragments that represent the contour of the object, the edges are extracted using the Canny Edge Detector. Given a new image, they try to find what is the point that represents the center of the object. The center of the object will be such that the distance between the distance between the contours of the representation of the object and the edges of the image being compared is minimal. Then, the distance between the two images is the distance when the center is at the defined location. 2.4.1 Chamfer Distance The chamfer distance is calculated as the sum of the distances between each pixel and the set of edges. The distance between each pixel and the set of edgels is equal to the distance of the pixel and the closest pixel if the images were superimposed. 2.5 Related work The problem of combining top-down information with bottom-up information has been well studied over the last decade. In this overview I will focus in the combination of top-down object knowledge with bottom-up grouping cues, i.e. the combinatation of object detection with image segmentation. The different approaches to solve the problem of combining them can be divided into 3 broad lines of work: 25 mm E T E T E T E T Total Distance = 1.412 + 1 + Distance = 1.412 Figure 2-2: This is an example of how to calculate the chamfer distance between 2 sets of edges. 2.5.1 Conditional Random Fields A way to encode segment relations is using Conditional Random Fields, Kohli and Kumar (2010), Ladicky et al. (2010), Boix et al. (2012), Lucchi et al. (2011), i.e. neighborhood regions impose a constraint on the region. Kohli and Kumar (2010) encodes the relation between pixels and regions, so that low-level cues bind pixels to one another, encouraging them to respect region coherence; parts detections bind to each other respective to their correlation of being the same object or not. Ladicky et al. (2010) define objects co-ocurrences in a single image in CRFs. Boix et al. (2012) create a generic representation such that the problem becomes to try to reduce the total energy. Lucchi et al. (2011) propose a new potential which allow multiple classes to be assigned to each node. 2.5.2 Refining Top-Down Segmentation This technique can be defined as follows: Start by scanning a window object detector and then by refining the image segmentation by using that piece of information. Brox et al. (2011) aligns an object mask to image contours, it applies varionational 26 smoothing and assigns figure/ground using self-similarity. Yang et al. (2010) uses object detector to output the estimated object shap and depth ordering to help with segmentation. 2.5.3 Scoring bottom-up segmentation This technique can be defined as first generating multiple figure-ground hypotheses by bottom-up processing of the image and producing the final segmentation by a weighted voting for each class of objects. Arbelaez et al. (2012) addresses the problem of joining together the information of segmentation and Object Detection. 27 28 Chapter 3 Usage Examples In this chapter I describe the possible usage examples of the Map-Dictionary Pixel and give an overview of how the Map-Dictionary Pixel is used in the 2 algorithms that I implemented for this thesis. The next 2 sections give a brief overview of the 2 algorithms that were implemented using the described framework. The core differences of the 2 algorithms is that the first algorithm implements a form of n-cuts for bottom-up segmentation of the image, called Ultrametric Contour Map, and the second algorithm a variation of the technique described by Borenstein and Ullman (2002) for top-down segmentation. The first algorithm is called Ultrametric Contour Map and the second algorithm is called Matching Fragments. 3.1 Ultrametic Contour Map The Ultrametric Contour Map can be divided into 3 main algorithms. The details of their implementation are described in Chapter 4, 5 and 6. The Ultrametric Contour Map uses the Map-Dictionary Pixel to use the output of one algorithm to feed to the next algorithm. Figure 3-1 shows how the algorithms are connected through the use of the Map-Dictionary Pixel. The following list enumerates the 3 algorithms that are part of the Ultrametric Contour Map algorithm and it gives a small overview of each one of them. 29 1. Object Detection: To identify the part of the image that belongs to the object, this algorithm will produce a Map-Dictionary with a mask detecting the possible locations of the object. In this thesis I implement the Histogram of Oriented Gradients filter for identifying people. The implementation of Histogram of Oriented Gradients filters is discussed in chapter 4 of this thesis. 2. Bottom-up Segmentation: To separate the image of the object that was found in the Object Detection part, this algorithm produces a set of edges that separates the image in subparts. In this tehsis I implement Ultrametic Contour Map, a variation of the algorithm presented by Arbaleaz et Al. (2012). The implementation of this algorithm is discussed in chapter 5 of this thesis. Figure 3-1 represents the input and the output of this algorithm. 3. Subparts Detector: After segmenting the image of the object into subparts this algorithm will determine if the subparts belong to the object. To identify the subparts I use the Histogram of Oriented Gradients filter detector. I've tested the algorithm for the subparts of a person, i.e. the upperbody, lowerbody and face. The implementation of this algorithm is explained in chapter 6. Iae Object Map-DictionarY, otmU Detection Pixel Segmentation Subparts Pixel Detector Figure 3-1: The Ultrametric Contour Map algorithm ties together the 3 algorithms using the Map-Dictionary Pixel. Each algorithm outputs the Map-Dictionary Pixel that is processed by the next algorithm in the chain. 3.2 Matching fragments. The algorithm described in this subsection, called Matching fragments, ties together 2 algorithms. It uses the Histogram of Oriented Gradients filter to detect objects in the same way that the first algorithm does. The output of each algorithm is a 30 Figure 3-2: The input and the output of running the Ultrametric Contour Map algorithm. The output is a probability map. The right most image represents the probability map. The more white the pixel, the closer to 1 is the value of the pixel, analogously the more black the pixel is, the closer the value at that pixel is to 0. Map-Dictionary Pixel, therefore there is no need to change the implementation of Histogram of Oriented Gradients filter used in the Ultrametric Contour Map algorithm described in the previously. The 2 algorithms formin the Matching gragments algorithm are listed below: 1. Object Detection: This algorithm is the same as the one presented in the Ultrametric Contour Map algorithm. It's implementation is discussed in Chapter 4. 2. Top-Down Segmentation: This algorithm filters the image by creating a set of fragments that represent an object. The fragments are laid on top of the image by finding the correspondent region with the highest correlation. implementation of this algorithm is discussed in chapter 7. 31 The Imge Object Detection Ma-itop Pe Top-Down Segmentation Figure 3-3: This image shows how the Map-Dictionary Pixel ties together the 2 presented algorithms, Object Detection and Top-Down Segmentation. 3.3 3.3.1 Other possible usages 3D reconstruction and Object Detection 3D reconstruction algorithms do not use the information given by the classification of objects. The proposed framework can be used. The Map-Dictionary Pixel of the 3D reconstruction can be used as an input for creating features for the Object Detection. In the same way, the object detection Map-Dictionary Pixel can be used as a prior for the 3D reconstructiion of the image. 3.3.2 Image Segmentation and 3D reconstruction Image Segmentation can tell which parts of the images are the ones that represent edges, and therefore big differences in transition of the depth. Therefore, the MapDicionary Pixel of Image Segmentation can be used for 3D reconstruction of the image. 32 Chapter 4 Histogram of Oriented Gradients filter Object Detection is one of the most important areas of Computer Vision. Histogram of Oriented Gradients filters gives a comprehensive way of solving the problem. Histogram of Oriented Gradients filters do not have the same accuracy as a human subject but is one of the best tools for identifying objects. In this chapter I discuss the problem of indentifying objects in a single image objects and explain how Histogram Of Oriented Gradients filters work. The creation of a Histogram of Oriented Gradients filter for a determined object can be divided into 5 parts, which are connected as shown in Figure 4-1. 1. Compute gradients; 2. Orientation Binning; 3. Descriptor Blocks; 4. Contrast normalize over overlapping spatial blocks; 5. Collect Histogram of Oriented Gradients over detection window; 6. Linear Support Vector Machine. 33 PKWM Input nn-perso. chdfloatn bure Figure 4-1: The diagram above summarizes how the Histogram of Oriented Gradients filter is subdivided and how each part is connected to the other. Orientation Binning is done in order to discretize the data points, i.e. the gradients. The descriptor blocks are collected to avoid the noise created by the noise, for example differences in ligthing. Ater the creation of the Support Vector Machine, a new data point is classified as an object by running the 4 initial parts of the algorithm and classifying the new data point using the created Support Vector Machine. Figure 4-2: Representation of a 9x9 to the RGB colors. Each 9x9 square on the right of the image represent the magnitude of the color on the respective color space. 4.1 Compute gradients For computing gradients, the algorithm runs two 2D masks [-1,0,1] for the x-axis and y-axis for each color channel. There is no need to do Gaussian smoothing in the 34 ~ - 189 163 5 189 195 11 189 195 112 Figure 4-3: The red color channel to the correspondent value block. 1 0 -1 X Filter 189 163 5 189-5 189 195 11 8911 189 195 112 ~ :112 . Figure 4-4: Multiplying the first matrix with the horizontal 2D mask [1,0,-i]. image, because Gaussian smoothing does not improve performance of Histogram of Oriented Gradients filters as shown by Felzenszwalb et Al (2010). 4.2 Orientation Binning The algorithm discretizes the space of angles into 9 bins between the angles 0" and 180' degrees ("unsigned" gradient). Figure 8 represents the vote of the set of pixels. The weight of the vote of each pixel is equal to the magnitude of the gradient of each pixel. Each pixel will be added to a different bin by its proximity to it. 4 4 4 T 4 4-~4 4 T - T Figure 4-5: Orientation binning is done in 6x6 pixel cells. Each pixel will contribute to one direction of the gradient. The direction of the pixel at the center of the cell is determined by voting of the pixels on that cell. In this example, the orientation of the bin will be down. 35 4.3 Descriptor blocks There is a big variation of strength of the illumination and a big difference of constrast between the foreground image and the background image, therefore it's important to have blocks that integrate information of sets of neighbor pixels. Each sett of neighbor pixels is called a block. Each block is normalized to reduce the impact of these differences when identifying objects. In this paper I use blocks of size 16x16 pixels. 4.4 Contrast normalize over overlapping spatial blocks The image is represented by a vector of overlapping spatial blocks. The vector is built such each horizontal row of spatial blocks is part of the vector, and the rows are appended at the end of the last row. The blocks overlap in a 8x16 part of each one, i.e. each block overlaps with neighbors, such that each pair of overlapping blocks covers half of each block. The vector is normalized to reduce the impact of brightness of certain areas when calculating the gradients. Each vector v of pixels is normalized by: V 4.5 Collect Histogram of Oriented Gradients over detection window To detect objects, the algorithm produces the vectors from 64x128 window size pixels. To identify different sizes of images, I sub-sample and super-sample each image. Figure 4-6 exemplifies a 64x128 detection window. 36 .... .. ......................... ............ ...... ............ . ......... . - ----- ......... . 128 64 Figure 4-6: The window size of each block. 4.6 Support Vector Machine After collecting the data, I create a linear Support Vector Machine using the the data is compared to other data points using a linear Support Vector Machine. Figure 4-7 exemplifies how a Support Vector Machine separates the training data. 50 0 0 0 37.5 0 0 25 0 12.5 0 0 0 0 00 0 3 0 6 9 12 If each vector of pixels, window of 64x128 pixels, is represented as a vector , Figure 4-7: This image exemplifies how a Support Vector Machine separates the training data. The blue points are points such that the training data is classified as not objects and the green points are classified as objects. and the output of the vector is represented as a scalar y. The scalar y is such that: I if the window is an example of the object, -1 otherwise. y= The linear Support Vector Machine can be found by solving the following linear 37 ... ... ........... system: + b) yiii, = 1 2 i=1 cv=1 - 2 I solved this linear system using Gaussian elimination. Solving this linear system gives t and b, which are used to describe the linear Support Vector Machine in the following way: t_5 + b = 0 4.6.1 (4.1) New data points To classify new images, I run all the previous steps except that I use the formula of the linear Support Vector Machine to classify new data points, J, as being the object if they satisfy the following formula: Yi+ b > 0 Therefore, the output of this algorithm are masks of regions that are classified as being an object. 4.7 Map-Dictionary Pixel Given an image I(x, y), the algorithm outputs the regions of the image containing the object, i.e. it outputs a mask, P(x, y), such that: 38 P(X if the pixel at location (x, y) belongs to a person, Y){ otherwise. The output, Map-Dictionary Pixel, of this algorithm is equals to: MDp(x, y) = {JImage:I(x, y), Person:P(x, y)} Figure 4-7 contains an exemplification of an input image and P(x, y). (4.2) All the points inside the red rectangle are such that P(x, y) = 1, and all points outside it are such that P(x, y) = 0. Figure 4-8: The input image and P(x, y). All the points inside the red rectangle are such that P(x, y) = 1 and all points outside it are such that P(x, y) = 0. 39 40 Chapter 5 Image Segmentation Image segmentation is one of the essential parts of the algorithm. Image segmentation is the process of partitioning an image into multiple regions. Each region is also known as super-pixel. The image segmentation algorithm that I will use is the technique presented by Arbalaez et al. (2011). The technique is a variant of Normalized Cuts. Shi and Malik (2000) present the normalized cuts algorithm in more details. Normalized Cuts is an algorithm that associates a distance metric between every pixel and its neighbors. The algorithm finds cuts that maximize the distance between two regions. The distance between regions is defined as the sum of all pixels that are separated by the cut. Garey and Johnson (1979) proved that finding the maximum k-cut of a graph is a NP-complete problem, therefore Arbalaez et al. (2012) uses an approximation algorithm to transform the edges, such that finding the maximum k-cut is equivalent to finding the minimum k-cut. To find the minimum k-cut Arbalaez et al. (2010) use an approximation algorithm. 5.1 Map-Dictionary Pixel The input of this algorithm is the Map-Dictionary Pixel given at the end of Chapter 4, i.e.: 41 MDp(x, y) = {Image:I(x, y), Person:P(x, y)} (5.1) Each object is represented as a mask P(x, y). The algorithm described in this chapter uses the Map-Dictionary Pixel to identify all the objects in the image and segment them separately. The advantage of using the Map-Dictionary Pixel is that there is no need to separate the input into different images and run the algorithm separately in each image. Instead of doing that, the algorithm runs the algorithm described in this chapter in each region of the image such that P(x, y) = 1. The Map-Dictionary Pixel is incremented easily. The output of the algorithm will be similar to the input but with the addition of the information that the algorithm provides to it. The output of the image is equals to: MDp(x, y) = {Image:I(x, y), Person:P(x,y), Edge:E(x, y)}, (5.2) such that E(x, y) is equals to 0 if P(x, y) is equals to 0. 5.2 Minimum N cuts Instead of calculating directly the minimum N cuts, the algorithm below finds the probable minimum N-cuts using an appromixation algorithm. Figure 5-1 shows an example of the minimum cut for a certain graph. 5 3833 13 10 2 24 Figure 5-1: Minimum Cut with sum equal to 10. 42 The next two sections present how I reimplemented how to calculate the distance between pixels. The first section presents how to calculate the distance between pixels using only the local information, i.e. the information of the neighbor pixels. The second section presents how to calculate the distance between pixels using the information of the whole image. 5.3 5.3.1 Local Information Each angle To calculate the distance L(x, y,0) at e certain position (x.y), at a certain angle 0 and at a certain radius a, I use the histogram difference of the two half-discs of the circle. The formula is: 1 (0~) - NOi))2 - L, (x, y, 0) = 2 g(i) + h(i) For each direction 0 we have a function of the distance between neiborhood pixels. To smooth out the distance function, I apply a second-order Savitzky-Golay filter on the direction of the angle on the whole image. 5.3.2 Local Distance The local distance for every position (x, y) is the maximum of all the distances for different angles. The formula is shown below: L(x, y) = maxo( L,(x, y, 0)) or 43 Upper Half-Disc Histogram 0.5 0 1 Lower Half-Disc Histogram 0 0.5 1 Figure 5-2: Image extracted from work done by Arbelaez et al. (2012) exemplifying the process of calculating the local distance for a certain angle. 5.4 5.4.1 Global Information Distance Transformation The following procedure is anapproximation algorithm, i.e. instead of finding the maximum N cuts, the weights of the edges are transformed such that the distance of neigborhood pixels are distant apart have a correpondent small magnitude. Given the function L(x, y) for every position (x, y), the intervening contour cue distance, which is defined for a fixed radius r as: (5.3) Wj = emaxpEeyL(p). In the implementation of the algorithm I set r equal to 5. The maximum sum of exl+...+Xn is not equal to the minimum x 1 + . . . + x, but the set of solutions are very similar. The function e-x is depicted in the figure below: 44 1(J I51 -3 -2 -1 1 2 3 Figure 5-3: e- from -3 to 3. 5.4.2 Each angle I use the affinity matrix of a graph. The entries of the affinity matrix are calculated using equation (5.3). Definition: Given an affinity matrix W whose entries encode the similarity between pixels, one defines the diagonal matrix Di = E Wij. To find the minimum N-cuts, the algorithm finds the generalized eigenvectors. The eigenvectors provide information on what axes a the matrix has the highest variation. The equation to find the eigenvectors is the following: (D - W)v = ADv (5.4) After this step, each pixel is represented on the basis of the biggest N eigenvectors. To find the approximate minimum N cuts, we clusters the pixels into different regions using N-means clustering. The above solution usually breaks regions that have smooth gradients. But the solution provides a very good clue where the regions should be. The solution given by Arbalaez et al. (2012) is to implement the above algorithm but to separate regions by boundaries that are not smooth. The N eigenvectors found from equation 5.3 with the biggest eigenvalues, JAl, have the correspondent eigenvectors that represent the highest variability axis of the points. I pick the highest N = 16 eigenvectors, as Arbalaez et al. (2012) did. The 45 eigenvectors are combined in the following manner: gD (x, y, 0) = E V ogkV (X, Y, 0) (5.5) k=1 For every angle 0 the global distance is: G(x, y, 0) 5.4.3 = L,(x, y, 9) + gD(x, y, 9) Global Distance Given G(x, y, 9) for every angle 9, the global distance for each position (x, y) is: G(x, y) = maxoG(x, y, 9) 5.5 (5.6) Oriented Watershed Transform The watershed transform is calculated given a function at each location (x, y). In this case I use the global distance, G(x, y). Imagining that the function G(x, y) represents how tall each pixel is, To calculate the orientee watersher transform is equivalent to set water drops in each pixel and let them leak to the pixels at the lowest positions. Pseudocode is shown below: Given the watershed transform, to calculate the oriented watershed transform I estimate an orientation at each pixel on an arc from the local geometry of the arc itself. The orientations are obtained by approximating the watershed arcs with line segments. I subdivide any are which is not well fit by the line segment connecting its endpoints until there are no arcs like that. Then, I assign each pixel (x, y) on a subdivided arc the orientation o(x, y) c [0, 7r) of the corresponding line segment. Next, I use G(x,y,9), given in equation 5.2, to assign each arc pixel (x,y) a boundary strength of G(x, y, o(x, y)). I quantize o(x, y) in the same manner as 9, so this operation is a simple lookup. Finally, each original arc in Ko is assigned weight 46 equal to average boundary strength of the pixels it contains. Comparing the middle left and far right panels in Figure 5-4 shows that this reweighting scheme removes artifacts. The output will be a function represented as E(x, y) at every location (x, y) of the image. Figure 5-4: Image extracted from [1] exemplifying the process of calculating the Oriented Watershed Transform. Image in the far left represents the the function G(x, y). The image in the middle left represents the watershed transform. The image in the middle represents the process of approximating the arcs to line segments. The image in the middle right represents the streanth of G(x, y, o(x, y)) for 4 different angles. The image on the far right represents the output. 5.5.1 Map-Dictionary Pixel The output of the algorithm is described in eq. 5.2 and it is equals to: MDp(x, y) = {JImage:I(x, y), Object:M(x, y), Edge:E(x, y)}. 47 (5.7) Figure 5-5: algorithm. The input and the output of running the Ultrametric Contour Map 48 Chapter 6 Histogram of Oriented Gradients filters for the subparts The goal of this chapter is to describe the process of how I used the Histogram of Oriented Gradients filter detector for the subparts of the human body - upperbody, lowerbody and face - and how I annotated the data for training the Histogram of Oriented Gradients filter detector described in Chapter 4. To classify the subparts of the human body I used the Map-Dictionary Pixel given by the Image Segmentation algorithm described in Chapter 5. The output of the Image Segmentation algorithm segments the image such that regions that are distinct are more likely to be separated. Thus the human subparts - upperbody, lowerbody and face - are very likely to be separated by an edge created using the Image Segmentation algorithm described in Chapter 5. The reason why the human can be separated using these subparts is because people are usually dressed in such a way that the upperbody, the lowerbody and face have a consistent color scheme. The following subsection describe the input of the algorithm and how the algorithm proces the input. 49 6.0.2 Map-Dictionary Pixel The input of this algorithm is the output given by the algorithm at Chapter 5, which is equals to: MDp(x, y) {Image:I(x, y), Person:P(x,y), Edge:E(x, y)}. (6.1) Given that the Image Segmentation algorithm gives the N cuts and the scale of each cut - edge - is equals to how separated the 2 regions separated by it are far apart, therefore I will set a threshold A such that the subparts chosen are most of the time separated. Thus the chosen threshold was A = 0.7. Therefore, if a pixel at position (x, y) is such that E(x, y) > A then the function f will return 1 if the ultrametric contour map returns a value for that pixel that is greater or equal to A; and f(x, y) = 0 if the ultrametric contour map returns a value for that pixel that is less than A. The formula below summarizes it: f (I if E(x, y) > A 0 otherwise. The Histogram of Oriented Gradients filter for the subparts will be run on each segmented part of the image individually, i.e the Histogram of Oriented Gradients filter will only run through windows in each of the subparts generated by the MapDictionary Pixel. For each segmented part of the image the Histogram of Oriented Gradients filter outputs if that subpart is present. The algorithm outputs the following Map-Dictionary Pixel: MDp(x, y) {Image:I(x, y), Person:P(x, y), Edge:E(x, y), Face:F(x, y), LowerBody:L(x, y), UpperBody:U(x, y)} 50 (6.2) 6.0.3 Algorithm The Histogram of Oriented Gradients filters for each subpart are run for each segmented part and if any of the Histogram of Oriented Gradients filters detects one of the subparts, the respective segmented part will be considered as being of such a subpart. For example if the Histogram of Oriented Gradients filtler detects the lowerbody in a segmented area A, then: L(x, y) = I V(x, y) E A Figure 6-1 shows an example where there is a segmented part that the face Histogram of Oriented Gradients filter detects a face, and there is an example of a segmented part that the face Histogram of Oriented Gradients filter does not detect a face. Figure 6-1: The left most image in this figure shows the raw image. The middle left image shows E(x, y). The middle right image shows an segmented part such that the Histogram of Oriented Gradients filter detects a face. The right most image shows a segmented part such that the Histogram of Oriented Gradients filter does not detect a face. 6.1 Training Data I have annotated 830 faces on the images provided by the PASCAL dataset, 89 lowerbody parts and 155 upperbody parts, totalling 1049 labeled data points. Figure 6-1 51 represents a set of images that are used for training the Histogram of Oriented Gradients filter detector for faces. The training data is available in this repository. The training data was annotated using the PASCAL Dataset. The following subsection gives an overview of the PASCAL Dataset. Figure 6-2: Set of image faces used for training the Histogram of Oriented Gradients filter model for the human face. 6.1.1 PASCAL Dataset The PASCAL Dataset is divided into 3 parts: 1. Annotations: The annotations contain information on where and how the subparts are laid out inside an image. 2. Image Sets: The image sets contain the classification of which annotations are classified as being the object and which annotations are not the object. 3. JPEG Images: This folder contains the images that are annotated. In the implememntation of the algorithm described in this chapter, I created 1049 annotations using the JPEG Images given by the PASCAL Dataset. I've also created 3 Image Sets describing the data points that represent and do not represent the following subparts: lowerbody, upperbody and face. 6.2 Output The output of this algorithm is the following Map-Dictionary Pixel: 52 MDp(x, y) = {Image:I(x, y), Object:M(x, y), Edge:E(x, y), Face:F(x, y), LowerBody:L(x, y), UpperBody:U(x, y)}. (6.3) Given the output of the algorithm the final image of the algorithm is the union of the the Histogram of Oriented Gradients filters of the subparts, therefore the output is equals to: O(Xy)= I(x, y) if F(x, y) == 1 or L(x, y) 0 otherwise. 1 or U(x, y) == 1. Figure 6-3 represents the input of the Ultrametric Contour Map algorithm and the output O(x, y) that the mask I(x, y) produces. Figure 6-3: The left image is the input of the Ultrametric Contour Map and the right image is the mask I(x, y) described above. Note that the mask on top of the person is very close to the actual pixels that belong to the person despicted in the image. 53 54 Chapter 7 Matching Fragments In this chapter I discuss the reimplementation of the algorithm developed by Borenstein and Ullamn (2002). To achieve that I first annotated the images that are part of a person, then I matched the the fragments of the annotated images over new images of a person. The input of the algorithm is the Map-Dictionary Pixel that is the output of the Histogram of Oriented Gradients filter described in chapter 4. Thus, the input is: MDp(x,y) = {Image:I(x,y), Person:P(x,y)}. (7.1) The input is easily processed because the only parts that are analyzed that can be considered as part of a person are the ones that P(x, y) = 1. Thus, if P(x, y) = 0 then there are no fragments, F(x, y), that belong to the image of a person. Therefore if P(x, y) = 0 then F(x, y) = 0, and the output of the algorithn is the following Map-Dictionary Pixel: MDp(x,y) = {Image:I(x,y), Person:P(x,y), Fragments:F(x,y)}. (7.2) The Fragments represent the pixels that belong to the image of a person. The presentation of the output is such that: 55 F (x, y) = I(X, y) if (x, y) is part of a fragment, 0 otherwise The following sections describe the training data, the implementation of the algorithm and the output. 7.1 Training Data The training data consist of a set of pairs of images. The pair of images contain a fragment image of the object class and an image of its mask. Fragments and masks are described below: 7.1.1 Fragment A fragment image is a image of a determined size, mxn, that contains part of the image of the object. In the implementation of this algorithm I've set m = 24 and n = 45. The fragments and its respective masks are extracted manually. 7.1.2 Mask The mask of an image is an image of the same size as the orginal one. Each mask image contains zeros and ones. The ones represent the interior of the object and the zeros represent the background. Thus, for every pixel at a position (x, y) the mask image, M, is such that: (x, y) {1 0 if pixel at position (x, y) belongs to the object, otherwise. Figure 7-1 has 2 images, one representing a fragment image and its mask. fragment image is described later in this chapter. 56 The U Figure 7-1: This figure represents a pair of a fragment and its mask. Note that the mask is black and white, because the mask is either equal to 0 or 1. 7.2 Algorithm Given the set of pairs of fragments and masks and the input image, I calculate the best matching fragments and their respective position to the input image. Then, the algorithm lays out the best matching fragments on top of the image, like a jigsaw puzzle. To calculate the best matching fragments, there is a need to calculate the correlation between the fragment and the fragment of the input image on a certain position (x, y). Figure 7-2 is an example of the 100 best matching fragments on top of an input image. Figure 7-2: The first image is the input image and the second image shows the 100 best matching fragments superimposed on top of a black image of the same size. The third image shows the 100 best matching fragments adding the information of the textons. For each input image, given the correlation of two fragments of the same size, then the best matching fragments are found using the following algorithm: max-fragments = -Inf; for f in Fragments: 57 maxF = -Inf pf = (0,0) while not v = end of the figure: correlation(I[x,x+delta(x)][y,y + delta(y)],f) if maxF < v: maxF pf = = v (x,y) move (x , y) if < max-fragments max-fragments maxF: = maxF The best matching fragments are the ones that maximize maXF. The correlation between two fragments is calculated using the formula described in the next subsection. 7.2.1 Correlation The distance between two fragments of the same size is done by first changing the representation of the image from the RGB color channel to the LAB color channel. Then, the distance of two fragments, f, and f2, is calculated by clustering the color space and then calculating the distance between the histograms of the fragments: 1 K 2 (f 1 (k) - f2(k))2) fi(k) + f2 (k) To cluster the color space, I've use the k-means clustering with K = 32. Example Given 2 histograms shown in Figure 7-3, the difference between the 2 fragments is equals to: )2 +(4-2 )2 +(3 --0)2 +(0-_1)2 +(1 -4)2 1I1 x (2-2 )2 +(4-3 2 (2+2) + (4+3)+(4+2)+(3+0)+(0+1)+(1+4) 1 2 - x 0+1+4+9+1+9 4+7+6+3+1+5 58 2 2 TP.Aw1 Tetn eln Txw etn Tim6 Texion1 Textm 2 Texa0,3 Texio4 Twon 0Tex0mi Figure 7-3: Two different histograms representing how many textons of each type are in each region. 1 24 12 2 26 26 The new distance is the weighted sum of the distance between the LAB color space and the distance between the textons of the pixels. 7.3 Textons To improve the algorithm I added textons as information for comparing different fragments. Textons aid when representing the same kind of material. The distance between textons is a good representation of different materials, thus it will aid to separate different objects. To calculate the value of textons for each pixel, p, a correspondent vector v with entries vi, such that: Vi = p * Fi, such that F is a filter. In the implementation of this algorithm I've used the Leung-Malik (LM) filter bank. It's described in the next subsection. The textons are calculated clustering the vectors into 32 different clusters through K-means clustering algorithm. Figure 7-4 shows an image and the corresponding textons. 59 Figure 7-4: An image and the corresponding textons. 7.3.1 Filter Bank In the implementation of this algorithm I've used the Leung-Malik (LM) Filter Bank. The Leung-Malik consists of a set of mult-scale, multi-orientation filter bank containing 48 filters. The code to generate Leung-Malik (LP) Filter Bank is located in the Appendix. The Leung-Malik Filter bank contains first and second derivatives of Gaussians at 6 orientations and 3 scales; 8 Laplacian of Gaussian filters; and 4 Gaussians. Figure 7-5: The representation of the LM Filter Bank. 7.4 Output The output of this algorithm is the following Map-Dictionary Pixel: MDp(x, y) = {Image:I(x, y), Person:P(x,y), Mask:M(x, y). 60 (7.3) The output image, 0, of the person can be calculated in the following way: O(x, Y) { I(X, y) if P(x, y) 0 otherwise. 61 1 and M(x, y) 62 Chapter 8 Conclusion In this chapter I review the goals of the thesis. I revised what has been done to achieve these goals, and the contributions of this work. The goal of computer visual systems is to simulate the human visual system. In order to do that, the computer visual system is divided into different parts. Many algorithms are interested in a specific question. Examplifications of questions that single algorithms try to answer are: 1. Where is a certain object localized in an image? 2. What are the subparts of an image? 3. How can we find the distance between the points represented by the image and the camera? Each question is answered by a single class of algorithms. The algorithms that try to answer these questions are: Object Detection, Image Segmentation and 3D reconstruction. The presented framework integrates these simple algorithms in a simple way, and, therefore empowers them. 8.1 Contributions In this thesis, I make the following contributions: 63 " I developed the Map-Dictionary Pixel framework for passing information between various computer vision agorithms. * I demonstrated the value of the Map-Dictionary Pixel framework by implementing two human-detecting algorithms, Ultrametric Contour Map algorithm and Matching Fragments algorithm, and connected them using that framework " I created training data for modeling human faces, upper bodies, and lower bodies. 64 Appendix A Code This appendix includes code for algorithms described in previous chapters. The code is divided by the chapters that they are included. A.0.1 Chapter 5 The following code shows how to calculated Oriented Watershed Transform: for pixel in pixels: output(pixel) stilldropping while 1 = = true stilldropping: still-dropping = for pixel in neighbors false pixels: = get-neighbors(pixel) numberoflowerneighbors = for neighbor if in neighbors: G(neighbor) < G(pixel): stilldropping = true output(neighbor) += output(pixel) return get-numberoflowerneighbors(pixel) = output(pixel)/numberof 0 output 65 _lowerneighbors A.0.2 Chapter 7 The following code is the MATLAB code to generate the Leung-Malik Filter Bank. function F=makeLMfilters % Returns the % image with I % conv2, i.e. LML the filter bank filter of bank size you can 49x49x48 either in F. use To the convolve matlab responses(:,:,i)=conv2(I,F(:,:,i),'valid'), or an function use the % Fourier transform. SUP=49; % Support of SCALEX=sqrt(2).^[1:31; % Sigma_{x} for NORIENT=6; % orientations Number of the largest the filter oriented (must be filters NROTINV=12; NBAR=length(SCALEX)*NORIENT; NEDGE=length(SCALEX)*NORIENT; NF=NBAR+NEDGE+NROTINV; F=zeros (SUP ,SUP, NF); hsup=(SUP-1)/2; [x,y]=meshgrid([-hsup:hsup],[hsup:-1:-hsup]); orgpts=[x(:) y(:)]'; Licount=1; auforuscale=1: length(SCALEX) uuuufororient =0:NORIENT-1, jjjjj angle=pi*orient/NORIENT ; ,,Not,2pi asofilters haveusymmetry JUUUULc=cos(angle);s=sin(angle); uuuLjuurotpts=[c,-s;sLsc]*orgpts; LjuuuuF(:,: ,count)=makefilter(SCALEX(scale),0,1,rotpts,SUP); accjuuF(:,: accuajjucount ,count+NEDGE)=makefilter(SCALEX(scale),0,2,rotpts,SUP); count +1; LiLiLiLiend; uuend; ,,count=NBAR+NEDGE+1; ,uSCALES= sqrt (2) . [1: 4] 66 odd) uuforui=1:length(SCALES), LjLjLLF(:,:,count)=normalise(fspecial('gaussian',SUP,SCALES(i))); LjLjLjjF(:,:,count+1)=normaalise(fspecial(log',SUP,SCALES(i))); uuLjuu(F(:,:,count+2)=normalise(fspecial('log',SUP,3*SCALES(i))); uuLiu count= count +3; ,,end; return function f=rmakefilter(scale uugx=gaussld (3*scale 0,pts ,phasexphaseyptssup) (1, :),phasex) augy=gauss1d(scale,O,pts(2,:),phasey); juf=normalise (reshape (gx .*gy sup, sup)); return 67 68 Bibliography [1] Pablo Arbelaez, Michael Maire, Charless Fowlkes, Jitendra Malik Contour Detection and HierarchicalImage Segmentation. In Pattern Analysis and Machine Intelligence, IEEE Transactions on 33.5 (2011): 898-916. [2] Eran Borenstein, Shimon Ullman Class-Specific, Top-Down Segmentation. In Computer VisionECCV 2002. Springer Berlin Heidelberg, 2002 109-122. [3] Navneet Dalal, Bill Triggs Histograms of Oriented Gradients for Human Detection. In Computer Vision and PatternRecognition, 2005. CVPR 2005, IEEE Computer Society Conference on, volume 1, pages 886-893. IEEE, 2005. [4] Jammie Shotton, Andrew Blake, Roberto Cipolla Contour-BasedLearningfor Object Detection. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. Vol. 1. IEEE, 2005. [5] Lubomir Bourdev., Jitendra Malik Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations. In Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009. [6] Pablo Arbelaez, Laurent Cohen ConstrainedImage Segmentation from Hierarchical Boundaries. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008. [7] Michael Maire, Pablo Arbelaez, Charless Fowlkes, Jitendra Malik. Using Contours to Detect and Localize Junctions in Natural Images. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008. [8] Pablo Arbelaez, Bharath Hariharan, Chinhui Gu, Saubarabh Gupta, Lubomir Bourdev, Jitendra Malik Semantic Segmentation using Regions and Parts. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012. [9] Charless Fowlkes, Jitendra Malik How Much Globalization Help Segmentation? In Computer Science Division, University of California, 2004. [10] Normalized Cuts and Image Segmentation Jianbo Shi, JitendraMalik In Pattern Analysis and Machine Intelligence, IEEE Transactions on 22.8 (2000): 888-905. 69 [11] Garey, M. R.; Johnson, D. S. (1979), Computers and Intractability: A Guide to the Theory of NP-Completeness, W.H. Freeman, ISBN 0-7167-1044-7. [12] Arbelez, Pablo, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir Bourdev, and Jitendra Malik. Semantic segmentation using regions and parts. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3378-3385. IEEE, 2012. [13] P. Kohli and M. Kumar. Energy minimization for linear envelope mrfs. In Proc. CVPR, 2010. [14] L. Ladicky, C. Russell, P. Kohli, and P. Torr. Graph cut based in- ference with co-occurence statistics. In Proc. ECCV, 2010. [15] X. Boix, J. M. Gonfaus, J. van de Weijer, A. D. Bagdanov, J. Ser- rat, and J. Gonza lez. Harmony Potentials Fusing Global and Local Scale for Semantic Image Segmentation. International Journal of Computer Vision, 96(1):83102, December 2012. [16] A. Lucchi, Y. Li, X. Boix, K. Smith, and P. Fua. Are spatial and global constraints really necessary for segmentation? In Proc. ICCV, 2011. [17] T. Brox, L. Bourdev, S. Maji, and J. Malik. Object segmentation by alignment of poselet activations to image contours. In Proc. CVPR. 2011. [18] Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes. Layered object detection for multi-class segmentation. In Proc. CVPR, 2010. [19] Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained part-based models." Pattern Analysis and Machine Intelligence, IEEE Transac- tions on 32.9 (2010): 1627-1645. 70

Integrating Bottom-up and Top-Down Information

Related documents

Products

Support

Integrating Bottom-up and Top-Down Information

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib