Integrating Bottom-up and Top-Down Information

Integrating Bottom-up and Top-Down Information
by
Giuliano Pezzolo Giacaglia
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Masters of Engineering in Computer Science and Engineering
at the
MASSACHUSETTS INGl9TE'
OF TECHNOLOGY
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
JUL 15 2014
February 2014
LIBRARIES
@ Massachusetts Institute of Technology 2014. All rights reserved.
Signature redacted
A uthor ..........
Department
Certified by.
...............
Electrical Engineering and Computer Science
January 17, 2014
Signature redacted
rl
Patrick Henry Winston
Ford Professor of Artificial Intelligence and Computer Science
Thesis Supervisor
/7
/I
Signature redacted
..................
Albert R. Meyer
Chairman, Department Committee on Graduate Theses
A ccepted by ........
2
Integrating Bottom-up and Top-Down Information
by
Giuliano Pezzolo Giacaglia
Submitted to the Department of Electrical Engineering and Computer Science
on January 17, 2014, in partial fulfillment of the
requirements for the degree of
Masters of Engineering in Computer Science and Engineering
Abstract
In this thesis I present a framework for integrating bottom-up and top-down computer
vision algorithms. I developed this framework, which I call the Map-Dictionary Pixel
framework, because my intuition is that there is a need for tools that make it easier
to build computer vision systems that mimic the way human visual systems process
information. In particular, we humans humans create models of objects around us,
and we use these models, top-down, to interpret, analyze and discern objects in the
information that comes bottom-up from the visual world.
After introducing my Map-Dictionary Pixel framework, I demonstrate how it empowers computer vision algorithms. I implement two different systems that extract
the pixels of the image that correspond to a human. Even though each system uses
different sets of algorithms, both use Map-Dictionary Pixel framework as the connecting pipeline. The two implementations demonstrate the utility of the Map-Dictionary
Pixel framework and provide an example of how it can be used.
Thesis Supervisor: Patrick Henry Winston
Title: Ford Professor of Artificial Intelligence and Computer Science
3
4
Acknowledgments
If I were to describe my life in one word that word would be hustle. In the urban
dictionary hustle is defined as "To do things to get closer to the point you want to
get." I've always worked really hard to achieve my aspirations. When I was in Middle
School I hustle to get a scholarship for the best high school in Brazil. When I was
in High School I hustled to increase my schorlaship, trying to be amongst the best
mathematicians at my age in Brazil. In college in Brazil hustle to survive through
the hazing and military training. A semester later at MIT I hustled to pay for food
and living.
I was really fortunate to have all these opportunities and I know that there are a
lot of people that want to be in the same place but didn't have the same opportunities.
Jay-Z in "American Dreamin" explains it well:
"Mama forgive me, should be thinking about Harvard
But that's too far away, niggas are starving
Ain't nothing wrong with my aim, just gotta change the target
I got dreams of bagging snidd-ow, the size of pillows
I see pies every time my eyes clidd-ose
I see rides, sixes, I got to get those
Life's a bitch, I hope to not make her a widow"
MIT was specially hard. It was not only hard because of the endless psets and
labs, the midterms and projects, but it was specially hard to do all of this and have
to worry about bills. Sometimes there wouldn't be money for lunch.
I've survived. I did it only with the help of my friends and family. I'm extremely
thankful to them. I dedicated all my work to them.
To my Mom, Leticia Maria Pezzolo Giacaglia, for all her love.
To my dad for
all the knowledge and for being an example of honesty and good work ethic. To my
brother for always standing on my side.
I want to thank my friends Aldo and Beneah that kept me sane and going. Sometimes we would go partying on the weekend nights so we could forget about the
5
problems of the past week and the problems that we would encounter the next day.
To all my friends in Genesis group, Sam, Hiba, Nikola, Igor, Adam, Matthew,
Sila, Mark, Dylan for great discussions that improved this work. But most of all, for
all our discussion about other random things that made my experience so great.
To Rafa, Bujao and Pond, for helping me transition to this country.
To Professor Winston, for being an example of love and care for its students.
This research was supported, in part, by the Defense Advanced Research Projects
Agency, Award Number Mind's Eye: W911NF-10-2-0065.
6
Contents
17
1.1
V ision
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.2
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
1.3
Map-Dictionary Pixel . . . . . . . . . . . . . . . . . . . . . . . . . .
20
1.3.1
21
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Histogram of Oriented Gradients Detection . . . . . . . . . .
24
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Object Detection
2.1.1
21
23
Previous Work
2.1
Image Segmentation
. . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.3
Top-Down Segmentation . . . . . . . . . . . . . . . . . . . . . . . .
24
2.4
Object Detection from Contour
. . . . . . . . . . . . . . . . . . . .
25
Chamfer Distance . . . . . . . . . . . . . . . . . . . . . . . .
25
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
Related work
25
Conditional Random Fields
. . . . . . . . . . . . . . . . . .
26
2.5.2
Refining Top-Down Segmentation . . . . . . . . . . . . . . .
26
2.5.3
Scoring bottom-up segmentation . . . . . . . . . . . . . . . .
27
.
.
2.5.1
.
2.5
.
2.2
2.4.1
Usage Examples
29
3.1
Ultrametic Contour Map . . . . . . . . . . . . . . . . . . . . . . . .
29
3.2
Matching fragments.
. . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3
Other possible usages . . . . . . . . . . . . . . . . . . . . . . . . . .
32
.
3
.
Organization
.
2
Implementation . . . . . . . . . . . . . . . . . . . . . . . . .
.
1.4
.
Introduction
.
1
7
4
3.3.1
3D reconstruction and Object Detection
. . . . . . . . . . . .
32
3.3.2
Image Segmentation and 3D reconstruction . . . . . . . . . . .
32
4.1
Compute gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.2
Orientation Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.3
Descriptor blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.4
Contrast normalize over overlapping spatial blocks . . . . . . . . . . .
36
4.5
Collect Histogram of Oriented Gradients over detection window
. . .
36
4.6
Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . .
37
. . . . . . . . . . . . . . . . . . . . . . . . .
38
Map-Dictionary Pixel . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.6.1
4.7
5
New data points
41
Image Segmentation
5.1
Map-Dictionary Pixel . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.2
Minimum N cuts
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
5.3
Local Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.3.1
E ach angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.3.2
Local Distance
. . . . . . . . . . . . . . . . . . . . . . . . . .
43
Global Information . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5.4.1
Distance Transformation . . . . . . . . . . . . . . . . . . . . .
44
5.4.2
E ach angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.4.3
Global Distance . . . . . . . . . . . . . . . . . . . . . . . . . .
46
. . . . . . . . . . . . . . . . . . . . .
46
Map-Dictionary Pixel . . . . . . . . . . . . . . . . . . . . . . .
47
5.4
5.5
Oriented Watershed Transform
5.5.1
6
33
Histogram of Oriented Gradients filter
Histogram of Oriented Gradients filters for the subparts
6.1
49
6.0.2
Map-Dictionary Pixel . . . . . . . . . . . . . . . . . . . . . . .
50
6.0.3
A lgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
PASCAL Dataset . . . . . . . . . . . . . . . . . . . . . . . . .
52
6.1.1
8
7
Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
6.2
Matching Fragments
55
7.1
Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
7.1.1
Fragment
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
7.1.2
M ask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
7.2.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
7.2
7.3
7.4
Correlation
Textons
7.3.1
8
52
Conclusion
8.1
63
Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Code
63
65
A.0.1
Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
A.0.2
Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
9
10
List of Figures
1-1
This image is an example of an image that the human visual system
would have trouble identifying different objects (people) if it didn't
have a model of the objects, i.e. if it didn't use Top-Down Information
1-2
Input Image and Output Image of the Ultrametric Contour Map Al-
gorith m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-3
19
This image despicts 3D reconstruction. The algorithm that generates
the output data can not be easily used in other algorithms. . . . . . .
1-5
18
These images represent the input and output of a second algorithm
using the proposed framework. . . . . . . . . . . . . . . . . . . . . . .
1-4
17
This image represents what a Map-Dictionary Pixel looks like.
20
The
dictionary that is contained in the pixel at the most upper-right position is represented in this image . . . . . . . . . . . . . . . . . . . . .
1-6
This image shows a representation of how Map-Dictionary Pixel ties
together with the algorithms.
2-1
2-2
..
. . . . . . . . . . . . . . . . . . . . . .
21
. ..... ... .. ........... ........... . .... ...... 24
This is an example of how to calculate the chamfer distance between
2 sets of edges. ........
3-1
20
......
...........
... .......
..
26
The Ultrametric Contour Map algorithm ties together the 3 algorithms
using the Map-Dictionary Pixel.
Each algorithm outputs the Map-
Dictionary Pixel that is processed by the next algorithm in the chain.
11
30
3-2
The input and the output of running the Ultrametric Contour Map
algorithm.
The output is a probability map.
The right most image
represents the probability map. The more white the pixel, the closer
to 1 is the value of the pixel, analogously the more black the pixel is,
the closer the value at that pixel is to 0.
3-3
. . . . . . . . . . . . . . . .
This image shows how the Map-Dictionary Pixel ties together the 2
presented algorithms, Object Detection and Top-Down Segmentation.
4-1
32
The diagram above summarizes how the Histogram of Oriented Gradients filter is subdivided and how each part is connected to the other.
4-2
31
Representation of a 9x9 to the RGB colors.
34
Each 9x9 square on the
right of the image represent the magnitude of the color on the respective
color space.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4-3
The red color channel to the correspondent value block. . . . . . . . .
35
4-4
Multiplying the first matrix with the horizontal 2D mask [1,0,-i].
. .
35
4-5
Orientation binning is done in 6x6 pixel cells. Each pixel will contribute
to one direction of the gradient. The direction of the pixel at the center
of the cell is determined by voting of the pixels on that cell. In this
example, the orientation of the bin will be down . . . . . . . . . . . .
35
4-6
The window size of each block.
37
4-7
This image exemplifies how a Support Vector Machine separates the
. . . . . . . . . . . . . . . . . . . . .
training data. The blue points are points such that the training data
is classified as not objects and the green points are classified as objects. 37
4-8
The input image and P(x, y). All the points inside the red rectangle are
such that P(x, y) = 1 and all points outside it are such that P(x, y) = 0. 39
5-1
Minimum Cut with sum equal to 10.
5-2
Image extracted from work done by Arbelaez et al. (2012) exemplifying
5-3
. . . . . . . . . . . . . . . . . .
42
the process of calculating the local distance for a certain angle. . . . .
44
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
e-x from -3 to 3.
12
5-4
Image extracted from [1] exemplifying the process of calculating the
Oriented Watershed Transform.
the function G(x, y).
Image in the far left represents the
The image in the middle left represents the
watershed transform. The image in the middle represents the process
of approximating the arcs to line segments. The image in the middle
right represents the streanth of G(X, y, o(x, y)) for 4 different angles.
The image on the far right represents the output . . . . . . . . . . .
5-5
The input and the output of running the Ultrametric Contour Map
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-1
47
48
The left most image in this figure shows the raw image. The middle
left image shows E(x, y). The middle right image shows an segmented
part such that the Histogram of Oriented Gradients filter detects a
face.
The right most image shows a segmented part such that the
Histogram of Oriented Gradients filter does not detect a face. ....
6-2
Set of image faces used for training the Histogram of Oriented Gradients filter model for the human face.
6-3
51
. . . . . . . . . . . . . . . . . .
52
The left image is the input of the Ultrametric Contour Map and the
right image is the mask I(x, y) described above. Note that the mask
on top of the person is very close to the actual pixels that belong to
the person despicted in the image. . . . . . . . . . . . . . . . . . . . .
7-1
This figure represents a pair of a fragment and its mask. Note that the
mask is black and white, because the mask is either equal to 0 or 1. .
7-2
53
57
The first image is the input image and the second image shows the 100
best matching fragments superimposed on top of a black image of the
same size.
The third image shows the 100 best matching fragments
adding the information of the textons.
7-3
57
Two different histograms representing how many textons of each type
are in each region.
7-4
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
An image and the corresponding textons. . . . . . . . . . . . . . . . .
60
13
7-5
The representation of the LM Filter Bank.
14
. . . . . . . . . . . . . . .
60
15
16
Chapter 1
Introduction
The human visual system is a very complex and fascinating system. It is a challenge to
create a computer system that achieves the same results as the human visual system
achieves. The human visual system has the ability to interpret a big array of data in
a short period of time. Not only that, but to accomplish the many tasks that the it's
in charge of, the human visual system is engineered in a different way than computer
systems are usually engineered. Most of the information that flows between the eyes
and the brain is flowing from the brain to the eyes. The human visual system uses
prior information to identify objects.
Figure 1-1: This image is an example of an image that the human visual system
would have trouble identifying different objects (people) if it didn't have a model of
the objects, i.e. if it didn't use Top-Down Information
17
Figure 1-1 shows an example of an image that demonstrates how the brain createss
models of objects and uses that information to interpret images.
To solve the problem of integrating previous information that the brain gathers,
I define 2 types of information: Bottom-Up and Top-Down Information. Top-Down
Information is information contained in the model of objects and high-level structures.
Bottom-up information is the information given only by the image. This thesis creates
a framework to integrate the two types of information easily.
1.1
Vision
An exemplification of the implementation of the presented framework is an algorithm
for creating a mask on a single image, the mask is created on top of images of humans
on the single image. Figure 1-2 presents an exemplification of the implementation of
the framework.
The left most image on Figure 1-2 represents the input of the algorithm and
the Histogram of Oriented Gradients filter on top of it. The image on the right
on Figure 1-2 represents the output of the first implementation of the presented
framework. Figure 1-3 represents another set of input and output images for the first
implementation of the presented framework.
Figure 1-2: Input Image and Output Image of the Ultrametric Contour Map Algorithm.
The problem is how to integrate these two types of information in a simple
18
Figure 1-3: These images represent the input and output of a second algorithm using
the proposed framework.
model.This model should be simple enough that different types of algorithms created separately can be joined together easily.
1.2
Problem Statement
Computer Vision tasks have been divided into many tasks. The field has been divided
into many single tasks that are solved isolately. In this thesis I'm only concerned
with tasks that relate to single images. These tasks inlcude Object Detection, Image
Segmentation, Contour Extraction, 3D reconstruction among others.
Each single task is handled separately. The question is how to integrate different
algorithms into a single system or how to transfer the output or knowledge from one
algorithm to the other easily, without modifying each of the algorithms.
19
Figure 1-4: This image despicts 3D reconstruction. The algorithm that generates the
output data can not be easily used in other algorithms.
Map-Dictionary Pixels
------------------------....................................................
------------------------------...................
........................
---------------------------------------.............
...........
......
I......
I......
-----I.............
I......
I----------------------- -------------------------
...............
4 ------ ..................
------
........................ ..................................................
............ L ----- ...... L ............. .
......
......
{Image: [1,3,4],
Person: 0.3,
Edge: 0.5}
......
..... ......
......................................L ..... ...... L ..... .............
............ L
............ *..... ......
----------------------------------------------------------------......
------
......
Figure 1-5: This image represents what a Map-Dictionary Pixel looks like. The
dictionary that is contained in the pixel at the most upper-right position is represented
in this image.
1.3
Map-Dictionary Pixel
Map-Dictionary Pixel addresses the problem of integrating top-down information and
bottom-up information. Map-Dictionary can be used for detecting objects, segmenting images, finding contours of objects inside a single image and other applications.
In this thesis I present 2 implementation of the presented framework.
Each algorithm can be thought as a black box, the Map-Dictionary Pixel is the
representation of the input and the output of each algorithm. Each algorithm does
not have any information of the other algorithms, except that they are connected
using the Map-Dictionary Pixel.
Definition: Map-Dictionary Pixel is a map between each pixel in a image and
a dictionary. Each entry in the dictionary contains information encoding the image
and encoding the outputs of the algorithms.
20
Map-Dictionary
Pixel
Algorithm
Output
Figure 1-6: This image shows a representation of how Map-Dictionary Pixel ties
together with the algorithms.
1.3.1
Implementation
The Implementation of the Map-Dictionary Pixel class is shown below:
class MapDictionary:
def
__init__(self,
m,
self.matrix =
def
n):
[[{}
addDictToEntry(self,
self.matrix[i][j]
def
getDictAtEntry(self,
for y in range(n)]
dict
, i,
=
dict
for x in range(m)]
j)
i,j):
return self.matrix[i][j]
1.4
Organization
In Section 2 I present the previous work in each area. In Section 3 I explain how
Histogram of Oriented Gradients filters work in detail and how I reimplemented them
in detail. In Section 4 I give an overview of the possible use cases of the framework.
21
In Section 5 the top-down segmentation algorithm is explained in detail. In Section
6 I present how I implemented Histogram of Oriented Gradients filters for face, lower
body and upper body. In Section 7 I present the implementation of the Top-Down
Segmentation of the image using the model of an object. In Section 8 I present the
contributions of the work presented in this thesis.
22
Chapter 2
Previous Work
In this chapter I describe the most influential and the state-of-the-art work on ObjectDetection and Image Segmentation. I don't have the intention to offer a comprehensive summary of all the approaches on Object Detection and Image Segmentation,
but instead I will summarize the most important results on this area.
Object Detection and Image Segmentation are important areas of Computer Vision. Most of the work dealing with the problem solve each one separetely, but object
detection can help segmention and vice-versa.
2.1
Object Detection
Object Detection is one of the most important areas of Computer Vision. The reason
for its importance is because detecting objects in images has ample applications.
The most famous work on the area is called Histogram of Oriented Gradients.
Histogram of Oriented Gradients Detection uses gradients in the images to create
features for object detection.
The reason behind using gradients is that gradients
give a good cue for shapes. The reason behind using histograms is that histograms
are a good way of representing areas of images and comparing the respective areas.
23
2.1.1
Histogram of Oriented Gradients Detection
Work done by Dalal and Triggs (2005) has influenced most of Histogram of Oriented
Gradients detection.
palm/
IUPW
Figure 2-1:
The most significant improvement on top of Dalal and Triggs (2005) is the work
done by Felzenszwalb (2010). Felzenszwalb (2010) improves the algorithm by creating
a root filter and part filters for subparts of the object. Deformations mixture models
help to create a multi-view representation of a determined object.
2.2
Image Segmentation
Arbelaez et al.
(2008) segment natural images with a seed point and without one
Arbelaez et al. (2011) inside the object of interest. Arbelaez et al. (2011) define a
metric between different regions inside the image. Regions are defined as connected
pixels. Arbelaez et al. use a metric such that the distance between two regions signals
how strong the edge between them is.
2.3
Top-Down Segmentation
Borenstein and Ullman (2002) segmented the images using training data to represent
the edges and the sorroundings of the edges of object classes. Each set of edge and
the respective boundary is called a fragment. For each object class Borenstein and
Ullman (2002) extract the fragments that represent the best the class object.
Given a new image, Borenstein and Ullman (2002) compare the fragments that
it has with the image and try to superimpose the fragments, like a jigglesaw puzzle.
24
With the fragments in place, Borenstein and Ullman (2002) can infer where the edges
should be, and therefore it segments the image.
2.4
Object Detection from Contour
Shotton et al. (2005) implemented object detection using the contours of the images.
First, they detect the edges of a bunch of training images and create a model of
the contour of objects by defining the center of the object and creating fragments of
the contours of the objects. After creating a model of the possible fragments that
represent the contour of the object, the edges are extracted using the Canny Edge
Detector.
Given a new image, they try to find what is the point that represents the center
of the object. The center of the object will be such that the distance between the
distance between the contours of the representation of the object and the edges of
the image being compared is minimal. Then, the distance between the two images is
the distance when the center is at the defined location.
2.4.1
Chamfer Distance
The chamfer distance is calculated as the sum of the distances between each pixel
and the set of edges. The distance between each pixel and the set of edgels is equal
to the distance of the pixel and the closest pixel if the images were superimposed.
2.5
Related work
The problem of combining top-down information with bottom-up information has
been well studied over the last decade. In this overview I will focus in the combination
of top-down object knowledge with bottom-up grouping cues, i.e. the combinatation
of object detection with image segmentation.
The different approaches to solve the problem of combining them can be divided
into 3 broad lines of work:
25
mm
E
T
E
T
E
T
E
T
Total Distance = 1.412 + 1
+
Distance = 1.412
Figure 2-2: This is an example of how to calculate the chamfer distance between 2
sets of edges.
2.5.1
Conditional Random Fields
A way to encode segment relations is using Conditional Random Fields, Kohli and
Kumar (2010), Ladicky et al. (2010), Boix et al. (2012), Lucchi et al. (2011), i.e.
neighborhood regions impose a constraint on the region. Kohli and Kumar (2010)
encodes the relation between pixels and regions, so that low-level cues bind pixels to
one another, encouraging them to respect region coherence; parts detections bind to
each other respective to their correlation of being the same object or not.
Ladicky et al. (2010) define objects co-ocurrences in a single image in CRFs. Boix
et al. (2012) create a generic representation such that the problem becomes to try
to reduce the total energy. Lucchi et al. (2011) propose a new potential which allow
multiple classes to be assigned to each node.
2.5.2
Refining Top-Down Segmentation
This technique can be defined as follows: Start by scanning a window object detector
and then by refining the image segmentation by using that piece of information.
Brox et al. (2011) aligns an object mask to image contours, it applies varionational
26
smoothing and assigns figure/ground using self-similarity.
Yang et al.
(2010) uses
object detector to output the estimated object shap and depth ordering to help with
segmentation.
2.5.3
Scoring bottom-up segmentation
This technique can be defined as first generating multiple figure-ground hypotheses
by bottom-up processing of the image and producing the final segmentation by a
weighted voting for each class of objects. Arbelaez et al. (2012) addresses the problem
of joining together the information of segmentation and Object Detection.
27
28
Chapter 3
Usage Examples
In this chapter I describe the possible usage examples of the Map-Dictionary Pixel
and give an overview of how the Map-Dictionary Pixel is used in the 2 algorithms
that I implemented for this thesis.
The next 2 sections give a brief overview of the 2 algorithms that were implemented
using the described framework.
The core differences of the 2 algorithms is that
the first algorithm implements a form of n-cuts for bottom-up segmentation of the
image, called Ultrametric Contour Map, and the second algorithm a variation of the
technique described by Borenstein and Ullman (2002) for top-down segmentation.
The first algorithm is called Ultrametric Contour Map and the second algorithm is
called Matching Fragments.
3.1
Ultrametic Contour Map
The Ultrametric Contour Map can be divided into 3 main algorithms. The details of
their implementation are described in Chapter 4, 5 and 6. The Ultrametric Contour
Map uses the Map-Dictionary Pixel to use the output of one algorithm to feed to the
next algorithm. Figure 3-1 shows how the algorithms are connected through the use
of the Map-Dictionary Pixel. The following list enumerates the 3 algorithms that are
part of the Ultrametric Contour Map algorithm and it gives a small overview of each
one of them.
29
1. Object Detection: To identify the part of the image that belongs to the object, this algorithm will produce a Map-Dictionary with a mask detecting the
possible locations of the object. In this thesis I implement the Histogram of Oriented Gradients filter for identifying people. The implementation of Histogram
of Oriented Gradients filters is discussed in chapter 4 of this thesis.
2. Bottom-up Segmentation:
To separate the image of the object that was
found in the Object Detection part, this algorithm produces a set of edges that
separates the image in subparts. In this tehsis I implement Ultrametic Contour
Map, a variation of the algorithm presented by Arbaleaz et Al.
(2012).
The
implementation of this algorithm is discussed in chapter 5 of this thesis. Figure
3-1 represents the input and the output of this algorithm.
3. Subparts Detector: After segmenting the image of the object into subparts
this algorithm will determine if the subparts belong to the object. To identify
the subparts I use the Histogram of Oriented Gradients filter detector.
I've
tested the algorithm for the subparts of a person, i.e. the upperbody, lowerbody
and face. The implementation of this algorithm is explained in chapter 6.
Iae
Object
Map-DictionarY,
otmU
Detection
Pixel
Segmentation
Subparts
Pixel
Detector
Figure 3-1: The Ultrametric Contour Map algorithm ties together the 3 algorithms
using the Map-Dictionary Pixel. Each algorithm outputs the Map-Dictionary Pixel
that is processed by the next algorithm in the chain.
3.2
Matching fragments.
The algorithm described in this subsection, called Matching fragments, ties together
2 algorithms.
It uses the Histogram of Oriented Gradients filter to detect objects
in the same way that the first algorithm does. The output of each algorithm is a
30
Figure 3-2: The input and the output of running the Ultrametric Contour Map
algorithm. The output is a probability map. The right most image represents the
probability map. The more white the pixel, the closer to 1 is the value of the pixel,
analogously the more black the pixel is, the closer the value at that pixel is to 0.
Map-Dictionary Pixel, therefore there is no need to change the implementation of
Histogram of Oriented Gradients filter used in the Ultrametric Contour Map algorithm described in the previously. The 2 algorithms formin the Matching gragments
algorithm are listed below:
1. Object Detection: This algorithm is the same as the one presented in the Ultrametric Contour Map algorithm. It's implementation is discussed in Chapter
4.
2. Top-Down Segmentation: This algorithm filters the image by creating a set
of fragments that represent an object. The fragments are laid on top of the
image by finding the correspondent region with the highest correlation.
implementation of this algorithm is discussed in chapter 7.
31
The
Imge
Object
Detection
Ma-itop
Pe
Top-Down
Segmentation
Figure 3-3: This image shows how the Map-Dictionary Pixel ties together the 2
presented algorithms, Object Detection and Top-Down Segmentation.
3.3
3.3.1
Other possible usages
3D reconstruction and Object Detection
3D reconstruction algorithms do not use the information given by the classification of
objects. The proposed framework can be used. The Map-Dictionary Pixel of the 3D
reconstruction can be used as an input for creating features for the Object Detection.
In the same way, the object detection Map-Dictionary Pixel can be used as a prior
for the 3D reconstructiion of the image.
3.3.2
Image Segmentation and 3D reconstruction
Image Segmentation can tell which parts of the images are the ones that represent
edges, and therefore big differences in transition of the depth. Therefore, the MapDicionary Pixel of Image Segmentation can be used for 3D reconstruction of the
image.
32
Chapter 4
Histogram of Oriented Gradients
filter
Object Detection is one of the most important areas of Computer Vision. Histogram
of Oriented Gradients filters gives a comprehensive way of solving the problem. Histogram of Oriented Gradients filters do not have the same accuracy as a human
subject but is one of the best tools for identifying objects. In this chapter I discuss the problem of indentifying objects in a single image objects and explain how
Histogram Of Oriented Gradients filters work.
The creation of a Histogram of Oriented Gradients filter for a determined object
can be divided into 5 parts, which are connected as shown in Figure 4-1.
1. Compute gradients;
2. Orientation Binning;
3. Descriptor Blocks;
4. Contrast normalize over overlapping spatial blocks;
5. Collect Histogram of Oriented Gradients over detection window;
6. Linear Support Vector Machine.
33
PKWM
Input
nn-perso.
chdfloatn
bure
Figure 4-1: The diagram above summarizes how the Histogram of Oriented Gradients
filter is subdivided and how each part is connected to the other.
Orientation Binning is done in order to discretize the data points, i.e. the gradients. The descriptor blocks are collected to avoid the noise created by the noise, for
example differences in ligthing.
Ater the creation of the Support Vector Machine, a new data point is classified
as an object by running the 4 initial parts of the algorithm and classifying the new
data point using the created Support Vector Machine.
Figure 4-2: Representation of a 9x9 to the RGB colors. Each 9x9 square on the right
of the image represent the magnitude of the color on the respective color space.
4.1
Compute gradients
For computing gradients, the algorithm runs two 2D masks [-1,0,1] for the x-axis and
y-axis for each color channel.
There is no need to do Gaussian smoothing in the
34
~
-
189
163
5
189
195
11
189
195
112
Figure 4-3: The red color channel to the correspondent value block.
1
0
-1
X
Filter
189
163
5
189-5
189
195
11
8911
189
195
112
~
:112
.
Figure 4-4: Multiplying the first matrix with the horizontal 2D mask [1,0,-i].
image, because Gaussian smoothing does not improve performance of Histogram of
Oriented Gradients filters as shown by Felzenszwalb et Al (2010).
4.2
Orientation Binning
The algorithm discretizes the space of angles into 9 bins between the angles 0" and
180' degrees ("unsigned" gradient). Figure 8 represents the vote of the set of pixels.
The weight of the vote of each pixel is equal to the magnitude of the gradient of each
pixel. Each pixel will be added to a different bin by its proximity to it.
4
4
4
T
4
4-~4
4
T
-
T
Figure 4-5: Orientation binning is done in 6x6 pixel cells. Each pixel will contribute
to one direction of the gradient. The direction of the pixel at the center of the cell
is determined by voting of the pixels on that cell. In this example, the orientation of
the bin will be down.
35
4.3
Descriptor blocks
There is a big variation of strength of the illumination and a big difference of constrast
between the foreground image and the background image, therefore it's important
to have blocks that integrate information of sets of neighbor pixels.
Each sett of
neighbor pixels is called a block. Each block is normalized to reduce the impact of
these differences when identifying objects. In this paper I use blocks of size 16x16
pixels.
4.4
Contrast normalize over overlapping spatial blocks
The image is represented by a vector of overlapping spatial blocks.
The vector is
built such each horizontal row of spatial blocks is part of the vector, and the rows are
appended at the end of the last row.
The blocks overlap in a 8x16 part of each one, i.e.
each block overlaps with
neighbors, such that each pair of overlapping blocks covers half of each block.
The vector is normalized to reduce the impact of brightness of certain areas when
calculating the gradients. Each vector v of pixels is normalized by:
V
4.5
Collect Histogram of Oriented Gradients over
detection window
To detect objects, the algorithm produces the vectors from 64x128 window size pixels. To identify different sizes of images, I sub-sample and super-sample each image.
Figure 4-6 exemplifies a 64x128 detection window.
36
....
.. .........................
............
......
............ . .........
.
- -----
.........
.
128
64
Figure 4-6: The window size of each block.
4.6
Support Vector Machine
After collecting the data, I create a linear Support Vector Machine using the the data
is compared to other data points using a linear Support Vector Machine. Figure 4-7
exemplifies how a Support Vector Machine separates the training data.
50
0
0
0
37.5
0
0
25
0
12.5
0
0
0
0
00
0
3
0
6
9
12
If each vector of pixels, window of 64x128 pixels, is represented as a vector
,
Figure 4-7: This image exemplifies how a Support Vector Machine separates the
training data. The blue points are points such that the training data is classified as
not objects and the green points are classified as objects.
and the output of the vector is represented as a scalar y. The scalar y is such that:
I
if the window is an example of the object,
-1
otherwise.
y=
The linear Support Vector Machine can be found by solving the following linear
37
...
...
...........
system:
+ b)
yiii,
=
1
2
i=1
cv=1
-
2
I solved this linear system using Gaussian elimination. Solving this linear system
gives t
and b, which are used to describe the linear Support Vector Machine in the
following way:
t_5 + b = 0
4.6.1
(4.1)
New data points
To classify new images, I run all the previous steps except that I use the formula of
the linear Support Vector Machine to classify new data points, J,
as being the object
if they satisfy the following formula:
Yi+
b > 0
Therefore, the output of this algorithm are masks of regions that are classified as
being an object.
4.7
Map-Dictionary Pixel
Given an image I(x, y), the algorithm outputs the regions of the image containing
the object, i.e. it outputs a mask, P(x, y), such that:
38
P(X
if the pixel at location (x, y) belongs to a person,
Y){
otherwise.
The output, Map-Dictionary Pixel, of this algorithm is equals to:
MDp(x, y) = {JImage:I(x, y), Person:P(x, y)}
Figure 4-7 contains an exemplification of an input image and P(x, y).
(4.2)
All the
points inside the red rectangle are such that P(x, y) = 1, and all points outside it are
such that P(x, y) = 0.
Figure 4-8: The input image and P(x, y). All the points inside the red rectangle are
such that P(x, y) = 1 and all points outside it are such that P(x, y) = 0.
39
40
Chapter 5
Image Segmentation
Image segmentation is one of the essential parts of the algorithm. Image segmentation
is the process of partitioning an image into multiple regions. Each region is also known
as super-pixel.
The image segmentation algorithm that I will use is the technique presented by
Arbalaez et al. (2011). The technique is a variant of Normalized Cuts. Shi and Malik
(2000) present the normalized cuts algorithm in more details.
Normalized Cuts is an algorithm that associates a distance metric between every
pixel and its neighbors. The algorithm finds cuts that maximize the distance between
two regions. The distance between regions is defined as the sum of all pixels that are
separated by the cut.
Garey and Johnson (1979) proved that finding the maximum k-cut of a graph
is a NP-complete problem, therefore Arbalaez et al. (2012) uses an approximation
algorithm to transform the edges, such that finding the maximum k-cut is equivalent
to finding the minimum k-cut. To find the minimum k-cut Arbalaez et al. (2010) use
an approximation algorithm.
5.1
Map-Dictionary Pixel
The input of this algorithm is the Map-Dictionary Pixel given at the end of Chapter
4, i.e.:
41
MDp(x, y) = {Image:I(x, y), Person:P(x, y)}
(5.1)
Each object is represented as a mask P(x, y). The algorithm described in this
chapter uses the Map-Dictionary Pixel to identify all the objects in the image and
segment them separately. The advantage of using the Map-Dictionary Pixel is that
there is no need to separate the input into different images and run the algorithm
separately in each image. Instead of doing that, the algorithm runs the algorithm
described in this chapter in each region of the image such that P(x, y) = 1.
The Map-Dictionary Pixel is incremented easily. The output of the algorithm will
be similar to the input but with the addition of the information that the algorithm
provides to it. The output of the image is equals to:
MDp(x, y) = {Image:I(x, y), Person:P(x,y), Edge:E(x, y)},
(5.2)
such that E(x, y) is equals to 0 if P(x, y) is equals to 0.
5.2
Minimum N cuts
Instead of calculating directly the minimum N cuts, the algorithm below finds the
probable minimum N-cuts using an appromixation algorithm. Figure 5-1 shows an
example of the minimum cut for a certain graph.
5
3833
13
10
2
24
Figure 5-1: Minimum Cut with sum equal to 10.
42
The next two sections present how I reimplemented how to calculate the distance
between pixels.
The first section presents how to calculate the distance between
pixels using only the local information, i.e. the information of the neighbor pixels.
The second section presents how to calculate the distance between pixels using the
information of the whole image.
5.3
5.3.1
Local Information
Each angle
To calculate the distance L(x, y,0) at e certain position (x.y), at a certain angle 0
and at a certain radius a, I use the histogram difference of the two half-discs of the
circle. The formula is:
1 (0~) -
NOi))2
-
L, (x, y, 0) =
2
g(i) + h(i)
For each direction 0 we have a function of the distance between neiborhood pixels.
To smooth out the distance function, I apply a second-order Savitzky-Golay filter on
the direction of the angle on the whole image.
5.3.2
Local Distance
The local distance for every position (x, y) is the maximum of all the distances for
different angles. The formula is shown below:
L(x, y) = maxo(
L,(x, y, 0))
or
43
Upper Half-Disc Histogram
0.5
0
1
Lower Half-Disc Histogram
0
0.5
1
Figure 5-2: Image extracted from work done by Arbelaez et al. (2012) exemplifying
the process of calculating the local distance for a certain angle.
5.4
5.4.1
Global Information
Distance Transformation
The following procedure is anapproximation algorithm, i.e.
instead of finding the
maximum N cuts, the weights of the edges are transformed such that the distance of
neigborhood pixels are distant apart have a correpondent small magnitude.
Given the function L(x, y) for every position (x, y), the intervening contour cue
distance, which is defined for a fixed radius r as:
(5.3)
Wj = emaxpEeyL(p).
In the implementation of the algorithm I set r equal to 5. The maximum sum of
exl+...+Xn is not equal to the minimum x 1 + . . . + x,
but the set of solutions are very
similar. The function e-x is depicted in the figure below:
44
1(J
I51
-3
-2
-1
1
2
3
Figure 5-3: e- from -3 to 3.
5.4.2
Each angle
I use the affinity matrix of a graph. The entries of the affinity matrix are calculated
using equation (5.3).
Definition: Given an affinity matrix W whose entries encode the similarity between pixels, one defines the diagonal matrix Di = E Wij.
To find the minimum N-cuts, the algorithm finds the generalized eigenvectors. The
eigenvectors provide information on what axes a the matrix has the highest variation.
The equation to find the eigenvectors is the following:
(D - W)v = ADv
(5.4)
After this step, each pixel is represented on the basis of the biggest N eigenvectors.
To find the approximate minimum N cuts, we clusters the pixels into different
regions using N-means clustering.
The above solution usually breaks regions that have smooth gradients. But the
solution provides a very good clue where the regions should be. The solution given
by Arbalaez et al. (2012) is to implement the above algorithm but to separate regions
by boundaries that are not smooth.
The N eigenvectors found from equation 5.3 with the biggest eigenvalues, JAl,
have the correspondent eigenvectors that represent the highest variability axis of the
points. I pick the highest N = 16 eigenvectors, as Arbalaez et al. (2012) did. The
45
eigenvectors are combined in the following manner:
gD (x, y, 0) = E
V
ogkV
(X, Y, 0)
(5.5)
k=1
For every angle 0 the global distance is:
G(x, y, 0)
5.4.3
=
L,(x, y, 9) + gD(x, y, 9)
Global Distance
Given G(x, y, 9) for every angle 9, the global distance for each position (x, y) is:
G(x, y) = maxoG(x, y, 9)
5.5
(5.6)
Oriented Watershed Transform
The watershed transform is calculated given a function at each location (x, y). In this
case I use the global distance, G(x, y).
Imagining that the function G(x, y) represents how tall each pixel is, To calculate
the orientee watersher transform is equivalent to set water drops in each pixel and let
them leak to the pixels at the lowest positions. Pseudocode is shown below:
Given the watershed transform, to calculate the oriented watershed transform I
estimate an orientation at each pixel on an arc from the local geometry of the arc
itself. The orientations are obtained by approximating the watershed arcs with line
segments.
I subdivide any are which is not well fit by the line segment connecting its endpoints until there are no arcs like that. Then, I assign each pixel (x, y) on a subdivided
arc the orientation o(x, y) c [0, 7r) of the corresponding line segment.
Next, I use G(x,y,9), given in equation 5.2, to assign each arc pixel (x,y) a
boundary strength of G(x, y, o(x, y)). I quantize o(x, y) in the same manner as 9, so
this operation is a simple lookup. Finally, each original arc in Ko is assigned weight
46
equal to average boundary strength of the pixels it contains. Comparing the middle
left and far right panels in Figure 5-4 shows that this reweighting scheme removes
artifacts. The output will be a function represented as E(x, y) at every location (x, y)
of the image.
Figure 5-4: Image extracted from [1] exemplifying the process of calculating the Oriented Watershed Transform. Image in the far left represents the the function G(x, y).
The image in the middle left represents the watershed transform. The image in the
middle represents the process of approximating the arcs to line segments. The image
in the middle right represents the streanth of G(x, y, o(x, y)) for 4 different angles.
The image on the far right represents the output.
5.5.1
Map-Dictionary Pixel
The output of the algorithm is described in eq. 5.2 and it is equals to:
MDp(x, y) = {JImage:I(x, y), Object:M(x, y), Edge:E(x, y)}.
47
(5.7)
Figure 5-5:
algorithm.
The input and the output of running the Ultrametric Contour Map
48
Chapter 6
Histogram of Oriented Gradients
filters for the subparts
The goal of this chapter is to describe the process of how I used the Histogram of
Oriented Gradients filter detector for the subparts of the human body - upperbody,
lowerbody and face - and how I annotated the data for training the Histogram of
Oriented Gradients filter detector described in Chapter 4.
To classify the subparts of the human body I used the Map-Dictionary Pixel
given by the Image Segmentation algorithm described in Chapter 5. The output of
the Image Segmentation algorithm segments the image such that regions that are
distinct are more likely to be separated.
Thus the human subparts - upperbody,
lowerbody and face - are very likely to be separated by an edge created using the
Image Segmentation algorithm described in Chapter 5. The reason why the human
can be separated using these subparts is because people are usually dressed in such
a way that the upperbody, the lowerbody and face have a consistent color scheme.
The following subsection describe the input of the algorithm and how the algorithm proces the input.
49
6.0.2
Map-Dictionary Pixel
The input of this algorithm is the output given by the algorithm at Chapter 5, which
is equals to:
MDp(x, y)
{Image:I(x, y), Person:P(x,y), Edge:E(x, y)}.
(6.1)
Given that the Image Segmentation algorithm gives the N cuts and the scale of
each cut - edge - is equals to how separated the 2 regions separated by it are far apart,
therefore I will set a threshold A such that the subparts chosen are most of the time
separated. Thus the chosen threshold was A = 0.7.
Therefore, if a pixel at position (x, y) is such that E(x, y) > A then the function
f
will return 1 if the ultrametric contour map returns a value for that pixel that is
greater or equal to A; and f(x, y) = 0 if the ultrametric contour map returns a value
for that pixel that is less than A. The formula below summarizes it:
f (I if E(x, y) > A
0
otherwise.
The Histogram of Oriented Gradients filter for the subparts will be run on each
segmented part of the image individually, i.e the Histogram of Oriented Gradients
filter will only run through windows in each of the subparts generated by the MapDictionary Pixel. For each segmented part of the image the Histogram of Oriented
Gradients filter outputs if that subpart is present. The algorithm outputs the following Map-Dictionary Pixel:
MDp(x, y)
{Image:I(x, y), Person:P(x, y), Edge:E(x, y), Face:F(x, y),
LowerBody:L(x, y), UpperBody:U(x, y)}
50
(6.2)
6.0.3
Algorithm
The Histogram of Oriented Gradients filters for each subpart are run for each segmented part and if any of the Histogram of Oriented Gradients filters detects one
of the subparts, the respective segmented part will be considered as being of such
a subpart. For example if the Histogram of Oriented Gradients filtler detects the
lowerbody in a segmented area A, then:
L(x, y) = I V(x, y) E A
Figure 6-1 shows an example where there is a segmented part that the face Histogram of Oriented Gradients filter detects a face, and there is an example of a
segmented part that the face Histogram of Oriented Gradients filter does not detect
a face.
Figure 6-1: The left most image in this figure shows the raw image. The middle left
image shows E(x, y). The middle right image shows an segmented part such that the
Histogram of Oriented Gradients filter detects a face. The right most image shows a
segmented part such that the Histogram of Oriented Gradients filter does not detect
a face.
6.1
Training Data
I have annotated 830 faces on the images provided by the PASCAL dataset, 89 lowerbody parts and 155 upperbody parts, totalling 1049 labeled data points. Figure 6-1
51
represents a set of images that are used for training the Histogram of Oriented Gradients filter detector for faces. The training data is available in this repository. The
training data was annotated using the PASCAL Dataset. The following subsection
gives an overview of the PASCAL Dataset.
Figure 6-2: Set of image faces used for training the Histogram of Oriented Gradients
filter model for the human face.
6.1.1
PASCAL Dataset
The PASCAL Dataset is divided into 3 parts:
1. Annotations: The annotations contain information on where and how the subparts are laid out inside an image.
2. Image Sets: The image sets contain the classification of which annotations are
classified as being the object and which annotations are not the object.
3. JPEG Images: This folder contains the images that are annotated.
In the implememntation of the algorithm described in this chapter, I created 1049
annotations using the JPEG Images given by the PASCAL Dataset. I've also created
3 Image Sets describing the data points that represent and do not represent the
following subparts: lowerbody, upperbody and face.
6.2
Output
The output of this algorithm is the following Map-Dictionary Pixel:
52
MDp(x, y) = {Image:I(x, y), Object:M(x, y), Edge:E(x, y), Face:F(x, y),
LowerBody:L(x, y), UpperBody:U(x, y)}.
(6.3)
Given the output of the algorithm the final image of the algorithm is the union of
the the Histogram of Oriented Gradients filters of the subparts, therefore the output
is equals to:
O(Xy)=
I(x, y)
if F(x, y) == 1 or L(x, y)
0
otherwise.
1 or U(x, y) == 1.
Figure 6-3 represents the input of the Ultrametric Contour Map algorithm and
the output O(x, y) that the mask I(x, y) produces.
Figure 6-3: The left image is the input of the Ultrametric Contour Map and the right
image is the mask I(x, y) described above. Note that the mask on top of the person
is very close to the actual pixels that belong to the person despicted in the image.
53
54
Chapter 7
Matching Fragments
In this chapter I discuss the reimplementation of the algorithm developed by Borenstein and Ullamn (2002). To achieve that I first annotated the images that are part of
a person, then I matched the the fragments of the annotated images over new images
of a person.
The input of the algorithm is the Map-Dictionary Pixel that is the output of the
Histogram of Oriented Gradients filter described in chapter 4. Thus, the input is:
MDp(x,y) = {Image:I(x,y), Person:P(x,y)}.
(7.1)
The input is easily processed because the only parts that are analyzed that can
be considered as part of a person are the ones that P(x, y) = 1. Thus, if P(x, y) = 0
then there are no fragments, F(x, y), that belong to the image of a person. Therefore
if P(x, y) = 0 then F(x, y) = 0, and the output of the algorithn is the following
Map-Dictionary Pixel:
MDp(x,y) = {Image:I(x,y), Person:P(x,y), Fragments:F(x,y)}.
(7.2)
The Fragments represent the pixels that belong to the image of a person. The
presentation of the output is such that:
55
F (x, y) =
I(X, y)
if (x, y) is part of a fragment,
0
otherwise
The following sections describe the training data, the implementation of the algorithm and the output.
7.1
Training Data
The training data consist of a set of pairs of images. The pair of images contain a
fragment image of the object class and an image of its mask. Fragments and masks
are described below:
7.1.1
Fragment
A fragment image is a image of a determined size, mxn, that contains part of the
image of the object.
In the implementation of this algorithm I've set m = 24 and
n = 45. The fragments and its respective masks are extracted manually.
7.1.2
Mask
The mask of an image is an image of the same size as the orginal one. Each mask
image contains zeros and ones. The ones represent the interior of the object and the
zeros represent the background. Thus, for every pixel at a position (x, y) the mask
image, M, is such that:
(x, y)
{1
0
if pixel at position (x, y) belongs to the object,
otherwise.
Figure 7-1 has 2 images, one representing a fragment image and its mask.
fragment image is described later in this chapter.
56
The
U
Figure 7-1: This figure represents a pair of a fragment and its mask. Note that the
mask is black and white, because the mask is either equal to 0 or 1.
7.2
Algorithm
Given the set of pairs of fragments and masks and the input image, I calculate the
best matching fragments and their respective position to the input image. Then, the
algorithm lays out the best matching fragments on top of the image, like a jigsaw
puzzle.
To calculate the best matching fragments, there is a need to calculate the correlation between the fragment and the fragment of the input image on a certain position
(x, y). Figure 7-2 is an example of the 100 best matching fragments on top of an
input image.
Figure 7-2: The first image is the input image and the second image shows the 100
best matching fragments superimposed on top of a black image of the same size. The
third image shows the 100 best matching fragments adding the information of the
textons.
For each input image, given the correlation of two fragments of the same size, then
the best matching fragments are found using the following algorithm:
max-fragments
=
-Inf;
for f in Fragments:
57
maxF = -Inf
pf
=
(0,0)
while not
v
=
end of
the
figure:
correlation(I[x,x+delta(x)][y,y
+ delta(y)],f)
if maxF < v:
maxF
pf
=
=
v
(x,y)
move (x , y)
if
<
max-fragments
max-fragments
maxF:
=
maxF
The best matching fragments are the ones that maximize maXF.
The correla-
tion between two fragments is calculated using the formula described in the next
subsection.
7.2.1
Correlation
The distance between two fragments of the same size is done by first changing the
representation of the image from the RGB color channel to the LAB color channel.
Then, the distance of two fragments,
f,
and
f2, is calculated by clustering the color
space and then calculating the distance between the histograms of the fragments:
1
K
2
(f 1 (k)
-
f2(k))2)
fi(k) + f2 (k)
To cluster the color space, I've use the k-means clustering with K = 32.
Example
Given 2 histograms shown in Figure 7-3, the difference between the 2 fragments
is equals to:
)2 +(4-2 )2 +(3 --0)2 +(0-_1)2 +(1 -4)2
1I1 x (2-2 )2 +(4-3
2
(2+2) + (4+3)+(4+2)+(3+0)+(0+1)+(1+4)
1
2
-
x
0+1+4+9+1+9
4+7+6+3+1+5
58
2
2
TP.Aw1
Tetn
eln
Txw
etn
Tim6
Texion1
Textm 2 Texa0,3 Texio4
Twon 0Tex0mi
Figure 7-3: Two different histograms representing how many textons of each type are
in each region.
1
24
12
2
26
26
The new distance is the weighted sum of the distance between the LAB color
space and the distance between the textons of the pixels.
7.3
Textons
To improve the algorithm I added textons as information for comparing different
fragments. Textons aid when representing the same kind of material. The distance
between textons is a good representation of different materials, thus it will aid to
separate different objects.
To calculate the value of textons for each pixel, p, a correspondent vector v with
entries vi, such that:
Vi = p * Fi,
such that F is a filter.
In the implementation of this algorithm I've used the
Leung-Malik (LM) filter bank. It's described in the next subsection.
The textons are calculated clustering the vectors into 32 different clusters through
K-means clustering algorithm.
Figure 7-4 shows an image and the corresponding
textons.
59
Figure 7-4: An image and the corresponding textons.
7.3.1
Filter Bank
In the implementation of this algorithm I've used the Leung-Malik (LM) Filter Bank.
The Leung-Malik consists of a set of mult-scale, multi-orientation filter bank containing 48 filters. The code to generate Leung-Malik (LP) Filter Bank is located in the
Appendix.
The Leung-Malik Filter bank contains first and second derivatives of Gaussians
at 6 orientations and 3 scales; 8 Laplacian of Gaussian filters; and 4 Gaussians.
Figure 7-5: The representation of the LM Filter Bank.
7.4
Output
The output of this algorithm is the following Map-Dictionary Pixel:
MDp(x, y) = {Image:I(x, y), Person:P(x,y), Mask:M(x, y).
60
(7.3)
The output image, 0, of the person can be calculated in the following way:
O(x, Y)
{
I(X, y)
if P(x, y)
0
otherwise.
61
1 and M(x, y)
62
Chapter 8
Conclusion
In this chapter I review the goals of the thesis.
I revised what has been done to
achieve these goals, and the contributions of this work.
The goal of computer visual systems is to simulate the human visual system. In
order to do that, the computer visual system is divided into different parts. Many
algorithms are interested in a specific question. Examplifications of questions that
single algorithms try to answer are:
1. Where is a certain object localized in an image?
2. What are the subparts of an image?
3. How can we find the distance between the points represented by the image and
the camera?
Each question is answered by a single class of algorithms. The algorithms that
try to answer these questions are: Object Detection, Image Segmentation and 3D
reconstruction.
The presented framework integrates these simple algorithms in a
simple way, and, therefore empowers them.
8.1
Contributions
In this thesis, I make the following contributions:
63
"
I developed the Map-Dictionary Pixel framework for passing information between various computer vision agorithms.
* I demonstrated the value of the Map-Dictionary Pixel framework by implementing two human-detecting algorithms, Ultrametric Contour Map algorithm and
Matching Fragments algorithm, and connected them using that framework
" I created training data for modeling human faces, upper bodies, and lower
bodies.
64
Appendix A
Code
This appendix includes code for algorithms described in previous chapters. The code
is divided by the chapters that they are included.
A.0.1
Chapter 5
The following code shows how to calculated Oriented Watershed Transform:
for pixel
in pixels:
output(pixel)
stilldropping
while
1
=
=
true
stilldropping:
still-dropping =
for pixel
in
neighbors
false
pixels:
=
get-neighbors(pixel)
numberoflowerneighbors =
for neighbor
if
in neighbors:
G(neighbor)
< G(pixel):
stilldropping =
true
output(neighbor)
+=
output(pixel)
return
get-numberoflowerneighbors(pixel)
=
output(pixel)/numberof
0
output
65
_lowerneighbors
A.0.2
Chapter 7
The following code is the MATLAB code to generate the Leung-Malik Filter Bank.
function F=makeLMfilters
% Returns
the
% image
with
I
% conv2,
i.e.
LML
the
filter
bank
filter
of
bank
size
you
can
49x49x48
either
in F.
use
To
the
convolve
matlab
responses(:,:,i)=conv2(I,F(:,:,i),'valid'),
or
an
function
use
the
% Fourier transform.
SUP=49;
% Support of
SCALEX=sqrt(2).^[1:31;
% Sigma_{x}
for
NORIENT=6;
%
orientations
Number
of
the
largest
the
filter
oriented
(must be
filters
NROTINV=12;
NBAR=length(SCALEX)*NORIENT;
NEDGE=length(SCALEX)*NORIENT;
NF=NBAR+NEDGE+NROTINV;
F=zeros (SUP ,SUP, NF);
hsup=(SUP-1)/2;
[x,y]=meshgrid([-hsup:hsup],[hsup:-1:-hsup]);
orgpts=[x(:)
y(:)]';
Licount=1;
auforuscale=1:
length(SCALEX)
uuuufororient
=0:NORIENT-1,
jjjjj
angle=pi*orient/NORIENT ; ,,Not,2pi
asofilters
haveusymmetry
JUUUULc=cos(angle);s=sin(angle);
uuuLjuurotpts=[c,-s;sLsc]*orgpts;
LjuuuuF(:,: ,count)=makefilter(SCALEX(scale),0,1,rotpts,SUP);
accjuuF(:,:
accuajjucount
,count+NEDGE)=makefilter(SCALEX(scale),0,2,rotpts,SUP);
count +1;
LiLiLiLiend;
uuend;
,,count=NBAR+NEDGE+1;
,uSCALES= sqrt (2)
.
[1: 4]
66
odd)
uuforui=1:length(SCALES),
LjLjLLF(:,:,count)=normalise(fspecial('gaussian',SUP,SCALES(i)));
LjLjLjjF(:,:,count+1)=normaalise(fspecial(log',SUP,SCALES(i)));
uuLjuu(F(:,:,count+2)=normalise(fspecial('log',SUP,3*SCALES(i)));
uuLiu count= count +3;
,,end;
return
function
f=rmakefilter(scale
uugx=gaussld (3*scale
0,pts
,phasexphaseyptssup)
(1, :),phasex)
augy=gauss1d(scale,O,pts(2,:),phasey);
juf=normalise (reshape (gx .*gy
sup, sup));
return
67
68
Bibliography
[1] Pablo Arbelaez, Michael Maire, Charless Fowlkes, Jitendra Malik Contour Detection and HierarchicalImage Segmentation. In Pattern Analysis and Machine
Intelligence, IEEE Transactions on 33.5 (2011): 898-916.
[2] Eran Borenstein, Shimon Ullman Class-Specific, Top-Down Segmentation. In
Computer VisionECCV 2002. Springer Berlin Heidelberg, 2002 109-122.
[3] Navneet Dalal, Bill Triggs Histograms of Oriented Gradients for Human Detection.
In Computer Vision and PatternRecognition, 2005. CVPR 2005, IEEE Computer
Society Conference on, volume 1, pages 886-893. IEEE, 2005.
[4] Jammie Shotton, Andrew Blake, Roberto Cipolla Contour-BasedLearningfor Object Detection. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International
Conference on. Vol. 1. IEEE, 2005.
[5] Lubomir Bourdev., Jitendra Malik Poselets: Body Part Detectors Trained Using
3D Human Pose Annotations. In Computer Vision, 2009 IEEE 12th International
Conference on. IEEE, 2009.
[6] Pablo Arbelaez, Laurent Cohen ConstrainedImage Segmentation from Hierarchical Boundaries. In Computer Vision and Pattern Recognition, 2008. CVPR 2008.
IEEE Conference on. IEEE, 2008.
[7] Michael Maire, Pablo Arbelaez, Charless Fowlkes, Jitendra Malik. Using Contours
to Detect and Localize Junctions in Natural Images. In Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
[8] Pablo Arbelaez, Bharath Hariharan, Chinhui Gu, Saubarabh Gupta, Lubomir
Bourdev, Jitendra Malik Semantic Segmentation using Regions and Parts. In
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.
IEEE, 2012.
[9] Charless Fowlkes, Jitendra Malik How Much Globalization Help Segmentation?
In Computer Science Division, University of California, 2004.
[10] Normalized Cuts and Image Segmentation Jianbo Shi, JitendraMalik In Pattern
Analysis and Machine Intelligence, IEEE Transactions on 22.8 (2000): 888-905.
69
[11] Garey, M. R.; Johnson, D. S. (1979), Computers and Intractability: A Guide to
the Theory of NP-Completeness, W.H. Freeman, ISBN 0-7167-1044-7.
[12] Arbelez, Pablo, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir
Bourdev, and Jitendra Malik. Semantic segmentation using regions and parts.
In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
on, pp. 3378-3385. IEEE, 2012.
[13] P. Kohli and M. Kumar. Energy minimization for linear envelope mrfs. In Proc.
CVPR, 2010.
[14] L. Ladicky, C. Russell, P. Kohli, and P. Torr. Graph cut based in- ference with
co-occurence statistics. In Proc. ECCV, 2010.
[15] X. Boix, J. M. Gonfaus, J. van de Weijer, A. D. Bagdanov, J. Ser- rat, and J.
Gonza lez. Harmony Potentials Fusing Global and Local Scale for Semantic Image
Segmentation. International Journal of Computer Vision, 96(1):83102, December
2012.
[16] A. Lucchi, Y. Li, X. Boix, K. Smith, and P. Fua. Are spatial and global constraints really necessary for segmentation? In Proc. ICCV, 2011.
[17] T. Brox, L. Bourdev, S. Maji, and J. Malik. Object segmentation by alignment
of poselet activations to image contours. In Proc. CVPR. 2011.
[18] Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes. Layered object detection for
multi-class segmentation. In Proc. CVPR, 2010.
[19] Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained
part-based models." Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on 32.9 (2010): 1627-1645.
70