3D Model-Based Pose Estimation of Rigid... From A Single Image For Robotics Samuel I. Davies

3D Model-Based Pose Estimation of Rigid Objects
From A Single Image For Robotics
by
Samuel I. Davies
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
IJJ
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2015
I
N
I.-
@ Massachusetts Institute of Technology 2015. All rights reserved.
Signature
....
.U
................redacted
...
.
Author ....
Department of Electrical Engineering and Computer Science
May 20, 2015
Signature redacted
/
Certified by
Signature redact
Certified by...
JTomis Lozano-Perez
Professor
Thesis Supervisor
..................
leslie Pack Kaelbling
Professor
Thesis Supervisor
Accepted by ....
z:
OLL
Clo
Signatre redacted................
/
& 4 ~essor Leslie A. Kolodziejski
Chairman, Department Committee on Graduate Theses
w
MITLibraries
77 Massachusetts Avenue
Cambridge, MA 02139
http://Iibraries.mit.edu/ask
DISCLAIMER NOTICE
Due to the condition of the original material, there are unavoidable
flaws in this reproduction. We have made every effort possible to
provide you with the best copy available.
Thank you.
Figure 3-6 (p.65) is missing from the
thesis.
3D Model-Based Pose Estimation of Rigid Objects From A
Single Image For Robotics
by
Samuel I. Davies
Submitted to the Department of Electrical Engineering and Computer Science
on June 5, 2015, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Abstract
We address the problem of finding the best 3D pose for a known object, supported on a
horizontal plane, in a cluttered scene in which the object is not significantly occluded.
We assume that we are operating with RGB-D images and some information about the
pose of the camera. We also assume that a 3D mesh model of the object is available,
along with a small number of labeled images of the object. The problem is motivated
by robot systems operating in indoor environments that need to manipulate particular
objects and therefore need accurate pose estimates. This contrasts with other vision
settings in which there is great variability in the objects but precise localization is
not required.
Our approach is to find the global best object localization in a full 6D space of
rigid poses. There are two key components to our approach: (1) learning a viewbased model of the object and (2) detecting the object in an image. An object model
consists of edge and depth parts whose positions are piece-wise linear functions of
the object pose, learned from synthetic rendered images of the 3D mesh model. We
search for objects using branch-and-bound search in the space of the depth image
(not directly in the Euclidean world space) in order to facilitate an efficient bounding
function computed from lower-dimensional data structures.
Thesis Supervisor: Tomas Lozano-P6rez
Title: Professor
Thesis Supervisor: Leslie Pack Kaelbling
Title: Professor
3
4
Acknowledgments
I dedicate this thesis to the Lord Jesus Christ, who, in creating the universe, was
the first Engineer, and in knowing all the mysteries is the greatest and Scientist and
Mathematician.
I am very grateful to my advisors, Tomis Lozano-P rez and Leslie Kaelbling for
their kindness, insightful ideas and well-seasoned advice during each stage of this
process. If it was not for your patient insistence on finding a way to do branch and
bound search over the space of object poses using a probabilistic model, I would
have believed it was impossible to do efficiently. And thank you for fostering an
environment in the Learning and Intelligent Systems (LIS) group that is conducive
to thinking about the math, science and engineering of robotics.
I am also indebted to my loving parents for their upraising-and thank you for
supporting me all these years. I love you! And I would never have learned engineering
or computer programing if you had not taught me, Dad. Thanks for sparking my
early interest in robotics by with the WAO-II mobile robot!
Thanks also to our administrative assistant Teresa Cataldo for helping with logistics. Thanks to William Ang from TechSquare.com for keeping the lab's robot
and computers running and updated, and to Jonathan Proulx from The Infrastructure Group who was very helpful in maintaining and supporting the cloud computing
platform on which we ran the experiments.
Special thanks to my officemate Eun-Jong (Ben) Hong whose algorithm for exhaustive search over protein structures [19] encouraged me to find a way to do exhaustive
object recognition. I would also like to thank other fellow graduate students who
worked on object recognition in the LIS group: Meg Lippow, Hang Pang Chiu and
Jared Glover whose insights were valuable to this work. And I would like to thank
the undergraduates I had the privilege of supervising: Freddy Bafuka, Birkan Uzun
and Hemu Arumugam-thank you for being patient students! I would especially like
to thank Freddy, who turned me from atheism to Christ and has become my Pastor.
By his faithful preaching, he has guided towards God during these years.
5
6
Contents
19
1.1
Overview of the Approach . . . . . .
21
1.1.1
Learning . . . . . . . . . . . .
23
1.1.2
Detection
. . . . . . . . . . .
33
Outline of Thesis . . . . . . . . . . .
40
1.2
.
41
2.1
Low-Level Features ......
2.2
Generic Categories vs. Specific Objects . . . . . . . . . . . . . . . .
42
2.3
2D vs. 3D vs. 2-D view-based models
. . . . . . . . . . . . . . . .
43
............................
.
.
41
2D view-based models
. . . . . . . . . . . . . . . . . . . . .
44
2.3.2
3D view-based models
. . . . . . . . . . . . . . . . . . . . .
46
2.3.3
21D view-based models . . . . . . . . . . . . . . . . . . . . .
47
Search: Randomized vs. Cascades vs. Branch-and-Bound . . . . . .
48
2.4.1
Randomized . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
2.4.2
C ascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
2.4.3
Branch-and-Bound . . . . . . . . . . . . . . . . . . . . . . .
49
2.5
Contextual Information. . . . . . . . . . . . . . . . . . . . . . . . .
50
2.6
Part Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
.
.
.
.
.
.
.
.
.
2.3.1
2.4
Representation
53
3.1
O bject Poses
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.2
Im ages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
.
3
.
Related Work
.
2
.
Introduction
.
1
7
3.3
View-Based Models .................
3.4
Approximations . . . . . . . . . . . . . . . . . .
.
56
. . . . . . . . . . .
3.4.1
(x, y) Translation Is Shifting In The Image Plane
3.4.2
3.4.3
59
Weak Perspective Projection . . . . . . .
. . . . . . . . . . .
59
Small Angle Approximation . . . . . . .
. . . . . . . . . . .
60
.
.
. . . . . . .
Sources Of Variability
. . . . . . . . . . . . . .
. . . . . . . . . . .
60
3.6
Choice of Distributions . . . . . . . . . . . . . .
. . . . . . . . . . .
62
.
.
3.5
4 Learning
. . . . . . . . . . .
67
4.1.1
Rendering . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
68
4.1.2
Feature Enumeration . . . . . . . . . . .
. . . . . . . . . . .
69
4.1.3
Feature Selection . . . . . . . . . . . . .
. . . . . . . . . . .
76
4.1.4
Combining Viewpoint Bin Models.....
. . . . . . . . . . .
80
. . . . . . . . .
. . . . . . . . . . .
82
Tuning Parameters . . . . . . . . . . . .
. . . . . . . . . . .
84
.
.
.
.
View-Based Model Learning Subsystem . . . . .
High Level Learning Procedure
4.2.1
.
4.2
67
.
4.1
5 Detection
87
Detecting Features . . . . . . . . . . . . . . . .
. . . . . . . . . . .
87
5.2
Pre-Processing Features
. . . . . . . . . . . . .
. . . . . . . . . . .
89
5.3
Branch-and-Bound Search . . . . . . . . . . . .
. . . . . . . . . . .
100
5.3.1
Branching . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
100
5.3.2
Bounding . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
100
5.3.3
Initializing the Priority Queue . . . . . .
. . . . . . . . . . .
119
5.3.4
Constraints On The Search Space . . . .
. . . . . . . . . . .
120
5.3.5
Branch-and-Bound Search . . . . . . . .
. . . . . . . . . . .
125
5.3.6
Parallelizing Branch-and-Bound Search .
. . . . . . . . . . .
128
. . . . . . . . . . .
131
.
.
.
.
.
.
.
.
.
5.1
Non-Maximum Suppression
. . . . . . . . . . .
.
5.4
Experiments
6.1
D ataset
135
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
6
59
8
135
6.2
Setting Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . .
138
6.3
Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
138
Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
155
6.3.1
7
157
Conclusion
7.1
Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
157
161
A Proofs
A.1 Visual Part Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . .
161
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
163
A.2 Depth Part Bound
9
10
. '. - - ,
, -
.
-
, .
-,
-
II
1
111 k
. -
.
. -
. -
4
4W
List of Figures
1-1
Examples of correct object detections.
. . . . . . . . . . . . . . . . .
20
1-2
An overview of the view-based model learning subsystem. . . . . . . .
26
1-3
An overview of the manual labor required to learn a new object. . . .
29
1-4
An overview of detection . . . . . . . . . . . . . . . . . . . . . . . . .
35
2-1
Fergus et al. [13] used a fully-connected model . . . . . . . . . . . . .
45
2-2
Crandall et al. [5] used a 1-fan model. . . . . . . . . . . . . . . . . . .
46
2-3
Torralba et al. [35] showed that sharing parts can improve efficiency.
51
3-1
An illustration of features in an RGB-D image.
. . . . . . . . . . . .
57
3-2
Warping spherical coordinates into rectangular coordinates. . . . . . .
61
3-3
Examples of object poses that are at the same rotation in spherical
coordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
. . . . . . . . . . .
64
3-4
An example of edges missed by an edge detector.
3-5
Normal distributions with and without a receptive field radius. .....
64
3-6
2D normal distributions with elliptical and circular covariances.
65
4-1
Examples of synthetic images. . . . . . . . . . . . . . . . . . . . . . .
69
4-2
Visualizations of enumerated features . . . . . . . . . . . . . . . . . .
74
4-3
The effect of varying the minimum distance between parts. . . . . . .
77
4-4
Different objects and viewpoints vary in the area of the image they cover. 78
4-5
The minimum distance between parts should not be the same for all
view s.
4-6
. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The PR2 robot with camera height and pitch angles.
11
. . . . . . . . .
79
83
5-1
Hough transforms for visual parts in 1D. . . . . . . . . . . . . . . . .
92
5-2
Adding a rotation dimension to figure 5-1.
. . . . . . . . . . . . . . .
93
5-3
Adding a scale dimension to figure 5-1. . . . . . . . . . . . . . . . . .
94
5-4
Hough transforms for visual parts in 2D. . . . . . . . . . . . . . . . .
96
5-5
Hough transforms for depth parts in 2D.
. . . . . . . . . . . . . . . .
97
5-6
The maximum of the sum of 1D Hough votes in a region. . . . . . . .
103
5-7
The maximum of the sum of Hough votes with rotation in a region. .
104
5-8
The maximum of the sum of Hough votes with scale in a region is
broken into parts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-9
105
The maximum of the sum of Hough votes in a region for optical character recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-10 The maximum of of the sum of Hough votes for depth parts in 2D.
.
106
107
5-11 ID Hough transform votes and bounding regions aligned to image coordinates. .......
........
.........................
108
5-12 Hough transform votes for visual parts (with rotation) and bounding
regions aligned to image coordinates.
. . . . . . . . . . . . . . . . . .
109
5-13 Hough transform votes for visual parts (with scale) and bounding regions aligned to image coordinates.
. . . . . . . . . . . . . . . . . . .
110
5-14 Hough transform votes for depth parts and bounding regions in warped
coordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
5-15 1D Hough transform votes and bounding regions with receptive field
radius. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
114
5-16 Hough transform votes (with scale), bounding regions and receptive
field radius.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
115
5-17 Hough transform votes for depth parts with bounding regions and receptive field radius. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118
6-1
Average Precision vs. Number of Training Images (n) . . . . . . . . .
139
6-2
Average Precision vs. Number of Visual Parts (ny) . . . . . . . . . .
140
6-3
Average Precision vs. Number of Depth Parts (nD) ..
141
12
..........
.
6-4
Average Precision vs. Number of Visual and Depth Parts (nD = nV)
142
6-5
Average Precision vs. Receptive Field Radius For Visual Parts (rv)
143
6-6
Average Precision vs. Receptive Field Radius For Depth Parts (rD)
144
6-7
Average Precision vs. Maximum Visual Part Variance (vvmax)
. . . .
145
6-8
Average Precision vs. Maximum Depth Part Variance
.
. .
146
6-9
Average Precision vs. Rotational Bin Width (r,) . . . . . . . . . . .
147
6-10 Average Precision vs. Minimum Edge Probability Threshold . . . . .
148
6-11 Average Precision vs. Camera Height Tolerance (hto0 ) . . . . . . . . .
149
6-12 Average Precision vs. Camera Pitch Tolerance (rtoi) . . . . . . . . . .
150
6-13 Detection Running Time vs. Number of Processors . . . . . . . . . .
155
13
(VDmax)
.
14
List of Tables
6.1
Detailed information about the objects and 3D mesh models used in
the experim ents.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
136
6.2
Images of the objects and 3D mesh models used in the experiments. .
137
6.3
Parameter values used in experiments.
. . . . . . . . . . . . . . . . .
139
6.4
Average precision for each object, compared with the detector of Felzenszw alb et al. [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
151
6.5
A confusion matrix with full-sized images.
. . . . . . . . . . . . . . .
152
6.6
A confusion matrix with cropped images. . . . . . . . . . . . . . . . .
152
6.7
Errors in predicted poses for asymmetric objects.
. . . . . . . . . . .
153
6.8
Errors in predicted poses for symmetric objects. . . . . . . . . . . . .
154
15
16
List of Algorithms
1
Render and crop a synthetic image. . . . . . . . . . . . . . . . . . . .
2
Update an incremental least squares visual part by adding a new training exam ple. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
. . . . . . . . . . . . . . . . . . . .
72
Update an incremental least squares depth part by adding a new training exam ple. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
72
Finalize an incremental least squares visual part after it has been updated with all training examples.
4
69
73
Finalize an incremental least squares depth part after it has been updated with all training examples.
. . . . . . . . . . . . . . . . . . . .
73
6
Enumerate all possible features. . . . . . . . . . . . . . . . . . . . . .
75
7
Select features greedily for a particular minimum allowable distance
between chosen parts
8
dmin.
. . . . . . . . . . . . . . . . . . . . . . . .
78
Select features greedily for a particular maximum allowable part variance vmax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
9
Learn a new viewpoint bin model. . . . . . . . . . . . . . . . . . . . .
81
10
Learn a full object model.
. . . . . . . . . . . . . . . . . . . . . . . .
81
11
Evaluates a depth part for in image at a particular pose. . . . . . . .
91
12
Evaluates a visual part for in image at a particular pose. . . . . . . .
98
13
Evaluates an object model in an image at a particular pose.
. . . . .
98
14
An uninformative design for a bounding function. . . . . . . . . . . .
101
15
A brute-force design for a bounding function.
101
16
Calculate an upper bound on the log probability of a visual part for
poses within a hypothesis region.
17
. . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
112
17
Calculate an upper bound on the log probability of a depth part for
poses within a hypothesis region by brute force. . . . . . . . . . . . .
18
Calculate an upper bound on the log probability of a depth part for
poses within a hypothesis region.
19
116
. . . . . . . . . . . . . . . . . . . . 117
Calculate an upper bound on the log probability of an object for poses
within a hypothesis region. . . . . . . . . . . . . . . . . . . . . . . . .
119
20
A set of high-level hypotheses used to initialize branch-and-bound search. 120
21
A test to see whether a point is in the constraint region.
. . . . . . .
122
22
Update the range of r, values for a pixel. . . . . . . . . . . . . . . . .
122
23
Find the range of r, values for a hypothesis region.
. . . . . . . . . .
123
24
Update the range of r. values for a pixel. . . . . . . . . . . . . . . . .
123
25
Find the range of r. values for a hypothesis region.
26
Update the range of z values for a pixel.
27
Find the range of z values for a hypothesis region. . . . . . . . . . . . 126
28
Find the smallest hypothesis region that contains the intersection be-
. . . . . . . . . . 124
. . . . . . . . . . . . . . . . 125
tween a hypothesis region and the constraint.
. . . . . . . . . . . . .
126
29
One step in branch-and-bound search.
. . . . . . . . . . . . . . . . .
127
30
Detect an object in an image by branch-and-bound search. . . . . . .
127
31
Send a status update from a worker for parallel branch-and-bound.
129
32
A worker for a parallel branch-and-bound search for an object in an
im age. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
130
33
Coordinate workers to perform branch-and-bound search in parallel. . 132
34
Non maximum suppression: remove detections that are not local maxima. 133
18
Chapter 1
Introduction
In this thesis we address the problem of finding the best 3D pose for a known object,
supported on a horizontal plane, in a cluttered scene in which the object is not
significantly occluded. We assume that we are operating with RGB-D images and
some information about the pose of the camera. We also assume that a 3D mesh
model of the object is available, along with a small number of labeled images of the
object.
The problem is motivated by robot systems operating in indoor environments that
need to manipulate particular objects and therefore need accurate pose estimates.
This contrasts with other vision settings in which there is great variability in the
objects but precise localization is not required.
Our goal is to find the best detection for a given view-based object model in an
image, even though it takes time to search the whole space of possible object locations. After searching, we guarantee that we have found the best detection without
exhaustively searching the whole space of poses by using branch-and-bound methods.
Our solution requires the user to have a 3D mesh model of the object and an
RGB-D camera that senses both visual and depth information. The recent advances
in RGB-D cameras have proven to give much higher-accuracy depth information
than previous stereo cameras. Moreover, RGB-D cameras like the Microsoft Kinect
are cheap, reliable and broadly available. We also require an estimate of the height
and pitch angle of the camera with respect to the horizontal supporting plane (i.e.
19
Figure 1-1: Examples of correct object detections. Detections include the full 6
dimensional location (or pose) of the object. The laundry detergent bottle (top left)
and mustard bottle (top right) are both correctly detected (bottom).
20
table) on which the upright object is located. The result is a 6 degree-of-freedom
pose estimate. We allow background clutter in images, but we restrict the problem
to images in which the object is not significantly occluded. We also assume that for
each RGB-D image, the user knows the pitch angle of the camera and the height of
the camera measured from the table the object is on.
This is a useful problem in the context of robotics, in which it is necessary to
have an accurate estimate of an object's pose before it can be grasped or picked
up. Although less flexible than the popular paradigm of learning from labeled real
images of a highly variable object class, this method requires less manual labor-only
a small number of real images of the object instance with 2D bounding box labels are
used to test the view-based model and to tune learning parameters. This makes the
approach practical as a component of a complete robotic system, as there is a large
class of real robotic manipulation domains in which a mesh model for the object to
be manipulated can be acquired ahead of time.
1.1
Overview of the Approach
Our approach to the problem is to find the global maximum probability object localization in the full space of rigid poses. We represent this pose space using 3 positional
dimensions plus 3 rotational dimensions, for a total of 6 degrees of freedom.
There are two key components to our approach:
" learning a view-based model of the object and
" detecting the object in an image.
A view-based model consists of a number of parts. There are two types of parts:
visual parts and depth parts. Visual parts are matched to edges detected in an image
by an edge detector. Each visual edge part is tuned to find edges at one of 8 discrete
edge angles. In addition, small texture elements can be used to define other kinds
of visual parts. Visual parts do not have depth, and the uncertainty about their
positions is restricted to the image plane.
21
Depth parts, on the other hand, only model uncertainty in depth, not in the image
plane. Each depth part is matched to a depth measurement from the RGB-D camera
at some definite pixel in the image. Thus we can think of the 1D uncertainty of depth
part locations as orthogonal to the 2D uncertainty of image part locations.
The expected position of each of the view-based model parts (both visual and
depth parts) is a function of the object pose. We divide the 3 rotational dimensions
of pose space into a number of viewpoint bins, and we model the positions of the
parts as a linear function of the object rotation within each viewpoint bin (i.e. we
use a small angle approximation). In this way, the position of the object parts is
a piecewise linear function of the object rotation, and the domain of each of the
"pieces" is a viewpoint bin. The 3 positional dimensions are defined with respect to
the camera: as the object moves tangent to a sphere centered at the focal point of the
camera, all the model parts are simply translated in the image plane. As the object
moves nearer or farther from the camera, the positions of the parts are appropriately
scaled with an origin at the center of the object (this is known as the weak perspective
approximation to perspective projection). A view-based model consists of parts whose
expected positions in the image plane are modeled by a function of all 6 dimensions
of the object pose.
A view-based model is learned primarily from synthetic images rendered from a
mesh of the particular object instance. For each viewpoint bin, synthetic images are
scaled and aligned (using the weak perspective assumption) and a linear model (using
the small angle assumption) is fit to the aligned images using least squares.
An object is detected by a branch-and-bound search that guarantees that the best
detections will always be found first. This guarantee is an attractive feature of our
detection system because it allows the user to focus on tuning the learning parameters
that affect the model, with the assurance that errors will not be introduced by the
search process. A key aspect of branch-and-bound search is bounding. Bounding gives
a conservative estimate (i.e. an upper bound) on the probability that the object is
located within some region in pose space.
Each part in a view-based model casts a weighted "vote" for likely object poses
22
based on the part's position.
These votes assign a weight to every point in the
6D space of object poses. The pose with the greatest sum of "votes" from all of
the parts is the most probable detection in the image. We therefore introduce a
bounding function that efficiently computes an upper bound on the votes from each
part over a region of pose space. The bounding function is efficient because of the
weak perspective projection and small angle approximations. These approximations
allow the geometric redundancy of the 6D votes to be reduced, representing them in
lower dimensional (2D and 3D) tables in the image plane. To further save memory
and increase efficiency, these lower dimensional tables are shared by all the parts
tuned to a particular kind of feature, so that they can be re-used to compute the
bounding functions for all the parts of each kind.
1.1.1
Learning
Input: An instance of the object, a way to acquire a 3D mesh, an RGB-D camera
with the ability to measure camera height and pitch angle
Output: A view-based model of the object, composed of a set of viewpoint bins,
each with visual and depth parts and their respective parameters
We break the process of learning (described in depth in chapter 4) a new viewbased model into two parts. First we will discuss the fully automated view-based
model learning subsystem that generates a view-based model from a 3D mesh and a
specific choice of parameter values. Then we will discuss the procedure required to
tune the parameter values. This is a manual process in which the human uses the
view-based learning subsystem and the detection system to repeatedly train and test
parameter values. The model learning and the detection sub-procedures can be called
by a human in this manual process.
1.1.1.1
View-Based Model Learning Subsystem
Input: A 3D mesh and parameter values such as the set of viewpoint bins
Output: A view-based model of the object, which is composed of a set of viewpoint
23
bins, each with visual and depth parts along with the coefficients of the linear
model for their positions and the uncertainty about those positions
The view-based model learning subsystem (see figure 1-2) is a fully automated
process that takes a mesh and some parameter values and produces a view-based
model of the object. This subsystem is described in detail in section 4.1. The position
of each object part is a piecewise linear function of the three rotation angles about
each axis. Each piece of this piecewise linear model covers an axis-aligned "cube"
in this 3D space of rotations. We call these cubes viewpoint bins. 3D objects are
modeled with a number of different viewpoint bins, each with its own linear model
of the object's shape and appearance for poses within that bin. The following three
learning phases are repeated for each viewpoint bin:
1. rendering the image,
2. enumerating the set of features that could be used as model parts,
3. selecting the features that will be used in the final viewpoint bin model and
4. combining viewpoint bin models into the final view-based model.
Since each viewpoint bin is learned independently, we parallelize the learning procedure, learning each view-based model on a separate core.
In our tests, we had
nearly enough CPUs to learn all of the viewpoint bin models in parallel, so the total
learning time was primarily determined by the time taken to learn a single viewpoint
bin model. On a single 2.26 GHz Intel CPU core, learning time takes an average of
approximately 2 minutes.
The view-based model learning subsystem is designed to be entirely automated,
and require few parameter settings from the user. However, there are still a number
of parameters to tune, as mentioned in section 1.1.1.2.
An unusual aspect of this learning subsystem is that the only training input to
the algorithm is a single 3D mesh. The learning is performed entirely using synthetic
images generated from rendering this mesh. This means that the learned view-based
24
model will be accurate for the particular object instance that the mesh represents,
and not for a general class of objects.
Rendering
Input: a 3D mesh and parameters such as viewpoint bin size and ambient lighting
level
Output: a sequence of cropped, scaled, and aligned rendered RGB-D images for
randomly sampled views within the viewpoint bin
Objects are rendered using OpenGL at a variety of positions and rotations in the
view frustum of the virtual camera, with a variety of virtual light source positions.
This causes variation in resolution, shading and perspective distortion, in addition
to the changes in appearance as the object is rotated within the viewpoint bin. The
virtual camera parameters are set to match the calibration of the real Microsoft Kinect
camera. The OpenGL Z-buffer is used to reconstruct what the depth image from the
Microsoft Kinect would look like. 1 Each of the images are then scaled and translated
such that the object centers are exactly aligned on top of each other. We describe
the rendering process in more detail in section 4.1.1.
Feature Enumeration
Input: a sequence of scaled and aligned RGB-D images
Output: a least squares linear model of the closest feature position at each pixel for
depth features and for each kind of visual feature
The rendered images are used to fit a number of linear functions that model how
the position of each visual feature (such as edges) and depth value varies with small
object rotations within the viewpoint bin. A linear function is fit at each pixel in
the aligned images, and for each edge angle as well as for each pixel in the aligned
depth images. The linear functions are fit using least squares, so the mean squared
'This method is only an approximate simulation of the true process that generates RGB-D images
in the Kinect. For example, the real Kinect has a few centimeters of disparity between the infrared
camera that measures depth and the color camera, so that the visual and depth images are not
aligned at all depths.
25
viewpoint bin 2
viewpoint bin 1
.
I
0-0I
-67.50
edges (8 edge
-450
-
depth
Is)
'M.&
-77 ko
n
45*
67.5*
90*
S*.
.
fetr
edges (8 edge angles)
depth
enmraS
22.5*
45*
6 7.5*
22.5*
goo
;
seletio
view-based model
Figure 1-2: An overview of the view-based model learning subsystem. Random poses
are sampled from within each view, and synthetic RGB-D images are rendered at
these views. These images are then scaled and translated so that the centers of the
objects in the images are aligned. Next, a linear model is fit at each pixel for each type
of visual feature (8 edge directions in this figure) detected in the synthetic images, and
another linear model is fit at each pixel for the depth measurements that come from
the Z-buffer of the synthetic images. Finally, some of those linear models are selected
and become parts of the final viewpoint bin model. Each of the 360 viewpoint bin
models are combined to form piecewise linear segments of a full view-based model.
Note: this procedure does not take any real images as an input-the learned models
will later be tested on real images.
26
error values are a readily available metric to determine how closely the models fit the
actual simulated images.
In reality, feature enumeration is a process that occurs incrementally as each new
rendered image is generated. This saves memory and greatly increases learning speed.
We use a formulation of the least squares problem that allows each training point to
be added sequentially in an online fashion.
The speed of this process could be improved by the fine-grained parallelism available on a GPU architecture, computing the feature enumeration for each pixel of each
kind of feature in parallel.
We give more details in section 4.1.2.
Feature Selection
Input: a set of linear models for positions of each kind of feature at each pixel and
parameter values for how visual and depth parts should be modeled
Output: a model for a viewpoint bin, consisting of a selection of these part models
with low mean-squared-error and spaced evenly across the image of the viewpoint
From the set of enumerated features, a small fixed number are selected to be parts
of the model for the viewpoint bin. They are greedily chosen to have a low mean
squared error, and even spacing across the image of the viewpoint bin model. Most
of the user-specified learning parameters are used to control this stage of the learning
process. The selected parts constitute the model for each particular viewpoint bin,
and the set of viewpoint bins form the whole object model. We provide more details
in section 4.1.3.
Combining Models Of Viewpoint Bins
Input: a set of object models for all the different viewpoint bins
Output: an view-based model covering the region of pose space covered by the union
of the input viewpoint bins
A view-based model consists of a set of viewpoint bin models, each of which has
a set of visual and depth parts. After feature enumeration and selection, the set of
27
viewpoint bin models are grouped together to form the complete view-based model.
We give more details in section 4.1.4.
1.1.1.2
High Level Learning Procedure
Input: An instance of the object, a way to acquire a 3D mesh, an RGB-D camera
with the ability to measure camera height and pitch angle and computational power.
Output: A view-based model of the object, composed of a set of viewpoint bin
models, each with visual and depth parts
The procedure to learn a new view-based model is depicted in figure 1-2. The
steps involved are:
1. Collect RGB-D images of the object, along with information about the camera
pose, and partition the images into a test and hold-out set.
2. Label the images with bounding boxes.
3. Acquire a 3D mesh of the object (usually by 3D scanning).
4. Tune learning parameters while testing the view-based models on the test images.
5. Evaluate the accuracy of the view-based models.
We describe this procedure in section 4.2.
Collect RGB-D Images
Input:
* the object instance,
e the object instance, placed on a table and
* an RGB-D camera with the ability to measure pitch angle and height above
the table,
28
"""""""pararaeterr
collect images & data using
a PR2 robot with a inect
labe
Figure 1-3: An overview of the manual labor required to learn a new object. To
learn a new view-based model of an object, we first collect a data set of about 30
positive image examples of the object and about 30 background images of the object
(all images used for training the downy bottle view-based model are in this figure).
Each image must also include the pitch angle of the camera and the height of the
camera above the table when the image was taken. The images should also include
depth information. We use a PR2 robot with a Microsoft Kinect mounted on its head
to gather our data sets. Each positive example must be labeled by a human with
an approximate 2D bounding box for the object to detect. A 3D mesh of the object
should be acquired (usually using a 3D scanner). The mesh is used to learn a new
view-based model, and the learning parameters must be manually adjusted as the
user tests each new learned view-based model on the real images and evaluates the
accuracy (measured by average precision).
29
Output: a set of RGB-D images labeled with the camera's pitch and its height above
table
In our experiments, we collect a set of around 15 images with depth information
(RGB-D images) for each object using a Microsoft Kinect mounted on a PR2 robot. In
our data sets, we did not manually change the scene between each image capturewe set up a table with background clutter and the object one time, and we drove
the robot to different positions around the table, ensuring that the object was not
occluded in any images, since our detector does not currently deal explicitly with
occlusion. This process took about 10 minutes per object. The reason we used the
robot instead of manually holding the Kinect camera and walking around the table
is that we also record the camera height above the table and its pitch (assuming the
camera is upright and level with ground, i.e., the roll angle of the camera in the plane
of the image is always zero). An affordable alternative to this method would be to
instrument a tripod with a system to measure the camera's height and pitch. This
information is used to constrain the 6D search space to search a region surrounding
the table, at object rotations that would be consistent (or nearly consistent) with the
object standing upright on the table.
We also collected a set of about 15 background images that contained none of the
objects we were interested in detecting using this same methodology. We were able
to re-use this set of images as negative examples for each of the objects we tested.
Label Bounding Boxes
Input: a set of RGB-D images
Output: left, right, top and bottom extents of an approximate rectangular bounding
box for each image of the object
We use a simple metric to decide whether a detection is correct: if the 2D rectangle
that bounds the detection in the image plane overlaps with the manually-labeled
bounding box according to the standard intersection over union (IoU) overlap metric
30
of the detected bounding box A and the ground truth bounding box B:
AuB> 0.5
(1.1)
This leaves some room for flexibility, so the labeled bounding boxes do not need to be
accurate to the exact pixel. Labeling approximate bounding boxes for a set of about
30 images takes around 10 minutes for a single trained person.
We labeled our image sets with 6D poses for the purposes of evaluating our algorithm in this thesis. However, we found that, even if we know the plane of the table
from the camera height and pitch angle, labeling 6D poses is very time-consuming,
difficult and error-prone, so we decided to reduce the overall manual effort required
of the end user by relaxing the labeling task to simple bounding boxes.
We suggest that, in practice, the accuracy of detected poses can be evaluated
directly by human inspection, rather than using full 6D pose labels.
Acquire A 3D Mesh
Input: the object instance
Output: an accurate 3D mesh of the object instance
We found that the accuracy of the detector is highly related to the accuracy of
the 3D mesh, so it is important to use an accurate method of obtaining a mesh. The
scanned mesh models used in this thesis were mostly obtained from a commercial
scanning service: 3D Scan Services, LLC. Some of the mesh models we used (such
boxes and cylinders) were simple enough that hand-built meshes yielded reasonable
accuracy.
Tune Parameters
Input:
results of evaluating the view-based object detector
a sample of correct and incorrect detections from the view-based model learned
from the previous parameter settings
31
Output: a new set of parameter values that should improve the accuracy of the
view-based model
There are many learning parameters involved in producing a view-based model,
such as:
" the size of the viewpoint bin,
* the amount of ambient lighting in the rendered images,
* the maximum allowable mean squared error in feature selection, etc.
It would be computationally infeasible to test all combinations of parameter settings
to automatically find the best-performing values, so this is left as a manual process.
A human can look at a visualization of a view-based model, and the set of detections
for that model, and see where the pattern of common failure cases are. A bit of
intuition, experience and understanding of how the view-based model is constructed
can help the human to make educated guesses as to which parameters need to be
adjusted to improve the performance. For example, by looking at a visualization of a
view-based model, one may realize that the set of view bins does not fully cover the
set of object rotations in the real world, so the user would adjust the set of viewpoint
bins and re-run the learning and test to see if it performs more accurately. Or the user
may notice that there appear to be randomly scattered edge parts in the view-based
model. In this case, the user may try to reduce the maximum allowable mean squared
error for edge feature selection.
This is admittedly the most difficult part of the process, as it requires a fair
amount of experience.
Section 4.2.1 gives more details on our methodology and
chapter 6 gives a sample of the kinds of experiments that we used to determine good
parameter settings, but the real process involves some careful inspection of detections,
an understanding of how the view-based model is affected by the parameters, and
some critical thinking.
Evaluation of View-Based Models
32
Input:
" a set of detected object poses in images,
" hand-labeled bounding boxes for the object in the images,
" a set of about 15 RGB-D images not containing the object
Output: a score between 0 and 1 evaluating the accuracy of the view-based model
on the set of test images
Since we only require 2D bounding box labels (to save manual labor), we are able
to evaluate the accuracy of results following the standard and accepted methodology
defined by the PASCAL [8] and ImageNet [29] challenges. The challenge defines
correct detection with respect to the ground truth label by the intersection over
union (IoU) metric (see section 1.1.1.2), and the overall detection accuracy on a set
of test images is measured by an average precision that is a number between 0 and
1, where 1 represents perfect detection.
1.1.2
Detection
Input:
" An RGB-D image along with the camera height and the camera pitch angle
" A view-based model
Output: A sequence of detected poses, ordered by decreasing probability that the
object is at each pose
The detection algorithm uses branch-and-bound search to find detections in decreasing order of probability.
Branch-and-bound search operates by recursively breaking up the 6D search space
into smaller and smaller regions, guided by a bounding function which gives an overestimate of the maximum probability that the object may be found in a particular
33
region. Using the bounding function, branch-and-bound explores the most promising
regions first, so that it can provably find the most probable detections first.
The bounding function in branch-and-bound search is the critical factor that determines the running time of the search process. The. over-estimate of the bound
should not be too far above the true maximum probability (i.e. the bound should
be tight), and time to compute the bound should be minimal. In our design of the
detection algorithm, computational efficiency of the bounding function is the primary
consideration.
The detection algorithm consists of five steps:
1. detect visual features
2. pre-process the image to create low-dimensional tables to quickly access the 6D
search space
3. initialize the priority queue for branch-and-bound search
4. run the branch-and-bound search to generate a list of detections
5. suppress detections that are not local maxima to remove many redundant detections that are only slight variations of each other
Chapter 5 gives more details about the detection algorithm.
Visual Feature Detection
Input: an RGB-D image
Output: a binary image of the same dimensions, for each kind of visual feature
The first phase of detection is to detect the low-level features. Depth measurements are converted from Euclidean space, into measurements along a 3D ray starting
at the focal point of the camera passing through each pixel. Visual features must be
extracted from the input image. Visual feature detectors determine a binary value of
whether the feature is present, or absent at any feature. In this thesis, we use an edge
detector to extract edge pixels from an edge detector from around 8 different edge
directions. We provide more details in section 5.2.
34
depimage(1D slice of 2D)
RGB/visual image (1D
Lr .
slice
of 2D)
F
I
-Ili t
l1_
-------------
=. $0
N a9m
It
rM6WM
---*
-
4_4
*9----.
-""
summed area
table of depth (21) slice of 3D)
IO
summed area tables of visual features (ID slice of 2D)
distance transforms of visual features
I
summed area tables and distance transforms
Figure 1-4: An overview of detection. First features are detected in the image, then
these binary feature images are preprocessed to produce summed area tables and
distance transforms. The priority queue used in the branch-and-bound search is
initialized with a full image-sized region for each viewpoint bin. As branch-and-bound
runs, it emits a sequence of detections sorted in decreasing order of probability. Some
of these detections that are redundantly close to other, higher-probability detections
are then removed (or "suppressed").
35
Pre-processing
Input: a depth image and a binary image of the same dimensions, for each kind of
visual feature
Output:
" a 3D summed area table computed from the depth image
" a 2D summed area table computed from each kind of visual feature
" a 2D distance transform computed from each kind of visual feature
Before the process of searching for an object in an image begins, our algorithm
builds a number of tables that allow the bounding function to be computed efficiently.
First, edges are detected in the image. Each edge has an angle, and edges are
grouped into 8 discrete edge angle bins. Each edge angle bin is treated as a separate
feature. An optional texture detection phase may be used to detect other types of
visual features besides edges. 2D binary-valued images are created for each feature,
recording where in the RGB-D image the features were detected, and where they were
absent.
The bounding function needs to efficiently produce an over-estimate of the maximum probability that the object is in a 6D region in pose space. A dense 6D structure
table would be large, and even if it could fit in RAM, it would be slow because the
whole structure could never fit in the CPU cache. We therefore store 2D and 3D
tables that are smaller in size and more likely to fit in a CPU cache for fast read
access.
Each feature in the image has a maximum receptive radius, which is the region
of pose space where it may increase the overall "votes" for those poses. A feature
can have no effect on the total sum of votes for any pose outside of its receptive field
radius. The key idea of the bounding function for a particular part is to conservatively
assume the highest possible vote for that feature for a region of pose space that may
intersect the receptive field radius of some feature. Otherwise, it is safe to assume
the lowest possible vote for that region. To make the bounding function efficient, we
36
take advantage of a property of a summed area table [6] (also known as an integral
image [36]) that allows us to determine whether a feature is present in any rectangular
region with a small constant number of reads from the table. A separate 2D summed
area table is used for each visual feature (such as each edge angle). We similarly
compute a 3D summed area table for depth features.
The constant access time
property of the summed area table means that the bounding function takes the same
amount of time to compute, regardless of how big or small the region is.
We would also like to efficiently compute the exact probabilities when branch-andbound arrives at a leaf of the search tree. The uncertainty of visual feature locations
are modeled by normal distributions in the image plane. The logarithm of a normal
distribution is simply a quadratic function (i.e. a parabola with a 2D domain), which
is the square of a distance function. To find the highest probability match between a
visual part and a visual feature in the image (such as an edge), we want to find the
squared distance to the visual feature that is closest to the expected location of the
part. A distance transform is a table that provides exactly this information-it gives
the minimum distance to a feature detection at each pixel [11]. A distance transform
is pre-computed for each kind of visual feature so that any the exact probability of any
visual part can be computed with only one look-up into the table for the appropriate
kind of visual feature.
We underscore that these pre-computed tables are shared for all parts of a particular kind. In other words, the total number of pre-computed tables in memory
is proportional to the number of different kinds features (usually 8 discrete edge directions), which is many fewer than the number of parts in a viewpoint bin model
(usually 100 visual and 100 depth parts), or the number of viewpoint bin models
or even the number of different object types being detected. This contributes to a
significant increase in search efficiency.
These tables take about 4 seconds to compute on a single core 2.26 GHz Intel
CPU.
Initializing the Priority Queue
37
Input: an empty priority queue, and the viewpoint bin sizes of the view-based model
Output: a priority queue containing a maximum-size region for each viewpoint bin,
each with its appropriate bound
Besides the tables used to compute the bounding function discussed in the last
chapter, the other major data structure used by branch-and-bound search is a priority
queue (usually implemented by a heap data structure). The priority queue stores the
current collection of working hypothesis regions of pose space, prioritized by their
maximum probability. The priority queue is designed to make it efficient to find and
remove the highest-priority (maximum probability bound) region. It is also fast to
add new regions with arbitrary bounds onto the priority queue.
Branch-and-bound search starts with the initial hypothesis that the object could
be anywhere (subject to the current set of constraints, such as whether it is near
a table top). We therefore put a full-sized region that covers the whole 6D pose
space we are considering for each viewpoint bin on the priority queue. These initial
regions are so large that they are uninformative-the bounding function will give a
very optimistic over-estimate, but it will still be fast to compute since the running
time does not vary with the size of the region.
Initializing the priority queue takes a negligible amount of time.
Branch-and-Bound Search
Input: an initialized priority queue
Output: the sequence detections (i.e. points in pose space) sorted in descending
order of probability that the object is located there
Branch-and-bound search removes the most promising hypothesis from the priority queue and splits it into as many as 26 "branch" hypotheses because there are 6
dimensions in the search space. It computes the bounding function for each branch,
and puts them back onto the queue. When it encounters a region that is small enough,
it exhaustively tests a 6D grid of points within the region to find the maximum probability point, and that point is then pushed back onto the priority queue as a leaf.
The first time that a leaf appears as the maximum probability hypothesis, we know
38
that we have found the best possible detection.
As we have said, the bounding function is the most critical factor in the efficiency of
the branch-and-bound search process. Inherent in the design of our view-based model
representation are two approximations: weak perspective projection and the smallangle approximation. These two approximations make the bounding function simple
to compute using the low dimensional pre-computed tables (discussed in section 1.1.2).
These approximations make it possible to access these low-dimensional tables in a
simple way: using only scaling and translations, rather than complex perspective
functions or trigonometric functions. A rectangular 6D search region can be projected
down to a 2D (or 3D) region, bounded by a rectangle in the pre-computed summed
area tables, with some translation and scaling. This region can then be tested with
a small constant number of reads from these tables in constant time for each visual
and depth part.
The leaf probabilities can also be computed quickly by looking up values at appropriately scaled and translated pixel locations in the distance transforms for each
kind of visual feature, and scaling and thresholding those distances values according
to the parameters of each part.
We tested the detection algorithm on 20 2.26 GHz Intel 24-core machines in parallel, each 24-core machine had 12 GB of RAM. Under these conditions, this process
usually takes about 15-30 seconds. The search procedure is parallelized by giving
each processor core its own priority queue to search its own sub-set of the pose search
space. When a priority queue for one core becomes empty, it requests more work,
and another core is chosen to delegate part of its priority queue to the empty core.
If searching for only the best n detections, then the current nth best probability is
continually broadcasted to all of the CPUs because any branches in the search tree
with lower probability can be safely discarded. At the end of the search, the final set
of detections are collected and sorted together.
The speed of this process could be improved by the fine-grained parallelism available on a GPU architecture, by computing the "vote" from each of the object parts
in parallel. Under this strategy, the memory devoted to the priority queue would be
39
located on the host CPU, while the memory devoted to the read-only precomputed
tables would be located on the GPU for faster access. The amount of communication
between the GPU and the CPU would be low: only a few numbers would be transferred at evaluation of a search node: the current search region would be sent to the
GPU, and the probability of that region would be returned to the CPU. This means
the problem would be unlikely to suffer from the relatively low-bandwidth connection
between a CPU and a GPU.
Non-Maximum Suppression
Input: a list of detections with their corresponding probabilities
Output: a subset of that list that only keeps detections whose probabilities are a
local maximum
If branch-and-bound search continues after the first detection is found, the sequence of detections will always be in decreasing order of probability. In this sequence
of detections, there are often many redundant detections bunched very close to each
other around the same part of the pose space.
In order to make the results easier to interpret, only the best detection in a local
region of search space is retained, and the rest are discarded. We refer to this process
as non-maximum suppression.
Non-maximum suppression takes a negligible amount of computation time.
1.2
Outline of Thesis
In chapter 2 we discuss related work in the field of object recognition and situate
this thesis in the larger context. In chapter 3, we give the formal representation of
the view-based model we developed. Chapter 4 describes how we learn a view-based
model from a 3D mesh model and from images. Chapter 5 describes our algorithm for
detecting objects represented by a view-based model in an RGB-D image. Chapter 6
describes our experiments and gives experimental results. Finally, chapter 7 discusses
the system and gives conclusions and directions for future work.
40
Chapter 2
Related Work
In this chapter, we give a brief overview of some of the work in the field of object
recognition that relates to this thesis. For an in-depth look at the current state of
the entire field of object recognition, we refer the reader to three surveys:
" Hoiem and Savarese [18], primarily address 3D recognition and scene understanding.
" Grauman and Leibe [16] compare and contrast some of the most popular object recognition learning and detection algorithms, with an emphasis on 2D
algorithms.
" Andreopoulos and Tsotsos [2], examines the history of object recognition along
with its current real-world applications, with a particular focus on active vision.
2.1
Low-Level Features
Researchers have explored many different low-level features over the years. Section
4.3 of Hoiem and Savarese [18] and chapter 3 of Grauman and Leibe [16] give a good
overview of the large number of features that are popular in the field such as SIFT [25]
or HOG [7] descriptors. In addition to these, much recent attention in the field has
been given to features learned automatically in deep neural networks. Krizhevsky et
41
al. [21] present a recent breakthrough work in this area which is often referred to as
deep learning. In this thesis, we use edges and depth features.
Edges are useful because they are invariant to many changes in lighting conditions.
One of the most popular edge detectors by Canny [3] is fast-it can detect edges in an
image in milliseconds. More recent edge detectors like that of Maire et al. [26] achieve
a higher accuracy by using a more principled approach and evaluating accuracy on
human-labeled edge datasets-however these detectors tend to take minutes to run
on a single CPU'. With the advent of RGB-D cameras and edge datasets, the most
recent edge detectors such as the one by Ren and Bo [28] have taken advantage of
the additional information provided by the depth channel. We use the Canny [3] and
Ren and Bo [28] edge detectors in our experiments.
Although the computer vision research community has traditionally focused on
analyzing 2D images, research (including our work in this thesis) has begun to shift
towards making use of the depth channel in RGB-D images. In this thesis, we also
use simple depth features: at almost every pixel in an RGB-D image, there is a depth
measurement, in meters, to the nearest surface intersected by the ray passing from
the focal point of the camera through that pixel (however, at some pixels, the RGB-D
camera fails, giving undefined depth measurements).
2.2
Generic Categories vs. Specific Objects
We humans can easily recognize a chair when we see one, even though they come in
such a wide variety of shapes and appearances. Researchers have primarily focused
on trying to develop algorithms that are able to mimic this kind of flexibility in
object recognition-they have developed systems to recognize generic categories of
objects like airplanes, bicycles or cars for popular contests like the PASCAL [8] or
ImageNet recognition challenges [29]. In addition to the variability within the class,
researchers have also had to cope with the variability caused by changes in viewpoint
and lighting. The most successful of these algorithms, such as Felzenszwalb et al. [10],
'but GPUs seem to be a promising way to speed these detectors up.
42
Viola and Jones [36] and Krizhevsky et al. [21] are impressive in their ability to locate
and identify instances of generic object classes in cluttered scenes with occlusion and
without any contextual priming. These systems usually aim to draw a bounding box
around the object, rather than finding an exact estimate of the position and rotation
of the object. They also usually require a large number of images with hand-labeled
annotations as training examples to learn the distribution of appearances within the
class. Image databases such as ImageNet [29], LabelMe [30] and SUN [37] have been
used to train generic detection systems for thousands of objects.
In this work, we, along with some other researchers in the field, such as Lowe [25],
and Nister and Stewenius [27] have chosen to work on a different problem-recognizing
an object instance without class variability, but requiring a more accurate pose estimate. Setting up the problem in this way rules out all of the variability from a
generic class of objects. For example, instead of looking for any bottle of laundry
detergent, these algorithms might specifically look for a 51 fl. oz. bottle of Downy
laundry detergent manufactured in 2014. Although within-class variability is eliminated by simplifying the problem, there is still variability in shape and appearance
from changing viewpoints and lighting. Chapter 3 of Grauman and Leibe [16] discusses a number of local feature-based approaches that have been very successful in
detecting and localizing specific object instances with occlusion in cluttered scenes
using only a single image of the object as a training example. But these approaches
usually require the objects to be highly textured, and their accuracy tends to decrease
with large changes in viewpoint.
2.3
2D vs. 3D vs. 21D view-based models
Hoiem and Savarese [18] divide view-based object models into three groups: 2D, 3D
and 21D.
43
2.3.1
2D view-based models
Researchers have used a variety of different 2D object representations. If the object
class is like a face or a pedestrian that is usually found in a single canonical viewpoint,
then it can be well represented by a single 2D view-based model. Two of the most
popular techniques for detecting a single view of an object are rigid window-based
templates and flexible part-based models.
Window-based models
One search strategy, commonly referred to as the sliding window approach, compares
a rigid template (a "window") to every possible position (and scale) in the scene to
detect and localize the object. Viola and Jones [36] demonstrated a very fast and
accurate face detector, and Dalal and Triggs [7] made a very accurate pedestrian
detector using this technique. More recently, Sermanet et al. [31] have successfully
used the deep learning approach in a sliding window strategy, and Farfade et al. [9]
have shown that this kind of strategy can even be robust to substantial variations in
poses. However, window-based methods have primarily been used for object detection
and have not yet been demonstrated to localize precise poses.
Part-based models
In order to detect and localize objects in a broader range of viewpoints, the viewbased model may need to be more flexible. A common way of adding flexibility is to
modularize the single window template by breaking it into parts. Each part functions
as a small template that can be compared to the image.
Several different representations have been used to add flexibility in the geometric
layout of these parts relative to each other.
Lowe [25] used an algorithm called
RANSAC (invented by Fischler and Bolles [14]) to greedily and randomly match
points to a known example (see section 2.4.1).
Another technique is to represent the layout of the parts as if they were connected
by springs. The less the spring needs to be stretched to fit the image, the better the
44
Figure 2-1: Fergus et al. [13] learned representations of object classes (for example,
spotted cats) using a fully-connected model.
match. The stretch of the springs and the quality of matching the individual part
templates are combined together to score the overall object detection. In this way,
each part can be thought of as casting a weighted "vote" for the position of the object.
The space of possible "votes" from the parts is sometimes referred to as a Hough
transform space. Fergus et al. [13] worked with part-based models in which the parts
are fully connected to each other by springs (see figure 2-1), but detection using these
models can be computationally expensive. Crandall et al. [5] introduced a family of
models called k-fans in which the number of springs connected to each part can range
from the fully-connected model where k = n with n parts in the model (as in Fergus
et al.), down to the star model where k = 1. They showed that k = 1-fans, in which
each part is only connected to a single central part (see figure 2-2), can be comparably
accurate to k > 1-fans, and detection can be much more computationally efficient by
using distance transforms to represent the votes from each part. Felzenszwalb et
al. [10] used multiple templates designed by Dalal and Triggs [7] as parts of a 1-fan
model to create one of the most successful 2D object detectors. To represent objects
from a wider range of viewpoints, Felzenszwalb et al. [10] (and many others) have
combined multiple 2D view-based models into a single mixture model, in which each
model represents a different viewpoint. In essence, this strategy treats different views
of an object as different objects, each to be detected separately.
45
Figure 2-2: Crandall et al. [5] used a 1-fan model in which most part locations are
independent of each other, yielding faster detection.
The view-based models proposed in this thesis can be seen as an extension of
1-fans to the full 6 degree-of-freedom space of rigid transformations (translations and
rotations). The parts of our view-based models "vote" in Hough Transform space,
and the "votes" are represented by distance transforms.
2.3.2
3D view-based models
The other end of the spectrum of view-based object models is to represent the distribution of object appearances entirely in three dimensions. Chiu et al. [4], represent
an object by a collection of nearly planar facades centered at fixed 3D positions in
space and detect the object using distance transforms. Lim et al. [24] use 3D mesh
models of furniture from Ikea to detect and localize the pose of objects in 2D images
using 2D keypoints and RANSAC search. Glover and Popovic [15] represent an object by a collection of oriented features in 3D space, and use a randomized method to
match model points to points in the "point cloud" from an RGB-D camera. Aldoma
et al. [1] use a 3D mesh model to detect the object in a depth image from the Kinect
camera. They introduce new 3D point descriptor that is used to match the mesh to
the point cloud.
The view-based models in this thesis do not contain a full 3D representation of
46
the object, so would not directly fit into this category of models. However, the work
of Aldoma et al. [1] can be viewed, from an end-user's perspective, as similar to ours,
because the training input is a 3D CAD model, and the detection algorithm operates
on depth images from the Kinect. However, in addition to the depth images, we also
use the picture (RGB) channel of the RGB-D image.
2.3.3
2 12 D view-based models
There has also been work on models that are not entirely 2D, but not entirely 3D
either. Hoiem and Savarese [18] use the name 2)D to refer to models that have some
dependency between viewpoints (i.e. they are not simply a mixture of separate 2D
models), yet they do not have a full explicit 3D model of the object.
Thomas et al. [33] demonstrate a system that tracks the affine transformations
of regions across a sequence of images of a particular object instance from different
viewpoints. Each discrete view-based model is represented separately, but it is linked
to the other view-based models in the view sphere by sharing parts. Detected features
"vote" via a Hough transform for where object is likely to be for each view. Votes
from other view-based models are also combined to find the final detection.
Su et al. [32] use videos from a camera moving around a single object instance,
along with a set of unsorted and unlabeled training images of other instances in
the category to learn a dense model of visual appearance. The viewpoint models are
morphed linearly between key views on the view sphere, so they can be used to detect
objects from previously unseen views and accurately localize their poses.
The view-based models we present in this thesis bear resemblance to Su et al [32]
because we use piecewise linear models to represent the transformation of parts in
the model, much like their linear morphing between key views. For this reason, our
view-based models can also be used to accurately localize object poses.
47
2.4
Search: Randomized vs. Cascades vs. Branchand-Bound
Object detection, which is the main computational task of an object recognition
system, involves searching for the object over a large space of potential hypothesis
locations. The "brute force" approach of fully evaluating every possible hypothesis
is only feasible for low-dimensional object poses spaces. There are so many hypotheses in high-dimensional spaces that they are prohibitively computationally expensive
to evaluate exhaustively.
We look briefly at three search strategies: randomized,
cascades and branch-and-bound.
2.4.1
Randomized
Many object recognition systems have made effective use of the RANdom ASmple
Consensus (RANSAC) algorithm to efficiently search the space of object positions
in an image-Lowe [25] and Lim et al. [24] use this algorithm, as mentioned above.
RANSAC is a robust iterative method that estimates parameters by random sampling
and greedy matching. RANSAC runs for a pre-specified number of iterations before
terminating. Although RANSAC is often very efficient, there is no guarantee that
it will have found the best solution when the number of iterations are completed.
Moreover, RANSAC is designed only to estimate the best detection, so it cannot be
directly applied to images with multiple instances of the same object.
2.4.2
Cascades
Viola and Jones [36] introduced another efficient method to search an image they
call a cascade of classifiers. A cascade of classifiers is a sequence of increasingly
complex classifiers that "fails fast." The early classifiers in the cascade are very fast
to evaluate and are chosen to have nearly 100% true positive detection rate, with
some false positives. In this way, if an early classifier says a hypothesis is not the
object, then one can be reasonably certain that it is not the object without running
48
any further classifiers in the cascade. Viola and Jones used a cascade of classifiers
to evaluate every position and scale in an image. This kind of "brute force" search
would normally be computationally expensive, but since the cascade "fails fast", it
can run very efficiently. Their face detector was the first to run on a full-sized image
in less than one second using only a single CPU. Cascades of classifiers have since
been applied to many other detection systems, including an extension [12] of the work
by Felzenszwalb et al. (mentioned above [10]).
Another technique that Viola and Jones also used to achieve efficient face detection
is summed area tables (also known as integral images). Summed area tables allow
the summation of values in any rectangular sub-window of an image with a small
constant number of machine instructions. This property allowed their detector to
detect faces quickly at any size or scale without influencing the running time. The
detector in this thesis also makes use of summed area tables for the same reason.
2.4.3
Branch-and-Bound
Branch-and-bound search is another method that is sometimes used to efficiently
search a space of potential object positions. Branch-and-bound search uses a bounding function that gives an over-estimate of the probability that an object is located
in a region of hypothesis space. Branch-and-bound search is guaranteed to find the
best detections in an image without individually evaluating every hypothesis, saving
time without sacrificing accuracy.
Lampert et al. [22] demonstrate several applications of branch-and-bound to searching the 4-dimensional space of rectangular bounding boxes in an image.
Lehmann et al. [23] formulate a branch-and-bound search for objects in the 3D
pose space (2D position in the image plane, plus ID scale).
Like our algorithm,
Lehman et al. also use Hough transforms in which each part "votes" for where the
object is likely to be located in space. Moreover, they use summed area tables to
compute their bounding function efficiently in a way that also bears a very close
resemblance to the system presented in this thesis.
49
2.5
Contextual Information
A number of researchers have pointed out the importance of using contextual information to help improve object recognition. Notably, Torralba introduced a probabilistic
representation of context based on 2D image scene statistics called GIST [34]. Although we do not incorporate 2D contextual information in this thesis, our system
could be extended to incorporate this useful information.
Hoiem et al. [17] demonstrate a framework for estimating the 3D perspective of
the scene and its ground plane (using vanishing points), and they show that this
information can be used to significantly improve the accuracy of an object detector.
We also make use of this information to improve both accuracy and running time.
However, rather than estimating the table plane, we calculate it directly from the pitch
angle of the camera and the height of the camera above the table. This information
was read from the PR2 robot we used in our experiments. As with Hoiem et al., we
assume that the camera is always level, i.e., camera roll angle is always 0 degrees.
2.6
Part Sharing
Researchers have often highlighted the importance of re-using a shared library of
features to detect many different objects and views. Notably, Torralba et al. [35]
demonstrated a multi-class learning procedure that scales efficiently with the number
of object classes by sharing visual words among classes. As the number of classes
increases, more sharing occurs, and the total number of unique visual words used by
all classes remains small.
We noticed that their algorithm usually chooses simple visual words for sharing
(see figure 2-3). In particular, the most commonly shared parts look like small line
segments. This served as motivation for the set of visual features we chose to use in
this work.
50
U"-R.
U--"
screen
poster
car rontal
chair
keyboard
bottle
car side
mouse
mouse pad
can
trash can
head
person
mug
speaker
traffic light
one way Sign
do not enter
stop Sign
light
CPU
Figure 2-3: Torralba et al. [35] introduced a learning algorithm to share parts among
a large number of object classes. A white entry in the matrix indicates that the part
(column) is used by the detector for that class (row). Parts are sorted from left to
right in order of how often they are shared.
51
52
Chapter 3
Representation
In this thesis, we aim to find the most likely pose(s) of an object in an image. We
represent an object as a probability distribution, rather than rigidly fixed values, in
order to account for uncertainty in sensing and variability caused by explicit approximations made by our model.
Section 3.1 discusses our particular choices in representing the space of object
poses, section 3.2 discusses how we represent images and section 3.3 discusses our
representation for view-based models. Then section 3.4 discusses some of the particular approximations we use, section 3.5 discusses the sources of variability and 3.6
discusses and defends our particular choice of probability distributions.
3.1
Object Poses
We choose to represent the position and rotation of objects so as to facilitate several
natural approximations to the appearances of 3D objects. The most popular representations of an object pose are (1) a 4 x 4 homogeneous transform matrix and (2) a
3D position vector and a quaternion for rotation. We choose a different representation
that can be easily converted to either of these popular representations.
We represent the position of an object as a triple (x, y, z), in which (x, y) is the
pixel position of the perspective-projected center of the object on the image. In this
way, translations in the (x, y) plane are simply approximated as shifting the pixels
53
of the appearance of the object in the image plane. We represent the depth of the
object as the distance z from the focal point of the camera to the center of the object,
rather than the Euclidean z-axis which is always measured perpendicular to the image
plane. This choice is well suited to the weak-perspective approximation, as z be used
directly to compute the scale of the object. Together, the position triple (x, y, z)
forms a modified spherical coordinate system. The origin of the coordinate system is
the focal point of the camera, the radius is z and inclination and azimuth angles are
replaced by pixel positions (x, y) on the image plane.
There are two important factors that lead to our choice of a representation for
object rotations. First, we would like to respect the approximations discussed above.
In other words, if an object is translated in the image plane, its rotation should not
have to change in order to keep approximately the same (shifted) visual appearance.
We accomplish this by measuring an object's rotation with respect to the surface
of a focal point-centered sphere (see figure 3-3). Second, we would like to represent
rotations with the smallest number of variables as possible. The most natural option
for representing rotations-quaternions-require four variables (constrained to a 3D
manifold). If computational speed was not a factor, we might choose quaternions.
However, we will be searching a large space of object poses using branch-and-bound,
and the speed of the overall search is directly related to the number of branches that
must be explored. Branching in a higher dimensional space increases the number of
branches multiplicatively. Moreover, explicitly branching on a 3D manifold embedded
in a 4D space would add some additional computational burden.
We therefore use Euler angles to represent rotations since they use the smallest
number of variables. As discussed above, we define the Euler angles (r-, ry, rz) for
an object with respect to the tangent of a sphere centered on the focal point of the
camera and the natural "up" direction of the camera.
One of the biggest problems with Euler angles is the loss of a degree of freedom at
certain singular points known as a gimbal lock. Due to the fact that we already must
constrain our search to upright object rotations on horizontal surfaces in order to
detect objects in a reasonable amount of time (see section 5.3.4), we naturally avoid
54
these singular points in the space of Euler angles. If we were able to speed up our
detection algorithm to run unconstrained searches at a more practical speed then we
might be able to afford the computational luxury of switching back to the quaternion
representation for rotations.
Thus we define a pose to be a six-tuple (x, y, z, rX, rY, rz), where (x, y) is the
projected pixel location of the center of the object in the image plane, z is the distance
of the center of the object from the focal point of the camera and (rx, ry, rz) are Euler
angles representing the rotation of the object: rx is the rotation of the object about a
line that passes through the focal point of the camera through the center of the object
(this can roughly be thought of as rotation in the image plane), ry is the rotation
of the object about a horizontal line that is perpendicular to the first line, and rz is
the rotation of the object about a vertical line perpendicular to the other two. The
rotations are applied in order: rx followed by ry, followed by rz.
3.2
Images
We represent an RGB-D image as a collection of visual and depth features.
All
features have an integer pixel location in the image. There is a pre-defined library
of kinds of visual features (we use edge detections within a range of angles, but this
could be extended to include a class of local textures). The detectors for each kind
of visual feature are binary-valued. In other words, a visual feature of a certain kind
is either present or absent at a particular pixel. This choice to restrict visual features
to 2 values is for the sake of reducing memory overhead, as we will see in section 5.1.
Depth features, on the other hand, are real-valued. There is a real-valued depth
measurement associated with each pixel in the image (except where the depth detector
fails-for such a pixel, the depth measurement is undefined). Depth measurements
are measured along a line from the focal point of the camera, through the pixel in
the image plane, to the point where it intersects an object in the scene. This means
that the depth measurements are not measured as z-coordinates in the standard
Euclidean coordinate system. Instead, the coordinate system can be viewed as a
55
spherical coordinate system centered at the focal point of the camera.
3.3
View-Based Models
We define a viewpoint to be a triple (rz, ry, rz) containing only the rotational information from a pose, and we define a viewpoint bin to be an axis-aligned bounding
)
box in viewpoint space. Formally, a viewpoint bin is a pair of viewpoints (v 1 , v 2
where vi = (rxi, ry , rzi) and v 2
=
(r2,
ry 2 , rz 2 )
such that rx2
>
rx1, ry2
> ry1 and
rzi. A viewpoint (rX, ry, rz) is in the viewpoint bin (v 1 , v 2 ) iff rx1 < r < rx2,
rz2
ry1 < ry 5
y2
and rzi < rz < rz2.
We define a viewpoint bin model to be a pair (V, D) of vectors of random variables
representing visual parts and depth parts respectively. A depth feature dj is drawn
from the distribution of Dj. In other words, a depth part Dj can be matched to a
particular depth feature dj in the depth channel of the image with some probability.
Similarly, a visual feature
Vk
is drawn from the distribution of Vk. In other words,
a visual part Vk can be matched to a particular visual feature Vk in the image with
some probability.
We define a view-based model as a set of pairs. Each pair (M, B) in the view-based
model represents the viewpoint bin model M of the appearance of the object for views
in the corresponding viewpoint bin B.
The goal of detection for each viewpoint bin model (V, D) is to find the highest
probability pose p in the image by finding the best matching of parts to features:
argmax max Pr(7 = V, D = d P = p) = argmax max Pr(V = 7, D = dIP = p) Pr(P= p)
p
p
v,d
V,d
(3.1)
We assume that all of the distributions for parts are conditionally independent of
each other and that the distribution over poses Pr(P) is uniform within the region of
56
.............
image
-
ID
A
A
visual
features
B
r
W
e
W
A
A
F
A
F
19
depth
features
-
F
Figure 3-1: An illustration of features in a 1D slice of an RGB-D image. This simplified example uses three kinds of visual features (A, B and C). The depth features
are a set of depth measurements at every pixel in the image.
pose space we are searching. These assumptions give:
argmax max Pr(V = 6, D' = dP = p) = argmax max fPr(Dj = d IP = p)
p
p
,d*
Vd
fJ Pr(Vk =
VkIP = p)
k
(3.2)
Each visual part can only be matched to its own kind of visual feature. We define
the projected expected location M~, of a visual part in the image plane as a function
of the object pose p = (x, y, z, rx, ry, rz):
mVn = [
+ z [
,
(3.3)
erfr]
(3.4)
where
[i;] =
1
g h
57
and where a, b, c, d, e, f, g and h are constants. We choose Pr(VIP) to be a distribution
that represents our uncertainty about the location of the visual feature t that matches
a visual part V in the image plane:
Pr(V = t}P = p) OC e -
(3.5)
v
where t is the 2D pixel location of a certain feature in the image plane, v is the
scalar variance of the uncertainty of this location around the mean nmv and r is the
receptive field radius, outside of which visual features are ignored. A visual part does
not have any representation of depth. We discuss more of the details of this choice of
distributions in section 3.6.
For simplicity, we assume there is no uncertainty over the location of a depth
feature in the image plane. We define the projected location f of a depth part in the
image plane for the pose p = (x, y, z, rx, ry, rz) and constants px, py as:
.(3.6)
[1 +i [2Y 1=
Instead of uncertainty in the image plane, we represent uncertainty of the depth of a
depth part along a line from the focal point of the camera through f We define the
expected depth of a depth part md as a linear function of the pose:
md
=
Z + fi rirr2]
[,
(3.7)
for constants i, j, k, 1. And we choose the distribution Pr(DIP) that represents our
uncertainty about the depth of the depth feature that matches a depth part D along
this line:
Pr(D~t~~p~cx6min((t-md)
Pr(D = tIP = p) cX e2
,r2(.8
(3.8)
where t is the depth measurement from the depth channel at pixel location f, v is the
variance of the uncertainty of this depth about the mean md and r is the receptive
field radius, outside of which depth features are ignored. We discuss more of the
58
details of this choice of distributions in section 3.6.
We note that the probability distributions in equations 3.8 and 3.5 do not integrate
to 1 on [-o, o] because their value outside of the receptive field radius is non-zero
(see figure 3-5). These can be made into valid probability distributions if we assume
that the domain is t E [md
-
r,md + r] for equation 3.8 and (F- iV) 2
; r 2 for
equation 3.5 and that there is also a single discrete "out" state in the domain of each
distribution when the feature is outside the receptive field radius.
3.4
3.4.1
Approximations
(x, y) Translation Is Shifting In The Image Plane
Built into these equations is the assumption that motion tangent to a sphere located
at the focal point of the camera is equivalent to translation on the image plane. In
other words, if the object is moved to a new point on the sphere, and is turned such
that it has the same rotation with respect to the plane tangent to the surface of
the sphere at the new point, it should register the same depth values, at a translated
position in the image plane. While this a good approximation for most cameras which
have narrow fields of view, in figure 3-3, one can see that this assumption does not
hold perfectly, especially near the boundaries of the image.
3.4.2
Weak Perspective Projection
Standard perspective projection calculates the projected position of each point in
space from its Euclidean coordinates (p, py, pz). In other words, standard perspective projection would replace the denominators in both equations 3.3 and 3.6 with
pz so that the locations of each part is scaled independently according to its own
depth. However we use weak perspective projection, as an approximation to standard
perspective projection. Weak perspective approximation scales the whole object as if
all points had the same depth, rather than scaling each point separately. The denominators in these equations contain the distance z from the focal point of the camera
59
to the center of the object, rather than pz, the Euclidean z-coordinate of the point
to be projected. This approximation is used because the depth of visual features
(especially edges at the contour of the object) are not always available. This is a
reasonable approximation when the size of the object is small compared its distance
from the camera, as is true for small objects to be manipulated by a robot.
3.4.3
Small Angle Approximation
Visual features like edges seen by the camera are not always produced by the same
point on the object frame as the object rotates-especially for objects with rounded
surfaces. Moreover, some features may disappear due to self-occlusion as the object
rotates. Rather than trying to model all of these complex factors explicitly, we choose
simple linear models (equations 3.4 and 3.7). The constant coefficients are chosen
during the learning phase (chapter 4) by selecting features whose motion appear
nearly linear over the domain of the viewpoint bin.
It is also true that rotating the object will generally change the edge angles. We
do not explicitly address this issue, but leave it to the learning procedure (described
in chapter 4) to select edge features that do not rotate beyond their angle range for
object rotations within the viewpoint bin. This means that the design choice of the
best viewpoint bin size is related to the choice of the edge angle discretization.
3.5
Sources Of Variability
There are several sources of variability in part locations:
* the assumption that motion tangent to a camera-centered sphere is equivalent
to translation on the image plane (see figure 3-3),
" the weak perspective approximation to perspective projection for part locations,
* the small angle approximation to the true part locations under object rotation,
60
..............................................
....
..............
......
Figure 3-2: Warping spherical coordinates into rectangular coordinates. Although
a natural representation for a depth image is a Euclidean space, we represent this
space in spherical coordinates and visualize the spherical coordinate system (top) as
if it was warped into a rectangular coordinate system (bottom). You will also notice
that the hidden and occluded surfaces of the objects are indicated by thin lines in the
world space (top) and removed entirely in the warped spherical coordinates (bottom).
61
---------
* errors in detecting visual features such as missed or spurious detections (figure
3-4) and
" errors in depth sensing, such as missing or wrong depth values (see Khoshelham
et al. [20] for details on the accuracy of the Kinect).
For these reasons, we use probability distributions, rather than rigidly fixed values,
to model the locations of parts.
3.6
Choice of Distributions
We would like to use the simplest distribution that can represent these uncertainties
well.
Normal distributions, C e
t
2
2v,
are simple, but they give a high penalty for
errors made by the visual feature detector or by the depth sensor. Therefore, we
modify the normal distribution by removing additional the penalty for parts that are
farther than r units from the expected location: oc e
min(t2,2)
2v
(see figure 3-5).We call
r the receptive field radius because the distribution is only receptive to features found
within that radius. If the feature is not found within that radius, it is assigned the
same penalty as if it was found exactly at that radius.
Crandall et al. [5] also used normal distributions with a receptive field radius.'
But we further simplify this distribution in two ways: (1) we constrain the covariance
matrix to be circular rather than generally elliptical (see figure 3-6), and (2) we
assume that each part is binary-valued (it is either present or absent at every pixel),
rather than real-valued.2
1Although Crandall et
al. do not explicitly describe their use of a receptive field radius in their
paper
[5],
the
code
they
used
for the experiments in their paper uses this technique.
2
See section 5.1 for a discussion on why we use these simplifications.
62
-
-i
"I I _
_
11
1
11. I'll
focal point
11-
H-
'0,
L
-
focal point
U
4D
depth
depth
Figure 3-3: (Left top) Four object poses, all at the same rotation (rx, ry, rz). (Left
bottom) These same object poses when spherical coordinates are warped onto a rectangular axes.
(Right top) Four object poses at another rotation (r',r', r' ). (Left bottom) These
same object poses when spherical coordinates are warped onto a rectangular axes.
This figure shows the difference between what we consider rotation in the spherical
coordinate system and rotation in the rectangular coordinate system. Changing the
position of the object in the spherical coordinate system is actually moving the object
tangent to a sphere centered at the camera's focal point. When the spherical coordinate system is warped to be rectangular (bottom), we can see why weak perspective
projection is only an approximate representation of the actual transformation: objects
are distorted.
63
JL
lu
Figure 3-4: Parts of these monitors have intensity values similar to the background,
so some of their boundaries are missed by Canny's edge detector of [3].
1--
x
x
r
A
Figure 3-5: (Top left) The normal distribution oc e 2. (Top right) The normal dismin(x2,
(t2
T
2-v
. (Bottom) The log of these plots.
tribution with a receptive field radius o e
Observe that the logarithm of a normal distribution (top) is a parabola (bottom).
64
Figure 3-6: (Left) A contour plot of a two dimensional normal distribution oc
with a general elliptical covariance matrix. A 2D covariance mae-(-Tc-(-)
trix is any matrix C = [, b] which satisfies a > 0 and ac - b2 > 0. (Right) A normal
distribution with a circular covariance matrix C = vI
65
-
[v 0].
66
Chapter 4
Learning
Input: An instance of the object, a way to acquire a 3D mesh, an RGB-D camera
with the ability to measure camera height and pitch angle
Output: A view-based model of the object, composed of a set of viewpoint bins,
each with visual and depth parts and their respective parameters
In this chapter, we break the process of creating a new view-based model into
two parts. Section 4.1 describes the fully-automated subsystem that learns a new
view-based model from synthetic images of a 3D mesh, while 4.2 describes the manual
process involved in collecting and labeling data, acquiring a mesh, and setting parameters. The view-based model learning subsystem of section 4.1 is a "sub-procedure"
used by the human in the manual process required to train, test and tune a new
view-based model in section 4.2.
4.1
View-Based Model Learning Subsystem
Input: A 3D mesh and parameter values such as the set of viewpoint bins
Output: A view-based model of the object, which is composed of a set of viewpoint
bins, each with visual and depth parts along with the coefficients of the linear
model for their positions and the uncertainty about those positions
The view-based model learning subsystem (see figure 1-2) is a fully automated
process that takes a mesh and some parameter values and produces a view-based
67
model of the object. The position of each object part is a piecewise function of the
three rotation angles about each axis, where the domain of each piece of this piecewise
function is a viewpoint bin. In other words, objects are modeled using a number of
different viewpoint bins, each with its own model of the object's shape and appearance
for poses within that bin. The following three learning phases are repeated for each
viewpoint bin:
1. rendering the image (section 4.1.1),
2. enumerating the set of features that could be used as model parts (section 4.1.2),
3. selecting the features that will be used in the final viewpoint bin model (section
4.1.3) and
4. combining viewpoint bin models into the final view-based model (section 4.1.4).
The view-based model learning subsystem is designed to be entirely automated,
and require few parameter settings from the user. However, there are still a number
of parameters to tune, as mentioned in section 1.1.1.2.
An unusual aspect of this learning subsystem is that the only training input to
the algorithm is a single 3D mesh. The learning is performed entirely using synthetic
images generated from rendering this mesh. This means that the learned view-based
model will be accurate for the particular object instance that the mesh represents,
and not for a general class of objects.
4.1.1
Rendering
We generate synthetic RGB-D images from a 3D mesh model M by randomly sampling poses of the object that fall within the specified viewpoint bin B and the field
of view of the camera. We also randomly change the virtual light source position to
add some extra variability in appearance. We use the OpenGL MESA GLX library
to render images, combined with the xvf b utility, which allows us to render on server
machines on the cloud using software when hardware rendering (on a GPU) is not
available. Values from the z-buffer are used to compute depth. Recall that depth is
68
vlA-nn"nt Kin
I
vipwnnint
hin 2
Figure 4-1: Examples of synthetic images from two different viewpoint bins of a
downy bottle.
computed as the distance
/x 2 + y 2 + z 2 between the Euclidean point (x, y, z) and
the focal point of the camera located at the origin (0, 0, 0). We then crop the image
to a minimal square containing the full bounding sphere of the object.
Procedure 1 Render and crop a synthetic image.
Input: 3D mesh model M, viewpoint bin B
Output: color image RGB, depth image D, randomly chosen pose p of the object
within B
1: procedure SAMPLESYNTHETIC-IMAGE(M, B)
p +- a random pose of M in B and within the simulated camera's field of view
2:
3:
4:
5:
6:
7:
8:
r +- the projected radius (in pixels) of the bounding sphere of M at p
(x, y) +- center (in pixels) of the projected pose p
(lX, ly, 1) +- a random position for a simulated ambient light
(RGB, D) +- render M in pose p with ambient light at (lx, 1Y, lZ)
RGB <- crop RGB for x-range [x - r, x + r] and y-range [y - r, y + r]
D +- crop D for x-range [x - r, x + r] and y-range [y - r, y + r]
9:
return (RGB, D, p)
4.1.2
Feature Enumeration
Recall that equations 3.4 and 3.7 define the locations of visual and depth parts as
a linear function of the viewpoint (rx, r., rz). We designed a supervised learning
procedure to learn the parameters of these linear functions from synthetically rendered
images, along with the variances v in equations 3.5 and 3.8 by a standard least squares
formulation with a matrix of training labels A, a matrix of training examples B and
a matrix of linear coefficients X. We use a training set of n synthetic images of the
object along with the exact pose of the object in the image for both visual parts and
depth parts. The training examples matrix A has size n x 4, where each row is an
69
example of the form [1
rx
ry
rz
]. The size of the other matrices B and X depend on
whether it is a visual or depth part. The linear least squares formulation is:
X = argmin(AX - B)T (AX - B).
x
(4.1)
We can assume that the columns of A are linearly independent (i.e. A is full column
rank) because the rotations rx, ry and r, are sampled at random. If A is full column
rank, we know that ATA is invertible. Then the solution for the optimal matrix of
constants X is:
=
(ATA)-
1 AT
(4.2)
B
and the sum of square errors is used to compute the unbiased variance v for this
solution:
V =Tr
(IT AT A
)- Tr(X TAT B) +Tr BT B .
(4.3)
In the case of visual parts, the label matrix B has size n x 2 with a 2D pixel
location of the nearest visual part [x Y] in each of its n rows, and the matrix of
constant coefficients X is
ca b
c].
_g h.
In the case of depth parts, the label matrix B has size n x 1 with a scalar depth
of the n entries, and the matrix of constants X is
.
measurement for a particular location relative to the scaled and aligned image in each
In principle, this formulation is sufficient to compute all depth and visual parts
from the n training images. In practice, however, this requires a large amount of
memory-there will be Kw 2 label matrices A and training matrices B for each of the
K kinds of visual parts at each pixel in the w x w scaled and aligned training images.
Similarly, there will be w 2 label and training matrices for all of the enumerated depth
parts. Since each of these matrices has n columns, this requires (6K + 5)nw2 floating
point numbers to be kept in memory as the synthetic images are being rendered. For
an image width w = 245, K = 8 kinds of visual features, and n = 100 training images,
and if the numbers are represented with 8-byte double-precision, it would take 2.3
GiB to represent these matrices alone. Another, less memory-intensive, alternative
70
would be to render n images separately for each of the w 2 pixels, which means that
SAMPLESYNTHETIC-IMAGE
would be called nw 2 times, rather than just n, but this
would be computationally expensive.
To address this issue, we observe that the quantities ATA, ATB and Tr BTB are
sufficient to compute X and v, and they have small constant sizes that do not depend
on the number of training images n. Moreover, we observe that they can be updated
incrementally when each new synthetic image is rendered and a new observation and
label becomes available. In particular, when the jth training example is [1
ATA
=
y
L
A
rzj
rxjzj
ryi
rjyj
rx,
rxj r2
(4.4)
yj rr r ryjrzj
zj r xjzj ryjrzj
rZj
:
.
1
n
ryj
In the case of visual parts, when the jth label is [xj Y]:
fl
AT B =
\'
F
X
Y
[xjr,
yjr
1
xjrz yjrz.
j=1
(4.5)
n
Tr BTB
=
+y .
x
(4.6)
j=1
j=1zrz
.
And in the case of depth parts, when the jth label is zj:
n
Tr BT B
=
z .
(4.8)
j=1
We use this observation to define an algorithm. We define an incremental least
squares visual part to be a triple
(ATAV,ATB
VTr BTB
V). The notation "ATAV" is just
a name for a 4 x 4 matrix containing the current value of the expression ATA for a
visual feature given the training examples that have been sampled so far. Similarly,
ATBV
is a 4 x 2 matrix with the incremental value of the expression ATB for a visual
feature, and
Tr BTBV
is a scalar with the incremental value of the expression Tr BTB
for a visual feature.
71
An incremental least squares visual part can be updated by the viewpoint training
example (rx, ry, rZ) and a visual feature location training label (x, y):
)
Procedure 2 Update an incremental least squares visual part by adding a new
training example.
Input: least squares visual part (ATAVATB VTr BTB V), viewpoint training example
(rX, ry, rz), visual feature location training label (x, y)
Output: updated least squares visual part (ATAV',ATB V',YBTB V
procedure UPDATEVISUALPART((ATAVATB V,TrBTB V), (X, Y), (rx, ry, rz))
rx
-l
2:
2:~~~~
ATAV'
-ATA
ry
rz
+xomeuain
r1 r rxrrxr
2~r V
ry rxry r Y ryrz
V +
Lrz
V +
c> from equation 4.4
rxrz ryrz r2J
x
Y
> from equation 4.5
3:
ATBV
4:
TrBTBV' +TrBTB
5:
return (ATAV',ATB V'Tr BTB V
r yry
[xrz yrz
J
> from equation 4.6
V + X2 + y2
)
+ATB
.
1:
Procedure 3 Finalize an incremental least squares visual part after it has been
updated with all training examples.
Input: least squares visual part (ATAVATB V,Tr BTB V), receptive field radius r, integer index of the kind of visual part k
Output: a visual part
1: procedure FINALIZEVISUALPART((ATAVATB V,Tr BTB V), r, k)
> from equation 4.2
2:
X +-ATA V- 1 ATBV
> The top left element is the number of training images.
3:
n +-ATA Vi, 1
4:
V +- n-
(Tr(XT ATAVX) - 2 Tr(XT ATBV) +TrBTB V)
> from equation 4.3
Sa b
5:
e
<-X
.g h]
6:
return a visual part of kind k, with variance v, receptive field radius r, and
constants a, b, c, d, e,
f, g,
h
We similarly define an incremental least squares depth part to be a triple (ATAD,ATB DTrBTB D),
where ATAD is a 4 x 4 matrix, ATBD is a 4 x 1 matrix and
TrBTBD
is a scalar. By
default, an incremental least squares depth part is initialized to be (0, 0, 0). An incremental least squares depth part can be updated by the viewpoint training example
(ri, ry, r,) and a depth feature depth training label z:
In order to describe the algorithm to enumerate features, we now review the
concept of a distance transform. A distance transform takes a binary matrix M : m x n
and produces an integer-valued matrix D : m x n in which each element Dxy is the
72
Procedure 4 Update an incremental least squares depth part by adding a new
training example.
Input: least squares depth part (ATAD,ATB D,TrhBTB D), viewpoint training example
(rX, ry, rz), depth feature depth training label z
Output: updated least squares depth part (ATAD',ATB D',Tr BTB D')
1: procedure UPDATEDEPTHPART((ATAD,ATB D,T1 rBTB D), z, (rX, ry, rz))
r1
ATAD' +ATA
D+
Lrz
3:
ABD+
ry
rz
rxrz ryrz
r
wATB D+ zI,
r2
> from equation 4.4
.
2:
rx
rx r2Iryrr
x r2r
ry rxry r V ryrz
>from equation 4.7
[zrzJ
+ z2
4:
TrBTBD' +-TrBTB D
5:
return (ATAD',ATB D',TrBTB D')
r> from equation 4.8
Procedure 5 Finalize an incremental least squares depth part after it has been
updated with all training examples.
Input: least squares depth part (ATAD,ATB D,TrBTB D), receptive field radius r,
(Px, py) pixel location of depth part
Output: a depth part
1: procedure FINALIZEDEPTHPART((ATADATB DiT BTB D), r)
> from equation 4.2
2:
X +-ATA D- 1 ATBD
> The top left element is the number of training images.
3:
n +-ATA D1 ,1
> from equation 4.3
4:
v+- n- (Tr(XTATADX) - 2Tr(XT ATBD) +Tr BTB D)
5:
+- X
.
Il
6:
return a depth part at location (px, py), with variance v, receptive field radius
r, and constants i, j, k, 1
73
viewpoint bin 1
depth
viewpoint bin 2
edges (8 edge angles)
-22.5'
-45'
-67.5'
22.5'
45*
67.5*
depth
00
-67.51
90*
22.5'
edges (8 edge angles)
-22.5*
-45'
45'
67.5*
0.
90*
Figure 4-2: This figure shows two different view bins (left and right). For each
viewpoint bin, the w 2 depth parts are arranged according to their position in the
w x w scaled and aligned image, with red pixels being those in which there was no
depth found in at least one of the training images, and the gray values are proportional
to the variance v of the least squares fit for that pixel. There are also K = 8 kinds
of visual features (an edge feature found at various angles), and the gray similarly
depicts the variance v.
squared euclidean distance between M,,y and a nearest element in M whose value is
'1'.
DX,=
min
x1, yl
Mx
(x- X) 2 + (y- y)2
(4.9)
1,y,=1I
Optionally, matrices X and Y may also be produced, containing the argmin indexes
x 1 and yi respectively, for each element in M. Felzenswalb and Huttenlocher [11]
describe an algorithm to compute the distance transform in time that scales linearly
with the number of pixels O(mn) in an m x n image.
We now describe the ENUMERATEFEATURES procedure which uses synthetic im-
ages to enumerate a set of linear models, one for each pixel and each kind of part.
The ENUMERATEFEATURES procedure is efficient since it calls SAMPLESYNTHETICAMAGE exactly n times, and it is much more memory efficient than the naive algo-
rithm that stores all of the A and B matrices. To calculate the memory usage, we first
observe that ATA is a symmetric matrix, so we only need 10 numbers to represent
ATAV
and
ATAD-
ATBV
has 8 elements and
TrBTBV
is just a scalar, so an incremental
least squares visual part takes 19 numbers in memory. Similarly,
ATBD
has 4 elements
and ly BTBD is just one number, for a total of 15 numbers for each incremental least
squares depth part. So (15 + 19K)w 2 numbers are required to represent all of the least
squares parts. For w = 245, K = 8 and 8-byte double-precision values, this takes
74
)
Procedure 6 Enumerate all possible features.
Input: 3D mesh model M, viewpoint bin B, number of images to render n, synthetic
image width w, receptive field radius for visual parts rv, receptive field radius for
depth parts rD
Output: set of visual parts V (where JVI = Kw 2 for the number of different kinds
of visual features K), set of depth parts E (where JV = w 2
1: procedure ENUMERATEFEATURES(M,
2:
3:
4:
5:
6:
7:
B, n, w)
VF +- a vector of K matrices, each of size w x w containing incremental least
squares visual parts initialized to (0, 0, 0)
DF +- a w x w matrix of incremental least squares depth parts, each initialized
to (0, 0, 0)
for j = I to n do
(RGB, D, (x, y, z, rx, ry, rz)) +- SAMPLESYNTHETIC-IMAGE(M, B)
fork= 1 toKdo
V +- binary image of visual features of kind k detected in RGB, D
8:
9:
10:
V +- scale V to size w x w
(DT, X, Y) +- distance transform of V
for all pixel locations (p, py) in w x w do
vik,,,,p,
11:
12:
13:
+-UPDATE_VISUAL _PA RT(vFk,p,py,,
(rx , ry, rz),I W,,,,,, Yxp)
D +- scale D to size w x w
for all pixel locations (pr, py) in w x w do
-UPDATE-DEPTH _PART(DF,,,p,,, (rx, ry, rz), Dp,,y)
DFp,p
14:
15:
V-
16:
D
17:
18:
for all pixel locations (px, py) in w x w do
fork= 1 toKdo
V +- V U { FINALIZEVISUAL -PART(v Fk,px,p, rvk)}
19:
20:
21:
0
÷0
D +- V U
{
FINALIZEDEPTHPART(DI,,,PV, rD)
return V,D
75
}
76 MiB of memory, compared to the 2.3 GiB required to represent the full matrices.
This memory savings is significant because it causes fewer cache misses. The overall
running time for this procedure is approximately 15 seconds on a single CPU without
any GPU acceleration. 1
The memory savings becomes even more significant when we consider the possibility of parallelizing ENUMERATE-FEATURES on a GPU architecture. The smaller
memory footprint fits easily into the video memory on a GPU, even if the GPU is
being shared by several cores running ENUMERATE-FEATURES in parallel. Lines 10,
13 and 17 contain for loops that could be parallelized in GPU kernels.
We also note that, to save a considerable amount of time in the learning phase,
we use the edge detector developed by Canny [3], rather than the detector of Ren
and Bo [28] on line 7 of ENUMERATE-FEATURES.
We have not seen a significant
degradation in detection accuracy from this choice to use different edges for learning
and detection.
4.1.3
Feature Selection
Given the enumeration of all possible parts, the next step is to choose a subset of
them to use in the final viewpoint bin model.
First, we decide how many parts we want in our model. We do this empirically in
chapter 6. Then we use variance as the primary selection criteria to choose the parts,
because features with lower variance in the aligned training images tend to be closest
to their mean values, and they are therefore the most reliable and repeatable for
detection. However, we cannot use variance as the sole criteria for selecting features
because low variance features are usually bunched near by each other (see figure 4-3).
1After implementing this algorithm, the author realized a simple way of eliminating common
subexpressions that would lead to a significant savings in memory and time. The observation is that
ATA
=ATA D is the same for every pixel, since AT A only uses the rotation information r,, r. and
rz which is constant for the whole training image. By storing the A matrix once for all features and
pixels, the naive approach of representing the whole matrix would be reduced from 2.3 GiB down
to 4n + (2K + 1)nw 2 numbers 779 MiB), and the approach we presented would be reduced from
76 MiB down to 10 + (5 + 9K)w 2 numbers, or 35 MiB. Moreover, this observation could be used
to reduce the number of matrix inversions of the ATAV =ATA D matrices from (K + 1)w 2 down
to just 1, yielding significant time savings in the FINALIZE-VISUALPART and FINALIZEDEPTHPART
procedures.
76
0
*
*..
1
0
0
\4-
4*
-
.
I
0
0
01
Figure 4-3: The effect of varying the minimum distance between parts. When the
minimum distance is too low (left), some of the higher-variance regions of the object
are not modeled at all. However, when the minimum distance is too high (right),
the desired number of parts do not fit on the object model. When the maximum
variance constraint is used (center), the minimum distance between parts is chosen
automatically.
A good object model should ideally have parts spread evenly over the object. We
therefore focus our attention on designing an algorithm to select features that are
both low variance and also spread out over the object.
The core of the algorithm, S ELECTF EATURES_-GREEDY, uses a greedy strategy to
select features with high variance, constrained to be at least some distance dmin from
each other.
Note that if dmi,, is too large, we will be forced to choose some very high-variance
features that will not be found consistently in images of the object. On the other
hand, if dmi,, is too small, it is like the case of only selecting the lowest-variance
features without regard to the distance between parts (dmin = 0)-the parts will be
bunched and will not be distributed evenly over the whole viewpoint.
How, then, can we choose a good value for dmi,,? Observe in figure 4-4 that different
objects and even different views of the same object vary in the area of the image they
77
Procedure 7 Select features greedily for a particular minimum allowable distance
between chosen parts dmin.
Input: a vector of parts P sorted in order of increasing variance, the minimum
allowed distance between two selected parts dmin
Output: a set of selected parts Q
1: procedure SELECTFEATURESGREEDY(P,
dmin)
2:
Q +-{}
3:
4:
5:
for all P in P do
(px, ptY) <- the original pixel location of Pj
d +- oo
8:
9:
10:
for all original pixel locations (q2, qy) of parts in Q do
d +- min(d, v/(px - qx) 2 + (py - qy) 2
if d > dmin then
)
6:
7:
Q +- Q U {P}
return Q
Figure 4-4: Different objects and viewpoints vary in the area of the image they cover.
Each object on the top row covers more area in the image than a different viewpoint
of the same object directly below it.
78
0
0
'0
O *040
oOOO
7:0.0000
.0'
040.
0 0000000~00
o~0000*
.0
000
.0
* 00
0
0 o 04
.0 o~0
je0 0 0
00,
000
0"
. 0. 00 .00.
0 0 0
*
I
*
0
0.
*.
0
9
4II4**~
Figure 4-5: When dmin is held constant from one viewpoint (left) to another view
with a different area (center), the parts are not evenly spread out to cover the whole
area. However, when dmin is chosen automatically using a parameter for the maximum
allowable variance (right), the parts tend to be evenly spread, independent of the area
of the viewpoint.
cover, even when all the distances from the camera are equal. The minimum distance
between parts needs to be able to change so that the parts will always evenly fill the
entire area of the object in the image. Lower-area viewpoints should have a shorter
dmin
so that all of the features will fit within the area, and higher-area viewpoints
should have a greater dmi,, so that the same number of features be will spread out,
filling the whole area.
The SELECT-FEATURESGREEDY procedure requires a choice of dmi". It would be
difficult and time-consuming for a user to manually select a different value for the
parameter for each and every viewpoint region. Instead, we use an automatic method
to find a good dmin value; we introduce a different user-specified parameter, Vmax, that
specifies the greatest allowable part variance, and, using binary search, we find the
greatest dmin that still allows us to greedily pick the desired number of parts from the
set of all enumerated features whose variances are all small enough (see figure 4-5).
79
Procedure 8 Select features greedily for a particular maximum allowable part variance vmax.
Input: a vector of parts P sorted in order of increasing variance, the desired number
of features n, the maximum allowed part variance Vmax
Output: a set of selected parts Q
1: procedure SELECT -FEATURES(P, n, vmax)
2:
d+- the vector of possible integer distances from 0 to 500
3:
Jmin +- I
Jmax + idl
while Jmax > jmin do
> binary search for the best distance in d
Jmid +- [ j
Q +- SELECTFEATURESGREEDY(P, djmid)
4:
5:
6:
J
7:
8:
if IQI= n then
9:
return Q
11:
else if IQ1 < n then
Jmin < Jmid + 1
12:
else
10:
13:
Jmax
- Jmi
-
> search upper subarray
1
> search lower subarray
return SELECTFEATURESGREEDY(, djAmin)
14:
This is essentially the same as the original idea of setting dmin based on the area
of the viewpoint, as it automatically adapts to the area of the training images. Since
vmax is not affected by the total area of the object, we can keep it constant for all
viewpoints. This feature selection method accomplishes the goal of choosing features
that are evenly spread over the whole area of the visible view, while only choosing
reliable, low-variance features.
We now have the machinery necessary to learn a viewpoint bin model.
The LEARNVIEWPOINT-BIN-MODEL procedure takes around 18 to 20 seconds to
run on a single CPU core. As mentioned, the call to ENUMERATE-FEATURES on line
2 takes the majority of the time, about 15 seconds. Sorting the features takes a
negligible amount of time, and the remainder of the time is spent selecting features.
4.1.4
Combining Viewpoint Bin Models
The process of learning a full object model is a matter of learning each viewpoint bin
model.
Line 3 in the LEARN procedure ensures that there is a viewpoint bin that
80
Procedure 9 Learn a new viewpoint bin model.
Input: 3D mesh model M, viewpoint bin B, number of images to render n, synthetic
image width w, receptive field radius for visual parts rv, receptive field radius for
depth parts rD, desired number of visual parts nV, maximum visual part variance
WVmax, desired number of depth parts nD, maximum depth part variance VDmax
Output: a set of visual parts V', a set of depth parts V'
1: procedure LEARNVIEWPOINTBINMODEL(M, B, n, w, rv, rD, nv, vv max, nD, VD
2:
V, D +- ENUMERATE-FEATURES(M, B, n, w)
3:
V +- V sorted by variance, increasing
4:
D +- D sorted by variance, increasing
5:
V' + SELECTFEATURES(V, n VV max)
6:
7:
max)
79' + SELECT-FEATURES(D, nD, VD max)
return V', 7'
Procedure 10 Learn a full object model.
Input: 3D mesh model M, number of images to render n, synthetic image width w,
receptive field radius for visual parts rv, receptive field radius for depth parts rD,
desired number of visual parts ny, maximum visual part variance WVmax, desired
number of depth parts nD, maximum depth part variance VDmax, rotational bin
width re, z-symmetry angle s,
Output: a view-based model M
1: procedure LEARN(M, n, w, rv, r,D, n, IVmax, nD, VDmax,
2:
M <- {}
to 30, step by r,, do
3:
for r, = rw
rw, s)
[9g
4:
5:
6:
7:
for ry = -90 to 0, step by rw do
for rz = 0 to sz, step by r. do
B
- -rwr - -rL,. - -r,),(,,, + -r-,r+ -rwr + -rV +- LEARN)VIEWPOINTBINMODEL(M, B, n, w,
+-((r,,
rv, rD,
8:
9:
M <- M U{(B, V)}
return M
81
,vVmax,
nD, VDmax)
is centered on the upright image plane rotation angle (r. = 0), which is the most
common object rotation in the images we used in our experiments. The angle ranges
in lines 3-5 are chosen to be an upper bound of the maximum and minimum rotations
found in the images we used in our experiments. The z-symmetry angle s2 represents
the symmetry of an object about a vertical axis perpendicular to the table. A can
would have s2
would have s2
0, a cereal box would have sz = 180, and an asymmetric object
=
=
360. The rotational bin width parameter rz is chosen empirically in
chapter 6. When rz = 20 and sz = 360, the resulting view-based model contains 360
viewpoint bin models.
In practice, we parallelize the LEARN procedure by distributing each call to LEARN_VIEWPOINTBINMODEL (line 7) to different cores and different physical machines in
the cloud using OpenMPI.
4.2
High Level Learning Procedure
Input: An instance of the object, a way to acquire a 3D mesh, an RGB-D camera
with the ability to measure camera height and pitch angle
Output: A view-based model of the object, composed of a set of viewpoint bin
models, each with visual and depth parts
The procedure to learn a new view-based model is depicted in figure 1-2. The
steps involved are:
1. Collect RGB-D images of the object, along with information about the camera
pose, and partition the images into a test and hold-out set (section 1.1.1.2).
2. Label the images with bounding boxes (section 1.1.1.2).
3. Acquire a 3D mesh of the object (section 1.1.1.2).
4. Thne learning parameters while testing the view-based models on the test images
(section 4.2.1).
5. Evaluate the accuracy of the view-based models (section 1.1.1.2).
82
kinect
-
----------------------
Figure 4-6: At the time each RGB-D image was captured, we also record the camera's
height above the table and its pitch angle from the joint angles and torso height of
the PR2 robot.
83
4.2.1
Tuning Parameters
Input:
* results of evaluating the view-based object detector
* a sample of correct and incorrect detections from the view-based model learned
from the previous parameter settings
Output: a new set of parameter values that should improve the accuracy of the
view-based model
The parameters to
LEARN
(procedure 10) reveal the following parameters that
must be set:
" number of images to render n
* desired number of visual parts nV
" desired number of depth parts nD
" receptive field radius for visual parts rv
" receptive field radius for depth parts rD
" maximum visual part variance Vymax
" maximum depth part variance VDmax
* rotational bin width r,,
" synthetic image width w
A parameter not explicitly mentioned in the pseudo-code above is the amount of ambient lighting in the synthetic images, and a related parameter is the threshold that
defines the minimum edge detection probability that will be counted as an edge. Another important pair of parameters is the tolerances on the camera height ht.i and the
84
camera pitch angle rtoi-these parameters are used to constrain the search space during detection and will be discussed more in chapter 5. We experiment with independently varying most of these parameters (n, ny, nD1 rV,
rD, Vmax, VDmax,
rw, htol, rtol,
edge threshold) in chapter 6.
It would be computationally infeasible to test all combinations of parameter settings to automatically find the best-performing values. Taking advantage of the high
parallelism available on GPUs would enable significantly more parameter settings to
be tested, however, finding the optimal combination of settings for all variables remains a hard problem. We therefore wish to explicitly mention the importance of
human training and intuition in guiding the parameter setting process.
This is admittedly the most difficult part of the process, as it requires a fair
amount of experience. Chapter 6 helps to provide some intuition of how accuracy
may be affected by varying any one of these parameters, but the real process involves
some careful inspection of detections, an understanding of how the view-based model
is affected by the parameters, and some critical thinking.
85
-
-
-r
. .. - .... -
-
s
.- ..-..
.....- ......
. . .
...
.
. ... - - .. s.
.
. --- - - -. - . -
.-
.
. -. --
- .--
nir
,.
~
.. -- - .--,. - -. ..--. , . - -
--. -. .
.
.
.
86
Chapter 5
Detection
Input:
" An RGB-D image along with the camera height and the camera pitch angle
" A view-based model
Output: A sequence of detected poses, ordered by decreasing probability that the
object is at each pose
The detection algorithm consists of five steps:
1. detect visual features (section 5.1)
2. pre-process the image to create low-dimensional tables to quickly access the 6D
search space (section 5.2)
3. run the branch-and-bound search to generate a list of detections (section 5.3)
4. remove detections that are not local maxima to remove many redundant detections that are only slight variations of each other (section 5.4)
5.1
Detecting Features
Input: an RGB-D image
87
Output: a binary image of the same dimensions, for each kind of visual feature
The depth measurement at each pixel from the RGB-D camera is converted from
Euclidean space into a depth feature measuring the distance to the nearest surface
from the focal point of the camera along a 3D along a ray passing through the pixel.
In general, visual features are any kind of feature whose uncertainty over position
can be represented as being restricted to the image plane, as defined by equations 3.3
and 3.4. This could include binary texture detection. In this thesis, visual features are
restricted to edges in the image. We use the edge detector of Ren and Bo [28], which
uses the depth channel in addition the RGB channel, and outputs a probability for 8
different edge directions at every pixel. We then use the minimum edge probability
threshold parameter to change these probability tables into 8 binary images, one for
each edge direction. The running time for this edge detector is several minutes on a
single CPU, which is obviously impractical for most situations in which a detection is
needed in seconds or less. For practical situations, we would use a GPU-accelerated
implementation of this edge detector, or a simpler edge detector such as the one by
Canny [3].
We use a single threshold to change the real-valued probability into a binary
decision that determines whether the edge is present or absent. It is also possible to
introduce a second threshold such that each edge detection could have three possible
values: absent, weak or strong. To preserve the re-usability of the pre-processed
distance transforms and 2D summed area tables described in section 5.2, 3-valued
visual features must be implemented by introducing a new kind of visual feature for
each edge direction, effectively doubling the amount of memory required to store the
pre-processed image. As the number of thresholds increase (the limit is real-valued
visual features), the size of the pre-processed image would increase proportionally.
We have implemented 3-valued visual features, but we have not yet performed the
experiments to determine the potential gain in accuracy.
88
5.2
Pre-Processing Features
Input: a depth image and a binary image of the same dimensions, for each kind of
visual feature
Output:
* a 3D summed area table computed from the depth image
" a 2D summed area table computed from each kind of visual feature
* a 2D distance transform computed from each kind of visual feature
The detection algorithm searches for detections in decreasing order of probability.
We pre-process the visual and depth features in order to make this search more
efficient. A preprocessed image I is a 6-tuple < T, S, DT, Z, re, h >. T is a vector of
k 2D summed area tables where k is the number of different kind of visual features
(usually 8). S is a 3D summed area table that is computed from the depth image.
DT is a vector of k distance transforms (one for each kind of visual feature), Z is the
depth image from the original RGB-D image, r, is the pitch angle of the camera and
h is the height of the camera above the table.
To explain this pre-processing step, we begin by transforming the objective probability optimization equation 3.2 (reprinted below) to show how it can be computed
as a Hough transform. We then describe how a detection is evaluated at a specific
point using a procedure called EVAL and its two sub-procedures EVALDEPTH and
EVALVISUAL.
In a Hough transform, each part casts probabilistic "votes" for where it thinks the
object might be. The votes are summed at each location, and the location with the
highest total sum is the most likely location for the object. However the optimization
objective (equation 3.2) takes the product of the inputs from each part, rather than
summing them. To fix this, we can change the products to sums by taking the log.
Moreover, Crandall et al. [5] noted that we can move each max over the locations
of the depth parts dj and visual parts Vk close to its probability distribution (which
is a form of dynamic programming).
89
These two modifications allow us to rewrite the optimization objective into a
Hough transform:
argmax max Pr(V = 'I,
p
= d,P = p) = argmax max 7Pr(Dj = djIP = p)
p
id
i7,d
1
Pr(V = Vk|P = p)
k
(3.2 revisited)
= argmax
p
HDj(p) + ZHvk(p),
i
(5.1)
k
where the Hough votes for depth parts HDj and visual parts Hvk are a functions of
the pose p:
HDj(p)
=
maxlogPr(Dj = djlP = p)
(5.2)
dj
Hvk(p) = maxlogPr(Vk=vk|P=p).
(5.3)
Vk
Equation 5.1 makes it clear that we are simply summing up Hough votes from each
part. The votes are tallied over the 6-dimensional Hough space of object poses.
Before describing the EVALDEPTH procedure to evaluate HDj (p), we note that we
can drop the maxdj operator from equation 5.2, since there is only one valid depth
measurement from the depth image that matches a particular feature Dj when the
object is at pose p.
HDj (p) = log Pr(Dj = djP = p).
(5.4)
The depth measurement dj is read from the depth image at a pixel location that
is a function of the the pose. This simplification makes the EVALDEPTH procedure
straightforward to compute:
Figure 5-1 shows a visualization of equation 5.1 for a 1D Hough space and an
object with only 3 visual parts. Figure 5-2 extends figure 5-1 by adding a rotational
dimension, so that the Hough vote space is two-dimensional.
Figure 5-3 also ex-
tends figure 5-1 to form a two-dimensional Hough space-this figure adds the scale
90
Procedure 11 Evaluates a depth part in an image at a particular pose for a Hough
vote HDj according to equation 5.2.
Input: pose p, depth part D, a preprocessed image I
Output: the log probability of the depth part D for an object at pose p in the image
1J
2:
procedure EVAL-DEPTH(p, D, 1)
(x, y, z,rx, ry, rz) +- p
3:
+- linear coefficients of D
1:
4:
[
+- z+[r,
m
5:
[,
6:
[X]
+- the
r
> From equation 3.7
r]
pixel location of D if the pose was at (x, y, z) = (0, 0, 1)
7:
> From equation 3.6
d +- the depth at location (round(p'),round(p',)) in the depth image of I
8:
v +- the variance of depth part D
9:
if d is defined then
10:
11:
12:
13:
14:
+-
[>]
r +- the receptive field radius of D
return els e
2
min((d-md) ,r2)
> From equation 3.8
d <- default depth difference of D' when undefined
return - d
91
> From equation 3.8
object model
(3 edge parts)
input image
(D)
IE
feature
detections
I
/
Hough
transform
votes
(translated
for each
part)
sum of Hough
transform
votes
best detection ' q,
I
\
I/
/
original image
Figure 5-1: A one-dimensional illustration of Hough transforms for visual parts. The
object model is comprised of three different visual parts, which are matched to features
detected in the 1D input image. Each feature detection casts a vote for where it thinks
the center of the object is-locations with more certainty receive more weight. Recall
from equation 3.5 that we represent the distribution over visual part positions as
normal with a receptive field radius. The Hough votes are the log of the normal
distribution, which is a parabola (also seen in figure 3-5). Notice how the votes
are shifted horizontally to account for the offset between the expected part position
and the center of the object. The shape of the votes is defined by the visual part
min((x-m)2,r2)
(equation 3.5 in one dimension). The
2v
distribution Pr(V = xIP = p) oc ebest detection is the global maximum of the sum of the votes (equation 5.1), and
which we can see is indeed where the object is found in the original image.
92
object model
(3 edge parts)
input image
(D)
1W
p
feature
detections
Hough
transform
votes
(with
rotation
and
translation)
I
sum of Hough
transform
votes
I
best detection
\
2
\
-/
/
original image
\
nd best
detection
A
Figure 5-2: This adds a rotation dimension to figure 5-1. The Hough transform votes
are two dimensional, so the darkness of the Hough transform vote images indicates
the weight of the vote at that location in the space of poses. A horizontal cross section
of these 2D Hough transform votes is a shifted version of the ID Hough transform
vote depicted in figure 5-1. Since we use a small angle approximation for rotation, the
reader will notice that the shift in the Hough votes is a linear function of the rotation
angle (the vertical axis). The sum of the Hough transform votes are rendered using
a contour map. In this image, we can also see that the maximum in the sum of
distance transforms occurs at the place we would expect (with no rotation). The
second best detection would occur when the object undergoes some rotation to put
the blue feature further to the right.
93
object model
(3 edge parts)
L
/
input image
(1D)
feature
detections
mow
Hough
transform
votes
(with
scaling and
translation)
s im
of Hough
transform
votes
original image
I
best detection
/
\
/
/\/
2nd
best detection
Figure 5-3: This adds a scale dimension to figure 5-1. A horizontal cross section of
these 2D Hough transform votes are related to the 1D Hough transform vote depicted
in figure 5-1: The parabolas are widened as the scale increases, since the receptive
field radius is also changed with scale. In addition, the entire 1D image is shifted.
The sum of the Hough transform votes are rendered using a contour map. In this
image, we can also see that the maximum in the sum of distance transforms occurs
at the place we would expect.
dimension.
Unlike depth features, which (when defined) are always at a known pixel in the
image plane, visual features may be found anywhere in the image plane. We are
interested in finding the nearest visual feature to the expected location of a visual
part. A naive algorithm would search for the nearest visual feature every time we
need to evaluate a new visual part. However, we can pre-compute this information
in a distance transform (introduced previously in section 4.1.2).
Figure 5-4 shows an application of Hough transforms to optical character recognition with a two dimensional Hough space, motivating the use of distance transforms.
94
Each Hough transform vote from a visual feature Hvk is a clipped and translated distance transform. The values of the distance transforms are clipped according to the
receptive field radius of the visual part and the entire distance transform is translated
by the expected location of the part relative to the center of the object model. Note
that there are two kinds of edge parts used in the 4-part model in this figure, two
of each edge angle. In general, the distance transforms for visual parts of the same
kind are the same, but they are shifted according to the expected locations of the
respective parts.
Figure 5-5 illustrates the concept of Hough transforms for depth parts in two
dimensions. We cannot render Hough votes for depth parts with rotation on a page,
but the idea is similar to 5-2.
Before we begin searching for an object in a new image, we always pre-compute a
distance transform for each kind of visual feature. Since we use edges with 8 different
directions as visual features in this thesis, we pre-compute 8 distance transforms for
an image. These distance transforms are then re-used for every visual part of the
same kind. This enables us to quickly evaluate the Hough vote for a visual feature
(equation 5.3) at any point in pose space, as shown in the
EVALVISUAL
procedure.
Now that we have procedures for evaluating the Hough votes from both depth and
visual parts, we sum these votes (as in equation 5.1) using a procedure called
EVAL.
Distance transforms allow us to quickly evaluate a viewpoint bin model at any
pose. But evaluation is not enough-in order to perform branch-and-bound search,
we need a way to bound a model over a region of poses. We discuss the details of
our bounding method in section 5.3.2. However, we will briefly describe another kind
of data structure that we pre-compute in order to speed up the bounding procedure:
summed area tables.
Summed area tables (also known as integral images), give a fast (constant time)
way to compute the sum of a rectangular range of a matrix. They were first used
in computer graphics by Crow [6] and popularized in computer vision by Viola and
Jones [36]. Summed area tables can be computed from a matrix of any number of
dimensions, but we use 2D and 3D summed area tables.
95
r - - - - - - - - - --
object model
(4 edge parts,
origin at '+')
X
input image (2D)
----------------------------I
r
- -
-
-
-
-
-
-
I
trnfr
voe
atsepce
-
r-
N
oain
- - - - - - - - --
I
IYZ
'
4-
best detection
I-I
Houg
usdtieIrnltdfrec
deeto
(each fetr
-
I
J
-
-
J
-
feature
detections i
(2 edge
directions) ------------
-
XYZ
|
I
I
-
Ls - fHg---------
original image
sum of Hough transform votes
Figure 5-4: Recognizing the letter 'X' in an image using visual edge features. We
are searching the 2D space of poses in the image plane (without depth or scale), as
opposed to the one dimensional pose space used in figure 5-1. The object model is
made up of four visual edge parts, two of which have one diagonal angle and the other
two have a different diagonal angle. The feature detections are used to make Hough
votes using the visual edge feature distribution (equation 3.5). Note that the log of
) 2 ) is a squared distance
this distribution log Pr(V = AIP) oc min((i - M)T(I function, clipped by the min() operator for distances greater than r. For this reason,
these Hough votes are referred to as distance transforms. The distance transforms
are translated such that each feature votes for where it thinks the center should be.
Adding up the votes from each feature yields an image with the best detection at the
center of the letter 'X' in the original image, as we would expect.
96
object model
(3 depth parts,
origin at '+')
S!.
~i77
input depth image
(in world space)
i A 0*1*
2
nd best
detection
'*'*
14
Hough transform votes
(in world space)
'-"\ r
best detection
original image
7
~-
sum of Hough transform votes
Figure 5-5: A two-dimensional slice of Hough votes and their sum. The object model
is designed specifically to detect the rotated rectangle near the center of the scene.
It consists of three depth parts located on the surfaces of the rectangle that should
be visible to the camera at that angle. The three Hough votes are derived from the
input depth image-the darkness indicates the weight of the vote at that pose in the
space. The sum of the Hough votes reveals that the best detection is where we would
expect. The second best detection is another rectangular part of an object which has
a similar rotation with respect to the tangent of the circle centered at the focal point
of the camera. All graphics are rendered in Euclidean world coordinates.
97
Procedure 12 Evaluates a visual part in an image at a particular pose for a Hough
vote Hvk according to equation 5.3.
Input: pose p, visual part V, a preprocessed image I
Output: the log probability of the visual part V for an object at pose p in the image
I
1: procedure EVALVISUAL(p, V, I)
2:
(x, y, z, rx, ry, rz) +-p
a b
linear coefficients of V
+-
3:
.g h.
<-
From
equation
3.3
[]
a b'
4:
[
5:
[
6:
k +- the index of the kind of visual part V
rx rvrz]
Y
e[
cd
From
equation 3.4
z
7:
d <- the kth distance transform in the preprocessed image I at pixel location
(round(px), round(py))
8:
r' +- the receptive field radius of V
9:
10:
v +- the variance of visual part V
return -
2 2
min(dz
2v ,r )
From equation 3.5
Procedure 13 Evaluates an object model in an image at a particular pose according
to equation 5.1.
Input: pose p, viewpoint bin model M, a preprocessed image I
Output: the log probability of the viewpoint bin model M for at pose p in the image
I
1: procedure EVAL(p, M, I)
2:
(V, 1) +-M
> visual features V and depth features D of M
3:
return ZEVAL-DEPTH(p, j5,1)+EkEVAL-VISUAL(p,1 k, I)
98
For an m x n matrix M, a summed area table S can be pre-computed in time
Then it can be used to answer queries
linear in the number of entries O(mn).
SUM2D(Sx, x2 ,
1, 2)
=
EV
M, in constant time, that is, the running
time is independent of the size of the query region. We pre-compute a 2D summed
area table for each kind of visual feature (typically 8 edge directions), so that we can
quickly determine if there are 1 or more visual features of a particular kind within a
rectangular bounding region. We also pre-compute a 2D summed area table U for a
binary image the same size as the original RGB-D image, whose entries are 1 where
the depth is undefined. This allows us to quickly determine if there are any missing
depth values in a rectangle.
For an 1 x m x n matrix N, a summed area table T can be pre-computed in
time linear in the number of entries O(lmn). Then it can be used to answer queries
SUM-3D(TX,
2
,,X
2
Nx,y,z in constant time that is
, zZ 2) =
independent of the size of the query region. We compute a 3D summed area table
for depth images by discretizing depth into equally-spaced intervals of 5 centimeters.
We transform the m x n real-valued depth image Z into a 1 x m x n matrix N such
that Nx,y,2 is 1 if Zxv falls into the zth depth interval (and 0, otherwise).
Pre-computing the 8 distance transforms, 9 2D summed area tables and the 3D
summed area table takes less than 1 second on a single CPU.
The last two components of a preprocessed image, r, and h, are two scalar numbers that provide contextual information about the camera's pose. They help us to
constrain the search space to a horizontal surface such as a table top. r, measures
the pitch angle of the camera, which is changed by pointing the camera further up
or down with respect to the ground plane (we assume that the camera never rotates
in the image plane-in other words, the roll angle is always zero). h measures the
height of the camera above the table.
99
5.3
Branch-and-Bound Search
Input: a pre-processed image, a view-based object model
Output: the sequence of detections (i.e. points in pose space) sorted in descending
order of probability that the object is located there
After pre-processing an image, we now turn to the business of searching for the
object.
This step is the most time-consuming part of the process, so we try to
take extra care to ensure that the most frequently called sub-procedures are efficient.
Section 5.3.1 describes how we branch regions into sub-regions, section 5.3.2 discusses
the method we use to bound the probability of a detection over regions of pose space
and section 5.3.5 puts these parts together to form the branch-and-bound search
procedure.
5.3.1
Branching
We define a hypothesis region R to be an axis-aligned bounding box in the space of
poses. As the search progresses, the priority queue will store the current collection of
working hypothesis regions of pose space. Formally, a hypothesis region is an index
m into the vector of viewpoint bin models and a pair of poses (M, Pi, P2) where pi =
(xi, yi, z1 , rzi, ry1 , rZI) and
>
r1, ry 2
(x 2 , Y2, Z2 , rx2, ry 2 , rz2) such that x 2
ry,1 and rz2 > rzi. A pose (x, y, z, rX, ry, rz) of viewpoint bin
model m is in hypothesis region (rn,p1 ,p 2 ) iff X1 < x
rXi < rX
<
X 1 , Y2 > Yi,
rx2 , ry1 < r. < ry 2 and ri
x 2 , Y1 < y < Y2, Z1 < Z < Z 2
,
z 2 > zi, rx 2
P2 =
< rz < rz2.
The BRANCH(R) procedure partitions a hypothesis region into 64 smaller hypothesis regions, all of equal size. Each of the 6 dimensions are split in half. BRANCH
returns the set of all 26 = 64 combinations of upper and lower halves for each dimension. We omit the pseudocode for this procedure for brevity.
5.3.2
Bounding
In this section, we will develop a function b(R) that gives an upper bound on the
log-probability of finding the object within a hypothesis region 7? of the space of
100
poses.
b(1Z) > max
HDj (p) + 1
j
Hy,(p)
(5.5)
k
The design of this function is the critical bottleneck that determines the running
time of the whole detection procedure.
There are two important aspects to the
bounding procedure: its running time and its accuracy or tightness. The bounding
procedure is in the "inner loop" of the branch-and-bound search-it is called every
time a region is evaluated, so the bounding function must be efficient. On the other
hand, the bounding procedure should be accurate. That is, the bounding function
b(1Z) should be tight, that is, it should be nearly equal to the right hand side of
inequality 5.5. The closer b(1Z) is to the true maximum value, the faster the search
will narrow in on the true global maximum.
For illustration, consider two straw-man bounding functions
UNINFORMATIVEBOUND
and BRUTEFORCE-BOUND.
Procedure 14 An uninformative design for a bounding function.
Input: hypothesis region 7?, view-based object model 0, a preprocessed image I
Output: an upper bound on the log probability of ( being located in 7? in the image
1
0,
> The maximum possible return value of EVAL
2:
return 0
)
1: procedure UNINFORMATIVEBOUND(R,
Procedure 15 A brute-force design for a bounding function.
Input: hypothesis region R, view-based object model 0, a preprocessed image I
Output: an upper bound on the log probability of 0 being located in R in the image
I
1: procedure BRUTEFORCE-BOUND(R, 0,I)
2:
m +- the index of the viewpoint bin model for region R
3:
(M, B) <- Om
> viewpoint bin B for viewpoint bin model M
4:
(V, b) +- M
visual features V and depth features D
.
+- -00
5:
Vmax
6:
7:
for all p in a minimum-resolution grid in R do
Vmax +- max(vmax, EVAL(p, M, I))
8:
return Vmax
UNINFORMATIVE-BOUND is extremely fast to evaluate, but it is completely useless-
if we used it, the branch-and-bound search would never eliminate any branches. It
101
would only terminate after evaluating every pose in the full search space, which cancels out any benefit that could come from trying to use branch-and-bound search to
speed up detection.
BRUTEFORCEBOUND, on the other hand, is perfectly accurate-it is always pre-
cisely equal to the right hand side of inequality 5.5. However, the procedure explicitly
evaluates every point in the minimum-resolution grid in the hypothesis region R by
brute force. Clearly, this bounding function is far too slow to be practical.
These bounding functions are extreme examples-they lie at opposite ends of the
trade-off between speed and accuracy. A good design for a bounding function should
be reasonably fast to compute, but also reasonably tight.
To design a bounding function, we first observe that if we let
P= argmaxp,
Z HDj(p) + 1
Hvk(p),
k
then since
max HDj(p)
HDj(p*)
pER
and
maxHvk(p) > Hk(p*),
the following inequality must hold:
max HDj(p) + E maxHvk(P) > max
PE
k
HDj(p)+
PRERk/
Hvk(P)
.
(5.6)
This allows us to break the bounding function into parts:
b(1Z)
=
bR)
+
5
bVk(P-),
(5.7)
k
where, because of the observation in inequality 5.6, we require the part bounding
102
-------
object model
(3 edge parts)
input image
(1D)
feature
detections
p
low
I
Hough
transform
votes
(translated
I
for each
part)
<_
z
sum of Hough
transform
votes
best detection ' q
original image
I
\I|/
/
/\/
Figure 5-6: To find an upper bound on the maximum value of the sum of the Hough
transform votes in a region (the bottom red range), we take the sum of the maximums
for each part in that region (the top 3 red ranges).
functions bDj and bVk to satisfy the following:
>
max HDj(p)
(5.8)
>
max Hvk(P).
(5.9)
)
bDj(
bvk(7Z)
PER
PER
To illustrate this idea for visual features, figure 5-6 shows how finding the maximum within a region of the sum of Hough transform votes can be bounded by the
sum of the maximums of each of the Hough transform votes in the same region, and
figures 5-7 and 5-8 apply this to Hough transform spaces with rotation and scale,
respectively.
Figure 5-9 then applies this same principle to 2D optical character
recognition.
Similarly, to illustrate this idea for depth features, figure 5-10 shows how finding
103
I---------------
object model
(3 edge parts)
:\
It/'!
L
input image
(D)
Hough
transform
votes
(with
rotation
and
translation)
sum of Hough
transform
votes
original image
p
p
p
IE
feature
detections
low
IT
MJ
IT
IT
IT
I
best detection
\/\/
2
nd
best detection
Figure 5-7: To find an upper bound on the maximum value of the sum of the Hough
transform votes in a region (the bottom red rectangle), we take the sum of the maximums for each part in that region (the top 3 red rectangles).
104
object model
(3 edge parts)
L---------------
input image
(D)
p
p
feature
p
detections
Hough
transform
votes
(with
scaling and
translation)
s am of Hough
transform
votes
original image
I
best detectioUn+
-4
+
~
2 nd
best detection
A\
Figure 5-8: To find an upper bound on the maximum value of the sum of the Hough
transform votes in a region (the bottom red rectangle), we take the sum of the maximums for each part in that region (the top 3 red rectangles).
105
object model
(4 edge parts,
origin at '+')
X
input image (2D)
XY Z
r -F
feature
detections
(2 edge
I
directions)
-'-
-
-
A: "Bounded by the sum of the maxes of the Hough transform votes in these regions."
region?"
this
in
max
the
is
Q: "What
r -;-I
I
IY
LIL
Z
original image
sum of Hough transform votes
Figure 5-9: To find an upper bound on the maximum value of the sum of the Hough
transform votes in a region (the bottom red rectangle), we take the sum of the maximums for each part in that region (the 4 red rectangles on the distance transforms).
Note that the sum of the maximums for each part will be greater than the true maximum of the sum of the Hough votes in that region because the maximums for each
part do not occur at the same location in the space of poses. We also show that,
because the distance transforms are shifted, the query regions are not aligned in the
binary feature detection images (even for parts of the same edge direction).
106
40
//
/
object model
(3 depth parts,
origin at '+')
input depth image
(in world space)
I A 0*1*
10
A*'*
Hough transform votes
(in world space)
2 nd
best detection
10\
best detection
original image
7 0'
rr-
sum of Hough transform votes
Figure 5-10: To find an upper-bound on the maximum value of the sum of the Hough
transform votes in a region (bottom red region), we take the sum of the maximums
for each part in the region (the top 3 red regions).
the maximum within a region of the sum of Hough transforms can be bounded by
the sum of the maximums of each of the Hough transform votes in the same region.
In order to compute a bound on the vote for a particular visual feature bvk(R),
we need to transform the Hough votes so that they are in alignment with the original
image (and the data structures described in section 5.2).
In figure 5-11, we shift
the Hough transform votes into alignment with the feature detections in the original
image coordinates (compare with figure 5-6). Since the shifted regions are also shifted
into image coordinates, we can use them to access the pre-processed summed area
tables. Recall that horizontal cross-sections of the Hough votes in figures 5-2 and 5-3
are the same as the Hough votes in figure 5-1, except that they are shifted horizontally.
For the same reason, we can see that the bounding region must be sheared from a
107
- --
--- --
L - -
-
-
:\ Il/i
-
object model
(3 edge parts)
input image
(D)
I
low
feature
low
detections
Hough
transform
votes
(translated
for each
y
part)
sum of Hough
transform
votes
bs
dtnLU
best detection
q/
T
original image
Figure 5-11: In figure 5-6, the Hough transform votes were horizontally translated
according to the expected location of the part. In this figure, we translate the Hough
transform votes in order to find which regions in the original image could contribute.
to maximum value in the bottom red query region. This enables us to access the
correct regions of the pre-processed tables for the image.
108
object model
(3 edge parts)
input image
(D)
60
feature
detections
p
p
p
p
Hough
transform
votes
(with
rotation
and
translation)
sum of Hough
transform
votes
best detection
original image
I
2 nd
/
/
\
best detection
/1
Figure 5-12: In figure 5-7, the Hough transform votes were aligned with the Hough
transform voting space. In this figure, we shear the Hough transform votes so that
they are aligned to the original image coordinates. We also shear the bounding regions
in the same way so that they change from rectangles to parallelograms. The extent
of each parallelogram is then projected onto the binary feature detection image.
rectangle to a parallelogram in order to be aligned to the original image coordinates
when we add rotation (figure 5-12) or scale (figure 5-13) to the pose space. Figure
5-14 shows how the depth measurements are warped from the Euclidean coordinates
into the coordinate system in which the depth is measured from the focal point of the
camera through the image plane to the nearest surface.
These transformations are
the key insight that enables us to compress the 6D Hough transform space down to
lower-dimensional data structures, which allows us to design fast bounding functions.
The following procedure, BOUND-VISUAL,
provides our implementation of Bvk
that satisfies inequality 5.9 (for a proof, see appendix A.1).
109
In essence, it does this
---------------
object model
(3 edge parts)
\I/
I
input imagE
(D)
'Ro
feature
detections
Hough
transform
votes
(with
scaling and
translation )h
gI a
II
aI
sum of Hou
transform
votes
original ima ge
~Ch
I
best de tection7J
\
//
2
nd
best detection
Figure 5-13: In figure 5-8, the Hough transform votes were aligned with the Hough
transform voting space. In this figure, we shear the Hough transform votes so that
they are aligned to the original image coordinates. We also shear the bounding regions
in the same way so that they change from rectangles to parallelograms. The extent
of each parallelogram is then projected onto the binary feature detection image.
110
object model
(3 depth parts,
origin at '+')
8!1
input depth image
(in w arped space)
summed area table
Hough transform votes
(in warped space)
original image
7"
144
sum of Hough transform votes
Figure 5-14: In figure 5-10, the Hough transform votes were rendered in the Euclidean
world coordinates. In this figure, we warp the coordinates such that the rays from the
focal point of the camera are parallel. In this view, the reader can see that the Hough
votes are all simply translated versions of the same image, and that the red regions
become parallelograms. This warping transform aligns all of the votes to the summed
area table, which is coarsely discretized (with 5 cm increments) in the vertical (depth)
dimension.
111
by projecting the six-dimensional hypothesis region R onto a two-dimensional region
of the image plane, such that if a visual feature was found in that region there would
be a pose in R that would match that visual feature within the receptive field radius
of the visual part.
Procedure 16 Calculate an upper bound on the log probability of a visual part for
poses within a hypothesis region.
Input: hypothesis region R, visual part V, preprocessed image I
Output: an upper bound on the log probability of visual part V for an object located
in 7Z in the image I
1: procedure BOUNDVISUAL(Z, V,1)
(mn, (xi, y1, zi, rxi, ry1, r'zi), (X2,
2:
Y2, Z2, rx2,
ry2, rz2)) <
a b~
3:
e
<Ig h].
linear coefficients of V
12:
min- a + min(crxi, crx2) + min(ery1, ery2 ) + min(grzi, grz2
Xmin <- [xi + min(xz'i/zi, f'mi/z2)J
X'ax <-a + max(crxi, crx2) + max(eryi, ery2 ) + max(grzi, 9rz2)
Xmax +- FX2 + max(x'ax/zi, X'ax/Z2)l
ymin - b + min(drxi, dr 2 ) + min(fry1, fry 2 ) + min(hrzi, hrz 2
ymin - Ly1 + min(ymin/zi, ym-n/z 2)J
Y'ax
b + max(drx1, drxx-2 ) + max(fry1, fry 2 ) + max(hrzi, hrz 2
Ymax +- [y2 + max(y' ax/zi, y'ax/z 2 )]
r' <- receptive field radius of V
13:
rmax +- I
14:
15:
k +- index of the kind of visual part V
S +- the kth summed area table for visual features of kind k in I
16:
if SUM_2D(S, Xmin
)
4:
5:
6:
7:
)
8:
9:
)
10:
11:
17:
-
rminax, Xmax
+
rmax, Ymin - rmax, Ymax
return 0
18:
19:
+
rmax)>
0 then
> The maximum value of equation 3.5
else
v +- variance of V
return -1
20:
> The minimum value of equation 3.5
Note that the BOUND-VISUAL procedure can only return two different log proba-
bilities: the maximum log probability for the part, 0, or the minimum log probability
for the part, - 1. The procedure returns the maximum value when it is possible that
a visual feature is a distance of less than or equal to the receptive field radius from
the hypothesis region. It returns the minimum value for the part when it is certain
that there are no visual features within the receptive field radius of the hypothesis
region.
112
Recall that suM_2D(S, ... ) (line 16) returns the total number of visual features in
the region. Notice that the region is expanded by the receptive field radius rm,. We
illustrate this expanded region in figure 5-15, in which the hypothesis regions aligned
to image coordinates are expanded by the receptive field radius. There is a minor
complication with this strategy that comes when we consider adding scale to the pose
space-as the scale changes, the receptive field also changes. In this case, we simply
choose the maximum radius over the range of scales, as depicted in figure 5-16.
BOUND-VISUAL is computed using a small constant number of operations that
is independent of the size of the region IZ, which means that it is very efficient to
compute. However, it is a somewhat loose bound.
Part of the bound inaccuracy stems from the fact that it can only return two
different values: the maximum (line 17) or the minimum (line 20) value of the Hough
vote for a part. When the size of the hypothesis region is small compared to the
receptive field radius, the projected expanded hypothesis region is relatively large, so
that most of the visual features will be found in the expanded receptive field radius
region (the green shaded regions in figure 5-15), rather than the projected region.
When the visual feature is in the expanded receptive field radius region and not in
the projected region, it will certainly contribute a vote that is less than the maximum
possible value, so the maximum value (returned on line 17) will be an overestimate.
The following two procedures, BRUTE FORCE-BOUNDDEPTH and BOUNDDEPTH
provide our implementation of BDk that satisfies inequality 5.8 (for a proof, see appendix A.2). It does this by projecting the six-dimensional hypothesis region R onto a
three-dimensional region in the warped coordinates, such that if a depth measurement
was found in that region, there would be a pose in 7Z that would match that depth
feature within the receptive field radius of the depth part. We note that lines 13-15
of BOUND-DEPTH could
be omitted (and the entire BRUTEFORCEBOUNDDEPTH
procedure could be eliminated). However, we include these lines because they make
the search run faster.
If BRUTEFORCEBOUND-DEPTH is omitted, BOUND-DEPTH can return only three
different values: the maximum value for the part, 0, the minimum value for the part,
113
---------
object model
(3 edge parts)
L
/
I
input image
(D)
feature
detections
-
I
/
Hough
transform
votes
(translated
for each
part)
-M
sum of Hough
transform
votes
best detection
original image
Figure 5-15: We expand the bounding regions in the Hough transform votes by adding
the receptive field radius to both sides of each region. Using summed area tables, we
can efficiently count the number of feature detections in each region. From this
diagram, we can use BOUNDVISUAL to bound each part: the projected expanded
hypothesis region for the red diagonal part contains one feature detection, so it receives the maximum bound: 0. The projected expanded hypothesis region for the
green vertical part contains no feature detections, so it receives the minimum bound:
- .2 The projected expanded hypothesis region for the blue diagonal part contains
one feature detection, so it receives the maximum bound: 0.
114
--------
object model
(3 edge parts)
input image
(D)
pq
feature
detections
Hough
transform
votes
(with
scaling and
translation)
sum of Hough
transform
votes
original image
I^
I
best detection
/
\
/
/\/
2nd
best detection
Figure 5-16: In this figure, we can see that the receptive field radius changes with
scale. We therefore use the maximum receptive field radius to determine the extent
of the expanded hypothesis region when it is projected on to image coordinates.
115
Procedure 17 Calculate an upper bound on the log probability of a depth part for
poses within a hypothesis region by brute force.
Input: the depth image Z, the receptive field radius r, the default depth difference
when depth is missing d, a bounding box (Xmin, Xmax, Ymin, Ymax, Zmin, Zmax)
Output: an upper bound on the log probability for a depth part in the bounding
box
1: procedure BRUTEFORCEBOUNDDEPTH(Z, r, d, Xmin, Xmax, Ymin, Ymax, Zmin, Zmax)
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
dmin +- 00
for X = Xmin to xmax do
for Y = Ymin to Ymax do
if Zx,y is undefined then
dmin <- min (dmiin, d)
else if Zmin <; Zx,, < zmax then
return 0
else if Zx,y < zmin then
dmin <- min(dmin, Zmin - Zx,y)
else
12:
13:
-
dmin - min(dmin, Zx,y - Zmax)
return
min(d
,r2 )
> From equation 3.8
, and the default value returned when a depth measurement is undefined, - d. It
returns the maximum value when it is possible that a depth measurement is within
the receptive field radius of the range of depth values associated with the hypothesis
region. Otherwise, it returns the default depth if there is an undefined measurement,
or the minimum value for the part if all the depth measurements are out of the
receptive field radius.
Recall that suM-3D(T,...) (line 20) returns the total number of depth measurements that are within the receptive field radius of the region. We illustrate the use of
the receptive field radius in figure 5-17, in which the hypothesis regions are expanded
by the receptive field radius. This 3D summed area table is constructed using discrete
steps of size ZSATstep = 5 cm.
If BRUTEFORCEBOUNDDEPTH is omitted, BOUND-DEPTH is computed using a
small constant number of operations that is independent of the size of 1Z. However,
we found that the 5 cm discretizatiori in the 3D summed area table leads to very
loose bounds, especially as the hypothesis region gets small.
Since most calls to
BOUNDDEPTH are with small hypothesis regions, this increases the overall detection
116
Procedure 18 Calculate an upper bound on the log probability of a depth part for
poses within a hypothesis region.
Input: hypothesis region R, depth part D, preprocessed image I
Output: an upper bound on the log probability of depth part D for an object located
in R in the image I
1: procedure BOUNDDEPTH(Z, D, I)
(M) (X1,~ y I, rX1,I rylI, rzi), (X2,1Y2,iZ2,Jx2,iry2,irz2)) +- IZ
2:
3:
[
4:
(x, y)
5:
6:
xmin +- [+1 + min(x/zi, x/z 2 )]
7:
8:
9:
ymin +- Lyi + min(y/zi, y/z 2 )]
Ymax
[y2 + max(y/zi, y/z 2 )1
z' in+ z1 + i + min(jrxi, jrx2 ) + min(kryi, kry2 ) + min(lrz, lrz2)
z'nax + z 2 + i + max(jrx,, jrx2 ) + max(kryi, kry2 ) + max(lrzi, lrZ2)
d +- default depth difference of D when undefined
r' +- receptive field radius of D
if (Xmax - Xmin)(Ymax - ym in) < 500 then
Z +- the depth image from I
10:
11:
12:
13:
14:
+- linear coefficients of D
+-
Xmax +-
the pixel location of D if the pose was at (x, y, z) = (0, 0, 1)
[x 2 + max(x/zi, x/z 2 )]
return BRUTEFORCEBOUND-DEPTH(Z, r',
15:
d, Xmin, Xmax, Ymin, Ymax, Z' in, Z4ax)
the summed area table of the undefined depths
16:
U
17:
r
18:
Zmin +-
zi-SATmin
19:
Zmax
Zmax-ZSATmin
20:
if sUM-3D(T, Xmin, Xmax, Ymin, Ymax, Zmin - r, Zmax
ZSATstep
ZSATstep
|
21:
22:
23:
24:
25:
ZSATstep
+ r)> 0 then
> The maximum value of equation 3.8
return 0
<
I V Xmax
else if xmin
SUM_2D(U, Xmin, Xmax, Ymin, Ymax) >
return -
L
W V Ymin
>
<
I V ymax
>
hv
0 then
> The value of equation 3.8 when the depth is undefined
else
> The minimum value of equation 3.8
return _
117
object model
(3 depth parts,
origin at '+')
S.
input depth image
(in w arped space)
summed area table
Hough transform votes
(in warped space)
original image
At
145
sum of Hough transform votes
Figure 5-17: We expand the bounding regions in the Hough transform votes by adding
the receptive field radius to the top and bottom of each region. We then expand
each of these regions further such that it is aligned to the 5 cm summed area table
increments. Using the summed area table, we can efficiently count the number of
depth features in each green region. All three green regions contain at least one
depth feature, so BOUNDDEPTH will return the maximum bound for each part: 0
(assuming BRUTEFORCEBOUND-DEPTH is not called).
118
time. To address this issue, we added BRUTEFORCEBOUNDDEPTH.
Even though
BRUTEFORCEBOUND -DEPTH runs in time that is proportional to the number of
pixels in the region, it is a much tighter bound, so it significantly decreases detection
time. 1
The following procedure, BOUND combines the part bounds according to equation
5.7. Its correctness follows directly from the proofs the correctness of BOUNDDEPTH
and BOUNDVISUAL in appendix A.
Procedure 19 Calculate an upper bound on the log probability of an object for
poses within a hypothesis region.
Input: hypothesis region R, view-based object model 0, a preprocessed image I
Output: an upper bound on the log probability of an object 0 located in R in the
image I
1: procedure BOUND(, 0,1)
2:
m +- the index of the viewpoint bin model for region R
3:
(M, B) +- On
> viewpoint bin B for viewpoint bin model M
4:
(V, 5) +- M
> visual features V and depth features D
5:
return ZjBOUNDDEPTH(R, 15,1)+ EkBOUND-VISUAL(1Z, Vk, 1)
5.3.3
Initializing the Priority Queue
Input: an empty priority queue, and the viewpoint bin sizes of the view-based model
Output: the priority queue contains a maximum-size region for each viewpoint bin,
each with its appropriate bound
A priority queue is an essential data structure in branch-and-bound search. A
priority queue contains elements associated with real-valued priorities. We use a heap
'If we had a fast algorithm for computing the maximum or minimum entry in a rectangular sub-region of a 2D table in time that does not depend on the size of the sub-region,
the code would be significantly simpler and faster. Such a method could be used to eliminate
BRUTE-FORCEBOUNDDEPTH and its slow for-loops, as well as the coarsely-discretized 3D summed
area table used by BOUND.DEPTH while improving the bound tightness. Such a method could also
be used to give a tighter bound than the 2D summed area tables used by BOUNDNVISUAL. Using
this method would make both the depth and visual bounds tighter especially for smaller hypothesis
regions, reaching perfect accuracy when the area of the region reaches 0. This would probably lead
to a significant improvement in running time since most of the searching work is near the leaves of
the search tree in small hypothesis regions. This means that lines 4-11 of BRANCH-ANDBOUND-STEP
could replaced with a single line: "D +- D U {1}." It would also shorten the proofs in appendix A
and eliminate the need for figures 5-15, 5-16 and 5-17.
119
data structure to implement the priority queue. A heap allows new elements to be
efficiently added to the queue, and also allows the highest-priority element to be
efficiently found and removed from the priority queue.
In our implementation, the elements of the priority queue are hypothesis regions.
The priority of the hypothesis region is an upper bound on the (log) probability that
the object is located in the region.
Our branch-and-bound search implementation starts with the initial hypothesis
that any viewpoint bin model of the object could be located anywhere (subject to the
current set of constraints, such as whether it is nearly upright on a table top). The
images we use for our experiments are 640 x 480 pixels, and we search within a range
of distances that can be accurately measured by the Kinect (0.5 to 3 meters from
the focal point of the camera). We set the initial ranges for rotation angles to the
ranges covered by the viewpoint bin models. Before we start the search process, we
compute upper bounds for the initial hypothesis region of each viewpoint bin model
(as described in section 5.3.5), and put them on the empty priority queue.
Procedure 20 A set of high-level hypotheses used to initialize branch-and-bound
search.
Input: preprocessed image I, view-based object model 0
Output: A priority queue of hypotheses Q used to initialize branch-and-bound
search
1: procedure
2:
3:
4:
5:
6:
7:
8:
9:
5.3.4
INITIAL-PRIORITYQUEUE
(1, 0)
Q <- empty priority queue
for m = 1 to 101 do
(M, B) <- Om
> viewpoint bin B for viewpoint bin model M
((ri, ry1 , rzi), (rx2, ry 2, rz2)) <- B
R +- (m, (0, 0, 0.5, rxi, ry1, rZ), (640, 480, 3, rx2, ry2, rz2))
b +- BOUND(R, 0, I)
add R to Q with priority b
return Q
Constraints On The Search Space
In order to speed up the detection time (and improve accuracy by eliminating some
false positive detections), we constrain the search space to a region consistent with the
120
object being located on a horizontal surface, such as a table, including some margin
of error. Recall that we record two numbers along with each RGB-D image: the
height of the camera above the table h and the camera pitch r,. When we assume
that the roll angle of the camera is zero, this is enough information to define the
unique position of the table plane in space. Constraining the object to rest upright
on a table effects three of the variables in the object pose: z, r, and ry. In our choice
of constraints, x and y remain unconstrained because we assume table plane extends
infinitely in all directions. And r2 remains unconstrained because the object can be
rotated while remaining upright, as if on a turn-table. But because of perspective
projection, the roll angle rx of the object generally varies by few degrees from 0, even
though the camera roll angle is always exactly 0. And the pitch angle ry of the object
also changes for different object positions within an image even when r, is fixed.
The intrinsic parameters of the camera consist of the focal lengths
fx
and
fy
and
the center of the image (ci, cy). We ignore radial and tangential distortion. A 3D
point in Euclidean coordinates (xe, ye, ze) is projected to the screen coordinate (X, y)
by:
(5.10)
{ey=+
The intrinsic parameters for the Kinect are
fx
=
fy
= 525 and (cr, cy) = (319.5, 239.5)
because the image size is 640- x 480.
When the camera has pitch angle r, and the object is upright on a table, then it
has Euler angles:
rx
=
atan2((cx - X) tan(rc), (cy - y)
r
= sin- 1
sin
+Pcos
rc))
+ fy)
(5.11)
(5.12)
where
Px =
py
X
-
=
fy
121
(5.13)
-
fX
.
(5.14)
However, since there are some errors due to approximations and inaccurate sensing,
we allow an error tolerance of
CONSTRAINT-CONTAINS, tests
rtol for both r, and ry. The following procedure,
if a point is contained in the constrained region.2
Procedure 21 A test to see whether a point is in the constraint region.
Input: a point in pose space p, the camera pitch angle r,
Output: a boolean value indicating whether p is in the constraint region (upright
on a horizontal surface)
1: procedure CONSTRAINTCONTAINS(p, rc)
2:
4:
(x, y, z, rx, ry, rz) +- p
r/ <- atan2((cx - x) tan(re), (c, - y) + fy)
X-C
Px
5:
pY
3:
6:
1
Y5
JL
+-
7:
sin--
cos(rc)
1 +p2p
-sin(rc)+p
return (r' - rto 5 rx
r +
, r,rtoi)
r +rto)
A (r
-
rr'i
The following procedures, RXRANGE and its subprocedure RX, compute the maximum and minimum values of rx that can be found within a hypothesis region R. RX
computes the value of rx for a particular pixel location (x, y) and camera pitch angle
r. and updates the current greatest and least known rx values.
Procedure 22 Update the range of rx values for a pixel.
Input: the camera pitch angle r,, a pixel location (x, y), the current least known rx
value rx mini the current greatest known r. value rx max
Output: (r/min, r/im) updated to the new greatest and least known values of rx
1: procedure RX(r., x, y, rx min, rx max)
2:
rx +- atan2((cx - x) tan(r,), (cy - y) + fy)
3:
r/ mi +-min(rx min, rx)
4:
5:
rxmax
max(rx max, rx)
return (r/min, iax)
RX-RANGE uses RX to test the possible extremes at each corner of the range and
also the center vertical line where a local optimum may occur.
Similarly, the following procedures, RYRANGE and its subprocedure RY, compute
the range of values of ry that can be found within a hypothesis region R. RY computes
2
However, due to an oversight, the author omitted the constraint on the depth z in the CON-
STRAINT-CONTAINS procedure. This is a bug. However, since CONSTRAINT-CONTAINS is only used
by the BRANCHANDBOUND.STEP procedure when the object is known to be close to the plane, the
bug only leads to a slight over-estimate of the volume of the constrained region during detection.
122
+
Procedure 23 Find the range of r, values for a hypothesis region.
Input: a hypothesis region R, the camera pitch angle r,
Output: the greatest and least values (r min, r' max) of rx in R
1: procedure RXRANGE(1, r,)
2: (m (x1 , y, , ziI rx,Iry1, I) (z,9,zr2y2,iz2))
3:
4:
(rx min, rx max) +-(o,
-o)
(rx min
5:
rx max) +- RX(r,, x 1 , yi, rx min rx max)
(rx mil, rx max) +- RX (rc, X 1,Y2, rx mil, rx max)
6:
(rx min, rx max)
7:
8:
(rx min, rx max)
9:
i, , rx mil, rx max)
+- RX(rc, X2, Y2rx min, rx max)
+- RX(rc, X 2
if Xi < cx < x 2 then
(rx min, rx max) +- RX(rc, cxj , yi, rx min, rx max)
10:
(rx mini, rx max) +- RX(rc,
11:
(rx min, rx max)
12:
(rx min, rx max) +- RX(r,
13:
Lcx
, Y2, rx min, rx max)
+- RX(rc, cx1 , y1, rx min, rx max)
c
, y2, rx min, rx max)
return (rx min, rxmax)
the value of ry for a particular pixel location (x, y) and camera pitch angle r, and
updates the current greatest and least known ry values.
Procedure 24 Update the range of ry values for a pixel.
Input: the camera pitch angle rc, a pixel location (x, y), the current least known r.
value ry min, the current greatest known ry value ry max
Output: (rmin,7 max) updated to the new greatest and least known values of ry
1: procedure RY(rc, X, y, rymin, Tymax)
2:
PX
3:
py +-
4:
5:
6:
7:
y
XC
sin(sin(rc)+py cos(rc)
rI min <- min(ry min, ry)
r
+- max(ry maxry)
return (rY min, r' max)
RYRANGE uses RY to test the possible extremes at each corner of the range and
also several other locations where a local optimum may occur.
When the camera has pitch angle r, and the vertical height of the camera above
the center of the object on the table is h, then it has depth:
z =
h
1+
+(p2
ip2
py cos (rc) + sin (re)'
123
(5.15)
Procedure 25 Find the range of r. values for a hypothesis region.
Input: a hypothesis region R, the camera pitch angle r,
Output: the greatest and least values (r' mi., r' ma) of r. in R
1: procedure RYRANGE(R, T,)
2:
(M, (x1, y1, zi, TxiI ryI, rzl)7 (X2,
3:
(ry min, ry max) +
4:
(ry min, ry max)
5:
6:
7:
8:
10:
12:
13:
14:
15:
16:
17:
18:
19:
20:
rx2i ry21 rz2))
+ RY(rc,xi, y1, ry min,Ty
(ry min, ry max) +- RY(rc, X1, y2, ry mil, ry
(ry min, ry max) + RY(rc, X2, y1, Ty min, ry
(ry min, ry max) + RY (rc, X2, y2, ry min, ry
cy+fy(C2 +fx2-2c;xx1+xij) cot (r,)
xif
max)
max)
max)
max)
if y1 < yx 1 < y 2 then
9:
11:
Y2, Z2,
(OO, -0c)
(ry min,
11:
y max) +- RY(r, xi, yx 1 ,ry min, ry max)
x2 +- fcy+fY(cx+f.T-2cxX2+X2)
Yx2
I
COt(r,)
if y1 < yx2 < Y2 then
(7y min, ry max) +- RY (r,X 1 ,Yx2, ry mill, ry max)
if x 1 < cx < x 2 then
(ry min, ry max) <- RY (rc, cY 1 , Ty min, ry max)
(ry min, ry max) - RY (rc,cx, y2, ry min, ry max)
Yx0
ffcy+f f cot(rc)
if y 1 < yXo < y 2 then
(ry min, ry max)
+-
RY
(r,cx, yxo, ry min, ry max)
return (ry min, ry max)
124
R-
where p. and p. are defined in equations 5.13 and 5.14 above.
The errors from
approximations and inaccurate sensing is modeled by an error tolerance of
+ztol.
Again in the same pattern as above, the following procedures, ZRANGE and its
subprocedure z, compute the range of values of z that can be found within a hypothesis region R. z computes the value of z for a particular pixel location (x, y) and
camera pitch angle r, and updates the current greatest and least known z values.
Procedure 26 Update the range of z values for a pixel.
Input: the camera pitch angle re, the camera height above the table h, a pixel
location (x, y), the current least known z value zmin, the current greatest known
z value Zmax
Output: (z'j., z' ax) updated to the new greatest and least known values of z
1: procedure z(r., h, X, y, ZYmin, Zmax)
PX~ X-
3:
pY +-
4:
z
5:
z ein +-min(zmin, Z)
6:
7:
znax <-max(zmax, z)
&
-
2:
h
1~+px+p2
py cos(rc)+sin(rc)
return (Zn, z'
ax)
Z_RANGE uses z to test the possible extremes at each corner of the range and also
several other locations where a local optimum may occur.
Finally, CONSTRAINTINTERSECT uses the above subprocedures to find the small-
est hypothesis region that contains the intersection between a hypothesis R and the
constraint region.
5.3.5
Branch-and-Bound Search
Branch-and-bound search operates by calling BRANCH and BOUND on the constrained
6D search space. BOUND provides a heuristic so that BRANCH-ANDBOUND explores
the most promising regions first, provably finding the most probable detections first.
Lines 3-11 of BRANCHAND-BOUNDSTEP
include a brute force search for small
hypothesis regions. This was included because we found that it increases the overall
speed of detection (much like our decision to use BRUTE-FORCE-BOUND-DEPTH).
The
minimum resolution region referred to in line 3 is 2 pixels for x and y, 2 cm for z,
125
Procedure 27 Find the range of z values for a hypothesis region.
Input: a hypothesis region 1Z, the camera pitch angle r,, the camera height above
the table h
Output: the greatest and least values (z'ai, zmax) of z in R
1: procedure ZRANGE(Z, r,, h)
2:
(m, (Xi,yi, zi rXi, ry1, rzi), (X2,Y2 ,z2 ,rx2,ry 2,rz 2 ))
3:
(Zmin, Zmax) +
4:
(Zmin,7Zmax)
5:
(Zmin, Zmax)
7:
h, , 1,IZmin,7Zmax)
z(re, h, Xi, Y2, Zmin, Zmax)
(Zmin, Zmax) +- Z(rc, h, X 2,Y1, Zmin, Zmax)
(Zmin, Zmax) <- z(rc, h, X 2 ,Y2, Zmin, Zmax)
8:
Yx2 +- cY
9:
if y1 < yx2 < ry then
6:
- Z(rc"
+
+ fy (1 +
10:
(Zmin, Zmax)
11:
(Zmin, Zmax)
12:
if X1 < cx < x
(x2-Cx)2)
cot(rc)
+- Z(rc, h, Xi, Yx2, Zmin, Zmax)
- z(rc, h, x 2 , Yx2, Zmin, Zmax)
2
then
13:
(Zmin, Zmax) +- Z(rc,
14:
(Zmin,iZmax) +- Z(r, h, Cx,iY2,iZmin,iZmax)
YxO +- cy + fy cot(re)
15:
16:
h, Cx, Yi, Zmin, Zmax)
if y 1 < yxo < y 2 then
17:
(Zmin, Zmax) +- Z(rc,
18:
(Zmin, Zmax) +- Z(rc, h,x2, yxo, zmin, zmax)
19:
1
(00, -oc)
h, X,
YxO, Zmin, Zmax)
return (Zmin, Zmax)
Procedure 28 Find the smallest hypothesis region that contains the intersection
between a hypothesis region and the constraint.
Input: hypothesis region R, camera pitch angle r,, camera height above table h
Output: the smallest hypothesis region that contains the intersection between R
and the constraint
1: procedure CONSTRAINT-INTERSECT(R, rc, h)
2:
(M, (xi, yi, zi, rxi, ry1, r'z1), (X2,
Y2, Z2, rx2, ry2,
(rxmin, rxmax) +- RXRANGE(Zrc)
5:
6:
(rymin, rymax) +- RYRANGE(1Z,rc)
7:
r' min
8:
r'ymax - mmnary max + rtoi, ry2
(zmin, Zmax) +- RANGE,(I, r,, h)
9:
1Z
r
r' mim
r' max
min(rx max + rtoi, rx2)
)
)
min(ry min - rtoi, ry 2
i +-min(zmin -to-, z2
Zax +- min(zmax + Zto, Z2
return (M, (i, Yi,z, r mI ,
)
10:
7rz2))
11:
12:
)
min - rtoi, rx2)
3:
4:
min, rz 2 ), (x2, Y2, z'
126
,rx max,
max, rz2))
Procedure 29 One step in branch-and-bound search.
Input: a priority queue Q, a preprocessed image I, an object 0, the minimum log
probability of detections m
Output: Q is modified, a new detection pose p or null
1: procedure BRANCHANDBOUNDSTEP(Q1, (, m)
2:
1? +-remove the highest priority hypothesis region from Q
3:
if 1? is smaller than or equal to the minimum resolution region then
4:
M +- the viewpoint bin model corresponding to R
5:
Vmax +- -0
6:
for all poses p in a grid within R do
v -- EVAL(p, M, I)
if v > vmax and CONSTRAINT-CONTAINS(p, rc) then
7:
8:
9:
Vmax
10:
Pbest
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
V
-
P
return Pbest
else
r, +- camera pitch angle for I
h +- camera height above table for I
for all S in BRANCH(R) do
S' +- CONSTRAINTINTERSECT(S, rc, h)
if S' is not empty then
b <- BOUND(S', 0, 1)
if b > m then
add S' to Q with priority b
return null
Procedure 30 Detect an object in an image by branch-and-bound search.
Input: preprocessed image I, view-based object model 0, minimum log probability
of detections m, the maximum number of detections to return m
Output: the set of at most n detections sorted in decreasing order of probability
1: procedure BRANCHAND-BOUND(I, 0, m, n)
2:
Q -- INITIALPRIORITY-QUEUE(I, 0)
3:
D +-{}
4:
5:
while Q is not empty and IDI < n do
p +- BRANCHANDBOUNDSTEP(QI, 0, M)
6:
7:
8:
if p # null then
E <- D U {p}
return D
127
and 4 degrees for rx, ry and r. The points in the grid samples on line 6 are spaced
at 1 pixel for x and y, 2 mm for z, and 1 degree for rx, ry and rz.
5.3.6
Parallelizing Branch-and-Bound Search
Since the BRANCHAND-BOUND algorithm does not run at practical speeds on a single
CPU, we parallelize it on a cloud of multicore CPUs. We assume that we are operating
in a shared cloud environment, in which different CPUs have different loads due to
multitenancy virtual CPUs on a single shared physical CPU.
In our implementation, there are many worker cores that run the BRANCHAND_BOUND-WORKER procedure, and there is one coordinator core that runs the PARALLEL_BRANCH-ANDBOUND procedure. Each worker repeatedly calls BRANCHANDBOUND-
_STEP on its own private priority queue. When a worker's queue is empty, the coordinator tells another worker to delegate part of its queue to the empty worker. In this
way, the workers are kept busy nearly all of the time.
Each time a worker reaches a detection (a leaf in its search tree), it sends it back to
the coordinator. This parallel implementation of branch-and-bound no longer carries
the guarantee that the detections will be found in decreasing order of probability.
However, the coordinator maintains a sorted list of detections found so far. In order
to reduce the total number of search node evaluations near to the minimum necessary
(as is in the case of the serial BRANCHANDBOUND implementation), the coordinator
also informs workers when the current global minimum log probability increases,
which will happen if the requested number detections n have already been found.
For brevity, we do not include the pseudo-code for distributing the image to the
workers or pre-processing the image, but we rather assume that the pre-processed
image I is already available to each worker. We also do not include pseudo-code for
terminating BRANCHAND-BOUNDWORKER,
but this is relatively straightforward to
add.
There are five different types of messages: a status update message, a minimum
log probability message, a delegate-to message, a delegate-from message, and a leaf
message.
128
A status update message is a triple (q, r, wfrom) that is sent periodically from each
worker to the coordinator, where q is the current number of hypothesis regions on the
worker's priority queue, r is the current rate at which the worker has been evaluating
hypothesis regions (measured in evaluations per second), and when Wfrom is not null,
it indicates that the worker has just received a delegation of work from worker wfrro,.
A minimum log probability message is a real number m that is always sent from
the coordinator to a worker in response to a status update message. This number
defines the threshold below which hypotheses can be safely discarded. For example,
imagine that in PARALLELBRANCH-AND-BOUND,
n = 1 and m starts at -oc.
As
soon as a worker finds a leaf whose log probability is m' and sends it back to the
coordinator, the workers no longer need to consider any hypothesis region whose log
probability is less than m', so the coordinator updates m +- m', and notifies each
worker in response to its next status update message. The WORKERSTATUSUPDATE
procedure handles the worker's side of the communication of these two messages.
Procedure 31 Send a status update from a worker for parallel branch-and-bound.
Input: the current size of the priority queue q, the current rate of evaluating hypotheses r, the worker that just delegated to this worker wfrom
Output: the current minimum log probability m
1: procedure WORKER-STATUSUPDATE(q, r, Wfrom)
2:
send a status update message (q, r, wfrom) to the coordinator
3:
wait for a new minimum log probability message
4:
5:
m +-- receive the new minimum log probability message
return m
A delegate-to message is a pair (wto, rto) sent from the coordinator to a worker
to tell it to delegate some part of the contents of its private priority queue to worker
Wto, where rTt is the last reported hypothesis evaluation rate from worker wt 0 . The
worker that receives a delegate-to message decides the fraction of its priority queue in
order to maximize the productivity of the pair of workers based on their evaluation
rates, assuming that the delegated hypotheses have the same branching factor as the
hypotheses are not delegated.
A delegate-from message is a vector R of hypothesis regions. A delegate-from
message is sent from a worker as soon as it receives a delegate-from message from the
129
coordinator, except when the INITIALPRIORITYQUEUE is sent from the coordinator
to the first worker that sends a status update.
A leaf message is a pose p for a detection at the leaf of the search tree for one of
the workers. Each new found leaf is immediately sent back to the coordinator, and
added to the global set of detections D.
Procedure 32 A worker for a parallel branch-and-bound search for an object in an
image.
Input: preprocessed image I, view-based object model 0, minimum log probability
of detections m, the maximum number of detections to return m
1: procedure BRANCH-AND-BOUNDWORKER(I,
0, m)
2:
3:
r- 50
Q <- empty priority queue
4:
while true do
rn <- WORKERSTATUSUPDATE(I Q1, r, null)
while QI = 0 or there is a new delegate-to or delegate-from message do
if there is a new delegate-to message then
(uto, rto) +- receive the delegate-to message
5:
6:
7:
8:
9:
10:
R +- remove the top
,
Jhypothesis regions from Q
11:
send a delegate-from message 1? to Wt"
rn +- WORKER-STATUSUPDATE(IQ1, r, null)
12:
13:
14:
15:
else if there is a new delegate-from message then
R +- receive the delegate-from message
for all R in 1? do
add R to Q with priority BOUND(1Z, 0,1)
16:
wfrom
17:
m +-
<-
the worker that sent the delegate-from message
WORKER-STATUSUPDATE(I
Q1, r, Wfrom)
t +- the current time (in seconds)
19:
20:
C +- 0
while c < 1000 and there is not a new delegate-to message do
p <- BRANCH-AND-BOUND-STEP(Q,
I, 0, m)
if (p # null) A (EVAL(p, M, I)> m) then
send a leaf message p to the coordinator
t' +- the current time (in seconds)
21:
22:
23:
24:
25:
26:
r +c +-c
-
18:
1
A worker is represented by the coordinator as a 4-tuple (q, r, d, a) where q is
number of remaining hypotheses in its queue it last reported (initially 0), r is the rate
that that worker has evaluated hypotheses in the queue, d is a boolean indicating
130
whether the worker is currently involved in a delegation-either sending to or receiving
from another worker (initially false), and a is a boolean indicating whether the worker
is currently available (initially false). Due to the fact that we are operating in a shared
cloud environment, some workers may be much slower than other workers because
they are sharing a load with other users. This is why we keep track of the rate r of
evaluating hypotheses. In fact, some workers may not even become available during
the course of a computation, this is why we track whether a worker has checked in
to the coordinator a. However, once a worker has checked in and been marked as
available, we assume that it will complete its computation without failing.
In our experiments, we used 24-core virtual CPUs on an OpenStack cluster shared
by many different research groups at our lab. We used OpenMPI to implement the
message passing and coordination.
Although we only use single-core CPU worker machines, this algorithm is well
suited to parallelization if GPUs were available at each worker core. In particular,
we would store the preprocessed image in the GPU's memory and parallelize line 5 of
BOUND so that visual part and each depth part are bounded in parallel within a GPU
kernel. The worker CPU would take care of the priority queue and the constraint
evaluations. This would lead to low communication overhead between the GPU and
the CPU once the preprocessed image has been sent to the GPU just before a call to
BRANCH-ANDBOUNDWORKER when a new detection is beginning.
5.4
Non-Maximum Suppression
Two detections that are very close to each other in the space of poses tend to have
similar log probabilities. This often leads to many slight variations on a single detection corresponding to a single object in the world. In order to shorten the list of
detections returned from BRANCH-ANDBOUND or PARALLELBRANCHANDBOUND.
To address this problem, we remove the detections that are near a local maximum.
We do this by calculating the bounding box of each object (the details of this calculation are omitted for brevity), and removing detections that overlap with other
131
Procedure 33 Coordinate workers to perform branch-and-bound search in parallel.
Input: a preprocessed image 1, an object 0, the minimum log probability of detections m, the maximum number of detections to return n
Output: the set of detections sorted in decreasing order of probability
1: procedure PARALLEL-BRANCHAND-BOUND(I, 0, m, n)
2:
w +- a vector of workers with |w J = the number of worker CPU cores
3:
wait for a status update message
4:
5:
(q, r, wfrom) +- receive the status update message
let wto be the worker that sent the message
6:
send a minimum log probability message m to wt,
7:
8:
the bit indicating whether wt, is available +- true
the bit indicating whether wt, is delegating +- true
9:
delegate INITIAL-PRIORITY-QUEUE(I, 0) to worker
D +-{}
while there is a worker in W' with d V (q > 0) do
if there is a new leaf message then
10:
11:
12:
13:
14:
p +- receive the leaf message
D -D U {p}
15:
if IDI > n then
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
wt,
m +- the log probability bound of the nth best detection in D
if there is a new status update message then
(q, r, Wfrom)
+-
receive the status update message
ws +- the worker that sent the status update message
send m to w,
the number of remaining hypotheses for w, +- q
the rate of hypothesis evaluation for w, +- r
the bit a indicating whether w, is available +- true
if wfrom # null then
the bit indicating whether w, is delegating +- false
the bit indicating whether Wfr. 11 is delegating +- false
while there is a worker in W' with a A (-d) A (a = 0) and there is a worker
in W' with a A (-,cd) A (q > 5).do
28:
29:
30:
31:
32:
33:
wt. +- the worker with a A (-,d) A (q = 0) that has the greatest r
wfrom +- the worker with a A (-d) that has the greatest q
the bit indicating whether wfrom is delegating +- true
the bit indicating whether wt, is delegating +-true
send a delegate-to message (Wto, rto) to Wfrom
return D
132
detections with greater log probability. We use the standard intersection over union
(IoU) overlap metric (equation 1.1) to determine if the bounding boxes overlap.
Procedure 34 Non maximum suppression: remove detections that are not local
maxima.
Input: a vector of detections D sorted in decreasing order of log probability, the
object model 0 for those detections
Output: a set of detections S without the non-maximums
1: procedure NONMAXIMUM-SUPPRESSION(D,
2:
S <- {}
3:
for alldE Ddo
4:
m +- true
5:
for all s E S do
6:
7:
8:
9:
10:
0)
if the bounding box of d overlaps with that of s then
m +- false
if m then
S +- S U {Ej}
t> d is a local maximum
return S
133
134
Chapter 6
Experiments
6.1
Dataset
We test our object detection system using a set of 13 object instances: downy, cascade,
frenchs, jiffy, cereal, printed-tiki, tiki, campbells, sugar-jar, coffee-jar, quaker, sardine
and spool. Table 6.1 provides some details about the objects and their meshes, and
table 6.2 gives images of each of the objects and meshes.
We collected RGB-D images of each object using a PR2 robot. Each image contained only one instance of the object to be detected, without any occlusion, and no
instances of any of the other objects. To take the pictures, we drove the robot around
a table with the object of interest and other background objects on it. We captured
RGB-D images at various positions around the table. Along with each image, we
also stored the precise height of the camera and the pitch angle1 of the camera, as
measured by the PR2 proprioception sensors. We labeled all of the images containing
each object of interest with 2D axis-aligned bounding boxes.
There were also another 36 background images (18 training, 18 hold-out), collected
using the same procedure. The background images are similar in composition, but
none of the objects in them are from the set of objects of interest. This background
set was used for additional negative examples in the experiments.
The images were then divided into a training set and a hold-out set by alternating
'Changing the pitch angle causes the camera to look up and down.
135
Cd
i
-Ci
bk
downy
cascade
frenchs
jiffy
cereal
printed-tiki*
tiki*
campbells
sugar-jar
coffee-jar
quaker
sardine
spool
3600
1800
1800
1800
00
00
00
00
00
00
00
00
24,999
24,999
24,999
25,000
8
1,244
1,244
24,998
24,999
1,120
25,000
80
96
27.8
26.3
18.0
14.0
29.2
18.0
18.0
12.0
13.7
17.0
18.2
9.0
5.7
supermarket
supermarket
supermarket
supermarket
supermarket
3D print
unknown
supermarket
supermarket
supermarket
supermarket
supermarket
hardware store
3D scan
3D scan
3D scan
3D scan
hand-built
hand-built*
hand-built*
3D scan
3D scan
hand-built
3D scan
hand-built
hand-built
18
19
19
19
17
18
19
17
17
18
17
17
17
17
19
18
19
16
17
19
17
16
17
17
16
16
Table 6.1: Detailed information about the objects and 3D mesh models used in the
experiments. *The tiki and printed-tiki objects both share the identical 3D mesh.
The mesh was hand-built to resemble the original tiki glass and the printed-tiki glass
was printed from this exact mesh using a 3D printer.
136
object
object
mesh
downy
S
cascade
frenchs
S
jiffy
printed-tiki
cereal
tiki
I
campbells
sugar-jar
coffee-jar
quaker
sardine
mesh
S
I
I
I
I
spool
Table 6.2: Images of the objects and 3D mesh models used in the experiments.
137
through the sequence of the images taken.
For the purposes of evaluating pose accuracy in this thesis, we also manually
added full 6D pose ground truth labels for the objects of interest in each of images.
Acquiring these labels involved a significant amount of human effort, and may be
somewhat error-prone. However, we do not expect that the average user training a
new object instance would need to invest the resources necessary to add these labels
to their data set.
6.2
Setting Parameters
For each object, we experiment with independently varying 12 different parameters as
described in section 4.2.1. The following plots show how the average precision for each
varies object as the value of one parameter is varied at a time while the other values
are held constant at their default values as given in table 6.3. We acknowledge that
we could have obtained higher accuracy values by allowing more than one parameter
value to vary at the same time. However, we decided to restrict our search to limit
the computational time required for the experiments. Figures 6-1 through 6-12 show
the effect of varying each variable along with a brief intuition as to the reason behind
each observed trend.
6.3
Results
Table 6.4 gives a summary of the average precision results for each class. In particular,
we compare with the deformable part models (DPM) of Felzenszwalb et al. [10]. We
used the training images described in table 6.1 as input to DPM, and the hold-out
images for testing. The DPM algorithm outputs 2D bounding boxes, rather than full
3D poses as our detector does.
Table 6.5 is a confusion matrix qualifying how well our algorithm can be used in
a setting in which there may be multiple objects of interest in a scene with other
background clutter. In the experiments used to construct this confusion matrix, a
138
default
6-1
6-2
6-3
6-4
6-5
6-6
6-7
6-8
6-9
6-10
6-11
6-12
100
100
100
100
5
0.02
275
0.02
200
0.4
0.0135
experimental values
'
100
200 1000
50
100
150
50
0
100
150
0
50
150
200
100
50
10
6
8
4
0.002 0.004 0.006 0.008
350
250
300
200
0.04
0.03
0.02
0.01
400
200
100
300
0.4
0.5
0.2
0.3
0.005 0.010 0.015 0.020
250
150
200
100
I
'
parameter I figure
n
nyv
nD
nD= nV
rV (pixels)
rD (M)
(pixels 2
)
WVmax
)
VDmax (Mi 2
r,
minimum edge probability threshold
ht., (m)
rtol
120
Table 6.3: Parameter values used in experiments. The default value is used in all
experiments, except when that particular parameter is being independently varied to
the experimental values.
,Average frecision vs. Number of Training Images
1
W-M
M
I
-i
downy
-E
cascade
--*--- frenchs
------- ----
-
0.95
x
--
x
7
0.9 r
.
0
0.85
C)
(D
A.. campbells
sugar-jar
C
Z
jy
cereal
printed-tiki
tiki
D
coffee-jar
-
quaker
sardine
spool
0.8
0.75
F-
0.7 F
0.65
0
I
200
I
400
I -600
--
I -800
1000
Figure 6-1: Although accuracy is affected by the number of training images, it is
difficult to say whether increasing to more than 100 images gives any additional
improvement. We guess that around 100 examples may be enough to support a
reasonably robust least squares fit for each of the features.
139
..
...
.....
..
....
"I'll,
:::..:::-.- --- - .....
...
...
...........
.-I , I - - -I
- - -
.
-
'-
-
--
-
-
-
-
-
--
-
-
-AN12- -
M
Average Precis on vs. Number of Visu i Parts
1
I
downy
cascade
E)
frenchs
0.91
jiffy
cereal
0.8
x
F3
0.7
A
-W
printed-tiki
tOki
campbells
sugar-jar
to
coffee-jar
-
quaker
sardine
spool
-
--
. 0.6
>
-
a. 0.5
0)
as 0.4
|
0.3
0.2
0.1
0O
0
100
50
150
fly
Figure 6-2: As the number of visual parts increases while the number of depth parts
remains constant, the accuracy generally increases because the edges of the object
are represented more completely. Two notable exceptions are the tiki glass and the
domino sugar jar, each of whose mesh models match the real objects in the general
shape (so depth features work well), but the edges on the silhouettes of the meshes
are not accurate enough to provide a good match to images of real objects.
140
.
.........
- -
. ..
.......
. .. ............
Average Precis on vs. Number of Dep h Parts
i
downy
-e- cascade
-
0.9
frenchs
jiffy
cereal
-
0.81
At
3F>
0.7 -511
a
coffee-jar
_-4- quaker
0
0.6
sardine
spool
-
0
printed-tiki
tiki
campbells
sugar-jar
0.4
0.31
0.2
0.1
0
50
100
15 0
Figure 6-3: We can see from this graph that, for a fixed number of visual parts,
50 depth parts gives a significant improvement over 0. For more than 50 depth
features, we see that objects without flat horizontal surfaces on top tend to increase
or remain at the same accuracy, while objects that do have flat horizontal surfaces
tend to decrease in accuracy. This behavior is explained by the fact that in many
of the training images, a box with a horizontal surface was positioned close to the
camera, and since the Kinect measures shorter depths with higher precision, that
surface matched better to the horizontal surfaces in the view-based model, causing
false-positive detections.
141
1 Average
I
Z'
-
-
-
Precision vs.,Number of Visual And Depth Parts
-
- - __
-
- -
-
-
- -
__
-
i
-
'-
-
_' '
'
-
downy
cascade
frenchs
e
-
0.95
-
0.91
jiffy
cereal
-
printed-tiki
tiki
A
v
0.85,
.0
0.8
--
.
-... .......
..- - :_ _ - _
- ____ ...
...
. . ...
...
............................
I
campbells
sugar-jar
E+
coffee-jar
-
quaker
sardine
spool
-
(D
n 0.75
(.
<
0. 7
0.65 F
0.6 F
0.55
50
I
I
100
150
nV
200
D
Figure 6-4: As the number of visual and depth parts both increase simultaneously, the
accuracy increases slightly because the shape and appearance of the object is more
accurately modeled with more features in the model. The most notable exception
is the jiffy muffin box. The main cause for the low accuracy of this detector is that
another horizontal surface at the same height as the jiffy box was located close to
the camera in some cases. Nearer depth measurements from the Kinect camera have
higher precision, so this nearer horizontal surface is usually favored, in spite of the
poor matching of the edge features.
Increasing the number of visual and depth parts would generally give higher-accuracy
view-based models, however, the time to compute the bound increases proportionally
with the number of parts. Usually it is also true that increasing the time to compute
the bound also slows down the whole searching process.
142
. ....
......
I __ -
_:_::-
-
-
_'__
_._:_:_
Average Precision vs. Recep
Id Radius F
rts
downy
-+-
I
cascade
---
0.9
Scereal
r
0.8
)
tiki
A campbells
1- sugar-jar
D- coffee-jar
-0 quaker
sardine
spool
------
----
-
C
L
0-
0.7
CL
frenchs
jiffy
0.5
0.4
r
-
0.6
-
(D
4
5
6
7
8
9
10
r (pixels)
Figure 6-5: As the receptive field radius increases, the accuracy generally increases,
except in models learned from poor meshes like the tiki-glass. Since the edges in the
rendered mesh are usually a poor fit to images of the real object, a larger receptive
field makes the model more flexible to admit false positive detections. The total
detection time increases with increasing receptive field radius because the bound
becomes looser.
143
.......................
.........
....
..
...
..
...............
......
Average Precision vs. Receptive Field Radius For Depth Parts
i
-
e
cascade
Irenchs
-*--
0.9
downy
-jiffy
-~--cereal
0.8
--
A
0.7
v
C
0
0.6
printed-tiki
tiki
campbells
sugar-jar
F)
coffee-jar
--
quaker
sardine
spool
41)
a~
E5
(D)
0.4
0.3
0.2
0.1
n
2
3
4
7
6
5
rD (M)
8
x 10-
3
Figure 6-6: Increasing the receptive field radius for depth parts significantly increases
accuracy, therefore we set the default value (rD = .02m) significantly above the
parameter values we tried. The reason for this is that the linear model does not
capture the exact depth of the surface of the object, so it is useful to have less
penalty for error within some reasonable margin.
144
.
.......
....
.
.........
....
........
. .....
....
........................
Average Precision vs. Maximum Visual Par Variance
1
i
downy
cascade
frenchs
-0.95
-
jiffy
cereal
-
0.9
printed-tiki
tiki
A- campbells
V sugar-jar
--
0.85
coffee-jar
-D0-
.2
0.8
a.
0.75
-
quaker
sardine
spool
0.7 F0.65
-
()
4
0.55
-
-
0.6
'
0.5
20 0
300
250
350
V VMax (pixelS 2)
Figure 6-7: As the maximum visual part variance increases, accuracy increases
slightly, because if it is too low, some of the important distinguishing edge parts
are not selected for the model. The notable exception to this trend is the jiffy muffin
box, since it is one of the smallest of the models, so the edge features are closer to
the center of the object. Edge parts near the center of the object tend to have lower
variance, so increasing the maximum edge position variance admits the selection of
edge features that are a poor match to the true object.
145
.
................
........
....
. .......
Average Precision s. Maximum Depth Part Variance
i
-E
®R
I
downy
cascade
frenchs
jiffy
-
0.95
X
cereal
-3
printed-tiki
tiki
campbells
6
0.9
-V- sugar-jar
A
-
-+
o$
0
.S 0.85
-
coffee-jar
quaker
sardine
Sspool
0)
0.8
PV
0.75
0.7
F
'I-
0.65
0.0 1
0.015
0.02
0.025
VDmax
2)
0.03
0.035
0.04
Figure 6-8: Varying maximum depth part variance has little effect on the overall
accuracy, except for the sardine can. We believe this is due to the small size of the
sardine can compared to the other objects in our experiments. The small size of
the sardine can means that rotations around the center of the object tend to have a
smaller effect on the variance of the depth compared to the other objects. Another
effect of the small size of the sardine can may be that there is more error introduced
by scaling the synthetic images more.
146
..............
....
... ...........
Average Preciplon vs. Rotational BInWIdth
1
-
e
-
0.95
E3
0.9 A
downy
cascade
frenchs
jiffy
cereal
printed-tiki
tiki
A
0.85
C
0
0.8
---
.(CL 0.75
()
V
campbells
sugar-jar
-
coffee-jar
-4
quaker
sardine
4-
spool
0.7
0.65
0.6
0.55
0.5
10
15
20
25
30
35
40
rw (degrees)
Figure 6-9: As the size of the viewpoint bins increases, accuracy decreases, because
the linear model does not remain close to the true appearance of the object as it
rotates over such a large viewpoint bin. Poor hand-made models such as the tiki
glass are an exception to this trend.
However models with smaller viewpoint bins also have many more viewpoint bins, so
the total learning time increases. The learning time increases proportionally with the
number of viewpoint bins, and the number of viewpoint bins increases as O(1/w3 ) as
the width w of the viewpoint bins decrease in all 3 rotational dimensions.
147
..
. ...
--- ------
...
..
.......
..
..
..
..........
.......
.....
. ......
... ...
?verage Precision vs.
lyinimum Edge Probabi Ity Threshold
downy
cascade
frenchs
-
0.9
-
--------iffy
cereal
--
printed-tiki
tiki
campbells
i
*1
y
0.8,
0
v
sugar-jar
E- coffee-jar
-4 quaker
-sardine
-
spool
a. 0.7
0
CD
n0r
I-
0.5 17
0.4 L
0. 2
0.25
0.3
0.35
0.4
0.45
0.5
minimum edge probability threshold
Figure 6-10: If the edge detector threshold is too low, then too many spurious edges
are detected, so there are more false positives. If the edge detector threshold is too
high, then some of the true edges of the object are missed, so accuracy decreases.
Some of the view-based models are more accurate for a higher edge threshold and
some are more accurate for a lower edge threshold. We have found that the optimal
edge detector threshold is highly dependent on the brightness, color and texture of the
object in comparison to the background. Objects with significant contrast against
the background can be detected with lower edge thresholds, admitting fewer false
positive detections.
148
...
.....
..........
..............
.....
1A
Average Precisln vs. Camera Height T lerance
-
downy
E)
0.9
cascade
frenchs
-
-~~
-
+
-
------------
jiffy
*
cereal
*
printed-tki
tiki
0.8
A
campbells
-4
quaker
sardine
spool
-v sugar-jar
E> coffee-jar
C
-
0)
-. 7
cc'30.6
F-
0.5
-
a)
-
0.4
'
0.3
0.005
5I
0.10.1
0.01
0.015
0.02
camera height tolerance (m)
Figure 6-11: As the search space increases with higher tolerance for errors in the
measurement of the camera's height, the overall accuracy decreases and the detection
time increases. The tiki glass is an outlier in this trend, since extra variability in
the height of the object above the table often allows a better match between the
view-based model learned from an inaccurate mesh and the actual RGB-D image of
the object.
149
. .. ..
1
0.9,
.......
.....
...
...........
.... .........
...........
1111
..
. . ......
- - ..
rnco
Average Precision vs. Camera Pitch T
--- downy
----
cascade
-
frenchs
*
cereal
-
- --
e
Oiki
A
3v
0.8
ID>
0)
0
(D
CL
(D
---
0.
campbells
sugar-jar
coffee-jar
quaker
sardine
spool
0.6
0.5 1-
0.4'
10
I
I
15
20
25
camera pitch tolerance (degrees)
Figure 6-12: As the search space increases with higher tolerance for errors in the rotation of the camera, the overall accuracy decreases and the detection time increases.
This trend is shown in all object classes except those which already have perfect (or
near-perfect) detection accuracy (the Downy laundry detergent bottle, the French's
mustard bottle and the printed tiki glass).
150
hold-out
training
0.005
0.005
0.005
0.005
1.00
0.93
0.99
0.99
1.00
0.96
1.00
0.97
cereal
nD = 50
0.99
0.98
printed-tiki
tiki
campbells
sugar-jar
coffee-jar
quaker
sardine
spool
htoi = 0.005
r, = 30
ht0 i = 0.005
ht.i = 0.005
rv = 10
ht.i = 0.005
rv = 10
hto, = 0.005
1.00
0.94
0.90
0.90
0.94
0.96
0.76
1.00
1.00
0.93
0.97
0.98
0.94
0.95
0.99
1.00
parameter
downy
cascade
frenchs
jiffy
ht.,
ht.,
ht.,
hto,
=
=
=
=
DPM
1.00
1.00
0.98
0.94
1.00
1.00
1.00
1.00
1.00
1.00
0.99
1.00
Table 6.4: Average precision for each object, compared with the detector of Felzenszwalb et al. [10]. In these experiments, each object detector was run on the set
of images containing the object as well as the set of background images containing
none of the objects. In order to obtain this table, we chose the setting of a single
parameter that yielded the highest average precision on the set of training images,
and, for validation, we also report the average precision on the set of images that
were held out for testing. All parameters other than the one explicitly modified are
set to the default values as specified in table 6.3.
false detection of the cereal box in the background of an image containing the Downy
bottle, for example, would count towards a confusion between the downy bottle and
the cereal box, even if the detection had no overlap with the actual Downy bottle in
the image.
Table 6.6 is another confusion matrix that just deals with possible confusions
between true objects of interest (without considering background clutter). The false
detection described in the previous paragraph would not count if it did not overlap
the actual Downy bottle in the image.
Tables 6.7 and 6.8 show histograms of the deviation between the hand-labeled
ground truth poses and the detected poses. In these histograms, 80% of the objects
are localized to within 1 cm of the hand-labeled ground truth position, and 80% are
within 15 degrees of the hand-labeled ground truth rotation.
151
predicted
0.06
0.71
0.68
cascade
0.06
0.05
0.06
frenchs
jiffy
cereal
-printed-tiki
tiki
S campbells
0.83
0.06
0.58
0.75
0.76
0.11
0.11
sugar-jar
-
coffee-jar
quaker
~
~
~
0.06
0.06
0.12
0.21
1
0.06
sardine
0.06
spool
0.06
0.06
0.06
0.12
0.05
0.16
0.11
0.16
0.06
0.06
0.26 0.11
0.76
zM~'4~ 0.62
0.19
0.18
0.12 0.06
0.11
0.06
0.05
0.06
0.06
0.11
0.06
0.06
0.11
0.12
0.05
0.06
0.19
*
downy
0.65
0.06
0.12
0.65
0.88
0.06
0.69
0.12
0.12
Table 6.5: A confusion matrix in which false positives of the predicted object are
allowed to be found anywhere in the full image of the actual object (including in the
background clutter). Empty cells denote 0.
predicted
-
downy
cascade
frenchs
jiffy
cereal
printed-tiki
tiki
campbells
sugar-jar
coffee-jar
quaker
sardine
spool
1.00
0.05
0.95
1.00
1.00
0.06
0.94
0.05
0.76
0.16
0.24
0.79
1.00
1.00
1.00
0.12
0.88
0.06
0.94
1.00
Table 6.6: A confusion matrix in which the search for the predicted object is restricted
to a bounding box surrounding the actual object. Empty cells denote 0.
152
distance (cm)
r, angle (0)
6
7
6
5
4
5
4-
-
3-
1
downy
0
2
3
4
L
-200
-100
0
100
200
50
100
10
20
10
8
7
6
5
4
3
2
cascade
1
9
0
1
3
2
-I
6
4
400
5
-50
1-~
3
4
3-
2
2
I
1
frenchs
0.1
0.5
0.4
0.3
0.2
0.6
-10
2-20
5
4
4
3
-10
3
32I
2
o
iffy
0.8
0.6
0.4
0.2
2-15
6
0
-10
0
1052
-5
3
I
5
43
2-
cereal
0
1
2
3
4
-15
-10
-5
0
5
10
Table 6.7: These histograms show the error in the predicted pose for objects that are
asymmetric about the z axis. The first column shows histograms of the distance, in
centimeters, between the detected center of the object and the hand-labeled ground
truth center of the object. The second column shows the angle difference, in degrees,
in the r, angle (that is, as if the object was on a turn-table) between the detected
and ground truth poses.
153
distance (cm)
8
7
3-
6
2-
5
4
-
distance (cm)
4
3-
2
o
printed-tiki
0.5
1.5
1
2
tiki
4
0
0.5
1
15
2
5
4-
3
3
2
2
1
campbells
0
0.1
0.2
0.3
0.4
0.5
0.6
sugar-jar
5
0
0.5
5
1
3
4
2
32
coffee-jar
0
0.2
0.4
0.6
0.8
1
1.2
quaker
5
0.2
0.4
0.6
0.8
1
1.2
3
4
2
3
2
sardine
0
0.1
0.2
0.3
0.4
0.5
spool
0
0.2
0.4
0.6
0.8
1
Table 6.8: These histograms show the error in the predicted pose for objects that
are symmetric about the z axis. These histograms show the distance, in centimeters,
between the detected center of the object and the hand-labeled ground truth center
of the object.
154
1.2
600
500
S400
0
S
300
200
I
100
0
0
5
10
20
15
25
30
35
40
# 24-core processors
Figure 6-13: This plot shows that adding more 24-core processors to the set of workers
available for the detection task increases until about 20 24-core processors, after which
point, detection time increases slightly. This behavior is probably due to the extra
overhead in initializing and synchronizing more MPI processes.
6.3.1
Speed
Figure 6.3.1 shows how the running time to detect an object in a single image is
effected by adding more CPUs to help with the process of detection.
155
teidiMEshdawarsi ferimierfdrauhailledh
littelsfitiilida
A adeb=ddekshwamek
156
Chapter 7
Conclusion
In this thesis, we have introduced a new detection system that searches for the global
maximum probability detection of a rigid object in a 6 degree-of-freedom space of
poses. The model representation makes use of both visual and depth features from
an RGB-D camera. In principle, we could have omitted the depth parts of our model,
and our system could be used to detect shiny or transparent objects, for which it is
difficult to measure depth information.
In addition to introducing a new model representation, we have also developed a
corresponding learning algorithm that is simple and efficient. The algorithm learns
primarily from synthetic images rendered from a 3D mesh, which greatly reduces the
amount of work necessary to train a new model.
7.1
Future Work
There are many possible directions of future work. We will briefly discuss a few of
the most interesting and promising directions.
In order to improve accuracy, we can use visual features with more than just 2
values (as discussed in section 5.1). We believe the 3-valued visual features, in which
an edge may be absent, weak or strong at each pixel, will yield a significant increase
in accuracy. This will involve some parameter tuning to choose the best thresholds
and corresponding probabilities for each threshold. We could naturally extend this
157
direction of inquiry to 4-valued visual features and beyond.
Another important future direction of research to improve accuracy will be to
augment our model representation to include a weight for each depth and visual
part, according to which part is most informative in distinguishing the object. If
we had access to scanned objects with realistic textures in realistic synthetic scenes
for training, we could use synthetically rendered images as input to a support vector machine (SVM) to learn optimal weights that separate positive examples from
negative examples. This technique is particularly promising for our system because
our branch-and-bound search finds the global maximum detections, which correspond
exactly to the support vectors in the SVM. SVM-based Detection systems such as
that of Felzenszwalb et al. [10] often sample a few difficult negative examples because
the training set would be too large if it included every possible object pose in the
set of negative images. These difficult negative examples are chosen during the SVM
learning process because they are likely to be the true support vectors for the full
training set (or they are similar to the true support vectors). We may be able to
leverage the fact that our detection algorithm can find global maximum detections
in negative images by incorporating these new detections as support vectors in each
iteration of SVM learning.
In this thesis we restricted our inquiry to specific objects, rather than generic
object classes. However, our representation already has the capability to model some
amount of class variability.
If we had access to a set of visually-similar scanned
3D mesh models of objects within the same class, and if we carefully aligned these
meshes, we believe we could directly use synthetic rendered images of all of these
objects simultaneously to learn a model that captures some class variability.
One other important topic we have not addressed is object occlusion. Our current
models inherently deal with some missing part detections, yielding a small amount
of robustness to object occlusion. However, it is important for us to treat occlusion
explicitly, especially in the case when the occluder is known (i.e., either the object is
cropped by the boundary of the image, or occluded by a known object such as the
robot gripper).
158
Along similar lines, we could learn models for articulated objects if we had access
to parameterized meshes. For example, if we could generate a mesh model for any
angle of a pair of scissors opening and closing, we could add this as another dimension
in the space we search, such that it becomes a 7-dimensional configuration space. We
could imagine extending this idea to more and more dimensions, modeling natural
or deformable objects, etc. However, a crucial challenge would be to represent the
search space such that it can be searched efficiently.
159
160
Appendix A
Proofs
A.1
Visual Part Bound
We will prove that the bounding procedure for visual parts is a valid bound. We
re-write the requirement for the visual part bounding function (inequality 5.9) more
formally in terms of EVALNVISUAL (procedure 12) and BOUND-VISUAL (procedure 16):
Vp E R : BOUND-VISUAL(R, V, I) > EVALVISUAL(p, VI)
(A.1)
We will prove that inequality A. 1 holds.
First, we prove that x'in < p/ < X'ma, where x'
X'm
is from line 4 of BOUNDVISUAL,
is from line 6 of BOUND-VISUAL and p' is from line 4 of EVALVISUAL. This is
evident since x'min is calculated in the same way as p', except that a min operation
ensures that the least term for any pose p in the hypothesis region R is added (whether
the constants are positive, zero or negative). Therefore, x'min < p' holds for any pose
p in the region R. A similar argument can be used to show px
x'ma
Therefore, we
can conclude that xmin < Px < X'ma.
In a similar way, it can be shown that y i
p'
y,
where y
is from line
8 of BOUNDVNISUAL, yQ is from line 10 of BOUND-VISUAL and p' is from line 4 of
EVALVISUAL.
Next, we prove that Xmin
round(px)
161
Xinax, where xmin is from line 5 of
BOUNDVISUAL,
EVALVISUAL.
Ximax
is from line 7 of BOUNDVISUAL
and px is from line 5 of
We can see that xmin < round(px) using the fact that
'Min
P'
(proved above), and the fact that x1 5 x 2 by the definition of a hypothesis region,
and since we use the min operator, and since we use the floor operator.
ilar reasoning, we can also see that round(px) 5 xmax.
Xmin < round(px)
Therefore, we know that
xmax.
Again, we can follow a similar line of reasoning to show that Ymin
Ymax where ymin
By sim-
round(p.) <
is from line 9 of BOUND-VISUAL, Ymax is from line 11 of BOUND-VISUAL
and py is from line 5 of EVALVISUAL.
Another way to rephrase these results is that the projected expected pixel location
(round(px), round(py)) is contained in the rectangular region in the image plane (x, y)
where xmin < x < xmax and Ymin < y 5 ymax. However, we note that this rectangular
region is a superset of the range of pixel locations of the visual feature for poses
p E R. We will refer to the set difference between the superset and the true range of
pixel rotations as the image plane difference region.
Now we prove that if there exists a pose p* = (x*, y*, z*, r*, r*, r*) E R such that
dz*2 < r'2 then SUM_2D(S, Xmin
-
rmax,
Xmax + rmax, Ymin
-
rmax, Ymax +
rmax) >
0,
where d is the squared distance from the kth distance transform as on line 7 of
EVALNVISUAL(p*, V, I), r' is the receptive field radius of the visual feature, rmax is the
scaled receptive field radius from line 13 of BOUND-VISUAL and S is the kth summed
area table as on line 15 of BOUNDVISUAL.
If there is such a pose p* then, by definition
of the distance transform, there must be a visual feature of kind k in the image with
a pixel distance <
4
from the projected expected pixel location for the visual part
(round(px), round(py)), where px and py are from line 5 of EVALVISUAL(p*, V, I). We
>
" since we use the ceil operator and divide by the minimum
zi in the region R to calculate rmax on line 13 of BOUND-VISUAL.
circle on the image plane with pixel radius 1
Therefore the
centered at (round(px), round(py))
must be contained in the rectangular region in the image plane (x, y) where xi
rmax < X < Xmax
+ rmax and Ymin - rmax
Y 5 Ymax
-
also know that rmax
+ rmax. Then, since there is
a visual feature in that circle, the definition of the summed area table implies that
162
sUM_2D(S, xmin-rmax, xmax+rmax, ymin-rmax, ymax+rmax) > 0, which is the condition
of the if statement in BOUND-VISUAL.
In this case, where dz*2 < r'2 for p* E R, it follows directly from this that
-m(dz*
2v-
2 2
,r )
< 0 is a valid bound on line 10 of EVALVISUAL(p*, V, I) and on line
17 of BOUND-VISUAL.
If, on the other hand, dz*2 > r12
it is still possible that
BOUNDVISUAL will return 0 on line 17 if there is a visual feature in the image plane
difference region, which is still a valid bound. Finally, if the pixel location of the visual
2 2
r ) -<
feature lies entirely outside of the rectangular region, then - min(dz*
2v
-
E
2v
is a
valid bound on line 10 of EVALVISUAL and on line 20 of BOUNDVISUAL. Therefore,
we have covered both cases of the if statement in BOUND-VISUAL, so we can conclude
that the bound is valid: Vp E R : BOUNDVISUAL(R, V,1)
A.2
EVALNVISUAL(p, V,1).
Depth Part Bound
We will prove that the bounding procedure for depth parts is a valid bound. We
re-write the requirement for the depth part bounding function (inequality 5.8) more
formally in terms of EVAL-DEPTH (procedure 11) and BOUND-DEPTH (procedure 18):
Vp E R: BOUND-DEPTH(R, D, I) > EVAL-DEPTH(p, D, I)
(A.2)
We will prove that inequality A.2 holds.
First, we prove that xmin < Px < Xmax, where Xmin is from line 5 of BOUNDDEPTH,
Xmax is from line 6 of BOUND-DEPTH and px is from line 6 of EVALDEPTH. That is
evident since Xmin is calculated in the same way as px, except that a min operation
ensures that the least term for any pose p in the hypothesis region R is added (whether
x is positive, zero or negative). Therefore, xmin 5 px holds for any pose p in the region
R. A similar argument can be used to show px < xmax. Therefore, we can conclude
that Xmin
Px < Xmax.
In a similar way, it can be shown that ymin
Py
Ymax, where ymin is from line
7 of BOUNDDEPTH, ymax is from line 8 of BOUND-DEPTH and py is from line 6 of
163
EVALDEPTH.
Another way to rephrase these results is that the projected pixel location (round(p,), round(py))
is contained in the rectangular region in the image plane (x, y) where Xmin
and ymin 5 Y
x < xmax
Ymax. However, we note that this rectangular region is a superset of
the range of pixel locations of the depth feature for poses p E R-we call this the
true image plane region. We will refer to the set difference between the superset and
the true image plane region as the image plane difference region.
Next, we prove that z i
z'
a
m:5 z
x,
where z'n is from line 9 of BOUND-DEPTH,
is from line 10 of BOUND-DEPTH and md is from line 4 of EVALDEPTH. This is
evident since z in is calculated in the same way as md, except that a min operation
ensures that the least term for any pose p in the hypothesis region R is added (whether
the constants are positive, zero or negative). Therefore, z'
_<m
holds for any pose
p in the region R. A similar argument can be used to show md <z'.
we can conclude that z'
Therefore,
< md < Z'
We continue the proof as if lines 13-15 were omitted from BOUNDDEPTH, then
we will address BRUTEFORCEBOUNDDEPTH afterward.
Now we prove that if there exists a pose p* = (x*, y*, z*, r*, r*, r*) E R such that
(d
- md) 2 < r2 then sUM_3D(T, Xmin, Xmax, Ymin, Ymax, Zmin - r, Zmax
+ r) > 0, where
d is the depth from the depth image D at the projected pixel location (px, py) as
on line 6 of EVALDEPTH(p*, D,I), md is from line 4 of EVALDEPTH(p*, D,I), r
is from line 17 of BOUNDDEPTH, Zmin is from line 18 of BOUND-DEPTH, and zmax
is from line 19.
If there is such a pose p*, then it follows from the inequalities
+
proven above that z'in - r' < d < z'ax + r'. Then it follows that ZminZSATstep
ZsATmin -
rzsATstep
d < ZmaxzSATstep
+ ZSATmin - rZSATstep (this expanded range
of depth values is a superset of the previous inequalities-we will refer to the set
difference between this expanded range and the previous range as the depth difference
range). Now, from the definition of the 3D summed area table T, we know that
SUM_3D(T, Xmin, Xmax, Ymin, Ymax, Zmin - r, Zmax + r) > 0, which is the condition of the
if statement on line 20 of BOUND-DEPTH.
In this case, where (d-md) 2 < r2 for p* E1R, it follows directly that min((d-Md)',r') <
164
0 is a valid bound on line 11 of EVAL-DEPTH (p*, D,
Otherwise, if (d - md) 2
>
I) and on line 21 of BOUND-DEPTH.
r2 , it is still possible that BOUNDDEPTH will return 0 on
line 21 if a depth value in the image plane difference region falls in the expanded
depth range or if a depth value in the image plane region falls in the depth difference
range, which is still a valid bound.
If the depth d is undefined then sUM_2D(U
minX ma, Ymin, Ymax) > 0.
If d is
undefined or if the pixel location (round(px), round(py)) is outside of the image, then
the bound -d
on line 23 of BOUND-DEPTH will be equal to the return value of
EVALDEPTH(p*,
D, I) on line 14. If d is defined and its pixel location is within the
image, it is still possible that BOUND-DEPTH will return -d2v on line 23 if there is
another undefined depth in the rectangular image plane region or if part of the rect-
angular image plane region is outside of the image. In this case, it is still a valid
bound for BOUNDDEPTH to return -2
on line 23 since line 20 of BOUNDDEPTH
eliminated the possibility that there are any depth measurements within the rectangular image plane region that are within the receptive field radius (between z' j. - r'
and z'
+ r').
Finally, if the depth measurement is defined and if we know that it is beyond the
receptive field radius (d - md) 2 < r2 , then
BOUNDDEPTH
will return -.
2v
on line 25
which will be the same value returned by EVALDEPTH on line 14.
Therefore, we have covered all the cases of the if statements in BOUND-DEPTH and
EVALDEPTH, so we can conclude that the bound is valid if lines 13-15 were omitted
from BOUND-DEPTH.
We now prove that calling BRUTEFORCEBOUNDDEPTH on line 15 of BOUNDVISUAL
yields a valid bound (regardless of the expression used as the condition of the ifstatement on line 13).
The BRUTEFORCEBOUNDDEPTH procedure exhaustively
evaluates each pixel (x, y) in the rectangular image plane region to determine the min-
imum possible absolute difference between the depth value at that pixel and the depth
range. If the image plane difference region was empty (i.e. the rectangular image plane
region is equal to the true image plane region), then BRUTEFORCEBOUNDDEPTH
would be perfectly tight. However, since the true image plane region is a subset of the
165
rectangular image plane region, it is possible that a pixel in the image plane difference
region would cause BRUTEFORCEBOUNDDEPTH to return a value that is greater
than
EVALDEPTH
for some p E R. But depth measurements from the image plane
difference region will never violate the bound condition since the min operations on
lines 6, 10 and 12 can only decrease dmin for pixels in the image plane difference
region. So they can only increase the return value on line 13.
Therefore, we have covered both the case when
BRUTE FORCEBOUNDDEPTH
is
called and the case when it is not called, so we can conclude that the bound is valid:
Vp
E
R : BOUND-DEPTH(Z, D, I) > EVALDEPTH(p, V, I).
166
Bibliography
[1] A. Aldoma, M. Vincze, N. Blodow, D. Gossow, S. Gedikli, R.B. Rusu, and
G. Bradski. Cad-model recognition and 6dof pose estimation using 3d cues.
In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International
Conference on, pages 585-592, 2011.
-
[2] Alexander Andreopoulos and John K. Tsotsos. 50 years of object recognition:
Directions forward. Computer Vision and Image Understanding, 117(8):827
891, 2013.
[3] John Canny. A computational approach to edge detection. IEEE Trans. Pattern
Anal. Mach. Intell., 8(6):679-698, 1986.
[4] Han-Pang Chiu, L.P. Kaelbling, and T. Lozano-Perez. Virtual training for multiview object class recognition. In Computer Vision and PatternRecognition, 2007.
CVPR '07. IEEE Conference on, pages 1-8, June 2007.
[5] David J. Crandall, Pedro F. Felzenszwalb, and Daniel P. Huttenlocher. Spatial
priors for part-based recognition using statistical models. In CVPR (1), pages
10-17. IEEE Computer Society, 2005.
[6] Franklin C. Crow. Summed-area tables for texture mapping. In Proceedings of
the 11th Annual Conference on Computer Graphics and Interactive Techniques,
SIGGRAPH '84, pages 207-212, New York, NY, USA, 1984. ACM.
[7] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Cordelia Schmid, Stefano Soatto, and Carlo Tomasi, editors, International Conference on Computer Vision & Pattern Recognition, volume 2, pages
886-893, INRIA Rhone-Alpes, ZIRST-655, av. de l'Europe, Montbonnot-38334,
June 2005.
[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The
pascal visual object classes (voc) challenge. International Journal of Computer
Vision, 88(2):303-338, June 2010.
[9] Sachin Sudhakar Farfade, Mohammad Saberian, and Li-Jia Li. Multi-view
arXiv preprint
face detection using deep convolutional neural networks.
arXiv:1502.02766, 2015.
167
[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object
detection with discriminatively trained part based models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 32(9):1627-1645, 2010.
[11] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Distance transforms of sampled functions. Technical Report 1963, Cornell Computing and Information Science, 2004.
[12] P.F. Felzenszwalb, R.B. Girshick, and D. McAllester. Cascade object detection with deformable part models. In Computer Vision and Pattern Recognition
(CVPR), 2010 IEEE Conference on, pages 2241-2248, June 2010.
[13] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised
scale-invariant learning. In In CVPR, pages 264-271, 2003.
[14] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm
for model fitting with applications to image analysis and automated cartography.
Commun. ACM, 24(6):381-395, June 1981.
[15] Jared Glover and Sanja Popovic. Bingham procrustean alignment for object
detection in clutter. In IEEE/RSJ InternationalConference on Intelligent Robots
and Systems, 2013.
[16] Kristen Grauman and Bastian Leibe. Visual object recognition. Synthesis Lectures on Artificial Intelligence and Machine Learning, 5(2):1-181, 2011.
[17] Derek Hoiem, Alexei A Efros, and Martial Hebert. Putting objects in perspective.
InternationalJournal of Computer Vision, 80(1):3-15, 2008.
[18] Derek Hoiem and Silvio Savarese. Representations and Techniques for 3D Object
Recognition and Scene Interpretation. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2011.
[19] Eun-Jong Hong, Shaun M. Lippow, Bruce Tidor, and Tomis Lozano-Perez. Rotamer optimization for protein design through map estimation and problem-size
reduction. J. Computational Chemistry, 30(12):1923-1945, 2009.
[20] Kourosh Khoshelham and Sander Oude Elberink. Accuracy and resolution of
kinect depth data for indoor mapping applications. Sensors, 12(2):1437-1454,
2012.
[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification
with deep convolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing
Systems 25, pages 1097-1105. Curran Associates, Inc., 2012.
[22] Christoph H. Lampert, M.B. Blaschko, and T. Hofmann. Beyond sliding windows: Object localization by efficient subwindow search. In Computer Vision
and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8,
2008.
168
[23] Alain Lehmann, Bastian Leibe, and Luc Gool. Fast prism: Branch and bound
hough transform for object class detection. InternationalJournal of Computer
Vision, 94(2):175-197, 2011.
[24] Joseph J. Lim, Hamed Pirsiavash, and Antonio Torralba. Parsing IKEA Objects:
Fine Pose Estimation. ICCV, 2013.
[25] D.G. Lowe. Object recognition from local scale-invariant features. In Computer
Vision, 1999. The Proceedings of the Seventh IEEE InternationalConference on,
volume 2, pages 1150-1157 vol.2, 1999.
[26] M. Maire, P. Arbelaez, C. Fowlkes, and J. Malik. Using contours to detect and localize junctions in natural images. In Computer Vision and Pattern Recognition,
2008. CVPR 2008. IEEE Conference on, pages 1-8, 2008.
[27] D. Nister and H. Stew nius. Scalable recognition with a vocabulary tree. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), volume 2,
pages 2161-2168, June 2006. oral presentation.
[28] Xiaofeng Ren and Liefeng Bo. Discriminatively trained sparse code gradients
for contour detection. In Advances in Neural Information Processing Systems
25: 26th Annual Conference on Neural Information Processing Systems 2012.
Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United
States., pages 593-601, 2012.
[29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition
Challenge, 2014.
[30] Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman.
Labelme: A database and web-based tool for image annotation. Int. J. Comput.
Vision, 77(1-3):157-173, May 2008.
[31] Pierre Sermanet, David Eigen, Xiang Zhang, Michasl Mathieu, Rob Fergus, and
Yann LeCun. Overfeat: Integrated recognition, localization and detection using
convolutional networks. CoRR, abs/1312.6229, 2013.
[32] Hao Su, Min Sun, Li Fei-Fei, and S. Savarese. Learning a dense multi-view
representation for detection, viewpoint classification and synthesis of object categories. In Computer Vision, 2009 IEEE 12th InternationalConference on, pages
213-220, Sept 2009.
[33] A. Thomas, V. Ferrar, B. Leibe, T. Tuytelaars, B. Schiel, and L. Van Gool.
Towards multi-view object class detection. In Computer Vision and Pattern
Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 15891596, 2006.
169
[34] Antonio Torralba. Contextual priming for object detection. InternationalJournal
of Computer Vision, 53(2):169-191, 2003.
[35] Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Sharing visual
features for multiclass and multiview object detection. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 29(5):854-869, 2007.
[36] Paul Viola and Michael Jones. Robust real-time object detection. In International Journal of Computer Vision, 2001.
[37] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages
3485-3492. IEEE, 2010.
170