3D Model-Based Pose Estimation of Rigid Objects From A Single Image For Robotics by Samuel I. Davies Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of IJJ Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2015 I N I.- @ Massachusetts Institute of Technology 2015. All rights reserved. Signature .... .U ................redacted ... . Author .... Department of Electrical Engineering and Computer Science May 20, 2015 Signature redacted / Certified by Signature redact Certified by... JTomis Lozano-Perez Professor Thesis Supervisor .................. leslie Pack Kaelbling Professor Thesis Supervisor Accepted by .... z: OLL Clo Signatre redacted................ / & 4 ~essor Leslie A. Kolodziejski Chairman, Department Committee on Graduate Theses w MITLibraries 77 Massachusetts Avenue Cambridge, MA 02139 http://Iibraries.mit.edu/ask DISCLAIMER NOTICE Due to the condition of the original material, there are unavoidable flaws in this reproduction. We have made every effort possible to provide you with the best copy available. Thank you. Figure 3-6 (p.65) is missing from the thesis. 3D Model-Based Pose Estimation of Rigid Objects From A Single Image For Robotics by Samuel I. Davies Submitted to the Department of Electrical Engineering and Computer Science on June 5, 2015, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract We address the problem of finding the best 3D pose for a known object, supported on a horizontal plane, in a cluttered scene in which the object is not significantly occluded. We assume that we are operating with RGB-D images and some information about the pose of the camera. We also assume that a 3D mesh model of the object is available, along with a small number of labeled images of the object. The problem is motivated by robot systems operating in indoor environments that need to manipulate particular objects and therefore need accurate pose estimates. This contrasts with other vision settings in which there is great variability in the objects but precise localization is not required. Our approach is to find the global best object localization in a full 6D space of rigid poses. There are two key components to our approach: (1) learning a viewbased model of the object and (2) detecting the object in an image. An object model consists of edge and depth parts whose positions are piece-wise linear functions of the object pose, learned from synthetic rendered images of the 3D mesh model. We search for objects using branch-and-bound search in the space of the depth image (not directly in the Euclidean world space) in order to facilitate an efficient bounding function computed from lower-dimensional data structures. Thesis Supervisor: Tomas Lozano-P6rez Title: Professor Thesis Supervisor: Leslie Pack Kaelbling Title: Professor 3 4 Acknowledgments I dedicate this thesis to the Lord Jesus Christ, who, in creating the universe, was the first Engineer, and in knowing all the mysteries is the greatest and Scientist and Mathematician. I am very grateful to my advisors, Tomis Lozano-P rez and Leslie Kaelbling for their kindness, insightful ideas and well-seasoned advice during each stage of this process. If it was not for your patient insistence on finding a way to do branch and bound search over the space of object poses using a probabilistic model, I would have believed it was impossible to do efficiently. And thank you for fostering an environment in the Learning and Intelligent Systems (LIS) group that is conducive to thinking about the math, science and engineering of robotics. I am also indebted to my loving parents for their upraising-and thank you for supporting me all these years. I love you! And I would never have learned engineering or computer programing if you had not taught me, Dad. Thanks for sparking my early interest in robotics by with the WAO-II mobile robot! Thanks also to our administrative assistant Teresa Cataldo for helping with logistics. Thanks to William Ang from TechSquare.com for keeping the lab's robot and computers running and updated, and to Jonathan Proulx from The Infrastructure Group who was very helpful in maintaining and supporting the cloud computing platform on which we ran the experiments. Special thanks to my officemate Eun-Jong (Ben) Hong whose algorithm for exhaustive search over protein structures [19] encouraged me to find a way to do exhaustive object recognition. I would also like to thank other fellow graduate students who worked on object recognition in the LIS group: Meg Lippow, Hang Pang Chiu and Jared Glover whose insights were valuable to this work. And I would like to thank the undergraduates I had the privilege of supervising: Freddy Bafuka, Birkan Uzun and Hemu Arumugam-thank you for being patient students! I would especially like to thank Freddy, who turned me from atheism to Christ and has become my Pastor. By his faithful preaching, he has guided towards God during these years. 5 6 Contents 19 1.1 Overview of the Approach . . . . . . 21 1.1.1 Learning . . . . . . . . . . . . 23 1.1.2 Detection . . . . . . . . . . . 33 Outline of Thesis . . . . . . . . . . . 40 1.2 . 41 2.1 Low-Level Features ...... 2.2 Generic Categories vs. Specific Objects . . . . . . . . . . . . . . . . 42 2.3 2D vs. 3D vs. 2-D view-based models . . . . . . . . . . . . . . . . 43 ............................ . . 41 2D view-based models . . . . . . . . . . . . . . . . . . . . . 44 2.3.2 3D view-based models . . . . . . . . . . . . . . . . . . . . . 46 2.3.3 21D view-based models . . . . . . . . . . . . . . . . . . . . . 47 Search: Randomized vs. Cascades vs. Branch-and-Bound . . . . . . 48 2.4.1 Randomized . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.4.2 C ascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.4.3 Branch-and-Bound . . . . . . . . . . . . . . . . . . . . . . . 49 2.5 Contextual Information. . . . . . . . . . . . . . . . . . . . . . . . . 50 2.6 Part Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 . . . . . . . . . 2.3.1 2.4 Representation 53 3.1 O bject Poses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Im ages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 . 3 . Related Work . 2 . Introduction . 1 7 3.3 View-Based Models ................. 3.4 Approximations . . . . . . . . . . . . . . . . . . . 56 . . . . . . . . . . . 3.4.1 (x, y) Translation Is Shifting In The Image Plane 3.4.2 3.4.3 59 Weak Perspective Projection . . . . . . . . . . . . . . . . . . 59 Small Angle Approximation . . . . . . . . . . . . . . . . . . 60 . . . . . . . . . Sources Of Variability . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6 Choice of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 62 . . 3.5 4 Learning . . . . . . . . . . . 67 4.1.1 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.1.2 Feature Enumeration . . . . . . . . . . . . . . . . . . . . . . 69 4.1.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1.4 Combining Viewpoint Bin Models..... . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . 82 Tuning Parameters . . . . . . . . . . . . . . . . . . . . . . . 84 . . . . View-Based Model Learning Subsystem . . . . . High Level Learning Procedure 4.2.1 . 4.2 67 . 4.1 5 Detection 87 Detecting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2 Pre-Processing Features . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3 Branch-and-Bound Search . . . . . . . . . . . . . . . . . . . . . . . 100 5.3.1 Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3.2 Bounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3.3 Initializing the Priority Queue . . . . . . . . . . . . . . . . . 119 5.3.4 Constraints On The Search Space . . . . . . . . . . . . . . . 120 5.3.5 Branch-and-Bound Search . . . . . . . . . . . . . . . . . . . 125 5.3.6 Parallelizing Branch-and-Bound Search . . . . . . . . . . . . 128 . . . . . . . . . . . 131 . . . . . . . . . 5.1 Non-Maximum Suppression . . . . . . . . . . . . 5.4 Experiments 6.1 D ataset 135 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 59 8 135 6.2 Setting Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.3.1 7 157 Conclusion 7.1 Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 161 A Proofs A.1 Visual Part Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A.2 Depth Part Bound 9 10 . '. - - , , - . - , . -, - II 1 111 k . - . . - . - 4 4W List of Figures 1-1 Examples of correct object detections. . . . . . . . . . . . . . . . . . 20 1-2 An overview of the view-based model learning subsystem. . . . . . . . 26 1-3 An overview of the manual labor required to learn a new object. . . . 29 1-4 An overview of detection . . . . . . . . . . . . . . . . . . . . . . . . . 35 2-1 Fergus et al. [13] used a fully-connected model . . . . . . . . . . . . . 45 2-2 Crandall et al. [5] used a 1-fan model. . . . . . . . . . . . . . . . . . . 46 2-3 Torralba et al. [35] showed that sharing parts can improve efficiency. 51 3-1 An illustration of features in an RGB-D image. . . . . . . . . . . . . 57 3-2 Warping spherical coordinates into rectangular coordinates. . . . . . . 61 3-3 Examples of object poses that are at the same rotation in spherical coordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 . . . . . . . . . . . 64 3-4 An example of edges missed by an edge detector. 3-5 Normal distributions with and without a receptive field radius. ..... 64 3-6 2D normal distributions with elliptical and circular covariances. 65 4-1 Examples of synthetic images. . . . . . . . . . . . . . . . . . . . . . . 69 4-2 Visualizations of enumerated features . . . . . . . . . . . . . . . . . . 74 4-3 The effect of varying the minimum distance between parts. . . . . . . 77 4-4 Different objects and viewpoints vary in the area of the image they cover. 78 4-5 The minimum distance between parts should not be the same for all view s. 4-6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The PR2 robot with camera height and pitch angles. 11 . . . . . . . . . 79 83 5-1 Hough transforms for visual parts in 1D. . . . . . . . . . . . . . . . . 92 5-2 Adding a rotation dimension to figure 5-1. . . . . . . . . . . . . . . . 93 5-3 Adding a scale dimension to figure 5-1. . . . . . . . . . . . . . . . . . 94 5-4 Hough transforms for visual parts in 2D. . . . . . . . . . . . . . . . . 96 5-5 Hough transforms for depth parts in 2D. . . . . . . . . . . . . . . . . 97 5-6 The maximum of the sum of 1D Hough votes in a region. . . . . . . . 103 5-7 The maximum of the sum of Hough votes with rotation in a region. . 104 5-8 The maximum of the sum of Hough votes with scale in a region is broken into parts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9 105 The maximum of the sum of Hough votes in a region for optical character recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 The maximum of of the sum of Hough votes for depth parts in 2D. . 106 107 5-11 ID Hough transform votes and bounding regions aligned to image coordinates. ....... ........ ......................... 108 5-12 Hough transform votes for visual parts (with rotation) and bounding regions aligned to image coordinates. . . . . . . . . . . . . . . . . . . 109 5-13 Hough transform votes for visual parts (with scale) and bounding regions aligned to image coordinates. . . . . . . . . . . . . . . . . . . . 110 5-14 Hough transform votes for depth parts and bounding regions in warped coordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5-15 1D Hough transform votes and bounding regions with receptive field radius. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5-16 Hough transform votes (with scale), bounding regions and receptive field radius. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5-17 Hough transform votes for depth parts with bounding regions and receptive field radius. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6-1 Average Precision vs. Number of Training Images (n) . . . . . . . . . 139 6-2 Average Precision vs. Number of Visual Parts (ny) . . . . . . . . . . 140 6-3 Average Precision vs. Number of Depth Parts (nD) .. 141 12 .......... . 6-4 Average Precision vs. Number of Visual and Depth Parts (nD = nV) 142 6-5 Average Precision vs. Receptive Field Radius For Visual Parts (rv) 143 6-6 Average Precision vs. Receptive Field Radius For Depth Parts (rD) 144 6-7 Average Precision vs. Maximum Visual Part Variance (vvmax) . . . . 145 6-8 Average Precision vs. Maximum Depth Part Variance . . . 146 6-9 Average Precision vs. Rotational Bin Width (r,) . . . . . . . . . . . 147 6-10 Average Precision vs. Minimum Edge Probability Threshold . . . . . 148 6-11 Average Precision vs. Camera Height Tolerance (hto0 ) . . . . . . . . . 149 6-12 Average Precision vs. Camera Pitch Tolerance (rtoi) . . . . . . . . . . 150 6-13 Detection Running Time vs. Number of Processors . . . . . . . . . . 155 13 (VDmax) . 14 List of Tables 6.1 Detailed information about the objects and 3D mesh models used in the experim ents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.2 Images of the objects and 3D mesh models used in the experiments. . 137 6.3 Parameter values used in experiments. . . . . . . . . . . . . . . . . . 139 6.4 Average precision for each object, compared with the detector of Felzenszw alb et al. [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.5 A confusion matrix with full-sized images. . . . . . . . . . . . . . . . 152 6.6 A confusion matrix with cropped images. . . . . . . . . . . . . . . . . 152 6.7 Errors in predicted poses for asymmetric objects. . . . . . . . . . . . 153 6.8 Errors in predicted poses for symmetric objects. . . . . . . . . . . . . 154 15 16 List of Algorithms 1 Render and crop a synthetic image. . . . . . . . . . . . . . . . . . . . 2 Update an incremental least squares visual part by adding a new training exam ple. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . 72 Update an incremental least squares depth part by adding a new training exam ple. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 72 Finalize an incremental least squares visual part after it has been updated with all training examples. 4 69 73 Finalize an incremental least squares depth part after it has been updated with all training examples. . . . . . . . . . . . . . . . . . . . . 73 6 Enumerate all possible features. . . . . . . . . . . . . . . . . . . . . . 75 7 Select features greedily for a particular minimum allowable distance between chosen parts 8 dmin. . . . . . . . . . . . . . . . . . . . . . . . . 78 Select features greedily for a particular maximum allowable part variance vmax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 9 Learn a new viewpoint bin model. . . . . . . . . . . . . . . . . . . . . 81 10 Learn a full object model. . . . . . . . . . . . . . . . . . . . . . . . . 81 11 Evaluates a depth part for in image at a particular pose. . . . . . . . 91 12 Evaluates a visual part for in image at a particular pose. . . . . . . . 98 13 Evaluates an object model in an image at a particular pose. . . . . . 98 14 An uninformative design for a bounding function. . . . . . . . . . . . 101 15 A brute-force design for a bounding function. 101 16 Calculate an upper bound on the log probability of a visual part for poses within a hypothesis region. 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 17 Calculate an upper bound on the log probability of a depth part for poses within a hypothesis region by brute force. . . . . . . . . . . . . 18 Calculate an upper bound on the log probability of a depth part for poses within a hypothesis region. 19 116 . . . . . . . . . . . . . . . . . . . . 117 Calculate an upper bound on the log probability of an object for poses within a hypothesis region. . . . . . . . . . . . . . . . . . . . . . . . . 119 20 A set of high-level hypotheses used to initialize branch-and-bound search. 120 21 A test to see whether a point is in the constraint region. . . . . . . . 122 22 Update the range of r, values for a pixel. . . . . . . . . . . . . . . . . 122 23 Find the range of r, values for a hypothesis region. . . . . . . . . . . 123 24 Update the range of r. values for a pixel. . . . . . . . . . . . . . . . . 123 25 Find the range of r. values for a hypothesis region. 26 Update the range of z values for a pixel. 27 Find the range of z values for a hypothesis region. . . . . . . . . . . . 126 28 Find the smallest hypothesis region that contains the intersection be- . . . . . . . . . . 124 . . . . . . . . . . . . . . . . 125 tween a hypothesis region and the constraint. . . . . . . . . . . . . . 126 29 One step in branch-and-bound search. . . . . . . . . . . . . . . . . . 127 30 Detect an object in an image by branch-and-bound search. . . . . . . 127 31 Send a status update from a worker for parallel branch-and-bound. 129 32 A worker for a parallel branch-and-bound search for an object in an im age. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 33 Coordinate workers to perform branch-and-bound search in parallel. . 132 34 Non maximum suppression: remove detections that are not local maxima. 133 18 Chapter 1 Introduction In this thesis we address the problem of finding the best 3D pose for a known object, supported on a horizontal plane, in a cluttered scene in which the object is not significantly occluded. We assume that we are operating with RGB-D images and some information about the pose of the camera. We also assume that a 3D mesh model of the object is available, along with a small number of labeled images of the object. The problem is motivated by robot systems operating in indoor environments that need to manipulate particular objects and therefore need accurate pose estimates. This contrasts with other vision settings in which there is great variability in the objects but precise localization is not required. Our goal is to find the best detection for a given view-based object model in an image, even though it takes time to search the whole space of possible object locations. After searching, we guarantee that we have found the best detection without exhaustively searching the whole space of poses by using branch-and-bound methods. Our solution requires the user to have a 3D mesh model of the object and an RGB-D camera that senses both visual and depth information. The recent advances in RGB-D cameras have proven to give much higher-accuracy depth information than previous stereo cameras. Moreover, RGB-D cameras like the Microsoft Kinect are cheap, reliable and broadly available. We also require an estimate of the height and pitch angle of the camera with respect to the horizontal supporting plane (i.e. 19 Figure 1-1: Examples of correct object detections. Detections include the full 6 dimensional location (or pose) of the object. The laundry detergent bottle (top left) and mustard bottle (top right) are both correctly detected (bottom). 20 table) on which the upright object is located. The result is a 6 degree-of-freedom pose estimate. We allow background clutter in images, but we restrict the problem to images in which the object is not significantly occluded. We also assume that for each RGB-D image, the user knows the pitch angle of the camera and the height of the camera measured from the table the object is on. This is a useful problem in the context of robotics, in which it is necessary to have an accurate estimate of an object's pose before it can be grasped or picked up. Although less flexible than the popular paradigm of learning from labeled real images of a highly variable object class, this method requires less manual labor-only a small number of real images of the object instance with 2D bounding box labels are used to test the view-based model and to tune learning parameters. This makes the approach practical as a component of a complete robotic system, as there is a large class of real robotic manipulation domains in which a mesh model for the object to be manipulated can be acquired ahead of time. 1.1 Overview of the Approach Our approach to the problem is to find the global maximum probability object localization in the full space of rigid poses. We represent this pose space using 3 positional dimensions plus 3 rotational dimensions, for a total of 6 degrees of freedom. There are two key components to our approach: " learning a view-based model of the object and " detecting the object in an image. A view-based model consists of a number of parts. There are two types of parts: visual parts and depth parts. Visual parts are matched to edges detected in an image by an edge detector. Each visual edge part is tuned to find edges at one of 8 discrete edge angles. In addition, small texture elements can be used to define other kinds of visual parts. Visual parts do not have depth, and the uncertainty about their positions is restricted to the image plane. 21 Depth parts, on the other hand, only model uncertainty in depth, not in the image plane. Each depth part is matched to a depth measurement from the RGB-D camera at some definite pixel in the image. Thus we can think of the 1D uncertainty of depth part locations as orthogonal to the 2D uncertainty of image part locations. The expected position of each of the view-based model parts (both visual and depth parts) is a function of the object pose. We divide the 3 rotational dimensions of pose space into a number of viewpoint bins, and we model the positions of the parts as a linear function of the object rotation within each viewpoint bin (i.e. we use a small angle approximation). In this way, the position of the object parts is a piecewise linear function of the object rotation, and the domain of each of the "pieces" is a viewpoint bin. The 3 positional dimensions are defined with respect to the camera: as the object moves tangent to a sphere centered at the focal point of the camera, all the model parts are simply translated in the image plane. As the object moves nearer or farther from the camera, the positions of the parts are appropriately scaled with an origin at the center of the object (this is known as the weak perspective approximation to perspective projection). A view-based model consists of parts whose expected positions in the image plane are modeled by a function of all 6 dimensions of the object pose. A view-based model is learned primarily from synthetic images rendered from a mesh of the particular object instance. For each viewpoint bin, synthetic images are scaled and aligned (using the weak perspective assumption) and a linear model (using the small angle assumption) is fit to the aligned images using least squares. An object is detected by a branch-and-bound search that guarantees that the best detections will always be found first. This guarantee is an attractive feature of our detection system because it allows the user to focus on tuning the learning parameters that affect the model, with the assurance that errors will not be introduced by the search process. A key aspect of branch-and-bound search is bounding. Bounding gives a conservative estimate (i.e. an upper bound) on the probability that the object is located within some region in pose space. Each part in a view-based model casts a weighted "vote" for likely object poses 22 based on the part's position. These votes assign a weight to every point in the 6D space of object poses. The pose with the greatest sum of "votes" from all of the parts is the most probable detection in the image. We therefore introduce a bounding function that efficiently computes an upper bound on the votes from each part over a region of pose space. The bounding function is efficient because of the weak perspective projection and small angle approximations. These approximations allow the geometric redundancy of the 6D votes to be reduced, representing them in lower dimensional (2D and 3D) tables in the image plane. To further save memory and increase efficiency, these lower dimensional tables are shared by all the parts tuned to a particular kind of feature, so that they can be re-used to compute the bounding functions for all the parts of each kind. 1.1.1 Learning Input: An instance of the object, a way to acquire a 3D mesh, an RGB-D camera with the ability to measure camera height and pitch angle Output: A view-based model of the object, composed of a set of viewpoint bins, each with visual and depth parts and their respective parameters We break the process of learning (described in depth in chapter 4) a new viewbased model into two parts. First we will discuss the fully automated view-based model learning subsystem that generates a view-based model from a 3D mesh and a specific choice of parameter values. Then we will discuss the procedure required to tune the parameter values. This is a manual process in which the human uses the view-based learning subsystem and the detection system to repeatedly train and test parameter values. The model learning and the detection sub-procedures can be called by a human in this manual process. 1.1.1.1 View-Based Model Learning Subsystem Input: A 3D mesh and parameter values such as the set of viewpoint bins Output: A view-based model of the object, which is composed of a set of viewpoint 23 bins, each with visual and depth parts along with the coefficients of the linear model for their positions and the uncertainty about those positions The view-based model learning subsystem (see figure 1-2) is a fully automated process that takes a mesh and some parameter values and produces a view-based model of the object. This subsystem is described in detail in section 4.1. The position of each object part is a piecewise linear function of the three rotation angles about each axis. Each piece of this piecewise linear model covers an axis-aligned "cube" in this 3D space of rotations. We call these cubes viewpoint bins. 3D objects are modeled with a number of different viewpoint bins, each with its own linear model of the object's shape and appearance for poses within that bin. The following three learning phases are repeated for each viewpoint bin: 1. rendering the image, 2. enumerating the set of features that could be used as model parts, 3. selecting the features that will be used in the final viewpoint bin model and 4. combining viewpoint bin models into the final view-based model. Since each viewpoint bin is learned independently, we parallelize the learning procedure, learning each view-based model on a separate core. In our tests, we had nearly enough CPUs to learn all of the viewpoint bin models in parallel, so the total learning time was primarily determined by the time taken to learn a single viewpoint bin model. On a single 2.26 GHz Intel CPU core, learning time takes an average of approximately 2 minutes. The view-based model learning subsystem is designed to be entirely automated, and require few parameter settings from the user. However, there are still a number of parameters to tune, as mentioned in section 1.1.1.2. An unusual aspect of this learning subsystem is that the only training input to the algorithm is a single 3D mesh. The learning is performed entirely using synthetic images generated from rendering this mesh. This means that the learned view-based 24 model will be accurate for the particular object instance that the mesh represents, and not for a general class of objects. Rendering Input: a 3D mesh and parameters such as viewpoint bin size and ambient lighting level Output: a sequence of cropped, scaled, and aligned rendered RGB-D images for randomly sampled views within the viewpoint bin Objects are rendered using OpenGL at a variety of positions and rotations in the view frustum of the virtual camera, with a variety of virtual light source positions. This causes variation in resolution, shading and perspective distortion, in addition to the changes in appearance as the object is rotated within the viewpoint bin. The virtual camera parameters are set to match the calibration of the real Microsoft Kinect camera. The OpenGL Z-buffer is used to reconstruct what the depth image from the Microsoft Kinect would look like. 1 Each of the images are then scaled and translated such that the object centers are exactly aligned on top of each other. We describe the rendering process in more detail in section 4.1.1. Feature Enumeration Input: a sequence of scaled and aligned RGB-D images Output: a least squares linear model of the closest feature position at each pixel for depth features and for each kind of visual feature The rendered images are used to fit a number of linear functions that model how the position of each visual feature (such as edges) and depth value varies with small object rotations within the viewpoint bin. A linear function is fit at each pixel in the aligned images, and for each edge angle as well as for each pixel in the aligned depth images. The linear functions are fit using least squares, so the mean squared 'This method is only an approximate simulation of the true process that generates RGB-D images in the Kinect. For example, the real Kinect has a few centimeters of disparity between the infrared camera that measures depth and the color camera, so that the visual and depth images are not aligned at all depths. 25 viewpoint bin 2 viewpoint bin 1 . I 0-0I -67.50 edges (8 edge -450 - depth Is) 'M.& -77 ko n 45* 67.5* 90* S*. . fetr edges (8 edge angles) depth enmraS 22.5* 45* 6 7.5* 22.5* goo ; seletio view-based model Figure 1-2: An overview of the view-based model learning subsystem. Random poses are sampled from within each view, and synthetic RGB-D images are rendered at these views. These images are then scaled and translated so that the centers of the objects in the images are aligned. Next, a linear model is fit at each pixel for each type of visual feature (8 edge directions in this figure) detected in the synthetic images, and another linear model is fit at each pixel for the depth measurements that come from the Z-buffer of the synthetic images. Finally, some of those linear models are selected and become parts of the final viewpoint bin model. Each of the 360 viewpoint bin models are combined to form piecewise linear segments of a full view-based model. Note: this procedure does not take any real images as an input-the learned models will later be tested on real images. 26 error values are a readily available metric to determine how closely the models fit the actual simulated images. In reality, feature enumeration is a process that occurs incrementally as each new rendered image is generated. This saves memory and greatly increases learning speed. We use a formulation of the least squares problem that allows each training point to be added sequentially in an online fashion. The speed of this process could be improved by the fine-grained parallelism available on a GPU architecture, computing the feature enumeration for each pixel of each kind of feature in parallel. We give more details in section 4.1.2. Feature Selection Input: a set of linear models for positions of each kind of feature at each pixel and parameter values for how visual and depth parts should be modeled Output: a model for a viewpoint bin, consisting of a selection of these part models with low mean-squared-error and spaced evenly across the image of the viewpoint From the set of enumerated features, a small fixed number are selected to be parts of the model for the viewpoint bin. They are greedily chosen to have a low mean squared error, and even spacing across the image of the viewpoint bin model. Most of the user-specified learning parameters are used to control this stage of the learning process. The selected parts constitute the model for each particular viewpoint bin, and the set of viewpoint bins form the whole object model. We provide more details in section 4.1.3. Combining Models Of Viewpoint Bins Input: a set of object models for all the different viewpoint bins Output: an view-based model covering the region of pose space covered by the union of the input viewpoint bins A view-based model consists of a set of viewpoint bin models, each of which has a set of visual and depth parts. After feature enumeration and selection, the set of 27 viewpoint bin models are grouped together to form the complete view-based model. We give more details in section 4.1.4. 1.1.1.2 High Level Learning Procedure Input: An instance of the object, a way to acquire a 3D mesh, an RGB-D camera with the ability to measure camera height and pitch angle and computational power. Output: A view-based model of the object, composed of a set of viewpoint bin models, each with visual and depth parts The procedure to learn a new view-based model is depicted in figure 1-2. The steps involved are: 1. Collect RGB-D images of the object, along with information about the camera pose, and partition the images into a test and hold-out set. 2. Label the images with bounding boxes. 3. Acquire a 3D mesh of the object (usually by 3D scanning). 4. Tune learning parameters while testing the view-based models on the test images. 5. Evaluate the accuracy of the view-based models. We describe this procedure in section 4.2. Collect RGB-D Images Input: * the object instance, e the object instance, placed on a table and * an RGB-D camera with the ability to measure pitch angle and height above the table, 28 """""""pararaeterr collect images & data using a PR2 robot with a inect labe Figure 1-3: An overview of the manual labor required to learn a new object. To learn a new view-based model of an object, we first collect a data set of about 30 positive image examples of the object and about 30 background images of the object (all images used for training the downy bottle view-based model are in this figure). Each image must also include the pitch angle of the camera and the height of the camera above the table when the image was taken. The images should also include depth information. We use a PR2 robot with a Microsoft Kinect mounted on its head to gather our data sets. Each positive example must be labeled by a human with an approximate 2D bounding box for the object to detect. A 3D mesh of the object should be acquired (usually using a 3D scanner). The mesh is used to learn a new view-based model, and the learning parameters must be manually adjusted as the user tests each new learned view-based model on the real images and evaluates the accuracy (measured by average precision). 29 Output: a set of RGB-D images labeled with the camera's pitch and its height above table In our experiments, we collect a set of around 15 images with depth information (RGB-D images) for each object using a Microsoft Kinect mounted on a PR2 robot. In our data sets, we did not manually change the scene between each image capturewe set up a table with background clutter and the object one time, and we drove the robot to different positions around the table, ensuring that the object was not occluded in any images, since our detector does not currently deal explicitly with occlusion. This process took about 10 minutes per object. The reason we used the robot instead of manually holding the Kinect camera and walking around the table is that we also record the camera height above the table and its pitch (assuming the camera is upright and level with ground, i.e., the roll angle of the camera in the plane of the image is always zero). An affordable alternative to this method would be to instrument a tripod with a system to measure the camera's height and pitch. This information is used to constrain the 6D search space to search a region surrounding the table, at object rotations that would be consistent (or nearly consistent) with the object standing upright on the table. We also collected a set of about 15 background images that contained none of the objects we were interested in detecting using this same methodology. We were able to re-use this set of images as negative examples for each of the objects we tested. Label Bounding Boxes Input: a set of RGB-D images Output: left, right, top and bottom extents of an approximate rectangular bounding box for each image of the object We use a simple metric to decide whether a detection is correct: if the 2D rectangle that bounds the detection in the image plane overlaps with the manually-labeled bounding box according to the standard intersection over union (IoU) overlap metric 30 of the detected bounding box A and the ground truth bounding box B: AuB> 0.5 (1.1) This leaves some room for flexibility, so the labeled bounding boxes do not need to be accurate to the exact pixel. Labeling approximate bounding boxes for a set of about 30 images takes around 10 minutes for a single trained person. We labeled our image sets with 6D poses for the purposes of evaluating our algorithm in this thesis. However, we found that, even if we know the plane of the table from the camera height and pitch angle, labeling 6D poses is very time-consuming, difficult and error-prone, so we decided to reduce the overall manual effort required of the end user by relaxing the labeling task to simple bounding boxes. We suggest that, in practice, the accuracy of detected poses can be evaluated directly by human inspection, rather than using full 6D pose labels. Acquire A 3D Mesh Input: the object instance Output: an accurate 3D mesh of the object instance We found that the accuracy of the detector is highly related to the accuracy of the 3D mesh, so it is important to use an accurate method of obtaining a mesh. The scanned mesh models used in this thesis were mostly obtained from a commercial scanning service: 3D Scan Services, LLC. Some of the mesh models we used (such boxes and cylinders) were simple enough that hand-built meshes yielded reasonable accuracy. Tune Parameters Input: results of evaluating the view-based object detector a sample of correct and incorrect detections from the view-based model learned from the previous parameter settings 31 Output: a new set of parameter values that should improve the accuracy of the view-based model There are many learning parameters involved in producing a view-based model, such as: " the size of the viewpoint bin, * the amount of ambient lighting in the rendered images, * the maximum allowable mean squared error in feature selection, etc. It would be computationally infeasible to test all combinations of parameter settings to automatically find the best-performing values, so this is left as a manual process. A human can look at a visualization of a view-based model, and the set of detections for that model, and see where the pattern of common failure cases are. A bit of intuition, experience and understanding of how the view-based model is constructed can help the human to make educated guesses as to which parameters need to be adjusted to improve the performance. For example, by looking at a visualization of a view-based model, one may realize that the set of view bins does not fully cover the set of object rotations in the real world, so the user would adjust the set of viewpoint bins and re-run the learning and test to see if it performs more accurately. Or the user may notice that there appear to be randomly scattered edge parts in the view-based model. In this case, the user may try to reduce the maximum allowable mean squared error for edge feature selection. This is admittedly the most difficult part of the process, as it requires a fair amount of experience. Section 4.2.1 gives more details on our methodology and chapter 6 gives a sample of the kinds of experiments that we used to determine good parameter settings, but the real process involves some careful inspection of detections, an understanding of how the view-based model is affected by the parameters, and some critical thinking. Evaluation of View-Based Models 32 Input: " a set of detected object poses in images, " hand-labeled bounding boxes for the object in the images, " a set of about 15 RGB-D images not containing the object Output: a score between 0 and 1 evaluating the accuracy of the view-based model on the set of test images Since we only require 2D bounding box labels (to save manual labor), we are able to evaluate the accuracy of results following the standard and accepted methodology defined by the PASCAL [8] and ImageNet [29] challenges. The challenge defines correct detection with respect to the ground truth label by the intersection over union (IoU) metric (see section 1.1.1.2), and the overall detection accuracy on a set of test images is measured by an average precision that is a number between 0 and 1, where 1 represents perfect detection. 1.1.2 Detection Input: " An RGB-D image along with the camera height and the camera pitch angle " A view-based model Output: A sequence of detected poses, ordered by decreasing probability that the object is at each pose The detection algorithm uses branch-and-bound search to find detections in decreasing order of probability. Branch-and-bound search operates by recursively breaking up the 6D search space into smaller and smaller regions, guided by a bounding function which gives an overestimate of the maximum probability that the object may be found in a particular 33 region. Using the bounding function, branch-and-bound explores the most promising regions first, so that it can provably find the most probable detections first. The bounding function in branch-and-bound search is the critical factor that determines the running time of the search process. The. over-estimate of the bound should not be too far above the true maximum probability (i.e. the bound should be tight), and time to compute the bound should be minimal. In our design of the detection algorithm, computational efficiency of the bounding function is the primary consideration. The detection algorithm consists of five steps: 1. detect visual features 2. pre-process the image to create low-dimensional tables to quickly access the 6D search space 3. initialize the priority queue for branch-and-bound search 4. run the branch-and-bound search to generate a list of detections 5. suppress detections that are not local maxima to remove many redundant detections that are only slight variations of each other Chapter 5 gives more details about the detection algorithm. Visual Feature Detection Input: an RGB-D image Output: a binary image of the same dimensions, for each kind of visual feature The first phase of detection is to detect the low-level features. Depth measurements are converted from Euclidean space, into measurements along a 3D ray starting at the focal point of the camera passing through each pixel. Visual features must be extracted from the input image. Visual feature detectors determine a binary value of whether the feature is present, or absent at any feature. In this thesis, we use an edge detector to extract edge pixels from an edge detector from around 8 different edge directions. We provide more details in section 5.2. 34 depimage(1D slice of 2D) RGB/visual image (1D Lr . slice of 2D) F I -Ili t l1_ ------------- =. $0 N a9m It rM6WM ---* - 4_4 *9----. -"" summed area table of depth (21) slice of 3D) IO summed area tables of visual features (ID slice of 2D) distance transforms of visual features I summed area tables and distance transforms Figure 1-4: An overview of detection. First features are detected in the image, then these binary feature images are preprocessed to produce summed area tables and distance transforms. The priority queue used in the branch-and-bound search is initialized with a full image-sized region for each viewpoint bin. As branch-and-bound runs, it emits a sequence of detections sorted in decreasing order of probability. Some of these detections that are redundantly close to other, higher-probability detections are then removed (or "suppressed"). 35 Pre-processing Input: a depth image and a binary image of the same dimensions, for each kind of visual feature Output: " a 3D summed area table computed from the depth image " a 2D summed area table computed from each kind of visual feature " a 2D distance transform computed from each kind of visual feature Before the process of searching for an object in an image begins, our algorithm builds a number of tables that allow the bounding function to be computed efficiently. First, edges are detected in the image. Each edge has an angle, and edges are grouped into 8 discrete edge angle bins. Each edge angle bin is treated as a separate feature. An optional texture detection phase may be used to detect other types of visual features besides edges. 2D binary-valued images are created for each feature, recording where in the RGB-D image the features were detected, and where they were absent. The bounding function needs to efficiently produce an over-estimate of the maximum probability that the object is in a 6D region in pose space. A dense 6D structure table would be large, and even if it could fit in RAM, it would be slow because the whole structure could never fit in the CPU cache. We therefore store 2D and 3D tables that are smaller in size and more likely to fit in a CPU cache for fast read access. Each feature in the image has a maximum receptive radius, which is the region of pose space where it may increase the overall "votes" for those poses. A feature can have no effect on the total sum of votes for any pose outside of its receptive field radius. The key idea of the bounding function for a particular part is to conservatively assume the highest possible vote for that feature for a region of pose space that may intersect the receptive field radius of some feature. Otherwise, it is safe to assume the lowest possible vote for that region. To make the bounding function efficient, we 36 take advantage of a property of a summed area table [6] (also known as an integral image [36]) that allows us to determine whether a feature is present in any rectangular region with a small constant number of reads from the table. A separate 2D summed area table is used for each visual feature (such as each edge angle). We similarly compute a 3D summed area table for depth features. The constant access time property of the summed area table means that the bounding function takes the same amount of time to compute, regardless of how big or small the region is. We would also like to efficiently compute the exact probabilities when branch-andbound arrives at a leaf of the search tree. The uncertainty of visual feature locations are modeled by normal distributions in the image plane. The logarithm of a normal distribution is simply a quadratic function (i.e. a parabola with a 2D domain), which is the square of a distance function. To find the highest probability match between a visual part and a visual feature in the image (such as an edge), we want to find the squared distance to the visual feature that is closest to the expected location of the part. A distance transform is a table that provides exactly this information-it gives the minimum distance to a feature detection at each pixel [11]. A distance transform is pre-computed for each kind of visual feature so that any the exact probability of any visual part can be computed with only one look-up into the table for the appropriate kind of visual feature. We underscore that these pre-computed tables are shared for all parts of a particular kind. In other words, the total number of pre-computed tables in memory is proportional to the number of different kinds features (usually 8 discrete edge directions), which is many fewer than the number of parts in a viewpoint bin model (usually 100 visual and 100 depth parts), or the number of viewpoint bin models or even the number of different object types being detected. This contributes to a significant increase in search efficiency. These tables take about 4 seconds to compute on a single core 2.26 GHz Intel CPU. Initializing the Priority Queue 37 Input: an empty priority queue, and the viewpoint bin sizes of the view-based model Output: a priority queue containing a maximum-size region for each viewpoint bin, each with its appropriate bound Besides the tables used to compute the bounding function discussed in the last chapter, the other major data structure used by branch-and-bound search is a priority queue (usually implemented by a heap data structure). The priority queue stores the current collection of working hypothesis regions of pose space, prioritized by their maximum probability. The priority queue is designed to make it efficient to find and remove the highest-priority (maximum probability bound) region. It is also fast to add new regions with arbitrary bounds onto the priority queue. Branch-and-bound search starts with the initial hypothesis that the object could be anywhere (subject to the current set of constraints, such as whether it is near a table top). We therefore put a full-sized region that covers the whole 6D pose space we are considering for each viewpoint bin on the priority queue. These initial regions are so large that they are uninformative-the bounding function will give a very optimistic over-estimate, but it will still be fast to compute since the running time does not vary with the size of the region. Initializing the priority queue takes a negligible amount of time. Branch-and-Bound Search Input: an initialized priority queue Output: the sequence detections (i.e. points in pose space) sorted in descending order of probability that the object is located there Branch-and-bound search removes the most promising hypothesis from the priority queue and splits it into as many as 26 "branch" hypotheses because there are 6 dimensions in the search space. It computes the bounding function for each branch, and puts them back onto the queue. When it encounters a region that is small enough, it exhaustively tests a 6D grid of points within the region to find the maximum probability point, and that point is then pushed back onto the priority queue as a leaf. The first time that a leaf appears as the maximum probability hypothesis, we know 38 that we have found the best possible detection. As we have said, the bounding function is the most critical factor in the efficiency of the branch-and-bound search process. Inherent in the design of our view-based model representation are two approximations: weak perspective projection and the smallangle approximation. These two approximations make the bounding function simple to compute using the low dimensional pre-computed tables (discussed in section 1.1.2). These approximations make it possible to access these low-dimensional tables in a simple way: using only scaling and translations, rather than complex perspective functions or trigonometric functions. A rectangular 6D search region can be projected down to a 2D (or 3D) region, bounded by a rectangle in the pre-computed summed area tables, with some translation and scaling. This region can then be tested with a small constant number of reads from these tables in constant time for each visual and depth part. The leaf probabilities can also be computed quickly by looking up values at appropriately scaled and translated pixel locations in the distance transforms for each kind of visual feature, and scaling and thresholding those distances values according to the parameters of each part. We tested the detection algorithm on 20 2.26 GHz Intel 24-core machines in parallel, each 24-core machine had 12 GB of RAM. Under these conditions, this process usually takes about 15-30 seconds. The search procedure is parallelized by giving each processor core its own priority queue to search its own sub-set of the pose search space. When a priority queue for one core becomes empty, it requests more work, and another core is chosen to delegate part of its priority queue to the empty core. If searching for only the best n detections, then the current nth best probability is continually broadcasted to all of the CPUs because any branches in the search tree with lower probability can be safely discarded. At the end of the search, the final set of detections are collected and sorted together. The speed of this process could be improved by the fine-grained parallelism available on a GPU architecture, by computing the "vote" from each of the object parts in parallel. Under this strategy, the memory devoted to the priority queue would be 39 located on the host CPU, while the memory devoted to the read-only precomputed tables would be located on the GPU for faster access. The amount of communication between the GPU and the CPU would be low: only a few numbers would be transferred at evaluation of a search node: the current search region would be sent to the GPU, and the probability of that region would be returned to the CPU. This means the problem would be unlikely to suffer from the relatively low-bandwidth connection between a CPU and a GPU. Non-Maximum Suppression Input: a list of detections with their corresponding probabilities Output: a subset of that list that only keeps detections whose probabilities are a local maximum If branch-and-bound search continues after the first detection is found, the sequence of detections will always be in decreasing order of probability. In this sequence of detections, there are often many redundant detections bunched very close to each other around the same part of the pose space. In order to make the results easier to interpret, only the best detection in a local region of search space is retained, and the rest are discarded. We refer to this process as non-maximum suppression. Non-maximum suppression takes a negligible amount of computation time. 1.2 Outline of Thesis In chapter 2 we discuss related work in the field of object recognition and situate this thesis in the larger context. In chapter 3, we give the formal representation of the view-based model we developed. Chapter 4 describes how we learn a view-based model from a 3D mesh model and from images. Chapter 5 describes our algorithm for detecting objects represented by a view-based model in an RGB-D image. Chapter 6 describes our experiments and gives experimental results. Finally, chapter 7 discusses the system and gives conclusions and directions for future work. 40 Chapter 2 Related Work In this chapter, we give a brief overview of some of the work in the field of object recognition that relates to this thesis. For an in-depth look at the current state of the entire field of object recognition, we refer the reader to three surveys: " Hoiem and Savarese [18], primarily address 3D recognition and scene understanding. " Grauman and Leibe [16] compare and contrast some of the most popular object recognition learning and detection algorithms, with an emphasis on 2D algorithms. " Andreopoulos and Tsotsos [2], examines the history of object recognition along with its current real-world applications, with a particular focus on active vision. 2.1 Low-Level Features Researchers have explored many different low-level features over the years. Section 4.3 of Hoiem and Savarese [18] and chapter 3 of Grauman and Leibe [16] give a good overview of the large number of features that are popular in the field such as SIFT [25] or HOG [7] descriptors. In addition to these, much recent attention in the field has been given to features learned automatically in deep neural networks. Krizhevsky et 41 al. [21] present a recent breakthrough work in this area which is often referred to as deep learning. In this thesis, we use edges and depth features. Edges are useful because they are invariant to many changes in lighting conditions. One of the most popular edge detectors by Canny [3] is fast-it can detect edges in an image in milliseconds. More recent edge detectors like that of Maire et al. [26] achieve a higher accuracy by using a more principled approach and evaluating accuracy on human-labeled edge datasets-however these detectors tend to take minutes to run on a single CPU'. With the advent of RGB-D cameras and edge datasets, the most recent edge detectors such as the one by Ren and Bo [28] have taken advantage of the additional information provided by the depth channel. We use the Canny [3] and Ren and Bo [28] edge detectors in our experiments. Although the computer vision research community has traditionally focused on analyzing 2D images, research (including our work in this thesis) has begun to shift towards making use of the depth channel in RGB-D images. In this thesis, we also use simple depth features: at almost every pixel in an RGB-D image, there is a depth measurement, in meters, to the nearest surface intersected by the ray passing from the focal point of the camera through that pixel (however, at some pixels, the RGB-D camera fails, giving undefined depth measurements). 2.2 Generic Categories vs. Specific Objects We humans can easily recognize a chair when we see one, even though they come in such a wide variety of shapes and appearances. Researchers have primarily focused on trying to develop algorithms that are able to mimic this kind of flexibility in object recognition-they have developed systems to recognize generic categories of objects like airplanes, bicycles or cars for popular contests like the PASCAL [8] or ImageNet recognition challenges [29]. In addition to the variability within the class, researchers have also had to cope with the variability caused by changes in viewpoint and lighting. The most successful of these algorithms, such as Felzenszwalb et al. [10], 'but GPUs seem to be a promising way to speed these detectors up. 42 Viola and Jones [36] and Krizhevsky et al. [21] are impressive in their ability to locate and identify instances of generic object classes in cluttered scenes with occlusion and without any contextual priming. These systems usually aim to draw a bounding box around the object, rather than finding an exact estimate of the position and rotation of the object. They also usually require a large number of images with hand-labeled annotations as training examples to learn the distribution of appearances within the class. Image databases such as ImageNet [29], LabelMe [30] and SUN [37] have been used to train generic detection systems for thousands of objects. In this work, we, along with some other researchers in the field, such as Lowe [25], and Nister and Stewenius [27] have chosen to work on a different problem-recognizing an object instance without class variability, but requiring a more accurate pose estimate. Setting up the problem in this way rules out all of the variability from a generic class of objects. For example, instead of looking for any bottle of laundry detergent, these algorithms might specifically look for a 51 fl. oz. bottle of Downy laundry detergent manufactured in 2014. Although within-class variability is eliminated by simplifying the problem, there is still variability in shape and appearance from changing viewpoints and lighting. Chapter 3 of Grauman and Leibe [16] discusses a number of local feature-based approaches that have been very successful in detecting and localizing specific object instances with occlusion in cluttered scenes using only a single image of the object as a training example. But these approaches usually require the objects to be highly textured, and their accuracy tends to decrease with large changes in viewpoint. 2.3 2D vs. 3D vs. 21D view-based models Hoiem and Savarese [18] divide view-based object models into three groups: 2D, 3D and 21D. 43 2.3.1 2D view-based models Researchers have used a variety of different 2D object representations. If the object class is like a face or a pedestrian that is usually found in a single canonical viewpoint, then it can be well represented by a single 2D view-based model. Two of the most popular techniques for detecting a single view of an object are rigid window-based templates and flexible part-based models. Window-based models One search strategy, commonly referred to as the sliding window approach, compares a rigid template (a "window") to every possible position (and scale) in the scene to detect and localize the object. Viola and Jones [36] demonstrated a very fast and accurate face detector, and Dalal and Triggs [7] made a very accurate pedestrian detector using this technique. More recently, Sermanet et al. [31] have successfully used the deep learning approach in a sliding window strategy, and Farfade et al. [9] have shown that this kind of strategy can even be robust to substantial variations in poses. However, window-based methods have primarily been used for object detection and have not yet been demonstrated to localize precise poses. Part-based models In order to detect and localize objects in a broader range of viewpoints, the viewbased model may need to be more flexible. A common way of adding flexibility is to modularize the single window template by breaking it into parts. Each part functions as a small template that can be compared to the image. Several different representations have been used to add flexibility in the geometric layout of these parts relative to each other. Lowe [25] used an algorithm called RANSAC (invented by Fischler and Bolles [14]) to greedily and randomly match points to a known example (see section 2.4.1). Another technique is to represent the layout of the parts as if they were connected by springs. The less the spring needs to be stretched to fit the image, the better the 44 Figure 2-1: Fergus et al. [13] learned representations of object classes (for example, spotted cats) using a fully-connected model. match. The stretch of the springs and the quality of matching the individual part templates are combined together to score the overall object detection. In this way, each part can be thought of as casting a weighted "vote" for the position of the object. The space of possible "votes" from the parts is sometimes referred to as a Hough transform space. Fergus et al. [13] worked with part-based models in which the parts are fully connected to each other by springs (see figure 2-1), but detection using these models can be computationally expensive. Crandall et al. [5] introduced a family of models called k-fans in which the number of springs connected to each part can range from the fully-connected model where k = n with n parts in the model (as in Fergus et al.), down to the star model where k = 1. They showed that k = 1-fans, in which each part is only connected to a single central part (see figure 2-2), can be comparably accurate to k > 1-fans, and detection can be much more computationally efficient by using distance transforms to represent the votes from each part. Felzenszwalb et al. [10] used multiple templates designed by Dalal and Triggs [7] as parts of a 1-fan model to create one of the most successful 2D object detectors. To represent objects from a wider range of viewpoints, Felzenszwalb et al. [10] (and many others) have combined multiple 2D view-based models into a single mixture model, in which each model represents a different viewpoint. In essence, this strategy treats different views of an object as different objects, each to be detected separately. 45 Figure 2-2: Crandall et al. [5] used a 1-fan model in which most part locations are independent of each other, yielding faster detection. The view-based models proposed in this thesis can be seen as an extension of 1-fans to the full 6 degree-of-freedom space of rigid transformations (translations and rotations). The parts of our view-based models "vote" in Hough Transform space, and the "votes" are represented by distance transforms. 2.3.2 3D view-based models The other end of the spectrum of view-based object models is to represent the distribution of object appearances entirely in three dimensions. Chiu et al. [4], represent an object by a collection of nearly planar facades centered at fixed 3D positions in space and detect the object using distance transforms. Lim et al. [24] use 3D mesh models of furniture from Ikea to detect and localize the pose of objects in 2D images using 2D keypoints and RANSAC search. Glover and Popovic [15] represent an object by a collection of oriented features in 3D space, and use a randomized method to match model points to points in the "point cloud" from an RGB-D camera. Aldoma et al. [1] use a 3D mesh model to detect the object in a depth image from the Kinect camera. They introduce new 3D point descriptor that is used to match the mesh to the point cloud. The view-based models in this thesis do not contain a full 3D representation of 46 the object, so would not directly fit into this category of models. However, the work of Aldoma et al. [1] can be viewed, from an end-user's perspective, as similar to ours, because the training input is a 3D CAD model, and the detection algorithm operates on depth images from the Kinect. However, in addition to the depth images, we also use the picture (RGB) channel of the RGB-D image. 2.3.3 2 12 D view-based models There has also been work on models that are not entirely 2D, but not entirely 3D either. Hoiem and Savarese [18] use the name 2)D to refer to models that have some dependency between viewpoints (i.e. they are not simply a mixture of separate 2D models), yet they do not have a full explicit 3D model of the object. Thomas et al. [33] demonstrate a system that tracks the affine transformations of regions across a sequence of images of a particular object instance from different viewpoints. Each discrete view-based model is represented separately, but it is linked to the other view-based models in the view sphere by sharing parts. Detected features "vote" via a Hough transform for where object is likely to be for each view. Votes from other view-based models are also combined to find the final detection. Su et al. [32] use videos from a camera moving around a single object instance, along with a set of unsorted and unlabeled training images of other instances in the category to learn a dense model of visual appearance. The viewpoint models are morphed linearly between key views on the view sphere, so they can be used to detect objects from previously unseen views and accurately localize their poses. The view-based models we present in this thesis bear resemblance to Su et al [32] because we use piecewise linear models to represent the transformation of parts in the model, much like their linear morphing between key views. For this reason, our view-based models can also be used to accurately localize object poses. 47 2.4 Search: Randomized vs. Cascades vs. Branchand-Bound Object detection, which is the main computational task of an object recognition system, involves searching for the object over a large space of potential hypothesis locations. The "brute force" approach of fully evaluating every possible hypothesis is only feasible for low-dimensional object poses spaces. There are so many hypotheses in high-dimensional spaces that they are prohibitively computationally expensive to evaluate exhaustively. We look briefly at three search strategies: randomized, cascades and branch-and-bound. 2.4.1 Randomized Many object recognition systems have made effective use of the RANdom ASmple Consensus (RANSAC) algorithm to efficiently search the space of object positions in an image-Lowe [25] and Lim et al. [24] use this algorithm, as mentioned above. RANSAC is a robust iterative method that estimates parameters by random sampling and greedy matching. RANSAC runs for a pre-specified number of iterations before terminating. Although RANSAC is often very efficient, there is no guarantee that it will have found the best solution when the number of iterations are completed. Moreover, RANSAC is designed only to estimate the best detection, so it cannot be directly applied to images with multiple instances of the same object. 2.4.2 Cascades Viola and Jones [36] introduced another efficient method to search an image they call a cascade of classifiers. A cascade of classifiers is a sequence of increasingly complex classifiers that "fails fast." The early classifiers in the cascade are very fast to evaluate and are chosen to have nearly 100% true positive detection rate, with some false positives. In this way, if an early classifier says a hypothesis is not the object, then one can be reasonably certain that it is not the object without running 48 any further classifiers in the cascade. Viola and Jones used a cascade of classifiers to evaluate every position and scale in an image. This kind of "brute force" search would normally be computationally expensive, but since the cascade "fails fast", it can run very efficiently. Their face detector was the first to run on a full-sized image in less than one second using only a single CPU. Cascades of classifiers have since been applied to many other detection systems, including an extension [12] of the work by Felzenszwalb et al. (mentioned above [10]). Another technique that Viola and Jones also used to achieve efficient face detection is summed area tables (also known as integral images). Summed area tables allow the summation of values in any rectangular sub-window of an image with a small constant number of machine instructions. This property allowed their detector to detect faces quickly at any size or scale without influencing the running time. The detector in this thesis also makes use of summed area tables for the same reason. 2.4.3 Branch-and-Bound Branch-and-bound search is another method that is sometimes used to efficiently search a space of potential object positions. Branch-and-bound search uses a bounding function that gives an over-estimate of the probability that an object is located in a region of hypothesis space. Branch-and-bound search is guaranteed to find the best detections in an image without individually evaluating every hypothesis, saving time without sacrificing accuracy. Lampert et al. [22] demonstrate several applications of branch-and-bound to searching the 4-dimensional space of rectangular bounding boxes in an image. Lehmann et al. [23] formulate a branch-and-bound search for objects in the 3D pose space (2D position in the image plane, plus ID scale). Like our algorithm, Lehman et al. also use Hough transforms in which each part "votes" for where the object is likely to be located in space. Moreover, they use summed area tables to compute their bounding function efficiently in a way that also bears a very close resemblance to the system presented in this thesis. 49 2.5 Contextual Information A number of researchers have pointed out the importance of using contextual information to help improve object recognition. Notably, Torralba introduced a probabilistic representation of context based on 2D image scene statistics called GIST [34]. Although we do not incorporate 2D contextual information in this thesis, our system could be extended to incorporate this useful information. Hoiem et al. [17] demonstrate a framework for estimating the 3D perspective of the scene and its ground plane (using vanishing points), and they show that this information can be used to significantly improve the accuracy of an object detector. We also make use of this information to improve both accuracy and running time. However, rather than estimating the table plane, we calculate it directly from the pitch angle of the camera and the height of the camera above the table. This information was read from the PR2 robot we used in our experiments. As with Hoiem et al., we assume that the camera is always level, i.e., camera roll angle is always 0 degrees. 2.6 Part Sharing Researchers have often highlighted the importance of re-using a shared library of features to detect many different objects and views. Notably, Torralba et al. [35] demonstrated a multi-class learning procedure that scales efficiently with the number of object classes by sharing visual words among classes. As the number of classes increases, more sharing occurs, and the total number of unique visual words used by all classes remains small. We noticed that their algorithm usually chooses simple visual words for sharing (see figure 2-3). In particular, the most commonly shared parts look like small line segments. This served as motivation for the set of visual features we chose to use in this work. 50 U"-R. U--" screen poster car rontal chair keyboard bottle car side mouse mouse pad can trash can head person mug speaker traffic light one way Sign do not enter stop Sign light CPU Figure 2-3: Torralba et al. [35] introduced a learning algorithm to share parts among a large number of object classes. A white entry in the matrix indicates that the part (column) is used by the detector for that class (row). Parts are sorted from left to right in order of how often they are shared. 51 52 Chapter 3 Representation In this thesis, we aim to find the most likely pose(s) of an object in an image. We represent an object as a probability distribution, rather than rigidly fixed values, in order to account for uncertainty in sensing and variability caused by explicit approximations made by our model. Section 3.1 discusses our particular choices in representing the space of object poses, section 3.2 discusses how we represent images and section 3.3 discusses our representation for view-based models. Then section 3.4 discusses some of the particular approximations we use, section 3.5 discusses the sources of variability and 3.6 discusses and defends our particular choice of probability distributions. 3.1 Object Poses We choose to represent the position and rotation of objects so as to facilitate several natural approximations to the appearances of 3D objects. The most popular representations of an object pose are (1) a 4 x 4 homogeneous transform matrix and (2) a 3D position vector and a quaternion for rotation. We choose a different representation that can be easily converted to either of these popular representations. We represent the position of an object as a triple (x, y, z), in which (x, y) is the pixel position of the perspective-projected center of the object on the image. In this way, translations in the (x, y) plane are simply approximated as shifting the pixels 53 of the appearance of the object in the image plane. We represent the depth of the object as the distance z from the focal point of the camera to the center of the object, rather than the Euclidean z-axis which is always measured perpendicular to the image plane. This choice is well suited to the weak-perspective approximation, as z be used directly to compute the scale of the object. Together, the position triple (x, y, z) forms a modified spherical coordinate system. The origin of the coordinate system is the focal point of the camera, the radius is z and inclination and azimuth angles are replaced by pixel positions (x, y) on the image plane. There are two important factors that lead to our choice of a representation for object rotations. First, we would like to respect the approximations discussed above. In other words, if an object is translated in the image plane, its rotation should not have to change in order to keep approximately the same (shifted) visual appearance. We accomplish this by measuring an object's rotation with respect to the surface of a focal point-centered sphere (see figure 3-3). Second, we would like to represent rotations with the smallest number of variables as possible. The most natural option for representing rotations-quaternions-require four variables (constrained to a 3D manifold). If computational speed was not a factor, we might choose quaternions. However, we will be searching a large space of object poses using branch-and-bound, and the speed of the overall search is directly related to the number of branches that must be explored. Branching in a higher dimensional space increases the number of branches multiplicatively. Moreover, explicitly branching on a 3D manifold embedded in a 4D space would add some additional computational burden. We therefore use Euler angles to represent rotations since they use the smallest number of variables. As discussed above, we define the Euler angles (r-, ry, rz) for an object with respect to the tangent of a sphere centered on the focal point of the camera and the natural "up" direction of the camera. One of the biggest problems with Euler angles is the loss of a degree of freedom at certain singular points known as a gimbal lock. Due to the fact that we already must constrain our search to upright object rotations on horizontal surfaces in order to detect objects in a reasonable amount of time (see section 5.3.4), we naturally avoid 54 these singular points in the space of Euler angles. If we were able to speed up our detection algorithm to run unconstrained searches at a more practical speed then we might be able to afford the computational luxury of switching back to the quaternion representation for rotations. Thus we define a pose to be a six-tuple (x, y, z, rX, rY, rz), where (x, y) is the projected pixel location of the center of the object in the image plane, z is the distance of the center of the object from the focal point of the camera and (rx, ry, rz) are Euler angles representing the rotation of the object: rx is the rotation of the object about a line that passes through the focal point of the camera through the center of the object (this can roughly be thought of as rotation in the image plane), ry is the rotation of the object about a horizontal line that is perpendicular to the first line, and rz is the rotation of the object about a vertical line perpendicular to the other two. The rotations are applied in order: rx followed by ry, followed by rz. 3.2 Images We represent an RGB-D image as a collection of visual and depth features. All features have an integer pixel location in the image. There is a pre-defined library of kinds of visual features (we use edge detections within a range of angles, but this could be extended to include a class of local textures). The detectors for each kind of visual feature are binary-valued. In other words, a visual feature of a certain kind is either present or absent at a particular pixel. This choice to restrict visual features to 2 values is for the sake of reducing memory overhead, as we will see in section 5.1. Depth features, on the other hand, are real-valued. There is a real-valued depth measurement associated with each pixel in the image (except where the depth detector fails-for such a pixel, the depth measurement is undefined). Depth measurements are measured along a line from the focal point of the camera, through the pixel in the image plane, to the point where it intersects an object in the scene. This means that the depth measurements are not measured as z-coordinates in the standard Euclidean coordinate system. Instead, the coordinate system can be viewed as a 55 spherical coordinate system centered at the focal point of the camera. 3.3 View-Based Models We define a viewpoint to be a triple (rz, ry, rz) containing only the rotational information from a pose, and we define a viewpoint bin to be an axis-aligned bounding ) box in viewpoint space. Formally, a viewpoint bin is a pair of viewpoints (v 1 , v 2 where vi = (rxi, ry , rzi) and v 2 = (r2, ry 2 , rz 2 ) such that rx2 > rx1, ry2 > ry1 and rzi. A viewpoint (rX, ry, rz) is in the viewpoint bin (v 1 , v 2 ) iff rx1 < r < rx2, rz2 ry1 < ry 5 y2 and rzi < rz < rz2. We define a viewpoint bin model to be a pair (V, D) of vectors of random variables representing visual parts and depth parts respectively. A depth feature dj is drawn from the distribution of Dj. In other words, a depth part Dj can be matched to a particular depth feature dj in the depth channel of the image with some probability. Similarly, a visual feature Vk is drawn from the distribution of Vk. In other words, a visual part Vk can be matched to a particular visual feature Vk in the image with some probability. We define a view-based model as a set of pairs. Each pair (M, B) in the view-based model represents the viewpoint bin model M of the appearance of the object for views in the corresponding viewpoint bin B. The goal of detection for each viewpoint bin model (V, D) is to find the highest probability pose p in the image by finding the best matching of parts to features: argmax max Pr(7 = V, D = d P = p) = argmax max Pr(V = 7, D = dIP = p) Pr(P= p) p p v,d V,d (3.1) We assume that all of the distributions for parts are conditionally independent of each other and that the distribution over poses Pr(P) is uniform within the region of 56 ............. image - ID A A visual features B r W e W A A F A F 19 depth features - F Figure 3-1: An illustration of features in a 1D slice of an RGB-D image. This simplified example uses three kinds of visual features (A, B and C). The depth features are a set of depth measurements at every pixel in the image. pose space we are searching. These assumptions give: argmax max Pr(V = 6, D' = dP = p) = argmax max fPr(Dj = d IP = p) p p ,d* Vd fJ Pr(Vk = VkIP = p) k (3.2) Each visual part can only be matched to its own kind of visual feature. We define the projected expected location M~, of a visual part in the image plane as a function of the object pose p = (x, y, z, rx, ry, rz): mVn = [ + z [ , (3.3) erfr] (3.4) where [i;] = 1 g h 57 and where a, b, c, d, e, f, g and h are constants. We choose Pr(VIP) to be a distribution that represents our uncertainty about the location of the visual feature t that matches a visual part V in the image plane: Pr(V = t}P = p) OC e - (3.5) v where t is the 2D pixel location of a certain feature in the image plane, v is the scalar variance of the uncertainty of this location around the mean nmv and r is the receptive field radius, outside of which visual features are ignored. A visual part does not have any representation of depth. We discuss more of the details of this choice of distributions in section 3.6. For simplicity, we assume there is no uncertainty over the location of a depth feature in the image plane. We define the projected location f of a depth part in the image plane for the pose p = (x, y, z, rx, ry, rz) and constants px, py as: .(3.6) [1 +i [2Y 1= Instead of uncertainty in the image plane, we represent uncertainty of the depth of a depth part along a line from the focal point of the camera through f We define the expected depth of a depth part md as a linear function of the pose: md = Z + fi rirr2] [, (3.7) for constants i, j, k, 1. And we choose the distribution Pr(DIP) that represents our uncertainty about the depth of the depth feature that matches a depth part D along this line: Pr(D~t~~p~cx6min((t-md) Pr(D = tIP = p) cX e2 ,r2(.8 (3.8) where t is the depth measurement from the depth channel at pixel location f, v is the variance of the uncertainty of this depth about the mean md and r is the receptive field radius, outside of which depth features are ignored. We discuss more of the 58 details of this choice of distributions in section 3.6. We note that the probability distributions in equations 3.8 and 3.5 do not integrate to 1 on [-o, o] because their value outside of the receptive field radius is non-zero (see figure 3-5). These can be made into valid probability distributions if we assume that the domain is t E [md - r,md + r] for equation 3.8 and (F- iV) 2 ; r 2 for equation 3.5 and that there is also a single discrete "out" state in the domain of each distribution when the feature is outside the receptive field radius. 3.4 3.4.1 Approximations (x, y) Translation Is Shifting In The Image Plane Built into these equations is the assumption that motion tangent to a sphere located at the focal point of the camera is equivalent to translation on the image plane. In other words, if the object is moved to a new point on the sphere, and is turned such that it has the same rotation with respect to the plane tangent to the surface of the sphere at the new point, it should register the same depth values, at a translated position in the image plane. While this a good approximation for most cameras which have narrow fields of view, in figure 3-3, one can see that this assumption does not hold perfectly, especially near the boundaries of the image. 3.4.2 Weak Perspective Projection Standard perspective projection calculates the projected position of each point in space from its Euclidean coordinates (p, py, pz). In other words, standard perspective projection would replace the denominators in both equations 3.3 and 3.6 with pz so that the locations of each part is scaled independently according to its own depth. However we use weak perspective projection, as an approximation to standard perspective projection. Weak perspective approximation scales the whole object as if all points had the same depth, rather than scaling each point separately. The denominators in these equations contain the distance z from the focal point of the camera 59 to the center of the object, rather than pz, the Euclidean z-coordinate of the point to be projected. This approximation is used because the depth of visual features (especially edges at the contour of the object) are not always available. This is a reasonable approximation when the size of the object is small compared its distance from the camera, as is true for small objects to be manipulated by a robot. 3.4.3 Small Angle Approximation Visual features like edges seen by the camera are not always produced by the same point on the object frame as the object rotates-especially for objects with rounded surfaces. Moreover, some features may disappear due to self-occlusion as the object rotates. Rather than trying to model all of these complex factors explicitly, we choose simple linear models (equations 3.4 and 3.7). The constant coefficients are chosen during the learning phase (chapter 4) by selecting features whose motion appear nearly linear over the domain of the viewpoint bin. It is also true that rotating the object will generally change the edge angles. We do not explicitly address this issue, but leave it to the learning procedure (described in chapter 4) to select edge features that do not rotate beyond their angle range for object rotations within the viewpoint bin. This means that the design choice of the best viewpoint bin size is related to the choice of the edge angle discretization. 3.5 Sources Of Variability There are several sources of variability in part locations: * the assumption that motion tangent to a camera-centered sphere is equivalent to translation on the image plane (see figure 3-3), " the weak perspective approximation to perspective projection for part locations, * the small angle approximation to the true part locations under object rotation, 60 .............................................. .... .............. ...... Figure 3-2: Warping spherical coordinates into rectangular coordinates. Although a natural representation for a depth image is a Euclidean space, we represent this space in spherical coordinates and visualize the spherical coordinate system (top) as if it was warped into a rectangular coordinate system (bottom). You will also notice that the hidden and occluded surfaces of the objects are indicated by thin lines in the world space (top) and removed entirely in the warped spherical coordinates (bottom). 61 --------- * errors in detecting visual features such as missed or spurious detections (figure 3-4) and " errors in depth sensing, such as missing or wrong depth values (see Khoshelham et al. [20] for details on the accuracy of the Kinect). For these reasons, we use probability distributions, rather than rigidly fixed values, to model the locations of parts. 3.6 Choice of Distributions We would like to use the simplest distribution that can represent these uncertainties well. Normal distributions, C e t 2 2v, are simple, but they give a high penalty for errors made by the visual feature detector or by the depth sensor. Therefore, we modify the normal distribution by removing additional the penalty for parts that are farther than r units from the expected location: oc e min(t2,2) 2v (see figure 3-5).We call r the receptive field radius because the distribution is only receptive to features found within that radius. If the feature is not found within that radius, it is assigned the same penalty as if it was found exactly at that radius. Crandall et al. [5] also used normal distributions with a receptive field radius.' But we further simplify this distribution in two ways: (1) we constrain the covariance matrix to be circular rather than generally elliptical (see figure 3-6), and (2) we assume that each part is binary-valued (it is either present or absent at every pixel), rather than real-valued.2 1Although Crandall et al. do not explicitly describe their use of a receptive field radius in their paper [5], the code they used for the experiments in their paper uses this technique. 2 See section 5.1 for a discussion on why we use these simplifications. 62 - -i "I I _ _ 11 1 11. I'll focal point 11- H- '0, L - focal point U 4D depth depth Figure 3-3: (Left top) Four object poses, all at the same rotation (rx, ry, rz). (Left bottom) These same object poses when spherical coordinates are warped onto a rectangular axes. (Right top) Four object poses at another rotation (r',r', r' ). (Left bottom) These same object poses when spherical coordinates are warped onto a rectangular axes. This figure shows the difference between what we consider rotation in the spherical coordinate system and rotation in the rectangular coordinate system. Changing the position of the object in the spherical coordinate system is actually moving the object tangent to a sphere centered at the camera's focal point. When the spherical coordinate system is warped to be rectangular (bottom), we can see why weak perspective projection is only an approximate representation of the actual transformation: objects are distorted. 63 JL lu Figure 3-4: Parts of these monitors have intensity values similar to the background, so some of their boundaries are missed by Canny's edge detector of [3]. 1-- x x r A Figure 3-5: (Top left) The normal distribution oc e 2. (Top right) The normal dismin(x2, (t2 T 2-v . (Bottom) The log of these plots. tribution with a receptive field radius o e Observe that the logarithm of a normal distribution (top) is a parabola (bottom). 64 Figure 3-6: (Left) A contour plot of a two dimensional normal distribution oc with a general elliptical covariance matrix. A 2D covariance mae-(-Tc-(-) trix is any matrix C = [, b] which satisfies a > 0 and ac - b2 > 0. (Right) A normal distribution with a circular covariance matrix C = vI 65 - [v 0]. 66 Chapter 4 Learning Input: An instance of the object, a way to acquire a 3D mesh, an RGB-D camera with the ability to measure camera height and pitch angle Output: A view-based model of the object, composed of a set of viewpoint bins, each with visual and depth parts and their respective parameters In this chapter, we break the process of creating a new view-based model into two parts. Section 4.1 describes the fully-automated subsystem that learns a new view-based model from synthetic images of a 3D mesh, while 4.2 describes the manual process involved in collecting and labeling data, acquiring a mesh, and setting parameters. The view-based model learning subsystem of section 4.1 is a "sub-procedure" used by the human in the manual process required to train, test and tune a new view-based model in section 4.2. 4.1 View-Based Model Learning Subsystem Input: A 3D mesh and parameter values such as the set of viewpoint bins Output: A view-based model of the object, which is composed of a set of viewpoint bins, each with visual and depth parts along with the coefficients of the linear model for their positions and the uncertainty about those positions The view-based model learning subsystem (see figure 1-2) is a fully automated process that takes a mesh and some parameter values and produces a view-based 67 model of the object. The position of each object part is a piecewise function of the three rotation angles about each axis, where the domain of each piece of this piecewise function is a viewpoint bin. In other words, objects are modeled using a number of different viewpoint bins, each with its own model of the object's shape and appearance for poses within that bin. The following three learning phases are repeated for each viewpoint bin: 1. rendering the image (section 4.1.1), 2. enumerating the set of features that could be used as model parts (section 4.1.2), 3. selecting the features that will be used in the final viewpoint bin model (section 4.1.3) and 4. combining viewpoint bin models into the final view-based model (section 4.1.4). The view-based model learning subsystem is designed to be entirely automated, and require few parameter settings from the user. However, there are still a number of parameters to tune, as mentioned in section 1.1.1.2. An unusual aspect of this learning subsystem is that the only training input to the algorithm is a single 3D mesh. The learning is performed entirely using synthetic images generated from rendering this mesh. This means that the learned view-based model will be accurate for the particular object instance that the mesh represents, and not for a general class of objects. 4.1.1 Rendering We generate synthetic RGB-D images from a 3D mesh model M by randomly sampling poses of the object that fall within the specified viewpoint bin B and the field of view of the camera. We also randomly change the virtual light source position to add some extra variability in appearance. We use the OpenGL MESA GLX library to render images, combined with the xvf b utility, which allows us to render on server machines on the cloud using software when hardware rendering (on a GPU) is not available. Values from the z-buffer are used to compute depth. Recall that depth is 68 vlA-nn"nt Kin I vipwnnint hin 2 Figure 4-1: Examples of synthetic images from two different viewpoint bins of a downy bottle. computed as the distance /x 2 + y 2 + z 2 between the Euclidean point (x, y, z) and the focal point of the camera located at the origin (0, 0, 0). We then crop the image to a minimal square containing the full bounding sphere of the object. Procedure 1 Render and crop a synthetic image. Input: 3D mesh model M, viewpoint bin B Output: color image RGB, depth image D, randomly chosen pose p of the object within B 1: procedure SAMPLESYNTHETIC-IMAGE(M, B) p +- a random pose of M in B and within the simulated camera's field of view 2: 3: 4: 5: 6: 7: 8: r +- the projected radius (in pixels) of the bounding sphere of M at p (x, y) +- center (in pixels) of the projected pose p (lX, ly, 1) +- a random position for a simulated ambient light (RGB, D) +- render M in pose p with ambient light at (lx, 1Y, lZ) RGB <- crop RGB for x-range [x - r, x + r] and y-range [y - r, y + r] D +- crop D for x-range [x - r, x + r] and y-range [y - r, y + r] 9: return (RGB, D, p) 4.1.2 Feature Enumeration Recall that equations 3.4 and 3.7 define the locations of visual and depth parts as a linear function of the viewpoint (rx, r., rz). We designed a supervised learning procedure to learn the parameters of these linear functions from synthetically rendered images, along with the variances v in equations 3.5 and 3.8 by a standard least squares formulation with a matrix of training labels A, a matrix of training examples B and a matrix of linear coefficients X. We use a training set of n synthetic images of the object along with the exact pose of the object in the image for both visual parts and depth parts. The training examples matrix A has size n x 4, where each row is an 69 example of the form [1 rx ry rz ]. The size of the other matrices B and X depend on whether it is a visual or depth part. The linear least squares formulation is: X = argmin(AX - B)T (AX - B). x (4.1) We can assume that the columns of A are linearly independent (i.e. A is full column rank) because the rotations rx, ry and r, are sampled at random. If A is full column rank, we know that ATA is invertible. Then the solution for the optimal matrix of constants X is: = (ATA)- 1 AT (4.2) B and the sum of square errors is used to compute the unbiased variance v for this solution: V =Tr (IT AT A )- Tr(X TAT B) +Tr BT B . (4.3) In the case of visual parts, the label matrix B has size n x 2 with a 2D pixel location of the nearest visual part [x Y] in each of its n rows, and the matrix of constant coefficients X is ca b c]. _g h. In the case of depth parts, the label matrix B has size n x 1 with a scalar depth of the n entries, and the matrix of constants X is . measurement for a particular location relative to the scaled and aligned image in each In principle, this formulation is sufficient to compute all depth and visual parts from the n training images. In practice, however, this requires a large amount of memory-there will be Kw 2 label matrices A and training matrices B for each of the K kinds of visual parts at each pixel in the w x w scaled and aligned training images. Similarly, there will be w 2 label and training matrices for all of the enumerated depth parts. Since each of these matrices has n columns, this requires (6K + 5)nw2 floating point numbers to be kept in memory as the synthetic images are being rendered. For an image width w = 245, K = 8 kinds of visual features, and n = 100 training images, and if the numbers are represented with 8-byte double-precision, it would take 2.3 GiB to represent these matrices alone. Another, less memory-intensive, alternative 70 would be to render n images separately for each of the w 2 pixels, which means that SAMPLESYNTHETIC-IMAGE would be called nw 2 times, rather than just n, but this would be computationally expensive. To address this issue, we observe that the quantities ATA, ATB and Tr BTB are sufficient to compute X and v, and they have small constant sizes that do not depend on the number of training images n. Moreover, we observe that they can be updated incrementally when each new synthetic image is rendered and a new observation and label becomes available. In particular, when the jth training example is [1 ATA = y L A rzj rxjzj ryi rjyj rx, rxj r2 (4.4) yj rr r ryjrzj zj r xjzj ryjrzj rZj : . 1 n ryj In the case of visual parts, when the jth label is [xj Y]: fl AT B = \' F X Y [xjr, yjr 1 xjrz yjrz. j=1 (4.5) n Tr BTB = +y . x (4.6) j=1 j=1zrz . And in the case of depth parts, when the jth label is zj: n Tr BT B = z . (4.8) j=1 We use this observation to define an algorithm. We define an incremental least squares visual part to be a triple (ATAV,ATB VTr BTB V). The notation "ATAV" is just a name for a 4 x 4 matrix containing the current value of the expression ATA for a visual feature given the training examples that have been sampled so far. Similarly, ATBV is a 4 x 2 matrix with the incremental value of the expression ATB for a visual feature, and Tr BTBV is a scalar with the incremental value of the expression Tr BTB for a visual feature. 71 An incremental least squares visual part can be updated by the viewpoint training example (rx, ry, rZ) and a visual feature location training label (x, y): ) Procedure 2 Update an incremental least squares visual part by adding a new training example. Input: least squares visual part (ATAVATB VTr BTB V), viewpoint training example (rX, ry, rz), visual feature location training label (x, y) Output: updated least squares visual part (ATAV',ATB V',YBTB V procedure UPDATEVISUALPART((ATAVATB V,TrBTB V), (X, Y), (rx, ry, rz)) rx -l 2: 2:~~~~ ATAV' -ATA ry rz +xomeuain r1 r rxrrxr 2~r V ry rxry r Y ryrz V + Lrz V + c> from equation 4.4 rxrz ryrz r2J x Y > from equation 4.5 3: ATBV 4: TrBTBV' +TrBTB 5: return (ATAV',ATB V'Tr BTB V r yry [xrz yrz J > from equation 4.6 V + X2 + y2 ) +ATB . 1: Procedure 3 Finalize an incremental least squares visual part after it has been updated with all training examples. Input: least squares visual part (ATAVATB V,Tr BTB V), receptive field radius r, integer index of the kind of visual part k Output: a visual part 1: procedure FINALIZEVISUALPART((ATAVATB V,Tr BTB V), r, k) > from equation 4.2 2: X +-ATA V- 1 ATBV > The top left element is the number of training images. 3: n +-ATA Vi, 1 4: V +- n- (Tr(XT ATAVX) - 2 Tr(XT ATBV) +TrBTB V) > from equation 4.3 Sa b 5: e <-X .g h] 6: return a visual part of kind k, with variance v, receptive field radius r, and constants a, b, c, d, e, f, g, h We similarly define an incremental least squares depth part to be a triple (ATAD,ATB DTrBTB D), where ATAD is a 4 x 4 matrix, ATBD is a 4 x 1 matrix and TrBTBD is a scalar. By default, an incremental least squares depth part is initialized to be (0, 0, 0). An incremental least squares depth part can be updated by the viewpoint training example (ri, ry, r,) and a depth feature depth training label z: In order to describe the algorithm to enumerate features, we now review the concept of a distance transform. A distance transform takes a binary matrix M : m x n and produces an integer-valued matrix D : m x n in which each element Dxy is the 72 Procedure 4 Update an incremental least squares depth part by adding a new training example. Input: least squares depth part (ATAD,ATB D,TrhBTB D), viewpoint training example (rX, ry, rz), depth feature depth training label z Output: updated least squares depth part (ATAD',ATB D',Tr BTB D') 1: procedure UPDATEDEPTHPART((ATAD,ATB D,T1 rBTB D), z, (rX, ry, rz)) r1 ATAD' +ATA D+ Lrz 3: ABD+ ry rz rxrz ryrz r wATB D+ zI, r2 > from equation 4.4 . 2: rx rx r2Iryrr x r2r ry rxry r V ryrz >from equation 4.7 [zrzJ + z2 4: TrBTBD' +-TrBTB D 5: return (ATAD',ATB D',TrBTB D') r> from equation 4.8 Procedure 5 Finalize an incremental least squares depth part after it has been updated with all training examples. Input: least squares depth part (ATAD,ATB D,TrBTB D), receptive field radius r, (Px, py) pixel location of depth part Output: a depth part 1: procedure FINALIZEDEPTHPART((ATADATB DiT BTB D), r) > from equation 4.2 2: X +-ATA D- 1 ATBD > The top left element is the number of training images. 3: n +-ATA D1 ,1 > from equation 4.3 4: v+- n- (Tr(XTATADX) - 2Tr(XT ATBD) +Tr BTB D) 5: +- X . Il 6: return a depth part at location (px, py), with variance v, receptive field radius r, and constants i, j, k, 1 73 viewpoint bin 1 depth viewpoint bin 2 edges (8 edge angles) -22.5' -45' -67.5' 22.5' 45* 67.5* depth 00 -67.51 90* 22.5' edges (8 edge angles) -22.5* -45' 45' 67.5* 0. 90* Figure 4-2: This figure shows two different view bins (left and right). For each viewpoint bin, the w 2 depth parts are arranged according to their position in the w x w scaled and aligned image, with red pixels being those in which there was no depth found in at least one of the training images, and the gray values are proportional to the variance v of the least squares fit for that pixel. There are also K = 8 kinds of visual features (an edge feature found at various angles), and the gray similarly depicts the variance v. squared euclidean distance between M,,y and a nearest element in M whose value is '1'. DX,= min x1, yl Mx (x- X) 2 + (y- y)2 (4.9) 1,y,=1I Optionally, matrices X and Y may also be produced, containing the argmin indexes x 1 and yi respectively, for each element in M. Felzenswalb and Huttenlocher [11] describe an algorithm to compute the distance transform in time that scales linearly with the number of pixels O(mn) in an m x n image. We now describe the ENUMERATEFEATURES procedure which uses synthetic im- ages to enumerate a set of linear models, one for each pixel and each kind of part. The ENUMERATEFEATURES procedure is efficient since it calls SAMPLESYNTHETICAMAGE exactly n times, and it is much more memory efficient than the naive algo- rithm that stores all of the A and B matrices. To calculate the memory usage, we first observe that ATA is a symmetric matrix, so we only need 10 numbers to represent ATAV and ATAD- ATBV has 8 elements and TrBTBV is just a scalar, so an incremental least squares visual part takes 19 numbers in memory. Similarly, ATBD has 4 elements and ly BTBD is just one number, for a total of 15 numbers for each incremental least squares depth part. So (15 + 19K)w 2 numbers are required to represent all of the least squares parts. For w = 245, K = 8 and 8-byte double-precision values, this takes 74 ) Procedure 6 Enumerate all possible features. Input: 3D mesh model M, viewpoint bin B, number of images to render n, synthetic image width w, receptive field radius for visual parts rv, receptive field radius for depth parts rD Output: set of visual parts V (where JVI = Kw 2 for the number of different kinds of visual features K), set of depth parts E (where JV = w 2 1: procedure ENUMERATEFEATURES(M, 2: 3: 4: 5: 6: 7: B, n, w) VF +- a vector of K matrices, each of size w x w containing incremental least squares visual parts initialized to (0, 0, 0) DF +- a w x w matrix of incremental least squares depth parts, each initialized to (0, 0, 0) for j = I to n do (RGB, D, (x, y, z, rx, ry, rz)) +- SAMPLESYNTHETIC-IMAGE(M, B) fork= 1 toKdo V +- binary image of visual features of kind k detected in RGB, D 8: 9: 10: V +- scale V to size w x w (DT, X, Y) +- distance transform of V for all pixel locations (p, py) in w x w do vik,,,,p, 11: 12: 13: +-UPDATE_VISUAL _PA RT(vFk,p,py,, (rx , ry, rz),I W,,,,,, Yxp) D +- scale D to size w x w for all pixel locations (pr, py) in w x w do -UPDATE-DEPTH _PART(DF,,,p,,, (rx, ry, rz), Dp,,y) DFp,p 14: 15: V- 16: D 17: 18: for all pixel locations (px, py) in w x w do fork= 1 toKdo V +- V U { FINALIZEVISUAL -PART(v Fk,px,p, rvk)} 19: 20: 21: 0 ÷0 D +- V U { FINALIZEDEPTHPART(DI,,,PV, rD) return V,D 75 } 76 MiB of memory, compared to the 2.3 GiB required to represent the full matrices. This memory savings is significant because it causes fewer cache misses. The overall running time for this procedure is approximately 15 seconds on a single CPU without any GPU acceleration. 1 The memory savings becomes even more significant when we consider the possibility of parallelizing ENUMERATE-FEATURES on a GPU architecture. The smaller memory footprint fits easily into the video memory on a GPU, even if the GPU is being shared by several cores running ENUMERATE-FEATURES in parallel. Lines 10, 13 and 17 contain for loops that could be parallelized in GPU kernels. We also note that, to save a considerable amount of time in the learning phase, we use the edge detector developed by Canny [3], rather than the detector of Ren and Bo [28] on line 7 of ENUMERATE-FEATURES. We have not seen a significant degradation in detection accuracy from this choice to use different edges for learning and detection. 4.1.3 Feature Selection Given the enumeration of all possible parts, the next step is to choose a subset of them to use in the final viewpoint bin model. First, we decide how many parts we want in our model. We do this empirically in chapter 6. Then we use variance as the primary selection criteria to choose the parts, because features with lower variance in the aligned training images tend to be closest to their mean values, and they are therefore the most reliable and repeatable for detection. However, we cannot use variance as the sole criteria for selecting features because low variance features are usually bunched near by each other (see figure 4-3). 1After implementing this algorithm, the author realized a simple way of eliminating common subexpressions that would lead to a significant savings in memory and time. The observation is that ATA =ATA D is the same for every pixel, since AT A only uses the rotation information r,, r. and rz which is constant for the whole training image. By storing the A matrix once for all features and pixels, the naive approach of representing the whole matrix would be reduced from 2.3 GiB down to 4n + (2K + 1)nw 2 numbers 779 MiB), and the approach we presented would be reduced from 76 MiB down to 10 + (5 + 9K)w 2 numbers, or 35 MiB. Moreover, this observation could be used to reduce the number of matrix inversions of the ATAV =ATA D matrices from (K + 1)w 2 down to just 1, yielding significant time savings in the FINALIZE-VISUALPART and FINALIZEDEPTHPART procedures. 76 0 * *.. 1 0 0 \4- 4* - . I 0 0 01 Figure 4-3: The effect of varying the minimum distance between parts. When the minimum distance is too low (left), some of the higher-variance regions of the object are not modeled at all. However, when the minimum distance is too high (right), the desired number of parts do not fit on the object model. When the maximum variance constraint is used (center), the minimum distance between parts is chosen automatically. A good object model should ideally have parts spread evenly over the object. We therefore focus our attention on designing an algorithm to select features that are both low variance and also spread out over the object. The core of the algorithm, S ELECTF EATURES_-GREEDY, uses a greedy strategy to select features with high variance, constrained to be at least some distance dmin from each other. Note that if dmi,, is too large, we will be forced to choose some very high-variance features that will not be found consistently in images of the object. On the other hand, if dmi,, is too small, it is like the case of only selecting the lowest-variance features without regard to the distance between parts (dmin = 0)-the parts will be bunched and will not be distributed evenly over the whole viewpoint. How, then, can we choose a good value for dmi,,? Observe in figure 4-4 that different objects and even different views of the same object vary in the area of the image they 77 Procedure 7 Select features greedily for a particular minimum allowable distance between chosen parts dmin. Input: a vector of parts P sorted in order of increasing variance, the minimum allowed distance between two selected parts dmin Output: a set of selected parts Q 1: procedure SELECTFEATURESGREEDY(P, dmin) 2: Q +-{} 3: 4: 5: for all P in P do (px, ptY) <- the original pixel location of Pj d +- oo 8: 9: 10: for all original pixel locations (q2, qy) of parts in Q do d +- min(d, v/(px - qx) 2 + (py - qy) 2 if d > dmin then ) 6: 7: Q +- Q U {P} return Q Figure 4-4: Different objects and viewpoints vary in the area of the image they cover. Each object on the top row covers more area in the image than a different viewpoint of the same object directly below it. 78 0 0 '0 O *040 oOOO 7:0.0000 .0' 040. 0 0000000~00 o~0000* .0 000 .0 * 00 0 0 o 04 .0 o~0 je0 0 0 00, 000 0" . 0. 00 .00. 0 0 0 * I * 0 0. *. 0 9 4II4**~ Figure 4-5: When dmin is held constant from one viewpoint (left) to another view with a different area (center), the parts are not evenly spread out to cover the whole area. However, when dmin is chosen automatically using a parameter for the maximum allowable variance (right), the parts tend to be evenly spread, independent of the area of the viewpoint. cover, even when all the distances from the camera are equal. The minimum distance between parts needs to be able to change so that the parts will always evenly fill the entire area of the object in the image. Lower-area viewpoints should have a shorter dmin so that all of the features will fit within the area, and higher-area viewpoints should have a greater dmi,, so that the same number of features be will spread out, filling the whole area. The SELECT-FEATURESGREEDY procedure requires a choice of dmi". It would be difficult and time-consuming for a user to manually select a different value for the parameter for each and every viewpoint region. Instead, we use an automatic method to find a good dmin value; we introduce a different user-specified parameter, Vmax, that specifies the greatest allowable part variance, and, using binary search, we find the greatest dmin that still allows us to greedily pick the desired number of parts from the set of all enumerated features whose variances are all small enough (see figure 4-5). 79 Procedure 8 Select features greedily for a particular maximum allowable part variance vmax. Input: a vector of parts P sorted in order of increasing variance, the desired number of features n, the maximum allowed part variance Vmax Output: a set of selected parts Q 1: procedure SELECT -FEATURES(P, n, vmax) 2: d+- the vector of possible integer distances from 0 to 500 3: Jmin +- I Jmax + idl while Jmax > jmin do > binary search for the best distance in d Jmid +- [ j Q +- SELECTFEATURESGREEDY(P, djmid) 4: 5: 6: J 7: 8: if IQI= n then 9: return Q 11: else if IQ1 < n then Jmin < Jmid + 1 12: else 10: 13: Jmax - Jmi - > search upper subarray 1 > search lower subarray return SELECTFEATURESGREEDY(, djAmin) 14: This is essentially the same as the original idea of setting dmin based on the area of the viewpoint, as it automatically adapts to the area of the training images. Since vmax is not affected by the total area of the object, we can keep it constant for all viewpoints. This feature selection method accomplishes the goal of choosing features that are evenly spread over the whole area of the visible view, while only choosing reliable, low-variance features. We now have the machinery necessary to learn a viewpoint bin model. The LEARNVIEWPOINT-BIN-MODEL procedure takes around 18 to 20 seconds to run on a single CPU core. As mentioned, the call to ENUMERATE-FEATURES on line 2 takes the majority of the time, about 15 seconds. Sorting the features takes a negligible amount of time, and the remainder of the time is spent selecting features. 4.1.4 Combining Viewpoint Bin Models The process of learning a full object model is a matter of learning each viewpoint bin model. Line 3 in the LEARN procedure ensures that there is a viewpoint bin that 80 Procedure 9 Learn a new viewpoint bin model. Input: 3D mesh model M, viewpoint bin B, number of images to render n, synthetic image width w, receptive field radius for visual parts rv, receptive field radius for depth parts rD, desired number of visual parts nV, maximum visual part variance WVmax, desired number of depth parts nD, maximum depth part variance VDmax Output: a set of visual parts V', a set of depth parts V' 1: procedure LEARNVIEWPOINTBINMODEL(M, B, n, w, rv, rD, nv, vv max, nD, VD 2: V, D +- ENUMERATE-FEATURES(M, B, n, w) 3: V +- V sorted by variance, increasing 4: D +- D sorted by variance, increasing 5: V' + SELECTFEATURES(V, n VV max) 6: 7: max) 79' + SELECT-FEATURES(D, nD, VD max) return V', 7' Procedure 10 Learn a full object model. Input: 3D mesh model M, number of images to render n, synthetic image width w, receptive field radius for visual parts rv, receptive field radius for depth parts rD, desired number of visual parts ny, maximum visual part variance WVmax, desired number of depth parts nD, maximum depth part variance VDmax, rotational bin width re, z-symmetry angle s, Output: a view-based model M 1: procedure LEARN(M, n, w, rv, r,D, n, IVmax, nD, VDmax, 2: M <- {} to 30, step by r,, do 3: for r, = rw rw, s) [9g 4: 5: 6: 7: for ry = -90 to 0, step by rw do for rz = 0 to sz, step by r. do B - -rwr - -rL,. - -r,),(,,, + -r-,r+ -rwr + -rV +- LEARN)VIEWPOINTBINMODEL(M, B, n, w, +-((r,, rv, rD, 8: 9: M <- M U{(B, V)} return M 81 ,vVmax, nD, VDmax) is centered on the upright image plane rotation angle (r. = 0), which is the most common object rotation in the images we used in our experiments. The angle ranges in lines 3-5 are chosen to be an upper bound of the maximum and minimum rotations found in the images we used in our experiments. The z-symmetry angle s2 represents the symmetry of an object about a vertical axis perpendicular to the table. A can would have s2 would have s2 0, a cereal box would have sz = 180, and an asymmetric object = = 360. The rotational bin width parameter rz is chosen empirically in chapter 6. When rz = 20 and sz = 360, the resulting view-based model contains 360 viewpoint bin models. In practice, we parallelize the LEARN procedure by distributing each call to LEARN_VIEWPOINTBINMODEL (line 7) to different cores and different physical machines in the cloud using OpenMPI. 4.2 High Level Learning Procedure Input: An instance of the object, a way to acquire a 3D mesh, an RGB-D camera with the ability to measure camera height and pitch angle Output: A view-based model of the object, composed of a set of viewpoint bin models, each with visual and depth parts The procedure to learn a new view-based model is depicted in figure 1-2. The steps involved are: 1. Collect RGB-D images of the object, along with information about the camera pose, and partition the images into a test and hold-out set (section 1.1.1.2). 2. Label the images with bounding boxes (section 1.1.1.2). 3. Acquire a 3D mesh of the object (section 1.1.1.2). 4. Thne learning parameters while testing the view-based models on the test images (section 4.2.1). 5. Evaluate the accuracy of the view-based models (section 1.1.1.2). 82 kinect - ---------------------- Figure 4-6: At the time each RGB-D image was captured, we also record the camera's height above the table and its pitch angle from the joint angles and torso height of the PR2 robot. 83 4.2.1 Tuning Parameters Input: * results of evaluating the view-based object detector * a sample of correct and incorrect detections from the view-based model learned from the previous parameter settings Output: a new set of parameter values that should improve the accuracy of the view-based model The parameters to LEARN (procedure 10) reveal the following parameters that must be set: " number of images to render n * desired number of visual parts nV " desired number of depth parts nD " receptive field radius for visual parts rv " receptive field radius for depth parts rD " maximum visual part variance Vymax " maximum depth part variance VDmax * rotational bin width r,, " synthetic image width w A parameter not explicitly mentioned in the pseudo-code above is the amount of ambient lighting in the synthetic images, and a related parameter is the threshold that defines the minimum edge detection probability that will be counted as an edge. Another important pair of parameters is the tolerances on the camera height ht.i and the 84 camera pitch angle rtoi-these parameters are used to constrain the search space during detection and will be discussed more in chapter 5. We experiment with independently varying most of these parameters (n, ny, nD1 rV, rD, Vmax, VDmax, rw, htol, rtol, edge threshold) in chapter 6. It would be computationally infeasible to test all combinations of parameter settings to automatically find the best-performing values. Taking advantage of the high parallelism available on GPUs would enable significantly more parameter settings to be tested, however, finding the optimal combination of settings for all variables remains a hard problem. We therefore wish to explicitly mention the importance of human training and intuition in guiding the parameter setting process. This is admittedly the most difficult part of the process, as it requires a fair amount of experience. Chapter 6 helps to provide some intuition of how accuracy may be affected by varying any one of these parameters, but the real process involves some careful inspection of detections, an understanding of how the view-based model is affected by the parameters, and some critical thinking. 85 - - -r . .. - .... - - s .- ..-.. .....- ...... . . . ... . . ... - - .. s. . . --- - - -. - . - .- . . -. -- - .-- nir ,. ~ .. -- - .--,. - -. ..--. , . - - --. -. . . . . 86 Chapter 5 Detection Input: " An RGB-D image along with the camera height and the camera pitch angle " A view-based model Output: A sequence of detected poses, ordered by decreasing probability that the object is at each pose The detection algorithm consists of five steps: 1. detect visual features (section 5.1) 2. pre-process the image to create low-dimensional tables to quickly access the 6D search space (section 5.2) 3. run the branch-and-bound search to generate a list of detections (section 5.3) 4. remove detections that are not local maxima to remove many redundant detections that are only slight variations of each other (section 5.4) 5.1 Detecting Features Input: an RGB-D image 87 Output: a binary image of the same dimensions, for each kind of visual feature The depth measurement at each pixel from the RGB-D camera is converted from Euclidean space into a depth feature measuring the distance to the nearest surface from the focal point of the camera along a 3D along a ray passing through the pixel. In general, visual features are any kind of feature whose uncertainty over position can be represented as being restricted to the image plane, as defined by equations 3.3 and 3.4. This could include binary texture detection. In this thesis, visual features are restricted to edges in the image. We use the edge detector of Ren and Bo [28], which uses the depth channel in addition the RGB channel, and outputs a probability for 8 different edge directions at every pixel. We then use the minimum edge probability threshold parameter to change these probability tables into 8 binary images, one for each edge direction. The running time for this edge detector is several minutes on a single CPU, which is obviously impractical for most situations in which a detection is needed in seconds or less. For practical situations, we would use a GPU-accelerated implementation of this edge detector, or a simpler edge detector such as the one by Canny [3]. We use a single threshold to change the real-valued probability into a binary decision that determines whether the edge is present or absent. It is also possible to introduce a second threshold such that each edge detection could have three possible values: absent, weak or strong. To preserve the re-usability of the pre-processed distance transforms and 2D summed area tables described in section 5.2, 3-valued visual features must be implemented by introducing a new kind of visual feature for each edge direction, effectively doubling the amount of memory required to store the pre-processed image. As the number of thresholds increase (the limit is real-valued visual features), the size of the pre-processed image would increase proportionally. We have implemented 3-valued visual features, but we have not yet performed the experiments to determine the potential gain in accuracy. 88 5.2 Pre-Processing Features Input: a depth image and a binary image of the same dimensions, for each kind of visual feature Output: * a 3D summed area table computed from the depth image " a 2D summed area table computed from each kind of visual feature * a 2D distance transform computed from each kind of visual feature The detection algorithm searches for detections in decreasing order of probability. We pre-process the visual and depth features in order to make this search more efficient. A preprocessed image I is a 6-tuple < T, S, DT, Z, re, h >. T is a vector of k 2D summed area tables where k is the number of different kind of visual features (usually 8). S is a 3D summed area table that is computed from the depth image. DT is a vector of k distance transforms (one for each kind of visual feature), Z is the depth image from the original RGB-D image, r, is the pitch angle of the camera and h is the height of the camera above the table. To explain this pre-processing step, we begin by transforming the objective probability optimization equation 3.2 (reprinted below) to show how it can be computed as a Hough transform. We then describe how a detection is evaluated at a specific point using a procedure called EVAL and its two sub-procedures EVALDEPTH and EVALVISUAL. In a Hough transform, each part casts probabilistic "votes" for where it thinks the object might be. The votes are summed at each location, and the location with the highest total sum is the most likely location for the object. However the optimization objective (equation 3.2) takes the product of the inputs from each part, rather than summing them. To fix this, we can change the products to sums by taking the log. Moreover, Crandall et al. [5] noted that we can move each max over the locations of the depth parts dj and visual parts Vk close to its probability distribution (which is a form of dynamic programming). 89 These two modifications allow us to rewrite the optimization objective into a Hough transform: argmax max Pr(V = 'I, p = d,P = p) = argmax max 7Pr(Dj = djIP = p) p id i7,d 1 Pr(V = Vk|P = p) k (3.2 revisited) = argmax p HDj(p) + ZHvk(p), i (5.1) k where the Hough votes for depth parts HDj and visual parts Hvk are a functions of the pose p: HDj(p) = maxlogPr(Dj = djlP = p) (5.2) dj Hvk(p) = maxlogPr(Vk=vk|P=p). (5.3) Vk Equation 5.1 makes it clear that we are simply summing up Hough votes from each part. The votes are tallied over the 6-dimensional Hough space of object poses. Before describing the EVALDEPTH procedure to evaluate HDj (p), we note that we can drop the maxdj operator from equation 5.2, since there is only one valid depth measurement from the depth image that matches a particular feature Dj when the object is at pose p. HDj (p) = log Pr(Dj = djP = p). (5.4) The depth measurement dj is read from the depth image at a pixel location that is a function of the the pose. This simplification makes the EVALDEPTH procedure straightforward to compute: Figure 5-1 shows a visualization of equation 5.1 for a 1D Hough space and an object with only 3 visual parts. Figure 5-2 extends figure 5-1 by adding a rotational dimension, so that the Hough vote space is two-dimensional. Figure 5-3 also ex- tends figure 5-1 to form a two-dimensional Hough space-this figure adds the scale 90 Procedure 11 Evaluates a depth part in an image at a particular pose for a Hough vote HDj according to equation 5.2. Input: pose p, depth part D, a preprocessed image I Output: the log probability of the depth part D for an object at pose p in the image 1J 2: procedure EVAL-DEPTH(p, D, 1) (x, y, z,rx, ry, rz) +- p 3: +- linear coefficients of D 1: 4: [ +- z+[r, m 5: [, 6: [X] +- the r > From equation 3.7 r] pixel location of D if the pose was at (x, y, z) = (0, 0, 1) 7: > From equation 3.6 d +- the depth at location (round(p'),round(p',)) in the depth image of I 8: v +- the variance of depth part D 9: if d is defined then 10: 11: 12: 13: 14: +- [>] r +- the receptive field radius of D return els e 2 min((d-md) ,r2) > From equation 3.8 d <- default depth difference of D' when undefined return - d 91 > From equation 3.8 object model (3 edge parts) input image (D) IE feature detections I / Hough transform votes (translated for each part) sum of Hough transform votes best detection ' q, I \ I/ / original image Figure 5-1: A one-dimensional illustration of Hough transforms for visual parts. The object model is comprised of three different visual parts, which are matched to features detected in the 1D input image. Each feature detection casts a vote for where it thinks the center of the object is-locations with more certainty receive more weight. Recall from equation 3.5 that we represent the distribution over visual part positions as normal with a receptive field radius. The Hough votes are the log of the normal distribution, which is a parabola (also seen in figure 3-5). Notice how the votes are shifted horizontally to account for the offset between the expected part position and the center of the object. The shape of the votes is defined by the visual part min((x-m)2,r2) (equation 3.5 in one dimension). The 2v distribution Pr(V = xIP = p) oc ebest detection is the global maximum of the sum of the votes (equation 5.1), and which we can see is indeed where the object is found in the original image. 92 object model (3 edge parts) input image (D) 1W p feature detections Hough transform votes (with rotation and translation) I sum of Hough transform votes I best detection \ 2 \ -/ / original image \ nd best detection A Figure 5-2: This adds a rotation dimension to figure 5-1. The Hough transform votes are two dimensional, so the darkness of the Hough transform vote images indicates the weight of the vote at that location in the space of poses. A horizontal cross section of these 2D Hough transform votes is a shifted version of the ID Hough transform vote depicted in figure 5-1. Since we use a small angle approximation for rotation, the reader will notice that the shift in the Hough votes is a linear function of the rotation angle (the vertical axis). The sum of the Hough transform votes are rendered using a contour map. In this image, we can also see that the maximum in the sum of distance transforms occurs at the place we would expect (with no rotation). The second best detection would occur when the object undergoes some rotation to put the blue feature further to the right. 93 object model (3 edge parts) L / input image (1D) feature detections mow Hough transform votes (with scaling and translation) s im of Hough transform votes original image I best detection / \ / /\/ 2nd best detection Figure 5-3: This adds a scale dimension to figure 5-1. A horizontal cross section of these 2D Hough transform votes are related to the 1D Hough transform vote depicted in figure 5-1: The parabolas are widened as the scale increases, since the receptive field radius is also changed with scale. In addition, the entire 1D image is shifted. The sum of the Hough transform votes are rendered using a contour map. In this image, we can also see that the maximum in the sum of distance transforms occurs at the place we would expect. dimension. Unlike depth features, which (when defined) are always at a known pixel in the image plane, visual features may be found anywhere in the image plane. We are interested in finding the nearest visual feature to the expected location of a visual part. A naive algorithm would search for the nearest visual feature every time we need to evaluate a new visual part. However, we can pre-compute this information in a distance transform (introduced previously in section 4.1.2). Figure 5-4 shows an application of Hough transforms to optical character recognition with a two dimensional Hough space, motivating the use of distance transforms. 94 Each Hough transform vote from a visual feature Hvk is a clipped and translated distance transform. The values of the distance transforms are clipped according to the receptive field radius of the visual part and the entire distance transform is translated by the expected location of the part relative to the center of the object model. Note that there are two kinds of edge parts used in the 4-part model in this figure, two of each edge angle. In general, the distance transforms for visual parts of the same kind are the same, but they are shifted according to the expected locations of the respective parts. Figure 5-5 illustrates the concept of Hough transforms for depth parts in two dimensions. We cannot render Hough votes for depth parts with rotation on a page, but the idea is similar to 5-2. Before we begin searching for an object in a new image, we always pre-compute a distance transform for each kind of visual feature. Since we use edges with 8 different directions as visual features in this thesis, we pre-compute 8 distance transforms for an image. These distance transforms are then re-used for every visual part of the same kind. This enables us to quickly evaluate the Hough vote for a visual feature (equation 5.3) at any point in pose space, as shown in the EVALVISUAL procedure. Now that we have procedures for evaluating the Hough votes from both depth and visual parts, we sum these votes (as in equation 5.1) using a procedure called EVAL. Distance transforms allow us to quickly evaluate a viewpoint bin model at any pose. But evaluation is not enough-in order to perform branch-and-bound search, we need a way to bound a model over a region of poses. We discuss the details of our bounding method in section 5.3.2. However, we will briefly describe another kind of data structure that we pre-compute in order to speed up the bounding procedure: summed area tables. Summed area tables (also known as integral images), give a fast (constant time) way to compute the sum of a rectangular range of a matrix. They were first used in computer graphics by Crow [6] and popularized in computer vision by Viola and Jones [36]. Summed area tables can be computed from a matrix of any number of dimensions, but we use 2D and 3D summed area tables. 95 r - - - - - - - - - -- object model (4 edge parts, origin at '+') X input image (2D) ----------------------------I r - - - - - - - - I trnfr voe atsepce - r- N oain - - - - - - - - -- I IYZ ' 4- best detection I-I Houg usdtieIrnltdfrec deeto (each fetr - I J - - J - feature detections i (2 edge directions) ------------ - XYZ | I I - Ls - fHg--------- original image sum of Hough transform votes Figure 5-4: Recognizing the letter 'X' in an image using visual edge features. We are searching the 2D space of poses in the image plane (without depth or scale), as opposed to the one dimensional pose space used in figure 5-1. The object model is made up of four visual edge parts, two of which have one diagonal angle and the other two have a different diagonal angle. The feature detections are used to make Hough votes using the visual edge feature distribution (equation 3.5). Note that the log of ) 2 ) is a squared distance this distribution log Pr(V = AIP) oc min((i - M)T(I function, clipped by the min() operator for distances greater than r. For this reason, these Hough votes are referred to as distance transforms. The distance transforms are translated such that each feature votes for where it thinks the center should be. Adding up the votes from each feature yields an image with the best detection at the center of the letter 'X' in the original image, as we would expect. 96 object model (3 depth parts, origin at '+') S!. ~i77 input depth image (in world space) i A 0*1* 2 nd best detection '*'* 14 Hough transform votes (in world space) '-"\ r best detection original image 7 ~- sum of Hough transform votes Figure 5-5: A two-dimensional slice of Hough votes and their sum. The object model is designed specifically to detect the rotated rectangle near the center of the scene. It consists of three depth parts located on the surfaces of the rectangle that should be visible to the camera at that angle. The three Hough votes are derived from the input depth image-the darkness indicates the weight of the vote at that pose in the space. The sum of the Hough votes reveals that the best detection is where we would expect. The second best detection is another rectangular part of an object which has a similar rotation with respect to the tangent of the circle centered at the focal point of the camera. All graphics are rendered in Euclidean world coordinates. 97 Procedure 12 Evaluates a visual part in an image at a particular pose for a Hough vote Hvk according to equation 5.3. Input: pose p, visual part V, a preprocessed image I Output: the log probability of the visual part V for an object at pose p in the image I 1: procedure EVALVISUAL(p, V, I) 2: (x, y, z, rx, ry, rz) +-p a b linear coefficients of V +- 3: .g h. <- From equation 3.3 [] a b' 4: [ 5: [ 6: k +- the index of the kind of visual part V rx rvrz] Y e[ cd From equation 3.4 z 7: d <- the kth distance transform in the preprocessed image I at pixel location (round(px), round(py)) 8: r' +- the receptive field radius of V 9: 10: v +- the variance of visual part V return - 2 2 min(dz 2v ,r ) From equation 3.5 Procedure 13 Evaluates an object model in an image at a particular pose according to equation 5.1. Input: pose p, viewpoint bin model M, a preprocessed image I Output: the log probability of the viewpoint bin model M for at pose p in the image I 1: procedure EVAL(p, M, I) 2: (V, 1) +-M > visual features V and depth features D of M 3: return ZEVAL-DEPTH(p, j5,1)+EkEVAL-VISUAL(p,1 k, I) 98 For an m x n matrix M, a summed area table S can be pre-computed in time Then it can be used to answer queries linear in the number of entries O(mn). SUM2D(Sx, x2 , 1, 2) = EV M, in constant time, that is, the running time is independent of the size of the query region. We pre-compute a 2D summed area table for each kind of visual feature (typically 8 edge directions), so that we can quickly determine if there are 1 or more visual features of a particular kind within a rectangular bounding region. We also pre-compute a 2D summed area table U for a binary image the same size as the original RGB-D image, whose entries are 1 where the depth is undefined. This allows us to quickly determine if there are any missing depth values in a rectangle. For an 1 x m x n matrix N, a summed area table T can be pre-computed in time linear in the number of entries O(lmn). Then it can be used to answer queries SUM-3D(TX, 2 ,,X 2 Nx,y,z in constant time that is , zZ 2) = independent of the size of the query region. We compute a 3D summed area table for depth images by discretizing depth into equally-spaced intervals of 5 centimeters. We transform the m x n real-valued depth image Z into a 1 x m x n matrix N such that Nx,y,2 is 1 if Zxv falls into the zth depth interval (and 0, otherwise). Pre-computing the 8 distance transforms, 9 2D summed area tables and the 3D summed area table takes less than 1 second on a single CPU. The last two components of a preprocessed image, r, and h, are two scalar numbers that provide contextual information about the camera's pose. They help us to constrain the search space to a horizontal surface such as a table top. r, measures the pitch angle of the camera, which is changed by pointing the camera further up or down with respect to the ground plane (we assume that the camera never rotates in the image plane-in other words, the roll angle is always zero). h measures the height of the camera above the table. 99 5.3 Branch-and-Bound Search Input: a pre-processed image, a view-based object model Output: the sequence of detections (i.e. points in pose space) sorted in descending order of probability that the object is located there After pre-processing an image, we now turn to the business of searching for the object. This step is the most time-consuming part of the process, so we try to take extra care to ensure that the most frequently called sub-procedures are efficient. Section 5.3.1 describes how we branch regions into sub-regions, section 5.3.2 discusses the method we use to bound the probability of a detection over regions of pose space and section 5.3.5 puts these parts together to form the branch-and-bound search procedure. 5.3.1 Branching We define a hypothesis region R to be an axis-aligned bounding box in the space of poses. As the search progresses, the priority queue will store the current collection of working hypothesis regions of pose space. Formally, a hypothesis region is an index m into the vector of viewpoint bin models and a pair of poses (M, Pi, P2) where pi = (xi, yi, z1 , rzi, ry1 , rZI) and > r1, ry 2 (x 2 , Y2, Z2 , rx2, ry 2 , rz2) such that x 2 ry,1 and rz2 > rzi. A pose (x, y, z, rX, ry, rz) of viewpoint bin model m is in hypothesis region (rn,p1 ,p 2 ) iff X1 < x rXi < rX < X 1 , Y2 > Yi, rx2 , ry1 < r. < ry 2 and ri x 2 , Y1 < y < Y2, Z1 < Z < Z 2 , z 2 > zi, rx 2 P2 = < rz < rz2. The BRANCH(R) procedure partitions a hypothesis region into 64 smaller hypothesis regions, all of equal size. Each of the 6 dimensions are split in half. BRANCH returns the set of all 26 = 64 combinations of upper and lower halves for each dimension. We omit the pseudocode for this procedure for brevity. 5.3.2 Bounding In this section, we will develop a function b(R) that gives an upper bound on the log-probability of finding the object within a hypothesis region 7? of the space of 100 poses. b(1Z) > max HDj (p) + 1 j Hy,(p) (5.5) k The design of this function is the critical bottleneck that determines the running time of the whole detection procedure. There are two important aspects to the bounding procedure: its running time and its accuracy or tightness. The bounding procedure is in the "inner loop" of the branch-and-bound search-it is called every time a region is evaluated, so the bounding function must be efficient. On the other hand, the bounding procedure should be accurate. That is, the bounding function b(1Z) should be tight, that is, it should be nearly equal to the right hand side of inequality 5.5. The closer b(1Z) is to the true maximum value, the faster the search will narrow in on the true global maximum. For illustration, consider two straw-man bounding functions UNINFORMATIVEBOUND and BRUTEFORCE-BOUND. Procedure 14 An uninformative design for a bounding function. Input: hypothesis region 7?, view-based object model 0, a preprocessed image I Output: an upper bound on the log probability of ( being located in 7? in the image 1 0, > The maximum possible return value of EVAL 2: return 0 ) 1: procedure UNINFORMATIVEBOUND(R, Procedure 15 A brute-force design for a bounding function. Input: hypothesis region R, view-based object model 0, a preprocessed image I Output: an upper bound on the log probability of 0 being located in R in the image I 1: procedure BRUTEFORCE-BOUND(R, 0,I) 2: m +- the index of the viewpoint bin model for region R 3: (M, B) <- Om > viewpoint bin B for viewpoint bin model M 4: (V, b) +- M visual features V and depth features D . +- -00 5: Vmax 6: 7: for all p in a minimum-resolution grid in R do Vmax +- max(vmax, EVAL(p, M, I)) 8: return Vmax UNINFORMATIVE-BOUND is extremely fast to evaluate, but it is completely useless- if we used it, the branch-and-bound search would never eliminate any branches. It 101 would only terminate after evaluating every pose in the full search space, which cancels out any benefit that could come from trying to use branch-and-bound search to speed up detection. BRUTEFORCEBOUND, on the other hand, is perfectly accurate-it is always pre- cisely equal to the right hand side of inequality 5.5. However, the procedure explicitly evaluates every point in the minimum-resolution grid in the hypothesis region R by brute force. Clearly, this bounding function is far too slow to be practical. These bounding functions are extreme examples-they lie at opposite ends of the trade-off between speed and accuracy. A good design for a bounding function should be reasonably fast to compute, but also reasonably tight. To design a bounding function, we first observe that if we let P= argmaxp, Z HDj(p) + 1 Hvk(p), k then since max HDj(p) HDj(p*) pER and maxHvk(p) > Hk(p*), the following inequality must hold: max HDj(p) + E maxHvk(P) > max PE k HDj(p)+ PRERk/ Hvk(P) . (5.6) This allows us to break the bounding function into parts: b(1Z) = bR) + 5 bVk(P-), (5.7) k where, because of the observation in inequality 5.6, we require the part bounding 102 ------- object model (3 edge parts) input image (1D) feature detections p low I Hough transform votes (translated I for each part) <_ z sum of Hough transform votes best detection ' q original image I \I|/ / /\/ Figure 5-6: To find an upper bound on the maximum value of the sum of the Hough transform votes in a region (the bottom red range), we take the sum of the maximums for each part in that region (the top 3 red ranges). functions bDj and bVk to satisfy the following: > max HDj(p) (5.8) > max Hvk(P). (5.9) ) bDj( bvk(7Z) PER PER To illustrate this idea for visual features, figure 5-6 shows how finding the maximum within a region of the sum of Hough transform votes can be bounded by the sum of the maximums of each of the Hough transform votes in the same region, and figures 5-7 and 5-8 apply this to Hough transform spaces with rotation and scale, respectively. Figure 5-9 then applies this same principle to 2D optical character recognition. Similarly, to illustrate this idea for depth features, figure 5-10 shows how finding 103 I--------------- object model (3 edge parts) :\ It/'! L input image (D) Hough transform votes (with rotation and translation) sum of Hough transform votes original image p p p IE feature detections low IT MJ IT IT IT I best detection \/\/ 2 nd best detection Figure 5-7: To find an upper bound on the maximum value of the sum of the Hough transform votes in a region (the bottom red rectangle), we take the sum of the maximums for each part in that region (the top 3 red rectangles). 104 object model (3 edge parts) L--------------- input image (D) p p feature p detections Hough transform votes (with scaling and translation) s am of Hough transform votes original image I best detectioUn+ -4 + ~ 2 nd best detection A\ Figure 5-8: To find an upper bound on the maximum value of the sum of the Hough transform votes in a region (the bottom red rectangle), we take the sum of the maximums for each part in that region (the top 3 red rectangles). 105 object model (4 edge parts, origin at '+') X input image (2D) XY Z r -F feature detections (2 edge I directions) -'- - - A: "Bounded by the sum of the maxes of the Hough transform votes in these regions." region?" this in max the is Q: "What r -;-I I IY LIL Z original image sum of Hough transform votes Figure 5-9: To find an upper bound on the maximum value of the sum of the Hough transform votes in a region (the bottom red rectangle), we take the sum of the maximums for each part in that region (the 4 red rectangles on the distance transforms). Note that the sum of the maximums for each part will be greater than the true maximum of the sum of the Hough votes in that region because the maximums for each part do not occur at the same location in the space of poses. We also show that, because the distance transforms are shifted, the query regions are not aligned in the binary feature detection images (even for parts of the same edge direction). 106 40 // / object model (3 depth parts, origin at '+') input depth image (in world space) I A 0*1* 10 A*'* Hough transform votes (in world space) 2 nd best detection 10\ best detection original image 7 0' rr- sum of Hough transform votes Figure 5-10: To find an upper-bound on the maximum value of the sum of the Hough transform votes in a region (bottom red region), we take the sum of the maximums for each part in the region (the top 3 red regions). the maximum within a region of the sum of Hough transforms can be bounded by the sum of the maximums of each of the Hough transform votes in the same region. In order to compute a bound on the vote for a particular visual feature bvk(R), we need to transform the Hough votes so that they are in alignment with the original image (and the data structures described in section 5.2). In figure 5-11, we shift the Hough transform votes into alignment with the feature detections in the original image coordinates (compare with figure 5-6). Since the shifted regions are also shifted into image coordinates, we can use them to access the pre-processed summed area tables. Recall that horizontal cross-sections of the Hough votes in figures 5-2 and 5-3 are the same as the Hough votes in figure 5-1, except that they are shifted horizontally. For the same reason, we can see that the bounding region must be sheared from a 107 - -- --- -- L - - - - :\ Il/i - object model (3 edge parts) input image (D) I low feature low detections Hough transform votes (translated for each y part) sum of Hough transform votes bs dtnLU best detection q/ T original image Figure 5-11: In figure 5-6, the Hough transform votes were horizontally translated according to the expected location of the part. In this figure, we translate the Hough transform votes in order to find which regions in the original image could contribute. to maximum value in the bottom red query region. This enables us to access the correct regions of the pre-processed tables for the image. 108 object model (3 edge parts) input image (D) 60 feature detections p p p p Hough transform votes (with rotation and translation) sum of Hough transform votes best detection original image I 2 nd / / \ best detection /1 Figure 5-12: In figure 5-7, the Hough transform votes were aligned with the Hough transform voting space. In this figure, we shear the Hough transform votes so that they are aligned to the original image coordinates. We also shear the bounding regions in the same way so that they change from rectangles to parallelograms. The extent of each parallelogram is then projected onto the binary feature detection image. rectangle to a parallelogram in order to be aligned to the original image coordinates when we add rotation (figure 5-12) or scale (figure 5-13) to the pose space. Figure 5-14 shows how the depth measurements are warped from the Euclidean coordinates into the coordinate system in which the depth is measured from the focal point of the camera through the image plane to the nearest surface. These transformations are the key insight that enables us to compress the 6D Hough transform space down to lower-dimensional data structures, which allows us to design fast bounding functions. The following procedure, BOUND-VISUAL, provides our implementation of Bvk that satisfies inequality 5.9 (for a proof, see appendix A.1). 109 In essence, it does this --------------- object model (3 edge parts) \I/ I input imagE (D) 'Ro feature detections Hough transform votes (with scaling and translation )h gI a II aI sum of Hou transform votes original ima ge ~Ch I best de tection7J \ // 2 nd best detection Figure 5-13: In figure 5-8, the Hough transform votes were aligned with the Hough transform voting space. In this figure, we shear the Hough transform votes so that they are aligned to the original image coordinates. We also shear the bounding regions in the same way so that they change from rectangles to parallelograms. The extent of each parallelogram is then projected onto the binary feature detection image. 110 object model (3 depth parts, origin at '+') 8!1 input depth image (in w arped space) summed area table Hough transform votes (in warped space) original image 7" 144 sum of Hough transform votes Figure 5-14: In figure 5-10, the Hough transform votes were rendered in the Euclidean world coordinates. In this figure, we warp the coordinates such that the rays from the focal point of the camera are parallel. In this view, the reader can see that the Hough votes are all simply translated versions of the same image, and that the red regions become parallelograms. This warping transform aligns all of the votes to the summed area table, which is coarsely discretized (with 5 cm increments) in the vertical (depth) dimension. 111 by projecting the six-dimensional hypothesis region R onto a two-dimensional region of the image plane, such that if a visual feature was found in that region there would be a pose in R that would match that visual feature within the receptive field radius of the visual part. Procedure 16 Calculate an upper bound on the log probability of a visual part for poses within a hypothesis region. Input: hypothesis region R, visual part V, preprocessed image I Output: an upper bound on the log probability of visual part V for an object located in 7Z in the image I 1: procedure BOUNDVISUAL(Z, V,1) (mn, (xi, y1, zi, rxi, ry1, r'zi), (X2, 2: Y2, Z2, rx2, ry2, rz2)) < a b~ 3: e <Ig h]. linear coefficients of V 12: min- a + min(crxi, crx2) + min(ery1, ery2 ) + min(grzi, grz2 Xmin <- [xi + min(xz'i/zi, f'mi/z2)J X'ax <-a + max(crxi, crx2) + max(eryi, ery2 ) + max(grzi, 9rz2) Xmax +- FX2 + max(x'ax/zi, X'ax/Z2)l ymin - b + min(drxi, dr 2 ) + min(fry1, fry 2 ) + min(hrzi, hrz 2 ymin - Ly1 + min(ymin/zi, ym-n/z 2)J Y'ax b + max(drx1, drxx-2 ) + max(fry1, fry 2 ) + max(hrzi, hrz 2 Ymax +- [y2 + max(y' ax/zi, y'ax/z 2 )] r' <- receptive field radius of V 13: rmax +- I 14: 15: k +- index of the kind of visual part V S +- the kth summed area table for visual features of kind k in I 16: if SUM_2D(S, Xmin ) 4: 5: 6: 7: ) 8: 9: ) 10: 11: 17: - rminax, Xmax + rmax, Ymin - rmax, Ymax return 0 18: 19: + rmax)> 0 then > The maximum value of equation 3.5 else v +- variance of V return -1 20: > The minimum value of equation 3.5 Note that the BOUND-VISUAL procedure can only return two different log proba- bilities: the maximum log probability for the part, 0, or the minimum log probability for the part, - 1. The procedure returns the maximum value when it is possible that a visual feature is a distance of less than or equal to the receptive field radius from the hypothesis region. It returns the minimum value for the part when it is certain that there are no visual features within the receptive field radius of the hypothesis region. 112 Recall that suM_2D(S, ... ) (line 16) returns the total number of visual features in the region. Notice that the region is expanded by the receptive field radius rm,. We illustrate this expanded region in figure 5-15, in which the hypothesis regions aligned to image coordinates are expanded by the receptive field radius. There is a minor complication with this strategy that comes when we consider adding scale to the pose space-as the scale changes, the receptive field also changes. In this case, we simply choose the maximum radius over the range of scales, as depicted in figure 5-16. BOUND-VISUAL is computed using a small constant number of operations that is independent of the size of the region IZ, which means that it is very efficient to compute. However, it is a somewhat loose bound. Part of the bound inaccuracy stems from the fact that it can only return two different values: the maximum (line 17) or the minimum (line 20) value of the Hough vote for a part. When the size of the hypothesis region is small compared to the receptive field radius, the projected expanded hypothesis region is relatively large, so that most of the visual features will be found in the expanded receptive field radius region (the green shaded regions in figure 5-15), rather than the projected region. When the visual feature is in the expanded receptive field radius region and not in the projected region, it will certainly contribute a vote that is less than the maximum possible value, so the maximum value (returned on line 17) will be an overestimate. The following two procedures, BRUTE FORCE-BOUNDDEPTH and BOUNDDEPTH provide our implementation of BDk that satisfies inequality 5.8 (for a proof, see appendix A.2). It does this by projecting the six-dimensional hypothesis region R onto a three-dimensional region in the warped coordinates, such that if a depth measurement was found in that region, there would be a pose in 7Z that would match that depth feature within the receptive field radius of the depth part. We note that lines 13-15 of BOUND-DEPTH could be omitted (and the entire BRUTEFORCEBOUNDDEPTH procedure could be eliminated). However, we include these lines because they make the search run faster. If BRUTEFORCEBOUND-DEPTH is omitted, BOUND-DEPTH can return only three different values: the maximum value for the part, 0, the minimum value for the part, 113 --------- object model (3 edge parts) L / I input image (D) feature detections - I / Hough transform votes (translated for each part) -M sum of Hough transform votes best detection original image Figure 5-15: We expand the bounding regions in the Hough transform votes by adding the receptive field radius to both sides of each region. Using summed area tables, we can efficiently count the number of feature detections in each region. From this diagram, we can use BOUNDVISUAL to bound each part: the projected expanded hypothesis region for the red diagonal part contains one feature detection, so it receives the maximum bound: 0. The projected expanded hypothesis region for the green vertical part contains no feature detections, so it receives the minimum bound: - .2 The projected expanded hypothesis region for the blue diagonal part contains one feature detection, so it receives the maximum bound: 0. 114 -------- object model (3 edge parts) input image (D) pq feature detections Hough transform votes (with scaling and translation) sum of Hough transform votes original image I^ I best detection / \ / /\/ 2nd best detection Figure 5-16: In this figure, we can see that the receptive field radius changes with scale. We therefore use the maximum receptive field radius to determine the extent of the expanded hypothesis region when it is projected on to image coordinates. 115 Procedure 17 Calculate an upper bound on the log probability of a depth part for poses within a hypothesis region by brute force. Input: the depth image Z, the receptive field radius r, the default depth difference when depth is missing d, a bounding box (Xmin, Xmax, Ymin, Ymax, Zmin, Zmax) Output: an upper bound on the log probability for a depth part in the bounding box 1: procedure BRUTEFORCEBOUNDDEPTH(Z, r, d, Xmin, Xmax, Ymin, Ymax, Zmin, Zmax) 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: dmin +- 00 for X = Xmin to xmax do for Y = Ymin to Ymax do if Zx,y is undefined then dmin <- min (dmiin, d) else if Zmin <; Zx,, < zmax then return 0 else if Zx,y < zmin then dmin <- min(dmin, Zmin - Zx,y) else 12: 13: - dmin - min(dmin, Zx,y - Zmax) return min(d ,r2 ) > From equation 3.8 , and the default value returned when a depth measurement is undefined, - d. It returns the maximum value when it is possible that a depth measurement is within the receptive field radius of the range of depth values associated with the hypothesis region. Otherwise, it returns the default depth if there is an undefined measurement, or the minimum value for the part if all the depth measurements are out of the receptive field radius. Recall that suM-3D(T,...) (line 20) returns the total number of depth measurements that are within the receptive field radius of the region. We illustrate the use of the receptive field radius in figure 5-17, in which the hypothesis regions are expanded by the receptive field radius. This 3D summed area table is constructed using discrete steps of size ZSATstep = 5 cm. If BRUTEFORCEBOUNDDEPTH is omitted, BOUND-DEPTH is computed using a small constant number of operations that is independent of the size of 1Z. However, we found that the 5 cm discretizatiori in the 3D summed area table leads to very loose bounds, especially as the hypothesis region gets small. Since most calls to BOUNDDEPTH are with small hypothesis regions, this increases the overall detection 116 Procedure 18 Calculate an upper bound on the log probability of a depth part for poses within a hypothesis region. Input: hypothesis region R, depth part D, preprocessed image I Output: an upper bound on the log probability of depth part D for an object located in R in the image I 1: procedure BOUNDDEPTH(Z, D, I) (M) (X1,~ y I, rX1,I rylI, rzi), (X2,1Y2,iZ2,Jx2,iry2,irz2)) +- IZ 2: 3: [ 4: (x, y) 5: 6: xmin +- [+1 + min(x/zi, x/z 2 )] 7: 8: 9: ymin +- Lyi + min(y/zi, y/z 2 )] Ymax [y2 + max(y/zi, y/z 2 )1 z' in+ z1 + i + min(jrxi, jrx2 ) + min(kryi, kry2 ) + min(lrz, lrz2) z'nax + z 2 + i + max(jrx,, jrx2 ) + max(kryi, kry2 ) + max(lrzi, lrZ2) d +- default depth difference of D when undefined r' +- receptive field radius of D if (Xmax - Xmin)(Ymax - ym in) < 500 then Z +- the depth image from I 10: 11: 12: 13: 14: +- linear coefficients of D +- Xmax +- the pixel location of D if the pose was at (x, y, z) = (0, 0, 1) [x 2 + max(x/zi, x/z 2 )] return BRUTEFORCEBOUND-DEPTH(Z, r', 15: d, Xmin, Xmax, Ymin, Ymax, Z' in, Z4ax) the summed area table of the undefined depths 16: U 17: r 18: Zmin +- zi-SATmin 19: Zmax Zmax-ZSATmin 20: if sUM-3D(T, Xmin, Xmax, Ymin, Ymax, Zmin - r, Zmax ZSATstep ZSATstep | 21: 22: 23: 24: 25: ZSATstep + r)> 0 then > The maximum value of equation 3.8 return 0 < I V Xmax else if xmin SUM_2D(U, Xmin, Xmax, Ymin, Ymax) > return - L W V Ymin > < I V ymax > hv 0 then > The value of equation 3.8 when the depth is undefined else > The minimum value of equation 3.8 return _ 117 object model (3 depth parts, origin at '+') S. input depth image (in w arped space) summed area table Hough transform votes (in warped space) original image At 145 sum of Hough transform votes Figure 5-17: We expand the bounding regions in the Hough transform votes by adding the receptive field radius to the top and bottom of each region. We then expand each of these regions further such that it is aligned to the 5 cm summed area table increments. Using the summed area table, we can efficiently count the number of depth features in each green region. All three green regions contain at least one depth feature, so BOUNDDEPTH will return the maximum bound for each part: 0 (assuming BRUTEFORCEBOUND-DEPTH is not called). 118 time. To address this issue, we added BRUTEFORCEBOUNDDEPTH. Even though BRUTEFORCEBOUND -DEPTH runs in time that is proportional to the number of pixels in the region, it is a much tighter bound, so it significantly decreases detection time. 1 The following procedure, BOUND combines the part bounds according to equation 5.7. Its correctness follows directly from the proofs the correctness of BOUNDDEPTH and BOUNDVISUAL in appendix A. Procedure 19 Calculate an upper bound on the log probability of an object for poses within a hypothesis region. Input: hypothesis region R, view-based object model 0, a preprocessed image I Output: an upper bound on the log probability of an object 0 located in R in the image I 1: procedure BOUND(, 0,1) 2: m +- the index of the viewpoint bin model for region R 3: (M, B) +- On > viewpoint bin B for viewpoint bin model M 4: (V, 5) +- M > visual features V and depth features D 5: return ZjBOUNDDEPTH(R, 15,1)+ EkBOUND-VISUAL(1Z, Vk, 1) 5.3.3 Initializing the Priority Queue Input: an empty priority queue, and the viewpoint bin sizes of the view-based model Output: the priority queue contains a maximum-size region for each viewpoint bin, each with its appropriate bound A priority queue is an essential data structure in branch-and-bound search. A priority queue contains elements associated with real-valued priorities. We use a heap 'If we had a fast algorithm for computing the maximum or minimum entry in a rectangular sub-region of a 2D table in time that does not depend on the size of the sub-region, the code would be significantly simpler and faster. Such a method could be used to eliminate BRUTE-FORCEBOUNDDEPTH and its slow for-loops, as well as the coarsely-discretized 3D summed area table used by BOUND.DEPTH while improving the bound tightness. Such a method could also be used to give a tighter bound than the 2D summed area tables used by BOUNDNVISUAL. Using this method would make both the depth and visual bounds tighter especially for smaller hypothesis regions, reaching perfect accuracy when the area of the region reaches 0. This would probably lead to a significant improvement in running time since most of the searching work is near the leaves of the search tree in small hypothesis regions. This means that lines 4-11 of BRANCH-ANDBOUND-STEP could replaced with a single line: "D +- D U {1}." It would also shorten the proofs in appendix A and eliminate the need for figures 5-15, 5-16 and 5-17. 119 data structure to implement the priority queue. A heap allows new elements to be efficiently added to the queue, and also allows the highest-priority element to be efficiently found and removed from the priority queue. In our implementation, the elements of the priority queue are hypothesis regions. The priority of the hypothesis region is an upper bound on the (log) probability that the object is located in the region. Our branch-and-bound search implementation starts with the initial hypothesis that any viewpoint bin model of the object could be located anywhere (subject to the current set of constraints, such as whether it is nearly upright on a table top). The images we use for our experiments are 640 x 480 pixels, and we search within a range of distances that can be accurately measured by the Kinect (0.5 to 3 meters from the focal point of the camera). We set the initial ranges for rotation angles to the ranges covered by the viewpoint bin models. Before we start the search process, we compute upper bounds for the initial hypothesis region of each viewpoint bin model (as described in section 5.3.5), and put them on the empty priority queue. Procedure 20 A set of high-level hypotheses used to initialize branch-and-bound search. Input: preprocessed image I, view-based object model 0 Output: A priority queue of hypotheses Q used to initialize branch-and-bound search 1: procedure 2: 3: 4: 5: 6: 7: 8: 9: 5.3.4 INITIAL-PRIORITYQUEUE (1, 0) Q <- empty priority queue for m = 1 to 101 do (M, B) <- Om > viewpoint bin B for viewpoint bin model M ((ri, ry1 , rzi), (rx2, ry 2, rz2)) <- B R +- (m, (0, 0, 0.5, rxi, ry1, rZ), (640, 480, 3, rx2, ry2, rz2)) b +- BOUND(R, 0, I) add R to Q with priority b return Q Constraints On The Search Space In order to speed up the detection time (and improve accuracy by eliminating some false positive detections), we constrain the search space to a region consistent with the 120 object being located on a horizontal surface, such as a table, including some margin of error. Recall that we record two numbers along with each RGB-D image: the height of the camera above the table h and the camera pitch r,. When we assume that the roll angle of the camera is zero, this is enough information to define the unique position of the table plane in space. Constraining the object to rest upright on a table effects three of the variables in the object pose: z, r, and ry. In our choice of constraints, x and y remain unconstrained because we assume table plane extends infinitely in all directions. And r2 remains unconstrained because the object can be rotated while remaining upright, as if on a turn-table. But because of perspective projection, the roll angle rx of the object generally varies by few degrees from 0, even though the camera roll angle is always exactly 0. And the pitch angle ry of the object also changes for different object positions within an image even when r, is fixed. The intrinsic parameters of the camera consist of the focal lengths fx and fy and the center of the image (ci, cy). We ignore radial and tangential distortion. A 3D point in Euclidean coordinates (xe, ye, ze) is projected to the screen coordinate (X, y) by: (5.10) {ey=+ The intrinsic parameters for the Kinect are fx = fy = 525 and (cr, cy) = (319.5, 239.5) because the image size is 640- x 480. When the camera has pitch angle r, and the object is upright on a table, then it has Euler angles: rx = atan2((cx - X) tan(rc), (cy - y) r = sin- 1 sin +Pcos rc)) + fy) (5.11) (5.12) where Px = py X - = fy 121 (5.13) - fX . (5.14) However, since there are some errors due to approximations and inaccurate sensing, we allow an error tolerance of CONSTRAINT-CONTAINS, tests rtol for both r, and ry. The following procedure, if a point is contained in the constrained region.2 Procedure 21 A test to see whether a point is in the constraint region. Input: a point in pose space p, the camera pitch angle r, Output: a boolean value indicating whether p is in the constraint region (upright on a horizontal surface) 1: procedure CONSTRAINTCONTAINS(p, rc) 2: 4: (x, y, z, rx, ry, rz) +- p r/ <- atan2((cx - x) tan(re), (c, - y) + fy) X-C Px 5: pY 3: 6: 1 Y5 JL +- 7: sin-- cos(rc) 1 +p2p -sin(rc)+p return (r' - rto 5 rx r + , r,rtoi) r +rto) A (r - rr'i The following procedures, RXRANGE and its subprocedure RX, compute the maximum and minimum values of rx that can be found within a hypothesis region R. RX computes the value of rx for a particular pixel location (x, y) and camera pitch angle r. and updates the current greatest and least known rx values. Procedure 22 Update the range of rx values for a pixel. Input: the camera pitch angle r,, a pixel location (x, y), the current least known rx value rx mini the current greatest known r. value rx max Output: (r/min, r/im) updated to the new greatest and least known values of rx 1: procedure RX(r., x, y, rx min, rx max) 2: rx +- atan2((cx - x) tan(r,), (cy - y) + fy) 3: r/ mi +-min(rx min, rx) 4: 5: rxmax max(rx max, rx) return (r/min, iax) RX-RANGE uses RX to test the possible extremes at each corner of the range and also the center vertical line where a local optimum may occur. Similarly, the following procedures, RYRANGE and its subprocedure RY, compute the range of values of ry that can be found within a hypothesis region R. RY computes 2 However, due to an oversight, the author omitted the constraint on the depth z in the CON- STRAINT-CONTAINS procedure. This is a bug. However, since CONSTRAINT-CONTAINS is only used by the BRANCHANDBOUND.STEP procedure when the object is known to be close to the plane, the bug only leads to a slight over-estimate of the volume of the constrained region during detection. 122 + Procedure 23 Find the range of r, values for a hypothesis region. Input: a hypothesis region R, the camera pitch angle r, Output: the greatest and least values (r min, r' max) of rx in R 1: procedure RXRANGE(1, r,) 2: (m (x1 , y, , ziI rx,Iry1, I) (z,9,zr2y2,iz2)) 3: 4: (rx min, rx max) +-(o, -o) (rx min 5: rx max) +- RX(r,, x 1 , yi, rx min rx max) (rx mil, rx max) +- RX (rc, X 1,Y2, rx mil, rx max) 6: (rx min, rx max) 7: 8: (rx min, rx max) 9: i, , rx mil, rx max) +- RX(rc, X2, Y2rx min, rx max) +- RX(rc, X 2 if Xi < cx < x 2 then (rx min, rx max) +- RX(rc, cxj , yi, rx min, rx max) 10: (rx mini, rx max) +- RX(rc, 11: (rx min, rx max) 12: (rx min, rx max) +- RX(r, 13: Lcx , Y2, rx min, rx max) +- RX(rc, cx1 , y1, rx min, rx max) c , y2, rx min, rx max) return (rx min, rxmax) the value of ry for a particular pixel location (x, y) and camera pitch angle r, and updates the current greatest and least known ry values. Procedure 24 Update the range of ry values for a pixel. Input: the camera pitch angle rc, a pixel location (x, y), the current least known r. value ry min, the current greatest known ry value ry max Output: (rmin,7 max) updated to the new greatest and least known values of ry 1: procedure RY(rc, X, y, rymin, Tymax) 2: PX 3: py +- 4: 5: 6: 7: y XC sin(sin(rc)+py cos(rc) rI min <- min(ry min, ry) r +- max(ry maxry) return (rY min, r' max) RYRANGE uses RY to test the possible extremes at each corner of the range and also several other locations where a local optimum may occur. When the camera has pitch angle r, and the vertical height of the camera above the center of the object on the table is h, then it has depth: z = h 1+ +(p2 ip2 py cos (rc) + sin (re)' 123 (5.15) Procedure 25 Find the range of r. values for a hypothesis region. Input: a hypothesis region R, the camera pitch angle r, Output: the greatest and least values (r' mi., r' ma) of r. in R 1: procedure RYRANGE(R, T,) 2: (M, (x1, y1, zi, TxiI ryI, rzl)7 (X2, 3: (ry min, ry max) + 4: (ry min, ry max) 5: 6: 7: 8: 10: 12: 13: 14: 15: 16: 17: 18: 19: 20: rx2i ry21 rz2)) + RY(rc,xi, y1, ry min,Ty (ry min, ry max) +- RY(rc, X1, y2, ry mil, ry (ry min, ry max) + RY(rc, X2, y1, Ty min, ry (ry min, ry max) + RY (rc, X2, y2, ry min, ry cy+fy(C2 +fx2-2c;xx1+xij) cot (r,) xif max) max) max) max) if y1 < yx 1 < y 2 then 9: 11: Y2, Z2, (OO, -0c) (ry min, 11: y max) +- RY(r, xi, yx 1 ,ry min, ry max) x2 +- fcy+fY(cx+f.T-2cxX2+X2) Yx2 I COt(r,) if y1 < yx2 < Y2 then (7y min, ry max) +- RY (r,X 1 ,Yx2, ry mill, ry max) if x 1 < cx < x 2 then (ry min, ry max) <- RY (rc, cY 1 , Ty min, ry max) (ry min, ry max) - RY (rc,cx, y2, ry min, ry max) Yx0 ffcy+f f cot(rc) if y 1 < yXo < y 2 then (ry min, ry max) +- RY (r,cx, yxo, ry min, ry max) return (ry min, ry max) 124 R- where p. and p. are defined in equations 5.13 and 5.14 above. The errors from approximations and inaccurate sensing is modeled by an error tolerance of +ztol. Again in the same pattern as above, the following procedures, ZRANGE and its subprocedure z, compute the range of values of z that can be found within a hypothesis region R. z computes the value of z for a particular pixel location (x, y) and camera pitch angle r, and updates the current greatest and least known z values. Procedure 26 Update the range of z values for a pixel. Input: the camera pitch angle re, the camera height above the table h, a pixel location (x, y), the current least known z value zmin, the current greatest known z value Zmax Output: (z'j., z' ax) updated to the new greatest and least known values of z 1: procedure z(r., h, X, y, ZYmin, Zmax) PX~ X- 3: pY +- 4: z 5: z ein +-min(zmin, Z) 6: 7: znax <-max(zmax, z) & - 2: h 1~+px+p2 py cos(rc)+sin(rc) return (Zn, z' ax) Z_RANGE uses z to test the possible extremes at each corner of the range and also several other locations where a local optimum may occur. Finally, CONSTRAINTINTERSECT uses the above subprocedures to find the small- est hypothesis region that contains the intersection between a hypothesis R and the constraint region. 5.3.5 Branch-and-Bound Search Branch-and-bound search operates by calling BRANCH and BOUND on the constrained 6D search space. BOUND provides a heuristic so that BRANCH-ANDBOUND explores the most promising regions first, provably finding the most probable detections first. Lines 3-11 of BRANCHAND-BOUNDSTEP include a brute force search for small hypothesis regions. This was included because we found that it increases the overall speed of detection (much like our decision to use BRUTE-FORCE-BOUND-DEPTH). The minimum resolution region referred to in line 3 is 2 pixels for x and y, 2 cm for z, 125 Procedure 27 Find the range of z values for a hypothesis region. Input: a hypothesis region 1Z, the camera pitch angle r,, the camera height above the table h Output: the greatest and least values (z'ai, zmax) of z in R 1: procedure ZRANGE(Z, r,, h) 2: (m, (Xi,yi, zi rXi, ry1, rzi), (X2,Y2 ,z2 ,rx2,ry 2,rz 2 )) 3: (Zmin, Zmax) + 4: (Zmin,7Zmax) 5: (Zmin, Zmax) 7: h, , 1,IZmin,7Zmax) z(re, h, Xi, Y2, Zmin, Zmax) (Zmin, Zmax) +- Z(rc, h, X 2,Y1, Zmin, Zmax) (Zmin, Zmax) <- z(rc, h, X 2 ,Y2, Zmin, Zmax) 8: Yx2 +- cY 9: if y1 < yx2 < ry then 6: - Z(rc" + + fy (1 + 10: (Zmin, Zmax) 11: (Zmin, Zmax) 12: if X1 < cx < x (x2-Cx)2) cot(rc) +- Z(rc, h, Xi, Yx2, Zmin, Zmax) - z(rc, h, x 2 , Yx2, Zmin, Zmax) 2 then 13: (Zmin, Zmax) +- Z(rc, 14: (Zmin,iZmax) +- Z(r, h, Cx,iY2,iZmin,iZmax) YxO +- cy + fy cot(re) 15: 16: h, Cx, Yi, Zmin, Zmax) if y 1 < yxo < y 2 then 17: (Zmin, Zmax) +- Z(rc, 18: (Zmin, Zmax) +- Z(rc, h,x2, yxo, zmin, zmax) 19: 1 (00, -oc) h, X, YxO, Zmin, Zmax) return (Zmin, Zmax) Procedure 28 Find the smallest hypothesis region that contains the intersection between a hypothesis region and the constraint. Input: hypothesis region R, camera pitch angle r,, camera height above table h Output: the smallest hypothesis region that contains the intersection between R and the constraint 1: procedure CONSTRAINT-INTERSECT(R, rc, h) 2: (M, (xi, yi, zi, rxi, ry1, r'z1), (X2, Y2, Z2, rx2, ry2, (rxmin, rxmax) +- RXRANGE(Zrc) 5: 6: (rymin, rymax) +- RYRANGE(1Z,rc) 7: r' min 8: r'ymax - mmnary max + rtoi, ry2 (zmin, Zmax) +- RANGE,(I, r,, h) 9: 1Z r r' mim r' max min(rx max + rtoi, rx2) ) ) min(ry min - rtoi, ry 2 i +-min(zmin -to-, z2 Zax +- min(zmax + Zto, Z2 return (M, (i, Yi,z, r mI , ) 10: 7rz2)) 11: 12: ) min - rtoi, rx2) 3: 4: min, rz 2 ), (x2, Y2, z' 126 ,rx max, max, rz2)) Procedure 29 One step in branch-and-bound search. Input: a priority queue Q, a preprocessed image I, an object 0, the minimum log probability of detections m Output: Q is modified, a new detection pose p or null 1: procedure BRANCHANDBOUNDSTEP(Q1, (, m) 2: 1? +-remove the highest priority hypothesis region from Q 3: if 1? is smaller than or equal to the minimum resolution region then 4: M +- the viewpoint bin model corresponding to R 5: Vmax +- -0 6: for all poses p in a grid within R do v -- EVAL(p, M, I) if v > vmax and CONSTRAINT-CONTAINS(p, rc) then 7: 8: 9: Vmax 10: Pbest 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: V - P return Pbest else r, +- camera pitch angle for I h +- camera height above table for I for all S in BRANCH(R) do S' +- CONSTRAINTINTERSECT(S, rc, h) if S' is not empty then b <- BOUND(S', 0, 1) if b > m then add S' to Q with priority b return null Procedure 30 Detect an object in an image by branch-and-bound search. Input: preprocessed image I, view-based object model 0, minimum log probability of detections m, the maximum number of detections to return m Output: the set of at most n detections sorted in decreasing order of probability 1: procedure BRANCHAND-BOUND(I, 0, m, n) 2: Q -- INITIALPRIORITY-QUEUE(I, 0) 3: D +-{} 4: 5: while Q is not empty and IDI < n do p +- BRANCHANDBOUNDSTEP(QI, 0, M) 6: 7: 8: if p # null then E <- D U {p} return D 127 and 4 degrees for rx, ry and r. The points in the grid samples on line 6 are spaced at 1 pixel for x and y, 2 mm for z, and 1 degree for rx, ry and rz. 5.3.6 Parallelizing Branch-and-Bound Search Since the BRANCHAND-BOUND algorithm does not run at practical speeds on a single CPU, we parallelize it on a cloud of multicore CPUs. We assume that we are operating in a shared cloud environment, in which different CPUs have different loads due to multitenancy virtual CPUs on a single shared physical CPU. In our implementation, there are many worker cores that run the BRANCHAND_BOUND-WORKER procedure, and there is one coordinator core that runs the PARALLEL_BRANCH-ANDBOUND procedure. Each worker repeatedly calls BRANCHANDBOUND- _STEP on its own private priority queue. When a worker's queue is empty, the coordinator tells another worker to delegate part of its queue to the empty worker. In this way, the workers are kept busy nearly all of the time. Each time a worker reaches a detection (a leaf in its search tree), it sends it back to the coordinator. This parallel implementation of branch-and-bound no longer carries the guarantee that the detections will be found in decreasing order of probability. However, the coordinator maintains a sorted list of detections found so far. In order to reduce the total number of search node evaluations near to the minimum necessary (as is in the case of the serial BRANCHANDBOUND implementation), the coordinator also informs workers when the current global minimum log probability increases, which will happen if the requested number detections n have already been found. For brevity, we do not include the pseudo-code for distributing the image to the workers or pre-processing the image, but we rather assume that the pre-processed image I is already available to each worker. We also do not include pseudo-code for terminating BRANCHAND-BOUNDWORKER, but this is relatively straightforward to add. There are five different types of messages: a status update message, a minimum log probability message, a delegate-to message, a delegate-from message, and a leaf message. 128 A status update message is a triple (q, r, wfrom) that is sent periodically from each worker to the coordinator, where q is the current number of hypothesis regions on the worker's priority queue, r is the current rate at which the worker has been evaluating hypothesis regions (measured in evaluations per second), and when Wfrom is not null, it indicates that the worker has just received a delegation of work from worker wfrro,. A minimum log probability message is a real number m that is always sent from the coordinator to a worker in response to a status update message. This number defines the threshold below which hypotheses can be safely discarded. For example, imagine that in PARALLELBRANCH-AND-BOUND, n = 1 and m starts at -oc. As soon as a worker finds a leaf whose log probability is m' and sends it back to the coordinator, the workers no longer need to consider any hypothesis region whose log probability is less than m', so the coordinator updates m +- m', and notifies each worker in response to its next status update message. The WORKERSTATUSUPDATE procedure handles the worker's side of the communication of these two messages. Procedure 31 Send a status update from a worker for parallel branch-and-bound. Input: the current size of the priority queue q, the current rate of evaluating hypotheses r, the worker that just delegated to this worker wfrom Output: the current minimum log probability m 1: procedure WORKER-STATUSUPDATE(q, r, Wfrom) 2: send a status update message (q, r, wfrom) to the coordinator 3: wait for a new minimum log probability message 4: 5: m +-- receive the new minimum log probability message return m A delegate-to message is a pair (wto, rto) sent from the coordinator to a worker to tell it to delegate some part of the contents of its private priority queue to worker Wto, where rTt is the last reported hypothesis evaluation rate from worker wt 0 . The worker that receives a delegate-to message decides the fraction of its priority queue in order to maximize the productivity of the pair of workers based on their evaluation rates, assuming that the delegated hypotheses have the same branching factor as the hypotheses are not delegated. A delegate-from message is a vector R of hypothesis regions. A delegate-from message is sent from a worker as soon as it receives a delegate-from message from the 129 coordinator, except when the INITIALPRIORITYQUEUE is sent from the coordinator to the first worker that sends a status update. A leaf message is a pose p for a detection at the leaf of the search tree for one of the workers. Each new found leaf is immediately sent back to the coordinator, and added to the global set of detections D. Procedure 32 A worker for a parallel branch-and-bound search for an object in an image. Input: preprocessed image I, view-based object model 0, minimum log probability of detections m, the maximum number of detections to return m 1: procedure BRANCH-AND-BOUNDWORKER(I, 0, m) 2: 3: r- 50 Q <- empty priority queue 4: while true do rn <- WORKERSTATUSUPDATE(I Q1, r, null) while QI = 0 or there is a new delegate-to or delegate-from message do if there is a new delegate-to message then (uto, rto) +- receive the delegate-to message 5: 6: 7: 8: 9: 10: R +- remove the top , Jhypothesis regions from Q 11: send a delegate-from message 1? to Wt" rn +- WORKER-STATUSUPDATE(IQ1, r, null) 12: 13: 14: 15: else if there is a new delegate-from message then R +- receive the delegate-from message for all R in 1? do add R to Q with priority BOUND(1Z, 0,1) 16: wfrom 17: m +- <- the worker that sent the delegate-from message WORKER-STATUSUPDATE(I Q1, r, Wfrom) t +- the current time (in seconds) 19: 20: C +- 0 while c < 1000 and there is not a new delegate-to message do p <- BRANCH-AND-BOUND-STEP(Q, I, 0, m) if (p # null) A (EVAL(p, M, I)> m) then send a leaf message p to the coordinator t' +- the current time (in seconds) 21: 22: 23: 24: 25: 26: r +c +-c - 18: 1 A worker is represented by the coordinator as a 4-tuple (q, r, d, a) where q is number of remaining hypotheses in its queue it last reported (initially 0), r is the rate that that worker has evaluated hypotheses in the queue, d is a boolean indicating 130 whether the worker is currently involved in a delegation-either sending to or receiving from another worker (initially false), and a is a boolean indicating whether the worker is currently available (initially false). Due to the fact that we are operating in a shared cloud environment, some workers may be much slower than other workers because they are sharing a load with other users. This is why we keep track of the rate r of evaluating hypotheses. In fact, some workers may not even become available during the course of a computation, this is why we track whether a worker has checked in to the coordinator a. However, once a worker has checked in and been marked as available, we assume that it will complete its computation without failing. In our experiments, we used 24-core virtual CPUs on an OpenStack cluster shared by many different research groups at our lab. We used OpenMPI to implement the message passing and coordination. Although we only use single-core CPU worker machines, this algorithm is well suited to parallelization if GPUs were available at each worker core. In particular, we would store the preprocessed image in the GPU's memory and parallelize line 5 of BOUND so that visual part and each depth part are bounded in parallel within a GPU kernel. The worker CPU would take care of the priority queue and the constraint evaluations. This would lead to low communication overhead between the GPU and the CPU once the preprocessed image has been sent to the GPU just before a call to BRANCH-ANDBOUNDWORKER when a new detection is beginning. 5.4 Non-Maximum Suppression Two detections that are very close to each other in the space of poses tend to have similar log probabilities. This often leads to many slight variations on a single detection corresponding to a single object in the world. In order to shorten the list of detections returned from BRANCH-ANDBOUND or PARALLELBRANCHANDBOUND. To address this problem, we remove the detections that are near a local maximum. We do this by calculating the bounding box of each object (the details of this calculation are omitted for brevity), and removing detections that overlap with other 131 Procedure 33 Coordinate workers to perform branch-and-bound search in parallel. Input: a preprocessed image 1, an object 0, the minimum log probability of detections m, the maximum number of detections to return n Output: the set of detections sorted in decreasing order of probability 1: procedure PARALLEL-BRANCHAND-BOUND(I, 0, m, n) 2: w +- a vector of workers with |w J = the number of worker CPU cores 3: wait for a status update message 4: 5: (q, r, wfrom) +- receive the status update message let wto be the worker that sent the message 6: send a minimum log probability message m to wt, 7: 8: the bit indicating whether wt, is available +- true the bit indicating whether wt, is delegating +- true 9: delegate INITIAL-PRIORITY-QUEUE(I, 0) to worker D +-{} while there is a worker in W' with d V (q > 0) do if there is a new leaf message then 10: 11: 12: 13: 14: p +- receive the leaf message D -D U {p} 15: if IDI > n then 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: wt, m +- the log probability bound of the nth best detection in D if there is a new status update message then (q, r, Wfrom) +- receive the status update message ws +- the worker that sent the status update message send m to w, the number of remaining hypotheses for w, +- q the rate of hypothesis evaluation for w, +- r the bit a indicating whether w, is available +- true if wfrom # null then the bit indicating whether w, is delegating +- false the bit indicating whether Wfr. 11 is delegating +- false while there is a worker in W' with a A (-d) A (a = 0) and there is a worker in W' with a A (-,cd) A (q > 5).do 28: 29: 30: 31: 32: 33: wt. +- the worker with a A (-,d) A (q = 0) that has the greatest r wfrom +- the worker with a A (-d) that has the greatest q the bit indicating whether wfrom is delegating +- true the bit indicating whether wt, is delegating +-true send a delegate-to message (Wto, rto) to Wfrom return D 132 detections with greater log probability. We use the standard intersection over union (IoU) overlap metric (equation 1.1) to determine if the bounding boxes overlap. Procedure 34 Non maximum suppression: remove detections that are not local maxima. Input: a vector of detections D sorted in decreasing order of log probability, the object model 0 for those detections Output: a set of detections S without the non-maximums 1: procedure NONMAXIMUM-SUPPRESSION(D, 2: S <- {} 3: for alldE Ddo 4: m +- true 5: for all s E S do 6: 7: 8: 9: 10: 0) if the bounding box of d overlaps with that of s then m +- false if m then S +- S U {Ej} t> d is a local maximum return S 133 134 Chapter 6 Experiments 6.1 Dataset We test our object detection system using a set of 13 object instances: downy, cascade, frenchs, jiffy, cereal, printed-tiki, tiki, campbells, sugar-jar, coffee-jar, quaker, sardine and spool. Table 6.1 provides some details about the objects and their meshes, and table 6.2 gives images of each of the objects and meshes. We collected RGB-D images of each object using a PR2 robot. Each image contained only one instance of the object to be detected, without any occlusion, and no instances of any of the other objects. To take the pictures, we drove the robot around a table with the object of interest and other background objects on it. We captured RGB-D images at various positions around the table. Along with each image, we also stored the precise height of the camera and the pitch angle1 of the camera, as measured by the PR2 proprioception sensors. We labeled all of the images containing each object of interest with 2D axis-aligned bounding boxes. There were also another 36 background images (18 training, 18 hold-out), collected using the same procedure. The background images are similar in composition, but none of the objects in them are from the set of objects of interest. This background set was used for additional negative examples in the experiments. The images were then divided into a training set and a hold-out set by alternating 'Changing the pitch angle causes the camera to look up and down. 135 Cd i -Ci bk downy cascade frenchs jiffy cereal printed-tiki* tiki* campbells sugar-jar coffee-jar quaker sardine spool 3600 1800 1800 1800 00 00 00 00 00 00 00 00 24,999 24,999 24,999 25,000 8 1,244 1,244 24,998 24,999 1,120 25,000 80 96 27.8 26.3 18.0 14.0 29.2 18.0 18.0 12.0 13.7 17.0 18.2 9.0 5.7 supermarket supermarket supermarket supermarket supermarket 3D print unknown supermarket supermarket supermarket supermarket supermarket hardware store 3D scan 3D scan 3D scan 3D scan hand-built hand-built* hand-built* 3D scan 3D scan hand-built 3D scan hand-built hand-built 18 19 19 19 17 18 19 17 17 18 17 17 17 17 19 18 19 16 17 19 17 16 17 17 16 16 Table 6.1: Detailed information about the objects and 3D mesh models used in the experiments. *The tiki and printed-tiki objects both share the identical 3D mesh. The mesh was hand-built to resemble the original tiki glass and the printed-tiki glass was printed from this exact mesh using a 3D printer. 136 object object mesh downy S cascade frenchs S jiffy printed-tiki cereal tiki I campbells sugar-jar coffee-jar quaker sardine mesh S I I I I spool Table 6.2: Images of the objects and 3D mesh models used in the experiments. 137 through the sequence of the images taken. For the purposes of evaluating pose accuracy in this thesis, we also manually added full 6D pose ground truth labels for the objects of interest in each of images. Acquiring these labels involved a significant amount of human effort, and may be somewhat error-prone. However, we do not expect that the average user training a new object instance would need to invest the resources necessary to add these labels to their data set. 6.2 Setting Parameters For each object, we experiment with independently varying 12 different parameters as described in section 4.2.1. The following plots show how the average precision for each varies object as the value of one parameter is varied at a time while the other values are held constant at their default values as given in table 6.3. We acknowledge that we could have obtained higher accuracy values by allowing more than one parameter value to vary at the same time. However, we decided to restrict our search to limit the computational time required for the experiments. Figures 6-1 through 6-12 show the effect of varying each variable along with a brief intuition as to the reason behind each observed trend. 6.3 Results Table 6.4 gives a summary of the average precision results for each class. In particular, we compare with the deformable part models (DPM) of Felzenszwalb et al. [10]. We used the training images described in table 6.1 as input to DPM, and the hold-out images for testing. The DPM algorithm outputs 2D bounding boxes, rather than full 3D poses as our detector does. Table 6.5 is a confusion matrix qualifying how well our algorithm can be used in a setting in which there may be multiple objects of interest in a scene with other background clutter. In the experiments used to construct this confusion matrix, a 138 default 6-1 6-2 6-3 6-4 6-5 6-6 6-7 6-8 6-9 6-10 6-11 6-12 100 100 100 100 5 0.02 275 0.02 200 0.4 0.0135 experimental values ' 100 200 1000 50 100 150 50 0 100 150 0 50 150 200 100 50 10 6 8 4 0.002 0.004 0.006 0.008 350 250 300 200 0.04 0.03 0.02 0.01 400 200 100 300 0.4 0.5 0.2 0.3 0.005 0.010 0.015 0.020 250 150 200 100 I ' parameter I figure n nyv nD nD= nV rV (pixels) rD (M) (pixels 2 ) WVmax ) VDmax (Mi 2 r, minimum edge probability threshold ht., (m) rtol 120 Table 6.3: Parameter values used in experiments. The default value is used in all experiments, except when that particular parameter is being independently varied to the experimental values. ,Average frecision vs. Number of Training Images 1 W-M M I -i downy -E cascade --*--- frenchs ------- ---- - 0.95 x -- x 7 0.9 r . 0 0.85 C) (D A.. campbells sugar-jar C Z jy cereal printed-tiki tiki D coffee-jar - quaker sardine spool 0.8 0.75 F- 0.7 F 0.65 0 I 200 I 400 I -600 -- I -800 1000 Figure 6-1: Although accuracy is affected by the number of training images, it is difficult to say whether increasing to more than 100 images gives any additional improvement. We guess that around 100 examples may be enough to support a reasonably robust least squares fit for each of the features. 139 .. ... ..... .. .... "I'll, :::..:::-.- --- - ..... ... ... ........... .-I , I - - -I - - - . - '- - -- - - - - - -- - - -AN12- - M Average Precis on vs. Number of Visu i Parts 1 I downy cascade E) frenchs 0.91 jiffy cereal 0.8 x F3 0.7 A -W printed-tiki tOki campbells sugar-jar to coffee-jar - quaker sardine spool - -- . 0.6 > - a. 0.5 0) as 0.4 | 0.3 0.2 0.1 0O 0 100 50 150 fly Figure 6-2: As the number of visual parts increases while the number of depth parts remains constant, the accuracy generally increases because the edges of the object are represented more completely. Two notable exceptions are the tiki glass and the domino sugar jar, each of whose mesh models match the real objects in the general shape (so depth features work well), but the edges on the silhouettes of the meshes are not accurate enough to provide a good match to images of real objects. 140 . ......... - - . .. ....... . .. ............ Average Precis on vs. Number of Dep h Parts i downy -e- cascade - 0.9 frenchs jiffy cereal - 0.81 At 3F> 0.7 -511 a coffee-jar _-4- quaker 0 0.6 sardine spool - 0 printed-tiki tiki campbells sugar-jar 0.4 0.31 0.2 0.1 0 50 100 15 0 Figure 6-3: We can see from this graph that, for a fixed number of visual parts, 50 depth parts gives a significant improvement over 0. For more than 50 depth features, we see that objects without flat horizontal surfaces on top tend to increase or remain at the same accuracy, while objects that do have flat horizontal surfaces tend to decrease in accuracy. This behavior is explained by the fact that in many of the training images, a box with a horizontal surface was positioned close to the camera, and since the Kinect measures shorter depths with higher precision, that surface matched better to the horizontal surfaces in the view-based model, causing false-positive detections. 141 1 Average I Z' - - - Precision vs.,Number of Visual And Depth Parts - - - __ - - - - - - - __ - i - '- - _' ' ' - downy cascade frenchs e - 0.95 - 0.91 jiffy cereal - printed-tiki tiki A v 0.85, .0 0.8 -- . -... ....... ..- - :_ _ - _ - ____ ... ... . . ... ... ............................ I campbells sugar-jar E+ coffee-jar - quaker sardine spool - (D n 0.75 (. < 0. 7 0.65 F 0.6 F 0.55 50 I I 100 150 nV 200 D Figure 6-4: As the number of visual and depth parts both increase simultaneously, the accuracy increases slightly because the shape and appearance of the object is more accurately modeled with more features in the model. The most notable exception is the jiffy muffin box. The main cause for the low accuracy of this detector is that another horizontal surface at the same height as the jiffy box was located close to the camera in some cases. Nearer depth measurements from the Kinect camera have higher precision, so this nearer horizontal surface is usually favored, in spite of the poor matching of the edge features. Increasing the number of visual and depth parts would generally give higher-accuracy view-based models, however, the time to compute the bound increases proportionally with the number of parts. Usually it is also true that increasing the time to compute the bound also slows down the whole searching process. 142 . .... ...... I __ - _:_::- - - _'__ _._:_:_ Average Precision vs. Recep Id Radius F rts downy -+- I cascade --- 0.9 Scereal r 0.8 ) tiki A campbells 1- sugar-jar D- coffee-jar -0 quaker sardine spool ------ ---- - C L 0- 0.7 CL frenchs jiffy 0.5 0.4 r - 0.6 - (D 4 5 6 7 8 9 10 r (pixels) Figure 6-5: As the receptive field radius increases, the accuracy generally increases, except in models learned from poor meshes like the tiki-glass. Since the edges in the rendered mesh are usually a poor fit to images of the real object, a larger receptive field makes the model more flexible to admit false positive detections. The total detection time increases with increasing receptive field radius because the bound becomes looser. 143 ....................... ......... .... .. ... .. ............... ...... Average Precision vs. Receptive Field Radius For Depth Parts i - e cascade Irenchs -*-- 0.9 downy -jiffy -~--cereal 0.8 -- A 0.7 v C 0 0.6 printed-tiki tiki campbells sugar-jar F) coffee-jar -- quaker sardine spool 41) a~ E5 (D) 0.4 0.3 0.2 0.1 n 2 3 4 7 6 5 rD (M) 8 x 10- 3 Figure 6-6: Increasing the receptive field radius for depth parts significantly increases accuracy, therefore we set the default value (rD = .02m) significantly above the parameter values we tried. The reason for this is that the linear model does not capture the exact depth of the surface of the object, so it is useful to have less penalty for error within some reasonable margin. 144 . ....... .... . ......... .... ........ . ..... .... ........................ Average Precision vs. Maximum Visual Par Variance 1 i downy cascade frenchs -0.95 - jiffy cereal - 0.9 printed-tiki tiki A- campbells V sugar-jar -- 0.85 coffee-jar -D0- .2 0.8 a. 0.75 - quaker sardine spool 0.7 F0.65 - () 4 0.55 - - 0.6 ' 0.5 20 0 300 250 350 V VMax (pixelS 2) Figure 6-7: As the maximum visual part variance increases, accuracy increases slightly, because if it is too low, some of the important distinguishing edge parts are not selected for the model. The notable exception to this trend is the jiffy muffin box, since it is one of the smallest of the models, so the edge features are closer to the center of the object. Edge parts near the center of the object tend to have lower variance, so increasing the maximum edge position variance admits the selection of edge features that are a poor match to the true object. 145 . ................ ........ .... . ....... Average Precision s. Maximum Depth Part Variance i -E ®R I downy cascade frenchs jiffy - 0.95 X cereal -3 printed-tiki tiki campbells 6 0.9 -V- sugar-jar A - -+ o$ 0 .S 0.85 - coffee-jar quaker sardine Sspool 0) 0.8 PV 0.75 0.7 F 'I- 0.65 0.0 1 0.015 0.02 0.025 VDmax 2) 0.03 0.035 0.04 Figure 6-8: Varying maximum depth part variance has little effect on the overall accuracy, except for the sardine can. We believe this is due to the small size of the sardine can compared to the other objects in our experiments. The small size of the sardine can means that rotations around the center of the object tend to have a smaller effect on the variance of the depth compared to the other objects. Another effect of the small size of the sardine can may be that there is more error introduced by scaling the synthetic images more. 146 .............. .... ... ........... Average Preciplon vs. Rotational BInWIdth 1 - e - 0.95 E3 0.9 A downy cascade frenchs jiffy cereal printed-tiki tiki A 0.85 C 0 0.8 --- .(CL 0.75 () V campbells sugar-jar - coffee-jar -4 quaker sardine 4- spool 0.7 0.65 0.6 0.55 0.5 10 15 20 25 30 35 40 rw (degrees) Figure 6-9: As the size of the viewpoint bins increases, accuracy decreases, because the linear model does not remain close to the true appearance of the object as it rotates over such a large viewpoint bin. Poor hand-made models such as the tiki glass are an exception to this trend. However models with smaller viewpoint bins also have many more viewpoint bins, so the total learning time increases. The learning time increases proportionally with the number of viewpoint bins, and the number of viewpoint bins increases as O(1/w3 ) as the width w of the viewpoint bins decrease in all 3 rotational dimensions. 147 .. . ... --- ------ ... .. ....... .. .. .. .......... ....... ..... . ...... ... ... ?verage Precision vs. lyinimum Edge Probabi Ity Threshold downy cascade frenchs - 0.9 - --------iffy cereal -- printed-tiki tiki campbells i *1 y 0.8, 0 v sugar-jar E- coffee-jar -4 quaker -sardine - spool a. 0.7 0 CD n0r I- 0.5 17 0.4 L 0. 2 0.25 0.3 0.35 0.4 0.45 0.5 minimum edge probability threshold Figure 6-10: If the edge detector threshold is too low, then too many spurious edges are detected, so there are more false positives. If the edge detector threshold is too high, then some of the true edges of the object are missed, so accuracy decreases. Some of the view-based models are more accurate for a higher edge threshold and some are more accurate for a lower edge threshold. We have found that the optimal edge detector threshold is highly dependent on the brightness, color and texture of the object in comparison to the background. Objects with significant contrast against the background can be detected with lower edge thresholds, admitting fewer false positive detections. 148 ... ..... .......... .............. ..... 1A Average Precisln vs. Camera Height T lerance - downy E) 0.9 cascade frenchs - -~~ - + - ------------ jiffy * cereal * printed-tki tiki 0.8 A campbells -4 quaker sardine spool -v sugar-jar E> coffee-jar C - 0) -. 7 cc'30.6 F- 0.5 - a) - 0.4 ' 0.3 0.005 5I 0.10.1 0.01 0.015 0.02 camera height tolerance (m) Figure 6-11: As the search space increases with higher tolerance for errors in the measurement of the camera's height, the overall accuracy decreases and the detection time increases. The tiki glass is an outlier in this trend, since extra variability in the height of the object above the table often allows a better match between the view-based model learned from an inaccurate mesh and the actual RGB-D image of the object. 149 . .. .. 1 0.9, ....... ..... ... ........... .... ......... ........... 1111 .. . . ...... - - .. rnco Average Precision vs. Camera Pitch T --- downy ---- cascade - frenchs * cereal - - -- e Oiki A 3v 0.8 ID> 0) 0 (D CL (D --- 0. campbells sugar-jar coffee-jar quaker sardine spool 0.6 0.5 1- 0.4' 10 I I 15 20 25 camera pitch tolerance (degrees) Figure 6-12: As the search space increases with higher tolerance for errors in the rotation of the camera, the overall accuracy decreases and the detection time increases. This trend is shown in all object classes except those which already have perfect (or near-perfect) detection accuracy (the Downy laundry detergent bottle, the French's mustard bottle and the printed tiki glass). 150 hold-out training 0.005 0.005 0.005 0.005 1.00 0.93 0.99 0.99 1.00 0.96 1.00 0.97 cereal nD = 50 0.99 0.98 printed-tiki tiki campbells sugar-jar coffee-jar quaker sardine spool htoi = 0.005 r, = 30 ht0 i = 0.005 ht.i = 0.005 rv = 10 ht.i = 0.005 rv = 10 hto, = 0.005 1.00 0.94 0.90 0.90 0.94 0.96 0.76 1.00 1.00 0.93 0.97 0.98 0.94 0.95 0.99 1.00 parameter downy cascade frenchs jiffy ht., ht., ht., hto, = = = = DPM 1.00 1.00 0.98 0.94 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 Table 6.4: Average precision for each object, compared with the detector of Felzenszwalb et al. [10]. In these experiments, each object detector was run on the set of images containing the object as well as the set of background images containing none of the objects. In order to obtain this table, we chose the setting of a single parameter that yielded the highest average precision on the set of training images, and, for validation, we also report the average precision on the set of images that were held out for testing. All parameters other than the one explicitly modified are set to the default values as specified in table 6.3. false detection of the cereal box in the background of an image containing the Downy bottle, for example, would count towards a confusion between the downy bottle and the cereal box, even if the detection had no overlap with the actual Downy bottle in the image. Table 6.6 is another confusion matrix that just deals with possible confusions between true objects of interest (without considering background clutter). The false detection described in the previous paragraph would not count if it did not overlap the actual Downy bottle in the image. Tables 6.7 and 6.8 show histograms of the deviation between the hand-labeled ground truth poses and the detected poses. In these histograms, 80% of the objects are localized to within 1 cm of the hand-labeled ground truth position, and 80% are within 15 degrees of the hand-labeled ground truth rotation. 151 predicted 0.06 0.71 0.68 cascade 0.06 0.05 0.06 frenchs jiffy cereal -printed-tiki tiki S campbells 0.83 0.06 0.58 0.75 0.76 0.11 0.11 sugar-jar - coffee-jar quaker ~ ~ ~ 0.06 0.06 0.12 0.21 1 0.06 sardine 0.06 spool 0.06 0.06 0.06 0.12 0.05 0.16 0.11 0.16 0.06 0.06 0.26 0.11 0.76 zM~'4~ 0.62 0.19 0.18 0.12 0.06 0.11 0.06 0.05 0.06 0.06 0.11 0.06 0.06 0.11 0.12 0.05 0.06 0.19 * downy 0.65 0.06 0.12 0.65 0.88 0.06 0.69 0.12 0.12 Table 6.5: A confusion matrix in which false positives of the predicted object are allowed to be found anywhere in the full image of the actual object (including in the background clutter). Empty cells denote 0. predicted - downy cascade frenchs jiffy cereal printed-tiki tiki campbells sugar-jar coffee-jar quaker sardine spool 1.00 0.05 0.95 1.00 1.00 0.06 0.94 0.05 0.76 0.16 0.24 0.79 1.00 1.00 1.00 0.12 0.88 0.06 0.94 1.00 Table 6.6: A confusion matrix in which the search for the predicted object is restricted to a bounding box surrounding the actual object. Empty cells denote 0. 152 distance (cm) r, angle (0) 6 7 6 5 4 5 4- - 3- 1 downy 0 2 3 4 L -200 -100 0 100 200 50 100 10 20 10 8 7 6 5 4 3 2 cascade 1 9 0 1 3 2 -I 6 4 400 5 -50 1-~ 3 4 3- 2 2 I 1 frenchs 0.1 0.5 0.4 0.3 0.2 0.6 -10 2-20 5 4 4 3 -10 3 32I 2 o iffy 0.8 0.6 0.4 0.2 2-15 6 0 -10 0 1052 -5 3 I 5 43 2- cereal 0 1 2 3 4 -15 -10 -5 0 5 10 Table 6.7: These histograms show the error in the predicted pose for objects that are asymmetric about the z axis. The first column shows histograms of the distance, in centimeters, between the detected center of the object and the hand-labeled ground truth center of the object. The second column shows the angle difference, in degrees, in the r, angle (that is, as if the object was on a turn-table) between the detected and ground truth poses. 153 distance (cm) 8 7 3- 6 2- 5 4 - distance (cm) 4 3- 2 o printed-tiki 0.5 1.5 1 2 tiki 4 0 0.5 1 15 2 5 4- 3 3 2 2 1 campbells 0 0.1 0.2 0.3 0.4 0.5 0.6 sugar-jar 5 0 0.5 5 1 3 4 2 32 coffee-jar 0 0.2 0.4 0.6 0.8 1 1.2 quaker 5 0.2 0.4 0.6 0.8 1 1.2 3 4 2 3 2 sardine 0 0.1 0.2 0.3 0.4 0.5 spool 0 0.2 0.4 0.6 0.8 1 Table 6.8: These histograms show the error in the predicted pose for objects that are symmetric about the z axis. These histograms show the distance, in centimeters, between the detected center of the object and the hand-labeled ground truth center of the object. 154 1.2 600 500 S400 0 S 300 200 I 100 0 0 5 10 20 15 25 30 35 40 # 24-core processors Figure 6-13: This plot shows that adding more 24-core processors to the set of workers available for the detection task increases until about 20 24-core processors, after which point, detection time increases slightly. This behavior is probably due to the extra overhead in initializing and synchronizing more MPI processes. 6.3.1 Speed Figure 6.3.1 shows how the running time to detect an object in a single image is effected by adding more CPUs to help with the process of detection. 155 teidiMEshdawarsi ferimierfdrauhailledh littelsfitiilida A adeb=ddekshwamek 156 Chapter 7 Conclusion In this thesis, we have introduced a new detection system that searches for the global maximum probability detection of a rigid object in a 6 degree-of-freedom space of poses. The model representation makes use of both visual and depth features from an RGB-D camera. In principle, we could have omitted the depth parts of our model, and our system could be used to detect shiny or transparent objects, for which it is difficult to measure depth information. In addition to introducing a new model representation, we have also developed a corresponding learning algorithm that is simple and efficient. The algorithm learns primarily from synthetic images rendered from a 3D mesh, which greatly reduces the amount of work necessary to train a new model. 7.1 Future Work There are many possible directions of future work. We will briefly discuss a few of the most interesting and promising directions. In order to improve accuracy, we can use visual features with more than just 2 values (as discussed in section 5.1). We believe the 3-valued visual features, in which an edge may be absent, weak or strong at each pixel, will yield a significant increase in accuracy. This will involve some parameter tuning to choose the best thresholds and corresponding probabilities for each threshold. We could naturally extend this 157 direction of inquiry to 4-valued visual features and beyond. Another important future direction of research to improve accuracy will be to augment our model representation to include a weight for each depth and visual part, according to which part is most informative in distinguishing the object. If we had access to scanned objects with realistic textures in realistic synthetic scenes for training, we could use synthetically rendered images as input to a support vector machine (SVM) to learn optimal weights that separate positive examples from negative examples. This technique is particularly promising for our system because our branch-and-bound search finds the global maximum detections, which correspond exactly to the support vectors in the SVM. SVM-based Detection systems such as that of Felzenszwalb et al. [10] often sample a few difficult negative examples because the training set would be too large if it included every possible object pose in the set of negative images. These difficult negative examples are chosen during the SVM learning process because they are likely to be the true support vectors for the full training set (or they are similar to the true support vectors). We may be able to leverage the fact that our detection algorithm can find global maximum detections in negative images by incorporating these new detections as support vectors in each iteration of SVM learning. In this thesis we restricted our inquiry to specific objects, rather than generic object classes. However, our representation already has the capability to model some amount of class variability. If we had access to a set of visually-similar scanned 3D mesh models of objects within the same class, and if we carefully aligned these meshes, we believe we could directly use synthetic rendered images of all of these objects simultaneously to learn a model that captures some class variability. One other important topic we have not addressed is object occlusion. Our current models inherently deal with some missing part detections, yielding a small amount of robustness to object occlusion. However, it is important for us to treat occlusion explicitly, especially in the case when the occluder is known (i.e., either the object is cropped by the boundary of the image, or occluded by a known object such as the robot gripper). 158 Along similar lines, we could learn models for articulated objects if we had access to parameterized meshes. For example, if we could generate a mesh model for any angle of a pair of scissors opening and closing, we could add this as another dimension in the space we search, such that it becomes a 7-dimensional configuration space. We could imagine extending this idea to more and more dimensions, modeling natural or deformable objects, etc. However, a crucial challenge would be to represent the search space such that it can be searched efficiently. 159 160 Appendix A Proofs A.1 Visual Part Bound We will prove that the bounding procedure for visual parts is a valid bound. We re-write the requirement for the visual part bounding function (inequality 5.9) more formally in terms of EVALNVISUAL (procedure 12) and BOUND-VISUAL (procedure 16): Vp E R : BOUND-VISUAL(R, V, I) > EVALVISUAL(p, VI) (A.1) We will prove that inequality A. 1 holds. First, we prove that x'in < p/ < X'ma, where x' X'm is from line 4 of BOUNDVISUAL, is from line 6 of BOUND-VISUAL and p' is from line 4 of EVALVISUAL. This is evident since x'min is calculated in the same way as p', except that a min operation ensures that the least term for any pose p in the hypothesis region R is added (whether the constants are positive, zero or negative). Therefore, x'min < p' holds for any pose p in the region R. A similar argument can be used to show px x'ma Therefore, we can conclude that xmin < Px < X'ma. In a similar way, it can be shown that y i p' y, where y is from line 8 of BOUNDVNISUAL, yQ is from line 10 of BOUND-VISUAL and p' is from line 4 of EVALVISUAL. Next, we prove that Xmin round(px) 161 Xinax, where xmin is from line 5 of BOUNDVISUAL, EVALVISUAL. Ximax is from line 7 of BOUNDVISUAL and px is from line 5 of We can see that xmin < round(px) using the fact that 'Min P' (proved above), and the fact that x1 5 x 2 by the definition of a hypothesis region, and since we use the min operator, and since we use the floor operator. ilar reasoning, we can also see that round(px) 5 xmax. Xmin < round(px) Therefore, we know that xmax. Again, we can follow a similar line of reasoning to show that Ymin Ymax where ymin By sim- round(p.) < is from line 9 of BOUND-VISUAL, Ymax is from line 11 of BOUND-VISUAL and py is from line 5 of EVALVISUAL. Another way to rephrase these results is that the projected expected pixel location (round(px), round(py)) is contained in the rectangular region in the image plane (x, y) where xmin < x < xmax and Ymin < y 5 ymax. However, we note that this rectangular region is a superset of the range of pixel locations of the visual feature for poses p E R. We will refer to the set difference between the superset and the true range of pixel rotations as the image plane difference region. Now we prove that if there exists a pose p* = (x*, y*, z*, r*, r*, r*) E R such that dz*2 < r'2 then SUM_2D(S, Xmin - rmax, Xmax + rmax, Ymin - rmax, Ymax + rmax) > 0, where d is the squared distance from the kth distance transform as on line 7 of EVALNVISUAL(p*, V, I), r' is the receptive field radius of the visual feature, rmax is the scaled receptive field radius from line 13 of BOUND-VISUAL and S is the kth summed area table as on line 15 of BOUNDVISUAL. If there is such a pose p* then, by definition of the distance transform, there must be a visual feature of kind k in the image with a pixel distance < 4 from the projected expected pixel location for the visual part (round(px), round(py)), where px and py are from line 5 of EVALVISUAL(p*, V, I). We > " since we use the ceil operator and divide by the minimum zi in the region R to calculate rmax on line 13 of BOUND-VISUAL. circle on the image plane with pixel radius 1 Therefore the centered at (round(px), round(py)) must be contained in the rectangular region in the image plane (x, y) where xi rmax < X < Xmax + rmax and Ymin - rmax Y 5 Ymax - also know that rmax + rmax. Then, since there is a visual feature in that circle, the definition of the summed area table implies that 162 sUM_2D(S, xmin-rmax, xmax+rmax, ymin-rmax, ymax+rmax) > 0, which is the condition of the if statement in BOUND-VISUAL. In this case, where dz*2 < r'2 for p* E R, it follows directly from this that -m(dz* 2v- 2 2 ,r ) < 0 is a valid bound on line 10 of EVALVISUAL(p*, V, I) and on line 17 of BOUND-VISUAL. If, on the other hand, dz*2 > r12 it is still possible that BOUNDVISUAL will return 0 on line 17 if there is a visual feature in the image plane difference region, which is still a valid bound. Finally, if the pixel location of the visual 2 2 r ) -< feature lies entirely outside of the rectangular region, then - min(dz* 2v - E 2v is a valid bound on line 10 of EVALVISUAL and on line 20 of BOUNDVISUAL. Therefore, we have covered both cases of the if statement in BOUND-VISUAL, so we can conclude that the bound is valid: Vp E R : BOUNDVISUAL(R, V,1) A.2 EVALNVISUAL(p, V,1). Depth Part Bound We will prove that the bounding procedure for depth parts is a valid bound. We re-write the requirement for the depth part bounding function (inequality 5.8) more formally in terms of EVAL-DEPTH (procedure 11) and BOUND-DEPTH (procedure 18): Vp E R: BOUND-DEPTH(R, D, I) > EVAL-DEPTH(p, D, I) (A.2) We will prove that inequality A.2 holds. First, we prove that xmin < Px < Xmax, where Xmin is from line 5 of BOUNDDEPTH, Xmax is from line 6 of BOUND-DEPTH and px is from line 6 of EVALDEPTH. That is evident since Xmin is calculated in the same way as px, except that a min operation ensures that the least term for any pose p in the hypothesis region R is added (whether x is positive, zero or negative). Therefore, xmin 5 px holds for any pose p in the region R. A similar argument can be used to show px < xmax. Therefore, we can conclude that Xmin Px < Xmax. In a similar way, it can be shown that ymin Py Ymax, where ymin is from line 7 of BOUNDDEPTH, ymax is from line 8 of BOUND-DEPTH and py is from line 6 of 163 EVALDEPTH. Another way to rephrase these results is that the projected pixel location (round(p,), round(py)) is contained in the rectangular region in the image plane (x, y) where Xmin and ymin 5 Y x < xmax Ymax. However, we note that this rectangular region is a superset of the range of pixel locations of the depth feature for poses p E R-we call this the true image plane region. We will refer to the set difference between the superset and the true image plane region as the image plane difference region. Next, we prove that z i z' a m:5 z x, where z'n is from line 9 of BOUND-DEPTH, is from line 10 of BOUND-DEPTH and md is from line 4 of EVALDEPTH. This is evident since z in is calculated in the same way as md, except that a min operation ensures that the least term for any pose p in the hypothesis region R is added (whether the constants are positive, zero or negative). Therefore, z' _<m holds for any pose p in the region R. A similar argument can be used to show md <z'. we can conclude that z' Therefore, < md < Z' We continue the proof as if lines 13-15 were omitted from BOUNDDEPTH, then we will address BRUTEFORCEBOUNDDEPTH afterward. Now we prove that if there exists a pose p* = (x*, y*, z*, r*, r*, r*) E R such that (d - md) 2 < r2 then sUM_3D(T, Xmin, Xmax, Ymin, Ymax, Zmin - r, Zmax + r) > 0, where d is the depth from the depth image D at the projected pixel location (px, py) as on line 6 of EVALDEPTH(p*, D,I), md is from line 4 of EVALDEPTH(p*, D,I), r is from line 17 of BOUNDDEPTH, Zmin is from line 18 of BOUND-DEPTH, and zmax is from line 19. If there is such a pose p*, then it follows from the inequalities + proven above that z'in - r' < d < z'ax + r'. Then it follows that ZminZSATstep ZsATmin - rzsATstep d < ZmaxzSATstep + ZSATmin - rZSATstep (this expanded range of depth values is a superset of the previous inequalities-we will refer to the set difference between this expanded range and the previous range as the depth difference range). Now, from the definition of the 3D summed area table T, we know that SUM_3D(T, Xmin, Xmax, Ymin, Ymax, Zmin - r, Zmax + r) > 0, which is the condition of the if statement on line 20 of BOUND-DEPTH. In this case, where (d-md) 2 < r2 for p* E1R, it follows directly that min((d-Md)',r') < 164 0 is a valid bound on line 11 of EVAL-DEPTH (p*, D, Otherwise, if (d - md) 2 > I) and on line 21 of BOUND-DEPTH. r2 , it is still possible that BOUNDDEPTH will return 0 on line 21 if a depth value in the image plane difference region falls in the expanded depth range or if a depth value in the image plane region falls in the depth difference range, which is still a valid bound. If the depth d is undefined then sUM_2D(U minX ma, Ymin, Ymax) > 0. If d is undefined or if the pixel location (round(px), round(py)) is outside of the image, then the bound -d on line 23 of BOUND-DEPTH will be equal to the return value of EVALDEPTH(p*, D, I) on line 14. If d is defined and its pixel location is within the image, it is still possible that BOUND-DEPTH will return -d2v on line 23 if there is another undefined depth in the rectangular image plane region or if part of the rect- angular image plane region is outside of the image. In this case, it is still a valid bound for BOUNDDEPTH to return -2 on line 23 since line 20 of BOUNDDEPTH eliminated the possibility that there are any depth measurements within the rectangular image plane region that are within the receptive field radius (between z' j. - r' and z' + r'). Finally, if the depth measurement is defined and if we know that it is beyond the receptive field radius (d - md) 2 < r2 , then BOUNDDEPTH will return -. 2v on line 25 which will be the same value returned by EVALDEPTH on line 14. Therefore, we have covered all the cases of the if statements in BOUND-DEPTH and EVALDEPTH, so we can conclude that the bound is valid if lines 13-15 were omitted from BOUND-DEPTH. We now prove that calling BRUTEFORCEBOUNDDEPTH on line 15 of BOUNDVISUAL yields a valid bound (regardless of the expression used as the condition of the ifstatement on line 13). The BRUTEFORCEBOUNDDEPTH procedure exhaustively evaluates each pixel (x, y) in the rectangular image plane region to determine the min- imum possible absolute difference between the depth value at that pixel and the depth range. If the image plane difference region was empty (i.e. the rectangular image plane region is equal to the true image plane region), then BRUTEFORCEBOUNDDEPTH would be perfectly tight. However, since the true image plane region is a subset of the 165 rectangular image plane region, it is possible that a pixel in the image plane difference region would cause BRUTEFORCEBOUNDDEPTH to return a value that is greater than EVALDEPTH for some p E R. But depth measurements from the image plane difference region will never violate the bound condition since the min operations on lines 6, 10 and 12 can only decrease dmin for pixels in the image plane difference region. So they can only increase the return value on line 13. Therefore, we have covered both the case when BRUTE FORCEBOUNDDEPTH is called and the case when it is not called, so we can conclude that the bound is valid: Vp E R : BOUND-DEPTH(Z, D, I) > EVALDEPTH(p, V, I). 166 Bibliography [1] A. Aldoma, M. Vincze, N. Blodow, D. Gossow, S. Gedikli, R.B. Rusu, and G. Bradski. Cad-model recognition and 6dof pose estimation using 3d cues. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 585-592, 2011. - [2] Alexander Andreopoulos and John K. Tsotsos. 50 years of object recognition: Directions forward. Computer Vision and Image Understanding, 117(8):827 891, 2013. [3] John Canny. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell., 8(6):679-698, 1986. [4] Han-Pang Chiu, L.P. Kaelbling, and T. Lozano-Perez. Virtual training for multiview object class recognition. In Computer Vision and PatternRecognition, 2007. CVPR '07. IEEE Conference on, pages 1-8, June 2007. [5] David J. Crandall, Pedro F. Felzenszwalb, and Daniel P. Huttenlocher. Spatial priors for part-based recognition using statistical models. In CVPR (1), pages 10-17. IEEE Computer Society, 2005. [6] Franklin C. Crow. Summed-area tables for texture mapping. In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH '84, pages 207-212, New York, NY, USA, 1984. ACM. [7] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Cordelia Schmid, Stefano Soatto, and Carlo Tomasi, editors, International Conference on Computer Vision & Pattern Recognition, volume 2, pages 886-893, INRIA Rhone-Alpes, ZIRST-655, av. de l'Europe, Montbonnot-38334, June 2005. [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303-338, June 2010. [9] Sachin Sudhakar Farfade, Mohammad Saberian, and Li-Jia Li. Multi-view arXiv preprint face detection using deep convolutional neural networks. arXiv:1502.02766, 2015. 167 [10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627-1645, 2010. [11] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Distance transforms of sampled functions. Technical Report 1963, Cornell Computing and Information Science, 2004. [12] P.F. Felzenszwalb, R.B. Girshick, and D. McAllester. Cascade object detection with deformable part models. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2241-2248, June 2010. [13] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In In CVPR, pages 264-271, 2003. [14] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381-395, June 1981. [15] Jared Glover and Sanja Popovic. Bingham procrustean alignment for object detection in clutter. In IEEE/RSJ InternationalConference on Intelligent Robots and Systems, 2013. [16] Kristen Grauman and Bastian Leibe. Visual object recognition. Synthesis Lectures on Artificial Intelligence and Machine Learning, 5(2):1-181, 2011. [17] Derek Hoiem, Alexei A Efros, and Martial Hebert. Putting objects in perspective. InternationalJournal of Computer Vision, 80(1):3-15, 2008. [18] Derek Hoiem and Silvio Savarese. Representations and Techniques for 3D Object Recognition and Scene Interpretation. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2011. [19] Eun-Jong Hong, Shaun M. Lippow, Bruce Tidor, and Tomis Lozano-Perez. Rotamer optimization for protein design through map estimation and problem-size reduction. J. Computational Chemistry, 30(12):1923-1945, 2009. [20] Kourosh Khoshelham and Sander Oude Elberink. Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors, 12(2):1437-1454, 2012. [21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097-1105. Curran Associates, Inc., 2012. [22] Christoph H. Lampert, M.B. Blaschko, and T. Hofmann. Beyond sliding windows: Object localization by efficient subwindow search. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8, 2008. 168 [23] Alain Lehmann, Bastian Leibe, and Luc Gool. Fast prism: Branch and bound hough transform for object class detection. InternationalJournal of Computer Vision, 94(2):175-197, 2011. [24] Joseph J. Lim, Hamed Pirsiavash, and Antonio Torralba. Parsing IKEA Objects: Fine Pose Estimation. ICCV, 2013. [25] D.G. Lowe. Object recognition from local scale-invariant features. In Computer Vision, 1999. The Proceedings of the Seventh IEEE InternationalConference on, volume 2, pages 1150-1157 vol.2, 1999. [26] M. Maire, P. Arbelaez, C. Fowlkes, and J. Malik. Using contours to detect and localize junctions in natural images. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8, 2008. [27] D. Nister and H. Stew nius. Scalable recognition with a vocabulary tree. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2161-2168, June 2006. oral presentation. [28] Xiaofeng Ren and Liefeng Bo. Discriminatively trained sparse code gradients for contour detection. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 593-601, 2012. [29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014. [30] Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Labelme: A database and web-based tool for image annotation. Int. J. Comput. Vision, 77(1-3):157-173, May 2008. [31] Pierre Sermanet, David Eigen, Xiang Zhang, Michasl Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013. [32] Hao Su, Min Sun, Li Fei-Fei, and S. Savarese. Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories. In Computer Vision, 2009 IEEE 12th InternationalConference on, pages 213-220, Sept 2009. [33] A. Thomas, V. Ferrar, B. Leibe, T. Tuytelaars, B. Schiel, and L. Van Gool. Towards multi-view object class detection. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 15891596, 2006. 169 [34] Antonio Torralba. Contextual priming for object detection. InternationalJournal of Computer Vision, 53(2):169-191, 2003. [35] Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5):854-869, 2007. [36] Paul Viola and Michael Jones. Robust real-time object detection. In International Journal of Computer Vision, 2001. [37] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 3485-3492. IEEE, 2010. 170