3D Reconstruction of Cuboid-Shaped Objects 2013 2

3D Reconstruction of Cuboid-Shaped Objects
from Labeled Images
by
rT 2 9 2013
Erika Lee
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2013
@ Massachusetts Institute of Technology 2013. All rights reserved.
A u tho r ..............................................................
Department of Electrical Engineering and Computer Science
August 9, 2013
Certified by
.
.............
Antonio Torralba
Associate Professor
Thesis Supervisor
Accepted by .......
............
.. ..................
......
Prof. Dennis M. Freeman
Chairman, Masters of Engineering Thesis Committee
2
3D Reconstruction of Cuboid-Shaped Objects from Labeled
Images
by
Erika Lee
Submitted to the Department of Electrical Engineering and Computer Science
on August 9, 2013, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
In this thesis, my goal is to determine a rectangular 3D cuboid that outlines the
boundaries of a cuboid-shaped object shown in an image. The position of the corners
of each cuboid are manually labeled, or annotated, in a 2D color image. Given the
color image, the labels, and a 2D depth image of the same scene, an algorithm extrapolates a cuboid's 3D position, orientation, and size characteristics by minimizing
two quantities: the deviation of each estimated corner's projected position from its
annotated position in 2D space and the distance from each estimated surface to the
observed points associated with that surface in 3D space. I found that this approach
successfully estimated the 3D boundaries of a cuboid object for 72.6% of the cuboids
in a data set of 1,089 manually-labeled cuboids in images taken from the SUN3D
database [12].
Thesis Supervisor: Antonio Torralba
Title: Associate Professor
3
4
Acknowledgments
Professor Antonio Torralba for supervising my research at CSAIL during my MEng
year and giving me the opportunity to explore this topic in computer vision with his
group.
Jianxiong Xiao for his support and indispensable advice throughout the year.
5
6
Contents
1
Introduction
11
1.1
Motivation for Reconstruction . . . . . . . . . . . . . . . . . . . . . .
11
1.2
Description of a Reconstructed Cuboid . . . . . . . . . . . . . . . . .
12
1.3
Description of Labeling . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.4
Related Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2 Background
3
15
2.1
Projecting a Cuboid
. . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2
Going from 2D to 3D . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2.1
The Cuboid . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2.2
The Projection Equation . . . . . . . . . . . . . . . . . . . . .
18
2.2.3
Rotation, Translation, and Scale . . . . . . . . . . . . . . . . .
19
2.2.4
Intrinsic Camera Parameters . . . . . . . . . . . . . . . . . . .
19
System Overview
21
3.1
Capturing Color and Depth
. . . . . . . . . . . . . . . . . . . . . . .
21
3.2
Labeling a Cuboid
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.3
Bundle Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.4
Initializing the Parameters . . . . . . . . . . . . . . . . . . . . . . . .
24
3.4.1
Fitting Planes with RANSAC
. . . . . . . . . . . . . . . . . .
24
3.4.2
Making the Planes Orthogonal . . . . . . . . . . . . . . . . . .
25
3.4.3
Solving for Corner Positions . . . . . . . . . . . . . . . . . . .
25
3.4.4
Estimating Initial Cuboid Parameters . . . . . . . . . . . . . .
27
7
4 Results
29
4.1
Data Set and Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.2
Edge Cases
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.2.1
Cuboids with Only One Visible Face . . . . . . . . . . . . . .
29
4.2.2
Cuboids Missing a Face
. . . . . . . . . . . . . . . . . . . . .
31
4.2.3
Cuboid Deformation . . . . . . . . . . . . . . . . . . . . . . .
31
4.2.4
Reflective or Clear Surfaces
. . . . . . . . . . . . . . . . . . .
33
4.2.5
Occlusion by Other Objects . . . . . . . . . . . . . . . . . . .
35
4.2.6
Labeling Outside of the Image Boundary . . . . . . . . . . . .
35
4.2.7
Objects Outside of the Depth Sensor's Distance of Use . . . .
37
4.2.8
Inaccuracies in Annotations . . . . . . . . . . . . . . . . . . .
37
5 Conclusion
39
5.1
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
5.2
Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . .
39
8
List of Figures
1-1
3D Reconstruction
2-1
. . .
... ... ... ... .... ... ..
12
Perspective Projection
... ... ... ... ... .... ..
16
2-2
Intrinsic Parameters
... ... ... ... .... ... ..
17
2-3
A Unit Cube
... ... ... ... ... .... ..
18
3-1
Color and Depth Images
... ... ... ... .... ... ..
22
3-2
The SUN Labeling Tool
... ... ... ... .... ... ..
23
4-1
Reconstruction Results.....
... ... ... .... ... ... ..
30
4-2
Edge Case: Cuboid with One Vis ible Face . . . . . . . . . . . . . . .
31
4-3
Edge Case: Cuboid Missing a Fa ,e
4-4
Edge Case: Cuboid Deformation . . . . . . . . . . . . . . . . . . . . .
33
4-5
Edge Case: Black Reflective Surf,ace . . . . . . . . . . . . . . . . . . .
34
4-6
Edge Case: Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4-7
Edge Case: Outside of Image Bo uindaries . . . . . . . . . . . . . . . .
36
4-8
Edge Case: Outside of Sensor's Distance of U se . . . . . . . . . . . .
37
4-9
Edge Case: Annotation Error
38
. . . . . .
.32
. . . . . . . . . . . . . . . . . . . . . .
9
10
Chapter 1
Introduction
1.1
Motivation for Reconstruction
Human vision captures information in series of 2D pictures; at any point in time, we
see a 2D image of the scene in font of us. Yet, we have no trouble understanding how
to interact with a 3D world. For example, suppose there is a cup of coffee sitting at
eye level. Even though you can only see one side of the cup, your brain is somehow
able to fill in the rest of it; you can tell how big that cup is, how that cup is shaped,
how far away it is, etc. even though you can only see one side of that cup. For
most people, this scenario is a natural process that requires relatively little conscious
thought.
Similarly, computers see using cameras. In a picture, each pixel's color value
contributes to a creating a cohesive image. However, without further processing, that
is all that the pixels are -a
series of color values. Computers do not automatically
have an semantic understanding of what items are in a scene, how much space those
items take up, or how the item looks from the other side. There is an gap between
mere color values and semantic meaning that needs to be filled before computers are
able to conduct the same perception tasks that our brains are naturally wired to
do. The concept of 3D reconstruction is an attempt at taking a step towards filling
that gap by building a 3D model which represents the shape and geometry of objects
visible in a 2D picture.
11
Figure 1-1: A table reconstructed from a 2D image, seen from different angles.
1.2
Description of a Reconstructed Cuboid
Cuboids are a category of 3D shape encompassing geometries composed of six faces
connected at right angles, including cubes and rectangular prisms. A unique cuboid
can be constrained by three properties: size, position, and orientation. The goal of
this thesis is to estimate these these properties to a reasonable approximation of what
they are in reality for a given object. In a successful estimate, the estimated cuboid
looks like a bounding box of the targeted cuboid-shaped object.
Numerous artificial items are cuboid-shaped, such as boxes, books, cabinets, etc.
Being able to identify the 3D properties of cuboid-shaped objects can potentially
be helpful in areas such as object recognition and scene understanding. This thesis
discusses an approach to find a cuboid-shaped object's position, orientation, and size
from a 2D color image, a 2D depth image, and some labels.
12
1.3
Description of Labeling
Locating and recognizing objects in an image is a large component in scene understanding, and it poses a significant question in the field of computer vision. Image
labeling is one way to accomplish this task. Labels are manually added to an image to
indicate where an object exists. They are intended to provide some information about
the content of the image and by extension some indication how the image should be
processed.
There are several ways in which these labels can be recorded; a popular approach
is asking a user click points and enclose a polygon that outlines the object of interest
[8, 10]. For the purposes of this thesis, the user labels a cuboid-shaped object by
specifying where the corners of that object appear in the image, thereby constraining
the boundaries of the object.
Because labeling is done manually, it is a labor-intensive process.
However it
remains an valid approach because manual labeling offers a higher rate of accuracy
than automatic algorithms. People do a much better job of recognizing objects and
segmenting which parts of an image correlates to what object than current automatic
algorithms that attempt to do general object recognition. In addition, user annotations can ultimately provide training examples for studies in related fields that use
machine learning. These manually-labeled instances could perhaps shed some insight
on the processes that our visual system undergoes which allows us to segment the
objects in these images so naturally.
1.4
Related Works
There have been a number of works published on cuboid fitting. One approach includes training a system on the color gradients of several variations of cuboid corners.
The gradients are matched to an image to determine the position of the corner in the
image, and therefore the location of the cuboid [13]. Other works have relied on the
geometry of the scene to estimate cuboid location, by using perspective and doing
13
edge detection to find potential fits for the cuboid and the edges and corners of room
[4, 7].
Whereas the works described above relies primarily on processing a color image,
the approach discussed in this thesis relies primarily on processing the depth map and
user-provided labels to segment the objects of interest. There have been approaches
that attempt to extrapolate the depth from the color image by guessing the orientation
of every visible surface in the room [7]. Another common approach to find depth
information in general is to use multiple cameras to capture the same scene at slightly
different viewpoints. By contrast, the approach in this thesis uses a single camera
that produces two different images from the same viewpoint.
Two labeling frameworks of note are LabelMe
[8, 10] and SUN
[12], both of
which are image labeling tools developed at the Vision group at CSAIL. Both are
publicly accessible, and use an interface that asks the user to click on a color image
to label its contents. LabelMe3D, an extension of LabelMe, explores extrapolating
3D information from labeled 2D polygons, but does support labeling cuboids at this
point in time [9].
14
Chapter 2
Background
This section highlights the geometry and mathematical concepts relevant to the topic
of this thesis. The ideas are well established and discussed at length in reference texts
such as [3]. Therefore concepts that are considered common knowledge in the field
(such as perspective projection, homogeneous coordinates, intrinsic camera parameters, etc.) are left without further citation in this section.
2.1
Projecting a Cuboid
In order to estimate the 3D characteristics of a cuboid-shaped object from a picture,
it is useful to examine how a 2D image correlates to its 3D counterpart. A camera
projects one viewpoint of a 3D world onto a plane in a manner described as perspective
projection. The color of each pixel on the resulting image plane is determined by the
visible object in the 3D world that intersects with the ray that passes through the
camera origin and that pixel on the image plane.
Focal length and principle point are the intrinsic parameters of a camera; they
describe the measurements of perspective projection. The focal length values
(f.
and fy) determine the distance between the camera lens and the image plane. The
principle point (px, py) refers to the point of intersection of the perpendicular line
from the image plane to the camera origin. The principle point is typically very close
to the center of the image.
15
object
camera
projected
image
Figure 2-1: In perspective projection, each pixel of an image is determined by the
intersection of the 3D scene and the ray through the camera origin and that pixel.
Assuming that there is a way to derive the distance from the camera to an object
from the depth image, it is possible to estimate the 3D coordinates of the points in a
scene using similar triangles. The following equations express these relationships:
Z = depth-map(x, y)
Px) Z
fX
X -(x
Y _ (x-PY).Z
fy
(2.1)
(2.2)
(2.3)
where (x, y) refer to a point in image coordinates and (X, Y, Z) refer to the corresponding point in real world coordinates.
16
side view
I
fYI jpy
,,tffw
camera
origin
Z
image
plane
Figure 2-2: A camera's intrinsic parameters are defined by its focal length and principle point.
2.2
2.2.1
Going from 2D to 3D
The Cuboid
Generally, a cuboid can be represented by the positions of its 8 corners. To maintain
consistency, each corner is assigned a number from 1 through 8, and each visible face
is assigned a number from 1 through 3. The corners are numbered such that for every
cuboid, each face has the same set of 4 corners, as shown in Figure 2-3.
The corner positions of the unit cube centered at (0, 0, 0) can be represented as
follows:
X3D
=
-1
-1
-1
1 -1
1
1
1
1
1
-1
1 -1
-1
1
-1
1
-1
1
1 -1
1
-1
-1
1
1
1
1
1
1
1
1
(2.4)
Each column of the first three rows correspond to the X, Y, and Z positions of
17
y
31
5
1
3
4
3
z
Figure 2-3: A unit cube centered at (0,0,0). Each of its 8 corners and 3 visible faces
are numbered in this fashion. Corner 8 is not visible in this representation.
corners from 1 through 8.
2.2.2
The Projection Equation
Each annotated corner with position (x, y) in the 2D image can be mapped to its 3D
world coordinates (X, Y, Z). The projection of 3D coordinates onto a 2D plane can
be written as:
X2D =
where
X2D
PX
3D
(2.5)
takes the form:
X2D
~
X1
X2
X3
X4
X5
X6
X7
X8
Yi
Y2
Y3
Y4
Y5
Y6
Y7
Y8
L1
1
1
1
1
1
1
P, the projection matrix, can be further broken down into:
18
(2.6)
P= K x [Rlt] x S
2.2.3
(2.7)
Rotation, Translation, and Scale
[Rlt] is a 4x4 matrix that combines a regular 3x3 rotation matrix, R, and a vector,
t, representing the center of the cuboid after translation. The matrix takes the form:
R
[Rjt] =
t
(2.8)
with the bottom row being all zeros except for the last element.
The R matrix can be translated into angle-axis notation, which is a vector of three
elements that represents the axis of rotation. The magnitude of rotation is encoded by
the magnitude of the vector. This reduces the number of free parameters for rotation
from nine to three.
Since the t component of the matrix consists of the X, Y, Z values of the center
of the cuboid, the number of free parameters for translation is three.
S is a standard scaling matrix of the form:
W
S_ S=0
0
0
0
0
h 0 0
0 d 0
0
0 0
.9
(2.9)
1
where w is the width, h is the height, and d is the depth of the cuboid. Thus,
scale also has three free parameters, for a total of nine free parameters to optimize
for.
2.2.4
Intrinsic Camera Parameters
K is a matrix representing the intrinsic parameters of the camera of the form:
19
fx0 P.
K=
0
fy p
0
0
(2.10)
1
Multiplying a set of 3D corners by K gives the corners' projected locations by the
relations discussed in section 2.1. The corner positions are converted from homogeneous coordinates back to Cartesian coordinates before being multiplied by the K
matrix.
20
Chapter 3
System Overview
The tasks involved in completing this thesis includes capturing videos of different
scenes, labeling cuboid-shaped objects in extracted frames, and solving for the cuboid
for each labeled object.
3.1
Capturing Color and Depth
The SUN database consists of videos of various rooms and indoor spaces. The images
in this thesis are randomly sampled frames from the videos in SUN. Data is captured
with the Asus Xtion camera, which contains a regular RGB camera and a depth
sensor. These images, along with the labels provided by the user, serve as the input
for the algorithm. The depth map is much like the regular image, except each pixel
corresponds to a value that can be used to derive the how far that point is away from
from the camera instead of a color value.
Ideally, the depth sensor would give an accurate depth reading for all pixels;
however, the sensor fails in a number of situations. The sensor gives no readings
near the top, bottom, and right image boundaries of the image. In addition, certain
surfaces leave large gaps in the depth map, and there also tends to be small gaps
around the boundaries of objects. Accurate readings for those pixels are unavailable.
The implications of the depth sensor's limitations are discussed in more depth in the
results section 4.2.
21
Figure 3-1: An RGB image and a depth image captured by the Asus Xtion camera.
There are gaps in the depth map for some of the items in the far background, around
the back outline of the table, and around the top, right, and bottom borders of the
image.
The significance of introducing a depth sensor for this project is that it provides
an absolute depth reading, which can be used to determine absolute scaling.
In
perspective projection, a large object far away from the camera and smaller object
close to the camera could look exactly the same in 2D projection if viewed from the
right angle. Thus, absolute scaling is difficult to determine. However, a depth sensor
removes that ambiguity since the exact depth of objects in a scene becomes known.
3.2
Labeling a Cuboid
Labels refer to the manual annotations to an image that provide information about the
image's content, and specifically in this thesis, which parts of the image correspond to
a cuboid-shaped object. The labels play a pivotal role in discerning the orientation,
position, and size of the cuboid. To label a cuboid with the SUN labeling tool, the user
chooses the viewpoint that best matches that of the object in the image, then clicks
on the corners of the cuboid in the color image. For a cuboid, the corners define face
boundaries; the pixels enclosed by the corners of a face are also the pixels associated
with the face of the cuboid.
The labeling provides some insight of the cuboid's
geometry as well, since all the visible faces of a cuboid from a given viewpoint must
22
~3DC
Figure 3-2: The SUN labeling tool interface allows users to label cuboids on a color
image [12].
be adjacent.
From any perspective there are 1, 2, or 3 visible faces of a cuboid. The number
of visible corners are 4, 6, and 7, depending on the number of faces which are visible.
For a cuboid with 1 visible face, the problem is not constrained enough to solve for a
unique cuboid. Thus, this thesis explores the cases with 2 or 3 visible faces.
The primary annotation file format this system uses is compatible with the SUN
annotation tool [12]. However, the system is also compatible with an .xml file format
similar in structure to LabelMe's annotation file [8, 10].
3.3
Bundle Adjustment
Bundle adjustment is a standard technique used to estimate the values of a cuboid's
parameters [3, 11].
This thesis's approach minimizes two measurements: 1.
the
2D distance between a labeled corner position and its location after projecting the
estimated 3D point onto the image and 2. the 3D distance between each observed
point from the estimated plane.
This value to be minimized can be expressed by the following:
23
(X 2 D- P x X3 D) + A
cECorners
(distance(p,m))
(3.1)
pEPlanes mEPointCloud
where Corners is the annotated corners of the cuboid, A is some weight, Planes is
the set of estimated planes of the cuboid, PointCloud is a set of points associated
with plane p, m is a point in the point cloud associated with p, and distance(p,m)
computes the perpendicular distance between a plane p and point m.
The library used to solve for the parameter values to minimize error for this thesis
is ceres-solver, which was developed at Google
[1]. The solver iteratively tries to
solve for the best parameter values for orientation, position, and size for a cuboid to
minimize the expression above.
3.4
Initializing the Parameters
As described in section 2.2.3, a cuboid object's orientation, position, and size can
be encoded as rotation, translation, and scale matrices. Given the constraints in
the section above 3.3, a solver should yield values for the free parameters. Because
the quality of the results can be improved with good initialization, determining a
reasonable set of initial parameters is reasonably important.
3.4.1
Fitting Planes with RANSAC
The first step to making an initial guess of the cuboid parameters is to extract the
point cloud associated with each face. This is accomplished by choosing all of the 2D
points enclosed by the corners of a face and converting them to 3D using equations
(2.1), (2.2), (2.3).
for each visible face
Then, the RANSAC algorithm is used to find the best-fit plane
[2].
This process involves taking a random sample of three 3D
points from a face and seeing how many other points in the point cloud falls onto the
plane defined by these three points. After some number of iterations, the plane that
yields the most matches is selected as the best-fit plane. It is possible to evaluate
24
how confidently each plane fits by taking the number of points in the point cloud that
fall in the plane. This metric of confidence favors faces of higher surface area, which
usually does correspond correctly to a more confident fit.
Although RANSAC does not guarantee the correct answer, if more than 50% of
a plane's points lie within the correct plane, n iterations of RANSAC will yield a
successful result with a probability of at least 1 - (.5)n, which comes out to be greater
than 0.99999 for n = 20.
3.4.2
Making the Planes Orthogonal
After determining the best-fit plane for each visible face of the cuboid, the planes are
adjusted so that each face is orthogonal to each other, if the planes are not already.
The plane with the highest confidence score is left untouched. The second plane's
normal is determined by projecting two points along the second most confident plane's
normal onto the first plane. If the second plane is just slightly misaligned, this should
only slightly adjust the plane's orientation while producing a normal that lies in the
first plane. Thus, the second plane will be orthogonal to the first plane. The third
plane's normal is determined by taking the cross product of the normals of the first
two planes. The third plane's normal will be orthogonal to the normals for the first
two planes.
Thus, all three planes will be orthogonal to each other. Even if the
original object had only two visible faces, this third plane can be derived once the
first two planes are fixed.
3.4.3
Solving for Corner Positions
The algorithm uses two techniques to solve for the 3D positions of the corners. The
first gets the 3D position of each corner from equations (2.1),
(2.2),
(2.3), then
adjusts each corner by projecting it onto the intersection of all of the fitted planes
that the corner sits on. If the corner only sits on one of those faces, the corner is
simply projected onto that plane instead of an intersection of multiple planes. This
technique shifts the position of a 3D point so that it lies exactly on the orthogonal
25
planes; however, it fails when the depth map reading is not available for that corner's
2D position, which can occur when the depth sensor simply fails for that pixel or
because the labeled corner is outside of the captured image.
The second technique uses perspective projection and projects the labeled 2D
point on the image plane, along the ray drawn from the camera origin, onto one of
the fitted, orthogonal planes that the corner lies on. If the corner lies on more than
one plane, the algorithm chooses the plane that is most orthogonal to the ray drawn
from the camera center through the point in the image plane. The most orthogonal
plane, determined by calculating the cross product of the ray and the plane's normal,
is chosen because this technique tends to fail if the ray projects onto a plane parallel
to the ray. A labeling error of a few pixels can make a very large difference in the
final 3D position of the point if the ray and plane are close to parallel.
In both techniques, the point could be adjusted to a position that does not correspond to its original observed value since the fitted orthogonal planes have been
adjusted after the fitting.
Because the two techniques fail in different circumstances, a combination of both
yields better results than just one technique by itself. This thesis combines the two
techniques to make two different approaches. First, the system makes a pass through
all the corners with the first technique, then makes a second pass with the second
technique on the points that failed the first pass. The system saves the results, clears
all the corners, then makes a pass through all the corners again with the second
technique first before making a pass with the first technique on the points that failed
the first pass the second time around. Trying both sequences and picking the better
set of corners tends to yield more accurate results than just trying one option or
the other. Thus, the algorithm yields two sets of corners. A corner will fail both
techniques when there is no depth reading and only relatively parallel planes for the
corner to project onto, or if the corner is not visible from the image plane (which is
always true for corner 8, and true for corner 5 for cuboids with only two visible faces).
If a cuboid is fully constrained, the position of the remaining corners can be
solved by some additive combination of the known corners. The known corners need
26
to establish the properties of width, height, and depth. This occurs when five or
more corners are already known, or when four corners are known as long as the four
corners are not all along the same face. If one set of corners is under-constrained, the
set is discarded. If both approaches result in an under-constrained set of corners, the
algorithm fails.
Because this approach can return multiple sets of corners, each set is assigned
a score that approximates how good the guess is. The set with the higher score is
chosen to extract orientation, position, and size parameters. The score measures the
integrity of the cuboid by two metrics. The first is the fraction of points in each point
cloud that fall within the boundary defined by the corners of that face. The second is
how similar in length each of the edges are for each dimension of width, height, and
depth. A cuboid with significantly uneven edge lengths along the same dimension is
unlikely to be a good estimate.
3.4.4
Estimating Initial Cuboid Parameters
The final step is to convert the set of corners to a true cuboid, represented by the
parameters as described in 2.2.3. The scale of the cuboid is approximated by the
average length of the edges of the resulting set of corners. The translation of the
cuboid is approximated by taking the average position values of each of the corners.
These values can easily be translated to matrix form.
The orientation, represented by the rotation matrix, is solved for using the following equation [5]:
R=M x M 7
(3.2)
where the goal coordinate system of the cuboid is:
M=
0
0 1
0
1
-1
27
0
0 0
(3.3)
as represented by figure 2-3. The rows represents the normals for faces 1, 2, and 3.
The initial coordinate system, M, is also a 3x3 matrix, where each row represents
the normals of the fitted orthogonal planes 1, 2, and 3:
X 1 Y1
Zi
Mr= X 2 Y2 z 2
X3 Y3 Z3
The resulting rotation matrix can then be converted to angle-axis format.
28
(3.4)
Chapter 4
Results
4.1
Data Set and Results
To construct a data set of images, I randomly sampled 7,130 images from the SUN3D
database. Of these images, there were a total of 675 images containing 1,089 two
or three face cuboids. The experiment yielded a success rate of 72.6%, with 791
successful and 298 failed reconstructions. On average, each cuboid took about 3.2
seconds to process. This is fast enough perhaps for a visualization tool for a labeling
database, but not quite fast enough for more time-sensitive applications.
4.2
Edge Cases
This section discusses various edge cases that arose. All of these cases except for the
first were included in the results.
4.2.1
Cuboids with Only One Visible Face
Cuboid-shaped objects that have only one visible face in an image are not constrained
enough for this algorithm to solve for its 3D reconstruction. These cases were discounted from the results.
29
Figure 4-1: Example of a reconstructed scene with three cuboid-shaped objects.
30
Figure 4-2: Example of a image with a cuboid-shaped object with only one visible
face (labeled in red).
4.2.2
Cuboids Missing a Face
Certain objects are cuboid-shaped, but may be "open" or missing a part of a face or a
whole face. For example, a box without a cover has five solid faces, but its open face
may be visible to the camera. This causes the algorithm to fail in some cases, since
the point cloud for that face no longer resembles a plane, thus causing the algorithm
fit the wrong plane to the point cloud. It is arguable whether objects in this category
should be considered a cuboid since they do not actually have six faces.
4.2.3
Cuboid Deformation
Few objects in reality are actually perfect cuboids. Most are at least slightly deformed.
Although the deformation is not usually an issue, an item that deviates too much in
31
I
Figure 4-3: Example of a reconstructed object with a missing face. This particular
object was a successful example.
32
Figure 4-4: Example of a reconstructed object that deviates somewhat from a cuboid.
The algorithm still attempts to fit a cuboid to the object.
shape from an actual cuboid can cause the algorithm to fail.
4.2.4
Reflective or Clear Surfaces
The depth sensor fails to return accurate readings for certain surfaces. For example,
it ignores clear surfaces and gives incomplete readings for reflective black or metallic
surfaces. In mirrors, it returns the distance of the object shown in the mirrored
image. These cases can cause the algorithm to fail, although some objects with
reflective surfaces were still successful if enough of the depth reading was usable for
reconstruction.
33
Figure 4-5: Example of a reconstructed object with a black reflective surface that
returns poor depth reading. A large chunk of the pixels on the object are blacked out
in the depth image.
34
Figure 4-6: Example of a reconstructed object that is partially occluded. The points
associated with the occluding objects causes the estimated cuboid to be shifted upward from its actual position.
4.2.5
Occlusion by Other Objects
Objects in a scene are often at least partially occluded. Not only does occlusion make
accurate labeling more difficult, but the points associated with the occluding object
can also skew the estimated translation parameters.
4.2.6
Labeling Outside of the Image Boundary
Some objects are partially cut off at the edge of the image. Both the labeling tool
and the reconstruction algorithm handles this as long as at least two faces of the
object have enough points in the image to fit a plane to. The object is successfully
reconstructed in most cases.
35
Figure 4-7: Example of a object that has some of its visible faces cut off from the
image. Enough of the object is preserved for reconstruction.
36
Figure 4-8: Example of a labeled object that cannot be reconstructed because no
depth reading is available. The depth map is completely black for parts of the picture
that are too far away from the depth sensor.
4.2.7
Objects Outside of the Depth Sensor's Distance of Use
According to the Asus Xtion's specification, its depth sensor reads between 0.8 to 3.5
meters. In a few cases, the cuboid is out of the range of the depth sensor. This tends
to occur in large spaces such as lecture halls. Although the RGB camera still captures
the image of the cuboid, the algorithm fails because there is no depth reading for the
object.
4.2.8
Inaccuracies in Annotations
For cuboid faces that are very parallel to the camera's viewpoint, mislabeling a corner
by a few pixels can appear correct upon visual inspection, but result in a significant
error during the reconstruction. Human errors in labeling suggest that perhaps the
annotations should not be taken as strict constraints, or perhaps that there could be
some automatic readjustment of labeled corner positions for better results. Another
possibilty is a more fine-grain interface for labeling the images.
37
Figure 4-9: Example of a object that is slightly mislabeled. The boundaries of the
top face is difficult to discern in the color image. Because the angle of the top
face is reasonably parallel to the viewpoint the camera, this mislabeling results in a
reconstructed cuboid that is significantly too large for the object. In this case, the
estimated cuboid juts out beyond the back wall.
38
Chapter 5
Conclusion
5.1
Contribution
This thesis proposes a method for reconstructing a 3D cuboid that estimates the size,
location, and orientation of a cuboid-shaped object. The reconstruction is accomplished using depth images and user-provided labels of color images. I evaluated this
approach on a data set constructed from images taken from the SUN3D database.
This data set and the performance results could serve as a reference for future attempts at cuboid reconstruction. Alternatively, it could also be used as a potential
training set for algorithms that rely on learning to automatically recognize cuboidshaped objects in an image such as [6].
5.2
Suggestions for Future Work
Potential future improvements might include gracefully handling some of the edge
cases, especially occlusion since occlusion occurs very frequently in images. More
than half of the objects labeled in the data set were at least partially occluded.
One approach could be to identify when occlusion occurs and ignore the occluding
points, which might be feasible since occluding points will have a closer depth reading
than the actual cuboid. Another possibility is to integrate occlusion boundaries into
the labeling tool so the algorithm is aware that a part of the image is occluded.
39
LabelMe3D has already begun to explore this option
[91.
Another change could be to include more information in cuboid labels. If the labeling included information about the general orientation of each face, the reconstruction
algorithm could become more constrained and potentially more robust. This change
would improve cases such as the hollow objects that are missing a face since a plane
fit to the wrong general orientation would not be attributed very high confidence.
And finally, a similar approach as the one described in the thesis can be extended
to apply to other commonly occurring geometric shapes such as cylinders. This would
increase the types of objects that could be reconstructed by the algorithm.
40
Bibliography
[1] Sameer Agarwal and Keir Mierle. Ceres Solver. https: //code .google. com/p/
ceres-solver/.
[2]
M. Fischler and R. Bolles. Random sample consensus: A paradigm for model
fitting with applications to image analysis and automated cartography. Communications of the ACM, 24, 1981.
[3] R. I.Hartley and A. Zisserman. Multiple View Geometry in Computer Vision.
Cambridge University Press, ISBN: 0521540518, second edition, 2004.
[4] Varsha Hedau, Derek Hoiem, and David Forsyth. Thinking inside the box: Using
appearance models and context based on room geometry. In Kostas Daniilidis,
Petros Maragos, and Nikos Paragios, editors, Computer Vision ECCV 2010,
volume 6316 of Lecture Notes in Computer Science, pages 224-237. Springer
Berlin Heidelberg, 2010.
[5] B.K.P. Horn. Closed-form solution of absolute orientation using unit quaternions.
Journal of the Optical Society of America A, 4, 1987.
[6] H. Jiang and J. Xiao. A linear approach to matching cuboids in rgbd images.
Proceedings of 26th IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[7] David C. Lee, Abhinav Gupta, Martial Hebert, and Takeo Kanade. Estimating
spatial layout of rooms using volumetric reasoning about objects and surfaces.
Advances in Neural Information Processing Systems, 24, 2010.
[8] B. C. Russell and A. Torralba. LabelMe: A database and web-based tool for
image annotation. InternationalJournal of Computer Vision, 77, 2008.
[9] B. C. Russell and A. Torralba. Building a database of 3D scenes from user
annotations. Computer Vision Pattern Recognition, 2009.
[10] A. Torralba, B. C. Russell, and J. Yuen. LabelMe: Online image annotation and
applications. Proceedings of the IEEE, 98, 2010.
[11] Bill Triggs, Philip Mclauchlan, Richard Hartley, and Andrew Fitzgibbon. Bundle
adjustment A modern synthesis. In Vision Algorithms: Theory and Practice,
LNCS, pages 298-375. Springer Verlag, 1999.
41
[12] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. SUN database:
Large-scale scene recognition from abbey to zoo. Computer Vision and Pattern
Recognition, 2010.
[13] J. Xiao, B. C. Russell, and A. Torralba. Localizing 3D cuboids in single-view
images. Neural Information Processing Systems, 2012.
[14] Jianxiong Xiao. Multiview 3D reconstruction for dummies. https://6.869.
csail.mit .edu/f a12/lectures/lecture3D/SFMedu.pdf, October 2012.
42