Multi-Volume High Resolution RGB-D Mapping with Dynamic Volume Placement Michael Salvato

Multi-Volume High Resolution RGB-D Mapping
with Dynamic Volume Placement
by
Michael Salvato
S.B. Massachusetts Institute of Technology (2013)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2015
c Massachusetts Institute of Technology 2015. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
May 26, 2015
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
John Leonard
Samuel C. Collins Professor of Mechanical and Ocean Engineering
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Albert R. Meyer
Professor of Electrical Engineering and Computer Science
Chairman, Masters of Engineering Thesis Committee
2
Multi-Volume High Resolution RGB-D Mapping with
Dynamic Volume Placement
by
Michael Salvato
Submitted to the Department of Electrical Engineering and Computer Science
on May 26, 2015, in partial fulfillment of the
requirements for the degree of
Master of Engineering
Abstract
We present a novel method for creating high-resolution 3D reconstructions using
RGB-D cameras, such as the Microsoft Kinect. We demonstrate how such a technique
can be used both for large scale mapping and high-resolution object detection. Our
work consists of two core contributions. First, we present an algorithm to generate
3D reconstructions of arbitrary resolution, up to that of the limitations of the camera,
with a linear increase in computation time as a function of resolution. We the build
upon this method to provide a general infrastructure for scanning regions of variable
shape with variable resolution across the scanned regions. We then show quantitative
improvements in resolution in a virtual environment, show qualitative improvements
in resolution through images, and provide runtime performance details.
Thesis Supervisor: John Leonard
Title: Samuel C. Collins Professor of Mechanical and Ocean Engineering
3
4
Acknowledgments
Foremost I would like to thank my thesis advisor John Leonard. It was by working
in your group that I’ve become excited about robotics research, and your enthusiasm
for the work contagious. You’ve been a supportive mentor, and I’m always amazed
on how much is accomplished when meeting with you. Thanks for all you’ve done for
me.
I’d like to thank the members of the Marine Robotics Group. First Ross has
been a tremendous help, both in my research and in learning about academia beyond
the walls of MIT. Thanks for spending the time to review this thesis and the IROS
paper, supporting me in my research, and giving me helpful advise along the way.
Thanks to Liam for organizing so many things in the group and always providing
helpful advice when asked. Thanks Dehann for welcoming me to the group when
I first arrived, and answering any questions I had in more depth than I expected.
Thanks Dave for taking the time explain your research when I asked, teaching me
mathematic principles I didn’t yet know in a clearly articulated way. And thanks to
Janille, Ted, Sudeep, Thomas, and the rest of MRG; I’ve enjoyed spending time with
all of you, and will miss you.
Last I’d like to thank my dad. You’ve been incredibly supportive in everything
I’ve done, and helped me more than I can account for, in all aspects of my life. This
thesis wouldn’t exist without your support, and I can’t thank you enough for all the
opportunities I’ve had thanks to you.
5
6
Contents
1 Introduction
1.1
15
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.1.1
Simultaneous Localization and Mapping . . . . . . . . . . . .
15
1.1.2
Dense 3D Reconstruction . . . . . . . . . . . . . . . . . . . . .
17
1.1.3
Dense Methods for SLAM . . . . . . . . . . . . . . . . . . . .
19
1.2
Dense Reconstruction for SLAM and Object Detection . . . . . . . .
20
1.3
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2 Higher Resolution Reconstruction Using Multiple Volumes
2.1
23
Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.1.1
Input and Output . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.1.2
Metrics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.1.3
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Reconstruction Algorithm . . . . . . . . . . . . . . . . . . . . . . . .
26
2.2.1
Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.2.2
Depth Map Conversion . . . . . . . . . . . . . . . . . . . . . .
28
2.2.3
Camera Tracking . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.2.4
Memory Passing . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.2.5
Volumetric Integration . . . . . . . . . . . . . . . . . . . . . .
30
2.2.6
Raycasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.2.7
Comparison to KinectFusion . . . . . . . . . . . . . . . . . . .
32
2.3
Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.2
7
3 Dynamic Allocation
3.1
35
Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.1.1
Input and Output . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.1.2
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.2.1
Insertion and Removal . . . . . . . . . . . . . . . . . . . . . .
39
3.2.2
Dynamic Mapping for SLAM . . . . . . . . . . . . . . . . . .
40
3.3
Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.2
4 Results
45
4.1
Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.2
Runtime Performance . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.3
Comparison with Baseline . . . . . . . . . . . . . . . . . . . . . . . .
47
4.4
Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5 Conclusion
53
5.1
Summary of contribution . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.2
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.2.1
Large Scale Improvements . . . . . . . . . . . . . . . . . . . .
54
5.2.2
Color Integration . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.2.3
Integration with SLAM and object recognition . . . . . . . . .
55
8
List of Figures
1-1 A simple pose graph representation. . . . . . . . . . . . . . . . . . . .
17
1-2 Example of a structured light camera. . . . . . . . . . . . . . . . . . .
18
1-3 Example of space discretization in KinectFusion. . . . . . . . . . . . .
19
2-1 Diagram of volume placement in higher resolution reconstruction. . .
27
2-2 Flowchart for model creation. . . . . . . . . . . . . . . . . . . . . . .
29
2-3 Image generated from 4 1.5 meter TSDF subvolumes. . . . . . . . . .
32
3-1 ICL-NUIM Livingroom dataset reconstruction. . . . . . . . . . . . . .
36
3-2 Example of spaced out TSDF subvolumes. . . . . . . . . . . . . . . .
39
3-3 Scan of a hallway with cube placement. . . . . . . . . . . . . . . . . .
42
4-1 Example of point cloud registration using Cloud Compare. . . . . . .
47
4-2 Visual comparison between KinectFusion and higher resolution dynamic reconstruction in a desk scene. . . . . . . . . . . . . . . . . . .
48
4-3 Colored fireplace and mantel scan at Kendall Hotel. . . . . . . . . . .
50
9
10
List of Tables
4.1
Model to Ground-Truth Error . . . . . . . . . . . . . . . . . . . . . .
11
46
12
List of Acronyms
SLAM simultaneous localization and mapping
TSDF truncated signed distance function
ICP iterative closest points
13
14
Chapter 1
Introduction
The core of this work develops an algorithm to create high-resolution models of a 3D
environment. To motivate this work, we present a brief history of the simultaneous
localization and mapping (SLAM) problem, which uses our work for both object
detection and map generation. We then provide a history of general 3D scanning
techniques and those specifically for SLAM, to motivate the role of our work within
both.
1.1
Related Work
1.1.1
Simultaneous Localization and Mapping
The simultaneous localization and mapping problem (SLAM) [25] describes the issue
where a robot in an environment both needs to create a map of the environment, and
determine its location in the environment. Many formulations of the SLAM problem
rely on the odometry measurements from the robot, as well as some measurements
of the environment using a sensor. Using these, the robot both attempts to derive a
map of the environment based on given metrics, as well as derive its placement within
this map.
In the case where all sensors have no noise, the SLAM problem can be reduced
into that of only a mapping problem. However, as this is not the case, substantial
15
work has been put into solving the SLAM problem. Current state of the art largely
focuses on probabilistic models of the position of the robot and beliefs of the map
structure. An overview of the history of such methods can be found at [47]. Pose
graph methods, first presented by [28], are often used in practice, as they provide a
robust representation and can be solved efficiently using sparse methods. State of
the art methodology can be found at [9] followed by [20]. These methods involve
creating a graph where nodes consist of robot poses (x), landmarks (l), loop closures,
odometry measurements, and landmark measurements. Landmark measurements are
used to denote detected features of the environment. Loop closures indicate that
two poses have the same location. A graph is created by appropriately connecting
the landmarks, poses, measurements, and loop closures. The goal is then to find the
maximally likely values for x and l given the constraints in the graph, which can be
solved as a sparse coding problem (see [39]). The details of pose graph optimization
are not key to this thesis. However, we note a major step to maximize the consistency
of the graph is to recognize when landmarks measured at different times are in fact
the same landmark, so they can be treated as one node on the graph. By doing so the
system becomes more constrained. Our work is used to improve on the robustness of
landmark measurements, which can increase the number of constraints in the graph,
and reduce the probability of incorrect constraints. See Figure 1-1 for a graphical
representation of a pose graph.
By the above formulation it is important to consider where landmark measurements come from. Historically SLAM landmarks have been relatively simple geometric landmarks [1, 2], such as corners. In recent history there has been work towards
using objects as SLAM landmarks, such as [37] [6] [43]. By using objects, landmarks
will ideally be more robust to noise, and provide more unique information about
a particular environment. To use objects as landmarks we need to develop robust
methods to parse out relevant objects. Because our work provides high-resolution
constructions, our work might be extended to provide a robust method with which
to parse these landmarks.
16
Figure 1-1: A simple pose graph representation. Blue: Pose nodes. These are not
measurements themselves, but are solved for to determine the pose at each step.
Each one is connected to measurements made during that step. Green: Landmarks
in the environment. u: Odometry nodes. These represent odometry measurements
taken from the robot. c: loop closure nodes, these indicate that two locations are
believed to be the same. m: Landmark measurements. The landmark measurements,
odometry measurements, and loop closures are used to solve for the robot poses and
landmarks. Our contributes to SLAM by providing the ability to make more robust
landmark measurements. Courtesy David Rosen [38].
1.1.2
Dense 3D Reconstruction
Recently 3D reconstruction has seen tremendous growth in robotics. This is largely
due to the growth of structured light techniques [44] for generating depth maps,
combined with the advent of the Microsoft Kinect structured light sensor. This led to
a the commercial availability of cheap (≈$200), robust structured light sensors. These
cameras found use in applications such as motion detection and object tracking,
largely for humans [35] [52]. While these frame based methods were sufficient for
some applications, we will focus on the growth of dense 3D object reconstruction
based techniques.
Perhaps the most famous of 3D reconstruction technique is the KinectFusion algorithm, developed by Newcombe, Izadi, et al. [32] [19]. In KinectFusion the user
defines the size of a cubic volume in front of the camera to be scanned. This volume is then broken into many (≈ 5123 ) voxels, each of which contain a belief of its
distance to the nearest surface, which is updated each frame. See Figure 1-3. This
effectively smooths out the noise in the depth camera, allowing for a more accurate
17
Figure 1-2: Example of a structured light camera [27]. The projected IR dot grid
has a known pattern, which is deformed by the surface it is projected on. By reading
in the deformed grid with a camera, the surface can be reconstructed based on the
deformation in the projected pattern from the surface.
model to be recovered from the volume. All computation is run on the GPU for
real-time performance. However the size of the physical volume that can be scanned
is bounded by GPU memory in KinectFusion, and fast movement results in blurring.
These limitations have spawned other algorithms such as [53], which uses octrees to
map larger spaces to partial overcome the finite scope of KinectFusion, and [33] which
allows for accurate reconstruction even for slowly moving objects. The PointCloud
Library [41] has an open source implementation of KinectFusion known as KinFu,
which many methods build off of. Our work will build off of KinectFusion, and our
implementation is modified from KinFu.
After creating 3D reconstructions using techniques such as those described above,
objects can then be parsed from the reconstructions. Work such as [18] [42] [10] [13]
and [24] provide methods to parse objects from 3D data. The work we present in the
remainder of this thesis will allow for higher resolution reconstruction than current
methods, which can result in more robust object parsing by increasing the available
information.
18
Figure 1-3: Example of space discretization in KinectFusion. On the left are two
spherical objects which we wish to perform a 3D scan on. KinectFusion discretizes
the space into many voxels, each of which keep track of their believed distance to the
nearest surface on the objects. The sizes of the space to be scanned must be specified
at runtime and cannot be changed.
1.1.3
Dense Methods for SLAM
While our work will focus on dense methods using RGB-D sensors [5, 36, 46, 54, 55],
there has been significant prior work in monocular SLAM . Methods such as [7] had
been developed since before the prevalence of RGB-D camera for SLAM reconstruction. Additionally due to the high resolution and potentially lower cost and energy
requirements, work such as [12, 23, 34] is important to creating SLAM systems for a
wider range of environments, though not the focus of our work.
As noted above, KinectFusion [34] is limited by the size of reconstruction region.
Work such as [16], one of the first RGB-D mapping systems, removes this size restriction . This work, and others based on 3D features [11], consequently throws away
much of the data, resulting in lower resolution reconstructions than those seen in
methods designed strictly for dense reconstruction.
Keyframe approaches such as [48] and the work of Meilland et al. [29–31], integrate new data into separate keyframes that exist in image space, to allow for highresolution regions, but also expanded regions as keyframes are added. Specifically,
Meilland et al. present work which combines keyframes and volumetric approaches
to mapping, in order to create real-time, robust, and scalable systems in a memory
efficient manner. However, in order to represent the 3D features of a small cluttered
scene many small keyframes are required in order to properly capture the 3D fea19
tures of different view. This could potentially cause tracking drift or failure if each
keyframe does not have sufficient features to track.
There has been notable work to extend KinectFusion methods into maps of larger
size, such as the work of Roth and Vona [40], and the Kintinuous algorithm developed
by Whelan et al. [50]. The Kintinuous algorithm is designed to accomplish a similar
goal to our work, which is to allow for scanning of dynamically increasing regions.
Kintinuous overcomes the GPU memory restriction by cycling out of working memory
as it exits the scope of the scan. Steinbrücker et al. [45] use an octree representation to
reduce computational costs, and keyframe methods as well as loop closures to reduce
drift. Wurm et al. [51] similarly use an octree representation to reduce data overhead,
and make use of CPU capabilities to additionally reduce processing overhead. Henry
et al. [17] use a series of KinectFusion-like volumes, which are stitched together to
create a cohesive map. They use plane detection to determine where to place volumes,
and they are explicitly scoped to be used in mapping. While both of these methods
provide robust mapping solutions, our work provides the ability for a user to specify
the resolution desired and parameters for mapping, resulting in a flexibility that no
other solution provides.
1.2
Dense Reconstruction for SLAM and Object
Detection
This work will focus on a novel algorithm for dense reconstruction and SLAM. The
KinectFusion algorithm has physical scope coupled to reconstruction resolution, limited by the size of GPU memory. Functionally this results in a limited size that
KinectFusion can scan, which is fixed during a given scan, as well as a limited resolution. This is because KinectFusion keeps the entire volume being scanned in GPU
memory during runtime, in order to increase speed efficiency. In order to remove
this constraint, we pass memory in and our of the GPU, allowing for reconstruction
resolution to be specified by the user.
20
Because the above extension passes memory in and out of GPU memory already,
we then expanded on this to keep representations of different areas of the environment
stored in CPU memory. We remove the restriction of KinectFusion where it can map
a fixed sized region by introducing new scanning regions during runtime. We present
both a general framework for accomplishing this, and an example case used for 3D
mapping.
1.3
Overview
The remainder of the thesis is organized into the following chapters:
Chapter 2 gives a description of our algorithm for high-resolution dense scanning
in fixed sized environments.
Chapter 3 extends our algorithm to scan arbitrary regions in resolution determined
by a user given function. It also gives an example of how to use this in a general
mapping solution.
Chapter 4 provides qualitative results from running our algorithms, as well as quantitative metrics for resolution and runtime.
Chapter 5 Addresses some limitations and future directions of our work, as well as
concluding remarks.
21
22
Chapter 2
Higher Resolution Reconstruction
Using Multiple Volumes
The following describes a procedure for generation a high-resolution 3D reconstruction
of a region of fixed size. This is a first step towards developing a high-resolution
mapping system of dynamically increasing size. This technique allows for the creating
of arbitrarily high resolution scans of fixed sized, and is useful in robotic object
detection (see Section 2.3).
2.1
Problem Definition
We are trying to solve the problem of using a series of depth maps to best reconstruct
the 3D model that generates those depth maps. Assuming there is no noise and
the depth maps are drawn from a physically realizable model, we would be able
to generate a perfect model. However we assume the depth maps provide a noisy
representation of our model, which we then attempt to account for to create the most
accurate model possible. We both attempt to maximize the accuracy and density of
points in our reconstruction. Here we present the problem more precisely based on
the input and output relationship, metrics, and assumptions.
23
2.1.1
Input and Output
Our algorithm is designed to output the 3D reconstruction of a fixed size cubic region
of physical space using a sequence of provided depth maps. The input to the algorithm
is therefore as follows:
• The linear dimension, l ∈ R+ , of the physical region to be scanned. Our region
is assumed to be a cube, so this is the side length of a cube to be scanned.
• The linear number of voxels, r ∈ N, of the model used during 3D reconstruction.
Our region is assumed to be a cube, so this is the side length of a virtual cube
of voxels.
• A series of i n × m depth frames, Di ∈ R+nm provided by a camera. We assume
each frame is generated from a camera by a noisy, physically realizable model.
• The camera intrinsic matrix, K ∈ R3×3 . The camera intrinsics provide information about the mapping from the values in a depth frame to the physical
location of the surface that generated that reading [15].
During the scanning process a collection of voxels is stored as a cube of voxels.
Each voxel stores the signed distance from each voxel to the nearest surface, which
is then truncated if this distance is too large. The collection of voxels is known as
the truncated signed distance function (TSDF) [8] volume. The number of voxels r3
in the TSDF volume acts as a metric for the resolution of the reconstruction. The
larger the number of voxels in a given physical space, the higher the resolution of
the model that is output. For example a low resolution water bottle may only have
enough points to appear as a cylinder, while a high-resolution one may be able to
represent grooves in the water bottle. The definition of resolution will be refined in
Section 2.1.2.
The depth frames are each a 2D matrix, with each value representing the depth to
the nearest surface of a given ray. The camera intrinsic matrix is used to determine
the angle of a given ray described by the depth map.
24
Each measurement in Di is itself noisy with an unknown noise function. The core
function of this algorithm is to integrate all measurements in Di to determine the
true generating model.
The algorithm outputs the following:
• A set V of vertices, consisting of elements v j ∈ R3 for j = [1, |V |]. Each vertex is
the physical location of a 3D point relative to the camera initialization location.
• A set N of normals, consisting of elements nj ∈ R3 for j = [1, |N |]. nj corresponds to the normal for vertex v j .
We note that the magnitude of V is upper bounded by r3 . That is, the number of
vertices in the reconstruction cannot exceed the number of voxels in the model used
for reconstruction.
2.1.2
Metrics
In the ideal case, each vertex represents the location of a surface, with location identical to that of a point on some surface relative to the initial camera location. To
capture the error in these distance, we define the mean distance from each vertex to
the nearest true surface as the accuracy of the model. In the trivial case the accuracy
of our model could be maximized by outputting only the single most accurate measurement. However this would not provide useful information about the model we
are attempting to reconstruct. As a second constraint in the ideal case, for each voxel
in the reconstructive model, if a surface crosses through that voxel, then a vertex is
created for that voxel. Increasing the number of vertices in the output will be defined
as increasing the resolution of the model.
The primary contribution of this thesis is to increase the resolution of 3D reconstructions over existing methods, without reducing their accuracy, compared to state
of the art models.
25
2.1.3
Assumptions
• This algorithm assumes the depth frames from a camera, and are derived from
a physically realizable source.
• The scene contains features which can be tracked across frames. Because our
algorithm uses the iterative closest point algorithm [4], without features to track
the algorithm will fail.
• Static portions of the scene dominate relative to non-static ones. This is necessary to be able to track the camera movement relative to the scene.
• The movement of the camera between frames is small enough that features may
be tracked across frames.
• Surfaces are in the view of the camera long enough to smooth over the noise
generating the depth maps.
• The probability distribution on the noise of each depth value is centered on the
true value. Otherwise the accuracy is bounded by the center of the distribution
for each point.
2.2
Reconstruction Algorithm
Those who are familiar with the KinectFusion algorithm will notice that the steps
listed below follow the same structure as KinectFusion. The key modification with
our algorithm is that it allows for the ability to increase the resolution of the reconstruction to a user specified degree in exchange for computation time. This will be
further explained in Section 2.3.
2.2.1
Initialization
This algorithm allows for the number of voxels r3 to be a user tunable parameter.
However, future steps in our algorithm rely on being able to perform computation
26
Figure 2-1: Left: 2D representation of KinectFusion volume with 8 voxels in the
linear dimension scanning a bunny. Right: Four volumes each with eight linear
voxels scanning a bunny of the same size. We double the linear resolution by using
four volumes each with half the length.
on the TSDF volume on a GPU. In the case where the user-provided r results in too
many voxels for the entire volume to be stored in GPU memory, the TSDF volume
will be broken into a set of TSDF subvolumes H. Each subvolume h ∈ H acts as a
TSDF volume and is used in reconstruction, but are separated so that computation
can be performed on them independently.
We define rGPU to be the maximum number of linear voxels that can fit into GPU
memory. The linear number of TSDF subvolumes initialized is
r
.
rGPU
The initial location of the camera is initialized as (0, 0, 0). The back, center voxel
of the TSDF volume is additionally initialized to this location. In order to achieve
this, each subvolume is placed such that they form a cube with back center at (0, 0, 0).
Each subvolume stores the translation ht of its back center voxel relative to (0, 0, 0), its
linear number of voxels hr and its linear physical dimension hl . For reasons mentioned
Section 2.2.6, each subvolume overlaps by two pixels, but this will be ignored for the
majority of the analysis, as it only results in a constant factor decrease in the total
volume size.
27
2.2.2
Depth Map Conversion
For each frame i, the depth map Di is converted into a collection of vertices. A
given vertex in coordinates of the current frame vli (x, y) = K −1 [x, y, 1]T Di (x, y) for
pixel coordinates x = [0, m − 1] and y = [0, n − 1]. This is done on the GPU
on a per pixel basis. The normals are then computed for each vertex: nil (x, y) =
(vli (x + 1, y) − vli (x, y)) × (vli (x, y + 1) − vli (x, y)). We note here that the expected
number of pixels in the depth map is assumed to be much smaller than that which
could be supported by GPU memory.
Using the transformation matrix derived from the previous camera tracking iteration (see Section 2.2.3), the global position of each point is then derived. The global
transformation matrix T i−1 = [Ri−1 |ti−1 ], with translation vector ti−1 ∈ R3 and rotation matrix Ri−1 ∈ SO(3). The global vertex location is then vi (x, y) = ti−1 vli (x, y)
and global normal is ni (x, y) = Ri−1 nil (x, y). This puts our vertex in normals in the
global coordinate frame defined, which is necessary for future steps.
2.2.3
Camera Tracking
The transformation matrix between the vertices derived from the previous raycasting step and the current depth map conversion step is determined. This is run by
performing a GPU based iterative closest point (ICP) algorithm, the details of which
can be found in [32]. As a brief overview, the first step of the ICP algorithm is the
match each point in a source point cloud with those in a reference point cloud. A
transformation is then applied to minimize least square error between the matching
points. Each point in the source point cloud is then matched against the reference,
and the procedure is repeated until some minimum error threshold is obtained. The
transformation from the initial source point cloud position to the final source point
cloud position is then the transformation matrix. The transformation is then applied
to T i−1 to determine the current global transformation matrix T i .
28
Initialization
Depth Map
Conversion
Per Frame
Camera
Tracking
Memory
Passing
Volumetric
Integration
Raycasting
Per Volume
Figure 2-2: Flowchart for model creation. At the beginning of each dataset, TSDF
volumes are initialized. For each frame, depth map conversion is performed to convert frame’s depth map values to global vertices. ICP is then performed during the
camera tracking step to obtain the camera transformation from the previous frame to
the current one. In order to perform volumetric integration and raycasting on each
volume, a memory passing step is performed to pass a volume into GPU memory.
Volumetric integration and raycasting are then performed to update our TSDF volumes and obtain the raycasted depth map. Red: Performed once per dataset. Blue:
Performed once per frame. Green: Performed on each volume for each frame.
29
2.2.4
Memory Passing
To store the amount of data required to have multiple TSDF subvolumes, they generally need to be stored in CPU memory, as we constructed the subvolumes such
that only one may fit in GPU memory. Generally a computer has many times more
CPU memory than GPU memory. Because of this, and the benefits of virtual memory, we can store more volumes in CPU memory than in GPU memory (see section
VI.B.). Prior to a volumetric integration step, one TSDF subvolume is uploaded to
the GPU for point volumetric integration and raycasting. After raycasting, this volume is downloaded to CPU memory. This happens for each subvolume, upon which
the final raycasting step is complete for that frame.
2.2.5
Volumetric Integration
During this step we update our TSDF subvolumes, which store our beliefs of which
surfaces are relative to our global origin. Each subvolume stores a set of values Dk,i (tl )
3
where tl = (txl , tyl , tzl ) ∈ R3 represents the voxel coordinate, and Dk,i ∈ Rr stores the
believed distances to the nearest surface for each voxel in the kth subvolume on the
ith frame.
Each GPU thread is assigned an (txl , tyl ) location, for txl = [0, r − 1] and tyl =
[0, r − 1]. Each thread then iterates over each tzl = [0, r − 1] for its given (txl , tyl ).
Each voxel’s distance to the nearest surface is determined using its tl coordinate.
However because each subvolume is translated relative to the origin by translation
ht , the coordinate to use to calculate distance is then ts = tl + ht . From here the
physical model distance is calculated as dm = rl ||ts ||, which represents the distance
from the camera origin to the voxel.
In order to determine the measured distance for that frame, each ts voxel position
is then back projected to pixel coordinates to get an (x, y) location. Using this
the distance to the nearest surface is calculated as dist = Di (x, y) − dm . dist is
then constrained to be within some truncation bounds proportional to the physical
distance of a voxel in order to prevent unnecessary computation. From here we can
30
calculate the updated TSDF value Dk,i (tl ) =
wi−1 Dk,i−1 (t)+wi dist
,
wi−1 +wi
for each voxel, which
is the belief of the distance from that voxel to the nearest surface. wi is a linearly
increasing weighting function, up to some maximum.
2.2.6
Raycasting
During this step we derive the depth map constructed from our TSDF volume. As
mentioned above, the raycasting step is run once separately per subvolume, each after
the matching volumetric integration step.
Prior to the memory passing step a pixel distance map vr ∈ R+ and normal map
nr ∈ R+3 , each of size m × n are initialized. Based on the camera intrinsics, a ray can
be cast from each pixel location (x, y) in the camera space through the TSDF volume.
As each ray iterates through the TSDF volume space, checks for the value of the TSDF
voxel it is currently in. If the ray passes through a change of sign between voxels, it is
noted to have crossed through a surface. By trilinearly interpolating over the voxels
on either side of the zero-crossing, the location of the surface is determined. Because
the side of each voxel is known, the raycasted voxel location vr (x, y) is determined.
Similarly the derivative of the TSDF as the zero-crossing is used to determine the
normal value nr (x, y) for this pixel.
However because we are only working on a given subvolume, a few modifications
need to be made. While raycasting, calculation is only done while the ray is within
the TSDF subvolume, or until it finds a zero-crossing. As a ray is cast from a camera
location, its voxel position in global space can be described as a position tr ∈ R3 . The
voxel position in volume coordinates is then t0r = tr − ht . We then read values of the
voxels at t0r when searching for zero crossing. When determining the global physical
locations g of each voxel, the translations must be added back to our new indices,
to accurately reflect the physical distance the ray travels to our translated volume.
This gives us g = hl ∗ (t0r + ht ). When a zero crossing is found, the value is trilinearly
interpolated over neighboring voxels to determine vr (x, y), and the derivative of the
TSDF at the zero crossing is calculated to determine vn (x, y).
It is worth noting that all raycasting steps modify the same map. This means
31
Figure 2-3: Image generated from 4 1.5 meter TSDF subvolumes.
each pixel coordinate is searched over by each subvolume, even though all edit the
same map. In order to guarantee that we use the correct value in our voxel map,
we take the value with the shortest raycast distance from the local camera location.
This guarantees that we always take the value of the surface visible by the current
camera location. To trilinearly interpolate successfully each subvolume must overlap
by 2 voxels.
2.2.7
Comparison to KinectFusion
We show that when creating a cube of multiple volumes the results are identical
to that of running the KinectFusion algorithm on one volume with the same total
number of voxels. Because the volumetric integration and raycasting steps are the
only ones modified, if we can show they give identical outputs to KinectFusion, then
the entire algorithm also does so. For sake of simplicity we will ignore the overlapping
voxels in our volumes when discussing the number of voxels, which is accounted for
by requiring that each volume be larger by the size of 2 voxels.
During the volumetric integration step, each voxel is independent of each other
voxel. That is, the calculation performed only depends on the TSDF value of the
given voxel, and the depth of the ray back projected from that voxel. All modifications
done by our algorithm guarantee that the voxel is associated with the correct physical
location, but do not affect the TSDF value or the depth value. Due to the fact that
32
all computation depends on each voxel independently, that voxels exist in separate
volumes does not affect the correctness of the computation.
In the raycasting step, unlike the volumetric integration step, the voxels are not
independent. Due to trilinear interpolation, each voxel depends on each neighboring
voxel. However given that the volumetric integration step is identical to KinectFusion, upon overlapping the volumes, the overlapping voxels will have identical values.
Therefore during the trilinear interpolation step, every voxel for which trilinear interpolation is possible will have a set of neighbors identical to that of KinectFusion. By
then only the shortest distance zero-crossing found by each ray, the depth and normal
maps are identical to those in KinectFusion. With this the outputs of volumetric integration and raycasting are both identical to that of KinectFusion, so our algorithm
with produce results identical to KinectFusion, assuming that an equal resolution
KinectFusion volume could fit in GPU memory.
2.3
Benefits
This algorithm is designed explicitly to create high-resolution volumes of a fixed area.
By using the TSDF volume techniques of KinectFusion we create a more accurate
3D model than any individual frame. Then by allowing for a number of subvolumes,
and thereby voxels, only bounded by disk space, we allow for resolution functionally
bounded only by the noise in the camera which generates the depth frames. This
is compared to the resolution bounded by GPU size in KinectFusion. In exchange
we increase the processing time of the algorithm, which will be furthered explored in
Chapter 4.
2.4
Summary
This chapter has presented a method to generate higher resolution 3D reconstructions
in exchange for increased computation time, by using multiple TSDF subvolumes
passed in and out of GPU memory. The next chapter presents a method to expand
33
the use of multiple TSDF subvolumes to map different regions of an environment
at different resolutions, and dynamically allocate subvolumes based on user specified
parameters.
34
Chapter 3
Dynamic Allocation
In the previous section we developed a method for generating high-resolution models by using multiple subvolumes. Building off of this infrastructure, we present a
method to dynamically allocate these subvolumes during reconstruction. This allows
for the reconstruction of dynamically growing region, and the ability to decide the
reconstruction quality of different regions during run time.
3.1
Problem Definition
Our goal is again to create a 3D reconstruction of a noisy measurement model while
optimizing over accuracy and resolution. However in this case we present the additional requirement that resolution of the model in a given region be variable based
on user input. Here we present the problem more precisely based on the input and
output relationship, metrics, and assumptions.
3.1.1
Input and Output
The output relationship of dynamic allocation is identical to that of the high-resolution
reconstruction.
• A set V of vertices, consisting of elements v j ∈ R3 for j = [1, |V |]. Each vertex is
the physical location of a 3D point relative to the camera initialization location.
35
Figure 3-1: Top: A colored model of the ICL-NUIM Livingroom dataset [14]. Bottom: A reconstruction of the same scene with segments showing regions mapped by
a volume. Each volume has the maximum resolution that can be loaded into GPU
RAM. Volumes are dynamically placed and removed as the camera passes over the
scene.
36
• A set N of normals, consisting of elements nj ∈ R3 for j = [1, |N |]. nj corresponds to the normal for vertex v j .
During runtime the algorithm will keep a list H of active subvolumes it is processing on. Each subvolume h ∈ H has 3 parameters: hl , the linear physical dimension,
hr , the linear number of voxels, and ht , the translation relative to the camera origin.
A volume can be added to the set at any point during execution, and will be processed
in subsequent frames. To remove a volume from processing, it is removed from the
set of volumes. The point cloud for the removed volume can then be downloaded to
main memory. Thus the input parameters are:
• A series of i n × m depth frames, Di ∈ R+nm provided by a camera. We assume
each frame is generated from a camera by a noisy, physically realizable model.
• The camera intrinsic matrix, K ∈ R3×3 . The camera intrinsics provide information about the mapping from the values in a depth frame to the physical
location of the surface that generated that reading.
• f (·), which can take in any data available during a given frame. It returns
a list of volume parameters (h1p , ..., hnp ) for the volumes to be added, where
hp = (hl , hr , ht ) for some subvolume.
• g(h, ·), which is run for all h ∈ H. The function removes the volume from
H if the conditions it specifies are met. It can additionally take in any other
parameters available in the system.
By defining f and g, this algorithm can be extended to a multitude of use cases.
For example, the work of Kaparthy et al. [21] could be applied to create higher
resolution models for areas with object-like properties.
3.1.2
Assumptions
The assumptions of dynamic allocation include the assumptions of high-resolution
reconstruction:
37
• This algorithm assumes the depth frames from a camera, and are derived from
a physically realizable model.
• The scene contains features which can be tracked across frames. Because our
algorithm uses the iterative closest point algorithm [4], without features to track
the algorithm will fail.
• Static portions of the scene dominate relative to non-static ones. This is necessary to be able to track the camera movement relative to the scene.
• The movement of the camera between frames is small enough that features may
be tracked across frames.
• Surfaces are in the view of the camera long enough to smooth over the noise
generating the depth maps.
• The probability distribution on the noise of each depth value is centered on the
true value. Otherwise the accuracy is bounded by the center of the distribution
for each point.
We additionally assume:
• H is never empty prior to initialization. If it were, ICP could not be performed,
and the camera could not localize.
• The collection of subvolumes at any given time has features such that ICP can
be performed adequately.
• All h ∈ H have a number of voxels that can fit into GPU memory.
• f (·) and g(h, ·) are designed such that a given surface will exist in memory
long enough to develop an accurate model of that surface. This is because
smoothing over the noise in our model takes a number of measurements, so if
we pass volumes out too quickly the reconstruction will be poor.
38
3.2
Algorithm
Figure 3-2: Volumes can be placed anywhere relative to the initial camera location.
Here we scan with three spaced out volumes, and choose not to allocate a volume in
the bottom left corner for demonstration purposes.
3.2.1
Insertion and Removal
After the final raycasting step an additional insertion and removal step is added.
During this step the removal algorithm f (·) is first run. Each subvolume to be added
is specified with hl , the linear physical dimension, hr , the linear number of voxels,
and ht , the translation relative to the camera original. Each subvolume specified by
f (·) is then marked for addition.
g(h, ·) is then run for each h ∈ H. Note that this algorithm can use any parameter
of the algorithm, including the set of subvolumes marked for addition. If a subvolume
specified has identical parameters to one marked for addition, that volume instead
will not be marked for addition. This provides a mechanism for f (·) and g(h, ·) to
interact. Each subvolume specified is then marked for addition.
39
Because each TSDF volume can be described as a point cloud, the volumes marked
for deletion have their point cloud representations downloaded to CPU memory. The
volumes marked for addition are then added to H, and used during the volumetric
integration and raycasting steps of future iterations.
3.2.2
Dynamic Mapping for SLAM
To perform large-scale mapping we propose a tool for dynamically placing subvolumes
in areas expected to have a high density of surface vertices. There are three tunable
parameters in this system: The linear number of voxels in each TSDF subvolume
dv , the linear physical dimension in each subvolume dl , and the maximum number of
subvolumes to exist at once time n. Our system dynamically places n subvolumes h
each with hl = dl and hv = dv in locations based on the following process.
We define the origin of the global coordinate frame as the initial location of the
camera. Assuming voxels of fixed size, we can view this coordinate system as having
length units in voxels rather than a physical length. We allow subvolumes to be placed
on this coordinate system such that their translations from the origin are constrained
to ht = k ∗ (dv − 2), where k ∈ Z3 . This allows the subvolumes to be placed on a 3D
grid, where each cell is the size of a subvolume. The (−2) term is so that adjacent
cubes do not create a gap in the raycasting result and will be excluded for simplicity
of notation. We do not allow multiple subvolumes to have the same translation, so a
subvolume’s placement in this grid defines a unique property of the subvolume.
During the raycasting operation, for each pixel p = Di (x, y) in the depth frame,
we calculate the the endpoint e = (ex , ey , ez ) ∈ R3 of the ray generated from p based
on the depth rd and direction rv = (rvx , rvy , rvz ) ∈ R3 from p. Using this we calculate
the location in global coordinates of where that depth value maps.
rvx
ex = q z
y
( rrvx )2 + ( rrvx )2 + 1
v
v
ey =
rvy
∗x
rvx
40
ez =
rvz
∗x
rvx
We then convert each endpoint to a coordinate l which is constrained to ht =
k ∗ (dv ) as the subvolume locations are
l=
(ex , ey , ez )
(ex , ey , ez )
−(
mod dv )
dl
dl
Any ray with endpoint l maps uniquely to a subvolume. We define c(l) as the
number of rays that end at the subvolume mapped to l. The set L is defined as all
subvolume locations with non-zero c(l). Our insertion function takes in the current
subvolume set H and all c(l) for l ∈ L, so the function signature is f (H, L, c(L)). We
then only keep the subvolumes with the N largest c(l) values. Our removal function
g(h, ·) removes a subvolume which has been marked as having falling out of the top
n subvolumes. In order to prevent volumes from switching in and out too often, we
also set a threshold that a subvolume to be added needs to have a count larger than
the lowest count subvolume by some constant factor.
3.3
Benefits
Our method scans environments of dynamically increasing size, as subvolumes will
be continually inserted and removed from H. This provides a key utility over dense
3D scanning algorithms which are bounded to a fixed region of space.
While high-resolution reconstructions are possible with methods such as Kintinuous [50], do not provide for dynamic volume allocation. While doing a scan the
user needs to understand where the TSDF volume exists in the scene while scanning,
and scan based on that. In contrast our algorithm allows for high-resolution reconstruction of all objects that were in the view of the camera, and does not require
understanding where scanning boundaries are.
Our dynamic mapping method is just one function for dynamic allocation. Our
framework allows for any metric in deciding where to place volumes. For example,
if there were certain objects of interest for which higher resolution scanning would
41
Figure 3-3: Top: Scan of hallway. This scan was taken while simply walking forward,
yet still captures both walls and the floor at high resolution. 2m subvolume were
dynamically allocated such that at most 16 were live at one time. This image consists
of 24 subvolumes. Bottom: Volume placement on a short hallway scene. We see no
subvolumes allocated in the region between the walls.
42
be useful, volumes could be placed on those objects if detected. Additionally other
metrics, such as the amount of clutter in a given region, could be used for dynamic
mapping.
3.4
Summary
This chapter has presented a method to allocate TSDF subvolumes to different regions
of a space, allowing for dynamically growing regions of variable resolution. In the
next chapter we explore the time and accuracy tradeoffs of higher resolution dynamic
placement, as well as quantitative results in the form of model visualizations.
43
44
Chapter 4
Results
The key benefit of our algorithm is the ability to increase the resolution of our model
without sacrificing accuracy, but at the cost of computation time. Therefore to quantitatively evaluate our algorithm we use two metrics:
• Runtime performance of the algorithm as a function of the number of TSDF
volumes being used.
• Comparison against the PCL implementation of KinectFusion (KinFu) [41] using benchmarks developed by [14].
We further discuss qualitative results of our algorithm.
4.1
Hardware
Our computer uses an nVidia GeForce GTX 880M graphics card with 8GB GPU
memory, along with a 2.7GHz Intel Core i7 4800MQ processor. It has 32GB main
memory, along with a 1TB SSD. We allocated an additional 64GB of swap space
in order to guarantee the function of our algorithm. We used an ASUS Xtion Pro
camera for all experiments.
45
Table 4.1: Model to Ground-Truth Error
traj0
traj1
traj2
KinFu Volume
vertices(m)
1497300 1465488 506229
mean(m)
0.0309
0.0534
0.0213
median(m)
0.0215
0.0375
0.0134
std(m)
0.0282
0.0542
0.0254
Eight Dense Volumes
vertices
7318191 7354923 2192196
mean(m)
0.0335
0.0537
0.0222
median(m)
0.0253
0.0368
0.0152
std(m)
0.0315
0.0545
0.0250
4.2
Runtime Performance
The fundamental tradeoff of our algorithm is speed versus resolution. As the number
of volumes we use increases we expect the time transferring data from GPU memory
to CPU memory to account for the majority of the processing time. Because this
happens for every volume on each frame, we expect the time to be linear with the
number of volumes, until the computer switches to swap memory.
We ran the following experiment to validate our runtime expectations. On a
pre-recorded 100 frame real world dataset we dynamically allocated a fixed number
of volumes with 512 linear voxels on the first frame. For 2 to 7 volumes we used
1.5m volumes. In order to guarantee there existed unique data for each volume, we
decreased the size of each volumes as the number of volumes increased.
Our results show the seconds per frame is linear with the number of volumes, with
each volume adding approximately .17s per frame. By running the algorithm with 48
volumes we demonstrate the ability to generate a 48x increase in resolution compared
to a single volume. We also found that both PCL KinFu and our implementation ran
at approximately 11 frames per second for one volume. It is worth noting that this
is less than the expected 30 frames per second because the images were read out of
memory instead of a live camera feed.
46
Figure 4-1: Example of point cloud registration using Cloud Compare. The white
model is ground truth. The color model is our reconstruction. The colors represent
a heat map of distances from ground truth, where blue indicates small error, and red
indicates large error.
4.3
Comparison with Baseline
We compare our algorithm to PCL KinFu on the synthetic dataset developed by
Handa et al. [14]. This dataset provides a series of RGB and Depth frames that
are structured simulate the camera movement through a virtual living room. Also
provided is an .obj model for each dataset as ground truth. In order to accurately
simulate a depth camera, Handa et al. added noise as described in [3] to each frame.
For our experiments we limited the range of camera scenes to those contained by
a 6m KinectFusion volume. For the experiments we ran each mapping algorithm on
a scene in order to generate a 3D map. We then use the open source Cloud Compare1
to align the reconstruction to the ground truth model. This provides the translation
of each point in the reconstruction relative to the nearest mesh triangle in the ground
truth model. Using the statistics provided by [14], we then determine the mean,
median, and standard deviation for each reconstruction’s translation, the results of
1
http://www.danielgm.net/cc/
47
Figure 4-2: Top: Model of a desk generated by a 4m KinectFusion volume. Here we
can clearly see the triangles that generate the mesh, and it is difficult to distinguish
the keyboard keys. Bottom: Model of desk generated by dynamically allocated
.5m volumes. The triangles are smaller and so less noticeable, and the delineation
between keyboard keys becomes clear. In order to map the space we needed to
translate the volume over a larger space, but zoom in on the laptop to highlight the
visible resolution difference.
48
which can be seen in Table 1. For traj0 and traj1 we ran PCL KinFu with a 6m
volume size, and our algorithm with a set of 8 volumes forming a cube of 6m (3m per
volume). For traj2 we we ran PCL KinFu with a 3m volume size, and our algorithm
with a set of 8 volumes forming a cube of 3m (so 1.5m per volume). All trajectories
used 512 linear voxels per volume.
Table 1 shows that our algorithm generates on average 4.75 times as many vertices, indicating a 4.75 time increase in resolution by our definition. Reconstruction
accuracy is in line with KinFu, with the average mean error increased by 4.2%. Due
to the relatively small number of points used to generate the synthetic meshes, we
do not expect to see an improvement in model reconstruction statistics relative to
KinectFusion, even for large volumes, due to the fact the synthetic dataset is almost
entirely flat surfaces, reconstruction resolution has a minor impact on error. Given
the small difference in quality and high number of vertices increased, we believe this
highlights the algorithm’s ability to generate high-resolution datasets without reducing accuracy. As a matter of future work it would be useful to generate a dense
synthetic dataset and determine if our model reduces mesh reconstruction error.
4.4
Qualitative Results
For small datasets we found small volume dynamic allocation to be effective. Figure
4-2 shows the comparison of a office scene scanned with a single 4m volume, compared
to a series of .5 volumes which were dynamically allocated. We see the indentation
between the keys of the computer keyboard are more visible for the small volumes.
We tested our dynamic volume allocation with 3m volumes while walking down
a hallway, as seen in Figure 3-3. To obtain this dataset the camera was not rotated
as it was translated through the hallway, highlighting the ability to obtain large
scale datasets easily. We found that large datasets like this often had drift due to
accumulated error in ICP, which can be resolved by improved visual odometry (see
Section 5.2.1).
The fireplace seen in Figure 4-3 highlights power of our algorithm. The figure
49
Figure 4-3: Top: Fireplace scene at the Kendall Hotel in Cambridge, MA. Bottom:
Zoom of two busts from the same model. Model was generated using dynamically
placed .75m volumes. This highlights the ability to create models of relatively large
spaces, while maintaining a high level of detail. Unfortunately some colors appear
pasted behind objects. This is due to a misalignment in the colors with the 3D model,
and is not an issue with the 3D model itself.
50
was generated by a 40 second scan of the fireplace and mantelpiece. By acting on a
”medium” scale scan such as this, we do not see the sensor drift present in hallway
scans. Additionally this highlights how a large area can be scanned in high resolution,
while we still see the detail in the individual busts.
4.5
Summary
This chapter has presented quantitive measures of the accuracy and time tradeoffs
of higher resolution reconstruction. We found a linear increase in computation time
relative to resolution, but little loss in accuracy with increased resolution. The next
chapter provides a summary of our work and areas for future work.
51
52
Chapter 5
Conclusion
5.1
Summary of contribution
This thesis makes two significant contributions towards 3D mapping for robotics.
First, it provides a framework by which a user can arbitrarily increase the resolution of a reconstructed model, in exchange for computation time. By providing this
a user can create high-resolution scans of relatively small objects, which can result in
more precise object detection for methods such as object recognition through change
detection [13].
On top of this we built a general framework for dynamic placement of subvolumes,
allowing for the scanned region to increase dynamically during runtime. This has
a multitude of potential uses. In the case of SLAM, it allows for mapping of an
environment of unbounded size. Because of the computational cost, this is mostly
beneficial for mapping a region in advance, which is useful for techniques such as
those in [26]. Additionally it allows for high-resolution maps while requiring minimal
work on the part of the user. KinectFusion and derived works require that the user be
able to mentally project the TSDF volume onto the space for optimal reconstruction,
whereas our technique has no such requirements. Our technique allows users to easily
create high-resolution reconstructions of regions with minimal user input, with the
infrastructure to modify reconstruction resolution of different regions based on user
specified parameters.
53
5.2
Future work
A number of improvements can be made to the work in this thesis to extend its utility
in SLAM and robotic object scanning.
5.2.1
Large Scale Improvements
The biggest limitation of our system is sensor drift for large scale reconstructions.
Presently when stitching TSDF subvolumes over a large regions, noise in sensor measurements causes drift in the reconstruction. For example long straight hallways tend
to curve. This issue could be largely mitigated by using improved camera tracking
techniques that are more robust than ICP. For example, [49] compares the DVO
camera tracking of Kerl et al. [22], RDB-D methods of Endres et al. [11], the multiresolution surfel maps of Stückler and Behnke [46], and a custom solution Whelan et
al. developed. All of these provide better camera tracking than the current solution
and would greatly improve large scale mapping.
Additionally, our work incurs signifiant memory and computational overhead when
scanning larger regions. In order to reduce the computational costs it would be useful
to integrate our work with sparser representations, such as octrees to recursively subdivide based on their surface crossing locations. OctoMap [51], for example, provides
such a direction which could be explored.
5.2.2
Color Integration
The focus of this work was to create high resolution point clouds based on available
depth data. However being able to appropriately color our reconstructions can provide
rich information for object detection. As seen in Figure 4-3, our current method
results in drift between the color information and depth information. As a matter of
future work it would be useful to formally include color integration and remove this
drift.
54
5.2.3
Integration with SLAM and object recognition
Currently our work is stand alone and has not been integrated with any existing SLAM
solutions. As a direction moving forward we hope that when large scale improvements
are made, large scale models can be made for use in robotics. By providing such a
high resolution backdrop, when traveling through a pre-mapped environment a robot
can know with higher certainty where it is. This is highlighted by the fact that
our algorithm potentially reduces the error in object detection systems [13], as the
ground truth will be known with higher resolution. Additionally, for missions where
a robot is able to enter an environment only a limited number of time, our algorithm
provides the ability to retrieve high-resolution information to reduce the need to enter
the environment again. Thus we hope to extend our algorithm for such mapping and
data collection scenarios in the future.
55
56
Bibliography
[1] T. Bailey and H.F. Durrant-Whyte. Simultaneous localisation and mapping
(SLAM): Part I. Robotics & Automation Magazine, 13(2):99–110, June 2006.
[2] T. Bailey and H.F. Durrant-Whyte. Simultaneous localisation and mapping
(SLAM): Part II. Robotics & Automation Magazine, 13(3):108–117, September
2006.
[3] Jonathan T Barron and Jitendra Malik. Intrinsic scene properties from a single
RGB-D image. In Computer Vision and Pattern Recognition (CVPR), 2013
IEEE Conference on, pages 17–24. IEEE, 2013.
[4] Paul J. Besl and Neil D. McKay. A method for registration of 3-D shapes. IEEE
Trans. Pattern Anal. Machine Intell., 14(2):239–256, February 1996.
[5] D. R. Canelhas, T. Stoyanov, and A. J. Lilienthal. SDF Tracker: A parallel algorithm for on-line pose estimation and scene reconstruction from depth images. In
IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, IROS, Tokyo, Japan,
November 2013.
[6] Siddharth Choudhary, Alexander JB Trevor, Henrik I Christensen, and Frank
Dellaert. SLAM with object discovery, modeling and mapping. In Intelligent
Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference
on, pages 1018–1025. IEEE, 2014.
[7] A.I. Comport, E. Malis, and P. Rives. Accurate Quadri-focal Tracking for Robust 3D Visual Odometry. In IEEE International Conference on Robotics and
Automation, ICRA’07, Rome, Italy, April 2007.
[8] B. Curless and M. Levoy. A volumetric method for building complex models
from range images. In SIGGRAPH, pages 303–312, August 1996.
[9] F. Dellaert. Square Root SAM: Simultaneous location and mapping via square
root information smoothing. In Robotics: Science and Systems (RSS), Cambridge, MA, 2005.
[10] Bertram Drost, Markus Ulrich, Nassir Navab, and Slobodan Ilic. Model globally,
match locally: Efficient and robust 3D object recognition. In Computer Vision
and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 998–1005.
IEEE, 2010.
57
[11] F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard. An
evaluation of the RGB-D SLAM system. In IEEE Intl. Conf. on Robotics and
Automation (ICRA), 2012.
[12] Jakob Engel, Thomas Schöps, and Daniel Cremers. LSD-SLAM: Large-scale
direct monocular SLAM. In Computer Vision–ECCV 2014, pages 834–849.
Springer, 2014.
[13] R. Finman, T. Whelan, M. Kaess, and J. Leonard. Toward lifelong object segmentation from change detection in dense RGB-D maps. In European Conference
on Mobile Robotics, Barcelona, Spain, September 2013.
[14] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. A benchmark for
RGB-D visual odometry, 3D reconstruction and SLAM. In IEEE Intl. Conf. on
Robotics and Automation, ICRA, Hong Kong, China, May 2014. (to appear).
[15] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision.
Cambridge University Press, 2003. Second Edition.
[16] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. RGB-D mapping: Using
Kinect-style depth cameras for dense 3D modeling of indoor environments. Intl.
J. of Robotics Research, 31(5):647–663, 2012.
[17] Peter Henry, Dieter Fox, Achintya Bhowmik, and Rajiv Mongia. Patch Volumes:
Segmentation-based Consistent Mapping with RGB-D Cameras. In Third Joint
3DIM/3DPVT Conference (3DV), 2013.
[18] Evan Herbst, Peter Henry, and Dieter Fox. Toward online 3-D object segmentation and mapping. In IEEE Intl. Conf. on Robotics and Automation (ICRA),
2014.
[19] S. Izadi, R.A. Newcombe, D. Kim, O. Hilliges, D. Molyneaux, S. Hodges,
P. Kohli, J. Shotton, A.J. Davison, and A. Fitzgibbon. KinectFusion: Real-time
dynamic 3D surface reconstruction and interaction. In SIGGRAPH, page 23,
August 2011.
[20] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J.J. Leonard, and F. Dellaert.
iSAM2: Incremental smoothing and mapping with fluid relinearization and incremental variable reordering. In IEEE Intl. Conf. on Robotics and Automation
(ICRA), Shanghai, China, May 2011.
[21] Andrej Karpathy, Stephen Miller, and Li Fei-Fei. Object discovery in 3D scenes
via shape analysis. In International Conference on Robotics and Automation
(ICRA), 2013.
[22] C. Kerl, J. Sturm, and D. Cremers. Dense visual SLAM for RGB-D cameras. In
Proc. of the Int. Conf. on Intelligent Robot Systems (IROS), 2013.
58
[23] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces.
In IEEE and ACM Intl. Sym. on Mixed and Augmented Reality (ISMAR), pages
225–234, Nara, Japan, November 2007.
[24] Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. Detection-based object
labeling in 3D scenes. In IEEE International Conference on on Robotics and
Automation, 2012.
[25] J. J. Leonard and H. F. Durrant-Whyte. Simultaneous map building and localization for an autonomous mobile robot. In Proc. IEEE Int. Workshop on
Intelligent Robots and Systems, pages 1442–1447, Osaka, Japan, 1991.
[26] J. Levinson. Automatic Laser Calibration, Mapping, and Localization for Autonomous Vehicles. PhD thesis, Stanford University, 2011.
[27] C. Liebe, C. Padgett, J. Chapsky, D. Wilson, K. Brown, S. Jerebets, H. Goldberg,
and J. Schroeder. Breadboard of a hazard avoidance sensor using structured
light developed at the jet propulsion laboratory., 2005. [Online; accessed May
13, 2015].
[28] F. Lu and E. Milios. Globally consistent range scan alignment for environmental
mapping. Autonomous Robots, 4:333–349, April 1997.
[29] M. Meilland and A.I. Comport. On unifying key-frame and voxel-based dense
visual SLAM at large scales. In International Conference on Intelligent Robots
and Systems, Tokyo, Japan, 3-8 November 2013. IEEE/RSJ.
[30] M. Meilland, T. Drummond., and A.I. Comport. A Unified Rolling Shutter and
Motion Model for Dense 3D Visual Tracking. In International Conference on
Computer Vision, Sydney, Australia, December 3-6 2013.
[31] Maxime Meilland and Andrew I Comport. Super-resolution 3d tracking and mapping. In Robotics and Automation (ICRA), 2013 IEEE International Conference
on, pages 5717–5723. IEEE, 2013.
[32] R. A. Newcombe, A. J. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton,
D. Molyneaux, S. Hodges, D. Kim, and A. Fitzgibbon. KinectFusion: Real-time
dense surface mapping and tracking. In IEEE and ACM Intl. Sym. on Mixed
and Augmented Reality (ISMAR), pages 127–136, Basel, Switzerland, October
2011.
[33] R.A. Newcombe, D. Fox, and S. Seitz. Dynamicfusion: Reconstruction and
tracking of non-rigid scenes in real-time. In Proc. IEEE Int. Conf. Computer
Vision and Pattern Recognition, 2015.
[34] R.A. Newcombe, S.J. Lovegrove, and A.J. Davison. DTAM: Dense tracking and
mapping in real-time. In Intl. Conf. on Computer Vision (ICCV), pages 2320–
2327, Barcelona, Spain, November 2011.
59
[35] I. Oikonomidis, N. Kyriazis, and A.A. Argyros. Efficient model-based 3D tracking
of hand articulations using kinect. In Proc. BMVC, 2011.
[36] K. Pirker, M. Rüther, G. Schweighofer, and H. Bischof. GPSlam: Marrying
sparse geometric and dense probabilistic visual mapping. In British Machine
Vision Conf. (BMVC), August 2011.
[37] J.G. Rogers, A.J.B. Trevor, C. Nieto-Granda, and H.I. Christensen. Simultaneous localization and mapping with learned object recognition and semantic data
association. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS),
pages 1264–1270, 2011.
[38] D.M. Rosen, M. Kaess, and J.J. Leonard. An incremental trust-region method
for robust online sparse least-squares estimation. In IEEE Intl. Conf. on Robotics
and Automation (ICRA), pages 1262–1269, St. Paul, MN, May 2012.
[39] D.M. Rosen, M. Kaess, and J.J. Leonard. RISE: An incremental trust-region
method for robust online sparse least-squares estimation. Robotics and Automation, IEEE Transactions on, 2014.
[40] H. Roth and M. Vona. Moving volume KinectFusion. In British Machine Vision
Conf. (BMVC), September 2012.
[41] R. Rusu and S. Cousins. 3D is here: Point Cloud Library (PCL). In IEEE Intl.
Conf. on Robotics and Automation (ICRA), Shanghai, China, May 2011.
[42] Radu Bogdan Rusu, Zoltan Csaba Marton, Nico Blodow, Mihai Dolha, and
Michael Beetz. Towards 3D point cloud based object maps for household environments. J. of Robotics and Autonomous Systems, 56(11):927–941, 2008.
[43] R.F. Salas-Moreno, R.A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison. SLAM++: Simultaneous localisation and mapping at the level of objects.
In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, Portland,
Oregon, June 2013.
[44] D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structured
light. In Computer Vision and Pattern Recognition, 1994. Proceedings CVPR
’94., 1994 IEEE Computer Society Conference on, volume 1, pages 195–202,
June 2003.
[45] F. Steinbrücker, C. Kerl, J. Sturm, and D. Cremers. Large-scale multi-resolution
surface reconstruction from RGB-D sequences. In IEEE International Conference
on Computer Vision (ICCV), Sydney, Australia, 2013.
[46] Jörg Stückler and Sven Behnke. Multi-resolution surfel maps for efficient dense
3D modeling and tracking. In Journal of Visual Communication and Image
Representation, 2013.
60
[47] S. Thrun, D. Fox, and W. Burgard. A probabilistic approach to concurrent
mapping and localization for mobile robots. Machine Learning, 31:29–53, 1998.
also appeared in Autonomous Robots 5, 253–271 (joint issue).
[48] T. Tykkälä, A. I. Comport, and J-K. Kamarainen. Photorealistic 3D Mapping of
Indoors by RGB-D Scanning Process. In International Conference on Intelligent
Robots and Systems, Tokyo, Japan, 3-7 November 2013.
[49] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and J. McDonald.
Real-time large scale dense RGB-D SLAM with volumetric fusion. Intl. J. of
Robotics Research, 2014. To Appear.
[50] T. Whelan, M. Kaess, J.J. Leonard, and J.B. McDonald. Deformation-based
loop closure for large scale dense RGB-D SLAM. In IEEE/RSJ Intl. Conf. on
Intelligent Robots and Systems (IROS), Tokyo, Japan, November 2013.
[51] K. M. Wurm, A. Hornung, M. Bennewitz, C. Stachniss, and W. Burgard. OctoMap: A probabilistic, flexible, and compact 3D map representation for robotic
systems. In Proc. of the ICRA 2010 Workshop on Best Practice in 3D Perception
and Modeling for Mobile Manipulation, Anchorage, AK, USA, May 2010.
[52] L. Xia, C. Chen, and J.K. Aggarwal. Human detection using depth information
by kinect. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition,
2011.
[53] Ming Zeng, Fukai Zhao, Jiaxiang Zheng, and Xinguo Liu. A Memory-Efficient
KinectFusion Using Octree. In Computational Visual Media, volume 7633 of
Lecture Notes in Computer Science, pages 234–241. Springer, 2012.
[54] Q. Zhou and V. Koltun. Dense scene reconstruction with points of interest. In
SIGGRAPH 2013, Anaheim, CA, USA, 2013. ACM.
[55] Q. Zhou, S. Miller, and V. Koltun. Elastic Fragments for Dense Scene Reconstruction. In Int. Conf. on Computer Vision (ICCV), December 2013.
61