Multi-Volume High Resolution RGB-D Mapping with Dynamic Volume Placement by Michael Salvato S.B. Massachusetts Institute of Technology (2013) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2015 c Massachusetts Institute of Technology 2015. All rights reserved. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science May 26, 2015 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John Leonard Samuel C. Collins Professor of Mechanical and Ocean Engineering Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Albert R. Meyer Professor of Electrical Engineering and Computer Science Chairman, Masters of Engineering Thesis Committee 2 Multi-Volume High Resolution RGB-D Mapping with Dynamic Volume Placement by Michael Salvato Submitted to the Department of Electrical Engineering and Computer Science on May 26, 2015, in partial fulfillment of the requirements for the degree of Master of Engineering Abstract We present a novel method for creating high-resolution 3D reconstructions using RGB-D cameras, such as the Microsoft Kinect. We demonstrate how such a technique can be used both for large scale mapping and high-resolution object detection. Our work consists of two core contributions. First, we present an algorithm to generate 3D reconstructions of arbitrary resolution, up to that of the limitations of the camera, with a linear increase in computation time as a function of resolution. We the build upon this method to provide a general infrastructure for scanning regions of variable shape with variable resolution across the scanned regions. We then show quantitative improvements in resolution in a virtual environment, show qualitative improvements in resolution through images, and provide runtime performance details. Thesis Supervisor: John Leonard Title: Samuel C. Collins Professor of Mechanical and Ocean Engineering 3 4 Acknowledgments Foremost I would like to thank my thesis advisor John Leonard. It was by working in your group that I’ve become excited about robotics research, and your enthusiasm for the work contagious. You’ve been a supportive mentor, and I’m always amazed on how much is accomplished when meeting with you. Thanks for all you’ve done for me. I’d like to thank the members of the Marine Robotics Group. First Ross has been a tremendous help, both in my research and in learning about academia beyond the walls of MIT. Thanks for spending the time to review this thesis and the IROS paper, supporting me in my research, and giving me helpful advise along the way. Thanks to Liam for organizing so many things in the group and always providing helpful advice when asked. Thanks Dehann for welcoming me to the group when I first arrived, and answering any questions I had in more depth than I expected. Thanks Dave for taking the time explain your research when I asked, teaching me mathematic principles I didn’t yet know in a clearly articulated way. And thanks to Janille, Ted, Sudeep, Thomas, and the rest of MRG; I’ve enjoyed spending time with all of you, and will miss you. Last I’d like to thank my dad. You’ve been incredibly supportive in everything I’ve done, and helped me more than I can account for, in all aspects of my life. This thesis wouldn’t exist without your support, and I can’t thank you enough for all the opportunities I’ve had thanks to you. 5 6 Contents 1 Introduction 1.1 15 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.1.1 Simultaneous Localization and Mapping . . . . . . . . . . . . 15 1.1.2 Dense 3D Reconstruction . . . . . . . . . . . . . . . . . . . . . 17 1.1.3 Dense Methods for SLAM . . . . . . . . . . . . . . . . . . . . 19 1.2 Dense Reconstruction for SLAM and Object Detection . . . . . . . . 20 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 Higher Resolution Reconstruction Using Multiple Volumes 2.1 23 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Reconstruction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Depth Map Conversion . . . . . . . . . . . . . . . . . . . . . . 28 2.2.3 Camera Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.4 Memory Passing . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.5 Volumetric Integration . . . . . . . . . . . . . . . . . . . . . . 30 2.2.6 Raycasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.7 Comparison to KinectFusion . . . . . . . . . . . . . . . . . . . 32 2.3 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2 7 3 Dynamic Allocation 3.1 35 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.1 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.1 Insertion and Removal . . . . . . . . . . . . . . . . . . . . . . 39 3.2.2 Dynamic Mapping for SLAM . . . . . . . . . . . . . . . . . . 40 3.3 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 4 Results 45 4.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Runtime Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Comparison with Baseline . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5 Conclusion 53 5.1 Summary of contribution . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.1 Large Scale Improvements . . . . . . . . . . . . . . . . . . . . 54 5.2.2 Color Integration . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.3 Integration with SLAM and object recognition . . . . . . . . . 55 8 List of Figures 1-1 A simple pose graph representation. . . . . . . . . . . . . . . . . . . . 17 1-2 Example of a structured light camera. . . . . . . . . . . . . . . . . . . 18 1-3 Example of space discretization in KinectFusion. . . . . . . . . . . . . 19 2-1 Diagram of volume placement in higher resolution reconstruction. . . 27 2-2 Flowchart for model creation. . . . . . . . . . . . . . . . . . . . . . . 29 2-3 Image generated from 4 1.5 meter TSDF subvolumes. . . . . . . . . . 32 3-1 ICL-NUIM Livingroom dataset reconstruction. . . . . . . . . . . . . . 36 3-2 Example of spaced out TSDF subvolumes. . . . . . . . . . . . . . . . 39 3-3 Scan of a hallway with cube placement. . . . . . . . . . . . . . . . . . 42 4-1 Example of point cloud registration using Cloud Compare. . . . . . . 47 4-2 Visual comparison between KinectFusion and higher resolution dynamic reconstruction in a desk scene. . . . . . . . . . . . . . . . . . . 48 4-3 Colored fireplace and mantel scan at Kendall Hotel. . . . . . . . . . . 50 9 10 List of Tables 4.1 Model to Ground-Truth Error . . . . . . . . . . . . . . . . . . . . . . 11 46 12 List of Acronyms SLAM simultaneous localization and mapping TSDF truncated signed distance function ICP iterative closest points 13 14 Chapter 1 Introduction The core of this work develops an algorithm to create high-resolution models of a 3D environment. To motivate this work, we present a brief history of the simultaneous localization and mapping (SLAM) problem, which uses our work for both object detection and map generation. We then provide a history of general 3D scanning techniques and those specifically for SLAM, to motivate the role of our work within both. 1.1 Related Work 1.1.1 Simultaneous Localization and Mapping The simultaneous localization and mapping problem (SLAM) [25] describes the issue where a robot in an environment both needs to create a map of the environment, and determine its location in the environment. Many formulations of the SLAM problem rely on the odometry measurements from the robot, as well as some measurements of the environment using a sensor. Using these, the robot both attempts to derive a map of the environment based on given metrics, as well as derive its placement within this map. In the case where all sensors have no noise, the SLAM problem can be reduced into that of only a mapping problem. However, as this is not the case, substantial 15 work has been put into solving the SLAM problem. Current state of the art largely focuses on probabilistic models of the position of the robot and beliefs of the map structure. An overview of the history of such methods can be found at [47]. Pose graph methods, first presented by [28], are often used in practice, as they provide a robust representation and can be solved efficiently using sparse methods. State of the art methodology can be found at [9] followed by [20]. These methods involve creating a graph where nodes consist of robot poses (x), landmarks (l), loop closures, odometry measurements, and landmark measurements. Landmark measurements are used to denote detected features of the environment. Loop closures indicate that two poses have the same location. A graph is created by appropriately connecting the landmarks, poses, measurements, and loop closures. The goal is then to find the maximally likely values for x and l given the constraints in the graph, which can be solved as a sparse coding problem (see [39]). The details of pose graph optimization are not key to this thesis. However, we note a major step to maximize the consistency of the graph is to recognize when landmarks measured at different times are in fact the same landmark, so they can be treated as one node on the graph. By doing so the system becomes more constrained. Our work is used to improve on the robustness of landmark measurements, which can increase the number of constraints in the graph, and reduce the probability of incorrect constraints. See Figure 1-1 for a graphical representation of a pose graph. By the above formulation it is important to consider where landmark measurements come from. Historically SLAM landmarks have been relatively simple geometric landmarks [1, 2], such as corners. In recent history there has been work towards using objects as SLAM landmarks, such as [37] [6] [43]. By using objects, landmarks will ideally be more robust to noise, and provide more unique information about a particular environment. To use objects as landmarks we need to develop robust methods to parse out relevant objects. Because our work provides high-resolution constructions, our work might be extended to provide a robust method with which to parse these landmarks. 16 Figure 1-1: A simple pose graph representation. Blue: Pose nodes. These are not measurements themselves, but are solved for to determine the pose at each step. Each one is connected to measurements made during that step. Green: Landmarks in the environment. u: Odometry nodes. These represent odometry measurements taken from the robot. c: loop closure nodes, these indicate that two locations are believed to be the same. m: Landmark measurements. The landmark measurements, odometry measurements, and loop closures are used to solve for the robot poses and landmarks. Our contributes to SLAM by providing the ability to make more robust landmark measurements. Courtesy David Rosen [38]. 1.1.2 Dense 3D Reconstruction Recently 3D reconstruction has seen tremendous growth in robotics. This is largely due to the growth of structured light techniques [44] for generating depth maps, combined with the advent of the Microsoft Kinect structured light sensor. This led to a the commercial availability of cheap (≈$200), robust structured light sensors. These cameras found use in applications such as motion detection and object tracking, largely for humans [35] [52]. While these frame based methods were sufficient for some applications, we will focus on the growth of dense 3D object reconstruction based techniques. Perhaps the most famous of 3D reconstruction technique is the KinectFusion algorithm, developed by Newcombe, Izadi, et al. [32] [19]. In KinectFusion the user defines the size of a cubic volume in front of the camera to be scanned. This volume is then broken into many (≈ 5123 ) voxels, each of which contain a belief of its distance to the nearest surface, which is updated each frame. See Figure 1-3. This effectively smooths out the noise in the depth camera, allowing for a more accurate 17 Figure 1-2: Example of a structured light camera [27]. The projected IR dot grid has a known pattern, which is deformed by the surface it is projected on. By reading in the deformed grid with a camera, the surface can be reconstructed based on the deformation in the projected pattern from the surface. model to be recovered from the volume. All computation is run on the GPU for real-time performance. However the size of the physical volume that can be scanned is bounded by GPU memory in KinectFusion, and fast movement results in blurring. These limitations have spawned other algorithms such as [53], which uses octrees to map larger spaces to partial overcome the finite scope of KinectFusion, and [33] which allows for accurate reconstruction even for slowly moving objects. The PointCloud Library [41] has an open source implementation of KinectFusion known as KinFu, which many methods build off of. Our work will build off of KinectFusion, and our implementation is modified from KinFu. After creating 3D reconstructions using techniques such as those described above, objects can then be parsed from the reconstructions. Work such as [18] [42] [10] [13] and [24] provide methods to parse objects from 3D data. The work we present in the remainder of this thesis will allow for higher resolution reconstruction than current methods, which can result in more robust object parsing by increasing the available information. 18 Figure 1-3: Example of space discretization in KinectFusion. On the left are two spherical objects which we wish to perform a 3D scan on. KinectFusion discretizes the space into many voxels, each of which keep track of their believed distance to the nearest surface on the objects. The sizes of the space to be scanned must be specified at runtime and cannot be changed. 1.1.3 Dense Methods for SLAM While our work will focus on dense methods using RGB-D sensors [5, 36, 46, 54, 55], there has been significant prior work in monocular SLAM . Methods such as [7] had been developed since before the prevalence of RGB-D camera for SLAM reconstruction. Additionally due to the high resolution and potentially lower cost and energy requirements, work such as [12, 23, 34] is important to creating SLAM systems for a wider range of environments, though not the focus of our work. As noted above, KinectFusion [34] is limited by the size of reconstruction region. Work such as [16], one of the first RGB-D mapping systems, removes this size restriction . This work, and others based on 3D features [11], consequently throws away much of the data, resulting in lower resolution reconstructions than those seen in methods designed strictly for dense reconstruction. Keyframe approaches such as [48] and the work of Meilland et al. [29–31], integrate new data into separate keyframes that exist in image space, to allow for highresolution regions, but also expanded regions as keyframes are added. Specifically, Meilland et al. present work which combines keyframes and volumetric approaches to mapping, in order to create real-time, robust, and scalable systems in a memory efficient manner. However, in order to represent the 3D features of a small cluttered scene many small keyframes are required in order to properly capture the 3D fea19 tures of different view. This could potentially cause tracking drift or failure if each keyframe does not have sufficient features to track. There has been notable work to extend KinectFusion methods into maps of larger size, such as the work of Roth and Vona [40], and the Kintinuous algorithm developed by Whelan et al. [50]. The Kintinuous algorithm is designed to accomplish a similar goal to our work, which is to allow for scanning of dynamically increasing regions. Kintinuous overcomes the GPU memory restriction by cycling out of working memory as it exits the scope of the scan. Steinbrücker et al. [45] use an octree representation to reduce computational costs, and keyframe methods as well as loop closures to reduce drift. Wurm et al. [51] similarly use an octree representation to reduce data overhead, and make use of CPU capabilities to additionally reduce processing overhead. Henry et al. [17] use a series of KinectFusion-like volumes, which are stitched together to create a cohesive map. They use plane detection to determine where to place volumes, and they are explicitly scoped to be used in mapping. While both of these methods provide robust mapping solutions, our work provides the ability for a user to specify the resolution desired and parameters for mapping, resulting in a flexibility that no other solution provides. 1.2 Dense Reconstruction for SLAM and Object Detection This work will focus on a novel algorithm for dense reconstruction and SLAM. The KinectFusion algorithm has physical scope coupled to reconstruction resolution, limited by the size of GPU memory. Functionally this results in a limited size that KinectFusion can scan, which is fixed during a given scan, as well as a limited resolution. This is because KinectFusion keeps the entire volume being scanned in GPU memory during runtime, in order to increase speed efficiency. In order to remove this constraint, we pass memory in and our of the GPU, allowing for reconstruction resolution to be specified by the user. 20 Because the above extension passes memory in and out of GPU memory already, we then expanded on this to keep representations of different areas of the environment stored in CPU memory. We remove the restriction of KinectFusion where it can map a fixed sized region by introducing new scanning regions during runtime. We present both a general framework for accomplishing this, and an example case used for 3D mapping. 1.3 Overview The remainder of the thesis is organized into the following chapters: Chapter 2 gives a description of our algorithm for high-resolution dense scanning in fixed sized environments. Chapter 3 extends our algorithm to scan arbitrary regions in resolution determined by a user given function. It also gives an example of how to use this in a general mapping solution. Chapter 4 provides qualitative results from running our algorithms, as well as quantitative metrics for resolution and runtime. Chapter 5 Addresses some limitations and future directions of our work, as well as concluding remarks. 21 22 Chapter 2 Higher Resolution Reconstruction Using Multiple Volumes The following describes a procedure for generation a high-resolution 3D reconstruction of a region of fixed size. This is a first step towards developing a high-resolution mapping system of dynamically increasing size. This technique allows for the creating of arbitrarily high resolution scans of fixed sized, and is useful in robotic object detection (see Section 2.3). 2.1 Problem Definition We are trying to solve the problem of using a series of depth maps to best reconstruct the 3D model that generates those depth maps. Assuming there is no noise and the depth maps are drawn from a physically realizable model, we would be able to generate a perfect model. However we assume the depth maps provide a noisy representation of our model, which we then attempt to account for to create the most accurate model possible. We both attempt to maximize the accuracy and density of points in our reconstruction. Here we present the problem more precisely based on the input and output relationship, metrics, and assumptions. 23 2.1.1 Input and Output Our algorithm is designed to output the 3D reconstruction of a fixed size cubic region of physical space using a sequence of provided depth maps. The input to the algorithm is therefore as follows: • The linear dimension, l ∈ R+ , of the physical region to be scanned. Our region is assumed to be a cube, so this is the side length of a cube to be scanned. • The linear number of voxels, r ∈ N, of the model used during 3D reconstruction. Our region is assumed to be a cube, so this is the side length of a virtual cube of voxels. • A series of i n × m depth frames, Di ∈ R+nm provided by a camera. We assume each frame is generated from a camera by a noisy, physically realizable model. • The camera intrinsic matrix, K ∈ R3×3 . The camera intrinsics provide information about the mapping from the values in a depth frame to the physical location of the surface that generated that reading [15]. During the scanning process a collection of voxels is stored as a cube of voxels. Each voxel stores the signed distance from each voxel to the nearest surface, which is then truncated if this distance is too large. The collection of voxels is known as the truncated signed distance function (TSDF) [8] volume. The number of voxels r3 in the TSDF volume acts as a metric for the resolution of the reconstruction. The larger the number of voxels in a given physical space, the higher the resolution of the model that is output. For example a low resolution water bottle may only have enough points to appear as a cylinder, while a high-resolution one may be able to represent grooves in the water bottle. The definition of resolution will be refined in Section 2.1.2. The depth frames are each a 2D matrix, with each value representing the depth to the nearest surface of a given ray. The camera intrinsic matrix is used to determine the angle of a given ray described by the depth map. 24 Each measurement in Di is itself noisy with an unknown noise function. The core function of this algorithm is to integrate all measurements in Di to determine the true generating model. The algorithm outputs the following: • A set V of vertices, consisting of elements v j ∈ R3 for j = [1, |V |]. Each vertex is the physical location of a 3D point relative to the camera initialization location. • A set N of normals, consisting of elements nj ∈ R3 for j = [1, |N |]. nj corresponds to the normal for vertex v j . We note that the magnitude of V is upper bounded by r3 . That is, the number of vertices in the reconstruction cannot exceed the number of voxels in the model used for reconstruction. 2.1.2 Metrics In the ideal case, each vertex represents the location of a surface, with location identical to that of a point on some surface relative to the initial camera location. To capture the error in these distance, we define the mean distance from each vertex to the nearest true surface as the accuracy of the model. In the trivial case the accuracy of our model could be maximized by outputting only the single most accurate measurement. However this would not provide useful information about the model we are attempting to reconstruct. As a second constraint in the ideal case, for each voxel in the reconstructive model, if a surface crosses through that voxel, then a vertex is created for that voxel. Increasing the number of vertices in the output will be defined as increasing the resolution of the model. The primary contribution of this thesis is to increase the resolution of 3D reconstructions over existing methods, without reducing their accuracy, compared to state of the art models. 25 2.1.3 Assumptions • This algorithm assumes the depth frames from a camera, and are derived from a physically realizable source. • The scene contains features which can be tracked across frames. Because our algorithm uses the iterative closest point algorithm [4], without features to track the algorithm will fail. • Static portions of the scene dominate relative to non-static ones. This is necessary to be able to track the camera movement relative to the scene. • The movement of the camera between frames is small enough that features may be tracked across frames. • Surfaces are in the view of the camera long enough to smooth over the noise generating the depth maps. • The probability distribution on the noise of each depth value is centered on the true value. Otherwise the accuracy is bounded by the center of the distribution for each point. 2.2 Reconstruction Algorithm Those who are familiar with the KinectFusion algorithm will notice that the steps listed below follow the same structure as KinectFusion. The key modification with our algorithm is that it allows for the ability to increase the resolution of the reconstruction to a user specified degree in exchange for computation time. This will be further explained in Section 2.3. 2.2.1 Initialization This algorithm allows for the number of voxels r3 to be a user tunable parameter. However, future steps in our algorithm rely on being able to perform computation 26 Figure 2-1: Left: 2D representation of KinectFusion volume with 8 voxels in the linear dimension scanning a bunny. Right: Four volumes each with eight linear voxels scanning a bunny of the same size. We double the linear resolution by using four volumes each with half the length. on the TSDF volume on a GPU. In the case where the user-provided r results in too many voxels for the entire volume to be stored in GPU memory, the TSDF volume will be broken into a set of TSDF subvolumes H. Each subvolume h ∈ H acts as a TSDF volume and is used in reconstruction, but are separated so that computation can be performed on them independently. We define rGPU to be the maximum number of linear voxels that can fit into GPU memory. The linear number of TSDF subvolumes initialized is r . rGPU The initial location of the camera is initialized as (0, 0, 0). The back, center voxel of the TSDF volume is additionally initialized to this location. In order to achieve this, each subvolume is placed such that they form a cube with back center at (0, 0, 0). Each subvolume stores the translation ht of its back center voxel relative to (0, 0, 0), its linear number of voxels hr and its linear physical dimension hl . For reasons mentioned Section 2.2.6, each subvolume overlaps by two pixels, but this will be ignored for the majority of the analysis, as it only results in a constant factor decrease in the total volume size. 27 2.2.2 Depth Map Conversion For each frame i, the depth map Di is converted into a collection of vertices. A given vertex in coordinates of the current frame vli (x, y) = K −1 [x, y, 1]T Di (x, y) for pixel coordinates x = [0, m − 1] and y = [0, n − 1]. This is done on the GPU on a per pixel basis. The normals are then computed for each vertex: nil (x, y) = (vli (x + 1, y) − vli (x, y)) × (vli (x, y + 1) − vli (x, y)). We note here that the expected number of pixels in the depth map is assumed to be much smaller than that which could be supported by GPU memory. Using the transformation matrix derived from the previous camera tracking iteration (see Section 2.2.3), the global position of each point is then derived. The global transformation matrix T i−1 = [Ri−1 |ti−1 ], with translation vector ti−1 ∈ R3 and rotation matrix Ri−1 ∈ SO(3). The global vertex location is then vi (x, y) = ti−1 vli (x, y) and global normal is ni (x, y) = Ri−1 nil (x, y). This puts our vertex in normals in the global coordinate frame defined, which is necessary for future steps. 2.2.3 Camera Tracking The transformation matrix between the vertices derived from the previous raycasting step and the current depth map conversion step is determined. This is run by performing a GPU based iterative closest point (ICP) algorithm, the details of which can be found in [32]. As a brief overview, the first step of the ICP algorithm is the match each point in a source point cloud with those in a reference point cloud. A transformation is then applied to minimize least square error between the matching points. Each point in the source point cloud is then matched against the reference, and the procedure is repeated until some minimum error threshold is obtained. The transformation from the initial source point cloud position to the final source point cloud position is then the transformation matrix. The transformation is then applied to T i−1 to determine the current global transformation matrix T i . 28 Initialization Depth Map Conversion Per Frame Camera Tracking Memory Passing Volumetric Integration Raycasting Per Volume Figure 2-2: Flowchart for model creation. At the beginning of each dataset, TSDF volumes are initialized. For each frame, depth map conversion is performed to convert frame’s depth map values to global vertices. ICP is then performed during the camera tracking step to obtain the camera transformation from the previous frame to the current one. In order to perform volumetric integration and raycasting on each volume, a memory passing step is performed to pass a volume into GPU memory. Volumetric integration and raycasting are then performed to update our TSDF volumes and obtain the raycasted depth map. Red: Performed once per dataset. Blue: Performed once per frame. Green: Performed on each volume for each frame. 29 2.2.4 Memory Passing To store the amount of data required to have multiple TSDF subvolumes, they generally need to be stored in CPU memory, as we constructed the subvolumes such that only one may fit in GPU memory. Generally a computer has many times more CPU memory than GPU memory. Because of this, and the benefits of virtual memory, we can store more volumes in CPU memory than in GPU memory (see section VI.B.). Prior to a volumetric integration step, one TSDF subvolume is uploaded to the GPU for point volumetric integration and raycasting. After raycasting, this volume is downloaded to CPU memory. This happens for each subvolume, upon which the final raycasting step is complete for that frame. 2.2.5 Volumetric Integration During this step we update our TSDF subvolumes, which store our beliefs of which surfaces are relative to our global origin. Each subvolume stores a set of values Dk,i (tl ) 3 where tl = (txl , tyl , tzl ) ∈ R3 represents the voxel coordinate, and Dk,i ∈ Rr stores the believed distances to the nearest surface for each voxel in the kth subvolume on the ith frame. Each GPU thread is assigned an (txl , tyl ) location, for txl = [0, r − 1] and tyl = [0, r − 1]. Each thread then iterates over each tzl = [0, r − 1] for its given (txl , tyl ). Each voxel’s distance to the nearest surface is determined using its tl coordinate. However because each subvolume is translated relative to the origin by translation ht , the coordinate to use to calculate distance is then ts = tl + ht . From here the physical model distance is calculated as dm = rl ||ts ||, which represents the distance from the camera origin to the voxel. In order to determine the measured distance for that frame, each ts voxel position is then back projected to pixel coordinates to get an (x, y) location. Using this the distance to the nearest surface is calculated as dist = Di (x, y) − dm . dist is then constrained to be within some truncation bounds proportional to the physical distance of a voxel in order to prevent unnecessary computation. From here we can 30 calculate the updated TSDF value Dk,i (tl ) = wi−1 Dk,i−1 (t)+wi dist , wi−1 +wi for each voxel, which is the belief of the distance from that voxel to the nearest surface. wi is a linearly increasing weighting function, up to some maximum. 2.2.6 Raycasting During this step we derive the depth map constructed from our TSDF volume. As mentioned above, the raycasting step is run once separately per subvolume, each after the matching volumetric integration step. Prior to the memory passing step a pixel distance map vr ∈ R+ and normal map nr ∈ R+3 , each of size m × n are initialized. Based on the camera intrinsics, a ray can be cast from each pixel location (x, y) in the camera space through the TSDF volume. As each ray iterates through the TSDF volume space, checks for the value of the TSDF voxel it is currently in. If the ray passes through a change of sign between voxels, it is noted to have crossed through a surface. By trilinearly interpolating over the voxels on either side of the zero-crossing, the location of the surface is determined. Because the side of each voxel is known, the raycasted voxel location vr (x, y) is determined. Similarly the derivative of the TSDF as the zero-crossing is used to determine the normal value nr (x, y) for this pixel. However because we are only working on a given subvolume, a few modifications need to be made. While raycasting, calculation is only done while the ray is within the TSDF subvolume, or until it finds a zero-crossing. As a ray is cast from a camera location, its voxel position in global space can be described as a position tr ∈ R3 . The voxel position in volume coordinates is then t0r = tr − ht . We then read values of the voxels at t0r when searching for zero crossing. When determining the global physical locations g of each voxel, the translations must be added back to our new indices, to accurately reflect the physical distance the ray travels to our translated volume. This gives us g = hl ∗ (t0r + ht ). When a zero crossing is found, the value is trilinearly interpolated over neighboring voxels to determine vr (x, y), and the derivative of the TSDF at the zero crossing is calculated to determine vn (x, y). It is worth noting that all raycasting steps modify the same map. This means 31 Figure 2-3: Image generated from 4 1.5 meter TSDF subvolumes. each pixel coordinate is searched over by each subvolume, even though all edit the same map. In order to guarantee that we use the correct value in our voxel map, we take the value with the shortest raycast distance from the local camera location. This guarantees that we always take the value of the surface visible by the current camera location. To trilinearly interpolate successfully each subvolume must overlap by 2 voxels. 2.2.7 Comparison to KinectFusion We show that when creating a cube of multiple volumes the results are identical to that of running the KinectFusion algorithm on one volume with the same total number of voxels. Because the volumetric integration and raycasting steps are the only ones modified, if we can show they give identical outputs to KinectFusion, then the entire algorithm also does so. For sake of simplicity we will ignore the overlapping voxels in our volumes when discussing the number of voxels, which is accounted for by requiring that each volume be larger by the size of 2 voxels. During the volumetric integration step, each voxel is independent of each other voxel. That is, the calculation performed only depends on the TSDF value of the given voxel, and the depth of the ray back projected from that voxel. All modifications done by our algorithm guarantee that the voxel is associated with the correct physical location, but do not affect the TSDF value or the depth value. Due to the fact that 32 all computation depends on each voxel independently, that voxels exist in separate volumes does not affect the correctness of the computation. In the raycasting step, unlike the volumetric integration step, the voxels are not independent. Due to trilinear interpolation, each voxel depends on each neighboring voxel. However given that the volumetric integration step is identical to KinectFusion, upon overlapping the volumes, the overlapping voxels will have identical values. Therefore during the trilinear interpolation step, every voxel for which trilinear interpolation is possible will have a set of neighbors identical to that of KinectFusion. By then only the shortest distance zero-crossing found by each ray, the depth and normal maps are identical to those in KinectFusion. With this the outputs of volumetric integration and raycasting are both identical to that of KinectFusion, so our algorithm with produce results identical to KinectFusion, assuming that an equal resolution KinectFusion volume could fit in GPU memory. 2.3 Benefits This algorithm is designed explicitly to create high-resolution volumes of a fixed area. By using the TSDF volume techniques of KinectFusion we create a more accurate 3D model than any individual frame. Then by allowing for a number of subvolumes, and thereby voxels, only bounded by disk space, we allow for resolution functionally bounded only by the noise in the camera which generates the depth frames. This is compared to the resolution bounded by GPU size in KinectFusion. In exchange we increase the processing time of the algorithm, which will be furthered explored in Chapter 4. 2.4 Summary This chapter has presented a method to generate higher resolution 3D reconstructions in exchange for increased computation time, by using multiple TSDF subvolumes passed in and out of GPU memory. The next chapter presents a method to expand 33 the use of multiple TSDF subvolumes to map different regions of an environment at different resolutions, and dynamically allocate subvolumes based on user specified parameters. 34 Chapter 3 Dynamic Allocation In the previous section we developed a method for generating high-resolution models by using multiple subvolumes. Building off of this infrastructure, we present a method to dynamically allocate these subvolumes during reconstruction. This allows for the reconstruction of dynamically growing region, and the ability to decide the reconstruction quality of different regions during run time. 3.1 Problem Definition Our goal is again to create a 3D reconstruction of a noisy measurement model while optimizing over accuracy and resolution. However in this case we present the additional requirement that resolution of the model in a given region be variable based on user input. Here we present the problem more precisely based on the input and output relationship, metrics, and assumptions. 3.1.1 Input and Output The output relationship of dynamic allocation is identical to that of the high-resolution reconstruction. • A set V of vertices, consisting of elements v j ∈ R3 for j = [1, |V |]. Each vertex is the physical location of a 3D point relative to the camera initialization location. 35 Figure 3-1: Top: A colored model of the ICL-NUIM Livingroom dataset [14]. Bottom: A reconstruction of the same scene with segments showing regions mapped by a volume. Each volume has the maximum resolution that can be loaded into GPU RAM. Volumes are dynamically placed and removed as the camera passes over the scene. 36 • A set N of normals, consisting of elements nj ∈ R3 for j = [1, |N |]. nj corresponds to the normal for vertex v j . During runtime the algorithm will keep a list H of active subvolumes it is processing on. Each subvolume h ∈ H has 3 parameters: hl , the linear physical dimension, hr , the linear number of voxels, and ht , the translation relative to the camera origin. A volume can be added to the set at any point during execution, and will be processed in subsequent frames. To remove a volume from processing, it is removed from the set of volumes. The point cloud for the removed volume can then be downloaded to main memory. Thus the input parameters are: • A series of i n × m depth frames, Di ∈ R+nm provided by a camera. We assume each frame is generated from a camera by a noisy, physically realizable model. • The camera intrinsic matrix, K ∈ R3×3 . The camera intrinsics provide information about the mapping from the values in a depth frame to the physical location of the surface that generated that reading. • f (·), which can take in any data available during a given frame. It returns a list of volume parameters (h1p , ..., hnp ) for the volumes to be added, where hp = (hl , hr , ht ) for some subvolume. • g(h, ·), which is run for all h ∈ H. The function removes the volume from H if the conditions it specifies are met. It can additionally take in any other parameters available in the system. By defining f and g, this algorithm can be extended to a multitude of use cases. For example, the work of Kaparthy et al. [21] could be applied to create higher resolution models for areas with object-like properties. 3.1.2 Assumptions The assumptions of dynamic allocation include the assumptions of high-resolution reconstruction: 37 • This algorithm assumes the depth frames from a camera, and are derived from a physically realizable model. • The scene contains features which can be tracked across frames. Because our algorithm uses the iterative closest point algorithm [4], without features to track the algorithm will fail. • Static portions of the scene dominate relative to non-static ones. This is necessary to be able to track the camera movement relative to the scene. • The movement of the camera between frames is small enough that features may be tracked across frames. • Surfaces are in the view of the camera long enough to smooth over the noise generating the depth maps. • The probability distribution on the noise of each depth value is centered on the true value. Otherwise the accuracy is bounded by the center of the distribution for each point. We additionally assume: • H is never empty prior to initialization. If it were, ICP could not be performed, and the camera could not localize. • The collection of subvolumes at any given time has features such that ICP can be performed adequately. • All h ∈ H have a number of voxels that can fit into GPU memory. • f (·) and g(h, ·) are designed such that a given surface will exist in memory long enough to develop an accurate model of that surface. This is because smoothing over the noise in our model takes a number of measurements, so if we pass volumes out too quickly the reconstruction will be poor. 38 3.2 Algorithm Figure 3-2: Volumes can be placed anywhere relative to the initial camera location. Here we scan with three spaced out volumes, and choose not to allocate a volume in the bottom left corner for demonstration purposes. 3.2.1 Insertion and Removal After the final raycasting step an additional insertion and removal step is added. During this step the removal algorithm f (·) is first run. Each subvolume to be added is specified with hl , the linear physical dimension, hr , the linear number of voxels, and ht , the translation relative to the camera original. Each subvolume specified by f (·) is then marked for addition. g(h, ·) is then run for each h ∈ H. Note that this algorithm can use any parameter of the algorithm, including the set of subvolumes marked for addition. If a subvolume specified has identical parameters to one marked for addition, that volume instead will not be marked for addition. This provides a mechanism for f (·) and g(h, ·) to interact. Each subvolume specified is then marked for addition. 39 Because each TSDF volume can be described as a point cloud, the volumes marked for deletion have their point cloud representations downloaded to CPU memory. The volumes marked for addition are then added to H, and used during the volumetric integration and raycasting steps of future iterations. 3.2.2 Dynamic Mapping for SLAM To perform large-scale mapping we propose a tool for dynamically placing subvolumes in areas expected to have a high density of surface vertices. There are three tunable parameters in this system: The linear number of voxels in each TSDF subvolume dv , the linear physical dimension in each subvolume dl , and the maximum number of subvolumes to exist at once time n. Our system dynamically places n subvolumes h each with hl = dl and hv = dv in locations based on the following process. We define the origin of the global coordinate frame as the initial location of the camera. Assuming voxels of fixed size, we can view this coordinate system as having length units in voxels rather than a physical length. We allow subvolumes to be placed on this coordinate system such that their translations from the origin are constrained to ht = k ∗ (dv − 2), where k ∈ Z3 . This allows the subvolumes to be placed on a 3D grid, where each cell is the size of a subvolume. The (−2) term is so that adjacent cubes do not create a gap in the raycasting result and will be excluded for simplicity of notation. We do not allow multiple subvolumes to have the same translation, so a subvolume’s placement in this grid defines a unique property of the subvolume. During the raycasting operation, for each pixel p = Di (x, y) in the depth frame, we calculate the the endpoint e = (ex , ey , ez ) ∈ R3 of the ray generated from p based on the depth rd and direction rv = (rvx , rvy , rvz ) ∈ R3 from p. Using this we calculate the location in global coordinates of where that depth value maps. rvx ex = q z y ( rrvx )2 + ( rrvx )2 + 1 v v ey = rvy ∗x rvx 40 ez = rvz ∗x rvx We then convert each endpoint to a coordinate l which is constrained to ht = k ∗ (dv ) as the subvolume locations are l= (ex , ey , ez ) (ex , ey , ez ) −( mod dv ) dl dl Any ray with endpoint l maps uniquely to a subvolume. We define c(l) as the number of rays that end at the subvolume mapped to l. The set L is defined as all subvolume locations with non-zero c(l). Our insertion function takes in the current subvolume set H and all c(l) for l ∈ L, so the function signature is f (H, L, c(L)). We then only keep the subvolumes with the N largest c(l) values. Our removal function g(h, ·) removes a subvolume which has been marked as having falling out of the top n subvolumes. In order to prevent volumes from switching in and out too often, we also set a threshold that a subvolume to be added needs to have a count larger than the lowest count subvolume by some constant factor. 3.3 Benefits Our method scans environments of dynamically increasing size, as subvolumes will be continually inserted and removed from H. This provides a key utility over dense 3D scanning algorithms which are bounded to a fixed region of space. While high-resolution reconstructions are possible with methods such as Kintinuous [50], do not provide for dynamic volume allocation. While doing a scan the user needs to understand where the TSDF volume exists in the scene while scanning, and scan based on that. In contrast our algorithm allows for high-resolution reconstruction of all objects that were in the view of the camera, and does not require understanding where scanning boundaries are. Our dynamic mapping method is just one function for dynamic allocation. Our framework allows for any metric in deciding where to place volumes. For example, if there were certain objects of interest for which higher resolution scanning would 41 Figure 3-3: Top: Scan of hallway. This scan was taken while simply walking forward, yet still captures both walls and the floor at high resolution. 2m subvolume were dynamically allocated such that at most 16 were live at one time. This image consists of 24 subvolumes. Bottom: Volume placement on a short hallway scene. We see no subvolumes allocated in the region between the walls. 42 be useful, volumes could be placed on those objects if detected. Additionally other metrics, such as the amount of clutter in a given region, could be used for dynamic mapping. 3.4 Summary This chapter has presented a method to allocate TSDF subvolumes to different regions of a space, allowing for dynamically growing regions of variable resolution. In the next chapter we explore the time and accuracy tradeoffs of higher resolution dynamic placement, as well as quantitative results in the form of model visualizations. 43 44 Chapter 4 Results The key benefit of our algorithm is the ability to increase the resolution of our model without sacrificing accuracy, but at the cost of computation time. Therefore to quantitatively evaluate our algorithm we use two metrics: • Runtime performance of the algorithm as a function of the number of TSDF volumes being used. • Comparison against the PCL implementation of KinectFusion (KinFu) [41] using benchmarks developed by [14]. We further discuss qualitative results of our algorithm. 4.1 Hardware Our computer uses an nVidia GeForce GTX 880M graphics card with 8GB GPU memory, along with a 2.7GHz Intel Core i7 4800MQ processor. It has 32GB main memory, along with a 1TB SSD. We allocated an additional 64GB of swap space in order to guarantee the function of our algorithm. We used an ASUS Xtion Pro camera for all experiments. 45 Table 4.1: Model to Ground-Truth Error traj0 traj1 traj2 KinFu Volume vertices(m) 1497300 1465488 506229 mean(m) 0.0309 0.0534 0.0213 median(m) 0.0215 0.0375 0.0134 std(m) 0.0282 0.0542 0.0254 Eight Dense Volumes vertices 7318191 7354923 2192196 mean(m) 0.0335 0.0537 0.0222 median(m) 0.0253 0.0368 0.0152 std(m) 0.0315 0.0545 0.0250 4.2 Runtime Performance The fundamental tradeoff of our algorithm is speed versus resolution. As the number of volumes we use increases we expect the time transferring data from GPU memory to CPU memory to account for the majority of the processing time. Because this happens for every volume on each frame, we expect the time to be linear with the number of volumes, until the computer switches to swap memory. We ran the following experiment to validate our runtime expectations. On a pre-recorded 100 frame real world dataset we dynamically allocated a fixed number of volumes with 512 linear voxels on the first frame. For 2 to 7 volumes we used 1.5m volumes. In order to guarantee there existed unique data for each volume, we decreased the size of each volumes as the number of volumes increased. Our results show the seconds per frame is linear with the number of volumes, with each volume adding approximately .17s per frame. By running the algorithm with 48 volumes we demonstrate the ability to generate a 48x increase in resolution compared to a single volume. We also found that both PCL KinFu and our implementation ran at approximately 11 frames per second for one volume. It is worth noting that this is less than the expected 30 frames per second because the images were read out of memory instead of a live camera feed. 46 Figure 4-1: Example of point cloud registration using Cloud Compare. The white model is ground truth. The color model is our reconstruction. The colors represent a heat map of distances from ground truth, where blue indicates small error, and red indicates large error. 4.3 Comparison with Baseline We compare our algorithm to PCL KinFu on the synthetic dataset developed by Handa et al. [14]. This dataset provides a series of RGB and Depth frames that are structured simulate the camera movement through a virtual living room. Also provided is an .obj model for each dataset as ground truth. In order to accurately simulate a depth camera, Handa et al. added noise as described in [3] to each frame. For our experiments we limited the range of camera scenes to those contained by a 6m KinectFusion volume. For the experiments we ran each mapping algorithm on a scene in order to generate a 3D map. We then use the open source Cloud Compare1 to align the reconstruction to the ground truth model. This provides the translation of each point in the reconstruction relative to the nearest mesh triangle in the ground truth model. Using the statistics provided by [14], we then determine the mean, median, and standard deviation for each reconstruction’s translation, the results of 1 http://www.danielgm.net/cc/ 47 Figure 4-2: Top: Model of a desk generated by a 4m KinectFusion volume. Here we can clearly see the triangles that generate the mesh, and it is difficult to distinguish the keyboard keys. Bottom: Model of desk generated by dynamically allocated .5m volumes. The triangles are smaller and so less noticeable, and the delineation between keyboard keys becomes clear. In order to map the space we needed to translate the volume over a larger space, but zoom in on the laptop to highlight the visible resolution difference. 48 which can be seen in Table 1. For traj0 and traj1 we ran PCL KinFu with a 6m volume size, and our algorithm with a set of 8 volumes forming a cube of 6m (3m per volume). For traj2 we we ran PCL KinFu with a 3m volume size, and our algorithm with a set of 8 volumes forming a cube of 3m (so 1.5m per volume). All trajectories used 512 linear voxels per volume. Table 1 shows that our algorithm generates on average 4.75 times as many vertices, indicating a 4.75 time increase in resolution by our definition. Reconstruction accuracy is in line with KinFu, with the average mean error increased by 4.2%. Due to the relatively small number of points used to generate the synthetic meshes, we do not expect to see an improvement in model reconstruction statistics relative to KinectFusion, even for large volumes, due to the fact the synthetic dataset is almost entirely flat surfaces, reconstruction resolution has a minor impact on error. Given the small difference in quality and high number of vertices increased, we believe this highlights the algorithm’s ability to generate high-resolution datasets without reducing accuracy. As a matter of future work it would be useful to generate a dense synthetic dataset and determine if our model reduces mesh reconstruction error. 4.4 Qualitative Results For small datasets we found small volume dynamic allocation to be effective. Figure 4-2 shows the comparison of a office scene scanned with a single 4m volume, compared to a series of .5 volumes which were dynamically allocated. We see the indentation between the keys of the computer keyboard are more visible for the small volumes. We tested our dynamic volume allocation with 3m volumes while walking down a hallway, as seen in Figure 3-3. To obtain this dataset the camera was not rotated as it was translated through the hallway, highlighting the ability to obtain large scale datasets easily. We found that large datasets like this often had drift due to accumulated error in ICP, which can be resolved by improved visual odometry (see Section 5.2.1). The fireplace seen in Figure 4-3 highlights power of our algorithm. The figure 49 Figure 4-3: Top: Fireplace scene at the Kendall Hotel in Cambridge, MA. Bottom: Zoom of two busts from the same model. Model was generated using dynamically placed .75m volumes. This highlights the ability to create models of relatively large spaces, while maintaining a high level of detail. Unfortunately some colors appear pasted behind objects. This is due to a misalignment in the colors with the 3D model, and is not an issue with the 3D model itself. 50 was generated by a 40 second scan of the fireplace and mantelpiece. By acting on a ”medium” scale scan such as this, we do not see the sensor drift present in hallway scans. Additionally this highlights how a large area can be scanned in high resolution, while we still see the detail in the individual busts. 4.5 Summary This chapter has presented quantitive measures of the accuracy and time tradeoffs of higher resolution reconstruction. We found a linear increase in computation time relative to resolution, but little loss in accuracy with increased resolution. The next chapter provides a summary of our work and areas for future work. 51 52 Chapter 5 Conclusion 5.1 Summary of contribution This thesis makes two significant contributions towards 3D mapping for robotics. First, it provides a framework by which a user can arbitrarily increase the resolution of a reconstructed model, in exchange for computation time. By providing this a user can create high-resolution scans of relatively small objects, which can result in more precise object detection for methods such as object recognition through change detection [13]. On top of this we built a general framework for dynamic placement of subvolumes, allowing for the scanned region to increase dynamically during runtime. This has a multitude of potential uses. In the case of SLAM, it allows for mapping of an environment of unbounded size. Because of the computational cost, this is mostly beneficial for mapping a region in advance, which is useful for techniques such as those in [26]. Additionally it allows for high-resolution maps while requiring minimal work on the part of the user. KinectFusion and derived works require that the user be able to mentally project the TSDF volume onto the space for optimal reconstruction, whereas our technique has no such requirements. Our technique allows users to easily create high-resolution reconstructions of regions with minimal user input, with the infrastructure to modify reconstruction resolution of different regions based on user specified parameters. 53 5.2 Future work A number of improvements can be made to the work in this thesis to extend its utility in SLAM and robotic object scanning. 5.2.1 Large Scale Improvements The biggest limitation of our system is sensor drift for large scale reconstructions. Presently when stitching TSDF subvolumes over a large regions, noise in sensor measurements causes drift in the reconstruction. For example long straight hallways tend to curve. This issue could be largely mitigated by using improved camera tracking techniques that are more robust than ICP. For example, [49] compares the DVO camera tracking of Kerl et al. [22], RDB-D methods of Endres et al. [11], the multiresolution surfel maps of Stückler and Behnke [46], and a custom solution Whelan et al. developed. All of these provide better camera tracking than the current solution and would greatly improve large scale mapping. Additionally, our work incurs signifiant memory and computational overhead when scanning larger regions. In order to reduce the computational costs it would be useful to integrate our work with sparser representations, such as octrees to recursively subdivide based on their surface crossing locations. OctoMap [51], for example, provides such a direction which could be explored. 5.2.2 Color Integration The focus of this work was to create high resolution point clouds based on available depth data. However being able to appropriately color our reconstructions can provide rich information for object detection. As seen in Figure 4-3, our current method results in drift between the color information and depth information. As a matter of future work it would be useful to formally include color integration and remove this drift. 54 5.2.3 Integration with SLAM and object recognition Currently our work is stand alone and has not been integrated with any existing SLAM solutions. As a direction moving forward we hope that when large scale improvements are made, large scale models can be made for use in robotics. By providing such a high resolution backdrop, when traveling through a pre-mapped environment a robot can know with higher certainty where it is. This is highlighted by the fact that our algorithm potentially reduces the error in object detection systems [13], as the ground truth will be known with higher resolution. Additionally, for missions where a robot is able to enter an environment only a limited number of time, our algorithm provides the ability to retrieve high-resolution information to reduce the need to enter the environment again. Thus we hope to extend our algorithm for such mapping and data collection scenarios in the future. 55 56 Bibliography [1] T. Bailey and H.F. Durrant-Whyte. Simultaneous localisation and mapping (SLAM): Part I. Robotics & Automation Magazine, 13(2):99–110, June 2006. [2] T. Bailey and H.F. Durrant-Whyte. Simultaneous localisation and mapping (SLAM): Part II. Robotics & Automation Magazine, 13(3):108–117, September 2006. [3] Jonathan T Barron and Jitendra Malik. Intrinsic scene properties from a single RGB-D image. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 17–24. IEEE, 2013. [4] Paul J. Besl and Neil D. McKay. A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Machine Intell., 14(2):239–256, February 1996. [5] D. R. Canelhas, T. Stoyanov, and A. J. Lilienthal. SDF Tracker: A parallel algorithm for on-line pose estimation and scene reconstruction from depth images. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, IROS, Tokyo, Japan, November 2013. [6] Siddharth Choudhary, Alexander JB Trevor, Henrik I Christensen, and Frank Dellaert. SLAM with object discovery, modeling and mapping. In Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, pages 1018–1025. IEEE, 2014. [7] A.I. Comport, E. Malis, and P. Rives. Accurate Quadri-focal Tracking for Robust 3D Visual Odometry. In IEEE International Conference on Robotics and Automation, ICRA’07, Rome, Italy, April 2007. [8] B. Curless and M. Levoy. A volumetric method for building complex models from range images. In SIGGRAPH, pages 303–312, August 1996. [9] F. Dellaert. Square Root SAM: Simultaneous location and mapping via square root information smoothing. In Robotics: Science and Systems (RSS), Cambridge, MA, 2005. [10] Bertram Drost, Markus Ulrich, Nassir Navab, and Slobodan Ilic. Model globally, match locally: Efficient and robust 3D object recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 998–1005. IEEE, 2010. 57 [11] F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard. An evaluation of the RGB-D SLAM system. In IEEE Intl. Conf. on Robotics and Automation (ICRA), 2012. [12] Jakob Engel, Thomas Schöps, and Daniel Cremers. LSD-SLAM: Large-scale direct monocular SLAM. In Computer Vision–ECCV 2014, pages 834–849. Springer, 2014. [13] R. Finman, T. Whelan, M. Kaess, and J. Leonard. Toward lifelong object segmentation from change detection in dense RGB-D maps. In European Conference on Mobile Robotics, Barcelona, Spain, September 2013. [14] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In IEEE Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May 2014. (to appear). [15] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2003. Second Edition. [16] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. RGB-D mapping: Using Kinect-style depth cameras for dense 3D modeling of indoor environments. Intl. J. of Robotics Research, 31(5):647–663, 2012. [17] Peter Henry, Dieter Fox, Achintya Bhowmik, and Rajiv Mongia. Patch Volumes: Segmentation-based Consistent Mapping with RGB-D Cameras. In Third Joint 3DIM/3DPVT Conference (3DV), 2013. [18] Evan Herbst, Peter Henry, and Dieter Fox. Toward online 3-D object segmentation and mapping. In IEEE Intl. Conf. on Robotics and Automation (ICRA), 2014. [19] S. Izadi, R.A. Newcombe, D. Kim, O. Hilliges, D. Molyneaux, S. Hodges, P. Kohli, J. Shotton, A.J. Davison, and A. Fitzgibbon. KinectFusion: Real-time dynamic 3D surface reconstruction and interaction. In SIGGRAPH, page 23, August 2011. [20] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J.J. Leonard, and F. Dellaert. iSAM2: Incremental smoothing and mapping with fluid relinearization and incremental variable reordering. In IEEE Intl. Conf. on Robotics and Automation (ICRA), Shanghai, China, May 2011. [21] Andrej Karpathy, Stephen Miller, and Li Fei-Fei. Object discovery in 3D scenes via shape analysis. In International Conference on Robotics and Automation (ICRA), 2013. [22] C. Kerl, J. Sturm, and D. Cremers. Dense visual SLAM for RGB-D cameras. In Proc. of the Int. Conf. on Intelligent Robot Systems (IROS), 2013. 58 [23] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In IEEE and ACM Intl. Sym. on Mixed and Augmented Reality (ISMAR), pages 225–234, Nara, Japan, November 2007. [24] Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. Detection-based object labeling in 3D scenes. In IEEE International Conference on on Robotics and Automation, 2012. [25] J. J. Leonard and H. F. Durrant-Whyte. Simultaneous map building and localization for an autonomous mobile robot. In Proc. IEEE Int. Workshop on Intelligent Robots and Systems, pages 1442–1447, Osaka, Japan, 1991. [26] J. Levinson. Automatic Laser Calibration, Mapping, and Localization for Autonomous Vehicles. PhD thesis, Stanford University, 2011. [27] C. Liebe, C. Padgett, J. Chapsky, D. Wilson, K. Brown, S. Jerebets, H. Goldberg, and J. Schroeder. Breadboard of a hazard avoidance sensor using structured light developed at the jet propulsion laboratory., 2005. [Online; accessed May 13, 2015]. [28] F. Lu and E. Milios. Globally consistent range scan alignment for environmental mapping. Autonomous Robots, 4:333–349, April 1997. [29] M. Meilland and A.I. Comport. On unifying key-frame and voxel-based dense visual SLAM at large scales. In International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3-8 November 2013. IEEE/RSJ. [30] M. Meilland, T. Drummond., and A.I. Comport. A Unified Rolling Shutter and Motion Model for Dense 3D Visual Tracking. In International Conference on Computer Vision, Sydney, Australia, December 3-6 2013. [31] Maxime Meilland and Andrew I Comport. Super-resolution 3d tracking and mapping. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 5717–5723. IEEE, 2013. [32] R. A. Newcombe, A. J. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton, D. Molyneaux, S. Hodges, D. Kim, and A. Fitzgibbon. KinectFusion: Real-time dense surface mapping and tracking. In IEEE and ACM Intl. Sym. on Mixed and Augmented Reality (ISMAR), pages 127–136, Basel, Switzerland, October 2011. [33] R.A. Newcombe, D. Fox, and S. Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2015. [34] R.A. Newcombe, S.J. Lovegrove, and A.J. Davison. DTAM: Dense tracking and mapping in real-time. In Intl. Conf. on Computer Vision (ICCV), pages 2320– 2327, Barcelona, Spain, November 2011. 59 [35] I. Oikonomidis, N. Kyriazis, and A.A. Argyros. Efficient model-based 3D tracking of hand articulations using kinect. In Proc. BMVC, 2011. [36] K. Pirker, M. Rüther, G. Schweighofer, and H. Bischof. GPSlam: Marrying sparse geometric and dense probabilistic visual mapping. In British Machine Vision Conf. (BMVC), August 2011. [37] J.G. Rogers, A.J.B. Trevor, C. Nieto-Granda, and H.I. Christensen. Simultaneous localization and mapping with learned object recognition and semantic data association. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), pages 1264–1270, 2011. [38] D.M. Rosen, M. Kaess, and J.J. Leonard. An incremental trust-region method for robust online sparse least-squares estimation. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 1262–1269, St. Paul, MN, May 2012. [39] D.M. Rosen, M. Kaess, and J.J. Leonard. RISE: An incremental trust-region method for robust online sparse least-squares estimation. Robotics and Automation, IEEE Transactions on, 2014. [40] H. Roth and M. Vona. Moving volume KinectFusion. In British Machine Vision Conf. (BMVC), September 2012. [41] R. Rusu and S. Cousins. 3D is here: Point Cloud Library (PCL). In IEEE Intl. Conf. on Robotics and Automation (ICRA), Shanghai, China, May 2011. [42] Radu Bogdan Rusu, Zoltan Csaba Marton, Nico Blodow, Mihai Dolha, and Michael Beetz. Towards 3D point cloud based object maps for household environments. J. of Robotics and Autonomous Systems, 56(11):927–941, 2008. [43] R.F. Salas-Moreno, R.A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison. SLAM++: Simultaneous localisation and mapping at the level of objects. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, Portland, Oregon, June 2013. [44] D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structured light. In Computer Vision and Pattern Recognition, 1994. Proceedings CVPR ’94., 1994 IEEE Computer Society Conference on, volume 1, pages 195–202, June 2003. [45] F. Steinbrücker, C. Kerl, J. Sturm, and D. Cremers. Large-scale multi-resolution surface reconstruction from RGB-D sequences. In IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 2013. [46] Jörg Stückler and Sven Behnke. Multi-resolution surfel maps for efficient dense 3D modeling and tracking. In Journal of Visual Communication and Image Representation, 2013. 60 [47] S. Thrun, D. Fox, and W. Burgard. A probabilistic approach to concurrent mapping and localization for mobile robots. Machine Learning, 31:29–53, 1998. also appeared in Autonomous Robots 5, 253–271 (joint issue). [48] T. Tykkälä, A. I. Comport, and J-K. Kamarainen. Photorealistic 3D Mapping of Indoors by RGB-D Scanning Process. In International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3-7 November 2013. [49] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and J. McDonald. Real-time large scale dense RGB-D SLAM with volumetric fusion. Intl. J. of Robotics Research, 2014. To Appear. [50] T. Whelan, M. Kaess, J.J. Leonard, and J.B. McDonald. Deformation-based loop closure for large scale dense RGB-D SLAM. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), Tokyo, Japan, November 2013. [51] K. M. Wurm, A. Hornung, M. Bennewitz, C. Stachniss, and W. Burgard. OctoMap: A probabilistic, flexible, and compact 3D map representation for robotic systems. In Proc. of the ICRA 2010 Workshop on Best Practice in 3D Perception and Modeling for Mobile Manipulation, Anchorage, AK, USA, May 2010. [52] L. Xia, C. Chen, and J.K. Aggarwal. Human detection using depth information by kinect. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2011. [53] Ming Zeng, Fukai Zhao, Jiaxiang Zheng, and Xinguo Liu. A Memory-Efficient KinectFusion Using Octree. In Computational Visual Media, volume 7633 of Lecture Notes in Computer Science, pages 234–241. Springer, 2012. [54] Q. Zhou and V. Koltun. Dense scene reconstruction with points of interest. In SIGGRAPH 2013, Anaheim, CA, USA, 2013. ACM. [55] Q. Zhou, S. Miller, and V. Koltun. Elastic Fragments for Dense Scene Reconstruction. In Int. Conf. on Computer Vision (ICCV), December 2013. 61