3D Sensing and Mapping Intelligent Robotics Spring 2014 Kris Hauser Agenda • A high-level overview of visual sensors and perception algorithms • Core concepts • • • • Camera / projective geometry Point clouds Occupancy grids Iterative closest points algorithm • Proprioceptive: sense one’s own body • Motor encoders (absolute or relative) • Contact switches (joint limits) • Inertial: sense accelerations of a link • Accelerometers • Gyroscopes • Inertial Measurement Units (IMUs) • Visual: sense 3D scene with reflected light • RGB: cameras (monocular, stereo) • Depth: lasers, radar, time-of-flight, stereo+projection • Infrared, etc • Tactile: sense forces • Contact switches • Force sensors • Pressure sensors • Other • Motor current feedback: sense effort • GPS • Sonar Sensing vs Perception • Sensing: acquisition of signals from hardware • Perception: processing of “raw” signals into “meaningful” representations Digits • Example: • Reading pixels from a camera is sensing. • Declaring “it’s a rounded shape with skin color”, “it’s a face”, or “it’s Joe’s smiling face” are different levels of perception. • Especially at lower levels of perception like signal processing, the line is blurry, and the often the processed results can be essentially considered “sensed”. 3D Perception Topics • Sensors: visible-light cameras, depth sensors, laser sensors • (Some) perception tasks: • • • • Stereo reconstruction Object recognition 3D mapping Object pose recognition • Key issues: • How to represent and optimize camera transforms? • How to fit models in the presence of noise? • How to represent large 3D models? Visual Sensors • Visible-light cameras: cheap, low power, high resolution, high frame rates • Data: 2D field of RGB pixels • Stereo cameras Bumblebee stereo camera • Depth field sensors • Two major types: infrared pattern projection (Kinect, ASUS, PrimeSense) and time-of-flight (SwissRanger) • Data: 2D field of depth values ASUS Xtion depth sensor • Sweeping laser sensors • Data: 1D field of depth values • Hokuyo, SICK, Velodyne • Can be mounted on a tilt/spin mount to get 3D field of view Hokuyo laser sensor Sensors vary in strengths / weaknesses • Velodyne (DARPA grand challenge) • 1.3 million readings/s • $75k price tag Image formation • Light bounces off an object, passes through a lens and lands on a CCD pixel on the image plane • Depth of focus • Illumination and aperture • Color: accomplished through use of filters, e.g. Bayer filter • Each channel’s in-between pixels are interpolated Idealized projective geometry • Let: • Zim: distance from image plane to focal point along the depth axis • (X,Y,Z): point in 3D space relative to focal point, Z > 0 • Then, image space point is: • Xc = Zim X / Z • Yc = Zim Y / Z x Focal point …Which get scaled and offset to get pixel coordinates Image plane (X,Z) (Xc,Zim) z Issues with real sensors • • • • • Motion blur Distortion caused by lenses Non-square pixels Exposure Noise Motion blur • Film grain • Salt-and-pepper noise • Shot noise Distortion Exposure Calibration • Determine camera’s intrinsic parameters • • • • Focal length Field of view Pixel dimensions Radial distortion • That determine the mapping from image pixels to an idealized pinhole camera • Rectification Stereo vision processing • Dense reconstruction • Given two rectified images, find binocular disparity at each pixel • Take a small image patch around each pixel in the left image, search for the best horizontal shifted copy in the right image • What size patch? What search size? What matching criterion? • Works best for highly textured scenes Point Clouds • Unordered list of 3D points P={p1,…,pn} • Each point optionally annotated by: • Color (RGB) • Sensor reading ID# (why?) • Estimated surface normal (nx,ny,nz) • No information about objects, occlusions, topology • Point Cloud Library (PCL) • http://pointclouds.org 3D Mapping • Each frame of a depth sensor gives a narrow snapshot of the world geometry from a given position • 3D mapping is the process of stitching multiple views into a global model Three scenarios: • Consider two raw point clouds P1 and P2 from cameras with transformations T1 and T2. • Goal: build point cloud P in frame T1 (assume identity) • Case 1: relative transformations known • Simple union P = P1 ī (T2 ī P2) • Case 2: small transformation • Pose registration problem • Vast majority of points correspond between scenes • Case 3: large transformation • Significant fraction of points do not correspond, lighting differences, more occlusions • Pose registration must be more robust to outliers Case 2: Small transformations • Visual odometry: estimate relative motion of subsequent frames using optical flow • Define feature points in P1 (e.g., corner detector) • Estimate a transformation of an image patch around feature that best matches P2 (defines optical flow field) • Transformation: translation, rotation, scale • Fit T2 to match these feature transforms Case 3: Large transformations • Iterative closest point algorithm • Input: initial guess for T2 • Repeat until convergence: • Find nearest neighbor pairings between P1 and T2īP2 • Select those pairs that fall below some distance threshold (outlier rejection) • Assign an error metric and optimize T2 to minimize this metric Case 3: Large transformations • Iterative closest point algorithm • Input: initial guess for T2 • Repeat until convergence: What metric? • Find nearest neighbor pairings between P1 and T2īP2 • Select those pairs that fail to satisfy some distance criteria (outlier rejection) What criteria? • Assign an error metric and optimize T2 to minimize this metric How to minimize? What metrics for matching? • Position • Surface normal • Color • Nearest neighbor methods • Fast data structures, e.g. K-D trees • For large scans usually want a constant sized subsample • Projection-based methods • Render scene from perspective of T1 to determine a match • Very fast (used in Kinect Fusion algorithm) • Only uses position information What criteria for outlier rejection? • Distance too large (e.g., top X%) • Inconsistencies with neighboring pairs • On boundary of scan How to optimize? • Want to find rotation R, translation t of T2 to minimize some error function • Sum of squared point-to-point differences • đ¸ đ , đĄ = đđ=1 đđ − (đ đđ + đĄ) • Closed-form solution (SVD) • Very fast per step 2 • Sum of squared point-to-plane differences • • • • đ đ=1 đđđ đđ đđđ (đ đđ 2 đ¸ đ , đĄ = − + đĄ) Must use numerical methods Deal with rotation variable Tends to lead ICP to converge with fewer iterations Other applications of ICP • Fitting known, complete 3D triangulated models to partial point clouds for object recognition / pose estimation Stitching multiple scans • In its most basic form, multi-view 3D mapping is simply a repetition of the two-camera case • But two major issues: • Drift and “closing the loop” • Point clouds become massive after many scans This time Point cloud growth problem • With N points, at F frames/sec, and T seconds of run time, NīFīT points are gathered • Kinect: N=307200, F=30, T=60 => 552,960,000 points • With RGB in 4 bytes, XYZ in 12 bytes => 8 GB / min • Solutions: • Forget earlier scans (short term memory) • Build persistent, “collapsed” representation of environment geometry • Polygon meshes • Occupancy grids • Key issue: how to estimate with low # of points and update later? Occupancy grids • Store a grid G with a fixed minimum resolution • Mark which cells (voxels) are occupied by a point • Representation size is independent of T Two options for updating on a new scan 1. Compute ICP to align current scan to prior scan, then add points to G 2. Modify ICP to work directly with the representation G Probabilistic Occupancy Grids with Ray Casting • Scans are noisy, so simply adding points is likely to overestimate occupied cells • Ray casting approaches: • Each cell has a probability of being free/occupied/unseen • Each scan defines a line segment that passes through free space and ends in an occupied cell (or near one) • Walk along the segment, increasing P(free(c)) of each encountered cell c, and finally increase P(occupied(c)) for the terminal cell c Probabilistic Occupancy Grids with Ray Casting • Scans are noisy, so simply adding points is likely to overestimate occupied cells • Ray casting approaches: • Each cell has a probability of being free/occupied/unseen • Each scan defines a line segment that passes through free space and ends in an occupied cell (or near one) • Walk along the segment, increasing P(free(c)) of each encountered cell c, and finally increase P(occupied(c)) for the terminal cell c Compact geometry representations within a cell • On-line averaging • On-line least squares estimation of fitting plane Handling large 3D grids • Problem: tabular 3D grid storage increases with O(N3) • 10243 is 1Gb • Solutions • Store hash table only of occupied cells • Octree data structure • OctoMap Library (http://octomap.sourceforge.net) Dynamic Environments • Real environments have people, animals, objects that move around • Two options: • Map static parts by assuming dynamic objects will average out as noise over time (probabilistic occupancy grids) • Segment (and possibly model) dynamic objects Related topics • • • • Sensor fusion Object segmentation and recognition Simultaneous Localization and Mapping (SLAM) Next-best-view planning Next week • Computer Vision guest lecture • Kalman filtering • Principles Ch. 8 • Zeeshan