Autonomous Visual Navigation for Planetary Exploration Rovers M. Lourakis, G. Chliveros and X. Zabulis Institute of Computer Science Foundation for Research and Technology – Hellas Heraklion, Crete, Greece ASTRA 2013, 15 - 17 May 2013, Noordwijk, The Netherlands Introduction • The talk concerns on-going work pursued in the context of the SEXTANT activity • SEXTANT is funded by ESA and aims to develop visual navigation algorithms suitable for use by Martian rovers • Low-performance target CPU (150 MIPS) • Vision algorithms should require as little computing power, memory footprint and communication overhead as possible • Computationally most intensive parts are implemented on FPGA (not part of this talk) SEXTANT – CDR 2012/11/28 Page 2 Terrain mapping & visual odometry • Two key elements of visual navigation are terrain mapping & visual odometry • Terrain mapping uses dense binocular stereo on a set of image pairs to produce a 3D representation of the viewed scene that will be utilized for obstacle avoidance • Visual odometry (VO) is the process of estimating the egomotion of a mobile system using as input only the images acquired by its cameras • VO will be used as a building block for a complete vSLAM system (the latter may include loop closure and possibly global optimization for the map) SEXTANT – CDR 2012/11/28 Page 3 Camera setup • Prototype rover is equipped with two pairs of stereo cameras on a mast • High pair is used for mapping, low for VO • Stereo no depth/scale ambiguity SEXTANT – CDR 2012/11/28 Page 4 Dense 3D reconstruction: plane-sweeping • A local stereo method based on the optimization of visual similarity along rays that uses a hypothetical sweeping plane along depth • Simple yet effective • Amenable to efficient implementation • Inherently parallelizable, detail can be modulated • Has been successfully used for large-scale urban reconstruction (e.g. UrbanScape project) • Expandable to more than two cameras for greater accuracy • Limitations: • Textureless or non-reflective surfaces • Illumination specularities • The result can be a depth map or point cloud R. Collins. A space-sweep approach to true multi-image matching, CVPR 1996. SEXTANT – TConf1 2012/09/24 Page 5 Plane-sweeping: sample result 512 by 348 images 101 depth planes 441 by 351 plane resolution 15 by 15 correlation kernel SEXTANT – TConf1 2012/09/24 Page 6 Visual odometry: processing pipeline • Harris corner detection • BRIEF descriptor extraction • BRIEF descriptor matching • Sparse stereo triangulation • Pose determination from 2D-3D matches in two views • No temporal smoothing, e.g. local bundle adjustment SEXTANT – TConf1 2012/09/24 Page 7 Harris (Plessey) corner detection: formulation • Corner points exhibit significant intensity change in all directions • The intensity change within a shifted image patch is captured by the 22 autocorrelation matrix M involving image derivatives: I x2 M w( x, y ) x, y I x I y IxI y 2 I y • The eigenvalues 1, 2 of M quantify the intensity change in a shifting window • Large intensity change in all directions implies that both eigenvalues should be large • Eigenvalues are not computed explicitly, thus avoiding costly calculation of square roots SEXTANT – TConf1 2012/09/24 Page 8 Harris corner detection: formulation (cont’d) • A “cornerness” measure R determines if 1, 2 are sufficiently large R det M k trace M 2 det M 12 trace M 1 2 • k is an empirical constant, chosen between 0.04-0.06. Alternative cornerness measures avoiding arbitrary constants also available • Subpixel corner approximation by locating the minimum of a quadratic surface fitted to R • Harris has high detection and repeatability rates but poor localization • Involves moderate computational cost (mostly separable convolutions) • Not rotation or scale-invariant C. Harris and M. Stephens. A Combined Corner and Edge Detector. Alvey Vision Conf. 1988. SEXTANT – TConf1 2012/09/24 Page 9 Harris corner selection: ANMS • Improved spatial distribution with the ANMS (Adaptive Nonmaximal Suppression) scheme of Brown, Szeliski and Winder • Only corners whose cornerness is locally maximal are retained Strongest corners using plain Harris (left) and Harris with ANMS (right) M. Brown et al. Multi-Image Matching Using Multi-scale Oriented Patches. CVPR 2005. SEXTANT – TConf1 2012/09/24 Page 10 BRIEF descriptor (BRIEF: Binary Robust Independent Elementary Features) • Performs several pair-wise intensity comparisons on a Gaussiansmoothed image patch and encodes the outcomes in a bit vector • The pattern of pixels to be compared by BRIEF in each patch is selected randomly (same for all patches) • BRIEF is less discriminate compared to SIFT but much faster to compute and match, and more compact to store M. Calonder et al. BRIEF: Binary Robust Independent Elementary Features. ECCV 2010. SEXTANT – TConf1 2012/09/24 Page 11 BRIEF descriptor matching • Standard distance ratio test (Lowe): • Matches between images I1 and I2 are identified by finding the two nearest neighbors of each keypoint from I1 among those in I2, and only accepting a match if the distance to the closest neighbor is less than a fixed threshold of that to the second closest neighbor • Non-symmetric • Nearest neighbors are determined with the Hamming distance which counts the number of positions at which two bit strings differ • Maximum disparity and epipolar constraints are also imposed SEXTANT – TConf1 2012/09/24 Page 12 Sparse stereo • Given some corner matches in a calibrated stereo pair, the 3D points they originate from can be recovered via triangulation • Triangulating rays will not exactly intersect; approximating their point of intersection with least squares has no physical meaning L2 C1 m1 M L1 m2 C 2 • Skewness of L1, L2 is dealt with by correcting m1, m2 so that they are guaranteed to comply with the epipolar geometry SEXTANT – TConf1 2012/09/24 Page 13 Sparse stereo (cont’d) • Given an epipolar plane, we seek the optimal 3D point for (m1, m2) l1 m1 m2´ m1´ l2 m2 • The solution is to select the closest points (m1´, m2´) on epipolar lines and obtain 3D point through exact triangulation • This is achieved by minimizing the distances to epipolar lines with a non-iterative method involving the roots of a sixth degree polynomial • An approximate but much cheaper alternative we use is to rely on the Sampson approximation of the distance error SEXTANT – TConf1 2012/09/24 Page 14 Pose estimation: overview • Concerns the determination of position and orientation of a camera given its intrinsic parameters and a set of N correspondences between 3D points and their 2D projections • Has been extensively studied due to its diverse applicability in computer vision, robotics, augmented reality, HCI, … • Our solution: • A preliminary pose is estimated using an analytic P3P solver combined with RANSAC for coping with mismatches • The preliminary pose is refined by minimizing the total image reprojection error pertaining to inliers • Extended to the binocular case for better accuracy by jointly minimizing the reprojection error in two images • Motion parameters covariance computed as byproduct SEXTANT – TConf1 2012/09/24 Page 15 Pose estimation: 3D-2D correspondences • A stereo rig moving freely in space t+1 t Blue: projection rays, Green: spatial matches, Red: temporal matches • Temporal matches associate 3D points triangulated at time t with their 2D image projections at time t+1 SEXTANT – TConf1 2012/09/24 Page 16 Quantitative evaluation of VO pipeline • A 512x348 synthetic stereo sequence + ground truth. Motion is primarily forward with a shallow right turn • Moderate numbers of matches across images • VO was run for 363 frames (total traveled distance ~ 22m) • Naming convention: X-Y denotes detector X with descriptor Y, e.g. HARRIS-BRIEF refers to Harris corners and BRIEF descriptors Sequence courtesy of Marcos Aviles, GMV SEXTANT – TConf1 2012/09/24 Page 17 Accuracy of VO pipeline Left: translational error wrt ground truth, right: rotational error SEXTANT – TConf1 2012/09/24 Page 18 Summary & conclusions • VO performance tested against ground truth associated with simulated data • HARRIS+BRIEF detector/descriptor achieves lower accuracy compared to SIFT+SIFT • HARRIS+BRIEF has considerably lower computational requirements (32 times faster than SIFT+SIFT) • HARRIS+BRIEF yields a relative translational error < 2%, hence provides a good accuracy / performance tradeoff • Binocular HARRIS+BRIEF is more accurate compared to monocular HARRIS+BRIEF • Binocular HARRIS+BRIEF runs @ 4 fps on an Intel Core 3GHz SEXTANT – TConf1 2012/09/24 Page 19 Thank you Any questions? 20