High Frame Rate Egomotion Estimation Natesh Srinivasan, Richard Roberts, and Frank Dellaert Georgia Institute of Technology Abstract. In this paper, we present an algorithm for doing high frame rate egomotion estimation. We achieve this by using a basis flow model, along with a novel inference algorithm, that uses spatio-temporal gradients, foregoing the computation of the slow and noisy optical flow. The inherent linearity in our model allows us to achieve fine grained parallelism. We demonstrate this by running our algorithm on GPUs to achieve egomotion estimation at 120Hz. Image motion is tightly coupled with the camera egomotion and depth of the scene. Hence, we validate our approach by using the egomotion estimate to compute the depth of a static scene. Our applications are aimed towards autonomous navigation scenarios where, it is required to have a quick estimate of the state of the vehicle, while freeing up computation time for higher level vision tasks. #$% $ ! " Fig. 1. (a) Input Stream from a high frame rate camera (b) Spatio-temporal gradients as an RGB image, r- Dx , g-Dy , b-Dt . (c) Learned Dense basis flows corresponding to pitch, yaw and forward motion (e) Platform showing typical motion during autonomous navigation (f) Estimated translational and rotational velocity (g) Recovered translational flow (h) Depth map of the scene. M. Chen, B. Leibe, and B. Neumann (Eds.): ICVS 2013, LNCS 7963, pp. 183–192, 2013. c Springer-Verlag Berlin Heidelberg 2013 184 1 N. Srinivasan, R. Roberts, and F. Dellaert Introduction As high frame rate cameras are becoming commodity hardware, they may play a very important role in the future of vision algorithms. This is because of the several obvious benefits [6] that it offers over its lower frame rate counterparts such as reduced motion blur, smaller inter-frame baseline, etc. However, these cameras also pose different challenges that necessitate developing new models while taking a close look at the underlying imaging process. This paper develops one such model and demonstrates its application to robot egomotion estimation. In a high frame rate setting, a vision-only solution to egomotion estimation is advantageous as compared to the traditional way of doing egomotion estimation by fusing IMUs (Inertial Measurement Units) and cameras. This is because, from a system standpoint, it is simpler to deal with a single sensor due to fewer integration and synchronization issues. Approaches that fuse multiple sensors [3] works well in a low frame rate setting, however this might be superfluous when using high frame rate cameras. State estimates from high speed cameras are more accurate because they have more measurements for the same trajectory, reducing the dependency on IMUs for high frequency state estimates. The primary objective of this paper is to develop and evaluate a fast egomotion estimation method which run upwards of 120Hz. This is possible because of the small inter-frame displacements and brightness variations across images. As a result, continuous time models, such as the well known constant brightness model [8] are a near exact representation of the underlying imaging process. These models enable us to develop closed form analytic expression for the unknown egomotion that can be computed in a single step. In contrast, models that rely on image warping or feature tracking require multiple iterations of non-linear estimation to converge to the correct solution. The first contribution of this paper is to exploit the idea of Basis Flows in a continuous time model to compute egomotion directly from image gradients. We take a two-tiered approach, where a slower learning algorithm learns a mapping from egomotion to optic flow by using a dense version of the Basis Flow model developed by Roberts et al. [14]. However, rather than using sparse flow as in that paper, we develop a fully dense approach using spatio-temporal gradients. A significant advantage of using learned basis flows is that they are independent of camera calibration. In fact, such an approach is not restricted to using a projective camera model at all. The trade-off is that, while rotational basis flows are scene independent, translational basis flows depend on the typical scene structure [14]. The second contribution of this paper is to rapidly estimate the motion and the depth map of a rigid scene, at better-than-framerate operation, using a consumer-grade graphics processing unit (GPU). This is possible because the bulk of the arithmetic to solve our linear model amounts to per-pixel operations. In the depth estimation stage, we reuse the basis flow model to first obtain the optical flow corresponding to translational motion. We then exploit the linearity that exists between the translational flow and inverse-depth to estimate the High Frame Rate Egomotion Estimation 185 depth map in a single step. We further use a Kalman filter model to incrementally refine our estimate of depth map over time. This work was developed to serve as a harness to explore the idea of mental rotations in robot navigation. The depth maps obtained from this paper serve as a quasi-instantaneous snapshot of the environment that enable the robot to reach a goal location (an x, y, z co-ordinate location) by iteratively aligning itself with an abstract representation of the goal state. A similar approach to navigation is believed to exist among primates [2]. However, the details of this work is beyond the scope of this paper and we will address this in a subsequent paper. 2 Related Work A brief survey on egomotion estimation methods starts with the works of Gibson. This was followed by a series of methods developed during the 80s and 90s. They are mostly classified based on the constraints and the underlying optimization techniques used to solve them. The constrains can either be differential epipolar [11], which map the 3D velocity of a point in the real world onto a 2D velocity vector in the image plane, or they can be discrete epipolar [10], which map a 3D rigid egomotion between views onto the corresponding 2D image points in each view. A variety of optimization techniques have been developed using these two fundamental constraints. These techniques attempt to solve for the egomotion quickly and robustly, a challenging task due to the non-linear relationship between depth, camera motion, and optical flow. Heeger et al. [7] solve this by sampling possible translations from a unit sphere and then deriving an equation that is linear in the inverse depth and the rotational velocity. Bruss and Horn [4] directly use the bi-linear constraint that exists between the depth and the translational velocity. Other approaches include the use of motion parallax [13] and the epipolar geometry constraints [10,11]. However, in most of these methods, estimation of either optical flow or feature correspondences is an intermediate step. This can be noisy and slow because of wrong feature matches and high-dimensional search spaces. This led to the advent of a class of methods called the direct methods, [9]. In these methods, the optical flow is treated as a hidden variable and camera motion is directly estimated using only image intensity gradients. The estimation is thus reduced to at most 6 parameters with each pixel providing a single constraint, resulting in a highly-overdetermined system. A more detailed analysis on the current state of the art on egomotion estimation can be found in the works of Florian et al. [5]. 3 Approach In this section, we will briefly revisit the basis flow model described in [14] (Sec 3.1), then derive a closed form expression for estimating egomotion (Sec 3.2). We will then show how we can use this egomotion to compute scene depth (Sec 3.3). 186 3.1 N. Srinivasan, R. Roberts, and F. Dellaert Basis Flows Basis flows are a powerful model that represent a linear relationship between optical flow and camera motion. They are a learned mapping that is derived by exploiting the following bi-linear relationship that exists between camera motion, T (v, ω), inverse depth, ρi and the optic flow, ui = uxi uyi at the ith pixel ui = ρi Ai v + Bi ω (1) T is the translational component of egomotion and ω = where v = vx vy vz T ωx ωy ωz is the rotational component. Ai ∈ R2×3 and, Bi ∈ R2×3 are matrices that depend on the camera parameters and pixel co-ordinates. This idea was first observed by Longuet et al. [11,10] who called it a differential epipolar constraint to distinguish from the discrete epipolar case. Roberts et al. [14] developed a calibration-free, probabilistic version of this idea by making certain simplifying assumptions about depth, namely approximate constancy per-pixel over time. This simplifies Eq 1 to v (2) ui = Wi ω where the columns of Wi are the learned basis flows. This method also works for non-projective camera models as discussed in [14]. Basis flows have several interesting and intuitive properties and afford a finegrained parallelism in the computation of egomotion. During the inference phase, Roberts et al. use a sparse optical flow as measurements from which to compute egomotion. In this paper we take an alternate approach, which leverages high frame-rate, using image gradients. 3.2 Gradient-Based Inference In this section we develop a probabilistic model for estimating egomotion directly from image gradients. This is the key contribution of this paper which enables us to leverage the computational power of GPUs to estimate egomotion. In a probabilistic sense, the computation of egomotion is the maximization of the posterior probability of egomotion, yt = (vt , ωt ) given the image gradients, Zt = {Dtx , Dtt }1 and the basis flow, W at time, t yˆt = arg max p (yt |Zt , W ) . yt (3) Rewriting this as a product of the measurement likelihood and prior and using the independence between the basis flows, W , and the egomotion, yt , we get (4) yˆt = arg max L Dtt |Dtx , yt , W p (yt ) . yt 1 Dx ∈ R1×2 and Dt ∈ R represent the spatial and temporal gradients. High Frame Rate Egomotion Estimation 187 (a) (b) Fig. 2. (a) Bayes-Net Representation of our generative model for pixel i , showing the relation between the basis flow, W , image gradients, Zi , the camera egomotion, yt and the optical flow, ui (b) Memory distribution of the GPU implementation of our model We use L (·) to represent the likelihood (unnormalized probability) to distinguish it from a probability, p (·). Eq. 4 states that the temporal gradient, Dtt can be independently computed given the spatial gradient, Dtx , camera egomotion, yt and the basis flows, W . The basis flow model given in Eq. 2 provides us with a way to compute the flow, ut , given the egomotion, yt However, the model does not provide an explicit relationship between the image gradients, Zt and egomotion, yt . In order to evaluate Eq. 4, we exploit the brightness constancy assumption [8] stated as t x Dti + Dti uti = 0 (5) where the subscript i corresponds to the ith pixel. This provides us with a addit x tional constraint between the flow, uti and the image gradients Zti = {Dti , Dti }. An important observation at this point is that the temporal gradient at each t is independent of every other pixel, iff we are given the spatial gradient, pixel, Dti x Dti at that pixel2 . This allows us to re-write Eq. 4 as a per-pixel product given by t x (6) yˆt = arg max p (yt ) L Dti |Dti , yt , Wi yt i where the subscript i refers to the ith pixel. The Bayes’ net representation of our model is shown in Fig. 2a. It tells us that the hidden flow term uti , provides a link between the egomotion, yt and the image gradients, Zit . This suggests that we can factorize the measurement likelihood (Eq. 4) as t x t x |Dti , yt , Wi = L Dti |Dti , uti L (uti |yti , Wi ) (7) L Dti 2 This is however not completely true, since the spatio-temporal Sobel kernel used in the computation of gradients induces some correlation between adjacent pixels 188 N. Srinivasan, R. Roberts, and F. Dellaert Since our goal is to get rid of the intermediate flow term, we compute the marginal corresponding to the joint distribution given in Eq. 7 by integrating out the hidden flow term uti ˆ t x arg max p (yt ) |Dti , uti L (uti |yti , Wi ) duti L Dti yt i u ti which provides us with the following expression for the marginal distribution: ˆ t x t x L Dti |Dti , uti L (uti |yti , Wi ) duti = L Dti (8) |Dti , yt , Wti uti Taking the negative log-likelihood of Eq. 8 and maximizing with respect to yt = (v, ω) gives us the following maximum likelihood estimate: −1 T x t T 2 x T x Dti Wi (9) yt = 1/Jti Wi (Dti ) Dti Wi 1/Jit2 Dti i T i x x where Ji = (Dti ) Σui Dti + σI where Σui is the covariance matrix of the flow at pixel i. x t , Dti can be preEvaluation of Jti is a per-pixel independent operation. Dti computed in parallel. Each pixel provides an completely independent constraint as detailed in the MAP estimate given in Eq. 9 . 3.3 Depth Maps In this section we develop a Kalman filter model for computing the scene depth. The depth map is used as a qualitative measure of the accuracy of the estimated egomotion. Eq. 1 provides a bi-linear constraint between the egomotion, depth and flow at each pixel. However the problem is greatly simplified if the optical flow corresponding to any of the translational component of egomotion is known. This is because the following linear model exists between the forward flow, ufti and the inverse depth ρi , given the forward velocity, vt of the camera ufxti x f (10) = ρ v +η uti = y i t ufyti where η = N (0, Pm ) is the noise on flow. The subscript t refers to time, and the superscript f indicates that it is the forward component of flow. (x, y) are co-ordinates of pixel i. In order to get the forward flow, uf , we subtract the rotational component of flow, uω , from the total flow, u. The rotational component of flow is obtained by scaling the basis flows corresponding to rotation, W ω ,3 by our estimate of rotational velocity, ω. v ω 3 Note that W = W W , implying that there is a basis for each component of egomotion. Ex: For 6 DOF motion, we have 6 basis flows. High Frame Rate Egomotion Estimation 189 Matthies et al. [12] talk about the various techniques to refine the depth map estimate using a Kalman filter. Assuming that the camera calibration is known, we get the set of equations shown in Fig. 3, borrowed from [12]. Fig. 3. Kalman Filter Equations for estimating inverse depth from the measured disparity d. The set of equations are borrowed from [12] . 4 Experimental Evaluation Fig. 5a shows the experimental setup. The test platform is a Pioneer3 P3-DX. We use a Flea 3 - USB 3.0 high speed camera running at 512×512 resolution. The onboard computer has a Quad-Core CPU and a Nvidia Quadro K2000M GPU. The platform has 3 degrees of freedom at any time t, namely forward velocity, vt , yaw, ωy and pitch, ωx (due to rough terrain). We first do a quantitative evaluation of the accuracy of the estimated egomotion and the we do a qualitative evaluation by generating depth maps. 4.1 Accuracy of Egomotion Estimation Fig. 4 shows the accuracy of the estimated egomotion on two test datasets - (1) indoor scene and (2) outdoor scene. The learning stage is done offline and we obtain the basis flows, W at a resolution of 32 × 32. They are then upscaled to match the resolution of the image (512 × 512). The covariance term of the flow, Σu was assumed to be 1 and the pixel intensity variance, σI was assumed to be 0.1. The spatial gradients, Dx were computed with a 3 × 3 Sobel operator. The temporal gradients, Dt was computed by taking difference of adjacent frames after blurring by a 3 × 3 Gaussian kernel. The accuracy of the estimated egomotion was compared with the wheel odometry as shown in Fig. 4 (d) & (e). N. Srinivasan, R. Roberts, and F. Dellaert %& ' (" ) +,"" !" #$" ) * !" #$" -" * 190 ' #$" %& ' ) +,"" ' #$" Fig. 4. (a) Learned translational basis flow [left] (color coding of flow can be found in Fig. 5b) for indoor scene [right]. (c) Learned translational basis flow [left] for outdoor scene [right]. (b) computed rotational basis flow, which is same for both sequence because it is independent of scene structure. (d) & (e) comparison of egomotion estimation with wheel odometry. The estimated motion is quite robust for rotational components but the translational estimates incur errors when robot is undergoing rotations. This is because, rotations drastically change the heading of the camera and as a result, the scene structure. The learned translational basis do not deal well with this kind of transformation. It is also to be noted that the method described here is sensitive to passing objects, especially if they form a significant portion of the camera’s field of view. However, we are currently working on a robust, outlier rejection model using the method described in [14]. The core computation time of the motion estimation for each frame was measured at 3.24 ms which is much better than the required 8.33 ms for an input stream of 120 Hz. 4.2 Application: Dense Depth Maps We demonstrate the application of our egomotion estimation by obtaining the depth map of the scene. The pipeline for evaluating the depth map is shown in Fig. 5b. The goal is to first compute the magnitudes of the egomotion components and then estimate the depth from the translational component of flow. We use the TV-L1 dense flow [1] (Fig. 5b (i)) to compute the total flow and then use the method described in section 3.3 to subtract the rotational flow (Fig. 5b (e)). The translational flow(Fig. 5b (h)), that is obtained, is used to get a rudimentary depth map (Fig. 5b (f)) which is then refined using the Kalman filter equations to get the final depth(Fig. 5b (j)). High Frame Rate Egomotion Estimation 191 (a) # $ % ! " ! " & (b) Fig. 5. (a) Experimental Setup. [Left]Diagrammatic representation of the top view, platform is directly moving toward the object with a velocity vt and yaw, ωyt . It has 3 degrees of freedom at any instant of time. [Top Right] The side view showing the pitching motion due to rough terrain. [Bottom Right] Picture of the setup.(b) Depth Map Generation Pipeline [See section 4.2]. We first computes the egomotion , yt . Using this estimate, we then get the flow corresponding to translation. We then use this to get a rudimentary depth maps which is refined using the Kalman filter equation to get the refined depth map. 192 5 N. Srinivasan, R. Roberts, and F. Dellaert Conclusions and Acknowledgments We have presented a novel, fully probabilistic model for doing egomotion estimation directly from image gradients. We have further demonstrated its applicability in the estimation of depth of a static, rigid scene. From a system level perspective we have shown that we can exploit the GPUs to achieve better-than frame rate performance. While the primary objective of the paper is to estimate motion, the end result was to obtain quick estimates of the depth of the scene. This research is supported by the Office of Naval Research under grant #00014 − 11 − 1 − 0593. The authors would like to thank Prof. Ronald Arkin for his significant contributions to this project. We would also like to acknowledge the other student team members for their roles: Ivan Walker and Ryan Kerwin. References 1. Wedel, A., Pock, T., Zach, C., Bischof, H., Cremers, D.: An Improved Algorithm for TV-L1 Optical Flow. In: Cremers, D., Rosenhahn, B., Yuille, A.L., Schmidt, F.R. (eds.) Statistical and Geometrical Approaches to Visual Motion Analysis. LNCS, vol. 5604, pp. 23–45. Springer, Heidelberg (2009) 2. Arkin, R.C.: The role of mental rotations in primate-inspired robot navigation. Cognitive Processing 13, 83–87 (2012) 3. Beall, C., Nguyen, T.H.D., Ok, C., Dellaert, F.: Attitude heading reference system with rotation-aiding visual landmarks (2012) 4. Bruss, A.R., Horn, B.K.P.: Passive navigation. Computer Vision, Graphics, and Image Processing 21, 3–20 (1983) 5. Florian, R., Heiko, N.: A review and evaluation of methods estimating ego-motion. Comput. Vis. Image Underst. 116(5), 606–633 (2012) 6. Handa, A., Newcombe, R.A., Angeli, A., Davison, A.J.: Real-time camera tracking: when is high frame-rate best? In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VII. LNCS, vol. 7578, pp. 222–235. Springer, Heidelberg (2012) 7. Heeger, D., Jepson, A.: Subspace methods for recovering rigid motion I: Algorithm and implementation. International Journal of Computer Vision 7(2), 95–117 (1992) 8. Horn, B.K.P., Schunck, B.G.: Determining Optical Flow. Artificial Intelligence 17(1-3), 185–203 (1981) 9. Irani, M., Anandan, P.: All about direct methods (1999) 10. Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135 (1981) 11. Longuet-Higgins, H.C., Prazdny, K.: The Interpretation of a Moving Retinal Image. Proceedings of the Royal Society of London. Series B, Biological Sciences (1934-1990) 208(1173), 385–397 (1980) 12. Matthies, L., Szeliski, R., Kanade, T.: Kalman filter-based algorithms for estimating depth from image sequences. International Journal of Computer Vision 3, 209–236 (1989) 13. Rieger, J.H., Lawton, D.T.: Sensor motion and relative depth from difference fields of optic flows. In: IJCAI, pp. 1027–1031 (1983) 14. Roberts, R., Potthast, C., Dellaert, F.: Learning general optical flow subspaces for egomotion estimation and detection of motion anomalies. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2009)