Camera pose estimation for augmented reality in a small indoor dynamic scene Rawia Frikha Ridha Ejbali Mourad Zaied Rawia Frikha, Ridha Ejbali, Mourad Zaied, “Camera pose estimation for augmented reality in a small indoor dynamic scene,” J. Electron. Imaging 26(5), 053029 (2017), doi: 10.1117/1.JEI.26.5.053029. Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Journal of Electronic Imaging 26(5), 053029 (Sep∕Oct 2017) Camera pose estimation for augmented reality in a small indoor dynamic scene Rawia Frikha,* Ridha Ejbali, and Mourad Zaied University of Gabes, Research Team in Intelligent Machines, National Engineering School of Gabes, Gabes, Tunisia Abstract. Camera pose estimation remains a challenging task for augmented reality (AR) applications. Simultaneous localization and mapping (SLAM)-based methods are able to estimate the six degrees of freedom camera motion while constructing a map of an unknown environment. However, these methods do not provide any reference for where to insert virtual objects since they do not have any information about scene structure and may fail in cases of occlusion of three-dimensional (3-D) map points or dynamic objects. This paper presents a real-time monocular piece wise planar SLAM method using the planar scene assumption. Using planar structures in the mapping process allows rendering virtual objects in a meaningful way on the one hand and improving the precision of the camera pose and the quality of 3-D reconstruction of the environment by adding constraints on 3-D points and poses in the optimization process on the other hand. We proposed to benefit from the 3-D planes rigidity motion in the tracking process to enhance the system robustness in the case of dynamic scenes. Experimental results show that using a constrained planar scene improves our system accuracy and robustness compared with the classical SLAM systems. © 2017 SPIE and IS&T [DOI: 10.1117/1.JEI.26.5.053029] Keywords: augmented reality; camera pose; simultaneous localization and mapping system; homography; planar structures; motion rigidity constraint; tracking. Paper 170207 received May 23, 2017; accepted for publication Oct. 6, 2017; published online Oct. 30, 2017. 1 Introduction In recent decades, there has been a wide interest in the augmented reality (AR) field because of the many different uses that this technology can provide, whether in entertainment, training, or education.1,2 Its aim is to improve our perception of the real world by superimposing virtual objects on real images captured by a video capture device. The major problem is the insertion of virtual entities within real scenes in a very precise way to create the coexistence illusion between both the real and virtual worlds. Solving this problem requires finding the properties of the real camera used to shoot the scenes to make the connection between the perspective virtual object to be superimposed with the actual scene and therefore obtain a consistent final synthetic image and in real time. Several techniques exist to find the actual parameters of the camera.3 These can be grouped into two categories. The first is related to localization methods with knowledge of a predefined model in the scene. This model can be defined as a marker or a three-dimensional (3-D) model of the scene. Markers are easier to manage since they are generally represented by a simple shape, which consists of a square with black edges and a simple design in the center.4 However, 3-D models are more difficult to use since the reconstructed shape accuracy of these models is essential.5 The principle is to use a similarity measure between the 3-D model and the real object on the image to determine the pose of the camera. The idea is to project the 3-D model into the image and find matches between the points of interest detected in the image and those projected. However, these techniques have the disadvantage of the existence of a priori models of the filmed scene in order to use the application, which makes the user restricted. This has led to the development of a new category of systems known as simultaneous localization and mapping (SLAM),6–8 which aims to locate the camera over time and reconstruct the scene at the same time in an incremental way based on the previous reconstructions. SLAM-based methods allow minimizing the initial costs of constructing AR systems, which frequently need a priori large and precise model of the environments. However, several practical questions arise when dealing with the use of monocular SLAM in AR applications. (1) It is important to identify a reference for inserting virtual objects in a meaningful way. Since the environment is unknown, the SLAM system does not possess sufficient information about the exact location where it must insert the virtual objects for AR applications. (2) It is a key to estimating an accurate camera pose using only visual information extracted from images captured by a monocular camera to render the virtual objects in a precise way. (3) It is important to establish a robust tracking process when the environment is full of dynamic objects, such as the context of a training environment. To meet all these challenges, we present an approach called piece wise planar simultaneous localization and mapping (PWPSLAM) to estimate the camera pose while building a map of this environment using the planar scene assumption. Our aim is to add geometric constraints in the mapping process, where each subset of 3-D points is constrained to a defined plan. This approach provides three contributions: (1) a method for map initialization to segment the two initial keyframes selected by the user into many piecewise planar surfaces and reliably estimate the relative motion of the camera using the information obtained *Address all correspondence to: Rawia Frikha, E-mail:frikha.rawia.tn@ieee.org 1017-9909/2017/$25.00 © 2017 SPIE and IS&T Journal of Electronic Imaging 053029-1 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Sep∕Oct 2017 • Vol. 26(5) Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene from the multiple homographies after segmentation. The use of piecewise planar surfaces allows, first of all, improving the precision of the camera pose and the quality of the 3-D reconstruction of the environment since they provide more information than the points usually used in the classical SLAM systems. (2) In a second step, we propose including these specific characteristics of the points constrained to the planes into an optimization phase, which helps increase the accuracy. In addition, planes can be used as a reference for AR applications to add virtual objects in a meaningful way. (3) Ultimately, using planes can provide additional cues that are useful for the tracking process especially when there are dynamic objects in the scene. 3-D points situated on the same plan have the same motion under the planar rigidity constraint. Our aim is to improve the tracking process robustness in the case of a dynamic scene. We explored the possibility of using planar rigidity constraint into the tracking process when there are points on planes that are occluded by dynamic or static objects. Our system is mainly intended for indoor medical training use, where there are many man-made planar structures, such as tables and walls. 2 Related Work 2.1 Monocular Simultaneous Localization and Mapping: Camera Localization in an Unknown Environment Various approaches have emerged to solve the monocular SLAM problem. They are mainly classified into two types: filtering-based methods9–11 based on filtering techniques and keyframe-based approaches, which rely on bundle adjustment.7,8,12 The filtering-based approach consists of the construction of a probabilistic map of the primitives, representing, at each instant, an overview of the current estimates of the state of the camera and of all the primitives. This approach was used for the first time in the robotic field by Davison6 who proposed a real-time camera localization system, called monocular simultaneously localization and mapping, based on an extended Kalman filter to solve the 3-D points and camera pose. The major drawback of this approach is that it requires an expensive computation time since the tracking and mapping are closely linked. Keyframe-based approaches7,8,13 approximate the map using only selected frames from a video sequence, allowing them to achieve more precise bundle adjustment optimizations since the mapping process is not attached to the frame rate. Klein and Murray7 proposed a new approach based on bundle adjustment called parallel tracking and mapping (PTAM) for solving the monocular SLAM problem without using filters to reduce the computation time and guarantee the real-time requirement of AR applications. They subdivided the tracking and the mapping into two parallel threads to improve the system robustness and speed. The tracking thread is responsible for estimating the camera pose (position and orientation) and for graphics augmentation. The mapping thread is responsible for generating the map and relocalizing the camera pose. The map consists of points features, which are continuously updated for each keyframe. Klein and Murray7 estimated a dominant virtual plan from the mapped points to provide the user with significant augmentations. However, the estimated plane may correspond to a virtual plane in the scene, preventing many AR-based Journal of Electronic Imaging applications. Strasdat et al.13 showed that keyframe-based methods are more precise than filtering for the same computational cost. In the same way, Mur-Artal et al.8 developed the Oriented fast and rotated brief (ORB)-SLAM, a feature-based monocular SLAM system that works in small and large environments in real time. They improved the classical SLAM accuracy by adding a third loop closing thread using a graph-based keyframe. In the initialization phase, they developed an automatic approach using a heuristic to choose among two geometric models, which are a homography to model planar scene and a fundamental matrix to model nonplanar scene. Unlike Klein and Murray,7 their approach does not require a user intervention to select two good views with significant parallax, but it may fail or take a long time to initialize when the scene is not static. Large-scale direct monocular simultaneous localization and mapping14 provides a semidense map of the environment. This system, which does not need feature extraction and matching between points, offers more information about the scene than the previous approaches, whereas the dense tracking and mapping system15 provides a dense map of the environment. Each pixel contributes to the registration process. Its principle is to use direct piecewise registration between the two images, the color, and the inverse depth. However, all the cited SLAM systems have their own limitations. The most important issue is that the environment has to remain stable during processing since they do not take into consideration updating the 3-D structure of dynamic objects. 2.2 Simultaneous Localization and Mapping in a Dynamic Scene Recently, several authors16–18 have proposed a new kind of SLAM that can obtain robust tracking and mapping in dynamic environments. The main idea is to update the map when there are dynamic objects or occlusions in the scene. Pirker et al.16 used the visibility information of a 3-D feature reference to remove potential ambiguities in the 2D–3D matching process using the histogram of oriented cameras descriptor. On the other hand, Tan et al.17 handled occlusions by comparing the feature appearance and structure to filter out changed features in the map. However, updating the map is not a very accurate solution to handle a dynamic scene since the set of 3-D features is used by an optimization process generally called bundle adjustment at each time that a keyframe is inserted in the map. On the contrary, our system assumes the rigidity of the scene at the beginning and benefits from the planar rigidity motion later on to add it into the tracking process to increase the system robustness. 2.3 Plan Segmentation Detecting multiple planes in images is a challenging problem. Most existing methods are typically built on the top of PTAM and EKF-SLAM system, respectively,19,20 and detect plans from 3-D point clouds. However, these approaches can detect only virtual planes, which may reduce the system performance. To simplify the task in our case, we assume that the scene contains only planar structures; we consider this to be suited for indoor manmade environments, where surfaces are mostly planar. We generate superpixels from the two initial keyframes, and 053029-2 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Sep∕Oct 2017 • Vol. 26(5) Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene classify and merge them into 3-D planes by applying a homography of one planar superpixel to another. 3 Preliminaries In this section, we gave an abstract of relevant mathematical concepts and notation. In particular, we summarized the perspective projection, the homography computation, and the reprojection error. 3.2 Homography Computation It is possible to estimate the camera pose using only the 2-D information extracted from two images of a scene. Given a set of corresponding points in left and right image, a homography can be built. Suppose mðx; yÞ is the point situated on the first image and H is the homography [3 × 3] matrix; the corresponding point m 0 ðx 0 ; y 0 Þ situated on the second image is determined as 3" # " 0# 2 x x h00 h01 h02 (2) x 0 ¼ Hx; y 0 ¼ 4 h10 h11 h12 5 y ; 1 1 h20 h21 h22 EQ-TARGET;temp:intralink-;e002;326;661 3.1 Perspective Projection In a pinhole camera model, a scene view is formed by projecting 3-D points into the image plane using a perspective transformation. Let us denote X w Y w Zw the world coordinate system and Xc Y c Zc the camera coordinate system m ¼ K w T c M; EQ-TARGET;temp:intralink-;e001;63;593 2 " # x fx y ¼4 0 1 0 0 fy 0 32 cx r11 cy 54 r21 1 r31 r12 r22 r32 r13 r23 r33 2 3 3 X t1 6 7 Y7 t2 56 4 Z 5; t3 1 (1) where m ¼ ðx; y; 1Þ is the coordinate of point image expressed in pixel, M ¼ ðX; Y; Z; 1Þ is the coordinate of a 3-D point in the world coordinate space, K is the camera intrinsic parameters matrix, cx and cy are the coordinates of the principal point, f x and f y are the focal lengths expressed in pixel, and w T c is the camera extrinsic parameters matrix. Equation (1) is used to find the transformation between the coordinates of a 3-D point in the world coordinate space and its projection in the camera coordinate space. It also contains R and T, which represent the rotation and translation matrixes, respectively. Since the intrinsic parameters can be obtained in an offline calibration step,21 the camera pose estimation can determine R and T by aligning the virtual world and the real one. with H ¼Rþ EQ-TARGET;temp:intralink-;e003;326;594 T T n ; d (3) where n and d are the normal and distance to the origin of the reference plane expressed in camera frame, respectively. The estimation of H can be computed from a minimum of four correspondence points using a direct linear transform algorithm. The relation between the distance of the projected point Hx and its measurement x 0 is determined by the geometric error function argmin N X EQ-TARGET;temp:intralink-;e004;326;481 dðxi0 ; Hxi Þ2 : (4) i¼1 4 System Overview Our system, see the overview in Fig. 1, can be divided into two phases; the first phase is the system initialization, and the second consists of two threads running in parallel, which are the tracking and mapping similar to PTAM.7 Both phases require an ORB feature extraction, which is applied for each acquired image frame. The initialization phase is in charge of segmenting the scene into planar regions to generate an initial map and Fig. 1 Our framework. Journal of Electronic Imaging 053029-3 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Sep∕Oct 2017 • Vol. 26(5) Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene estimate the initial camera pose. First of all, this phase requires user intervention to select the first two keyframes by clicking twice when the camera is moved smoothly. Then, we generate superpixels using the efficient graph-based segmentation method of Felzenszwalb and Huttenlocher.22 We assume that the scene is a planar in which we use a homography model to segment superpixels into planar regions. The output of this step is a set of planes defined by a set of ORB features. Our initialization allows us to obtain an accurate initial camera pose using the information obtained from homographies after segmentation, on the one hand, and generating a 3-D initial map using the triangulation process, on the other hand. All the initialization steps are detailed in Sec. 5. The tracking thread is responsible for localizing the camera over time and selecting keyframes. At each instant, the points of the reconstructed 3-D local map are searched by reprojection, and the pose camera is optimized again with all the found matches. Using this pose, virtual graphics can then be drawn on top of the video sequence. Finally, the tracking process decides when to insert a new keyframe. All the steps are explained extensively in Sec. 6. The mapping thread obtains keyframes selected by the tracking thread to process them with the local bundle adjustment. The aim of the bundle adjustment is to achieve an optimal reconstruction of the environment. The map is updated when new points are detected and initialized with the epipolar search; this allows the reconstruction of the areas of the environment that have not been explored yet. All the mapping steps are detailed in Sec. 7. 5 Map Initialization The initialization phase consists of segmenting the scene into piecewise planar regions to estimate the initial camera pose and generate a 3-D initial map using two keyframes selected by the user. Our initialization is achieved in two stages; the first stage is the piecewise planar segmentation, and the second is the structure from motion to estimate the initial camera pose and generate an initial 3-D map. The first stage in turn is divided in two steps. the first step consists of generating the superpixels using the method of Felzenszwalb and Huttenlocher.22 The output of this step is n homogeneous regions S1 ; S2 ; : : : ; Sn . The second step is the application of the homography model for clustering superpixels to create multiple planar maps. 5.1 Piecewise Planar Segmentation This section shows how to segment the scene in many piecewise planar surfaces in two steps. The first step consists of dividing the two first keyframes selected by the user into superpixels using the efficient graph-based segmentation method of Felzenszwalb and Huttenlocher.22 The resulting image is represented by indirected graph G ¼ ðV; EÞ, which contains N sets of vertices V, M sets of edges E, and V pixels. Note that wðeÞ is the weight of an edge e situated between two adjacent vertices. The aim of this method consists of finding a relevant partition of this image into n homogeneous regions S1 ; S2 ; : : : ; Sn of G. Each region Si¼1: : : n represents a set of connected pixels on the image that build a minimum spanning tree. It should be noted IntðSi Þ is the internal variation of Fig. 2 Example of an image segmentation with 17 superpixels: (a) and (b) two first keyframes of a sequence extracted from TUM RGB-D dataset, (c) and (d) an image segmentation using superpixels. Journal of Electronic Imaging 053029-4 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Sep∕Oct 2017 • Vol. 26(5) Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene each region looking for the greatest weight wðeÞ belonging to region Si , whereas ExtðSi ; Sj Þ is the external variation seeking the minimum weight of an edge belonging to the two regions Si and Sj . Three empirical parameters were proposed by Felzenszwalb and Huttenlocher,22 which are the minimum region sizes performed by postprocessing, the threshold function, and the standard deviation to smooth the input image before segmenting it. These three parameters were set to 40, 200, and 0.9, respectively, in our work. Figure 2 illustrates an image segmentation with 17 superpixels. Figures 2(a) and 2(b) show two first keyframes of a sequence acquired from the Technische Universität München (TUM) RGB-D dataset. Figures 2(c) and 2(d) show two images, which are clustered into superpixels defining the homogeneous regions. After segmentation, a filtering scheme was applied to eliminate superpixels belonging to a small region. Only superpixels above a certain threshold are considered useful in our application. A threshold is defined as the minimum superpixel size. In the second step, the superpixels S obtained from the filtering scheme mentioned in the first step would be merged in those pixels belonging to the same plane of the two images. Assuming that S ¼ S11 ; : : : ; Sm n with the superpixels of an image m corresponding to planar areas in a scene, a homography model for clustering superpixels is applied to create multiple planar maps. To apply the homography model, we, first of all, have to find the corresponding superpixels between the two first keyframes. To this end, each superpixel is defined by a set of ORB features as well as their ORB descriptors to describe the feature and its local Algorithm 1 Algorithm for clustering piecewise planar surface. P m , P m0 : Set of ORB features representing each superpixel in both images; S n : Set of planar piecewise surfaces of P m representing superpixels S i¼1: : : n ; H n : Set of homographies representing each planar piecewise surface H i¼1: : : n ; neighborhood. Then a matching process is performed to find the nearest descriptor of the requested image in the descriptors set. We use the Hamming distance to find all the matching superpixels between the two first keyframes since the ORB is a binary descriptor. The Hamming distance determines the similarity between both superpixels descriptors. If the obtained value is less than the threshold, then the descriptors are matched. To obtain a good matcher, homography with random sample consensus (RANSAC) scheme is used to filter bad matches since we are dealing with perspective transformations of planar regions. ORB is robust to scale changes because it uses image pyramids and ensures good invariance to viewpoint and illumination. Finally, for every N matched points obtained from the last step, a homography is computed for each superpixel. To identify the planar surfaces, as shown in Algorithm 1, H n is defined as the set of all n homographies existing between all the superpixels. Each homography in H n sets define a planar piecewise surface existing in both images. To verify if the points of superpixels Si are related to any other existing planes in H n , we use the reprojection error function Si . If the reprojection error of Si is below a fixed threshold, all the superpixels Si¼1: : : n belonging to the same plane are merged. The thresholds used above are chosen in an empirical way. This loop is performed until all homograhies are clustered to obtain multiple planar regions. 5.2 Structure from Motion Once the planar piecewise surfaces are identified (Sec. 5.1), we compute their 3-D positions in two steps using a structure from motion technique, which is the process of estimating the 3-D structure of a scene from a set of 2-D views. First, we estimate the camera pose associating with the selected keyframes. Since a set of homographies is built between all identified planar regions, we use the homography decomposition method proposed by Ref. 23. This method determines the relative camera motion up to scale between two different views of a planar region. We choose the homography model that produces the less reprojection error rate, thus giving the best accuracy of camera pose. t herr : Reprojection error threshold; Define: reperr : The reprojection error; for i = each homography of H n do for j = each superpixel of S n do while (i <> j) do for k = each point of P m computes the reprojection error reperr ¼ distance (H i P k , P k0 ) if reperr ≤ t h err then Merge superpixel j belonging to the plane defined by the homography i; end if end for end for Journal of Electronic Imaging Fig. 3 Structure from motion: motion and pose recovery from planar scene. 053029-5 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Sep∕Oct 2017 • Vol. 26(5) Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene By applying a positive depth constraint, we select one of the four generated solutions. Knowing the camera pose and corresponding points, we then triangulate 3-D features and estimate their positions (Fig. 3). Let π i be the 3-D plan defined by nTi X ¼ di, where ni is the plane normal and di is the distance to the origin. The features of a map mentioned in this paragraph such as 3-D points, camera pose, keyframes, and homographies are dealt with in the next section by the planar tracking process. 6 Tracking In this section, we describe the steps of the tracking process that are used with every frame acquired from the camera. Local map tracking in dynamic scene: After creating the initial map, the tracking consists of finding matching between the local map and points detected in the current frame. In each acquired frame, we performed ORB feature extractor and computed the ORB descriptors. After that, we projected the local map into a frame, and we searched for the map points correspondences. The local map contains the reference keyframes and their nearest n keyframes. At the initialization, we chose the initial keyframe as a reference keyframe. If there was a change of 50% of image, we chose a selected keyframe as a reference. Using the points correspondences, we computed the updated camera pose, and this was used for graphics rendering. In case of a dynamic scene, the classical tracking process may fail when not enough correspondence points are found and consequently the camera pose will be lost. The idea was to incorporate a planar rigidity constraint into the tracking process. Points belonging to the same planar surface have coherent motion and are converted by an affine homography transformation (see Sec. 3.2). In the classical methods used in an SLAM system, the projected points that do not find their matches are marked as invalid; this disturbs the location of the camera. However, in our system these points are marked as valid, which ensures the system robustness against dynamic objects. Keyframe selection: The keyframe selection is very important in the reconstruction process. To select a keyframe, two criteria were defined: (1) Thirty frames must have already been passed since the last insertion of the keyframe. (2) The current frame must have the minimum 30% of the points of interest correctly associated with the map points. Constrained bundle adjustment: Bundle adjustment is usually used as the last step of feature-based 3-D reconstruction algorithm. It optimizes the 3-D structure and camera parameters (rotation and translation). In other words, it can be defined as the problem of simultaneously refining the 3-D points and the camera pose parameters according to an optimality specification requiring the corresponding image projection of all the points. In fact, the map is optimized when each keyframe is added using the bundle adjustment, which adjusts map point positions and the keyframe pose and minimizes the reprojection error of all the points in all the keyframes. The bundle adjustment optimizes camera parameters and 3-D map points taking advantage of each inserted new keyframe in the map as it brings new information about the 3-D points that constitute the 3-D map. To respect the real-time requirement of our application, the parameters to be optimized were limited to a subset of keyframes and 3-D points in which we selected the keyframes closest to the reference keyframe. Assume that n 3-D points are seen in keyframes m, and let xij be the projection of the i’th point on keyframe j. Assume also that each keyframe j is parameterized by a vector KF and each 3-D point i by a vector PT. The cost function ϵ minimizes the total reprojection error with respect to the selected 3-D point and camera parameters ϵ ¼ min KFj PT i d½PðKFj ; PT i Þ; xij 2 ; (5) i¼1 j¼1 where PðKFj ; PT i Þ is the projection of point i on a keyframe j and dðx; yÞ is the Euclidean distance between the image points and the projected ones. The addition of a planar constraint in the bundle adjustment allows us to properly manage the 3-D points that belong to the same planar surface (Fig. 4). This implies that an additional term should be added to the previous cost function [Eq. (5)] to measure the 3-D structure error of planar points. This leads to the following constraint, where all plans are interdependent k X x X EQ-TARGET;temp:intralink-;e006;326;333 nTi · ðPT j − di Þ ¼ 0; (6) i¼1 j¼1 where k is the number of planes, x is the number of 3-D points that belong to the same plane i, and ðni ; di Þ are the normal and the distance of the plane i. 7 Mapping In this section, we presented the steps fulfilled by the mapping thread with every newly inserted keyframe. Keyframe and points insertion: Similar to Ref. 16, we used the pose graph to represent a set of keyframes as nodes and the homographies calculated between them as edges. Each keyframe Fi has a camera pose Ci ¼ ðRi jti Þ that presents the rigid transformation from the world coordinate system to the camera coordinate system. H ik represents the homography that maps points from keyframe Fi to Fk . After a new keyframe is added by the tracking process, new map points are found. The system first checks all unmatched ORB of the new added keyframe with the other keyframes and then achieves an epipolar search to triangulate new points and add them to the map. Journal of Electronic Imaging n X m X EQ-TARGET;temp:intralink-;e005;326;477 053029-6 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Fig. 4 Constrained planar mapping. Sep∕Oct 2017 • Vol. 26(5) Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene 8 Results and Evaluation Our approach was compared with PTAM7 and ORB SLAM8 for two reasons: the code sources were available and they could be used for indoor scenes, as in the case of our application. To evaluate the performance and accuracy of our system, we used three different video sequences placed in an office. The first video sequence was captured by a handheld camera, which contains textured planar surfaces, and the other two were extracted from the public TUM RGB-D benchmark,24 which contains moving dynamic objects and planar structures. The results were then evaluated by an Intel core i7-6500U 2.50GHZ CPU (four cores) with a GeoForce 920M graphic card computer. All RGB images are recorded at 30 frames per second with a resolution of 640480. 8.1 Qualitative Results To evaluate the performance of our proposed approach, two experiments were performed. In the first experiment, we demonstrated the utility of our initialization phase compared with the PTAM, whereas, in the second one, we showed the robustness of our system in the case of dynamic scenes. Figure 5 shows the first indoor experiment where a desktop image was captured. The sequence of this experiment had 626 frames recorded by a handheld camera containing two textured planar surfaces. As shown in Fig. 5, the planar surfaces were detected by our initialization phase (see Sec. 5.1), which is different from the PTAM (Fig. 6), and then tracked in the rest of the sequence. In the initialization phase, PTAM assumes that the scene is locally planar and recovers the camera pose from the homography used in the Faugeras method.25 In the PTAM, the first two keyframes are selected by the user, and a minimum set of corresponding points are used to compute the homography. This process is performed iteratively in RANSAC scheme to verify distðx 0 ; HxÞ < ϵ until a sufficient number of corresponding points are consistent with the computed homography. However, this method can select a virtual plan that does not belong to the real plane, as shown in Fig. 6. In contrast, the initialization phase of our approach takes into account the semantic meaning of the scene. We segmented the scene into different planes representing real piecewise planar surfaces. Fig. 5 Results of the tracking thread where two planar surfaces are detected in x -, y-, and z-axes system. (a)–(g) extracted video image frames at different instants. Journal of Electronic Imaging 053029-7 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Sep∕Oct 2017 • Vol. 26(5) Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene Fig. 6 Results of different initialization phases used in PTAM. The white grid corresponds to a virtual plane and the axis shows its orientation. (a)–(g) extracted video image frames at different instants. The initial maps were created and tracked in the rest of the sequence. Our system chose the plan that gave the best camera pose. In fact, it estimates the accurate camera pose by aligning the virtual objects with the real world. For ORBSLAM system, an automatic initialization based on model selection is used to create an initial map of both planar and nonplanar environments. For a dynamic scene, this method may fail and the system cannot initialize due to moving objects. In addition, similar to the PTAM, a dominant plane is selected from the created map, which may correspond to a virtual plan. Figure 7 shows our second experiment, which involves a challenging task for an SLAM system. We compared our system with the PTAM in two cases with and without handling occlusion caused by dynamic objects. We used a moving paper that occluded some parts of the scene. In the PTAM, the invalid 3-D points that were occluded by the moving objects could not be reliably detected and handled since the changed regions were not detected while our system could detect them and could effectively treat them under the planar rigidity motion constraint in which 3-D points situated on the same plan have the same motion. As shown in Fig. 7(c), in PTAM the occluded points in yellow were discarded by the system, and the camera Journal of Electronic Imaging focus may be lost or drifted. In contrast, our system could robustly recover the camera pose [Fig. 7(b)]. Figure 7(a) shows the handling of the occlusion result by our system. The changed 3-D points that where occluded by a moving paper were recognized and highlighted with the red color. Using the rigidity motion constraint of planes allows the system to obtain better results compared with classical tracking methods. Furthermore, to evaluate the robustness of our system on moving dynamic objects, we used a sequence (fr3 walking xyz) extracted from the TUM RGB-D24 dataset, which contains two persons walking through an office scene. Figure 8 shows that our system is able to estimate the camera pose in the case of a dynamic scene. Since our thesis is part of the exploration of AR in the medical field,26–30 we used a 3-D virtual heart in the scene as shown in Fig. 9. Our general aim was to train surgeons in a preoperative phase as well as with the various gestures that could be performed by their hands during a surgery on a virtual heart. Therefore, the environment was not static since it contained dynamic objects, such as surgeon hands, instruments, etc. In this case, our approach presented a good solution profiting from a planar constrained scene. 053029-8 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Sep∕Oct 2017 • Vol. 26(5) Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene Fig. 7 Handling occlusion caused by dynamic objects: (a) our result with occlusion handling, (b) camera pose, and (c) the PTAM result without occlusion handling. Fig. 8 Example of dynamic scenes wherein our system successfully estimates the camera pose. (a)– (e) extracted image frames from TUM RGBD dataset at different instants. 8.2 Quantitative Results The TUM RGB-D benchmark24 is a good dataset to evaluate the precision of a camera pose since it provides an accurate ground truth, which is obtained from an external Journal of Electronic Imaging motion capture system. We chose two video sequences that we found suitable for the evaluation: fr3 walking xyz and fr3 str tex far. We ran the PTAM, ORB-SLAM, and our system PWPSLAM three times in the benchmark to average 053029-9 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Sep∕Oct 2017 • Vol. 26(5) Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene Fig. 9 Example of a system based on AR aiming at visualizing and manipulating the virtual heart. Fig. 11 Camera trajectory of fr3str texfar (RMSE wrt ground truth) along the coordinate axes. Table 1 Absolute keyframe trajectory RMSE (cm). Sequence PWPSLAM ORB-SLAM PTAM fr3 walking xyz 0.7 1.06 X fr3 str tex far 0.45 0.65 0.88 the result of the evaluation. Table 1 shows the camera localization error compared with the TUM RGB-D benchmark. It can be seen that PTAM can only process fr3 str tex far, except for fr3 walking xyz. This is a challenging case as there are big occlusions due to the moving persons in the scene. The camera tracking is lost at some point, and a large part of the video sequence is not processed by the system. In the PTAM, the invalid 3-D points that were occluded by moving objects could not be reliably detected and handled since the changed regions were not detected; thus, the system could not robustly deal with dynamic changes. In contrast, PWPSLAM can successfully process fr3 walking xyz and shows that it can robustly deal with a challenging situation where there are moving objects. Our initialization phase segments the scene into piecewise planar regions, which we use as additional cues in the tracking process. The occluded 3-D points are reliably detected and can be effectively treated under the planar rigidity motion constraint in which 3-D points situated on the same plan have the same motion. Similarly, ORB-SLAM is also robust to dynamic changes. One of the possible causes can be that they perform a culling procedure to insert new keyframes to the map only if the visual content of the scene changes. For fr3 str tex far sequence, ORB-SLAM and PTAM have similar accuracy, while PWP-SLAM achieves higher accuracy. One of the possible causes can be that ORBSLAM and PTAM perform a bundle adjustment using only two families of parameters, which are the poses and the 3-D points, while in our system we add a planar constraint in the bundle adjustment that allows us to properly manage the 3-D points that belong to the same planar surface and thus improve both the precision of the camera pose and the quality of the 3-D map compared with the classical SLAM systems (Fig. 10). As shown in Fig. 11, we evaluated the camera keyframe trajectory against the ground truth data along the coordinate axes x and y in which PWPSLAM produces a clearly accurate trajectory with an accuracy below 0.5 cm for the fr3 str tex far sequence. Our system has shown that it can process sequences that contain moving dynamic objects and planar structures. Fig. 10 Feature detection and feature based 3-D reconstruction. Journal of Electronic Imaging 053029-10 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Sep∕Oct 2017 • Vol. 26(5) Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene 9 Conclusion The motivation for our work was to improve a state-of-theart SLAM system in a dynamic scene using planar surfaces. As a novelty, we presented in this paper a constrained planar piecewise monocular SLAM system. The proposed approach was divided into two stages: a multiplanar surface segmentation using homographies and superpixels in the initialization phase and a monocular constrain SLAM algorithm that tracks and maps the planar surfaces. In these experiments, we showed that using the constrained SLAM, we can render virtual objects in a meaningful way and improve the tracking process in a dynamic scene. In the future, we plan to use deep learning techniques in our segmentation for more robustness. Acknowledgments The authors would like to acknowledge the financial support of this work by grants from General Direction of Scientific Research (DGRST), Tunisia, under the ARUB program. References 1. G. Younes et al., “Pose tracking for augmented reality applications in outdoor archaeological sites,” J. Electron. Imaging 26(1), 011004 (2016). 2. J. Li et al., “Handheld pose tracking using vision-inertial sensors with occlusion handling,” J. Electron. Imaging 25(4), 041012 (2016). 3. E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: a hands-on survey,” IEEE Trans. Visual Comput. Graphics 22, 2633–2651 (2016). 4. H. Kato and M. Billinghurst, “Marker tracking and HMD calibration for a video-based augmented reality conferencing system,” in Proc. 2nd IEEE and ACM Int. Workshop on Augmented Reality (IWAR 2099), pp. 85–94 (1999). 5. G. Simon, A. W. Fitzgibbon, and A. Zisserman, “Markerless tracking using planar structures in the scene,” in Int. Symp. on Augmented Reality, pp. 120–128 (2000). 6. A. Davison, “Real-time simultaneous localisation and mapping with a single camera,” in Proc. Int. Conf. on Computer Vision, Nice (2003). 7. G. Klein and D. Murray, “Parallel tracking and mapping on a camera phone,” in Proc. Eigth IEEE and ACM Int. Symp. on Mixed and Augmented Reality (ISMAR’09), Orlando (2009). 8. R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: a versatile and accurate monocular SLAM system,” CoRR abs/1502.00956 (2015). 9. J. Civera, A. J. Davison, and J. M. M. Montiel, “Unified inverse depth parameterization for monocular slam,” in Proc. of Robotics: Science and Systems (2006). 10. E. Eade and T. Drummond, “Scalable monocular SLAM,” in Proc. of the 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition—Volume 1 (CVPR 2006), pp. 469–476, IEEE Computer Society, Washington, DC (2006). 11. A. J. Davison et al., “MonoSLAM: real-time single camera SLAM,” IEEE Trans. Pattern Anal. Mach. Intell. 29, 1052–1067 (2007). 12. R. O. Castle, G. Klein, and D. W. Murray, “Video-rate localization in multiple maps for wearable augmented reality,” in Proc. 2nd IEEE Int. Symp. on Wearable Computing, Pittsburgh, Pennsylvania (2008). 13. H. Strasdat, J. M. M. Montiel, and A. J. Davison, “Editors choice article: visual SLAM: why filter?” Image Vision Comput. 30, 65–77 (2012). 14. J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: large-scale direct monocular SLAM,” in European Conf. on Computer Vision (2014). 15. R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: dense tracking and mapping in real-time,” in Proc. of the 2011 Int. Conf. on Computer Vision (ICCV 2011), pp. 2320–2327, IEEE Computer Society, Washington, DC (2011). Journal of Electronic Imaging 16. K. Pirker, M. Rüther, and H. Bischof, “CD SLAM—continuous localization and mapping in a dynamic world,” in Intelligent Robots and Systems (IROS), pp. 3990–3997, IEEE (2011). 17. W. Tan et al., “Robust monocular SLAM in dynamic environments,” in IEEE Int. Symp. on Mixed and Augmented Reality (ISMAR), pp. 209– 218 (2013). 18. S. Oh, M. Hahn, and J. Kim, “Dynamic EKF-based SLAM for autonomous mobile convergence platforms,” Multimedia Tools Appl. 74(16), 6413–6430 (2015). 19. A. P. Gee et al., “Discovering higher level structure in visual SLAM,” IEEE Trans. Rob. 24, 980–990 (2008). 20. J. Ventura and T. Hollerer, “Online environment model estimation for augmented reality,” in 8th IEEE Int. Symp. on Mixed and Augmented Reality, pp. 103–106 (2009). 21. Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell. 22, 1330–1334 (2000). 22. P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” Int. J. Comput. Vision 59, 167–181 (2004). 23. E. Malis et al., “Deeper understanding of the homography decomposition for vision-based control,” INRIA Research Report (2007). 24. J. Sturm et al., “A benchmark for the evaluation of RGBD SLAM systems,” in Int. Conf. on Intelligent Robot Systems (IROS) (2012). 25. O. Faugeras and F. Lustman, “Motion and structure from motion in a piecewise planar environment,” Tech. Rep. RR-0856, INRIA (1988). 26. R. Frikha et al., “Natural gesture based interaction with virtual heart in augmented reality,” Lect. Notes Comput. Sci. 9375, 457–465 (2015). 27. R. Ben Ali, R. Ejbali, and M. Zaied, “GPU-based segmentation of dental x-ray images using active contours without edges,” in Int. Conf. on Intelligent Systems Design and Applications (ISDA), pp. 505–510 (2015). 28. R. Ben Ali, R. Ejbali, and M. Zaied, “Detection and classification of dental caries in x-ray images using deep neural networks,” in Int. Conf. on Software Engineering Advances (ICSEA), p. 236 (2016). 29. T. Chebbi et al., “3D human heart anatomy: simulation and visualization based on MR images,” in Int. Conf. on Software Engineering Advances (ICSEA), p. 225 (2016). 30. R. Frikha, R. Ejbali, and M. Zaied, “Handling occlusion in augmented reality surgical training based instrument tracking,” in IEEE/ACS 13th Int. Conf. of Computer Systems and Applications (AICCSA), pp. 1–5 (2016). Rawia Frikha is pursuing her doctoral studies at the Engineering School of Sfax, Tunisia, and she is a member of Research Groups on Intelligent Machines (REGIM-Lab) and is an IEEE student member. She received her BS and MS degrees in computer science and multimedia from the Higher Institute of Computer Science and Multimedia of Sfax in 2010 and 2012, respectively. Her current research interests include augmented reality, computer vision, image processing, and HCI. Ridha Ejbali is an assistant professor at the Faculty of Sciences of Gabes Tunisia (FSG) in the Department of Computer Sciences. He received his PhD in computer engineering, his master’s degree, and his computer engineer degree from the National Engineering School of Sfax, Tunisia (ENIS), in 2012, 2006, and 2004, respectively. He is an IEEE senior member, SPS society. His research area is now in pattern recognition and machine learning using wavelets and wavelet network theories. Mourad Zaied is an associate professor at the National Engineering School of Gabes in the Department of Electrical Engineering. He received his HDR, PhD degrees in computer engineering, and his master of science from the National Engineering School of Sfax in 2013, 2008, and 2003, respectively. His research interests include computer vision, image and video analysis, wavelets and wavelet networks, pattern recognition, and image, audio, and video coding and indexing. 053029-11 Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use Sep∕Oct 2017 • Vol. 26(5)