Uploaded by Seyed Erfan Shayegani

Camera pose estimation for augmented reality in a small indoor dynamic scene

advertisement
Camera pose estimation for
augmented reality in a small indoor
dynamic scene
Rawia Frikha
Ridha Ejbali
Mourad Zaied
Rawia Frikha, Ridha Ejbali, Mourad Zaied, “Camera pose estimation for augmented reality
in a small indoor dynamic scene,” J. Electron. Imaging 26(5), 053029 (2017),
doi: 10.1117/1.JEI.26.5.053029.
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Journal of Electronic Imaging 26(5), 053029 (Sep∕Oct 2017)
Camera pose estimation for augmented reality in a small
indoor dynamic scene
Rawia Frikha,* Ridha Ejbali, and Mourad Zaied
University of Gabes, Research Team in Intelligent Machines, National Engineering School of Gabes, Gabes, Tunisia
Abstract. Camera pose estimation remains a challenging task for augmented reality (AR) applications.
Simultaneous localization and mapping (SLAM)-based methods are able to estimate the six degrees of freedom
camera motion while constructing a map of an unknown environment. However, these methods do not provide
any reference for where to insert virtual objects since they do not have any information about scene structure and
may fail in cases of occlusion of three-dimensional (3-D) map points or dynamic objects. This paper presents a
real-time monocular piece wise planar SLAM method using the planar scene assumption. Using planar structures in the mapping process allows rendering virtual objects in a meaningful way on the one hand and improving
the precision of the camera pose and the quality of 3-D reconstruction of the environment by adding constraints
on 3-D points and poses in the optimization process on the other hand. We proposed to benefit from the 3-D
planes rigidity motion in the tracking process to enhance the system robustness in the case of dynamic scenes.
Experimental results show that using a constrained planar scene improves our system accuracy and robustness
compared with the classical SLAM systems. © 2017 SPIE and IS&T [DOI: 10.1117/1.JEI.26.5.053029]
Keywords: augmented reality; camera pose; simultaneous localization and mapping system; homography; planar structures; motion
rigidity constraint; tracking.
Paper 170207 received May 23, 2017; accepted for publication Oct. 6, 2017; published online Oct. 30, 2017.
1 Introduction
In recent decades, there has been a wide interest in the augmented reality (AR) field because of the many different uses
that this technology can provide, whether in entertainment,
training, or education.1,2 Its aim is to improve our perception
of the real world by superimposing virtual objects on real
images captured by a video capture device. The major problem is the insertion of virtual entities within real scenes in a
very precise way to create the coexistence illusion between
both the real and virtual worlds. Solving this problem
requires finding the properties of the real camera used to
shoot the scenes to make the connection between the perspective virtual object to be superimposed with the actual
scene and therefore obtain a consistent final synthetic
image and in real time.
Several techniques exist to find the actual parameters of
the camera.3 These can be grouped into two categories. The
first is related to localization methods with knowledge of a
predefined model in the scene. This model can be defined as
a marker or a three-dimensional (3-D) model of the scene.
Markers are easier to manage since they are generally represented by a simple shape, which consists of a square with
black edges and a simple design in the center.4 However, 3-D
models are more difficult to use since the reconstructed shape
accuracy of these models is essential.5 The principle is to use
a similarity measure between the 3-D model and the real
object on the image to determine the pose of the camera.
The idea is to project the 3-D model into the image and
find matches between the points of interest detected in the
image and those projected. However, these techniques
have the disadvantage of the existence of a priori models
of the filmed scene in order to use the application, which
makes the user restricted. This has led to the development
of a new category of systems known as simultaneous localization and mapping (SLAM),6–8 which aims to locate the
camera over time and reconstruct the scene at the same
time in an incremental way based on the previous reconstructions. SLAM-based methods allow minimizing the initial
costs of constructing AR systems, which frequently need
a priori large and precise model of the environments.
However, several practical questions arise when dealing
with the use of monocular SLAM in AR applications.
(1) It is important to identify a reference for inserting virtual
objects in a meaningful way. Since the environment is
unknown, the SLAM system does not possess sufficient
information about the exact location where it must insert
the virtual objects for AR applications. (2) It is a key to estimating an accurate camera pose using only visual information extracted from images captured by a monocular camera
to render the virtual objects in a precise way. (3) It is important to establish a robust tracking process when the environment is full of dynamic objects, such as the context of a
training environment. To meet all these challenges, we
present an approach called piece wise planar simultaneous
localization and mapping (PWPSLAM) to estimate the camera pose while building a map of this environment using the
planar scene assumption. Our aim is to add geometric constraints in the mapping process, where each subset of 3-D
points is constrained to a defined plan. This approach provides three contributions: (1) a method for map initialization
to segment the two initial keyframes selected by the user into
many piecewise planar surfaces and reliably estimate the relative motion of the camera using the information obtained
*Address all correspondence to: Rawia Frikha, E-mail:frikha.rawia.tn@ieee.org
1017-9909/2017/$25.00 © 2017 SPIE and IS&T
Journal of Electronic Imaging
053029-1
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Sep∕Oct 2017
•
Vol. 26(5)
Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene
from the multiple homographies after segmentation. The use
of piecewise planar surfaces allows, first of all, improving
the precision of the camera pose and the quality of the 3-D
reconstruction of the environment since they provide more
information than the points usually used in the classical
SLAM systems. (2) In a second step, we propose including
these specific characteristics of the points constrained to the
planes into an optimization phase, which helps increase the
accuracy. In addition, planes can be used as a reference for
AR applications to add virtual objects in a meaningful way.
(3) Ultimately, using planes can provide additional cues that
are useful for the tracking process especially when there are
dynamic objects in the scene. 3-D points situated on the same
plan have the same motion under the planar rigidity constraint. Our aim is to improve the tracking process robustness
in the case of a dynamic scene. We explored the possibility of
using planar rigidity constraint into the tracking process
when there are points on planes that are occluded by
dynamic or static objects. Our system is mainly intended
for indoor medical training use, where there are many
man-made planar structures, such as tables and walls.
2 Related Work
2.1 Monocular Simultaneous Localization and
Mapping: Camera Localization in an Unknown
Environment
Various approaches have emerged to solve the monocular
SLAM problem. They are mainly classified into two
types: filtering-based methods9–11 based on filtering techniques and keyframe-based approaches, which rely on bundle adjustment.7,8,12
The filtering-based approach consists of the construction
of a probabilistic map of the primitives, representing, at each
instant, an overview of the current estimates of the state
of the camera and of all the primitives. This approach
was used for the first time in the robotic field by Davison6
who proposed a real-time camera localization system, called
monocular simultaneously localization and mapping, based
on an extended Kalman filter to solve the 3-D points and
camera pose. The major drawback of this approach is
that it requires an expensive computation time since the
tracking and mapping are closely linked. Keyframe-based
approaches7,8,13 approximate the map using only selected
frames from a video sequence, allowing them to achieve
more precise bundle adjustment optimizations since the mapping process is not attached to the frame rate. Klein and
Murray7 proposed a new approach based on bundle adjustment called parallel tracking and mapping (PTAM) for
solving the monocular SLAM problem without using filters
to reduce the computation time and guarantee the real-time
requirement of AR applications. They subdivided the
tracking and the mapping into two parallel threads to
improve the system robustness and speed. The tracking
thread is responsible for estimating the camera pose (position
and orientation) and for graphics augmentation. The mapping thread is responsible for generating the map and relocalizing the camera pose. The map consists of points features,
which are continuously updated for each keyframe. Klein
and Murray7 estimated a dominant virtual plan from the
mapped points to provide the user with significant augmentations. However, the estimated plane may correspond to a
virtual plane in the scene, preventing many AR-based
Journal of Electronic Imaging
applications. Strasdat et al.13 showed that keyframe-based
methods are more precise than filtering for the same computational cost. In the same way, Mur-Artal et al.8 developed
the Oriented fast and rotated brief (ORB)-SLAM, a feature-based monocular SLAM system that works in small
and large environments in real time. They improved the
classical SLAM accuracy by adding a third loop closing
thread using a graph-based keyframe. In the initialization
phase, they developed an automatic approach using a heuristic to choose among two geometric models, which are a
homography to model planar scene and a fundamental matrix
to model nonplanar scene. Unlike Klein and Murray,7 their
approach does not require a user intervention to select two
good views with significant parallax, but it may fail or take a
long time to initialize when the scene is not static.
Large-scale direct monocular simultaneous localization
and mapping14 provides a semidense map of the environment. This system, which does not need feature extraction
and matching between points, offers more information
about the scene than the previous approaches, whereas the
dense tracking and mapping system15 provides a dense
map of the environment. Each pixel contributes to the registration process. Its principle is to use direct piecewise registration between the two images, the color, and the inverse
depth. However, all the cited SLAM systems have their
own limitations. The most important issue is that the environment has to remain stable during processing since they
do not take into consideration updating the 3-D structure
of dynamic objects.
2.2 Simultaneous Localization and Mapping in
a Dynamic Scene
Recently, several authors16–18 have proposed a new kind of
SLAM that can obtain robust tracking and mapping in
dynamic environments. The main idea is to update the
map when there are dynamic objects or occlusions in the
scene. Pirker et al.16 used the visibility information of a
3-D feature reference to remove potential ambiguities in
the 2D–3D matching process using the histogram of oriented
cameras descriptor. On the other hand, Tan et al.17 handled
occlusions by comparing the feature appearance and structure to filter out changed features in the map. However,
updating the map is not a very accurate solution to handle
a dynamic scene since the set of 3-D features is used by
an optimization process generally called bundle adjustment
at each time that a keyframe is inserted in the map. On the
contrary, our system assumes the rigidity of the scene at the
beginning and benefits from the planar rigidity motion later
on to add it into the tracking process to increase the system
robustness.
2.3 Plan Segmentation
Detecting multiple planes in images is a challenging problem. Most existing methods are typically built on the top
of PTAM and EKF-SLAM system, respectively,19,20 and
detect plans from 3-D point clouds.
However, these approaches can detect only virtual planes,
which may reduce the system performance. To simplify the
task in our case, we assume that the scene contains only planar structures; we consider this to be suited for indoor manmade environments, where surfaces are mostly planar. We
generate superpixels from the two initial keyframes, and
053029-2
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Sep∕Oct 2017
•
Vol. 26(5)
Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene
classify and merge them into 3-D planes by applying a
homography of one planar superpixel to another.
3 Preliminaries
In this section, we gave an abstract of relevant mathematical
concepts and notation. In particular, we summarized the perspective projection, the homography computation, and the
reprojection error.
3.2 Homography Computation
It is possible to estimate the camera pose using only the 2-D
information extracted from two images of a scene. Given a
set of corresponding points in left and right image, a homography can be built. Suppose mðx; yÞ is the point situated on
the first image and H is the homography [3 × 3] matrix; the
corresponding point m 0 ðx 0 ; y 0 Þ situated on the second image
is determined as
3" #
" 0# 2
x
x
h00 h01 h02
(2)
x 0 ¼ Hx;
y 0 ¼ 4 h10 h11 h12 5 y ;
1
1
h20 h21 h22
EQ-TARGET;temp:intralink-;e002;326;661
3.1 Perspective Projection
In a pinhole camera model, a scene view is formed by projecting 3-D points into the image plane using a perspective
transformation. Let us denote X w Y w Zw the world coordinate
system and Xc Y c Zc the camera coordinate system
m ¼ K w T c M;
EQ-TARGET;temp:intralink-;e001;63;593
2
" #
x
fx
y ¼4 0
1
0
0
fy
0
32
cx
r11
cy 54 r21
1
r31
r12
r22
r32
r13
r23
r33
2 3
3 X
t1 6 7
Y7
t2 56
4 Z 5;
t3
1
(1)
where m ¼ ðx; y; 1Þ is the coordinate of point image
expressed in pixel, M ¼ ðX; Y; Z; 1Þ is the coordinate of a
3-D point in the world coordinate space, K is the camera
intrinsic parameters matrix, cx and cy are the coordinates
of the principal point, f x and f y are the focal lengths
expressed in pixel, and w T c is the camera extrinsic parameters matrix. Equation (1) is used to find the transformation
between the coordinates of a 3-D point in the world coordinate space and its projection in the camera coordinate
space. It also contains R and T, which represent the rotation
and translation matrixes, respectively. Since the intrinsic
parameters can be obtained in an offline calibration step,21
the camera pose estimation can determine R and T by aligning the virtual world and the real one.
with
H ¼Rþ
EQ-TARGET;temp:intralink-;e003;326;594
T T
n ;
d
(3)
where n and d are the normal and distance to the origin of the
reference plane expressed in camera frame, respectively. The
estimation of H can be computed from a minimum of four
correspondence points using a direct linear transform algorithm. The relation between the distance of the projected
point Hx and its measurement x 0 is determined by the geometric error function
argmin
N
X
EQ-TARGET;temp:intralink-;e004;326;481
dðxi0 ; Hxi Þ2 :
(4)
i¼1
4 System Overview
Our system, see the overview in Fig. 1, can be divided into
two phases; the first phase is the system initialization, and the
second consists of two threads running in parallel, which are
the tracking and mapping similar to PTAM.7 Both phases
require an ORB feature extraction, which is applied for
each acquired image frame.
The initialization phase is in charge of segmenting the
scene into planar regions to generate an initial map and
Fig. 1 Our framework.
Journal of Electronic Imaging
053029-3
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Sep∕Oct 2017
•
Vol. 26(5)
Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene
estimate the initial camera pose. First of all, this phase
requires user intervention to select the first two keyframes
by clicking twice when the camera is moved smoothly.
Then, we generate superpixels using the efficient
graph-based segmentation method of Felzenszwalb and
Huttenlocher.22 We assume that the scene is a planar in
which we use a homography model to segment superpixels
into planar regions. The output of this step is a set of planes
defined by a set of ORB features. Our initialization allows us
to obtain an accurate initial camera pose using the information obtained from homographies after segmentation, on the
one hand, and generating a 3-D initial map using the triangulation process, on the other hand. All the initialization steps
are detailed in Sec. 5.
The tracking thread is responsible for localizing the camera over time and selecting keyframes. At each instant, the
points of the reconstructed 3-D local map are searched by
reprojection, and the pose camera is optimized again with
all the found matches. Using this pose, virtual graphics
can then be drawn on top of the video sequence. Finally,
the tracking process decides when to insert a new keyframe.
All the steps are explained extensively in Sec. 6.
The mapping thread obtains keyframes selected by the
tracking thread to process them with the local bundle adjustment. The aim of the bundle adjustment is to achieve an optimal reconstruction of the environment. The map is updated
when new points are detected and initialized with the epipolar search; this allows the reconstruction of the areas of the
environment that have not been explored yet. All the mapping steps are detailed in Sec. 7.
5 Map Initialization
The initialization phase consists of segmenting the scene into
piecewise planar regions to estimate the initial camera pose
and generate a 3-D initial map using two keyframes selected
by the user. Our initialization is achieved in two stages; the
first stage is the piecewise planar segmentation, and the second is the structure from motion to estimate the initial camera
pose and generate an initial 3-D map. The first stage in turn
is divided in two steps. the first step consists of generating
the superpixels using the method of Felzenszwalb and
Huttenlocher.22 The output of this step is n homogeneous
regions S1 ; S2 ; : : : ; Sn . The second step is the application
of the homography model for clustering superpixels to create
multiple planar maps.
5.1 Piecewise Planar Segmentation
This section shows how to segment the scene in many piecewise planar surfaces in two steps. The first step consists of
dividing the two first keyframes selected by the user into
superpixels using the efficient graph-based segmentation
method of Felzenszwalb and Huttenlocher.22 The resulting
image is represented by indirected graph G ¼ ðV; EÞ, which
contains N sets of vertices V, M sets of edges E, and V pixels. Note that wðeÞ is the weight of an edge e situated
between two adjacent vertices.
The aim of this method consists of finding a relevant
partition of this image into n homogeneous regions
S1 ; S2 ; : : : ; Sn of G. Each region Si¼1: : : n represents a set
of connected pixels on the image that build a minimum spanning tree. It should be noted IntðSi Þ is the internal variation of
Fig. 2 Example of an image segmentation with 17 superpixels: (a) and (b) two first keyframes of a
sequence extracted from TUM RGB-D dataset, (c) and (d) an image segmentation using superpixels.
Journal of Electronic Imaging
053029-4
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Sep∕Oct 2017
•
Vol. 26(5)
Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene
each region looking for the greatest weight wðeÞ belonging
to region Si , whereas ExtðSi ; Sj Þ is the external variation
seeking the minimum weight of an edge belonging to the
two regions Si and Sj .
Three empirical parameters were proposed by
Felzenszwalb and Huttenlocher,22 which are the minimum
region sizes performed by postprocessing, the threshold
function, and the standard deviation to smooth the input image
before segmenting it. These three parameters were set to 40,
200, and 0.9, respectively, in our work. Figure 2 illustrates an
image segmentation with 17 superpixels. Figures 2(a) and 2(b)
show two first keyframes of a sequence acquired from the
Technische Universität München (TUM) RGB-D dataset.
Figures 2(c) and 2(d) show two images, which are clustered
into superpixels defining the homogeneous regions. After
segmentation, a filtering scheme was applied to eliminate
superpixels belonging to a small region. Only superpixels
above a certain threshold are considered useful in our application. A threshold is defined as the minimum superpixel size.
In the second step, the superpixels S obtained from the
filtering scheme mentioned in the first step would be merged
in those pixels belonging to the same plane of the two
images. Assuming that S ¼ S11 ; : : : ; Sm
n with the superpixels
of an image m corresponding to planar areas in a scene, a
homography model for clustering superpixels is applied to
create multiple planar maps. To apply the homography
model, we, first of all, have to find the corresponding superpixels between the two first keyframes. To this end, each
superpixel is defined by a set of ORB features as well as
their ORB descriptors to describe the feature and its local
Algorithm 1 Algorithm for clustering piecewise planar surface.
P m , P m0 : Set of ORB features representing each superpixel in both
images;
S n : Set of planar piecewise surfaces of P m representing superpixels
S i¼1: : : n ;
H n : Set of homographies representing each planar piecewise surface
H i¼1: : : n ;
neighborhood. Then a matching process is performed to
find the nearest descriptor of the requested image in the
descriptors set. We use the Hamming distance to find all
the matching superpixels between the two first keyframes
since the ORB is a binary descriptor. The Hamming distance
determines the similarity between both superpixels descriptors. If the obtained value is less than the threshold, then the
descriptors are matched. To obtain a good matcher, homography with random sample consensus (RANSAC) scheme is
used to filter bad matches since we are dealing with perspective transformations of planar regions. ORB is robust to scale
changes because it uses image pyramids and ensures good
invariance to viewpoint and illumination. Finally, for
every N matched points obtained from the last step, a
homography is computed for each superpixel. To identify
the planar surfaces, as shown in Algorithm 1, H n is defined
as the set of all n homographies existing between all the
superpixels. Each homography in H n sets define a planar
piecewise surface existing in both images. To verify if the
points of superpixels Si are related to any other existing
planes in H n , we use the reprojection error function Si . If
the reprojection error of Si is below a fixed threshold, all
the superpixels Si¼1: : : n belonging to the same plane are
merged. The thresholds used above are chosen in an empirical way. This loop is performed until all homograhies are
clustered to obtain multiple planar regions.
5.2 Structure from Motion
Once the planar piecewise surfaces are identified (Sec. 5.1),
we compute their 3-D positions in two steps using a structure
from motion technique, which is the process of estimating
the 3-D structure of a scene from a set of 2-D views.
First, we estimate the camera pose associating with the
selected keyframes. Since a set of homographies is built
between all identified planar regions, we use the homography decomposition method proposed by Ref. 23. This
method determines the relative camera motion up to scale
between two different views of a planar region. We choose
the homography model that produces the less reprojection
error rate, thus giving the best accuracy of camera pose.
t herr : Reprojection error threshold;
Define: reperr : The reprojection error;
for i = each homography of H n do
for j = each superpixel of S n do
while (i <> j) do
for k = each point of P m computes the reprojection error
reperr ¼ distance (H i P k , P k0 )
if reperr ≤ t h err then
Merge superpixel j belonging to the plane defined by the homography i;
end if
end for
end for
Journal of Electronic Imaging
Fig. 3 Structure from motion: motion and pose recovery from planar
scene.
053029-5
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Sep∕Oct 2017
•
Vol. 26(5)
Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene
By applying a positive depth constraint, we select one of
the four generated solutions. Knowing the camera pose
and corresponding points, we then triangulate 3-D features
and estimate their positions (Fig. 3). Let π i be the 3-D plan
defined by nTi X ¼ di, where ni is the plane normal and di is
the distance to the origin. The features of a map mentioned
in this paragraph such as 3-D points, camera pose, keyframes, and homographies are dealt with in the next section
by the planar tracking process.
6 Tracking
In this section, we describe the steps of the tracking process
that are used with every frame acquired from the camera.
Local map tracking in dynamic scene: After creating the
initial map, the tracking consists of finding matching
between the local map and points detected in the current
frame. In each acquired frame, we performed ORB feature
extractor and computed the ORB descriptors. After that, we
projected the local map into a frame, and we searched for the
map points correspondences. The local map contains the
reference keyframes and their nearest n keyframes. At the
initialization, we chose the initial keyframe as a reference
keyframe. If there was a change of 50% of image, we
chose a selected keyframe as a reference. Using the points
correspondences, we computed the updated camera pose,
and this was used for graphics rendering.
In case of a dynamic scene, the classical tracking process
may fail when not enough correspondence points are found
and consequently the camera pose will be lost. The idea was
to incorporate a planar rigidity constraint into the tracking
process. Points belonging to the same planar surface have
coherent motion and are converted by an affine homography
transformation (see Sec. 3.2). In the classical methods used
in an SLAM system, the projected points that do not find
their matches are marked as invalid; this disturbs the location
of the camera. However, in our system these points are
marked as valid, which ensures the system robustness against
dynamic objects.
Keyframe selection: The keyframe selection is very
important in the reconstruction process. To select a keyframe, two criteria were defined:
(1) Thirty frames must have already been passed since the
last insertion of the keyframe. (2) The current frame must
have the minimum 30% of the points of interest correctly
associated with the map points.
Constrained bundle adjustment: Bundle adjustment is
usually used as the last step of feature-based 3-D reconstruction algorithm. It optimizes the 3-D structure and
camera parameters (rotation and translation). In other
words, it can be defined as the problem of simultaneously
refining the 3-D points and the camera pose parameters
according to an optimality specification requiring the corresponding image projection of all the points. In fact, the map
is optimized when each keyframe is added using the bundle
adjustment, which adjusts map point positions and the keyframe pose and minimizes the reprojection error of all the
points in all the keyframes.
The bundle adjustment optimizes camera parameters and
3-D map points taking advantage of each inserted new keyframe in the map as it brings new information about the 3-D
points that constitute the 3-D map. To respect the real-time
requirement of our application, the parameters to be optimized
were limited to a subset of keyframes and 3-D points in which
we selected the keyframes closest to the reference keyframe.
Assume that n 3-D points are seen in keyframes m, and let
xij be the projection of the i’th point on keyframe j. Assume
also that each keyframe j is parameterized by a vector KF
and each 3-D point i by a vector PT. The cost function ϵ
minimizes the total reprojection error with respect to the
selected 3-D point and camera parameters
ϵ ¼ min
KFj PT i
d½PðKFj ; PT i Þ; xij 2 ;
(5)
i¼1 j¼1
where PðKFj ; PT i Þ is the projection of point i on a keyframe
j and dðx; yÞ is the Euclidean distance between the image
points and the projected ones. The addition of a planar constraint in the bundle adjustment allows us to properly manage
the 3-D points that belong to the same planar surface (Fig. 4).
This implies that an additional term should be added to the
previous cost function [Eq. (5)] to measure the 3-D structure
error of planar points. This leads to the following constraint,
where all plans are interdependent
k X
x
X
EQ-TARGET;temp:intralink-;e006;326;333
nTi · ðPT j − di Þ ¼ 0;
(6)
i¼1 j¼1
where k is the number of planes, x is the number of 3-D
points that belong to the same plane i, and ðni ; di Þ are the
normal and the distance of the plane i.
7 Mapping
In this section, we presented the steps fulfilled by the mapping thread with every newly inserted keyframe.
Keyframe and points insertion: Similar to Ref. 16, we
used the pose graph to represent a set of keyframes as
nodes and the homographies calculated between them as
edges. Each keyframe Fi has a camera pose Ci ¼ ðRi jti Þ
that presents the rigid transformation from the world coordinate system to the camera coordinate system. H ik represents
the homography that maps points from keyframe Fi to Fk .
After a new keyframe is added by the tracking process,
new map points are found. The system first checks all
unmatched ORB of the new added keyframe with the
other keyframes and then achieves an epipolar search to triangulate new points and add them to the map.
Journal of Electronic Imaging
n X
m
X
EQ-TARGET;temp:intralink-;e005;326;477
053029-6
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Fig. 4 Constrained planar mapping.
Sep∕Oct 2017
•
Vol. 26(5)
Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene
8 Results and Evaluation
Our approach was compared with PTAM7 and ORB SLAM8
for two reasons: the code sources were available and they
could be used for indoor scenes, as in the case of our application. To evaluate the performance and accuracy of our system, we used three different video sequences placed in an
office. The first video sequence was captured by a handheld
camera, which contains textured planar surfaces, and the
other two were extracted from the public TUM RGB-D
benchmark,24 which contains moving dynamic objects and
planar structures. The results were then evaluated by an
Intel core i7-6500U 2.50GHZ CPU (four cores) with a
GeoForce 920M graphic card computer. All RGB images
are recorded at 30 frames per second with a resolution of
640480.
8.1 Qualitative Results
To evaluate the performance of our proposed approach, two
experiments were performed. In the first experiment, we
demonstrated the utility of our initialization phase compared
with the PTAM, whereas, in the second one, we showed the
robustness of our system in the case of dynamic scenes.
Figure 5 shows the first indoor experiment where a desktop
image was captured. The sequence of this experiment had
626 frames recorded by a handheld camera containing
two textured planar surfaces. As shown in Fig. 5, the planar
surfaces were detected by our initialization phase (see
Sec. 5.1), which is different from the PTAM (Fig. 6), and
then tracked in the rest of the sequence.
In the initialization phase, PTAM assumes that the scene
is locally planar and recovers the camera pose from the
homography used in the Faugeras method.25 In the PTAM,
the first two keyframes are selected by the user, and a minimum set of corresponding points are used to compute the
homography. This process is performed iteratively in
RANSAC scheme to verify distðx 0 ; HxÞ < ϵ until a sufficient
number of corresponding points are consistent with the computed homography.
However, this method can select a virtual plan that does
not belong to the real plane, as shown in Fig. 6. In contrast,
the initialization phase of our approach takes into account the
semantic meaning of the scene. We segmented the scene into
different planes representing real piecewise planar surfaces.
Fig. 5 Results of the tracking thread where two planar surfaces are detected in x -, y-, and z-axes system. (a)–(g) extracted video image frames at different instants.
Journal of Electronic Imaging
053029-7
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Sep∕Oct 2017
•
Vol. 26(5)
Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene
Fig. 6 Results of different initialization phases used in PTAM. The white grid corresponds to a virtual
plane and the axis shows its orientation. (a)–(g) extracted video image frames at different instants.
The initial maps were created and tracked in the rest of the
sequence. Our system chose the plan that gave the best camera pose. In fact, it estimates the accurate camera pose by
aligning the virtual objects with the real world. For ORBSLAM system, an automatic initialization based on model
selection is used to create an initial map of both planar
and nonplanar environments. For a dynamic scene, this
method may fail and the system cannot initialize due to moving objects. In addition, similar to the PTAM, a dominant
plane is selected from the created map, which may correspond to a virtual plan.
Figure 7 shows our second experiment, which involves a
challenging task for an SLAM system. We compared our
system with the PTAM in two cases with and without handling occlusion caused by dynamic objects. We used a moving paper that occluded some parts of the scene. In the
PTAM, the invalid 3-D points that were occluded by the
moving objects could not be reliably detected and handled
since the changed regions were not detected while our system could detect them and could effectively treat them under
the planar rigidity motion constraint in which 3-D points situated on the same plan have the same motion.
As shown in Fig. 7(c), in PTAM the occluded points in
yellow were discarded by the system, and the camera
Journal of Electronic Imaging
focus may be lost or drifted. In contrast, our system
could robustly recover the camera pose [Fig. 7(b)].
Figure 7(a) shows the handling of the occlusion result
by our system. The changed 3-D points that where occluded
by a moving paper were recognized and highlighted with
the red color. Using the rigidity motion constraint of planes
allows the system to obtain better results compared with
classical tracking methods. Furthermore, to evaluate the
robustness of our system on moving dynamic objects,
we used a sequence (fr3 walking xyz) extracted from the
TUM RGB-D24 dataset, which contains two persons walking through an office scene. Figure 8 shows that our system
is able to estimate the camera pose in the case of a
dynamic scene.
Since our thesis is part of the exploration of AR in the
medical field,26–30 we used a 3-D virtual heart in the
scene as shown in Fig. 9. Our general aim was to train surgeons in a preoperative phase as well as with the various gestures that could be performed by their hands during a surgery
on a virtual heart. Therefore, the environment was not static
since it contained dynamic objects, such as surgeon hands,
instruments, etc. In this case, our approach presented a good
solution profiting from a planar constrained scene.
053029-8
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Sep∕Oct 2017
•
Vol. 26(5)
Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene
Fig. 7 Handling occlusion caused by dynamic objects: (a) our result with occlusion handling, (b) camera
pose, and (c) the PTAM result without occlusion handling.
Fig. 8 Example of dynamic scenes wherein our system successfully estimates the camera pose. (a)–
(e) extracted image frames from TUM RGBD dataset at different instants.
8.2 Quantitative Results
The TUM RGB-D benchmark24 is a good dataset to
evaluate the precision of a camera pose since it provides
an accurate ground truth, which is obtained from an external
Journal of Electronic Imaging
motion capture system. We chose two video sequences that
we found suitable for the evaluation: fr3 walking xyz and
fr3 str tex far. We ran the PTAM, ORB-SLAM, and our
system PWPSLAM three times in the benchmark to average
053029-9
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Sep∕Oct 2017
•
Vol. 26(5)
Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene
Fig. 9 Example of a system based on AR aiming at visualizing and
manipulating the virtual heart.
Fig. 11 Camera trajectory of fr3str texfar (RMSE wrt ground truth) along
the coordinate axes.
Table 1 Absolute keyframe trajectory RMSE (cm).
Sequence
PWPSLAM
ORB-SLAM
PTAM
fr3 walking xyz
0.7
1.06
X
fr3 str tex far
0.45
0.65
0.88
the result of the evaluation. Table 1 shows the camera localization error compared with the TUM RGB-D benchmark.
It can be seen that PTAM can only process
fr3 str tex far, except for fr3 walking xyz. This is a challenging case as there are big occlusions due to the moving
persons in the scene. The camera tracking is lost at some
point, and a large part of the video sequence is not processed by the system. In the PTAM, the invalid 3-D points
that were occluded by moving objects could not be reliably
detected and handled since the changed regions were not
detected; thus, the system could not robustly deal with
dynamic changes. In contrast, PWPSLAM can successfully
process fr3 walking xyz and shows that it can robustly deal
with a challenging situation where there are moving
objects. Our initialization phase segments the scene into
piecewise planar regions, which we use as additional
cues in the tracking process. The occluded 3-D points
are reliably detected and can be effectively treated under
the planar rigidity motion constraint in which 3-D points
situated on the same plan have the same motion.
Similarly, ORB-SLAM is also robust to dynamic changes.
One of the possible causes can be that they perform a culling procedure to insert new keyframes to the map only if the
visual content of the scene changes.
For fr3 str tex far sequence, ORB-SLAM and PTAM
have similar accuracy, while PWP-SLAM achieves higher
accuracy. One of the possible causes can be that ORBSLAM and PTAM perform a bundle adjustment using
only two families of parameters, which are the poses and
the 3-D points, while in our system we add a planar constraint in the bundle adjustment that allows us to properly
manage the 3-D points that belong to the same planar surface
and thus improve both the precision of the camera pose and
the quality of the 3-D map compared with the classical
SLAM systems (Fig. 10). As shown in Fig. 11, we evaluated
the camera keyframe trajectory against the ground truth data
along the coordinate axes x and y in which PWPSLAM produces a clearly accurate trajectory with an accuracy below
0.5 cm for the fr3 str tex far sequence. Our system has
shown that it can process sequences that contain moving
dynamic objects and planar structures.
Fig. 10 Feature detection and feature based 3-D reconstruction.
Journal of Electronic Imaging
053029-10
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Sep∕Oct 2017
•
Vol. 26(5)
Frikha, Ejbali, and Zaied: Camera pose estimation for augmented reality in a small indoor dynamic scene
9 Conclusion
The motivation for our work was to improve a state-of-theart SLAM system in a dynamic scene using planar surfaces.
As a novelty, we presented in this paper a constrained planar
piecewise monocular SLAM system. The proposed approach
was divided into two stages: a multiplanar surface segmentation using homographies and superpixels in the initialization phase and a monocular constrain SLAM algorithm that
tracks and maps the planar surfaces. In these experiments, we
showed that using the constrained SLAM, we can render virtual objects in a meaningful way and improve the tracking
process in a dynamic scene. In the future, we plan to use deep
learning techniques in our segmentation for more robustness.
Acknowledgments
The authors would like to acknowledge the financial support
of this work by grants from General Direction of Scientific
Research (DGRST), Tunisia, under the ARUB program.
References
1. G. Younes et al., “Pose tracking for augmented reality applications in
outdoor archaeological sites,” J. Electron. Imaging 26(1), 011004
(2016).
2. J. Li et al., “Handheld pose tracking using vision-inertial sensors with
occlusion handling,” J. Electron. Imaging 25(4), 041012 (2016).
3. E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: a hands-on survey,” IEEE Trans. Visual Comput.
Graphics 22, 2633–2651 (2016).
4. H. Kato and M. Billinghurst, “Marker tracking and HMD calibration for
a video-based augmented reality conferencing system,” in Proc. 2nd
IEEE and ACM Int. Workshop on Augmented Reality (IWAR 2099),
pp. 85–94 (1999).
5. G. Simon, A. W. Fitzgibbon, and A. Zisserman, “Markerless tracking
using planar structures in the scene,” in Int. Symp. on Augmented
Reality, pp. 120–128 (2000).
6. A. Davison, “Real-time simultaneous localisation and mapping with a
single camera,” in Proc. Int. Conf. on Computer Vision, Nice (2003).
7. G. Klein and D. Murray, “Parallel tracking and mapping on a camera
phone,” in Proc. Eigth IEEE and ACM Int. Symp. on Mixed and
Augmented Reality (ISMAR’09), Orlando (2009).
8. R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: a versatile and accurate monocular SLAM system,” CoRR abs/1502.00956
(2015).
9. J. Civera, A. J. Davison, and J. M. M. Montiel, “Unified inverse depth
parameterization for monocular slam,” in Proc. of Robotics: Science
and Systems (2006).
10. E. Eade and T. Drummond, “Scalable monocular SLAM,” in Proc. of
the 2006 IEEE Computer Society Conf. on Computer Vision and
Pattern Recognition—Volume 1 (CVPR 2006), pp. 469–476, IEEE
Computer Society, Washington, DC (2006).
11. A. J. Davison et al., “MonoSLAM: real-time single camera SLAM,”
IEEE Trans. Pattern Anal. Mach. Intell. 29, 1052–1067 (2007).
12. R. O. Castle, G. Klein, and D. W. Murray, “Video-rate localization in
multiple maps for wearable augmented reality,” in Proc. 2nd IEEE Int.
Symp. on Wearable Computing, Pittsburgh, Pennsylvania (2008).
13. H. Strasdat, J. M. M. Montiel, and A. J. Davison, “Editors choice article:
visual SLAM: why filter?” Image Vision Comput. 30, 65–77 (2012).
14. J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: large-scale direct
monocular SLAM,” in European Conf. on Computer Vision (2014).
15. R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: dense
tracking and mapping in real-time,” in Proc. of the 2011 Int. Conf. on
Computer Vision (ICCV 2011), pp. 2320–2327, IEEE Computer
Society, Washington, DC (2011).
Journal of Electronic Imaging
16. K. Pirker, M. Rüther, and H. Bischof, “CD SLAM—continuous localization and mapping in a dynamic world,” in Intelligent Robots and
Systems (IROS), pp. 3990–3997, IEEE (2011).
17. W. Tan et al., “Robust monocular SLAM in dynamic environments,” in
IEEE Int. Symp. on Mixed and Augmented Reality (ISMAR), pp. 209–
218 (2013).
18. S. Oh, M. Hahn, and J. Kim, “Dynamic EKF-based SLAM for autonomous mobile convergence platforms,” Multimedia Tools Appl. 74(16),
6413–6430 (2015).
19. A. P. Gee et al., “Discovering higher level structure in visual SLAM,”
IEEE Trans. Rob. 24, 980–990 (2008).
20. J. Ventura and T. Hollerer, “Online environment model estimation for
augmented reality,” in 8th IEEE Int. Symp. on Mixed and Augmented
Reality, pp. 103–106 (2009).
21. Z. Zhang, “A flexible new technique for camera calibration,” IEEE
Trans. Pattern Anal. Mach. Intell. 22, 1330–1334 (2000).
22. P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image
segmentation,” Int. J. Comput. Vision 59, 167–181 (2004).
23. E. Malis et al., “Deeper understanding of the homography decomposition for vision-based control,” INRIA Research Report (2007).
24. J. Sturm et al., “A benchmark for the evaluation of RGBD SLAM systems,” in Int. Conf. on Intelligent Robot Systems (IROS) (2012).
25. O. Faugeras and F. Lustman, “Motion and structure from motion in a
piecewise planar environment,” Tech. Rep. RR-0856, INRIA (1988).
26. R. Frikha et al., “Natural gesture based interaction with virtual heart in
augmented reality,” Lect. Notes Comput. Sci. 9375, 457–465 (2015).
27. R. Ben Ali, R. Ejbali, and M. Zaied, “GPU-based segmentation of dental x-ray images using active contours without edges,” in Int. Conf. on
Intelligent Systems Design and Applications (ISDA), pp. 505–510
(2015).
28. R. Ben Ali, R. Ejbali, and M. Zaied, “Detection and classification of
dental caries in x-ray images using deep neural networks,” in Int.
Conf. on Software Engineering Advances (ICSEA), p. 236 (2016).
29. T. Chebbi et al., “3D human heart anatomy: simulation and visualization
based on MR images,” in Int. Conf. on Software Engineering Advances
(ICSEA), p. 225 (2016).
30. R. Frikha, R. Ejbali, and M. Zaied, “Handling occlusion in augmented
reality surgical training based instrument tracking,” in IEEE/ACS 13th
Int. Conf. of Computer Systems and Applications (AICCSA), pp. 1–5
(2016).
Rawia Frikha is pursuing her doctoral studies at the Engineering
School of Sfax, Tunisia, and she is a member of Research Groups
on Intelligent Machines (REGIM-Lab) and is an IEEE student
member. She received her BS and MS degrees in computer science
and multimedia from the Higher Institute of Computer Science and
Multimedia of Sfax in 2010 and 2012, respectively. Her current
research interests include augmented reality, computer vision, image
processing, and HCI.
Ridha Ejbali is an assistant professor at the Faculty of Sciences of
Gabes Tunisia (FSG) in the Department of Computer Sciences. He
received his PhD in computer engineering, his master’s degree,
and his computer engineer degree from the National Engineering
School of Sfax, Tunisia (ENIS), in 2012, 2006, and 2004, respectively.
He is an IEEE senior member, SPS society. His research area is now
in pattern recognition and machine learning using wavelets and wavelet network theories.
Mourad Zaied is an associate professor at the National Engineering
School of Gabes in the Department of Electrical Engineering. He
received his HDR, PhD degrees in computer engineering, and his
master of science from the National Engineering School of Sfax in
2013, 2008, and 2003, respectively. His research interests include
computer vision, image and video analysis, wavelets and wavelet networks, pattern recognition, and image, audio, and video coding and
indexing.
053029-11
Downloaded From: https://www.spiedigitallibrary.org/journals/Journal-of-Electronic-Imaging on 19 Dec 2022
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Sep∕Oct 2017
•
Vol. 26(5)
Download