Motion Capture using Body Mounted Cameras in an Unknown Environment

advertisement
Motion Capture using Body Mounted Cameras in an Unknown Environment
Nam Vo
Taeyoung Kim
Siddharth Choudhary
1. The Problem
3.1. 3D Reconstruction
Motion capture has been recently used to provide much
of character motion in several recent theatrical releases. It
generally requires the recording to be done in an indoor environment with controlled light setting. This prevents directors from capturing motion in a natural setting or a large
environment. In this paper we propose a method to solve
this problem of motion capture in an unknown and unconstrained environment by using body mounted cameras.
For the 3D reconstruction step, we apply traditional
structure from motion technique to build a sparse model of
the environment. At the same time, each camera’s pose is
also retrieved and represent initial guess for corresponding
body part position.
Initialize Model: First of all, we initialize the model using some images. This is done off-line on some frames
from all cameras. First, SIFT features are extracted and correspondences between pairs of images that has significant
number of matches are estimated using RANSAC. Next,
we triangulate location of matched features points in 3D.
The set of points and cameras are optimized using GTSAM.
Some of the triangulated points having reprojection error
greater than some threshold are removed from the set. As
a result, we get an initial model consisting of camera and
points.
2. Related work
Motion capture technology has been studied for a long
time [7]. The technology is also available as commercial
products from Vicon and Qualisys [1, 2]. Most of these approaches are classified as outside-in as they require sensors
mounted in an environment. In contrast our approach is an
inside-out approach and uses the cameras mounted on the
body to recover the motion.
Markerless motion capture techniques has been developed by a number of researchers [3, 4]. More recently,
Hasler et al. proposed a method to handle markerless motion capture using structure from motion to handle register
cameras with respect to the background and using conventional motion capture to estimate the motion of the articulated object [4]. Our work is the most closely related to
the work by Shiratori et al. which also tries to capture motion using body mounted cameras[5]. However they require
a known environment which is reconstructed using the reference images prior to running their algorithm. We have
no such requirement. Our approach is also fundamentally
different from most of these approaches as we propose an
inside-out system.
Feature Tracking: Given a set of reconstructed images
and points, we track the features in rest of the frames using
Kanade Lucas Tomasi (KLT) feature tracker. Using the KLT
tracker, we create a mapping between already reconstructed
points and the features found in new frames. The factors between new frames and the existing points are added to factor
graph which is used to optimize its pose with respect to the
world. We do not triangulate any new points in the tracking
stage. In case tracking fails due to lack of correspondences
between existing 3D points and 2D features, we reconstruct
another frame using incremental reconstruction.
Incremental Reconstruction: Whenever tracking fails,
we match the current frame to the previous five frames and
grow the 3D model by incrementally adding points and images using their correspondences to the existing model. The
model is again optimized using GTSAM over the newly
added points and images and all the previously tracked
frames to minimize the error introduced by tracking. As
the result, we have new 3D points and camera poses which
is used to track rest of the frames until it again fails. As the
system progresses, the drifting error of individual cameras
would make the skeleton more and more inconsistent. So
the final step in the pipeline is to do global optimization to
3. Approach
Unlike [5], our hardware setup is simpler: we use a set
of 3 cheap webcams which are attached to torso and hands.
All webcams are connected to a single computer to record
videos. However our approach is supposed to work with
any number of cameras as long as synchronized videos are
given as input. The system consists of 2 main parts: 3D
reconstruction and skeleton estimation.
1
solve for optimal skeleton. Algorithm 1 gives the complete
algorithm.
Algorithm 1 3D Reconstruction
Initialize Model
repeat
Track features with respect to the model
if tracking fails then
Do incremental reconstruction and optimize
end if
until end of the sequence
3.2. Skeleton Estimation
In addition, we make use of the fact that there are constraints between cameras, for example relative pose between the camera attached to torso is unlikely to change
overtime, so they can be calibrated in advance. The cameras
attached to the torso and the palm is restricted by the introduction of distance constraint between the cameras. Since
we found only three cameras, we cannot model the additional degree of freedom available in human motion. Other
than this, we also restrict human motion to walk-only motion. For generic activities we do not have any constraint on
the palm motion and palm can reach anywhere within some
distance to the torso. In implementation we create a factor
between the poses of torso and hand and optimize over the
poses and the structure together.
Figure 1. Nam capturing right-palm motion using Kinect
CPL Dataset
#Cameras Registered
#Key Frames
#Points Added
#Measurements
Avg. Repr. Error
418
24
1717
6005
1.05
TUM RGB-D
Dataset
117
14
882
3457
0.68
Table 1. Statistics of the reconstruction algorithm on different sequences showing the number of cameras registered, number of key
frames, number of points added, number of measurements and the
average reprojection error
4. Evaluation
it h camera. The long horizontal lines in this figure, represents new key frames being added and the corresponding
new points that are triangulated. The dripping effect below
these lines are the points being tracked in rest of the cameras. Table 1 shows the number of cameras registered and
the number of points added.
We also show the reconstruction results on TUM RGBD dataset [6]. Figure 4 shows the screenshot of the reconstructed model for TUM RGB-D dataset. Figure 5 shows
the camera landmark matrix and Table 1 shows the number
of cameras registered and the number of points added.
We can see from figure 3 and 5 that the estimated
cameras poses follow the correct trajectory and the reconstructed points resemble an approximate structure of a lab.
From table 1 it can be seen that the average re-projection
error is less than 4 pixels for both the datasets which is optimal.
As there is no standard dataset for this line of approach and we don’t have access to modern MoCap system [1, 2], we conduct the experiments using indoor environment in Georgia Tech. To experiment with the reconstruction pipeline, we create an CPL lab video (Sequence
1) and reconstruct it. To analyze the reconstruction algorithm we try reconstructing only one video. Later on all the
videos capturing motion of different body parts are merged
together and optimized to capture the motion of the upper
body. Figure 1 shows the picture of Nam capturing the motion of his right palm using Kinect inside the CPL lab.
3D Reconstruction. In order to evaluate the performance
of the reconstruction algorithm, we reconstruct one video
sequence from the CPL lab videos. Figure 2 shows the
screen shot of the reconstructed model. The blue dots represent the tracked frames and the red coordinate frames represent the key frames. Figure 3 shows the camera landmark
matrix representing the correspondences between the reconstructed cameras and landmarks. Each row represents one
camera, each column represents one landmark and the value
at (i, j) is non zero only if the jt h landmark is seen in the
Skeleton Estimation Using the reconstruction algorithm,
we estimate camera motion corresponding to different body
parts. Figure 6 shows the screenshot of each video corresponding to left palm, torso and the right palm respectively.
Figure 7 shows the estimated camera motion correspond2
Figure 3. CPL dataset Camera-Landmark matrix (Rows: Cameras, Columns: Landmarks)
Figure 5. TUM RGB-D dataset Camera-Landmark matrix (Rows: Cameras, Columns: Landmarks)
ing to left palm, torso and right palm. The blue dots refer
to the tracked frames and the red axises are the keyframes.
We can see a sinusoidal motion in the torso as well even
though we didn’t expect this from torso since it has a stable
motion as compared to the left and the right palm motion.
We see that the left and the right palm motion is a lot more
jerky due to which it loses track at a lot of places and it
can be seen from less blue dots and more keyframes. Given
the optimized camera poses and the structure, we optimize
over the skeleton by adding a between factor between the
poses, constraining the palm poses to be below and to the
left and right of the torso pose. Figure 8 shows the output after merging different videos. In Table 2, we show
the statistics of the reconstruction algorithm on different sequences corresponding to the left palm, torso and the right
palm, showing the number of cameras registered, number of
key frames and number of measurements. As can be seen
from this table, we registered a good number of cameras
with less key frames. As seen in the figure 8 our optimized
model is not clean but it follows our constraints that torso is
above both the palms.
5. Discussion
As proposed, we are able to track camera poses and capture motion of different body parts. We see that the reconstruction algorithm generally works fine on slow video sequences. However for jerky or fast motion it loses track and
a new key frame has to be added. So depending on how
jerky the motion is it can take a lot of time by adding new
key frames each time and optimizing over the complete sequence. Other than this the reconstruction algorithm also
depends on how textured the region is. We don’t have sufficient features to track in texture less regions and this can
result in pose estimation failures. This can be one of the
reason behind tracking failures in the left and right palm
videos. Camera moving along principal axis is another issue
which causes bad initialization due to a very low parallax.
To rectify this, we move camera sideways for some initial
frames or manually select the initial frames if the sideways
movement fails too.
3
Figure 6. Screenshot of each video to left palm, torso and right palm respectively
Figure 7. Estimated Camera Motion corresponding to left palm, torso and right palm respectively
References
For skeleton optimization, we see that the constraints between the torso and both the palms are not effective enough
to give an optimized result. Instead if we have more cameras that are attached to upper hand then it can provide better results. As a future work we can look in the direction of
using other sensors like gyroscopes and GPS and fuse them
with the estimated pose to get better results and may be reduce some load from the optimization to get near real-time
performance.
[1] Qualisys. http://www.qualisys.com/. 1, 2
[2] Vicon. http://www.vicon.com/. 1, 2
[3] K. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette
of articulated objects and its use for human body kinematics
estimation and motion capture. In CVPR, 2003. 1
[4] N. Hasler, B. Rosenhahn, T. Thormahlen, M. Wand, J. Gall,
and H.-P. Seidel. Markerless motion capture with unsynchronized moving cameras. In CVPR, 2009. 1
[5] T. Shiratori, H. S. Park, L. Sigal, Y. Sheikh, and J. K. Hodgins.
Motion capture from body-mounted cameras. ACM Transactions on Graphics, 30(4), 2011. 1
4
#Cameras Registered
#Key Frames
#Measurements
Left Palm Data
412
28
5798
Torso
314
20
4361
Right Palm Data
331
43
8078
Table 2. Statistics of the reconstruction algorithm on different sequences corresponding to the left palm, torso and the right palm, showing
the number of cameras registered, number of key frames and number of measurements
Figure 4. Screen shot of the reconstructed cameras and points for
TUM RGB-D dataset
Figure 2. Screen shot of the reconstructed cameras and points for
CPL dataset
[6] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems.
In Proc. of the International Conference on Intelligent Robot
Systems (IROS), Oct. 2012. 2
[7] G. Welch and E. Foxlin. Motion tracking: no silver bullet, but
a respectable arsenal. Computer Graphics and Applications,
IEEE, 22(6):24 –38, nov.-dec. 2002. 1
Figure 8. Estimated Skeletal Motion after optimizing over the left
palm, torso and right palm respectively
5
Download