Motion Capture using Body Mounted Cameras in an Unknown Environment Nam Vo Taeyoung Kim Siddharth Choudhary 1. The Problem 3.1. 3D Reconstruction Motion capture has been recently used to provide much of character motion in several recent theatrical releases. It generally requires the recording to be done in an indoor environment with controlled light setting. This prevents directors from capturing motion in a natural setting or a large environment. In this paper we propose a method to solve this problem of motion capture in an unknown and unconstrained environment by using body mounted cameras. For the 3D reconstruction step, we apply traditional structure from motion technique to build a sparse model of the environment. At the same time, each camera’s pose is also retrieved and represent initial guess for corresponding body part position. Initialize Model: First of all, we initialize the model using some images. This is done off-line on some frames from all cameras. First, SIFT features are extracted and correspondences between pairs of images that has significant number of matches are estimated using RANSAC. Next, we triangulate location of matched features points in 3D. The set of points and cameras are optimized using GTSAM. Some of the triangulated points having reprojection error greater than some threshold are removed from the set. As a result, we get an initial model consisting of camera and points. 2. Related work Motion capture technology has been studied for a long time [7]. The technology is also available as commercial products from Vicon and Qualisys [1, 2]. Most of these approaches are classified as outside-in as they require sensors mounted in an environment. In contrast our approach is an inside-out approach and uses the cameras mounted on the body to recover the motion. Markerless motion capture techniques has been developed by a number of researchers [3, 4]. More recently, Hasler et al. proposed a method to handle markerless motion capture using structure from motion to handle register cameras with respect to the background and using conventional motion capture to estimate the motion of the articulated object [4]. Our work is the most closely related to the work by Shiratori et al. which also tries to capture motion using body mounted cameras[5]. However they require a known environment which is reconstructed using the reference images prior to running their algorithm. We have no such requirement. Our approach is also fundamentally different from most of these approaches as we propose an inside-out system. Feature Tracking: Given a set of reconstructed images and points, we track the features in rest of the frames using Kanade Lucas Tomasi (KLT) feature tracker. Using the KLT tracker, we create a mapping between already reconstructed points and the features found in new frames. The factors between new frames and the existing points are added to factor graph which is used to optimize its pose with respect to the world. We do not triangulate any new points in the tracking stage. In case tracking fails due to lack of correspondences between existing 3D points and 2D features, we reconstruct another frame using incremental reconstruction. Incremental Reconstruction: Whenever tracking fails, we match the current frame to the previous five frames and grow the 3D model by incrementally adding points and images using their correspondences to the existing model. The model is again optimized using GTSAM over the newly added points and images and all the previously tracked frames to minimize the error introduced by tracking. As the result, we have new 3D points and camera poses which is used to track rest of the frames until it again fails. As the system progresses, the drifting error of individual cameras would make the skeleton more and more inconsistent. So the final step in the pipeline is to do global optimization to 3. Approach Unlike [5], our hardware setup is simpler: we use a set of 3 cheap webcams which are attached to torso and hands. All webcams are connected to a single computer to record videos. However our approach is supposed to work with any number of cameras as long as synchronized videos are given as input. The system consists of 2 main parts: 3D reconstruction and skeleton estimation. 1 solve for optimal skeleton. Algorithm 1 gives the complete algorithm. Algorithm 1 3D Reconstruction Initialize Model repeat Track features with respect to the model if tracking fails then Do incremental reconstruction and optimize end if until end of the sequence 3.2. Skeleton Estimation In addition, we make use of the fact that there are constraints between cameras, for example relative pose between the camera attached to torso is unlikely to change overtime, so they can be calibrated in advance. The cameras attached to the torso and the palm is restricted by the introduction of distance constraint between the cameras. Since we found only three cameras, we cannot model the additional degree of freedom available in human motion. Other than this, we also restrict human motion to walk-only motion. For generic activities we do not have any constraint on the palm motion and palm can reach anywhere within some distance to the torso. In implementation we create a factor between the poses of torso and hand and optimize over the poses and the structure together. Figure 1. Nam capturing right-palm motion using Kinect CPL Dataset #Cameras Registered #Key Frames #Points Added #Measurements Avg. Repr. Error 418 24 1717 6005 1.05 TUM RGB-D Dataset 117 14 882 3457 0.68 Table 1. Statistics of the reconstruction algorithm on different sequences showing the number of cameras registered, number of key frames, number of points added, number of measurements and the average reprojection error 4. Evaluation it h camera. The long horizontal lines in this figure, represents new key frames being added and the corresponding new points that are triangulated. The dripping effect below these lines are the points being tracked in rest of the cameras. Table 1 shows the number of cameras registered and the number of points added. We also show the reconstruction results on TUM RGBD dataset [6]. Figure 4 shows the screenshot of the reconstructed model for TUM RGB-D dataset. Figure 5 shows the camera landmark matrix and Table 1 shows the number of cameras registered and the number of points added. We can see from figure 3 and 5 that the estimated cameras poses follow the correct trajectory and the reconstructed points resemble an approximate structure of a lab. From table 1 it can be seen that the average re-projection error is less than 4 pixels for both the datasets which is optimal. As there is no standard dataset for this line of approach and we don’t have access to modern MoCap system [1, 2], we conduct the experiments using indoor environment in Georgia Tech. To experiment with the reconstruction pipeline, we create an CPL lab video (Sequence 1) and reconstruct it. To analyze the reconstruction algorithm we try reconstructing only one video. Later on all the videos capturing motion of different body parts are merged together and optimized to capture the motion of the upper body. Figure 1 shows the picture of Nam capturing the motion of his right palm using Kinect inside the CPL lab. 3D Reconstruction. In order to evaluate the performance of the reconstruction algorithm, we reconstruct one video sequence from the CPL lab videos. Figure 2 shows the screen shot of the reconstructed model. The blue dots represent the tracked frames and the red coordinate frames represent the key frames. Figure 3 shows the camera landmark matrix representing the correspondences between the reconstructed cameras and landmarks. Each row represents one camera, each column represents one landmark and the value at (i, j) is non zero only if the jt h landmark is seen in the Skeleton Estimation Using the reconstruction algorithm, we estimate camera motion corresponding to different body parts. Figure 6 shows the screenshot of each video corresponding to left palm, torso and the right palm respectively. Figure 7 shows the estimated camera motion correspond2 Figure 3. CPL dataset Camera-Landmark matrix (Rows: Cameras, Columns: Landmarks) Figure 5. TUM RGB-D dataset Camera-Landmark matrix (Rows: Cameras, Columns: Landmarks) ing to left palm, torso and right palm. The blue dots refer to the tracked frames and the red axises are the keyframes. We can see a sinusoidal motion in the torso as well even though we didn’t expect this from torso since it has a stable motion as compared to the left and the right palm motion. We see that the left and the right palm motion is a lot more jerky due to which it loses track at a lot of places and it can be seen from less blue dots and more keyframes. Given the optimized camera poses and the structure, we optimize over the skeleton by adding a between factor between the poses, constraining the palm poses to be below and to the left and right of the torso pose. Figure 8 shows the output after merging different videos. In Table 2, we show the statistics of the reconstruction algorithm on different sequences corresponding to the left palm, torso and the right palm, showing the number of cameras registered, number of key frames and number of measurements. As can be seen from this table, we registered a good number of cameras with less key frames. As seen in the figure 8 our optimized model is not clean but it follows our constraints that torso is above both the palms. 5. Discussion As proposed, we are able to track camera poses and capture motion of different body parts. We see that the reconstruction algorithm generally works fine on slow video sequences. However for jerky or fast motion it loses track and a new key frame has to be added. So depending on how jerky the motion is it can take a lot of time by adding new key frames each time and optimizing over the complete sequence. Other than this the reconstruction algorithm also depends on how textured the region is. We don’t have sufficient features to track in texture less regions and this can result in pose estimation failures. This can be one of the reason behind tracking failures in the left and right palm videos. Camera moving along principal axis is another issue which causes bad initialization due to a very low parallax. To rectify this, we move camera sideways for some initial frames or manually select the initial frames if the sideways movement fails too. 3 Figure 6. Screenshot of each video to left palm, torso and right palm respectively Figure 7. Estimated Camera Motion corresponding to left palm, torso and right palm respectively References For skeleton optimization, we see that the constraints between the torso and both the palms are not effective enough to give an optimized result. Instead if we have more cameras that are attached to upper hand then it can provide better results. As a future work we can look in the direction of using other sensors like gyroscopes and GPS and fuse them with the estimated pose to get better results and may be reduce some load from the optimization to get near real-time performance. [1] Qualisys. http://www.qualisys.com/. 1, 2 [2] Vicon. http://www.vicon.com/. 1, 2 [3] K. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture. In CVPR, 2003. 1 [4] N. Hasler, B. Rosenhahn, T. Thormahlen, M. Wand, J. Gall, and H.-P. Seidel. Markerless motion capture with unsynchronized moving cameras. In CVPR, 2009. 1 [5] T. Shiratori, H. S. Park, L. Sigal, Y. Sheikh, and J. K. Hodgins. Motion capture from body-mounted cameras. ACM Transactions on Graphics, 30(4), 2011. 1 4 #Cameras Registered #Key Frames #Measurements Left Palm Data 412 28 5798 Torso 314 20 4361 Right Palm Data 331 43 8078 Table 2. Statistics of the reconstruction algorithm on different sequences corresponding to the left palm, torso and the right palm, showing the number of cameras registered, number of key frames and number of measurements Figure 4. Screen shot of the reconstructed cameras and points for TUM RGB-D dataset Figure 2. Screen shot of the reconstructed cameras and points for CPL dataset [6] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012. 2 [7] G. Welch and E. Foxlin. Motion tracking: no silver bullet, but a respectable arsenal. Computer Graphics and Applications, IEEE, 22(6):24 –38, nov.-dec. 2002. 1 Figure 8. Estimated Skeletal Motion after optimizing over the left palm, torso and right palm respectively 5