IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, MANUSCRIPT ID 1 Interactive marker-less tracking of human limbs Srinivasa G. Rao, and Larry F. Hodges, Member, IEEE Abstract— One of the hard problems that combines research in Computer Graphics and Computer Vision is tracking human limbs in real time without using any markers. This problem is known to be notoriously difficult and ill posed. In this paper we present a technique to track human limbs at near interactive rates without markers. This work builds on, modifies, and adds to the state-of-theart tracking techniques developed by previous researchers. We use a particle filtering tracking algorithm that uses 3d data derived from a modified visual hull algorithm. The particle filtering algorithm runs on a GPU and hence we are able to track human limbs at 4Hz with reasonable accuracy. This method has the potential to be applied to tracking the whole human body without markers. Index Terms— I.3.7.a Animation, I.4.8 Scene Analysis, I.4.8.n Tracking, I.3.7 Three Dimensional Graphics and Realism. —————————— —————————— 1 INTRODUCTION AND MOTIVATION T HE applications of Marker-less Limb Tracking in Real Time (abbreviated as MLTRT from now on) are many. Firstly, if we can track a person’s limb poses over time, we can construct a free view-point video by rendering his 3d model from novel camera positions. The viewer watching this video can alter his view-point in the video at will. Previously, [1], [2], and [3] have achieved this and have synthesized video from novel view-points, but only by offline processing. Secondly, we can render the tracked position of user’s limbs in VR environments without use of cumbersome maker based systems. [21], [22] have shown that if a user in a VR environment can see his limbs, the feeling of immersion is increased. Third, MLTRT will be very useful for new types of user interfaces. We can use human actions such as pointing, waving hands etc. instead of pointing devices such as mice or joysticks to interact with our computer. Fourth, MLTRT can be used for achieving telepresence and telecollaboration in combined virtual and real environments. Gross et al. [14], achieve this by constructing 3d video consisting of video fragments (3d point samples derived from a visual hull) and streaming it across the network. MLTRT would enable us to just stream the joint angle values over the network, thereby reducing the amount of bandwidth required to drive such systems. Section 2 describes our ideas and further sections show how we implemented these ideas. In section 3 we describe the previous work done both in Computer Vision and Computer Graphics literature on human motion tracking. Section 4 gives an overview of our work. Section 5 goes into details of particle filtering. Section 6 describes implementation of the particle ———————————————— Srinivasa G. Rao is a PhD student in the Department of Computer Science, University of North Carolina, Charlotte, NC 28213. E-mail: srao3@uncc.edu. Professor Larry F. Hodges is the chair of the Department of Computer Science, University of North Carolina, Charlotte, NC 28213. E-mail: lfhodges@uncc.edu. Manuscript received (insert date of submission if desired). Please note that all acknowledgments should be placed at the end of the paper, before the bibliography. filtering algorithm on a GPU and its tracking speed and accuracy. In section 7 we describe how we obtain the necessary 3d data from a visual hull. Section 8 describes real time implementation of the system. Section 9 describes our attempt to do marker less limb tracking in PCA (Eigen) space. Finally, we conclude and describe our future work. 2 CONTRIBUTIONS Our ideas and technical contributions can be summarized as follows – (a) Even though vision and computer graphics researchers have used camera images as input for tracking algorithms, only the silhouette information is used for estimating motion parameters [1]. Some of the researchers have used 3d data [6], [7] but have not used a robust tracking framework such as particle filtering. Our major contribution is to integrate current particle filtering techniques primarily found in computer vision literature and three-dimensional visual hull techniques from computer graphics. (b) Previous techniques for MLT have required offline processing and hence cannot track in real time as data is acquired. We implement a parallelizable particle filtering algorithm with 3d data as input on a GPU that enables us to track human limbs at near interactive rates. 3 PREVIOUS WORK This section starts by summarizing the work done on MLT in the computer vision literature. Then we summarize similar research done in the computer graphics literature. MLTRT is challenging due to the high dimensionality of degrees of freedom (DOFs) of the articulated human body, the lack of explicit depth information in 2D images, occlusion, and complexity of human dynamics and kinematics. There has been extensive research in the field of computer vision on MLT, but not at real-time rates (MLTRT). Deutscher et al. [9], [10] have used particle filters to do MLT. Particle filtering provides a robust Bayesian framework for human motion capture. A kinematic 2 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, MANUSCRIPT ID body model is used to “search for” the optimal body pose (a vector of angles representing joint angles of the human body) such that, it minimizes the error between the kinematic model’s projection on an image plane and body silhouettes obtained from doing background subtraction on camera images. We refer the interested readers to [9], [10] for a more detailed description. This is the basic framework that we have modified and we describe it in detail in section 5. Sigal et al. [11] pose the problem of 3D human limb tracking as one of inference in a graphical model. Their model of the body is a collection of loosely-connected limbs. Conditional probabilities relating the 3D pose of connected limbs are learned from motion-captured training data. Human pose and motion estimation is then solved with non-parametric belief propagation using a variation of particle filtering that can be applied over a general loopy graph. Gavrila and Davis [12] use a decomposition approach and a best-first technique to search through the high dimensional pose parameter space, to search for the pose of the articulated model that best matches image data. Both ellipsoids [7] and super quadrics [12] have been used in kinematic models. Bregler and Malik [13] have used the product of exponential maps and twist motions to track a walking person in a video captured from his side view. Cheung et al. [8] use an algorithm that iteratively segments points on the silhouettes to each articulated part of the kinematic model and estimates the motion of each individual part using the segmented silhouette. Luck and Small [6] create a force field exerted by voxels in a volumetric data set to fit a kinematic skeleton at near interactive frame rates. Computer Graphics researches have gone a step further and generated free view-point and virtual videos from novel views after estimating motion parameters. [15], [16] compute a visual hull, which is a coarse representation of human shape from a set of input images acquired from cameras at known positions. The visual hull can then be rendered from novel viewpoints. Vedula et al. [17] have created models of moving human actors using Multi-view video, voxel-based reconstruction, and space-time interpolation along the 3D scene flow. Carranza et al. [1] have used sophisticated human models for offline parameter estimation and then rendered free view-point video from novel view-points. Gross et al.[14] have used 3d point samples derived from a visual hull for telepresence in a CAVE-like environments. But, a visual hull and its derivatives are a very coarse approximation for a high quality video. Recently [4], [5] have demonstrated that, with the use of many cameras and high bandwidth and/or offline computation, it is possible to generate virtual videos and autostereoscopic 3d display even without estimation of any motion parameters. But, in such systems, not only is the computational and/or bandwidth cost very high, but also novel views generated are very close to and in-between the views captured by the cameras. 4 OVERVIEW OF OUR WORK Our work aims to generate a high quality LIVE action virtual video at near interactive rates. To accomplish this, we track the joint angles of human motion using a simple kinematic model. Using the joint angle data estimated by the tracking algorithm we can animate a high resolution, high quality, preprocessed, 3d human model from novel view-points. We achieve this by implementing a particle filtering tracking algorithm that runs on a Graphics Processing Unit (GPU) and uses 3d data derived from a visual hull. MLT, let alone MLTRT, is known to be a notoriously difficult and ill posed problem. In this work, we have decided to solve a much simpler problem first – track only human limbs without any markers on the limbs. Fig. 1 shows architecture of our system. Fig. 1. The system architecture. 5 PARTICLE FILTERING ALGORITHM (PFA) In this section, we describe the particle filtering techniques we use for tracking. Our algorithm is a modification of the techniques originally presented in [9], [10]. The term “particle” should not be confused with particles used in physical simulation of fire/smoke/water etc. Here, “particle” is a vector which has dimension equal to the number of degrees of freedom (say d) of the limbs of an articulated model. It can be thought of as a point in d-dimensional space. We first simulated our entire work and studied the nature of the MLTRT problem using the CMU Mocap data and viewing software [18]. To determine the accuracy of our tracking approach, we compared the synthetic 3d data points rendered by MOCAP software using the motion captured data (Fig. 2b, yellow points) to the pose determined by our tracking algorithm. Our kinematic model (Fig. 2a) consisted of ellipsoidal limbs of the same bone structure with 10 DOF - three rotation values for each thigh, one rotation value (along X axis (red line)) for each leg, and one rotation value (along X axis) for each foot. Hence in our case each particle is a vector of floating point values with 10 components, for legs, and is a 4 component vector (3DOF for shoulder and 1DOF for arm) for tracking arms. RAO S. ET AL.: INTERACTIVE MARKER-LESS TRACKING OF HUMAN LIMBS 3 ticles are shown for brevity. This illustrates the multimodal nature of the MLTRT weight function. The green line represents the particle in Fig. 2(e), which has the pose that is closest to ground truth. Fig. 2. Clockwise starting from bottom left (a) kinematic model, (b) synthetic 3d data (c), (d) particles having imperfect match, (e) particle having perfect match with 3d data and (f) typical MLTRT weight function plot, green line indicating ground truth, particle where the weight function has the highest peak. 5.1 Multimodal property of the PFA weight function Given a particle P (vector with dimension d), it describes a pose of the kinematic model. A measure of how good this particle matches the actual 3d data is the fraction of 3d points that are inside the ellipsoids of the model for that particular pose. This is called the weigh, w, of the particle. Weights are between [0, 1]. For example, the particle in Fig. 2c has a lesser weight than the particle in Fig. 2d. The particle in Fig. 2e has a weight of 1, since it matches the 3d data perfectly – this is the particle/pose that we want to search for. But, even a non perfect match like the one shown in Fig. 2d can have weight close to 1. This is because; ellipsoids representing the right foot and the lower part of the right leg of the kinematic model contain some 3d points in the ground truth that are actually a part of middle part of the right leg! Due to this the weight function in MLTRT is multimodal, which means it has many peaks, which further means we cannot use gradient descent methods or its variants to search for the peak of such a function! Say we have only 4DOF (rotation along X axis for thighs and limbs). Also suppose we know the maximum and minimum angles of rotation for each DOF. We want to find where the highest peak lies. If we take 10 values in between the maximum and minimum values for each DOF, we end up with 10 4 particles whose weights we have to evaluate to find the peak of the function! Hence, the computation is exponential. Fig. 2f shows a plot (particle number along X axis and weights along Y axis) of weights of 104 particles; only the middle 6000 par- Fig. 3. The annealing stack of PFA for arbitrary frame, Frame K– green line gives the ground truth. See sections 5.1, 5.2 for description of the process. 5.2 Details of the PFA A Particle Filtering Algorithm (PFA for brevity) may be used to search for a peak on such a multimodal function. Fig. 3 demonstrates how the PFA works. We have to track the 3d data each frame, which means we have to run the PFA each frame. Each run of the PFA has an annealing stack that has several layers. The main idea is based on simulated annealing [23]. At the topmost layer (say M-1), the weight function described in the previous section is smoothed so that the search does not get “distracted” by smaller peaks. As the PFA progress from the topmost layer to the bottommost layer, the degree of smoothing of the weight function is decreased such that at layer 0 we use the original unsmoothed weight function. At the beginning of the PFA, a particular number (say N) of random particles (poses or vectors, each of d dimensions) are generated. Their weights are computed depending on how close their match is with the actual 3d data as described above (for a more detailed description see Section 6). Then, a fraction of particles (say half of them), that have higher weights are selected. Gaussian noise 4 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, MANUSCRIPT ID (also a vector) with a mean zero vector and particular variance (diagonal matrix, correlations between noise in different dimensions is assumed to be 0) is generated and added to the selected weighted particles to generate unweighted particles for the next layer. This process of computing weights of particles and “propagating” only a heavier fraction of those to the next layer is repeated M times. Thus, the particles/poses that match the given 3d data more closely “live” and the rest “die out” as the PFA progresses. The PFA is very robust for tracking because some of the particles with less initial weight “live long enough” and sometimes turn out to be a very close match at the end. We learn the variance for each DOF from the motion captured data and use those values while running the PFA. A histogram of the absolute difference between consecutive frame values for each DOF is generated. A maximum change in that DOF is taken from each histogram after discounting for the noise. The standard deviation of the noise for the particular DOF is set to be one third of this maximum value at the top most layer of the PFA. As the PFA progresses from the topmost to the bottommost layer, noise of smaller and smaller variance is added to the heavier fraction of weighted particles that are selected for generating un-weighted particles for the next layer and the weight function becomes closer and closer to the actual unsmoothed weight function. This results in many particles that are near the desired peak of the multimodal MLTRT weight function as shown in Fig. 3. After we reach the bottom layer (layer 0), an average of the selected particles (weighted by their weights) is computed to determine the particle that is very close to the actual 3d data – the particle where the highest peak of multimodal weight function occurs – the particle/pose which we want to search for! Fig. 3 shows the PFA in action with number of particles (N) = 4, selection fraction of 0.5 (i.e. 2 heavier particles getting selected) and, number of layers (M) = 3. 5.3 Technical Contributions We have modified the PFA algorithm described in [9] and [10] in the following ways. Fig. 4. Clockwise from top left. 2d image from (a) camera in front (b) camera on the left (c) a camera behind (d) camera on the right (e) view from camera at 45deg (f) result of using 3d data. (a) Deutscher et al. [9], [10] use 2d images to compute the weight function. The contours of an actor are obtained from doing background subtraction on a set of camera images. They then project the kinematic model onto the camera images and compute the weight of a particle depending on how close the projection matches the contours. We have found that the use of 2d images causes many problems during PFA. An example is shown in Fig. 4 – green is the kinematic model, red is ground truth - (a), (b), (c) and (d) show four images from four surrounding synthetic cameras, placed at 90 degrees separation (front, left, back and right) around the synthetic kinematic model. Notice that, even though the reprojection error (number of green pixels not overlapping with red foreground contour) is not very high, the position picked by the PFA is far from ground truth! The left leg of the kinematic model is matched up with the right leg of the ground truth and vice versa as seen in Fig. 4(e)! But when we use 3d data for tracking, as shown in Fig. 4(f), this problem does not happen. In addition, using 3d data for weight computation is more efficient for GPU implementation as we show in Section 6. Fig. 5. Clockwise from top left (a) Distribution is only approximately uniform. Hence (b) particles with lesser weights are sometimes Selected more number of times for Generating new particles (# SG) than ones with higher weights, which is avoided by having deterministic algorithm (c). (d) Use of Crossover operator results in particles having more variance in some DOFs. (b) Deutscher et al. [9], [10] select the particles with a probability proportional to their weights, with replacement, for generating new particles. Initially, we implemented this approach by dividing the unit interval [0, 1] into N (number of particles) intervals whose lengths were equal to normalized particle weights. Then, we generated a uniform random number. Based on which interval this random number belonged to, we chose the corresponding particle. But, pseudo random number generators generate random numbers whose distribution is only approximately uniform (or approximately Gaussian, for that matter). See Fig. 5(a). Due to this sometimes particles with lesser weights were chosen more times than ones with heavier weights for generating new particles for the next layer – contrary to that required by the PFA (Fig. 5(b)). Hence, to fix this problem, we now use a deterministic rather than a probabilistic algorithm to do selection (Fig. 5(c)). RAO S. ET AL.: INTERACTIVE MARKER-LESS TRACKING OF HUMAN LIMBS 5 (c) Also, we found that use of the cross-over operator found in Genetic Algorithms [10] actually resulted in particles having more variance during PFA compared to those simulations where we did not use it. The smaller the variance is for the DOFs, the better it is for the PFA. Hence, as compared to [10] we do not use cross-over operator in our implementation. This comparison is shown in Fig. 5(d). ordinate system’s origin’s position in global coordinates; and X, Y, Z, unit vectors in the local coordinate system transformed to the global coordinate system. Do for all 3d points, P = (x, y, z), that have BNE set to 1 Let D = (P – O); Dx = (X-O).D; Dy = (YO).D; Dz = (Z-O).D; If a 3d point is inside the ellipsoid, i.e., (Dx <= xradius) and (Dy <= yradius) and (Dz <= zradius), then set the BNE flag to 0, indicating that 3d point belongs to one of the ellipsoids. (3) Weight of the particle w = (total number of 3d points – number of 3d points with BNE flag set to 1)/ (total number of 3d points). 6 IMPLEMENTATION OF THE PFA ON THE GPU The PFA executes partly on the CPU and partly on a GPU. To speed up the execution of the PFA, the weight computation, which is parallelizable, is implemented on a GPU. The remainder of the PFA is implemented on the CPU. The part of the PFA that executes on the CPU supplies appropriate texture data to the GPU during multi-pass rendering, which then executes its part of the PFA. Fig. 6 shows an articulated model (on the left) with a thigh and a leg being represented by one ellipsoid each (solid blue) with their local coordinate systems. On the right, we have synthetic ground truth 3d data color coded depending on which limb (ellipsoid) of the articulated model the 3d point belongs to. The 3d data is in the global coordinate system. Yellow means it belong to the leg, cyan means it belongs to the thigh and blue means it does not belong to any part of the kinematic articulated model. Green colored 3d points are not of interest. The articulated kinematic model and synthetic 3d data have the same “root” position, i.e., the green sphere of the articulated model on the left contains all red 3d points – but here they are shown apart just for clarity. Also shown is the global coordinate system. 6.1 3d weight calculation The weight of a particle (pose) like the one shown in Fig. 6 can be calculated by this simple algorithm. (1) Mark all the 3d data points as Belongs to No Ellipsoid (BNE) i.e., set all BNE flags of all 3d points to 1. Fig. 6. 3d Weight Calculation. (2) Do for all ellipsoids (having radii xradius, yradius and zradius in corresponding directions) in articulated model Transform ellipsoid’s local coordinate system to the global coordinate system. We get O, local co- 6.2 Parallelization of 3D weight calculation Computational complexity in the PFA comes from calculating the weight of each particle, in each layer, and for each frame. The process can be easily parallelized. Notice that the BNE flags for different particles are independent of each other. Also notice that, given a single particle we have to process BNE flags for the 1st ellipsoid first; and then for the 2nd ellipsoid, using only those 3d points whose BNE flag was NOT set to 0 by the 1st ellipsoid and so on. So, if we can feed the same 3d data for every particle, the BNE flags for a particular ellipsoid can be processed for all the particles independently in parallel – this corresponds to one rendering. Notice that a particular ellipsoid will have different orientations for different particles. So we process BNE flags for the 1st ellipsoid of all particles (independent of each other) and use the resulting data for the 2nd ellipsoid of all particles and so on. Hence we have to update the frame buffer a number of times equal to the number of ellipsoids in the articulated model. In this way we have parallelized the weight computation. We use Open GL Shading Language (GLSL) to program the GPU. We use the fragment shader (FS for brevity) on the GPU to do this parallel computation. The vertex shader (VS) is just a pass-through shader. We input 3d data into FS as a 512X512 RGB texture. We do multi-pass rendering because we need to use the data from the previous render as an input for the next render. Each render corresponds to an ellipsoid in the articulated kinematic model. In each rendering, we render N textured squares – each square corresponds to one particle. We use a uniform variable 4X4 matrix (GLSL) to feed in the ellipsoid’s orientation matrix for the corresponding particle to the FS. Each point in the square triggers the FS which checks if the corresponding 3d point (read from texture) has its BNE flag set to 1. If it is, then the FS checks whether the point is inside the current ellipsoid. If so, its BNE flag is set to 0 and that fragment’s color is set to black during rendering. If not, the fragment is rendered with color read originally from the texture. At end of all these renders we are left with only those 3d points which don’t belong to ANY ellipsoid, for each particle, i.e. their BNE flags will still be set to 1. Fig. 7 shows two frame buffer read backs – at start (left) and near the end of multi-pass rendering (right). Notice that some squares have all 6 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, MANUSCRIPT ID black pixels (corresponding particles fit 3d data very well) and some don’t. We use two Pbuffers for multi-pass rendering. We first render to the buffer1 and in the next rendering, read from buffer1 and render to buffer2; and in the next rendering read from buffer2 render to buffer1 and so on (usually known as the “ping pong” action). We have 6 ellipsoids (2 thighs, 2 limbs, 2feet) in our articulated model for tracking legs and 2 ellipsoids for arm. We have used N = 81 particles. So we have to do multi-pass rendering 6(or2) times for each anneal layer of PFA, depending on whether we are tracking legs(or arms) followed by a read back of the frame buffer. We have used M=3 anneal layers. Hence, to track each frame, we need 18(or 6) renderings and 3 read backs of frame buffer. 7 DERIVING 3D DATA FROM A VISUAL HULL We use a modified Visual Hull algorithm to derive 3d points inside the 3d object directly. We refer interested readers to [15] for more details. We use this 3d point data for the PFA and hence achieve MLT of human limbs at near interactive rates. Since we only need the 3d points inside the 3d object, the modified visual hull algorithm has O(n) complexity instead of O(n2), where n is the number of cameras. Hence, less computation time is spent on computing the 3d data. Fig. 7. Two sample frame buffer read backs. 6.3 Tracking Accuracy and Speed If we implement the PFA on a Pentium 3.06 GHz CPU only it takes 5 seconds to calculate the pose of the articulated model in each frame. When we use both the CPU and a GeForce 5900FX GPU, the time to calculate the pose is reduced to 0.25 seconds – a 20X speedup! We also calculated the accuracy of the PFA by calculating Euclidean distance between the ground truth particle (vector of d dimensions of angles) and the particle found by the PFA for each frame and averaging it over many ground truth sequences. The average absolute joint angle error was 7.7302 degrees. Maximum, minimum and standard deviation of the error were 19.3103, 0.0 and 4.5398 degrees respectively. 6.4 Comparison with 2D weight calculation We compare our approach with the approach taken in [9], [10] in this section. Let us say that instead of deriving 3d data from camera images, we wanted to use 2d re-projection error on the image plane for weight computation. If we use only 6 renderings per anneal layer to match the number of renderings per anneal layer in our 3d data method, then, in each rendering we must process N (= 81)/6 = 14 particles. If we use 4 cameras as done in [9] and render a 512X512 quadrilateral each time during the PFA, then each camera image would occupy (512X512)/(14*4) = 4681 pixels. Assuming we are capturing 320X240 images, this would mean we have to resize the image to less than 1/16th the size of original image! Apart from having fundamental pose calculation problems like that shown in Fig. 4, this would severely alias the image data which would adversely affect the performance of the PFA. Fig. 8. Image contours from 3cameras – one each along X,Y and Z axis and Modified Visual Hull computation. Consider three calibrated cameras, one each along X, Y and Z axes. The algorithm can be described as follows – (1) Choose grid points on foreground contour obtained from Camera1 (Fig. 8, red points on black contour). (2) For each of these points, do (a) Compute the 3d line (red line) passing through the camera center and the grid point (see Fig. 8). (b) Compute the corresponding epipolar lines on image planes of Camera2 and Camera3 (green and blue lines on respective image planes). Intersect these with the corresponding contours. Using these points of intersection for Camera2 compute that portion on the red 3d line which, when projected onto the image plane of Camera2, belongs to the foreground contour of Camera2 – say Cam2_3DLine. Similarly compute Cam3_3DLine. (c) Compute the part of the red 3d line that is the intersection of Cam2_3DLine and Cam3_3DLine – black part of red 3d line in Fig. 8. (d) Choose 3d points on this line. Number of 3d points chosen is directly proportional to length of the black 3d line (see Fig. 8). Fig. 9 shows the result of this algorithm at an instant in time. The 3d points inside the 3d object derived from the contours in the camera images are shown on the left. A close up view of the 3d points is shown on the right. RAO S. ET AL.: INTERACTIVE MARKER-LESS TRACKING OF HUMAN LIMBS 7 Fig. 9. (Left) 3d point data derived from modified visual hull computation and (Right) Zoomed in version. DirectX 9.0 library was used to capture camera images and synchronize them in software. The whole program has 7 threads. A main thread creates and controls all other threads. There is one thread for doing only the GPU part of the PFA. There are separate threads for capturing data from each camera. The main thread requests data from all the camera threads and suspends itself until it received updates from all of the camera threads. Then a separate thread computes the visual hull and the 3d data which is then fed to the PFA. A separate debug thread displayed all the debugging information. We were able to track human limbs with reasonable accuracy at 4Hz. See Fig. 11, Fig. 12, Fig. 13 and Fig. 14 for few tracking results. 10 CONCLUSION AND FUTURE WORK 8 PROBLEMS WITH DOING MLTRT IN PCA SPACE During this work, we tried another idea. Safonova et al. [19] and Grochow et al. [20] showed that human motion can be simulated and solved for in a lesser dimensional space rather than in the original high DOF space. We thought that this would help in particle filtering too. If we could do PFA in a lesser dimensional Eigen (PCA) space, we could use a smaller number of particles and hence reduce computation time. We computed the covariance matrix and base vectors for the DOFs of our articulated model in the PCA space from the Mocap database [18]. We postulated that the error due to tracking in PCA space would be acceptable. But, it turns out that error in PCA space is not acceptable. Fig. 10 shows two instances where the PFA failed in PCA space. We think this is because Safonova et al. [19] actually solved for motion parameters on a whole sequence by specifying positions for the articulated model at some key frames. This results in the solution of motion parameters “getting pulled to the right path”. Since we do not do any kind of optimization/key frame position specification, doing PFA in PCA space actually decreases its robustness. Also since the position computed for Frame K is used to generate particles for Frame K+1, tracking errors accumulate while doing PFA in PCA space. We have demonstrated how human limbs can be tracked at near interactive rates. We also have demonstrated that using 3d data derived from a visual hull greatly improves the robust particle filtering framework. Implementation of the particle filtering framework on GPU has enabled us to do MLTRT at 4Hz and with mean error of 7.7302 degrees. We are currently working on optimizing the algorithm further and increase the update rate using a faster GPU. We also want to try out full body MLTRT. We will investigate signal processing issues associated with smoothing and sampling of the MLTRT weight function. We will integrate human dynamics and collision detection between limbs into MLTRT. We then will proceed to using this technique for free view-point video, VR gaming, HCI, telepresence in combined VR and real environments. ACKNOWLEDGMENT The first author would like to thank his friends Gajendra Singh, Arvind Somya and Madhusudhan Reddy for their help and moral support during this project. Fig. 10. Two instances showing failure of PFA in PCA space. For both, tracking in actual space (left) and tracking in PCA space (right). 9 REAL TIME IMPLEMENTATION OF OUR SYSTEM We implemented the whole system on a Pentium 4 3.06 GHz processor and a GeForce 5900FX video card. We used 3 IBOT fire-wire web cameras for hand tracking and leg tracking. GLSL and GLEW library were used for GPU programming. Calibration toolbox by [24] was used to calibrate our cameras. Fig. 11. First row shows images of a hand at a particular position captured by three cameras. Second row shows corresponding foreground segments extracted by doing background subtraction. Last row show three virtual views of the articulated model that is used to track user’s hands – tracking and rendering of virtual views happens in real time at near interactive rates. 8 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, MANUSCRIPT ID Fig. 12. Same as Fig. 11, but for a different pose of the arm. Fig. 14. Same as Fig. 13, but for a different pose of the legs. [4] [5] [6] [7] [8] [9] [10] Fig. 13. First row shows images of legs at a particular position captured by three cameras. Second row shows corresponding foreground segments extracted by doing background subtraction. Last row show three virtual views of the articulated model that is used to track user’s legs – tracking and rendering of virtual views happens in real time at near interactive rates. [11] [12] REFERENCES [1] [2] [3] Carranza, J., Theobalt, C., Magnor, M. A., Seidel, H., “Free-View point Video of Human Actors”. In Proceedings of SIGGRAPH 2003, ACM Press / ACM SIGGRAPH, 2003. Wuermlin, S., Lamboray, E., Staadt, O., Gross, M., “3d video recorder”. In Proceedings of Pacific Graphics 2002, IEEE Computer Society Press, 325– 334. 2002. Matsuyama, T., Akai, T., “Generation, visualization, and editing of 3D video”. In Proc. of 1st International Symposium on 3D Data Processing Visualization and Transmission (3DPVT’02), 234ff, 2002. [13] [14] [15] [16] [17] [18] Matusik, W., and Pfister, H., “3DTV: A Scalable System for Real-Time Acquisition, Transmission, and Auto stereoscopic Display of Dynamic Scenes”. In Proceedings of SIGGRAPH 2004, ACM Press / ACM SIGGRAPH, 2004. Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R., “High quality Video View Interpolation using a Layered Representation”. In Proceedings of SIGGRAPH 2004, ACM Press / ACM SIGGRAPH, 2004. Luck, J., Small, D., “Real-time markerless motion tracking using linked kinematic chains”. In Proc. of CVPRIP02, 2002. Cheung, K., Kanade, T., Bouguet, J.Y., Holler, M., “A real time system for robust 3D voxel reconstruction of human motions”. In Proc. of Computer Vision and Pattern Recognition, vol. 2, 714 – 720, 2000. Cheung, G., Baker, S., Kanade, T., “Shape-From-Silhouette of Articulated Objects and its Use for Human Body Kinematics Estimation and Motion Capture”. Proc. Conf. Computer Vision and Pattern Recognition, 2003. Deutscher, J., Blake, A., Reid, I., “Articulated body motion capture by annealed particle filtering”. In Proc. Conf. Computer Vision and Pattern Recognition, volume 2, pages 1144-- e 1149, 2000. Deutscher, J., Davison, A.J., Reid, I., ”Automatic Partitioning of High Dimensional Search Spaces associated with Articulated Body Motion Capture”. Proc. IEEE Conference on Computer Vision and Pattern Recognition, Kauai, 2001. Sigal, L., Bhatia, S., Roth, S., Black, M.J., Isard, M., “Tracking Looselimbed People”. Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2004. Gavrila, D., and Davis, L., “3d model-based tracking of humans in action: a multi-view approach”. Proc. Conf. Computer Vision and Pattern Recognition (1996), 73–80, 1996. Bregler, C., Malik, J., “Tracking People with Twists and Exponential Maps”. Proc. IEEE Computer Vision and Pattern Recognition 1998. Gross, M., Würmlin, S., Naef, M., Lamboray, E., Spagno, C., Kunz, A., Koller-Meier, E., Svoboda T., Van Gool, L., Lang S., Strehlke, K., Moere, V.A., StaadT,O., “blue-c: A Spatially Immersive Display and 3D Video Portal for Telepresence”. Proceedings of ACMSIGGRAPH 2003, pp. 819827, 2003. Matusik, W., Buehler, C., and McMillan, L., “Polyhedral Visual Hulls for Real-Time Rendering”. In Proceedings of Eurographics Workshop on Rendering, 2001. Laurentini, A., “The Visual Hull Concept for Silhouette Based Image Understanding”. IEEE PAMI, 16, 2 (1994), pp. 150-162., 1994. Vedula, S., Baker, S., Kanade, T., “Spatio-temporal view interpolation.” In Proceedings of the 13th ACM Eurographics Workshop on Rendering, 65– 75, 2002. http://mocap.cs.cmu.edu/ RAO S. ET AL.: INTERACTIVE MARKER-LESS TRACKING OF HUMAN LIMBS [19] Safonova, A., Hodgins, J., Pollard, N., “Synthesizing Physically Realistic Human Motion in Low-Dimensional, Behavior-Specific Spaces”. In Proceedings of SIGGRAPH 2004, ACM Press / ACM SIGGRAPH, 2004. [20] Grochow K., Martin S.L., Hertzmann A., Popović Z., “Style-based Inverse Kinematics”. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2004), 2004. [21] M. Slater and M. Usoh, “The Influence of a Virtual Body on Presence in Immersive Virtual Environments.” VR 93, Virtual Reality International, Proceedings of the Third Annual Conferenceon Virtual Reality, London, Meckler, 1993, pp 34-42. [22] M. Slater and M. Usoh, “Body Centred Interaction inImmersive Virtual Environments.” In Artificial Life and Virtual Reality, (N. Magnenat, D. Thalmann, eds.), John Wiley and Sons, 1994, pp. 125-148. [23] Kirkpatrick, S., C.D. Gelatt Jr, and M.P. Vecchi, “Optimization by Simulated Annealing”, Science, V. 220, No. 4598, pp. 671 – 680, 1983. [24] Tomas Svoboda, Daniel Martinec, and Tomas Pajdla. “A convenient multi-camera self-calibration for virtual environments”. PRESENCE: Teleoperators and Virtual Environments, 14(4), August 2005. Srinivasa G. Rao is a PhD student in the Department of Computer Science in the College of Information Technology at the University of North Carolina - Charlotte. He earned his B.E. (Bachelor of Engineering) degree in Computer Science in 1999 from University of Mysore, India. He worked with IBM as a software engineer (19992000). He holds a M.S. in Computer Science from University of Maryland at College Park (2000-2002). He also worked at MERL labs as a research intern (2002-2003). His main research interests are Virtual Reality, Augmented Reality, Real time Computer Graphics and Computer Vision. Dr. Larry F. Hodges received the PhD degree from North Carolina State University in Computer Engineering in 1988. He is a professor and chair of the Department of Computer Science in the College of Information Technology at University of North Carolina at Charlotte. Prior to moving to Charlotte, he spent 14 years as a faculty member in the College of Computing at the Georgia Institute of Technology, where he was a founding member of the Graphics, Visualization and Usability (GVU) Center. He is also cofounder of Virtually Better, a company that specializes in creating virtual environments for clinical applications in psychiatry, psychology, and addiction. His research interests include virtual reality, interactive computer graphics, visualization and 3D HCI. 9