Papers: •Pfinder: Real-Time Tracking of the Human Body, Wren, C., Azarbayejani, A., Darrell, T., and Pentland, P. •Tracking and Labelling of Interacting Multiple Targets, J. Sullivan and S. Carlsson This talk will cover two distinct tracking algorithms. Pfinder: Real-Time Tracking of the Human Body Multi-target tracking and labeling For each of them we will present: Motivation and previous approaches Review of relevant techniques Algorithm details Applications and demos There is always a major trade-off between genericity and accuracy. Because we know we are trying to identify and track human beings, we can start making assumptions about our objects. If we have more specific information (example: tracking players in a football game), we can add even more specific assumptions. These kind of assumptions will help us to get a more accurate tracking. Tracking Algorithm #1 Pfinder: Real-Time Tracking of the Human Body Motivation Introduction • Pfinder is a tracking algorithm – Detects human motion in real-time. – Segments the person’s body – Analyze internal features (head, body, hands, and feet) Many Tracking algorithm use a static model – For each frame, similar pixels are searched in the vicinity of the bounding box of the previous frame. We will use a dynamic model – One that learns over time. Most tracking algorithms need some user-input for initialization. The presented algorithm will do automatic initialization. Covariance For a domain of dimension n, we define the sampling domain’s variables x1 xn The covariance of two variables xi , x j is defined: cov xi , x j E xi i x j j where i E xi The covariance of two variables is a measure of how much two variables change together. The Covariance Matrix (marked ) is defined: ij cov xi , x j Normal distribution of a variable x is defined: 2 x 1 p x exp 2 2 2 The more generalized multivariate distribution is defined: 1 T 1 1 p x x1, , xN exp x x N 2 12 2 2 Mahalanobis distance: The distance DM x measured from a sample vector x x1 mean 1 is defined: DM x xN T N To a group of samples with and a covariance matrix S 1 x S x T 1. (Automatic) Initialization Background is modeled in a few seconds of video where the person does not appear. When the person enters the scene, he is detected and modeled. 2. The analysis loop After the background and person models are initialized, each pixel in the next frame is checked against all models. The first step in the algorithm is build a preliminary representation of the person and the surrounding scene. First we need to acquire a video sequence of the scene that do not contain a person in order to model the background The algorithm assumes a mostly-static background. However, it is needed to be robust in illumination changes and to be able to recover from changes in the scene (e.g. a book that was moved from one place to another). The images in the video are using the YUV color representation (Y = luminance component, UV = chrominance component). There exists a transformation matrix which transforms RGB representation to YUV. The algorithm models the background by matching each pixel a Gaussian that describes the pixel’s mean and distribution. We do this by measuring the pixel’s YUV mean and distribution over time y u v This pixel has some YUV value on this frame, on the next frame, it might change, so we mark it’s mean as 0 x, y and its covariance matrix as K0 x, y After the scene has been modeled, Pfinder watches for large deviations from this model. This is done by measuring the Mahalanobis distance in the color space between the new pixel’s value and to the scene model values in the appropriate location. If the distance is large enough and the change is visible over a sufficient number of pixel, we begin to build a model of a person. The algorithm represents the detected person’s body parts using blobs. Blobs are 2D representation of a Gaussian distribution of the spatial statistics. Also, a support map is built for each blob : 1 x , y k Sk x , y 0 otherwise k To initialize the blob models, Pfinder uses a 2D contour shape analysis that attempts to identify the head, hands, and feet location. A blob is created for each identified location. The class analyzer find the location of body features by using statistics from their position and color in the previous frames. Because no statistics have been gathered yet (this is the first frames where the person appears), the algorithm uses ready-made statistical priors. Hand and face blobs have strong flesh-colored color priors (it appears that normalized skin color is constant across different skin pigmentation levels). The other blobs are initialized to cover the clothing regions The contour analyzer can find features in a single frame, but the results tend to be noisy. The class analyzer produce accurate result but it depends on the stability of the underlying models (i.e. no occlusion). A blend of contour analysis and class model is used to find the feature in the next frame. original contour After the initialization step of the algorithm, the information is now divided into scene and person models. Scene (background) model consist of the color space distribution for each pixel. Person model consist of spatial space and color space distribution for each blob The spatial space determines the blob’s location and size The color space determines the distribution of color in the blob Given a person model and a scene model, we can now acquire a new image, interpret it, and update the scene and person models. 1. Update the spatial model associated with each blob using the blob’s measured statistics, to yield the blob’s predicted spatial distribution for the current image. This is done with a Kalman filter assuming simple Newtonian dynamics. Measuring information from video sequence can be very inaccurate sometimes Without some kind of filtering it would be impossible to make any short-term forward predictions. Also, each measurement is used as a seed for the tracking algorithm at the next frame. Some kind of filtering is needed to make the measurements more accurate. Each tracked object is represented with a state vector (usually location) With each new frame, a linear operator is applied to the state to generate the new state, with some noise mixed in, and some information from the controls on the system Usually, Newton’s laws are applied. The noise added is a Gaussian noise with mean 0 and a covariance matrix. The predicted state is then updated with the real measurement to create the estimate for the next frame. 2. Now when a new image is acquired, we measure the likelihood of each pixel being a member of each of the blob models and the scene model: the vector p x, y,Y ,U ,V is defined as the location and color of each pixel. For each class k , the log likelihood is measured: 1 1 m T 1 d k p k K k p k ln K k ln 2 2 2 2 3. Each pixel is now assign to a particular class. Either one of the blobs or the background. A support map is build which indicates which pixel belong to which class s x, y arg max d k x, y k Connectivity constraints are enforced by iterative morphological growing from a single central point, to produce a connected region. First, a foreground region is grown comprised of all the blob classes. Then, each of the individual blob is grown with the constraint that they remain confined to the foreground region 4. Now the statistical model for each class is updated. For the blob classes, the new mean is calculated k E p k T Kk E p k p k The Kalman filter statistics are also updated at this time. Background pixels are also updated to have the ability to recover from changes in the scene. The algorithm employs several domainspecific assumptions in order to have an accurate tracking. If one of the assumptions break, the system degrades. However, the system can recover after a few frames if the assumptions again hold The system can track only after a single person. RMS (Root Mean Square) errors were found on the order of a few pixels: Test Hand Arm Translation (X,Y) 0.7 pixels (0.2% relative) 2.1 pixels (0.8% relative) Rotation ( ) 4.8 degrees (5.2% relative) 3.0 degrees (3.1% relative) A Modular Interface - An application that provides programmers tracking, segmentation and feature detection. The ALIVE application places 3d animated characters that interact with the person according to his gestures. Here, Rexy! The SURVIVE application recorded the movement of the person to navigate a 3d virtual game environment. I guess you can’t get any nerdy than this Recognition of American Sign Language Pfinder was used as a pre-process for detecting a 40-word subset of ASL. It had 99% sign accuracy Avatars and Telepresence The model of the person is translated to several blobs. Which can be used to model 2d characters. Tracking Algorithm #2 Multi-Target Tracking and Labeling uses slides by Josephine Sullivan from http://www.csc.kth.se/~sullivan/ Motivation Introduction • The multi-target tracknig and labeling algorithm – Track multiple targets over large periods of time – Robust collision recovery – Does labeling even when targets are interacting Multi Tracking and Labeling Sometimes Easy Sometimes Hard The algorithm addresses the problem of the surveillance and tracking of multiple persons over a wide area. Previous multi-target tracking algorithms are based on Kalman filtering and advanced techniques of particle filtering. Often tracking algorithms fails if occlusion or interaction between the targets occurs. This work’s specific goal is to track and label the players in a football game. This is especially hard when players collide and interact The researchers used a wide-screen video which was produced using the video from four calibrated cameras. The images were stitched after the homography between the images was computed. This produces a high-resolution video which gives good tracking results 1. 2. 3. 4. Background modeling and subtraction Build an interaction graph Resolve split/merge situations Recover identities of temporally separated player trajectories. A probabilistic model of the image gradient of each pixel in the background is obtained. The gradient is used to prevent situation where the player’s uniform has the same color as the background. Let g xt denote the image gradient at pixel x in frame t . b g Each background pixel x is modeled by a mixture of three bivariate normal i distributions with means x and covariance matrices ix : 3 gx i i i N , x 2 x x i 1 3 i Where 0 1 and x 1 i x i 1 A pixel x in frame t is considered a foreground pixel if g g is larger than a threshold . t x T x 1 x t x x Let Ft be the set of foreground pixels at time t Let Bt be the set of background pixels at time t Connected components are then identified and are processed by deleting small “cc”s or joining them to neighboring larger “cc”s. This is made to make sure that each connected component corresponds to at least one whole player The set of ellipses representing the connected components detected (marked by bounding boxes) is defined: t n Et Ei t i 1 With nt being the number of ellipses detected in frame t The first aim is to put the ellipses inEt and Et1 in correspondence. Definition: ellipses E1 and E2 are an exact match if their size and orientation are sufficiently similar and distance between their centers are sufficiently small. Define a relation ~ : Eti ~ Et j1 if Eti and Et j1 are an exact match i If no such exact match exists for Et in Et1 then Eti ~ Et j1 if Area Eti Et j1 0 and Et j1 has no exact match in Et Define a Forward and Backward mappings: i j j Ft i Et ~ Et 1 Forward mapping: Backward mapping: k Bt i i Ft k With the forward and backward mapping, we can define events at each frame: Signal Event Signal Event Ft i 1 Split Bt j 1 Merge Ft i 0 Disappear Bt j 0 Appear Ft i Bt Ft i 1 stable A maximal sequence of stable events sandwiched between non-stable events is termed a track. A player track is a track that corresponds to exactly one player If the event sequence is track split or merge track then track involves multiple players If the event sequence is {split, appear} track {merge ,disappear} then track may be a player track If such track is long enough and ellipse size is not too big, it is considered a player track. Other tracks are called multiple players track. Because we’re dealing with a football game, we know that players are divided into 3 categories: Team A, Team B and officials. This will help us in cases where teams from different teams appear in multiple players tracks. Given the labeling of the tracks and their interactions through merging and splitting, the game can be summarized by a graph structure called target interaction graph. White and gray nodes corresponds to team A / team B player tracks. Black nodes corresponds to multiple players track This graph is a small section of the ~5000 node graph describing 10 minutes of analyzed gameplay. By examining the player interaction graph, it is possible to isolate situations where n player tracks merge and then split into n player tracks. These merge-split situations are resolved by finding correspondence between input and output tracks. Input and output tracks are each a set of n tracks. We wish to find the assignment M of the input to the output. It is a bijective mapping M : 1, , n 1, , n . Where M i j implies that track Ti and Tj are the same player. Not all assignments are physically possible. For each valid assignment, we estimate the intermediate tracks by exploiting the properties of maintaining continuity of motion and relative depth ordering. We investigate if any of the intermediary tracks can be described by a constant velocity motion model. This is done by linearly interpolate between the last ellipse of Ti and the first ellipse of TM i . If there is sufficient image data to support this, the penalty for this estimation is 0. The overall estimation for each assignment is scored: Sc Dist T T Pen T T Where: n M i i 1 M i i M i Dist Ti TM i is the distance traveled during the hypothesized trajectory. Pen Ti TM i 1 if Ti is not consistent with relevant TM i otherwise 0 If the minimum score assignment was explained solely on linear interpolation, and its estimate is lower than threshold , then we accept this assignment. Otherwise, we repeat this process at constant time intervals. This is called relative depth ordering. Intermediary tracks that cannot be explained by simple linear interpolation, is analyzed every mth frame in the interval between the merge and the split. Starting with the first interval, we define the region Rk t j as the union of all ellipses and try to interpolate in smaller distances. The aim at each interval is to maximize the intersection of Rk t j with the foreground pixels and minimize the intersection with the background pixels. Again, the penalty is set to 1 if the mentioned intersection is not consistent. Then the score is re-calculated and the minimum scored assignment is chosen. This process was found to be working if the number of targets merging was smaller or equal to 5. Nonetheless, the examined sequence contained roughly 200 merge-split situations, of varying complexity, all resolved. At this step, it is interesting to see how frequently a player was assigned a player track. Not all split/merge situations were accurately resolved. Usually, other features can be used to resolved the identity of player tracks. In a football game, a player’s identity can be obtained by his relative position to his teammates. . The easiest example is the goalkeeper who is always behind his teammates. We can look at the problem as a partitioning problem This is specific to a football game, but a variation can be used for other applications. The feature vector for each player i 1, ,11 at frame t is: vti rti , lti , f ti , bti which counts the number of players in the team to the left, right, in front and behind the player. We assign an index to every possible configuration (feature vector) and for each unlabeled player track, we make a histogram of the configuration over the track’s ellipses. We start by considering only long player tracks (over 40 seconds). Build their distance matrix: The distance between every pair of player tracks is shown. Darker values indicated smaller distances We Grow and merge cluster by using player tracks of decreasing lengths. This clustering considers tracks of 750 frames long. Clustering at 250 frames tracks: Errors begin to occor We’ve seen two algorithms One deals with single person tracking, the other with multi-target tracking Both algorithm makes specific assumptions. The first one assumptions about the human body and motion, the other about motion and football game’s conditions.