Lecture09_Pedestrian..

advertisement
Papers:
•Pfinder: Real-Time Tracking of the Human Body,
Wren, C., Azarbayejani, A., Darrell, T.,
and Pentland, P.
•Tracking and Labelling of Interacting Multiple
Targets,
J. Sullivan and S. Carlsson
This talk will cover two distinct tracking
algorithms.
 Pfinder: Real-Time Tracking of the Human Body
 Multi-target tracking and labeling
For each of them we will present:
 Motivation and previous approaches
 Review of relevant techniques
 Algorithm details
 Applications and demos
There is always a major trade-off between
genericity and accuracy.
Because we know we are trying to identify and
track human beings, we can start making
assumptions about our objects.
If we have more specific information (example:
tracking players in a football game), we can add
even more specific assumptions.
These kind of assumptions will help us to get a
more accurate tracking.
Tracking Algorithm #1
Pfinder: Real-Time Tracking of the Human Body
Motivation
Introduction
• Pfinder is a tracking algorithm
– Detects human motion in real-time.
– Segments the person’s body
– Analyze internal features (head, body, hands, and
feet)
 Many Tracking algorithm use a static model – For
each frame, similar pixels are searched in the
vicinity of the bounding box of the previous
frame.
 We will use a dynamic model – One that learns over
time.
 Most tracking algorithms need some user-input
for initialization.
 The presented algorithm will do automatic
initialization.
Covariance
 For a domain of dimension n, we define the
sampling domain’s variables x1 xn 
 The covariance of two variables xi , x j is defined:
cov  xi , x j   E  xi  i   x j   j 
where i  E  xi 
The covariance of two variables is a measure of
how much two variables change together.
The Covariance Matrix (marked  ) is
defined:
ij  cov  xi , x j 
Normal distribution of a variable x is defined:
2


x




1
p  x 
exp  

2


2

 2


The more generalized multivariate
distribution is defined:
1
T
 1

1
p  x  x1, , xN  
exp

x



x







N 2
12
 2

 2  
Mahalanobis distance:
 The distance DM  x  measured from a sample
vector x  x1
mean    1
is defined:
DM  x  
xN 
T
N 
To a group of samples with
and a covariance matrix S
1
x


S


x  
T
1. (Automatic) Initialization
 Background is modeled in a few seconds of
video where the person does not appear.
 When the person enters the scene, he is
detected and modeled.
2. The analysis loop
 After the background and person models are
initialized, each pixel in the next frame is checked
against all models.
The first step in the algorithm is build a
preliminary representation of the person and
the surrounding scene.
First we need to acquire a video sequence of
the scene that do not contain a person in
order to model the background
The algorithm assumes a mostly-static
background.
However, it is needed to be robust in
illumination changes and to be able to recover
from changes in the scene (e.g. a book that
was moved from one place to another).
The images in the video are using the YUV
color representation (Y = luminance
component, UV = chrominance component).
 There exists a transformation matrix which
transforms RGB representation to YUV.
The algorithm models the background by
matching each pixel a Gaussian that describes
the pixel’s mean and distribution.
We do this by measuring the pixel’s YUV mean
and distribution over time
y
u
v
This pixel has some YUV value
on this frame, on the next
frame, it might change, so we
mark it’s mean as 0 x, y
and its covariance matrix as
K0 x, y
After the scene has been modeled, Pfinder
watches for large deviations from this model.
This is done by measuring the Mahalanobis
distance in the color space between the new
pixel’s value and to the scene model values in
the appropriate location.
If the distance is large enough and the change
is visible over a sufficient number of pixel, we
begin to build a model of a person.
The algorithm represents the detected
person’s body parts using blobs.
Blobs are 2D representation of a Gaussian
distribution of the spatial statistics.
Also, a support map is built for each blob :
1  x , y   k
Sk  x , y   
0 otherwise
k
To initialize the blob models, Pfinder uses a
2D contour shape analysis that attempts to
identify the head, hands, and feet location.
A blob is created for each identified location.
The class analyzer find the location of body
features by using statistics from their position
and color in the previous frames.
Because no statistics have been gathered yet
(this is the first frames where the person
appears), the algorithm uses ready-made
statistical priors.
Hand and face blobs have
strong flesh-colored color
priors (it appears that
normalized skin color is
constant across different
skin pigmentation levels).
The other blobs are
initialized to cover the
clothing regions
The contour analyzer can find features in a
single frame, but the results tend to be noisy.
The class analyzer produce accurate result but
it depends on the stability of the underlying
models (i.e. no occlusion).
A blend of contour analysis and class model is
used to find the feature in the next frame.
original
contour
After the initialization step of the algorithm, the
information is now divided into scene and person
models.
 Scene (background) model consist of the color space
distribution for each pixel.
 Person model consist of spatial space and color space
distribution for each blob
 The spatial space determines the blob’s location
and
size
 The color space determines the distribution of color in
the blob
Given a person model and a scene model,
we can now acquire a new image, interpret
it, and update the scene and person models.
1. Update the spatial model associated with
each blob using the blob’s measured
statistics, to yield the blob’s predicted spatial
distribution for the current image.
This is done with a Kalman filter assuming
simple Newtonian dynamics.
Measuring information from video sequence
can be very inaccurate sometimes
 Without some kind of filtering it would be impossible
to make any short-term forward predictions.
 Also, each measurement is used as a seed for the
tracking algorithm at
the next frame.
 Some kind of filtering
is needed to make the
measurements more
accurate.
Each tracked object is represented with a
state vector (usually location)
With each new frame, a linear operator is
applied to the state to generate the new
state, with some noise mixed in, and some
information from the controls on the system
Usually, Newton’s laws are applied.
The noise added is a Gaussian noise with
mean 0 and a covariance matrix.
The predicted state is then updated with the
real measurement to create the estimate for
the next frame.
2. Now when a new image is acquired, we
measure the likelihood of each pixel being a
member of each of the blob models and the
scene model:
the vector p  x, y,Y ,U ,V  is defined as the
location and color of each pixel. For each
class k , the log likelihood is measured:
1
1
m
T
1
d k    p  k  K k  p  k   ln K k  ln  2 
2
2
2
3. Each pixel is now assign to a particular class.
Either one of the blobs or the background.
A support map is build which indicates which
pixel belong to which class
s  x, y   arg max  d k  x, y  
k
Connectivity constraints are enforced by
iterative morphological growing from a single
central point, to produce a connected region.
First, a foreground region is grown
comprised of all the blob classes.
Then, each of the individual blob is
grown with the constraint that they
remain confined to the foreground
region
4. Now the statistical model for each class is
updated.
 For the blob classes, the new mean is calculated
k  E  p  k 
T

Kk  E  p  k  p  k  


 The Kalman filter statistics are also updated at
this time.
 Background pixels are also updated to have the
ability to recover from changes in the scene.
The algorithm employs several domainspecific assumptions in order to have an
accurate tracking.
 If one of the assumptions break, the system
degrades.
 However, the system can recover after a few
frames if the assumptions again hold
The system can track only after a single
person.
RMS (Root Mean Square) errors were found
on the order of a few pixels:
Test
Hand
Arm
Translation
(X,Y)
0.7 pixels
(0.2% relative)
2.1 pixels
(0.8% relative)
Rotation
( )
4.8 degrees
(5.2% relative)
3.0 degrees
(3.1% relative)
A Modular Interface - An application that
provides programmers tracking, segmentation
and feature detection.
The ALIVE application places 3d animated
characters that interact with the person
according to his gestures.
Here,
Rexy!
 The SURVIVE application recorded the movement of
the person to navigate a 3d virtual game
environment.
I guess you can’t
get any nerdy
than this
Recognition of American Sign Language
 Pfinder was used as a pre-process for detecting a
40-word subset of ASL. It had 99% sign accuracy
Avatars and Telepresence
 The model of the person is translated to several
blobs. Which can be used to model 2d characters.
Tracking Algorithm #2
Multi-Target Tracking and Labeling
uses slides by Josephine Sullivan
from http://www.csc.kth.se/~sullivan/
Motivation
Introduction
• The multi-target tracknig and labeling
algorithm
– Track multiple targets over large periods of time
– Robust collision recovery
– Does labeling even when targets are interacting
Multi Tracking and Labeling
Sometimes Easy
Sometimes Hard
The algorithm addresses the problem of the
surveillance and tracking of multiple persons
over a wide area.
Previous multi-target tracking algorithms are
based on Kalman filtering and advanced
techniques of particle filtering.
Often tracking algorithms fails if occlusion or
interaction between the targets occurs.
This work’s specific goal  is to track and
label the players in a football game.
This is especially hard when players collide
and interact
The researchers used a wide-screen video
which was produced using the video from four
calibrated cameras.
The images were stitched after the
homography between the images was
computed.
This produces a high-resolution video which
gives good tracking results
1.
2.
3.
4.
Background modeling and subtraction
Build an interaction graph
Resolve split/merge situations
Recover identities of temporally separated
player trajectories.
A probabilistic model of the image gradient of
each pixel in the background is obtained.
The gradient is used to prevent situation
where the player’s uniform has the same color
as the background.
Let g xt denote the image gradient at pixel x in
frame t .
b
g
Each background pixel x is modeled by a
mixture of three bivariate normal
i

distributions with means x and covariance
matrices ix :
3
gx
i
i
i

N

,

 x 2 x x
i 1
3
i

Where 0    1 and  x  1
i
x
i 1
A pixel x in frame t is considered a
foreground pixel if  g      g    is larger
than a threshold  .
t
x
T
x
1
x
t
x
x
 Let Ft be the set of foreground pixels at time t
 Let Bt be the set of background pixels at time t
Connected components are then identified
and are processed by deleting small “cc”s or
joining them to neighboring larger “cc”s.
 This is made to make sure that each connected
component corresponds to at least one whole
player
The set of ellipses representing the connected
components detected (marked by bounding
boxes) is defined:
t n
Et  Ei 
t
i 1
With nt being the number of ellipses detected
in frame t
The first aim is to put the ellipses inEt and Et1
in correspondence.
Definition: ellipses E1 and E2 are an exact
match if their size and orientation are
sufficiently similar and distance between their
centers are sufficiently small.
Define a relation ~ :
 Eti ~ Et j1 if Eti and Et j1 are an exact match
i
 If no such exact match exists for Et in Et1 then
Eti ~ Et j1 if Area  Eti  Et j1   0 and Et j1 has no
exact match in Et
Define a Forward and Backward mappings:
i
j
j  Ft i   Et ~ Et 1
 Forward mapping:
 Backward mapping: k  Bt i   i  Ft  k 
With the forward and backward mapping, we
can define events at each frame:
Signal
Event
Signal
Event
Ft  i   1
Split
Bt  j   1
Merge
Ft  i   0
Disappear
Bt  j   0
Appear
Ft  i   Bt  Ft  i    1
stable
A maximal sequence of stable events
sandwiched between non-stable events is
termed a track.
A player track is a track that corresponds to
exactly one player
If the event sequence is
track  split or merge  track
then track involves multiple players
If the event sequence is
{split, appear}  track  {merge ,disappear}
then track may be a player track
If such track is long enough and ellipse size is
not too big, it is considered a player track.
Other tracks are called multiple players track.
Because we’re dealing with a football game,
we know that players are divided into 3
categories: Team A, Team B and officials.
This will help us in cases where teams from
different teams appear in multiple players
tracks.
Given the labeling of the tracks and their
interactions through merging and splitting,
the game can be summarized by a graph
structure called target interaction graph.
White and gray nodes
corresponds to team A /
team B player tracks.
Black nodes corresponds to
multiple players track
This graph is a small section
of the ~5000 node graph
describing 10 minutes of
analyzed gameplay.
By examining the player interaction graph, it is
possible to isolate situations where n player
tracks merge and then split into n player
tracks.
These merge-split situations are resolved by
finding correspondence between input and
output tracks.
Input and output tracks are each a set of n
tracks.
We wish to find the assignment M of the
input to the output. It is a bijective mapping
M : 1, , n  1, , n . Where M i   j implies that
track Ti and Tj are the same player.
Not all assignments are physically possible.
For each valid assignment, we estimate the
intermediate tracks by exploiting the
properties of maintaining continuity of motion
and relative depth ordering.
We investigate if any of the intermediary
tracks can be described by a constant velocity
motion model. This is done by linearly
interpolate between the last ellipse of Ti and
the first ellipse of TM i  .
If there is sufficient image data to support
this, the penalty for this estimation is 0.
The overall estimation for each assignment is
scored: Sc    Dist T  T      Pen T  T   
Where:
n
M
i
i 1
M i
i
M i
 Dist Ti  TM i   is the distance traveled during the
hypothesized trajectory.

 Pen Ti  TM i 


1 if Ti is not consistent with relevant TM i 

otherwise

0
If the minimum score assignment was
explained solely on linear interpolation, and
its estimate is lower than threshold  , then
we accept this assignment.
Otherwise, we repeat this process at constant
time intervals. This is called relative depth
ordering.
Intermediary tracks that cannot be explained
by simple linear interpolation, is analyzed
every mth frame in the interval between the
merge and the split.
Starting with the first interval, we define the
region Rk t j  as the union of all ellipses and
try to interpolate in smaller distances.
The aim at each interval is to maximize the
intersection of Rk t j  with the foreground
pixels and minimize the intersection with the
background pixels.
Again, the penalty is set to 1 if the mentioned
intersection is not consistent.
Then the score is re-calculated and the
minimum scored assignment is chosen.
This process was found to be working if the
number of targets merging was smaller or
equal to 5.
Nonetheless, the examined sequence
contained roughly 200 merge-split situations,
of varying complexity, all resolved.
At this step, it is interesting to see how
frequently a player was assigned a player
track.
Not all split/merge situations were accurately
resolved.
Usually, other features can be used to
resolved the identity of player tracks.
In a football game, a player’s identity can be
obtained by his relative position to his
teammates. .
The easiest example is the goalkeeper who is
always behind his teammates.
We can look at the problem as a partitioning
problem
This is specific to a football game, but a
variation can be used for other applications.
The feature vector for each player i  1, ,11
at frame t is: vti   rti , lti , f ti , bti  which counts the
number of players in the team to the left,
right, in front and behind the player.
We assign an index to every possible
configuration (feature vector) and for each
unlabeled player track, we make a histogram
of the configuration over the track’s ellipses.
We start by considering only long player
tracks (over 40 seconds).
Build their distance matrix:
The distance between
every pair of player
tracks is shown.
Darker values indicated
smaller distances
We Grow and merge cluster by using player
tracks of decreasing lengths. This clustering
considers tracks of 750 frames long.
Clustering at 250 frames tracks:
 Errors begin to occor
We’ve seen two algorithms
 One deals with single person tracking, the other
with multi-target tracking
 Both algorithm makes specific assumptions. The
first one assumptions about the human body and
motion, the other about motion and football
game’s conditions.
Download