>> Rick Szeliski: Good morning everyone. It is... who gave us a talk yesterday on feature descriptors. ...

advertisement
>> Rick Szeliski: Good morning everyone. It is my pleasure to reintroduce Pascal Fua
who gave us a talk yesterday on feature descriptors. Pascal is a professor at the Ecole
Polytechnique, Paris. He has done a lot of work in stereo matching and feature detection
and also person tracking, and he is going to tell us about the latest work in that area.
>> Pascal Fua: Thank you. What we have been doing in the area of people tracking is to
work on a multi-camera system that ought to be able to follow a number of people in
potentially crowded space where people occlude each other. Which means we have
images like those and from those, from those video sequences we would like to produce
2-D trajectories, and ideally, of course, we would like to be able to do this over a long
periods of time using cameras that may be above, that is actually easier, or as is often the
case, at head level. This is important because it closes occlusions. So a lot of what I
have to talk about is how you deal with occlusions and how do you put together a system
that is robust at that.
And of course we would like it to work in realistic scenarios where you have thousands
of frames, I mean long sequences. The typical type frame would be I'll mention
basketball later, so a whole period of basketball it should track. The images may or may
not be of good quality, so for example, we will see these images from a subway, an
underpass where the lighting is bad; the resolution is not wonderful. It has to be able to
handle that, and of course as soon as you are outdoors you have to deal with shadows and
illumination changes and all that fun stuff that makes life more difficult. So the
algorithm we put together works in two different steps. It does its reasoning in the
ground plane, so we imagine that we have an area we want to model in which people are
going to walk in and walk out. We discretize that area and in the first step we are going
to compute the probability of occupancy at each timeframe independently, so no temporal
consistency at this point. We just act every instance separately we estimate how many
people there may be in the scene based on what all of the cameras see. And then given
these probable occupancy maps we will enforce temporal consistency under very, very
generic assumptions. Essentially the only assumptions we going to make is that people
do not teleport.
>>: [inaudible]
>> Pascal Fua: Two or more. Actually I will show you that to some extent it actually
works with one. The four [inaudible] can hold one but it is really designed for two or
three or four. Here is the kind of input that we are going to work with. So we have two
graduate students, I know [inaudible] in these cases. What we do is we run a very simple
[inaudible] background subtraction algorithm that does something like this. The reason
I'm doing this is because you can see this background [inaudible] good and that is
actually realistic and that is what is really going to happen in real situations. You're
going to need an algorithm that can handle all of this. And by the way, since we are at
Microsoft, I am using background subtraction because that is something we have been
using for a long time, but if you feel like using the output of the Kinect, you could.
Actually if you are in a situation where the Kinect would work, the output would be
better than this. It would actually have more information than this and so everything else
after that should work even better.
But for the time being we are going to work with this. So the game we want to play in
this first stage of the algorithm is given multiple binary maps like those, compute this
probable occupancy map which is just something that says every grid cell there is a
certain probability of a person being present. So of course this one indicates that there
are probably two people, one here and one here. And there is a bit of noise and this
actually is not binary. It is really a set of flowing point numbers. So how are we going to
get those? So let's try to formalize this a little bit. So what we are going to try to
compute is the probability of there being people so X1 to XN, a binary variables that say
whether there is or isn't a person at location i, grid cell i. Given B1 to BC which are the
binaries. And so this is going to be done at individual multiview time frames throughout
one given instance. We should expect the Bs to be very noisy and we should take
advantage of the fact that the size of the Bs gives us some very rough estimate of how far
the people are and most importantly it should be done in such a way that it robustly
handles occlusions.
So to estimate this which is of course a complicated property distribution, the first thing
that we are going to do is a mean field assumption, which is where we are going to write
it as a product of local probabilities. I am not actually assuming, making any assumption
here. I am just defining the variables of my optimization, so I am going to write Q as a
product of the small qs and the small q can be interpreted as a quality of presence in any
grid cell. And to compute them what I am going to want to do is to say well, I am going
to choose my qs so that these two distributions are as close as possible and one way to
say this is I want the KL divergence to be as small as possible, so I am looking for a place
where the derivatives with respect to the small qs are zero.
So, of course it is easy to write, but you actually have to compute this. So how are we
going to compute this term, so this is essentially this is the [inaudible], the probability of
presence given the binary blobs. So the first thing you do is the standard Bayesian
reformulation…
>>: [inaudible] XK [inaudible] the top. QSK [inaudible] the X [inaudible] many,
many…
>> Pascal Fua: So X capital X is a vector of all of the locations, so it has basically it is a
binary vector whose size is the size of the grid.
>>: Oh, so it's [inaudible]?
>> Pascal Fua: If I go back to this, when I am going capital X without an index is the
variable is all these guys, which are all…
>>: [inaudible]
>> Pascal Fua: Is everything. That is standard Bayes reformulation. I need to actually
estimate the property of B knowing X and the property of X. so let's now make a couple
of independent assumptions, which are first, I am not going to take into account the
interaction between the people. I am not going to take into account that if I am in this
grid cell, the probability of somebody being in the grid cell next to me might be affected
by that. So I am going to write this PX on to XN as just a simple product of the prior
properties of, there being somebody in the grid cell and that is going to be a constant.
And then the second one says that all statistical dependencies between views are due to
people being present and moving. In other words, in the scenes I am dealing with the
only thing that is moving are the people. The furniture is fixed. There is nothing else.
And that means that the property of, if you know where the people are, then the property
of observing a particular binary image in any different image are conditionally
independent, which I will write this way. So when it comes down to it, the only term I
really have to estimate is this property of observing a particular binary map if I know
where the people are. And to do this is actually fairly simple. I am going to use a very
simple generic model. If I know there is one person continually running around the
room, here are the images, the binary images I expect to see. So in other words, I am
representing a person as a cylinder, and if I know I have a person at a given location, the
image I expect to see is a rectangle at the appropriate location that is given by the fact
that my cameras are calibrated.
Here is, for example, what I would expect to see if I knew there were exactly 3 people at
three specific locations. And what I am going to do essentially is compare this image to
the image I actually observe. And one of the strengths of these which is not shown here
is that it will handle occlusions naturally. So if, for example, this person was behind this
one in this particular view, the fact that I am using a generic mold automatically and the
synthetic image I would produce would naturally take into account the occlusion without
any complicated reasoning, which [inaudible] is the reason that we formulated it in this
particular way.
>>: [inaudible] the image and then compare it with…
>> Pascal Fua: I am going to. What I am really going to do is I am going to adjust to put
ones at the right places so that this image looks as much as possible as the image I
observed. We could have gone a different way. Actually we did--instead of having a
generic model, you could say I have blobs in one image. These blobs define a cone in
which a person might be, and then you do that for all possible images and you intersect
the cones. So we tried that. But what happens if you do it this way, is then you have to
do some real reasoning about the occlusions and people intercepting the cones and then it
becomes a massive headache very quickly, and this is why we switched to that particular
way of doing things.
And I have to introduce one more concept which is I have talked about the synthetic
image, that was based on the idea that I knew exactly that property one, the qs here and
property number, so if I know it was property one, then there were three people at three
specific locations, I expect to see X. But now if I know not which property one but I
know some probability between zero and one that there may be a person at these
locations. What kind of image am I expecting to see? And so what I expect to see is
something like that which I am calling the average synthetic image which is if you drew
many, many distributions of there is or isn't a person using those probabilities, generated
synthetic images and average them, you would get this, which is an average synthetic
image, which I am going to use. So this is the one you get if you assume that at these
locations, at these four different locations you have a person with probability 0.2 for each
one of the four.
>>: [inaudible] computation that [inaudible] synthetic image…
>> Pascal Fua: No. Because you don't get, here there is a bit, that is where the mean
field assumption comes in. You just, what you do in the interpretation is actually just
trivial. What you just do is you paint something with the rectangle instead of putting one,
you put 0.2. That is it. I don't explicitly do a multi-[inaudible] simulation every time.
>>: I thought you randomly took…
>> Pascal Fua: No, no. I don't, because that would be horribly expensive. This is a
slight approximation. It is known as the mean field approximation. So maybe I should
call it the approximate average synthetic image.
>>: But why is [inaudible]
>> Pascal Fua: Is just an example. It can be any value, because, in fact, what we are
doing is waiting to compute those numbers in the end. So back to what I am trying to
estimate, I have this, this is a result of my binary subtraction. So these are the blobs that
my background extractor gave me. And let's say I put people here with some values and
I create a synthetic image. What I want to do is measure this distance between these two
images and the measure of this distance is going to be very simple. I'm going to define
this math functions Psi which actually depends on their respective overlap. And I am
going to estimate the property of observing these blue blobs given X as just e to the
minus Psi, so if the black rectangles and the blue blobs are perfectly superposed this will
be close to one, if they are completely disjointed this will be close to zero.
>>: [inaudible] distance [inaudible]
>> Pascal Fua: This pixel works and that is because it is going to be fast. And also I will
get to this. Not only is it pixel wise but because the avatars are rectangles, we will use
integral images to compute them and that is how we are going to get real-time
performance. And so this term I was trying to estimate the probability of B observing all
of these binary images given the property of presence is going to be this estimate that this
product of e minus Psi, of course was the usual [inaudible] constant that I would ignore,
when I look for maximum likelihood.
>>: [inaudible]
>> Pascal Fua: Yes. I am assuming, so it is 1.8 meters.
>>: What if a child moves around [inaudible]
>> Pascal Fua: No. And I am going to show you why in a second.
>>: [inaudible]
>> Pascal Fua: No. So the answer, a child that is taller than .9 cm, 90 cm, was taller than
half the standard size would be fine. Let me, actually I have this video. So back to what
I was trying to do which is I have, I am trying to make these two distributions be as close
as possible as a function of those qs. I am minimizing the KL divergence and I have
expressed this p of X [inaudible] B as this product of distances of exponential of minus
the distance between the binary images I observed and those average synthetic images.
And what is important to note about this, it doesn't show, but this is also a function of the
qs. Because the average images are a function of where the people are and so all of this
is a function of qs, so differentiating with respect to q make sense. And in the end if you
work out the math what you, what happened here?
What ends up happening is you get the minimum when q at location i can be expressed
for all qs, for all locations, you have this, it has to hold, so you have this. This looks like
a very ugly expression but in fact it is fairly simple. But this is is you take all your
probabilities of presence and you take the one you are trying to estimate which is location
i and you compute two different average images. The one where the corresponding q is
forced to one, so definitely there is somebody and the one where the same q is forced to
zero, definitely there is nobody. So you get these two synthetic images with and without
somebody at location i and you compare the distances given, with all your other qs being
fixed. So basically what you are measuring is if I put the person's location, do I improve
the distance or do I improve the distance by removing somebody. And so this is what
this term is and when it is balanced, i is in the form of equilibrium and your KL
divergence is minimized.
The formula doesn't matter that much but the point is all of this can be computed very,
very quickly. You just have to compute this Psi by doing a few integral image
computation is very fast. And what you end up is a simple iterative scheme where you
start the algorithm by having these qs being some small values, the same small value
everywhere, and then you iterate this operation until it standardizes, until you find the
fixed point of this iterated scheme, which typically takes four or five iterations. It doesn't
take very long. So to make this a little bit more visual, first an interpretation, so you have
these binary maps and you are computing the qs so that the average of the images are as
close as possible to these maps in all possible views simultaneously.
>>: In this particular case you have floaters your [inaudible] which is separate to is much
closer than that path where it is being interpreted is far away.
>> Pascal Fua: Yes. That is why you need multiple cameras. So what is going to
happen is, and we will see it in the next slide, is if you have only one camera, you can do
what I just described and what you are going to get is this property of presence are being
kind of spread along the line, the line of sight. But because you have multiple cameras
what you get essentially is intersection of these ellipses and then the ambiguity goes
away.
>>: [inaudible] each other. There are not two cameras…
>> Pascal Fua: No. It is meant for, the four cameras here in this room are perfect as
sequences.
>>: So you are not using information that the [inaudible] very far?
>> Pascal Fua: No. Because as you see oftentimes the legs can be cut off. The
background information is just not reliable enough for this.
>>: [inaudible] so it's not like you actually have an occupied [inaudible]
>> Pascal Fua: No. But it seems to be enough, I imagine if you were to have acrobats in
all sorts of funny positions I am not really sure what would happen. But you will see a
moment when you look at basketball players, which do move quite a bit, it is good
enough for that. Okay. So what you are seeing here on the slide is two cameras is this
iterative scheme happening. You start with kind of a small uniform property distribution
and pretty soon it converges to this very peaked thing where this is a result where what
you have is a very high probability of there being a person here and a bit of residual here.
There are some small numbers because of noise and because indeed what is happening
here is the person here for some reason the blob, actually that is what we just discussed,
is part of the blob here is outside of the rectangle, so it is not completely explained by this
rectangle, so it compensates for that by adding a little bit of probability showing blue here
that corresponds to this. But in fact it is still, the result is still very peaked. This is light
blue indicating a value that is not zero but is small and the black here indicates a value
that is very close to one.
>>: [inaudible]
>> Pascal Fua: Because of what is happening the result is so peaked that the blur is
almost imperceptible. And here is another result with two people. So see, again, it is the
same idea where it converges very quickly to, the distribution is very peaked with two
peaks of course corresponding to the two people. And actually this blob is kind of
interesting; it kind of answers the question of why don't we use the feet when in many
cases the feet just aren't there.
>>: Will we need to know how many people [inaudible]?
>> Pascal Fua: No. Actually it will introduce--it has blobs and it is trying to explain
those blobs, so it will introduce blobs, I mean squares, blacks, as many black squares as it
needs to explain what it sees. So now it is the same thing but over time. The previous
video didn't have time in it. It was just what you did at one iteration, at one time step.
But now what I am showing is this property distribution as the video goes, there
converged results, but still without temporal consistency, so all of these maps have been
computed at each time step independently.
You can see that occasionally you see a little bit of blue in this thing which is when it is
not so sure anymore. And typically that happens when the person is visible in one
camera but not in the other. Okay. There are three people, same ID, so you can see that
one of the dots is blurred once in a while but not that often. And now to your question
about people's sizes, here is what happens when the person is growing during the video.
So the size of the person is changing and what you can see is the algorithm is not terribly
sensitive to that, especially when a person is seen in more than one view. So the black,
what you are looking at is a person who is not moving. Well, now he is moving. His size
is changing. So at some point he is bigger then 1.8 meters and in some others he is
smaller than 1.8 meters. That is [inaudible]. So we, we give basically, we decide that
this is the output of the background abstraction and we feed that to the algorithm. And
then to make it a bit more real here is a kid. So that kid is obviously less then 1.8 m and
he is detected just fine.
And I will let this run because in a moment his mother is going to walk in. Okay. So
here is mom; here is little sister [laughter]. And the little sister is not detected. It is not
because the algorithm is sexist, it is because she is below 90 centimeters. She is shorter
than 90 centimeters and so if you look at the--the formula what is happening when you
compute the difference between the blobs and the average images, basically it will put a
rectangle if at least half of it is full. What it tries to do is explain everything needs to be
explained, and not explain where it doesn't need to be. So you pay a penalty if you see an
avatar at a place where there is nobody. And so that is why you have this cut off.
Somebody was half the size and it can go either way. If you are bigger than half the size
then the gain that you have by introducing the avatar over compensates for the loss that
you have in explaining all of the top part that doesn't need to be explained. And in the
case of the little girl who is too short, then explaining all of that empty space above her
head is too expensive and the algorithm doesn't.
>>: You do a frame by frame and how do you maintain [inaudible] the person
[inaudible]? Secret sauce [laughter]?
>> Pascal Fua: No. Not secret sauce. I am coming to that. This actually is--give me one
slide to answer this. The key, so this runs in real-time. The code is on the web if you
want to play with it and also the code is done so that you, for example, if you want to
make an experiment of running this through a Kinect output you could. Because it
doesn't matter whether the background abstraction, the background abstraction is not part
of the algorithm. It is just that you would expect…
>>: [inaudible]
>> Pascal Fua: Yes. You need to give it blobs somehow. How you compute the blobs is
your problem. And so to answer your question when I have talked about is the algorithm
that produces at every time step independently would probability of frequency map. The
next step, of course, is to connect the dots and produce the [inaudible] and so the answer
to your question is what I showed in the previous slide was the output of the two steps. I
don't have the output of the first step only of the kids, when I showed the complete result.
So how are you going to do this? Well it turns out that a good way to do that is to think
of it in terms of a flow problem. Why? Because you are on a grid, which is the grid on
the ground at time T. And as I said people do not teleport. So if somebody gets in that
slot at time T it means that time T -1 they had to be in one of these, one of the
neighboring grid cells. And then at the following time step they can only go to one of the
neighboring time steps, grid cells, I'm sorry. And actually they cannot move. And not
moving in this case means going from here at time T going to the same place at time T
+1. And so typically I know practical implementation the grid cells are about 30 cm x 30
cm which means at 20 frames per second, people can only move by one grid cell from
frame to frame.
It doesn't matter. You can make it more but of course that would become
computationally more expensive if you do. So what you have and think of this as you
have each grid. You connect them, so it becomes a graph. People can move from graph
location to graph location, but in this big graph only a very limited number of transitions
are possible, only the ones that correspond to going from one grid cell to one of the
neighboring grid cells at the next timestamp. And in addition to this, so you can think of
this as a flow. So people are flowing from grid cell to grid cell and there are some
constraints on this flow. Which is if at time T you have a certain number of people in a
grid cell, in fact that number can only be zero and one, zero or one. But I am introducing
these notations because eventually we are going to relax that make these floating-point
numbers for the purpose of optimization. So at time T grid cell K, you can have a certain
number of people N. And this number is the sum of the people who flowed from the
neighboring grid cells. Likewise, the people who flowed out, all of these people here are
going to flow out into neighboring grid cells during the next timestamp. So I have N
which is the number of people, F which is a flow from one cell to the next, but to the
neighboring one, to any neighboring one at the next time step.
>>: [inaudible] one person [inaudible].
>> Pascal Fua: I have in fact, and N will always be 01. But the reason I am forming it
this way is because I am going to reformulate my problem as a lineal programming
problem. So at this time is purely binary. All of these numbers, in fact, or 01. Because
there is only at most one person in the cell, because they are 1 foot square, and of course
if there is at most one person in the cell, this number can only ever be 01. But the
problem is as we know integer optimization is difficult, so in the end I may want to
reformulate this integer optimization as a floating point optimization which numbers
between zero and one to make it easier. And so I have some constraints on all of these
flows and number of objects so that the sum of people arriving is the number in the cell,
which is also equal to the sum of the people leaving at the next timestamp.
>>: [inaudible] some type of [inaudible] what people do is from the previous timestamp
a kind of [inaudible] at some [inaudible] to a neighboring…
>> Pascal Fua: Right now there is no probability. They are really hard constraints. This
has to be, cannot be any other way. Property one. Right now there is no image data. I
am just defining, basically I am just introducing limitations.
>>: [inaudible] input [inaudible] it's called 01 [inaudible]
>>: It is coming, but I have not used it yet. I mean right now, give me two slides. So
occupancy, while there can't be any more than one person in a grid cell in any grid cell.
And all the i’s are positive. They are between zero and one. There is no negative flow.
So to make this complete I also need to allow people to enter and exit. So here I
essentially have a grid at every timeframe, every time instant, and I also have two special
nodes in my graph which I am going to call the source and the sync, which are connected
to all of the nodes that are entrances and exits. So in a room like this one the source in
the sync would be connected to the doors. In something like a basketball court they
might be connected to all of the grid cells at the edge of the court. And these are special
nodes in the sense that they are not subject to the flow constraints saying that at most one
person can come in. Any number of people can enter or exit at any one time.
Right now I have just set up the framework. I have given you a bunch of limitations,
defined flows, occupancies and the like, and now what I want really to do is compute
these Ns so whether they are people in those, at those grid locations in such a way that
they obey the constraints and are as close as possible to these numbers that my first step
gave me. The first step when I talked about before the occupancy maps, gave me plenty
of presence at every location but without any temporal consistency. Now I want to
compute new temporal presence that are as close as possible to those, but at the same
time, obey those rule constraints. So they are temporally consistent. And that can be
formulated as essentially it is a maximum likelihood computation. It means I am going to
compute the Ns whether there is or is not somebody at a given location so that this is
maximized and the qs are those that I computed at the previous step.
So that means the Ns are as close as possible to the qs but with the additional constraints I
have formulated in the previous slides, the flow consideration constraints. And this turns
out, this is an integer [inaudible] problem. In general this is going to be hard, but in fact
because our graph is very specific and has a very specific shape, the extreme measure is
totally modular, I believe is the term, which means it is not [inaudible]. You [inaudible]
it and it has this interesting property that if you formulate it as a lineal program, which is
this, which is instead of having these Ns here which are your, between that Ns you
optimize the variables, you optimize with respect to the Fs, the flows. The Ns are a
function of the flows, the Ns are the sums of the Ns, so doesn't change very much. You
can maximize in fact this. This is the same formula of that I have written before because
the sum here is just N under all of these constraints which are the full constraints I
introduced before and this is a linear programming problem. So the additional trick that
you play which is why I have introduced all of this limitations, is instead of treating the
Fs as values of 0 or 1 you treat them as floating-point numbers. And you solve this
problem for those floating-point numbers. And because the problem is, the constraints
are totally modular, what provably happens is when it converges the Fs are binary. So
you treat them as floating-point numbers, but once you have done the optimization you
get zeros and ones. In that is what we have done and that works pretty well, but it is
actually pretty slow. So it turns out you can do the same thing faster using what is known
as a K [inaudible] algorithm. And the K [inaudible] algorithm is a bit like the [inaudible]
algorithm in this graph except that instead of finding one path, the best path, you find the
K best path with [inaudible] and the N.
>>: Does this require all of the frames to be in the buffer?
>> Pascal Fua: Yes. So it's not tracking. What we are doing is batch processing. So we
do this on batches of about 100 frames and actually one of the things we get is
robustness. It is real-time but with a delay, because of the hundred frames. So here is the
kind of results we are going to get. So we put several cameras at both ends of this
subway, underpass and this is actually not a very easy sequence. It is not very crowded,
but the lighting is terrible and the cameras at the far end when they look at somebody at
the far end they are very, very small, only a few pixels. And so it does a reasonable job
of tracking these people. Another example on, this is the cafeteria at the EPFL and what
it is showing is what happens when people walk in. So we are not telling the system how
many people there are. It just figures this out for itself. So basically what happens is
when blobs start appearing it decides it needs to start explain them and it starts doing so.
And what it is showing of course is the identities are preserved reasonably well. And one
important thing about this is in the sequence appearance is not used at all. The only input
are the binary blobs.
And here is another thing where we have people. This is just a basketball practice for the
cameras. And we get something like this. And finally here is an example that actually
surprised us. So I don't know if you have heard that every year there is a PETS
competition were various algorithms that do surveillance, there is a competition. And so
participated in 2009 and at some point they decided to run it on this monocular sequence,
where it still does a reasonable job. So I am not going to claim that it works for all
monocular sequences; it works for this one because the cameras are reasonably high,
which means the monocular uncertainty is still not all that spread out and then the linking
step does the right thing by finds the [inaudible] goes to the maximum.
>>: [inaudible].
>> Pascal Fua: Yes. So in the world where the--because we need to know where to draw
our avatars. It doesn't have to be flat but it has to be new. And so on this graph at least
here we did better than the others.
>>: [inaudible] so this is just a rough guess [inaudible]
>> Pascal Fua: We calibrated on--we gave a homography for the ground floor. And it
was a subway so there was a ceiling so we could do another homography for the ceiling
of the subway. And actually if you come to Lausanne Olympic Museum and we have
actually this thing that is running in the museum. It has been running for several months
now where you have visitors that can--there is kind of a room here behind with four
cameras observing them. And they walk in here; they do something in the room; they
walk out and they see their trajectory on the screens as they walk out. And this is
actually the first time somebody came to ask us if we could do that and I was actually
very scared as Rick mentioned yesterday, we go back a long way, and when we started
nothing in vision ever worked. And the notion that anything could possibly ever work
was alien to me and I haven't quite overcome this bias. But actually we did manage to
make it work in a public setting. Of course, I need to be honest. This is engineered.
They carefully put in the cameras to make sure that the background was way was
supposed to be et cetera. But still it has been working on its own for quite a while.
>>: [inaudible] obviously you see their [inaudible].
>> Pascal Fua: They see their trace on the screen. So basically the idea is it is not like
broadcast. We are going to get to that. In a broadcast style application if you track
players, so there is a bit of the delay, which is happening here. So they walk out, when
they walk out is when they see what they've done. But if you're going to show something
after the publicity break, which there seem to be many on the TV around here, it is okay;
the small delay does not matter.
Okay. So last thing is I have told you that we don't use appearance. Up to now we don't
use appearance at all. But in some cases you do want to use appearance. And sports is a
good example where players have uniforms. They have numbers on the uniforms. You
want to use this. But the trick is, and I think it is something that is a little bit
underestimated in many of the other state-of-the-art approaches is you don't get this
information all the time. So typically for the numbers on the players’ shirts in practice
you can only read them once in a while. In most of the frames it is too blurry or you
don't see them right. So you really need something where you see it once, okay you
remember that that you see it. You use it, but you did not depend on the fact that you will
see it again immediately.
And so what we have done is expanded the framework I have just shown you to do this.
And the way that we do this is the result I just showed you is done in this context where
identities are ignored. Now to take identities into account, what we do is we take this
graph and will extend it where we have one grid per person. So typically in a team we
will have, I mean in that case, we know how many players there are. And so we will
have one grid per player knowing that we can track what this particular person does.
And since it is late, I am not going to go into details, but what happens is you can this
formulation can be expanded to take into account the identities, so not only do we have
the presence but we have the property that player X is there. And it is pretty much the
same. It is just a bit more complicated, but it is still a linear program. It is a linear
program in which you add one thing that wasn't there before which is the property that
this particular blob corresponds to a particular person based on his appearance. And if
you don't have any information, you just put the same property on all possible identities.
In this way if you don't know, which is often the case, you can still just keep on going.
Just say I don't know and don't exploit it. And so that actually is a very simple extension.
It has one drawback. We are now working on a graph that is much bigger than the
previous one and it starts becoming slow.
We are losing the real-time performance so the trick that we used is to first run the
algorithm appearance free to get the first result where identities are not necessarily
preserved or accounted for. And get rid of all of the grid cells in which there was nobody
because they are irrelevant. And then we rerun the linear program algorithm only on this
subset of the grid cells so that it is fast again. And what we get is something like this
where we are trying soccer players or in this case basketball players. And so this is
actually a, these are videos from a real world championship somewhere in the Czech
Republic recently. And we are working on this with the International Basketball
Federation and the goal, this is in collaboration with an actual company and the goal is of
course to be present for real at major sports events, basketball first and others if we can in
the not-too-distant future.
>>: What are these numbers here?
>> Pascal Fua: This is just a frame number.
>>: The red and green.
>> Pascal Fua: Oh yeah. Here what we, in this video we only, the appearance we are
taking into account is just to the team that you belong.
>>: It is not trying to extract exact player identity?
>> Pascal Fua: No. It is the group. So we have the green team, the red team and their
refs. And in a different version is we could of course also look at the numbers on the
backs of the shirts, which we are not doing in this video. To conclude what I have
presented is an approach that is robust and can track arbitrary numbers of people over
long periods of time on the scale of a basketball game. We do not require appearance
information and can do quite a bit without it, but if you have it then you can use it. And
we get real time performance. So this is not the end of the story. I mean there is still a
lot to be done, which is one of the weakest parts of this is to rely on background
subtraction because we all know that it doesn't work, in many cases it just doesn't work.
So one idea would be to replace this by a people detector. If you want to go to a much
more crowded scene then this, those are going to be required or to a Kinect if you want.
Use more sophisticated templates among other things. People actually come across
pretty nice because if you are modeling them as just a cylinder it is pretty much good
enough for many things. But if you want to detect cars, he want to go into the street and
detect people and cars, which may move, then you are going to need more sophisticated
templates, but the formulation to support this is just different kinds of rectangles. And
finally right now there is no behavioral model whatsoever. The only constraint is you
cannot go more than the number of grid cells between timestamps. That is it. But of
course we know there is more to it than that. For example in the basketball case we also
track the ball. Obviously, what the players do and where the ball is our not independent
and so this should be taken into account. So these other kinds of things we're going to
keep on working on and of course if you are interested there are the papers and just as
important is the software is actually available on the web and we are of course always
delighted when people use it. Thank you. [applause]
>>: [inaudible] passengers [inaudible].
>> Pascal Fua: They are because people, I mean one of the basic assumptions is no two
people can be in the same location at the same time.
>>: [inaudible].
>> Pascal Fua: That won't work. We are not. But even if they hug each other, they will
be in two different locations. If they are on each other's shoulders then the assumption is
violated, but if they hug together, unless, the grid cells are small enough that they have to
occupy two grid cells. So the one thing that will create problems when you don't have
appearance and that is a standard problem is. This is hard. Whether, there are cases
where can't see them. So is this. Is this where they came in or is it that? Without
appearance, there is no answer. It is going to do something arbitrary. The one thing that
the appearance thing does well is if it has read the number here, here the number one,
once and it reads it again here once, then it will do the right connection.
>>: [inaudible] better than the frames on the two angles.
>> Pascal Fua: There is no obvious--you would imagine that the ones in the middle
would be the better. But we have not looked at it systematically.
>>: But your experiments [inaudible]
>> Pascal Fua: Basically we take overlapping things and and then we…
>>: [inaudible]
>> Pascal Fua: Okay. [applause]
Download