>> Rick Szeliski: Good morning everyone. It is my pleasure to reintroduce Pascal Fua who gave us a talk yesterday on feature descriptors. Pascal is a professor at the Ecole Polytechnique, Paris. He has done a lot of work in stereo matching and feature detection and also person tracking, and he is going to tell us about the latest work in that area. >> Pascal Fua: Thank you. What we have been doing in the area of people tracking is to work on a multi-camera system that ought to be able to follow a number of people in potentially crowded space where people occlude each other. Which means we have images like those and from those, from those video sequences we would like to produce 2-D trajectories, and ideally, of course, we would like to be able to do this over a long periods of time using cameras that may be above, that is actually easier, or as is often the case, at head level. This is important because it closes occlusions. So a lot of what I have to talk about is how you deal with occlusions and how do you put together a system that is robust at that. And of course we would like it to work in realistic scenarios where you have thousands of frames, I mean long sequences. The typical type frame would be I'll mention basketball later, so a whole period of basketball it should track. The images may or may not be of good quality, so for example, we will see these images from a subway, an underpass where the lighting is bad; the resolution is not wonderful. It has to be able to handle that, and of course as soon as you are outdoors you have to deal with shadows and illumination changes and all that fun stuff that makes life more difficult. So the algorithm we put together works in two different steps. It does its reasoning in the ground plane, so we imagine that we have an area we want to model in which people are going to walk in and walk out. We discretize that area and in the first step we are going to compute the probability of occupancy at each timeframe independently, so no temporal consistency at this point. We just act every instance separately we estimate how many people there may be in the scene based on what all of the cameras see. And then given these probable occupancy maps we will enforce temporal consistency under very, very generic assumptions. Essentially the only assumptions we going to make is that people do not teleport. >>: [inaudible] >> Pascal Fua: Two or more. Actually I will show you that to some extent it actually works with one. The four [inaudible] can hold one but it is really designed for two or three or four. Here is the kind of input that we are going to work with. So we have two graduate students, I know [inaudible] in these cases. What we do is we run a very simple [inaudible] background subtraction algorithm that does something like this. The reason I'm doing this is because you can see this background [inaudible] good and that is actually realistic and that is what is really going to happen in real situations. You're going to need an algorithm that can handle all of this. And by the way, since we are at Microsoft, I am using background subtraction because that is something we have been using for a long time, but if you feel like using the output of the Kinect, you could. Actually if you are in a situation where the Kinect would work, the output would be better than this. It would actually have more information than this and so everything else after that should work even better. But for the time being we are going to work with this. So the game we want to play in this first stage of the algorithm is given multiple binary maps like those, compute this probable occupancy map which is just something that says every grid cell there is a certain probability of a person being present. So of course this one indicates that there are probably two people, one here and one here. And there is a bit of noise and this actually is not binary. It is really a set of flowing point numbers. So how are we going to get those? So let's try to formalize this a little bit. So what we are going to try to compute is the probability of there being people so X1 to XN, a binary variables that say whether there is or isn't a person at location i, grid cell i. Given B1 to BC which are the binaries. And so this is going to be done at individual multiview time frames throughout one given instance. We should expect the Bs to be very noisy and we should take advantage of the fact that the size of the Bs gives us some very rough estimate of how far the people are and most importantly it should be done in such a way that it robustly handles occlusions. So to estimate this which is of course a complicated property distribution, the first thing that we are going to do is a mean field assumption, which is where we are going to write it as a product of local probabilities. I am not actually assuming, making any assumption here. I am just defining the variables of my optimization, so I am going to write Q as a product of the small qs and the small q can be interpreted as a quality of presence in any grid cell. And to compute them what I am going to want to do is to say well, I am going to choose my qs so that these two distributions are as close as possible and one way to say this is I want the KL divergence to be as small as possible, so I am looking for a place where the derivatives with respect to the small qs are zero. So, of course it is easy to write, but you actually have to compute this. So how are we going to compute this term, so this is essentially this is the [inaudible], the probability of presence given the binary blobs. So the first thing you do is the standard Bayesian reformulation… >>: [inaudible] XK [inaudible] the top. QSK [inaudible] the X [inaudible] many, many… >> Pascal Fua: So X capital X is a vector of all of the locations, so it has basically it is a binary vector whose size is the size of the grid. >>: Oh, so it's [inaudible]? >> Pascal Fua: If I go back to this, when I am going capital X without an index is the variable is all these guys, which are all… >>: [inaudible] >> Pascal Fua: Is everything. That is standard Bayes reformulation. I need to actually estimate the property of B knowing X and the property of X. so let's now make a couple of independent assumptions, which are first, I am not going to take into account the interaction between the people. I am not going to take into account that if I am in this grid cell, the probability of somebody being in the grid cell next to me might be affected by that. So I am going to write this PX on to XN as just a simple product of the prior properties of, there being somebody in the grid cell and that is going to be a constant. And then the second one says that all statistical dependencies between views are due to people being present and moving. In other words, in the scenes I am dealing with the only thing that is moving are the people. The furniture is fixed. There is nothing else. And that means that the property of, if you know where the people are, then the property of observing a particular binary image in any different image are conditionally independent, which I will write this way. So when it comes down to it, the only term I really have to estimate is this property of observing a particular binary map if I know where the people are. And to do this is actually fairly simple. I am going to use a very simple generic model. If I know there is one person continually running around the room, here are the images, the binary images I expect to see. So in other words, I am representing a person as a cylinder, and if I know I have a person at a given location, the image I expect to see is a rectangle at the appropriate location that is given by the fact that my cameras are calibrated. Here is, for example, what I would expect to see if I knew there were exactly 3 people at three specific locations. And what I am going to do essentially is compare this image to the image I actually observe. And one of the strengths of these which is not shown here is that it will handle occlusions naturally. So if, for example, this person was behind this one in this particular view, the fact that I am using a generic mold automatically and the synthetic image I would produce would naturally take into account the occlusion without any complicated reasoning, which [inaudible] is the reason that we formulated it in this particular way. >>: [inaudible] the image and then compare it with… >> Pascal Fua: I am going to. What I am really going to do is I am going to adjust to put ones at the right places so that this image looks as much as possible as the image I observed. We could have gone a different way. Actually we did--instead of having a generic model, you could say I have blobs in one image. These blobs define a cone in which a person might be, and then you do that for all possible images and you intersect the cones. So we tried that. But what happens if you do it this way, is then you have to do some real reasoning about the occlusions and people intercepting the cones and then it becomes a massive headache very quickly, and this is why we switched to that particular way of doing things. And I have to introduce one more concept which is I have talked about the synthetic image, that was based on the idea that I knew exactly that property one, the qs here and property number, so if I know it was property one, then there were three people at three specific locations, I expect to see X. But now if I know not which property one but I know some probability between zero and one that there may be a person at these locations. What kind of image am I expecting to see? And so what I expect to see is something like that which I am calling the average synthetic image which is if you drew many, many distributions of there is or isn't a person using those probabilities, generated synthetic images and average them, you would get this, which is an average synthetic image, which I am going to use. So this is the one you get if you assume that at these locations, at these four different locations you have a person with probability 0.2 for each one of the four. >>: [inaudible] computation that [inaudible] synthetic image… >> Pascal Fua: No. Because you don't get, here there is a bit, that is where the mean field assumption comes in. You just, what you do in the interpretation is actually just trivial. What you just do is you paint something with the rectangle instead of putting one, you put 0.2. That is it. I don't explicitly do a multi-[inaudible] simulation every time. >>: I thought you randomly took… >> Pascal Fua: No, no. I don't, because that would be horribly expensive. This is a slight approximation. It is known as the mean field approximation. So maybe I should call it the approximate average synthetic image. >>: But why is [inaudible] >> Pascal Fua: Is just an example. It can be any value, because, in fact, what we are doing is waiting to compute those numbers in the end. So back to what I am trying to estimate, I have this, this is a result of my binary subtraction. So these are the blobs that my background extractor gave me. And let's say I put people here with some values and I create a synthetic image. What I want to do is measure this distance between these two images and the measure of this distance is going to be very simple. I'm going to define this math functions Psi which actually depends on their respective overlap. And I am going to estimate the property of observing these blue blobs given X as just e to the minus Psi, so if the black rectangles and the blue blobs are perfectly superposed this will be close to one, if they are completely disjointed this will be close to zero. >>: [inaudible] distance [inaudible] >> Pascal Fua: This pixel works and that is because it is going to be fast. And also I will get to this. Not only is it pixel wise but because the avatars are rectangles, we will use integral images to compute them and that is how we are going to get real-time performance. And so this term I was trying to estimate the probability of B observing all of these binary images given the property of presence is going to be this estimate that this product of e minus Psi, of course was the usual [inaudible] constant that I would ignore, when I look for maximum likelihood. >>: [inaudible] >> Pascal Fua: Yes. I am assuming, so it is 1.8 meters. >>: What if a child moves around [inaudible] >> Pascal Fua: No. And I am going to show you why in a second. >>: [inaudible] >> Pascal Fua: No. So the answer, a child that is taller than .9 cm, 90 cm, was taller than half the standard size would be fine. Let me, actually I have this video. So back to what I was trying to do which is I have, I am trying to make these two distributions be as close as possible as a function of those qs. I am minimizing the KL divergence and I have expressed this p of X [inaudible] B as this product of distances of exponential of minus the distance between the binary images I observed and those average synthetic images. And what is important to note about this, it doesn't show, but this is also a function of the qs. Because the average images are a function of where the people are and so all of this is a function of qs, so differentiating with respect to q make sense. And in the end if you work out the math what you, what happened here? What ends up happening is you get the minimum when q at location i can be expressed for all qs, for all locations, you have this, it has to hold, so you have this. This looks like a very ugly expression but in fact it is fairly simple. But this is is you take all your probabilities of presence and you take the one you are trying to estimate which is location i and you compute two different average images. The one where the corresponding q is forced to one, so definitely there is somebody and the one where the same q is forced to zero, definitely there is nobody. So you get these two synthetic images with and without somebody at location i and you compare the distances given, with all your other qs being fixed. So basically what you are measuring is if I put the person's location, do I improve the distance or do I improve the distance by removing somebody. And so this is what this term is and when it is balanced, i is in the form of equilibrium and your KL divergence is minimized. The formula doesn't matter that much but the point is all of this can be computed very, very quickly. You just have to compute this Psi by doing a few integral image computation is very fast. And what you end up is a simple iterative scheme where you start the algorithm by having these qs being some small values, the same small value everywhere, and then you iterate this operation until it standardizes, until you find the fixed point of this iterated scheme, which typically takes four or five iterations. It doesn't take very long. So to make this a little bit more visual, first an interpretation, so you have these binary maps and you are computing the qs so that the average of the images are as close as possible to these maps in all possible views simultaneously. >>: In this particular case you have floaters your [inaudible] which is separate to is much closer than that path where it is being interpreted is far away. >> Pascal Fua: Yes. That is why you need multiple cameras. So what is going to happen is, and we will see it in the next slide, is if you have only one camera, you can do what I just described and what you are going to get is this property of presence are being kind of spread along the line, the line of sight. But because you have multiple cameras what you get essentially is intersection of these ellipses and then the ambiguity goes away. >>: [inaudible] each other. There are not two cameras… >> Pascal Fua: No. It is meant for, the four cameras here in this room are perfect as sequences. >>: So you are not using information that the [inaudible] very far? >> Pascal Fua: No. Because as you see oftentimes the legs can be cut off. The background information is just not reliable enough for this. >>: [inaudible] so it's not like you actually have an occupied [inaudible] >> Pascal Fua: No. But it seems to be enough, I imagine if you were to have acrobats in all sorts of funny positions I am not really sure what would happen. But you will see a moment when you look at basketball players, which do move quite a bit, it is good enough for that. Okay. So what you are seeing here on the slide is two cameras is this iterative scheme happening. You start with kind of a small uniform property distribution and pretty soon it converges to this very peaked thing where this is a result where what you have is a very high probability of there being a person here and a bit of residual here. There are some small numbers because of noise and because indeed what is happening here is the person here for some reason the blob, actually that is what we just discussed, is part of the blob here is outside of the rectangle, so it is not completely explained by this rectangle, so it compensates for that by adding a little bit of probability showing blue here that corresponds to this. But in fact it is still, the result is still very peaked. This is light blue indicating a value that is not zero but is small and the black here indicates a value that is very close to one. >>: [inaudible] >> Pascal Fua: Because of what is happening the result is so peaked that the blur is almost imperceptible. And here is another result with two people. So see, again, it is the same idea where it converges very quickly to, the distribution is very peaked with two peaks of course corresponding to the two people. And actually this blob is kind of interesting; it kind of answers the question of why don't we use the feet when in many cases the feet just aren't there. >>: Will we need to know how many people [inaudible]? >> Pascal Fua: No. Actually it will introduce--it has blobs and it is trying to explain those blobs, so it will introduce blobs, I mean squares, blacks, as many black squares as it needs to explain what it sees. So now it is the same thing but over time. The previous video didn't have time in it. It was just what you did at one iteration, at one time step. But now what I am showing is this property distribution as the video goes, there converged results, but still without temporal consistency, so all of these maps have been computed at each time step independently. You can see that occasionally you see a little bit of blue in this thing which is when it is not so sure anymore. And typically that happens when the person is visible in one camera but not in the other. Okay. There are three people, same ID, so you can see that one of the dots is blurred once in a while but not that often. And now to your question about people's sizes, here is what happens when the person is growing during the video. So the size of the person is changing and what you can see is the algorithm is not terribly sensitive to that, especially when a person is seen in more than one view. So the black, what you are looking at is a person who is not moving. Well, now he is moving. His size is changing. So at some point he is bigger then 1.8 meters and in some others he is smaller than 1.8 meters. That is [inaudible]. So we, we give basically, we decide that this is the output of the background abstraction and we feed that to the algorithm. And then to make it a bit more real here is a kid. So that kid is obviously less then 1.8 m and he is detected just fine. And I will let this run because in a moment his mother is going to walk in. Okay. So here is mom; here is little sister [laughter]. And the little sister is not detected. It is not because the algorithm is sexist, it is because she is below 90 centimeters. She is shorter than 90 centimeters and so if you look at the--the formula what is happening when you compute the difference between the blobs and the average images, basically it will put a rectangle if at least half of it is full. What it tries to do is explain everything needs to be explained, and not explain where it doesn't need to be. So you pay a penalty if you see an avatar at a place where there is nobody. And so that is why you have this cut off. Somebody was half the size and it can go either way. If you are bigger than half the size then the gain that you have by introducing the avatar over compensates for the loss that you have in explaining all of the top part that doesn't need to be explained. And in the case of the little girl who is too short, then explaining all of that empty space above her head is too expensive and the algorithm doesn't. >>: You do a frame by frame and how do you maintain [inaudible] the person [inaudible]? Secret sauce [laughter]? >> Pascal Fua: No. Not secret sauce. I am coming to that. This actually is--give me one slide to answer this. The key, so this runs in real-time. The code is on the web if you want to play with it and also the code is done so that you, for example, if you want to make an experiment of running this through a Kinect output you could. Because it doesn't matter whether the background abstraction, the background abstraction is not part of the algorithm. It is just that you would expect… >>: [inaudible] >> Pascal Fua: Yes. You need to give it blobs somehow. How you compute the blobs is your problem. And so to answer your question when I have talked about is the algorithm that produces at every time step independently would probability of frequency map. The next step, of course, is to connect the dots and produce the [inaudible] and so the answer to your question is what I showed in the previous slide was the output of the two steps. I don't have the output of the first step only of the kids, when I showed the complete result. So how are you going to do this? Well it turns out that a good way to do that is to think of it in terms of a flow problem. Why? Because you are on a grid, which is the grid on the ground at time T. And as I said people do not teleport. So if somebody gets in that slot at time T it means that time T -1 they had to be in one of these, one of the neighboring grid cells. And then at the following time step they can only go to one of the neighboring time steps, grid cells, I'm sorry. And actually they cannot move. And not moving in this case means going from here at time T going to the same place at time T +1. And so typically I know practical implementation the grid cells are about 30 cm x 30 cm which means at 20 frames per second, people can only move by one grid cell from frame to frame. It doesn't matter. You can make it more but of course that would become computationally more expensive if you do. So what you have and think of this as you have each grid. You connect them, so it becomes a graph. People can move from graph location to graph location, but in this big graph only a very limited number of transitions are possible, only the ones that correspond to going from one grid cell to one of the neighboring grid cells at the next timestamp. And in addition to this, so you can think of this as a flow. So people are flowing from grid cell to grid cell and there are some constraints on this flow. Which is if at time T you have a certain number of people in a grid cell, in fact that number can only be zero and one, zero or one. But I am introducing these notations because eventually we are going to relax that make these floating-point numbers for the purpose of optimization. So at time T grid cell K, you can have a certain number of people N. And this number is the sum of the people who flowed from the neighboring grid cells. Likewise, the people who flowed out, all of these people here are going to flow out into neighboring grid cells during the next timestamp. So I have N which is the number of people, F which is a flow from one cell to the next, but to the neighboring one, to any neighboring one at the next time step. >>: [inaudible] one person [inaudible]. >> Pascal Fua: I have in fact, and N will always be 01. But the reason I am forming it this way is because I am going to reformulate my problem as a lineal programming problem. So at this time is purely binary. All of these numbers, in fact, or 01. Because there is only at most one person in the cell, because they are 1 foot square, and of course if there is at most one person in the cell, this number can only ever be 01. But the problem is as we know integer optimization is difficult, so in the end I may want to reformulate this integer optimization as a floating point optimization which numbers between zero and one to make it easier. And so I have some constraints on all of these flows and number of objects so that the sum of people arriving is the number in the cell, which is also equal to the sum of the people leaving at the next timestamp. >>: [inaudible] some type of [inaudible] what people do is from the previous timestamp a kind of [inaudible] at some [inaudible] to a neighboring… >> Pascal Fua: Right now there is no probability. They are really hard constraints. This has to be, cannot be any other way. Property one. Right now there is no image data. I am just defining, basically I am just introducing limitations. >>: [inaudible] input [inaudible] it's called 01 [inaudible] >>: It is coming, but I have not used it yet. I mean right now, give me two slides. So occupancy, while there can't be any more than one person in a grid cell in any grid cell. And all the i’s are positive. They are between zero and one. There is no negative flow. So to make this complete I also need to allow people to enter and exit. So here I essentially have a grid at every timeframe, every time instant, and I also have two special nodes in my graph which I am going to call the source and the sync, which are connected to all of the nodes that are entrances and exits. So in a room like this one the source in the sync would be connected to the doors. In something like a basketball court they might be connected to all of the grid cells at the edge of the court. And these are special nodes in the sense that they are not subject to the flow constraints saying that at most one person can come in. Any number of people can enter or exit at any one time. Right now I have just set up the framework. I have given you a bunch of limitations, defined flows, occupancies and the like, and now what I want really to do is compute these Ns so whether they are people in those, at those grid locations in such a way that they obey the constraints and are as close as possible to these numbers that my first step gave me. The first step when I talked about before the occupancy maps, gave me plenty of presence at every location but without any temporal consistency. Now I want to compute new temporal presence that are as close as possible to those, but at the same time, obey those rule constraints. So they are temporally consistent. And that can be formulated as essentially it is a maximum likelihood computation. It means I am going to compute the Ns whether there is or is not somebody at a given location so that this is maximized and the qs are those that I computed at the previous step. So that means the Ns are as close as possible to the qs but with the additional constraints I have formulated in the previous slides, the flow consideration constraints. And this turns out, this is an integer [inaudible] problem. In general this is going to be hard, but in fact because our graph is very specific and has a very specific shape, the extreme measure is totally modular, I believe is the term, which means it is not [inaudible]. You [inaudible] it and it has this interesting property that if you formulate it as a lineal program, which is this, which is instead of having these Ns here which are your, between that Ns you optimize the variables, you optimize with respect to the Fs, the flows. The Ns are a function of the flows, the Ns are the sums of the Ns, so doesn't change very much. You can maximize in fact this. This is the same formula of that I have written before because the sum here is just N under all of these constraints which are the full constraints I introduced before and this is a linear programming problem. So the additional trick that you play which is why I have introduced all of this limitations, is instead of treating the Fs as values of 0 or 1 you treat them as floating-point numbers. And you solve this problem for those floating-point numbers. And because the problem is, the constraints are totally modular, what provably happens is when it converges the Fs are binary. So you treat them as floating-point numbers, but once you have done the optimization you get zeros and ones. In that is what we have done and that works pretty well, but it is actually pretty slow. So it turns out you can do the same thing faster using what is known as a K [inaudible] algorithm. And the K [inaudible] algorithm is a bit like the [inaudible] algorithm in this graph except that instead of finding one path, the best path, you find the K best path with [inaudible] and the N. >>: Does this require all of the frames to be in the buffer? >> Pascal Fua: Yes. So it's not tracking. What we are doing is batch processing. So we do this on batches of about 100 frames and actually one of the things we get is robustness. It is real-time but with a delay, because of the hundred frames. So here is the kind of results we are going to get. So we put several cameras at both ends of this subway, underpass and this is actually not a very easy sequence. It is not very crowded, but the lighting is terrible and the cameras at the far end when they look at somebody at the far end they are very, very small, only a few pixels. And so it does a reasonable job of tracking these people. Another example on, this is the cafeteria at the EPFL and what it is showing is what happens when people walk in. So we are not telling the system how many people there are. It just figures this out for itself. So basically what happens is when blobs start appearing it decides it needs to start explain them and it starts doing so. And what it is showing of course is the identities are preserved reasonably well. And one important thing about this is in the sequence appearance is not used at all. The only input are the binary blobs. And here is another thing where we have people. This is just a basketball practice for the cameras. And we get something like this. And finally here is an example that actually surprised us. So I don't know if you have heard that every year there is a PETS competition were various algorithms that do surveillance, there is a competition. And so participated in 2009 and at some point they decided to run it on this monocular sequence, where it still does a reasonable job. So I am not going to claim that it works for all monocular sequences; it works for this one because the cameras are reasonably high, which means the monocular uncertainty is still not all that spread out and then the linking step does the right thing by finds the [inaudible] goes to the maximum. >>: [inaudible]. >> Pascal Fua: Yes. So in the world where the--because we need to know where to draw our avatars. It doesn't have to be flat but it has to be new. And so on this graph at least here we did better than the others. >>: [inaudible] so this is just a rough guess [inaudible] >> Pascal Fua: We calibrated on--we gave a homography for the ground floor. And it was a subway so there was a ceiling so we could do another homography for the ceiling of the subway. And actually if you come to Lausanne Olympic Museum and we have actually this thing that is running in the museum. It has been running for several months now where you have visitors that can--there is kind of a room here behind with four cameras observing them. And they walk in here; they do something in the room; they walk out and they see their trajectory on the screens as they walk out. And this is actually the first time somebody came to ask us if we could do that and I was actually very scared as Rick mentioned yesterday, we go back a long way, and when we started nothing in vision ever worked. And the notion that anything could possibly ever work was alien to me and I haven't quite overcome this bias. But actually we did manage to make it work in a public setting. Of course, I need to be honest. This is engineered. They carefully put in the cameras to make sure that the background was way was supposed to be et cetera. But still it has been working on its own for quite a while. >>: [inaudible] obviously you see their [inaudible]. >> Pascal Fua: They see their trace on the screen. So basically the idea is it is not like broadcast. We are going to get to that. In a broadcast style application if you track players, so there is a bit of the delay, which is happening here. So they walk out, when they walk out is when they see what they've done. But if you're going to show something after the publicity break, which there seem to be many on the TV around here, it is okay; the small delay does not matter. Okay. So last thing is I have told you that we don't use appearance. Up to now we don't use appearance at all. But in some cases you do want to use appearance. And sports is a good example where players have uniforms. They have numbers on the uniforms. You want to use this. But the trick is, and I think it is something that is a little bit underestimated in many of the other state-of-the-art approaches is you don't get this information all the time. So typically for the numbers on the players’ shirts in practice you can only read them once in a while. In most of the frames it is too blurry or you don't see them right. So you really need something where you see it once, okay you remember that that you see it. You use it, but you did not depend on the fact that you will see it again immediately. And so what we have done is expanded the framework I have just shown you to do this. And the way that we do this is the result I just showed you is done in this context where identities are ignored. Now to take identities into account, what we do is we take this graph and will extend it where we have one grid per person. So typically in a team we will have, I mean in that case, we know how many players there are. And so we will have one grid per player knowing that we can track what this particular person does. And since it is late, I am not going to go into details, but what happens is you can this formulation can be expanded to take into account the identities, so not only do we have the presence but we have the property that player X is there. And it is pretty much the same. It is just a bit more complicated, but it is still a linear program. It is a linear program in which you add one thing that wasn't there before which is the property that this particular blob corresponds to a particular person based on his appearance. And if you don't have any information, you just put the same property on all possible identities. In this way if you don't know, which is often the case, you can still just keep on going. Just say I don't know and don't exploit it. And so that actually is a very simple extension. It has one drawback. We are now working on a graph that is much bigger than the previous one and it starts becoming slow. We are losing the real-time performance so the trick that we used is to first run the algorithm appearance free to get the first result where identities are not necessarily preserved or accounted for. And get rid of all of the grid cells in which there was nobody because they are irrelevant. And then we rerun the linear program algorithm only on this subset of the grid cells so that it is fast again. And what we get is something like this where we are trying soccer players or in this case basketball players. And so this is actually a, these are videos from a real world championship somewhere in the Czech Republic recently. And we are working on this with the International Basketball Federation and the goal, this is in collaboration with an actual company and the goal is of course to be present for real at major sports events, basketball first and others if we can in the not-too-distant future. >>: What are these numbers here? >> Pascal Fua: This is just a frame number. >>: The red and green. >> Pascal Fua: Oh yeah. Here what we, in this video we only, the appearance we are taking into account is just to the team that you belong. >>: It is not trying to extract exact player identity? >> Pascal Fua: No. It is the group. So we have the green team, the red team and their refs. And in a different version is we could of course also look at the numbers on the backs of the shirts, which we are not doing in this video. To conclude what I have presented is an approach that is robust and can track arbitrary numbers of people over long periods of time on the scale of a basketball game. We do not require appearance information and can do quite a bit without it, but if you have it then you can use it. And we get real time performance. So this is not the end of the story. I mean there is still a lot to be done, which is one of the weakest parts of this is to rely on background subtraction because we all know that it doesn't work, in many cases it just doesn't work. So one idea would be to replace this by a people detector. If you want to go to a much more crowded scene then this, those are going to be required or to a Kinect if you want. Use more sophisticated templates among other things. People actually come across pretty nice because if you are modeling them as just a cylinder it is pretty much good enough for many things. But if you want to detect cars, he want to go into the street and detect people and cars, which may move, then you are going to need more sophisticated templates, but the formulation to support this is just different kinds of rectangles. And finally right now there is no behavioral model whatsoever. The only constraint is you cannot go more than the number of grid cells between timestamps. That is it. But of course we know there is more to it than that. For example in the basketball case we also track the ball. Obviously, what the players do and where the ball is our not independent and so this should be taken into account. So these other kinds of things we're going to keep on working on and of course if you are interested there are the papers and just as important is the software is actually available on the web and we are of course always delighted when people use it. Thank you. [applause] >>: [inaudible] passengers [inaudible]. >> Pascal Fua: They are because people, I mean one of the basic assumptions is no two people can be in the same location at the same time. >>: [inaudible]. >> Pascal Fua: That won't work. We are not. But even if they hug each other, they will be in two different locations. If they are on each other's shoulders then the assumption is violated, but if they hug together, unless, the grid cells are small enough that they have to occupy two grid cells. So the one thing that will create problems when you don't have appearance and that is a standard problem is. This is hard. Whether, there are cases where can't see them. So is this. Is this where they came in or is it that? Without appearance, there is no answer. It is going to do something arbitrary. The one thing that the appearance thing does well is if it has read the number here, here the number one, once and it reads it again here once, then it will do the right connection. >>: [inaudible] better than the frames on the two angles. >> Pascal Fua: There is no obvious--you would imagine that the ones in the middle would be the better. But we have not looked at it systematically. >>: But your experiments [inaudible] >> Pascal Fua: Basically we take overlapping things and and then we… >>: [inaudible] >> Pascal Fua: Okay. [applause]