>> Eyal Ofek: Good morning. It's my pleasure to invite Professor Shai Avidan from Tel Aviv University to give a talk here at MSR. Shai did his PhD in the Hebrew University in computer vision. He was a postdoc in Microsoft Research Redmond and since then has been a researcher with Mitsubishi Electronics Research Lab and senior researcher at Adobe. So Shai. >> Shai Avidan: Thank you. Thanks for the introduction. Actually, Eyal introduced me to the grad program in Jerusalem so it's great being here. So I'm going to talk about a paper we had several months ago with Tali Dekel and Yael Moses at ECCV 2012, but I thought that before going into this paper I'll just give a short pointer to a paper that I actually have with Simon Korman who is my grad student. He's an intern now at MSR and it's about template matching. So I'll give a five-minute trailer to this work that will be presented next week at CVPR and then I'll go to the main talk on photo sequencing. So FAsT-Match is a paper about fast affine template matching. It's a joint work with Gilad Tsur and Danny Reichman from the Weitzmann Institute and what we are trying to solve there is template matching. So given a template you want to detect its location in the image and what you see in magenta is the location we found out and in green it's barely noticeable is the ground truth location. And you want to do it quickly and efficiently and you want to have some global guarantees on the accuracy of the solution that you find. In fact, we can do it even on a smaller template. You want to detect this template over there and you see that we got almost perfect results. You can barely see the green ground truth rectangle. And we can even deal with something like this. We can detect the whiskers of the cat and detect the location in the image. The code is up on the web. You are more than welcome to go download and play with it. We would love to hear comments and feedback and just a couple of slides on how this works. So we thought long and hard on how to solve the problem and this is the algorithm we came up with. All you have to do is sample the space of all to the affine transformations, evaluate each one of them and return the best. All the paper does really is formally prove why this is the right thing to do, and a lot of experiments to validate that the assumptions that we're making hold in practice. So the idea is as follows, let's say that this is the transformation space assumes the x-axis, this is a one decay say is the transformation space and the y-axis is the error of each transformation. So when we say at step one take a sample of the affine transformation we evaluate the transformation at each of these locations. These are the values that we have. And when we return the best results, we mark return something like this because this would be the sample transformation with the lowest error. Now the key observation that we make and that's something that is well known, is that we assume that the images are piecewise smooth. Now what this means is that if you take a transformation and slightly perturb it, the error that you get will change only slightly because the image is piecewise smooth so most of the pieces of the template will land on pieces with similar values and only around the edges where there will be a change. This means that transformations that are not on the grid like this one that is the optimal one cannot be too far away from the samples that we do have. So we have a way of formally connecting the error that you want to introduce which in the template matching to the sampling rate that you need to do to the affine space and within a matter of 2 to 15 or 20 seconds you get the results that I’ve shown you. So here are a couple of additional examples. In this case the templates are 45 percent of the original image size and you can see that this is the result that we get. This is the ground truth, so you can barely see any difference between the two. Here are more examples. Template size is 35 percent and it goes down even lower than that. Now in all these cases we did -- this is a controlled experiment because we took the image, extracted the template and then tried to find it back in the image. You might say there might be noise, so how will it work with real scenarios in which you have a template in one image and you want to match it to another image. So we use the Mikolajczyk data set and what we are showing here is this is the template we are using and this is the location that we found in each of the successive five images. These are different images. You can relate this template to each of the other images through the homographies that they supply and, again, you see that we are doing a pretty good job. This is the worst-case. We are detecting it through the affine transformation. In this case there is a projective transformation that you need, therefore you don't get the exact match. Here is another example. It's very challenging. There's a lot of repetitive texture and still we do a fairly good job except for the last case here where we fail and it's mainly because of the projective nature of the transformation in this case. We deal with blur. This is the original image. These are the five successive examples with an increasing amount of blur and we correctly detect the template in each of those. This is the bark sequence. It's zoom plus rotation, and you see that we get a fairly good result except for this one, but we can detect the template even in these two cases. And when we fail here the failure means that we found this region such that the absolute error of this template is within a constant from the true global position. JPEG artifacts are no problem and we did the same thing with the Zürich building data set. Again, there is a template here and we find it in another image of the same building and there are many more examples. So again, if you find this interesting by all means go either to my website or to Simon's website and follow the link to the paper and download the code and play with it. Questions? >>: How do you differentiate your [indiscernible] affine shift? >> Shai Avidan: Oh ASIF [phonetic], yeah. So ASIF assumes that you can detect and describe interest points. So in many of the examples I've shown you like at the beginning, it's not clear that you will find interest points on something like this. So all the feature based methods will not work in this case. On the other hand, you might use our method and use direct methods to improve the result if you think you need [indiscernible] because our system is pixel wise [indiscernible]. So in the paper we give an analysis and comparison both to ASIF and to direct methods and so the advantages of what we're doing. Okay. Good. So now back to the topic of today's talk and that's photo sequencing and this is the problem that we wanted to solve. Actually, that's an illustration of what we wanted to solve. That's a picture taken in 2005 of the announcement of the new Pope and that's a picture taken from the same location eight years later when they announced the new Pope. So there is a new type of camera. You might call it a crowd cam and it's a co-location of cameras in space and time and the question is what do you do with that. And there are a couple of interesting things to point out. For the first time you don't have a single photographer that determines what's the right time to capture something. Everyone is capturing the image whenever they want and you hope that they will capture it at roughly the same time when there is an important event going on, but there is no guarantee. And the second thing is currently these images do not interact with each other. We do know that camera arrays are very useful to do a variety of things. You can get a huge panorama with high dynamic range and do better tracking, whatever, but all of the camera arrays that I am familiar with assume there is one person using a PhD student or his advisor who carefully calibrated and synchronized the camera array in order to capture the high-quality image. The question is can you work with such a noisy data set. So that's the goal of our work to make a baby step in this direction of analyzing dynamic content from a collection of still images. So let's try to define it more formally. This is the park next to Tel Aviv University and if you go there on a sunny day you can take a picture of the boats on the river and other people will take photos of the same event and before you know it, you have a folder full of images taken at roughly the same time from roughly the same location and the question is not what. How are you going to make sense of this information? So one obvious thing to try is try to place the images one after another and play the video and if you place them in a random order you get a very messy result. It's very difficult to understand what's going on here. You can't infer that the course of each boat. You don't exactly understand the dynamics of the scene. If you run our algorithm instead you can see that, how the green boat moves, how the white boat moves and so on and so forth. So our goal is to take a folder full of images and try to infer the correct temporal order of the images that were taken. So that's the formal definition given N still images, determine their temporal order. And just to give you a sense of dimensionality of the space, you have N factorial possible permutations and if you have 15 images like the example I've shown you now, the space is 10 to the 12, so you need to pick the correct order out of 10 to the 12 possibilities. So now that you know what photos sequencing is, let me explain what it is not. First thing to notice is that it is not video synchronization and it's not video synchronization because you have very discreet and small number of samples from each camera, so you don't have the full video which is a very dense sampling in time. You just have one or few images from each camera and you need that information to recover the relative temporal order between the images and cameras. The second thing that photo is sequencing is not is it's not photo tourism because photo tourism deals with the static aspect of the problem, so you want to recover the camera position and the 3-D structure of the scene. You don't treat the dynamic content of the scene. And the last thing is there was a very nice paper by Schindler about, from Georgia Tech back in CVPR 2007. What they did there is they had images of Atlanta captured over a period of about 100 years and they wanted to find the correct order because they didn't have the timestamp for each of those images. They used completely different techniques and the set up was different. We are interested in images that are captured in a similar space and time and want to organize them, so we are using completely different tools. So the assumption we are making is that the images were taken from roughly the same location at roughly the same time and we want to organize them. The way we go about it is we take a geometric approach to solve the problem and we detect and match feature points. Here are examples of feature points detected in one of the images. And then we use standard tools from geometry to find the fundamental matrix and the static features in images and all of the outliers are considered the dynamic features. So we have the fundamental matrices between every image and the reference image and we have all the dynamic features in this image. So let's look at the dynamic features. This is the tool that we are going to use to solve the problem and let's look at the dynamic features in this image. You see that in this particular image the shape descriptor I think detected all of the points on the boat, but it failed to detect them on the boat up there. So you see we are dealing with a noisy process and for this particular feature point it was detected in all of these images that you see the red arrows. So in a subset of the images this dynamic feature was detected and matched. For a different dynamic feature like the one marked with yellow over there, you get a different subset of images. Okay. So the question is how do you aggregate? There are two things. How do you take advantage of the images that you have the feature points, how do you use that to recover the temporal order of the images in the yellow subset or the red subset and then how do you aggregate everything together into a global order of the images? So the algorithm consists of two steps. The first one is finding the partial order from each of the dynamic features. So for the red dynamic feature you get a partial order like this. For the yellow you get some partial order like that. And once you have all those partial orders you want to aggregate everything all together and get one globally consistent temporal order for all of the images. And I'll spend the bulk of my talk on the first part and then I'll mention how we solve the second part. And the tool we're using here is something called rank aggregation, but I'll explain more about it later. Questions? Okay. So we want to recover the temporal order from a single feature set and here's a plot to explain what's going on. You have the 3-D point and we assume it's moving along a straight line. And as it's moving along a straight line, it is captured by image three and then by image two and then by image four and then by image one. So there is this implicit assumption that the point moves along a straight line and moreover, it doesn't go back and forth. It's going in one direction and we don't care about the speed in which it moves. So one way to solve the problem is to say the order of the points in 3-D implies temporal order, so what we can do is if we knew the positions of all the cameras we can reconstruct the line and then intersect each of the rays with the line, get the position of the points along the 3-D line and solve the problem. And theory tells us that five cameras are enough to recover the 3-D line and then you can recover the position of the points and get the temporal order. We want to avoid that for two reasons. The first one is you make strong assumptions about the data in the sense that you need all of the cameras to be in the same coordinate system, so you need to run bundle adjustments which is a step above the fundamental metric estimation that we are doing. The second thing is a little more delicate. We actually don't care about the 3-D structure of the scene. At this point all we care about is the order of the images, and therefore when you're trying to do this reconstruction you might be more sensitive to errors and noise. So we did some synthetic experiments to validate that, but that's not the point. We really wanted to avoid the 3-D reconstruction because we just care about the temporal order of the images. So instead of working in 3-D, what we want to do is work in 2-D and that's very simple. All you have to do is project this trajectory line onto the image plane here and the 3-D points are projected to points along this line and you can do the incurred analysis in the image plane. So we are moving from 3-D to 2-D. In order to do that we have to make an assumption and we are making something called the static image pair assumption. We are assuming there are two images that are taken by the same camera and this camera is static for that time period. So let's denote them i1 and i2. And how is this going to help us? Let's see an illustration here. So this is the first image taken by the reference camera and this is the dynamic feature point that we detected. And now we take the second image with the same static camera and let's say the boat moved here. So now because we have the static image pair assumption, we can pass a line connecting these two points and this is the projection of the 3-D trajectory of the boat on to the reference image point. Okay. That's the crucial thing. We made our life very easy because we assume we had one static camera taking two images at different timestamps. It can move afterwards, but that's the assumption we are making for now. So now that we have this line, let's see what happens when we take a different image ik and we detect the same dynamic feature on the boat. Because we have the fundamental matrix connecting ik and i1, we can compute that a temporal line, take the intersection between the temporal line and the trajectory, the projection of the trajectory and get the position of this dynamic feature as it would have been seen at time tk in the reference camera system. Okay? So now we see that the temporal order of the images is i1 and i2. That's the two images coming from the reference camera and we see that ik was taken before them. And you can do that for each of the other images and you will get a temporal order out of the 2-D analysis that you are doing in the image plane. So all of these feature points, so this feature point was seen by five images; that's the temporal order that we get here. >>: Obviously the reverse order is just as valid so you are just putting both of those in there. [indiscernible] >> Shai Avidan: We assume that we have two images from the reference camera and we assume we know the order of them. >>: So you have two from the same camera? >> Shai Avidan: Yes. >>: Do you know the relative times? >> Shai Avidan: Yes. >>: [indiscernible] and then the other one obviously is that you're not crossing the line of action so to speak. [indiscernible] >> Shai Avidan: Yeah. This plays to the assumption that the images are taken at approximately the same time so a point doesn't have to go back and forth. We assume it's only going in one direction. If it's oscillating then we have a problem and I'll show you an example later on dealing with that. One nice thing to observe here and it goes to the fact that we don't do 3-D reconstruction. I'm talking all the time about points moving along straight line. In fact we know what is the path of this boat, you see from the trail it's leaving behind and you see it's not a straight line, but the algorithm is robust enough in assumption because all you care about is the temporal order is robust enough to give us the correct order even though the objects do not move along a straight line. Another point that I would like to make is that each future point can move along a different 3-D line in a different direction. They don't have to all move in the same direction. That's not necessary. >>: [indiscernible] >> Shai Avidan: Yeah. It doesn't matter, if you cannot approximate its motion by a straight line, so it depends how curved is the actual trajectory. As long as the order of points do not flip when you project it on the line, you will get the correct order. >>: I guess what I was trying to say is in some cases there is a line of action, right? And if you do you go to the other side? >> Shai Avidan: Yeah, yeah. >>: In one case the line is going from left to right and in the other it goes completely reversed. >> Shai Avidan: It doesn't matter in the sense -- oh. It would matter for this feature point, yeah, but if you have -- so what will work if you have one dynamic object moving in this direction and another dynamic object moving in this direction, that would be fine. Looking from both sides of the action line would be a problem, yeah. We have example of a similar nature later in the presentation. I'll mention that. There are a couple of limitations with this approach. The first one is you have this reference image which means that you need to match all future points and all the images to the reference image, so it limits the amount of images you can work with. And the second problem is that you are not making use of all the information that you will have and I will show an example in a minute. Here is an example of a parade in Tel Aviv. Let's say this is a reference image. It would be extremely difficult to match this image and this image to the reference image. The background is way too different even though the dynamic object that is the focus of our work is very prominent, so we would like to alleviate this problem somehow. The other problem is that usually people take more than one picture with their camera and we are not taking advantage of the fact that we know the temporal order of the images within a camera. We don't know the exact timestamp, but we know that image two was taken after image one and before image three and so on. So we want to make the approach more scalable by removing the reference image assumption and we want to make the approach more realistic by looking at multiple images that you know the temporal during each of the cameras. So let's go back to our illustration. We don't have the 3D trajectory line anymore and all we have are the projection of dynamic feature point in the image plane. So let's take this as the reference image and we will do it for each of the images as the reference image repeatedly, and you can map the feature points to points to the temporal lines. So what you have in a particular image is a point and let's say three temporal lines coming from the remaining three cameras. So here is another -- this is a zoom in just on the image plane of a particular camera. This is the projection of the dynamic feature point and these are the temporal lines corresponding to the projection of this dynamic point from different timestamps on the image plane of this reference camera. So if we had the trajectory, the projection of the trajectory then the problem is solved, but we don't have, so this might be the projection of the trajectory, this line here. Or it can be this line here. So how are we going to solve this problem? One thing to notice is that if you take this line and perturb it slightly, the order of the points will not change. So the question is when will they change? So there are two types of cases in which it will change. The first one is very simple. If you pass the trajectory a line must pass through this point, so we need to establish another point and we'll take the point to be the intersection of these two lines, for instance. So would you'll notice that if you have the trajectory line to the right of this line than the order is green, then red, or red, then green depending on the direction which you're looking at. And if this black line crosses this critical line, then order of points changes. This is green red and now it switches to red and then green. So this is one type of critical line. Every time the trajectory line crosses a critical line the order of points along the trajectory changes. There is another instance and the second type of critical lines is critical line that is parallel to each of the temporal lines but passing through the feature point here. So to see why this is true, if you have this as a line then you have blue and then green and if you cross this critical line you have green then blue. So taken together we have removed the temporal assumption. We don't have the static image pair assumption and therefore we increased dramatically the number of possible permutations that a single dynamic feature can induce. In fact, if you look at the image plane, it's now divided by all of those critical lines in three regions and it's pretty obvious to agree that within each region you have just one order, well actually two up to the direction of the points which infers the order of the images. So here is a quiz. Let's say you have 15 images, what's the maximum number of permutations? Let me remind you that N temporal equals 10 to the 12th possible permutations, so if I'm giving you this number how many different permutations will you get from the analysis I've just shown you? Is the question clear? >>: [indiscernible] number of points? >> Shai Avidan: Yes. So let's see. The number of permutations is two times the number of regions because for each region it can either go backward or forward and the number of intersection points for the first type of critical points is N-1 choose 2 and for the parallel lines the second type of critical lines it's N-1 and if you do the computation you end up having 210 possible permutations. Okay? So that's the, in the case of N equals 15, that's the maximum number of permutations. In practice, you can do much better and this is an example showing that. What we see here is a point. This is the dynamic feature point and you see all the temporal lines from the remaining images in the data set and there are two cameras. And we know the order of each, of the images in each camera. This is one particular order. This is the particular order of images from the other camera. So now that you evaluate all these 210 different regions for each one, you see if you violate any of these assumptions. And if it does, you throw it away; you know it's not a possible solution. In practice, we get about four or five different permutations for each dynamic feature, so we've replaced the static image pair assumption and we get much, much, but we still get a very small number of permutations per dynamic feature. So to recap what I've shown you so far, we're using geometry and geometric reasoning to infer the temporal order for each dynamic feature. You can either do 3-D reconstruction. You can do this solution based on the static image pair or you can use this temporal restraints that I've shown you a minute ago. In any case each dynamic feature will give you one or few numbers of temporal orders on the image set. The second question that we need to solve is how to aggregate all the information into one consistently and global order of all the images. The way to solve that is to think of the problem as a graph. Each node is an image and let's say that this is the correct temporal order of the images. Then you have the edges like this. i3 goes to i1 to 4, 5 and 2. Now if we were lucky and there was one dynamic feature that appeared in all the images, let's say this red point. And there were no errors we'll get that and then it's just topological sort on the graph we get a solution that's trivial. In practice that's not what we have. What we have is a partial order from the green point, another partial order from the blue point and another partial order from the red point. Now the problem is that there might be errors and the way errors are manifested in this graph is that there might be a cycle. If you look at the edges you'll see that there is a cycle between 2, 1 and 4 and that's impossible. Okay? So the question is how do you aggregate the information and resolve this conflict. And the way to solve it is for a tool called rank aggregation and you define a metric on the space of all possible permutations and essentially you take 2 permutations and the distance between them is the number of pairs in which they don't agree in order. So we are looking for a consensus for order sigma that is as consistent as possible with each of the partial orders sigma i, so each dynamic feature voted for one or a few temporal order sigma i and now we're looking for one global temporal order sigma. This problem has been investigated extensively in social studies. This is how we do voting and in the context of computer science there was, we are following a work by Dwork et al from 2001 and what they had is the following problem. They had a search query problem. You'll want to do -you'll run a query and you'll get results from multiple search engines, so now you have the pages ranked by each search engine separately and you want to aggregate all the ranking into one global consistent ranking. And the solution is based on Markov chain solution. Markov chain approximation the problem in general is known to be NP hard. So how is the solution going to work? We construct the transition matrix. We build the graph and we construct the edges such that the weight of these edges, the probability that image one appeared between image four. You can simply count the number of features that voted i1 before i4 and that would be the probability, that would be the weight of edge ij and now you can look at the transition matrix that is associated with this graph and you do a Markov chain. Essentially you initialize the state, the probability that the last state is each of the images to be uniform distribution, so it's 1 over 5 and if you look at the leading eigenvector of this transition matrix m you'll get that it's 1 on the last state and 0 everywhere else. Imagine a random walk. As you take it to infinity with probability 1 it will end at the last position. So you can place image 2 there; it's the last image, remove it, and repeat. So that's the whole algorithm. What happens in case you have a conflict, a conflict meaning you have a cycle? In that case when you look at the eigenvector it will not be all zeros and just one; it will be something like this. Let's say there is a cycle, so all the elements outside the cycle will be 0, but within the cycle there will be some values and you are not guaranteed that the correct image within the cycle will be, will have the highest probability. Let's say this image got 68 percent probability so we choose that to be the last image, but it might have been this one or this one and that's the approximation nature of the algorithm. That's why we can't guarantee the globally optimal solution. Result. So the first question people ask us is why bother with this approach to begin with. Why can't you just use the clocks on the cell phones? And I thank you for not asking it until now. So we ran an experiment. We played the a stopwatch on the screen and asked all of the students in class to take a picture when it reaches 10 seconds and again when it switches to 20 seconds and send us the images. And this is the histogram of the times that you get. This is the time offset and this is the number of people that took an image at this time offset, so there is a total of 40 images here. A couple of things to notice here. The range is about 6 seconds, so if you're talking about a dynamic event, 6 seconds can be depending on the type of event, that's quite a lot. That's problem number one. Problem number two is there, it's not uni-model distribution I think because there are two types of -- there are more. I think two or three types of carriers in Israel. Maybe each group of phones is clustered around a different peak. Moreover, there are a couple of outliers that I'm not showing here that they had their camera phones completely off sync. They were several minutes off. So given all these reasons, I think there is room for using a vision-based technology to properly synchronize images. Here are some results. Here are nine images I think taken with two cameras. In this case we are using the static image pair assumption. These are the two images that are taken by the static camera. And these are results in random order. It's not very clear what this guy is doing. And that's not me, by the way. And this is the result you get. Okay. I'm playing it back and forth. So you can clearly see that he is skating from left to right. Here is another example, the same set up. In this case look at the little girl on the slide. You'll see her location that will give an indication of how well we are doing it. And again. So we're recovering the correct temporal order here. And another example. This is a challenging example because if you look at the image, the top third is blue sky. There are no features there. The bottom third is reflections of the waterslide so it's difficult to get any feature points, so just getting the temporal lines, the static feature points and computing the fundamental matrix is not trivial and another problem is the motion of the girl is almost parallel to the temporal lines which makes the intersection that much more sensitive. So I think the 3-D reconstruction here will not work very well, but in our case you see that we got a fairly good result. Okay. A couple of experiments using the data courtesy of Park et al from ECCV ‘10. We have here five cameras and we have a number of images from each camera. So in this case we're using this assumption. We're not using the static image pair assumption. The points were manually marking the images and going back to your comment earlier, actually the guy’s climbing up and down, but because we only assume that there is, that the object is moving along in a single direction and we clip the sequence only to the images going in one direction. Otherwise it will be messed up. And you can see that he is constantly climbing up. Let me play that again. And another example from the same paper. There are 14 images, four cameras and we correctly recovered the temporal order here. And the last example is from a parade. There are three cameras, four images and two images respectively and this is the result that we got, and again. Okay. So let me summarize what we've done. I think it's a cool problem. It's also, it focuses to think what are the assumptions that you want to make and play with geometry which I personally like very much. Also, I haven't seen much work on rank aggregation in computer vision and I think it's a very powerful tool and in many instances you can think of where you have ranks between, partial ranks and you want to aggregate everything together. In the future I think the main problem we had is the problem of matching, detecting and matching feature points is extremely difficult. It works well for the static parts where SIFT [phonetic] or ASIF [phonetic] will work fine, but for the dynamic feature points first we couldn't get SIFT to detect feature points on the dynamic object and even if we managed to do that than the matching would fail because the descriptor is not working. You can't match the descriptor of the same dynamic points because, especially on humans, they change their appearance too much for the matching to work well. So in this case what we did is we used the code by the Adobe group and non-rigid dense correspondence and that computed dense flow through between two images and it seemed to work better for our problem. As a follow-up to matching this will allow us scalability. Ideally you want to work with hundreds of thousands of images and if you have automatic matching then you can scale really well. And are probably new applications, for instance separating the dynamic from the static objects in the scene to highlight if something really changes. That's it. Thank you very much. [applause] yeah? >>: So I got a question. Do basically aggregate coding and then find the majority vote? We know that time has one way of progressing and it seems enough that you have one good vote. I'll give you an example. Suppose you have a scene where you have a lot of objects are having a periodical motion let's say between two points and one or a few objects that have a linear motion around. If you are looking at that you would know that this is not the [indiscernible] >> Shai Avidan: We are not going into that in the talk, but in the paper we did some trying to figure out the confidence that each dynamic feature has, so we might say this dynamic feature appeared in more images and therefore we trust it more. Or you can add all these priors into the voting process so that if you have a very dynamic feature that you are very confident in, then it can have a higher vote. And this will completely dominate the solution and then you'll get the trivial solution. So there is a way to introduce those priors into the framework. The rank aggregation is agnostic to all of this. It just takes a matrix that tells the probability of i occurring before j and gives up the solution. Yeah? >>: When you are not using any time information from the camera why doesn't the vote go backwards because they are not equally plausible? >> Shai Avidan: Yeah. So in this case you manually selected out of the two possible solutions, the solution where the boat is going forward. But you're completely right. In the first case when you have the static image pair assumption, then because you know the order of the two images is from the static camera, it determines the sign bit of time. >>: You said you match, or you determine the probability that one image happened before the other based on the number of match features, right? Did you try any other things, like the quality of the features or any like geometrical substance? >> Shai Avidan: I don't think we did. You can think of the quality of the match I think. That would be one option, but in practice that was never really the problem. We also did synthetic experiments where you used 2 or 300 cameras or images, if you will, and we introduced errors and rank aggregation gave solution that was like 98 percent accurate. The real problem is getting the matching. That's the killer problem. >>: These are relatively controlled scenarios. Did you try it, you know, in the wild? >> Shai Avidan: The carnival is as wild as we could put our hands on. It's difficult because the background changes dramatically. You think that it's, when you look at the images then you see that the background changes quite a bit and there is lots of motion and so just getting the feature points there is difficult. >>: Did you try to find the sports, like say football? Assuming that the entity is moving forward and you have your linear motion, so you still have plays with similar jerseys and stuff. I would assume that the matching would be probably better? >> Shai Avidan: I agree. I wish. We don't have the data set for that. If we can get enough data we would love to try our algorithm on that. Thank you. [applause]