>> Eyal Ofek: Good morning. It's my pleasure... University to give a talk here at MSR. Shai...

advertisement
>> Eyal Ofek: Good morning. It's my pleasure to invite Professor Shai Avidan from Tel Aviv
University to give a talk here at MSR. Shai did his PhD in the Hebrew University in computer
vision. He was a postdoc in Microsoft Research Redmond and since then has been a researcher
with Mitsubishi Electronics Research Lab and senior researcher at Adobe. So Shai.
>> Shai Avidan: Thank you. Thanks for the introduction. Actually, Eyal introduced me to the
grad program in Jerusalem so it's great being here. So I'm going to talk about a paper we had
several months ago with Tali Dekel and Yael Moses at ECCV 2012, but I thought that before
going into this paper I'll just give a short pointer to a paper that I actually have with Simon
Korman who is my grad student. He's an intern now at MSR and it's about template matching.
So I'll give a five-minute trailer to this work that will be presented next week at CVPR and then
I'll go to the main talk on photo sequencing. So FAsT-Match is a paper about fast affine
template matching. It's a joint work with Gilad Tsur and Danny Reichman from the Weitzmann
Institute and what we are trying to solve there is template matching. So given a template you
want to detect its location in the image and what you see in magenta is the location we found
out and in green it's barely noticeable is the ground truth location. And you want to do it
quickly and efficiently and you want to have some global guarantees on the accuracy of the
solution that you find. In fact, we can do it even on a smaller template. You want to detect this
template over there and you see that we got almost perfect results. You can barely see the
green ground truth rectangle. And we can even deal with something like this. We can detect
the whiskers of the cat and detect the location in the image. The code is up on the web. You
are more than welcome to go download and play with it. We would love to hear comments
and feedback and just a couple of slides on how this works. So we thought long and hard on
how to solve the problem and this is the algorithm we came up with. All you have to do is
sample the space of all to the affine transformations, evaluate each one of them and return the
best. All the paper does really is formally prove why this is the right thing to do, and a lot of
experiments to validate that the assumptions that we're making hold in practice. So the idea is
as follows, let's say that this is the transformation space assumes the x-axis, this is a one decay
say is the transformation space and the y-axis is the error of each transformation. So when we
say at step one take a sample of the affine transformation we evaluate the transformation at
each of these locations. These are the values that we have. And when we return the best
results, we mark return something like this because this would be the sample transformation
with the lowest error. Now the key observation that we make and that's something that is well
known, is that we assume that the images are piecewise smooth. Now what this means is that
if you take a transformation and slightly perturb it, the error that you get will change only
slightly because the image is piecewise smooth so most of the pieces of the template will land
on pieces with similar values and only around the edges where there will be a change. This
means that transformations that are not on the grid like this one that is the optimal one cannot
be too far away from the samples that we do have. So we have a way of formally connecting
the error that you want to introduce which in the template matching to the sampling rate that
you need to do to the affine space and within a matter of 2 to 15 or 20 seconds you get the
results that I’ve shown you. So here are a couple of additional examples. In this case the
templates are 45 percent of the original image size and you can see that this is the result that
we get. This is the ground truth, so you can barely see any difference between the two. Here
are more examples. Template size is 35 percent and it goes down even lower than that. Now
in all these cases we did -- this is a controlled experiment because we took the image, extracted
the template and then tried to find it back in the image. You might say there might be noise, so
how will it work with real scenarios in which you have a template in one image and you want to
match it to another image. So we use the Mikolajczyk data set and what we are showing here is
this is the template we are using and this is the location that we found in each of the successive
five images. These are different images. You can relate this template to each of the other
images through the homographies that they supply and, again, you see that we are doing a
pretty good job. This is the worst-case. We are detecting it through the affine transformation.
In this case there is a projective transformation that you need, therefore you don't get the
exact match. Here is another example. It's very challenging. There's a lot of repetitive texture
and still we do a fairly good job except for the last case here where we fail and it's mainly
because of the projective nature of the transformation in this case. We deal with blur. This is
the original image. These are the five successive examples with an increasing amount of blur
and we correctly detect the template in each of those. This is the bark sequence. It's zoom
plus rotation, and you see that we get a fairly good result except for this one, but we can detect
the template even in these two cases. And when we fail here the failure means that we found
this region such that the absolute error of this template is within a constant from the true
global position. JPEG artifacts are no problem and we did the same thing with the Zürich
building data set. Again, there is a template here and we find it in another image of the same
building and there are many more examples. So again, if you find this interesting by all means
go either to my website or to Simon's website and follow the link to the paper and download
the code and play with it. Questions?
>>: How do you differentiate your [indiscernible] affine shift?
>> Shai Avidan: Oh ASIF [phonetic], yeah. So ASIF assumes that you can detect and describe
interest points. So in many of the examples I've shown you like at the beginning, it's not clear
that you will find interest points on something like this. So all the feature based methods will
not work in this case. On the other hand, you might use our method and use direct methods to
improve the result if you think you need [indiscernible] because our system is pixel wise
[indiscernible]. So in the paper we give an analysis and comparison both to ASIF and to direct
methods and so the advantages of what we're doing. Okay. Good.
So now back to the topic of today's talk and that's photo sequencing and this is the problem
that we wanted to solve. Actually, that's an illustration of what we wanted to solve. That's a
picture taken in 2005 of the announcement of the new Pope and that's a picture taken from the
same location eight years later when they announced the new Pope. So there is a new type of
camera. You might call it a crowd cam and it's a co-location of cameras in space and time and
the question is what do you do with that. And there are a couple of interesting things to point
out. For the first time you don't have a single photographer that determines what's the right
time to capture something. Everyone is capturing the image whenever they want and you hope
that they will capture it at roughly the same time when there is an important event going on,
but there is no guarantee. And the second thing is currently these images do not interact with
each other. We do know that camera arrays are very useful to do a variety of things. You can
get a huge panorama with high dynamic range and do better tracking, whatever, but all of the
camera arrays that I am familiar with assume there is one person using a PhD student or his
advisor who carefully calibrated and synchronized the camera array in order to capture the
high-quality image. The question is can you work with such a noisy data set. So that's the goal
of our work to make a baby step in this direction of analyzing dynamic content from a collection
of still images. So let's try to define it more formally. This is the park next to Tel Aviv University
and if you go there on a sunny day you can take a picture of the boats on the river and other
people will take photos of the same event and before you know it, you have a folder full of
images taken at roughly the same time from roughly the same location and the question is not
what. How are you going to make sense of this information? So one obvious thing to try is try
to place the images one after another and play the video and if you place them in a random
order you get a very messy result. It's very difficult to understand what's going on here. You
can't infer that the course of each boat. You don't exactly understand the dynamics of the
scene. If you run our algorithm instead you can see that, how the green boat moves, how the
white boat moves and so on and so forth. So our goal is to take a folder full of images and try
to infer the correct temporal order of the images that were taken. So that's the formal
definition given N still images, determine their temporal order. And just to give you a sense of
dimensionality of the space, you have N factorial possible permutations and if you have 15
images like the example I've shown you now, the space is 10 to the 12, so you need to pick the
correct order out of 10 to the 12 possibilities. So now that you know what photos sequencing
is, let me explain what it is not. First thing to notice is that it is not video synchronization and
it's not video synchronization because you have very discreet and small number of samples
from each camera, so you don't have the full video which is a very dense sampling in time. You
just have one or few images from each camera and you need that information to recover the
relative temporal order between the images and cameras. The second thing that photo is
sequencing is not is it's not photo tourism because photo tourism deals with the static aspect of
the problem, so you want to recover the camera position and the 3-D structure of the scene.
You don't treat the dynamic content of the scene. And the last thing is there was a very nice
paper by Schindler about, from Georgia Tech back in CVPR 2007. What they did there is they
had images of Atlanta captured over a period of about 100 years and they wanted to find the
correct order because they didn't have the timestamp for each of those images. They used
completely different techniques and the set up was different. We are interested in images that
are captured in a similar space and time and want to organize them, so we are using completely
different tools. So the assumption we are making is that the images were taken from roughly
the same location at roughly the same time and we want to organize them. The way we go
about it is we take a geometric approach to solve the problem and we detect and match
feature points. Here are examples of feature points detected in one of the images. And then
we use standard tools from geometry to find the fundamental matrix and the static features in
images and all of the outliers are considered the dynamic features. So we have the
fundamental matrices between every image and the reference image and we have all the
dynamic features in this image. So let's look at the dynamic features. This is the tool that we
are going to use to solve the problem and let's look at the dynamic features in this image. You
see that in this particular image the shape descriptor I think detected all of the points on the
boat, but it failed to detect them on the boat up there. So you see we are dealing with a noisy
process and for this particular feature point it was detected in all of these images that you see
the red arrows. So in a subset of the images this dynamic feature was detected and matched.
For a different dynamic feature like the one marked with yellow over there, you get a different
subset of images. Okay. So the question is how do you aggregate? There are two things. How
do you take advantage of the images that you have the feature points, how do you use that to
recover the temporal order of the images in the yellow subset or the red subset and then how
do you aggregate everything together into a global order of the images? So the algorithm
consists of two steps. The first one is finding the partial order from each of the dynamic
features. So for the red dynamic feature you get a partial order like this. For the yellow you get
some partial order like that. And once you have all those partial orders you want to aggregate
everything all together and get one globally consistent temporal order for all of the images.
And I'll spend the bulk of my talk on the first part and then I'll mention how we solve the
second part. And the tool we're using here is something called rank aggregation, but I'll explain
more about it later. Questions? Okay. So we want to recover the temporal order from a single
feature set and here's a plot to explain what's going on. You have the 3-D point and we assume
it's moving along a straight line. And as it's moving along a straight line, it is captured by image
three and then by image two and then by image four and then by image one. So there is this
implicit assumption that the point moves along a straight line and moreover, it doesn't go back
and forth. It's going in one direction and we don't care about the speed in which it moves. So
one way to solve the problem is to say the order of the points in 3-D implies temporal order, so
what we can do is if we knew the positions of all the cameras we can reconstruct the line and
then intersect each of the rays with the line, get the position of the points along the 3-D line
and solve the problem. And theory tells us that five cameras are enough to recover the 3-D line
and then you can recover the position of the points and get the temporal order. We want to
avoid that for two reasons. The first one is you make strong assumptions about the data in the
sense that you need all of the cameras to be in the same coordinate system, so you need to run
bundle adjustments which is a step above the fundamental metric estimation that we are
doing. The second thing is a little more delicate. We actually don't care about the 3-D
structure of the scene. At this point all we care about is the order of the images, and therefore
when you're trying to do this reconstruction you might be more sensitive to errors and noise.
So we did some synthetic experiments to validate that, but that's not the point. We really
wanted to avoid the 3-D reconstruction because we just care about the temporal order of the
images. So instead of working in 3-D, what we want to do is work in 2-D and that's very simple.
All you have to do is project this trajectory line onto the image plane here and the 3-D points
are projected to points along this line and you can do the incurred analysis in the image plane.
So we are moving from 3-D to 2-D. In order to do that we have to make an assumption and we
are making something called the static image pair assumption. We are assuming there are two
images that are taken by the same camera and this camera is static for that time period. So
let's denote them i1 and i2. And how is this going to help us? Let's see an illustration here. So
this is the first image taken by the reference camera and this is the dynamic feature point that
we detected. And now we take the second image with the same static camera and let's say the
boat moved here. So now because we have the static image pair assumption, we can pass a
line connecting these two points and this is the projection of the 3-D trajectory of the boat on
to the reference image point. Okay. That's the crucial thing. We made our life very easy
because we assume we had one static camera taking two images at different timestamps. It
can move afterwards, but that's the assumption we are making for now. So now that we have
this line, let's see what happens when we take a different image ik and we detect the same
dynamic feature on the boat. Because we have the fundamental matrix connecting ik and i1,
we can compute that a temporal line, take the intersection between the temporal line and the
trajectory, the projection of the trajectory and get the position of this dynamic feature as it
would have been seen at time tk in the reference camera system. Okay? So now we see that
the temporal order of the images is i1 and i2. That's the two images coming from the reference
camera and we see that ik was taken before them. And you can do that for each of the other
images and you will get a temporal order out of the 2-D analysis that you are doing in the image
plane. So all of these feature points, so this feature point was seen by five images; that's the
temporal order that we get here.
>>: Obviously the reverse order is just as valid so you are just putting both of those in there.
[indiscernible]
>> Shai Avidan: We assume that we have two images from the reference camera and we
assume we know the order of them.
>>: So you have two from the same camera?
>> Shai Avidan: Yes.
>>: Do you know the relative times?
>> Shai Avidan: Yes.
>>: [indiscernible] and then the other one obviously is that you're not crossing the line of
action so to speak. [indiscernible]
>> Shai Avidan: Yeah. This plays to the assumption that the images are taken at approximately
the same time so a point doesn't have to go back and forth. We assume it's only going in one
direction. If it's oscillating then we have a problem and I'll show you an example later on
dealing with that. One nice thing to observe here and it goes to the fact that we don't do 3-D
reconstruction. I'm talking all the time about points moving along straight line. In fact we know
what is the path of this boat, you see from the trail it's leaving behind and you see it's not a
straight line, but the algorithm is robust enough in assumption because all you care about is the
temporal order is robust enough to give us the correct order even though the objects do not
move along a straight line. Another point that I would like to make is that each future point can
move along a different 3-D line in a different direction. They don't have to all move in the same
direction. That's not necessary.
>>: [indiscernible]
>> Shai Avidan: Yeah. It doesn't matter, if you cannot approximate its motion by a straight line,
so it depends how curved is the actual trajectory. As long as the order of points do not flip
when you project it on the line, you will get the correct order.
>>: I guess what I was trying to say is in some cases there is a line of action, right? And if you
do you go to the other side?
>> Shai Avidan: Yeah, yeah.
>>: In one case the line is going from left to right and in the other it goes completely reversed.
>> Shai Avidan: It doesn't matter in the sense -- oh. It would matter for this feature point,
yeah, but if you have -- so what will work if you have one dynamic object moving in this
direction and another dynamic object moving in this direction, that would be fine. Looking
from both sides of the action line would be a problem, yeah. We have example of a similar
nature later in the presentation. I'll mention that. There are a couple of limitations with this
approach. The first one is you have this reference image which means that you need to match
all future points and all the images to the reference image, so it limits the amount of images
you can work with. And the second problem is that you are not making use of all the
information that you will have and I will show an example in a minute. Here is an example of a
parade in Tel Aviv. Let's say this is a reference image. It would be extremely difficult to match
this image and this image to the reference image. The background is way too different even
though the dynamic object that is the focus of our work is very prominent, so we would like to
alleviate this problem somehow. The other problem is that usually people take more than one
picture with their camera and we are not taking advantage of the fact that we know the
temporal order of the images within a camera. We don't know the exact timestamp, but we
know that image two was taken after image one and before image three and so on. So we
want to make the approach more scalable by removing the reference image assumption and
we want to make the approach more realistic by looking at multiple images that you know the
temporal during each of the cameras. So let's go back to our illustration. We don't have the 3D trajectory line anymore and all we have are the projection of dynamic feature point in the
image plane. So let's take this as the reference image and we will do it for each of the images
as the reference image repeatedly, and you can map the feature points to points to the
temporal lines. So what you have in a particular image is a point and let's say three temporal
lines coming from the remaining three cameras. So here is another -- this is a zoom in just on
the image plane of a particular camera. This is the projection of the dynamic feature point and
these are the temporal lines corresponding to the projection of this dynamic point from
different timestamps on the image plane of this reference camera. So if we had the trajectory,
the projection of the trajectory then the problem is solved, but we don't have, so this might be
the projection of the trajectory, this line here. Or it can be this line here. So how are we going
to solve this problem? One thing to notice is that if you take this line and perturb it slightly, the
order of the points will not change. So the question is when will they change? So there are two
types of cases in which it will change. The first one is very simple. If you pass the trajectory a
line must pass through this point, so we need to establish another point and we'll take the
point to be the intersection of these two lines, for instance. So would you'll notice that if you
have the trajectory line to the right of this line than the order is green, then red, or red, then
green depending on the direction which you're looking at. And if this black line crosses this
critical line, then order of points changes. This is green red and now it switches to red and then
green. So this is one type of critical line. Every time the trajectory line crosses a critical line the
order of points along the trajectory changes. There is another instance and the second type of
critical lines is critical line that is parallel to each of the temporal lines but passing through the
feature point here. So to see why this is true, if you have this as a line then you have blue and
then green and if you cross this critical line you have green then blue. So taken together we
have removed the temporal assumption. We don't have the static image pair assumption and
therefore we increased dramatically the number of possible permutations that a single dynamic
feature can induce. In fact, if you look at the image plane, it's now divided by all of those
critical lines in three regions and it's pretty obvious to agree that within each region you have
just one order, well actually two up to the direction of the points which infers the order of the
images. So here is a quiz. Let's say you have 15 images, what's the maximum number of
permutations? Let me remind you that N temporal equals 10 to the 12th possible
permutations, so if I'm giving you this number how many different permutations will you get
from the analysis I've just shown you? Is the question clear?
>>: [indiscernible] number of points?
>> Shai Avidan: Yes. So let's see. The number of permutations is two times the number of
regions because for each region it can either go backward or forward and the number of
intersection points for the first type of critical points is N-1 choose 2 and for the parallel lines
the second type of critical lines it's N-1 and if you do the computation you end up having 210
possible permutations. Okay? So that's the, in the case of N equals 15, that's the maximum
number of permutations. In practice, you can do much better and this is an example showing
that. What we see here is a point. This is the dynamic feature point and you see all the
temporal lines from the remaining images in the data set and there are two cameras. And we
know the order of each, of the images in each camera. This is one particular order. This is the
particular order of images from the other camera. So now that you evaluate all these 210
different regions for each one, you see if you violate any of these assumptions. And if it does,
you throw it away; you know it's not a possible solution. In practice, we get about four or five
different permutations for each dynamic feature, so we've replaced the static image pair
assumption and we get much, much, but we still get a very small number of permutations per
dynamic feature. So to recap what I've shown you so far, we're using geometry and geometric
reasoning to infer the temporal order for each dynamic feature. You can either do 3-D
reconstruction. You can do this solution based on the static image pair or you can use this
temporal restraints that I've shown you a minute ago. In any case each dynamic feature will
give you one or few numbers of temporal orders on the image set. The second question that
we need to solve is how to aggregate all the information into one consistently and global order
of all the images. The way to solve that is to think of the problem as a graph. Each node is an
image and let's say that this is the correct temporal order of the images. Then you have the
edges like this. i3 goes to i1 to 4, 5 and 2. Now if we were lucky and there was one dynamic
feature that appeared in all the images, let's say this red point. And there were no errors we'll
get that and then it's just topological sort on the graph we get a solution that's trivial. In
practice that's not what we have. What we have is a partial order from the green point,
another partial order from the blue point and another partial order from the red point. Now
the problem is that there might be errors and the way errors are manifested in this graph is
that there might be a cycle. If you look at the edges you'll see that there is a cycle between 2, 1
and 4 and that's impossible. Okay? So the question is how do you aggregate the information
and resolve this conflict. And the way to solve it is for a tool called rank aggregation and you
define a metric on the space of all possible permutations and essentially you take 2
permutations and the distance between them is the number of pairs in which they don't agree
in order. So we are looking for a consensus for order sigma that is as consistent as possible
with each of the partial orders sigma i, so each dynamic feature voted for one or a few
temporal order sigma i and now we're looking for one global temporal order sigma. This
problem has been investigated extensively in social studies. This is how we do voting and in the
context of computer science there was, we are following a work by Dwork et al from 2001 and
what they had is the following problem. They had a search query problem. You'll want to do -you'll run a query and you'll get results from multiple search engines, so now you have the
pages ranked by each search engine separately and you want to aggregate all the ranking into
one global consistent ranking. And the solution is based on Markov chain solution. Markov
chain approximation the problem in general is known to be NP hard. So how is the solution
going to work? We construct the transition matrix. We build the graph and we construct the
edges such that the weight of these edges, the probability that image one appeared between
image four. You can simply count the number of features that voted i1 before i4 and that
would be the probability, that would be the weight of edge ij and now you can look at the
transition matrix that is associated with this graph and you do a Markov chain. Essentially you
initialize the state, the probability that the last state is each of the images to be uniform
distribution, so it's 1 over 5 and if you look at the leading eigenvector of this transition matrix m
you'll get that it's 1 on the last state and 0 everywhere else. Imagine a random walk. As you
take it to infinity with probability 1 it will end at the last position. So you can place image 2
there; it's the last image, remove it, and repeat. So that's the whole algorithm. What happens
in case you have a conflict, a conflict meaning you have a cycle? In that case when you look at
the eigenvector it will not be all zeros and just one; it will be something like this. Let's say there
is a cycle, so all the elements outside the cycle will be 0, but within the cycle there will be some
values and you are not guaranteed that the correct image within the cycle will be, will have the
highest probability. Let's say this image got 68 percent probability so we choose that to be the
last image, but it might have been this one or this one and that's the approximation nature of
the algorithm. That's why we can't guarantee the globally optimal solution. Result. So the first
question people ask us is why bother with this approach to begin with. Why can't you just use
the clocks on the cell phones? And I thank you for not asking it until now. So we ran an
experiment. We played the a stopwatch on the screen and asked all of the students in class to
take a picture when it reaches 10 seconds and again when it switches to 20 seconds and send
us the images. And this is the histogram of the times that you get. This is the time offset and
this is the number of people that took an image at this time offset, so there is a total of 40
images here. A couple of things to notice here. The range is about 6 seconds, so if you're
talking about a dynamic event, 6 seconds can be depending on the type of event, that's quite a
lot. That's problem number one. Problem number two is there, it's not uni-model distribution I
think because there are two types of -- there are more. I think two or three types of carriers in
Israel. Maybe each group of phones is clustered around a different peak. Moreover, there are
a couple of outliers that I'm not showing here that they had their camera phones completely off
sync. They were several minutes off. So given all these reasons, I think there is room for using
a vision-based technology to properly synchronize images. Here are some results. Here are
nine images I think taken with two cameras. In this case we are using the static image pair
assumption. These are the two images that are taken by the static camera. And these are
results in random order. It's not very clear what this guy is doing. And that's not me, by the
way. And this is the result you get. Okay. I'm playing it back and forth. So you can clearly see
that he is skating from left to right. Here is another example, the same set up. In this case look
at the little girl on the slide. You'll see her location that will give an indication of how well we
are doing it. And again. So we're recovering the correct temporal order here. And another
example. This is a challenging example because if you look at the image, the top third is blue
sky. There are no features there. The bottom third is reflections of the waterslide so it's
difficult to get any feature points, so just getting the temporal lines, the static feature points
and computing the fundamental matrix is not trivial and another problem is the motion of the
girl is almost parallel to the temporal lines which makes the intersection that much more
sensitive. So I think the 3-D reconstruction here will not work very well, but in our case you see
that we got a fairly good result. Okay. A couple of experiments using the data courtesy of Park
et al from ECCV ‘10. We have here five cameras and we have a number of images from each
camera. So in this case we're using this assumption. We're not using the static image pair
assumption. The points were manually marking the images and going back to your comment
earlier, actually the guy’s climbing up and down, but because we only assume that there is, that
the object is moving along in a single direction and we clip the sequence only to the images
going in one direction. Otherwise it will be messed up. And you can see that he is constantly
climbing up. Let me play that again. And another example from the same paper. There are 14
images, four cameras and we correctly recovered the temporal order here. And the last
example is from a parade. There are three cameras, four images and two images respectively
and this is the result that we got, and again. Okay. So let me summarize what we've done. I
think it's a cool problem. It's also, it focuses to think what are the assumptions that you want
to make and play with geometry which I personally like very much. Also, I haven't seen much
work on rank aggregation in computer vision and I think it's a very powerful tool and in many
instances you can think of where you have ranks between, partial ranks and you want to
aggregate everything together. In the future I think the main problem we had is the problem of
matching, detecting and matching feature points is extremely difficult. It works well for the
static parts where SIFT [phonetic] or ASIF [phonetic] will work fine, but for the dynamic feature
points first we couldn't get SIFT to detect feature points on the dynamic object and even if we
managed to do that than the matching would fail because the descriptor is not working. You
can't match the descriptor of the same dynamic points because, especially on humans, they
change their appearance too much for the matching to work well. So in this case what we did is
we used the code by the Adobe group and non-rigid dense correspondence and that computed
dense flow through between two images and it seemed to work better for our problem. As a
follow-up to matching this will allow us scalability. Ideally you want to work with hundreds of
thousands of images and if you have automatic matching then you can scale really well. And
are probably new applications, for instance separating the dynamic from the static objects in
the scene to highlight if something really changes. That's it. Thank you very much. [applause]
yeah?
>>: So I got a question. Do basically aggregate coding and then find the majority vote? We
know that time has one way of progressing and it seems enough that you have one good vote.
I'll give you an example. Suppose you have a scene where you have a lot of objects are having a
periodical motion let's say between two points and one or a few objects that have a linear
motion around. If you are looking at that you would know that this is not the [indiscernible]
>> Shai Avidan: We are not going into that in the talk, but in the paper we did some trying to
figure out the confidence that each dynamic feature has, so we might say this dynamic feature
appeared in more images and therefore we trust it more. Or you can add all these priors into
the voting process so that if you have a very dynamic feature that you are very confident in,
then it can have a higher vote. And this will completely dominate the solution and then you'll
get the trivial solution. So there is a way to introduce those priors into the framework. The
rank aggregation is agnostic to all of this. It just takes a matrix that tells the probability of i
occurring before j and gives up the solution. Yeah?
>>: When you are not using any time information from the camera why doesn't the vote go
backwards because they are not equally plausible?
>> Shai Avidan: Yeah. So in this case you manually selected out of the two possible solutions,
the solution where the boat is going forward. But you're completely right. In the first case
when you have the static image pair assumption, then because you know the order of the two
images is from the static camera, it determines the sign bit of time.
>>: You said you match, or you determine the probability that one image happened before the
other based on the number of match features, right? Did you try any other things, like the
quality of the features or any like geometrical substance?
>> Shai Avidan: I don't think we did. You can think of the quality of the match I think. That
would be one option, but in practice that was never really the problem. We also did synthetic
experiments where you used 2 or 300 cameras or images, if you will, and we introduced errors
and rank aggregation gave solution that was like 98 percent accurate. The real problem is
getting the matching. That's the killer problem.
>>: These are relatively controlled scenarios. Did you try it, you know, in the wild?
>> Shai Avidan: The carnival is as wild as we could put our hands on. It's difficult because the
background changes dramatically. You think that it's, when you look at the images then you
see that the background changes quite a bit and there is lots of motion and so just getting the
feature points there is difficult.
>>: Did you try to find the sports, like say football? Assuming that the entity is moving forward
and you have your linear motion, so you still have plays with similar jerseys and stuff. I would
assume that the matching would be probably better?
>> Shai Avidan: I agree. I wish. We don't have the data set for that. If we can get enough data
we would love to try our algorithm on that. Thank you. [applause]
Download