>> Rick Szeliski: Okay. Good morning, everyone. ... my pleasure to introduce James Philbin who is a graduate...

advertisement
>> Rick Szeliski: Okay. Good morning, everyone. My name is Rick Szeliski, and it's
my pleasure to introduce James Philbin who is a graduate student at the University of
Oxford.
James has been doing some amazing work, published in the last few Computer Vision
Conferences on matching images, especially with very large databases such as finding
pictures, matching any new photograph to pictures of all the buildings in Oxford. And
he's going to tell us -- give us sort of an overview of a large number of papers he's been
working on and some of the most recent work that just came out this year.
>> James Philbin: Okay. So thanks, Rick, for the introduction. I'm going to talk on
organizing Flickr, so object mining, using particular object retrieval. I must say this is
joint work with Ondrej Chum, Josef Sivic, Michael Isard and Andrew Zisserman.
So why I'm here is to automatically organize large collections of unordered, unlabeled
images, things like Flickr. So you might be thinking, well, isn't Flickr already organized
by tags? You know, I can type something into Flickr and I get images which are
potentially relevant back.
So here's an example. I've never been to Redmond before, so I searched for -- do a
search in Flickr for Redmond. And these here three images I got back from the first
page. So one is some guys jumping around in a balloon. It's actually from New York.
But it was part of a photo set of which he had took some things in Redmond.
The image in the middle here is Allysa Redmond who is the woman with the longest
fingernails in the world. Not particularly useful to me.
And on the right here is a lunar eclipse which was taken from Redmond. But, again, not
useful for me.
So tags are often inaccurate and ambiguous, or just not descriptive enough to be useful.
And so here we're going to demonstrate organizing or clustering images just on the image
content alone. And specifically in this work we want to cluster images containing the
same objects, so same landmark or building, et cetera. So I'm looking at this car rather
than the set of all cars.
So an outline of my talk, it's going to be sort of in two parts. I'm going to introduce sort
of -- or talk about particular object retrieval and take you from the sort of baseline Video
Google method to the state of the art. And I'm going to do that using sort of large
vocabularies, spatial re-ranking, soft assignment, and query expansion.
And the second part of my talk is going to be on using particular object retrieval to
perform object mining. And we'll talk about building a matching graph and then look at
two different methods for doing the mining. So one uses standard graph clustering
techniques, and the other uses a sort of mixture of models -- mixture-of-objects topic
model based approach.
So a word about the dataset. I'll be demonstrating a lot of the work on the particular
object retrieval. So it's a dataset of Oxford buildings. Automatically crawled from
Flickr. And it consists of just over 5,000 images found by searching for various Oxford
landmarks. So you type into Flickr Oxford Radcliffe Camera, Oxford Christ Church
College, and you get a lot of results back.
Of course as we saw before, a lot of these tags are not very good, so I think when you
type these in, only about 5 percent of the images actually contain some of these
landmarks. These are all medium-resolution images. There's a sample at the bottom
there.
And we've gone through and manually labeled groundtruth for 11 landmarks. And each
landmark has five queries each. So you can see all of the 55 queries here.
And to evaluate our system, we use a sort of mean of the average precision for each
query. So we sum up all the average precisions for each query and divide by 55 to get
sort of an overall view of the performance of our system.
And so test the scalability. We also have two other sort of distractor datasets. So the first
one is sort of a hundred thousand images, and another which is just over a million
images. And it's assumed that none of our groundtruth objects are present in these
datasets. So we can just wrap them in and see how the system is scaling.
So a particular object retrieval. What's the aim. So the idea is that the user selects a
subimage or a subpart of a query image. So here and here. And we get a large unlabeled,
unordered collection of images. And we want to find other occurrences of that object in
this dataset. So we selected the bridge there. And you can see actually the bridge occurs
twice: so once in the middle there; once in the bottom left.
And this is the Radcliffe Camera, the other query, and it occurs twice as well. So once
there, but also way off in the distance in that top left image.
So this gives you some of the idea of the challenge of doing this. We want to search
millions of images at realtime speeds of particular objects, and we want to be able to find
these objects despite changes in scale, viewpoint, lighting and occlusion.
So here we've got sort of a two-time scale change, 45-degree viewpoint change, a lighting
change, the day to night, and occlusion. So this building is seen sort of through a
pane-glass window.
So I'm just going to review the sort of Video Google, the baseline method, which we're
going to build upon. So we start with every -- for every image in our dataset. And we're
going to find sparse affine invariant regions. So these are going to be Hessian affine
interest points, so they're elliptical regions in the images, which is supposed to be
somewhat invariant to viewpoint.
And we're going to take SIFT descriptors. So for each image we find interest points, we
generate SIFT descriptors. We're going to bind all these descriptors into the big matrix
and quantize, so we've gone from elliptical region to descriptor to a single number. And
this is going to represent a visual word in a model.
So we can now represent each image as a bag of visual words, so it's just a sparse
histogram of the visual word occurrences.
And to compare two images, we're now sort of fully in the text domain because we've got
these two sparse histograms which we're going to sort of treat as a proxy for those
images. And now if you want to compare two images, we're going to simply compare
their histogram of visual words and score them.
And the baseline method, well, we just use tf-idf weighting from the text retrieval. And
to search on an image, we accumulate all the visual words in the query region and we use
an inverted index to find the documents that share at least one word. And we compute
similarity for those documents using tf-idf with L2 distance. And this gives us a rank list
for all the documents in our dataset.
So this inverted index is essentially a sort of reverse or a book index, so it maps from
particular words to the documents in which they occurred. And by looking at this index,
we only have to consider documents which share at least one word.
So when you implement the system, the first issue you sort of come up against is how
does vocabulary size affect performance. And it seems to basically increase as you
increase K. So for flat K-means, this graph is maybe of a range sort of of K around to the
10- to 50,000. Performance just keeps on increasing.
But unfortunately K-means doesn't really scale, so K-means scales as order NK where N
is the number of descriptors and K is the number of clusters. So can we do better for
large K?
So the bottleneck for K-means is basically a nearest neighbor search, so from every
descriptor to every cluster center. And we can actually use approximate nearest neighbor
search techniques to reduce the complexity to order N log K. And sort of empirically for
small K, the approximate method doesn't seem to harm the cluster quality.
So the table on the bottom left here, I've compared flat K-means or standard K-means to
approximate K-means. And you can basically see the difference is sort of negligible.
But we're able to scale the approximate methods to much larger vocabulary sizes.
And so empirically it seems that large vocabularies are much better for particular object
retrieval. So here we actually start at 50,000 and go all the way out to one and a quarter
million. And we find a peak at a million visual words. So this is much, much bigger than
people normally do with K-means or with clustering.
>>: What is the measure here?
>> James Philbin: So this is mean average precision. So this is the mean of all the
average precisions for the 55 queries in the Oxford dataset. But there's a sort of plateau
from about a quarter of a million.
>>: So sometimes it does better [inaudible] the approximate K-means -- sorry, the
approximate nearest neighbor --
>> James Philbin: Right. I mean, there's a certain amount of noise here. So I think to
really find out, you need to sort of plot some error bars. Because you -- you know,
K-means, you randomly initialize it. So there's going to be some room for error.
>>: [inaudible] actually might get our minimum [inaudible] K-means because you
change the nearest neighbor structure [inaudible].
>> James Philbin: Yeah.
>>: So it's perfectly possible [inaudible].
>> James Philbin: So it could be that it's almost a bit like sort of stochastic -- some sort
of stochastic method where there's a bit of randomness in the system because it's
approximate. You don't always get the exact nearest neighbor. And actually by
sometimes getting it wrong you can minimize the -- the sort of objective function better.
I'm not -- I'm not going to claim that actually because the results don't necessarily show
it. But it's something -- intuition.
But one thing to know is that when you do the final assignment, you want to be sort of
right. So although it might have been true when you built the clusters, you want to be
more accurate when you actually finally assign them and we actually up the thresholds on
the approximate K-means when you do the assignment.
>>: What is [inaudible] approximate nearest neighbor?
>> James Philbin: So we're using sort of approximate KD trees.
>>: [inaudible]
>> James Philbin: Yes. It's best in first, but we have multiple trees as well with different
random sort of orderings of the ->>: [inaudible]
>> James Philbin: Right. And then we have a single sort of priority queue for exploring
these trees.
>>: For all the trees [inaudible].
>> James Philbin: For all the trees. And that makes a big difference actually.
>>: Are you going to talk about that or...
>> James Philbin: I wasn't, actually.
>>: Oh, okay.
>> James Philbin: But we can talk about it later. It's actually an idea of David Lowe's,
and I think a student of his has probably published it. But I wouldn't be able to tell you
the paper. But we can talk about it later.
>>: [inaudible]
>>: There's a whole survey [inaudible]?
>>: From David, yeah. Whether it's published or not or whether -- it's public. You can
get it from [inaudible].
>>: Okay.
>> James Philbin: So he visited Oxford for a year and was sort of talking about this stuff
and gave us some code and we tried it and it worked well. But I think at that point it
wasn't actually published.
>>: I think it's actually [inaudible].
>> James Philbin: It might well be.
[multiple people speaking at once]
>> James Philbin: Okay. So that's sort of using large vocabularies.
The second thing we can look at is can we use the position and shape of the underlying
features to improve retrieval quality. So at the moment we're just considering an image is
a bag of words. And obviously we could take all the [inaudible] which define those
words and jumble them up any way we like. And that's going to have the same visual
words representation. But it's not going to look anything like the object we're searching
more.
So as an example, we've got the query on my left and two potential results on the right
here. And both have quite a lot of matches. And so the question is just using the visual
word histograms alone, we're not going to be able to disambiguate which one is correct
and which one isn't.
So what we want to do is enforce some spatial consistency between the query and each
result. And this improves retrieval quality a lot.
So on the left here we can now see when we've enforced the consistency, we get some
good set of inliers. So this is sort of correspondences which agree with the hypothesis.
And on the right not many consistent matches. So we can say fairly confidently that this
object isn't present here.
And as an extra bonus it gives us localization of the objects and the target.
>>: [inaudible]
>> James Philbin: That's okay.
>>: Would you rather I --
>> James Philbin: Oh, no, that's fine.
>>: Well, you're going to have to go back about the nearest neighbor. So essentially
using several random trees with the same queue, you do something that is like NSH
because you're just going to cut the -- you're going to search [inaudible] with proximity
along these trees. So is this the motivation or...
>> James Philbin: I wasn't sure what the question is actually.
>>: So you're familiar with NSH?
>> James Philbin: Yeah.
>>: So doing several random K to trees is similar to doing NSH because you're going to
take -- in NSH you're going to do several random cuts and you're going to look at certain
bins there.
>> James Philbin: Right.
>>: So in a way you -- it can be seen as a way to combine the two. I guess that's ->> James Philbin: Yeah.
>>: I was wondering if that's ->> James Philbin: Especially in high dimensions you often get -- with KD trees you get
these splits which are sort of very -- these very thin splits in the space. And actually
the -- sort of the error or the amount of [inaudible] fall in the bin isn't necessarily going to
be very large.
So by doing multiple splits, you're sort of hedging your bet in a way. You can sort of
explore this space in a better way than just having a sort of fixed number of discrete sort
of hyper-rectangles is what you're searching in.
So how does the spatial work? So because we have elliptical regions, we can actually use
the shape of these [inaudible] to do a form of RANSAC which actually isn't randomized
because we can go through the linear list of correspondences.
So for each correspondence, we're going to generate a hypothesis, which is here. And so
each elliptical match gives you 4 degrees of freedom for the transform between these two
images. And we further constrain it, but the images are taken upright, which isn't always
the case but is normally the case in Flickr images. And that gives you a full 5 degree of
freedom affine transform.
Then we're going to find inliers as is standard. And we estimate the transformation. And
then we're going to score this image here by the number of consistent matches.
So I should say that this is obviously -- this is very fast because we just use one
correspondence to generate hypothesis. But it's still quite expensive because you got to
get a disk and read these features in.
So we actually need to do this for the top sort of 200 ranked documents after we've done
the rest of the retrieval.
>>: You said 5 degree of freedom? You don't use the skew or what?
>> James Philbin: Um...
>>: A full affine would be 6, right? So you've taken one from [inaudible].
>> James Philbin: Um...
>>: Maybe it's because ->>: It looks like vertical lines [inaudible] vertical lines, right?
>> James Philbin: Right. So it's not for the affine. So we've taken the skew out. You're
right. So it has to be something like ABVC. That's what we're learning.
Okay. So going to move on to the third thing we can do to improve retrieval.
So very large vocabularies we saw gave us good discrimination and good performance of
a particular object search. But sort of a natural question is do we actually lose any
performance from the quantization error because we've split up this space into a million
bins essentially. You might imagine that there's going to be lots of points which are sort
of near [inaudible] boundaries in the K-means space.
So instead of assigning each descriptor to one visual words, we're going to sort of soft
assign them [inaudible] better localizations. We've moved from a sort of hard sort of
partitioning of the space on the left to something a bit more nuanced on the right.
And there's sort of two main problems that soft assignment should overcome here. So
we've got cluster centers A, A to E, and points 1, 2, 3, and 4. And so the first problem is
that points 3 and 4 here are close in the space, but we never match them because they lie
on either side of this boundary. And 1, 2, and 3 are normally matched equally, whereas
actually we'd quite like to say that 2 is closer to 3 than 1 is to 3.
So sort of practical point is that we don't really want to regenerate all these clusters. And
even if we did manage to regenerate the clusters, it's not totally clear what method we
should use to take into account this soft assignment. So let's just pretend -- let's just take
all the clusters and pretend that they're univariate Gaussians from K-means. And we can
compute these Gaussian conditionals from the nearest R centers to each point and
represent each point is a multinomial.
We then L2 normalize multinomials, and instead of scoring a match 1, we just score it as
a dot product. So some rough intuition is that you sort of projected down onto a
three-dimensional space or an R-dimensional space and that we're taking L2 distance in
this new space between two points. And now in the index, instead of scoring a match is
1, we score it as a dot product.
And in practice -- yeah.
>>: Question. How do you determine sigma? That must affect the performance.
>> James Philbin: It does. And we just [inaudible]. We tried a few [inaudible] and it -it affected a bit, not as dramatically as you might expect. Actually the R here has a much
bigger effect as the number of points.
>>: There's one sigma for all clusters, and you used them [inaudible] looking up more
than one center [inaudible].
>> James Philbin: Yes.
>>: And whenever you find this ->> James Philbin: So actually the KD tree gives you more than just the nearest neighbor
anyway.
>>: So how do you truncate that? Because you're not going to be look at all ->> James Philbin: We just fix R to be something small, 3 to 5 gives fairly good results.
I'm not going to claim that this is sort of theoretically the best we could have done, but ->>: I'm just interested. So then you go in the neighbor, and the next you can actually run
five lists instead of one?
>> James Philbin: Yes. So there is a slowdown definitely. And we sort of experimented
with making the vocabulary even larger and get these things to be, you know -- run at
reasonable speed. And the other issue is that when you do the spatial, you also got a lot
more correspondences. So this is actually a blessing and a curse actually.
>>: Do you have any feeling for what this does running it per word or you could sort of
do a new vocabulary and do several queries and some averaged results? Would that
achieve similar thing?
>> James Philbin: Not quite sure what you mean. A new vocabulary ->>: Sort of do it per word now. If you do a query [inaudible] word going to run five lists
and weight them. But you could sort of actually just do it on a macro scale, you do one
query with hard on it, and do another one with a completely different vocabulary and you
do again and you average the score somehow.
>> James Philbin: Yeah. So we didn't look at that actually. But that could be -- that
sounds like an interesting thing. So it might be that if you don't get enough good
matches, somehow using just the hard vocabulary, then you go [inaudible] other ones.
I'm not sure.
>>: [inaudible]
>> James Philbin: Mmm. So empirically this gives us sort of two benefits. So the first
is we just get better initial results, which are sort of ranked higher initially, and then the
spatial verification picks them up.
So for hard assignment, we had for this query on the left, we got this sort of only one
good highly ranked result. The soft assignment gave us three others. Actually be three
extra.
The second benefit is better spatial localization. But I should say at the cost of having to
check more correspondences. For hard assignments, we had sort of two inliers between
these two images. And for soft assignment we get many more inliers and a better
estimation of the transform.
Okay. Last thing for particular object retrieval. So the last three things I talked about
were sort of first-order effects. So they're going to improve the quality -- the sort of
retrieval quality for a single query. And now I'm going to talk about something which is
sort of a second-order effect.
So the idea is that in text I'm sure you've all sort of witnessed this that when you use
Google you might type in a term and you have a look at the results coming back. And
then based on what you see in the results, you can refine your query. So maybe you saw
a document you liked and you saw another term that went with it, so you added that term
into your query and reissued it. And that sort of honed in on the documents you're
actually looking for.
So query expansion in text tries to sort of apply this idea automatically. And essentially
you reissue the top end results as queries. And there's some more intelligent stuff that
goes on, but that's essentially the idea. It's also called pseudo/blind relevance feedback.
But there's a big problem of topic drift, or a big danger. And actually this is such a big
problem for text that people don't often use it. So because you don't know when a
document is relevant, you can easily sort of diverge from what the true set of results
should be.
But in vision, we're in a different situation. Because we have these spatially verified
images, and when we have more than a certain number of inliers we can actually be very
sure that the object that was queries for does occur. We can just reissue these verified
image regions as queries. And this spatial verification gives us sort of an oracle of truth
or oracle of relevance for a particular result.
>>: Did you do anything to make sure that the top matching images aren't near
duplicates? Because you could almost -- you know, you could just waste your time
reissuing and almost identical image. In other words, do you look for something that's
different enough because that's more likely to [inaudible]?
>> James Philbin: We don't actually. Well, okay. So when there's more than a certain
number of inliers, we tend to just -- we let that image go and we choose one with a few
less. But one of the things that you overcome here is actually some sort of quantization
error and interest point error. So it might be that although they're very similar, actually
your interest point is [inaudible] on a whole bunch of different stuff which is going to
make your query much richer.
So here's a visual example. So we got an original query on the left. And we perform
retrieval without spatial to begin with and we get some good results, highly ranked, but
also a few false positives [inaudible] were on the list.
Once we do spatial verification, we sort of cut out those two false positives and we got a
list of sort of good results here.
And also we have a homography going from the original query to each of these results.
And now what we can do is we can take all the features which lie within the query region
in the target image, back project them into our original space, and we get this sort of new
enhanced query.
So this is going to contain the visual words that were part of the original query but also a
whole bunch of new visual words, which we didn't see initially, but which sort of
partially matched to the original query.
When we reissue that query, then you get a lot sort of more challenging results. So very
large scale change with some occlusion here. [inaudible] scale change. This is actually a
zoom of a crest which is here, which is almost difficult to see I think.
Here's another example. So we have the query image on the left and originally retrieved
image in the middle and an image which wasn't retrieved by this query. So it's dark and
there's occlusion. So there weren't -- there simply weren't enough visual features shared
between them to match it.
So here are the inliers to the image in the middle from that query. And here are the
inliers to this image we didn't retrieve. But from that middle image, what you can
imagine doing is projecting all those words which were in the predicted region in the
middle back onto the original query. And then you can match this image from that query.
>>: So which [inaudible] this is a new query or you can sort of make the graph offline
for the whole database [inaudible] looking into it in one place and we're just ->> James Philbin: Right.
>>: -- cascading to the graph.
>> James Philbin: You're jumping ahead of me a bit. But that's certainly true. But
there's one important difference, which is that this is query dependent, whereas anything
you build offline would be for a whole image.
>>: [inaudible]
>> James Philbin: Not that important. But -- well, if you were searching for something
that's very small in the image, then you can easily imagine the rest of the features would
sort of swamp out [inaudible] and you're simply not going to find it. So, I mean, I'm
mainly showing examples here where the object takes up most of the image. But I'll
show a demo and we can see that it does work for small objects as well.
So the big effect of doing query expansion is that we go from something that is high
precision but not necessarily very good recall to something very high precision and high
recall. And those are some of the expanded results.
And you can also look at a histogram of these average precisions for each of the 55
queries. So obviously the idea would be that we just see a peak at one, so that means
perfect retrieval for all of the queries. And obviously we're shifting up. When we use
query expansion, we really shift all these results into the right of the graph here.
So that sort of concludes that particular order retrieval. And I'll just take you through
how the different methods actually improve performance. So this is sort of the baseline
method for Video Google. Vocabulary size of 10,000. And we started at .389. Then we
got a big boost from the large vocabulary. That took us to .618. The spatial re-ranking
took us to .653, so not as big a boost as before, but it allows us then to use query
expansion and takes us to .8. And the soft assignment, not such a big gain, but
significant, to .825.
Okay. I didn't set up the [inaudible].
>> Rick Szeliski: No. That's not going to work.
>> James Philbin: That's not going to work.
>>: [inaudible] the running times along with the accuracies on that [inaudible] how
much of a performance [inaudible].
>> James Philbin: Yes. So actually the large vocabulary actually improves runtime
because you have many less words to match. Spatial re-ranking increases it a bit, and it
depends on how long you get on the list, so we do something like the top 200. But we
also experimented with things like you can just go -- you go down the list until you start
seeing bad results, and then you stop. That sort of thing.
>>: [inaudible]
>>: If you want to boot it up later, James can run the demo. Because getting your laptop
onto the Net [inaudible].
>> James Philbin: [inaudible]
>>: [inaudible] run it later.
>>: Do you have any feelings for if you deploy if soft assignment first? I mean, there's
all these dimensioning [inaudible].
>> James Philbin: Yeah. So the gain in soft assignment without query expansion is
much better. I'm just showing the sort of accumulated results. In the paper, we go
through sort of each of the stages and how it affects things.
>>: Can you get your scheme resolution down to [inaudible] 24 by 768?
[multiple people speaking at once]
>>: I'm guessing it will work if it's the right screen resolution. But we can try it. If it
doesn't work, we'll just go back.
>> James Philbin: Sure.
>>: This is on the Oxford dataset you mentioned earlier, right?
>> James Philbin: Yes.
>>: For any time you say things will be much harder, right [inaudible] very similar ->> James Philbin: Right. I've had this set before. And is it really true? I don't know.
So, I mean, certainly Seattle seems to have enough distinctive buildings that you could
pick out something interesting from a dataset on it.
>>: I have a related question, which is I think this spatial verification stuff is -- do you
think it might be more important for buildings than, for example, blurry toys that are
small [inaudible] it's all about getting a couple of features and it's almost [inaudible]
geometry [inaudible]?
>> James Philbin: So definitely in the geometry because it's sort of affine, we're
assuming that it's all [inaudible] facades.
>>: But in the sort of phase of all the repetitive structure and that -- that's what I can
imagine that the spatial verification is more important.
>> James Philbin: Yes. But often, especially when you get a large enough dataset,
you're just going to see every visual words. It just isn't descriptive enough. That's the
problem, is you add more and more data. You just end up making more and more errors.
You just happen to see those visual words in some orientation.
So you sort of have to have this filtering I think when you go through and then you do
something a bit more expensive but on less images, and then you can imagine doing more
and more ->>: And those results that you showed, how many distractors did you have in that ->> James Philbin: Right. So this is actually just for the 5k, these numbers. In the paper
we go into for the 100k and the million. And things go down.
>>: Kind of uniformly goes down?
>> James Philbin: Yeah. Yeah. I pretty much think. Yeah.
>>: Are all these steps possible to scale to the millions or some of them are?
>> James Philbin: All of these steps that we scaled up to the -- the sort of million size.
So it does actually scale -- so it scales maybe not in the sense that vision people mean it,
which means that I can run the 2 million dataset on my laptop, which is often what's
meant. But it scales in the sense if you got a cluster of computers, you can actually stick
disjoint sets of the data on each machine and you sort of issue the query [inaudible] the
results and return it. It scales in that sense, I think.
>>: I guess my question also is, for example, soft assignment cost you a factor 10 and
multiple speedup, scoring, or something like that ->> James Philbin: So soft assignment is an issue because you have to store more data.
And you have to stick it in the index. And if you don't have a large vocabulary, then it
can really kill you actually.
I would say that of all of these, probably the most important are sort of query expansion
of the large vocabulary.
>>: So could you try it without spatial [inaudible] at all?
>> James Philbin: The query expansion.
>>: [inaudible]
>> James Philbin: The query expansion?
>>: No, no. Just -- you know, if you'd run the method of 1, 2, 4, and 5 [inaudible] did
you ever try?
>> James Philbin: Well, so you can't really do query expansion without the spatial.
>>: Because you'll get too much?
>> James Philbin: You just get junk.
>>: How about if you throw 3 and 4 [inaudible] why do the soft assignment again, for
example? Could be the case that it gives you much lower ->> James Philbin: Right. I could have just shown these results inverted so the query
expansion after soft assignment.
>>: No, but did you try it without it?
>> James Philbin: We have. And the result's in the paper. I can't off the top of my head
remember what that ->>: Okay.
>> James Philbin: -- exactly is. Yeah. So in the paper we went through all these sort of
combinatorial this, that and the other, see what works.
Actually, I'll tell you what I'll do. I'll do the demo at the end, and then I -- otherwise I'll
lose the flow of it. Okay.
So that sort of ends the first part of the talk. So we saw how we can really improve sort
of particular object retrieval. And now we're going to sort of apply these methods to a
task which is object mining. So the goal here is to ultimately find and group images of
the same objects or scene. So we've gone from a large sort of bunch of images on the
left, we've automatically pulled out these separate objects.
And we sort of envisioned several applications potentially for this method. So one could
be dataset summarization, so you're given a huge bunch of images, it's very difficult to
sort of say or to characterize that dataset. But I think, yeah, a useful thing to be able to do
is to say, you know, this dataset is mainly of the Trevi Fountain or of, you know, these
particular scenes or objects.
We can also use it for efficient retrieval. So if we really are sure that an object is seen in
two images, we don't have to index both images. And we can also use it as sort of a
back-end or preprocessing step for 3D reconstruction methods like Photosynth. So
obviously the bundle adjustment is going to work much better when you've got actual
images of the object you're interested in.
I'm going to show some results on two additional datasets. So the Statue of Liberty
dataset. It's just under 40,000 images. Crawled from Flickr by searching for Statue of
Liberty. I have lots of images of statue, but I also have New York and other sites.
And also on the Rome datasets. This is over a million images of Rome. Again, crawled
from Flickr. And I should say both of these datasets come from Noah and Steve and
Rick.
So our approach is to try and build a matching graph of all the images in the dataset. And
this graph is formed such that each node is an image or an image represents a node in the
graph, and a link between these two nodes sort of encodes these two images have some
object in common between them. So they've been matched in some way. And then given
this graph structure, we can sort of apply various algorithms to group the data.
And so our approach is to use particular object retrieval to increase the search speed.
And instead of considering all pair-wise matches, we'll only consider images which we
know [inaudible] match.
So the procedure is to -- using retrieval, we query using each image in the dataset and
each query gives us a list of results scored by a measure of the spatial consistencies of the
query. And then we simply threshold this consistency measure to determine the final
links in our matching graph.
So the first thing we can look at is we can say, well, any collection of images of multiple
disjoint objects we'd expect the matching graph to also be disjoined. Otherwise, we're a
bit worried I think.
So the first simple step is to take connected components of this matching graph and just
have a look at the clusters or the components returned.
And here's just on the Oxford dataset. You can see in some cases it does a pretty good
job already. So these first three clusters on the left are pretty pure. So we've got the
bridge on the left. This is the back of Christ Church College in Oxford in the middle.
And there's a college hall in the middle there on the right.
But these last two components contain a mixture of objects. In part, part of the problem
is that we have a linking image. So this is sort of a Georgian facade which is here and
here, but then you start to move around it and then suddenly you've matched this
building.
And, again, the same problem with this last component that you have, this Georgian
facade, two buildings are seen in this middle image, and then you've linked to some other
stuff.
>>: What's the problem with that? It won't [inaudible] building [inaudible] clusters?
>> James Philbin: Yes. I mean -- yeah. So the aim is you want to group all the images
which contain the same scene or the same object. And obviously if you're -- if you have
enough images of the world, you can just -- you can spread everything. You can just
[inaudible] everything, which sort of isn't very useful.
>>: Well, as long as you know what the [inaudible] still makes sense even if it's one
cluster.
>> James Philbin: I'm not sure. I'd think you'd like to be able to say, oh, look, these
group of images contain the same objects. Or at least that's [inaudible].
>>: For some applications, that long-range linking is very useful, right, you're trying to
basically rebuild as much of the city in a three-dimensional consistent coordinate frame,
you want things to link to each other.
>> James Philbin: Right. So -- yeah. We actually ran this method on the Rome data for
Noah and Steve, and I think they use it as some preprocessing to sort of more efficiently
do the matching later.
So as I was saying, the problem with taking connecting components is that we have sort
of connecting images which can join two disjoint objects. So here I've actually got the
graph of one of the components here. And it contains sort of the bridge and the
Ashmolean Theater in Oxford behind it. And it's essentially linked by this image and this
image, where you have the objects here and the bridge is both seen in the same image.
So some connected components are very pure and they contain just a single common
object. Usually the problems are linking images. Some components contain more than
one object or scene. And so the first method I'll present, we're just going to use some sort
of standard graph clusterings and spatial clustering to produce sort of purer clusters. So
this is going to split up, say, this component into sort of two disjoint sets.
And this is quite fast to run once you have the graph. So here's some results for the
Statue of Liberty dataset. So we got a bunch of -- sort of 11,000 images of the Statue of
Liberty. There's a Lego Statue of Liberty in this data. And I think this is some building
on Staten Island in New York.
Here are some results of the Rome dataset. So Coliseum, obviously the most popular
object. But then also there's the Trevi Fountain. This is the St. Paul's Square -- St. Peter's
Square and the Vatican. And this is -- anyone know? I've forgotten that scene. So some
palace, I think.
>>: Paris?
>>: It's in Rome. But I forget the name ->> James Philbin: It's quite interesting to see sort of what people take I think.
So to get the results seen previously, we have to do a fair bit of manual tweaking. So
especially as the data gets bigger, you end up matching more and more and more. And
this means that you get sort of high-precision clusters but possibly low recall.
So these clusters are quite pure, but I imagine there's actually quite a lot more images
we've got to see that we've had to trim out, because otherwise we just cluster too much
together.
And basically any clustering method you care to mention basically assumes this
transitivity of the sort of relation between them. So our relation here is sort of contains
the same object. But clearly we don't have transitivity in this case. So image 1 contains
object A, and that links to image 2, which also has A. And image 2 has B, which also
links to image 3, which also has object B. But we can't say that image 1 and image 3
share the same object.
So we can imagine that you're representing this quite formally in saying that we're going
to represent images as a mixture of discrete objects. And so an object is now going to
consist of a histogram of visual words but also some spatial layout. And it's sort of
related to the Latent Dirichlet Allocation topic model. And we call it Geometric Latent
Dirichlet Allocation. So it's similar, but it has this spatial layout sort of encoded into the
model. And the aim is to jointly infer from the model what the objects look like, so
they're visual words and spatial location, and which images contain them.
So here's the model in a bit more detail. So I think if you haven't seen LDA before, this
will be meaningless. But if you have, essentially we've augmented the topic model here
with eventually a pinboard of where the visual words actually appear in sort of a
canonical frame.
And we're also going to generate a homography. So this is a document topic-specific
homography which projects words from the document into the topic or vice versa. And
there's a few more priors there as well.
So topic in our model corresponds to a particular object or a landmark. And each topic is
represented as a pinboard storing identity, position and shape of the constituent visual
words. And we sort of generate a document from this model with a mixture of these
topics together with the homography and the pinboard. So we pick a word, we pick a
topic, we pick sort of the homography, and then we generate the document in this way, so
we accumulate if new visual words.
>>: So you're using the word pinboard or ->> James Philbin: Yes. A pinboard. So imagine that -- yeah. A pinboard is basically
exactly what it sounds like. But you have a topic here and you have a bunch of visual
words. So we've got elliptical regions like this. So the pinboard consists of the position,
the shape of this word, and the identity.
>>: So is it a descriptor or ->> James Philbin: It's just a -- it's an elliptical region, basically, with a word ID
associated to it.
>>: But that's different than the homography.
>> James Philbin: So the homography then takes this word from the topic into the
document.
>>: Oh, okay. So within the topic the features have their own little geometrical
relationships.
>> James Philbin: Exactly. So there's sort of a canonical frame within this topic.
>>: Okay.
>> James Philbin: And then there's a homography which sort of projects the relevant
features into the document.
>>: And then Eric Sudderth had something called a transformed LDA, right? Which
was where you take the parts of an object and then transformed them to where it sat in the
image?
>> James Philbin: Right. So Eric actually had -- he had sort of full distributions over
these words and had lots and lots of priors. And it was basically -- his method sort of
worked, but it's very difficult to learn. And that's basically the issue we faced, that if you
can actually use it on ->>: [inaudible] generic objects, like a screen, where you don't really know [inaudible]
you're trying to model the rigid size which is fairly rigid so you can [inaudible] have
some kind of a very ridged definite idea of where things are located.
>> James Philbin: Exactly. And to find or to estimate these homographies, we can use
RANSAC. And that's really the important thing. It's actually tractable to learn it.
>>: [inaudible] distribution, then, on the visual words in your canonical frame there?
>> James Philbin: So the model is a bit tricky. And the pinboard actually isn't shown in
this graphical model, and that's sort of quite deliberate. Because we didn't want to have
to represent each visual word as a distribution, which we [inaudible] parameters over
because it's just too -- it's just too complex to learn.
So the pinboard comes into when you estimate the H here. So we're actually going to
estimate the H based on the likelihood of that homography given the pinboard. And
there's sort of the N step where we're learning this model, estimating the pinboard, given
the model, we're estimating the model given the pinboards.
>>: [inaudible] offline process from this which kind of does ->> James Philbin: Right. So there's sort of a -- there's probably another variable hidden
in here somewhere, which is actually directed out from the H. And also connects to those
words up there. So this is sort of how we learn it. So of course we use Gibbs because of
the topic model, and we're going to sample the topic assignments and the pinboard given
the current hypothesis, or current set of hypotheses. And then we sample the hypotheses
given the current topic assignments and the pinboard.
And actually step one is simply another Gibbs sample. And I won't go into all the
variables here. And step two is we're going to sample hypotheses from likely hypotheses
found using RANSAC. So we're doing essentially RANSAC between this pinboard and
each image, and this gives you a bunch of hypotheses. And we're going to basically have
a likelihood for a particular hypothesis being the number of inliers between the document
and the topic.
So, yeah, that's that slide.
So I'll give you an example. So this is one connected component from Oxford. This is
actually 56 images, and we saw this before, but we have sort of three that facades or three
distinct objects in this component. So we have this sort of Georgian building here, and
then we have the college [inaudible], and then this sort of tower as well.
But we also have these linking images, which contain more than one of these sort of
distinct objects.
I'm actually going to go back to that.
So if we set the number of topics correctly, then we get results like this. So we've
actually managed to pull out with our model these three separate facades. So we've got
the Georgian building on the left, we've got these clusters here, and then we've got this
tower as well. Notice that these are actually sort of probabilities now. So we're not
getting a hard clustering for these objects. So these images will also occur in this list, but
much further down.
Because it's a probabilistic model, we can actually look at sort of log likelihoods, et
cetera. And we use this to sort of set the number of distinct objects in our component.
So here we can see there's a peak at three topics or three objects. And this is sort of
pretty much correct.
We can also look at sort of linking images and how actually these images -- how these
words are sort of projected from the topic model. On the top row here we've got two
topic models or two pinboards, and we've got one, this Georgian building on the left, this
college class is on the right, and an image which contains both in the middle.
And you can see that not any have we disambiguated these two objects, but we also sort
of localized them in each of these images.
We can also look at sort of probabilities here. So this is for Sheldonian Theater. And
you can see it makes somewhat sense that as you zoom in your get different set of words,
and then it goes to night, so we're matching less. And then finally you got very extreme
viewpoints.
And we can also sort of visualize these pinboards. So because each visual word in our
dataset has to belong to some topic, and we have the hypotheses, we can sort of splatter
these elliptical regions and sort of try and visualize what these pinboards look like.
So these are sort of two of the better ones, I must admit. But you can definitely make out
the sort of bridge, sort of a painterly effect to the bridge on the left. And this is this
facade from Christ Church on the right.
>> Rick Szeliski: Are you familiar with the work of Jason Salavon?
>> James Philbin: No, I'm not.
>> Rick Szeliski: Oh, okay. I'll tell you over lunch.
>> James Philbin: You think I can sell these or...
>> Rick Szeliski: Well, it's reminiscent. The work comes up -- people who know about
this artist mention it like when they see Antonio Torralba's Average Caltech 101 image.
His art is basically taking a lot of photographs of, let's say, a wedding photo, lining them
up, and then taking the average image. It looks a lot like this.
>> James Philbin: Okay. Yes. I mean, other objects, they don't look quite so clean. But
they look much more like a sort of -- sort of a splatter of -- and you can sort of make out
vague shapes and things.
So, in conclusion, then, my first part of my talk I looked at sort of improving particular
object retrieval, and we saw that using large vocabularies and spatial verification, soft
assignment and query expansion can really sort of improve retrieval quality.
Then applied -- looked at applying these techniques to a particular problem. In this case,
object mining. So how we can build a matching graph and introduce this sort of
mixture-of-objects model, which is similar to LDA be with spatial information.
And that's it. Thanks.
>> Rick Szeliski: Thank you.
[applause]
>> Rick Szeliski: Would people like to see the demo now? Or do they have questions
they can ask?
So the two latest publications were ICVGIP, which is the Indian conference?
>> James Philbin: That's an Indian conference. And BMBC as well.
You can ask questions while I'm...
>>: [inaudible]
>> James Philbin: Um...
>>: Like if you were to -- if you wanted to cluster a large collection of images, do you
use your clustering method or do you think the min-hash technique of ->> James Philbin: So -- yeah. So I think actually Ondrej's work on -- in that clustering
can be seen as sort of a faster approximate way of building this matching graph. I don't
think it helps with the problem of sort of then how you pull out the distinct objects
necessarily.
>>: Basically you get an efficiency gain on ->> James Philbin: It definitely is. Yeah. But it's also approximate. So he does miss
some stuff.
>> Rick Szeliski: The other related work in -- there's -- James, the other work that comes
to mind when you talked about finding the objects is Ian Simon's work. It sort of finds,
you know, areas of interest in photographing by looking at the matching graph.
>> James Philbin: I don't think I've seen that work.
>> Rick Szeliski: Have you not seen that? I'm trying to remember what ->> James Philbin: I know that Tina had a paper.
>> Rick Szeliski: I'm sorry?
>> James Philbin: I know that Tina [inaudible] had a paper.
>> Rick Szeliski: Okay. No, this is -- he's advised by Noah Snavely, so it was kind of -it was by Steve Seitz, so it was along that line of work. But I'll show you that.
>> James Philbin: No.
>> Rick Szeliski: Okay.
>> James Philbin: So I think it's an idea that's sort of been independently looked at by
several people. It's definitely sort of come of age, I think. Okay. So this is sort of a
realtime demo we've got running at Oxford. If it loads.
>>: [inaudible] machine.
>> James Philbin: Sorry?
>>: [inaudible] machine.
>> James Philbin: This is just on my machine. But it's like a cold call also. Is there
some way of making it full screen? Press something. F11 or something?
>>: What is it? F11?
>> James Philbin: Got it. Okay. So we've got a query image here, and we can select any
part of the image. But let's look at the bridge. We know it works.
So this is just using the sort of standard inverted index plus spatial re-ranking. So we get
sort of results which are fairly similar to the query initially. I think we start getting
slightly more challenging things.
So we can actually see how -- where the first false positive is here. So 37. So someone
remember 37. And then we can ->>: And it accepted such a degenerate affine confirmation or...
>> James Philbin: Right. So that -- I mean, this is something we've thought about
before; we've just never got around to do it. Is that this is obviously false because the
hypothesis is so wildly not what you'd see in real life. And we've always said, yeah, we
should have a prize on the hypotheses and we should vote [inaudible] like this and just
chuck that one out. And I think that would improve performance a bit.
>>: Chuck it out or re-rank it.
>> James Philbin: Re-rank it, yeah, just push it down.
>>: Okay.
>>: I'm sorry, so it seems like it has [inaudible] matches, right? These are before the
RANSAC, right?
>> James Philbin: Yes. I mean, if you want to see ->>: [inaudible]
>> James Philbin: -- exactly what's actually matched ->>: Just curious [inaudible].
>> James Philbin: This demo is online, by the way. So you can just go to the robots
Web site and just play around with it.
So you get four in a row. They've matched on sort of windows.
>>: [inaudible] had inliers there ->>: Five inliers. This is after the homography.
>>: Yeah, after the homography, five.
>> James Philbin: Yeah.
>>: So the 40 is probably ->> James Philbin: Ah. So, yeah, I can turn that off actually.
>>: [inaudible]
[multiple people speaking at once]
>> James Philbin: This is my experience with large datasets is you think, oh, you know,
these words are so distinctive and you have enough data and you just see everything
eventually.
So, anyway, so someone remember 37. Or I'll remember it. And we can go back and
actually turn on query expansion and do the query again.
>>: This is using soft assignments?
>> James Philbin: This isn't actually, no. So it takes a little bit longer. So we've done a
query, and then we've got more results and then it's much richer, and then we do another
query. And then we can go down and see what the results are like.
You'll notice as well that we'll generally have a much higher number of inliers, we need
to do query expansion.
So we got to 37 before, and we're still going. Right.
>>: [inaudible]
>> James Philbin: 54, anyway, so we've got quite an improvement there. I always think
this one's amazing if it works at all.
That's good.
>> Rick Szeliski: Yeah. But it's quite likely that the query expansion really helped
because you got some ->> James Philbin: Right. Yeah, so ->> Rick Szeliski: -- series of more and more close-up images that kind of --
>> James Philbin: Exactly. I'm actually only prompting the original matches here. This
one's more interesting. So that doesn't normally work searching for faces, but in this case
it did. Start to get some [inaudible].
>>: [inaudible]
>> James Philbin: No. So we trim all the features which lie outside the box.
>>: [inaudible] black tie that's underneath his face [inaudible].
>> James Philbin: So I think probably that [inaudible] ->>: [inaudible] cloudy day.
>> James Philbin: So I think it's his graduation hat.
>>: [inaudible]
>> James Philbin: I mean, it's a sort of same expression. It's obviously taken from the
same camera, roughly the same time. So -- but it's quite nice that it works.
Also, you'll notice I searched on his face, but some of the bridge lay behind it and we sort
of got that as a superposition of those two things.
And actually an interesting anecdote is that when Andrew gave this talk, he met someone
who knows this guy. And he actually sent us an e-mail and said I'm very pleased that my
image is being used for such good purposes.
I'll just show you it working on some small object. So we found out that in Oxford,
there's sort of a -- just the one template for lampposts. And you see it's seen all over the
place.
>>: So which implementation [inaudible] are you using?
>> James Philbin: So this is Christian's -- so Christian has several versions, and each one
does performance totally differently. So you have to be very careful which one you
choose. And I believe ours is the second one down on the list on the Oxford Web site,
and that we've found works the best. It's definitely a black art, though, these feature, how
you get them to work well.
>>: [inaudible]
>> James Philbin: Because you've got -- I mean, you've got almost exponential space of
parameters. So, you know ->>: Well, people here are trying to learn better descriptors.
>> James Philbin: Yeah. So I think the descriptors is an interesting point, but the interest
points are almost more important to get right I find, because if you miss the interest point,
you're stuffed. There's nothing you can learn which is going to help you out in the
descriptor level. And certainly when Christian moved from more object based to more
category based, his -- it works much worse on our data. But I guess that's obvious
because he's tuned it. So...
Okay?
>> Rick Szeliski: Okay. Well, thank you very much.
>> James Philbin: Okay. Thanks.
[applause]
Download