>> Rick Szeliski: Okay. Good morning, everyone. My name is Rick Szeliski, and it's my pleasure to introduce James Philbin who is a graduate student at the University of Oxford. James has been doing some amazing work, published in the last few Computer Vision Conferences on matching images, especially with very large databases such as finding pictures, matching any new photograph to pictures of all the buildings in Oxford. And he's going to tell us -- give us sort of an overview of a large number of papers he's been working on and some of the most recent work that just came out this year. >> James Philbin: Okay. So thanks, Rick, for the introduction. I'm going to talk on organizing Flickr, so object mining, using particular object retrieval. I must say this is joint work with Ondrej Chum, Josef Sivic, Michael Isard and Andrew Zisserman. So why I'm here is to automatically organize large collections of unordered, unlabeled images, things like Flickr. So you might be thinking, well, isn't Flickr already organized by tags? You know, I can type something into Flickr and I get images which are potentially relevant back. So here's an example. I've never been to Redmond before, so I searched for -- do a search in Flickr for Redmond. And these here three images I got back from the first page. So one is some guys jumping around in a balloon. It's actually from New York. But it was part of a photo set of which he had took some things in Redmond. The image in the middle here is Allysa Redmond who is the woman with the longest fingernails in the world. Not particularly useful to me. And on the right here is a lunar eclipse which was taken from Redmond. But, again, not useful for me. So tags are often inaccurate and ambiguous, or just not descriptive enough to be useful. And so here we're going to demonstrate organizing or clustering images just on the image content alone. And specifically in this work we want to cluster images containing the same objects, so same landmark or building, et cetera. So I'm looking at this car rather than the set of all cars. So an outline of my talk, it's going to be sort of in two parts. I'm going to introduce sort of -- or talk about particular object retrieval and take you from the sort of baseline Video Google method to the state of the art. And I'm going to do that using sort of large vocabularies, spatial re-ranking, soft assignment, and query expansion. And the second part of my talk is going to be on using particular object retrieval to perform object mining. And we'll talk about building a matching graph and then look at two different methods for doing the mining. So one uses standard graph clustering techniques, and the other uses a sort of mixture of models -- mixture-of-objects topic model based approach. So a word about the dataset. I'll be demonstrating a lot of the work on the particular object retrieval. So it's a dataset of Oxford buildings. Automatically crawled from Flickr. And it consists of just over 5,000 images found by searching for various Oxford landmarks. So you type into Flickr Oxford Radcliffe Camera, Oxford Christ Church College, and you get a lot of results back. Of course as we saw before, a lot of these tags are not very good, so I think when you type these in, only about 5 percent of the images actually contain some of these landmarks. These are all medium-resolution images. There's a sample at the bottom there. And we've gone through and manually labeled groundtruth for 11 landmarks. And each landmark has five queries each. So you can see all of the 55 queries here. And to evaluate our system, we use a sort of mean of the average precision for each query. So we sum up all the average precisions for each query and divide by 55 to get sort of an overall view of the performance of our system. And so test the scalability. We also have two other sort of distractor datasets. So the first one is sort of a hundred thousand images, and another which is just over a million images. And it's assumed that none of our groundtruth objects are present in these datasets. So we can just wrap them in and see how the system is scaling. So a particular object retrieval. What's the aim. So the idea is that the user selects a subimage or a subpart of a query image. So here and here. And we get a large unlabeled, unordered collection of images. And we want to find other occurrences of that object in this dataset. So we selected the bridge there. And you can see actually the bridge occurs twice: so once in the middle there; once in the bottom left. And this is the Radcliffe Camera, the other query, and it occurs twice as well. So once there, but also way off in the distance in that top left image. So this gives you some of the idea of the challenge of doing this. We want to search millions of images at realtime speeds of particular objects, and we want to be able to find these objects despite changes in scale, viewpoint, lighting and occlusion. So here we've got sort of a two-time scale change, 45-degree viewpoint change, a lighting change, the day to night, and occlusion. So this building is seen sort of through a pane-glass window. So I'm just going to review the sort of Video Google, the baseline method, which we're going to build upon. So we start with every -- for every image in our dataset. And we're going to find sparse affine invariant regions. So these are going to be Hessian affine interest points, so they're elliptical regions in the images, which is supposed to be somewhat invariant to viewpoint. And we're going to take SIFT descriptors. So for each image we find interest points, we generate SIFT descriptors. We're going to bind all these descriptors into the big matrix and quantize, so we've gone from elliptical region to descriptor to a single number. And this is going to represent a visual word in a model. So we can now represent each image as a bag of visual words, so it's just a sparse histogram of the visual word occurrences. And to compare two images, we're now sort of fully in the text domain because we've got these two sparse histograms which we're going to sort of treat as a proxy for those images. And now if you want to compare two images, we're going to simply compare their histogram of visual words and score them. And the baseline method, well, we just use tf-idf weighting from the text retrieval. And to search on an image, we accumulate all the visual words in the query region and we use an inverted index to find the documents that share at least one word. And we compute similarity for those documents using tf-idf with L2 distance. And this gives us a rank list for all the documents in our dataset. So this inverted index is essentially a sort of reverse or a book index, so it maps from particular words to the documents in which they occurred. And by looking at this index, we only have to consider documents which share at least one word. So when you implement the system, the first issue you sort of come up against is how does vocabulary size affect performance. And it seems to basically increase as you increase K. So for flat K-means, this graph is maybe of a range sort of of K around to the 10- to 50,000. Performance just keeps on increasing. But unfortunately K-means doesn't really scale, so K-means scales as order NK where N is the number of descriptors and K is the number of clusters. So can we do better for large K? So the bottleneck for K-means is basically a nearest neighbor search, so from every descriptor to every cluster center. And we can actually use approximate nearest neighbor search techniques to reduce the complexity to order N log K. And sort of empirically for small K, the approximate method doesn't seem to harm the cluster quality. So the table on the bottom left here, I've compared flat K-means or standard K-means to approximate K-means. And you can basically see the difference is sort of negligible. But we're able to scale the approximate methods to much larger vocabulary sizes. And so empirically it seems that large vocabularies are much better for particular object retrieval. So here we actually start at 50,000 and go all the way out to one and a quarter million. And we find a peak at a million visual words. So this is much, much bigger than people normally do with K-means or with clustering. >>: What is the measure here? >> James Philbin: So this is mean average precision. So this is the mean of all the average precisions for the 55 queries in the Oxford dataset. But there's a sort of plateau from about a quarter of a million. >>: So sometimes it does better [inaudible] the approximate K-means -- sorry, the approximate nearest neighbor -- >> James Philbin: Right. I mean, there's a certain amount of noise here. So I think to really find out, you need to sort of plot some error bars. Because you -- you know, K-means, you randomly initialize it. So there's going to be some room for error. >>: [inaudible] actually might get our minimum [inaudible] K-means because you change the nearest neighbor structure [inaudible]. >> James Philbin: Yeah. >>: So it's perfectly possible [inaudible]. >> James Philbin: So it could be that it's almost a bit like sort of stochastic -- some sort of stochastic method where there's a bit of randomness in the system because it's approximate. You don't always get the exact nearest neighbor. And actually by sometimes getting it wrong you can minimize the -- the sort of objective function better. I'm not -- I'm not going to claim that actually because the results don't necessarily show it. But it's something -- intuition. But one thing to know is that when you do the final assignment, you want to be sort of right. So although it might have been true when you built the clusters, you want to be more accurate when you actually finally assign them and we actually up the thresholds on the approximate K-means when you do the assignment. >>: What is [inaudible] approximate nearest neighbor? >> James Philbin: So we're using sort of approximate KD trees. >>: [inaudible] >> James Philbin: Yes. It's best in first, but we have multiple trees as well with different random sort of orderings of the ->>: [inaudible] >> James Philbin: Right. And then we have a single sort of priority queue for exploring these trees. >>: For all the trees [inaudible]. >> James Philbin: For all the trees. And that makes a big difference actually. >>: Are you going to talk about that or... >> James Philbin: I wasn't, actually. >>: Oh, okay. >> James Philbin: But we can talk about it later. It's actually an idea of David Lowe's, and I think a student of his has probably published it. But I wouldn't be able to tell you the paper. But we can talk about it later. >>: [inaudible] >>: There's a whole survey [inaudible]? >>: From David, yeah. Whether it's published or not or whether -- it's public. You can get it from [inaudible]. >>: Okay. >> James Philbin: So he visited Oxford for a year and was sort of talking about this stuff and gave us some code and we tried it and it worked well. But I think at that point it wasn't actually published. >>: I think it's actually [inaudible]. >> James Philbin: It might well be. [multiple people speaking at once] >> James Philbin: Okay. So that's sort of using large vocabularies. The second thing we can look at is can we use the position and shape of the underlying features to improve retrieval quality. So at the moment we're just considering an image is a bag of words. And obviously we could take all the [inaudible] which define those words and jumble them up any way we like. And that's going to have the same visual words representation. But it's not going to look anything like the object we're searching more. So as an example, we've got the query on my left and two potential results on the right here. And both have quite a lot of matches. And so the question is just using the visual word histograms alone, we're not going to be able to disambiguate which one is correct and which one isn't. So what we want to do is enforce some spatial consistency between the query and each result. And this improves retrieval quality a lot. So on the left here we can now see when we've enforced the consistency, we get some good set of inliers. So this is sort of correspondences which agree with the hypothesis. And on the right not many consistent matches. So we can say fairly confidently that this object isn't present here. And as an extra bonus it gives us localization of the objects and the target. >>: [inaudible] >> James Philbin: That's okay. >>: Would you rather I -- >> James Philbin: Oh, no, that's fine. >>: Well, you're going to have to go back about the nearest neighbor. So essentially using several random trees with the same queue, you do something that is like NSH because you're just going to cut the -- you're going to search [inaudible] with proximity along these trees. So is this the motivation or... >> James Philbin: I wasn't sure what the question is actually. >>: So you're familiar with NSH? >> James Philbin: Yeah. >>: So doing several random K to trees is similar to doing NSH because you're going to take -- in NSH you're going to do several random cuts and you're going to look at certain bins there. >> James Philbin: Right. >>: So in a way you -- it can be seen as a way to combine the two. I guess that's ->> James Philbin: Yeah. >>: I was wondering if that's ->> James Philbin: Especially in high dimensions you often get -- with KD trees you get these splits which are sort of very -- these very thin splits in the space. And actually the -- sort of the error or the amount of [inaudible] fall in the bin isn't necessarily going to be very large. So by doing multiple splits, you're sort of hedging your bet in a way. You can sort of explore this space in a better way than just having a sort of fixed number of discrete sort of hyper-rectangles is what you're searching in. So how does the spatial work? So because we have elliptical regions, we can actually use the shape of these [inaudible] to do a form of RANSAC which actually isn't randomized because we can go through the linear list of correspondences. So for each correspondence, we're going to generate a hypothesis, which is here. And so each elliptical match gives you 4 degrees of freedom for the transform between these two images. And we further constrain it, but the images are taken upright, which isn't always the case but is normally the case in Flickr images. And that gives you a full 5 degree of freedom affine transform. Then we're going to find inliers as is standard. And we estimate the transformation. And then we're going to score this image here by the number of consistent matches. So I should say that this is obviously -- this is very fast because we just use one correspondence to generate hypothesis. But it's still quite expensive because you got to get a disk and read these features in. So we actually need to do this for the top sort of 200 ranked documents after we've done the rest of the retrieval. >>: You said 5 degree of freedom? You don't use the skew or what? >> James Philbin: Um... >>: A full affine would be 6, right? So you've taken one from [inaudible]. >> James Philbin: Um... >>: Maybe it's because ->>: It looks like vertical lines [inaudible] vertical lines, right? >> James Philbin: Right. So it's not for the affine. So we've taken the skew out. You're right. So it has to be something like ABVC. That's what we're learning. Okay. So going to move on to the third thing we can do to improve retrieval. So very large vocabularies we saw gave us good discrimination and good performance of a particular object search. But sort of a natural question is do we actually lose any performance from the quantization error because we've split up this space into a million bins essentially. You might imagine that there's going to be lots of points which are sort of near [inaudible] boundaries in the K-means space. So instead of assigning each descriptor to one visual words, we're going to sort of soft assign them [inaudible] better localizations. We've moved from a sort of hard sort of partitioning of the space on the left to something a bit more nuanced on the right. And there's sort of two main problems that soft assignment should overcome here. So we've got cluster centers A, A to E, and points 1, 2, 3, and 4. And so the first problem is that points 3 and 4 here are close in the space, but we never match them because they lie on either side of this boundary. And 1, 2, and 3 are normally matched equally, whereas actually we'd quite like to say that 2 is closer to 3 than 1 is to 3. So sort of practical point is that we don't really want to regenerate all these clusters. And even if we did manage to regenerate the clusters, it's not totally clear what method we should use to take into account this soft assignment. So let's just pretend -- let's just take all the clusters and pretend that they're univariate Gaussians from K-means. And we can compute these Gaussian conditionals from the nearest R centers to each point and represent each point is a multinomial. We then L2 normalize multinomials, and instead of scoring a match 1, we just score it as a dot product. So some rough intuition is that you sort of projected down onto a three-dimensional space or an R-dimensional space and that we're taking L2 distance in this new space between two points. And now in the index, instead of scoring a match is 1, we score it as a dot product. And in practice -- yeah. >>: Question. How do you determine sigma? That must affect the performance. >> James Philbin: It does. And we just [inaudible]. We tried a few [inaudible] and it -it affected a bit, not as dramatically as you might expect. Actually the R here has a much bigger effect as the number of points. >>: There's one sigma for all clusters, and you used them [inaudible] looking up more than one center [inaudible]. >> James Philbin: Yes. >>: And whenever you find this ->> James Philbin: So actually the KD tree gives you more than just the nearest neighbor anyway. >>: So how do you truncate that? Because you're not going to be look at all ->> James Philbin: We just fix R to be something small, 3 to 5 gives fairly good results. I'm not going to claim that this is sort of theoretically the best we could have done, but ->>: I'm just interested. So then you go in the neighbor, and the next you can actually run five lists instead of one? >> James Philbin: Yes. So there is a slowdown definitely. And we sort of experimented with making the vocabulary even larger and get these things to be, you know -- run at reasonable speed. And the other issue is that when you do the spatial, you also got a lot more correspondences. So this is actually a blessing and a curse actually. >>: Do you have any feeling for what this does running it per word or you could sort of do a new vocabulary and do several queries and some averaged results? Would that achieve similar thing? >> James Philbin: Not quite sure what you mean. A new vocabulary ->>: Sort of do it per word now. If you do a query [inaudible] word going to run five lists and weight them. But you could sort of actually just do it on a macro scale, you do one query with hard on it, and do another one with a completely different vocabulary and you do again and you average the score somehow. >> James Philbin: Yeah. So we didn't look at that actually. But that could be -- that sounds like an interesting thing. So it might be that if you don't get enough good matches, somehow using just the hard vocabulary, then you go [inaudible] other ones. I'm not sure. >>: [inaudible] >> James Philbin: Mmm. So empirically this gives us sort of two benefits. So the first is we just get better initial results, which are sort of ranked higher initially, and then the spatial verification picks them up. So for hard assignment, we had for this query on the left, we got this sort of only one good highly ranked result. The soft assignment gave us three others. Actually be three extra. The second benefit is better spatial localization. But I should say at the cost of having to check more correspondences. For hard assignments, we had sort of two inliers between these two images. And for soft assignment we get many more inliers and a better estimation of the transform. Okay. Last thing for particular object retrieval. So the last three things I talked about were sort of first-order effects. So they're going to improve the quality -- the sort of retrieval quality for a single query. And now I'm going to talk about something which is sort of a second-order effect. So the idea is that in text I'm sure you've all sort of witnessed this that when you use Google you might type in a term and you have a look at the results coming back. And then based on what you see in the results, you can refine your query. So maybe you saw a document you liked and you saw another term that went with it, so you added that term into your query and reissued it. And that sort of honed in on the documents you're actually looking for. So query expansion in text tries to sort of apply this idea automatically. And essentially you reissue the top end results as queries. And there's some more intelligent stuff that goes on, but that's essentially the idea. It's also called pseudo/blind relevance feedback. But there's a big problem of topic drift, or a big danger. And actually this is such a big problem for text that people don't often use it. So because you don't know when a document is relevant, you can easily sort of diverge from what the true set of results should be. But in vision, we're in a different situation. Because we have these spatially verified images, and when we have more than a certain number of inliers we can actually be very sure that the object that was queries for does occur. We can just reissue these verified image regions as queries. And this spatial verification gives us sort of an oracle of truth or oracle of relevance for a particular result. >>: Did you do anything to make sure that the top matching images aren't near duplicates? Because you could almost -- you know, you could just waste your time reissuing and almost identical image. In other words, do you look for something that's different enough because that's more likely to [inaudible]? >> James Philbin: We don't actually. Well, okay. So when there's more than a certain number of inliers, we tend to just -- we let that image go and we choose one with a few less. But one of the things that you overcome here is actually some sort of quantization error and interest point error. So it might be that although they're very similar, actually your interest point is [inaudible] on a whole bunch of different stuff which is going to make your query much richer. So here's a visual example. So we got an original query on the left. And we perform retrieval without spatial to begin with and we get some good results, highly ranked, but also a few false positives [inaudible] were on the list. Once we do spatial verification, we sort of cut out those two false positives and we got a list of sort of good results here. And also we have a homography going from the original query to each of these results. And now what we can do is we can take all the features which lie within the query region in the target image, back project them into our original space, and we get this sort of new enhanced query. So this is going to contain the visual words that were part of the original query but also a whole bunch of new visual words, which we didn't see initially, but which sort of partially matched to the original query. When we reissue that query, then you get a lot sort of more challenging results. So very large scale change with some occlusion here. [inaudible] scale change. This is actually a zoom of a crest which is here, which is almost difficult to see I think. Here's another example. So we have the query image on the left and originally retrieved image in the middle and an image which wasn't retrieved by this query. So it's dark and there's occlusion. So there weren't -- there simply weren't enough visual features shared between them to match it. So here are the inliers to the image in the middle from that query. And here are the inliers to this image we didn't retrieve. But from that middle image, what you can imagine doing is projecting all those words which were in the predicted region in the middle back onto the original query. And then you can match this image from that query. >>: So which [inaudible] this is a new query or you can sort of make the graph offline for the whole database [inaudible] looking into it in one place and we're just ->> James Philbin: Right. >>: -- cascading to the graph. >> James Philbin: You're jumping ahead of me a bit. But that's certainly true. But there's one important difference, which is that this is query dependent, whereas anything you build offline would be for a whole image. >>: [inaudible] >> James Philbin: Not that important. But -- well, if you were searching for something that's very small in the image, then you can easily imagine the rest of the features would sort of swamp out [inaudible] and you're simply not going to find it. So, I mean, I'm mainly showing examples here where the object takes up most of the image. But I'll show a demo and we can see that it does work for small objects as well. So the big effect of doing query expansion is that we go from something that is high precision but not necessarily very good recall to something very high precision and high recall. And those are some of the expanded results. And you can also look at a histogram of these average precisions for each of the 55 queries. So obviously the idea would be that we just see a peak at one, so that means perfect retrieval for all of the queries. And obviously we're shifting up. When we use query expansion, we really shift all these results into the right of the graph here. So that sort of concludes that particular order retrieval. And I'll just take you through how the different methods actually improve performance. So this is sort of the baseline method for Video Google. Vocabulary size of 10,000. And we started at .389. Then we got a big boost from the large vocabulary. That took us to .618. The spatial re-ranking took us to .653, so not as big a boost as before, but it allows us then to use query expansion and takes us to .8. And the soft assignment, not such a big gain, but significant, to .825. Okay. I didn't set up the [inaudible]. >> Rick Szeliski: No. That's not going to work. >> James Philbin: That's not going to work. >>: [inaudible] the running times along with the accuracies on that [inaudible] how much of a performance [inaudible]. >> James Philbin: Yes. So actually the large vocabulary actually improves runtime because you have many less words to match. Spatial re-ranking increases it a bit, and it depends on how long you get on the list, so we do something like the top 200. But we also experimented with things like you can just go -- you go down the list until you start seeing bad results, and then you stop. That sort of thing. >>: [inaudible] >>: If you want to boot it up later, James can run the demo. Because getting your laptop onto the Net [inaudible]. >> James Philbin: [inaudible] >>: [inaudible] run it later. >>: Do you have any feelings for if you deploy if soft assignment first? I mean, there's all these dimensioning [inaudible]. >> James Philbin: Yeah. So the gain in soft assignment without query expansion is much better. I'm just showing the sort of accumulated results. In the paper, we go through sort of each of the stages and how it affects things. >>: Can you get your scheme resolution down to [inaudible] 24 by 768? [multiple people speaking at once] >>: I'm guessing it will work if it's the right screen resolution. But we can try it. If it doesn't work, we'll just go back. >> James Philbin: Sure. >>: This is on the Oxford dataset you mentioned earlier, right? >> James Philbin: Yes. >>: For any time you say things will be much harder, right [inaudible] very similar ->> James Philbin: Right. I've had this set before. And is it really true? I don't know. So, I mean, certainly Seattle seems to have enough distinctive buildings that you could pick out something interesting from a dataset on it. >>: I have a related question, which is I think this spatial verification stuff is -- do you think it might be more important for buildings than, for example, blurry toys that are small [inaudible] it's all about getting a couple of features and it's almost [inaudible] geometry [inaudible]? >> James Philbin: So definitely in the geometry because it's sort of affine, we're assuming that it's all [inaudible] facades. >>: But in the sort of phase of all the repetitive structure and that -- that's what I can imagine that the spatial verification is more important. >> James Philbin: Yes. But often, especially when you get a large enough dataset, you're just going to see every visual words. It just isn't descriptive enough. That's the problem, is you add more and more data. You just end up making more and more errors. You just happen to see those visual words in some orientation. So you sort of have to have this filtering I think when you go through and then you do something a bit more expensive but on less images, and then you can imagine doing more and more ->>: And those results that you showed, how many distractors did you have in that ->> James Philbin: Right. So this is actually just for the 5k, these numbers. In the paper we go into for the 100k and the million. And things go down. >>: Kind of uniformly goes down? >> James Philbin: Yeah. Yeah. I pretty much think. Yeah. >>: Are all these steps possible to scale to the millions or some of them are? >> James Philbin: All of these steps that we scaled up to the -- the sort of million size. So it does actually scale -- so it scales maybe not in the sense that vision people mean it, which means that I can run the 2 million dataset on my laptop, which is often what's meant. But it scales in the sense if you got a cluster of computers, you can actually stick disjoint sets of the data on each machine and you sort of issue the query [inaudible] the results and return it. It scales in that sense, I think. >>: I guess my question also is, for example, soft assignment cost you a factor 10 and multiple speedup, scoring, or something like that ->> James Philbin: So soft assignment is an issue because you have to store more data. And you have to stick it in the index. And if you don't have a large vocabulary, then it can really kill you actually. I would say that of all of these, probably the most important are sort of query expansion of the large vocabulary. >>: So could you try it without spatial [inaudible] at all? >> James Philbin: The query expansion. >>: [inaudible] >> James Philbin: The query expansion? >>: No, no. Just -- you know, if you'd run the method of 1, 2, 4, and 5 [inaudible] did you ever try? >> James Philbin: Well, so you can't really do query expansion without the spatial. >>: Because you'll get too much? >> James Philbin: You just get junk. >>: How about if you throw 3 and 4 [inaudible] why do the soft assignment again, for example? Could be the case that it gives you much lower ->> James Philbin: Right. I could have just shown these results inverted so the query expansion after soft assignment. >>: No, but did you try it without it? >> James Philbin: We have. And the result's in the paper. I can't off the top of my head remember what that ->>: Okay. >> James Philbin: -- exactly is. Yeah. So in the paper we went through all these sort of combinatorial this, that and the other, see what works. Actually, I'll tell you what I'll do. I'll do the demo at the end, and then I -- otherwise I'll lose the flow of it. Okay. So that sort of ends the first part of the talk. So we saw how we can really improve sort of particular object retrieval. And now we're going to sort of apply these methods to a task which is object mining. So the goal here is to ultimately find and group images of the same objects or scene. So we've gone from a large sort of bunch of images on the left, we've automatically pulled out these separate objects. And we sort of envisioned several applications potentially for this method. So one could be dataset summarization, so you're given a huge bunch of images, it's very difficult to sort of say or to characterize that dataset. But I think, yeah, a useful thing to be able to do is to say, you know, this dataset is mainly of the Trevi Fountain or of, you know, these particular scenes or objects. We can also use it for efficient retrieval. So if we really are sure that an object is seen in two images, we don't have to index both images. And we can also use it as sort of a back-end or preprocessing step for 3D reconstruction methods like Photosynth. So obviously the bundle adjustment is going to work much better when you've got actual images of the object you're interested in. I'm going to show some results on two additional datasets. So the Statue of Liberty dataset. It's just under 40,000 images. Crawled from Flickr by searching for Statue of Liberty. I have lots of images of statue, but I also have New York and other sites. And also on the Rome datasets. This is over a million images of Rome. Again, crawled from Flickr. And I should say both of these datasets come from Noah and Steve and Rick. So our approach is to try and build a matching graph of all the images in the dataset. And this graph is formed such that each node is an image or an image represents a node in the graph, and a link between these two nodes sort of encodes these two images have some object in common between them. So they've been matched in some way. And then given this graph structure, we can sort of apply various algorithms to group the data. And so our approach is to use particular object retrieval to increase the search speed. And instead of considering all pair-wise matches, we'll only consider images which we know [inaudible] match. So the procedure is to -- using retrieval, we query using each image in the dataset and each query gives us a list of results scored by a measure of the spatial consistencies of the query. And then we simply threshold this consistency measure to determine the final links in our matching graph. So the first thing we can look at is we can say, well, any collection of images of multiple disjoint objects we'd expect the matching graph to also be disjoined. Otherwise, we're a bit worried I think. So the first simple step is to take connected components of this matching graph and just have a look at the clusters or the components returned. And here's just on the Oxford dataset. You can see in some cases it does a pretty good job already. So these first three clusters on the left are pretty pure. So we've got the bridge on the left. This is the back of Christ Church College in Oxford in the middle. And there's a college hall in the middle there on the right. But these last two components contain a mixture of objects. In part, part of the problem is that we have a linking image. So this is sort of a Georgian facade which is here and here, but then you start to move around it and then suddenly you've matched this building. And, again, the same problem with this last component that you have, this Georgian facade, two buildings are seen in this middle image, and then you've linked to some other stuff. >>: What's the problem with that? It won't [inaudible] building [inaudible] clusters? >> James Philbin: Yes. I mean -- yeah. So the aim is you want to group all the images which contain the same scene or the same object. And obviously if you're -- if you have enough images of the world, you can just -- you can spread everything. You can just [inaudible] everything, which sort of isn't very useful. >>: Well, as long as you know what the [inaudible] still makes sense even if it's one cluster. >> James Philbin: I'm not sure. I'd think you'd like to be able to say, oh, look, these group of images contain the same objects. Or at least that's [inaudible]. >>: For some applications, that long-range linking is very useful, right, you're trying to basically rebuild as much of the city in a three-dimensional consistent coordinate frame, you want things to link to each other. >> James Philbin: Right. So -- yeah. We actually ran this method on the Rome data for Noah and Steve, and I think they use it as some preprocessing to sort of more efficiently do the matching later. So as I was saying, the problem with taking connecting components is that we have sort of connecting images which can join two disjoint objects. So here I've actually got the graph of one of the components here. And it contains sort of the bridge and the Ashmolean Theater in Oxford behind it. And it's essentially linked by this image and this image, where you have the objects here and the bridge is both seen in the same image. So some connected components are very pure and they contain just a single common object. Usually the problems are linking images. Some components contain more than one object or scene. And so the first method I'll present, we're just going to use some sort of standard graph clusterings and spatial clustering to produce sort of purer clusters. So this is going to split up, say, this component into sort of two disjoint sets. And this is quite fast to run once you have the graph. So here's some results for the Statue of Liberty dataset. So we got a bunch of -- sort of 11,000 images of the Statue of Liberty. There's a Lego Statue of Liberty in this data. And I think this is some building on Staten Island in New York. Here are some results of the Rome dataset. So Coliseum, obviously the most popular object. But then also there's the Trevi Fountain. This is the St. Paul's Square -- St. Peter's Square and the Vatican. And this is -- anyone know? I've forgotten that scene. So some palace, I think. >>: Paris? >>: It's in Rome. But I forget the name ->> James Philbin: It's quite interesting to see sort of what people take I think. So to get the results seen previously, we have to do a fair bit of manual tweaking. So especially as the data gets bigger, you end up matching more and more and more. And this means that you get sort of high-precision clusters but possibly low recall. So these clusters are quite pure, but I imagine there's actually quite a lot more images we've got to see that we've had to trim out, because otherwise we just cluster too much together. And basically any clustering method you care to mention basically assumes this transitivity of the sort of relation between them. So our relation here is sort of contains the same object. But clearly we don't have transitivity in this case. So image 1 contains object A, and that links to image 2, which also has A. And image 2 has B, which also links to image 3, which also has object B. But we can't say that image 1 and image 3 share the same object. So we can imagine that you're representing this quite formally in saying that we're going to represent images as a mixture of discrete objects. And so an object is now going to consist of a histogram of visual words but also some spatial layout. And it's sort of related to the Latent Dirichlet Allocation topic model. And we call it Geometric Latent Dirichlet Allocation. So it's similar, but it has this spatial layout sort of encoded into the model. And the aim is to jointly infer from the model what the objects look like, so they're visual words and spatial location, and which images contain them. So here's the model in a bit more detail. So I think if you haven't seen LDA before, this will be meaningless. But if you have, essentially we've augmented the topic model here with eventually a pinboard of where the visual words actually appear in sort of a canonical frame. And we're also going to generate a homography. So this is a document topic-specific homography which projects words from the document into the topic or vice versa. And there's a few more priors there as well. So topic in our model corresponds to a particular object or a landmark. And each topic is represented as a pinboard storing identity, position and shape of the constituent visual words. And we sort of generate a document from this model with a mixture of these topics together with the homography and the pinboard. So we pick a word, we pick a topic, we pick sort of the homography, and then we generate the document in this way, so we accumulate if new visual words. >>: So you're using the word pinboard or ->> James Philbin: Yes. A pinboard. So imagine that -- yeah. A pinboard is basically exactly what it sounds like. But you have a topic here and you have a bunch of visual words. So we've got elliptical regions like this. So the pinboard consists of the position, the shape of this word, and the identity. >>: So is it a descriptor or ->> James Philbin: It's just a -- it's an elliptical region, basically, with a word ID associated to it. >>: But that's different than the homography. >> James Philbin: So the homography then takes this word from the topic into the document. >>: Oh, okay. So within the topic the features have their own little geometrical relationships. >> James Philbin: Exactly. So there's sort of a canonical frame within this topic. >>: Okay. >> James Philbin: And then there's a homography which sort of projects the relevant features into the document. >>: And then Eric Sudderth had something called a transformed LDA, right? Which was where you take the parts of an object and then transformed them to where it sat in the image? >> James Philbin: Right. So Eric actually had -- he had sort of full distributions over these words and had lots and lots of priors. And it was basically -- his method sort of worked, but it's very difficult to learn. And that's basically the issue we faced, that if you can actually use it on ->>: [inaudible] generic objects, like a screen, where you don't really know [inaudible] you're trying to model the rigid size which is fairly rigid so you can [inaudible] have some kind of a very ridged definite idea of where things are located. >> James Philbin: Exactly. And to find or to estimate these homographies, we can use RANSAC. And that's really the important thing. It's actually tractable to learn it. >>: [inaudible] distribution, then, on the visual words in your canonical frame there? >> James Philbin: So the model is a bit tricky. And the pinboard actually isn't shown in this graphical model, and that's sort of quite deliberate. Because we didn't want to have to represent each visual word as a distribution, which we [inaudible] parameters over because it's just too -- it's just too complex to learn. So the pinboard comes into when you estimate the H here. So we're actually going to estimate the H based on the likelihood of that homography given the pinboard. And there's sort of the N step where we're learning this model, estimating the pinboard, given the model, we're estimating the model given the pinboards. >>: [inaudible] offline process from this which kind of does ->> James Philbin: Right. So there's sort of a -- there's probably another variable hidden in here somewhere, which is actually directed out from the H. And also connects to those words up there. So this is sort of how we learn it. So of course we use Gibbs because of the topic model, and we're going to sample the topic assignments and the pinboard given the current hypothesis, or current set of hypotheses. And then we sample the hypotheses given the current topic assignments and the pinboard. And actually step one is simply another Gibbs sample. And I won't go into all the variables here. And step two is we're going to sample hypotheses from likely hypotheses found using RANSAC. So we're doing essentially RANSAC between this pinboard and each image, and this gives you a bunch of hypotheses. And we're going to basically have a likelihood for a particular hypothesis being the number of inliers between the document and the topic. So, yeah, that's that slide. So I'll give you an example. So this is one connected component from Oxford. This is actually 56 images, and we saw this before, but we have sort of three that facades or three distinct objects in this component. So we have this sort of Georgian building here, and then we have the college [inaudible], and then this sort of tower as well. But we also have these linking images, which contain more than one of these sort of distinct objects. I'm actually going to go back to that. So if we set the number of topics correctly, then we get results like this. So we've actually managed to pull out with our model these three separate facades. So we've got the Georgian building on the left, we've got these clusters here, and then we've got this tower as well. Notice that these are actually sort of probabilities now. So we're not getting a hard clustering for these objects. So these images will also occur in this list, but much further down. Because it's a probabilistic model, we can actually look at sort of log likelihoods, et cetera. And we use this to sort of set the number of distinct objects in our component. So here we can see there's a peak at three topics or three objects. And this is sort of pretty much correct. We can also look at sort of linking images and how actually these images -- how these words are sort of projected from the topic model. On the top row here we've got two topic models or two pinboards, and we've got one, this Georgian building on the left, this college class is on the right, and an image which contains both in the middle. And you can see that not any have we disambiguated these two objects, but we also sort of localized them in each of these images. We can also look at sort of probabilities here. So this is for Sheldonian Theater. And you can see it makes somewhat sense that as you zoom in your get different set of words, and then it goes to night, so we're matching less. And then finally you got very extreme viewpoints. And we can also sort of visualize these pinboards. So because each visual word in our dataset has to belong to some topic, and we have the hypotheses, we can sort of splatter these elliptical regions and sort of try and visualize what these pinboards look like. So these are sort of two of the better ones, I must admit. But you can definitely make out the sort of bridge, sort of a painterly effect to the bridge on the left. And this is this facade from Christ Church on the right. >> Rick Szeliski: Are you familiar with the work of Jason Salavon? >> James Philbin: No, I'm not. >> Rick Szeliski: Oh, okay. I'll tell you over lunch. >> James Philbin: You think I can sell these or... >> Rick Szeliski: Well, it's reminiscent. The work comes up -- people who know about this artist mention it like when they see Antonio Torralba's Average Caltech 101 image. His art is basically taking a lot of photographs of, let's say, a wedding photo, lining them up, and then taking the average image. It looks a lot like this. >> James Philbin: Okay. Yes. I mean, other objects, they don't look quite so clean. But they look much more like a sort of -- sort of a splatter of -- and you can sort of make out vague shapes and things. So, in conclusion, then, my first part of my talk I looked at sort of improving particular object retrieval, and we saw that using large vocabularies and spatial verification, soft assignment and query expansion can really sort of improve retrieval quality. Then applied -- looked at applying these techniques to a particular problem. In this case, object mining. So how we can build a matching graph and introduce this sort of mixture-of-objects model, which is similar to LDA be with spatial information. And that's it. Thanks. >> Rick Szeliski: Thank you. [applause] >> Rick Szeliski: Would people like to see the demo now? Or do they have questions they can ask? So the two latest publications were ICVGIP, which is the Indian conference? >> James Philbin: That's an Indian conference. And BMBC as well. You can ask questions while I'm... >>: [inaudible] >> James Philbin: Um... >>: Like if you were to -- if you wanted to cluster a large collection of images, do you use your clustering method or do you think the min-hash technique of ->> James Philbin: So -- yeah. So I think actually Ondrej's work on -- in that clustering can be seen as sort of a faster approximate way of building this matching graph. I don't think it helps with the problem of sort of then how you pull out the distinct objects necessarily. >>: Basically you get an efficiency gain on ->> James Philbin: It definitely is. Yeah. But it's also approximate. So he does miss some stuff. >> Rick Szeliski: The other related work in -- there's -- James, the other work that comes to mind when you talked about finding the objects is Ian Simon's work. It sort of finds, you know, areas of interest in photographing by looking at the matching graph. >> James Philbin: I don't think I've seen that work. >> Rick Szeliski: Have you not seen that? I'm trying to remember what ->> James Philbin: I know that Tina had a paper. >> Rick Szeliski: I'm sorry? >> James Philbin: I know that Tina [inaudible] had a paper. >> Rick Szeliski: Okay. No, this is -- he's advised by Noah Snavely, so it was kind of -it was by Steve Seitz, so it was along that line of work. But I'll show you that. >> James Philbin: No. >> Rick Szeliski: Okay. >> James Philbin: So I think it's an idea that's sort of been independently looked at by several people. It's definitely sort of come of age, I think. Okay. So this is sort of a realtime demo we've got running at Oxford. If it loads. >>: [inaudible] machine. >> James Philbin: Sorry? >>: [inaudible] machine. >> James Philbin: This is just on my machine. But it's like a cold call also. Is there some way of making it full screen? Press something. F11 or something? >>: What is it? F11? >> James Philbin: Got it. Okay. So we've got a query image here, and we can select any part of the image. But let's look at the bridge. We know it works. So this is just using the sort of standard inverted index plus spatial re-ranking. So we get sort of results which are fairly similar to the query initially. I think we start getting slightly more challenging things. So we can actually see how -- where the first false positive is here. So 37. So someone remember 37. And then we can ->>: And it accepted such a degenerate affine confirmation or... >> James Philbin: Right. So that -- I mean, this is something we've thought about before; we've just never got around to do it. Is that this is obviously false because the hypothesis is so wildly not what you'd see in real life. And we've always said, yeah, we should have a prize on the hypotheses and we should vote [inaudible] like this and just chuck that one out. And I think that would improve performance a bit. >>: Chuck it out or re-rank it. >> James Philbin: Re-rank it, yeah, just push it down. >>: Okay. >>: I'm sorry, so it seems like it has [inaudible] matches, right? These are before the RANSAC, right? >> James Philbin: Yes. I mean, if you want to see ->>: [inaudible] >> James Philbin: -- exactly what's actually matched ->>: Just curious [inaudible]. >> James Philbin: This demo is online, by the way. So you can just go to the robots Web site and just play around with it. So you get four in a row. They've matched on sort of windows. >>: [inaudible] had inliers there ->>: Five inliers. This is after the homography. >>: Yeah, after the homography, five. >> James Philbin: Yeah. >>: So the 40 is probably ->> James Philbin: Ah. So, yeah, I can turn that off actually. >>: [inaudible] [multiple people speaking at once] >> James Philbin: This is my experience with large datasets is you think, oh, you know, these words are so distinctive and you have enough data and you just see everything eventually. So, anyway, so someone remember 37. Or I'll remember it. And we can go back and actually turn on query expansion and do the query again. >>: This is using soft assignments? >> James Philbin: This isn't actually, no. So it takes a little bit longer. So we've done a query, and then we've got more results and then it's much richer, and then we do another query. And then we can go down and see what the results are like. You'll notice as well that we'll generally have a much higher number of inliers, we need to do query expansion. So we got to 37 before, and we're still going. Right. >>: [inaudible] >> James Philbin: 54, anyway, so we've got quite an improvement there. I always think this one's amazing if it works at all. That's good. >> Rick Szeliski: Yeah. But it's quite likely that the query expansion really helped because you got some ->> James Philbin: Right. Yeah, so ->> Rick Szeliski: -- series of more and more close-up images that kind of -- >> James Philbin: Exactly. I'm actually only prompting the original matches here. This one's more interesting. So that doesn't normally work searching for faces, but in this case it did. Start to get some [inaudible]. >>: [inaudible] >> James Philbin: No. So we trim all the features which lie outside the box. >>: [inaudible] black tie that's underneath his face [inaudible]. >> James Philbin: So I think probably that [inaudible] ->>: [inaudible] cloudy day. >> James Philbin: So I think it's his graduation hat. >>: [inaudible] >> James Philbin: I mean, it's a sort of same expression. It's obviously taken from the same camera, roughly the same time. So -- but it's quite nice that it works. Also, you'll notice I searched on his face, but some of the bridge lay behind it and we sort of got that as a superposition of those two things. And actually an interesting anecdote is that when Andrew gave this talk, he met someone who knows this guy. And he actually sent us an e-mail and said I'm very pleased that my image is being used for such good purposes. I'll just show you it working on some small object. So we found out that in Oxford, there's sort of a -- just the one template for lampposts. And you see it's seen all over the place. >>: So which implementation [inaudible] are you using? >> James Philbin: So this is Christian's -- so Christian has several versions, and each one does performance totally differently. So you have to be very careful which one you choose. And I believe ours is the second one down on the list on the Oxford Web site, and that we've found works the best. It's definitely a black art, though, these feature, how you get them to work well. >>: [inaudible] >> James Philbin: Because you've got -- I mean, you've got almost exponential space of parameters. So, you know ->>: Well, people here are trying to learn better descriptors. >> James Philbin: Yeah. So I think the descriptors is an interesting point, but the interest points are almost more important to get right I find, because if you miss the interest point, you're stuffed. There's nothing you can learn which is going to help you out in the descriptor level. And certainly when Christian moved from more object based to more category based, his -- it works much worse on our data. But I guess that's obvious because he's tuned it. So... Okay? >> Rick Szeliski: Okay. Well, thank you very much. >> James Philbin: Okay. Thanks. [applause]