15901 >> Rick Szeliski: Good morning, everyone. It's my pleasure to welcome Svetlana Lazebnik from the University of North Carolina Chapel Hill to give us a talk this morning about her research. Lana got her Ph.D. last year from the University of Illinois, actually two years ago, University of Illinois Urbana - Champagne supervised by Jean Paul and Cordelia Schmidt. And she's one of the world's leading experts on object recognition. And I think you'll find today's talk quite interesting and see the connections to the work we're doing here on photosynth. >> Svetlana Lazebnik: Thanks, Rick. So I will talk about some of the projects I started this year since moving to UNC and the main theme is modeling and organizing large scale Internet photo collections. So in the last few years many computer vision researchers have become fascinated by photo sharing websites like Flickr. So there's a lot of websites out there. Flickr is the one I know the best and the one that gets used the most in published literature. So I personally have a Flickr account. I put my pictures on there and I spend a fair amount of time just sort of browsing and playing around with it. So here are some numbers that are, I think, current as of the end of 2007. So Flickr is very large in terms of the number of images, the number of users, the volume of images that gets uploaded daily. I think now they also are beginning to allow short video clips, and there's very rich tag data that goes along with the images. There is geo tagging and so on and so forth. So it's a sort of rich and interesting test bed. So from my personal experience of playing around with the Flickr interface and just trying to learn about what kinds of pictures are out there, it's very often frustrating. So the main thing that users do or at least the main thing that I do when interacting with Flickr is basically do key word searches. So here's a very simple example. I do a key word search for apple, just to try to find out what comes up. What does the universe of apple images like. And here is pretty much the best way that you can get a summary of the results in Flickr as far as I know. You basically display thumb nails of the most relevant search results. The most interesting search results kind of give you very bizarre pictures that might not be apples at all. And most recent, obviously, is not a very representative sample. So most relevant is about the best you can do there, and you can see that, well, there are a few images that are unquestionably apples, like this one over here. Then there's images having to do with Apple computers and Apple brand products. So you can see that there's kind of a (inaudible) going on. So apple has obviously two dominant meanings. But, then, even in these top 24 results there's a lot of garbage. There's images might, somehow, if you look up close, be relevant to one or the other sense of apple. But, you know, this is not really the first kind of search result that you want to see. They're just not very salient. Here's another example for Statue of Liberty. So here in the top 24 we get a lot of salient results. At a glance we can see fairly easily what the Statue of Liberty looks like. But, once again, even among the top six most relevant images retrieved, five of them you can hardly see the statue at all. And also many of the results in this top page -- this was retrieved maybe a couple of days ago -- are basically from the same user. So in the second row like three of the middle images are from the same user, Scan G Blue, and here's some more from the same user. So we're getting basically a lot of very similar images. Instead it would be much better to see sort of different salient aspects of the Statue of Liberty or images that look very similar when taken by a variety of users. So we want to see a more representative slice as opposed to just one user's picture being overrepresented on this page. Those are from this page. It's hard to get a sense well is there just one unique Statue of Liberty or are there several copies of these statues in different parts of the world. Because if there are, and you will see later that there are, this is something we want to be able to see up front. So from these results, it's easy to see what the Statue of Liberty looks like but you don't know if this is all there is to the Statue of Liberty, are there other interesting views? Are there other statues. This is not necessarily complete. Of course, you know the top five images, five of the top six, are kind of garbage. So this is what comes up again and again when interacting with Flickr, their interface is so simple that it's just very hard and not just with these queries, any kind of query you want to issue it's very hard to get a complete at a glance summary of what does a given category look like. What are all the relevant themes or motifs or iconic images for a given category. So this is what we want to do. This is what I will talk about here, starting with a large noisy pool of photographs basically downloaded by issuing a query search such as for the Statue of Liberty or apple. First of all, I want to sort out the good pictures from the bad pictures. So which pictures really do represent the concept of interest and which ones are just noisily tagged or just spurious. And secondly, we want to very compactly summarize the set of relevant images. So we want to find iconic images that are somehow especially representative or salient of the subset of the good subset of the query results. And these two problems are actually very interrelated. So finding iconic images, my second point here can actually help to do the first. So it's not necessarily that first we have to sort the good images from the bad images and then cluster the good ones to get iconic images. We can actually start by somehow getting iconic images and that will help us to determine what are the other good images for the categories. So these two problems are really very tightly interrelated. So how exactly can we define an iconic image? So there is no unique answer to this. And let me just show briefly what people have done with this concept in the literature. So one of the first places where you see this idea of iconic or conical image mentioned is psychology literature. And then the psychology literature, the work by Palmer et al., they are concerned or were concerned mainly about basically single rigid 3-D objects. If I give you a picture of a toy horse or of a telephone or of some common object, you know what are sort of the aspects of that object that you recognize the most easily, that you think are the most characteristic. So this is what kinds of questions let's say college literature has tried to answer. So this aspect of the horse might be relatively easy to recognize and relatively characteristic as opposed to let's say a view of this horse from the top or from the back. So there are certain views of 3-D objects that people just like to imagine but don't find easier to recognize and others may be less common or less bizarre or more bizarre views. So in the computer vision literature, people have tried to quantify these ideas once again given a known 3-D model how can we compute some sort of common or representative views. Instead of a 3-D model we can try to do that with a large set of pictures taken around the viewing sphere of an object, how can we find the subset of views that are represented. More recently people have looked at iconic or conical images for products and logos. Obviously, kinds of imagery that has commercial significance. And that is not too difficult to do, because images of products and logos tend to be fairly simple and stylized. The kinds of product images you tend to search for on the web would have a very large picture of the product of interest with relatively little (inaudible) and so on. In this case, a iconical image would be an image that basically has as little clutter as possible and is similar to many other images. So if you're looking for images, let's say, with the Apple logo, you can use local features to cluster images, all of which contain the Apple logo, and then find the iconic image that basically has the most matches to all the other images. And, for example, the fewest nonmatching features. So it would be the image that has the most salient instance of this logo that also occurs in a lot of other images. So people have also tried to extend essentially the same ideas to landmarks, like Statue of Liberty, Eiffel Tower and so on. So a lack of clutter, having a very salient representation of the landmark of interest that takes up most of the image that maybe is not occluded, that has a lot of features that can match to other images of the landmark, this is how an iconic image for landmarks can be characterized. So in some work, for example, Bergan Forsythe who I think were one of the first to look at iconic images for landmarks, this is their key criteria. This is an image that has a clearly delineated instance of the object. And the way that iconic images are found in this work is by trying to find the foregone background separation, basically to isolate the foreground that contains the object and to look for images that have a lot of this salient foreground. And in other work from Ian Simon at UW, conical images for landmarks are defined in terms of basically 3-D structure. So like landmarks is just an instance of these single 3-D objects. So it's a unique object in the world. So it's characterized by rigid 3-D geometry. So we can apply essentially many of the ideas that people have used for single 3-D objects to deal with landmarks. 3-D geometry distribution of view points in space plays a very large role. Finally, there's another category of categories -- category of visual classes that hasn't been looked at a lot but that actually intrigues me. So general categories, not single 3-D objects, not products, not landmarks, more or less abstract categories. Let's say love, beauty, things like that. That's an iconic image exists for such a category. And for these kinds of categories, 3-D geometry is no longer meaningful. Even foreground background separation is not necessarily meaningful as it is for products and landmarks, because you know the whole image like this image of a rose might somehow represent love. And there's no point in trying to distinguish foreground from background because it's not a concrete concept. So this is something on which there hasn't been a lot of work. But, you know, this is kind of an overview of the kinds of criteria that people have used to find the iconic images. Okay. So in my talk I will deal with two kinds of imagery. One is landmark collections, Statue of Liberty, et cetera, and for this, the viewpoint we take in our work is very much in line with the photosynth work, is very much in line with the UW photo collection summarization work, is that 3-D geometry, multi-view geometry is really the most important cue that we can use to deal with landmark image collections. So other work on landmarks that tries to view them essentially as 2-D images, let's say just using gist feature matching or bags of features in 2-D and in our opinion is just not strong enough. We really do need to bring in multi-view geometry, we need to bring in strong 3-D geometric constraints to recognize successful landmark images and to find out which images are representative. So which images, for example, come from a tight cluster of similar 3-D viewpoints in the world. So this is very much a 3-D approach and not just a 2-D image clustering approach. However, enforcing 3-D geometric calculations is very computational expensive. Extracting features, doing ransack between every pair of images to estimate homographies or fundamental matrices, et cetera, if we do this for every pair of images it's quadratic and the size of the collection. So it's not very scalable. So enforcing 3-D geometric constraints and figuring out the 3-D relationships between 3-D images is not very scalable. So what we want to do is start by 2-D image clustering to sort of preprocess the image collection and to group images that are already very likely to come from similar viewpoints. So even though we completely believe in 3-D geometry for landmarks, we want to start with fast 2-D image clustering, and I will explain in more detail to make this more scalable. And finally summarization as in Ian Simon's work has been done very successfully as basically a byproduct of reconstruction of estimation of all the 3-D geometric relationships. If you start by estimating all the 3-D points from the landmark in the world and all the camera viewpoints from your image collection, you can now cluster more or less these viewpoints to get iconic images as the centers of clusters. But for that it requires that we perform all the reconstruction and all the pair wise geometric matching first. That takes a lot of time. So we actually take the opposite approach of doing summarization first quickly to give us a good initial guess for the 3-D structure of the world. So we kind of invert the process. We do summarization first and out of that 3-D reconstruction also more or less falls out. And in the second part of the talk I will talk a little bit about this general category question, how can we deal with just categories like love, beauty and so on. So here we don't have a lot to go on. No 3-D geometry or anything, so as you will see in that part of the talk we'll have to rely on semantics a lot more as captured by image tags. Okay. Any questions so far? All right. So let me talk then about our work on landmark image collections and that will appear in ECCV this fall. So we start with our large collection typical size would be about 50,000 images, although in principle we can download a lot more. Typically 40 to 60 percent of this initial collection obtained through key word search and Flickr is noise. So for search for Statue of Liberty 40 to 60%, I don't remember the exact number for the data set, are images that don't have the Statue of Liberty in them at all. So it's a very sort of noisy data set. And as a first step we perform clustering using 2-D global image descriptors. We just find images that look almost the same in 2-D that pixel by pixel look more or less the same. And the idea is if these images are so similar, most likely they come from very similar viewpoints anyway. So we do this clustering followed by some geometric verification to make sure the images do come from the same viewpoint to get a much smaller subset of these representative images from clusters or iconic images. So we'll go from about 50,000 images to maybe a couple of hundred of these iconic images following this clustering stage. >>: Iconic images have to have multiple things back or can they have single. >> Svetlana Lazebnik: Sorry? >>: Your iconic images are images that are clustered together because they appear similar enough. Can you have clusters of size one or do you have to have a certain minimum. >> Svetlana Lazebnik: In practice I think the threshold is about eight. I will show in a couple of slides exactly how we do those geometric verification, but in order to be able to do the verification, we have, you know, some sort of minimum size. And eight is what we use. Okay. So we have these iconic images and next we want to figure out what are the viewpoint relationships between them. So we match every pair of iconic images and try to estimate the upper polar geometry between them to figure out how the aspects relate to one another. Then we go from just a pile of iconic images to a graph where edges represent two view relationships. And we caught this graph basically to find groups of images that see more or less the same images of the landmark. From these components we can perform -- whoops. We can perform, for example, 3-D reconstruction. So these components allow us to perform reconstruction very efficiently, because they would be a lot smaller than the whole graph, as I will explain. But also we can use these components as a natural browsing structure for the landmarks, as I will also explain. So this is kind of a brief overview of the approach. And now let me give you more of the details of every step. So the first step of appearance-based clustering, basically this is one of the end results of the step. And you can see the kinds of images it looks for. Images that almost pixel per pixel are the same. So for this we use so-called gist descriptors that were developed by Aud Alven (phonetic) and Tony Toralba (phonetic) and have been popularized in recent years in graphics and vision literature by Alsha Lofrus (phonetic). See here you can see on the left is just some generic image of a forest and on the right is a synthetic image that has the same gist descriptor as this forest. And you can see that this gist descriptor captures sort of the coarse texture characteristics of the image. The mean and sort of directionality of textures where are the various barbs or strong edges and images and so on. And this description is a thousand dimensional and it's obtained from taking a steerable filter transform of the image and then finding the filter responses over very large areas at different more or less quadrants of the image. And this is the kind of information that it captures. So it turns out by extracting these gist descriptors from the entire image and doing caiman's clustering in them it's fairly successful in finding groups of very similar-looking images. But, of course, we can't trust just the output of the gist. So in fact in this example here you can see there's one outlier. There's one image that doesn't really match the others, but got clustered with them anyway. We have to make sure this kind of thing doesn't happen. So we can't just trust gist alone. After we do the gist clustering, we have to go to every cluster and make sure that the cluster does represent a set of images that are geometrically consistent, that have the same sets of features viewed from more or less the same viewpoint. So for this we take the top few representatives from each cluster, meaning the images that are closest to the center of the cluster and we use eight in implementation. So we thought we take the top eight images closest to the center and perform pair wise measuring between them. We extract sift features and do ransack to estimate either fundamental matrix or homography, depending upon the relationship between the images. And then the number of inliers to this relationship tells us basically are these images geometrically consistent or not. And then among these top eight images we simply sum up all the inliers. They gear up by being matched to the remaining images and the image that gets the most inliers is declared as iconic image for the cluster. And there's also a threshold in this total sum of inliers, so that if there are too few among the top eight images, they can see that the cluster is not geometrically consistent and throw it out. Empirically, we have found that this is basically sufficient. Even if you have a very large cluster you only take the top very few representatives and do this matching. It really pretty much tells you everything you need to know about the cluster. And I will show later we have quantitative evaluation of this in terms of recall precision curves. After verifying each cluster, keeping the representatives or iconic images from the geometrically consistent clusters, we might get a set of images that looks like this. This would be our iconic images and we can see that among them there are some that still look almost the same, like a few of these images on top. And a few of the other ones, clearly there's been some overclustering. There's been some oversegmentation, because it's just clustering is far from perfect. Some images that could really have been clustered together didn't get clustered together and so on. Some of the images represent the same viewpoint but don't look the same in 2-D. Either because of clutter like a person standing in front of one of the images, or maybe the lighting is too different, or maybe there's too much camera zoom or viewpoint change for gist to work. So now I have to fix all this. Now we have to basically verify the geometric relationships between the iconic images. So we match every pair of iconic images to each other. Once again to see whether or not there's a two-view relationship between them. And this gives us the graph structure. This gives us the so-called iconic scene graph where the nodes are these iconic images representing their clusters and the edges represent either homographies or fundamental matrices and the edges are also weighted by the number of inliers to this transformation. So this graph basically, and unfortunately these graphs have been pretty hard to visualize. So here you can kind of see on the top to visualize the graph. And there are two types of edges. I don't remember one of them is homography and another one is fundamental matrix. And nodes are images. So this is what the graphs tend to look like. They look fairly messy. But one thing you'll usually find in them they have sort of very tight clumps. They have very tight clumps representing sort of very popular views like the frontal view of the Statue of Liberty and maybe they have several if there are several dominant task beds. And there might be sort of a lot of weakly connected components or isolated nodes floating about. So we have to analyze this connectivity. So, first of all, with isolated nodes, because we've already tried to estimate the geometric relationships between the iconic images, we have an idea that if there's an isolated node in the graph it doesn't match anything else. So this is already a pretty strong hint that this might be garbage and this might be a set of images that doesn't truly belong with the landmark and should be thrown out. So for this, this is the only place where we use tags for our work on landmarks. We look at isolated nodes and try to match their tag distribution to the distribution of the tags for the largest graph components. Because the tags, the most popular tags associated with the largest graph component, we have a pretty good idea that this is really a very good sample of tags that describe the landmark well. So by taking the tags from the isolated nodes and seeing if they appear in this list, we can basically quickly check, are the tags associated with this isolated node garbage or not. And this gives us a filter that we can use to basically eliminate a lot of these isolated nodes, because what we have found is that there will be a few sort of sets of near duplicates in the data set, just very, very similar images that were all taken by the same person, and just somehow spuriously tagged. So most of their tags are garbage. And we can recognize that pretty easily. So this is just a filtering step to clean up some of these three graph nodes. But what we're really interested in are the tightly connected components of the graph. Because this gives us the most salient characteristic aspect of the landmarks. So what we do is we have this graph that has a loose structure with a few cuts we're on normalize cuts to get these clumps or these tightly connected sub graphs, which really end up corresponding to very similar aspects. So I will show the Statue of Liberty results in a little bit. But then what we can do with these components is we can use them for structure for motion. So basically one of the big reasons to work on components as opposed to the whole graph is that they're much smaller. So structure for motion can be done especially global bond adjustment can be done much more efficiently on sub graphs as opposed to on the whole graph. So we basically process each component separately. So we use the graph structure, we find a maximum weight spelling tree where the weights, if you will recall, are given by the number of inliers. So obviously to incorporate the images into the model you know we will want to go by images that have a lot of inliers to other images. So we use a maximum weight spanning tree to determine the order of incorporating images into the 3-D model. And finally we don't forget about the connections between the components. It could be that there were a few weak connections that were cut when we did the normalized cuts and we don't throw those out. In the end, when we separately reconstruct each component, we can go back to the cut edges and use the matches along those edges to merge the component models. And finally, so this one we reconstruct each component and do as much merging as we can. This gives us a good 3 D sort of photosynth style representation of the landmark, and it's built using relatively few images. It only uses the iconic images. So if we started with 50,000 images, maybe half of which actually contain the landmark, at this point we'll end up with a model that uses a couple of hundred iconic images. But we can fix that. We can enlarge the models by registering additional noniconic images. And this is very easy to do. So let me show you a video that kind of shows some of the models that we get. So unfortunately I don't have the models in their (inaudible) format but in this video you can see we have three data sets. So this is the Statue of Liberty. And here's sort of a brief screen shot of how merging works. These are different aspects of the Statue of Liberty that still have a lot of overlap so they can be merged. And this merging the matching and the estimation, this is all automatic based on the graph structure. So this is a bigger merged Statue of Liberty. And you can see there are a few outliars. So we haven't done a lot to make our models perfect. And this is our second data set, San Marco. Piazo San Marco in Venice. This is a big component that includes the front of the cathedral. And the third data set is Notre Dame in Paris. So that was a dominant model. And this is the model of the back of Notre Dame, which is hard to merge with the front. So this kind of gives you the sense of the 3-D models that we get using our approach. So there are a few outliars. And we really have done a lot less to sort of cosmetically clean up the models and make sure that there are no outliars, no incorrect matches and so on, but you get the idea, that by and large the structure found by algorithm is pretty much correct. >>: Did you -- for clustering the things that had similar views did you find that it wasn't, sometimes it wasn't enough baseline for constructor repromotion? >>: Svetlana Lazebnik: So with data cluster, there's not a lot of baseline. But between clusters, because when we match the iconic images basically to each other, this is where we get the baseline most of the time. From different iconic images. And even there, you know, there's a good number of homographies among those as well. You know, so there isn't always sort of -- I mean in some cases it's true that it's hard to get a right baseline. But overall, I think the biggest problem still remains is when the baseline is too wide and you can't merge things. So I think that is kind of a bigger problem for us than not having enough images with a good baseline. So this is 3-D reconstruction, and it shows what we can do with the iconic images. That the iconic images basically have enough information in them to permit the 3-D reconstruction of the landmarks. And the second thing, besides reconstruction that the iconic images allow us to do, is browse the collection. So we have a three level browsing structure. So at the first level we have the different components that we obtained by partitioning the iconic scene graph. So each component, it includes a number of iconic images. So here we expand one component and these are just iconic images from that single component. And you can see that these are images that some of them look very close to each other even in 2-D so it's almost identical aspect. But others include a lot of zoom, sort of viewpoint change and so on, but there's still a lot of overlap in terms of visibility, in terms of what all these images can see. So the components represent sort of different iconic images that still seem more or less the same part of the landmark. And then we can expand each iconic image to get a cluster that was obtained originally by gist clustering and that all contains images that are almost the same as the iconic. So this is a three level browsing structure for the data set. And I can show an example -yes? >>: In that, can you go back to the front? I'm surprised, like the image with the sunset in it, I'm surprised you got enough features on the Statue of Liberty there at all to pass any sort of geometric verification. >>: Svetlana Lazebnik: Yeah. >>: How did that -- how did that pass? >>: Svetlana Lazebnik: Well, the work was a reasonably high resolution image. So it's a thousand by whatever. But there wouldn't be a whole lot of features there. >>: The second to last column, the third one down. >>: Svetlana Lazebnik: Like this one? Yeah, so I can't 100 percent guarantee that all of them are absolutely correct. But I think there's still enough maybe features around sort of the outline and our threshold for the number of inliers, I think, was reasonably low. Like maybe 15 or so. So it's possible you can get 15 or so features out of this. But in general -- and using even higher resolution images I think would help even more. So it was kind of a compromise. If you use the original resolution images, you know you can spend forever extracting sift features and doing the matching and so on. But on the other hand if you go to low resolution and you can lose images like that. So we haven't completely found the right trade-off. So I think we'll use the 1024 by whatever resolution in Flickr. But I'll show a little later it's true that getting enough features for the Statue of Liberty was sometimes problematic. But let me show examples of browsing for Statue of Liberty. So this is level one where we have all the different components of the statue. So this, for example, is the face of the statue which is kept in the museum or something. So you can see that these are all the images in the same gist cluster. >>: I have a question about that. Because the original just gives you a load of garbage, but at this point you've only filtered out the ->>: Svetlana Lazebnik: Actually, one thing I forgot to mention is that these are already filtered gist clusters. So the garbage -- once we find -- once we decide what the iconic image for the cluster, it's relatively quick to take every other image and to verify it geometrically against the iconic. Obviously it's much quicker than doing pair wise verification of the whole cluster where we image each other. So what we do instead is we do first pair wise verification of the top eight, decide iconic, and then just verify every remaining image against iconic. So that's basically linear and a number of images in the cluster. So what you see here is the clusters that have already been cleaned up by throwing out all images that don't match the iconic. So the biggest component is what you have already seen is the more or less frontal view of the statue. And these gist clusters following verification are really very clean. There's almost no mistakes in them. And in fact the second component for the Statue of Liberty, the second largest connected subset of the graph is Las Vegas. So this is the Statue of Liberty in Las Vegas. So you can see that all of these images have the same sort of distinctive background and it's probably this background to a large extent that enables us to, you know, connect all these images together and to recognize this Las Vegas component. This is a nighttime Las Vegas component. So sometimes we get lucky and we can match night views to day views. But most of the time they're two different. Both for gist and for sift features. So we really end up with different components for night and day. And then there's also a third component for Tokyo. So we get a pretty large component for Statue of Liberty in Tokyo. And once again the background is distinctive enough so that we can distinguish it. There's a nighttime Tokyo component. So you get the idea what the results are like. And I think there's one component that basically is incorrect. So this is a component that doesn't truly have the Statue of Liberty. But these tags, so I mention that we tried to use some tag-based filtering to remove isolated components. But oftentimes the tags are not informative enough. So New York City are very common tags for the Statue of Liberty. So we just don't have enough information to throw this out. So this gives you an idea what kind of browsing this iconic scene graph enables. >>: Is there one in Seattle? >>: Svetlana Lazebnik: There's one in Seattle? >>: (Inaudible) Beach. >>: Svetlana Lazebnik: Oh, really. I know there's one in Paris, and that one hasn't shown up either. But that one is hard to photograph. You can't really get up close to it. So I don't know I haven't gone to the data set and seen how many images of the Statue of Liberty in Paris there are. So there might be a few more like in Seattle or wherever else that just don't show up. And you also have to keep in mind that our approach is based on clustering. So it's pretty brute force and it looks for modes, for very large clumps of images. If you only have a few images of the Statue of Liberty in Seattle or the one in Paris, it might end up just losing them. So this is one sort of disadvantage of this clustering-based approach. It's very good at finding like the very salient and dominant components but it might throw out a lot of smaller components that could still be very interesting. >>: So the (inaudible) structure for motion construction, the very first one when you were doing it independently in each cluster, you define like certain clusters didn't have enough paradigm, too small baseline. >>: Svetlana Lazebnik: So we're reconstructing from components. I think he asked a similar question. So if you look at the images inside of a component, let's see, you know, if you look at the images inside of a component, like this whole collection, you know I think there's enough paralex between them in order to do a good job. Some homographies are also usable for reconstruction. You can use information about camera rotation and so on. So overall we're okay. So we don't try to reconstruct individual clusters when we deal with sufficiently large components it works out fine. And sometimes if the component is too small, you know, you might get reconstruction that's not very interesting so here in Tokyo, you kind of get the idea that there's a statue in front and there's a bunch of stuff in the back. But I don't know how interesting this reconstruction really is. And this is still not completely a bad case. You could get an even more uninformative component. Let's say a very distant view of the Statue of Liberty, like a view from top of Empire State Building or something that basically just shows the New York skyline. So from that, you know, you basically get a trivial reconstruction. So we can get reconstructions, but they're not always going to be good. There's going to be bad cases. But for the good cases like the central aspect of the Statue of Liberty, we don't have that problem. Okay. And we have tried to quantify basically our various filtering strategies. Does, for example, starting with gist clustering and then performing geometric verification and so on, does it improve recall and precision. So it definitely improves precision. And recall basically in every filtering stage goes down a bit because we end up rejecting some things. Things that are not geometrically verified let's say to the iconic image get thrown out and sometimes we make a mistake. So in every stage, reject some images and the recall goes down a bit but the precision stays pretty high. So for the Statue of Liberty. In the end we're able to register about half of all the images that actually contain the Statue of Liberty in them with pretty high precision. And also we've tested how good is the set of iconic images for basically testing or trying to recognize the landmark and new images. So we take a never-before-seen image and try to figure out is there a Statue of Liberty in that image using just our iconic images. And we can either do a key nearest neighbor matching of this test image to the iconics using gist, or we can use bag of local features matching with a vocabulary tree and follow that by geometric verification. So very basic registration approach. And one thing that's interesting about the Statue of Liberty is the vocabulary tree itself. So it's hard to see, but basically if we do just nearest neighbor matching, using a vocabulary tree, this is the recall precision curve we get. This red curve. And it's basically random. So if we tried to do bag of feature matching to the nearest iconic for the statue of liberty, it pretty much fails miserably. This is due to the fact that it's hard to get a lot of inliers for the Statue of Liberty. It's not very textured so we don't have a lot of local features so that's going to fail. So gist, just global 2-D matching works a lot better than the vocabulary tree for this case. So this is just one interesting note. Okay. So basically we get similar kinds of results for Notre Dame, which is smaller, and for Piazo San Marco. So one interesting note I want to make about this one is the merging. So we get models basically for the front of the square with the church and the back of the square and there's also this tower. And we've tried really hard to merge the two and we just haven't succeeded. And there are a few sort of difficulties. One of the difficulties is the tower. So this tower, this clock tower is completely symmetric from all four sides. And there's a lot of images in which you just see the tower against the sky, nothing else. For those images it's basically, you know, very hard to get the correct camera pose. What we end up is with a camera pose that has basically a four-fold ambiguity. So when you have a bunch of misestimated cameras for the pictures looking at the tower and you try to do sub merging without it basically and completely screw things up, it puts a square together backwards where in a world they intersect and so forth. So it can be tricky. Again, the clustering-based approach keeps the dominant aspect and sometimes it throws out a lot of the things in the middle which in this case turns out a lot of times to be the missing links. There might not be a lot of pictures that really see both aspects together, and that can help you to do the merging. So this is one weakness of this clustering-based approach that we're going to work to address. Okay. So basically this is all for this part of the talk. And I can go into the second part just a little bit. Does anybody have any questions? Okay. So I will briefly talk about the second part, which was actually represented last week at the Internet Vision workshop. And this part is more sort of speculative and open ended. So what I wanted to see was what I mentioned in the beginning, what can we do basically with just abstract categories? Can we find any useful summaries at all? And so basically one of the things when they want to discover is in the apple example is the fact that there's apple the fruit, and Apple the computer company. As far as even more abstract terms like beauty, the best we can hope to discover is basically distinct themes that women may be beautiful, sunsets may be beautiful, flowers may be beautiful and cats may be beautiful. But in slightly different ways. So we want to find different semantic sub categories or sort of instantiations of what an abstract term could mean. So here, basically, we also have this recall versus precision problem. Ideally we would like to find all images that somehow represent love or beauty. But it's much harder to define them in the case of Statue of Liberty. In the case of Statue of Liberty, we could have ground truth, does this image show the Statue of Liberty or not, and we could have people actually annotate images with yes or no, whether or not they chose the Statue of Liberty. For these kind of categories, it's no longer possible. We can't have a person look at any of these given images and say does this image represent love. So we have to kind of throw out the whole question of recall out the window. But don't really have the hope of retrieving all of the images out of a large noisy collection that represent love. But what we can do is maybe find a few subsets of the images that are internally consistent, that correspond to some recurring visual motifs. Like hearts or roses will be very typical for love. We don't know about a whole bunch of other images, but we find a few clusters about which we can say yes these are representative images for this abstract term. So it's hard to define what iconic image means in this case, because it's so sort of speculative. But the working definition we came up with is an iconic image. Basically is a good-looking representative of a group of images that look the same and have similar semantics. So this really has three components. One, appearance. Images have to look the same. Two, semantics they all have to be roses or all be hearts or all be rings. And the third, they have to be good-looking. And we're not sure why they have to be good-looking but it can't hurt. So this pretty much dictates the algorithm. So we want to find, perform joint clustering to find groups of images that are consistent using both appearance descriptors and Flickr tags. And, secondly, for each cluster we pick a representative iconic images by doing automatic quality based stringing. So our joint clustering is very simple. We do first clustering with gist just like in the landmark work. And that, you know, does ensure some uniformity in appearance but it doesn't ensure uniformity in semantics. So here we have a bunch of round structures but some of them are apples. Some of them are apple pies. Some of them are iPod buttons. And some of them are actually apple pies with the Apple logo in the middle. So appearance gets us somewhere, but we still have this mish mash of different themes. So for this we try to capture semantics by clustering tags. We use a very simple approach. We run so-called produced latent semantic analysis to transform tags into a latent topic space. And then we cluster the topic vectors with caimans, and this gives us a bunch of semantic themes, and we simply intersect the two clusterings. And then, you know, this is an actual sort of snippet out of the results that we get. So this is a hand picked snippet. So this shows where it works nicely, but we didn't make it up. >>: Just a question. I thought you were (inaudible) limits in order to be able to say to run ->>: Svetlana Lazebnik: This is to be able to say on the tags, not to be able to say on the feature. This is like the (inaudible) with text. Not to be able to say that people have used on image features. And so this is what we get when it works well. We're really able to separate out the different themes. And, finally, we use Yung Kay's (phonetic) work on learning quality ranking from a database of images that have been ranked as being more professional versus more amateur, and the sole goal of that is just to find basically a nice-looking representative for a cluster. Because, well, we could just use the image closest to the center whatever but somehow it seemed like a good idea to incorporate aesthetics more into the approach. So this is what the results look like for apple. So we have -- so each quadruple of images here represents a distinct theme found with peel and say tags and the different themes are laid out using multi-dimensional scaling. Multi-dimensional scaling using a distance between sets of tags. So it basically shows how close are these tag distributions to each other. So you can see we can separate the apple fruits very nicely from the Apple company. And we have nice distinct themes for in a logo, wallpaper, Apple stores. In fact, there's a theme for a store in New York and store in London and so on. So you get the idea. So this is the top level. And we can expand to this for each theme at the top level would just show basically the four topic iconic images but there are more because there are basically several in the intersection there are several gist clusters that get intersected with the same theme cluster. So this is all of the gist clusters that are under the same theme. And once again as in the landmark work, we can expand a gist cluster to hopefully get a lot of images from the collection that are very similar, that basically support this gist cluster. So this kind of gives you the idea. So for beauty, what we have, once again, we have results that make a lot of sense. We have a cluster of different pictures of women. There's a theme for some reason for Japanese girls. And, you know, we get flowers and sunsets and nature and cats. So this is an expansion of some of the women themes. So images here -- so these are, once again, different gist clusters that fall under the same theme. So these are not supposed to be consistent in appearance. But if we expand a gist cluster, we expect images that are more consistent in appearance. But once again the results we give here are not as clean as for landmarks because in the landmarks what really rescued us a lot was the 3-D geometric verification. This is what helped us to clean up the clusters. And here we have no geometry geometry we really have no additional cues to help us verify. So what we get in the gist cluster is really not as nice and clean a lot of times. But it works very well for flowers. And one thing that's interesting is we use no color at all. So gist is computed on gray level images. So appearance-based clustering does not use color. But in many cases we find the results to be consistent in color anyway. So this is something that would be interesting to investigate, basically how dependent is color on global gist features and so on. How correlated is it. So the third category we tried is close-up. So this is even more abstract. It refers basically to photographic technique and not even to any sort of concrete object or whatever or any concrete subject matter. So, basically, here you get themes of what people like to take close-ups of. So they like to take close-ups of lips, eyes, cat's noses, insects, birds. There's a theme for more or less abstract macro shots and we have a theme for strawberries. And also drops of water, taking macro shots of drops of water seems to be very popular. So you can see this theme expanded, and this is a gist cluster for drops of water. This is one of the nice ones with the eyes. And one thing we can do and it's really hard to evaluate this in any quantitative way, it's hard to say how well we're doing. One thing I can point to is Flickr clusters. So Flickr clusters are I think computed solely based on correlations between tags. They don't use any visual information. But these are the kinds of clusters you get for close-up in the Flickr interface. So you can see, of course, there's not a lot of them. I think at most you only ever get four clusters in Flickr. And they're also not very satisfying. Like in the bottom cluster, you know, cat and animal faces are mixed with people faces. In the top cluster, you know, we get some objects and some insects. You know, so in my opinion the themes we get are kind of more satisfying. Everything is separated out a lot more nicely. And our approach is basically pretty simple. It was our first stab at how do we cluster these images. And even just with peel and say clustering, we get, in my opinion, a lot nicer themes than what Flickr gets. So I guess the take-home message isn't really that we're doing so well, it's that it's really, really easy to do better than this. Final one is love. So one surprise that was here, we have this cluster for self, me, self portrait. So I was kind of surprised to see that as a big theme associated with love. So I guess people love themselves or, I don't know, they love to take self-portraits. Then we get dogs and babies, which makes a lot of sense. What's interesting is the dogs show up under love and cats showed up under beauty. I don't know if this is just random or it really represents some big statistical bias in things. There's a cluster for wedding. A wedding theme, but the images in the wedding theme are actually not very good. So I'm not really sure why that is. And then there's like clouds, sunset, beach. So beach and couples and love appear to be very tightly associated. It makes sense, but I guess I didn't really expect to see such a strong association. And there's a big hard cluster. Obviously that's nice if there wasn't a big hard cluster and then we would know for sure that this thing had failed miserably. One other thing that was interesting to see in the heart cluster, we get five images that basically are images of a ring laid on top of an open book and the shadow of the ring is kind of heart-shaped. How many of you have seen these kind of images? So some of you have. Do you also browse Flickr a lot? Okay. So I think this is something that a lot of people are surprised how unfamiliar with Flickr and these kind of amateur photo sharing places, but I guess this is one of these sort of visual cliches that people really love. And you wouldn't know it. So this was something I was pleased to see because ideally this is kind of output that I would hope from this approach. It would help to reveal these visual cliches that would really be very prevalent. But regular people might not necessarily know about them. So this is what I would think would be really useful to reveal in the Flickr interface. And once again Flickr clusters for love, not that great. So there is a cluster for hearts. But, you know, the heart images here are not particularly salient. Although, there's a better wedding cluster than what we have. But apart from that, once again, I think we do better in terms of comprehensiveness of themes and so on. Any questions? >>: So how many, when you start doing your analysis, how many photos -- I guess it's based on proving a set of photos, right? So many photos start the set. >>: Svetlana Lazebnik: So here it's smaller than for the land mark work. Here I think it's about 10 to 15,000 photos per category. So I think getting more photos would definitely be helpful for getting more consistent results. So here one of the shortcomings we have is we get sort of a set of themes that make sense. But it's hard to say. Basically is this comprehensive? Are these all important themes? Or even if they were the themes we do get, even if they do make sense how important are they? Like this self-portrait theme, it makes sense. But is this just a fluke of our clustering that it happened to reveal this theme which might not actually be very common? Or is it indeed sort of a very common theme? And by contrast, you would think that weddings would be huge under love. But we don't get such huge clusters for wedding. So the clustering is a bit iffy. So this is something that we really have to work on. And implementing, getting more data and implementing a more stable clustering algorithm we'll be able to say more things about why did a given theme show up, why didn't it show up and so on. And because this is hard to evaluate, one reasonable way to evaluate it is basically to do user studies, which would probably be a good idea. So that pretty much concludes my talk. Does anybody have any questions? (Applause). >>: I have two questions. Ask them one at a time. One is from the second half of the talk it's clearly an amazing amount of information you can get out of these tags. And the semantic information. But from a pure vision standpoint, it's a little unsatisfying in the first half of the talk that you actually need them to really filter your results. I was wondering, do you have any insights could you get away with getting just as good results using vision alone without the tags? >>: Svetlana Lazebnik: Actually, in our landmark work we use tags for very little. The only thing we use them for was for filtering the isolated nodes of a graph. So the clusters that didn't match any other clusters geometrically. So we apply the tags in the first part of the work after we applied all the stronger constraints that we have. And basically the only thing that would happen if we didn't apply that stat would be that we would end up with more sort of garbage components. Like here I can show for Notre Dame, I'll show you one example of sort of a garbage component. So for Notre Dame we have this component that is geometrically consistent because they're all images of the same building. But it's obviously not a Notre Dame cathedral in Paris. And tags in this case don't give us enough information, once again, to remove this, because it's tagged with Notre Dame and art, which are both very common for the correct Notre Dame. So basically if we use no tags at all for this step, we would simply end up with more garbage components like this. So we really don't rely on tags for a whole lot for the landmarks. And for the second part of the work, we rely on tags because semantics are much more important, and because we don't have strong constraints. So geometry, basically when geometry makes sense, whether 2-D geometry appearance similarity or 3-D geometry, it's really good to use it. But here for general categories, we're basically in the situation where we have very little to go on. So we're forced to rely on tags. >>: Other question that relates to that same micro example where you couldn't glue the two halves of it together. It's a common problem in the feature-based approach when you have competitive structures that you could get confused. You can see two approaches to handle this. One of them would be to do more verification, there might be some asymmetry that you could leverage and say those are actually two different views of almost the same thing. The other approach would be to have like a critical filter type approach where you say well he has multiple hypothesis we need to turn them around. Do you have any insights how you could use any one of these approaches? >>: Svetlana Lazebnik: I think you could in fact use both. So we've talked about it a little bit. So I think being able to generate multiple hypotheses, especially in this tower case, in the tower case there's really basically a four-fold ambiguity. So it's really not that hard to keep track of it. It may be simply just a question of reimplementing your ransack instead of just spitting out one hypothesis, it kind of, it spits out all of the sort of consistent hypotheses. So if there's just a four-fold ambiguity, then you can keep track of it pretty easily. But also in that square, in the back of the square where you have these repeating columns, the ambiguity might be a lot bigger. When you have transitional ambiguity you might have a lot more than four possibilities so it could blow up. So it's a question of like where do you want to stop. And it's also a question, I guess, of careful implementation and so on. I think just carefully keeping track of the ambiguities and generating multiple hypothesis, in some cases when there's just a two-fold or four fold symmetry is all that's needed but in some other cases same with particle filtering, it might blow up when there's too much symmetry. But, yeah, I think especially in architectural scenes, you know, dealing with symmetry is important. And the first thing you mentioned obviously getting more matches trying to get some unique features I think also is part of the solution. So it could be as simple as just more carefully implementing the sift-based matching and extraction using higher resolution images and so on. I think part of the answer is more careful implementation and part of the answer is there are probably issues that we don't even want to touch when some of these basically ambiguities get out of hand. >>: Rick Szeliski: So Svetlana is here for today. If someone would like to meet with her, come see me. We'll probably go out and have lunch in the cafeteria if anyone wants to join us. So thanks again. (Applause)