>> Larry Zitnick: Hi. It's my pleasure to introduce James Hays. He is a student currently at CMU and will be graduating in June. I think I first heard of your work when he started doing some repetitive texture work but then recognizing it and using it for different tasks such as in painting and texture synthesis, and then more recently he's been doing a string of work using large image collections. So he has an impending work, I think you're talking about today, and then also a work on localizing images using both scene recognition and feature recognition which I think you're also talking about today. So ->> James Hays: Yes. >> Larry Zitnick: I'll hand it over to James. >> James Hays: Fine. Thank you. And talk about a few other things as well. So I'm kind of covering maybe a little too much ground. If I'm going too fast or something, then just stop me and get me to clarify. Hopefully it won't be a problem. So, you know, our visual experience for us for computers is really extraordinarily varied and complex. You know, tons of differently scenes and lots of complexity within any scene, and that complexity makes it really hard to understand images and to synthesize images. So -- and the computer, you know, if you look at an image like this, snapshot I took, then there's, you know, there's thousands of meaningful image dimensions. There's thousands of things that you could want a computer vision system to answer you about this question from me, what's the pose of all these articulated figures to material to illuminations, to geometry, and most computer vision and also computer graphics techniques kind of get around this huge complexity by divide and con concur technique; that is you know that the image isn't the representation, some part of the image is, so that something like sliding windows or part based models or segmenting the image first and then classifying segments. I mean in the graphics community they obviously tend to render things by building it up from part based geometry as well. So all the work I'm going to talk about the fundamentally different. It's using scene matching instead -- to, you know, whatever, find information about images instead of going kind of into the image. And there's one example of that I'm going to show you what I mean by scene matching is scene completion. And this was at SGF 2007. So let's say you have an image, a photograph you took or something like that and you want to remove part of it or part of it is damaged or you're doing a 3D reconstruction and you want to get a view of something that you never quite saw in the original views. So for whatever reason you need to hallucinate some data here, fill it in such that it seems plausible, such that a human observer can't tell that anything was ever taken away. There's a lot of good previous work in this area. In fact, some of the people sitting here have done it. The simplest is just diffusion based. Maybe I should ask for the lights to be brought down a little bit actually. That's great. Thank you. So diffusion based methods just propagate color from the boundary of the region, maybe trying to preserve straight lines or something like that. But they're knot preserving texture, so for large holes it's not going to work reliably. Then there are texture methods, texture synthesis methods, a lot of them following from Efros and Leon [phonetic] that explicitly propagate texture from the rest of the image generally. But since they don't have any semantic knowledge of what the layout of the scene should be, you know, they can run into problems as well, kind of blindly copying texture. It might be seamless and it might preserve textures but they're just not in the right place. So, you know, this is actually a really difficult problem to complete an image. It's kind of a vision hard and graphics hard problem. You have to parse the scene and identify all the objects, even partial objects and then render new objects on top of them with the correct illumination and viewpoint and things like that. So very hard. And what we're doing is trying to skip all of that and instead just find some scene that as similar a layout as possible, as similar appearance and view parameters and high level semantics, you know, similar enough on all these axes that you can just steal whatever content is in the corresponding place and use computer graphics techniques to blend that. And in this case, we can get a reasonable result by stealing a roadway from another image. It's not seamless, there's some blurriness, but it kind of got the vanishing point right, and it kind of -- I think you wouldn't notice it at first glance. So the algorithm is pretty simple. The input image is actually an incomplete image. We never see the original one. And from that we build a theme descriptor based on the gist descriptor from Antonio Teralva [phonetic] and Oliva [phonetic], which is a measure of basically where the edge energy is in the image at different frequency, different scales, different orientations. And then we have some color on top of that. Anyway, then we search a large database of unlabeled images from the Internet. 2.3 million images in this case. And we find some nearest neighbors. And in this case it works pretty well. These nearest neighbor scenes are actually pretty similar to the input. And for every one of them, we're going to try and find some alignment where they overlap best. They should be roughly aligned already because the scene matching tends to find, you know, similar layouts. But we can maybe improve a little bit. And then at that best offset we'll do a graph cut and Poisson blending to make it seamless. And in the case of this image, I'm walking too far away from that, here's one of the -- the scene matches underneath it, so it's overlay, the incomplete image is overlaid with the scene match and then after the compositing, that's the result. >>: So was that picture actually taken from the same location as the one that you found? >> James Hays: I don't think so. Actually the hills are very different. Like it's a very different climate, wherever it is. Some are camera height, I think, over the water. So, you know, why does this work? We're not doing anything really smart to find similar scenes. If we have a small database of images, say 10,000 or 20,000, which isn't that small really, I mean find the nearest neighbors using the same descriptor, this is what they look like. And you know, they're not really that similar scenes. They wouldn't be useful for image completion certainly. But with the database that's two orders of magnitude larger than the scenes start to become similar, at least some of them, and useful for image completion, so, you know, similar enough that you can use them for photo realistic image editing. So we're kind of going to a point where we're sampling the space of images densely enough to do some cool stuff with millions of images. But, if we go back to this image, we're never going to have enough images to be able to, you know, explain this with other -- with an entire different image. I mean, this is so high dimensional that it's not just a rare image, it's once in the lifetime of the universe. This is never going to occur again. So if you want to know how many percussion instruments there are here or what the pose of the people is you have to use deeper image understanding and go in there with part based models or whatever you want to use. The idea with our scene matching is just to provide a prior, a much better starting point for all of those deeper image understanding techniques. And there's evidence that humans do exactly this as well, that humans use preattentive global scene features to help with things like object recognition, and further Oliva says that those preattentive features are memory based, it's not that humans are doing anything exceptionally clever about parsing the scene and helping to determine where objects may be, it's that they have seen similar scenes before and they relate any novel scene to a scene that they have seen before and thus they know, oh, these objects are likely to occur, or this is probably the correct scale that an object would be at, things like that. >>: Do you have any comments that, you know, when you [inaudible] an image [inaudible] do you have any way of telling or knowing whether there is a match, whether there is an image that is good enough? I mean, is there some sort of threshold? >> James Hays: I have certainly not found anything that's close to universal. >>: Okay. >> James Hays: I mean the -- yeah, I mean the distances of the features aren't really comparable against different scene types. And small differences than distance make a big difference in the perceived scene matching quality in a lot of cases. So I mean, no, I haven't looked in anything really like that yet. So the big picture is, you know, for any -- any image we want to be able to use kind of Internet scale scene matches to learn as much as we can, you know, learn as much about -- as we can from that and then maybe that's a good starting off point for deeper image and understanding. So to inspire the next project, I'm going to let Hollywood give the sales pitch a little bit. See if this plays. [video played] >>: [inaudible] and the next one's still a mystery because we weren't able to download all of it. >>: Look. Do me a favor, [inaudible] to where this is. >>: Yeah. >>: Larry and Anita have found a match for your photograph. I don't know the significance, but it's somewhere in the valley. >>: Why, are you kidding, that's Robin's house. >>: What? >>: She's the next target. >> James Hays: So melodramatic when you only look at the clip. But so, yes, this is numbers, you know, CSI like thing where they've invented, you know, this geolocation thing that actually you know I want to pitch. So I want to try this, you know, given a single image, figure out where it is in the world. I should note that this was accepted for publication before that episode aired. Let's just go through kind of at a high level how humans may be geolocate images. I think there's different, qualitatively different ways they do it. For images like this that are landmarks, both humans and machines have a pretty easy time saying where this is because it's visually unique and there's a lot of photographs of it or people visit it a lot, so it's not difficult to know that this is Notre Dam Cathedral in Paris. For an image like this, though, humans have a lot more difficulty. They can probably tell you something about where this is in a generic sense, like this is the Mediterranean maybe or southern Europe or some people say South America. For machines it's possible that maybe, maybe instance of a recognition could get this maybe with enough street level views and the right matching and geometric verification that there's enough discriminative stuff here going on that you could tell unambiguously where this is. But at some threshold the world becomes too dynamic or noisy for this kind of instance level recognition to work. Even if you had all of the coastline views of the world, I don't think, you know, matching SIFT features is going to help you unambiguously geolocate this, especially in the data set is slightly out of date because you know this is dynamic, the water changes, the beach changes, a hurricane hits and the tree changes. So but a human could still tell you, oh, this is kind of tropical and lush, it looks like a warm environment, it's on the water obviously. So you can't geolocate it unambiguously, you can at least narrow it down to a pretty small geographic area. So we're going to use scene matching. And so our data from the Internet is 6 and a half million geolocated images off of Flickr. And this is just the distribution of them and this is in log scale so it's really quite a bit peakier than it looks here. Some of the regions are pretty empty like the middle of the Australian outback or Siberia. 110,000 different photographers. So it's an average of about 250 per photographer. And this is what a sample of these images look like. This is from our test set actually, which is from the same distribution. Some of them are landmarks like Sagrada Familia. Some of them are generic indoor images. A lot of them are pretty difficult to geolocate, and some of them are kind of dynamic scenes like this glacier that are probably going to not exist in five minutes. And the features we're going to use, we're going to use the same as the scene completion paper that I just talked about. And I'm going to try and add some other ones. This is kind of a good task to quantitatively evaluate these different theme matching features, color histograms and feature histograms and also histograms of straight line features like where straight lines are facing and how long they are. And then we also tried geometric context which is the probability of each segment being ground sky or vertical, and we also tried tinier images, like Antonio Teralva, except even tinier. And they're great outlook because they just weren't useful in conjunction with the other features. So in the end I only used those four. So I'm going kind of fast. Sorry. So here's an example of what the scene matches look like. Very nice. Architecturally, they're very well correlated. Viewpoint is very well correlated. But also the location is correlated. At least among these scene matches you can see that they're mostly in Europe. If we plot these on the global we can see that this is pretty different from the prior distribution. There's almost no matches in the Americas and there were no matches in Asia. It's all centered in Europe. It's not doing a great job of telling where in Europe exactly it is. The ground truth is it's here in Barcelona and has some scene matches there but it has them other European cities as well. So it's at least giving you some signal. For the land mark images our features aren't really suited to instance level recognition because they don't have any viewpoint invariance or not much as least but there's so many images of something like this that you can still unambiguously tell that, oh, the biggest cluster of images is right there in Paris. And then for this more generic image, the scene matches I think look reasonable, but geographically it's very distributed, Hawaii, Thailand, Brazil, Mexico, New Zealand, Philippines. And I think they are mostly plausible, but geographically they're kind of distributed all around water and the equator. The correct answer happens to be Thailand. And in this case, we do happen to have the most matches there. But it's kind of just a little lucky, I think. And then one more result. This is somewhere in the American southwest. And I like it because the scene matches -- it does a very good job, but none of them are actually matching this exact viewpoint or this landmark. They're finding some other landmarks that are similar to it actually, but it's, you know, it's kind of identifying that there's something -- something geographically discriminative about the texture, its raw formations. So it's pretty well peaked in the American southwest there. So quantitatively we can say that for about 16 percent of that test set when using the full data set, about 16 percent of them, the first nearest neighbor was within 200 kilometers, so it's kind of within the nearest major airport or something. And as we decimate the data set, we can see that the performance goes down to chance, which is one percent in this problem. So 16 percent is not great in absolute terms, but it's pretty hard test set. And I thought it might actually be pretty difficult for a human to do this, so I've done a small pilot study where I asked people, you know, here's the test set, tell me where each of these images is, and this is just --these are different human subjects, their performance and compared to IMG GPS. There's a large variance. It depends a lot on people's exposure to international travel and things like that. Some people, you know, really don't recognize any of the landmarks, can't reason about some of the subtle things in the images and don't do well. But some people do manage to beat the computational baseline. >>: In your experience, the 16 percent, are they landmarks or, you know, are the ones that ->> James Hays: They're half landmarks and then they're half, you know, other things. >>: Is it just -- it's kind of like since there's most you know desert photos that exist on Flickr are southwest photos that you have a tendency to kind of peek around the southwest and it picks that and it's not exactly a landmark? >> James Hays: Yeah. I mean, it's something geographically discriminative. But I think you could find another type of desert image that would not match well to the American southwest. So it's not -- I think it's not just at the level of classifying it to a scene type like desert or mountain. It does a little better than that. But it could be other things. It can be -- you know, there's pictures of the Greek islands I'll show later where the white -- you know, all the white buildings and the rocky ground is pretty geographically discriminative, I guess, even without a land mark matching you can tell that that's Greece. So it's kind of interesting to look at the images which the humans did maximally better than the machine and I mean a lot of the reasons are pretty obvious, the humans can read the text and then know where that is roughly. And then here I don't think it's the text, but if people know that these are, you know, London cabs, then it makes that pretty easy. And then also interesting is the other way around, the images for which IMG GPS did maximally better than the humans. So in this case, it's not because the humans don't understand the scenes obviously, it's because, well, they know that that's probably Africa. They don't know where people actually go on safaris in Africa. And here, I don't know, does anybody know where that is? >>: [inaudible]. >>: Turkey. >> James Hays: Turkey. Okay. Well done. Yeah. And this is in Kenya. Almost everybody goes on safari. >>: [inaudible]. >>: It looks like [inaudible]. >> James Hays: And they one in Kenya. Almost everybody goes on safari in Kenya or South Africa, so it's actually not that hard to guess. So, okay, so for most of the images the geolocation is not really accurate. Here's another set of scene matches and the scene matches are pretty good, but geographically they're all over the place. In fact, this image is in Morocco and really almost none of the scene matches are there. But this can still be informative about the original image. If we, for instance an additional source of geographic information like a map, and we sample that at the locations where we thought this image might be, then we can get an explicit estimate of whatever geographic property. In this case, I'm showing how hilly, it is. It's the gradient elevation. And we can say oh, this is how hilly this image is, we think. Remember, the image is unlabeled originally. We're just going through geolocation as an intermediary to get to whatever other metadata. We can do that for the whole test set. And this is what it looks like if you sort all of them by how hilly you think the images are. And I think it does a pretty good job if you get the mountains and the top and then these are the bottom ones. They're mostly urban images because a lot of big cities are in flat areas. And I mean, you can do that for any map. If you have a map of land cover, then you can kind of do scene classification that way. Here's the most forestry images. That means that they were geolocated to be in the most forestry regions. These are mountains, but they are forested. That's correct. >>: [inaudible] the images [inaudible]. >> James Hays: [inaudible]. >>: [inaudible]. >> James Hays: I guess. I mean, that's just the proportion of images people take. All right. So in summary, this data driven scene matching, I think it makes kind of an interesting baseline for this task. And I think there's a lot of room for improvement. And even if we can't geolocate it accurately then it can still tell you something about the image, just because the GPS gives you access to a lot of metadata basically. So this is a -- now, moving into a work inspired by IMG GPS, maybe kind of an extension again. So here's an image which I think is very generic. Somebody can impress me again and tell me where this actually is. It's meant to be confusing. >>: Paris? >> James Hays: No. Okay. So nobody knows. Good. It's in -- what if I show you this image? Everybody should know where that is more or less. It's in central London. What if I tell you -- show you these images together time stamps on them, you know, and I tell you they were taken by the same photographer? Then you can reason that, okay, these are only 45 minutes apart and this one's in central London and you can't really get that far away, so, okay, that one is probably in London, too. And if I add another image, okay, so this one you can tell it's not in London, but it's a day after the other one, so if you're really trucking it, you could be almost anywhere on earth. It could be you know, Rio, or it could be Barcelona. So to help answer this, you might want to know, you know, what are the likely locations that somebody is going to go to from London. So this work image sequence geolocation which is with all of [inaudible] and I should give them credit for doing most of the work, it's their idea, this work aims to take, you know, just a folder of photos from your vacation, say, okay, this is my vacation or this is my entire photo collection, and, you know, they may or may not be from the geographically same place. In this case, there's some from China, and then we move to London, and then the Greek main land and then the Greek islands. And you know, you want to geolocate all of these. So the inputs are images and timestamps. And timestamps aren't really trustworthy because maybe you didn't change your camera time to the right time zone or maybe you never set the time in the right -- in the first place so the battery died or whatever. But the differently between pairs of timestamps should be accurate for the same camera, even if the absolute time isn't. And this is really all we need. So what we're going to use is the time differentials between pairs of photos and the goal is to, you know, geolocate all of these. You can describe this with a hidden Markov model. So what we're interested in the location of the photos and what we get to observe are the images taken from those locations and the differentials in time between the pairs. And this relationship is kind of explained by IMG GPS. I'm not going to go into that. And the relationship between locations and images. But interesting new thing about this is this, which is, you know, how likely are you to move from one location to the another in a fixed amount of time? How likely are you to go [inaudible]. >>: [inaudible] when you showed us that which you know are the top and bottom rows, that's during the inference or test phase, right, during the training phase you know the locations as well? >> James Hays: Right. That's true. And I don't think we have necessarily an explicit -- we don't have a training program that takes advantage of that right now necessarily -- well ->>: [inaudible] tags in your training? >> James Hays: Well, of course we do. When we learn this movement model we do. Right. >>: So I mean that there's nothing surprising there, right, I just want to be clear that you're missing the missing data, the hidden data is the middle row, but you're in the training phase, that's what you know, that's ->> James Hays: Exactly. Yes. So, yeah, again what this relationship is kind of encapsulating is okay, I'm in London, what's the chance that I'm in Barcelona in six hours, something like that. That probability. And there's a lot of related work on human movement from other fields. There's some publications in nature recently where they, for example, track dollar bill serial numbers to see, you know, and use that to infer where people have moved around the country, or in this case they're tracking people by cell phones. Neither of these covers international travel. And what they're kind of more interested in is coming up with a compact parametric description of how people move. There's -- you know, they hypothesize that people follow this levy flight distribution where they have -- it's a power law distribution where they have lots of short trips but the occasional really long trip. It's a heavy tailed distribution. But I don't think that simple parametric models are going to be necessarily that useful because, you know, human movement patterns are really complex. Your movement will look different if you're in Jacksonville, if you're in New York. And we're going to try to learn it from Flickr data and more recently this group from MIT, they're kind of urban planning group, has done the same thing, where they take Flickr geotag data and look at how people move around a city. And we're going to do the same thing on a global scale, you know, basically to say, okay, if you know how many photographers move from Sydney to Hawaii what in X hours. So we've discretized all of the locations of the globe and we've discretized time, also. And this is the type of model that we learn or this is a visualization of the model, that is to say. If you're in Honolulu and zero to five minutes have elapsed, this is the probability of where you're going to be next. So it's pretty stationary, right, 30 to 45 minutes? An hour or two hour. You're still in Hawaii. Or you're in an airplane not taking pictures. Six to eight hours a couple of people have managed to land and take photos of the airport or something. And then as you extend time over longer time periods, it starts to resemble more the prior distribution of the data. In general, it is more stationary than I expected, even after 60 days, you know, a lot of people are still in Hawaii. Very lucky people. So using this model and IMG GPS, we can do, you know, find the maximum likelihood path over the global for a sequence of photos. And I'll just show some preliminary results for that. >>: [inaudible]. So you go to Flickr, you download, you know, all the photos of some person, then you at the same camera, you have the same time stamp and so you assume, okay, this is evidence of where that the person has been, right? >> James Hays: The only restriction is that it's the same person. We don't look at the camera or anything. We assume that a single Flickr using has taken all the photographs that he's posted. >>: So you [inaudible]. Okay. So okay. >>: The photos are [inaudible] are you using the ->> James Hays: We're using the real geotags to train the movement models and, yeah, given a new sequence we're using IMG GPS to get a distribution like this. Coupling that with those movement models to help refine them. >>: It doesn't really matter that it's photos it's just sort of checking with the GPS tag, right? >> James Hays: Sorry? >>: It's just, you're just using the -- location, you're not [inaudible]. >> James Hays: Well, I mean, once IMG GPS has given you a geolocation distribution like this, then, yes. Are we talking about training phase or testing phase? >>: I guess -- this is just using GPS, right? >> James Hays: Yes, this is just GPS. This is just the movement model. >>: [inaudible]. >> James Hays: That's just what I talked about just before this. It's just given a single image given a geolocation. >>: [inaudible]. >> James Hays: Oh, well given a novel sequence, I mean, you have to start with kind of the unary whatever, probabilities of where you think this image is. In this case, it's very bad estimate because it didn't do a very good job. But, you know, IMG GPS provides this distribution, I go on to the next image, and then IMG GPS sequential image provides this provides this distribution. And then with the movement model you're going to help refine those. Find a maximum likelihood path. All right. So here's these results. You know, it didn't do very well on this. The scene matches aren't actually that bad, it's just finding China towns in other cities. It's actually in Beijing. But, okay, this is the -- the posterior, whatever. After and for instance it's kind of the probability that it assigns to each location. And now at the maximum likelihood is here in Beijing and that's because the other images that were in Beijing that were close to it temporarily were kind of more landmarks, and this one is more obvious than Beijing. And also some confusion with Hong Kong. The other one was a landmark. And moving on through the sequence. Okay. This is after inference. Moving on through the sequence, this is the IMG GPS, you know, just the single image geolocation estimate for this one, so it's I guess the best one in the sequence, the most landmark you want -- it's tape modern in London. And then that helps -- okay. Sorry. This is after inference. It knows it's in London not surprisingly. This is the next image, the generic one that I showed you. And again, I mean, the distribution here is just basically random. It's the prior of the data set. But since the previous one was in London, it knows that that's where it actually is. And since the next one is kind of a landmark as well, although there's some confusion with New York. Then you know, it knows it's still in London. Though there's some confusion. Anyway, moving on to this one. So this picture was in Athens IMG GPS does a very bad job. It confuses it with Italy and Barcelona. And for that reason, you know, the inference isn't going to move it to Greece really. It just leaves it in London and then decides that the next photo now is in Greece. This is after the inference. There was another day that passed between these pairs of photos. So the temporal constraints weren't helping that much. Then it has more pictures here in the Greek mainland, and then they move to the Greek Isles. And it did the actually make a movement there. So anyway, out of this sequence, all of them are geolocated accurately except for that one. But it's kind of brittle right now. If I take these landmark images out, then all of these are wrong. It things you went to Beijing and then New York and then Greece and then the Greek Isles. So it's nothing -- it's pretty intuitive. It's using the landmarks to help support the more ambiguous images that are temporarily nearby. Any other questions on this? All right. >>: I'm trying to get a sense for how helpful it is to have the training data rather than just a kind of generic [inaudible]. >> James Hays: Yeah. I thought the training data would be a lot more useful, but then actually playing with it I'm not sure how much more useful it is. It's something we'll have to quantify when we actually publish it. I think. Yeah. I mean, I thought -- yeah. I don't know. That's a good -- very good question. It's not clear to me either. >>: It's also possible that there's a factor model that captures a lot of this right here. When you look to Honolulu, it wasn't just radial distance, it was sort of radial distance multiplied by population density, right? >> James Hays: Right. It's multiplied by the prior. >>: By prior, right. So if you're going to test a -- if you're going to test this sort of data driven model versus a parametric model, that would be the natural one to ->> James Hays: I agree. I agree. >>: But then, you know, you could also make up other hypotheses like he say well maybe nationality makes a difference because Chinese people [inaudible] ->> James Hays: And that's what I was thinking it would be more subtle things like that. I'm not sure that we have enough data for those to really emerge. >>: Cameras recorded [inaudible] so people actually did check the time zone. >> James Hays: Yeah, I mean if they were explicitly changing the time zone and not just the time, then that would be super informative, right, I mean, that would be a huge constraint. If they told you what 24th of the globe you're on. >>: [inaudible]. >> James Hays: What's the distinction? It's not [inaudible] if they're in a different time zone or time actually elapsed in the same time zone. Okay. So this recent stuff has been about finding geolocations for images and some people have given me a hard time saying this is not computer vision. John Pons [phonetic] was very critical of this. So I wanted to show that these geographic estimates are useful for something, some poor computer vision task like object detection. Working with Derek and Santosh, Derek [phonetic] is a professor at UAC now, and again, they did most of the work here. I'm just going to talk about what part I did in this. And the idea is there's been a lot of investigation in context in computer vision using context to improve say object detection. But the best performers in the past [inaudible] challenge are all context free, things like -- or they were last year, at least all these part based models from Pedro Felsenswab and David Mcallister and David Ralmanon [phonetic ] works really well, doesn't use any context. No scene level context, no object to object context. And we thought, well, if we can, you know, add some scene level context on top of this, surely it will improve the performance. So that's exactly what we're going to do, we're going to take their base detector exactly as is, not even retrain it and get candidate detections and then use context cues, you know, from what we know about the scene to say, you know, this is likely to be a cat, this is likely to be a cat, this is not likely to be a cat. And then there's also another part of it which I'm not going to talk about is then resegmenting based on those potential bounding boxes to improve the bounding boxes. And that ends up helping a lot. I'm not going to talk about that. So here's the top five cat detections from their detector, their local detector. This is kind of unfair, because the detector actually does a pretty good job in general, but for cats it just not so good. So these are confusions that shouldn't happen because the context is not even reasonable. Like you know, a cat can't live there. You should know that's not a cat detection. So the type of context we're going to focus on is given an image, what is the probability that it contains a cat or a chair, a dining table or whatever object. And to do this, we're going to use similar features that I already talked about to seen level features like the gist descriptor or the geometric context of this. We're also going to -- and we're going to try and learn the relationship between these features and object occurrence not using scene matching, just directly training from the Pascal training data which is about 4,000 images and they're trained to learn oh, this type of gist descriptor means cats are present. And then at the same time what I did was use scene descriptors to go out on the net, you know, find similar images, find the most similar matches and then use the properties of those scene matches to try and say whether or not there's a cat present. So I've kind of got the this baseline scene level context to compete against. So the type of context that I'm getting, say this is an input image and say this is some of the scene matches. Not great in this case, but the metadata associated with them like, you know, say squirrel or lion could be useful in saying whether there's a cat there. And also the geographic properties. These are all geotag data, so we know exactly like is this a grass land or is this urban and high population, things like that. And we can hypothesize that maybe that correlates with object occurrence. Maybe pedestrians are more likely in high population density, things like that. We don't use geolocate location exactly explicitly, just the geographic properties derived from it. So the actual features we build from this are a histogram of key word categories saying like okay. And the scene matches to that query I had four animal key words and that means it's likely to have contained a cat. And then we also have a geographic features, that is -- oh, the scene matches had this population density on average so -- and then maybe in this case that's not very informative about whether or not there's a cat there. This data is of course extremely noisy. I'm relying on people from the Internet to have annotated their images, and they don't use the same vocabulary. The scene matches only average one key word per image, so it's not densely labeled like the Pascal training set was which was perfectly labeled. All the objects are identified. And what I just talked about goes into a final regression to a rescore bounding boxes along with some other types of context that I'm not going to talk about. As long as the local detector score. So that is to say, you know, for every -- we're rescoring every little box that original detector found. And after we do that, here is the five most confident cat detections instead of these. And okay. So we do a little better. There's two cats instead of one. But the confusions are a lot more reasonable. They're in settings where it could have been a cat. But this confusion happens a lot. And in fact Pascal intentionally constructed such that you can't use context alone to solve it. Here's the precision recall of the cat class when we've added context we get a big jump versus the baseline detector. Overall, it's a lot more of an incremental increase. In fact, there's not much of an increase here in this low recall regime. Only kind of when we maybe get the more challenging ones. I think this is kind of consistent with the work that Larry has done that says maybe the context is really only necessary when you get to the kind of the more borderline cases. But if you have a high resolution object, then, you know, the appearance alone is definitely unambiguous. And here's some more quantitative results. So our goal was to improve this baseline detector. And if you just look at the bottom row, average precision. And we go from 18.2 to 22 overall. So that's a pretty significant increase. But most of that increase is actually coming from fixing the bounding blocks as using segmentation. The scene context only got us 1.2 percent, which is a little disappointing. Larry said that this was kind of a neglect result, which I kind of sort of agree with. I mean, we would have maybe done it better and we're only using one type of context, scene to object context. There's other types of context like object-to-object context. But that was a little bit disappointing to me. >>: [inaudible] so bicycle performs so much better. >> James Hays: I think it's hard to read too much into these. Average precision is kind of a -- if you switch the very first detection from being correct to incorrect, the average precision can jump a lot, be so it's kinds of noisy measure, actually. I don't want to read too much into these. There's not a whole lot -- there's not a whole lot of consistency on what we do better on. Is it indoor or outdoor, is it animal versus not animal? I mean, we kind of do just slightly better on a mix of things. My contextual features are I think worse for indoor images because the geography is more ambiguous than -- here's the relative weight that the regression assigned to each of the scene level features. And I'm not going to -I'm not going to -- I'm just going to block those out because those are -- those aren't dealing with object presence, they're dealing with object size and location. So these are the ones that are determine thing whether or not an object is present in an image. Here's an image. What's the likelihood there's a cat in there, something like that. And this is the relative weight assigned. So the presence column that's referring to the gist descriptor and the geometric context as kind of trained within the Pascal data set alone. And then semantic context refers to the key words from the scene matches and geographic context refers to the geographic features from the scene matches. So these two really dominate the scene features that are learned from within Pascal data set which is somewhat surprising. I'm happy about that, because it's the data set -- learning within Pascal that makes a lot of sense because the labeling is very clean. You could trust the labeling completely. And if you go up to the Internet, you have so much noise to deal with, but the scene matching is so much better that it actually makes it worthwhile. You can really trust it. The scenes are similar and that the properties, you know, are more likely to be shared. >>: Pascal [inaudible] within object categories, right? >> James Hays: Oh, yes. It's a very diverse set of images. >>: Yes. It's really hard to do scene level ->> James Hays: Yes, exactly. >>: Just from those images. >> James Hays: Right. I agree. And we entered this in the VOC 2008 challenge, and we get put in our own little special box because we're the only people to trained with outside data. And we're the only -- I think we're the first person to enter this competition ever. But it's okay. We can compare to these people who are doing the same task but only training within the Pascal data. The external data we used was only my scene matching stuff. And also because we used the 2007 version of Pedros [inaudible] but that's not really an advantage. So we did the best in six classes, second best in six classes, so we're competitive with the best of them. So, yeah, I mean that's not bad. Any questions on this before I move on? All right. So changing gears a little, something I'm ongoing work. I'm doing right now is high interest photo selection. I was hoping Yan Ki [phonetic] would be here because this kind of relates to his work a lot. His past work. So my motivation is that computer graphics does a lot with post processing images. I mean that's a justification for a lot of the imaging track of C graphic. We want to fix this image in one way or another. But I think that a photographer creates a great photograph mostly at capture time and not as a post process. I mean, there's a lot of things you can do at post process to help, but, you know, some examples of good photos. And things like this don't happen because you post processed them correctly, they happened because they were at the right place at the right time and you were taking pictures, a lot of pictures. So I think we should really be going crazy at capture time as photographers. I mean an iPhone holds 20,000 photos, digital cameras are really fast, memory is really cheap. But we don't do this, I don't do this because organization becomes such a hassle if you have that many photos. I've taken on my SLR, the quick counter is up to 50,000, but I haven't posted photos in two years, just because it's so hard to go through all the photos and pick them out and things like that. So a little help and maybe suggesting that, you know, these might be the nicest photos to share I think would go a long way. And this is a quote from a professional photographer saying if you get one good shot on a roll of 36, so this is from the able of film, you are doing good. When you edit ruthlessly like that, people think you're better than you are, a better photographer than you are. So this is kind of saying the key to photography is throwing away most of your photographs, which I also kind of agree with. You have to be very selective. So the goal of this project is given some collection of photos, whatever, your vacation photos to automatically suggest that, you know, okay, these dozen or so seem, you know, really good, maybe you could share these. And then you can, you know, throw them up on Facebook without thinking too much. And this is application that requires or at least the [inaudible] precision but tolerates low recall. We're only grabbing a few of the interesting photos out of the larger corpus. Maybe it's okay if we miss some interesting ones and leave them unchosen, as long as the ones that we do pick are good. So what makes a photo good? That's very difficult. There's not a single objective measure. People could argue about it and there's photography rules that you can reference, but they're scene dependent and they aren't set in stone. For example, you can go to a website trying to learn to be a better photographer and it will say the four rules for composition for landscape photography. It seems specific. And these rules might not be specific for portrait photography or macro photography or architectural photography. In fact, framing images I think that's probably a great piece of advice for landscape images, but for other scenes it could make them look kind of cluttered. And any guide book like this is going to say, you know, the rules are meant to be broken anyway. It's kind of nice to know these, but they're not set if stone. So I think a nice way to try and evaluate how interesting a photo is or how good a photo is just to find similar scenes and say, okay, what do people think of those scenes? They've been up on the Internet. How many times do people click on them and things like that. So we're going to do that using the IMG GPS database, which was six and a half million images. They happen to be geolocated, but we're not using that here. And then I added 600,000 more photos, which are the most interesting photos on Flickr. That's the 500 most interesting for every day of the last four years. And then the pieces of metadata that might be relevant for determining whether a query was interesting was that for each of these Flickr photos we have an interesting rank, this rank, which is something of a form of, you know, that was this rank out of this query segment. We don't have like a real valued number for interestingness unfortunately. Flickr hides that. And then we also have number times it's been viewed and the number of tags, number of words in the title. Those might be correlated with an increased quality. If people are more willing to label it. And, you know, for a query, we're just going to assign the features we're going to assign to it are just, you know, how interesting are your scene matches, how many views did your scene matches have, you know, what's the number of tags in your scene matches? It's a little more complicated than that, but that's basically it. And there's also how far away you are from your scene matches. So these are all scene matching features and then as kind of a baseline we're going to add to these, the features that Yan, who is here, developed for a quality assessment. Which is probably a pretty similar task. And I say these are kind of, you know, lower level things. It does the image take the entire dynamic range, is it well saturated, does it not have too many hues, a narrow band of hues is actually positively correlated with quality. Things like that. And I'm going to define a test set from the Internet data. I'll just kind of skip over how I'm doing that. But here's an example of positive images. So these are images that should be determined to be high quality or interesting. And I think, okay, be that's pretty reasonable. One of the problems that we run into a lot is this what's going on here is that the high quality photos often have a lot of post processing, and that's not useful for this task because we're trying to look at photos right after the camera, for instance. And we don't want to learn that post processing is good because none of them will have post processing. But maybe they're post processing it in a way that, you know, a camera could produce. Not in this case. And here's the bad images. These images all have zero views. Nobody on the Internet ever clicked on these Flickr images, which means that by building this data set I actually -- it's a destructive process. These images no longer have zero views. So this is a high quality -- I mean I think I like this image. I think some of them are fine. Maybe this one's okay, too. Maybe a little too closely cropped. I don't know. I mean, this is a problem. There's a lot of reasons an image could have no views. They're all publically available but maybe this person just doesn't a lot of friends looking at their album. Or maybe they posted a string of redundant photos and nobody bothered to click on the third one that was similar. I don't allow photos from the same photographer to appear multiple times. Sorry. One of the problems. So we have a test set and we use just a linear SBM define weights for those different features I showed you earlier. And here's the results. So this is kind of the baseline, which is actually a very good baseline, because Yan's stuff works very well. So it's using his features. These are just derived from, you know, from the individual image, not using scene matching. And this is the precision recall for the task these rescission recalls are higher than I thought, because I thought these test set -- the test set would be very difficult to separate the positive and negative classes. And it's pretty hard, but they actually do pretty well, especially in this low recall regime. So Yan's features get about 80 percent. Mine are up here. And then if we combine them both, we're getting like 90 percent precision, not 20 percent recall, which is ->>: [inaudible] is it the Flickr photos that you're testing on instead of a ->> James Hays: Yes. No, no, no. Oh, yes, in this case, yes. >>: So is the images then [inaudible] processing ->> James Hays: Yeah. >>: So do you know if you might be running a post processing detector rather than -- >> James Hays: I think that's part of it, yes. Unfortunately. >>: So remind me what your scene match features does. >> James Hays: So okay. Given a query, goes finds scene matches and then for those scene matches, it's looking at, you know, what were the interesting rank of those scene matches, how many views did those scene matches have, things like that. >>: So you're saying your scene matches are based on trying to find things that are geographically similar? >> James Hays: Well, I mean ->>: Or just simpler in layout? >> James Hays: Yeah. I mean we're using the same features that we used for the geographic task, that's true, but this is just kind of another access that we expect the scenes to correlated on, and they are matching scenes are geographically correlated. They're also quality correlated, it turns out. Not surprising, really. Yeah, I mean we should -- yes. So you're talking about the features used in the scene matching, and those were the same as IMG GPS, right? >>: I'm just trying to get an intuitive understanding for what's happening, like with Yan's you just reminded us that this brightness and blur and things like that, color and [inaudible] but for yours you're using [inaudible] and some color histograms and ->> James Hays: Texture histograms. >>: But you're using those features or you're actually going out and matching ->> James Hays: Going out an matching. And the features that go into this are the features derived from the scene matches. >>: So you go out, you find the closest matching images, and then those have all been tagged by [inaudible] or ->> James Hays: Well. >>: Or a small subset? >> James Hays: They all have the metadata. It's not something that -- it's generally not something that users explicitly tagging. Like the interesting is that Flickr doesn't say exactly how it's calculated. It's proprietary. But it's derived from viewing habits of people. >>: Okay. So you're ending your training with regressor which will be based on those things like number of times viewed [inaudible] how interesting the ->> James Hays: Query, yes. >>: The query. Right. Well, the query image generates a couple of matches and then you're trying to find it or score ->> James Hays: Yes, score the matches first. >>: Learning something [inaudible] match statistics, use statistics to interesting [inaudible] and that training is based on the ones which were actually -- that 600,000 subset, right, where you know the [inaudible] >> James Hays: The training process happens after the scene matches are collected. >>: Okay. So where are the labels that go into the training process? >> James Hays: So this is from the, whatever, could be the training or the -- so what are the labeled examples that ->>: So these were the things you thought were interesting based on what, on ->> James Hays: Well, so I tried to picture out a test set. >>: Oh, okay. >> James Hays: You know, some that had a lot of views and some that had no views, and some other properties. >>: Okay. >> James Hays: Any way. So there's the performance. And if we look at the -what the SVM, how it's actually weighting the features in the combined classifier, you can see it's about half of Yan's features and about half my scene matching features. And I mean, so things like the number of tags turned out not to be useful or might have already been encapsulated in the Flickr interesting score. Distance, increased distance you're seeing matches is positively correlated with increased photo quality. I guess maybe that's capturing something like if it's a unique photo that's better. So to make this an actual system like you give it to a user and they pull photos out of a collection and one last thing you need is to suppress redundant photos if they're really taking a lot of photos of the same thing, we do that just by clustering and taking only the best image from each cluster. So here's some qualitative results. This is Aloshia's [phonetic] pictures from Paris, and this is random samples of them. So it's kind of a lot of urban architecture photography. Also, here's some more samples. A lot of portraits. And it's by think all the computer vision in community is in here at some point. So here are the most interesting photos that it finds. And some of them I think are okay, like the black and white ones. Does anybody know who that is? >>: [inaudible]. >> James Hays: Yes. How did you know? >>: [inaudible]. >> James Hays: The vegetarian pride thing. Yes. But these are definitely not good photos. So I think it's kind of capturing the oh, they're really saturated or oh, they look like that I been post processed to some degree. Here's the next set of 12, the 12 most next most interesting ones. I think actually some of them are better here. I like most of these I think. I mean this is a little pedestrian and maybe that's a little cluttered by you know, pretty good, I think. I mean, different people might have different takes on it. Here's the least interesting photos. Again, they're all gray. So you can see that saturation plays a significant role. The next 12 least interesting. So one last thing I'll show is that if you have an interestingness assessment that's working well, which I'm not sure that I do, then you can use it as an oracle for some other image manipulation. Whatever it is. I'm just going to show with cropping, that is here's an original image, let's generate lots of crops, let's pick which crops are the most interesting. And these are the 12 most interesting and some of them are -- I mean, these aren't good clearly. Some of these are okay. Here's the 12 least interesting. Which, you know, this one is actually maybe okay. I think this task is harder when it's only doing scene matching based on parts of a single image, the scene matches might not change that much. >>: [inaudible]. Whole groups minimize them in photography, and they might argue differently than you. >> James Hays: Yeah. So I mean there's this whole thing like if you're a good photographer, this isn't for you, right, if you actually have a men to get out there, then this isn't for you. But this is based on the consensus of the Internet crowd as dumb as it might be, but, I mean, you can disagree with what it thinks is interesting for the test set, for example. But you can't really deny that that's the consensus of the crowd for whatever reason. Even though, you know, experts could definitely disagree with this. All right. So this is a work in progress. I was kind of aiming for SGF, but the results weren't there. So if anybody has feedback on any of this, I'd be welcome. >>: [inaudible] noticed is certain photographers get a following no matter what they'd posted they get thousands of views and I just can't believe some of the things they post get all these positive reactions. >> James Hays: Exactly. >>: And I don't know how to filter that out. I look at this, their work and like everything they post gets hotter and they just seem to go the out and try to collect friends. >> James Hays: Yeah, I thought about -- I thought about trying to normalize by photographer to say that this photographer, you know, all of his stuff seems to be viewed so maybe you can't trust that. But then maybe you do want to trust that. Maybe he's actually a sincerely good photographer. >>: He has some good stuff. >> James Hays: Right. >>: Establish that. >> James Hays: So it's hard to say. I hear you. >>: Yeah. That for a reason, some point along the way it's -- you got a less meaningful. >> James Hays: Yeah. And I thought about going to other data sets as well, like DP challenge or photo.net, but they have explicit ratings that are pretty reliable I think. But then you have orders of magnitude less data. So it's not going to look like a scene matching solution at that point. >>: [inaudible] label. >> James Hays: Sure. In fact, [inaudible] it's kind of been this game where I think you're doing some sort of binary force choice between which photo you like better. But again, the results then are I think binary force like this person thinks this image is better. I think that data would be good to look at. I need to bug Luiz about that. Yeah. >>: [inaudible] good for testing. It wouldn't be good for training [inaudible]. >> James Hays: I did actually. I did actually test on there. >>: Okay. I get similar results? >> James Hays: No, actually, I do worse, and Yan does about the same, so in fact, Yan's features do better. That's what he developed his algorithm on. >>: Yeah. >> James Hays: And I'm slightly worse. And if I combine them, I really only very incrementally improve them for his performance. So it's not -- it's not clear. So, yeah. >>: Instead of how many it might be interesting to see who is giving positive feedback. >> James Hays: Yeah, I don't think I have access to that. >>: [inaudible] better taste than others so if they're bothering to comment on which one ->> James Hays: Comment. >>: [inaudible] social network. >>: [inaudible] photographer to comment. >> James Hays: Yeah. >>: So you wouldn't go to, you know, things for the average user, right? I mean, you could be [inaudible] artist like [inaudible] right now. [inaudible]. >> James Hays: You want to be the most possible mainstream cliche photographer? Yeah. >>: [inaudible]. >> James Hays: Yeah. If you don't like the philosophical idea behind this, I completely understand. >>: You know, when you first talked about taking off the photos and having systems suggest to us when you have too many, I mean, I got very excited. I think it would be wonderful to develop synthesis. It's a hard task right now. I don't know if the if the quality of your results -- you know, is this state of the art useful enough to deploy it today, or is it just not -- you know, it's a very subjective decision, but basically being useful enough means that for your target audience which is the audience that doesn't have time to go manipulate their photos or select them, people like [inaudible]. Would it be good enough to trust doing [inaudible] to the system ->> James Hays: Right. >>: [inaudible] suggest and so you don't [inaudible] that includes the clustering suggests that -- you mentioned clustering has to be done. Did you do that [inaudible]? >> James Hays: Yes. Yes. So there's still photos that look pretty redundant because of the clustering wasn't that extreme. >>: [inaudible]. >> James Hays: Exactly. There's two Notre Dam shots and then there's also two of these shots. I mean but -- so there's different aspect ratio. In this case, there's enough translation such that the descriptors are different. I mean, I tried tuning it a little, and I can get rid of those duplicates, oh, plus these are duplicates as well. But again they're different enough vantage points that they didn't cluster together. >>: [inaudible] another thing you can do is [inaudible] version, you know, you show this to the user, it clicks [inaudible] redo everything and you could take more personalized. >> James Hays: Right. Personalization makes sense certainly. That would be. >>: Some day I will be just like everybody else. >> James Hays: Yeah. Yeah. I'm not sure how to couple that personalization with the scene matching process. >>: You can see the simplest thing it would be to save the ranking with your database, show it to the user, get reranking and then you could [inaudible]. >> James Hays: Just go from there. >>: [inaudible], you know. That's the difference. >> James Hays: That's true. >>: The processing issue is interesting, too, because I noticed a lot of [inaudible] Flickr posted image and then it's -- it seems really popular, and then they go post the original. It's like you look at the two and wow, I didn't even see that. >> James Hays: Really? >>: I would have never thought ->> James Hays: That would have been very helpful to be able to couple those. >>: Yeah. Plus process it that way. >> James Hays: Really. >>: It seems, yeah. >>: [inaudible] image potential [inaudible]. >>: It's the one in the lower ride tag HDR, so it looks very typical of the kind of extreme [inaudible] in the halo that ->> James Hays: Right. I hate this. But they love this. They love to show off their halos. I don't know if it's tagged with HDR. I didn't check. >>: [inaudible]. >> James Hays: Yeah. It's not quite as pronounced. I'm not sure that people even -- like maybe they could have had the dynamic range they need with the original shot, exposure bracket and do this just because it's like the trendy thing to have. So it's very much against everything that SGF has been going for the last five years, which is interesting. >>: Yeah, I don't even know if [inaudible] similar to just excessive [inaudible] because if you do [inaudible] you also get that ->> James Hays: Yeah. I think it's -- I mean, it gets where they manually segmented out the Larry they wanted feathered a large amount, like with the radius of a hundred, and then composited. >>: Yeah, I think that's [inaudible]. >> James Hays: I don't ->>: It might be. >> James Hays: No, I think. >>: [inaudible]. >> James Hays: I think it's from selection feathering I'm going to argue. But I could be wrong. >>: [inaudible]. >>: Another possible direction if you can get your client to go that way, is to, you know, not exactly do what Luiz does, because he always sets up games but to basically release something on the web and then gather data as people use it, right, so you have your ->> James Hays: True. >>: Your photo paper, right, and you release this little -[brief talking over] >>: Exactly. You upload it or they -- or you put it in the directory. I think I'll [inaudible] but you don't float your photos maybe it's good about unfloating small [inaudible] works fast and it gets you kind of a little stack, clusters. You would have to think about what the service [inaudible] but if you have it as a web service then you can capture information, right, equal to cluster things that would give you your best [inaudible] the stack of things that were clustered and say [inaudible] I'm not sure, it would be almost like [inaudible] but because you do that on the web you could use this as part of your research because you get feedback. You can do that kind of thing internal by just having ten of your friends use the system, or you can launch the service, run it for a few weeks [inaudible]. >>: Another suggestion [inaudible] that we could -- this minute of the house is part of a cluster, right, and someone -- you look at someone's picture and looks at something like some picture like cluster, but this is the most, you know, highest rank one in the cluster, you could tell the user, you know, you could -- you could possibly, you know, change your photo to look more like this, you know, then you have lots of friends and ->>: Here's creative suggestions of many ->> James Hays: Right. >>: Here are a few examples. >> James Hays: Right. This is the idea of forcing them to be more like the positive images, not just selecting them as they are. Like you suggested, also, the idea of being able to learn the mapping between preprocess and post process would be even better, but that training date has got be to be hard to come by. >>: [inaudible] instrument something like [inaudible] photo gallery to -- I mean there you start with the users start with a huge collection of minutes and then you upload certain ones so that it's pretty valuable information ->>: That's true. >>: Which ones [inaudible]. >> James Hays: [inaudible] Google. >>: Yeah, that's if you want the [inaudible] but if you want the [inaudible] research project, you know, and it's kind of explicitly and if people find it useful they can use it [inaudible] capture this data. >> James Hays: Well, you don't think somebody with Google, like Google would be comfortable with training from their data any mean, is that an invasion of privacy to train from ->>: [inaudible]. >>: I don't think you could get [inaudible]. >> James Hays: I would have to be interning there or something. Post-doc. >>: I don't know. I mean I know like the photo gallery you can write your own plug ins, so you could plug-in yourself. So you wouldn't even have to -- invoke the company's privacy policy. >>: Isn't it the case of the Google images they preserve [inaudible] the web album they preserve the rights [inaudible] or anyone can download it? [inaudible] something like that for a while. >> Larry Zitnick: Okay. Thank you. >> James Hays: Thank you so much. [applause]