>> Larry Zitnick: Hi. It's my pleasure to... currently at CMU and will be graduating in June. ...

advertisement
>> Larry Zitnick: Hi. It's my pleasure to introduce James Hays. He is a student
currently at CMU and will be graduating in June. I think I first heard of your work
when he started doing some repetitive texture work but then recognizing it and
using it for different tasks such as in painting and texture synthesis, and then
more recently he's been doing a string of work using large image collections. So
he has an impending work, I think you're talking about today, and then also a
work on localizing images using both scene recognition and feature recognition
which I think you're also talking about today. So ->> James Hays: Yes.
>> Larry Zitnick: I'll hand it over to James.
>> James Hays: Fine. Thank you. And talk about a few other things as well.
So I'm kind of covering maybe a little too much ground. If I'm going too fast or
something, then just stop me and get me to clarify. Hopefully it won't be a
problem.
So, you know, our visual experience for us for computers is really extraordinarily
varied and complex. You know, tons of differently scenes and lots of complexity
within any scene, and that complexity makes it really hard to understand images
and to synthesize images.
So -- and the computer, you know, if you look at an image like this, snapshot I
took, then there's, you know, there's thousands of meaningful image dimensions.
There's thousands of things that you could want a computer vision system to
answer you about this question from me, what's the pose of all these articulated
figures to material to illuminations, to geometry, and most computer vision and
also computer graphics techniques kind of get around this huge complexity by
divide and con concur technique; that is you know that the image isn't the
representation, some part of the image is, so that something like sliding windows
or part based models or segmenting the image first and then classifying
segments. I mean in the graphics community they obviously tend to render
things by building it up from part based geometry as well.
So all the work I'm going to talk about the fundamentally different. It's using
scene matching instead -- to, you know, whatever, find information about images
instead of going kind of into the image. And there's one example of that I'm
going to show you what I mean by scene matching is scene completion. And this
was at SGF 2007. So let's say you have an image, a photograph you took or
something like that and you want to remove part of it or part of it is damaged or
you're doing a 3D reconstruction and you want to get a view of something that
you never quite saw in the original views.
So for whatever reason you need to hallucinate some data here, fill it in such that
it seems plausible, such that a human observer can't tell that anything was ever
taken away. There's a lot of good previous work in this area. In fact, some of the
people sitting here have done it. The simplest is just diffusion based. Maybe I
should ask for the lights to be brought down a little bit actually.
That's great. Thank you. So diffusion based methods just propagate color from
the boundary of the region, maybe trying to preserve straight lines or something
like that. But they're knot preserving texture, so for large holes it's not going to
work reliably. Then there are texture methods, texture synthesis methods, a lot
of them following from Efros and Leon [phonetic] that explicitly propagate texture
from the rest of the image generally.
But since they don't have any semantic knowledge of what the layout of the
scene should be, you know, they can run into problems as well, kind of blindly
copying texture. It might be seamless and it might preserve textures but they're
just not in the right place.
So, you know, this is actually a really difficult problem to complete an image. It's
kind of a vision hard and graphics hard problem. You have to parse the scene
and identify all the objects, even partial objects and then render new objects on
top of them with the correct illumination and viewpoint and things like that. So
very hard.
And what we're doing is trying to skip all of that and instead just find some scene
that as similar a layout as possible, as similar appearance and view parameters
and high level semantics, you know, similar enough on all these axes that you
can just steal whatever content is in the corresponding place and use computer
graphics techniques to blend that.
And in this case, we can get a reasonable result by stealing a roadway from
another image. It's not seamless, there's some blurriness, but it kind of got the
vanishing point right, and it kind of -- I think you wouldn't notice it at first glance.
So the algorithm is pretty simple. The input image is actually an incomplete
image. We never see the original one. And from that we build a theme
descriptor based on the gist descriptor from Antonio Teralva [phonetic] and Oliva
[phonetic], which is a measure of basically where the edge energy is in the image
at different frequency, different scales, different orientations.
And then we have some color on top of that. Anyway, then we search a large
database of unlabeled images from the Internet. 2.3 million images in this case.
And we find some nearest neighbors. And in this case it works pretty well.
These nearest neighbor scenes are actually pretty similar to the input. And for
every one of them, we're going to try and find some alignment where they
overlap best. They should be roughly aligned already because the scene
matching tends to find, you know, similar layouts. But we can maybe improve a
little bit. And then at that best offset we'll do a graph cut and Poisson blending to
make it seamless.
And in the case of this image, I'm walking too far away from that, here's one of
the -- the scene matches underneath it, so it's overlay, the incomplete image is
overlaid with the scene match and then after the compositing, that's the result.
>>: So was that picture actually taken from the same location as the one that
you found?
>> James Hays: I don't think so. Actually the hills are very different. Like it's a
very different climate, wherever it is. Some are camera height, I think, over the
water.
So, you know, why does this work? We're not doing anything really smart to find
similar scenes. If we have a small database of images, say 10,000 or 20,000,
which isn't that small really, I mean find the nearest neighbors using the same
descriptor, this is what they look like. And you know, they're not really that
similar scenes. They wouldn't be useful for image completion certainly. But with
the database that's two orders of magnitude larger than the scenes start to
become similar, at least some of them, and useful for image completion, so, you
know, similar enough that you can use them for photo realistic image editing.
So we're kind of going to a point where we're sampling the space of images
densely enough to do some cool stuff with millions of images. But, if we go back
to this image, we're never going to have enough images to be able to, you know,
explain this with other -- with an entire different image. I mean, this is so high
dimensional that it's not just a rare image, it's once in the lifetime of the universe.
This is never going to occur again. So if you want to know how many percussion
instruments there are here or what the pose of the people is you have to use
deeper image understanding and go in there with part based models or whatever
you want to use.
The idea with our scene matching is just to provide a prior, a much better starting
point for all of those deeper image understanding techniques. And there's
evidence that humans do exactly this as well, that humans use preattentive
global scene features to help with things like object recognition, and further Oliva
says that those preattentive features are memory based, it's not that humans are
doing anything exceptionally clever about parsing the scene and helping to
determine where objects may be, it's that they have seen similar scenes before
and they relate any novel scene to a scene that they have seen before and thus
they know, oh, these objects are likely to occur, or this is probably the correct
scale that an object would be at, things like that.
>>: Do you have any comments that, you know, when you [inaudible] an image
[inaudible] do you have any way of telling or knowing whether there is a match,
whether there is an image that is good enough? I mean, is there some sort of
threshold?
>> James Hays: I have certainly not found anything that's close to universal.
>>: Okay.
>> James Hays: I mean the -- yeah, I mean the distances of the features aren't
really comparable against different scene types. And small differences than
distance make a big difference in the perceived scene matching quality in a lot of
cases. So I mean, no, I haven't looked in anything really like that yet.
So the big picture is, you know, for any -- any image we want to be able to use
kind of Internet scale scene matches to learn as much as we can, you know,
learn as much about -- as we can from that and then maybe that's a good starting
off point for deeper image and understanding.
So to inspire the next project, I'm going to let Hollywood give the sales pitch a
little bit. See if this plays.
[video played]
>>: [inaudible] and the next one's still a mystery because we weren't able to
download all of it.
>>: Look. Do me a favor, [inaudible] to where this is.
>>: Yeah.
>>: Larry and Anita have found a match for your photograph. I don't know the
significance, but it's somewhere in the valley.
>>: Why, are you kidding, that's Robin's house.
>>: What?
>>: She's the next target.
>> James Hays: So melodramatic when you only look at the clip. But so, yes,
this is numbers, you know, CSI like thing where they've invented, you know, this
geolocation thing that actually you know I want to pitch. So I want to try this, you
know, given a single image, figure out where it is in the world.
I should note that this was accepted for publication before that episode aired.
Let's just go through kind of at a high level how humans may be geolocate
images. I think there's different, qualitatively different ways they do it.
For images like this that are landmarks, both humans and machines have a
pretty easy time saying where this is because it's visually unique and there's a lot
of photographs of it or people visit it a lot, so it's not difficult to know that this is
Notre Dam Cathedral in Paris. For an image like this, though, humans have a lot
more difficulty. They can probably tell you something about where this is in a
generic sense, like this is the Mediterranean maybe or southern Europe or some
people say South America.
For machines it's possible that maybe, maybe instance of a recognition could get
this maybe with enough street level views and the right matching and geometric
verification that there's enough discriminative stuff here going on that you could
tell unambiguously where this is. But at some threshold the world becomes too
dynamic or noisy for this kind of instance level recognition to work. Even if you
had all of the coastline views of the world, I don't think, you know, matching SIFT
features is going to help you unambiguously geolocate this, especially in the data
set is slightly out of date because you know this is dynamic, the water changes,
the beach changes, a hurricane hits and the tree changes. So but a human
could still tell you, oh, this is kind of tropical and lush, it looks like a warm
environment, it's on the water obviously. So you can't geolocate it
unambiguously, you can at least narrow it down to a pretty small geographic
area.
So we're going to use scene matching. And so our data from the Internet is 6
and a half million geolocated images off of Flickr. And this is just the distribution
of them and this is in log scale so it's really quite a bit peakier than it looks here.
Some of the regions are pretty empty like the middle of the Australian outback or
Siberia. 110,000 different photographers. So it's an average of about 250 per
photographer.
And this is what a sample of these images look like. This is from our test set
actually, which is from the same distribution. Some of them are landmarks like
Sagrada Familia. Some of them are generic indoor images. A lot of them are
pretty difficult to geolocate, and some of them are kind of dynamic scenes like
this glacier that are probably going to not exist in five minutes. And the features
we're going to use, we're going to use the same as the scene completion paper
that I just talked about. And I'm going to try and add some other ones. This is
kind of a good task to quantitatively evaluate these different theme matching
features, color histograms and feature histograms and also histograms of straight
line features like where straight lines are facing and how long they are.
And then we also tried geometric context which is the probability of each
segment being ground sky or vertical, and we also tried tinier images, like
Antonio Teralva, except even tinier. And they're great outlook because they just
weren't useful in conjunction with the other features. So in the end I only used
those four. So I'm going kind of fast. Sorry.
So here's an example of what the scene matches look like. Very nice.
Architecturally, they're very well correlated. Viewpoint is very well correlated.
But also the location is correlated. At least among these scene matches you can
see that they're mostly in Europe. If we plot these on the global we can see that
this is pretty different from the prior distribution. There's almost no matches in
the Americas and there were no matches in Asia. It's all centered in Europe. It's
not doing a great job of telling where in Europe exactly it is. The ground truth is
it's here in Barcelona and has some scene matches there but it has them other
European cities as well.
So it's at least giving you some signal. For the land mark images our features
aren't really suited to instance level recognition because they don't have any
viewpoint invariance or not much as least but there's so many images of
something like this that you can still unambiguously tell that, oh, the biggest
cluster of images is right there in Paris. And then for this more generic image,
the scene matches I think look reasonable, but geographically it's very
distributed, Hawaii, Thailand, Brazil, Mexico, New Zealand, Philippines. And I
think they are mostly plausible, but geographically they're kind of distributed all
around water and the equator.
The correct answer happens to be Thailand. And in this case, we do happen to
have the most matches there. But it's kind of just a little lucky, I think. And then
one more result. This is somewhere in the American southwest. And I like it
because the scene matches -- it does a very good job, but none of them are
actually matching this exact viewpoint or this landmark. They're finding some
other landmarks that are similar to it actually, but it's, you know, it's kind of
identifying that there's something -- something geographically discriminative
about the texture, its raw formations. So it's pretty well peaked in the American
southwest there.
So quantitatively we can say that for about 16 percent of that test set when using
the full data set, about 16 percent of them, the first nearest neighbor was within
200 kilometers, so it's kind of within the nearest major airport or something. And
as we decimate the data set, we can see that the performance goes down to
chance, which is one percent in this problem.
So 16 percent is not great in absolute terms, but it's pretty hard test set. And I
thought it might actually be pretty difficult for a human to do this, so I've done a
small pilot study where I asked people, you know, here's the test set, tell me
where each of these images is, and this is just --these are different human
subjects, their performance and compared to IMG GPS. There's a large
variance. It depends a lot on people's exposure to international travel and things
like that. Some people, you know, really don't recognize any of the landmarks,
can't reason about some of the subtle things in the images and don't do well. But
some people do manage to beat the computational baseline.
>>: In your experience, the 16 percent, are they landmarks or, you know, are the
ones that ->> James Hays: They're half landmarks and then they're half, you know, other
things.
>>: Is it just -- it's kind of like since there's most you know desert photos that
exist on Flickr are southwest photos that you have a tendency to kind of peek
around the southwest and it picks that and it's not exactly a landmark?
>> James Hays: Yeah. I mean, it's something geographically discriminative.
But I think you could find another type of desert image that would not match well
to the American southwest. So it's not -- I think it's not just at the level of
classifying it to a scene type like desert or mountain. It does a little better than
that. But it could be other things. It can be -- you know, there's pictures of the
Greek islands I'll show later where the white -- you know, all the white buildings
and the rocky ground is pretty geographically discriminative, I guess, even
without a land mark matching you can tell that that's Greece.
So it's kind of interesting to look at the images which the humans did maximally
better than the machine and I mean a lot of the reasons are pretty obvious, the
humans can read the text and then know where that is roughly. And then here I
don't think it's the text, but if people know that these are, you know, London cabs,
then it makes that pretty easy. And then also interesting is the other way around,
the images for which IMG GPS did maximally better than the humans. So in this
case, it's not because the humans don't understand the scenes obviously, it's
because, well, they know that that's probably Africa. They don't know where
people actually go on safaris in Africa. And here, I don't know, does anybody
know where that is?
>>: [inaudible].
>>: Turkey.
>> James Hays: Turkey. Okay. Well done. Yeah. And this is in Kenya. Almost
everybody goes on safari.
>>: [inaudible].
>>: It looks like [inaudible].
>> James Hays: And they one in Kenya. Almost everybody goes on safari in
Kenya or South Africa, so it's actually not that hard to guess. So, okay, so for
most of the images the geolocation is not really accurate. Here's another set of
scene matches and the scene matches are pretty good, but geographically
they're all over the place. In fact, this image is in Morocco and really almost none
of the scene matches are there. But this can still be informative about the
original image.
If we, for instance an additional source of geographic information like a map, and
we sample that at the locations where we thought this image might be, then we
can get an explicit estimate of whatever geographic property. In this case, I'm
showing how hilly, it is. It's the gradient elevation. And we can say oh, this is
how hilly this image is, we think. Remember, the image is unlabeled originally.
We're just going through geolocation as an intermediary to get to whatever other
metadata. We can do that for the whole test set. And this is what it looks like if
you sort all of them by how hilly you think the images are. And I think it does a
pretty good job if you get the mountains and the top and then these are the
bottom ones. They're mostly urban images because a lot of big cities are in flat
areas.
And I mean, you can do that for any map. If you have a map of land cover, then
you can kind of do scene classification that way. Here's the most forestry
images. That means that they were geolocated to be in the most forestry
regions. These are mountains, but they are forested. That's correct.
>>: [inaudible] the images [inaudible].
>> James Hays: [inaudible].
>>: [inaudible].
>> James Hays: I guess. I mean, that's just the proportion of images people
take. All right. So in summary, this data driven scene matching, I think it makes
kind of an interesting baseline for this task. And I think there's a lot of room for
improvement. And even if we can't geolocate it accurately then it can still tell you
something about the image, just because the GPS gives you access to a lot of
metadata basically.
So this is a -- now, moving into a work inspired by IMG GPS, maybe kind of an
extension again. So here's an image which I think is very generic. Somebody
can impress me again and tell me where this actually is. It's meant to be
confusing.
>>: Paris?
>> James Hays: No. Okay. So nobody knows. Good. It's in -- what if I show
you this image? Everybody should know where that is more or less. It's in
central London. What if I tell you -- show you these images together time stamps
on them, you know, and I tell you they were taken by the same photographer?
Then you can reason that, okay, these are only 45 minutes apart and this one's
in central London and you can't really get that far away, so, okay, that one is
probably in London, too. And if I add another image, okay, so this one you can
tell it's not in London, but it's a day after the other one, so if you're really trucking
it, you could be almost anywhere on earth.
It could be you know, Rio, or it could be Barcelona. So to help answer this, you
might want to know, you know, what are the likely locations that somebody is
going to go to from London. So this work image sequence geolocation which is
with all of [inaudible] and I should give them credit for doing most of the work, it's
their idea, this work aims to take, you know, just a folder of photos from your
vacation, say, okay, this is my vacation or this is my entire photo collection, and,
you know, they may or may not be from the geographically same place. In this
case, there's some from China, and then we move to London, and then the
Greek main land and then the Greek islands.
And you know, you want to geolocate all of these. So the inputs are images and
timestamps. And timestamps aren't really trustworthy because maybe you didn't
change your camera time to the right time zone or maybe you never set the time
in the right -- in the first place so the battery died or whatever. But the differently
between pairs of timestamps should be accurate for the same camera, even if
the absolute time isn't. And this is really all we need. So what we're going to use
is the time differentials between pairs of photos and the goal is to, you know,
geolocate all of these. You can describe this with a hidden Markov model. So
what we're interested in the location of the photos and what we get to observe
are the images taken from those locations and the differentials in time between
the pairs. And this relationship is kind of explained by IMG GPS. I'm not going to
go into that. And the relationship between locations and images.
But interesting new thing about this is this, which is, you know, how likely are you
to move from one location to the another in a fixed amount of time? How likely
are you to go [inaudible].
>>: [inaudible] when you showed us that which you know are the top and bottom
rows, that's during the inference or test phase, right, during the training phase
you know the locations as well?
>> James Hays: Right. That's true. And I don't think we have necessarily an
explicit -- we don't have a training program that takes advantage of that right now
necessarily -- well ->>: [inaudible] tags in your training?
>> James Hays: Well, of course we do. When we learn this movement model
we do. Right.
>>: So I mean that there's nothing surprising there, right, I just want to be clear
that you're missing the missing data, the hidden data is the middle row, but
you're in the training phase, that's what you know, that's ->> James Hays: Exactly. Yes. So, yeah, again what this relationship is kind of
encapsulating is okay, I'm in London, what's the chance that I'm in Barcelona in
six hours, something like that. That probability. And there's a lot of related work
on human movement from other fields. There's some publications in nature
recently where they, for example, track dollar bill serial numbers to see, you
know, and use that to infer where people have moved around the country, or in
this case they're tracking people by cell phones. Neither of these covers
international travel. And what they're kind of more interested in is coming up with
a compact parametric description of how people move. There's -- you know, they
hypothesize that people follow this levy flight distribution where they have -- it's a
power law distribution where they have lots of short trips but the occasional really
long trip. It's a heavy tailed distribution.
But I don't think that simple parametric models are going to be necessarily that
useful because, you know, human movement patterns are really complex. Your
movement will look different if you're in Jacksonville, if you're in New York. And
we're going to try to learn it from Flickr data and more recently this group from
MIT, they're kind of urban planning group, has done the same thing, where they
take Flickr geotag data and look at how people move around a city.
And we're going to do the same thing on a global scale, you know, basically to
say, okay, if you know how many photographers move from Sydney to Hawaii
what in X hours. So we've discretized all of the locations of the globe and we've
discretized time, also. And this is the type of model that we learn or this is a
visualization of the model, that is to say.
If you're in Honolulu and zero to five minutes have elapsed, this is the probability
of where you're going to be next. So it's pretty stationary, right, 30 to 45
minutes? An hour or two hour. You're still in Hawaii. Or you're in an airplane
not taking pictures. Six to eight hours a couple of people have managed to land
and take photos of the airport or something. And then as you extend time over
longer time periods, it starts to resemble more the prior distribution of the data.
In general, it is more stationary than I expected, even after 60 days, you know, a
lot of people are still in Hawaii. Very lucky people. So using this model and IMG
GPS, we can do, you know, find the maximum likelihood path over the global for
a sequence of photos. And I'll just show some preliminary results for that.
>>: [inaudible]. So you go to Flickr, you download, you know, all the photos of
some person, then you at the same camera, you have the same time stamp and
so you assume, okay, this is evidence of where that the person has been, right?
>> James Hays: The only restriction is that it's the same person. We don't look
at the camera or anything. We assume that a single Flickr using has taken all
the photographs that he's posted.
>>: So you [inaudible]. Okay. So okay.
>>: The photos are [inaudible] are you using the ->> James Hays: We're using the real geotags to train the movement models
and, yeah, given a new sequence we're using IMG GPS to get a distribution like
this. Coupling that with those movement models to help refine them.
>>: It doesn't really matter that it's photos it's just sort of checking with the GPS
tag, right?
>> James Hays: Sorry?
>>: It's just, you're just using the -- location, you're not [inaudible].
>> James Hays: Well, I mean, once IMG GPS has given you a geolocation
distribution like this, then, yes. Are we talking about training phase or testing
phase?
>>: I guess -- this is just using GPS, right?
>> James Hays: Yes, this is just GPS. This is just the movement model.
>>: [inaudible].
>> James Hays: That's just what I talked about just before this. It's just given a
single image given a geolocation.
>>: [inaudible].
>> James Hays: Oh, well given a novel sequence, I mean, you have to start with
kind of the unary whatever, probabilities of where you think this image is. In this
case, it's very bad estimate because it didn't do a very good job. But, you know,
IMG GPS provides this distribution, I go on to the next image, and then IMG GPS
sequential image provides this provides this distribution. And then with the
movement model you're going to help refine those. Find a maximum likelihood
path. All right. So here's these results. You know, it didn't do very well on this.
The scene matches aren't actually that bad, it's just finding China towns in other
cities. It's actually in Beijing. But, okay, this is the -- the posterior, whatever.
After and for instance it's kind of the probability that it assigns to each location.
And now at the maximum likelihood is here in Beijing and that's because the
other images that were in Beijing that were close to it temporarily were kind of
more landmarks, and this one is more obvious than Beijing. And also some
confusion with Hong Kong. The other one was a landmark. And moving on
through the sequence.
Okay. This is after inference. Moving on through the sequence, this is the IMG
GPS, you know, just the single image geolocation estimate for this one, so it's I
guess the best one in the sequence, the most landmark you want -- it's tape
modern in London. And then that helps -- okay. Sorry. This is after inference. It
knows it's in London not surprisingly. This is the next image, the generic one that
I showed you. And again, I mean, the distribution here is just basically random.
It's the prior of the data set. But since the previous one was in London, it knows
that that's where it actually is. And since the next one is kind of a landmark as
well, although there's some confusion with New York. Then you know, it knows
it's still in London. Though there's some confusion.
Anyway, moving on to this one. So this picture was in Athens IMG GPS does a
very bad job. It confuses it with Italy and Barcelona. And for that reason, you
know, the inference isn't going to move it to Greece really. It just leaves it in
London and then decides that the next photo now is in Greece. This is after the
inference. There was another day that passed between these pairs of photos.
So the temporal constraints weren't helping that much.
Then it has more pictures here in the Greek mainland, and then they move to the
Greek Isles. And it did the actually make a movement there. So anyway, out of
this sequence, all of them are geolocated accurately except for that one. But it's
kind of brittle right now. If I take these landmark images out, then all of these are
wrong. It things you went to Beijing and then New York and then Greece and
then the Greek Isles. So it's nothing -- it's pretty intuitive. It's using the
landmarks to help support the more ambiguous images that are temporarily
nearby. Any other questions on this? All right.
>>: I'm trying to get a sense for how helpful it is to have the training data rather
than just a kind of generic [inaudible].
>> James Hays: Yeah. I thought the training data would be a lot more useful,
but then actually playing with it I'm not sure how much more useful it is. It's
something we'll have to quantify when we actually publish it. I think. Yeah. I
mean, I thought -- yeah. I don't know. That's a good -- very good question. It's
not clear to me either.
>>: It's also possible that there's a factor model that captures a lot of this right
here. When you look to Honolulu, it wasn't just radial distance, it was sort of
radial distance multiplied by population density, right?
>> James Hays: Right. It's multiplied by the prior.
>>: By prior, right. So if you're going to test a -- if you're going to test this sort of
data driven model versus a parametric model, that would be the natural one to ->> James Hays: I agree. I agree.
>>: But then, you know, you could also make up other hypotheses like he say
well maybe nationality makes a difference because Chinese people [inaudible] ->> James Hays: And that's what I was thinking it would be more subtle things
like that. I'm not sure that we have enough data for those to really emerge.
>>: Cameras recorded [inaudible] so people actually did check the time zone.
>> James Hays: Yeah, I mean if they were explicitly changing the time zone and
not just the time, then that would be super informative, right, I mean, that would
be a huge constraint. If they told you what 24th of the globe you're on.
>>: [inaudible].
>> James Hays: What's the distinction? It's not [inaudible] if they're in a different
time zone or time actually elapsed in the same time zone. Okay. So this recent
stuff has been about finding geolocations for images and some people have
given me a hard time saying this is not computer vision. John Pons [phonetic]
was very critical of this. So I wanted to show that these geographic estimates
are useful for something, some poor computer vision task like object detection.
Working with Derek and Santosh, Derek [phonetic] is a professor at UAC now,
and again, they did most of the work here. I'm just going to talk about what part I
did in this.
And the idea is there's been a lot of investigation in context in computer vision
using context to improve say object detection. But the best performers in the
past [inaudible] challenge are all context free, things like -- or they were last year,
at least all these part based models from Pedro Felsenswab and David Mcallister
and David Ralmanon [phonetic ] works really well, doesn't use any context. No
scene level context, no object to object context.
And we thought, well, if we can, you know, add some scene level context on top
of this, surely it will improve the performance. So that's exactly what we're going
to do, we're going to take their base detector exactly as is, not even retrain it and
get candidate detections and then use context cues, you know, from what we
know about the scene to say, you know, this is likely to be a cat, this is likely to
be a cat, this is not likely to be a cat.
And then there's also another part of it which I'm not going to talk about is then
resegmenting based on those potential bounding boxes to improve the bounding
boxes. And that ends up helping a lot. I'm not going to talk about that. So here's
the top five cat detections from their detector, their local detector. This is kind of
unfair, because the detector actually does a pretty good job in general, but for
cats it just not so good. So these are confusions that shouldn't happen because
the context is not even reasonable. Like you know, a cat can't live there. You
should know that's not a cat detection.
So the type of context we're going to focus on is given an image, what is the
probability that it contains a cat or a chair, a dining table or whatever object. And
to do this, we're going to use similar features that I already talked about to seen
level features like the gist descriptor or the geometric context of this.
We're also going to -- and we're going to try and learn the relationship between
these features and object occurrence not using scene matching, just directly
training from the Pascal training data which is about 4,000 images and they're
trained to learn oh, this type of gist descriptor means cats are present.
And then at the same time what I did was use scene descriptors to go out on the
net, you know, find similar images, find the most similar matches and then use
the properties of those scene matches to try and say whether or not there's a cat
present. So I've kind of got the this baseline scene level context to compete
against.
So the type of context that I'm getting, say this is an input image and say this is
some of the scene matches. Not great in this case, but the metadata associated
with them like, you know, say squirrel or lion could be useful in saying whether
there's a cat there. And also the geographic properties. These are all geotag
data, so we know exactly like is this a grass land or is this urban and high
population, things like that. And we can hypothesize that maybe that correlates
with object occurrence. Maybe pedestrians are more likely in high population
density, things like that.
We don't use geolocate location exactly explicitly, just the geographic properties
derived from it. So the actual features we build from this are a histogram of key
word categories saying like okay. And the scene matches to that query I had
four animal key words and that means it's likely to have contained a cat. And
then we also have a geographic features, that is -- oh, the scene matches had
this population density on average so -- and then maybe in this case that's not
very informative about whether or not there's a cat there.
This data is of course extremely noisy. I'm relying on people from the Internet to
have annotated their images, and they don't use the same vocabulary. The
scene matches only average one key word per image, so it's not densely labeled
like the Pascal training set was which was perfectly labeled. All the objects are
identified.
And what I just talked about goes into a final regression to a rescore bounding
boxes along with some other types of context that I'm not going to talk about. As
long as the local detector score. So that is to say, you know, for every -- we're
rescoring every little box that original detector found. And after we do that, here
is the five most confident cat detections instead of these. And okay. So we do a
little better. There's two cats instead of one. But the confusions are a lot more
reasonable. They're in settings where it could have been a cat. But this
confusion happens a lot. And in fact Pascal intentionally constructed such that
you can't use context alone to solve it.
Here's the precision recall of the cat class when we've added context we get a
big jump versus the baseline detector. Overall, it's a lot more of an incremental
increase. In fact, there's not much of an increase here in this low recall regime.
Only kind of when we maybe get the more challenging ones. I think this is kind
of consistent with the work that Larry has done that says maybe the context is
really only necessary when you get to the kind of the more borderline cases. But
if you have a high resolution object, then, you know, the appearance alone is
definitely unambiguous.
And here's some more quantitative results. So our goal was to improve this
baseline detector. And if you just look at the bottom row, average precision. And
we go from 18.2 to 22 overall. So that's a pretty significant increase. But most of
that increase is actually coming from fixing the bounding blocks as using
segmentation. The scene context only got us 1.2 percent, which is a little
disappointing. Larry said that this was kind of a neglect result, which I kind of
sort of agree with. I mean, we would have maybe done it better and we're only
using one type of context, scene to object context. There's other types of context
like object-to-object context. But that was a little bit disappointing to me.
>>: [inaudible] so bicycle performs so much better.
>> James Hays: I think it's hard to read too much into these. Average precision
is kind of a -- if you switch the very first detection from being correct to incorrect,
the average precision can jump a lot, be so it's kinds of noisy measure, actually.
I don't want to read too much into these. There's not a whole lot -- there's not a
whole lot of consistency on what we do better on. Is it indoor or outdoor, is it
animal versus not animal? I mean, we kind of do just slightly better on a mix of
things.
My contextual features are I think worse for indoor images because the
geography is more ambiguous than -- here's the relative weight that the
regression assigned to each of the scene level features. And I'm not going to -I'm not going to -- I'm just going to block those out because those are -- those
aren't dealing with object presence, they're dealing with object size and location.
So these are the ones that are determine thing whether or not an object is
present in an image. Here's an image. What's the likelihood there's a cat in
there, something like that. And this is the relative weight assigned. So the
presence column that's referring to the gist descriptor and the geometric context
as kind of trained within the Pascal data set alone. And then semantic context
refers to the key words from the scene matches and geographic context refers to
the geographic features from the scene matches.
So these two really dominate the scene features that are learned from within
Pascal data set which is somewhat surprising. I'm happy about that, because it's
the data set -- learning within Pascal that makes a lot of sense because the
labeling is very clean. You could trust the labeling completely. And if you go up
to the Internet, you have so much noise to deal with, but the scene matching is
so much better that it actually makes it worthwhile. You can really trust it. The
scenes are similar and that the properties, you know, are more likely to be
shared.
>>: Pascal [inaudible] within object categories, right?
>> James Hays: Oh, yes. It's a very diverse set of images.
>>: Yes. It's really hard to do scene level ->> James Hays: Yes, exactly.
>>: Just from those images.
>> James Hays: Right. I agree. And we entered this in the VOC 2008
challenge, and we get put in our own little special box because we're the only
people to trained with outside data. And we're the only -- I think we're the first
person to enter this competition ever. But it's okay. We can compare to these
people who are doing the same task but only training within the Pascal data. The
external data we used was only my scene matching stuff.
And also because we used the 2007 version of Pedros [inaudible] but that's not
really an advantage. So we did the best in six classes, second best in six
classes, so we're competitive with the best of them.
So, yeah, I mean that's not bad. Any questions on this before I move on? All
right.
So changing gears a little, something I'm ongoing work. I'm doing right now is
high interest photo selection. I was hoping Yan Ki [phonetic] would be here
because this kind of relates to his work a lot. His past work. So my motivation is
that computer graphics does a lot with post processing images. I mean that's a
justification for a lot of the imaging track of C graphic. We want to fix this image
in one way or another. But I think that a photographer creates a great
photograph mostly at capture time and not as a post process. I mean, there's a
lot of things you can do at post process to help, but, you know, some examples
of good photos. And things like this don't happen because you post processed
them correctly, they happened because they were at the right place at the right
time and you were taking pictures, a lot of pictures.
So I think we should really be going crazy at capture time as photographers. I
mean an iPhone holds 20,000 photos, digital cameras are really fast, memory is
really cheap. But we don't do this, I don't do this because organization becomes
such a hassle if you have that many photos. I've taken on my SLR, the quick
counter is up to 50,000, but I haven't posted photos in two years, just because it's
so hard to go through all the photos and pick them out and things like that.
So a little help and maybe suggesting that, you know, these might be the nicest
photos to share I think would go a long way. And this is a quote from a
professional photographer saying if you get one good shot on a roll of 36, so this
is from the able of film, you are doing good. When you edit ruthlessly like that,
people think you're better than you are, a better photographer than you are. So
this is kind of saying the key to photography is throwing away most of your
photographs, which I also kind of agree with. You have to be very selective.
So the goal of this project is given some collection of photos, whatever, your
vacation photos to automatically suggest that, you know, okay, these dozen or so
seem, you know, really good, maybe you could share these. And then you can,
you know, throw them up on Facebook without thinking too much. And this is
application that requires or at least the [inaudible] precision but tolerates low
recall. We're only grabbing a few of the interesting photos out of the larger
corpus. Maybe it's okay if we miss some interesting ones and leave them
unchosen, as long as the ones that we do pick are good. So what makes a
photo good? That's very difficult. There's not a single objective measure.
People could argue about it and there's photography rules that you can
reference, but they're scene dependent and they aren't set in stone.
For example, you can go to a website trying to learn to be a better photographer
and it will say the four rules for composition for landscape photography. It seems
specific. And these rules might not be specific for portrait photography or macro
photography or architectural photography. In fact, framing images I think that's
probably a great piece of advice for landscape images, but for other scenes it
could make them look kind of cluttered. And any guide book like this is going to
say, you know, the rules are meant to be broken anyway. It's kind of nice to
know these, but they're not set if stone. So I think a nice way to try and evaluate
how interesting a photo is or how good a photo is just to find similar scenes and
say, okay, what do people think of those scenes? They've been up on the
Internet. How many times do people click on them and things like that.
So we're going to do that using the IMG GPS database, which was six and a half
million images. They happen to be geolocated, but we're not using that here.
And then I added 600,000 more photos, which are the most interesting photos on
Flickr. That's the 500 most interesting for every day of the last four years. And
then the pieces of metadata that might be relevant for determining whether a
query was interesting was that for each of these Flickr photos we have an
interesting rank, this rank, which is something of a form of, you know, that was
this rank out of this query segment. We don't have like a real valued number for
interestingness unfortunately. Flickr hides that.
And then we also have number times it's been viewed and the number of tags,
number of words in the title. Those might be correlated with an increased quality.
If people are more willing to label it. And, you know, for a query, we're just going
to assign the features we're going to assign to it are just, you know, how
interesting are your scene matches, how many views did your scene matches
have, you know, what's the number of tags in your scene matches? It's a little
more complicated than that, but that's basically it. And there's also how far away
you are from your scene matches.
So these are all scene matching features and then as kind of a baseline we're
going to add to these, the features that Yan, who is here, developed for a quality
assessment. Which is probably a pretty similar task. And I say these are kind of,
you know, lower level things. It does the image take the entire dynamic range, is
it well saturated, does it not have too many hues, a narrow band of hues is
actually positively correlated with quality. Things like that.
And I'm going to define a test set from the Internet data. I'll just kind of skip over
how I'm doing that. But here's an example of positive images. So these are
images that should be determined to be high quality or interesting. And I think,
okay, be that's pretty reasonable. One of the problems that we run into a lot is
this what's going on here is that the high quality photos often have a lot of post
processing, and that's not useful for this task because we're trying to look at
photos right after the camera, for instance.
And we don't want to learn that post processing is good because none of them
will have post processing. But maybe they're post processing it in a way that,
you know, a camera could produce. Not in this case.
And here's the bad images. These images all have zero views. Nobody on the
Internet ever clicked on these Flickr images, which means that by building this
data set I actually -- it's a destructive process. These images no longer have
zero views. So this is a high quality -- I mean I think I like this image. I think
some of them are fine. Maybe this one's okay, too. Maybe a little too closely
cropped. I don't know. I mean, this is a problem. There's a lot of reasons an
image could have no views. They're all publically available but maybe this
person just doesn't a lot of friends looking at their album. Or maybe they posted
a string of redundant photos and nobody bothered to click on the third one that
was similar. I don't allow photos from the same photographer to appear multiple
times. Sorry. One of the problems.
So we have a test set and we use just a linear SBM define weights for those
different features I showed you earlier. And here's the results. So this is kind of
the baseline, which is actually a very good baseline, because Yan's stuff works
very well. So it's using his features. These are just derived from, you know, from
the individual image, not using scene matching. And this is the precision recall
for the task these rescission recalls are higher than I thought, because I thought
these test set -- the test set would be very difficult to separate the positive and
negative classes. And it's pretty hard, but they actually do pretty well, especially
in this low recall regime. So Yan's features get about 80 percent. Mine are up
here. And then if we combine them both, we're getting like 90 percent precision,
not 20 percent recall, which is ->>: [inaudible] is it the Flickr photos that you're testing on instead of a ->> James Hays: Yes. No, no, no. Oh, yes, in this case, yes.
>>: So is the images then [inaudible] processing ->> James Hays: Yeah.
>>: So do you know if you might be running a post processing detector rather
than --
>> James Hays: I think that's part of it, yes. Unfortunately.
>>: So remind me what your scene match features does.
>> James Hays: So okay. Given a query, goes finds scene matches and then
for those scene matches, it's looking at, you know, what were the interesting rank
of those scene matches, how many views did those scene matches have, things
like that.
>>: So you're saying your scene matches are based on trying to find things that
are geographically similar?
>> James Hays: Well, I mean ->>: Or just simpler in layout?
>> James Hays: Yeah. I mean we're using the same features that we used for
the geographic task, that's true, but this is just kind of another access that we
expect the scenes to correlated on, and they are matching scenes are
geographically correlated. They're also quality correlated, it turns out. Not
surprising, really.
Yeah, I mean we should -- yes. So you're talking about the features used in the
scene matching, and those were the same as IMG GPS, right?
>>: I'm just trying to get an intuitive understanding for what's happening, like with
Yan's you just reminded us that this brightness and blur and things like that, color
and [inaudible] but for yours you're using [inaudible] and some color histograms
and ->> James Hays: Texture histograms.
>>: But you're using those features or you're actually going out and matching ->> James Hays: Going out an matching. And the features that go into this are
the features derived from the scene matches.
>>: So you go out, you find the closest matching images, and then those have
all been tagged by [inaudible] or ->> James Hays: Well.
>>: Or a small subset?
>> James Hays: They all have the metadata. It's not something that -- it's
generally not something that users explicitly tagging. Like the interesting is that
Flickr doesn't say exactly how it's calculated. It's proprietary. But it's derived
from viewing habits of people.
>>: Okay. So you're ending your training with regressor which will be based on
those things like number of times viewed [inaudible] how interesting the ->> James Hays: Query, yes.
>>: The query. Right. Well, the query image generates a couple of matches
and then you're trying to find it or score ->> James Hays: Yes, score the matches first.
>>: Learning something [inaudible] match statistics, use statistics to interesting
[inaudible] and that training is based on the ones which were actually -- that
600,000 subset, right, where you know the [inaudible]
>> James Hays: The training process happens after the scene matches are
collected.
>>: Okay. So where are the labels that go into the training process?
>> James Hays: So this is from the, whatever, could be the training or the -- so
what are the labeled examples that ->>: So these were the things you thought were interesting based on what, on ->> James Hays: Well, so I tried to picture out a test set.
>>: Oh, okay.
>> James Hays: You know, some that had a lot of views and some that had no
views, and some other properties.
>>: Okay.
>> James Hays: Any way. So there's the performance. And if we look at the -what the SVM, how it's actually weighting the features in the combined classifier,
you can see it's about half of Yan's features and about half my scene matching
features. And I mean, so things like the number of tags turned out not to be
useful or might have already been encapsulated in the Flickr interesting score.
Distance, increased distance you're seeing matches is positively correlated with
increased photo quality. I guess maybe that's capturing something like if it's a
unique photo that's better.
So to make this an actual system like you give it to a user and they pull photos
out of a collection and one last thing you need is to suppress redundant photos if
they're really taking a lot of photos of the same thing, we do that just by clustering
and taking only the best image from each cluster.
So here's some qualitative results. This is Aloshia's [phonetic] pictures from
Paris, and this is random samples of them. So it's kind of a lot of urban
architecture photography. Also, here's some more samples. A lot of portraits.
And it's by think all the computer vision in community is in here at some point.
So here are the most interesting photos that it finds. And some of them I think
are okay, like the black and white ones. Does anybody know who that is?
>>: [inaudible].
>> James Hays: Yes. How did you know?
>>: [inaudible].
>> James Hays: The vegetarian pride thing. Yes. But these are definitely not
good photos. So I think it's kind of capturing the oh, they're really saturated or
oh, they look like that I been post processed to some degree. Here's the next set
of 12, the 12 most next most interesting ones. I think actually some of them are
better here. I like most of these I think. I mean this is a little pedestrian and
maybe that's a little cluttered by you know, pretty good, I think. I mean, different
people might have different takes on it. Here's the least interesting photos.
Again, they're all gray. So you can see that saturation plays a significant role.
The next 12 least interesting. So one last thing I'll show is that if you have an
interestingness assessment that's working well, which I'm not sure that I do, then
you can use it as an oracle for some other image manipulation. Whatever it is.
I'm just going to show with cropping, that is here's an original image, let's
generate lots of crops, let's pick which crops are the most interesting. And these
are the 12 most interesting and some of them are -- I mean, these aren't good
clearly. Some of these are okay. Here's the 12 least interesting. Which, you
know, this one is actually maybe okay. I think this task is harder when it's only
doing scene matching based on parts of a single image, the scene matches
might not change that much.
>>: [inaudible]. Whole groups minimize them in photography, and they might
argue differently than you.
>> James Hays: Yeah. So I mean there's this whole thing like if you're a good
photographer, this isn't for you, right, if you actually have a men to get out there,
then this isn't for you. But this is based on the consensus of the Internet crowd
as dumb as it might be, but, I mean, you can disagree with what it thinks is
interesting for the test set, for example. But you can't really deny that that's the
consensus of the crowd for whatever reason. Even though, you know, experts
could definitely disagree with this.
All right. So this is a work in progress. I was kind of aiming for SGF, but the
results weren't there. So if anybody has feedback on any of this, I'd be welcome.
>>: [inaudible] noticed is certain photographers get a following no matter what
they'd posted they get thousands of views and I just can't believe some of the
things they post get all these positive reactions.
>> James Hays: Exactly.
>>: And I don't know how to filter that out. I look at this, their work and like
everything they post gets hotter and they just seem to go the out and try to
collect friends.
>> James Hays: Yeah, I thought about -- I thought about trying to normalize by
photographer to say that this photographer, you know, all of his stuff seems to be
viewed so maybe you can't trust that. But then maybe you do want to trust that.
Maybe he's actually a sincerely good photographer.
>>: He has some good stuff.
>> James Hays: Right.
>>: Establish that.
>> James Hays: So it's hard to say. I hear you.
>>: Yeah. That for a reason, some point along the way it's -- you got a less
meaningful.
>> James Hays: Yeah. And I thought about going to other data sets as well, like
DP challenge or photo.net, but they have explicit ratings that are pretty reliable I
think. But then you have orders of magnitude less data. So it's not going to look
like a scene matching solution at that point.
>>: [inaudible] label.
>> James Hays: Sure. In fact, [inaudible] it's kind of been this game where I
think you're doing some sort of binary force choice between which photo you like
better. But again, the results then are I think binary force like this person thinks
this image is better. I think that data would be good to look at. I need to bug Luiz
about that. Yeah.
>>: [inaudible] good for testing. It wouldn't be good for training [inaudible].
>> James Hays: I did actually. I did actually test on there.
>>: Okay. I get similar results?
>> James Hays: No, actually, I do worse, and Yan does about the same, so in
fact, Yan's features do better. That's what he developed his algorithm on.
>>: Yeah.
>> James Hays: And I'm slightly worse. And if I combine them, I really only very
incrementally improve them for his performance. So it's not -- it's not clear. So,
yeah.
>>: Instead of how many it might be interesting to see who is giving positive
feedback.
>> James Hays: Yeah, I don't think I have access to that.
>>: [inaudible] better taste than others so if they're bothering to comment on
which one ->> James Hays: Comment.
>>: [inaudible] social network.
>>: [inaudible] photographer to comment.
>> James Hays: Yeah.
>>: So you wouldn't go to, you know, things for the average user, right? I mean,
you could be [inaudible] artist like [inaudible] right now. [inaudible].
>> James Hays: You want to be the most possible mainstream cliche
photographer? Yeah.
>>: [inaudible].
>> James Hays: Yeah. If you don't like the philosophical idea behind this, I
completely understand.
>>: You know, when you first talked about taking off the photos and having
systems suggest to us when you have too many, I mean, I got very excited. I
think it would be wonderful to develop synthesis. It's a hard task right now. I
don't know if the if the quality of your results -- you know, is this state of the art
useful enough to deploy it today, or is it just not -- you know, it's a very subjective
decision, but basically being useful enough means that for your target audience
which is the audience that doesn't have time to go manipulate their photos or
select them, people like [inaudible]. Would it be good enough to trust doing
[inaudible] to the system ->> James Hays: Right.
>>: [inaudible] suggest and so you don't [inaudible] that includes the clustering
suggests that -- you mentioned clustering has to be done. Did you do that
[inaudible]?
>> James Hays: Yes. Yes. So there's still photos that look pretty redundant
because of the clustering wasn't that extreme.
>>: [inaudible].
>> James Hays: Exactly. There's two Notre Dam shots and then there's also
two of these shots. I mean but -- so there's different aspect ratio. In this case,
there's enough translation such that the descriptors are different. I mean, I tried
tuning it a little, and I can get rid of those duplicates, oh, plus these are
duplicates as well. But again they're different enough vantage points that they
didn't cluster together.
>>: [inaudible] another thing you can do is [inaudible] version, you know, you
show this to the user, it clicks [inaudible] redo everything and you could take
more personalized.
>> James Hays: Right. Personalization makes sense certainly. That would be.
>>: Some day I will be just like everybody else.
>> James Hays: Yeah. Yeah. I'm not sure how to couple that personalization
with the scene matching process.
>>: You can see the simplest thing it would be to save the ranking with your
database, show it to the user, get reranking and then you could [inaudible].
>> James Hays: Just go from there.
>>: [inaudible], you know. That's the difference.
>> James Hays: That's true.
>>: The processing issue is interesting, too, because I noticed a lot of [inaudible]
Flickr posted image and then it's -- it seems really popular, and then they go post
the original. It's like you look at the two and wow, I didn't even see that.
>> James Hays: Really?
>>: I would have never thought ->> James Hays: That would have been very helpful to be able to couple those.
>>: Yeah. Plus process it that way.
>> James Hays: Really.
>>: It seems, yeah.
>>: [inaudible] image potential [inaudible].
>>: It's the one in the lower ride tag HDR, so it looks very typical of the kind of
extreme [inaudible] in the halo that ->> James Hays: Right. I hate this. But they love this. They love to show off
their halos. I don't know if it's tagged with HDR. I didn't check.
>>: [inaudible].
>> James Hays: Yeah. It's not quite as pronounced. I'm not sure that people
even -- like maybe they could have had the dynamic range they need with the
original shot, exposure bracket and do this just because it's like the trendy thing
to have. So it's very much against everything that SGF has been going for the
last five years, which is interesting.
>>: Yeah, I don't even know if [inaudible] similar to just excessive [inaudible]
because if you do [inaudible] you also get that ->> James Hays: Yeah. I think it's -- I mean, it gets where they manually
segmented out the Larry they wanted feathered a large amount, like with the
radius of a hundred, and then composited.
>>: Yeah, I think that's [inaudible].
>> James Hays: I don't ->>: It might be.
>> James Hays: No, I think.
>>: [inaudible].
>> James Hays: I think it's from selection feathering I'm going to argue. But I
could be wrong.
>>: [inaudible].
>>: Another possible direction if you can get your client to go that way, is to, you
know, not exactly do what Luiz does, because he always sets up games but to
basically release something on the web and then gather data as people use it,
right, so you have your ->> James Hays: True.
>>: Your photo paper, right, and you release this little -[brief talking over]
>>: Exactly. You upload it or they -- or you put it in the directory. I think I'll
[inaudible] but you don't float your photos maybe it's good about unfloating small
[inaudible] works fast and it gets you kind of a little stack, clusters. You would
have to think about what the service [inaudible] but if you have it as a web
service then you can capture information, right, equal to cluster things that would
give you your best [inaudible] the stack of things that were clustered and say
[inaudible] I'm not sure, it would be almost like [inaudible] but because you do
that on the web you could use this as part of your research because you get
feedback.
You can do that kind of thing internal by just having ten of your friends use the
system, or you can launch the service, run it for a few weeks [inaudible].
>>: Another suggestion [inaudible] that we could -- this minute of the house is
part of a cluster, right, and someone -- you look at someone's picture and looks
at something like some picture like cluster, but this is the most, you know, highest
rank one in the cluster, you could tell the user, you know, you could -- you could
possibly, you know, change your photo to look more like this, you know, then you
have lots of friends and ->>: Here's creative suggestions of many ->> James Hays: Right.
>>: Here are a few examples.
>> James Hays: Right. This is the idea of forcing them to be more like the
positive images, not just selecting them as they are. Like you suggested, also,
the idea of being able to learn the mapping between preprocess and post
process would be even better, but that training date has got be to be hard to
come by.
>>: [inaudible] instrument something like [inaudible] photo gallery to -- I mean
there you start with the users start with a huge collection of minutes and then you
upload certain ones so that it's pretty valuable information ->>: That's true.
>>: Which ones [inaudible].
>> James Hays: [inaudible] Google.
>>: Yeah, that's if you want the [inaudible] but if you want the [inaudible]
research project, you know, and it's kind of explicitly and if people find it useful
they can use it [inaudible] capture this data.
>> James Hays: Well, you don't think somebody with Google, like Google would
be comfortable with training from their data any mean, is that an invasion of
privacy to train from ->>: [inaudible].
>>: I don't think you could get [inaudible].
>> James Hays: I would have to be interning there or something. Post-doc.
>>: I don't know. I mean I know like the photo gallery you can write your own
plug ins, so you could plug-in yourself. So you wouldn't even have to -- invoke
the company's privacy policy.
>>: Isn't it the case of the Google images they preserve [inaudible] the web
album they preserve the rights [inaudible] or anyone can download it? [inaudible]
something like that for a while.
>> Larry Zitnick: Okay. Thank you.
>> James Hays: Thank you so much.
[applause]
Download