15901 >> Rick Szeliski: Good morning, everyone. It's my... Lazebnik from the University of North Carolina Chapel Hill to...

advertisement
15901
>> Rick Szeliski: Good morning, everyone. It's my pleasure to welcome Svetlana
Lazebnik from the University of North Carolina Chapel Hill to give us a talk this morning
about her research.
Lana got her Ph.D. last year from the University of Illinois, actually two years ago,
University of Illinois Urbana - Champagne supervised by Jean Paul and Cordelia
Schmidt. And she's one of the world's leading experts on object recognition. And I think
you'll find today's talk quite interesting and see the connections to the work we're doing
here on photosynth.
>> Svetlana Lazebnik: Thanks, Rick. So I will talk about some of the projects I started
this year since moving to UNC and the main theme is modeling and organizing large
scale Internet photo collections.
So in the last few years many computer vision researchers have become fascinated by
photo sharing websites like Flickr. So there's a lot of websites out there. Flickr is the one
I know the best and the one that gets used the most in published literature. So I
personally have a Flickr account. I put my pictures on there and I spend a fair amount of
time just sort of browsing and playing around with it.
So here are some numbers that are, I think, current as of the end of 2007. So Flickr is
very large in terms of the number of images, the number of users, the volume of images
that gets uploaded daily.
I think now they also are beginning to allow short video clips, and there's very rich tag
data that goes along with the images. There is geo tagging and so on and so forth.
So it's a sort of rich and interesting test bed. So from my personal experience of playing
around with the Flickr interface and just trying to learn about what kinds of pictures are
out there, it's very often frustrating.
So the main thing that users do or at least the main thing that I do when interacting with
Flickr is basically do key word searches. So here's a very simple example. I do a key
word search for apple, just to try to find out what comes up. What does the universe of
apple images like.
And here is pretty much the best way that you can get a summary of the results in Flickr
as far as I know. You basically display thumb nails of the most relevant search results.
The most interesting search results kind of give you very bizarre pictures that might not
be apples at all. And most recent, obviously, is not a very representative sample.
So most relevant is about the best you can do there, and you can see that, well, there are
a few images that are unquestionably apples, like this one over here. Then there's
images having to do with Apple computers and Apple brand products. So you can see
that there's kind of a (inaudible) going on. So apple has obviously two dominant
meanings. But, then, even in these top 24 results there's a lot of garbage.
There's images might, somehow, if you look up close, be relevant to one or the other
sense of apple. But, you know, this is not really the first kind of search result that you
want to see. They're just not very salient.
Here's another example for Statue of Liberty. So here in the top 24 we get a lot of salient
results. At a glance we can see fairly easily what the Statue of Liberty looks like. But,
once again, even among the top six most relevant images retrieved, five of them you can
hardly see the statue at all. And also many of the results in this top page -- this was
retrieved maybe a couple of days ago -- are basically from the same user.
So in the second row like three of the middle images are from the same user, Scan G
Blue, and here's some more from the same user.
So we're getting basically a lot of very similar images. Instead it would be much better to
see sort of different salient aspects of the Statue of Liberty or images that look very
similar when taken by a variety of users.
So we want to see a more representative slice as opposed to just one user's picture
being overrepresented on this page.
Those are from this page. It's hard to get a sense well is there just one unique Statue of
Liberty or are there several copies of these statues in different parts of the world.
Because if there are, and you will see later that there are, this is something we want to be
able to see up front.
So from these results, it's easy to see what the Statue of Liberty looks like but you don't
know if this is all there is to the Statue of Liberty, are there other interesting views? Are
there other statues. This is not necessarily complete.
Of course, you know the top five images, five of the top six, are kind of garbage.
So this is what comes up again and again when interacting with Flickr, their interface is
so simple that it's just very hard and not just with these queries, any kind of query you
want to issue it's very hard to get a complete at a glance summary of what does a given
category look like.
What are all the relevant themes or motifs or iconic images for a given category. So this
is what we want to do. This is what I will talk about here, starting with a large noisy pool
of photographs basically downloaded by issuing a query search such as for the Statue of
Liberty or apple. First of all, I want to sort out the good pictures from the bad pictures.
So which pictures really do represent the concept of interest and which ones are just
noisily tagged or just spurious.
And secondly, we want to very compactly summarize the set of relevant images. So we
want to find iconic images that are somehow especially representative or salient of the
subset of the good subset of the query results.
And these two problems are actually very interrelated. So finding iconic images, my
second point here can actually help to do the first.
So it's not necessarily that first we have to sort the good images from the bad images and
then cluster the good ones to get iconic images.
We can actually start by somehow getting iconic images and that will help us to
determine what are the other good images for the categories. So these two problems are
really very tightly interrelated. So how exactly can we define an iconic image? So there
is no unique answer to this. And let me just show briefly what people have done with this
concept in the literature.
So one of the first places where you see this idea of iconic or conical image mentioned is
psychology literature. And then the psychology literature, the work by Palmer et al., they
are concerned or were concerned mainly about basically single rigid 3-D objects. If I give
you a picture of a toy horse or of a telephone or of some common object, you know what
are sort of the aspects of that object that you recognize the most easily, that you think are
the most characteristic. So this is what kinds of questions let's say college literature has
tried to answer.
So this aspect of the horse might be relatively easy to recognize and relatively
characteristic as opposed to let's say a view of this horse from the top or from the back.
So there are certain views of 3-D objects that people just like to imagine but don't find
easier to recognize and others may be less common or less bizarre or more bizarre
views.
So in the computer vision literature, people have tried to quantify these ideas once again
given a known 3-D model how can we compute some sort of common or representative
views. Instead of a 3-D model we can try to do that with a large set of pictures taken
around the viewing sphere of an object, how can we find the subset of views that are
represented.
More recently people have looked at iconic or conical images for products and logos.
Obviously, kinds of imagery that has commercial significance. And that is not too difficult
to do, because images of products and logos tend to be fairly simple and stylized.
The kinds of product images you tend to search for on the web would have a very large
picture of the product of interest with relatively little (inaudible) and so on.
In this case, a iconical image would be an image that basically has as little clutter as
possible and is similar to many other images.
So if you're looking for images, let's say, with the Apple logo, you can use local features
to cluster images, all of which contain the Apple logo, and then find the iconic image that
basically has the most matches to all the other images.
And, for example, the fewest nonmatching features. So it would be the image that has
the most salient instance of this logo that also occurs in a lot of other images. So people
have also tried to extend essentially the same ideas to landmarks, like Statue of Liberty,
Eiffel Tower and so on.
So a lack of clutter, having a very salient representation of the landmark of interest that
takes up most of the image that maybe is not occluded, that has a lot of features that can
match to other images of the landmark, this is how an iconic image for landmarks can be
characterized.
So in some work, for example, Bergan Forsythe who I think were one of the first to look at
iconic images for landmarks, this is their key criteria. This is an image that has a clearly
delineated instance of the object. And the way that iconic images are found in this work
is by trying to find the foregone background separation, basically to isolate the foreground
that contains the object and to look for images that have a lot of this salient foreground.
And in other work from Ian Simon at UW, conical images for landmarks are defined in
terms of basically 3-D structure. So like landmarks is just an instance of these single 3-D
objects. So it's a unique object in the world. So it's characterized by rigid 3-D geometry.
So we can apply essentially many of the ideas that people have used for single 3-D
objects to deal with landmarks. 3-D geometry distribution of view points in space plays a
very large role.
Finally, there's another category of categories -- category of visual classes that hasn't
been looked at a lot but that actually intrigues me. So general categories, not single 3-D
objects, not products, not landmarks, more or less abstract categories. Let's say love,
beauty, things like that.
That's an iconic image exists for such a category. And for these kinds of categories, 3-D
geometry is no longer meaningful. Even foreground background separation is not
necessarily meaningful as it is for products and landmarks, because you know the whole
image like this image of a rose might somehow represent love.
And there's no point in trying to distinguish foreground from background because it's not
a concrete concept.
So this is something on which there hasn't been a lot of work. But, you know, this is kind
of an overview of the kinds of criteria that people have used to find the iconic images.
Okay. So in my talk I will deal with two kinds of imagery. One is landmark collections,
Statue of Liberty, et cetera, and for this, the viewpoint we take in our work is very much in
line with the photosynth work, is very much in line with the UW photo collection
summarization work, is that 3-D geometry, multi-view geometry is really the most
important cue that we can use to deal with landmark image collections.
So other work on landmarks that tries to view them essentially as 2-D images, let's say
just using gist feature matching or bags of features in 2-D and in our opinion is just not
strong enough.
We really do need to bring in multi-view geometry, we need to bring in strong 3-D
geometric constraints to recognize successful landmark images and to find out which
images are representative. So which images, for example, come from a tight cluster of
similar 3-D viewpoints in the world.
So this is very much a 3-D approach and not just a 2-D image clustering approach.
However, enforcing 3-D geometric calculations is very computational expensive.
Extracting features, doing ransack between every pair of images to estimate
homographies or fundamental matrices, et cetera, if we do this for every pair of images
it's quadratic and the size of the collection. So it's not very scalable. So enforcing 3-D
geometric constraints and figuring out the 3-D relationships between 3-D images is not
very scalable.
So what we want to do is start by 2-D image clustering to sort of preprocess the image
collection and to group images that are already very likely to come from similar
viewpoints. So even though we completely believe in 3-D geometry for landmarks, we
want to start with fast 2-D image clustering, and I will explain in more detail to make this
more scalable.
And finally summarization as in Ian Simon's work has been done very successfully as
basically a byproduct of reconstruction of estimation of all the 3-D geometric
relationships. If you start by estimating all the 3-D points from the landmark in the world
and all the camera viewpoints from your image collection, you can now cluster more or
less these viewpoints to get iconic images as the centers of clusters.
But for that it requires that we perform all the reconstruction and all the pair wise
geometric matching first. That takes a lot of time. So we actually take the opposite
approach of doing summarization first quickly to give us a good initial guess for the 3-D
structure of the world.
So we kind of invert the process. We do summarization first and out of that 3-D
reconstruction also more or less falls out. And in the second part of the talk I will talk a
little bit about this general category question, how can we deal with just categories like
love, beauty and so on.
So here we don't have a lot to go on. No 3-D geometry or anything, so as you will see in
that part of the talk we'll have to rely on semantics a lot more as captured by image tags.
Okay. Any questions so far? All right. So let me talk then about our work on landmark
image collections and that will appear in ECCV this fall. So we start with our large
collection typical size would be about 50,000 images, although in principle we can
download a lot more. Typically 40 to 60 percent of this initial collection obtained through
key word search and Flickr is noise. So for search for Statue of Liberty 40 to 60%, I don't
remember the exact number for the data set, are images that don't have the Statue of
Liberty in them at all. So it's a very sort of noisy data set. And as a first step we perform
clustering using 2-D global image descriptors. We just find images that look almost the
same in 2-D that pixel by pixel look more or less the same.
And the idea is if these images are so similar, most likely they come from very similar
viewpoints anyway.
So we do this clustering followed by some geometric verification to make sure the images
do come from the same viewpoint to get a much smaller subset of these representative
images from clusters or iconic images.
So we'll go from about 50,000 images to maybe a couple of hundred of these iconic
images following this clustering stage.
>>: Iconic images have to have multiple things back or can they have single.
>> Svetlana Lazebnik: Sorry?
>>: Your iconic images are images that are clustered together because they appear
similar enough. Can you have clusters of size one or do you have to have a certain
minimum.
>> Svetlana Lazebnik: In practice I think the threshold is about eight. I will show in a
couple of slides exactly how we do those geometric verification, but in order to be able to
do the verification, we have, you know, some sort of minimum size. And eight is what we
use.
Okay. So we have these iconic images and next we want to figure out what are the
viewpoint relationships between them. So we match every pair of iconic images and try
to estimate the upper polar geometry between them to figure out how the aspects relate
to one another.
Then we go from just a pile of iconic images to a graph where edges represent two view
relationships.
And we caught this graph basically to find groups of images that see more or less the
same images of the landmark. From these components we can perform -- whoops. We
can perform, for example, 3-D reconstruction. So these components allow us to perform
reconstruction very efficiently, because they would be a lot smaller than the whole graph,
as I will explain.
But also we can use these components as a natural browsing structure for the
landmarks, as I will also explain. So this is kind of a brief overview of the approach. And
now let me give you more of the details of every step. So the first step of
appearance-based clustering, basically this is one of the end results of the step. And you
can see the kinds of images it looks for. Images that almost pixel per pixel are the same.
So for this we use so-called gist descriptors that were developed by Aud Alven (phonetic)
and Tony Toralba (phonetic) and have been popularized in recent years in graphics and
vision literature by Alsha Lofrus (phonetic).
See here you can see on the left is just some generic image of a forest and on the right is
a synthetic image that has the same gist descriptor as this forest.
And you can see that this gist descriptor captures sort of the coarse texture
characteristics of the image. The mean and sort of directionality of textures where are
the various barbs or strong edges and images and so on. And this description is a
thousand dimensional and it's obtained from taking a steerable filter transform of the
image and then finding the filter responses over very large areas at different more or less
quadrants of the image.
And this is the kind of information that it captures. So it turns out by extracting these gist
descriptors from the entire image and doing caiman's clustering in them it's fairly
successful in finding groups of very similar-looking images.
But, of course, we can't trust just the output of the gist. So in fact in this example here
you can see there's one outlier. There's one image that doesn't really match the others,
but got clustered with them anyway.
We have to make sure this kind of thing doesn't happen. So we can't just trust gist alone.
After we do the gist clustering, we have to go to every cluster and make sure that the
cluster does represent a set of images that are geometrically consistent, that have the
same sets of features viewed from more or less the same viewpoint.
So for this we take the top few representatives from each cluster, meaning the images
that are closest to the center of the cluster and we use eight in implementation. So we
thought we take the top eight images closest to the center and perform pair wise
measuring between them. We extract sift features and do ransack to estimate either
fundamental matrix or homography, depending upon the relationship between the
images.
And then the number of inliers to this relationship tells us basically are these images
geometrically consistent or not. And then among these top eight images we simply sum
up all the inliers. They gear up by being matched to the remaining images and the image
that gets the most inliers is declared as iconic image for the cluster.
And there's also a threshold in this total sum of inliers, so that if there are too few among
the top eight images, they can see that the cluster is not geometrically consistent and
throw it out.
Empirically, we have found that this is basically sufficient. Even if you have a very large
cluster you only take the top very few representatives and do this matching. It really
pretty much tells you everything you need to know about the cluster.
And I will show later we have quantitative evaluation of this in terms of recall precision
curves. After verifying each cluster, keeping the representatives or iconic images from
the geometrically consistent clusters, we might get a set of images that looks like this.
This would be our iconic images and we can see that among them there are some that
still look almost the same, like a few of these images on top.
And a few of the other ones, clearly there's been some overclustering. There's been
some oversegmentation, because it's just clustering is far from perfect. Some images
that could really have been clustered together didn't get clustered together and so on.
Some of the images represent the same viewpoint but don't look the same in 2-D. Either
because of clutter like a person standing in front of one of the images, or maybe the
lighting is too different, or maybe there's too much camera zoom or viewpoint change for
gist to work.
So now I have to fix all this. Now we have to basically verify the geometric relationships
between the iconic images. So we match every pair of iconic images to each other.
Once again to see whether or not there's a two-view relationship between them.
And this gives us the graph structure. This gives us the so-called iconic scene graph
where the nodes are these iconic images representing their clusters and the edges
represent either homographies or fundamental matrices and the edges are also weighted
by the number of inliers to this transformation.
So this graph basically, and unfortunately these graphs have been pretty hard to
visualize. So here you can kind of see on the top to visualize the graph. And there are
two types of edges. I don't remember one of them is homography and another one is
fundamental matrix.
And nodes are images. So this is what the graphs tend to look like. They look fairly
messy. But one thing you'll usually find in them they have sort of very tight clumps. They
have very tight clumps representing sort of very popular views like the frontal view of the
Statue of Liberty and maybe they have several if there are several dominant task beds.
And there might be sort of a lot of weakly connected components or isolated nodes
floating about.
So we have to analyze this connectivity. So, first of all, with isolated nodes, because
we've already tried to estimate the geometric relationships between the iconic images,
we have an idea that if there's an isolated node in the graph it doesn't match anything
else.
So this is already a pretty strong hint that this might be garbage and this might be a set of
images that doesn't truly belong with the landmark and should be thrown out.
So for this, this is the only place where we use tags for our work on landmarks. We look
at isolated nodes and try to match their tag distribution to the distribution of the tags for
the largest graph components. Because the tags, the most popular tags associated with
the largest graph component, we have a pretty good idea that this is really a very good
sample of tags that describe the landmark well.
So by taking the tags from the isolated nodes and seeing if they appear in this list, we
can basically quickly check, are the tags associated with this isolated node garbage or
not.
And this gives us a filter that we can use to basically eliminate a lot of these isolated
nodes, because what we have found is that there will be a few sort of sets of near
duplicates in the data set, just very, very similar images that were all taken by the same
person, and just somehow spuriously tagged.
So most of their tags are garbage. And we can recognize that pretty easily. So this is
just a filtering step to clean up some of these three graph nodes. But what we're really
interested in are the tightly connected components of the graph. Because this gives us
the most salient characteristic aspect of the landmarks. So what we do is we have this
graph that has a loose structure with a few cuts we're on normalize cuts to get these
clumps or these tightly connected sub graphs, which really end up corresponding to very
similar aspects.
So I will show the Statue of Liberty results in a little bit. But then what we can do with
these components is we can use them for structure for motion. So basically one of the
big reasons to work on components as opposed to the whole graph is that they're much
smaller. So structure for motion can be done especially global bond adjustment can be
done much more efficiently on sub graphs as opposed to on the whole graph.
So we basically process each component separately. So we use the graph structure, we
find a maximum weight spelling tree where the weights, if you will recall, are given by the
number of inliers. So obviously to incorporate the images into the model you know we
will want to go by images that have a lot of inliers to other images.
So we use a maximum weight spanning tree to determine the order of incorporating
images into the 3-D model. And finally we don't forget about the connections between
the components.
It could be that there were a few weak connections that were cut when we did the
normalized cuts and we don't throw those out. In the end, when we separately
reconstruct each component, we can go back to the cut edges and use the matches
along those edges to merge the component models.
And finally, so this one we reconstruct each component and do as much merging as we
can. This gives us a good 3 D sort of photosynth style representation of the landmark,
and it's built using relatively few images.
It only uses the iconic images. So if we started with 50,000 images, maybe half of which
actually contain the landmark, at this point we'll end up with a model that uses a couple of
hundred iconic images.
But we can fix that. We can enlarge the models by registering additional noniconic
images. And this is very easy to do.
So let me show you a video that kind of shows some of the models that we get. So
unfortunately I don't have the models in their (inaudible) format but in this video you can
see we have three data sets. So this is the Statue of Liberty. And here's sort of a brief
screen shot of how merging works. These are different aspects of the Statue of Liberty
that still have a lot of overlap so they can be merged.
And this merging the matching and the estimation, this is all automatic based on the
graph structure. So this is a bigger merged Statue of Liberty.
And you can see there are a few outliars. So we haven't done a lot to make our models
perfect. And this is our second data set, San Marco. Piazo San Marco in Venice. This is
a big component that includes the front of the cathedral. And the third data set is Notre
Dame in Paris.
So that was a dominant model. And this is the model of the back of Notre Dame, which
is hard to merge with the front. So this kind of gives you the sense of the 3-D models that
we get using our approach. So there are a few outliars. And we really have done a lot
less to sort of cosmetically clean up the models and make sure that there are no outliars,
no incorrect matches and so on, but you get the idea, that by and large the structure
found by algorithm is pretty much correct.
>>: Did you -- for clustering the things that had similar views did you find that it wasn't,
sometimes it wasn't enough baseline for constructor repromotion?
>>: Svetlana Lazebnik: So with data cluster, there's not a lot of baseline. But between
clusters, because when we match the iconic images basically to each other, this is where
we get the baseline most of the time. From different iconic images.
And even there, you know, there's a good number of homographies among those as well.
You know, so there isn't always sort of -- I mean in some cases it's true that it's hard to
get a right baseline. But overall, I think the biggest problem still remains is when the
baseline is too wide and you can't merge things. So I think that is kind of a bigger
problem for us than not having enough images with a good baseline.
So this is 3-D reconstruction, and it shows what we can do with the iconic images. That
the iconic images basically have enough information in them to permit the 3-D
reconstruction of the landmarks. And the second thing, besides reconstruction that the
iconic images allow us to do, is browse the collection. So we have a three level browsing
structure. So at the first level we have the different components that we obtained by
partitioning the iconic scene graph.
So each component, it includes a number of iconic images. So here we expand one
component and these are just iconic images from that single component. And you can
see that these are images that some of them look very close to each other even in 2-D so
it's almost identical aspect.
But others include a lot of zoom, sort of viewpoint change and so on, but there's still a lot
of overlap in terms of visibility, in terms of what all these images can see. So the
components represent sort of different iconic images that still seem more or less the
same part of the landmark.
And then we can expand each iconic image to get a cluster that was obtained originally
by gist clustering and that all contains images that are almost the same as the iconic.
So this is a three level browsing structure for the data set. And I can show an example -yes?
>>: In that, can you go back to the front? I'm surprised, like the image with the sunset in
it, I'm surprised you got enough features on the Statue of Liberty there at all to pass any
sort of geometric verification.
>>: Svetlana Lazebnik: Yeah.
>>: How did that -- how did that pass?
>>: Svetlana Lazebnik: Well, the work was a reasonably high resolution image. So it's a
thousand by whatever. But there wouldn't be a whole lot of features there.
>>: The second to last column, the third one down.
>>: Svetlana Lazebnik: Like this one? Yeah, so I can't 100 percent guarantee that all of
them are absolutely correct. But I think there's still enough maybe features around sort of
the outline and our threshold for the number of inliers, I think, was reasonably low. Like
maybe 15 or so. So it's possible you can get 15 or so features out of this.
But in general -- and using even higher resolution images I think would help even more.
So it was kind of a compromise. If you use the original resolution images, you know you
can spend forever extracting sift features and doing the matching and so on. But on the
other hand if you go to low resolution and you can lose images like that.
So we haven't completely found the right trade-off. So I think we'll use the 1024 by
whatever resolution in Flickr. But I'll show a little later it's true that getting enough
features for the Statue of Liberty was sometimes problematic.
But let me show examples of browsing for Statue of Liberty. So this is level one where
we have all the different components of the statue. So this, for example, is the face of
the statue which is kept in the museum or something. So you can see that these are all
the images in the same gist cluster.
>>: I have a question about that. Because the original just gives you a load of garbage,
but at this point you've only filtered out the ->>: Svetlana Lazebnik: Actually, one thing I forgot to mention is that these are already
filtered gist clusters. So the garbage -- once we find -- once we decide what the iconic
image for the cluster, it's relatively quick to take every other image and to verify it
geometrically against the iconic.
Obviously it's much quicker than doing pair wise verification of the whole cluster where
we image each other.
So what we do instead is we do first pair wise verification of the top eight, decide iconic,
and then just verify every remaining image against iconic. So that's basically linear and a
number of images in the cluster.
So what you see here is the clusters that have already been cleaned up by throwing out
all images that don't match the iconic. So the biggest component is what you have
already seen is the more or less frontal view of the statue.
And these gist clusters following verification are really very clean. There's almost no
mistakes in them. And in fact the second component for the Statue of Liberty, the second
largest connected subset of the graph is Las Vegas. So this is the Statue of Liberty in
Las Vegas.
So you can see that all of these images have the same sort of distinctive background and
it's probably this background to a large extent that enables us to, you know, connect all
these images together and to recognize this Las Vegas component.
This is a nighttime Las Vegas component. So sometimes we get lucky and we can
match night views to day views. But most of the time they're two different. Both for gist
and for sift features. So we really end up with different components for night and day.
And then there's also a third component for Tokyo. So we get a pretty large component
for Statue of Liberty in Tokyo. And once again the background is distinctive enough so
that we can distinguish it.
There's a nighttime Tokyo component. So you get the idea what the results are like. And
I think there's one component that basically is incorrect. So this is a component that
doesn't truly have the Statue of Liberty. But these tags, so I mention that we tried to use
some tag-based filtering to remove isolated components. But oftentimes the tags are not
informative enough. So New York City are very common tags for the Statue of Liberty.
So we just don't have enough information to throw this out. So this gives you an idea
what kind of browsing this iconic scene graph enables.
>>: Is there one in Seattle?
>>: Svetlana Lazebnik: There's one in Seattle?
>>: (Inaudible) Beach.
>>: Svetlana Lazebnik: Oh, really. I know there's one in Paris, and that one hasn't
shown up either. But that one is hard to photograph. You can't really get up close to it.
So I don't know I haven't gone to the data set and seen how many images of the Statue
of Liberty in Paris there are.
So there might be a few more like in Seattle or wherever else that just don't show up.
And you also have to keep in mind that our approach is based on clustering. So it's
pretty brute force and it looks for modes, for very large clumps of images.
If you only have a few images of the Statue of Liberty in Seattle or the one in Paris, it
might end up just losing them. So this is one sort of disadvantage of this
clustering-based approach.
It's very good at finding like the very salient and dominant components but it might throw
out a lot of smaller components that could still be very interesting.
>>: So the (inaudible) structure for motion construction, the very first one when you were
doing it independently in each cluster, you define like certain clusters didn't have enough
paradigm, too small baseline.
>>: Svetlana Lazebnik: So we're reconstructing from components. I think he asked a
similar question. So if you look at the images inside of a component, let's see, you know,
if you look at the images inside of a component, like this whole collection, you know I
think there's enough paralex between them in order to do a good job. Some
homographies are also usable for reconstruction.
You can use information about camera rotation and so on. So overall we're okay. So we
don't try to reconstruct individual clusters when we deal with sufficiently large
components it works out fine.
And sometimes if the component is too small, you know, you might get reconstruction
that's not very interesting so here in Tokyo, you kind of get the idea that there's a statue
in front and there's a bunch of stuff in the back. But I don't know how interesting this
reconstruction really is.
And this is still not completely a bad case. You could get an even more uninformative
component. Let's say a very distant view of the Statue of Liberty, like a view from top of
Empire State Building or something that basically just shows the New York skyline.
So from that, you know, you basically get a trivial reconstruction. So we can get
reconstructions, but they're not always going to be good. There's going to be bad cases.
But for the good cases like the central aspect of the Statue of Liberty, we don't have that
problem.
Okay. And we have tried to quantify basically our various filtering strategies. Does, for
example, starting with gist clustering and then performing geometric verification and so
on, does it improve recall and precision. So it definitely improves precision. And recall
basically in every filtering stage goes down a bit because we end up rejecting some
things.
Things that are not geometrically verified let's say to the iconic image get thrown out and
sometimes we make a mistake. So in every stage, reject some images and the recall
goes down a bit but the precision stays pretty high. So for the Statue of Liberty. In the
end we're able to register about half of all the images that actually contain the Statue of
Liberty in them with pretty high precision.
And also we've tested how good is the set of iconic images for basically testing or trying
to recognize the landmark and new images. So we take a never-before-seen image and
try to figure out is there a Statue of Liberty in that image using just our iconic images.
And we can either do a key nearest neighbor matching of this test image to the iconics
using gist, or we can use bag of local features matching with a vocabulary tree and follow
that by geometric verification. So very basic registration approach.
And one thing that's interesting about the Statue of Liberty is the vocabulary tree itself.
So it's hard to see, but basically if we do just nearest neighbor matching, using a
vocabulary tree, this is the recall precision curve we get. This red curve. And it's
basically random. So if we tried to do bag of feature matching to the nearest iconic for
the statue of liberty, it pretty much fails miserably. This is due to the fact that it's hard to
get a lot of inliers for the Statue of Liberty. It's not very textured so we don't have a lot of
local features so that's going to fail. So gist, just global 2-D matching works a lot better
than the vocabulary tree for this case. So this is just one interesting note.
Okay. So basically we get similar kinds of results for Notre Dame, which is smaller, and
for Piazo San Marco. So one interesting note I want to make about this one is the
merging. So we get models basically for the front of the square with the church and the
back of the square and there's also this tower. And we've tried really hard to merge the
two and we just haven't succeeded.
And there are a few sort of difficulties. One of the difficulties is the tower. So this tower,
this clock tower is completely symmetric from all four sides. And there's a lot of images in
which you just see the tower against the sky, nothing else.
For those images it's basically, you know, very hard to get the correct camera pose.
What we end up is with a camera pose that has basically a four-fold ambiguity.
So when you have a bunch of misestimated cameras for the pictures looking at the tower
and you try to do sub merging without it basically and completely screw things up, it puts
a square together backwards where in a world they intersect and so forth.
So it can be tricky. Again, the clustering-based approach keeps the dominant aspect and
sometimes it throws out a lot of the things in the middle which in this case turns out a lot
of times to be the missing links.
There might not be a lot of pictures that really see both aspects together, and that can
help you to do the merging.
So this is one weakness of this clustering-based approach that we're going to work to
address. Okay. So basically this is all for this part of the talk. And I can go into the
second part just a little bit. Does anybody have any questions? Okay. So I will briefly
talk about the second part, which was actually represented last week at the Internet
Vision workshop. And this part is more sort of speculative and open ended.
So what I wanted to see was what I mentioned in the beginning, what can we do basically
with just abstract categories? Can we find any useful summaries at all? And so basically
one of the things when they want to discover is in the apple example is the fact that
there's apple the fruit, and Apple the computer company.
As far as even more abstract terms like beauty, the best we can hope to discover is
basically distinct themes that women may be beautiful, sunsets may be beautiful, flowers
may be beautiful and cats may be beautiful. But in slightly different ways.
So we want to find different semantic sub categories or sort of instantiations of what an
abstract term could mean. So here, basically, we also have this recall versus precision
problem. Ideally we would like to find all images that somehow represent love or beauty.
But it's much harder to define them in the case of Statue of Liberty.
In the case of Statue of Liberty, we could have ground truth, does this image show the
Statue of Liberty or not, and we could have people actually annotate images with yes or
no, whether or not they chose the Statue of Liberty.
For these kind of categories, it's no longer possible. We can't have a person look at any
of these given images and say does this image represent love. So we have to kind of
throw out the whole question of recall out the window. But don't really have the hope of
retrieving all of the images out of a large noisy collection that represent love. But what
we can do is maybe find a few subsets of the images that are internally consistent, that
correspond to some recurring visual motifs.
Like hearts or roses will be very typical for love. We don't know about a whole bunch of
other images, but we find a few clusters about which we can say yes these are
representative images for this abstract term.
So it's hard to define what iconic image means in this case, because it's so sort of
speculative. But the working definition we came up with is an iconic image. Basically is a
good-looking representative of a group of images that look the same and have similar
semantics.
So this really has three components. One, appearance. Images have to look the same.
Two, semantics they all have to be roses or all be hearts or all be rings.
And the third, they have to be good-looking. And we're not sure why they have to be
good-looking but it can't hurt.
So this pretty much dictates the algorithm. So we want to find, perform joint clustering to
find groups of images that are consistent using both appearance descriptors and Flickr
tags. And, secondly, for each cluster we pick a representative iconic images by doing
automatic quality based stringing. So our joint clustering is very simple.
We do first clustering with gist just like in the landmark work. And that, you know, does
ensure some uniformity in appearance but it doesn't ensure uniformity in semantics. So
here we have a bunch of round structures but some of them are apples. Some of them
are apple pies. Some of them are iPod buttons.
And some of them are actually apple pies with the Apple logo in the middle. So
appearance gets us somewhere, but we still have this mish mash of different themes. So
for this we try to capture semantics by clustering tags. We use a very simple approach.
We run so-called produced latent semantic analysis to transform tags into a latent topic
space. And then we cluster the topic vectors with caimans, and this gives us a bunch of
semantic themes, and we simply intersect the two clusterings.
And then, you know, this is an actual sort of snippet out of the results that we get. So this
is a hand picked snippet. So this shows where it works nicely, but we didn't make it up.
>>: Just a question. I thought you were (inaudible) limits in order to be able to say to
run ->>: Svetlana Lazebnik: This is to be able to say on the tags, not to be able to say on the
feature. This is like the (inaudible) with text. Not to be able to say that people have used
on image features.
And so this is what we get when it works well. We're really able to separate out the
different themes. And, finally, we use Yung Kay's (phonetic) work on learning quality
ranking from a database of images that have been ranked as being more professional
versus more amateur, and the sole goal of that is just to find basically a nice-looking
representative for a cluster.
Because, well, we could just use the image closest to the center whatever but somehow
it seemed like a good idea to incorporate aesthetics more into the approach.
So this is what the results look like for apple. So we have -- so each quadruple of images
here represents a distinct theme found with peel and say tags and the different themes
are laid out using multi-dimensional scaling. Multi-dimensional scaling using a distance
between sets of tags.
So it basically shows how close are these tag distributions to each other. So you can see
we can separate the apple fruits very nicely from the Apple company. And we have nice
distinct themes for in a logo, wallpaper, Apple stores. In fact, there's a theme for a store
in New York and store in London and so on.
So you get the idea. So this is the top level. And we can expand to this for each theme
at the top level would just show basically the four topic iconic images but there are more
because there are basically several in the intersection there are several gist clusters that
get intersected with the same theme cluster.
So this is all of the gist clusters that are under the same theme. And once again as in the
landmark work, we can expand a gist cluster to hopefully get a lot of images from the
collection that are very similar, that basically support this gist cluster.
So this kind of gives you the idea. So for beauty, what we have, once again, we have
results that make a lot of sense. We have a cluster of different pictures of women.
There's a theme for some reason for Japanese girls. And, you know, we get flowers and
sunsets and nature and cats.
So this is an expansion of some of the women themes. So images here -- so these are,
once again, different gist clusters that fall under the same theme. So these are not
supposed to be consistent in appearance. But if we expand a gist cluster, we expect
images that are more consistent in appearance.
But once again the results we give here are not as clean as for landmarks because in the
landmarks what really rescued us a lot was the 3-D geometric verification.
This is what helped us to clean up the clusters. And here we have no geometry
geometry we really have no additional cues to help us verify. So what we get in the gist
cluster is really not as nice and clean a lot of times.
But it works very well for flowers. And one thing that's interesting is we use no color at
all. So gist is computed on gray level images. So appearance-based clustering does not
use color. But in many cases we find the results to be consistent in color anyway.
So this is something that would be interesting to investigate, basically how dependent is
color on global gist features and so on. How correlated is it.
So the third category we tried is close-up. So this is even more abstract. It refers
basically to photographic technique and not even to any sort of concrete object or
whatever or any concrete subject matter.
So, basically, here you get themes of what people like to take close-ups of.
So they like to take close-ups of lips, eyes, cat's noses, insects, birds. There's a theme
for more or less abstract macro shots and we have a theme for strawberries.
And also drops of water, taking macro shots of drops of water seems to be very popular.
So you can see this theme expanded, and this is a gist cluster for drops of water. This is
one of the nice ones with the eyes. And one thing we can do and it's really hard to
evaluate this in any quantitative way, it's hard to say how well we're doing. One thing I
can point to is Flickr clusters.
So Flickr clusters are I think computed solely based on correlations between tags. They
don't use any visual information. But these are the kinds of clusters you get for close-up
in the Flickr interface.
So you can see, of course, there's not a lot of them. I think at most you only ever get four
clusters in Flickr. And they're also not very satisfying. Like in the bottom cluster, you
know, cat and animal faces are mixed with people faces.
In the top cluster, you know, we get some objects and some insects. You know, so in my
opinion the themes we get are kind of more satisfying.
Everything is separated out a lot more nicely. And our approach is basically pretty
simple. It was our first stab at how do we cluster these images. And even just with peel
and say clustering, we get, in my opinion, a lot nicer themes than what Flickr gets.
So I guess the take-home message isn't really that we're doing so well, it's that it's really,
really easy to do better than this. Final one is love. So one surprise that was here, we
have this cluster for self, me, self portrait. So I was kind of surprised to see that as a big
theme associated with love. So I guess people love themselves or, I don't know, they
love to take self-portraits.
Then we get dogs and babies, which makes a lot of sense. What's interesting is the dogs
show up under love and cats showed up under beauty. I don't know if this is just random
or it really represents some big statistical bias in things. There's a cluster for wedding. A
wedding theme, but the images in the wedding theme are actually not very good.
So I'm not really sure why that is. And then there's like clouds, sunset, beach. So beach
and couples and love appear to be very tightly associated. It makes sense, but I guess I
didn't really expect to see such a strong association.
And there's a big hard cluster. Obviously that's nice if there wasn't a big hard cluster and
then we would know for sure that this thing had failed miserably.
One other thing that was interesting to see in the heart cluster, we get five images that
basically are images of a ring laid on top of an open book and the shadow of the ring is
kind of heart-shaped. How many of you have seen these kind of images? So some of
you have. Do you also browse Flickr a lot? Okay. So I think this is something that a lot
of people are surprised how unfamiliar with Flickr and these kind of amateur photo
sharing places, but I guess this is one of these sort of visual cliches that people really
love. And you wouldn't know it.
So this was something I was pleased to see because ideally this is kind of output that I
would hope from this approach. It would help to reveal these visual cliches that would
really be very prevalent. But regular people might not necessarily know about them. So
this is what I would think would be really useful to reveal in the Flickr interface.
And once again Flickr clusters for love, not that great. So there is a cluster for hearts.
But, you know, the heart images here are not particularly salient. Although, there's a
better wedding cluster than what we have. But apart from that, once again, I think we do
better in terms of comprehensiveness of themes and so on.
Any questions?
>>: So how many, when you start doing your analysis, how many photos -- I guess it's
based on proving a set of photos, right? So many photos start the set.
>>: Svetlana Lazebnik: So here it's smaller than for the land mark work. Here I think it's
about 10 to 15,000 photos per category. So I think getting more photos would definitely
be helpful for getting more consistent results.
So here one of the shortcomings we have is we get sort of a set of themes that make
sense. But it's hard to say. Basically is this comprehensive? Are these all important
themes? Or even if they were the themes we do get, even if they do make sense how
important are they? Like this self-portrait theme, it makes sense.
But is this just a fluke of our clustering that it happened to reveal this theme which might
not actually be very common? Or is it indeed sort of a very common theme? And by
contrast, you would think that weddings would be huge under love. But we don't get such
huge clusters for wedding.
So the clustering is a bit iffy. So this is something that we really have to work on. And
implementing, getting more data and implementing a more stable clustering algorithm
we'll be able to say more things about why did a given theme show up, why didn't it show
up and so on.
And because this is hard to evaluate, one reasonable way to evaluate it is basically to do
user studies, which would probably be a good idea. So that pretty much concludes my
talk. Does anybody have any questions?
(Applause).
>>: I have two questions. Ask them one at a time. One is from the second half of the
talk it's clearly an amazing amount of information you can get out of these tags. And the
semantic information. But from a pure vision standpoint, it's a little unsatisfying in the first
half of the talk that you actually need them to really filter your results.
I was wondering, do you have any insights could you get away with getting just as good
results using vision alone without the tags?
>>: Svetlana Lazebnik: Actually, in our landmark work we use tags for very little. The
only thing we use them for was for filtering the isolated nodes of a graph. So the clusters
that didn't match any other clusters geometrically. So we apply the tags in the first part of
the work after we applied all the stronger constraints that we have.
And basically the only thing that would happen if we didn't apply that stat would be that
we would end up with more sort of garbage components. Like here I can show for Notre
Dame, I'll show you one example of sort of a garbage component. So for Notre Dame we
have this component that is geometrically consistent because they're all images of the
same building.
But it's obviously not a Notre Dame cathedral in Paris. And tags in this case don't give us
enough information, once again, to remove this, because it's tagged with Notre Dame
and art, which are both very common for the correct Notre Dame.
So basically if we use no tags at all for this step, we would simply end up with more
garbage components like this. So we really don't rely on tags for a whole lot for the
landmarks. And for the second part of the work, we rely on tags because semantics are
much more important, and because we don't have strong constraints.
So geometry, basically when geometry makes sense, whether 2-D geometry appearance
similarity or 3-D geometry, it's really good to use it. But here for general categories, we're
basically in the situation where we have very little to go on. So we're forced to rely on
tags.
>>: Other question that relates to that same micro example where you couldn't glue the
two halves of it together. It's a common problem in the feature-based approach when
you have competitive structures that you could get confused. You can see two
approaches to handle this. One of them would be to do more verification, there might be
some asymmetry that you could leverage and say those are actually two different views
of almost the same thing.
The other approach would be to have like a critical filter type approach where you say
well he has multiple hypothesis we need to turn them around. Do you have any insights
how you could use any one of these approaches?
>>: Svetlana Lazebnik: I think you could in fact use both. So we've talked about it a little
bit. So I think being able to generate multiple hypotheses, especially in this tower case,
in the tower case there's really basically a four-fold ambiguity. So it's really not that hard
to keep track of it.
It may be simply just a question of reimplementing your ransack instead of just spitting
out one hypothesis, it kind of, it spits out all of the sort of consistent hypotheses.
So if there's just a four-fold ambiguity, then you can keep track of it pretty easily. But also
in that square, in the back of the square where you have these repeating columns, the
ambiguity might be a lot bigger.
When you have transitional ambiguity you might have a lot more than four possibilities so
it could blow up. So it's a question of like where do you want to stop. And it's also a
question, I guess, of careful implementation and so on. I think just carefully keeping track
of the ambiguities and generating multiple hypothesis, in some cases when there's just a
two-fold or four fold symmetry is all that's needed but in some other cases same with
particle filtering, it might blow up when there's too much symmetry.
But, yeah, I think especially in architectural scenes, you know, dealing with symmetry is
important. And the first thing you mentioned obviously getting more matches trying to get
some unique features I think also is part of the solution.
So it could be as simple as just more carefully implementing the sift-based matching and
extraction using higher resolution images and so on.
I think part of the answer is more careful implementation and part of the answer is there
are probably issues that we don't even want to touch when some of these basically
ambiguities get out of hand.
>>: Rick Szeliski: So Svetlana is here for today. If someone would like to meet with her,
come see me. We'll probably go out and have lunch in the cafeteria if anyone wants to
join us. So thanks again.
(Applause)
Download