17894 >> Rick Szeliski: Okay. So it's my great... today. Rob is a professor at New York University. ...

advertisement
17894
>> Rick Szeliski: Okay. So it's my great pleasure to welcome Rob Fergus here
today. Rob is a professor at New York University. And he's very well known in
the computer vision recognition community for having developed one of the first
systems to learn constellation models. He's also done wonderful work in
computational photography, aspects of learning with very large data collections,
which I think is related to his topic today. And he's on his way up to NIPS where
he'll be delivering a version of this talk.
>> Rob Fergus: Thanks, rick. This is a talk -- this is some work I've done with
Yaie and Antonio. And what I'm going to talk about today is doing stuff with huge
collections of images. There's been some quite interesting work by people at
Microsoft here on taking people's holiday snaps on the Internet and creating rich
studio environments from them and also doing sort of fun things with your
photographs to repair them and so on. But in all these approaches really
leveraging the vast amount of data that's out there on the Web.
But people have been sort of less successful about developing alternate
recognition techniques that we can use on the sort of billions of images that you
have out there now for things like Facebook and these other websites. And so
what I'm going to present today is one approach which you could integrate with a
variety of different recognition systems that really does scale to sort of large
collections of images.
Okay. So one feature about the kind of data on the Internet is you don't just have
this idea of nice clean labels. Often you have the sort of spectrum of label
information. So you have some images which have been manually annotated.
You have nice clean labels which you trust. But that's likely to be a small fraction
of this giant amount of data out there.
What you're likely to have a lot of is sort of noisy labels where people have some
sort of text meta data or something like that which gives you some clue as to
what might be in the image but you certainly can't trust it entirely.
Then, of course, there's also going to be a vast amount of data which you have
really no label of any sort. What you would like is some approach which can
integrate information from all these different sources and do something useful
from them.
So the machine learning people have developed approaches that do this and
which is essentially semi-supervised learning. I'll give a brief introduction for
those who aren't familiar. The idea is you're given a whole collection of data.
You assume that only a very small number of them are labeled. In this case only
two data points that are labeled. And what you want to do -- sorry. In the
classical sort of supervised setting, you would just take these two labels, train
your classifier and try and label the remaining examples.
So if you do something like nearest neighbor or something like that, this is the
kind of thing you end up with. So the decision boundary is just equally distant
between the two points, and you can see it's a perfectly reasonable thing to do,
but you can see it somehow doesn't respect the density of the data.
And in semi-supervised learning what you would like is the label function to, the
label to sort of propagate smoothly through the density so that the resulting
labels do accurately reflect sort of structure of the data that you have.
Okay. So the semi-supervised approach I'm going to talk about today is based
on what's a common variety, in fact, that use these graph Laplacians as the basis
of the approach.
Okay. So the idea is that each image is going to be a vertex in a graph. You'll
look at the affinities between the images, which are going to correspond to the
edges in the graph. We're going to have a sort of Gaussian type affinity which
tells you how far, based on how far apart the images are, X being some
descriptor we're going to extract for each image.
And the thing which we're going to use is this thing called the graph Laplacian.
So the idea is here you have this affinity matrix W. D is just a diagonal matrix
which you sum across the rows of W. And then this is the thing that you
construct, this matrix L. It's a little bit hard to get intuition what that does. To get
an understanding, what it's really is doing going to measure the smoothness of
your label function F.
And the idea is if two points are close together and have a high affinity, you'd like
the F values of those two data points to be very similar. And equally if the affinity
is small, then you can, these two -- these F studies can vary to a greater extent.
So what you want, of course, is to find an F that minimizes this quantity here.
Now, of course, this has a trivial solution. So if you just have F equals to 1 then,
of course, or F equals a constant value, you end up with this being zero. Of
course, if you bring in a label, from the few label examples you have, then you
end up with this expression here.
So you have something which measures the smoothness of your label function
and you have something which checks how well your label function agrees with
the few labels you've been provided with.
And so the lambda here is just a weight which you can think of as being large on
your label examples and being zero otherwise.
You can then sort of rewrite this in this sort of this weight here as sort of a
diagonal matrix, and what you then want to do is find the F that minimizes this
expression.
So this is a fairly straightforward thing to do. Just to make clear that, the points
that are unlabeled have sort of zero weight in this term and the ones that are
labeled have some fairly large value lambda.
>>: [Indiscernible] the label numbers of 1-B?
>> Rob Fergus: Um.
>>: The scalars.
>> Rob Fergus: The class label, yes, we're just considering the simple two class
problem here. We're not doing any multi-class thing in this work.
Okay. So finding the minimum of this is fairly straightforward. You've got to
consider a linear system like this. So the problem is that if you have N images,
then this is going to be an N buy-in system, which is going to be, if N is a billion
or something like that, you're in trouble. You've got to somehow invert this -- this
is going to be a billion-by-billion matrix that you have to invert.
Now one thing people often do is instead of looking at directly these Lee squared
system [phonetic], they in fact consider the eigenvectors of the matrix L. The
intuition behind looking at the eigenvectors, the smallest eigenvectors will -- the
smallest eigenvectors will give you a basis which is smooth respect to the data.
I'm just visualizing here the first eigen -- the smallest eigenvector of the graph
Laplacian. It's a small constant offset. We don't use that one. You can see the
second smallest partitions the data in two and the third one partitions it vertically
and so on.
So what we can do is just represent our label function F as a sort of linear
combination of these eigenvectors of the graph Laplacian.
>>: [indiscernible].
>> Rob Fergus: So the sigmas are the eigen values. You can see they're sort of
getting larger because we're looking at the smallest ones.
>>: Just trying to -- the magnitude of sigma two and sigma three, do anything.
>> Rob Fergus: It's hard to get a feel for what magnitude these should be. It's a
normalized graph Laplacian. So that does mean the eigen values are bounded,
two or one. I can't remember which one it is.
Okay. If we make the substitution, if we substitute F for some linear combination
of the eigenvectors into the previous expression, we end up with this. So this is
the smoothest term as before. This is just a now a diagonal matrix of the eigen
values of your graph Laplacian. If we take the smallest K of them, just removing
the very smallest one because it's kind of -- it's just a DC term.
Then this is -- the alpha is sort of our vector of coefficients over the eigenvectors.
And this is the term which tells us we want our label function to agree with the
labels.
So if we want to solve for the optimal value of the coefficients alpha, we now just
have to worry about a K by K system. So things are better. But, of course, how
on earth do we find the eigenvectors of a billion-by-billion matrix if we're dealing
with that? So in some sense it makes things a bit easier in terms of solving the
label function but we have to find the eigenvectors in the first place.
So just to hammer this point home. If you just deal with the original Lee square
system, you've got to invert, if you have 80 million image base set, you've got to
invert and 80 million by 80 million matrix. If you want to be solving the eigen
vector basis, you have to diagonalize an 80 million by 80 million matrix. You
might be able to do it for a million but certainly can't do it for a billion images.
>>: Is the number 80 million magic?
>> Rob Fergus: In terms if you want the size of the dataset that I'm going to
show operating on this later.
So okay. So what do other people. Other people who want to use
semi-supervised learning, the standard approach basically is to, which is often -one with a common approach called Nystrom. There are many sort of other
methods that share this common property where they essentially they take the
data, they're going to just pick a subset of the data to act as a series of
landmarks. Solve for an exact solution on this. Typically they're solving for the
exact eigenvectors of the graph Laplacian of this reduced landmark set. Then
they'll interpolate the other data points back into those eigenvectors. And so the
basic intuition here is they somehow are reducing the amount of data they have
to deal with, solving an exact problem of that, and then figuring out what to do
with the rest.
Okay. So we're going to do something a bit different. Okay. So that was sort of
a warm-up. This is what we're actually going to do. So as I was saying before,
things like Nystrom, what they're fundamentally doing is reducing the number of
data points they're going to use.
So in contrast what we're going to do is actually think about the limit case as the
number of data points goes off to infinity. You can think of this now as a
continuous density. And the key sort of selling point is that our approach is going
to in fact be linear in the number of data points. We have to make a big
assumption to make that happen. Instead of trying to solve an exact problem on
a small subset, we're going to end up -- we're going to make an approximation
that lets, that makes it linear in N. The basic idea is what we're going to do is
think about if you think -- if you consider this little distribution of data here as it
goes off, as the number of data points goes off to infinity, we can think about
dealing not so much with the discrete set of data but with continuous density. In
this toy example, we've got a two-dimensional distribution of data. And what
we're going to do is we can write down a sort of analog, sort of continuous analog
to the graph Laplacian. So what we can do is write down operator LP, which is
going to measure the smoothness of a function F on this density.
So you can see it's very much the same as before. You've got this W here is an
infinity between two points in your density, sort of the Gaussian type affinity. And
then you have this term here which you want nearby points, nearby value, nearby
points X1 and X2, they're going to have high affinity and you'd like label F
function to be similar.
What we're going to do is we can look at the eigen functions -- eigen functions
now because everything is continuous of this LP of F guy.
So just to understand what these things look like. So this is the discrete case
where we had discrete data. And those were the eigenvectors of the graph
Laplacian. So the eigen functions of this density, sort of continuous analog, this
is the density and these are the functions.
So you can see the first sort of the second smallest eigen function splits the data
vertically, then the third smallest one splits it in the horizontal direction.
Okay. So the idea is that as the lambda data point goes off to infinity, the
eigenvectors in the limit become the eigen functions. Now finding those eigen
functions is what we're going to do. Now if our density, the data that we're
looking at has a nice analytic form, for example, it was uniform or Gaussian, then
actually you can write down an expression for the eigen functions. But,
unfortunately, most real world data doesn't have uniform or Gaussian distribution.
So you're in trouble.
What we're going to do is basically compute a numerical approximation to these
eigen functions.
Okay. So now the key assumption we have to make to be able to do this is that
we do have to assume that the input distribution is separable. That is, in our toy
2-D example here, we assume that we can actually -- whoops, sorry. My slide is
slightly out of sync here. We're assuming that we can model this as a product of
two distributions, one over each dimension.
Now, there's a sort of theoretical result that says if we can compute the eigen
functions of each one of these guys in turn, then they will in fact be eigen
functions of the actual overall joint density, which is what we're trying to find.
Now, this is quite a big assumption. We're going to come back and revisit that on
both toy and real data.
>>: Your formula for the continuous Laplacian already had it in a separable
form? Just kind of jumping ahead.
>> Rob Fergus: Yes. Thank you. Okay.
>>: On the previous page.
>> Rob Fergus: Yes, you're right. You're right. Yes, you're right.
>>: They were already separable.
>> Rob Fergus: Yes, that's true. It's been written out like that.
>>: Okay. It's not endemic in the definition.
>> Rob Fergus: Yes, exactly. Okay. So let's just consider how we would find a
numerical approximation to the one-dimensional data, one-dimensional
distribution which is going to be the marginal. Let's just consider the marginal
distribution for, say, the first dimension. Looking at P of X1. Here's how we in
fact have a lot of data points that we assume have been drawn from this density.
So we take the computer module, marginal distribution of this guy, that's what
this blue curve is here. And, of course, in practice we don't have this. What we
really have is the data itself. So we can form a histogram. We can run some
bends at discrete locations. See how many data points fall into each bin. And
we can get an approximate version of this true marginal.
So we'll have this histogram H of X1. Now, then what we can do is we can solve
a little eigenvector equation for the values of the eigen function at those discrete
locations. Basically the centers of the histogram bins.
And we're going to solve essentially for the value G which is the value of the
eigen function and the associated eigen value of that eigen function. So you can
compute it pretty easily. So P is going to be the matrix which just holds -- it's a
diagonal matrix that holds the values of the density. Just the histogram counts
as it were, normalized on its diagonals. And these guys here are sort of analogs
to the things we had in the original graph Laplacian. So W is just the affinity
between the locations in your one-dimensional space such as seen between
different bins and D is just some sum over the rows of W.
And so when we're solving this equation, this is just -- remember in 1-D, we're
looking at one-dimensional distribution. We're solving for G, series of discrete
locations, which correspond to the bin centers. And the sigma, which is the
associated eigen value.
So when we do that, for example, in that distribution, what do they look like? So
this is the first eigen function of that, this little marginal distribution. This is what it
looks like here.
So you can see it changes rapidly when the density is pretty low here. And then
it's fairly constant when you actually have high density.
And the second one looks like this. And the third one looks like this. Each of
these little circles corresponds to the X coordinate, lined up to the X coordinates
of the bin centers.
And so we do that for one dimension and then we go on to the second
dimension. So this is what the marginal distribution looks like for the second
dimension. Sort of Gaussian, in our toy example. And so the first eigen function
is pretty linear. And then the second one has sort of kink in it. And then this third
one has got a second kink in it and so on.
>>: The little red bars in the third one just disappeared.
>> Rob Fergus: You're quite right. That's a cut and paste error, I think. It's
meant to be the same distribution.
Okay. So what we've done now is for each dimension we've computed these
little 1-D eigen functions. And so the question is if you remember -- sorry. These
things are jumping around. What we're interested in is computing the
approximate eigenvectors of the graph Laplacian. How do we go from these
eigen functions to the eigenvectors. What we do we take the data points and
simply interpolate it in the approximate eigen function.
So this is fortunately a pretty quick operation. One-dimensional interpolation. If
we have K of these eigen functions we're dealing with, for each data point we
have to do this, do K interpolations.
So if we have a billion data points, we may have to do this many times. But
fortunately it's a pretty quick operation to do.
So that's pretty much the algorithm. Just one important preprocessing step is
that we do assume the data is separable for this thing to work effectively. Your
input data may well not be separable at all. And we use PCA to sort of rotate the
data. Now, unfortunately it would be nice to be able to do other types of mapping
to the input data.
But the catch is you must preserve the distances, because those affinities all
depend on the distances between points. If you do anything other than rotate,
then you're in trouble.
So PCA is one form of rotation. You can imagine other types. For the time being
we're just using PCA to rotate our input data and hopefully make it more
separable. And we'll look at that in due course.
Just to summarize the algorithm. We're going to take the PCA on our -- take our
input data, do PCA on it to rotate it. Have to make it more separable. And for
each dimension for PCA data in turn, we'll construct a one-dimensional histogram
to get the marginal density, solve numerically for the eigen functions and
associated eigen values, those funny wiggly green curves I was just showing.
What we're going to do is use the eigen values we computed and we're going to
sort the eigen functions from all different dimensions and take the first K. You
have to decide how many you want to look at.
And so some may be K equals 64 or 128 or something like that. So for each of
the K eigen functions which have the smallest eigen values across all
dimensions, you're going to take your data and interpolate into those K eigen
functions. And that's going to give you the sort of, an approximate set of K
eigenvectors of your normalized graph Laplacian that you can plug into that Lee
squared system we were looking at earlier and then solve to give you the label
function of your data.
>>: So is K here the number of dimensions on the vector, or it's the first number
of eigen values?
>> Rob Fergus: Sorry. This is true. It's not very clear. So K is the number of
eigenvectors in your little Lee squares equation that you're going to deal with. So
A would be a vector of linked K. Coefficients on each of those K eigenvectors.
What we've done basically is come up with a scheme for computing an
approximate to U. Separability to the data but linear time to do, rather than some
sort of polynomial time algorithm for computing the exact eigenvectors.
>>: Okay.
>> Rob Fergus: Okay. So just to do a little comparison here. So in Nystrom,
what they'll do is select a small subset of M landmarks. M is some sort of
moderate number and compute the exact eigenvectors of that N by M system or
pick the smallest K. Then compute all of them, in fact. Then interpolate the end
points back into those K eigenvectors and solve the K by K linear system.
And so the polynomial in the number of landmarks because they're doing this
thing exactly. So even some fancy iteration, it's still going to be at least quadratic
in M here.
So with us we're rotating our points. We're forming D one-dimensional
histograms. D is the dimensionality of the input vectors. For each of the D
dimensions we have to solve a little B-by-B linear system to get the eigen
functions. That's tiny because you only need like 50 histogram bins or
something. 50 by 50. You have to interpolate in your data points.
So this is fast. But it's linear in N. And then you solve purchase K-by-K linear
system.
So we are linear in our data points. Of course, we are making this big
assumption about separability of the input data. Let's look at the toy experiments
now. So here's a little grid of data. Perturb, each element perturbed by each
Gaussian noise. Stick labels on two data points, see what happens. If you solve
the exact Lee squared system, it's a little bit sensitive to the noise. It jumps up
and down. You can see it does roughly the right thing.
The eigenvector approach doesn't have a diagonal slot for whatever reason.
Seems to do a sensible thing, probably over 15 a little bit the two examples we
have. And our approach does something reasonable, too.
So let's look at these two approaches. This one is using the exact eigen values
of graph Laplacian and this is using the approximation we computed using the
eigen functions. If we take a look at the eigen functions of that data, so the first
one -- let me get the laser pointer so I can wave.
This is the smallest one, which is the smoothest. So both of them pretty much
the same. And they've got very similar eigen values. The second one again
pretty similar. Just the sign has been flipped, but it's basically the same shape.
The third one is again very similar, similar eigen values. And the whole thing
starts to diverge because currently we only consider these. We don't allow any
sort of diagonal eigen functions, if you like.
In this case, you can start to see this thing is using both dimensions. You get a
funky type function here which is a smaller eigen value than these ones here
which start partitioning up the horizontal axis more finally.
>>: You won't take the outer product of the eigen functions for two dimensions?
>> Rob Fergus: We thought about doing this but not at the moment.
>>: That's why I was wondering why you don't have a K squared or K to the D
log.
>> Rob Fergus: That's right. So we've been thinking about ways to sort of
compute, sort of combinations, but we've not done that yet.
Okay. So now just to look a little toy dataset. Another toy one here. This is
some toy data. Two labels. In Nystrom you'll decimate the data and compute
exact solution on that. You can see if you decimate too much, then you end up
losing the structure of the data. So the eigenvectors that it computes here are
sort of -- well, they do the wrong thing. These red points come across the gap.
So by contrast, it looks at all the data in the toy case anyway it does the right
thing.
Now we fall down in a case like this. We have two concentric circles. They're
significant dependencies between the dimensions here now. You put a label on
the inner and outer circle. Nystrom does the right thing even though it's got a
small set of data to work with. It keeps all the data around. And then it's done
horribly, because it just -- this doesn't understand somehow the structure
between, it doesn't -- it's missing the ability to model the dependencies between
dimensions.
Okay. So this is a slightly sort of philosophical slide. It makes the point that in
NIPS everyone have you believe that all your data is like a Swiss roll. You have
these incredibly intricate dependencies that lap back on themselves and so on. I
think it's not technically clear whether you have real high dimensional data and
stuff like that, it's real like that. Another sort of plausible model if you have small
pieces of dimensions that are coupled, but then those cleats are fairly
independent to one another. You have sort of like sea urchin. Sea urchin, puffer
fish, but many of these long heavy-tailed kind of things in your high-dimensional
space. And there isn't this sort of intricate manifold structure or whatever you
want to you call it. And of course you can't visualize it. It's hard to get a feel for
it.
But, yeah, I'm not sure I totally buy the Swiss role model that all the sort of
people at NIPS spend their time worrying about.
Let's look at some real data. What you see is images we downloaded from the
Internet. So this case, it was 126 classes. We basically took different nouns, put
them in search engines, got back the images. 62,000 images. Jeff and Alex
[indiscernible] and [indiscernible] at Toronto got people to label these. So
actually have ground truth labels which lets us carefully assess the performance
of our algorithm on them.
This is part of the C file priority funds for the labeling.
>>: Sorry. Why are there pictures of clocks in the classical ->> Rob Fergus: It's because we simply just put this noun into the search engine.
I'm showing you the raw output of the search engine. Someone has gone
through to label these, true instances of EMU [phonetic] or not. So, for example,
this would be a negative example of an EMU.
Or, yeah, because it's a clock running EMU. And now for each image we're
going to use a very simple representation. We're not going to work with pixels
directly. We'll use a single gist descriptor. This is kind of like looking at oriented,
responses of oriented Gabors at different scales over the image. Kind of
conceptually not too similar to sift. It's been hand designed to be a rough -- the
Euclidian distance between gist descriptors is meant to be a rough descriptor for
the distance between images. There's no learning involved. It's just some
descriptor that someone's cooked up.
And we're going to piece here this down to 64 dimensions to do computer eigen
functions in. Now, one obvious question is: How independent are these
dimensions? So if we take the original gist descriptor through an 84 dimensions,
we can form a sort of histogram of pairs of dimensions, look at the mutual
information between them to get a feel for how tightly coupled the different
dimensions are.
And you can see there's very significant dependencies going on here in the
original gist descriptors. So after the piece -- we do PCA on this. It doesn't do
too badly. Did a pretty good job in fact of removing the dependencies. This is
looking at pairs of dimensions of the PCA's gist descriptors. You can see the
information scores aren't zero which indicate true independence but they're not
too far off. Fairly small values. Visually they look fairly symmetric and so on
where the original ones weren't.
Now this could be sort of good luck because we're dealing with sort of images
and images have, we can sort of, the gist descriptors sort of wave at coefficients
and maybe it's possible to do PCA on them make them fairly independent. And
admittedly it would be nice to try it on more general types of data and see
whether you do indeed, how realistic this separability assumption is. We haven't
done that. We're just looking at the gist descriptors here.
Okay. So one thing you might wonder what do these eigen functions look like of
these PCA gist descriptors. This is a little visualization here. So what I'm
showing you is the eigen function which has the smallest eigen value up here
and then the sort of 256th smallest one down here. And what I'm doing is I'm just
color-coding them by the input dimension. You can see that the first few ones
are sort of almost linear functions of the first few dimensions of your PCA which
have higher variancy, expect them to be more easily splittable.
As you come down here you can start to see after a bit you get second order
ones, ones with little wiggles. And down here you start to see high frequency
eigen functions of these first few dimensions, the ones up here.
And then, but you also see now different dimensions. So as you come down
here you can start to see the dimensions of the input space which had sort of
smaller variance starting to crop up. So the first eigen function in each
dimension looks always pretty linear. And thereafter it starts to have this first
order kink pretty much and then second two kinks and so on.
So what we're going to be doing is taking each data point and sort of doing
interpolation in each of these eigen functions to compute a sort of 256
dimensional representation now that we're going to use in our semi-supervised
learning scheme.
So sorry to make clear what each of these is. This is one input dimension that is
bound between some value and X min and X max and this little curve is where
you computed basically the values of the histogram bins in this dimension and
then what you're going to do is take your data point -- whoops -- take your data
point and just look it up in this curve here.
So the eigen values that seem fairly, do a good job of sort of picking sensibly
across different dimensions, you can see here it allocates more eigen functions
to the dimensions with bigger variance. And then the ones down here sort of still
get some. We're just using the numerical eigen value here to pick the allocation
of the different eigen functions.
Okay. So the task we're going to do is to rerank the images of each class. So
the idea is you're given some set of images, say, for the word airbus. Maybe
several thousand of them or something. You've got -- they've been labeled so
you know which ones are labeled or not.
And the goal is to sort of rerank them. But you're going to give the algorithm all
the 63,000 images to compute the eigen functions of. And then what you're
going to do is take a few label ones from the few, give it a few label training
examples and propagate these label training examples using the eigen functions
to the remaining set of images for this keyword. And then hopefully the images
which have the highest label value F will correspond to sort of better examples of
the class. So hopefully improve the quality as it were of the images of the
original ordering.
Just to measure that, we're going to look at product recall reposition curve and
look at the precision of 15% recall.
And we're going to do this for lots of different classes. Just showing you some of
them here. Okay. Sorry. That was [indiscernible] of this slide. So what we're
doing here is varying the number of label training samples we use. Bear in mind
we've got these eigen functions that have been computed on all the data. So
that data is unlabeled when you compute the eigen functions, you don't use
labels in any way for it.
But you start to add in more label examples. And what we're measuring here is
the performance in terms of recall. So precision at a certain level of recall. And
chance is this line here. So you can see if we take a classic supervised method
like an SVM, it horribly overfits the data when we have just a few examples.
Then it sort of revives and starts to do pretty well once we have enough training
data. This is 100 training examples per class or something like that.
If we do the exact Lee squared semi-supervised scheme you can see it does
pretty well early on. So here it's doing great. In the sense it's much better than
the supervised case because it can regularize its solution using the data
distribution.
So this is not using the eigen functions. This is just the standard semi Lee
squares method of doing semi-supervised learning. It gets expensive to run
beyond a certain amount of data. So the curve stops here.
If we try Nystrom, so Nystrom does better than the supervised case but it's
definitely losing quite a bit over the exact semi-supervised scheme.
And then this is our approach. So we seem to do about as well as the exact Lee
squared situation but then we can continue up. To make clear, this thing is the
exact -- this one here you've got to invert a big matrix. So your Lees squared
number data points and our one is linear. It goes up. Starts to be beaten up by
the SVM eventually once you have enough trained data.
But you can see you're getting a big win here with small numbers of training
images.
Okay. And then a couple other things here. You can do nearest neighbors here
in green. And then you can just do the exact eigen values in magenta here.
Okay. So one thing you can try is if I vary the size of K, that is the number of
eigen functions I use, how does that affect my performance. I'm showing you
here, K on this axis. This axis is the number of training samples per class.
Completely unsupervised. That's bringing in more data. And the color, the more
red it is, the better it is doing. You can see when you don't have enough, it
doesn't do very well. You start to overfit. If you have too many eigen functions, it
starts to degrade performance slightly.
Now, one nice feature about this is as I was saying at the very beginning, you
have noisy examples in real data. And you might want to use these schemes.
And in semi-supervised learning it's pretty straightforward. What we can do is
take all the images belonging to a given keyword and just label them as being
positive.
Even though in practice many of them will be incorrect. Okay. And we can take
the images from all other keywords and just label them as negative. So we're
using somehow the label that was from this original search engine.
And we can give -- we obviously don't trust these labels so much as the actual
hand-specified ones, so we give them a much smaller weight. The value lambda
we use. It's one-tenth of whatever we give a hand label example.
If you do that, you find it actually gives you a very nice boost of performance.
Certainly when you've got very few labeled examples, the noise examples can
give you a big help. And you can get nice red curve lifts up here.
So it clearly helps to have label data still. But the noise examples certainly give
you a big advantage at the beginning.
Okay. So this is the same thing but just showing a slightly more detail. This is
increasing number of classes you have in your data. This is just number of
training samples here. Without noisy labels you can see the performance is fairly
weak when you have relatively few samples per class. When you bring in the
noisy labels, some of these things start to work a bit better.
If you have more classes, then it can use all the data from all the other classes
as a large set of negative set, which is why you see the performance of
improving as you move across this graph.
Okay. So I'm going slightly ahead of time here, I guess. What time did I start,
Rick? Okay.
>> Rick Szeliski: You started at around 1:35.
>> Rob Fergus: Fine. So I'm not going to be the full hour.
>> Rick Szeliski: No problem.
>> Rob Fergus: The fun thing is we did run some actual results on the 80 million
images. So it's not 80 million. It's 79 million and whatever. It's about 75,000
different classes. The 126 classes that was just running experiments on, this is a
subset of the 75,000. We took all the labels from CFAR and Jeff provided so
there's about 450,000 of them of which about 64,000 were positive.
So this is obviously a fairly small -- it's a big number, but it's fairly small in
comparison to this.
And what we did is to compute -- we took the PCA, a descriptor of those 79
million images, piece it down to 30 dimensions, computed the first 48 eigen
functions of that data. And so we precompute all this ahead of time.
And you end up with this giant 20 gigabyte matrix. 80 million by 48 matrix. And
what you can do is you can -- those eigen functions basically being computed on
all 80 million. When you plug labels into some of the points, you can then
propagate the label information through, effectively you're propagating through
the 80 million images via the eigen functions.
So this is the kind of result you get here. So this is the raw ranking of the
images. You see here's what you get with the nearest neighbors, which is just a
supervised scheme. So this is just using a few of the CFAR labels per class for
each of these guys. We give I think two or three positive and two or three
negative examples for each one of these rows. And with the nearest neighbors
you can see it does okay, improves things a bit. But you definitely get a better
result if you're actually able to use the data density to regularize things.
So both these two columns have the same amount of training data, I'm sorry,
same amount of hand-labeled data. Whereas, this one is sort of able to
regularize the solution by using the full set of 80 million images. It's very quick,
because once you compute the eigenvectors, what you need to do is just look up
the points that you want to sort. In this case we're dealing with just the images
from a given word and then we just need to basically solve a K by K system. So
K is 48, in fact.
So it takes a fraction of a second to actually produce these results.
>>: Could you go through the three columns again?
>> Rob Fergus: Sure.
>>: The first column is what you get out of the search engine.
>> Rob Fergus: That's correct. Then what we do, I picked rows here for which
we actually have labels in Jeff's label set. And I take, for each one of these rows
I'll take maybe three positive examples and three negative or something like that.
So a few, sparse set of labels.
And then you can do nearest neighbor. You can take each image from that
keyword. So this was French span or something, and you would then rerank it
based on how close it was to those six labeled examples.
And then this is what this column is here. This column is we are basically solving
that semi-supervised learning system using the approximate eigenvectors for
those images computed using the eigen functions which we used which were
formed from the whole 80 million image set.
>>: Could you, not right now, but if you were reworking the talk, could you show
the three positives and two negatives for each category?
>> Rob Fergus: Yes, that's true.
>>: You can see visually how similar they are, what's the intuition behind why
nearest neighbor is working so-so and not great, and what's going on in the other
one, that sort of defines that connected manifold away from each nearest
neighbor that's not obvious in the sort of nearest neighbor gist PCA sense, but
somehow more plausible as it's chaining. Isn't that sort of what's happening?
You're sort of chaining through sets of neighbors?
>> Rob Fergus: That's the idea, yeah. Yeah. That's a good idea. I should have
done that, you're right.
>>: Each one of these is a binary classification, right?
>> Rob Fergus: Well, each image is going to get a number assigned to it by that
label function. So I'm showing you here the ones which have the highest value of
that label function. You have the confidence associated with each of these. The
one at the top left guy is the one that had the highest label function.
>>: You couldn't solve this problem by taking out positive examples from the -just randomly sampling a lot of false positives, a lot of the noncategory images,
and just use those, actually do like the sampling technique or direct method on
that?
>> Rob Fergus: You mean just training the classifier on that and train ->>: Yeah.
>> Rob Fergus: The catch is, there are sort of approaches that to some
semi-supervised learning that try to do this. The problem is so if you have -okay. I mean, this is a slightly toy example. If you have some sort of elongated
structure to your data like this or something and you have labels down here, then
your classifier will learn what to do something sensible with this portion. Then
you have to sort of iteratively slowly work your way along this thing until you get
to sort of the end.
So it's true that training each training classifier in each iteration might be sort of
linear in the number of examples. But you may have to do many iterations of it to
actually get to the sort of, get to these data points that aren't very sort of close or
anything else, particularly.
>>: Because like an intuition all the times you had at the beginning had even
positive or negative. The number of sample, the number of points evenly and
eigen positive effect. But if you have the positive it's much lower than the
negative. Do you have to do these techniques, if you have a large discrepancy in
the number of positives?
>> Rob Fergus: So you still have this sparse amount of labels that you have to
do something sensible with. Sure, you could declare everything negative, but
then you would have -- it's not going to do anything useful. So, yes, what we're
saying is the intuition there is, right, so you're sort of chaining between examples.
These eigen functions are providing you with some way of propagating those few
hand labeled examples, that label information through the rest of the data.
And so, yeah. It seems to do -- it seems to help over just somehow taking the
sort of, just using nearest neighbors, using sort of the immediate closest
examples.
>>: Nearest neighbor, did you specifically remove the positive examples or the
first three images?
>> Rob Fergus: Yes, I think we did move the -- okay. So I guess I've been
talking about a semi-supervised scheme that can scale to these really large
problems. So essentially it's linear in the number of examples.
So you can -- it's feasible to do on these giant datasets. So certainly we've
worked on 80 million. But there's no reason why this wouldn't work on a thing of
a billion images or whatever. The fundamental difference is rather than sub
sampling the data we'll think of what happens when you have an infinite amount
of unlabeled data and treat it as density. The big catch is we do make a
distribution that it's somehow separable. The last experiment was showing it's
feasible to do this whole thing in a big graph with 80 million images or whatever,
in a fraction of a second.
So that's it.
>> Rick Szeliski: Thank you.
[applause]
>>: I've asked a lot. So you guys go ahead.
>>: So in your examples from the semi-supervised case, you knew it's a
two-class problem. So the 80 million Web image dataset, did you -- in other
words, like in unsupervised learning, it's a clustering problem. You don't know
how many classes there are. And in the semi-supervised case you find new
clusters or you don't do that in this one?
>> Rob Fergus: No. So exactly how you deploy the semi-supervised learning
thing, it's true there are many different ways you can do it. In the 80 million case
what we were doing was simply to just consider all the images belonging to a
single word in isolation. And then the only external source information would
come from the eigen functions which have been computed on all 80 million.
Which you can think of in some way defining a sort of clustering, some sort of
clustering basis over the data.
So other than that we weren't using any other classes. In some of the
experiments on the 126 classes we were using other classes there as negative
examples. So in fact you can do it with the 80 million. It is a bit slower. It takes
about 45 seconds. But you can actually do a giant thing where you would use all
79 odd million examples and just have a few positive examples. I'm not running
it like that.
I mean, it's slower, because you are -- and you really are having to sort of work
with matrices. The matrices get very big. You have to hold in memory then your
entire 20 gigabyte matrix of eigenvectors to do that.
>>: Maybe I'm misinterpreting but on one of the slides you said you did better
than the eigenvectors.
>> Rob Fergus: Yes. Certainly possible because the exact one does overfit a bit
to noise. Playing with toy examples.
>>: How would we interpret it? If the data is independent, it is.
>> Rob Fergus: Yeah, so in this sort of suitcase here, this exact case is quite
sensitive to -- I mean, the solution you can get can be quite different if you
generate random samples. There's something wrong with that. I know which
one you're talking about. But I'm -- at least that's our sort of understanding. I
mean, it's the same up to noise. I'm not sure it's any particular ->>: One more time, right? So the [indiscernible].
>> Rob Fergus: The ->>: [indiscernible].
>> Rob Fergus: So you want to put the cyan or the pink one?
>>: Sorry, the pink one.
>> Rob Fergus: The pink one. What's happening there is ->>: You're trying to approximate.
>> Rob Fergus: That's right. I think what's going on -- yeah, what we think is
going on is by confining it to just these separable, these ones per dimension is
probably helping. Because in all these cases we're using the same number of K
is the same. So 256 for all cases.
So that now it's -- yeah. So if you increase the number -- if you make K as K
goes off to N. Then this guy comes up to the cyan thing. The magenta up to
cyan.
>>: It doesn't necessarily converge the same, because you're just enforcing
separability to --
>> Rob Fergus: No, this one is the exact eigenvectors. That doesn't have any
separability assumption.
>>: If you increase N on that that's not necessarily going to match the
performance, your separable thing.
>> Rob Fergus: Oh, no, no, on the Lees squared case, which, yeah.
>>: What's the percentage of the sample of these for the Nystrom?
>> Rob Fergus: So I did with it as many [indiscernible], I think we had about
5,000 data points here. What's confusing here is I've got, there are many
different classes in the big pile of data. It's not just like -- it's all the test data as
well. So there's a lot of data you actually have to compute.
>>: What's the percentage?
>> Rob Fergus: So over here, you certainly have got -- let's say roughly 63,000
images. You've got the largest matrix you can compute eigenvectors in any
reasonable time, which was about 5,000, five or 6,000 besides matrix. I wasn't
using all the tricks. You could use some fancy tricks to compute the
eigenvectors. The things you have to be careful you can't necessarily assume
your Laplacian is sparse actually, because if you tried to sparcify too much you
start ending up with disconnected islands of points. So if it's nice and dense,
then there's lots of bandwidth in the graph for the labels to propagate through.
But if you start sparcifying like crazy, then you can start getting these
disconnected guys out here who just don't talk to the rest of the graph. So there
is a bit of danger.
There's lots of techniques for eigenvectors are very large sparse systems and
stuff like that. But you don't have to sparcify too much before this sort of thing
starts to happen.
>>: I'm always doing class two problems, doing a multi-class problem.
>> Rob Fergus: At the moment -- at the moment we're just solving a two class
problem.
>>: Whole bunch.
>> Rob Fergus: Whole bunch of different two class problems and averaging the
problems to show you the curve. We have basically more work on doing the
multi-class thing. This is where we start considering all classes simultaneously.
>>: So basically taking -- quadratic? [indiscernible] in essence would like to
solve a two class problem [indiscernible] in essence linear and in essence we'd
like to solve it vertical. If you can use the unlabeled data somehow ->> Rob Fergus: Yeah. [chuckling].
>>: It wasn't really a question.
>> Rob Fergus: No, no. [laughter]. So I guess the point is your choice of
features here. I mean, it's quite a simple scheme that you could plug in. If you
learn your features from some sort of debrief net or something like that, you
could just plug them in and this will give you a new way of -- the other thing is to
say, yes, there's the ability to sort of weight your examples according to how
confident you are in their label. It's trivial. Just set the number.
>>: The other thing is when you've got unreliably labeled data, if you translate it
back to the unreliable labels, it's trying to make the answer fit the label and
change the parameter so that it's [indiscernible] -- so what it ought to be trying to
do is first decide whether it believes this label and then decide whether to try. So
really, really is a mixture of do I trust this label. If there's no reason for
unsupervised learning, you can reject a lot of the labels. So I think overall it will
work much better. I think the idea of putting unreliable data sort of don't trust
these, trust [indiscernible].
>> Rob Fergus: I see what you're saying. I see.
>>: Based on what you already understand. With this label, we have evidence.
>> Rob Fergus: I know what you're saying.
>>: If you go back to your energy formulation on the very first page, right, that's
not sort of built in. If there's a lambda and you trust it. In other words,
co-variance on measures to help out our process.
>>: As soon as you got the unlabeled data and if it's the Laplacian data, which
they have to be for this particular industry, then you ought to be able to believe
some labels more than others based on what the data said.
>> Rob Fergus: Right. In practice -- but you sort of see that. When you solve
the label function, the guys that are true labels will actually have -- will have other
points nearby which have the same label. But those ones outliers will be way,
way in space having a tiny affinity connecting back to the rest of the guys where
the action is somehow. So ->>: You could say that this is still intrinsically a Lees squared formulation, right?
Even though -- actually, in this formula there's nothing saying it's a two-layer
problem. You could make Y a multi-value function would have weird behaviors
would interpolate between labels, right? Let's say it's binary. This is Lees
squared. People who do robust stuff would say it's sort of a robust function over
each one of these F minus Y things.
>> Rob Fergus: Actually, we did try that quickly. At least for our work it did
massively it was rough.
>>: Depends on how noisy your data is, if it went through Jeff's process we
assume the labels aren't noisy, right. The humans invented those, so it wouldn't
make a difference.
>> Rob Fergus: That's right. So we do have work where this is not a vector
anymore, it's a matrix, where you have each column is a different class. And
you're solving for a matrix there for the vector. This is our Gaussian solution.
Someone has been [indiscernible] so I'm not sure how much I should say about
this.
So if you just try to -- we've written a little map version of it so you can try. And, I
mean, it's the same code that we use to run 80 million images on. So if you have
enough memory to hold it with memory, fancy memory, it's a very simple code.
>>: But essentially, like you said, when you summarize, you first do a rotation to
hope to make it more separable and basically analyze each dimension
separately using a histogram function. And among those K you choose K from
each dimension.
>> Rob Fergus: No, no, that's the point, because you might have one
dimensional ->>: You treat your basis functions as a constant along all dimensions except for
one, right?
>> Rob Fergus: Constant ->>: Like you showed in the diagrams, basically just one dimensional functions.
>> Rob Fergus: Yes, that's right.
>>: Across the other dimension. You just choose the best one.
>> Rob Fergus: Yeah.
>>: Right. Thank you.
>> Rob Fergus: All righty.
Download