17894 >> Rick Szeliski: Okay. So it's my great pleasure to welcome Rob Fergus here today. Rob is a professor at New York University. And he's very well known in the computer vision recognition community for having developed one of the first systems to learn constellation models. He's also done wonderful work in computational photography, aspects of learning with very large data collections, which I think is related to his topic today. And he's on his way up to NIPS where he'll be delivering a version of this talk. >> Rob Fergus: Thanks, rick. This is a talk -- this is some work I've done with Yaie and Antonio. And what I'm going to talk about today is doing stuff with huge collections of images. There's been some quite interesting work by people at Microsoft here on taking people's holiday snaps on the Internet and creating rich studio environments from them and also doing sort of fun things with your photographs to repair them and so on. But in all these approaches really leveraging the vast amount of data that's out there on the Web. But people have been sort of less successful about developing alternate recognition techniques that we can use on the sort of billions of images that you have out there now for things like Facebook and these other websites. And so what I'm going to present today is one approach which you could integrate with a variety of different recognition systems that really does scale to sort of large collections of images. Okay. So one feature about the kind of data on the Internet is you don't just have this idea of nice clean labels. Often you have the sort of spectrum of label information. So you have some images which have been manually annotated. You have nice clean labels which you trust. But that's likely to be a small fraction of this giant amount of data out there. What you're likely to have a lot of is sort of noisy labels where people have some sort of text meta data or something like that which gives you some clue as to what might be in the image but you certainly can't trust it entirely. Then, of course, there's also going to be a vast amount of data which you have really no label of any sort. What you would like is some approach which can integrate information from all these different sources and do something useful from them. So the machine learning people have developed approaches that do this and which is essentially semi-supervised learning. I'll give a brief introduction for those who aren't familiar. The idea is you're given a whole collection of data. You assume that only a very small number of them are labeled. In this case only two data points that are labeled. And what you want to do -- sorry. In the classical sort of supervised setting, you would just take these two labels, train your classifier and try and label the remaining examples. So if you do something like nearest neighbor or something like that, this is the kind of thing you end up with. So the decision boundary is just equally distant between the two points, and you can see it's a perfectly reasonable thing to do, but you can see it somehow doesn't respect the density of the data. And in semi-supervised learning what you would like is the label function to, the label to sort of propagate smoothly through the density so that the resulting labels do accurately reflect sort of structure of the data that you have. Okay. So the semi-supervised approach I'm going to talk about today is based on what's a common variety, in fact, that use these graph Laplacians as the basis of the approach. Okay. So the idea is that each image is going to be a vertex in a graph. You'll look at the affinities between the images, which are going to correspond to the edges in the graph. We're going to have a sort of Gaussian type affinity which tells you how far, based on how far apart the images are, X being some descriptor we're going to extract for each image. And the thing which we're going to use is this thing called the graph Laplacian. So the idea is here you have this affinity matrix W. D is just a diagonal matrix which you sum across the rows of W. And then this is the thing that you construct, this matrix L. It's a little bit hard to get intuition what that does. To get an understanding, what it's really is doing going to measure the smoothness of your label function F. And the idea is if two points are close together and have a high affinity, you'd like the F values of those two data points to be very similar. And equally if the affinity is small, then you can, these two -- these F studies can vary to a greater extent. So what you want, of course, is to find an F that minimizes this quantity here. Now, of course, this has a trivial solution. So if you just have F equals to 1 then, of course, or F equals a constant value, you end up with this being zero. Of course, if you bring in a label, from the few label examples you have, then you end up with this expression here. So you have something which measures the smoothness of your label function and you have something which checks how well your label function agrees with the few labels you've been provided with. And so the lambda here is just a weight which you can think of as being large on your label examples and being zero otherwise. You can then sort of rewrite this in this sort of this weight here as sort of a diagonal matrix, and what you then want to do is find the F that minimizes this expression. So this is a fairly straightforward thing to do. Just to make clear that, the points that are unlabeled have sort of zero weight in this term and the ones that are labeled have some fairly large value lambda. >>: [Indiscernible] the label numbers of 1-B? >> Rob Fergus: Um. >>: The scalars. >> Rob Fergus: The class label, yes, we're just considering the simple two class problem here. We're not doing any multi-class thing in this work. Okay. So finding the minimum of this is fairly straightforward. You've got to consider a linear system like this. So the problem is that if you have N images, then this is going to be an N buy-in system, which is going to be, if N is a billion or something like that, you're in trouble. You've got to somehow invert this -- this is going to be a billion-by-billion matrix that you have to invert. Now one thing people often do is instead of looking at directly these Lee squared system [phonetic], they in fact consider the eigenvectors of the matrix L. The intuition behind looking at the eigenvectors, the smallest eigenvectors will -- the smallest eigenvectors will give you a basis which is smooth respect to the data. I'm just visualizing here the first eigen -- the smallest eigenvector of the graph Laplacian. It's a small constant offset. We don't use that one. You can see the second smallest partitions the data in two and the third one partitions it vertically and so on. So what we can do is just represent our label function F as a sort of linear combination of these eigenvectors of the graph Laplacian. >>: [indiscernible]. >> Rob Fergus: So the sigmas are the eigen values. You can see they're sort of getting larger because we're looking at the smallest ones. >>: Just trying to -- the magnitude of sigma two and sigma three, do anything. >> Rob Fergus: It's hard to get a feel for what magnitude these should be. It's a normalized graph Laplacian. So that does mean the eigen values are bounded, two or one. I can't remember which one it is. Okay. If we make the substitution, if we substitute F for some linear combination of the eigenvectors into the previous expression, we end up with this. So this is the smoothest term as before. This is just a now a diagonal matrix of the eigen values of your graph Laplacian. If we take the smallest K of them, just removing the very smallest one because it's kind of -- it's just a DC term. Then this is -- the alpha is sort of our vector of coefficients over the eigenvectors. And this is the term which tells us we want our label function to agree with the labels. So if we want to solve for the optimal value of the coefficients alpha, we now just have to worry about a K by K system. So things are better. But, of course, how on earth do we find the eigenvectors of a billion-by-billion matrix if we're dealing with that? So in some sense it makes things a bit easier in terms of solving the label function but we have to find the eigenvectors in the first place. So just to hammer this point home. If you just deal with the original Lee square system, you've got to invert, if you have 80 million image base set, you've got to invert and 80 million by 80 million matrix. If you want to be solving the eigen vector basis, you have to diagonalize an 80 million by 80 million matrix. You might be able to do it for a million but certainly can't do it for a billion images. >>: Is the number 80 million magic? >> Rob Fergus: In terms if you want the size of the dataset that I'm going to show operating on this later. So okay. So what do other people. Other people who want to use semi-supervised learning, the standard approach basically is to, which is often -one with a common approach called Nystrom. There are many sort of other methods that share this common property where they essentially they take the data, they're going to just pick a subset of the data to act as a series of landmarks. Solve for an exact solution on this. Typically they're solving for the exact eigenvectors of the graph Laplacian of this reduced landmark set. Then they'll interpolate the other data points back into those eigenvectors. And so the basic intuition here is they somehow are reducing the amount of data they have to deal with, solving an exact problem of that, and then figuring out what to do with the rest. Okay. So we're going to do something a bit different. Okay. So that was sort of a warm-up. This is what we're actually going to do. So as I was saying before, things like Nystrom, what they're fundamentally doing is reducing the number of data points they're going to use. So in contrast what we're going to do is actually think about the limit case as the number of data points goes off to infinity. You can think of this now as a continuous density. And the key sort of selling point is that our approach is going to in fact be linear in the number of data points. We have to make a big assumption to make that happen. Instead of trying to solve an exact problem on a small subset, we're going to end up -- we're going to make an approximation that lets, that makes it linear in N. The basic idea is what we're going to do is think about if you think -- if you consider this little distribution of data here as it goes off, as the number of data points goes off to infinity, we can think about dealing not so much with the discrete set of data but with continuous density. In this toy example, we've got a two-dimensional distribution of data. And what we're going to do is we can write down a sort of analog, sort of continuous analog to the graph Laplacian. So what we can do is write down operator LP, which is going to measure the smoothness of a function F on this density. So you can see it's very much the same as before. You've got this W here is an infinity between two points in your density, sort of the Gaussian type affinity. And then you have this term here which you want nearby points, nearby value, nearby points X1 and X2, they're going to have high affinity and you'd like label F function to be similar. What we're going to do is we can look at the eigen functions -- eigen functions now because everything is continuous of this LP of F guy. So just to understand what these things look like. So this is the discrete case where we had discrete data. And those were the eigenvectors of the graph Laplacian. So the eigen functions of this density, sort of continuous analog, this is the density and these are the functions. So you can see the first sort of the second smallest eigen function splits the data vertically, then the third smallest one splits it in the horizontal direction. Okay. So the idea is that as the lambda data point goes off to infinity, the eigenvectors in the limit become the eigen functions. Now finding those eigen functions is what we're going to do. Now if our density, the data that we're looking at has a nice analytic form, for example, it was uniform or Gaussian, then actually you can write down an expression for the eigen functions. But, unfortunately, most real world data doesn't have uniform or Gaussian distribution. So you're in trouble. What we're going to do is basically compute a numerical approximation to these eigen functions. Okay. So now the key assumption we have to make to be able to do this is that we do have to assume that the input distribution is separable. That is, in our toy 2-D example here, we assume that we can actually -- whoops, sorry. My slide is slightly out of sync here. We're assuming that we can model this as a product of two distributions, one over each dimension. Now, there's a sort of theoretical result that says if we can compute the eigen functions of each one of these guys in turn, then they will in fact be eigen functions of the actual overall joint density, which is what we're trying to find. Now, this is quite a big assumption. We're going to come back and revisit that on both toy and real data. >>: Your formula for the continuous Laplacian already had it in a separable form? Just kind of jumping ahead. >> Rob Fergus: Yes. Thank you. Okay. >>: On the previous page. >> Rob Fergus: Yes, you're right. You're right. Yes, you're right. >>: They were already separable. >> Rob Fergus: Yes, that's true. It's been written out like that. >>: Okay. It's not endemic in the definition. >> Rob Fergus: Yes, exactly. Okay. So let's just consider how we would find a numerical approximation to the one-dimensional data, one-dimensional distribution which is going to be the marginal. Let's just consider the marginal distribution for, say, the first dimension. Looking at P of X1. Here's how we in fact have a lot of data points that we assume have been drawn from this density. So we take the computer module, marginal distribution of this guy, that's what this blue curve is here. And, of course, in practice we don't have this. What we really have is the data itself. So we can form a histogram. We can run some bends at discrete locations. See how many data points fall into each bin. And we can get an approximate version of this true marginal. So we'll have this histogram H of X1. Now, then what we can do is we can solve a little eigenvector equation for the values of the eigen function at those discrete locations. Basically the centers of the histogram bins. And we're going to solve essentially for the value G which is the value of the eigen function and the associated eigen value of that eigen function. So you can compute it pretty easily. So P is going to be the matrix which just holds -- it's a diagonal matrix that holds the values of the density. Just the histogram counts as it were, normalized on its diagonals. And these guys here are sort of analogs to the things we had in the original graph Laplacian. So W is just the affinity between the locations in your one-dimensional space such as seen between different bins and D is just some sum over the rows of W. And so when we're solving this equation, this is just -- remember in 1-D, we're looking at one-dimensional distribution. We're solving for G, series of discrete locations, which correspond to the bin centers. And the sigma, which is the associated eigen value. So when we do that, for example, in that distribution, what do they look like? So this is the first eigen function of that, this little marginal distribution. This is what it looks like here. So you can see it changes rapidly when the density is pretty low here. And then it's fairly constant when you actually have high density. And the second one looks like this. And the third one looks like this. Each of these little circles corresponds to the X coordinate, lined up to the X coordinates of the bin centers. And so we do that for one dimension and then we go on to the second dimension. So this is what the marginal distribution looks like for the second dimension. Sort of Gaussian, in our toy example. And so the first eigen function is pretty linear. And then the second one has sort of kink in it. And then this third one has got a second kink in it and so on. >>: The little red bars in the third one just disappeared. >> Rob Fergus: You're quite right. That's a cut and paste error, I think. It's meant to be the same distribution. Okay. So what we've done now is for each dimension we've computed these little 1-D eigen functions. And so the question is if you remember -- sorry. These things are jumping around. What we're interested in is computing the approximate eigenvectors of the graph Laplacian. How do we go from these eigen functions to the eigenvectors. What we do we take the data points and simply interpolate it in the approximate eigen function. So this is fortunately a pretty quick operation. One-dimensional interpolation. If we have K of these eigen functions we're dealing with, for each data point we have to do this, do K interpolations. So if we have a billion data points, we may have to do this many times. But fortunately it's a pretty quick operation to do. So that's pretty much the algorithm. Just one important preprocessing step is that we do assume the data is separable for this thing to work effectively. Your input data may well not be separable at all. And we use PCA to sort of rotate the data. Now, unfortunately it would be nice to be able to do other types of mapping to the input data. But the catch is you must preserve the distances, because those affinities all depend on the distances between points. If you do anything other than rotate, then you're in trouble. So PCA is one form of rotation. You can imagine other types. For the time being we're just using PCA to rotate our input data and hopefully make it more separable. And we'll look at that in due course. Just to summarize the algorithm. We're going to take the PCA on our -- take our input data, do PCA on it to rotate it. Have to make it more separable. And for each dimension for PCA data in turn, we'll construct a one-dimensional histogram to get the marginal density, solve numerically for the eigen functions and associated eigen values, those funny wiggly green curves I was just showing. What we're going to do is use the eigen values we computed and we're going to sort the eigen functions from all different dimensions and take the first K. You have to decide how many you want to look at. And so some may be K equals 64 or 128 or something like that. So for each of the K eigen functions which have the smallest eigen values across all dimensions, you're going to take your data and interpolate into those K eigen functions. And that's going to give you the sort of, an approximate set of K eigenvectors of your normalized graph Laplacian that you can plug into that Lee squared system we were looking at earlier and then solve to give you the label function of your data. >>: So is K here the number of dimensions on the vector, or it's the first number of eigen values? >> Rob Fergus: Sorry. This is true. It's not very clear. So K is the number of eigenvectors in your little Lee squares equation that you're going to deal with. So A would be a vector of linked K. Coefficients on each of those K eigenvectors. What we've done basically is come up with a scheme for computing an approximate to U. Separability to the data but linear time to do, rather than some sort of polynomial time algorithm for computing the exact eigenvectors. >>: Okay. >> Rob Fergus: Okay. So just to do a little comparison here. So in Nystrom, what they'll do is select a small subset of M landmarks. M is some sort of moderate number and compute the exact eigenvectors of that N by M system or pick the smallest K. Then compute all of them, in fact. Then interpolate the end points back into those K eigenvectors and solve the K by K linear system. And so the polynomial in the number of landmarks because they're doing this thing exactly. So even some fancy iteration, it's still going to be at least quadratic in M here. So with us we're rotating our points. We're forming D one-dimensional histograms. D is the dimensionality of the input vectors. For each of the D dimensions we have to solve a little B-by-B linear system to get the eigen functions. That's tiny because you only need like 50 histogram bins or something. 50 by 50. You have to interpolate in your data points. So this is fast. But it's linear in N. And then you solve purchase K-by-K linear system. So we are linear in our data points. Of course, we are making this big assumption about separability of the input data. Let's look at the toy experiments now. So here's a little grid of data. Perturb, each element perturbed by each Gaussian noise. Stick labels on two data points, see what happens. If you solve the exact Lee squared system, it's a little bit sensitive to the noise. It jumps up and down. You can see it does roughly the right thing. The eigenvector approach doesn't have a diagonal slot for whatever reason. Seems to do a sensible thing, probably over 15 a little bit the two examples we have. And our approach does something reasonable, too. So let's look at these two approaches. This one is using the exact eigen values of graph Laplacian and this is using the approximation we computed using the eigen functions. If we take a look at the eigen functions of that data, so the first one -- let me get the laser pointer so I can wave. This is the smallest one, which is the smoothest. So both of them pretty much the same. And they've got very similar eigen values. The second one again pretty similar. Just the sign has been flipped, but it's basically the same shape. The third one is again very similar, similar eigen values. And the whole thing starts to diverge because currently we only consider these. We don't allow any sort of diagonal eigen functions, if you like. In this case, you can start to see this thing is using both dimensions. You get a funky type function here which is a smaller eigen value than these ones here which start partitioning up the horizontal axis more finally. >>: You won't take the outer product of the eigen functions for two dimensions? >> Rob Fergus: We thought about doing this but not at the moment. >>: That's why I was wondering why you don't have a K squared or K to the D log. >> Rob Fergus: That's right. So we've been thinking about ways to sort of compute, sort of combinations, but we've not done that yet. Okay. So now just to look a little toy dataset. Another toy one here. This is some toy data. Two labels. In Nystrom you'll decimate the data and compute exact solution on that. You can see if you decimate too much, then you end up losing the structure of the data. So the eigenvectors that it computes here are sort of -- well, they do the wrong thing. These red points come across the gap. So by contrast, it looks at all the data in the toy case anyway it does the right thing. Now we fall down in a case like this. We have two concentric circles. They're significant dependencies between the dimensions here now. You put a label on the inner and outer circle. Nystrom does the right thing even though it's got a small set of data to work with. It keeps all the data around. And then it's done horribly, because it just -- this doesn't understand somehow the structure between, it doesn't -- it's missing the ability to model the dependencies between dimensions. Okay. So this is a slightly sort of philosophical slide. It makes the point that in NIPS everyone have you believe that all your data is like a Swiss roll. You have these incredibly intricate dependencies that lap back on themselves and so on. I think it's not technically clear whether you have real high dimensional data and stuff like that, it's real like that. Another sort of plausible model if you have small pieces of dimensions that are coupled, but then those cleats are fairly independent to one another. You have sort of like sea urchin. Sea urchin, puffer fish, but many of these long heavy-tailed kind of things in your high-dimensional space. And there isn't this sort of intricate manifold structure or whatever you want to you call it. And of course you can't visualize it. It's hard to get a feel for it. But, yeah, I'm not sure I totally buy the Swiss role model that all the sort of people at NIPS spend their time worrying about. Let's look at some real data. What you see is images we downloaded from the Internet. So this case, it was 126 classes. We basically took different nouns, put them in search engines, got back the images. 62,000 images. Jeff and Alex [indiscernible] and [indiscernible] at Toronto got people to label these. So actually have ground truth labels which lets us carefully assess the performance of our algorithm on them. This is part of the C file priority funds for the labeling. >>: Sorry. Why are there pictures of clocks in the classical ->> Rob Fergus: It's because we simply just put this noun into the search engine. I'm showing you the raw output of the search engine. Someone has gone through to label these, true instances of EMU [phonetic] or not. So, for example, this would be a negative example of an EMU. Or, yeah, because it's a clock running EMU. And now for each image we're going to use a very simple representation. We're not going to work with pixels directly. We'll use a single gist descriptor. This is kind of like looking at oriented, responses of oriented Gabors at different scales over the image. Kind of conceptually not too similar to sift. It's been hand designed to be a rough -- the Euclidian distance between gist descriptors is meant to be a rough descriptor for the distance between images. There's no learning involved. It's just some descriptor that someone's cooked up. And we're going to piece here this down to 64 dimensions to do computer eigen functions in. Now, one obvious question is: How independent are these dimensions? So if we take the original gist descriptor through an 84 dimensions, we can form a sort of histogram of pairs of dimensions, look at the mutual information between them to get a feel for how tightly coupled the different dimensions are. And you can see there's very significant dependencies going on here in the original gist descriptors. So after the piece -- we do PCA on this. It doesn't do too badly. Did a pretty good job in fact of removing the dependencies. This is looking at pairs of dimensions of the PCA's gist descriptors. You can see the information scores aren't zero which indicate true independence but they're not too far off. Fairly small values. Visually they look fairly symmetric and so on where the original ones weren't. Now this could be sort of good luck because we're dealing with sort of images and images have, we can sort of, the gist descriptors sort of wave at coefficients and maybe it's possible to do PCA on them make them fairly independent. And admittedly it would be nice to try it on more general types of data and see whether you do indeed, how realistic this separability assumption is. We haven't done that. We're just looking at the gist descriptors here. Okay. So one thing you might wonder what do these eigen functions look like of these PCA gist descriptors. This is a little visualization here. So what I'm showing you is the eigen function which has the smallest eigen value up here and then the sort of 256th smallest one down here. And what I'm doing is I'm just color-coding them by the input dimension. You can see that the first few ones are sort of almost linear functions of the first few dimensions of your PCA which have higher variancy, expect them to be more easily splittable. As you come down here you can start to see after a bit you get second order ones, ones with little wiggles. And down here you start to see high frequency eigen functions of these first few dimensions, the ones up here. And then, but you also see now different dimensions. So as you come down here you can start to see the dimensions of the input space which had sort of smaller variance starting to crop up. So the first eigen function in each dimension looks always pretty linear. And thereafter it starts to have this first order kink pretty much and then second two kinks and so on. So what we're going to be doing is taking each data point and sort of doing interpolation in each of these eigen functions to compute a sort of 256 dimensional representation now that we're going to use in our semi-supervised learning scheme. So sorry to make clear what each of these is. This is one input dimension that is bound between some value and X min and X max and this little curve is where you computed basically the values of the histogram bins in this dimension and then what you're going to do is take your data point -- whoops -- take your data point and just look it up in this curve here. So the eigen values that seem fairly, do a good job of sort of picking sensibly across different dimensions, you can see here it allocates more eigen functions to the dimensions with bigger variance. And then the ones down here sort of still get some. We're just using the numerical eigen value here to pick the allocation of the different eigen functions. Okay. So the task we're going to do is to rerank the images of each class. So the idea is you're given some set of images, say, for the word airbus. Maybe several thousand of them or something. You've got -- they've been labeled so you know which ones are labeled or not. And the goal is to sort of rerank them. But you're going to give the algorithm all the 63,000 images to compute the eigen functions of. And then what you're going to do is take a few label ones from the few, give it a few label training examples and propagate these label training examples using the eigen functions to the remaining set of images for this keyword. And then hopefully the images which have the highest label value F will correspond to sort of better examples of the class. So hopefully improve the quality as it were of the images of the original ordering. Just to measure that, we're going to look at product recall reposition curve and look at the precision of 15% recall. And we're going to do this for lots of different classes. Just showing you some of them here. Okay. Sorry. That was [indiscernible] of this slide. So what we're doing here is varying the number of label training samples we use. Bear in mind we've got these eigen functions that have been computed on all the data. So that data is unlabeled when you compute the eigen functions, you don't use labels in any way for it. But you start to add in more label examples. And what we're measuring here is the performance in terms of recall. So precision at a certain level of recall. And chance is this line here. So you can see if we take a classic supervised method like an SVM, it horribly overfits the data when we have just a few examples. Then it sort of revives and starts to do pretty well once we have enough training data. This is 100 training examples per class or something like that. If we do the exact Lee squared semi-supervised scheme you can see it does pretty well early on. So here it's doing great. In the sense it's much better than the supervised case because it can regularize its solution using the data distribution. So this is not using the eigen functions. This is just the standard semi Lee squares method of doing semi-supervised learning. It gets expensive to run beyond a certain amount of data. So the curve stops here. If we try Nystrom, so Nystrom does better than the supervised case but it's definitely losing quite a bit over the exact semi-supervised scheme. And then this is our approach. So we seem to do about as well as the exact Lee squared situation but then we can continue up. To make clear, this thing is the exact -- this one here you've got to invert a big matrix. So your Lees squared number data points and our one is linear. It goes up. Starts to be beaten up by the SVM eventually once you have enough trained data. But you can see you're getting a big win here with small numbers of training images. Okay. And then a couple other things here. You can do nearest neighbors here in green. And then you can just do the exact eigen values in magenta here. Okay. So one thing you can try is if I vary the size of K, that is the number of eigen functions I use, how does that affect my performance. I'm showing you here, K on this axis. This axis is the number of training samples per class. Completely unsupervised. That's bringing in more data. And the color, the more red it is, the better it is doing. You can see when you don't have enough, it doesn't do very well. You start to overfit. If you have too many eigen functions, it starts to degrade performance slightly. Now, one nice feature about this is as I was saying at the very beginning, you have noisy examples in real data. And you might want to use these schemes. And in semi-supervised learning it's pretty straightforward. What we can do is take all the images belonging to a given keyword and just label them as being positive. Even though in practice many of them will be incorrect. Okay. And we can take the images from all other keywords and just label them as negative. So we're using somehow the label that was from this original search engine. And we can give -- we obviously don't trust these labels so much as the actual hand-specified ones, so we give them a much smaller weight. The value lambda we use. It's one-tenth of whatever we give a hand label example. If you do that, you find it actually gives you a very nice boost of performance. Certainly when you've got very few labeled examples, the noise examples can give you a big help. And you can get nice red curve lifts up here. So it clearly helps to have label data still. But the noise examples certainly give you a big advantage at the beginning. Okay. So this is the same thing but just showing a slightly more detail. This is increasing number of classes you have in your data. This is just number of training samples here. Without noisy labels you can see the performance is fairly weak when you have relatively few samples per class. When you bring in the noisy labels, some of these things start to work a bit better. If you have more classes, then it can use all the data from all the other classes as a large set of negative set, which is why you see the performance of improving as you move across this graph. Okay. So I'm going slightly ahead of time here, I guess. What time did I start, Rick? Okay. >> Rick Szeliski: You started at around 1:35. >> Rob Fergus: Fine. So I'm not going to be the full hour. >> Rick Szeliski: No problem. >> Rob Fergus: The fun thing is we did run some actual results on the 80 million images. So it's not 80 million. It's 79 million and whatever. It's about 75,000 different classes. The 126 classes that was just running experiments on, this is a subset of the 75,000. We took all the labels from CFAR and Jeff provided so there's about 450,000 of them of which about 64,000 were positive. So this is obviously a fairly small -- it's a big number, but it's fairly small in comparison to this. And what we did is to compute -- we took the PCA, a descriptor of those 79 million images, piece it down to 30 dimensions, computed the first 48 eigen functions of that data. And so we precompute all this ahead of time. And you end up with this giant 20 gigabyte matrix. 80 million by 48 matrix. And what you can do is you can -- those eigen functions basically being computed on all 80 million. When you plug labels into some of the points, you can then propagate the label information through, effectively you're propagating through the 80 million images via the eigen functions. So this is the kind of result you get here. So this is the raw ranking of the images. You see here's what you get with the nearest neighbors, which is just a supervised scheme. So this is just using a few of the CFAR labels per class for each of these guys. We give I think two or three positive and two or three negative examples for each one of these rows. And with the nearest neighbors you can see it does okay, improves things a bit. But you definitely get a better result if you're actually able to use the data density to regularize things. So both these two columns have the same amount of training data, I'm sorry, same amount of hand-labeled data. Whereas, this one is sort of able to regularize the solution by using the full set of 80 million images. It's very quick, because once you compute the eigenvectors, what you need to do is just look up the points that you want to sort. In this case we're dealing with just the images from a given word and then we just need to basically solve a K by K system. So K is 48, in fact. So it takes a fraction of a second to actually produce these results. >>: Could you go through the three columns again? >> Rob Fergus: Sure. >>: The first column is what you get out of the search engine. >> Rob Fergus: That's correct. Then what we do, I picked rows here for which we actually have labels in Jeff's label set. And I take, for each one of these rows I'll take maybe three positive examples and three negative or something like that. So a few, sparse set of labels. And then you can do nearest neighbor. You can take each image from that keyword. So this was French span or something, and you would then rerank it based on how close it was to those six labeled examples. And then this is what this column is here. This column is we are basically solving that semi-supervised learning system using the approximate eigenvectors for those images computed using the eigen functions which we used which were formed from the whole 80 million image set. >>: Could you, not right now, but if you were reworking the talk, could you show the three positives and two negatives for each category? >> Rob Fergus: Yes, that's true. >>: You can see visually how similar they are, what's the intuition behind why nearest neighbor is working so-so and not great, and what's going on in the other one, that sort of defines that connected manifold away from each nearest neighbor that's not obvious in the sort of nearest neighbor gist PCA sense, but somehow more plausible as it's chaining. Isn't that sort of what's happening? You're sort of chaining through sets of neighbors? >> Rob Fergus: That's the idea, yeah. Yeah. That's a good idea. I should have done that, you're right. >>: Each one of these is a binary classification, right? >> Rob Fergus: Well, each image is going to get a number assigned to it by that label function. So I'm showing you here the ones which have the highest value of that label function. You have the confidence associated with each of these. The one at the top left guy is the one that had the highest label function. >>: You couldn't solve this problem by taking out positive examples from the -just randomly sampling a lot of false positives, a lot of the noncategory images, and just use those, actually do like the sampling technique or direct method on that? >> Rob Fergus: You mean just training the classifier on that and train ->>: Yeah. >> Rob Fergus: The catch is, there are sort of approaches that to some semi-supervised learning that try to do this. The problem is so if you have -okay. I mean, this is a slightly toy example. If you have some sort of elongated structure to your data like this or something and you have labels down here, then your classifier will learn what to do something sensible with this portion. Then you have to sort of iteratively slowly work your way along this thing until you get to sort of the end. So it's true that training each training classifier in each iteration might be sort of linear in the number of examples. But you may have to do many iterations of it to actually get to the sort of, get to these data points that aren't very sort of close or anything else, particularly. >>: Because like an intuition all the times you had at the beginning had even positive or negative. The number of sample, the number of points evenly and eigen positive effect. But if you have the positive it's much lower than the negative. Do you have to do these techniques, if you have a large discrepancy in the number of positives? >> Rob Fergus: So you still have this sparse amount of labels that you have to do something sensible with. Sure, you could declare everything negative, but then you would have -- it's not going to do anything useful. So, yes, what we're saying is the intuition there is, right, so you're sort of chaining between examples. These eigen functions are providing you with some way of propagating those few hand labeled examples, that label information through the rest of the data. And so, yeah. It seems to do -- it seems to help over just somehow taking the sort of, just using nearest neighbors, using sort of the immediate closest examples. >>: Nearest neighbor, did you specifically remove the positive examples or the first three images? >> Rob Fergus: Yes, I think we did move the -- okay. So I guess I've been talking about a semi-supervised scheme that can scale to these really large problems. So essentially it's linear in the number of examples. So you can -- it's feasible to do on these giant datasets. So certainly we've worked on 80 million. But there's no reason why this wouldn't work on a thing of a billion images or whatever. The fundamental difference is rather than sub sampling the data we'll think of what happens when you have an infinite amount of unlabeled data and treat it as density. The big catch is we do make a distribution that it's somehow separable. The last experiment was showing it's feasible to do this whole thing in a big graph with 80 million images or whatever, in a fraction of a second. So that's it. >> Rick Szeliski: Thank you. [applause] >>: I've asked a lot. So you guys go ahead. >>: So in your examples from the semi-supervised case, you knew it's a two-class problem. So the 80 million Web image dataset, did you -- in other words, like in unsupervised learning, it's a clustering problem. You don't know how many classes there are. And in the semi-supervised case you find new clusters or you don't do that in this one? >> Rob Fergus: No. So exactly how you deploy the semi-supervised learning thing, it's true there are many different ways you can do it. In the 80 million case what we were doing was simply to just consider all the images belonging to a single word in isolation. And then the only external source information would come from the eigen functions which have been computed on all 80 million. Which you can think of in some way defining a sort of clustering, some sort of clustering basis over the data. So other than that we weren't using any other classes. In some of the experiments on the 126 classes we were using other classes there as negative examples. So in fact you can do it with the 80 million. It is a bit slower. It takes about 45 seconds. But you can actually do a giant thing where you would use all 79 odd million examples and just have a few positive examples. I'm not running it like that. I mean, it's slower, because you are -- and you really are having to sort of work with matrices. The matrices get very big. You have to hold in memory then your entire 20 gigabyte matrix of eigenvectors to do that. >>: Maybe I'm misinterpreting but on one of the slides you said you did better than the eigenvectors. >> Rob Fergus: Yes. Certainly possible because the exact one does overfit a bit to noise. Playing with toy examples. >>: How would we interpret it? If the data is independent, it is. >> Rob Fergus: Yeah, so in this sort of suitcase here, this exact case is quite sensitive to -- I mean, the solution you can get can be quite different if you generate random samples. There's something wrong with that. I know which one you're talking about. But I'm -- at least that's our sort of understanding. I mean, it's the same up to noise. I'm not sure it's any particular ->>: One more time, right? So the [indiscernible]. >> Rob Fergus: The ->>: [indiscernible]. >> Rob Fergus: So you want to put the cyan or the pink one? >>: Sorry, the pink one. >> Rob Fergus: The pink one. What's happening there is ->>: You're trying to approximate. >> Rob Fergus: That's right. I think what's going on -- yeah, what we think is going on is by confining it to just these separable, these ones per dimension is probably helping. Because in all these cases we're using the same number of K is the same. So 256 for all cases. So that now it's -- yeah. So if you increase the number -- if you make K as K goes off to N. Then this guy comes up to the cyan thing. The magenta up to cyan. >>: It doesn't necessarily converge the same, because you're just enforcing separability to -- >> Rob Fergus: No, this one is the exact eigenvectors. That doesn't have any separability assumption. >>: If you increase N on that that's not necessarily going to match the performance, your separable thing. >> Rob Fergus: Oh, no, no, on the Lees squared case, which, yeah. >>: What's the percentage of the sample of these for the Nystrom? >> Rob Fergus: So I did with it as many [indiscernible], I think we had about 5,000 data points here. What's confusing here is I've got, there are many different classes in the big pile of data. It's not just like -- it's all the test data as well. So there's a lot of data you actually have to compute. >>: What's the percentage? >> Rob Fergus: So over here, you certainly have got -- let's say roughly 63,000 images. You've got the largest matrix you can compute eigenvectors in any reasonable time, which was about 5,000, five or 6,000 besides matrix. I wasn't using all the tricks. You could use some fancy tricks to compute the eigenvectors. The things you have to be careful you can't necessarily assume your Laplacian is sparse actually, because if you tried to sparcify too much you start ending up with disconnected islands of points. So if it's nice and dense, then there's lots of bandwidth in the graph for the labels to propagate through. But if you start sparcifying like crazy, then you can start getting these disconnected guys out here who just don't talk to the rest of the graph. So there is a bit of danger. There's lots of techniques for eigenvectors are very large sparse systems and stuff like that. But you don't have to sparcify too much before this sort of thing starts to happen. >>: I'm always doing class two problems, doing a multi-class problem. >> Rob Fergus: At the moment -- at the moment we're just solving a two class problem. >>: Whole bunch. >> Rob Fergus: Whole bunch of different two class problems and averaging the problems to show you the curve. We have basically more work on doing the multi-class thing. This is where we start considering all classes simultaneously. >>: So basically taking -- quadratic? [indiscernible] in essence would like to solve a two class problem [indiscernible] in essence linear and in essence we'd like to solve it vertical. If you can use the unlabeled data somehow ->> Rob Fergus: Yeah. [chuckling]. >>: It wasn't really a question. >> Rob Fergus: No, no. [laughter]. So I guess the point is your choice of features here. I mean, it's quite a simple scheme that you could plug in. If you learn your features from some sort of debrief net or something like that, you could just plug them in and this will give you a new way of -- the other thing is to say, yes, there's the ability to sort of weight your examples according to how confident you are in their label. It's trivial. Just set the number. >>: The other thing is when you've got unreliably labeled data, if you translate it back to the unreliable labels, it's trying to make the answer fit the label and change the parameter so that it's [indiscernible] -- so what it ought to be trying to do is first decide whether it believes this label and then decide whether to try. So really, really is a mixture of do I trust this label. If there's no reason for unsupervised learning, you can reject a lot of the labels. So I think overall it will work much better. I think the idea of putting unreliable data sort of don't trust these, trust [indiscernible]. >> Rob Fergus: I see what you're saying. I see. >>: Based on what you already understand. With this label, we have evidence. >> Rob Fergus: I know what you're saying. >>: If you go back to your energy formulation on the very first page, right, that's not sort of built in. If there's a lambda and you trust it. In other words, co-variance on measures to help out our process. >>: As soon as you got the unlabeled data and if it's the Laplacian data, which they have to be for this particular industry, then you ought to be able to believe some labels more than others based on what the data said. >> Rob Fergus: Right. In practice -- but you sort of see that. When you solve the label function, the guys that are true labels will actually have -- will have other points nearby which have the same label. But those ones outliers will be way, way in space having a tiny affinity connecting back to the rest of the guys where the action is somehow. So ->>: You could say that this is still intrinsically a Lees squared formulation, right? Even though -- actually, in this formula there's nothing saying it's a two-layer problem. You could make Y a multi-value function would have weird behaviors would interpolate between labels, right? Let's say it's binary. This is Lees squared. People who do robust stuff would say it's sort of a robust function over each one of these F minus Y things. >> Rob Fergus: Actually, we did try that quickly. At least for our work it did massively it was rough. >>: Depends on how noisy your data is, if it went through Jeff's process we assume the labels aren't noisy, right. The humans invented those, so it wouldn't make a difference. >> Rob Fergus: That's right. So we do have work where this is not a vector anymore, it's a matrix, where you have each column is a different class. And you're solving for a matrix there for the vector. This is our Gaussian solution. Someone has been [indiscernible] so I'm not sure how much I should say about this. So if you just try to -- we've written a little map version of it so you can try. And, I mean, it's the same code that we use to run 80 million images on. So if you have enough memory to hold it with memory, fancy memory, it's a very simple code. >>: But essentially, like you said, when you summarize, you first do a rotation to hope to make it more separable and basically analyze each dimension separately using a histogram function. And among those K you choose K from each dimension. >> Rob Fergus: No, no, that's the point, because you might have one dimensional ->>: You treat your basis functions as a constant along all dimensions except for one, right? >> Rob Fergus: Constant ->>: Like you showed in the diagrams, basically just one dimensional functions. >> Rob Fergus: Yes, that's right. >>: Across the other dimension. You just choose the best one. >> Rob Fergus: Yeah. >>: Right. Thank you. >> Rob Fergus: All righty.