1 >> Alice Zheng: Okay. So thanks for coming to the talk this morning. We'll hear from Kilian Weinberger, formerly at Yahoo and now at Washington University in St. Louis. Kilian's an expert on metric learning and transfer learning, multitask learning. He's also interested in brain decoding problems. This morning, he's going to tell us about something, a way of improving bag-of-words features. One of his co-authors, Minmin Chen is in the back and is an almost intern this summer. And just a bonus question for the audience. In the subtitle of the talk, you will notice that Kilian has a middle name that starts with Q that is apparently not his secret agent code word handle. It is a real Bavarian middle name. Free lunch to whoever can guess what it is. >> Kilian Weinberger: Thanks, Alice. Good luck. Hey, hey. So thanks very much for inviting me. It's a lot of fun to be here and actually I recognize a lot of people in the audience. Actually, some people just took my course at Wash U. So I'm talking today about mSDA; marginalized stacked denoising autoencoders. It's an algorithm to improve bag-of-words features. And this is joint work with Minmin Chen, who is my student, who is in the back here. Lady with the purple sweater and my student Eddie and Fei Sha, who is also coming to Microsoft, I think, for next month or something. Okay. My talk -- actually, is there a way to turn down the light a little bit, or is that going to screw up the video? Because my slides are all black. That's going to screw up the video. Okay. So my talk, 90 percent of my talk will be on marginalized stacked denoising autoencoders. Which will also be presented at ICML in two weeks. And then I will give a very sneak preview at the end, mostly just because, you know, there might be some people who at Microsoft, in particular, that might be interested in this who I would like to talk to about cost sensitive training. That's also a paper that we have at ICML in two weeks. Okay. So let me get started. Just a quick review on bag-of-words features. Probably most people are familiar with bag of words. So just as a reminder, bag of words is basically a way to represent text documents in machine learning. So let's say if you have a text document here this is a product review, and basically what you do is you take the words in the text, and you represent them as a vector. 2 And the way you do this is you take the text and you have a dictionary and the dictionary maps any word to a particular entry in a vector, a particular dimension. You can also use a hashing function. And then you represent the entire text document as basically a vector that has as many dimensions as you have words in your dictionary. And then each dimension tells you how many times that word is present in your text. For example, here one, that means the word Kindle occurs exactly once in this text. And the word Nook occurs twice, et cetera. So in this case, basically, it's called bag of words, because you're basically throwing all the words in a bag. The order of the words is lost. All right. So this just machine learning 101 in some sense. Then, you know, if you want to do a classification problem or something, you take these bag of words vectors and you put them in a space and that's to train an SVM classifier or something. Or whatever you want to do with it. This is used in many different settings. For example, image classification, where people use hog features and sift features as interest points and then these become, you know, a user's bag of interest points or, of course, in text classification, that's the domain I will focus on today. For example, classifying documents or web pages based on topic or, you know, classifying product reviews or blog entries based on sentiment. And that's actually kind of the example that I will use throughout the talk. Given a review, is it positive or negative. But everything applies to all of these settings. Okay. So let me just illustrate why bag-of-words vector is in some sense too limited. Here are three sentences. A is recently Obama signed an important bill. B, Sunday, our president mentioned a game-changing law. These are probably about the same topic, these two sentences. C, last Sunday Manchester United in Britain won the game I mentioned. So, you know, clearly if I ask you which two sentences are more similar or about the same topic, A and B are similar and C really doesn't have anything to do with these two. But if you look at the bag-of-word representation, then it actually turns out that B and C have a lot of words in common. So if you take the inner product between these two vectors, they might actually appear pretty 3 similar, whereas A and B don't actually have any words in common. And why is this? Well, they talk about the same things, but they actually use different words. Here, for example, you have Obama. B says president. Here we have important and game-changing, and bill and law. So these are actually very similar concepts. But because they use different words in our representations, this is completely lost. Looks like these two are very, very dissimilar. So to sum this up, bag-of-word vectors basically, very often your problem is that your vectors are just too sparse. Too little about the text document and you might just have very little overlap between documents. So, for example, just because you use different words, you know, there's no overlap between two documents. You can't tell at all if they're similar or dissimilar. And in particular, if you were trained [indiscernible] small, one thing you're going to observe, if you use words that are not that common, in fact, actually in the English language, the majority, almost all words are very rare because they follow sips law, the power law distribution. So if you use rare words, then you only see them once or twice in your labeled data, and so the classifier doesn't really know what rates to put on them, and the main problem here is basically that bag-of-words just capture the word by themselves but not the meaning of the words. If I use president entities. It does that these two are be similar if they or Obama, these are two different things, two different not -- there's nothing in the representation that tells me actually the same thing. And therefore, two documents might use these two words. And so usually, these -- none of these problems are too bad if you have an infinite amount of data, but you never have that, especially our label data is usually very limited. You have a small amount of label data and then these problems are pretty severe. So what we are proposing is to -- well, we can't really do very much about the little labeled data, but we assume that we have unlabeled data so you can have just basically the same [indiscernible] setting. You have unlabeled data from the same kind of distribution or similar distribution, and we use unlabeled data to learn a representation of the documents. And then take the labeled data and learn our classifier. That's the two-step approach that we are 4 proposing. And so later on, I'm going to also show this for domain adaptation, where the unlabeled data actually comes from a different domain. So we can say, well, we might not have unlabeled data from the problem that we are interested in, but we have a similar problem that actually has unlabeled data and we just use that, and that might still work. Okay. Any questions? All right. So our algorithm is called marginalized and denoising autoencoder, and I will explain in a few slides where that name comes from and why that makes sense. And this is probably the most important slide. This is the basic building block of our algorithm is a linear denoising autoencoder. So let's say we have some text document that is in one of our examples in our unlabeled dataset. And this here is the text document, just a product review of some book on Amazon, and this here is the bag-of-word representation. So this here is a vector, and yellow here means that basically this word, you know, might be present and white means it's not present. This is zero. So each word here, each entry here in this bag-of-word vector corresponds to some dictionary entry. This might be the word favorite. This might be the word best written, et cetera. All right. So what do we do? The first thing we do is we take our text document and we duplicate it. And then we take the duplicated version and we corrupt it. How do we corrupt it? We go through it and we look at every single word and we remove it with probability P. So we roll the die for every single word and, you know, if it's -- the right thing comes up, you just remove it, just blank it out. And the intuition is that if you understand your domain, then even if you remove some of the words, you should still be able to make sense of the document. So, for example here, I'll read this to you. I read two to three blank a week. Right? Well, if you look at the text as a blank, you should probably figure out that this should be books. Just because you understand the domain pretty well. And so that's our intuition. You basically take the corrupted version and form 5 this corrupted version of the text you're trying to reconstruct the uncorrupted version. So in some sense what you're doing here is you're saying, well, if you're simulating the test case, and the test case, what are the hard examples during test times? They're the ones, basically important words about the domain are missing. The author used some different words that we don't know. Now we don't know -- the classifier hasn't trained on those words. If you're simulating this by taking out trained data and just removing words randomly, and now basically making harder test cases out of it in some sense. All right. So how does the algorithm work? You basically take this corrupted version, and this is the bag-of-word vector of the corrupted text here, which is strictly sparse, so there are strictly more zeroes in it because but removed some words, and we tried to reconstruct the original text document and we do this with a linear mapping. We basically take the corrupted version here and we try to minimize the square loss between the reconstruction and the original document. Any questions? >>: What is the W? >> Kilian Weinberger: W is basically what we learn. So we're just learning this mapping W. So we have the original document, we corrupt it, we try to find some W that basically, you know, if you map this corrupted version with this linear mapping, then these two are very similar. >>: So it's actually W, the [inaudible]. >> Kilian Weinberger: it. >>: No, just any [indiscernible] square matrix. But that's [inaudible]. >> Kilian Weinberger: point. I'll get to that, absolutely. You're right. Good >>: The corrupted version of the document, why do you want to do [indiscernible] substitution of synonyms? >> Kilian Weinberger: Because it's really simple. Everything, you can 6 definitely do fancier stuff. But you will see later on that basically, if you just do this deletion, then it actually turns out that we can -- when something is just a noise model that we can handle very nicely later on. That was a good question. >>: So if your goal is to understand words you've never seen, seems like it would make more sense to delete based on the similarity of the words [inaudible]. >> Kilian Weinberger: Yes, and actually one thing, one thing that's actually -- you can definitely do this and actually have different kind of probabilities for every single word in some sense. We try to keep it simple and just have [indiscernible] probability, but you can definitely do this, actually. There's no reason not to. Yeah? >>: [inaudible]. >> Kilian Weinberger: Actually, we did move it from the bag of words. because the same word appears several times, right, we move it in all locations. >>: Yes, Do you move the bin? >> Kilian Weinberger: We remove the entire bin so we just [indiscernible], right. And so what this mapping really does, basically, it learns to reconstruct from other words that co-occur. So basically, it would learn that president occurs together with White House, you know, and Obama or something. Then later on if you just see president, what it will do, it basically assume, well, maybe you removed White House and Obama, and it will start hallucinating those words and adding them to your representation. That's kind of the idea. So makes the representation richer. >>: So I see [indiscernible] to try to -- >> Kilian Weinberger: Perfect. This is the stacked denoising autoencoder. And we have the marginalized stacked denoising autoencoder. That's exactly what our work is building on. Absolutely. And I will have a slide that basically puts those next to each other. >>: So you want to have a symmetry class, whether you do adhesion or, you 7 know, like it might be, could you say, it might be errors which are [indiscernible] to miss a word than to suggest a word. >> Kilian Weinberger: >>: Oh, I see. Symmetrical word [inaudible]. >> Kilian Weinberger: Not entirely sure what you mean. different words might be -- Did you mean like >>: You make a prediction, a word construction prediction and you can either predict a word or [indiscernible] like reduce a weight of the word. >> Kilian Weinberger: >>: Oh, I see. And the same type of mistake, right? >> Kilian Weinberger: That's true. That's true. So yeah, we just handle it as a square loss. We just keep it simple in that sense, yeah. That's right. >>: Other kind of reconstructing document [inaudible]. >> Kilian Weinberger: So we try to reconstruct the original bag of words, absolutely. That's right. Okay. >>: So I think I understand what you're proposing, but I'm not sure I understand why it makes sense. Intuitively, at a very high level, maybe I understand. But it seems that what's hiding here is maybe some type of generative assumption on how texts are generated. But otherwise, I mean, why is this different than I could propose something like -- we could argue forever. >> Kilian Weinberger: All right. Let's try to avoid that. So in some sense, you're absolutely right. And you can view this, you know, just a different spin on this would be that you basically say, well, people, and actually we can talk about this maybe offline, because it goes a little bit more into depth, but in some sense, you're proposing a little bit different noise model. They basically say people use a lot of Gaussian noise, for example, et cetera. But in text documents, right, if you have, you know, you words [indiscernible] distributions, a lot of words are either present or they're not, right. They 8 only appear once or twice in a document. So a good way to model it in some sense is to say, well, they might either be there, or sometimes you're saying, well, in the test case, you some documents, and, you know, some words are just removed. And that's why they're not there. Does that make sense? Like in some sense an approximation of basically of the tale of the power [indiscernible] distribution. >>: Already should have noise. >> Kilian Weinberger: Yeah. more slides. This is great. Okay. Good. Let me just move on to maybe a few Okay. So this is basically, you're minimizing this objective here. We basically learn this matrix W to go from the corrupted version to the non-corrupted version. And this is just a square loss regression, right. It's ordinary lee squares. So ordinary lee squares is, of course, very nicely behaved as a convex function and, in fact, there's a closed form solution so we just have a little matrix inversion here and then we jump right to the minimum of the function. So this pops out in closed form. And one thing I've done in the previous slide, I took the text document and made one corrupted version of it. Of course, that's only -- if I would do this again, I would get a different corrupted version because I move every document populated to P. So instead of just have one corrupted version, it would actually be better to be kind of, you know, to treat my classifier with many different copies of corrupted versions. So let's say we M different corrupted versions. So we do this M time for every single text document in my training data, unsupervised data I corrupted M times. And then I can do exactly the same thing, so now I want to do a mapping that does, you know, reconstructs the original text document well across all of these corruptions. And that's the optimization problem that we get is exactly the same thing, except that we average over all M corruptions. So across all of them, we should do well. And turns out, well, surprising that it's all a closed-form solution. And it's very It's basically just the average here, you basically average the matrix -- you average these out of product matrices. These out vectors. it's not analogous. covariance of product 9 Okay. And so one question is how large do we set M. And in some case, sometimes that's iterating over the dataset. But in this case, we're not over fitting. There's no over fitting because you're not using the label anywhere. So the larger M, the more robust you get against this kind of noise. The more examples we see that are corrupted. So ideally, you would like to make M as large as possible. So ideally, you would like to make M, in fact, go to infinity. And, in fact that's exactly what we can do. We can just let M go to infinity. And in the limit, these terms here actually just become the expected values of the out of products. M goes to infinity. So And we can just stick that into, you know, a closed-form solution and turns out these expected values are actually really easy to compute. Why is this? Because our noise model, and that answers your question that you asked earlier, why do we do a simple noise model, because actually, if you just remove every word with probability P, it's just a [indiscernible] distribution. So this here we can just compute in closed form. It's a scatter matrix modified by the probability that every feature survives the corruption. That's all there is. That's the expected value. And so if you want to code this up, it's actually, it's really straightforward. The whole thing in Matlab is just ten lines of code. So this is the actual code that we use. Ten lines. >>: Average over the target documents. >> Kilian Weinberger: Where did that go? Sorry? >>: You didn't seem to average over all possible sums of the lee squares, right? >> Kilian Weinberger: possible corruptions. >>: No, that's here. Here are the average of all the But there's only one XI. >> Kilian Weinberger: corpus. Oh, I see. So here, sorry. So summing all my entire 10 >>: [inaudible]. >> Kilian Weinberger: So it sounds like you could imagine basically you take the entire corpus and you replicate it M times and you corrupt it. >>: So I'm surprised you didn't end up with a generalized Eigen probability -- >>: Exactly, like what's the [inaudible]. >>: You should be taking inverse, you should be taking the top end eigenvector. >> Kilian Weinberger: Okay, wait. Let me just see. >>: Averaging over those two [indiscernible] matrices, one which is your maximum ->> Kilian Weinberger: Yeah, but actually, this is half -- this is basically -this is fully corrupted. But this here is kind of scattered matrix of the uncorrupted version and the corrupted. So it's kind of the mix between these two. >>: So what if I just gave all of my corrupted data and I just put PC on it, I can illustrate [indiscernible]. How would that be different? >> Kilian Weinberger: It's quite different. So basically, you're finding the vectors are basically, sometimes you're projecting on to directions of maximum variance, right, which is not what we're doing here. >>: But there's OPCA, which tries to [indiscernible] maximize the variance of the data and simultaneously realize the variance over your corruption. >> Kilian Weinberger: >>: So yeah. I don't know how that relates to it. It's a different loss function. They're doing regression. >>: But linear [indiscernible] analysis always goes -- you can always take zero, one targets and boil it down into -- 11 >>: [inaudible]. >>: The corruption will be different. function. >>: Maybe. >> Kilian Weinberger: >>: So with PCA, you're not corrupting it. No, with OPCA. >> Kilian Weinberger: I don't know OPCA. >>: This is a very special kind of Oh, I see, okay. Maybe we can talk about this offline. Is W [indiscernible]. >> Kilian Weinberger: No, one W for all samples. Very good question. Thanks. Okay. Any more questions? Okay. So we can compute this in ten lines in Matlab and so here, someone asked earlier, there was a paper last year, ICML, and is this something similar. That is actually the paper we have at ICML this year. Actually, we show that you can take their version and make it much, much faster. So what you were referring to is the paper by [indiscernible] group in Montreal, [indiscernible], et al. And they were the ones who actually inspired all this work. They had this idea of this corruption, corrupting data input vectors and then reconstructing them. So they basically took text documents and randomly removed, you know, roof buckets with [indiscernible] the same noise model that we used and then they trained a newer network to reconstruct the original bag-of-word vector. And the new network had a hidden layer that's over complete. And they trained that with back propagation and that gets really, really nights results and that's kind of what started our work. So we have an encoder here and a decoder. And this is their loss function. The only difference is that uses back propagation and you have to iterate over the training set many times and our idea was basically how can you make this possibly faster? And so our idea was basically we removed hidden layer in the middle and this way, you can make this linear, and then instead of going over the dataset many times and corrupting it over and over again, you can actually 12 marginalize all the noise and do the whole thing in closed form. >>: Yeah? [inaudible]. >> Kilian Weinberger: So you're talking about this one here the SDA? So the reason why they didn't get over complete is because of the corruption. If it wasn't corrupt, then [indiscernible], right? Because of the corruption, they can get away with making this actually over complete. >>: So it's bigger than the -- even for text? >> Kilian Weinberger: Yeah, so they only use -- and this is going back to actually what Lynn said earlier. They only use the 5,000 dimensional -- 5,000 words. >>: [indiscernible]. >> Kilian Weinberger: That's actually another problem. [indiscernible] doesn't scale high dimension. Because that [indiscernible] is really slow. They have -- maybe ->>: The thing that I don't know how [indiscernible] you can remove [inaudible]. Then you're left with an ->>: Empirically, when you do these autoencoders, does [indiscernible] the nonlinear [indiscernible]. >> Kilian Weinberger: similar results. >>: I guess that's what we are showing. Only the output you can do something. Actually getting Not much [indiscernible]. >>: If you use Gaussians anyway, might as well use the natural [indiscernible]. >> Kilian Weinberger: Okay. Yeah, let me just move on. So basically, this is -- this relates -- and the reason why we call it marginalized stacking denoising autoencoder is because we can -- algorithm was inspired by the stacked denoising auto encoder, but we're marginalizing all the corruption. That's where the M comes from. 13 >>: So I was surprised that I didn't see that you needed some major amount of regularization. I would have thought you'd have had a problem with that with written words corrupting the entire rest of the document if those words appeared. That's not needed? >> Kilian Weinberger: It's not really needed. So one thing is because you're training it on the unsupervised corpus that's usually a lot larger. So ->>: So [indiscernible] one document. Then once you learn the Ws when you see that one word, you generate the entire rest of that document? >> Kilian Weinberger: You might. In those case, you might run into a little bit of problems. And actually, let me address that later on. >>: [indiscernible]. >> Kilian Weinberger: You do actually do a little bit of regularization. Here's the regularization. >>: [indiscernible]. >> Kilian Weinberger: There's a little [indiscernible] question there. that's right. So I put that on. >>: It's hidden in the -- >> Kilian Weinberger: It's very, very tiny. >>: That's right. [indiscernible]. But it's very tiny. It's ten to the minus five is the -- Okay. >> Kilian Weinberger: Just that it's not as well defined. >>: In that case, do you use an earlier version of this, do you use [indiscernible]. >> Kilian Weinberger: >>: But [indiscernible], yeah, uh-huh. You use cross [indiscernible] instead of MAC? 14 >> Kilian Weinberger: >>: Actually, Minmin would know. Is that what they do? Yeah, they use [indiscernible]. >>: So I think it's different like all the symmetric cases you do on [indiscernible]. Distribution of errors, you get ->> Kilian Weinberger: There's a little bit more subtlety to it, I agree, but bear with me for a few more slides. We do have to talk about this offline. [indiscernible]. Okay. So one thing that's nice about the [indiscernible], though, is that it can be either stacked or make it deep. In some sense, their claim is that they approached deep learning is the stacked denoising autoencoder. That's the Montreal version of deep learning. So you basically, you have another hidden layer and you have another hidden layer and so on. They make this five, six layers deep. So while we can do the same thing with our method, and so let's say you have our input X and you learn a matrix W and you apply W on to X and you get some hidden -- we get our output here in some senses. Basically, the reconstruction in a sense that basically if you take our X here and you hallucinate new words to it, and then what we do is we apply a nonlinearity to it. So we basically take a squashing function some sigmoid function that basically just squashes these outputs here and this is now in some sense the output of our algorithm. Then we can use that as the input of the algorithm again. So we can apply it again. So in some sense, you call this layer one. You can now use the output of layer one as the input again of the stack denoising autoencoder and get layer two, et cetera. So each one of these is solved in closed form. So solve in closed form, apply a sigmoid function and then solve the next layer in closed form, apply a sigmoid function, et cetera. And then we take these input layers, and that's basically the same thing that [indiscernible] group does. Take the bag-of-word vectors and these hidden layers and make that the new representation of our data. So instead of just using the bag-of-word vector, you basically now have these hidden representations, the bag-of-word vector, the first reconstruction, the second reconstruction and so on. That's our representation of our data. And now we 15 train SVMs on that data. And these here are basically completely learned on the unsupervised part of our corpus [indiscernible] part of our corpus. Yeah? >>: So can you give me some [indiscernible] on what you need to [indiscernible] so you update every single -- but here ->> Kilian Weinberger: >>: Good intuition. What does this first layer do? >> Kilian Weinberger: So sometimes it basically takes the bag-of-word vectors and for every word, it adds words that basically co-occur with that word. So, for example, let's say I have the word, you know, Obama. It might add the words White House, president, government, or something. It wouldn't add Clinton, because Clinton and Obama don't really occur together. Clinton, that's the wrong Clinton. Bill Clinton. And but then if you, in the second layer, right, now you have actually White House, government, president. And the second layer would actually add Bill Clinton to it, because that also occurs in the same context. You can kind of view it as actually -- Rani pointed this out when I talked to him yesterday. You can kind of see it, view it as a graph. Basically say what words are connected, co-occur with other words. And you can kind of -- each layer here takes one step in that graph. >>: [inaudible]. >> Kilian Weinberger: I think at some point, it won't change anymore. >>: How do you know the scale of the sigmoid? by -- well, your target is ->> Kilian Weinberger: it's pretty robust. >>: Have you tried it? You can. Seems like it can multiply H I don't think it's that important. I think 16 >> Kilian Weinberger: Actually, Minmin tried a whole bunch of squashing functions. And, in fact, even if you don't use a sigmoid -- sigmoid is a good idea, but even if you don't use it, it still improves to have multiple layers because of that effect. >>: [indiscernible] couldn't you just reinforce the sparsity? >> Kilian Weinberger: sparse. [indiscernible] sparsity. It's not sparse. It's not >>: But the small number of numbers would suggest that you are [indiscernible] sparsity, that you want to have some [indiscernible]. >> Kilian Weinberger: >>: Not really. So the sigmoid [inaudible]. >> Kilian Weinberger: Sometimes, basically what you're doing is you're exaggerating [indiscernible]. Actually, you know, [indiscernible]. >>: When you introduce the noise, you kind of do a [indiscernible], right? >> Kilian Weinberger: noise there is. >>: Not quite sure what you mean. There might not be a -- >> Kilian Weinberger: you mean? >>: The W is -- I mean, can you introduce any kind of [indiscernible] is linear, right? >> Kilian Weinberger: >>: We're using every possible Yes, but the W is linear. >> Kilian Weinberger: >>: Yeah, we're integrating it. Oh, you might be a perfect reconstruction, is that what There not night be a transformation from X, Y to X. 17 >> Kilian Weinberger: Well, you're just removing it, right. So basically, you just have to represent if you remove a word, right, if it's a linear combination, you can just have a linear combination of other words to reconstruct any word that you would move. So you remove that word, here energy efficient. Well, here it's [indiscernible]. But you could basically say, you know, like for example, there's a word, you know, in the product review, if it's energy efficient, right, that might appear together with good, you know, or great or something. So, you know, you reconstruct words from those co-occurring words. Does that kind of answer? >>: Probably. I'm just saying that the [indiscernible] model is linear and the noise model is totally random. So I'm not sure if that can go back to this. >> Kilian Weinberger: But we don't have, right? You're just trying to find a new representation. Like you're still sticking to the original input as our representation. So when you take our SVM, we're not corrupting this, right? We're not corrupting a test example. We take a test example the way it is and you take these basically additional, you know, these hidden representations where you basically added more words to it. So we're filling out the sparsity of the bag-of-word model. >>: [inaudible]. >>: That's right, I think W is -- that's exactly what PCA does. PCA finds the low rank representation of the [indiscernible]. You don't actually have to keep reconstruction if you're repeating the square root, you know what I'm saying. If you're computing the square root ->>: I see what you're -- >> Kilian Weinberger: You score things in that direction. I don't want to get too much into that, because I still have a couple slides. I do like this discussion a lot, actually. This is great. So if you do this whole stacking business, and then actually the whole thing is 20 lines of Matlab. So what I'm trying to drive home here is that it's really, really easy to use. If you have any kind of bag-of-word model, any kind of bag-of-word data, this is the entire code that you need to get this new representation. And the only parameters that you have here is to use your 18 input data. P is based on the probability of removing the word, and L is number of layers that you stack. So it's a really, really simple model. And this is a big contrast to the new networks people use, where you basically have the number of layers, the step size, instead of the number of epochs, you go over the data and so on. >>: [indiscernible] you do have layers for over there. >> Kilian Weinberger: Sure. I mean, [indiscernible]. Okay. Let me talk about some results of the domain that I focused on was actually domain adaptation. And particularly, actually, because the paper by [indiscernible] for ICML used exactly that domain. And so here, the task basically is given a review, a product review, try to predict if it's positive or negative. The sentiment of the text. And the dataset that we have, actually John Blitzer created the data 7 2006. He scraped product reviews from Amazon.com and he said everything has five reviews or four reviews is positive. And anything that -- five or four stars is positive. Anything with fewer stars is negative. So given the text, our goal is to predict if it's a positive or negative review. And the way we do this is we do domain adaptations. Training data is from one domain, for example, could be, you know, electronics or something. And testing is a different domain. So in this case, actually, we have four different domains. So for example, you could train our classifier on book reviews and be tested on electronic equipment or kitchen appliances. And in this kind of case, in this kind of domain adaptation, it's really, really important that you have a good representation of the data. So that's why this is a good application, a good way to test our algorithm in some sense. And so if you just do this naively, you just do bag of words and you train a classifier on book reviews, you get a test error of 13 percent. You take the same classifier and apply it on kitchen appliances, then you get a test error of 24 percent. So you're basically doubling your error by going from one to the other. And that makes a lot of sense, right, because the way you describe is a good book is very different from the way you describe a good toaster. So for example, here you might say the book is best written, you know, it's one 19 of my favorite books. Favorite you might use in both. But eloquent story, where you would never say a toaster is best written Orel convenient or something. So for a coffee maker, a toaster, you might say it's solidly constructed or easy to program. So these are things you would never use, these are words you would never use for a book. So if I train my classifier on book reviews, the classifier would have no idea what to do with these words. But what we do is we basically assume we have unlabeled data from both domains and we run our mSDA representation learning algorithm over both domains and get a joined representation that we then map our data into. Okay. So here's some results. Basically, these are different domain adaptation tasks. So this means going from DVD to books, electronics to books, kitchen appliances to books and so on. So this is what you train it on. This is what you test it on. And what you see here is the transfer loss, transfer loss is defined as the error you get from the transfer minus the error you get if you had actually stayed within the domain. If you had trained on kitchen appliances and tested on kitchen appliances but here you train on books. So one thing is just this here is basically just to compare this -- this is basically the error on bag of words, whereas this here is on mSDA. And so one thing you can see, we compare against a bunch of different algorithms. This here is the baseline. Here is PCA. Someone mentioned PCA earlier. So the blue line here is PCA. You basically project everything to one common sub-space. Then you have a couple sub-spaces, paper by John Blitzer, and then you have a couple more baselines. And the red line here is SDAs. That's the stacked denoising autoencoder. That's the newer network that basically does the same thing. And mSDA is our dark red line here. And so one thing you can see here, I think out of those 12 tasks, ten we actually get the best results. Only two this algorithm does a little worse. To be fair, though, SDA was only one layer. mSDA was five layers. The reason SDA was only one layer it's because it's so slow. Takes a really, really long time to train. >>: Is there any story there in terms of what are hard examples, things like 20 irony or reviewers were super hard or super soft? Is there any notion of what the reviewer correlation would be to words [indiscernible]? What's the actual noise? >> Kilian Weinberger: Well, the one question I don't understand. >>: There's two points to the question. What is the error rate. What is the noise level in the label. And then the other one is besides the better numbers, is there anything you think you're solving conceptually that you couldn't get before? Things like whether it's irony or whether it's a bias of the reviewer? >> Kilian Weinberger: The answer to both questions is I don't know. So I'm not sure about the natural error level. I'm sure there's some, right. And I mean, there's some reviews you just can't possibly get right. And in terms of -- I'm not sure if there's some -- so it just really helps you with making use of words, basically. Really what it does, it kind of connects words from the source domain to the target domain. That's really what it does. So if you have a review that before was written with a lot of words that you would never use in books, you couldn't get that right before, because you just don't know what these words mean. And now you can. >>: [indiscernible] on the original domain [indiscernible]. >> Kilian Weinberger: Entire [indiscernible]. SDA. That's the entire corpus. >>: So you never tried to [indiscernible] mold. that? Actually, the same mSDA and None of these methods do >> Kilian Weinberger: So just training -- so I don't think there's a benefit for these methods to only use one class. I think it always helps to throw in more data. So I think that's at least what we got empirically. >>: I don't understand the negative numbers here. So it seems like you're getting the negative number on this B to B and B to B. >> Kilian Weinberger: Right. I can tell you exactly why that is. 21 >>: And there's also K to K. There are pairs here. >> Kilian Weinberger: That seems odd, right? Like you train on D and you do better. And the reason is that it [indiscernible] bag of words. This here is the baseline of bag of words, whereas this here is the baseline of mSDA. So does that make sense? >>: So but why are D and B friends here and A and B are also friends? see somehow there's this pairing. >>: So we [inaudible]. >> Kilian Weinberger: So again, it's electronics and -- sorry, actually, I don't know too much about the individual domains. I mean, you could look into what words do you to drive a DVD. What words do you use to describe a book. So maybe electronic ->>: The difficulty [inaudible] book and DVD are more closer and kitchen and electronics are more closer. So it's easier to adapt. >>: mSDA from books to books, I would have done better. >> Kilian Weinberger: Actually, if you have used mSDA, so if you took all of the reviews that we had and done one representation for everything, then we just train the classifier. >>: A different capacity. That's the whole point, right? >> Kilian Weinberger: Yeah, so basically, yeah. changed, does that make sense? >>: It's five times the capacity. capacity than the right hand. >> Kilian Weinberger: Because the representation So the left hand one actually has much more That explains why you get negative results. >>: That explains why negatives. >>: [inaudible]. But I'm saying why is there this B to B? 22 >> Kilian Weinberger: >>: If I give you the dataset [indiscernible]. [inaudible]. >> Kilian Weinberger: The domains have roughly the same size. roughly around 2,000, I think. >>: 2,000? >>: [inaudible]. >>: How large is your vocabulary? >>: For this experiment, [inaudible] largest we try is 7,000. >> Kilian Weinberger: [indiscernible]. >>: They all have So these are a thousand, maybe 5,000 words. [indiscernible] classifier 25,000 [indiscernible]. >> Kilian Weinberger: Yeah, but it's regular. So SVM, we do SVM, we do classification over regularization. So we do, you know ->>: [inaudible]. >> Kilian Weinberger: Good question. I don't know. I'm sure. It's great to have [indiscernible]. >>: You would get something, Always have a grad student in your talk. >> Kilian Weinberger: Do you have a -- >>: I think almost reached like the upper [inaudible] bag of words. >>: [indiscernible] as good as bag of words, but not quite? >>: [indiscernible]. >>: So of the same mold, not significantly when you transfer it. 23 >>: No, I mean -- >> Kilian Weinberger: >>: [indiscernible] domain classification. >> Kilian Weinberger: >>: Actually, let's do a few more slides. [indiscernible] when it's [indiscernible] is better than bag of word. >> Kilian Weinberger: >>: What he's asking, is it in the same domain classified. That's right. So it has to be better. >> Kilian Weinberger: So yeah, makes sense it has negative numbers, okay? Good. So I think the most important thing, though, is we look at speed. And so from now on, we're going to look at average results across all of these different adaptation tasks. And when you average them, you actually do transfer ratios, where you actually divide the two, because otherwise it's dominated by that these small ones are washed out by the large ones. So anyway, this transfer rate, it's the same thing that [indiscernible] group does. So here's the classification results. This is the transfer ratio, so low is better, and this here is the time in log scale. So one thing you see is all these other base lines, bag of words, SCL, and you have coder and then SDA, in some sense, like over the years, basically, I don't have the year when these were published. But basically, it kind of goes in that order. So the results got better and better. But also, the time that it took to train these classifiers grew exponentially, right? SDA is now five hours in the dataset. And what we managed to do is basically push the results of SDA, which are really awesome, you know, to the left. They basically say we get the same results as SDA representation. But instead of five hours of training, it's a few seconds. So there's like -- it was two minutes of training in five layers. And so here's words, just to illustrate this a little bit, what gets hallucinated. So basically a document that has only one word, only the word great, so here's the kind of the document that's generated to this. And these are words basically in order of their strength of how it regenerates. So we have great, is great, highly, highly recommend, excellent, perfect, fantastic, 24 waste here too. So here's a bad one. Bad reconstructs dead, worst, sorry, please, the worst, bad, hope, horrible, and so on. >>: I have a question. >> Kilian Weinberger: The total [indiscernible]. 20,000, 27,000? I think that's right, yeah. >>: So that's something the SDA [indiscernible] trend on 27,000 examples [inaudible]. >> Kilian Weinberger: Yeah. >>: This is what [indiscernible]. >>: They must use punch cards or something. >>: 27,000 examples. >> Kilian Weinberger: have many ranges. >>: It would be corrupt over and over again, right? You But still if you [indiscernible]. >> Kilian Weinberger: Make it two hours. Compare it against two minutes. mean, even if it's an hour, it would be ->>: Let's be fair. >> Kilian Weinberger: >>: [inaudible]. Actually, use their code. [indiscernible]. >> Kilian Weinberger: >>: I Now we have -- I agree that [indiscernible] would be faster. >> Kilian Weinberger: Now we have a dataset with 340,000 data points. That's actually a large dataset that John created back then and nobody really used because it was too large for most algorithm. So we ran it over SDA and we took two days. Make it six hours. We can do it in six to 20 minutes, depending 25 upon how many layers you have, and you get the same results. >>: mSDA would be faster [indiscernible]. >> Kilian Weinberger: Yeah, I mean, so I don't know. That might be ways of speeding up. Actually, they have a code that also puts it on a Q dot graphics card. But that doesn't help in this case. >>: [inaudible]. >> Kilian Weinberger: >>: I don't know. [indiscernible]. >> Kilian Weinberger: >>: I don't know. It might be. Do you use, when you -- >>: I just, I read the [indiscernible] by good margin. surprised by the SDA because ->> Kilian Weinberger: >>: It's just that I'm I'm sure -- Like it's maybe [indiscernible] times faster than [inaudible]. >> Kilian Weinberger: They do early stopping. This way, they can make it three times faster. that's still ->>: [inaudible]. >>: [indiscernible]. >>: How did you come up with this word token? They don't have early stopping. But that's still, you know, >> Kilian Weinberger: Basically, you take a document, artificially create a document that only has one entry in it, only one word. That that's just this one here, just bad. Then you run it through mSDA, and you look what is the document I get at the end. What the words, basically, it reconstructs. And so these are basically the words that -- I put in bad, the most, you know, the 26 strongest word that it reconstructs is dead. That's bigrams. These are just bigrams. The way John created the dataset in 2006 was just bigrams. So we just used it the same way he did it. >>: Do you have a trick for inversion when you look at [indiscernible]. >> Kilian Weinberger: Thank you. Yeah. So okay. So far, I've been, you know, I've kind of put something under the rug and [indiscernible] early on we have this matrix inversion, and that's a D-cubed operation. These are the number of words in my vocabulary, right. So all of these results actually done with 5,000 words, but it's the same actually the SDA paper uses. But that's a little bit lame, right, because text documents usually have much, much higher vocabulary. Much larger vocabulary. So what if you have very high dimensional data, and that actually, so here's kind of what we do and it's based on intuition. So the intuition basically is, if you have a very large vocabulary, let's say you use 100,000 words, then most words, in the English language, you can get pretty far by just using five or ten thousand words. So the other 90,000 words are in some sense words that describe some concept that already exists in the first 10,000 words, most of the time. So you might have a word tasty, but instead you can be fancy and say delicious. That's a rarer word, but it really says the same thing as tasty for all means and purposes in terms of our classification. So what do we do? We basically take this very large vector, and we can't make it inversely 100,000 by 100,000, right? So we just randomly divide it up into chunks, and then we take the 5,000 or whatever, 10,000 most common words and we sometimes we learn mSDA transformation for each one of these chunks. So what are we doing here? In some sense, we're trying to reconstruct common words from rare words. So basically say, well, if there's delicious, try to reconstruct tasty, for example. Or something like this. So we're translating, you know, big words or something into common language. Yeah? >>: So [indiscernible] is you actually are going to like a [indiscernible]. >> Kilian Weinberger: one reason -- So in this case, it should be [indiscernible]. That's 27 >>: No, no, if you want to go to like five gram or six gram -- >> Kilian Weinberger: It's a good question. I don't know. actually think that it might still work. But I don't know. >>: But I would [inaudible]. >> Kilian Weinberger: And no, it's not the same thing. It's not the same thing. We can talk about the differences offline. So then actually what you then do, you can then know the stacking. Now we have a mapping to low dimensional space, and I didn't write out the algebra here, but it's straightforward. You basically add up all these reconstructions into one reconstruction and now you can stack it just the way before. So now, actually, the subsequent layers are in the low dimensional space. And so we tried this. Here's the dataset, the [indiscernible] set, and the 5,000 dimensions, 10,000, 20,000, 30,000 and I think this is 40,000. And this here is the SDA curve. This is the result you get with the original stack denoising autoencoder. And basically, one thing is very clear trend. As you increase the dimensionality, the error goes down. So it really helps the error, and our mSDA basically matches this very, very nicely. So for every point here, there's a parallel point that's just matched much faster. But it gets roughly the same accuracy. And so actually, at some point, you start [indiscernible]. If you get 40,000, you just include really rare words. >>: Is the SDA on the larger vocabularies doing exact reconstructions? >> Kilian Weinberger: It actually does -- it also does the mapping. This also does a [indiscernible] hidden layers are actually then just [indiscernible]. >>: But what I'm saying is when the SDA like tries to [indiscernible], does it actually reconstruct every output, or do you just, does it ->> Kilian Weinberger: We do the same thing that we only reconstructed the top M whatever, the top K most common words. So we don't reconstruct everything in the SDA. >>: Okay, but the SDA -- so I believe the group had a paper where they can, 28 for very large sparse binary auto filler problems, they can approximate the gradient very, very quickly. What I'm trying to ascertain is -- that or not that? >> Kilian Weinberger: Actually, I don't know. implementation. Okay. Probably used exactly the >>: And basically, you have an optimizer which works in a specific setting, which is this small dimensional target, and the [indiscernible] loss when you use SDA the exact same condition, right, of where [indiscernible] you can switch the [indiscernible] and you can use a large target space. >> Kilian Weinberger: Yeah. Maybe that helps. I mean, still, yeah. In some sense, you get locked into it. Because of our little tricks, right, that makes us a little efficient, we're locked into that formulation, right. Because they can change things around. >>: Did you access like on the smaller vocabulary, where inversion is visible, what you used to actually have smaller targets? >> Kilian Weinberger: Yes, and I don't have the graph. I don't. Maybe I do. I can show it to you. I think it's at the end of the talk somewhere. So it's very little. But yes, we did this. Okay. And so here's although someone, I think Leon asked the question earlier, what if you're just in domain. And that's basically just semi-supervised learning if you stay within the domain. So we also did semi-supervised learning experiments. So here's [indiscernible] and his writer's dataset. So he also compared against SLI, which is basically PCA, and latent [indiscernible] allocation of David Bly and TFIDF and so basically, here's the graph. The accuracy is higher is better and this is as you increase the number of trained sets. mSDA nicely on top. Here you can, you know, the benefit is more pronounced when you have little trained data, which makes sense. As you keep getting larger at some point, actually, here there's not that much benefit. Here some benefit, still up to 7,000 data points. >>: By training, do you mean training for SDM or -- >> Kilian Weinberger: Training for SDM. 29 >>: So the decision to train only on the [indiscernible]. >> Kilian Weinberger: >>: That's right. That's exactly right. And specifically, there are traces you can use to LSI [indiscernible]. >> Kilian Weinberger: Yeah, so we didn't do that. But there's also a lot of tricks you can do with mSDA. You can actually, for example. You know the optimization problem, you can actually also put in weights to the different, you know, add weights to the different words, right. For example, IDF score. And we have some results in this, but I actually, that's too -- I don't know what they are. >>: The weight was just TF? >> Kilian Weinberger: So in our case, it's just TF. So we just reconstructed the TF score. But for the other algorithms and the comparison early on, we use TFIDF or whatever worked best for those. >>: [indiscernible]. >> Kilian Weinberger: Just used TF. >>: So those documents are basically more weight than the optimization problem. The [indiscernible] is weighted more for [indiscernible]. >> Kilian Weinberger: That's right. These cases don't vary too much. >>: [inaudible] it's just the mean, the variance is the same as shifted. So I think it's -- see what I'm saying? So a target of one has an error. Target of zero has an error. It's not like the target of zero has the same ->>: Yeah, but there's more words. >>: So there, the covariance -- >> Kilian Weinberger: By the way, I didn't mention optimization. We also have a constant term in our optimization, in our square loss. I don't know if that matters in this. 30 >>: [indiscernible] much more interesting Gaussian noise. >> Kilian Weinberger: That's exactly how --. >>: And the [indiscernible] if I think about it, it's nice, because if you have the word that's very rare, it's just like if you have -- and you please review this paper. >>: Or he will. >>: So that's the same realization. Once you start thinking about why does this work, does the mean that the regularization [inaudible]. >> Kilian Weinberger: Yeah, so that's -- I send you my -- >>: And I bet you $20 if you did blog TF in your reconstruction Gaussians would be better. >> Kilian Weinberger: It might be. It might be worth trying out. >>: So mSDA still have the original, in this experiment still have the original root still, or did you just did mSDA representation? >> Kilian Weinberger: the ->>: Yes, we actually had the original representation and Is that the same for the -- >> Kilian Weinberger: Yes, and we actually tried both for those. these did cross validate all the parameters. >>: So for the others, how did the original -- just curious. >> Kilian Weinberger: That's one of our ->>: And although I think it might have actually not. I think it didn't. You have to choose the dimensional -- there's choices. >> Kilian Weinberger: That's right. We did cross-identify those. Okay. That's actually the last slide. So in summary, I talked about marginalized 31 stacked denoising autoencoder. The, you know, the marginalized stands for we marginalize out the corruption. And to keep the high, you know, high accuracy of the SDA features, really feature generation. So it depends on what algorithm you use afterwards. But the nice features of SDA creates just much, much faster. And that's because it's layer-wise convex and there's a layer-wise closed form solution. And, you know, one thing, and I hope people who works with bag-of-words features might experiment with it. It's just really, really easy to implement. Actually, Minmin is here so you can ask her for the code. It might lead to better results and different applications. I also have a second part of the talk, but I feel like we've talked a lot and there's been a lot of discussion. So maybe I'll just leave that and if someone wants to talk to me, this is another ICML paper we have. It's did cost sensitive learning when you assign costs to features. I don't know, unless somebody really wants to hear it, I've already talked for an hour so maybe I'll skip. Do you want to hear it? >>: I want to ask a question. >> Kilian Weinberger: Okay. >>: So something back to your algorithm there, there's a couple things going on. So you have this procedure to generate extra [indiscernible] data. Fairly standard technique used in learning methods, like when you're doing image work, for example. >> Kilian Weinberger: >>: Like virtual -- Yeah, those kind of hallucinating examples. >> Kilian Weinberger: Uh-huh. >>: So you're hallucinating examples and you also have the cost function. Do you have any sense of what it would do if you sort of take your work as you hallucinate examples like you're doing and feed it through LDA, feed those -this dataset through LDA or LSI or whatever? 32 >> Kilian Weinberger: I don't know. It would probably be very slow. LDA and LSI was very -- those would be [indiscernible] those results. Because those are data of all these different settings. >>: There are online versions of these things. >> Kilian Weinberger: >>: Yeah, okay. But everything is a covariance. >> Kilian Weinberger: But LDA, you still have to -- even if it's online, it still takes a lot, right. I don't know. So basically, you're saying I take this noise model and use the other algorithms with the noise model, right? >>: Yeah. You're sort of proposing two things. hallucination process, and I wonder -- Cost function and this >> Kilian Weinberger: Well, the cost function is not in the -- I see what you're saying. Yeah, I don't know, to be honest. So one thing we've tried, and that's basically the paper is that we basically said let's just take the noise model of the corruption and put them in a classifier. So the SVM or something, it basically sets up what is the square loss. You have [indiscernible] regularization, it's really just Gaussian noise you assume here as to this corruption noise. And basically derive the update rules. That also helps, like for text documents, it's a lot better than Gaussian noise or something. So there's definitely some benefit from, you know, the noise model. It's actually, it's generally the case that actually learning these representation that you layer and then stick them into SVM, that seems to be better than just putting the noise model into your classifier. But we're just at the beginning of exploring that in some sense. Yeah? >>: How did you decide you would do [inaudible]. >> Kilian Weinberger: That we had to use [indiscernible]. And that's the one parameter that is sensitive. So with layers, in some sense, you know, it improves if you can make it deeper and there isn't really, you know, just peters off at some point. You can't go too wrong, but with the [indiscernible] 33 that actually depends on the dataset. And so actually one thing that was quite surprising was that we found on some datasets, like it was up to 90 percent actually noise was the best setting. >>: Do you cross validate based on reconstruction? >> Kilian Weinberger: No, we actually just basically take a whole dataset, apply mSDA on the train set, and the train SVM on the, you know, and then test on the holdout set. So that takes a little while. But actually, it's pretty fast, because SVMs only takes a few minutes. >>: You mentioned that the best piece 90 percent. You are using the same data, the same data producing the [indiscernible] for different layers. For some, it's the best piece as you say. >> Kilian Weinberger: [indiscernible] surprisingly high. Originally, my intuition would have been that it should be pretty low. With 90 percent, you're removing a lot of the document, right. Like basically have very little left. But because you're integrating out all possible corruptions, maybe you get away with having pretty high P. It's not always 90 percent, though. Like, you know, it did vary. So we plotted the curve, actually. And the curve is a nice trend. It kind of looks like a bucket. But it depends on the dataset where the best point is. >>: [inaudible]. >> Kilian Weinberger: >>: Sorry? Is it higher than the [indiscernible]. >> Kilian Weinberger: That's a good question. I don't know. Minmin doesn't know. So actually, yes, it actually makes perfect sense that it's lower, because SDA can only go over so many iter rations over the training dataset whereas we go over all the possible once. SDA, you keep corrupting your trained dataset and every [indiscernible], you have a new corruption. So even if you do a hundred [indiscernible], you only have a hundred corruptions of your data. >>: [inaudible]. 34 >> Kilian Weinberger: >>: Minutes. >> Kilian Weinberger: >>: Oh, minutes. Yeah. Did you look at the unbalanced cases, which is -- >> Kilian Weinberger: >>: Millions? Unbalanced, what do you mean? Unbalanced presentation. So you have very few -- >> Kilian Weinberger: Oh, I don't think it mattered. It's one thing we actually did. Because what we train is completely unsupervised, right? So labor doesn't really matter. >>: When it's unbalanced, you have to make sure that your presentation preserved the part of the data which is [inaudible]. >> Kilian Weinberger: Yeah, so I bet you that it still works there. So one thing we actually did, so there's this nice paper by [indiscernible] and John Blitzer who actually did some, you know, proved some nice, nice balance on domain adaptation, and they have this magic way of saying domain adaptation work, and their claim is, well, you know, the target and the source have to be similar. And the way they measure similarity between target and source is by saying, well, we train a classifier to distinguish between the two. And if you can't really do this very well, if you can't distinguish between electronics and DVDs, then they are very similar and therefore they help each other. And so one thing we showed actually is when you run mSDA representation, then actually you're better at transfer, but you're also better at separating between the two classes. So it's actually, you know, actually helps in both cases. Like these are very different classification tasks, right. And it just seems to be a better representation. >>: [inaudible]. >> Kilian Weinberger: Let me know. Maybe I'll switch to the very last slide and just thank my co-authors. So here, I just want to thank Eddie, Minmin and Fei, who will be here and Oliver Chapelle, actually, who helped with this and 35 questions. Here's the graph that I promised, here's the graph that you asked earlier. If you take actually this 16,000 dataset is not so high and here basically we use dimensionality and this is the computation time. Here from 15,000, 16,000 you see is a big gap, because that's a cubic scaleability of the inverse so if you go down to 10,000, it's a very small hidden accuracy but actually a drastic reduction in time. 5,000, it's noticeable. At some point you notice it because you remove some concepts. >>: That was my concern, unbalance the things, like if you mention you have an unbalanced problem, probably your task will be looking at few words. >> Kilian Weinberger: But 10,000 words, you can say a lot in 10,000 words, right? Make it 15,000. >>: Yeah. >> Kilian Weinberger: >>: So that's -- [inaudible]. >> Kilian Weinberger: Any more questions? All right. Thank you.