1

advertisement
1
>> Alice Zheng: Okay. So thanks for coming to the talk this morning. We'll
hear from Kilian Weinberger, formerly at Yahoo and now at Washington University
in St. Louis. Kilian's an expert on metric learning and transfer learning,
multitask learning. He's also interested in brain decoding problems.
This morning, he's going to tell us about something, a way of improving
bag-of-words features. One of his co-authors, Minmin Chen is in the back and
is an almost intern this summer. And just a bonus question for the audience.
In the subtitle of the talk, you will notice that Kilian has a middle name that
starts with Q that is apparently not his secret agent code word handle. It is
a real Bavarian middle name. Free lunch to whoever can guess what it is.
>> Kilian Weinberger: Thanks, Alice. Good luck. Hey, hey. So thanks very
much for inviting me. It's a lot of fun to be here and actually I recognize a
lot of people in the audience. Actually, some people just took my course at
Wash U.
So I'm talking today about mSDA; marginalized stacked denoising autoencoders.
It's an algorithm to improve bag-of-words features. And this is joint work
with Minmin Chen, who is my student, who is in the back here. Lady with the
purple sweater and my student Eddie and Fei Sha, who is also coming to
Microsoft, I think, for next month or something.
Okay. My talk -- actually, is there a way to turn down the light a little bit,
or is that going to screw up the video? Because my slides are all black.
That's going to screw up the video. Okay. So my talk, 90 percent of my talk
will be on marginalized stacked denoising autoencoders. Which will also be
presented at ICML in two weeks. And then I will give a very sneak preview at
the end, mostly just because, you know, there might be some people who at
Microsoft, in particular, that might be interested in this who I would like to
talk to about cost sensitive training. That's also a paper that we have at
ICML in two weeks.
Okay. So let me get started. Just a quick review on bag-of-words features.
Probably most people are familiar with bag of words. So just as a reminder,
bag of words is basically a way to represent text documents in machine
learning. So let's say if you have a text document here this is a product
review, and basically what you do is you take the words in the text, and you
represent them as a vector.
2
And the way you do this is you take the text and you have a dictionary and the
dictionary maps any word to a particular entry in a vector, a particular
dimension. You can also use a hashing function. And then you represent the
entire text document as basically a vector that has as many dimensions as you
have words in your dictionary. And then each dimension tells you how many
times that word is present in your text. For example, here one, that means the
word Kindle occurs exactly once in this text. And the word Nook occurs twice,
et cetera.
So in this case, basically, it's called bag of words, because you're basically
throwing all the words in a bag. The order of the words is lost.
All right. So this just machine learning 101 in some sense. Then, you know,
if you want to do a classification problem or something, you take these bag of
words vectors and you put them in a space and that's to train an SVM classifier
or something. Or whatever you want to do with it.
This is used in many different settings. For example, image classification,
where people use hog features and sift features as interest points and then
these become, you know, a user's bag of interest points or, of course, in text
classification, that's the domain I will focus on today. For example,
classifying documents or web pages based on topic or, you know, classifying
product reviews or blog entries based on sentiment. And that's actually kind
of the example that I will use throughout the talk. Given a review, is it
positive or negative.
But everything applies to all of these settings.
Okay. So let me just illustrate why bag-of-words vector is in some sense too
limited. Here are three sentences. A is recently Obama signed an important
bill. B, Sunday, our president mentioned a game-changing law. These are
probably about the same topic, these two sentences. C, last Sunday Manchester
United in Britain won the game I mentioned.
So, you know, clearly if I ask you which two sentences are more similar or
about the same topic, A and B are similar and C really doesn't have anything to
do with these two. But if you look at the bag-of-word representation, then it
actually turns out that B and C have a lot of words in common. So if you take
the inner product between these two vectors, they might actually appear pretty
3
similar, whereas A and B don't actually have any words in common.
And why is this? Well, they talk about the same things, but they actually use
different words. Here, for example, you have Obama. B says president. Here
we have important and game-changing, and bill and law. So these are actually
very similar concepts. But because they use different words in our
representations, this is completely lost. Looks like these two are very, very
dissimilar.
So to sum this up, bag-of-word vectors basically, very often your problem is
that your vectors are just too sparse. Too little about the text document and
you might just have very little overlap between documents. So, for example,
just because you use different words, you know, there's no overlap between two
documents. You can't tell at all if they're similar or dissimilar.
And in particular, if you were trained [indiscernible] small, one thing you're
going to observe, if you use words that are not that common, in fact, actually
in the English language, the majority, almost all words are very rare because
they follow sips law, the power law distribution. So if you use rare words,
then you only see them once or twice in your labeled data, and so the
classifier doesn't really know what rates to put on them, and the main problem
here is basically that bag-of-words just capture the word by themselves but not
the meaning of the words.
If I use president
entities. It does
that these two are
be similar if they
or Obama, these are two different things, two different
not -- there's nothing in the representation that tells me
actually the same thing. And therefore, two documents might
use these two words.
And so usually, these -- none of these problems are too bad if you have an
infinite amount of data, but you never have that, especially our label data is
usually very limited. You have a small amount of label data and then these
problems are pretty severe.
So what we are proposing is to -- well, we can't really do very much about the
little labeled data, but we assume that we have unlabeled data so you can have
just basically the same [indiscernible] setting. You have unlabeled data from
the same kind of distribution or similar distribution, and we use unlabeled
data to learn a representation of the documents. And then take the labeled
data and learn our classifier. That's the two-step approach that we are
4
proposing.
And so later on, I'm going to also show this for domain adaptation, where the
unlabeled data actually comes from a different domain. So we can say, well, we
might not have unlabeled data from the problem that we are interested in, but
we have a similar problem that actually has unlabeled data and we just use
that, and that might still work.
Okay.
Any questions?
All right. So our algorithm is called marginalized and denoising autoencoder,
and I will explain in a few slides where that name comes from and why that
makes sense. And this is probably the most important slide. This is the basic
building block of our algorithm is a linear denoising autoencoder.
So let's say we have some text document that is in one of our examples in our
unlabeled dataset. And this here is the text document, just a product review
of some book on Amazon, and this here is the bag-of-word representation. So
this here is a vector, and yellow here means that basically this word, you
know, might be present and white means it's not present. This is zero.
So each word here, each entry here in this bag-of-word vector corresponds to
some dictionary entry. This might be the word favorite. This might be the
word best written, et cetera.
All right. So what do we do? The first thing we do is we take our text
document and we duplicate it. And then we take the duplicated version and we
corrupt it. How do we corrupt it? We go through it and we look at every
single word and we remove it with probability P. So we roll the die for every
single word and, you know, if it's -- the right thing comes up, you just remove
it, just blank it out.
And the intuition is that if you understand your domain, then even if you
remove some of the words, you should still be able to make sense of the
document. So, for example here, I'll read this to you. I read two to three
blank a week. Right? Well, if you look at the text as a blank, you should
probably figure out that this should be books. Just because you understand the
domain pretty well.
And so that's our intuition.
You basically take the corrupted version and form
5
this corrupted version of the text you're trying to reconstruct the uncorrupted
version. So in some sense what you're doing here is you're saying, well, if
you're simulating the test case, and the test case, what are the hard examples
during test times? They're the ones, basically important words about the
domain are missing. The author used some different words that we don't know.
Now we don't know -- the classifier hasn't trained on those words. If you're
simulating this by taking out trained data and just removing words randomly,
and now basically making harder test cases out of it in some sense.
All right. So how does the algorithm work? You basically take this corrupted
version, and this is the bag-of-word vector of the corrupted text here, which
is strictly sparse, so there are strictly more zeroes in it because but removed
some words, and we tried to reconstruct the original text document and we do
this with a linear mapping. We basically take the corrupted version here and
we try to minimize the square loss between the reconstruction and the original
document.
Any questions?
>>:
What is the W?
>> Kilian Weinberger: W is basically what we learn. So we're just learning
this mapping W. So we have the original document, we corrupt it, we try to
find some W that basically, you know, if you map this corrupted version with
this linear mapping, then these two are very similar.
>>:
So it's actually W, the [inaudible].
>> Kilian Weinberger:
it.
>>:
No, just any [indiscernible] square matrix.
But that's
[inaudible].
>> Kilian Weinberger:
point.
I'll get to that, absolutely.
You're right.
Good
>>: The corrupted version of the document, why do you want to do
[indiscernible] substitution of synonyms?
>> Kilian Weinberger:
Because it's really simple.
Everything, you can
6
definitely do fancier stuff. But you will see later on that basically, if you
just do this deletion, then it actually turns out that we can -- when something
is just a noise model that we can handle very nicely later on. That was a good
question.
>>: So if your goal is to understand words you've never seen, seems like it
would make more sense to delete based on the similarity of the words
[inaudible].
>> Kilian Weinberger: Yes, and actually one thing, one thing that's
actually -- you can definitely do this and actually have different kind of
probabilities for every single word in some sense. We try to keep it simple
and just have [indiscernible] probability, but you can definitely do this,
actually. There's no reason not to. Yeah?
>>:
[inaudible].
>> Kilian Weinberger: Actually, we did move it from the bag of words.
because the same word appears several times, right, we move it in all
locations.
>>:
Yes,
Do you move the bin?
>> Kilian Weinberger: We remove the entire bin so we just [indiscernible],
right. And so what this mapping really does, basically, it learns to
reconstruct from other words that co-occur. So basically, it would learn that
president occurs together with White House, you know, and Obama or something.
Then later on if you just see president, what it will do, it basically assume,
well, maybe you removed White House and Obama, and it will start hallucinating
those words and adding them to your representation. That's kind of the idea.
So makes the representation richer.
>>:
So I see [indiscernible] to try to --
>> Kilian Weinberger: Perfect. This is the stacked denoising autoencoder.
And we have the marginalized stacked denoising autoencoder. That's exactly
what our work is building on. Absolutely. And I will have a slide that
basically puts those next to each other.
>>:
So you want to have a symmetry class, whether you do adhesion or, you
7
know, like it might be, could you say, it might be errors which are
[indiscernible] to miss a word than to suggest a word.
>> Kilian Weinberger:
>>:
Oh, I see.
Symmetrical word [inaudible].
>> Kilian Weinberger: Not entirely sure what you mean.
different words might be --
Did you mean like
>>: You make a prediction, a word construction prediction and you can either
predict a word or [indiscernible] like reduce a weight of the word.
>> Kilian Weinberger:
>>:
Oh, I see.
And the same type of mistake, right?
>> Kilian Weinberger: That's true. That's true. So yeah, we just handle it
as a square loss. We just keep it simple in that sense, yeah. That's right.
>>:
Other kind of reconstructing document [inaudible].
>> Kilian Weinberger: So we try to reconstruct the original bag of words,
absolutely. That's right. Okay.
>>: So I think I understand what you're proposing, but I'm not sure I
understand why it makes sense. Intuitively, at a very high level, maybe I
understand. But it seems that what's hiding here is maybe some type of
generative assumption on how texts are generated. But otherwise, I mean, why
is this different than I could propose something like -- we could argue
forever.
>> Kilian Weinberger: All right. Let's try to avoid that. So in some sense,
you're absolutely right. And you can view this, you know, just a different
spin on this would be that you basically say, well, people, and actually we can
talk about this maybe offline, because it goes a little bit more into depth,
but in some sense, you're proposing a little bit different noise model. They
basically say people use a lot of Gaussian noise, for example, et cetera. But
in text documents, right, if you have, you know, you words [indiscernible]
distributions, a lot of words are either present or they're not, right. They
8
only appear once or twice in a document.
So a good way to model it in some sense is to say, well, they might either be
there, or sometimes you're saying, well, in the test case, you some documents,
and, you know, some words are just removed. And that's why they're not there.
Does that make sense? Like in some sense an approximation of basically of the
tale of the power [indiscernible] distribution.
>>:
Already should have noise.
>> Kilian Weinberger: Yeah.
more slides. This is great.
Okay.
Good.
Let me just move on to maybe a few
Okay. So this is basically, you're minimizing this objective here. We
basically learn this matrix W to go from the corrupted version to the
non-corrupted version. And this is just a square loss regression, right. It's
ordinary lee squares. So ordinary lee squares is, of course, very nicely
behaved as a convex function and, in fact, there's a closed form solution so we
just have a little matrix inversion here and then we jump right to the minimum
of the function. So this pops out in closed form.
And one thing I've done in the previous slide, I took the text document and
made one corrupted version of it. Of course, that's only -- if I would do this
again, I would get a different corrupted version because I move every document
populated to P. So instead of just have one corrupted version, it would
actually be better to be kind of, you know, to treat my classifier with many
different copies of corrupted versions.
So let's say we M different corrupted versions. So we do this M time for every
single text document in my training data, unsupervised data I corrupted M
times. And then I can do exactly the same thing, so now I want to do a mapping
that does, you know, reconstructs the original text document well across all of
these corruptions. And that's the optimization problem that we get is exactly
the same thing, except that we average over all M corruptions.
So across all of them, we should do well. And turns out, well,
surprising that it's all a closed-form solution. And it's very
It's basically just the average here, you basically average the
matrix -- you average these out of product matrices. These out
vectors.
it's not
analogous.
covariance
of product
9
Okay. And so one question is how large do we set M. And in some case,
sometimes that's iterating over the dataset. But in this case, we're not over
fitting. There's no over fitting because you're not using the label anywhere.
So the larger M, the more robust you get against this kind of noise. The more
examples we see that are corrupted.
So ideally, you would like to make M as large as possible. So ideally, you
would like to make M, in fact, go to infinity. And, in fact that's exactly
what we can do. We can just let M go to infinity. And in the limit, these
terms here actually just become the expected values of the out of products.
M goes to infinity.
So
And we can just stick that into, you know, a closed-form solution and turns out
these expected values are actually really easy to compute. Why is this?
Because our noise model, and that answers your question that you asked earlier,
why do we do a simple noise model, because actually, if you just remove every
word with probability P, it's just a [indiscernible] distribution. So this
here we can just compute in closed form. It's a scatter matrix modified by the
probability that every feature survives the corruption. That's all there is.
That's the expected value.
And so if you want to code this up, it's actually, it's really straightforward.
The whole thing in Matlab is just ten lines of code. So this is the actual
code that we use. Ten lines.
>>:
Average over the target documents.
>> Kilian Weinberger:
Where did that go?
Sorry?
>>: You didn't seem to average over all possible sums of the lee squares,
right?
>> Kilian Weinberger:
possible corruptions.
>>:
No, that's here.
Here are the average of all the
But there's only one XI.
>> Kilian Weinberger:
corpus.
Oh, I see.
So here, sorry.
So summing all my entire
10
>>:
[inaudible].
>> Kilian Weinberger: So it sounds like you could imagine basically you take
the entire corpus and you replicate it M times and you corrupt it.
>>:
So I'm surprised you didn't end up with a generalized Eigen probability --
>>:
Exactly, like what's the [inaudible].
>>: You should be taking inverse, you should be taking the top end
eigenvector.
>> Kilian Weinberger:
Okay, wait.
Let me just see.
>>: Averaging over those two [indiscernible] matrices, one which is your
maximum ->> Kilian Weinberger: Yeah, but actually, this is half -- this is basically -this is fully corrupted. But this here is kind of scattered matrix of the
uncorrupted version and the corrupted. So it's kind of the mix between these
two.
>>: So what if I just gave all of my corrupted data and I just put PC on it, I
can illustrate [indiscernible]. How would that be different?
>> Kilian Weinberger: It's quite different. So basically, you're finding the
vectors are basically, sometimes you're projecting on to directions of maximum
variance, right, which is not what we're doing here.
>>: But there's OPCA, which tries to [indiscernible] maximize the variance of
the data and simultaneously realize the variance over your corruption.
>> Kilian Weinberger:
>>:
So yeah.
I don't know how that relates to it.
It's a different loss function.
They're doing regression.
>>: But linear [indiscernible] analysis always goes -- you can always take
zero, one targets and boil it down into --
11
>>:
[inaudible].
>>: The corruption will be different.
function.
>>:
Maybe.
>> Kilian Weinberger:
>>:
So with PCA, you're not corrupting it.
No, with OPCA.
>> Kilian Weinberger:
I don't know OPCA.
>>:
This is a very special kind of
Oh, I see, okay.
Maybe we can talk about this offline.
Is W [indiscernible].
>> Kilian Weinberger:
No, one W for all samples.
Very good question.
Thanks.
Okay. Any more questions? Okay. So we can compute this in ten lines in
Matlab and so here, someone asked earlier, there was a paper last year, ICML,
and is this something similar. That is actually the paper we have at ICML this
year. Actually, we show that you can take their version and make it much, much
faster. So what you were referring to is the paper by [indiscernible] group in
Montreal, [indiscernible], et al. And they were the ones who actually inspired
all this work. They had this idea of this corruption, corrupting data input
vectors and then reconstructing them.
So they basically took text documents and randomly removed, you know, roof
buckets with [indiscernible] the same noise model that we used and then they
trained a newer network to reconstruct the original bag-of-word vector. And
the new network had a hidden layer that's over complete. And they trained that
with back propagation and that gets really, really nights results and that's
kind of what started our work.
So we have an encoder here and a decoder. And this is their loss function.
The only difference is that uses back propagation and you have to iterate over
the training set many times and our idea was basically how can you make this
possibly faster? And so our idea was basically we removed hidden layer in the
middle and this way, you can make this linear, and then instead of going over
the dataset many times and corrupting it over and over again, you can actually
12
marginalize all the noise and do the whole thing in closed form.
>>:
Yeah?
[inaudible].
>> Kilian Weinberger: So you're talking about this one here the SDA? So the
reason why they didn't get over complete is because of the corruption. If it
wasn't corrupt, then [indiscernible], right? Because of the corruption, they
can get away with making this actually over complete.
>>:
So it's bigger than the -- even for text?
>> Kilian Weinberger: Yeah, so they only use -- and this is going back to
actually what Lynn said earlier. They only use the 5,000 dimensional -- 5,000
words.
>>:
[indiscernible].
>> Kilian Weinberger: That's actually another problem. [indiscernible]
doesn't scale high dimension. Because that [indiscernible] is really slow.
They have -- maybe ->>: The thing that I don't know how [indiscernible] you can remove
[inaudible]. Then you're left with an ->>: Empirically, when you do these autoencoders, does [indiscernible] the
nonlinear [indiscernible].
>> Kilian Weinberger:
similar results.
>>:
I guess that's what we are showing.
Only the output you can do something.
Actually getting
Not much [indiscernible].
>>: If you use Gaussians anyway, might as well use the natural
[indiscernible].
>> Kilian Weinberger: Okay. Yeah, let me just move on. So basically, this
is -- this relates -- and the reason why we call it marginalized stacking
denoising autoencoder is because we can -- algorithm was inspired by the
stacked denoising auto encoder, but we're marginalizing all the corruption.
That's where the M comes from.
13
>>: So I was surprised that I didn't see that you needed some major amount of
regularization. I would have thought you'd have had a problem with that with
written words corrupting the entire rest of the document if those words
appeared. That's not needed?
>> Kilian Weinberger: It's not really needed. So one thing is because you're
training it on the unsupervised corpus that's usually a lot larger. So ->>: So [indiscernible] one document. Then once you learn the Ws when you see
that one word, you generate the entire rest of that document?
>> Kilian Weinberger: You might. In those case, you might run into a little
bit of problems. And actually, let me address that later on.
>>:
[indiscernible].
>> Kilian Weinberger: You do actually do a little bit of regularization.
Here's the regularization.
>>:
[indiscernible].
>> Kilian Weinberger: There's a little [indiscernible] question there.
that's right. So I put that on.
>>:
It's hidden in the --
>> Kilian Weinberger:
It's very, very tiny.
>>:
That's right. [indiscernible]. But it's very tiny.
It's ten to the minus five is the --
Okay.
>> Kilian Weinberger:
Just that it's not as well defined.
>>: In that case, do you use an earlier version of this, do you use
[indiscernible].
>> Kilian Weinberger:
>>:
But
[indiscernible], yeah, uh-huh.
You use cross [indiscernible] instead of MAC?
14
>> Kilian Weinberger:
>>:
Actually, Minmin would know.
Is that what they do?
Yeah, they use [indiscernible].
>>: So I think it's different like all the symmetric cases you do on
[indiscernible]. Distribution of errors, you get ->> Kilian Weinberger: There's a little bit more subtlety to it, I agree, but
bear with me for a few more slides. We do have to talk about this offline.
[indiscernible].
Okay. So one thing that's nice about the [indiscernible], though, is that it
can be either stacked or make it deep. In some sense, their claim is that they
approached deep learning is the stacked denoising autoencoder. That's the
Montreal version of deep learning. So you basically, you have another hidden
layer and you have another hidden layer and so on. They make this five, six
layers deep.
So while we can do the same thing with our method, and so let's say you have
our input X and you learn a matrix W and you apply W on to X and you get some
hidden -- we get our output here in some senses. Basically, the reconstruction
in a sense that basically if you take our X here and you hallucinate new words
to it, and then what we do is we apply a nonlinearity to it. So we basically
take a squashing function some sigmoid function that basically just squashes
these outputs here and this is now in some sense the output of our algorithm.
Then we can use that as the input of the algorithm again. So we can apply it
again. So in some sense, you call this layer one. You can now use the output
of layer one as the input again of the stack denoising autoencoder and get
layer two, et cetera. So each one of these is solved in closed form. So solve
in closed form, apply a sigmoid function and then solve the next layer in
closed form, apply a sigmoid function, et cetera.
And then we take these input layers, and that's basically the same thing that
[indiscernible] group does. Take the bag-of-word vectors and these hidden
layers and make that the new representation of our data. So instead of just
using the bag-of-word vector, you basically now have these hidden
representations, the bag-of-word vector, the first reconstruction, the second
reconstruction and so on. That's our representation of our data. And now we
15
train SVMs on that data.
And these here are basically completely learned on the unsupervised part of our
corpus [indiscernible] part of our corpus. Yeah?
>>: So can you give me some [indiscernible] on what you need to
[indiscernible] so you update every single -- but here ->> Kilian Weinberger:
>>:
Good intuition.
What does this first layer do?
>> Kilian Weinberger: So sometimes it basically takes the bag-of-word vectors
and for every word, it adds words that basically co-occur with that word. So,
for example, let's say I have the word, you know, Obama. It might add the
words White House, president, government, or something.
It wouldn't add Clinton, because Clinton and Obama don't really occur together.
Clinton, that's the wrong Clinton. Bill Clinton. And but then if you, in the
second layer, right, now you have actually White House, government, president.
And the second layer would actually add Bill Clinton to it, because that also
occurs in the same context.
You can kind of view it as actually -- Rani pointed this out when I talked to
him yesterday. You can kind of see it, view it as a graph. Basically say what
words are connected, co-occur with other words. And you can kind of -- each
layer here takes one step in that graph.
>>:
[inaudible].
>> Kilian Weinberger:
I think at some point, it won't change anymore.
>>: How do you know the scale of the sigmoid?
by -- well, your target is ->> Kilian Weinberger:
it's pretty robust.
>>:
Have you tried it?
You can.
Seems like it can multiply H
I don't think it's that important.
I think
16
>> Kilian Weinberger: Actually, Minmin tried a whole bunch of squashing
functions. And, in fact, even if you don't use a sigmoid -- sigmoid is a good
idea, but even if you don't use it, it still improves to have multiple layers
because of that effect.
>>:
[indiscernible] couldn't you just reinforce the sparsity?
>> Kilian Weinberger:
sparse.
[indiscernible] sparsity.
It's not sparse.
It's not
>>: But the small number of numbers would suggest that you are [indiscernible]
sparsity, that you want to have some [indiscernible].
>> Kilian Weinberger:
>>:
Not really.
So the sigmoid [inaudible].
>> Kilian Weinberger: Sometimes, basically what you're doing is you're
exaggerating [indiscernible]. Actually, you know, [indiscernible].
>>:
When you introduce the noise, you kind of do a [indiscernible], right?
>> Kilian Weinberger:
noise there is.
>>:
Not quite sure what you mean.
There might not be a --
>> Kilian Weinberger:
you mean?
>>:
The W is --
I mean, can you introduce any kind of [indiscernible] is linear, right?
>> Kilian Weinberger:
>>:
We're using every possible
Yes, but the W is linear.
>> Kilian Weinberger:
>>:
Yeah, we're integrating it.
Oh, you might be a perfect reconstruction, is that what
There not night be a transformation from X, Y to X.
17
>> Kilian Weinberger: Well, you're just removing it, right. So basically, you
just have to represent if you remove a word, right, if it's a linear
combination, you can just have a linear combination of other words to
reconstruct any word that you would move. So you remove that word, here energy
efficient. Well, here it's [indiscernible]. But you could basically say, you
know, like for example, there's a word, you know, in the product review, if
it's energy efficient, right, that might appear together with good, you know,
or great or something. So, you know, you reconstruct words from those
co-occurring words. Does that kind of answer?
>>: Probably. I'm just saying that the [indiscernible] model is linear and
the noise model is totally random. So I'm not sure if that can go back to
this.
>> Kilian Weinberger: But we don't have, right? You're just trying to find a
new representation. Like you're still sticking to the original input as our
representation. So when you take our SVM, we're not corrupting this, right?
We're not corrupting a test example. We take a test example the way it is and
you take these basically additional, you know, these hidden representations
where you basically added more words to it. So we're filling out the sparsity
of the bag-of-word model.
>>:
[inaudible].
>>: That's right, I think W is -- that's exactly what PCA does. PCA finds the
low rank representation of the [indiscernible]. You don't actually have to
keep reconstruction if you're repeating the square root, you know what I'm
saying. If you're computing the square root ->>:
I see what you're --
>> Kilian Weinberger: You score things in that direction. I don't want to get
too much into that, because I still have a couple slides. I do like this
discussion a lot, actually. This is great.
So if you do this whole stacking business, and then actually the whole thing is
20 lines of Matlab. So what I'm trying to drive home here is that it's really,
really easy to use. If you have any kind of bag-of-word model, any kind of
bag-of-word data, this is the entire code that you need to get this new
representation. And the only parameters that you have here is to use your
18
input data. P is based on the probability of removing the word, and L is
number of layers that you stack. So it's a really, really simple model.
And this is a big contrast to the new networks people use, where you basically
have the number of layers, the step size, instead of the number of epochs, you
go over the data and so on.
>>:
[indiscernible] you do have layers for over there.
>> Kilian Weinberger: Sure. I mean, [indiscernible]. Okay. Let me talk
about some results of the domain that I focused on was actually domain
adaptation. And particularly, actually, because the paper by [indiscernible]
for ICML used exactly that domain. And so here, the task basically is given a
review, a product review, try to predict if it's positive or negative. The
sentiment of the text. And the dataset that we have, actually John Blitzer
created the data 7 2006. He scraped product reviews from Amazon.com and he
said everything has five reviews or four reviews is positive. And anything
that -- five or four stars is positive. Anything with fewer stars is negative.
So given the text, our goal is to predict if it's a positive or negative
review.
And the way we do this is we do domain adaptations. Training data is from one
domain, for example, could be, you know, electronics or something. And testing
is a different domain. So in this case, actually, we have four different
domains. So for example, you could train our classifier on book reviews and be
tested on electronic equipment or kitchen appliances.
And in this kind of case, in this kind of domain adaptation, it's really,
really important that you have a good representation of the data. So that's
why this is a good application, a good way to test our algorithm in some sense.
And so if you just do this naively, you just do bag of words and you train a
classifier on book reviews, you get a test error of 13 percent. You take the
same classifier and apply it on kitchen appliances, then you get a test error
of 24 percent. So you're basically doubling your error by going from one to
the other. And that makes a lot of sense, right, because the way you describe
is a good book is very different from the way you describe a good toaster.
So for example, here you might say the book is best written, you know, it's one
19
of my favorite books. Favorite you might use in both. But eloquent story,
where you would never say a toaster is best written Orel convenient or
something.
So for a coffee maker, a toaster, you might say it's solidly constructed or
easy to program. So these are things you would never use, these are words you
would never use for a book. So if I train my classifier on book reviews, the
classifier would have no idea what to do with these words.
But what we do is we basically assume we have unlabeled data from both domains
and we run our mSDA representation learning algorithm over both domains and get
a joined representation that we then map our data into.
Okay. So here's some results. Basically, these are different domain
adaptation tasks. So this means going from DVD to books, electronics to books,
kitchen appliances to books and so on. So this is what you train it on. This
is what you test it on. And what you see here is the transfer loss, transfer
loss is defined as the error you get from the transfer minus the error you get
if you had actually stayed within the domain.
If you had trained on kitchen appliances and tested on kitchen appliances but
here you train on books. So one thing is just this here is basically just to
compare this -- this is basically the error on bag of words, whereas this here
is on mSDA.
And so one thing you can see, we compare against a bunch of different
algorithms. This here is the baseline. Here is PCA. Someone mentioned PCA
earlier. So the blue line here is PCA. You basically project everything to
one common sub-space. Then you have a couple sub-spaces, paper by John
Blitzer, and then you have a couple more baselines. And the red line here is
SDAs. That's the stacked denoising autoencoder. That's the newer network that
basically does the same thing. And mSDA is our dark red line here.
And so one thing you can see here, I think out of those 12 tasks, ten we
actually get the best results. Only two this algorithm does a little worse.
To be fair, though, SDA was only one layer.
mSDA was five layers. The reason SDA was only one layer it's because it's so
slow. Takes a really, really long time to train.
>>:
Is there any story there in terms of what are hard examples, things like
20
irony or reviewers were super hard or super soft? Is there any notion of what
the reviewer correlation would be to words [indiscernible]? What's the actual
noise?
>> Kilian Weinberger:
Well, the one question I don't understand.
>>: There's two points to the question. What is the error rate. What is the
noise level in the label. And then the other one is besides the better
numbers, is there anything you think you're solving conceptually that you
couldn't get before? Things like whether it's irony or whether it's a bias of
the reviewer?
>> Kilian Weinberger: The answer to both questions is I don't know. So I'm
not sure about the natural error level. I'm sure there's some, right. And I
mean, there's some reviews you just can't possibly get right. And in terms
of -- I'm not sure if there's some -- so it just really helps you with making
use of words, basically. Really what it does, it kind of connects words from
the source domain to the target domain. That's really what it does.
So if you have a review that before was written with a lot of words that you
would never use in books, you couldn't get that right before, because you just
don't know what these words mean. And now you can.
>>:
[indiscernible] on the original domain [indiscernible].
>> Kilian Weinberger: Entire [indiscernible].
SDA. That's the entire corpus.
>>: So you never tried to [indiscernible] mold.
that?
Actually, the same mSDA and
None of these methods do
>> Kilian Weinberger: So just training -- so I don't think there's a benefit
for these methods to only use one class. I think it always helps to throw in
more data. So I think that's at least what we got empirically.
>>: I don't understand the negative numbers here. So it seems like you're
getting the negative number on this B to B and B to B.
>> Kilian Weinberger:
Right.
I can tell you exactly why that is.
21
>>:
And there's also K to K.
There are pairs here.
>> Kilian Weinberger: That seems odd, right? Like you train on D and you do
better. And the reason is that it [indiscernible] bag of words. This here is
the baseline of bag of words, whereas this here is the baseline of mSDA. So
does that make sense?
>>: So but why are D and B friends here and A and B are also friends?
see somehow there's this pairing.
>>:
So we
[inaudible].
>> Kilian Weinberger: So again, it's electronics and -- sorry, actually, I
don't know too much about the individual domains. I mean, you could look into
what words do you to drive a DVD. What words do you use to describe a book.
So maybe electronic ->>: The difficulty [inaudible] book and DVD are more closer and kitchen and
electronics are more closer. So it's easier to adapt.
>>:
mSDA from books to books, I would have done better.
>> Kilian Weinberger: Actually, if you have used mSDA, so if you took all of
the reviews that we had and done one representation for everything, then we
just train the classifier.
>>:
A different capacity.
That's the whole point, right?
>> Kilian Weinberger: Yeah, so basically, yeah.
changed, does that make sense?
>>: It's five times the capacity.
capacity than the right hand.
>> Kilian Weinberger:
Because the representation
So the left hand one actually has much more
That explains why you get negative results.
>>:
That explains why negatives.
>>:
[inaudible].
But I'm saying why is there this B to B?
22
>> Kilian Weinberger:
>>:
If I give you the dataset [indiscernible].
[inaudible].
>> Kilian Weinberger: The domains have roughly the same size.
roughly around 2,000, I think.
>>:
2,000?
>>:
[inaudible].
>>:
How large is your vocabulary?
>>:
For this experiment, [inaudible] largest we try is 7,000.
>> Kilian Weinberger:
[indiscernible].
>>:
They all have
So these are a thousand, maybe 5,000 words.
[indiscernible] classifier 25,000 [indiscernible].
>> Kilian Weinberger: Yeah, but it's regular. So SVM, we do SVM, we do
classification over regularization. So we do, you know ->>:
[inaudible].
>> Kilian Weinberger: Good question. I don't know.
I'm sure. It's great to have [indiscernible].
>>:
You would get something,
Always have a grad student in your talk.
>> Kilian Weinberger:
Do you have a --
>>:
I think almost reached like the upper [inaudible] bag of words.
>>:
[indiscernible] as good as bag of words, but not quite?
>>:
[indiscernible].
>>:
So of the same mold, not significantly when you transfer it.
23
>>:
No, I mean --
>> Kilian Weinberger:
>>:
[indiscernible] domain classification.
>> Kilian Weinberger:
>>:
Actually, let's do a few more slides.
[indiscernible] when it's [indiscernible] is better than bag of word.
>> Kilian Weinberger:
>>:
What he's asking, is it in the same domain classified.
That's right.
So it has to be better.
>> Kilian Weinberger: So yeah, makes sense it has negative numbers, okay?
Good. So I think the most important thing, though, is we look at speed. And
so from now on, we're going to look at average results across all of these
different adaptation tasks. And when you average them, you actually do
transfer ratios, where you actually divide the two, because otherwise it's
dominated by that these small ones are washed out by the large ones.
So anyway, this transfer rate, it's the same thing that [indiscernible] group
does. So here's the classification results. This is the transfer ratio, so
low is better, and this here is the time in log scale. So one thing you see is
all these other base lines, bag of words, SCL, and you have coder and then SDA,
in some sense, like over the years, basically, I don't have the year when these
were published. But basically, it kind of goes in that order.
So the results got better and better. But also, the time that it took to train
these classifiers grew exponentially, right? SDA is now five hours in the
dataset. And what we managed to do is basically push the results of SDA, which
are really awesome, you know, to the left. They basically say we get the same
results as SDA representation. But instead of five hours of training, it's a
few seconds. So there's like -- it was two minutes of training in five layers.
And so here's words, just to illustrate this a little bit, what gets
hallucinated. So basically a document that has only one word, only the word
great, so here's the kind of the document that's generated to this. And these
are words basically in order of their strength of how it regenerates. So we
have great, is great, highly, highly recommend, excellent, perfect, fantastic,
24
waste here too. So here's a bad one. Bad reconstructs dead, worst, sorry,
please, the worst, bad, hope, horrible, and so on.
>>:
I have a question.
>> Kilian Weinberger:
The total [indiscernible].
20,000, 27,000?
I think that's right, yeah.
>>: So that's something the SDA [indiscernible] trend on 27,000 examples
[inaudible].
>> Kilian Weinberger:
Yeah.
>>:
This is what [indiscernible].
>>:
They must use punch cards or something.
>>:
27,000 examples.
>> Kilian Weinberger:
have many ranges.
>>:
It would be corrupt over and over again, right?
You
But still if you [indiscernible].
>> Kilian Weinberger: Make it two hours. Compare it against two minutes.
mean, even if it's an hour, it would be ->>:
Let's be fair.
>> Kilian Weinberger:
>>:
[inaudible].
Actually, use their code.
[indiscernible].
>> Kilian Weinberger:
>>:
I
Now we have --
I agree that [indiscernible] would be faster.
>> Kilian Weinberger: Now we have a dataset with 340,000 data points. That's
actually a large dataset that John created back then and nobody really used
because it was too large for most algorithm. So we ran it over SDA and we took
two days. Make it six hours. We can do it in six to 20 minutes, depending
25
upon how many layers you have, and you get the same results.
>>:
mSDA would be faster [indiscernible].
>> Kilian Weinberger: Yeah, I mean, so I don't know. That might be ways of
speeding up. Actually, they have a code that also puts it on a Q dot graphics
card. But that doesn't help in this case.
>>:
[inaudible].
>> Kilian Weinberger:
>>:
I don't know.
[indiscernible].
>> Kilian Weinberger:
>>:
I don't know.
It might be.
Do you use, when you --
>>: I just, I read the [indiscernible] by good margin.
surprised by the SDA because ->> Kilian Weinberger:
>>:
It's just that I'm
I'm sure --
Like it's maybe [indiscernible] times faster than [inaudible].
>> Kilian Weinberger: They do early stopping.
This way, they can make it three times faster.
that's still ->>:
[inaudible].
>>:
[indiscernible].
>>:
How did you come up with this word token?
They don't have early stopping.
But that's still, you know,
>> Kilian Weinberger: Basically, you take a document, artificially create a
document that only has one entry in it, only one word. That that's just this
one here, just bad. Then you run it through mSDA, and you look what is the
document I get at the end. What the words, basically, it reconstructs. And so
these are basically the words that -- I put in bad, the most, you know, the
26
strongest word that it reconstructs is dead. That's bigrams. These are just
bigrams. The way John created the dataset in 2006 was just bigrams. So we
just used it the same way he did it.
>>:
Do you have a trick for inversion when you look at [indiscernible].
>> Kilian Weinberger: Thank you. Yeah. So okay. So far, I've been, you
know, I've kind of put something under the rug and [indiscernible] early on we
have this matrix inversion, and that's a D-cubed operation. These are the
number of words in my vocabulary, right. So all of these results actually done
with 5,000 words, but it's the same actually the SDA paper uses. But that's a
little bit lame, right, because text documents usually have much, much higher
vocabulary. Much larger vocabulary.
So what if you have very high dimensional data, and that actually, so here's
kind of what we do and it's based on intuition. So the intuition basically is,
if you have a very large vocabulary, let's say you use 100,000 words, then most
words, in the English language, you can get pretty far by just using five or
ten thousand words. So the other 90,000 words are in some sense words that
describe some concept that already exists in the first 10,000 words, most of
the time.
So you might have a word tasty, but instead you can be fancy and say delicious.
That's a rarer word, but it really says the same thing as tasty for all means
and purposes in terms of our classification. So what do we do? We basically
take this very large vector, and we can't make it inversely 100,000 by 100,000,
right? So we just randomly divide it up into chunks, and then we take the
5,000 or whatever, 10,000 most common words and we sometimes we learn mSDA
transformation for each one of these chunks.
So what are we doing here? In some sense, we're trying to reconstruct common
words from rare words. So basically say, well, if there's delicious, try to
reconstruct tasty, for example. Or something like this. So we're translating,
you know, big words or something into common language. Yeah?
>>:
So [indiscernible] is you actually are going to like a [indiscernible].
>> Kilian Weinberger:
one reason --
So in this case, it should be [indiscernible].
That's
27
>>:
No, no, if you want to go to like five gram or six gram --
>> Kilian Weinberger: It's a good question. I don't know.
actually think that it might still work. But I don't know.
>>:
But I would
[inaudible].
>> Kilian Weinberger: And no, it's not the same thing. It's not the same
thing. We can talk about the differences offline. So then actually what you
then do, you can then know the stacking. Now we have a mapping to low
dimensional space, and I didn't write out the algebra here, but it's
straightforward. You basically add up all these reconstructions into one
reconstruction and now you can stack it just the way before.
So now, actually, the subsequent layers are in the low dimensional space. And
so we tried this. Here's the dataset, the [indiscernible] set, and the 5,000
dimensions, 10,000, 20,000, 30,000 and I think this is 40,000. And this here
is the SDA curve. This is the result you get with the original stack denoising
autoencoder. And basically, one thing is very clear trend. As you increase
the dimensionality, the error goes down. So it really helps the error, and our
mSDA basically matches this very, very nicely.
So for every point here, there's a parallel point that's just matched much
faster. But it gets roughly the same accuracy. And so actually, at some
point, you start [indiscernible]. If you get 40,000, you just include really
rare words.
>>:
Is the SDA on the larger vocabularies doing exact reconstructions?
>> Kilian Weinberger: It actually does -- it also does the mapping. This also
does a [indiscernible] hidden layers are actually then just [indiscernible].
>>: But what I'm saying is when the SDA like tries to [indiscernible], does it
actually reconstruct every output, or do you just, does it ->> Kilian Weinberger: We do the same thing that we only reconstructed the top
M whatever, the top K most common words. So we don't reconstruct everything in
the SDA.
>>:
Okay, but the SDA -- so I believe the group had a paper where they can,
28
for very large sparse binary auto filler problems, they can approximate the
gradient very, very quickly. What I'm trying to ascertain is -- that or not
that?
>> Kilian Weinberger: Actually, I don't know.
implementation. Okay.
Probably used exactly the
>>: And basically, you have an optimizer which works in a specific setting,
which is this small dimensional target, and the [indiscernible] loss when you
use SDA the exact same condition, right, of where [indiscernible] you can
switch the [indiscernible] and you can use a large target space.
>> Kilian Weinberger: Yeah. Maybe that helps. I mean, still, yeah. In some
sense, you get locked into it. Because of our little tricks, right, that makes
us a little efficient, we're locked into that formulation, right. Because they
can change things around.
>>: Did you access like on the smaller vocabulary, where inversion is visible,
what you used to actually have smaller targets?
>> Kilian Weinberger: Yes, and I don't have the graph. I don't. Maybe I do.
I can show it to you. I think it's at the end of the talk somewhere. So it's
very little. But yes, we did this. Okay. And so here's although someone, I
think Leon asked the question earlier, what if you're just in domain. And
that's basically just semi-supervised learning if you stay within the domain.
So we also did semi-supervised learning experiments.
So here's [indiscernible] and his writer's dataset. So he also compared
against SLI, which is basically PCA, and latent [indiscernible] allocation of
David Bly and TFIDF and so basically, here's the graph. The accuracy is higher
is better and this is as you increase the number of trained sets. mSDA nicely
on top. Here you can, you know, the benefit is more pronounced when you have
little trained data, which makes sense. As you keep getting larger at some
point, actually, here there's not that much benefit. Here some benefit, still
up to 7,000 data points.
>>:
By training, do you mean training for SDM or --
>> Kilian Weinberger:
Training for SDM.
29
>>:
So the decision to train only on the [indiscernible].
>> Kilian Weinberger:
>>:
That's right.
That's exactly right.
And specifically, there are traces you can use to LSI [indiscernible].
>> Kilian Weinberger: Yeah, so we didn't do that. But there's also a lot of
tricks you can do with mSDA. You can actually, for example. You know the
optimization problem, you can actually also put in weights to the different,
you know, add weights to the different words, right. For example, IDF score.
And we have some results in this, but I actually, that's too -- I don't know
what they are.
>>:
The weight was just TF?
>> Kilian Weinberger: So in our case, it's just TF. So we just reconstructed
the TF score. But for the other algorithms and the comparison early on, we use
TFIDF or whatever worked best for those.
>>:
[indiscernible].
>> Kilian Weinberger:
Just used TF.
>>: So those documents are basically more weight than the optimization
problem. The [indiscernible] is weighted more for [indiscernible].
>> Kilian Weinberger:
That's right.
These cases don't vary too much.
>>: [inaudible] it's just the mean, the variance is the same as shifted. So I
think it's -- see what I'm saying? So a target of one has an error. Target of
zero has an error. It's not like the target of zero has the same ->>:
Yeah, but there's more words.
>>:
So there, the covariance --
>> Kilian Weinberger: By the way, I didn't mention optimization. We also have
a constant term in our optimization, in our square loss. I don't know if that
matters in this.
30
>>:
[indiscernible] much more interesting Gaussian noise.
>> Kilian Weinberger:
That's exactly how --.
>>: And the [indiscernible] if I think about it, it's nice, because if you
have the word that's very rare, it's just like if you have -- and you please
review this paper.
>>:
Or he will.
>>: So that's the same realization. Once you start thinking about why does
this work, does the mean that the regularization [inaudible].
>> Kilian Weinberger:
Yeah, so that's -- I send you my --
>>: And I bet you $20 if you did blog TF in your reconstruction Gaussians
would be better.
>> Kilian Weinberger:
It might be.
It might be worth trying out.
>>: So mSDA still have the original, in this experiment still have the
original root still, or did you just did mSDA representation?
>> Kilian Weinberger:
the ->>:
Yes, we actually had the original representation and
Is that the same for the --
>> Kilian Weinberger: Yes, and we actually tried both for those.
these did cross validate all the parameters.
>>:
So for the others, how did the original -- just curious.
>> Kilian Weinberger:
That's one of our ->>:
And although
I think it might have actually not.
I think it didn't.
You have to choose the dimensional -- there's choices.
>> Kilian Weinberger: That's right. We did cross-identify those. Okay.
That's actually the last slide. So in summary, I talked about marginalized
31
stacked denoising autoencoder.
The, you know, the marginalized stands for we marginalize out the corruption.
And to keep the high, you know, high accuracy of the SDA features, really
feature generation. So it depends on what algorithm you use afterwards. But
the nice features of SDA creates just much, much faster.
And that's because it's layer-wise convex and there's a layer-wise closed form
solution. And, you know, one thing, and I hope people who works with
bag-of-words features might experiment with it. It's just really, really easy
to implement. Actually, Minmin is here so you can ask her for the code. It
might lead to better results and different applications.
I also have a second part of the talk, but I feel like we've talked a lot and
there's been a lot of discussion. So maybe I'll just leave that and if someone
wants to talk to me, this is another ICML paper we have. It's did cost
sensitive learning when you assign costs to features. I don't know, unless
somebody really wants to hear it, I've already talked for an hour so maybe I'll
skip. Do you want to hear it?
>>:
I want to ask a question.
>> Kilian Weinberger:
Okay.
>>: So something back to your algorithm there, there's a couple things going
on. So you have this procedure to generate extra [indiscernible] data. Fairly
standard technique used in learning methods, like when you're doing image work,
for example.
>> Kilian Weinberger:
>>:
Like virtual --
Yeah, those kind of hallucinating examples.
>> Kilian Weinberger:
Uh-huh.
>>: So you're hallucinating examples and you also have the cost function. Do
you have any sense of what it would do if you sort of take your work as you
hallucinate examples like you're doing and feed it through LDA, feed those -this dataset through LDA or LSI or whatever?
32
>> Kilian Weinberger: I don't know. It would probably be very slow. LDA and
LSI was very -- those would be [indiscernible] those results. Because those
are data of all these different settings.
>>:
There are online versions of these things.
>> Kilian Weinberger:
>>:
Yeah, okay.
But everything is a covariance.
>> Kilian Weinberger: But LDA, you still have to -- even if it's online, it
still takes a lot, right. I don't know. So basically, you're saying I take
this noise model and use the other algorithms with the noise model, right?
>>: Yeah. You're sort of proposing two things.
hallucination process, and I wonder --
Cost function and this
>> Kilian Weinberger: Well, the cost function is not in the -- I see what
you're saying. Yeah, I don't know, to be honest. So one thing we've tried,
and that's basically the paper is that we basically said let's just take the
noise model of the corruption and put them in a classifier.
So the SVM or something, it basically sets up what is the square loss. You
have [indiscernible] regularization, it's really just Gaussian noise you assume
here as to this corruption noise. And basically derive the update rules. That
also helps, like for text documents, it's a lot better than Gaussian noise or
something.
So there's definitely some benefit from, you know, the noise model. It's
actually, it's generally the case that actually learning these representation
that you layer and then stick them into SVM, that seems to be better than just
putting the noise model into your classifier. But we're just at the beginning
of exploring that in some sense. Yeah?
>>:
How did you decide you would do [inaudible].
>> Kilian Weinberger: That we had to use [indiscernible]. And that's the one
parameter that is sensitive. So with layers, in some sense, you know, it
improves if you can make it deeper and there isn't really, you know, just
peters off at some point. You can't go too wrong, but with the [indiscernible]
33
that actually depends on the dataset. And so actually one thing that was quite
surprising was that we found on some datasets, like it was up to 90 percent
actually noise was the best setting.
>>:
Do you cross validate based on reconstruction?
>> Kilian Weinberger: No, we actually just basically take a whole dataset,
apply mSDA on the train set, and the train SVM on the, you know, and then test
on the holdout set. So that takes a little while. But actually, it's pretty
fast, because SVMs only takes a few minutes.
>>: You mentioned that the best piece 90 percent. You are using the same
data, the same data producing the [indiscernible] for different layers. For
some, it's the best piece as you say.
>> Kilian Weinberger: [indiscernible] surprisingly high. Originally, my
intuition would have been that it should be pretty low. With 90 percent,
you're removing a lot of the document, right. Like basically have very little
left. But because you're integrating out all possible corruptions, maybe you
get away with having pretty high P. It's not always 90 percent, though. Like,
you know, it did vary. So we plotted the curve, actually. And the curve is a
nice trend. It kind of looks like a bucket. But it depends on the dataset
where the best point is.
>>:
[inaudible].
>> Kilian Weinberger:
>>:
Sorry?
Is it higher than the [indiscernible].
>> Kilian Weinberger: That's a good question. I don't know. Minmin doesn't
know. So actually, yes, it actually makes perfect sense that it's lower,
because SDA can only go over so many iter rations over the training dataset
whereas we go over all the possible once. SDA, you keep corrupting your
trained dataset and every [indiscernible], you have a new corruption. So even
if you do a hundred [indiscernible], you only have a hundred corruptions of
your data.
>>:
[inaudible].
34
>> Kilian Weinberger:
>>:
Minutes.
>> Kilian Weinberger:
>>:
Oh, minutes.
Yeah.
Did you look at the unbalanced cases, which is --
>> Kilian Weinberger:
>>:
Millions?
Unbalanced, what do you mean?
Unbalanced presentation.
So you have very few --
>> Kilian Weinberger: Oh, I don't think it mattered. It's one thing we
actually did. Because what we train is completely unsupervised, right? So
labor doesn't really matter.
>>: When it's unbalanced, you have to make sure that your presentation
preserved the part of the data which is [inaudible].
>> Kilian Weinberger: Yeah, so I bet you that it still works there. So one
thing we actually did, so there's this nice paper by [indiscernible] and John
Blitzer who actually did some, you know, proved some nice, nice balance on
domain adaptation, and they have this magic way of saying domain adaptation
work, and their claim is, well, you know, the target and the source have to be
similar. And the way they measure similarity between target and source is by
saying, well, we train a classifier to distinguish between the two.
And if you can't really do this very well, if you can't distinguish between
electronics and DVDs, then they are very similar and therefore they help each
other. And so one thing we showed actually is when you run mSDA
representation, then actually you're better at transfer, but you're also better
at separating between the two classes. So it's actually, you know, actually
helps in both cases. Like these are very different classification tasks,
right. And it just seems to be a better representation.
>>:
[inaudible].
>> Kilian Weinberger: Let me know. Maybe I'll switch to the very last slide
and just thank my co-authors. So here, I just want to thank Eddie, Minmin and
Fei, who will be here and Oliver Chapelle, actually, who helped with this and
35
questions. Here's the graph that I promised, here's the graph that you asked
earlier. If you take actually this 16,000 dataset is not so high and here
basically we use dimensionality and this is the computation time. Here from
15,000, 16,000 you see is a big gap, because that's a cubic scaleability of the
inverse so if you go down to 10,000, it's a very small hidden accuracy but
actually a drastic reduction in time. 5,000, it's noticeable. At some point
you notice it because you remove some concepts.
>>: That was my concern, unbalance the things, like if you mention you have an
unbalanced problem, probably your task will be looking at few words.
>> Kilian Weinberger: But 10,000 words, you can say a lot in 10,000 words,
right? Make it 15,000.
>>:
Yeah.
>> Kilian Weinberger:
>>:
So that's --
[inaudible].
>> Kilian Weinberger:
Any more questions?
All right.
Thank you.
Download