>> Silviu-Petru Cucerzan: It's a pleasure to have John... tomorrow and on Friday also. And for those who...

advertisement
>> Silviu-Petru Cucerzan: It's a pleasure to have John Blitzer with us today and
tomorrow and on Friday also. And for those who don't know John well, I should
say that he did his undergrad work at Cornell with Clark Hardy and Vivian Lee,
and then he got his Ph.D. at U Penn under the supervision of Fernando Pereira,
and now he's a post-doctoral fellow at UC Berkeley, working with Dan Klein. So
huge names in the field, right?
And in between U Penn and Berkeley, he went away from pure academic work
for a little and he worked in MSRA, so he already worked in Microsoft. And I
should say his main interests lie in the area of applied -- of machine learning
applied to natural language, and he did a lot of work on finding models that
compact models, semantic and syntactic, to represent phenomena in languages,
and he'll talk today more about that.
So his talk is natural language processing in multiple domains, relating the
unknown to the known.
>> John Blitzer: Okay. So thank you, Silviu. So I'm going to talk a little bit about
a few, kind of touch on a few of the applications of kind of this idea of relating the
unknown to the known.
And I hope that -- I'll touch on, again, at the very end, kind of other ideas that just
couldn't make it into this talk that I hope people who are interested in it will talk to
me about afterward.
So I actually want to start out by talking about, kind of machine learning for NLP
or statistical NLP in what I'll call the standard or single domain setting. So in this
setting, we build our models by training them from a corpus of data, so in this
case I've given -- this is an article from Wall Street Journal, and we asked a
teacher to go through and label a whole document or pieces of a document for
us. So this can be anything from annotating a particular sentence with its
syntactic structure to, you know, maybe if the teacher is actually a user, clicking
on ads that might be, you know, telling us which ads are relevant for this
particular article.
And it might be something extremely complicated with a lot of structure like, you
know, taking a sentence in English and giving its Chinese translation.
And in all these cases, our goal here is to build from this an NLP system, which is
essentially mimicking the actions of the teacher, right. So in particular, after
we've looked at a bunch of examples, we can take as input now another
document, in this case again from the Wall Street Journal, ask our NLP -- feed it
to our NLP system and get out a bunch of predictions. So, you know, just to use
the first example, you know, we could ask our system to tell us what the best
syntactic analysis for this particular sentence would be.
And so I won't go so far as to call this a solved problem, but it's typically very well
understood in NLP. And one of the, you know, fundamental reasons for this is
that, you know, we can see that training and prediction are from the same
distribution. Because this empirical process theory, actually, gives us a strong
guarantees and tells us, you know, the more, for example, sentences from the
Wall Street Journal we see with their parses if we build a model from that, we
expect to see better and better data, the more that we see.
And, in fact, this is true empirically as well. So if you look at syntactics parsers
built on the Wall Street Journal, if you give them another similar Wall Street
Journal sentence, they tend to do extremely well. That's not, not what I'm talking
about today. What I'm going to talk about is what I'll call the multi-domain setting.
In this setting, the set up at training time is quite similar. Now I've labeled the
Wall Street Journal here what I've called a source domain, and the idea now is
that we want to go through and in a new target domain apply our model.
One possible scenario is, you know, I'm reading the MSDN forums and I want to
know when a particular question is answered. Well, it might be really helpful to
parse the sentences in those forums, but, of course, they look nothing like Wall
Street Journal text, right? So I run them through my parser and I get out a bunch
of predictions, and, you know, I can hope for the best. But of course, all the nice
things that I said in the previous slide are not true anymore, right. So for
example, you know, now I come to MSDN forums and people are talking about,
you know, sequel queries and race conditions and that just doesn't happen in
financial news, right.
And because of that sort of, there's no more, at least standard empirical process
theory now doesn't have anything to say about this case, right, because the
distribution has changed, and kind of in the limit, I'm taking a sample from a
completely different distribution and asking you to do well, and we have no more
guarantees about that we would do well. And, in fact, this is true, you know,
state of the art models really tend to break down here, sometimes more than
doubling in error.
So just to -- I'm going to start by giving you guys two examples at a high level of
the two problems this I'm going to talk about today that illustrate this. So the first
is motivated from what I'll call sentiment classification. And the idea here is that
we get a review of a particular product. In this case, this is a product on Amazon.
This is a review of a book, and it reads, this book was horrible. I read half,
suffering from a headache the entire time and eventually I lit it on fire. One less
copy in the world, don't waste your money. I wish I had the time spent read this
book back. It wasted my life.
So our goal here as machine learners is to take as input this document and
output either positive or unfortunately for this book, I won't keep you guys in
suspense, this is actually a negative review, and the crucial idea here is just that
if we've seen a lot of examples, you know, if we have a teacher who goes
through and tells us, you know, oh, this review is positive, this review is negative,
we can actually do very well on books reviews.
But now, you know, books aren't the only thing that's sold. Even on Amazon,
and they're certainly not the only product or service that we'd like to try to
understand. And one -- if I actually try to build a model and we say, oh, well, I'd
like to now do a good job at reviews of other types of products. In this case, this
is a review of a deep fryer. Amazon also sells deep fryers.
And this reads, I love the way the Tefal deep fryer cooks. However, I am
returning my second one due to a defective lid closure. The lid may close
initially, but after a few uses, it no longer stays closed. I won't be buying this one
again. And the basic idea here is that if we haven't seen any outputs from a
teacher on kitchen appliances, or you may have many kitchen appliance reviews
that we'd like to say is this positive or negative, our teacher hasn't gone through
and told us for any of them whether or not they are.
We'd like to be able to generalize, but in practice, there is a huge increase in
error when training on reviews of books and testing on kitchen appliances. In
fact, the error doubles here.
And so this particular setup is something I worked on with Mark Dredze and
Fernando Pereira as part of my Ph.D. thesis.
>>: John?
>> John Blitzer: Yeah.
>>: Human performance in this test.
>> John Blitzer: Human performance? Well, that's hard to say exactly. So all of
this is crawled from Amazon, and, you know, you can't say for sure. You know
what the star -- so we're basing this on what the stars, the reviewer himself gave.
So -- and we have a little bit of inner annotator, but the inner annotator is
basically just me and Marc looking at a review and trying to decide. So that
seems to be 90 and above for all of the separate domains. Actually, it's a little
better outside of books. Books tend to have higher variance.
But again, like I wouldn't, I wouldn't really call this a strict inner annotator
experiment. That's very off the cuff, right. We really just have the stars and
we're trying to, you know, reproduce whether or not, kind of look at a review and
reproduce whether or not it got, let's say, five stars, right.
Okay. So the second problem that I'm going to talk to you about is motivated
from web search across multiple languages. So if I do a search on a Chinese
search engine for xiaonei, which is actually the top networking site in China, of
course, I get -- I actually get back very good results. So the first is -- so this is a
kind of locational query. That's not the right button, okay.
So there -- in practice, this is exactly the right link and, you know, the second hit
is actually a mobile version of this site, and it's also, you know, also great result
for this query.
But there are many -- so this is a common one of the, you know, one of the top
queries you see in any Chinese search engine. But there are many queries, like
this one, this is [speaking Chinese], which is the Chinese translation of
salmonella. And the top two results are okay, right. So the first one is kind of the
Chinese version of Wikipedia, and the second is this community question/answer
forum, where someone asked what is salmonella and someone responded and
basically gave a description of the disease.
But really, if you think about it what you might be looking for at a high level, if you
type in salmonella is kind of, well, respected sites, government sites. So there is
a CDC in China, and they do have a website on salmonella. It's not here. You
know, or you might want, you know, news about salmonella outbreaks. None of
that's there.
And the key insight, though -- this is something I worked on at MSRA with Wei
Gao, who is an intern there, Ming Zhou and Kam-Fai Wong. And they -- so the
key insight, though, here -- so actually, there is, there is a significant loss in
ranking performance here, and we'll get to that later on.
The key insight, though, here is that in English, actually if I search for salmonella,
I kind of get exactly what I'm hoping for. The top head is the CDC and then there
are some news results. And the basic, the basic idea here though is that these
results which actually are kind of low, if you look at purely Chinese -- the Chinese
ranker, actually are quite high if you were able to know that these queries were
equivalent and ranked purely based on the English ranker, the English results.
And, I mean, just to give a high level, certain features, the reason for this
disparity is one is just that, you know, in particular for a search engine like bing,
just a lot more effort has gone into the English ranker, right. But there are other
things too, like for example click-through data just is less meaningful in Chinese,
because there's less click-through.
And there's, you know, static, static ranking features like page rank are not as
predictive in Chinese as they are in English. So there's kind of this features
which may have been reliable in English directly are no longer as reliable in
Chinese.
And the rankers kind of suffer because of that. And that's kind of one, one
problem we're going to try to overcome by exploiting a ranker that we already
know, right. The English ranker we know is good.
Okay. So as you might have guessed, the talk is going to break down, the first
part is going to be about relating known and unknown features and for this case,
we're going to be looking at building a shared representation for different
documents. And the second is going to be about known -- I'm sorry, shared
representation across different products and reviews. The second is going to be
about known and unknown documents. So in this case, I may have a review -- a
query in Chinese which I don't know how to do good ranking for, but I have a
corresponding query in English. Again, I'm going to try to exploit cross-lingual
structure to do a better job ranking in Chinese.
Okay. So the first thing I want to mention, though, is that in order to build the
shared representation, I want to note first what's different. So I've highlighted
here in blue and in red words that are pretty predictive, but are kind of unique to
a particular domain.
So in this case, you know, I can read half of a book, and I know that's negative,
but I'm not going to read half of a kitchen appliance. I'm not going to read half of
a deep fryer. Similarly, deep fryers, when they don't work, I don't like them and if
I return them, I probably don't like it either. But I'm not going to say, you know
this book is defective. Just doesn't work. And I'm returning it, right. That's not
what you say to be negative about books.
And the idea here is just that these unique words, I'm going to look for a
representation, I'll discuss in detail what that representation looks like in a
minute, that maps returning and lit it on fire to similar areas of a low dimensional
space, and I'm going to try to exploit that to do better sentiment prediction.
Okay. So I want to begin with a brief interlude and actually this is probably not
necessary for this audience, but at least to get us on the same page notationally,
the kinds of models that I'm going to be looking at here are what's called
conditional exponential families models. Basically, the idea is that each review,
I'm going to score as the dot product of a feature vector and a weight vector.
Now, the feature vector in this case is going to be very simple. It's going to be
just the bag of words bigrams and trigrams. What this means is that while I might
have -- so it's going to be very high dimensional. Each dimension in my feature
vector corresponds to a single word or bigram so there will be millions of
dimensions, but any particular document will be quite sparse.
So, for example, if the word excellent occurs three times, then I give it a three in
my feature vector. Great would get a value of one in my feature vector if it
occurs once. Fascinating, two if it occurs twice.
Similarly, the parameter vector is also high dimensional, and each entry in this
parameter vector basically corresponds to the propensity of a particular word or
bigram to indicate positivity or negativity. So, for example, I might say that, you
know, excellent has a weight of one. You know, great has a weight of two and so
on. And taking the dot product of them gives me roughly a score that will indicate
whether or not this is positive or negative.
So in this case, looking at the dot product, it's three plus two plus 2.4, which is
7.4, and I say okay, this is a positive review.
And the only thing that I want to actually bring up, though, is that in terms of this
particular model, this linear model's paradigm, the thing that we're concerned
with is words that have zeroes in our parameter vector, right. So if I build a
model on books and I come to a word that's, you know, like sturdy, I've never
seen sturdy used to describe a book before. Then I don't really know whether
this is positive or negative, although, you know, as humans, kind of we know that
sturdy is a positive word for kitchen appliances. Yeah?
>>: Would you really think once in a while might say this sturdy prose reminds
me of -- do you have that many real zeroes? [inaudible] you might, where words
would come like sturdy or reliable or, you know, on fire. Do you think like you
might have the word show up, but just very rarely in a different context?
>> John Blitzer: Yeah, I think that's a fair characterization. There's always,
there's kind of in language there's always these heavy tail phenomenon. In
particular in reviews, people tend to take creative license so you're actually -- I
mean, Amazon has a lot of reviews. So, you know, hundreds of thousands and
there are still going to be the kind of hypoxlygomina [phonetic] words that appear
only once that actually are still there.
But you're right, that like a lot -- you know, words like sturdy probably is not a fair
characterization of things that are unique. But even low frequency things are
actually not -- it's not exactly clear once you have a million how to attribute
particular weight to something that's seen only once or only twice, all right? So
okay.
So all right, so let's back up for a second and see exactly what the set-up is. So I
have words that are, like fascinating and boring that are unique to books and
defective and sturdy that are unique or almost unique to kitchen appliances. And
what I basically -- the thing that I want to point out, though, is that I also have
words and phrases that are shared across the two domains.
So I can say that a book is fantastic and a blender is fantastic, and both of these
are ways of expressing positivity about books and blenders. Similarly, I can say
a book is a waste of money, and, you know, a deep fryer is a waste of money
and both of these can be used to describe books and deep fryers.
So the representation I'm going to focus on is a real valued representation and
what I'm going to do is use these words, these purple words here, to link up
words that are unique to each domain. So the idea is that a word like fascinating
can be linked to a word like sturdy via the bigram highly recommended.
Similarly, a word like boring can be linked to the word defective via the phrase
waste of money.
And the idea is that we're going to use these to map these other domain unique
words to similar areas of the low dimensional space.
Okay. So this, again, this is part of my thesis and I call these pivot words or
more generally, across all problems, pivot features. And this is again work that I
did with Fernando Pereira and other grad students at Penn.
Okay. So how are we actually going to exploit these pivots? One thing we can
say is that, well, actually if we've seen a pivot before, you know, I come to a
document where I see the phrase, do not buy. And even if it's a kitchen
appliance, I know that I've seen do not buy this book, and I can still do a good job
here, right?
Similarly, if I've seen an absolutely great purchase, well, this indicates that this
should be positive. So I don't actually need to exploit any unique information
here when I have the pivot words. It's the cases where someone says oh, this is
a sturdy deep fryer, but I've never seen the word sturdy before in books reviews.
And what I'm actually going to try to do here is if I notice if I expand the particular
of a review, I can see that actually I get things like do not buy the shark portable
steamer. The trigger mechanism is defective. Similarly, if I see an absolutely
great purchase, this blender is incredibly sturdy.
So I notice that these particular domain specific words are actually -- do actually
co-occur with the unique words.
So in particular here, how am I going to try to exploit this? Well, one thing I can
do is say I'd like to predict the presence of a particular pivot. So I'd say, I want to
predict whether or not great appears in this document using all the other words in
this document as context. All right?
So in particular, I've written here another exponential families model. The feature
vector is of the same. Of course, I'm going to delete the word great from the
feature vector. The weight vector now, I've replaced data with this W parameter
here, and W is basically unique to each pivot.
So in particular, I have a separate weight I want to predict separately for each
pivot, whether or not it appears in this document. And the thing to notice is that,
well, if sturdy appears a lot of times with great, then it's going to get high weight,
kind of automatically at an intuitive level. And because of this, I should be able to
say you know, well, kind of the high level intuition is sturdy co-occurs with great
and great co-occurs. And great is a good positive predictor so maybe sturdy
should be as well.
The last thing I want to say is I showed you kind of five example pivots. In fact,
we can construct these automatically from unlabeled books and kitchen
appliance reviews. Yeah?
>>: So the phrase not buy contains buy and the word does not work well
contains work well. So how do you deal with it?
>> John Blitzer: You're right that you have to be careful when constructing these
predictors. And I want to go through exactly how we construct the final
representation and then maybe come back to that question afterward. Yeah,
Chris?
>>: So [inaudible] never co-occurs with one of those pivots.
>> John Blitzer: Yeah.
>>: [indiscernible].
>> John Blitzer: Yeah, you are right. And that's why, in particular, that's why
you -- it's kind of always better to use more and more pivots, right. The more you
can get your hands on, the better. On the other hand, there are situations, and I
won't talk about it here, but there are situations where you can imagine kind of
completely disjoint areas of feature space, right, where, you know, there is just
nothing you can do, right. Maybe there's a set of kitchen appliances where no
one ever uses anything in common with books, right. And in practice, kind of this
is something that at least, at least in theory happens. Now, I've never seen it
empirically, but, you know, there certainly is the case that you might just be, you
know, you might just not be able to do perfectly on kitchen appliance reviews.
>>: [inaudible] feel for the fraction of time this occurs?
>> John Blitzer: So in our experience, it -- let's see. The experiments are in my
thesis. But it's certainly below five percent of the instances, right. And the idea is
just that like, you know, if you construct enough of these pivots, you might still get
the review wrong, right. So these are kind of cartoon pictures and there are
many reasons why you could get a review wrong. But in general, just
co-occurrence does, if you select enough pivots you, know, five thousand or ten
thousand, eventually you can saturate almost all of this space. So okay. So
yeah?
>>: Why do you need this notion of pivot if you can just look at the conditional
distribution. Say conditioned on the class and just look at the entire dictionary
and then you automatically just have a notion of pivotness for a given word and
you don't have to have a cut-off for certain things being pivots or not pivots.
>> John Blitzer: Yeah, that's right. It's mainly for computational reasons. You'll
see I want to train a bunch of pivot predictors and its easier to train 5,000 than 5
million.
>>: [inaudible].
>> John Blitzer: Yeah, and in fact that's the way we actually -- that's a good
question and I'll answer it. That's the way we actually do it. You can train, for
example, an L-1 regularized predictor. You get out a sparse predictor. And this
sparse predictor, you take the active features and use those as pivots, right. So
that's one example that tends to work well.
Okay. So I've trained up these predictors and maybe they're, you know, 5,000 of
them. And if I can write down a matrix here, big W where each kind of column in
the matrix corresponds to a single predictor. So for example, maybe this column
is the predictor for highly recommended, right? Whether or not highly
recommended appears.
And the reason I want to write it down this way in particular is that I can actually
predict using this weight vehicle to the presence of all the pivots in a particular
document, right. So I see a document, I want to predict what's the chance that
highly recommended appears, that horrible appears, that great appears, and so
on for all of the different pivots, right.
And a high level, we're almost done here, right. So if I have 5,000 pivots I can
generate a representation which is essentially 5,000 new features, each of which
is a prediction about something that I know is shared with my source domain.
And because of that, this is kind of -- this in particular almost answers our
question, right. And the reason I say it doesn't is that actually related to your
question, a lot of the predictors are going to capture kind of information that we
don't quite want, right. The non-welcome, non-sentiment information.
So one example, you know, not buy is a good example. But here's another one.
You could say, well, I've written a kind of cartoon picture here, where each axis
corresponds to a single pivot and each of these points is kind of the weight of a
feature for that particular -- in that particular pivot's predictor, right?
So one of the high weight features for highly recommended will be I. When
someone highly recommends something, they usually say I highly recommend
this book, right? But the word I has nothing to do with sentiment in particular,
right? It's a purely syntactic phenomenon. And what I want to be able to do here
is now distill from this the correct representation for actually predicting sentiment,
right?
The idea is there are words like I, and there are maybe a lot of words like I, but
there are still some like wonderful that are predictive of both highly recommend
and great. So what I'm going to look for is a sub-space of best fit to the space
whose columns -- the space which is spanned by the columns of the matrix W.
All right. So the idea is that this is kind of the best low dimensional sub-space in
terms of the error to the full pivot predictor space. And you can think of this as
almost a kind of de-noising, right, where I want -- the sub-space will capture what
I want from sentiment and kind of the syntactic phenomenon, which aren't shared
across many pivots, will not be in the top, the top eigenvectors of W. Well, WW
transposed. The stop singular vectors of W.
So this is, in practice, what we do. This sigh here is actually just the projection
on the left singular vectors of W or the top eigenvectors of WW transposed,
right? And this is kind of in a square gloss sense, the best sub-space for the
space of pivot predictors or the space of natural parameters for all of these
different conditional exponential families.
>>: Because W knows nothing about [inaudible]. I mean, I could be creating
classifier on, I don't know, you know, overuse of long vocabulary words, right?
>> John Blitzer: That's right. So W -- remember, though, that so it's not quite
true that W knows nothing about sentiment, because we select the pivots in a
particular way. Right. So I guess as miss Tra asked, like one question might
be -- I want to predict, you know, I trained a classifier to predict on the source
domain, right. For books, I know what I'm looking for, right, and I can use that to
bias what I consider to be a pivot. Right?
And, in fact, I bias it in the two ways. This is actually crucial for the structure of
W. One way is I bias it to be predictive of the target classification I'm trying to do,
which is sentiment, and I can do that by looking at the source label data. The
other is I make sure that the pivots are shared and I can do that by just looking at
data from both domains and say, well, you're not a pivot unless you occur in both
source and target documents, right?
But your question actually is right. So that's intuitively why it's true. It turns out
that in order to kind of prove that this will work, that a method like this will work,
there's a lot of subtleties in kind of the structure of W.
So, I mean, it's actually a problem that I'm working on right now, but I think it's -like in order to kind of characterize when this will work, there are actually a lot of
subtleties in the structure of distribution and how W is constructed. So okay.
So sigh actually is our low dimensional representation and in particular, here, I'm
looking at sigh, sigh times X for a particular input document X. And the thing to
notice is that this actually does map from the high dimensional feature space into
a low dimensional shared space, precisely because we force W to have that
structure. And these are the top eigenvectors of W. And so the only thing now
left to tell you is how I train my final model, right?
So before I was constructing my features from words and diagrams. Now I'm
going to construct it from the projection of this document on to the low
dimensional shared sub-space. And basically, the idea here is that by
constructing features on the projection, I'm going to have something that
generalizes across domains.
All right. So I want to show you guys briefly some results. So here I lied. I
actually have two domains. I kind of crawled all the different categories of
Amazon and we have reviews from several different kinds of products. So what
I'm showing you here is what's the accuracy on reviews of kitchen appliances
when I train on reviews from all these other separate domains.
So the first gold bar here is kind of our gold standard. If we actually had a lot of
kitchen appliance reviews to train on, how well could we do. And in this case, so
this is 87.7. If we now train up a support vector machine just looking at each of
these domains separately, we get the following set of predictions. And the only
thing that's interesting here is just that, you know, electronics reviews tend to use
a lot of the same language as reviews of kitchen appliances so you can do a lot
better training on those.
The last piece, the last set of results here is, you know, the method I just
described, where I train instead of on the high dimensional Uni Graham and
bigram feature vector, it's now projection on to the shared sub-space and, in fact,
you do see a big improvement, even for electronics, but certainly for DVDs and
books.
And in general, if you look at kind of all the pairs of domains across all the data
that I have, you can close this gap by about 36% between the green and the gold
bars, sort of adaptation and what you could do if you had in-domain data using
this technique. Mm-hmm?
>>: This may be the next slide. Did you try combining the other, the books and
the electronics, the DVDs?
>> John Blitzer: I did try that.
>>: See if you could somehow improve the results [inaudible] because you get
more data.
>> John Blitzer: Right. So I did try that, I don't have a slide for it and you're right
that you can. Of course, there is kind of a ceiling that -- so to think of it coarsely
in terms of bias and variance, each -- the predictor that you learn for books is
biased with respect to the base optimal predictor for kitchen.
So if I saw an infinite number of books reviews, I still won't be able to accurately
predict kitchen appliances or as accurately. And there is kind of a point after
which it's no -- like after I see some number of kitchen appliance reviews, you
just can't do any better no matter how many books reviews you add.
>>: The question is do you already have [inaudible] ->> John Blitzer: No.
>>: Are so small [inaudible].
>> John Blitzer: If I crawled now, I might, at the time, which is two years ago,
Amazon just doesn't have very many kitchen appliances reviews and you could
still do better, because there are literally like, even in 2007, I'm sure there are
more now. We crawled millions of books reviews. In kitchen appliances reviews,
we only had like 10,000. You still run into this. Yeah.
>>: [inaudible] binary.
>> John Blitzer: Yeah.
>>: Five stars?
>> John Blitzer: That's right.
>>: How did you make that split.
>> John Blitzer: In this experiment, we threw out threes. So it turns out you can
actually treat this as a regression problem and either kind of an ordinal
regression or actual looking -- you know, considering these to be revalued
predictions. And the same sort of projection into a low dimensional sub-space
also works well for these.
>>: This is following up a little bit on Robert's question. I think what you can also
show, though, is that in practice if you just combine more and more domains and
you have one held-out domain, you know, the more domains you combine, the
better you get on the held out-domain until you've pretty much reached pretty
much the same results or similar results. So if you throw in kitchen appliances, if
you throw in all electronic data you have, the book data you have, throw in the
movie data seat from Pang and Lee and the DVDs, you know, you actually, I
mean, you get close to -- it's a brute force approach. Let's use all the data that
we have from all the domains that we have. And if we have the luxury of having
an assortment of domains where we have [inaudible] data, which is not the case
in all domain adaptation problems, but in that case you can actually do fairly well.
>> John Blitzer: Yeah, I mean, I guess like that depends again on the structure
of the particular domains, right. So as long as, I mean, so there actually are
theoretical results that if, you know, let's say kitchen is kind of a mixture of
electronics and DVDs or something. Then you can do perfectly, right. But if
there's any sort of unique part of kitchen appliances, then you're always going to
miss something.
So yeah, you're right. And kind of it depends on the structure of these domains.
Yeah?
>>: [inaudible] how do you quantify the distributional difference?
>> John Blitzer: Yeah, let me do my last slide of this section and then I'll tell you.
Okay. So the last thing I want to give is some intuition for this low dimensional
sub-space. And what I'm showing here is in the top left, words that are unique to
books, but are negative under this single projection. So these are things like
number of pages, if you -- so when I say projection, sorry, I'm showing you one
row of the matrix sigh, which is kind of a linear projection from the space of
features on to the real line.
And here, what I'm showing is so if I mention the plot of the book, I probably don't
like it. Nobody likes books that are predictable. If I say it had, you know, 586
pages, probably means I'm not, you know, starting a dia tribe about how long and
boring it was.
Similarly, if I -- for kitchen appliances, you know, these are words that don't occur
at all in books. So if I didn't train on kitchen appliances, I wouldn't be able to get
them. But they still are negative under this protection, words like the plastic or
poorly designed, and these are words that are unique to kitchen appliances.
So here, positive words. Fascinating, engaging, must-read. A little
embarrassingly, the most positive unique word for books is Grisham. And people
just tend to love John Grisham novels on Amazon.
>>: Maybe he should make appliances.
>> John Blitzer: Well, okay. Actually, for appliances, you see things like I've
been using this for years now. This deep fryer is a breeze. Expresso. Expresso
turns out to be the John Grisham of kitchen appliances, basically. Everyone just
likes expresso machines and people when they write about expresso machines
tend to give them high reviews. I guess you paid a lot for expresso machines.
Must be good.
>>: The point about didn't show up at all. There still wouldn't be some things
that you could tend to mean changes or the word ->> John Blitzer: Okay.
>>: One or the other. Even things like number of pages. Maybe I like lots of
pages when it comes to the manuals for my electronic appliances and I hate the
fact -- if I mention that, it's a good thing.
>> John Blitzer: If I could tease out the separate questions. The one was your
first question, which is is there ever a word which is truly unique? And so in this
case, like, for the data set we have, these are. Like the bigram poorly designed
isn't in the books domain. That's not to say now if I crawled Amazon I wouldn't
see the character was poorly designed or something like that. So I agree with
you that, you know, there's always this question. On the other hand, you know,
the more data I see kind of the more bigrams I have and you can imagine a
non-parametric version of this where I basically increase the length of the end
gram with the amount of data I have.
>>: Maybe the opposite. There are many words that weren't unique that already
have shown up but the meaning of them wasn't properly incorporated in your ->> John Blitzer: So you're right. This does happen. You're absolutely right. It
may not be something like number of pages. Cell phones, it's really nice to be
small, but it's not good for hotel rooms to be small. And that's -- so in general,
that turns out to be extremely hard to deal with, sort of in the most general
setting. If I don't have any labeled data from kitchen appliance reviews, and I
have kind of arbitrary polarity switching of features, then I'm basically hosed.
There's nothing I can do.
>>: [inaudible] three or four times book reviews, and they just happened to all be
negative or positive, but that's not very much evidence. You notice, gosh,
expresso shows up a lot in this new corpus and I only had a few bits of
information. Seems like this hard edge thing doesn't show up -- seems fragile.
>> John Blitzer: That's one reason I'm trying to separate the qualitative and
quantitative results. I don't actually exploit the hard edge thing in the quantitative
section. These are the classification performance and it doesn't matter whether
or not a word showed up zero, one, ten, you know, 500 times. That's just what
the results are. But you're right. For these qualitative results, all these things are
unique.
But you're right, that phenomenon is true. And, you know, it doesn't seem to
affect us too much empirically here, but that could -- I could certainly envision
places where it would. Yeah?
>>: If you go back a page, these results might [inaudible] purely dimensionality
reduction.
>> John Blitzer: Yeah.
>>: What happens if you do LSI with the same dimensions of V. Do you get any
sort of comparable lift?
>> John Blitzer: You get lift, you get lift, absolutely. But you don't get
comparable lift. It's sort of, again, that -- either LSI or PLSI or some variant of
that was in my thesis. But in general, kind of if you don't do -- if you don't
somehow control the structure of W or, in particular, if you use the, just the
instances directly, you get kind of, you know, halfway between these two.
Yeah?
>>: You [inaudible] refer to [indiscernible].
>> John Blitzer: Uh-huh.
>>: [inaudible] reviews tended to be more positive than negative. Have you tried
training across [inaudible] domain like books versus cars.
>> John Blitzer: No, I haven't tried that. That's interesting. I actually, that really
was a passing reference. I actually don't know -- I don't have any statistics on
that phenomenon. That's interesting.
>>: [inaudible].
>> John Blitzer: Yeah.
>>: I mean, I think it's a great point.
>> John Blitzer: Yeah, I couldn't say. I mean, it seems intuitive that that would
be true. And, of course, again like without knowing a priori kind of what -- just to
say at a high level, to answer Misha's question, a high level, kind of the
constraints you need on the structure of your distribution fall into what's often
called covariant shift, which basically means that you can control -- if you think
about a joint distribution on X and Y in looking at the performance of a conditional
model, Y given X, you basically assume that the conditional on Y and, by
extension, the marginal on Y, is the same across domains.
So once you start playing with that, you know, you really need some extra
information beyond just, you know, I can see some unlabeled data, right.
Because you can think of kind of an adversarial setting where you get to do
whatever you want on the books domain and as much on labeled kitchen
appliance domain as you have and then I get to look at your predictor and
change, change the output of my classifier, right? There's nothing you can do.
So but, I mean, like if you know kind of some relationship, you can constrain your
model using that. Yeah?
>>: So I'm wondering how the sub-space SI changes according to the rating. I
guess kind of follow-up to my last question and Robert's last question, the
changing of the meaning of words. For example, the bigram not work contains
the single gram work. So you have both of these on your axes, then for a
negative review, then you will see both, right?
>> John Blitzer: Yeah.
>>: So the axis, if you were to just find the sub-space on the negative reviews
you wouldn't see the X equal Y line, whereas if it were a positive review, you
would just see works where you would get this Y includes zero line. But if you
kind of put those two together and find the sub-space that tries to, you know, like
work for both kinds of reviews and you find something in the middle, and I
wonder, I mean, on the one hand, sub-space projection kind of gets rid of noise
because it projects things. On the other hand, it also blurs the differences. And
I'm wondering what's your insight on the effect of putting -- learning the same
sub-space for all different scores, which is what you're doing.
>> John Blitzer: Yes, you're absolutely right.
>>: Across domains, but for all scores.
>> John Blitzer: For all scores.
>>: You're assuming that the sub-space is the same.
>> John Blitzer: That's right, yeah. I guess you'll answer that in two ways. First,
this is a good question. In detail, we actually do handle that case explicitly. We
look at the bigram and we don't allow you to use its Uni grams to predict it. So
for that particular -- and that actually does make a difference. You can still do
well without it, but it does make a difference in the final performance.
The second is that kind of, you know, for all these -- we can't deal with everything
that way, right. So you can deal with that, but you can't really deal with I and
highly recommend, for example, right, because there's sort of certain things that
you -- that are just difficult to deal with in a general -- you always expect some
correlation. And that's kind of where the projection helps.
Now, for the washing out, remember that we learn -- the idea is that we learn
separate weights for each of the projection, you know, each dimension of the
projection, right. So it's linear in the projected sub-space. But that means that,
you know, there are many, many dimensions which actually aren't predictive at
all. I showed you one that is. But there are plenty which are just dumb or
distinguished between topics in books domain. Religious books versus sci-fi
books, for example, is one. And that's not useful for sentiment, but that's okay
because as long as there's continuity across domains, we can learn that from just
looking at labeled books data.
You look at a books domain, you see oh, well, this thing isn't very predictive, and
therefore I just don't assign it any weight in my predictor, right.
>>: [inaudible].
>> John Blitzer: Yeah.
>>: You said you have a feature for don't work and another feature for work and
you don't populate those.
>> John Blitzer: Yeah, so suppose ->>: [inaudible].
>> John Blitzer: It solves that problem, yeah, but there are other kind of subtle
problems where, you know, just kind of general syntactic phenomena, right. Like
oh, you know, I guess I highly recommend is something where, you know, we
might not have the tri gram I highly recommend in there, but, you know, you still
don't -- you really would wish that you could say, oh, well, never predict highly
recommend using I.
>>: I don't understand that. It seems like it turns out that people using the
personal pronoun giving positive reviews [inaudible] but you're discovering that
fact.
>> John Blitzer: Yeah, that's true. And it turns out for that one that it, you know,
like I say, you know, I hate this. And so I actually isn't. But yeah, that actually is
a real noise case. But you're right, sometimes that happens. Part of the reason
is that, you know, you just got to -- you've just -- you know, you build as much
intuitive structure as you can into the model and, you know, kind of empirically
see whether it works or not. That's just, you know, you can't characterize, you
know, all of human language kind of in the structure of your model.
So actually, there is another half to the talk, but maybe I ->> Silviu-Petru Cucerzan: Maybe we could save some questions for the end.
>> John Blitzer: So how much time is left?
>> Silviu-Petru Cucerzan: Half hour. Half hour your presentation and the
questions. So it's good to be interactive, but I know how ->> John Blitzer: Well, okay. So I'll finish this. The next half, it's not actually half.
It's more like you know, next third of the talk.
>> Silviu-Petru Cucerzan: Feel free to manage the time.
>> John Blitzer: It's till 11:00; is that right? 10:45.
>> Silviu-Petru Cucerzan: 12.
>> John Blitzer: Oh, till 12, yeah. So the next part of the talk is going to be
about, you know, projecting information across languages for web search. All
right. So I have my two queries, right, salmonella and [speaking Chinese]. And I
have with that a bunch of English and Chinese documents that I've retrieved. I
have this actually for many, many queries, right. So all the -- I'll explain in a bit
how I get them. But, you know, another one might be British history and
[speaking foreign language].
And the basic idea here is I want to -- so I have not only in addition to that you
can almost think of it as I have my rankers output for the English as well, right.
So I know actually how to rank the English documents. So my goal here, where I
write E1, this is the best English document, where I write E2, this is the second
best for that particular query.
What I don't know, though, is how to rank the Chinese documents, right? This is
what I want to output. I want to say oh, well, you know the kind of first document
in my unordered list is actually ranked 15 and so on. This is what I want to
output. For those of you who know, I guess people are roughly familiar with
cross-lingual IR. People kind of know the set-up here. The set-up in
cross-lingual IR is I see a Chinese theory. I want to rank Chinese and English
documents. For the English documents that rank high, I want to translate back
into Chinese and show you Chinese output.
The reason I'm trying to avoid this, I guess with all due respect to people who
work on machine translation, we're not quite there yet, right? And I want to say
that right now, we can still give you a better ranking, purely monolingual ranking
for the Chinese documents without ever showing you translated English output.
I want to emphasize that. The user never kind of has to deal with machine
translated output yet.
Okay. So one question you might ask is, well, okay, you're going to do this for
bilingual queries, but how many queries are really bilingual? There are kind of at
least two kinds of phenomena that we can't deal with. So one is phenomena like
[speaking Chinese], which is, you know, the Chinese, you know, translation of
overview of learning to rank.
And you might say, okay if I had a really good dictionary, I should be able to look
this up and English should help me out here. But the real truth is, you know, we
can't get this. Just 'cause, you know, it's not common enough and we can't really
identify that this Chinese query corresponds to an English query which we could
do really well on.
So the second is kind of the opposite, where I can actually translate it right, like
so for this query [speaking Chinese] which Changhong is probably one of the
biggest electronics makers in China. I can translate this just fine into Changhong
TV set, but if I search for it in English, it's not going to be very helpful in ranking
Chinese queries.
All we do here is something really simple. We take an automatically mined
dictionary and we threshold the query logs at some number and then we just look
things up there. And it turns out that, you know, not a huge number, but a
significant number so here in this case, for the Chinese query log, 2.3% of the
queries are actually in the English query log.
And there are many like this one which we hope we could get eventually, like
overview of learning to rank, but we can't get yet. We could get it if we had better
machine translation, maybe.
>>: [inaudible].
>> John Blitzer: Say that again.
>>: By queries or by volume?
>> John Blitzer: This is, I think, by volume. I'm not 100% sure. So okay. So in
order to train and test, I guess, go through this pretty quickly. So an training
time, I'm actually going to see a few -- I'm going to see some bilingual queries.
I'm going to get both the English and Chinese ranking. My goal here now, there's
going to be kind of several steps.
I want to take this and construct a ranking on pairs of documents, right. So
initially, I had two rankings, monolingual rankings, now I'm going to construct a
bilingual ranking on pairs. And from that, I'm going to learn a ranking function for
these pairs. And this will use kind of standard machine learning techniques.
Now when I see a new query, a new bilingual query, I will run this through my
ranking function, get out my hypothesized ranking on pairs and now I need to
convert the Chinese side back into a monolingual ranking on Chinese
documents.
Okay. So basically, there are kind of these three steps, constructing a joint
ranking from monolingual rankings, learning a ranking function on pairs, and
reconstructing the monolingual ranking from the bilingual rankings.
Okay. So constructing the bilingual ranking actually turns out to be quite simple.
There are many ways you could consider doing it. Here we force the bilingual
ranking be absolutely consistent with the monolingual ranking. What I mean by
is that I only rank a pair English 1, Chinese 1, higher than English 2, Chinese 2, if
English 1 is higher than English 2 and Chinese 1 is no lower than Chinese 2, or
vice versa.
So what I have here is then I can look at a monolingual ranking, construct a
bilingual ranking which is consistent with that.
The second thing I need to do is learn this joint ranking function. And this we use
a standard rank SVM style objective. So basically, here, when I write this, I want
to look at all -- for a particular query, I want to look at all pairs of, you know,
bilingual pairs. And if I introduce basically a hinged loss penalty for each one
that's ranked -- each pair that's ranked incorrectly, and again I have a feature
vector on pairs of documents and their query.
Okay. So this is, you know, pretty standard set-up. The only interesting thing is
kind of what features I can introduce now that I have pairs. So I have all the
standard features, and I'm sure you guys know much more about what those are
monolingually than I do. That's kind of top secret. They don't let visiting
researchers know that.
But so one thing we can introduce is just bilingual dictionary similarity and
machine translation and kind of weighted versions of these two, as well as URL
similarity, right, so I might say oh, well, you know, airbus.com and airbus.com.cn
are kind of similar and I can introduce all these features kind of that are
generated from the pair, rather than from any single monolingual document.
And really, this is what we expect to help us, right. This is kind of what we hope
will actually improve performance.
So the final thing is how do I convert. So I can build a now a new ranking on
these pairs. And I've written here kind of on the left again when I write EC1, I
mean this is the best pair according to my ranking function. And now, I see a
Chinese document and I say okay, well, what's the -- you know, where should I
insert this Chinese document in my final Chinese ranking? In this case, maybe
it's occurred in position 1 and position 23. And it turns out that what actually
works well -- so there's no heuristic that's going to be completely consistent.
Before, remember, we generate the training pairs to be completely consistent
with every monolingual -- with our monolingual ranking.
But now, and I won't go into details, but the ranking on pairs, this might not be
consistent with any monolingual ranking. We have to do something. What we do
is just rank a Chinese document by averaging across pairs in which it appears.
And this kind of gives you -- so in this case, we say, oh, well it should be ranked
12 here.
And there are more sophisticated variants of this that you might consider. In
particular, I guess, people here must be familiar with this area called rank
aggregation, where the idea is you see multiple rankings and you want to
aggregate across them. So my co-authors now have results that rank
aggregation, you can kind of aggregate across all the possible rankings of
Chinese of this pair and you can do even better.
But this actually tends to work already quite a bit better than the monolingual
ranking. So what I'm showing you here, I have to be honest now, we actually
finished this project after we left Microsoft. So there's a problem with that in
terms of, you know, well, we wanted to actually report NDCG, but we can't, right,
because, you know, that's all Microsoft internal stuff.
So what we did was we took the click-through and now we're reporting kind of for
the top queries how do we compare with ranking based on click-through alone.
And you can still do significantly better bilingually than you can monolingually for
this. So basically the idea -- so, I mean, we have lots of queries, right. So it
shouldn't be surprising that this difference is statistically significant. But, you
know, we have -- I don't know, 10 or 20 thousand queries to test on. And, you
know, so we can actually get a big gain by actually combining these two.
Okay. The last thing I want to show you is kind of what queries are most
improved, again kind of the qualitative version of this. Obviously, British history
and salmonella, I picked them for a reason. Political cartoons actually is one
thing that might be a bit controversial, but it turns out to be one of the most
improved Chinese queries. Turns out that if you can search for English political
cartoons, it's actually, that helps you a lot in finding political cartoons in Chinese.
Just for fun, we actually tried this in the opposite way, right. We tried to rank
English and it turns out that so for any of you guys that -- not that I'm advocating
this, but for any of you guys who actually do watch pirated TV, you'll notice that
almost all of the English sites are -- the best English sites for pirated TV are
actually routed through China. Turns out this is actually the number one best
improved English query. If you search for free online TV, the actual kind of best
query is this.
The others are less interesting, except perhaps for aniston. It turns out jen -- like
right at least during the kind of period we sampled the query log, Friends was
hugely popular in China and Jennifer Aniston was also very popular, and her
Chinese name doesn't have her first name. So this is just aniston. Which is her
transliteration of her last name. And basically, the idea there is just that if you
kind of maybe in English, if you forgot her first name, what is it Jennifer aniston or
Jessica aniston, and you just searched for aniston, which people do do, you can
actually get a significant performance increase by kind of looking at the Chinese
results.
Okay. So I guess I'll end with this. So there's a lot of things that I'm interested in.
I showed you guys these two. All of them kind of fall in this idea of building
robust models across multiple domains. So one that kind of came up a little bit is
how can we characterize kind of theoretically when the algorithms like the ones I
described here work well. Can I say -- and also I did work with Shai Ben-David
and Koby Krammer and Jenn Wortman and Alexa Kulesza. And I'm continuing
this work as well now with other people. But I think that, you know, it's kind of
very interesting to consider when, you know -- I give you two distributions, kind of
what conditions on these distributions do I need such that an algorithm which
kind of learns on one and observes nothing about the other does well.
Or observes a few instances of the other does well, right.
The second and actually I'm going to talk about this Friday, I don't want to
compel you guys to come, but I'm going to talk to the NLP group about using
some of these techniques in machine translation. So it turns out just at a high
level, kind of there are all these different components to machine translation and
they're trained on all different domains. Might not be -- these domains might not
be comparable to the ones you want to translate.
Again, like, there's this idea of can I, can I use all of these different components
together to train a better joint model on the domain I care about.
So other things I'm interested in kind of fall under the general rubric of -- I don't
know what I call this, natural supervision. So, for example, there's been recent
work on face and object recognition that's really good. But now if I know kind of
the syntactic structure, can I recognize verbs. For example, here's a picture of
the Yankees pitcher who won the World Series, holding up his World Series
trophy. Can I look at a bunch of examples of the verb holding and if I knew the
nouns, could I, you know, learn something about holding in general?
And kind of another thing that I'm interested in that I think, you know, people on
here have already done good work on, but I think there's still a lot of interesting
kind of depth in this area is can I build a model for search or ads on, let's say,
serving ads on the Wall Street Journal and now I want to serve ads on word
press or something, right. Can you do this without any labeled data or without
any click-through on a new blog? Can I serve the right ads, right? And, you
know, sort of questions in this vein. Okay. So that's it. Thanks.
[Applause].
We have a few minutes for questions.
>> John Blitzer: Um-hmm.
>>: What are the number of the initial [indiscernible] when you were showing
books and DVDs and electronics. You had a various number of ->> John Blitzer: Yeah, turns out that just more is better always.
>>: Was the case that you had a lot more people in electronics and DVDs than
kitchen appliances?
>> John Blitzer: Oh, I see. Yeah, you're right. You're right about that. You can
actually construct more if the domains are closer. That is true. And in practice,
we always use the same number. We always used 2,000.
>>: Project on the same ->> John Blitzer: That's right, but that is a good question. You can construct
more from more similar domains. We haven't done that experiment. I suspect
that you would be able to do better. Part of the thing is just like the amount of
vocabulary overlap really helps you. So even with the same number of pivots
you can get, I guess, obviously better results.
>>: Could someone generalize this without looking at the second domain? In
other words, like building some sort of -- finding ways to make the model more
general. Have you thought of that? In other words, could you improve your
performance on [inaudible] looking at it [inaudible]?
>> John Blitzer: That, I mean I guess I haven't come up with any good ideas for
that. In some sense, that, like, in general seems very hard. Like, I mean,
consider not knowing anything about the reviews that you want, right. So if you
make some assumption about, like, you know, oh, I guess common words in
books are more appropriate, but ->>: [inaudible] in your model, you assume some really powerful word like, you
know, page Turner. Say that was a great thing. You assume that that's -- you're
going to learn how to predict that, rather than actually use it. You might assume
that the general things that help you predict that would still be predicted in the
next model.
>> John Blitzer: So I actually ->>: [inaudible].
>> John Blitzer: I guess you're right, although that still doesn't, that still depends
on you kind of knowing that, you know, a breeze is equivalent to page Turner,
kind of when you see it, right?
>>: I was saying, I guess my thought would be something along the lines of
anything that I find is really super strong in my training corpus, I shouldn't trust it.
I should move one step away from it and use the things that predict it and hope
that those things [inaudible] my space, those predictors, that I'll do better when I
get to the next one. I just assume some fraction of my strong things -- great buy
probably doesn't show up as much as one of the other.
>> John Blitzer: I see. So one variant of that is just to ignore the kitchen
appliances data and run completely this exact algorithm just only with books.
>>: That's what I was wondering.
>> John Blitzer: We did do that. It does help a little bit, but it's not nearly -- I
mean, like if you think about the intuition of, oh, I've just never seen that word
before, right ->>: You still wouldn't be able to use words you haven't seen.
>>: John Blitzer: Right. Those turn out to be ->>: Words surrounding would be more helpful. Like your I example. If, in fact, it
turned out that I was always a positive word or generally tended to be a positive
word, that's still going to show up in your new corpus.
>> John Blitzer: Right.
>>: And you can still be okay. So as long as you tend to trust things that were
less -- I guess one of the things we did spam filtering stuff. We had the exact
problem, since it was spam, stop using the words you found most [inaudible].
Any never did. Obviously, they stopped using the [inaudible]. And the model
compensates by looking at things like best price or things like all exclamation
points.
>> John Blitzer: Yeah, I mean, I guess like ->>: It would be nice to build a more generalized model without having to see the
future.
>> John Blitzer: Right. I mean, in this kind of -- let me actually say something to
that. You can -- there are actually algorithms which are online, right. Even in the
unsupervised case, right. So these are kind of bootstrapping algorithms where
you see a new example, and you kind of absorb the new vocabulary from that
example as you learn, right. And you kind of fuzz out over that, right. So you
can -- I mean, and I guess in your example, like I see -- well, okay. Suppose I
see, you know, a breeze just immediately. Then you can say, well, I've never
seen a breeze before, but I fuzz it out upon seeing it. So I can't get this one right,
but the next time I see it, I kind of absorb that in an online fashion.
And in that sense, you can be adaptable. I guess spam is particularly tricky,
right, because it really is an adversarial situation.
>>: It's a limited adversarial, but still a human has to read the final ->> John Blitzer: Yes and no, right, because the Spammers will read mail too and
mark their own messages as not spam, right?
>>: [inaudible] access to the rating system, if you will. [inaudible].
>> John Blitzer: The ratings? Oh, you actually pay people to do ->>: We don't. We get volunteers. Volunteers who [inaudible]. So since you can
assume your ratings are ->> John Blitzer: You can detect when a spammer is using hot mail.
>>: Spammers don't use it enough. We can't get enough traction. Assume the
ratings are fine. What they can do is [inaudible] try to alter it so ->> John Blitzer: I'm saying like at least -- I mean, this is via -- take it this with a
grain of salt, but this is via kind of conversation with the people at Yahoo. My
impression is that most people on Yahoo mail are Spammers. So you have to
have some way of kind of cleaning up your labels just initially, right.
So like every spammer -- I mean, he was describing the problem this way. Like
spammer sends out a mail, I mean bot.net sends out a mail, right. Including a lot
of Yahoo addresses that he owns, right. And he automatically logs in, marks that
mail as not spam, over and over again. Then eventually, he'll send it out kind
of --
>>: This is off the subject of your topic. The way we do it in hot mail, we have
two separate ways to get spam. One is that voluntary, I'd like to mark this mail
one way or the other. But the primary way is actually something called the
feedback where we opted in about one-tenth of 1% of our users and send them a
random sampling of their own mail once a day. One mail each day that says,
please mark this as spam or not.
>> John Blitzer: I see.
>>: Volunteers who we asked to join, and they were already users for three
months or something like that.
>> John Blitzer: Right, I see.
>>: So from looking at the data, it doesn't look like these users have been
substantially -- I didn't find any value from trying to clean the users out when I
tried to define the bad guys.
>> John Blitzer: Yeah, that's interesting. I mean, I can actually -- it's not my
work, but I can send you pointers to kind of ->>: [inaudible].
>> John Blitzer: Although there is newer kind of, in the purely supervised setting,
online algorithms that try to basically deal with this by saying, oh, well, if I got
something wrong, then I should adjust the weights of the features I haven't seen
more than the ones I have seen.
Oftentimes, you know, you look at the gradient-based methods and you look at a
gradient, a purely gradient step and what that basically means is that all the
features are traded equally, right. I mean, it's linear. So you linearize the
function around that particular point and all the features are pure.
You can think of things that are more complicated where you will kind of keep
track of some, let's say, second order information online and then you can kind of
update, even when you see an instance based on kind of what features are there
or not.
I do think that one thing that's interesting to me, I don't have any, you know, great
ideas yet, is that if I see an instance and I make a prediction about it right, like if I
trust of the prediction I made, then can I kind of like say, well, the new features,
somehow I can kind of adjust weights based on my own prediction.
>>: [inaudible] ongoing algorithm. Also I think there's always the case that
there's some split between the case you're talking about, like the electronics
versus the books, that's the real world. We never have the same test data that
we had in training. The world is always shifting under your feet.
>> John Blitzer: Yeah, that's right.
>>: So if you can make your model generalized better across two domains,
[inaudible] the second one, then you would have something that would probably
just do better in the real world when you're talking about normally categorizing
things where all of a sudden, somebody shoots somebody else and all those
arguments [inaudible] or whatever it is.
>> John Blitzer: So let me give you, actually let me give you an easier version of
your problem, which is that I think this is maybe more feasible. But like, I don't
actually know what the new domain is, but I know that I'm gonna be exposed to it
eventually, right. So like I really feel that kind of you can't be -- you can't really
build an algorithm that's general enough such that -- I mean, maybe if you have a
lot of kind of problem-specific knowledge. But I think in general, you know, you
build an algorithm and you can't expect to kind of cover a new domain well
without ever seeing it, right.
But I do feel like that your intuition is right. Like the problem that I -- to be more
extreme, I introduced kind of an artificial version of this problem, right, where like
I had electronics data, and I knew, like, oh, this is called electronics and I have a
lot of it, and here it is, and go.
But in practice, of course, you might know, well, I'm going to go somewhere. I
don't know that it's called -- I'm going to have to apply my model somewhere that
doesn't look like training, but I don't know that it's called electronics and kind of
like I don't know -- maybe I'm deploying versions of my model simultaneously
and like I want each of them to be adaptive to the data that it's looking at. And
I'm exposed to it over time.
>>: [inaudible] I just wonder, almost any kind of -- if there's anything you can
generalize from the fact that -- I don't want to repeat my question.
>> John Blitzer: Well, let me say one thing, because I do think this is really
interesting.
>>: [inaudible].
>> John Blitzer: Okay. Well there's kind of this hot start/cold start if you think
about all these kind of recommender systems where it's typically. You know, you
see a new user, and like can I -- do I cold start this user, right. Like but in reality,
kind of like you really can't do well on a user. Or anything like speech, right. You
say I wish I could build a general speech recognizer that kind of worked on
everyone regardless of accent, right? I mean, even I can't do that, right?
If I go and meet someone from -- you know, actually, the cab driver this morning
was from Russia. You know, after about a minute or two, I could talk to him fine.
But, you know, like it just, it takes a little while. You got to be exposed to
something, I think. You know. Unless you can actually get data from kind of all
the world's languages offline, then maybe. But, you know, again, they're sort of,
there are new domains being created all the time, this kind of thing, so anyway.
It's really interesting. We'll talk later.
>>: A simple problem would be in that respect is what Michael again was
suggesting. Is that, I mean, could you actually use data, information you get
from -- you have a lot of books, and then now you have the DVDs, and then you
have the electronics. And can you learn from that how to apply the book data for
a new domain, right? I mean, can you learn from -- so you have a distribution
from which you have a lot of data, [inaudible] data, and then you have these
other distributions from which you have just tiny data. You can still learn of
what's general enough in my initial distribution that I could use on any new
domain.
>>: Yeah, just like when you go to a new country. [inaudible] English differently
[indiscernible] like using big expressive hand gestures lets nobody know if you're
happy or sad. Or the other way around, I've learned over time if I can't share my
words, use facial expressions ->> John Blitzer: This is interesting. I don't think it will get at the whole problem.
But this is a study problem in machine learning. It's called multitask learning.
Basically if you have labeled data from electronics, right, and you want to say,
well, I don't know, you know, kind of where I'm -- when I'm going to see my next
domain, but I know this is a new domain, now kind of learn something that kind of
I expect to generalize to my next problem.
>>: Exactly, I'm assuming it's got to be some things in books or something that
you could figure out what are different. Maybe just assume anything that's really
of high value is probably specific to my domain.
>> John Blitzer: That's right, yeah, that's right, that's right. So I think ->>: Like what's the author you had?
>> John Blitzer: Grisham.
>>: You might just automatically distrust features that are too strong and try to
generalize any features that are popular but not strong.
>> John Blitzer: Yeah, that kind of, that kind of heuristic, actually, unfortunately,
tends not to work well.
>>: I never tried that, but again [inaudible] the mail if you learn the names, if you
[inaudible] the names of the people who are in your training and they never show
up in the test set and they just make your model worse. You never try to remove
them [inaudible] users. When you have only 100 users that each of those people
last name was just a [inaudible] feature to the [inaudible].
>> John Blitzer: Yeah, yeah. I mean, so I guess like there is kind of this set of
literature, the multitask literature, but one version of the problem that I don't think
is studied yet is kind of like oh, I'm going to -- I see unlabeled data from one
domain, but I know I'm going to test on a different domain, right. So that could be
something, right. Like I know I'm -- you know, I don't know. I know I'm going to
France and kind of I, I don't know, I watch a video about France. But then I go to
Russia, right. Like something like this. I don't know what the real world analogy
is. But yeah, I agree with that. That is, that is interesting. I mean, there's kind of
like, there's this whole sub-field, you know, of machine learning kind of cottage
industry called transfer learning where basically -- transfer learning, where
basically the idea is kind of, it kind of absorbs everything I did and also, you
know, where you actually do see some labels as well.
And I think the case where you start to see labels, you start to learn some things.
But -- and you're right, you would like to learn kind of what's -- what from books is
general, right. That is true. And you know, kind of regardless of where you're
going, there's kind of some core of, I don't know, what does it mean to be
sentiment?
>>: I wouldn't expect you to ever do as well as actually being able to look at the
appliance data first. But the question would be could you do better than you
would ->> John Blitzer: Yeah, I think that's right. It turns out that the heuristics that you
actually suspect would work at least in all the problems I've looked at don't work
so well. Like basically the one you suggested is one of the first things you look
at, like oh well, I should ignore things that are too good on books. Or, you know,
if I have unlabeled data from kitchen, I could drop all features which don't appear
in kitchen appliances, right? That kind of always works no better, maybe
sometimes a little worse ->>: Something along the lines you did learning to predict the high value features
from [inaudible].
>> John Blitzer: Yeah, except ->>: I know that ->> John Blitzer: Except that you now, you're still going to have gaps, right,
because the corpus you're actually interested in, you've never seen before. But
yeah, I think, I think all those things are, I mean, it is a fruitful area. I guess I sort
of feel like the online setting is probably the most compelling to me, because then
you can kind of start in this scenario you're talking about and kind of slowly move
to one where you actually are seeing data that you really want to deal with.
>>: I guess I'm really interested ->> Silviu-Petru Cucerzan: Let's end here and continue the discussion. We're
going to set the record for a talk. Let's thank again the speaker.
Related documents
Download