>> Silviu-Petru Cucerzan: It's a pleasure to have John Blitzer with us today and tomorrow and on Friday also. And for those who don't know John well, I should say that he did his undergrad work at Cornell with Clark Hardy and Vivian Lee, and then he got his Ph.D. at U Penn under the supervision of Fernando Pereira, and now he's a post-doctoral fellow at UC Berkeley, working with Dan Klein. So huge names in the field, right? And in between U Penn and Berkeley, he went away from pure academic work for a little and he worked in MSRA, so he already worked in Microsoft. And I should say his main interests lie in the area of applied -- of machine learning applied to natural language, and he did a lot of work on finding models that compact models, semantic and syntactic, to represent phenomena in languages, and he'll talk today more about that. So his talk is natural language processing in multiple domains, relating the unknown to the known. >> John Blitzer: Okay. So thank you, Silviu. So I'm going to talk a little bit about a few, kind of touch on a few of the applications of kind of this idea of relating the unknown to the known. And I hope that -- I'll touch on, again, at the very end, kind of other ideas that just couldn't make it into this talk that I hope people who are interested in it will talk to me about afterward. So I actually want to start out by talking about, kind of machine learning for NLP or statistical NLP in what I'll call the standard or single domain setting. So in this setting, we build our models by training them from a corpus of data, so in this case I've given -- this is an article from Wall Street Journal, and we asked a teacher to go through and label a whole document or pieces of a document for us. So this can be anything from annotating a particular sentence with its syntactic structure to, you know, maybe if the teacher is actually a user, clicking on ads that might be, you know, telling us which ads are relevant for this particular article. And it might be something extremely complicated with a lot of structure like, you know, taking a sentence in English and giving its Chinese translation. And in all these cases, our goal here is to build from this an NLP system, which is essentially mimicking the actions of the teacher, right. So in particular, after we've looked at a bunch of examples, we can take as input now another document, in this case again from the Wall Street Journal, ask our NLP -- feed it to our NLP system and get out a bunch of predictions. So, you know, just to use the first example, you know, we could ask our system to tell us what the best syntactic analysis for this particular sentence would be. And so I won't go so far as to call this a solved problem, but it's typically very well understood in NLP. And one of the, you know, fundamental reasons for this is that, you know, we can see that training and prediction are from the same distribution. Because this empirical process theory, actually, gives us a strong guarantees and tells us, you know, the more, for example, sentences from the Wall Street Journal we see with their parses if we build a model from that, we expect to see better and better data, the more that we see. And, in fact, this is true empirically as well. So if you look at syntactics parsers built on the Wall Street Journal, if you give them another similar Wall Street Journal sentence, they tend to do extremely well. That's not, not what I'm talking about today. What I'm going to talk about is what I'll call the multi-domain setting. In this setting, the set up at training time is quite similar. Now I've labeled the Wall Street Journal here what I've called a source domain, and the idea now is that we want to go through and in a new target domain apply our model. One possible scenario is, you know, I'm reading the MSDN forums and I want to know when a particular question is answered. Well, it might be really helpful to parse the sentences in those forums, but, of course, they look nothing like Wall Street Journal text, right? So I run them through my parser and I get out a bunch of predictions, and, you know, I can hope for the best. But of course, all the nice things that I said in the previous slide are not true anymore, right. So for example, you know, now I come to MSDN forums and people are talking about, you know, sequel queries and race conditions and that just doesn't happen in financial news, right. And because of that sort of, there's no more, at least standard empirical process theory now doesn't have anything to say about this case, right, because the distribution has changed, and kind of in the limit, I'm taking a sample from a completely different distribution and asking you to do well, and we have no more guarantees about that we would do well. And, in fact, this is true, you know, state of the art models really tend to break down here, sometimes more than doubling in error. So just to -- I'm going to start by giving you guys two examples at a high level of the two problems this I'm going to talk about today that illustrate this. So the first is motivated from what I'll call sentiment classification. And the idea here is that we get a review of a particular product. In this case, this is a product on Amazon. This is a review of a book, and it reads, this book was horrible. I read half, suffering from a headache the entire time and eventually I lit it on fire. One less copy in the world, don't waste your money. I wish I had the time spent read this book back. It wasted my life. So our goal here as machine learners is to take as input this document and output either positive or unfortunately for this book, I won't keep you guys in suspense, this is actually a negative review, and the crucial idea here is just that if we've seen a lot of examples, you know, if we have a teacher who goes through and tells us, you know, oh, this review is positive, this review is negative, we can actually do very well on books reviews. But now, you know, books aren't the only thing that's sold. Even on Amazon, and they're certainly not the only product or service that we'd like to try to understand. And one -- if I actually try to build a model and we say, oh, well, I'd like to now do a good job at reviews of other types of products. In this case, this is a review of a deep fryer. Amazon also sells deep fryers. And this reads, I love the way the Tefal deep fryer cooks. However, I am returning my second one due to a defective lid closure. The lid may close initially, but after a few uses, it no longer stays closed. I won't be buying this one again. And the basic idea here is that if we haven't seen any outputs from a teacher on kitchen appliances, or you may have many kitchen appliance reviews that we'd like to say is this positive or negative, our teacher hasn't gone through and told us for any of them whether or not they are. We'd like to be able to generalize, but in practice, there is a huge increase in error when training on reviews of books and testing on kitchen appliances. In fact, the error doubles here. And so this particular setup is something I worked on with Mark Dredze and Fernando Pereira as part of my Ph.D. thesis. >>: John? >> John Blitzer: Yeah. >>: Human performance in this test. >> John Blitzer: Human performance? Well, that's hard to say exactly. So all of this is crawled from Amazon, and, you know, you can't say for sure. You know what the star -- so we're basing this on what the stars, the reviewer himself gave. So -- and we have a little bit of inner annotator, but the inner annotator is basically just me and Marc looking at a review and trying to decide. So that seems to be 90 and above for all of the separate domains. Actually, it's a little better outside of books. Books tend to have higher variance. But again, like I wouldn't, I wouldn't really call this a strict inner annotator experiment. That's very off the cuff, right. We really just have the stars and we're trying to, you know, reproduce whether or not, kind of look at a review and reproduce whether or not it got, let's say, five stars, right. Okay. So the second problem that I'm going to talk to you about is motivated from web search across multiple languages. So if I do a search on a Chinese search engine for xiaonei, which is actually the top networking site in China, of course, I get -- I actually get back very good results. So the first is -- so this is a kind of locational query. That's not the right button, okay. So there -- in practice, this is exactly the right link and, you know, the second hit is actually a mobile version of this site, and it's also, you know, also great result for this query. But there are many -- so this is a common one of the, you know, one of the top queries you see in any Chinese search engine. But there are many queries, like this one, this is [speaking Chinese], which is the Chinese translation of salmonella. And the top two results are okay, right. So the first one is kind of the Chinese version of Wikipedia, and the second is this community question/answer forum, where someone asked what is salmonella and someone responded and basically gave a description of the disease. But really, if you think about it what you might be looking for at a high level, if you type in salmonella is kind of, well, respected sites, government sites. So there is a CDC in China, and they do have a website on salmonella. It's not here. You know, or you might want, you know, news about salmonella outbreaks. None of that's there. And the key insight, though -- this is something I worked on at MSRA with Wei Gao, who is an intern there, Ming Zhou and Kam-Fai Wong. And they -- so the key insight, though, here -- so actually, there is, there is a significant loss in ranking performance here, and we'll get to that later on. The key insight, though, here is that in English, actually if I search for salmonella, I kind of get exactly what I'm hoping for. The top head is the CDC and then there are some news results. And the basic, the basic idea here though is that these results which actually are kind of low, if you look at purely Chinese -- the Chinese ranker, actually are quite high if you were able to know that these queries were equivalent and ranked purely based on the English ranker, the English results. And, I mean, just to give a high level, certain features, the reason for this disparity is one is just that, you know, in particular for a search engine like bing, just a lot more effort has gone into the English ranker, right. But there are other things too, like for example click-through data just is less meaningful in Chinese, because there's less click-through. And there's, you know, static, static ranking features like page rank are not as predictive in Chinese as they are in English. So there's kind of this features which may have been reliable in English directly are no longer as reliable in Chinese. And the rankers kind of suffer because of that. And that's kind of one, one problem we're going to try to overcome by exploiting a ranker that we already know, right. The English ranker we know is good. Okay. So as you might have guessed, the talk is going to break down, the first part is going to be about relating known and unknown features and for this case, we're going to be looking at building a shared representation for different documents. And the second is going to be about known -- I'm sorry, shared representation across different products and reviews. The second is going to be about known and unknown documents. So in this case, I may have a review -- a query in Chinese which I don't know how to do good ranking for, but I have a corresponding query in English. Again, I'm going to try to exploit cross-lingual structure to do a better job ranking in Chinese. Okay. So the first thing I want to mention, though, is that in order to build the shared representation, I want to note first what's different. So I've highlighted here in blue and in red words that are pretty predictive, but are kind of unique to a particular domain. So in this case, you know, I can read half of a book, and I know that's negative, but I'm not going to read half of a kitchen appliance. I'm not going to read half of a deep fryer. Similarly, deep fryers, when they don't work, I don't like them and if I return them, I probably don't like it either. But I'm not going to say, you know this book is defective. Just doesn't work. And I'm returning it, right. That's not what you say to be negative about books. And the idea here is just that these unique words, I'm going to look for a representation, I'll discuss in detail what that representation looks like in a minute, that maps returning and lit it on fire to similar areas of a low dimensional space, and I'm going to try to exploit that to do better sentiment prediction. Okay. So I want to begin with a brief interlude and actually this is probably not necessary for this audience, but at least to get us on the same page notationally, the kinds of models that I'm going to be looking at here are what's called conditional exponential families models. Basically, the idea is that each review, I'm going to score as the dot product of a feature vector and a weight vector. Now, the feature vector in this case is going to be very simple. It's going to be just the bag of words bigrams and trigrams. What this means is that while I might have -- so it's going to be very high dimensional. Each dimension in my feature vector corresponds to a single word or bigram so there will be millions of dimensions, but any particular document will be quite sparse. So, for example, if the word excellent occurs three times, then I give it a three in my feature vector. Great would get a value of one in my feature vector if it occurs once. Fascinating, two if it occurs twice. Similarly, the parameter vector is also high dimensional, and each entry in this parameter vector basically corresponds to the propensity of a particular word or bigram to indicate positivity or negativity. So, for example, I might say that, you know, excellent has a weight of one. You know, great has a weight of two and so on. And taking the dot product of them gives me roughly a score that will indicate whether or not this is positive or negative. So in this case, looking at the dot product, it's three plus two plus 2.4, which is 7.4, and I say okay, this is a positive review. And the only thing that I want to actually bring up, though, is that in terms of this particular model, this linear model's paradigm, the thing that we're concerned with is words that have zeroes in our parameter vector, right. So if I build a model on books and I come to a word that's, you know, like sturdy, I've never seen sturdy used to describe a book before. Then I don't really know whether this is positive or negative, although, you know, as humans, kind of we know that sturdy is a positive word for kitchen appliances. Yeah? >>: Would you really think once in a while might say this sturdy prose reminds me of -- do you have that many real zeroes? [inaudible] you might, where words would come like sturdy or reliable or, you know, on fire. Do you think like you might have the word show up, but just very rarely in a different context? >> John Blitzer: Yeah, I think that's a fair characterization. There's always, there's kind of in language there's always these heavy tail phenomenon. In particular in reviews, people tend to take creative license so you're actually -- I mean, Amazon has a lot of reviews. So, you know, hundreds of thousands and there are still going to be the kind of hypoxlygomina [phonetic] words that appear only once that actually are still there. But you're right, that like a lot -- you know, words like sturdy probably is not a fair characterization of things that are unique. But even low frequency things are actually not -- it's not exactly clear once you have a million how to attribute particular weight to something that's seen only once or only twice, all right? So okay. So all right, so let's back up for a second and see exactly what the set-up is. So I have words that are, like fascinating and boring that are unique to books and defective and sturdy that are unique or almost unique to kitchen appliances. And what I basically -- the thing that I want to point out, though, is that I also have words and phrases that are shared across the two domains. So I can say that a book is fantastic and a blender is fantastic, and both of these are ways of expressing positivity about books and blenders. Similarly, I can say a book is a waste of money, and, you know, a deep fryer is a waste of money and both of these can be used to describe books and deep fryers. So the representation I'm going to focus on is a real valued representation and what I'm going to do is use these words, these purple words here, to link up words that are unique to each domain. So the idea is that a word like fascinating can be linked to a word like sturdy via the bigram highly recommended. Similarly, a word like boring can be linked to the word defective via the phrase waste of money. And the idea is that we're going to use these to map these other domain unique words to similar areas of the low dimensional space. Okay. So this, again, this is part of my thesis and I call these pivot words or more generally, across all problems, pivot features. And this is again work that I did with Fernando Pereira and other grad students at Penn. Okay. So how are we actually going to exploit these pivots? One thing we can say is that, well, actually if we've seen a pivot before, you know, I come to a document where I see the phrase, do not buy. And even if it's a kitchen appliance, I know that I've seen do not buy this book, and I can still do a good job here, right? Similarly, if I've seen an absolutely great purchase, well, this indicates that this should be positive. So I don't actually need to exploit any unique information here when I have the pivot words. It's the cases where someone says oh, this is a sturdy deep fryer, but I've never seen the word sturdy before in books reviews. And what I'm actually going to try to do here is if I notice if I expand the particular of a review, I can see that actually I get things like do not buy the shark portable steamer. The trigger mechanism is defective. Similarly, if I see an absolutely great purchase, this blender is incredibly sturdy. So I notice that these particular domain specific words are actually -- do actually co-occur with the unique words. So in particular here, how am I going to try to exploit this? Well, one thing I can do is say I'd like to predict the presence of a particular pivot. So I'd say, I want to predict whether or not great appears in this document using all the other words in this document as context. All right? So in particular, I've written here another exponential families model. The feature vector is of the same. Of course, I'm going to delete the word great from the feature vector. The weight vector now, I've replaced data with this W parameter here, and W is basically unique to each pivot. So in particular, I have a separate weight I want to predict separately for each pivot, whether or not it appears in this document. And the thing to notice is that, well, if sturdy appears a lot of times with great, then it's going to get high weight, kind of automatically at an intuitive level. And because of this, I should be able to say you know, well, kind of the high level intuition is sturdy co-occurs with great and great co-occurs. And great is a good positive predictor so maybe sturdy should be as well. The last thing I want to say is I showed you kind of five example pivots. In fact, we can construct these automatically from unlabeled books and kitchen appliance reviews. Yeah? >>: So the phrase not buy contains buy and the word does not work well contains work well. So how do you deal with it? >> John Blitzer: You're right that you have to be careful when constructing these predictors. And I want to go through exactly how we construct the final representation and then maybe come back to that question afterward. Yeah, Chris? >>: So [inaudible] never co-occurs with one of those pivots. >> John Blitzer: Yeah. >>: [indiscernible]. >> John Blitzer: Yeah, you are right. And that's why, in particular, that's why you -- it's kind of always better to use more and more pivots, right. The more you can get your hands on, the better. On the other hand, there are situations, and I won't talk about it here, but there are situations where you can imagine kind of completely disjoint areas of feature space, right, where, you know, there is just nothing you can do, right. Maybe there's a set of kitchen appliances where no one ever uses anything in common with books, right. And in practice, kind of this is something that at least, at least in theory happens. Now, I've never seen it empirically, but, you know, there certainly is the case that you might just be, you know, you might just not be able to do perfectly on kitchen appliance reviews. >>: [inaudible] feel for the fraction of time this occurs? >> John Blitzer: So in our experience, it -- let's see. The experiments are in my thesis. But it's certainly below five percent of the instances, right. And the idea is just that like, you know, if you construct enough of these pivots, you might still get the review wrong, right. So these are kind of cartoon pictures and there are many reasons why you could get a review wrong. But in general, just co-occurrence does, if you select enough pivots you, know, five thousand or ten thousand, eventually you can saturate almost all of this space. So okay. So yeah? >>: Why do you need this notion of pivot if you can just look at the conditional distribution. Say conditioned on the class and just look at the entire dictionary and then you automatically just have a notion of pivotness for a given word and you don't have to have a cut-off for certain things being pivots or not pivots. >> John Blitzer: Yeah, that's right. It's mainly for computational reasons. You'll see I want to train a bunch of pivot predictors and its easier to train 5,000 than 5 million. >>: [inaudible]. >> John Blitzer: Yeah, and in fact that's the way we actually -- that's a good question and I'll answer it. That's the way we actually do it. You can train, for example, an L-1 regularized predictor. You get out a sparse predictor. And this sparse predictor, you take the active features and use those as pivots, right. So that's one example that tends to work well. Okay. So I've trained up these predictors and maybe they're, you know, 5,000 of them. And if I can write down a matrix here, big W where each kind of column in the matrix corresponds to a single predictor. So for example, maybe this column is the predictor for highly recommended, right? Whether or not highly recommended appears. And the reason I want to write it down this way in particular is that I can actually predict using this weight vehicle to the presence of all the pivots in a particular document, right. So I see a document, I want to predict what's the chance that highly recommended appears, that horrible appears, that great appears, and so on for all of the different pivots, right. And a high level, we're almost done here, right. So if I have 5,000 pivots I can generate a representation which is essentially 5,000 new features, each of which is a prediction about something that I know is shared with my source domain. And because of that, this is kind of -- this in particular almost answers our question, right. And the reason I say it doesn't is that actually related to your question, a lot of the predictors are going to capture kind of information that we don't quite want, right. The non-welcome, non-sentiment information. So one example, you know, not buy is a good example. But here's another one. You could say, well, I've written a kind of cartoon picture here, where each axis corresponds to a single pivot and each of these points is kind of the weight of a feature for that particular -- in that particular pivot's predictor, right? So one of the high weight features for highly recommended will be I. When someone highly recommends something, they usually say I highly recommend this book, right? But the word I has nothing to do with sentiment in particular, right? It's a purely syntactic phenomenon. And what I want to be able to do here is now distill from this the correct representation for actually predicting sentiment, right? The idea is there are words like I, and there are maybe a lot of words like I, but there are still some like wonderful that are predictive of both highly recommend and great. So what I'm going to look for is a sub-space of best fit to the space whose columns -- the space which is spanned by the columns of the matrix W. All right. So the idea is that this is kind of the best low dimensional sub-space in terms of the error to the full pivot predictor space. And you can think of this as almost a kind of de-noising, right, where I want -- the sub-space will capture what I want from sentiment and kind of the syntactic phenomenon, which aren't shared across many pivots, will not be in the top, the top eigenvectors of W. Well, WW transposed. The stop singular vectors of W. So this is, in practice, what we do. This sigh here is actually just the projection on the left singular vectors of W or the top eigenvectors of WW transposed, right? And this is kind of in a square gloss sense, the best sub-space for the space of pivot predictors or the space of natural parameters for all of these different conditional exponential families. >>: Because W knows nothing about [inaudible]. I mean, I could be creating classifier on, I don't know, you know, overuse of long vocabulary words, right? >> John Blitzer: That's right. So W -- remember, though, that so it's not quite true that W knows nothing about sentiment, because we select the pivots in a particular way. Right. So I guess as miss Tra asked, like one question might be -- I want to predict, you know, I trained a classifier to predict on the source domain, right. For books, I know what I'm looking for, right, and I can use that to bias what I consider to be a pivot. Right? And, in fact, I bias it in the two ways. This is actually crucial for the structure of W. One way is I bias it to be predictive of the target classification I'm trying to do, which is sentiment, and I can do that by looking at the source label data. The other is I make sure that the pivots are shared and I can do that by just looking at data from both domains and say, well, you're not a pivot unless you occur in both source and target documents, right? But your question actually is right. So that's intuitively why it's true. It turns out that in order to kind of prove that this will work, that a method like this will work, there's a lot of subtleties in kind of the structure of W. So, I mean, it's actually a problem that I'm working on right now, but I think it's -like in order to kind of characterize when this will work, there are actually a lot of subtleties in the structure of distribution and how W is constructed. So okay. So sigh actually is our low dimensional representation and in particular, here, I'm looking at sigh, sigh times X for a particular input document X. And the thing to notice is that this actually does map from the high dimensional feature space into a low dimensional shared space, precisely because we force W to have that structure. And these are the top eigenvectors of W. And so the only thing now left to tell you is how I train my final model, right? So before I was constructing my features from words and diagrams. Now I'm going to construct it from the projection of this document on to the low dimensional shared sub-space. And basically, the idea here is that by constructing features on the projection, I'm going to have something that generalizes across domains. All right. So I want to show you guys briefly some results. So here I lied. I actually have two domains. I kind of crawled all the different categories of Amazon and we have reviews from several different kinds of products. So what I'm showing you here is what's the accuracy on reviews of kitchen appliances when I train on reviews from all these other separate domains. So the first gold bar here is kind of our gold standard. If we actually had a lot of kitchen appliance reviews to train on, how well could we do. And in this case, so this is 87.7. If we now train up a support vector machine just looking at each of these domains separately, we get the following set of predictions. And the only thing that's interesting here is just that, you know, electronics reviews tend to use a lot of the same language as reviews of kitchen appliances so you can do a lot better training on those. The last piece, the last set of results here is, you know, the method I just described, where I train instead of on the high dimensional Uni Graham and bigram feature vector, it's now projection on to the shared sub-space and, in fact, you do see a big improvement, even for electronics, but certainly for DVDs and books. And in general, if you look at kind of all the pairs of domains across all the data that I have, you can close this gap by about 36% between the green and the gold bars, sort of adaptation and what you could do if you had in-domain data using this technique. Mm-hmm? >>: This may be the next slide. Did you try combining the other, the books and the electronics, the DVDs? >> John Blitzer: I did try that. >>: See if you could somehow improve the results [inaudible] because you get more data. >> John Blitzer: Right. So I did try that, I don't have a slide for it and you're right that you can. Of course, there is kind of a ceiling that -- so to think of it coarsely in terms of bias and variance, each -- the predictor that you learn for books is biased with respect to the base optimal predictor for kitchen. So if I saw an infinite number of books reviews, I still won't be able to accurately predict kitchen appliances or as accurately. And there is kind of a point after which it's no -- like after I see some number of kitchen appliance reviews, you just can't do any better no matter how many books reviews you add. >>: The question is do you already have [inaudible] ->> John Blitzer: No. >>: Are so small [inaudible]. >> John Blitzer: If I crawled now, I might, at the time, which is two years ago, Amazon just doesn't have very many kitchen appliances reviews and you could still do better, because there are literally like, even in 2007, I'm sure there are more now. We crawled millions of books reviews. In kitchen appliances reviews, we only had like 10,000. You still run into this. Yeah. >>: [inaudible] binary. >> John Blitzer: Yeah. >>: Five stars? >> John Blitzer: That's right. >>: How did you make that split. >> John Blitzer: In this experiment, we threw out threes. So it turns out you can actually treat this as a regression problem and either kind of an ordinal regression or actual looking -- you know, considering these to be revalued predictions. And the same sort of projection into a low dimensional sub-space also works well for these. >>: This is following up a little bit on Robert's question. I think what you can also show, though, is that in practice if you just combine more and more domains and you have one held-out domain, you know, the more domains you combine, the better you get on the held out-domain until you've pretty much reached pretty much the same results or similar results. So if you throw in kitchen appliances, if you throw in all electronic data you have, the book data you have, throw in the movie data seat from Pang and Lee and the DVDs, you know, you actually, I mean, you get close to -- it's a brute force approach. Let's use all the data that we have from all the domains that we have. And if we have the luxury of having an assortment of domains where we have [inaudible] data, which is not the case in all domain adaptation problems, but in that case you can actually do fairly well. >> John Blitzer: Yeah, I mean, I guess like that depends again on the structure of the particular domains, right. So as long as, I mean, so there actually are theoretical results that if, you know, let's say kitchen is kind of a mixture of electronics and DVDs or something. Then you can do perfectly, right. But if there's any sort of unique part of kitchen appliances, then you're always going to miss something. So yeah, you're right. And kind of it depends on the structure of these domains. Yeah? >>: [inaudible] how do you quantify the distributional difference? >> John Blitzer: Yeah, let me do my last slide of this section and then I'll tell you. Okay. So the last thing I want to give is some intuition for this low dimensional sub-space. And what I'm showing here is in the top left, words that are unique to books, but are negative under this single projection. So these are things like number of pages, if you -- so when I say projection, sorry, I'm showing you one row of the matrix sigh, which is kind of a linear projection from the space of features on to the real line. And here, what I'm showing is so if I mention the plot of the book, I probably don't like it. Nobody likes books that are predictable. If I say it had, you know, 586 pages, probably means I'm not, you know, starting a dia tribe about how long and boring it was. Similarly, if I -- for kitchen appliances, you know, these are words that don't occur at all in books. So if I didn't train on kitchen appliances, I wouldn't be able to get them. But they still are negative under this protection, words like the plastic or poorly designed, and these are words that are unique to kitchen appliances. So here, positive words. Fascinating, engaging, must-read. A little embarrassingly, the most positive unique word for books is Grisham. And people just tend to love John Grisham novels on Amazon. >>: Maybe he should make appliances. >> John Blitzer: Well, okay. Actually, for appliances, you see things like I've been using this for years now. This deep fryer is a breeze. Expresso. Expresso turns out to be the John Grisham of kitchen appliances, basically. Everyone just likes expresso machines and people when they write about expresso machines tend to give them high reviews. I guess you paid a lot for expresso machines. Must be good. >>: The point about didn't show up at all. There still wouldn't be some things that you could tend to mean changes or the word ->> John Blitzer: Okay. >>: One or the other. Even things like number of pages. Maybe I like lots of pages when it comes to the manuals for my electronic appliances and I hate the fact -- if I mention that, it's a good thing. >> John Blitzer: If I could tease out the separate questions. The one was your first question, which is is there ever a word which is truly unique? And so in this case, like, for the data set we have, these are. Like the bigram poorly designed isn't in the books domain. That's not to say now if I crawled Amazon I wouldn't see the character was poorly designed or something like that. So I agree with you that, you know, there's always this question. On the other hand, you know, the more data I see kind of the more bigrams I have and you can imagine a non-parametric version of this where I basically increase the length of the end gram with the amount of data I have. >>: Maybe the opposite. There are many words that weren't unique that already have shown up but the meaning of them wasn't properly incorporated in your ->> John Blitzer: So you're right. This does happen. You're absolutely right. It may not be something like number of pages. Cell phones, it's really nice to be small, but it's not good for hotel rooms to be small. And that's -- so in general, that turns out to be extremely hard to deal with, sort of in the most general setting. If I don't have any labeled data from kitchen appliance reviews, and I have kind of arbitrary polarity switching of features, then I'm basically hosed. There's nothing I can do. >>: [inaudible] three or four times book reviews, and they just happened to all be negative or positive, but that's not very much evidence. You notice, gosh, expresso shows up a lot in this new corpus and I only had a few bits of information. Seems like this hard edge thing doesn't show up -- seems fragile. >> John Blitzer: That's one reason I'm trying to separate the qualitative and quantitative results. I don't actually exploit the hard edge thing in the quantitative section. These are the classification performance and it doesn't matter whether or not a word showed up zero, one, ten, you know, 500 times. That's just what the results are. But you're right. For these qualitative results, all these things are unique. But you're right, that phenomenon is true. And, you know, it doesn't seem to affect us too much empirically here, but that could -- I could certainly envision places where it would. Yeah? >>: If you go back a page, these results might [inaudible] purely dimensionality reduction. >> John Blitzer: Yeah. >>: What happens if you do LSI with the same dimensions of V. Do you get any sort of comparable lift? >> John Blitzer: You get lift, you get lift, absolutely. But you don't get comparable lift. It's sort of, again, that -- either LSI or PLSI or some variant of that was in my thesis. But in general, kind of if you don't do -- if you don't somehow control the structure of W or, in particular, if you use the, just the instances directly, you get kind of, you know, halfway between these two. Yeah? >>: You [inaudible] refer to [indiscernible]. >> John Blitzer: Uh-huh. >>: [inaudible] reviews tended to be more positive than negative. Have you tried training across [inaudible] domain like books versus cars. >> John Blitzer: No, I haven't tried that. That's interesting. I actually, that really was a passing reference. I actually don't know -- I don't have any statistics on that phenomenon. That's interesting. >>: [inaudible]. >> John Blitzer: Yeah. >>: I mean, I think it's a great point. >> John Blitzer: Yeah, I couldn't say. I mean, it seems intuitive that that would be true. And, of course, again like without knowing a priori kind of what -- just to say at a high level, to answer Misha's question, a high level, kind of the constraints you need on the structure of your distribution fall into what's often called covariant shift, which basically means that you can control -- if you think about a joint distribution on X and Y in looking at the performance of a conditional model, Y given X, you basically assume that the conditional on Y and, by extension, the marginal on Y, is the same across domains. So once you start playing with that, you know, you really need some extra information beyond just, you know, I can see some unlabeled data, right. Because you can think of kind of an adversarial setting where you get to do whatever you want on the books domain and as much on labeled kitchen appliance domain as you have and then I get to look at your predictor and change, change the output of my classifier, right? There's nothing you can do. So but, I mean, like if you know kind of some relationship, you can constrain your model using that. Yeah? >>: So I'm wondering how the sub-space SI changes according to the rating. I guess kind of follow-up to my last question and Robert's last question, the changing of the meaning of words. For example, the bigram not work contains the single gram work. So you have both of these on your axes, then for a negative review, then you will see both, right? >> John Blitzer: Yeah. >>: So the axis, if you were to just find the sub-space on the negative reviews you wouldn't see the X equal Y line, whereas if it were a positive review, you would just see works where you would get this Y includes zero line. But if you kind of put those two together and find the sub-space that tries to, you know, like work for both kinds of reviews and you find something in the middle, and I wonder, I mean, on the one hand, sub-space projection kind of gets rid of noise because it projects things. On the other hand, it also blurs the differences. And I'm wondering what's your insight on the effect of putting -- learning the same sub-space for all different scores, which is what you're doing. >> John Blitzer: Yes, you're absolutely right. >>: Across domains, but for all scores. >> John Blitzer: For all scores. >>: You're assuming that the sub-space is the same. >> John Blitzer: That's right, yeah. I guess you'll answer that in two ways. First, this is a good question. In detail, we actually do handle that case explicitly. We look at the bigram and we don't allow you to use its Uni grams to predict it. So for that particular -- and that actually does make a difference. You can still do well without it, but it does make a difference in the final performance. The second is that kind of, you know, for all these -- we can't deal with everything that way, right. So you can deal with that, but you can't really deal with I and highly recommend, for example, right, because there's sort of certain things that you -- that are just difficult to deal with in a general -- you always expect some correlation. And that's kind of where the projection helps. Now, for the washing out, remember that we learn -- the idea is that we learn separate weights for each of the projection, you know, each dimension of the projection, right. So it's linear in the projected sub-space. But that means that, you know, there are many, many dimensions which actually aren't predictive at all. I showed you one that is. But there are plenty which are just dumb or distinguished between topics in books domain. Religious books versus sci-fi books, for example, is one. And that's not useful for sentiment, but that's okay because as long as there's continuity across domains, we can learn that from just looking at labeled books data. You look at a books domain, you see oh, well, this thing isn't very predictive, and therefore I just don't assign it any weight in my predictor, right. >>: [inaudible]. >> John Blitzer: Yeah. >>: You said you have a feature for don't work and another feature for work and you don't populate those. >> John Blitzer: Yeah, so suppose ->>: [inaudible]. >> John Blitzer: It solves that problem, yeah, but there are other kind of subtle problems where, you know, just kind of general syntactic phenomena, right. Like oh, you know, I guess I highly recommend is something where, you know, we might not have the tri gram I highly recommend in there, but, you know, you still don't -- you really would wish that you could say, oh, well, never predict highly recommend using I. >>: I don't understand that. It seems like it turns out that people using the personal pronoun giving positive reviews [inaudible] but you're discovering that fact. >> John Blitzer: Yeah, that's true. And it turns out for that one that it, you know, like I say, you know, I hate this. And so I actually isn't. But yeah, that actually is a real noise case. But you're right, sometimes that happens. Part of the reason is that, you know, you just got to -- you've just -- you know, you build as much intuitive structure as you can into the model and, you know, kind of empirically see whether it works or not. That's just, you know, you can't characterize, you know, all of human language kind of in the structure of your model. So actually, there is another half to the talk, but maybe I ->> Silviu-Petru Cucerzan: Maybe we could save some questions for the end. >> John Blitzer: So how much time is left? >> Silviu-Petru Cucerzan: Half hour. Half hour your presentation and the questions. So it's good to be interactive, but I know how ->> John Blitzer: Well, okay. So I'll finish this. The next half, it's not actually half. It's more like you know, next third of the talk. >> Silviu-Petru Cucerzan: Feel free to manage the time. >> John Blitzer: It's till 11:00; is that right? 10:45. >> Silviu-Petru Cucerzan: 12. >> John Blitzer: Oh, till 12, yeah. So the next part of the talk is going to be about, you know, projecting information across languages for web search. All right. So I have my two queries, right, salmonella and [speaking Chinese]. And I have with that a bunch of English and Chinese documents that I've retrieved. I have this actually for many, many queries, right. So all the -- I'll explain in a bit how I get them. But, you know, another one might be British history and [speaking foreign language]. And the basic idea here is I want to -- so I have not only in addition to that you can almost think of it as I have my rankers output for the English as well, right. So I know actually how to rank the English documents. So my goal here, where I write E1, this is the best English document, where I write E2, this is the second best for that particular query. What I don't know, though, is how to rank the Chinese documents, right? This is what I want to output. I want to say oh, well, you know the kind of first document in my unordered list is actually ranked 15 and so on. This is what I want to output. For those of you who know, I guess people are roughly familiar with cross-lingual IR. People kind of know the set-up here. The set-up in cross-lingual IR is I see a Chinese theory. I want to rank Chinese and English documents. For the English documents that rank high, I want to translate back into Chinese and show you Chinese output. The reason I'm trying to avoid this, I guess with all due respect to people who work on machine translation, we're not quite there yet, right? And I want to say that right now, we can still give you a better ranking, purely monolingual ranking for the Chinese documents without ever showing you translated English output. I want to emphasize that. The user never kind of has to deal with machine translated output yet. Okay. So one question you might ask is, well, okay, you're going to do this for bilingual queries, but how many queries are really bilingual? There are kind of at least two kinds of phenomena that we can't deal with. So one is phenomena like [speaking Chinese], which is, you know, the Chinese, you know, translation of overview of learning to rank. And you might say, okay if I had a really good dictionary, I should be able to look this up and English should help me out here. But the real truth is, you know, we can't get this. Just 'cause, you know, it's not common enough and we can't really identify that this Chinese query corresponds to an English query which we could do really well on. So the second is kind of the opposite, where I can actually translate it right, like so for this query [speaking Chinese] which Changhong is probably one of the biggest electronics makers in China. I can translate this just fine into Changhong TV set, but if I search for it in English, it's not going to be very helpful in ranking Chinese queries. All we do here is something really simple. We take an automatically mined dictionary and we threshold the query logs at some number and then we just look things up there. And it turns out that, you know, not a huge number, but a significant number so here in this case, for the Chinese query log, 2.3% of the queries are actually in the English query log. And there are many like this one which we hope we could get eventually, like overview of learning to rank, but we can't get yet. We could get it if we had better machine translation, maybe. >>: [inaudible]. >> John Blitzer: Say that again. >>: By queries or by volume? >> John Blitzer: This is, I think, by volume. I'm not 100% sure. So okay. So in order to train and test, I guess, go through this pretty quickly. So an training time, I'm actually going to see a few -- I'm going to see some bilingual queries. I'm going to get both the English and Chinese ranking. My goal here now, there's going to be kind of several steps. I want to take this and construct a ranking on pairs of documents, right. So initially, I had two rankings, monolingual rankings, now I'm going to construct a bilingual ranking on pairs. And from that, I'm going to learn a ranking function for these pairs. And this will use kind of standard machine learning techniques. Now when I see a new query, a new bilingual query, I will run this through my ranking function, get out my hypothesized ranking on pairs and now I need to convert the Chinese side back into a monolingual ranking on Chinese documents. Okay. So basically, there are kind of these three steps, constructing a joint ranking from monolingual rankings, learning a ranking function on pairs, and reconstructing the monolingual ranking from the bilingual rankings. Okay. So constructing the bilingual ranking actually turns out to be quite simple. There are many ways you could consider doing it. Here we force the bilingual ranking be absolutely consistent with the monolingual ranking. What I mean by is that I only rank a pair English 1, Chinese 1, higher than English 2, Chinese 2, if English 1 is higher than English 2 and Chinese 1 is no lower than Chinese 2, or vice versa. So what I have here is then I can look at a monolingual ranking, construct a bilingual ranking which is consistent with that. The second thing I need to do is learn this joint ranking function. And this we use a standard rank SVM style objective. So basically, here, when I write this, I want to look at all -- for a particular query, I want to look at all pairs of, you know, bilingual pairs. And if I introduce basically a hinged loss penalty for each one that's ranked -- each pair that's ranked incorrectly, and again I have a feature vector on pairs of documents and their query. Okay. So this is, you know, pretty standard set-up. The only interesting thing is kind of what features I can introduce now that I have pairs. So I have all the standard features, and I'm sure you guys know much more about what those are monolingually than I do. That's kind of top secret. They don't let visiting researchers know that. But so one thing we can introduce is just bilingual dictionary similarity and machine translation and kind of weighted versions of these two, as well as URL similarity, right, so I might say oh, well, you know, airbus.com and airbus.com.cn are kind of similar and I can introduce all these features kind of that are generated from the pair, rather than from any single monolingual document. And really, this is what we expect to help us, right. This is kind of what we hope will actually improve performance. So the final thing is how do I convert. So I can build a now a new ranking on these pairs. And I've written here kind of on the left again when I write EC1, I mean this is the best pair according to my ranking function. And now, I see a Chinese document and I say okay, well, what's the -- you know, where should I insert this Chinese document in my final Chinese ranking? In this case, maybe it's occurred in position 1 and position 23. And it turns out that what actually works well -- so there's no heuristic that's going to be completely consistent. Before, remember, we generate the training pairs to be completely consistent with every monolingual -- with our monolingual ranking. But now, and I won't go into details, but the ranking on pairs, this might not be consistent with any monolingual ranking. We have to do something. What we do is just rank a Chinese document by averaging across pairs in which it appears. And this kind of gives you -- so in this case, we say, oh, well it should be ranked 12 here. And there are more sophisticated variants of this that you might consider. In particular, I guess, people here must be familiar with this area called rank aggregation, where the idea is you see multiple rankings and you want to aggregate across them. So my co-authors now have results that rank aggregation, you can kind of aggregate across all the possible rankings of Chinese of this pair and you can do even better. But this actually tends to work already quite a bit better than the monolingual ranking. So what I'm showing you here, I have to be honest now, we actually finished this project after we left Microsoft. So there's a problem with that in terms of, you know, well, we wanted to actually report NDCG, but we can't, right, because, you know, that's all Microsoft internal stuff. So what we did was we took the click-through and now we're reporting kind of for the top queries how do we compare with ranking based on click-through alone. And you can still do significantly better bilingually than you can monolingually for this. So basically the idea -- so, I mean, we have lots of queries, right. So it shouldn't be surprising that this difference is statistically significant. But, you know, we have -- I don't know, 10 or 20 thousand queries to test on. And, you know, so we can actually get a big gain by actually combining these two. Okay. The last thing I want to show you is kind of what queries are most improved, again kind of the qualitative version of this. Obviously, British history and salmonella, I picked them for a reason. Political cartoons actually is one thing that might be a bit controversial, but it turns out to be one of the most improved Chinese queries. Turns out that if you can search for English political cartoons, it's actually, that helps you a lot in finding political cartoons in Chinese. Just for fun, we actually tried this in the opposite way, right. We tried to rank English and it turns out that so for any of you guys that -- not that I'm advocating this, but for any of you guys who actually do watch pirated TV, you'll notice that almost all of the English sites are -- the best English sites for pirated TV are actually routed through China. Turns out this is actually the number one best improved English query. If you search for free online TV, the actual kind of best query is this. The others are less interesting, except perhaps for aniston. It turns out jen -- like right at least during the kind of period we sampled the query log, Friends was hugely popular in China and Jennifer Aniston was also very popular, and her Chinese name doesn't have her first name. So this is just aniston. Which is her transliteration of her last name. And basically, the idea there is just that if you kind of maybe in English, if you forgot her first name, what is it Jennifer aniston or Jessica aniston, and you just searched for aniston, which people do do, you can actually get a significant performance increase by kind of looking at the Chinese results. Okay. So I guess I'll end with this. So there's a lot of things that I'm interested in. I showed you guys these two. All of them kind of fall in this idea of building robust models across multiple domains. So one that kind of came up a little bit is how can we characterize kind of theoretically when the algorithms like the ones I described here work well. Can I say -- and also I did work with Shai Ben-David and Koby Krammer and Jenn Wortman and Alexa Kulesza. And I'm continuing this work as well now with other people. But I think that, you know, it's kind of very interesting to consider when, you know -- I give you two distributions, kind of what conditions on these distributions do I need such that an algorithm which kind of learns on one and observes nothing about the other does well. Or observes a few instances of the other does well, right. The second and actually I'm going to talk about this Friday, I don't want to compel you guys to come, but I'm going to talk to the NLP group about using some of these techniques in machine translation. So it turns out just at a high level, kind of there are all these different components to machine translation and they're trained on all different domains. Might not be -- these domains might not be comparable to the ones you want to translate. Again, like, there's this idea of can I, can I use all of these different components together to train a better joint model on the domain I care about. So other things I'm interested in kind of fall under the general rubric of -- I don't know what I call this, natural supervision. So, for example, there's been recent work on face and object recognition that's really good. But now if I know kind of the syntactic structure, can I recognize verbs. For example, here's a picture of the Yankees pitcher who won the World Series, holding up his World Series trophy. Can I look at a bunch of examples of the verb holding and if I knew the nouns, could I, you know, learn something about holding in general? And kind of another thing that I'm interested in that I think, you know, people on here have already done good work on, but I think there's still a lot of interesting kind of depth in this area is can I build a model for search or ads on, let's say, serving ads on the Wall Street Journal and now I want to serve ads on word press or something, right. Can you do this without any labeled data or without any click-through on a new blog? Can I serve the right ads, right? And, you know, sort of questions in this vein. Okay. So that's it. Thanks. [Applause]. We have a few minutes for questions. >> John Blitzer: Um-hmm. >>: What are the number of the initial [indiscernible] when you were showing books and DVDs and electronics. You had a various number of ->> John Blitzer: Yeah, turns out that just more is better always. >>: Was the case that you had a lot more people in electronics and DVDs than kitchen appliances? >> John Blitzer: Oh, I see. Yeah, you're right. You're right about that. You can actually construct more if the domains are closer. That is true. And in practice, we always use the same number. We always used 2,000. >>: Project on the same ->> John Blitzer: That's right, but that is a good question. You can construct more from more similar domains. We haven't done that experiment. I suspect that you would be able to do better. Part of the thing is just like the amount of vocabulary overlap really helps you. So even with the same number of pivots you can get, I guess, obviously better results. >>: Could someone generalize this without looking at the second domain? In other words, like building some sort of -- finding ways to make the model more general. Have you thought of that? In other words, could you improve your performance on [inaudible] looking at it [inaudible]? >> John Blitzer: That, I mean I guess I haven't come up with any good ideas for that. In some sense, that, like, in general seems very hard. Like, I mean, consider not knowing anything about the reviews that you want, right. So if you make some assumption about, like, you know, oh, I guess common words in books are more appropriate, but ->>: [inaudible] in your model, you assume some really powerful word like, you know, page Turner. Say that was a great thing. You assume that that's -- you're going to learn how to predict that, rather than actually use it. You might assume that the general things that help you predict that would still be predicted in the next model. >> John Blitzer: So I actually ->>: [inaudible]. >> John Blitzer: I guess you're right, although that still doesn't, that still depends on you kind of knowing that, you know, a breeze is equivalent to page Turner, kind of when you see it, right? >>: I was saying, I guess my thought would be something along the lines of anything that I find is really super strong in my training corpus, I shouldn't trust it. I should move one step away from it and use the things that predict it and hope that those things [inaudible] my space, those predictors, that I'll do better when I get to the next one. I just assume some fraction of my strong things -- great buy probably doesn't show up as much as one of the other. >> John Blitzer: I see. So one variant of that is just to ignore the kitchen appliances data and run completely this exact algorithm just only with books. >>: That's what I was wondering. >> John Blitzer: We did do that. It does help a little bit, but it's not nearly -- I mean, like if you think about the intuition of, oh, I've just never seen that word before, right ->>: You still wouldn't be able to use words you haven't seen. >>: John Blitzer: Right. Those turn out to be ->>: Words surrounding would be more helpful. Like your I example. If, in fact, it turned out that I was always a positive word or generally tended to be a positive word, that's still going to show up in your new corpus. >> John Blitzer: Right. >>: And you can still be okay. So as long as you tend to trust things that were less -- I guess one of the things we did spam filtering stuff. We had the exact problem, since it was spam, stop using the words you found most [inaudible]. Any never did. Obviously, they stopped using the [inaudible]. And the model compensates by looking at things like best price or things like all exclamation points. >> John Blitzer: Yeah, I mean, I guess like ->>: It would be nice to build a more generalized model without having to see the future. >> John Blitzer: Right. I mean, in this kind of -- let me actually say something to that. You can -- there are actually algorithms which are online, right. Even in the unsupervised case, right. So these are kind of bootstrapping algorithms where you see a new example, and you kind of absorb the new vocabulary from that example as you learn, right. And you kind of fuzz out over that, right. So you can -- I mean, and I guess in your example, like I see -- well, okay. Suppose I see, you know, a breeze just immediately. Then you can say, well, I've never seen a breeze before, but I fuzz it out upon seeing it. So I can't get this one right, but the next time I see it, I kind of absorb that in an online fashion. And in that sense, you can be adaptable. I guess spam is particularly tricky, right, because it really is an adversarial situation. >>: It's a limited adversarial, but still a human has to read the final ->> John Blitzer: Yes and no, right, because the Spammers will read mail too and mark their own messages as not spam, right? >>: [inaudible] access to the rating system, if you will. [inaudible]. >> John Blitzer: The ratings? Oh, you actually pay people to do ->>: We don't. We get volunteers. Volunteers who [inaudible]. So since you can assume your ratings are ->> John Blitzer: You can detect when a spammer is using hot mail. >>: Spammers don't use it enough. We can't get enough traction. Assume the ratings are fine. What they can do is [inaudible] try to alter it so ->> John Blitzer: I'm saying like at least -- I mean, this is via -- take it this with a grain of salt, but this is via kind of conversation with the people at Yahoo. My impression is that most people on Yahoo mail are Spammers. So you have to have some way of kind of cleaning up your labels just initially, right. So like every spammer -- I mean, he was describing the problem this way. Like spammer sends out a mail, I mean bot.net sends out a mail, right. Including a lot of Yahoo addresses that he owns, right. And he automatically logs in, marks that mail as not spam, over and over again. Then eventually, he'll send it out kind of -- >>: This is off the subject of your topic. The way we do it in hot mail, we have two separate ways to get spam. One is that voluntary, I'd like to mark this mail one way or the other. But the primary way is actually something called the feedback where we opted in about one-tenth of 1% of our users and send them a random sampling of their own mail once a day. One mail each day that says, please mark this as spam or not. >> John Blitzer: I see. >>: Volunteers who we asked to join, and they were already users for three months or something like that. >> John Blitzer: Right, I see. >>: So from looking at the data, it doesn't look like these users have been substantially -- I didn't find any value from trying to clean the users out when I tried to define the bad guys. >> John Blitzer: Yeah, that's interesting. I mean, I can actually -- it's not my work, but I can send you pointers to kind of ->>: [inaudible]. >> John Blitzer: Although there is newer kind of, in the purely supervised setting, online algorithms that try to basically deal with this by saying, oh, well, if I got something wrong, then I should adjust the weights of the features I haven't seen more than the ones I have seen. Oftentimes, you know, you look at the gradient-based methods and you look at a gradient, a purely gradient step and what that basically means is that all the features are traded equally, right. I mean, it's linear. So you linearize the function around that particular point and all the features are pure. You can think of things that are more complicated where you will kind of keep track of some, let's say, second order information online and then you can kind of update, even when you see an instance based on kind of what features are there or not. I do think that one thing that's interesting to me, I don't have any, you know, great ideas yet, is that if I see an instance and I make a prediction about it right, like if I trust of the prediction I made, then can I kind of like say, well, the new features, somehow I can kind of adjust weights based on my own prediction. >>: [inaudible] ongoing algorithm. Also I think there's always the case that there's some split between the case you're talking about, like the electronics versus the books, that's the real world. We never have the same test data that we had in training. The world is always shifting under your feet. >> John Blitzer: Yeah, that's right. >>: So if you can make your model generalized better across two domains, [inaudible] the second one, then you would have something that would probably just do better in the real world when you're talking about normally categorizing things where all of a sudden, somebody shoots somebody else and all those arguments [inaudible] or whatever it is. >> John Blitzer: So let me give you, actually let me give you an easier version of your problem, which is that I think this is maybe more feasible. But like, I don't actually know what the new domain is, but I know that I'm gonna be exposed to it eventually, right. So like I really feel that kind of you can't be -- you can't really build an algorithm that's general enough such that -- I mean, maybe if you have a lot of kind of problem-specific knowledge. But I think in general, you know, you build an algorithm and you can't expect to kind of cover a new domain well without ever seeing it, right. But I do feel like that your intuition is right. Like the problem that I -- to be more extreme, I introduced kind of an artificial version of this problem, right, where like I had electronics data, and I knew, like, oh, this is called electronics and I have a lot of it, and here it is, and go. But in practice, of course, you might know, well, I'm going to go somewhere. I don't know that it's called -- I'm going to have to apply my model somewhere that doesn't look like training, but I don't know that it's called electronics and kind of like I don't know -- maybe I'm deploying versions of my model simultaneously and like I want each of them to be adaptive to the data that it's looking at. And I'm exposed to it over time. >>: [inaudible] I just wonder, almost any kind of -- if there's anything you can generalize from the fact that -- I don't want to repeat my question. >> John Blitzer: Well, let me say one thing, because I do think this is really interesting. >>: [inaudible]. >> John Blitzer: Okay. Well there's kind of this hot start/cold start if you think about all these kind of recommender systems where it's typically. You know, you see a new user, and like can I -- do I cold start this user, right. Like but in reality, kind of like you really can't do well on a user. Or anything like speech, right. You say I wish I could build a general speech recognizer that kind of worked on everyone regardless of accent, right? I mean, even I can't do that, right? If I go and meet someone from -- you know, actually, the cab driver this morning was from Russia. You know, after about a minute or two, I could talk to him fine. But, you know, like it just, it takes a little while. You got to be exposed to something, I think. You know. Unless you can actually get data from kind of all the world's languages offline, then maybe. But, you know, again, they're sort of, there are new domains being created all the time, this kind of thing, so anyway. It's really interesting. We'll talk later. >>: A simple problem would be in that respect is what Michael again was suggesting. Is that, I mean, could you actually use data, information you get from -- you have a lot of books, and then now you have the DVDs, and then you have the electronics. And can you learn from that how to apply the book data for a new domain, right? I mean, can you learn from -- so you have a distribution from which you have a lot of data, [inaudible] data, and then you have these other distributions from which you have just tiny data. You can still learn of what's general enough in my initial distribution that I could use on any new domain. >>: Yeah, just like when you go to a new country. [inaudible] English differently [indiscernible] like using big expressive hand gestures lets nobody know if you're happy or sad. Or the other way around, I've learned over time if I can't share my words, use facial expressions ->> John Blitzer: This is interesting. I don't think it will get at the whole problem. But this is a study problem in machine learning. It's called multitask learning. Basically if you have labeled data from electronics, right, and you want to say, well, I don't know, you know, kind of where I'm -- when I'm going to see my next domain, but I know this is a new domain, now kind of learn something that kind of I expect to generalize to my next problem. >>: Exactly, I'm assuming it's got to be some things in books or something that you could figure out what are different. Maybe just assume anything that's really of high value is probably specific to my domain. >> John Blitzer: That's right, yeah, that's right, that's right. So I think ->>: Like what's the author you had? >> John Blitzer: Grisham. >>: You might just automatically distrust features that are too strong and try to generalize any features that are popular but not strong. >> John Blitzer: Yeah, that kind of, that kind of heuristic, actually, unfortunately, tends not to work well. >>: I never tried that, but again [inaudible] the mail if you learn the names, if you [inaudible] the names of the people who are in your training and they never show up in the test set and they just make your model worse. You never try to remove them [inaudible] users. When you have only 100 users that each of those people last name was just a [inaudible] feature to the [inaudible]. >> John Blitzer: Yeah, yeah. I mean, so I guess like there is kind of this set of literature, the multitask literature, but one version of the problem that I don't think is studied yet is kind of like oh, I'm going to -- I see unlabeled data from one domain, but I know I'm going to test on a different domain, right. So that could be something, right. Like I know I'm -- you know, I don't know. I know I'm going to France and kind of I, I don't know, I watch a video about France. But then I go to Russia, right. Like something like this. I don't know what the real world analogy is. But yeah, I agree with that. That is, that is interesting. I mean, there's kind of like, there's this whole sub-field, you know, of machine learning kind of cottage industry called transfer learning where basically -- transfer learning, where basically the idea is kind of, it kind of absorbs everything I did and also, you know, where you actually do see some labels as well. And I think the case where you start to see labels, you start to learn some things. But -- and you're right, you would like to learn kind of what's -- what from books is general, right. That is true. And you know, kind of regardless of where you're going, there's kind of some core of, I don't know, what does it mean to be sentiment? >>: I wouldn't expect you to ever do as well as actually being able to look at the appliance data first. But the question would be could you do better than you would ->> John Blitzer: Yeah, I think that's right. It turns out that the heuristics that you actually suspect would work at least in all the problems I've looked at don't work so well. Like basically the one you suggested is one of the first things you look at, like oh well, I should ignore things that are too good on books. Or, you know, if I have unlabeled data from kitchen, I could drop all features which don't appear in kitchen appliances, right? That kind of always works no better, maybe sometimes a little worse ->>: Something along the lines you did learning to predict the high value features from [inaudible]. >> John Blitzer: Yeah, except ->>: I know that ->> John Blitzer: Except that you now, you're still going to have gaps, right, because the corpus you're actually interested in, you've never seen before. But yeah, I think, I think all those things are, I mean, it is a fruitful area. I guess I sort of feel like the online setting is probably the most compelling to me, because then you can kind of start in this scenario you're talking about and kind of slowly move to one where you actually are seeing data that you really want to deal with. >>: I guess I'm really interested ->> Silviu-Petru Cucerzan: Let's end here and continue the discussion. We're going to set the record for a talk. Let's thank again the speaker.