>> Chris Burges: So we're delighted to have Yisong Yue here today with us. Yisong was my intern in 2007. He was awarded a Microsoft Research fellowship after that. He's finishing up his PhD with Thorsten Joachims in Cornell. And today he's going to talk about a new learning framework for information retrieval. >> Yisong Yue: Thanks, Chris. And thank you all for having me. Today I'll be talking about new learning frameworks for information retrieval. And it's joint work with my advisor Thorsten Joachims at Cornell. So we are living in an increasingly data driven and data dependent society, and as a result having the ability to efficiently and effectively manage, organize an level this growing sea of information is becoming correspondingly more important over time. And although when we first think of information systems, many of us first think of things like Web Search services. In reality, many of the services that we use and interact with today provide, amongst other things, a way for us to manage, leverage, and retrieve information of different sort. And although this is by no means an exhaustive list, I think it's clear to see that information systems are useful and can be applied to any domain that has a -that needs to manage a growing amount of digital content. And I would argue that that is every domain that is being important to society today. So this is a machine learning talk. And I am a machine learning researcher. And one of the reasons I'm even here giving a talk is due to the rising popularity in applying machine learning to designing models for information systems. The benefits are that machine learning can allow us to optimize for models with an enormous number of parameters, ranging from thousands to millions, to sometimes even billions of parameters. And this has been pretty effective and has been -- these techniques have been successfully employed in developing many of information systems that we use today. However, the standard techniques and the conventional learning algorithms that people typically use are limited. And here are two ways that I think are pretty important. First of all, the convention machine learning techniques, they make restrictive independence assumptions. So, for example, when training a model for a ranking function, you assume that the model -- you assume that the ranking function will compute the relevance of individual results or qualitative independently of other results. The second limitation is that when deploying these machine learning algorithms, it typically -- they typically require fairly expensive pre-processing step of gathering a labeled dataset and this on requires fairly extensive and expensive human annotation. And this inherently limits the scope to which these algorithms can be deployed as scope and scale to which these algorithms can be deployed across many different ranges of applications. So in this talk, I'll be discussing my research, which addresses both of these issues. In the first half of this talk, I will be describing a structured prediction approach for modeling interdependencies. You could think of this as relational data. In the second half of this talk, I'll be describing interactive learning approach where the idea is you want the system to learn interactively with the environment, in this case a population of users, using feedback from the environment. So feedback provided by the users. And these algorithms and methods were designed with information systems in mind. And I think that information retrieval is an important application area. But I think you'll see that these algorithms also motivate new applications and are themselves fundamentally new machine learning models. All right. So here is a quick three slide introduction to structured prediction by way of examples. Perhaps the simplest interesting example of structured prediction is first order sequence labeling or part-of-speech tagging in this case. So here the idea is they've given a sequence of words X. We want to predict the sequence of tags Y, which are the part-of-speech labels. So, for example, for the sentence the rain wet the cat, we want to predict determiner noun for determiner noun. And here the dependencies come from the transitions between tags, the tag. So we care not only about how often say the word rain maps to noun as opposed to verb, but also how often a determiner maps the noun in the English sentence. A more complicated example in natural language parsing -- processing is to give -- given a sequence of words X, predict the entire parse tree Y. So here for a sentence, the dog chased the cat, we want to predict this entire parse tree here where N -- the structural constraints comes from how often the sentence node decomposes into a noun phrase and verb phrase and that noun phrase decomposes into determiner noun and so on and so forth. In information retrieval, given some query X, we want to predict a set of retrieval results Y typically expressed as a ranking. And here we can think about dependencies between the results. So for example, we want to -- we might want to model the redundancy between documents that we're considering to retrieve in order to reduce the redundancy. And that's what I'll be talking about today, diversified retrieval. Now to motivate diversified retrieval, I want to quickly tell you a really short story about Misha Bilenko, the curious high school student. So, you see, Misha, he's heard about this magical field of machine learning, but he doesn't know too much about what it is, and so he's really curious. And so what he's going to do, he's going to go on his favorite search engine, type in the query words machine learning, maybe just read the top five documents and just to get an idea of what machine learning is all about. So here are the top five I pulled from my favorite search engine, which shall remain nameless. And, you know, they are relevant, right? I mean, they're all individually relevant. But they're all kind of the same, kind of redundant, right? Because, I mean, we have two results for machine learning textbooks, we have a result for the machine learning journal, which is just an obnoxious precursor to a textbook really, we have this AI topics machine learning link, which is very similar in spirit to the Wikipedia article. So it doesn't really provide a very comprehensive and cohesive -- well, it doesn't provide a very comprehensive view of all the concepts of machine learning on the Internet. So what if instead when Misha typed in machine learning he saw these results instead? Well, cook. We have the Wikipedia article link. We have a link to the International Machine Learning Society, which is the premier society for machine learning researchers. We have a link the an industry lab doing machine learning. We have one link to textbook. And if Misha was interested in looking at video lectures, we have a link to the video lectures page as well. So not claiming this is the perfect set of five results -- I don't even know what perfect exactly means here, but it clearly provides a more comprehensive view of all the different subtopics or concepts related to machine learning that you can discover on the Internet. All right. So there's been a lot of work in -- a lot of interest in diversified retrieval ranging from the seminal paper in 1998 by Carbonnell and Goldstein, and in all these previous work, the researchers who conducted these studies noted that it's -- that the models -- in order to optimize diversity it requires modeling inter-document dependencies. And this is, again, impossible with standard independence assumptions of the models that we are thinking about using in machine learning to train for. And so the solution I'll be presenting today is a coverage based approach. And I'll present a structured prediction learning algorithm for training this model. All right. So here is an abstract depiction of the prediction problem. You could think of this entire region as the space of information relevant to some query. These shaded regions are the different documents for this -- documents for this query. The area covered by the region is the relevance of the document, so bigger is better -- more relevant. And the overlap between documents are the inter-document redundancies, the redundant information covered by both documents. So in the conventional model, which evaluates the relevance of each document independently, we would first select D3, because it's the most relevant. Then we will select D4, even though it's a can be on selecting D3 is almost completely redundant because it's the next most relevant independently. And then we select D1. In the coverage based solution, we would first select D3, then condition on selecting D3, we would select the next most relevant document, which would be D1. And conditioned on both D1 and D3, we would then select D5 because it again covers the most -- the remaining uncovered relevant information. So this is a coverage based approach. So the idea here is to view diversified retrieval as a coverage problem. And if we had the quote/unquote perfect representation of information, then we could just deploy the greedy algorithm, which I just described in the previous slide, to computer goods coverage solution. The challenge here is to discover what a good representation of information is. And that the learning problem. And but the good news is that once we can learn this good coverage representation that we can make good predictions and new test examples. And this is a supervised learning approach, because we'll be requiring manual labeled subtopics as training data. So that it doesn't address the second limitation that I talked to at the beginning. But I'll get into that in the second half of this talk. And I'll be presenting a algorithm with theoretical guarantees and also a software package that you can play with yourself if you're interested. Okay. So how do we represent information? Well, there are many different ways. Perhaps the first thing you could think of is just basically all the words in the candidate document we're considering to retrieve. It's sort of the lowest level or rawest form of information. We could get a little about it fancier and think about things like the title words, the anchor text, meta text, so on and so forth. We can think about representing information at a perhaps somewhat higher level by thinking about cluster memberships with the documents, how they cluster. So maybe topic models or dim reduction techniques like LSI. We could also even use external taxonomy memberships like the ODP hierarchy. In this talk we'll be just focusing on the words or the lowest level raw representation of information. Okay. So intuitively speaking if we selected a set of documents that covered as many distinct words as possible, then we're covering as much information as possible. Now, of course not all words should be weighted equally. Some are obvious more important than others. So we need a good -- if this is our representation then we need to go to weighting function. But suppose we had a good weighting function. Then we could form laid the prediction task as to select K documents which collectively cover as many distinct weighted words as possible. We could again deploy the greedy algorithm, and this -- these types of coverage based problems are known as what's called submodular optimization problems. So the greedy algorithm actually has good performance guarantees. Again, the learning goal in this specific formulation is to learn a good weighting function. Okay. So how do we weight words? Well, surely not all words are created equal, right? The word the is probably almost close to meaningless -- yeah? >>: I have a question. You were saying that one [inaudible] approximation. So this means that the number of documents you need to cover all the words would be ->> Yisong Yue: No, no. So the objective function -- the objective function is the total number of weighted words you've covered. That's the objective function. >>: With a fixed K? >> Yisong Yue: With a fixed K. So the 1 minus 1 over E approximation is the greedy algorithm is competitive where the 1 minus 1 over E bound against the optimal set of K documents. Yeah? >>: So your notion of overlap is overlap on words [inaudible]. >> Yisong Yue: That the how we represented the information. So yeah. >>: So what if I [inaudible] take your example and Misha wanted to know all the textbooks [inaudible]. There's going to be significant overlap. So you're not taking into account intent, you're assuming that a diverse result [inaudible] is the best? >> Yisong Yue: Right. So it's certainly not the case that you want to diversify in this fashion on all possible queries. What we're exploring here is suppose you did, right? Suppose you are interested in designing models that retrieve based on diversity of the set of relevant documents. >>: [inaudible]. >> Yisong Yue: I'm sorry? >>: Okay. >> Yisong Yue: And then suppose we wanted to build models to that. How do we think about designing these models? And this is one way to do so. And Misha had typed in the query machine learning textbooks. You don't want to return the same text -- kind of textbook, right? Maybe you want to diversify on textbooks. So how do we weight the words? Well, not all words are created equal, so for example the word the is probably not that informative for almost all queries. It's also conditional on the query, right? So the word computer is normally fairly informative about a certain subtopic for most queries like maybe education, but not for the query ACM. Because it probably appears in basically all the candidate documents for the query ACM. So we really need a weighting function that the based on the candidate set of documents. Now, there's been some prior work on this -- in this area. It's called essential pages. It's actually done here at Microsoft. What they did was they used a fixed function of computing word benefit, the benefit of covering the word. And it depends on the word frequency in the candidate set. So for example, if a word appears in 100 percent of the documents in the candidate set, then it has zero benefit of being covered because it's not useful in teasing apart the different subtopics for different concepts. If a word appears very rarely in the candidate set, then it has low weight, because it's not very representative. You could think of this if you're familiar with information retrieval as a local verbal of TF-IDF. And what I'll be presenting is basically a way to learn a much more complicated and more general weighting function. So here are the features that I'll be using. So let boldface X denote the set of candidate documents and let V be the variable that denotes an individual word. Then we could describe -- we could use features that describe the frequency of the word V within different aspects of the set of candidate documents. So for example, the first feature might be active if the word V appears in at least 10 percent of the documents and X, the second feature might be active if the word V appears in at least 20 percent of the titles of the documents in X or 15 percent of the anchor text of documents pointing to X or 25 percent of the meta tags -- meta information in X and so on and so forth. And in practice, in our experiments we'll be using thousands of such features. And after learning a model which will represent as a weight factor W, then we can think of the benefit of covering a word V as the product of our model of actor W with this feature representation. Again, I should emphasize that this is local. It's dependent on the candidate set. >>: [inaudible]. >> Yisong Yue: Yes? >>: So kind of do it in a two step basis. The first step you actually retrieve say a couple thousands documents. >> Yisong Yue: So there's ->>: [inaudible]. >> Yisong Yue: That's a good point. So for this machine learning approach, we assume that -- I mean, you make the standard machine learning assumptions. You assume that the candidate sets are provided to you according to some distribution. In practice when we implement these types of algorithms you probably want to do a two stage approach where you first select the candidate set and then you apply an algorithm like this one. So, yeah. Okay. So here's the structured prediction formulation for maximizing coverage. Let X denote -- let boldface X denote the set of candidate documents, let boldface Y denote the prediction, the supervise set of X of size K and let V of Y denote the union of all the words that appear in our prediction. And then we can model the quality of our prediction given a learn model using what's called a discriminate function. And I'll walk you through what that means. We're basically summing over every word that appears in our prediction, and of course -- in that sum we basically just sum over the benefits. Okay. So what are some properties of this? >>: [inaudible]. >> Yisong Yue: So you could do normalizations a feature if you want. That's sort of the standard approach to discounting the length of the document. And we do do that in our experiments. All right. So what are some properties of this formulation? Well, first of all it is not rewarded on this, which is the thing that we care about the most. Because if a word appears in two documents in our prediction, it's only counted once. This is a sum over a set, not a multi-set. And that is the structure -- that is the structural assumption of the structure prediction model. So, for example, if we are representing this coverage based prediction problem where the space is represented using words, then if two documents share a word, the benefit is covered only once. Second of all, this is still a submodular optimization problem, so greedy has the same approximation bound. And we'll be using a more sophisticated structure in our experiment. So one limitation of this simple formulation is that if a document contains even one document of this word it's consider to be covered. And you really want to have graded levels of coverage. Some document might cover this word better, say, if it was in the title. And you can sort of expand this feature representation in a stack form. And I'll be happy to talk about this offline to anyone who' interested. But we do use a more sophisticated structure of the same flavor in our experiments to account for the fact that different documents cover words to different degrees. Yeah? >>: So I missed something. So is X is set of [inaudible] X is the set of all the ->> Yisong Yue: Candidate documents. >>: Candidate documents. >> Yisong Yue: Uh-huh. >>: So can you repeat where -- how you're using this entire set in define this function? I'm asking because -- I mean, you know, in real life you don't really have this, they kind of arrive one by one in this streaming fashion ->> Yisong Yue: They arrive one by one in a streaming fashion? >>: Yeah. I mean when you do retrieval, you never really get a full set and then rank it, you kind of have a streaming mechanism where you kind of get them gradually and you have to calculate their values on the fly. So that's what typically happens in the real system. >> Yisong Yue: So you might -- you might sort of let a hundred stream in and then rerank the top a hundred? >>: Well, the way these things work now is they just come one by one and you have to kind of support them on the fly. So pretty much. But I mean you could do something -- I'm just trying to understand your model. >> Yisong Yue: Yeah. Okay. I understand. So right. So in terms of -- from that perspective, you do have to wait -- I guess you would have to wait until you've collected a large enough pool of candidate document in order to apply this model. >>: And so how would you use the set? >> Yisong Yue: It's used in defining the feature representations. Because all these features are defined relative to the candidate set of documents. Because again, it goes back to sort of this -- this idea that the importance of a word is conditional on the candidate set because the word -- for example, the word computer is not as useful for teasing subtopics for the query ACM. So this is also a coverage based problem, so it's also submodular optimization problem. So greedy has the same approximation bound. And we'll be using more sophisticated structure in our experiments. And the goal here is this to learn this model vector W with the intention of making these types of predictions. Right? This is not a binary classification problem, this is a coverage problem, and so we want to train a W with the intention of doing well in this coverage problem. And I'll present a structural SVM approach for learning this W. And it will have empirical risk and generalization bounds. >>: So just [inaudible]. >> Yisong Yue: You can model -- you can model it if you like. In our experiments we choose not to. >>: So the creative [inaudible] is very implicit in these. >> Yisong Yue: In this particular model, yes. Although you can think of defining this feature representation also conditioned on the query explicitly. Okay so here's the formulation that we use. It's a structural SVM formulation. Let boldface X denote a structured input. In our case the set of candidate documents. Let boldface Y denote the structure output, in this case the prediction, the subset of size K. Our goal is to minimize the standard SVM objective, which is a tradeoff of model complexity measured as the square of the two norm. And a measure -- and upper bound measure of loss of our structured prediction problem. Now, this is a quadratic minimization problem, so if it were unconstrained it would have a iteratively solution at zero. But there are constraints. And I'll describe what the constraints mean. Let Y of I be the computed optimal solution for this -for the Ith training sample. And this is something that we assume that we can compute given supervised labels. Then for every other label, which is in this case suboptimal, we want the benefit function. So remember, this is the benefit of predicting Y given an X and a W. We want the benefit of our optimal prediction to be at least as large as the benefit of any suboptimal prediction plus the structure prediction loss. In this case, it's a subtopic loss that I'll get into later minus the slack. And so here the lack variable, in order for these constraints to be satisfied exactly upper bonds, the structure prediction loss. Yeah? >>: What is taken to the I to be the Ith [inaudible]. >> Yisong Yue: In the training set, which I'll describe in more detail later, each training example is a collection of documents, candidate documents and their subtopic labels. So I'll get into that in a few slides. >>: [inaudible]. >> Yisong Yue: Deltas are all non-negative, that's right. >>: [inaudible]. >> Yisong Yue: Well, okay, so suppose that -- those that our algorithm makes a mistake and this score is higher than this score. Then if you subtract this from both sides, this side is the negative, so this must be negative. Which means this has to be larger than this. That's why it's a minus. Okay. So this is still a complex optimization problem. So it should be pretty simple to optimize. Unfortunately there is typically extreme number -- extremely high number of constraints of this form. Usually intractably many. So using the naive approach, it's actually intractable to enumerate this objective, this optimization problem, let alone optimize for it. So we're using a cutting plan approach to optimize for this objective. So let the left hand side detect the original optimization problem where the color gradient denotes the objective, SVM region. This is good region right here. And let the linear planes denote the constraints. And so the global solution is global optimal solution is right here, subject to the constraints and the cutting plane approach we first start with no constraints. So we just solved this objective on constraints, so the solution is a zero. Starting from the solution, we then select the constraint that the most violated. In this case, it would be this constraint. Because the corresponding site variable is the highest for this constraint. So we add this constraint and then we resolve the optimization problem with a single constraint. And so we find this solution. And we continue to do so, to iteratively keep finding the most violated constraint until our solution here a epsilon approximation to the solution here. And by epsilon I mean that no other constraint is violated by more than some small tolerance epsilon. And this is guaranteed to terminate in linear time, linear in one over epsilon. So it's -- and in practice, it's quite fast. And it's guarantee to be a very good approximation to this in a way that preserves all the same generalization bounds of a normalized VM. Okay. So one key problem here is finding the most violated approach, right? The naive approach would be to enumerate over all the constraints to find the one that's most violated. And of course we assume that this is intractable. The good news is that we can formulate finding the most violated constraint in the same coverage formulation as the prediction problem. We can treat it as a coverage problem. So we have a one minus one over E approximation to find the most violated constraint, and all the theoretical guarantees still hold if we have a constants factor approximation of in finding the optimal cutting plane. Albeit the guarantees are a little bit weaker. And this will perform pretty well in practice, as you'll see. Okay. So here's the training data that we will use in evaluating our model. >>: Yisong? >> Yisong Yue: Yes? >>: That trick, the trick, the coverage problem wasn't it in -- I can't pronounce his name. >>: [inaudible]. >>: What he said. Marginal paper. That's your hack on his paper? Because I don't remember the -- that the original ->> Yisong Yue: That's the -- this is the key technical contribution of the paper. >>: That's right. But ->> Yisong Yue: No. It's -- it's subdivided. So this is an -- this is generally an ISMO 08 and this is sort of key, I guess, technical contribution of that paper. Okay. So here's the training data. We're going to use data from the interactive track in TREC. And in this dataset, queries have specific labeled subtopics. So for example, for one query, use of robots in the world today, human judges label documents as relevant to different subtopics. So for example, nano robots, space mission robots, one water robots, so on and so forth. You could think of this is a manual partitioning of total information regarding this query. And our goal is to discover a good representation that help us partition the information of this query. So you have the results. We're comparing against random, which -- I'm sorry. We're treating five document here. We're comparing as random, which basically select five documents from the candidate set at random. >>: How big is the [inaudible]? >> Yisong Yue: It varies from query to query. On average it's about 40. And I -yeah. And so this is missing subtopic error rights, so lower is better. We're comparing as Okapi, which just is a standard retrieval model in information retrieval that does not consider diversity. We're comparing its unweighted, which was the baseline for our model, where it just tries to select as many distinct words as possible unweighted. We're comparing it as essential pages which is the prior work that I mentioned developed here at Microsoft. And our method called SVM-div. And only essential pages in our method do we notice better than random, and SVM-div outperforms the rest with a [inaudible] represent significance. >>: So the training set took the subtopics and [inaudible] maximum coverage [inaudible]. What was the training [inaudible]. >> Yisong Yue: You just -- you took all the document in the -- you took all the -you take all the documents that are labeled into one of the -- at least one of the subtopics. So there are about 40. And then you -- the training -- the training -the optimal label is the one that is the set of five documents that maximizes -minimizes subtopic loss. So the ones that cover as many subtopic as possible. And different queries of different -- have different subtopics. They're all query dependent. >>: What happens if you do a query independent TF-IDF [inaudible]? >> Yisong Yue: I'm sorry? >>: What happens -- so your thing does a query dependent kind of TF-IDF thing. What if you do just a global TF-IDF which is query independent as a processing step? >> Yisong Yue: That's like Okapi as a preprocessing step? >>: No, no, you apply your algorithm, but you define these weights with respect to -- I mean, you do a preprocessing step, which is TF-IDF but globally across everything. Use the query dependent aspect of getting this entire candidate set ahead of time. Have you tried that? >> Yisong Yue: No. If you were to do that ->>: I mean, that the what you need to compare to in order to claim that the query dependent normalization is given to something, right? >> Yisong Yue: What we're comparing is essential pages. >>: But that's a complete different algorithm also, right? >> Yisong Yue: It's -- it's the same -- I mean, the prediction task between essential pages and an SVM-div are more or less the same. It's the model that's different. So essential pages ->>: I understand. You're doing better. >> Yisong Yue: Okay. >>: Is it because you're using SVM, is it because it's query dependent normalization? I mean, it could be a number of things. I just was ->> Yisong Yue: I guess -- I guess -- I guess you could try to model the importance of words globally rather than locally and apply the same coverage problem. We did not -- so that could be a base -- that could be an interesting baseline. We do not compare against that baseline. The two baselines -- so in the -- so these two are not coverage based approaches. The two baselines we did compare against are essential pages and totally unweighted. So we did not compare against weighting with global TF-IDF. Yes? >>: So that random fraction [inaudible] seems to apply to a lot of subtopics. Do you remember how many subtopics there were on the average? It seemed like there would have to be much more than five to get random .47. >> Yisong Yue: Again it varies because they're all query dependent. But it's about 15 subtopics per query. >>: Oh, so you might be at the best you can even do even if you had the oracle. >> Yisong Yue: The best is about .27. The oracle is about .27. >>: Oh, okay. Okay. Okay. >> Yisong Yue: So some documents are good. They cover more than one subtopic. >>: Okay. >> Yisong Yue: So if documents -- so okay. If documents were disjoint in the subtopics they covere4d then there is no redundancy in the case that we care about. >>: That is true. >>: So one way you can do that without labeling data is that you can [inaudible] the local idea and you get the largest courage of the [inaudible]. >> Yisong Yue: The local? >>: Yeah, the local ideas. So if you do that, then what kind of performance ->> Yisong Yue: That is actually what essential pages does. >>: I see. >> Yisong Yue: So essential pages, you know, it's -- they make this assumption that the importance -- that the importance of different terms is, you know, response to the TF-IDF function. And this is a discriminate active training algorithm that says we have the same prediction task but we're going to re-we're going to learn the new benefits with respect to our training labels. Right? So it's linear model to the training models. Okay. So that's -- so that's the approach. More abstractly what I have described in the first half of this talk is a way of learning coverage representations, right? So suppose we had a training set where we had some sort of gold standard representation of information, in this case the subtopic labels and we want to predict a good coverage that minimize subtopic loss. The goal here is to learn an automatic representation of the labels such that it does not require the gold standard labels, and then we can maximize coverage on new problem instances. You could think of this a the inverse of the prediction problem, right? So there's been a lot of work in optimization research on these different structured optimization problems, and optimizing for coverage maximize the coverage is one such instance of this problem. So the prediction problem is given the gold standard formalization of the optimization problem we can predict a good solution, in this case a good covering. The inverse problem is we want to learn and automatic representation that agrees with the gold standard solution on the prediction. So we want to learn a gold standard -- we want to learn automatic representation such that our coverage solution in this representation agrees with the coverage solution in the gold standard representation. That's the learning problem. So what are some other kinds of coverage problems? Well, you know, it's pretty endemic in many kinds of recommender system, such as, you know, recommending the types of schedules, so maybe helping some secretaries. Create schedules, different products for different commercial lab services, scholarly papers if you're interested in sort of browsing, you know, the different types of research work that's out there on some scholarly dataset and so on and so forth. Again, there's sort of an ambiguity in what users want, there's and ambiguity in what I want, even if I could formulate the queries perfectly. And there's a ambiguity in how the -- how the service interprets your request. Because there isn't a perfect formulation. In a different context, suppose we wanted to place sensors to detect for different outbreaks or intrusions in a physical environment, or if we wanted to figure out which blog feeds are to monitor in order to collect the most news stories information or information cascades as quickly as possible, again there's ambiguity in where the events occur and what events we care about. Yeah? >>: [inaudible] coverage problem? >> Yisong Yue: I'm sorry? >>: Is recommender system recommendation really a coverage problem? Being if I'm watching movies I don't really care about watching movies on all possible topics, I have a strong bias on only the movies that I like. And so it might be a coverage but ->> Yisong Yue: Suppose I'm looking for Mother's Day gifts. So it's certainly true that -- it's certainly true that diversity is not essential in all possible applications within recommender systems. That's certainly true. So if you know what you want, you know what you want. And if the system knows what you want, then that's perfect. Suppose I was searching for Mother's Day gifts, right? I have no idea. And so there it might behoove the system to hedge why it bets, in which case you could formulate this as a coverage problem. >>: I have a question [inaudible]. Do you think that diversity in coverage are typically a primary objective than most retrieval problem or is it more commonly irregular for example a constraint where you would have the primary objective being whatever -- some sort of [inaudible] of accuracy and do you think that solutions to -- if you state the problem where diversity is a regularizer or constraint would be very different from where you have this primary objective? >> Yisong Yue: So when you think about implementing these systems in practice, I guess it depends on what kind of -- the properties of the problem, right? I can imagine scenarios where I am -- where I do a pretty generic query on a research paper dataset where I want to sort of -- I'm kind of looking for new ideas, maybe looking for related works to cite for a paper I'm writing, not really sure what I'm looking for. And their relevance is agent about it harder to judge, first of all. And second of all, lots of papers talk about the same thing. I might only want to cite one paper from this pool of papers. And so there it would behoove a system when it's recommending, making a list of recommendations, to diversify why it recommendations. So it depends on how much redundancy there is, and it depends on how clear your notion of relevance is. So these are different property that you need to examine. But I think there are certainly applications where you do want diversity as a coverage problem. >>: Is the primary -- you think it's actually [inaudible] to have it as a primary objective? >> Yisong Yue: So okay. So there are ways of -- there are ways of combining this coverage based problem with a more -- with a more traditional one dimensional relevance objective. I haven't done any specific work on it, put there are ways to sort of balance the two in a way that's -- that -- in a way that's a combined optimization problem. So it will be interesting to think about. >>: Sorry. Another follow-up to that question. So if -- so say that we do care about coverage and we do want one document from each topic but in your example of, you know, citing papers and I want the most authoritative example from this group, so how can you -- so I agree that coverage is important, but given that coverage is important, how do you then kind of ->> Yisong Yue: Trade off -- trade off -- you can formulate -- I mean, I see your -I understand your question. It's -- you can formulate this problem more generally as a utility function, right? You want to maximize user utility. And utility is -- they gain utilities higher when you present users with more authoritative -- more authoritative results but the gained utility suffers from diminishing returns as you retrieve more a more redundant results. >>: Okay. >> Yisong Yue: So it's still a coverage based problem but it's -- well, but it's a little bit more complicated than the model I presented. Yeah? >>: Just a comment. I think it -- [inaudible] coverage problem is like a [inaudible] summarization. So you can [inaudible] you can construct the sentences from the document, try to [inaudible] -- >>: So especially in this slide, you're tell us that diversity is important [inaudible] intuitively kind of agree that having diverse thing is true, but I think that you can make a much stronger statement. I think you said -- the most important thing you said was hedging the bets, right? So essentially define the problem such that your prediction task becomes a risk averse prediction path, much like a portfolio selection or computational finance where there, you know, they want to make the most money with the portfolio but still you want to diversify just to guarantee that you don't lose everything. And there the need to diversify just comes as a natural sequence of your formulation of the model, and maybe that's what we're missing [inaudible]. [laughter]. >> Yisong Yue: That's a [inaudible] of putting it. >>: [inaudible] has done that where he increases [inaudible]. [brief talking over]. >> Yisong Yue: All right. So I think we've beaten the first half to death. Let's move on to the second half on interactive learning. How am I doing on time? >> Chris Burges: [inaudible] 15 minutes or so until the hour is up, but we have until 12. >> Yisong Yue: Okay. Great. All right. So the idea here is that we want to build systems that can learn interactively from feedback from its environment. So, for example, suppose you wanted to build search systems for corporate Internet search and these companies are very sort of -- are very private about their data. They just want a block box system that the just installed on their network and they just want it to work. So you want the system to sort of adapt to the particulars of the network structure of the internal corporate network, the language models of their documents, you know, the query -- the query formulation patterns of their users, the click behavior of their users and different things of that nature. The standard techniques people typically employ in machine learning are limited in the sense that they require a fairly labor intensive label pre-processing step in order to collect sufficient -- a sufficiently representative amount of training data in order to apply these machine learning algorithms. So for example, you might be asking a human labeler to ask whether or not this document is relevant to this query for some document query pair. This is expensive and inefficient to obtain, and it doesn't completely generate to other domains, right? There's definitely a difficulty in generalizing. So the dataset about patents, whether or not these patents are relevant to these types of patent queries might not generalize that well to designing a search system over a medical papers dataset, medical research dataset. So in light of this, a lot of people are starting to look at online learning as a way of modeling this problem. The idea here is to learn on the fly. And in particular, I'll be describing an extension of the multi-armed bandit problem. This is pretty broadly applicable because many systems interact with the environment on the fly, so they learn and they try to provide good service simultaneously. One of the key sort of technical challenges in theoretical questions in this line of research is how do we analyze performance? And what people usually do is they measure the utility of the strategies chosen by our algorithm as this learning on the fly versus the utility of the best possible strategy known only in hindsight. And this is -- this difference is often called regret. So there's been a lot of work done on multi-armed bandit problems. I just wanted to quickly illustrate with this set of -- with this simple example sort of some of difficulties or disconnects in modeling a lot of the real world problems using the standard language of the multi-armed bandit framework. So here we have a refreshment service provider. It's at a party. And it's trying to optimally provide refreshment services to the guests at the party. And it has two strategies in this simple example to give Pepsi or to give Coke. And its goal, amongst other things, is to figure out whether or not this population of users on average prefers Pepsi more or prefers Coke more. And then to exploit that non to optimally satisfy the population of guests here. So in the language of the standard multi-armed bandit framework, here's how the scenario would play out. Our system would go to the first guest, ask that guest to drink some Coke, then ask that guest to rate Coke from one to five or zero to one, whatever you prefer. And then update its average Coke rating. They would go to the next guest, ask that guest to rate Pepsi -- drink Pepsi and then rate that Pepsi from one to five, update our average Pepsi rating. And they will proceed along this fashion until it has some idea that maybe Pepsi should have a -probably has a higher average than Coke and they would start to exploit this knowledge and start serving the guests Pepsi more often in the future. So what's wrong with this scenario? Well, two things that I can think of. First of all, it assumes that all the users -- scratch that out, all the guests are calibrated on the same scale. So if I give a rating of three to Pepsi, it means the same thing as when Misha gives a rating of three to Coke. Probably not the case. Second of all, it assumes the users are able and willing to provide explicit feedback. So here imagine that I'm searching -- I'm searching on a search service and then my competent Internet and every time I do a search the search service gives me a popup that says please rate this -- the relevance of this document or this query on a scale of one to five. I'm just not going to do it. It's not worth my time. So here's an alternative approach that I've been working on instead of absolute feedback, we focus on relative feedback. Instead of simplicity feedback we focus on implicit feedback. So this is how this scenario will play out again if the language of the framework that I've been working on. The system goes to the first guest and asks him -- let's say it has some fixed budget, volume budget K. It has two cups. It pours some amount of Pepsi in one cup, some amount of Coke in the other cup. It gives both cups to the first guest, and it simply observes from which cup the guest drinks more out of. And if the guest -- if the first guest drinks more out of the cup of Coke, then it would make the inference that this guest prefers Coke to Pepsi. It would repeat this process for all subsequent guests. And if the second guest were to drink more from the Pepsi cup than the Coke cup, then it would make -- the system would make an inference that this user prefers Pepsi to Coke. And it would proceed along this fashion with all the users until it has an estimate that maybe this population prefers Coke more on average than Pepsi. So this addresses both of the issues that I pointed out. First of all, it's making relative comparisons as opposed to eliciting an absolute reward. So is A better than B? And this is something that has been found across a wide range of fields as more reliable feedback to gather. Second of all, it no longer assumes that users are providing explicit feedback. We're simply making inferences from natural user behavior as they're interacting with our system. >>: I know this is a [inaudible] example, but I'm assuming the match with the real stuff. [inaudible]. >> Yisong Yue: Right. So you could imagine -- you could [inaudible] right. You can imagine that the users can't be bothered to be asked what they would prefer -- what they would prefer. Or that they would -- or they would know once they see it but they don't know before hand. Yeah? >>: So and in this setting what's the equivalent of, you know, once it determines that Coke is the winner and it starts giving Coke to every guest, but in this scenario does it just give more Coke and less Pepsi? >> Yisong Yue: There are a couple ways. You could just pour Coke in both cups. You could just give one cup. I mean, it starts -- there are different ways of formulating this scenario. But you could just give Coke -- pour Coke in both cups. Yeah? >>: What would be the extension to one or two choices? Is that distribution [inaudible]. >> Yisong Yue: No, you just -- you choose -- in the two cups? You choose -suppose you have K choices. For each user you give them -- you give them two choices that you choose. >>: [inaudible]. >> Yisong Yue: I'm sorry? >>: [inaudible] from some distribution? >> Yisong Yue: Well, could you monthly from the distribution. The algorithm decides -- the algorithm could decide by sampling from the distribution. But the algorithm decides which two -- which two to give to users. >>: It doesn't have to be random, right, I mean, you may actually choose to sample with certain ->> Yisong Yue: Yeah. I mean, the algorithm -- you could design an algorithm to make choices based on the distribution, but you don't have to. With the RK choices in the center multi-armed bandit setting, you give the users one choice and ask that user to rate that choice. In this setting, you give the user two choices of your choosing and you ask the user to compare the two. Well, you infer which one of users prefer by seeing how they interact with the two choices. Okay. So the contribution here is a new online learning framework which we call the dueling bandits problem. And the idea here is that we want the algorithm to learn via making pairwise comparison. And we'll -- I'll present a new learning objective or a new notion of regret, and I'll also present an algorithm with theoretical results. And to my knowledge, this is the first online learning framework tailored towards relevant feedback. So before I present the formulation of the problem, I want to quickly describe real world comparison oracle for search applications. This milk is called team-game interleaving. And the idea here is suppose I had two retrieval functions F1 and F2, and I wanted to know which one is preferred by some user, say Thorsten for a query at say SVM. How this works out is we would first precompute the rankings of both retrieval functions, and we would show the user -- we would show Thorsten an interleaving of the two rankings. So the results are color coded here, such that it preserves some notion of rank fairness and the user does not know that one result came from one search engine, the other result came from another one. And from this interleaved ranking that the user sees, we can then simply trace clicks to -- as a vote of preference for one of the two retrieval functions. So if Thorsten were to click more on the red results, which is the case here, then we will make the inference that F1 is preferred to F2. Okay. So here's the dueling bandits problem. Given a space of bandits F, where we would think of this as a parameter space of retrieval functions or weight factors, the algorithm proceeds in a series of time steps where at each time step it can compare two bandits or two retrieval functions. So for example, this can be implemented using the interleaving task on search applications. Each comparison is modelled here in our case as being noisy and independent of other comparisons. The goal is to minimize this notion of regret. And I'll walk you through what that means. So at time step T our algorithm choose two bandits, F sub-T and F prime sub-T. And the regret is the sum of the probability that F star, which is the best bandit or the best retrieval function known only in hindsight, the probability that F star beats F sub-T plus the probability that F star beats F prime sub-T and minus one. Minus one is for normalization purposes. So if you interpret this as the probability or the fraction of users who at time sub-T would have preferred the best retrieval function over the ones selected by our algorithm at time step 2. So it's a dissatisfaction notion of regret. Okay. So here's a -- here are some examples to illustrate how this problem will play out in practice. Here I have three examples where the -- we're comparing two bandits in the space of bandits. And the color grained represents the quality of the bandit. So lighter is better. So the best bandits are up here. In the first example we're comparing two pretty poor bandits, pretty poor bandits. And so for noisy comparison between these two, we incur pretty high regret because users would have preferred bandits from up here. And again, we don't know a priori what this regret is. That's by assumption. And the second example we're comparing two mediocre bandits. And for noisy comparison between these two, we incur a noticeably lower regret. And in the last example, we're comparing two pretty good bandits. And for this comparison we incur almost zero regret. And so the goal then would be to make this -- for the algorithm to make a sequence of comparisons such that we -- such that this regret formulation is minimized over time, up to some time horizon T. And our algorithm does not know a priori what this -- what this space looks like or what F star is. So in our analysis of making a few modeling assumptions we assume that each bandit F has an intrinsic value V of F. This value function is never observed directly. And for analysis purposes, we'll assume that if V is concave which implies a unique F star. If V is not concave, you have to deal with the local optimum. We assume that comparisons -- our comparisons are modelled based on this value function. So for example, the probability of F beating F prime could be modelled as -- could be modelled as some transfer function applied to the difference of the values of F and F prime. And we make some smooth assumption on this probability function. So a common one is the Lipschitz -- I'm sorry, the logistic transfer function. But in general, any S shaped curve works. Yeah? >>: You're assuming that the intrinsic value is the same for all the queries? >> Yisong Yue: Right. So you would have to -- it's like the average value. So you would have to do some averaging. Okay. So I'll present the algorithm by an example. Our algorithm is called dueling bandit gradient descent. It has two parameters. It's pretty simple. It has two parameters. An exploration step size delta and an exploitation step size gamma. It maintains a current estimate of what it thinks is the best point. And then at each time step it chooses another point in this space within some exploration radius, within the exploration radius, and it compares the two. In this case, this bandit is worse, so it's likely to lose the comparison. So in this case it did lose the comparison. So our algorithm maintains a stationary current point. So -- and in the next time step we're comparing against this point which is again sampled from our exploration radius. And in this case, the proposed candidate bandit wins the comparisons so we make an update in this direction. Notice that our update step is smaller so our exploitation step is smaller than exploration step. So our algorithm is conservative and this falls out of the analysis. And we'll continue proceeding in this fashion, making this sequence of comparisons and updating of our proposed candidate wins the comparison. Now, this comparison is random, so sometimes a better bandit, a better candidate bandit will lose the comparison, in which case we stay stationary. And sometimes a worse candidate bandit will win the comparisons, in which case we make a step in the quote/unquote wrong direction. But the idea is that on average we're doing something like gradient descent, and we'll be converging towards the good region in a way that minimizes regret. So here's the analysis. A sketch of it on one slide. It's built upon the fact that convex functions sat this inequality where the difference between -- for convex function C, the difference between C of X and the best -- the minimal point C of X star is less than or equal to the gradient at C of X times the vector difference of two points. Now, for our formulation, first of all it's not convex because it's applied through a logistic transfer function. And second of all, we need -- we don't actually know what the gradient is. We need to estimate it. We're estimating the gradient. So this introduces both additive and multiplicative errors to this inequality. And in particular -- and in particular, it depends on our exploration step side delta, how bad the error is. The main analytical encryption of this work is a bound on the multiplicative error which allows us to reason about how aggressively we can explore versus how bad our error is in our estimate. The good news is that we can do so and -- the good news is that we can do so and the good news is that we can do so and by doing so we can actually set the parameters in a way that has sublinear regret with respect to the time horizon T. When a algorithm has sublinear regret what that means is that the average regret shrinks over time so in the limit we do as well as knowing the best bandit in hindsight. Yeah? >>: So it's your objective function [inaudible] you're adding [inaudible]. >> Yisong Yue: Is this convex? >>: Quasi convex. [inaudible]. >> Yisong Yue: Yes, it quasi convex. It's -- so you could read the paper for more details but it's convex in this region. >>: I wonder if there's [inaudible] in the performance of the quasi convex ->> Yisong Yue: Maybe. Actually I don't know. So, I mean, if the function were to be completely convex, then you could simply take a noisy estimate of this gradient and then you could show that that could do well because it's not completely convex it introduces error and that the contribution of the paper. There might be some approaches in stuff that's analyzing quasi convex functions. I don't know. >>: I kind of lost you here. You're saying that you're using this property of convex functions. So what's the convex function you are applying? What is C? >> Yisong Yue: C is the sum. >>: Is what ->> Yisong Yue: Is the sum right here. This probability. This probability function is C. >>: And it convex? >> Yisong Yue: No, it's not. But ->>: [inaudible]. >> Yisong Yue: It is -- it's convex for all bandits that are better than X. So the idea here is that this function, although it's not convex, satisfies this property up to some error. Up to some error rate. And if you can bound that error, then you can reason about how this function behaves as a function of delta versus this inequality. >>: [inaudible] regions of the -- for example, if that -- if your [inaudible]. >> Yisong Yue: No, because the F function is always guaranteed to be at the inflection point. Because relative -- relative to -- well, okay. It's at bit complicated. I'll be happy to talk with you offline. It's just you could describe a formulation where the F function is always -- you could describe an equipment formulation where the F function is always guaranteed to be at the inflection point. Where it becomes non-convex. Between convexity and non-convex. Okay. So here's some web simulation results. We took a Web Search dataset provided courtesy of Microsoft, and we did the following simulation. There's 1,000 queries in this dataset. And we simulated a user issuing a query by sampling from this dataset at random, and then for two different retrieval functions, two different rankings, the user would probabilistically choose one over the other based on the NDCG. So the NDCG is the hidden value of the two rankings. And so the user would probabilistically choose one over the other. And we tried three versions of our algorithm, one which sampled one query before making an update versus up to 100, in this case we were simply making an average -- we have a less noisy estimate of the uptake before making update decision. And we compare it against -- so this is NDCG on the Y axis. We compare it against ranking SVM, which is a pretty standard supervised learning algorithm for this problem. So we don't do well as the ranking SVM because the metaphor here is that the ranking SVM has direct access to the labels, which in this case is the labels inside the users' minds. Whereas we only make inferences based off observed user behavior. So all our labels in some sense are provided for free, all our feedback is subsets is provided for free, and we do reasonably well. >>: What's the X axis? >> Yisong Yue: Number of comparisons. So ->>: [inaudible] ranking functions? >> Yisong Yue: It's parameter space. So there are 367 features, so it's a continuous parameter space. So in some sense it's an infinite number. But we're just basically making -- exploring different points in this parameter space. Now, if we were -- so if we're interested in only optimizing over a discrete set of different retrieval functions like K, then we have stronger regret bounds. In this case, it's information theoretically optimal at log T, where the idea here is that we have K different retrieval functions and you want to find the -- you want to find the -- you want to make a sequence of comparisons to -- between these K retrieval functions in a way that minimizes regret. Now, the only bandit gradient descent the algorithm presented is pretty simple, and I think that's a strength because [inaudible] has reasonable theoretical guarantees and it's easy to extend because it simple. And so you could think about incorporating domain knowledge or leveraging the types of structured prediction algorithms that I described in the first half of this talk in a way that, you know, maybe makes sense for the application. Yeah? >>: I'm just trying to understand your non convex function that you're optimizing to get. So would it be possible to upper bound this with a convex function that has the same rating as X? Because if that -- if you could do that, then you could apply the algorithm [inaudible] without a gradient algorithm, which just [inaudible] get the same rates. >> Yisong Yue: We do actually do apply -- we actually do use some of their results in our results. >>: You see what I mean? If you could ->> Yisong Yue: I see what you mean. I see what you mean. >>: Use their algorithm with just one point [inaudible]. >> Yisong Yue: I see your -- I see what you mean. I think the answer is yes. I'm not sure if you do better than what we have. We do use some of the results from [inaudible]. We had to -- we had to extend the results because their estimate of the gradient is different than our estimate of the gradient. >>: [inaudible] in terms of from the one point, two point [inaudible] and two point, one point you go back to the rating -- approximate rating. >> Yisong Yue: So it should say that in the [inaudible] approach where they assume that they have knowledge of C, so they observe the value of the probability function, not just the sample of the -- sample of the comparison. We get the same regret bound as they do. The -- it's just that their estimate of the gradient has less variance than ours does. Okay. So I think that -- I think that this line of approach is very interesting and potentially very useful. And I think it's also important to think about ways in which we could design ever more practical approaches and maybe thing about ways of evaluating on real information systems from vary domains of user communities. And I think this will also shed insight and guide future development. So I want to quickly wrap up by just briefly talking about some of the related research that I've been doing in this regime. In structured prediction, I've also worked on optimizing for multi-variant ranking measures, such as average precision. Within interactive learning I've worked on not only modeling this exploration versus exploitation tradeoff but also on ways of interpreting user feedback. So here in the exploring exploitation tradeoff work we assume that we have a comparison oracle that will give us unbiased feedback via team-game interleaving. But there are ways in which we can think of maybe eliciting even stronger feedback than what we already have. All right. So that's the end of my talk. If you're interested in any of technique details or playing around any of software, they're available from my website. I invite you to take a look. And thank you for your attention. [applause]. >> Chris Burges: So any questions? All right. >>: Just one question about -- in the regret [inaudible] you said that -- you mentioned a [inaudible] out of your function or [inaudible]. >> Yisong Yue: So right. So okay. Here's our reference bandit, and in this one dimensional case we either compare against this point or this point. And so in expectation it gives you a estimate of the gradient at this point, right? But that has an error. Also, this point is not -- is that it falls outside of the convex region of this function. So that gives you -- so the -- so that gives you a different error. So the estimate of the gradient gives you an additive error. The fact that this point falls outside of the convex region gives you a multiplicative error. Okay. >> Chris Burges: Thank you. >> Yisong Yue: Okay. Thank you.