>> Ofer Dekel: Okay, welcome, everybody. Thanks for coming. It’s our pleasure today to have Manik Varma give us a talk on extreme classification, a new paradigm for ranking and recommendation. Thank you, Manik. >> Manik Varma: Cool. >> Ofer Dekel: Oh, wait, let me … I sorry; I have to make the announcement. So for those of you in the … that are watching from the … from your offices, I’ll be monitoring questions on the tablet, so feel free to ask, and I’ll relay them to Matt. Thank you. >> Manik Varma: Cool. Cool, okay, good to go? So thanks very much, Ofer. Right, so I’m Manik Varma from Microsoft Research India, and I’ll be talking about extreme classification, which might provide a new paradigm for thinking about core problems in machine learning, such as ranking, recommendation, possibly structured output prediction, and so on. Yeah? Now, many of you might not have heard the term extreme classification before, so let me start by giving some context. In classification, the complexity of the learning task has grown from binary classification, where we learn to pick a single item from amongst two labels, to multi-class classification, where we learn to pick an item from amongst L labels—with L being larger than two—to multi-label classification, where we learn to pick the most relevant subset of these L labels. At the same time, the complexity of the learning task has also grown in terms of the number of labels being considered. So we’ve moved from working with two labels for binary classification to tens to hundreds to thousands of labels for multi-label learning. Yeah? And if you looked at the state of the art about three years ago, then the largest multi-learning date … multilabel learning dataset had about five thousand labels, so the size of the output space was two to the power five thousand, which was—well—considered to be large. And so it was thought that going beyond that would be very hard. Then, two years ago, we exploded the number of labels to ten million; the application was to build a classifier that could predict the subset of millions of Bing queries that might potentially lead to a click on a new ad or on a new webpage. So the input of the algorithm would be an ad, such as this ad for Geico car insurance, and the output of the algorithm was the subset of queries that might lead to a click on the ad, such as “cheap car insurance” or dub, dub, dub dot geico dot com. Using a tool such as this, an advertiser could figure out most of the queries that might lead to a click on his ad, and he could then go to a search engine, such as Bing or Google, and say, “Hey—you know—anytime somebody asks this particular query, please show them this ad, and if the user clicks on the ad, I’ll give you a dollar.” Now, as you can well imagine from the application, predicting phrases from webpages is a very important problem, both from a commercial and a research perspective, and so many sophisticated NLP techniques have been developed in the literature. However, the way we decided to address the problem was to bypass all these NLP techniques and simply state that we’re going to take the top ten million queries in Bing, treat each of them as a separate label, and learn a multi-label random forest classifier—which I will be referring to as MLRF for the rest of the talk—that will take this ad as a test point, extract the … that bag-of-words features from the raw HTML that lies behind this ad, and then simply classify that feature vector into these ten million labels; and we’ll … and predict the corresponding queries. So it took us about two years to build MLRF, but when all the results came in, and all the performance evaluation was carried out, it turned out the MLRF had a couple of advantages over the state-of-art NLP techniques at that point of time. >>: Hey, Manik? Question. >> Manik Varma: Rich? >>: This is Rich. >> Manik Varma: Hi. >>: Why don’t you view it as a binary classification that goes the other way—that tries to classify which was … yeah. >> Manik Varma: So that was the standard approach to it, and if you want, I’ll come back to that towards the end of the talk, but I—as I was just going to say—one of the big advantages we had, as compared to the binary classification problem, was that we manage to push coverage from somewhere around sixty percent to ninety-eight percent. So coverage is the percentage of ads for which an algorithm makes nontrivial recommendations, and just to show you over here—right—so if you take the binary classifier approach that requ … that Rich was mentioning, what you’re essentially going to do is: you’ll train a binary classifier that’s going to have a sliding window; it’s going to go over every phrase in this webpage and try and predict whether it is a possible candidate for a bid phrase or not, right? So what happens is: when you actually run this on the page, you’re only limited to whatever phrases are there on the page, and that turns out to be a problem in this particular case, because all of this beautiful-looking text is actually embedded images, right? And most ads are very text-impoverished, so that binary classification approach doesn’t work all that well, and if you look at all the predictions that we are making over here, none of them are actually phrases that are present on the webpage, ‘kay? So this was one of the big advantages: we manage to push coverage up from sixty to ninety-eight percent, and the product team really cared a lot about that, so there was a big win right over there. The second advantage was that, even if you focus on just the subset of covered pages—which means the subset of pages, ads for which the NLP techniques could actually make predictions—our predictions were significantly more accurate. So if you measured something such as a precision at ten, then MLRF’s precision at ten was about five percent higher. So that was helpful as well. Sorry, by the way, guys; I don’t see very well, so if there any questions, it’s good to, like, shout out, or do a hula hoop dance, or something. [laughs] If you just raise your hands, I’ll never see you. Right. >>: Quick question. >> Manik Varma: Yeah? >>: What does “might lead to a click” mean? >> Manik Varma: So it’s trying to go beyond relevance in the sense that it’s likely that, in the past, we’ve seen users ask this query and click on a very similar ad. So that’s what that is trying to capture—that if somebody asks this query, and you showed them this ad, they might click on it. There isn’t a more formal definition at the moment. >>: But so when … if you’re evaluating: how good are these predictions? It’s some … it’s a … like a minimum click rate? >> Manik Varma: Oh, I see. No, so evaluation is something that I’d like to come back to towards the end. At the moment, this was an advertiser-facing tool, so the way you—or the product group—was looking to evaluate it would be: how many of your recommendations were actually adopted by the advertiser? >>: Oh, okay. Good. >> Manik Varma: Yeah. >>: Question? >> Manik Varma: Yes? >>: So maybe I missed something, but so on the labels on the right-hand side, are they … where are they from? Are they [indiscernible] >> Manik Varma: These are a subset of the top ten million queries in Bing. >>: Provided by [indiscernible] >> Manik Varma: Provided by Bing. >>: By Bing. >> Manik Varma: Yeah, you go and look at the Bing logs; you see which of the pop queries are most frequent, or most popular, or generate the most revenue; you sort those, take the top ten million; and that’s what you train on. >>: I see. >> Manik Varma: Any other questions? >>: Yeah, I have a question. Why ten million then? Why not … did you try different things, and hat worked well? >> Manik Varma: Yeah, so that covered the bulk of the revenue-generating queries. So it covered enough for the group to be—the product group—to be kind of satisfied. You could have gone larger; we—internally—we actually went much larger than that; the returns are not that great, so yeah. Anything else? Okay. So where I was going with this was that MLRF was published in dub, dub, dub 2013—so two years ago—and since then, many interesting research questions have arisen in this new area of learning with millions of labels, which we refer to as extreme classification. And I think two of the most interesting questions are related to applications and performance evaluation, which is what someone asked over here. So what I’d like to do is start by discussing applications, and then, if there’s time towards the end of the talk, I’ll touch up on performance evaluation. Okay, so ten million is a really large number, and I think one of the most interesting questions is: when or where in the world do we actually have ten million labels to choose from? So I think they are a couple of applications—high-impact applications—that do exist at this scale, even though ten million is very large, and one of them is people. So there are millions of people who are uploading selfies of themselves every day to Facebook, and there’re millions of people who are standing in front of Kinect cameras, so we could potentially use all this data to train classifiers to recognize people, and then ask, “Which subset of Facebook users is present in this selfie?” And this might have important applications in social network analysis, security, surveillance, and so on. Another interesting application could be Wikipedia. So if you browse … if you scroll down to the bottom of any Wikipedia page, you’ll find a subset of Wikipedia labels that have been assigned to that page by Wikipedia’s editors. Now, the total number of Wikipedia labels has crossed over into the millions today, and won’t it be great if you could build a classifier that could take every document, every webpage, every tweet, every query, every image, every video, and annotate it with a subset of relevant Wikipedia categories? There would be so many applications that would get enabled if we could do that successfully. So in particular, we can think of building these really massive-scale knowledge graphs with billions of nodes and millions of properties—right—you could go and stamp all the pages on the web with the set of Wikipedia categories, and now, you know: okay, this person is a VP at Microsoft, or this is a … person is an AI researcher; this person was born in 1952; and this might help you in building these massive knowledge graphs. You can also try and use these for text featurization. So just as we do with deep learning, we take a deep lent work and chop off the last layer, and now we use the intermediate representation as features, you could play a very similar games over here as well. So we’re trying to see whether we can use this for X featurization. But apart from all of these applications, we figured out that what one can also do is go back to core problems in machine learning and reformulate them as extreme classification tasks. In particular, we can think about ranking or recommending millions of items and think about whether we can treat that as an extreme classification task. So the way that would work is: we can treat each label to be—sorry— each item to be ranked or recommended as a separate label, learn an extreme multi-label classifier, and use it to predict the subset of items that should be recommended to each user. Thinking about ranking or recommendation in this way might have a significant impact in terms of performance in some applications, ‘kay? And that’s similar to what we saw with the phrase prediction and NLP techniques that we saw earlier, okay? So let’s get into some technical details about how we can tackle applications such as Wikipedia, or how we might be able to reformulate problems such as recommendation. And our algorithm is going to be called FastXML, which stands for a fast, accurate, and stable tree-classifier for extreme multi-label learning, and it was developed jointly with my PhD student at IIT Delhi, Yashoteja Prabhu. Okay, now before I get into technical details, let me quickly give you an overview of FastXML so that you know what’s coming for the rest of the talk. So it turns out that, at this scale, almost all real-world applications will require us to make predictions in milliseconds. Extreme classifiers should therefore have prediction costs which grow at most logarithmically with the number of categories or the number of labels. FastXML ensures this by employing a tree-structured architecture and learning very highlybalanced trees. The second noteworthy point about FastXML is that its prediction accuracy can be significantly higher, as compared to the state of the art—by up to twenty to twenty-five percent, in some cases. FastXML achieves this by optimizing a rank-sensitive loss function known as nDCG, which turns out to be a marked improvement over traditional tree-growing loss functions, such as the Gini index, or entropy, or the Hamming loss. Finally, FastXML can also be up to a thousand times faster to train. So two years ago, I needed a large production cluster with almost a thousand cores in order to train MLRF; today, I can train FastXML on some of the very same problems on a single core of a standard desktop; and that’s thanks to a new optimization technique based on alternating minimization, which comes with provable guarantees. So let’s start by considering the first bullet point and seeing how we can architect FastXML so as to make predictions in milliseconds. So the way I’m going to formulate the problem is that there will be a space of users, X, and a space of items, Y, and what we’d like to do is learn a multi-label classifier, f, that is gonna take a point in the space of feature … users and map it to a set of point in the space of items so that when a user comes in, we can simply apply our extreme classifier, see which labels get predicted, and recommend the corresponding items. ‘Kay? Now, the space of items might be very large, and even if it just takes us one second in order to determine to recommend an item or not, it will take us almost twelve days to go through a list of one million items and almost four months to go through a list of ten million. Extreme classifiers therefore face the daunting challenge of having to reduce prediction time by almost ten orders of magnitude—so down … so from four months down to about one millisecond. Some extreme classifiers address this challenge by learning a tree where each child receives only about half of its parent’s items, ‘kay? So when a user comes in, he starts off at the root node, which contains all the items, but then quickly—in logarithmic time—traverses the tree and ends up at a leaf node, which contains only a few items, and these are the items that are then recommended back to the user. We can also reformulate ranking problems in this fashion, and return a ranked list of items to the user by sorting the items according to their probabilities. So this sounds like a reasonable game plan but for the fact that learning hierarchies is notoriously hard, and any single learned tree would very likely have been suboptimal. FastXML therefore learns an entire ensemble of trees and simply aggregates the individual predictions in order to return a final ranked list of items to the user, ‘kay? So this is the very same architecture that MLRF used two years ago as well, eh? The only thing that’s going to be different today about FastXML is the way that these trees are going to be learned. So let’s move on to the second bullet point and see how we can formulate the learning problem so as to make much more accurate predictions. The key technical challenge that we need to address is: we need to figure out how to take a node and split it into a left and a right child, because once we know how to do that, then we can start at the root node of all the trees and keep applying this procedure recursively until all the trees are fully grown. Yeah? Now, note that our training data comprises historical information about which users liked which items, and even though there might be millions of items to choose from, each user will typically like only a small number of items. It is, therefore, far more important to pre … correctly predict the items that are going be liked by a user and to ensure they’re highly ranked than it is to predict the disliked items—this is our key insight. And FastXML therefore learns to partition a node by optimizing a rank-sensitive loss function known as nDCG rather than traditional tree-growing functions, such as the Gini index, or the entropy, or the Hamming loss, because these traditional loss functions don’t have any concept of ranking built into them, and they place an equal emphasis on predicting the liked and the disliked items. So let me try and illustrate what’s going on. When we start out, all the users will be present in the root node, but what we’ll do is we’ll quickly partition the users be … into a left and a right child. Yep, and the reason we’re going to do that is because a partitioning of users is going to induce not only a clustering over the items, but also a ranking over the items. So if you look at all the users in the left, they all like oranges, pomegranates, and bananas … or—sorry—grapes, so all of these three items should be clustered together and sent to the left, but bananas should not be sent to the left, because nobody likes them there. Furthermore, if you look at all the users on the left, then we see that four like oranges, four like pomegranates, but only two like grapes, so oranges and pomegranates should be ranked higher than grapes. Now, there’n many different ways in which we can partition users, and each of these different partitions will induce a different ranking over the items, and what we’d like to do is, through nDCG, choose that particular partition where each user’s items are ranked as highly as possible, ‘kay? So the way we will partition a node is by learning a hyperplane with normal, w, in the space of user features, X, such that w transpose x is less than zero for all the users who’ve been assigned to the left and greater than zero for all the users who’ve been assigned to the right. And we learn the sparsest possible hyperplane that optimizes nDCG and show that, in the results, this leads to significantly more accurate predictions. But before we get on to the results, let me quickly talk about optimizing nDCG. So there’re going to be some formulae flying around—I apologize in advance for that—but you can try and ignore that if you like; I’ll explain everything intuitively. The key take-home message from the next set of slides is: we have a very efficient way of optimizing nDCG, which will allow us to grow our FastXML trees in minutes, okay? So coming here and talking about nDCG is like preaching to the choir, right? But there might be one or two of you who don’t know nDCG, so please bear with me, the rest of you, while I just explain briefly, and then I’m talk about the optimization, eh? So it turns out that nDCG is incredibly easy to define; what it does is it measures the quality of a given ranking for a particular user, ‘kay? It’s a number between zero and one, and larger values mean better rankings, ‘kay? So the way you compute nDCG—or define nDCG—is: you take your ranking; you look at the top-ranked item; and if the user likes it, you add a one to your score; otherwise, you do nothing; and then you move on to the second-rank item, and if he likes it, you add a one by log two; otherwise, you do nothing; third item, one by log three if he likes it; otherwise, nothing; and so on and so forth, ‘kay? So nDCG is incredibly easy to define, but it turns out that it’s also really hard to optimize. So you can see that there’s a sort function buried here inside nDCG, and this means that not only is nDCG not convex, but it is also not differentiable, which means we can’t directly apply any of the large-scale, gradientbased techniques that we’ve developed for efficient optimization. Furthermore, nDCG might behave also … might also behave very erratically with respect to the hyperplane that we’re trying to learn. So large changes in the hyperplane might have no impact on the induced rankings, and so nDCG will stay flat in large regions of space, but there will also be situations when just a small change in nDCG … in the hyperplane will be enough to send a few users from the left to the right, and this will change the induced item rankings, and so nDCG will suddenly jump up, way? So nDCG has this nasty property that it’s either flat everywhere—which means you don’t know which direction to go in if you want to optimize it—or wherever every … anything interesting happens, it’s discontinuous, right? And this is a very well-known loss function in the learning-to-rank literature, and Ofer and Chris have spent a lot of time optimizing this. In fact, Chris—at ICML—won a test of time award for the work he’s done on optimizing nDC for the last ten years. So it turns out, like, based on all this, because nDCG is such a hard function to optimize, there’s a very real danger that our FastXML might actually be harder to train than MLRF. But it turns out that there’s something special we can do over here, which at first, is actually going to sound counterintuitive. So what I’m going to do is: I’m going to make the problem even more complex by adding in an extra hundred million variables into the optimization. Now, this is going to sound bizarre but for the fact that each of these extra variables that I’m going to add can be optimized in microseconds, and then the remaining problem that I’ll have left will turn out to be a simple binary classification problem in the hyperplane, which is our bread-and-butter task, and which we know how to solve very efficiently, ‘kay? So here’s what I’m going to do: I’m going to take each user and add an extra variable for him or her, which I’ll call delta, and delta can be either minus one or plus one, depending on whether the user has been assigned to the left partition or to the right partition; at the same time, I’ll introduce two extra variables, r minus and r plus, for each item, and these will specify the rank of the item on the left and the right child, respectively, ‘kay? And at the top, you can see how I’ve modified the objective function to bind together the hyperplane normal, w, to the item-ranking variables, r minus and r plus, to the user partition variables, delta. Now, there might be a hundred million users and ten million items, so we’ll have added an extra hundred and twenty million variables, but here is how we can optimize them very efficiently. So we’ll start by simply assigning each of the users to a left or a right partition, completely at random—‘kay, so there’s going to be a random initialization—but because I know which items each user likes, this will induce not only a clustering over the items, but also a ranking over the items on the left and the right. Now, you can see that this is not a very good partition, because the induced item ranking is not compact, and so nDCG will be low, but what we can now do is apply iterations of this alternating minimization procedure where we are first going to freeze the item rankings and then optimize over the user variables, and then freeze the user variables and optimize over the user rankings. And we want to keep doing this until we converge, okay? So the way we’re going to start is we’ll freeze the item rankings, and now we’re going to optimize all the user variables by going to each user in turn—let’s say we start with this particular user—and we’re going to go to each user in turn and ask him, “Hey, are you better off sticking in your current partition, or would you actually like to switch partitions?” So if you look at this user, we see he’s been assigned to the right partition based on the random initialization, but his items are actually ranked higher on the left than they are on the right, so he would be better off switching partitions. And we can do this for each user in turn and get a repartitioning of the set of users. Now note that each of these—we’re optimizing each delta for the user—can be done in microseconds, because all you need to do is go and see which items the user liked—that’s a couple of table look-ups—you see their ranks on the left and their ranks on the right, compute nDCG, and whichever is higher, you just assign the user there. So this is done in microseconds. The next step in the optimization is to freeze the user variables and recalculate the item rankings, right? But again, that’s very efficient, right? Because you just need to make one pass over your ratings matrix, see which users have been assigned to the right, and simply see how many votes were there for each items, and sort, ‘kay? So that’s just going through the ratings matrix and doing a sort. So that can be also done very quickly in microseconds, ‘kay? And now, you can see that this ranking is slightly better than the ranking we started out with, but there’s still room for improvement. So we can keep applying iterations of this alternating minimization—first freeze the item rankings and then optimize the user variables, then freeze the users, optimize the item rankings—and we can prove that there will soon come a time where no user will want to switch partitions any further, at which point of time, we’ll have you converge to a stable partition and a stable ranking. So I’m not going to go into details of that proof—you can come and talk to me afterwards or look at our KDD paper—but essentially, with … in a very small number of iterations, you’ll have reached here, ‘kay? The only problem that is left now is to separate the users on the left from the users on the right, but that’s a simple binary classification problem, and you can learn your … you can use your favorite machine learning toolkit—TLC, VW, Azure ML, whatever you like—that will just optimize that L1 log loss objective and learn a binary classifier, ‘kay? >>: Alright, quick question: so I missed how the xi’s encoded the you—this is Chris—the xi’s encoded the users, and you mentioned how that was, just a one-of-n encoding? >> Manik Varma: Oh, no, so the xi’s are the user feature vector, right? So the … his age, his gender, his IP address, whatever oth … else we know about him—so that’s his feature vector. And the ratings matrix is a zero-one encoding of which items he liked, right? So this entire … what I wanted to say was that this entire operation can be carried out very efficiently. It … optimizing each delta and r takes only microseconds, and this hyperplane can be learnt in a matter of seconds, right? And once we’ve learnt the hyperplane, then we know how to partition a node into a left and a right child, so now we know how to grow the tree, and we can simply start at the top and keep applying the procedure recursively until the entire FastXML tree has been learnt in minutes, eh? So let’s finally get to some results. We benchmarked FastXML on a bunch of small, medium, and large datasets. The advantage of the small datasets is that they’re all publically available, and we can compare FastXML’s performance to a number of techniques that have been proposed in the literature. Of the medium-scale datasets, we tried Wikipedia—so this is the challenge version of Wikipedia; it has about three hundred and twenty-five thousand labels, but not many algorithms will scale to it—but it’s publically available. Then the Ads datasets are all proprietary to Microsoft, and the largest Ads dataset has about seventy million training points, twenty million test points, nine million labels, and two million dimensions, ‘kay? So the results I’m going to show you now are from KDD; we just learned that our NIPS paper got accepted—so our paper on embeddings got accepted to NIPS—so the complexion of results will change slightly at NIPS, but for now, let’s just go with the KDD results, ‘kay? So I’m going to start by showing you results on the small datasets. So here is precision at one, precision at three, and precision at five of a bunch of algorithms on the small datasets. Now, because these datasets are small, we don’t really care about training time, so we decided to focus on prediction accuracy, and we did a very fine sweep over the hyper-parameters of all the algorithms except for FastXML. So FastXML also has a few hyper-parameters, such as the number of trees or the maximum depth of a tree, but we decided to keep these fixed and set them to default values across all datasets, both small and large, because ultimately, we will care a lot about FastXML’s training time, and I don’t want to spend time retraining FastXML again and again and again with different hyper-parameter settings when we get to the large datasets. So of course, this gives all of the other algorithms a slightly unfair advantage over FastXML, but that’s alright, because even with one hand tied behind the back, FastXML can still equal or outperform all of the other algorithms that have been published in the literature. So if you start by looking at all the low-rank matrix factorization collaborative-filtering embedding techniques that have published, starting from the compressed sensing—or CS—work of John Langford, Sham Kakade, going all the way up to the absolute state of the art in the field—at least, until the coming NIPS—which is the LEML algorithm of Inderjit Dhillon, Prateek Jain, we see that FastXML is much better than these—considerably better. Even if these techniques were taken to their limiting case, and we learned a full-rank matrix rather than a low-rank matrix—as is done in the one-versus-all baseline— these models might still not be able to outperform FastXML. Finally, FastXML might be better than other tree-base methods as well, particularly as compared to MLRF—which is the multi-label random forest that we had proposed two years ago—as well as LPSR, which is a technique that Jason Weston had proposed when he was still at Google, and which was shown to give lifts in CTR while recommending videos on YouTube. So it would appear that, thanks to nDCG, even untuned FastXML can equal or outperform all … like, highly-tuned versions of all the algorithms that have been presented in literature. However, this is not the scale for which FastXML was designed. So if you move to a slightly larger dataset, this message gets even more strongly reinforced. There’re a couple of things to notice; the first is that most of the algorithms can no longer scale to Wikipedia if you restrict training to be something reasonable—let’s say up to this … one full day on a standard … single core of a standard desktop. Of the algorithms that do scale, FastXML’s top-ranked prediction can be significantly more accurate as compared to both LEML and LPSR; so as compared to LEML, there’s about a thirty percent improvement, and about twenty-five percent as compared to LPSR. The second thing to note is that, two years ago, it took me about five hours to train MLRF, but that was on a thousand-node cluster in Cosmos; today, I can train FastXML on Wikipedia in—so it took me four hours earlier—now, it takes me about five hours, but that’s on a single core of a standard desktop; and that’s thanks to the new optimization based on alternating minimization. But again, this is still not the scale at … for which FastXML was design; if you just wanted to stick at the stair … at this scale, there’re a bunch of other tricks we could have played which would have pushed this up to nearly sixty percent. So if you move to slightly larger datasets, we see the trend is still the same. So FastXML is more accurate in prediction than both LEML at … and LPSR at the Ads-430K dataset, and then LEML can no longer scale to the Ads one million dataset, and neither LEML nor LPSR can scale to the Ads nine million dataset. And if you notice throughout, our prediction time is always either less than one millisecond or around one millisecond in the largest case. So this thing can actually be used in real-world applications if you like. Now, the reason I was fixating on training on a single core is because that’s what my student had back home at IIT Delhi, but modern-day machines are … have more than one core, and FastXML can exploit that trivially by paralyzing by growing different trees on different cores. So here is how my training time varies with the different number of cores; you can see that if I have access to sixteen cores, then my training time is less than half an hour on both Ads one million and on Wikipedia, and I can train in … on Ads-430K in less than five minutes. My training time on Ads nine million is still very large—it takes me about ten hours to train thirty trees and seventeen hours to train fifty trees—but that’s much better than the two days it was taking me two years ago on a thousand-node cluster. However, if th’any experts out there in terms of large-scale learning on GPUs, I would love to discuss this more with you and see if we can speed this up even further. >>: Manik, I have a question. I missed something; so what’s the difference between two trees? I understand how you grading … how you grow one tree, but you’re growing two trees in parallel? What … how mean … just the random initialization is different or …? >> Manik Varma: The random initialization is different, and then we’re using a L1 log loss, so that’s not strongly convex, so depending on the initialization, you get different results for that. >>: Alright. >> Manik Varma: Sorry. We tried randomly sampling the data points or the features, but that didn’t help; that wasn’t a very good idea. >>: Okay. Is there any way to use here something like ratted posting as opposed to just random forests? >> Manik Varma: Maybe. It would slow your training down a lot, and I really, really wanted to make training very fast over here. What happened was my cluster got taken away from me. [laughter] Right? They said I was contributing too much to global warming. [laughter] So I said, “What’s the one thing they can’t take away from me—right—it’s my desktop. I really, really want to train on one desktop.” [laughs] So I was really fixated on making this thing train very quickly. Sure, if you had the luxury, you could do that; I’ll get to what appropriate loss functions might be on which you might want to compute the gradient. So actually, that’s going to come just next, so let’s have that discussion in five minutes, if you don’t mind. >>: I have a quick question for you. I … so I understand why it’s faster—it’s clear—what’s your intuition as to why it’s more accurate than the other methods? >> Manik Varma: So more accurate as compared to what? If you look at all the low-rank matrix factorization work that’s been developed—right—I mean, if you look at recommendation, almost everybody’s done collaborative filtering or low-rank factorizations. But when you come to this scale, if you look at the ratings matrix, there is just no way it can be low-rank, so it’ll have … what’ll happen is: you’ll have a hundred million users; each of these users in the tail will like two or three items, but they’ll be a different two or three items. So there’ll be a column like one entry, nonzero, nonzero, nonzero; the next guys will be different three entries nonzero; next guy, another different three entries nonzero. So there’s no way any row can be written as a linear combination of the other rows. >>: But even on the small problems, it was better. >> Manik Varma: It’s not really that much better. The … when you … the real lift comes here—at Wikipedia—and when we went to Wikipedia, and when we looked at the low-rank matrix, we did a SVD into the top five hundred eigenvectors, and we saw that that captured only ten percent of the matrix. And five hundred is a number that’s much larger than what everybody else uses; everybody else goes between twenty to fifty embedding dimensions, because otherwise, the computational cost is just too high—prediction, training, both, right? So only ten percent of the matrix is being captured. So that’s what our NIPS paper is about; we’re trying to break free of the low-rank assumption and see if we can do something which has a nonlinear embedding that preserves local distances; and that apparently works much better. So that’s the reason for the performance improvement over low-rank methods. As compared to other tree-base methods, well, I think we’re much better than LPSR because all Jason does is he does k-means, and clusters this two sets of users, and goes with that, and so there’s no regard for the loss function there. As compared to MLRF, I think the main reason we’re better is because we’re not taking single features and learning splits on that; we’re learning a full hyperplane; and what was happening with MLRF is that: when you have ten million features, your—or two million features—your features become super selective, which means that if you were to take a split on any single feature, then even for the best features, you would have—let’s say—a ten thousand or a hundred thousand documents going left and then the rest of the hundred million going right. So you learn these imbalanced trees, and you don’t get very good generalization. You can try and force MLRF to use multiple features or learn balanced trees, and then your training time goes up even further, and your accuracy—in some cases—comes down because of the regularizer. So I think that’s why this is better. And then, nDCG is kind of … you want to do some kind of ranking in any case; that’s how you’re going to measure performance—right—these are more like ranking and recommendation task, so it turns out to be a better thing to do. Only thing I’m not sure about is—which I would love to get some help from you guys—is: I don’t know what the objective function should be at the root node, because ultimately, the predictions are going to be made by the leaves, right? So if I’m taking something like precision at K, I want to optimize precision at K at the leaves, and I want to optimize nDCG at K for the leaves, but doing that at the root node would be too myopic, right? So if I just wanted to optimize—let’s say—nDCG at five at the root node, all that would say is: what are the five best items on the left? And what are the five best items on the right? And I don’t care about anything else. So if you have Wikipedia, where you have—let’s say—half a million items, there’s no way that will work well. So we decided to go with nDCG over the entire set of items, and one of the things that you pointed out—which was very helpful—is that ranking over the rest of the tail … there’s a lot of information over there, so trying to learn a function over the … all L items, rather than just the top five, can help you in any case. So we found that not trying to be too myopic towards the top—looking at ranking all the items rather than just the top K—seemed to work well. Sorry, that was a huge monologue. [laughs] >>: That’ good. >> Manik Varma: So the conclusion for this part of the talk … well, they’re only take … two take-home messages; the first is that extreme classification is a new area in machine learning, which will allow us not only to tackle classification problems at web scale, but which might also allow us to go back to other problems in machine learning and try and reformulate them as extreme classification tasks; the second take-home message is that FastXML is a new algorithm for extreme classification which can make significantly more accurate predictions as compared to other methods, and which you can train on your desktops. So if you’re interested in code, just e-mail me; I’d be very happy to share the code or the data or—I said—something else. >>: Manik. >> Manik Varma: Anandan? >>: Yeah. >> Manik Varma: Hey. >>: No, Prabhat to think about what’s the one thing you did different from, say, a year ago? Is this the alternate minimizer? >> Manik Varma: So it’s both the objective function. So earlier on, I was trying to optimize Gini index; now, I’m trying to optimize nDCG; and earlier on, I was doing the standard random forest optimization, which is the brute-force search and thresholding; and now, I’m doing the alternating minimization. Those are the only two changes, and both are important. I’ll show you results later on about how that … so just wait for five minutes, and I’ll show you those results. >>: Do they both contribute to performance as well as accuracy, or one …? >> Manik Varma: Yes, they both do, yes, yes. So I’ll show you results in five minutes if you don’t mind. But … so I’m saying five minutes, because I want to just finish off this last portion of the talk, where what I want to do is just go beyond discussing the specifics of a particular algorithm and return to the general area of extreme classification, right? Why did we need to come up with a new area? Why did we need to come up with a new name? Why couldn’t we just have done whatever people were doing earlier on? Right? So what I’d like to do is discuss how extreme classification could be different from traditional classification. And I think that not only are the scaling and computational aspects different, but I think the statistics are also different, and this is perhaps best highlighted by considering how we might evaluate the performance of a extreme classification algorithm as compared to a standard, traditional classification algorithm. So this’ll go back to the performance evaluation question that somebody asked right at beginning, ‘kay? So it turns out that a number of loss or gain functions have been proposed for evaluating traditional multi-label learners—such as the Hammming loss, or the Subset 0/1 loss, or Jaccard distance, precision, recall, F-score, and so on, and so forth—but I think that these might not directly apply in the extreme setting. So I’ve made a toy example to convey that, and the US presidential elections are very much in the news, which is why I chose this particular example. So I’m going to give you the Wikipedia pages for five US presidents, and the task is to label each of these pages with five Wikipedia labels, ‘kay? And I’m showing you the results of three algorithms over here, which attempt this task. The first algorithm is a constant algorithm, which means that no matter what the input is, it’s going to output the same list of labels. So it looks at some of the most popular labels on Wikipedia, sorts them by popularity, and just outputs that list. So that’s algorithm one; algorithm two is something that does the opposite of one. Instead of looking at the tail of … head, it completely ignores the head, and focuses just on the tail. So it’s trying to predict these labels that are not very common, but because this is a very difficult task, it often makes mistakes, and I’m showing that by these dashed lines over here. So I actually put this algorithm because of Leon’s keynote at ICML, where he was discussing metrics a lot, and one of the metrics that he was proposing was coverage, right? So if you look at coverage, I’ve put this algorithm for him over here. And then the third algorithm is something that is a mix of one and two, right? It has some head labels, but then it also has some tail label. So what I’d like to do is just do a quick show of hands; I want to see which algorithms you like; so people who think algorithm one is the best should raise one hand; people who think algorithm two is better than one and three should raise both hands; and then people who like algorithm three should, like, two hands and a leg or something. [laughter] Or don’t do anything. Hey, Anandan, do a hand count for me please. See … >>: See, I’m useful. I get—yeah—I get four two hands, and one one hand, and no leg. >> Manik Varma: Four two hands? [laughter] >>: [indiscernible] yeah. >>: ‘Kay, three legs. >> Manik Varma: I see. >>: Your legs would have to be … >> Manik Varma: [laughs] So I think most people would prefer the third algorithm—I can’t see over here … well here, right? Most people by and large … sorry? >>: You should do use yeas and nays next time if you want to count. >> Manik Varma: Oh, okay. [laughs] That’s true. But if you look at all these three algorithms—right— algorithm one is the one that maximizes precision. So no other algorithm can have higher precision than one, but tickly, we don’t tend to like it very much. And then I been measuring performance in terms of precision all this while. If you look at algorithm two, this is the one that maximizes coverage; so as I said, coverage is the number of labels or categories—unique labels or categories—that you got right; so this has the most number of correct predictions across labels, but many people don’t tend to like this as well, right? Most people tend to like three, but three actually doesn’t optimize any of the raw metrics I listed in the previous page. And if you look at why that is the case, then it’s very clear why traditional loss functions might not be working so well. So I think one of the reasons is something I’ve already alluded to, and that depends on the statistics of the positives to the negative labels. So if you look at all of these datasets, on most data points, the average number of positive labels is completely dwarfed by the number of negative labels—so positives means relevant, and negatives means irrelevant, eh? In fact, in any cases, you have less than eight positive labels per data point and then hundreds of thousands or millions of negatives. So traditional loss functions—such as the Hamming loss—which place an equal emphasis on predicting the positive and the negative labels, cannot work well in this setting. Now, it turns out that there’s also another, more subtle reason why loss functions computed on the negative labels don’t work that well, and that’s because many of the negative labels aren’t really negative, ‘kay? So traditionally, we—in supervised learning—we’ve been working in this paradigm where we suppose that there exists an expert somewhere out there—an annotator or an expert—who, whenever we give him a test point, can tell us what are the labels that are relevant to that test point, and so performance evaluation is easy. However, in the extreme setting, there cannot exist an expert or an annotator who can go through a list of ten million labels and mark out the exact relevant subset. So even if you look at Wikipedia and trust Wikipedia’s editors—here’s Jeannette’s page; it has some labels, but you can see that many labels are missing right? So for example, it doesn’t mention that she’s a VP at Microsoft, that she was a director at DARPA; it has very little to do … mention about her research and her contributions there, so there’s nothing about type theory and so on, right? So a loss function which would penalize you for predicting that type theory is relevant would actually be a not a very good loss function over here, right? So I think we should try and move away from loss functions that look at both positives and negatives, such as Hamming loss or other loss functions, and look … work only with those loss functions that focus on the positive labels, but that doesn’t solve the entire problem—right—because they … those will be still biased, but it will also turn out to be the case that all traditional loss functions that work on just the positives are … treat each label as being equal; however, that is definitely not the case in the extreme scenario. So if you look at the distribution of labels, there are few labels that have a lot of training data—that are called very frequently—and so it’s easy to train on those labels, and also it’s fai … relatively easy to predict them; however, the vast majority of labels in an extreme scenario occur not very frequently, eh? In many cases, they occur only once on some of the dataset, so I’ve … I’m showing you over here: many of the labels occur less than five times, hey? So these datas … labels, they don’t occur very frequently, and they’re very hard to train on, and they’re extremely difficult to predict correctly; however, in many scenarios, you can get much more reward for predicting these rare labels than the relevant label … than the popular labels. So again, like in Jeanette’s webpage, the single most popular label on Wikipedia is living person, right? But there’s very little information gain in predicting that Jeannette is alive, right? It’s much better to predict from the tail and say … the … like, predict about type theory, or a VP at Microsoft, et cetera, ‘kay? And this might not be such a problem if all we had to do was predict all the positive labels, right? Then it’s fine; you just predict all of them, and you’re done. But in many cases, the number of slots that we have available for prediction is limited. So in particular, if you wanted to do recommendation, then you ha … might have only five slots on your ads page, or only five slots on your Amazon page, and you want to figure out: what are the five best labels I should show? And particularly in recommendation, recommending items that are popular and common—and that you might already know about—might not be very helpful; what will be … what’ll … what the best is is recommending these rare labels that really delight and surprise you, right? “I had no idea this existed, but it’s relevant to me; it’s perfect for what I want; and thank you very much for that recommendation.” So I think if you were to design a loss function for the extreme scale, we will need to focus, of course, on accuracy, right? But we should do this in an unbiased way; we should try and come up with loss functions, which as your test set becomes larger and larger, the value that you compute for your metric should converge to the value of the metric had it been computed on the full observed ground route. And there’re tools and statistics which can help us do that—so in particular, propensity scoring or important sampling—and we’ve started looking at some of those, but in addition to focusing on just accuracy, I think algorithms … or, sorry, lost functions in the extreme scenario should also reward predicting rare or novel items, and that will work well in some applications, but in some applications, you might want diversity. So in some applications, it’s important to have a little bit of the head in there as well, right? Because that’s where you generate most of your revenue, or in some situations, you might want to have a mix of your head and tail. And then finally—at least, particularly in the recommendation scenario—you might want to reward algorithms that are explainable, or that they … you … if they can explain why they made a certain set of recommendations to you, you might trust them more. So I think this is a fascinating research area; there are lot of open research questions over here that might not exist at … in their traditional setting, and I’d love to discuss these with you, or if there’s any interest, share code or data or talk to you more about them. So that’s my pitch, and thank you very much. [applause] So I’m happy to take questions. Let me just answer Anandan’s point quickly about whether the optimization is more important or the loss function is more important. So one of these is variants of FastXML, I think, yeah. >>: Here, but it’s on small datasets. >> Manik Varma: Right. So if you look at this—right—what we did is … we can take nDCG and put it in MLRF, right? So you … the optimization is not going to be the alternating minimization; it’s going to be the standard random forest im … optimization, where you sample features randomly, do a sweep, pick the single best feature. So you can see that in most of the results, MLRF optimized … optimizing nDCG does not do well. At the same time, you also said, “Okay, what happens if you take FastXML and restrict it to optimizing precision at five or nDCG at five?” Which I was mentioning to Chris earlier. That also doesn’t do well. So I would think that not only is optimizing nDCG important—rather than Gini index or entropy—but it is also important to optimize it, learn a full hyperplane, and do the alternating minimization. Yeah, other questions? >>: I had a—I mean—a quest, or a comment, or—I don’t know—just a topic that I’m … so I think that … so first of all, I think it’s … the work is great. I think you glanced over one important thing, which is kind of subtle, and maybe that’s why, but you kind of passed over it as if it doesn’t exist, so we … so I want to just ask you: what do you think about it? So when you talk about this setting, you can talk about one of two scenarios: one is where you have a very big, but fixed, set of labels; then all the training set asymptotes to infinity, so now we can think about it as a standard machine learning question, only with the constant number of labels being very, very big; but all of our theory, all of our understanding, all of our intuition still works out well. And I think the example you gave—as you said, “Well, let’s look at the labels given in Wikipedia, and let’s use those to label the entire internet.” So that whole … we’ll freeze Wikipedia at one point in time; we have many, many labels, but it’s fixed; and now, the web keeps growing and growing; it can asymptotes to infinity; and we can apply our standard statistical tools and understanding of machine. If you let—given a slightly different example—if you had said, “Listen, let’s use the labels in Wikipedia and label new Wikipedia pages that come in,” now, you would have fallen into the trap, because now, you have two things going to infinity: one is the number of labels, and the other is the dataset size, and they are going together to infinity. So every new Wikipedia page appears with new labels, so the label set grows almost at—you know—perhaps at the same rate as your training data goes, so the nu … so again, sample size and number of labels both grow the same rate. And now, all existing machine learning theory goes out of the window, and many things that you kind of took for granted here—you know, you were focusing on doing it faster, doing it better, so on—now, all bets are off; you have to start from scratch; and it seems like—I don’t know—maybe you were saying that this would work in both scenarios. >> Manik Varma: Yes, so … >>: There’s … I see no reason why it should work in the latter. >> Manik Varma: So I’m not a theory guy. So theoretically, things might really go down in a basket; in fact, that’s a great way for you to come in and help peep proved some theorems about what is happening, right? Practically, we’ve observed two things. One is: as new labels come in, both in the tree setting and in the embedding setting—which is for our NIPS paper—you can always take new labels and add them to the leaf nodes of the trees or embed them into the common space that you’re … that the CCS space—the low-rank matrix factorizations. In either case, new labels, as they come in, simply get inserted into the problem, and now you can predict them. And the guys at Google—Samy Bengio, Jason Weston—have been doing this for a number of years now. So if you go back to the zero short learning setting, you can—new abe … label came in—you just propagate it round your tree, figure out which leaf node they is in; you update you leaf node distribution. In the extreme multi-class setting, John Langford has been working on these long trees—so there’s online learning; points are coming in as a fashion—and the trees are grown as the data points come in, and again, you can apply similar kinds of things. So John actually has theorems showing—at least some—that his trees will be balanced and stuff; I don’t think he has a guarantee for a … like, statistical guarantees. And that’s actually a really interesting theory question, right? How do you even give statistical guarantees at this scale? What does it mean for me to have a million labels? Like, what is the complexity of this space? I could take a … one label and replicate it a million times, but that doesn’t mean I have a million-label problem, right? And as you were pointing out, the way the label space goes to infinity might be really interesting. So I think they’re really new theoretical research questions over here: what is the dimensionality of the ambient label space? How do I give statistical guarantees? That I would love for theory people to work on. Some theory work has recently started coming out; so at ICML, there was a workshop on this, which I’m told by the organizers was very successful. They said that it was the second most-attended workshop after deep learning. I think there’s a big curve—right—six hundred for deep learning and a hundred, two hundred for yeah. So what they were saying was: let’s say I’m going to take a Hamming loss and optimize that using one versus all or whatever standard techniques I have, but now suppose I have to do efficient prediction, so I’m going to construct the tree; well, how much worse will my Hamming loss be? Can I give abound on that as compared to the Hamming loss over the entire one versus all set? So I think there’re lots of new questions coming out, and should just put them up on the slide. If people are interested in working on them, I’d be very happy to chat and discuss, and so … >> Ofer Dekel: Let’s thank the speaker again. [applause] >> Manik Varma: Thanks, all.