Document 17859783

advertisement
>> Ofer Dekel: Okay, welcome, everybody. Thanks for coming. It’s our pleasure today to have Manik
Varma give us a talk on extreme classification, a new paradigm for ranking and recommendation. Thank
you, Manik.
>> Manik Varma: Cool.
>> Ofer Dekel: Oh, wait, let me … I sorry; I have to make the announcement. So for those of you in the
… that are watching from the … from your offices, I’ll be monitoring questions on the tablet, so feel free
to ask, and I’ll relay them to Matt. Thank you.
>> Manik Varma: Cool. Cool, okay, good to go? So thanks very much, Ofer. Right, so I’m Manik Varma
from Microsoft Research India, and I’ll be talking about extreme classification, which might provide a
new paradigm for thinking about core problems in machine learning, such as ranking, recommendation,
possibly structured output prediction, and so on. Yeah? Now, many of you might not have heard the
term extreme classification before, so let me start by giving some context. In classification, the
complexity of the learning task has grown from binary classification, where we learn to pick a single item
from amongst two labels, to multi-class classification, where we learn to pick an item from amongst L
labels—with L being larger than two—to multi-label classification, where we learn to pick the most
relevant subset of these L labels. At the same time, the complexity of the learning task has also grown in
terms of the number of labels being considered. So we’ve moved from working with two labels for
binary classification to tens to hundreds to thousands of labels for multi-label learning. Yeah? And if
you looked at the state of the art about three years ago, then the largest multi-learning date … multilabel learning dataset had about five thousand labels, so the size of the output space was two to the
power five thousand, which was—well—considered to be large. And so it was thought that going
beyond that would be very hard.
Then, two years ago, we exploded the number of labels to ten million; the application was to build a
classifier that could predict the subset of millions of Bing queries that might potentially lead to a click on
a new ad or on a new webpage. So the input of the algorithm would be an ad, such as this ad for Geico
car insurance, and the output of the algorithm was the subset of queries that might lead to a click on the
ad, such as “cheap car insurance” or dub, dub, dub dot geico dot com. Using a tool such as this, an
advertiser could figure out most of the queries that might lead to a click on his ad, and he could then go
to a search engine, such as Bing or Google, and say, “Hey—you know—anytime somebody asks this
particular query, please show them this ad, and if the user clicks on the ad, I’ll give you a dollar.” Now,
as you can well imagine from the application, predicting phrases from webpages is a very important
problem, both from a commercial and a research perspective, and so many sophisticated NLP
techniques have been developed in the literature. However, the way we decided to address the
problem was to bypass all these NLP techniques and simply state that we’re going to take the top ten
million queries in Bing, treat each of them as a separate label, and learn a multi-label random forest
classifier—which I will be referring to as MLRF for the rest of the talk—that will take this ad as a test
point, extract the … that bag-of-words features from the raw HTML that lies behind this ad, and then
simply classify that feature vector into these ten million labels; and we’ll … and predict the
corresponding queries.
So it took us about two years to build MLRF, but when all the results came in, and all the performance
evaluation was carried out, it turned out the MLRF had a couple of advantages over the state-of-art NLP
techniques at that point of time.
>>: Hey, Manik? Question.
>> Manik Varma: Rich?
>>: This is Rich.
>> Manik Varma: Hi.
>>: Why don’t you view it as a binary classification that goes the other way—that tries to classify which
was … yeah.
>> Manik Varma: So that was the standard approach to it, and if you want, I’ll come back to that
towards the end of the talk, but I—as I was just going to say—one of the big advantages we had, as
compared to the binary classification problem, was that we manage to push coverage from somewhere
around sixty percent to ninety-eight percent. So coverage is the percentage of ads for which an
algorithm makes nontrivial recommendations, and just to show you over here—right—so if you take the
binary classifier approach that requ … that Rich was mentioning, what you’re essentially going to do is:
you’ll train a binary classifier that’s going to have a sliding window; it’s going to go over every phrase in
this webpage and try and predict whether it is a possible candidate for a bid phrase or not, right? So
what happens is: when you actually run this on the page, you’re only limited to whatever phrases are
there on the page, and that turns out to be a problem in this particular case, because all of this
beautiful-looking text is actually embedded images, right? And most ads are very text-impoverished, so
that binary classification approach doesn’t work all that well, and if you look at all the predictions that
we are making over here, none of them are actually phrases that are present on the webpage, ‘kay?
So this was one of the big advantages: we manage to push coverage up from sixty to ninety-eight
percent, and the product team really cared a lot about that, so there was a big win right over there. The
second advantage was that, even if you focus on just the subset of covered pages—which means the
subset of pages, ads for which the NLP techniques could actually make predictions—our predictions
were significantly more accurate. So if you measured something such as a precision at ten, then MLRF’s
precision at ten was about five percent higher. So that was helpful as well. Sorry, by the way, guys; I
don’t see very well, so if there any questions, it’s good to, like, shout out, or do a hula hoop dance, or
something. [laughs] If you just raise your hands, I’ll never see you. Right.
>>: Quick question.
>> Manik Varma: Yeah?
>>: What does “might lead to a click” mean?
>> Manik Varma: So it’s trying to go beyond relevance in the sense that it’s likely that, in the past, we’ve
seen users ask this query and click on a very similar ad. So that’s what that is trying to capture—that if
somebody asks this query, and you showed them this ad, they might click on it. There isn’t a more
formal definition at the moment.
>>: But so when … if you’re evaluating: how good are these predictions? It’s some … it’s a … like a
minimum click rate?
>> Manik Varma: Oh, I see. No, so evaluation is something that I’d like to come back to towards the
end. At the moment, this was an advertiser-facing tool, so the way you—or the product group—was
looking to evaluate it would be: how many of your recommendations were actually adopted by the
advertiser?
>>: Oh, okay. Good.
>> Manik Varma: Yeah.
>>: Question?
>> Manik Varma: Yes?
>>: So maybe I missed something, but so on the labels on the right-hand side, are they … where are
they from? Are they [indiscernible]
>> Manik Varma: These are a subset of the top ten million queries in Bing.
>>: Provided by [indiscernible]
>> Manik Varma: Provided by Bing.
>>: By Bing.
>> Manik Varma: Yeah, you go and look at the Bing logs; you see which of the pop queries are most
frequent, or most popular, or generate the most revenue; you sort those, take the top ten million; and
that’s what you train on.
>>: I see.
>> Manik Varma: Any other questions?
>>: Yeah, I have a question. Why ten million then? Why not … did you try different things, and hat
worked well?
>> Manik Varma: Yeah, so that covered the bulk of the revenue-generating queries. So it covered
enough for the group to be—the product group—to be kind of satisfied. You could have gone larger;
we—internally—we actually went much larger than that; the returns are not that great, so yeah.
Anything else? Okay. So where I was going with this was that MLRF was published in dub, dub, dub
2013—so two years ago—and since then, many interesting research questions have arisen in this new
area of learning with millions of labels, which we refer to as extreme classification. And I think two of
the most interesting questions are related to applications and performance evaluation, which is what
someone asked over here. So what I’d like to do is start by discussing applications, and then, if there’s
time towards the end of the talk, I’ll touch up on performance evaluation.
Okay, so ten million is a really large number, and I think one of the most interesting questions is: when
or where in the world do we actually have ten million labels to choose from? So I think they are a
couple of applications—high-impact applications—that do exist at this scale, even though ten million is
very large, and one of them is people. So there are millions of people who are uploading selfies of
themselves every day to Facebook, and there’re millions of people who are standing in front of Kinect
cameras, so we could potentially use all this data to train classifiers to recognize people, and then ask,
“Which subset of Facebook users is present in this selfie?” And this might have important applications
in social network analysis, security, surveillance, and so on.
Another interesting application could be Wikipedia. So if you browse … if you scroll down to the bottom
of any Wikipedia page, you’ll find a subset of Wikipedia labels that have been assigned to that page by
Wikipedia’s editors. Now, the total number of Wikipedia labels has crossed over into the millions today,
and won’t it be great if you could build a classifier that could take every document, every webpage,
every tweet, every query, every image, every video, and annotate it with a subset of relevant Wikipedia
categories? There would be so many applications that would get enabled if we could do that
successfully. So in particular, we can think of building these really massive-scale knowledge graphs with
billions of nodes and millions of properties—right—you could go and stamp all the pages on the web
with the set of Wikipedia categories, and now, you know: okay, this person is a VP at Microsoft, or this is
a … person is an AI researcher; this person was born in 1952; and this might help you in building these
massive knowledge graphs. You can also try and use these for text featurization. So just as we do with
deep learning, we take a deep lent work and chop off the last layer, and now we use the intermediate
representation as features, you could play a very similar games over here as well. So we’re trying to see
whether we can use this for X featurization.
But apart from all of these applications, we figured out that what one can also do is go back to core
problems in machine learning and reformulate them as extreme classification tasks. In particular, we
can think about ranking or recommending millions of items and think about whether we can treat that
as an extreme classification task. So the way that would work is: we can treat each label to be—sorry—
each item to be ranked or recommended as a separate label, learn an extreme multi-label classifier, and
use it to predict the subset of items that should be recommended to each user. Thinking about ranking
or recommendation in this way might have a significant impact in terms of performance in some
applications, ‘kay? And that’s similar to what we saw with the phrase prediction and NLP techniques
that we saw earlier, okay?
So let’s get into some technical details about how we can tackle applications such as Wikipedia, or how
we might be able to reformulate problems such as recommendation. And our algorithm is going to be
called FastXML, which stands for a fast, accurate, and stable tree-classifier for extreme multi-label
learning, and it was developed jointly with my PhD student at IIT Delhi, Yashoteja Prabhu. Okay, now
before I get into technical details, let me quickly give you an overview of FastXML so that you know
what’s coming for the rest of the talk. So it turns out that, at this scale, almost all real-world
applications will require us to make predictions in milliseconds. Extreme classifiers should therefore
have prediction costs which grow at most logarithmically with the number of categories or the number
of labels. FastXML ensures this by employing a tree-structured architecture and learning very highlybalanced trees. The second noteworthy point about FastXML is that its prediction accuracy can be
significantly higher, as compared to the state of the art—by up to twenty to twenty-five percent, in
some cases. FastXML achieves this by optimizing a rank-sensitive loss function known as nDCG, which
turns out to be a marked improvement over traditional tree-growing loss functions, such as the Gini
index, or entropy, or the Hamming loss. Finally, FastXML can also be up to a thousand times faster to
train. So two years ago, I needed a large production cluster with almost a thousand cores in order to
train MLRF; today, I can train FastXML on some of the very same problems on a single core of a standard
desktop; and that’s thanks to a new optimization technique based on alternating minimization, which
comes with provable guarantees.
So let’s start by considering the first bullet point and seeing how we can architect FastXML so as to make
predictions in milliseconds. So the way I’m going to formulate the problem is that there will be a space
of users, X, and a space of items, Y, and what we’d like to do is learn a multi-label classifier, f, that is
gonna take a point in the space of feature … users and map it to a set of point in the space of items so
that when a user comes in, we can simply apply our extreme classifier, see which labels get predicted,
and recommend the corresponding items. ‘Kay? Now, the space of items might be very large, and even
if it just takes us one second in order to determine to recommend an item or not, it will take us almost
twelve days to go through a list of one million items and almost four months to go through a list of ten
million. Extreme classifiers therefore face the daunting challenge of having to reduce prediction time by
almost ten orders of magnitude—so down … so from four months down to about one millisecond.
Some extreme classifiers address this challenge by learning a tree where each child receives only about
half of its parent’s items, ‘kay? So when a user comes in, he starts off at the root node, which contains
all the items, but then quickly—in logarithmic time—traverses the tree and ends up at a leaf node,
which contains only a few items, and these are the items that are then recommended back to the user.
We can also reformulate ranking problems in this fashion, and return a ranked list of items to the user
by sorting the items according to their probabilities. So this sounds like a reasonable game plan but for
the fact that learning hierarchies is notoriously hard, and any single learned tree would very likely have
been suboptimal. FastXML therefore learns an entire ensemble of trees and simply aggregates the
individual predictions in order to return a final ranked list of items to the user, ‘kay? So this is the very
same architecture that MLRF used two years ago as well, eh? The only thing that’s going to be different
today about FastXML is the way that these trees are going to be learned.
So let’s move on to the second bullet point and see how we can formulate the learning problem so as to
make much more accurate predictions. The key technical challenge that we need to address is: we need
to figure out how to take a node and split it into a left and a right child, because once we know how to
do that, then we can start at the root node of all the trees and keep applying this procedure recursively
until all the trees are fully grown. Yeah? Now, note that our training data comprises historical
information about which users liked which items, and even though there might be millions of items to
choose from, each user will typically like only a small number of items. It is, therefore, far more
important to pre … correctly predict the items that are going be liked by a user and to ensure they’re
highly ranked than it is to predict the disliked items—this is our key insight. And FastXML therefore
learns to partition a node by optimizing a rank-sensitive loss function known as nDCG rather than
traditional tree-growing functions, such as the Gini index, or the entropy, or the Hamming loss, because
these traditional loss functions don’t have any concept of ranking built into them, and they place an
equal emphasis on predicting the liked and the disliked items.
So let me try and illustrate what’s going on. When we start out, all the users will be present in the root
node, but what we’ll do is we’ll quickly partition the users be … into a left and a right child. Yep, and the
reason we’re going to do that is because a partitioning of users is going to induce not only a clustering
over the items, but also a ranking over the items. So if you look at all the users in the left, they all like
oranges, pomegranates, and bananas … or—sorry—grapes, so all of these three items should be
clustered together and sent to the left, but bananas should not be sent to the left, because nobody likes
them there. Furthermore, if you look at all the users on the left, then we see that four like oranges, four
like pomegranates, but only two like grapes, so oranges and pomegranates should be ranked higher
than grapes. Now, there’n many different ways in which we can partition users, and each of these
different partitions will induce a different ranking over the items, and what we’d like to do is, through
nDCG, choose that particular partition where each user’s items are ranked as highly as possible, ‘kay?
So the way we will partition a node is by learning a hyperplane with normal, w, in the space of user
features, X, such that w transpose x is less than zero for all the users who’ve been assigned to the left
and greater than zero for all the users who’ve been assigned to the right. And we learn the sparsest
possible hyperplane that optimizes nDCG and show that, in the results, this leads to significantly more
accurate predictions.
But before we get on to the results, let me quickly talk about optimizing nDCG. So there’re going to be
some formulae flying around—I apologize in advance for that—but you can try and ignore that if you
like; I’ll explain everything intuitively. The key take-home message from the next set of slides is: we
have a very efficient way of optimizing nDCG, which will allow us to grow our FastXML trees in minutes,
okay? So coming here and talking about nDCG is like preaching to the choir, right? But there might be
one or two of you who don’t know nDCG, so please bear with me, the rest of you, while I just explain
briefly, and then I’m talk about the optimization, eh? So it turns out that nDCG is incredibly easy to
define; what it does is it measures the quality of a given ranking for a particular user, ‘kay? It’s a
number between zero and one, and larger values mean better rankings, ‘kay? So the way you compute
nDCG—or define nDCG—is: you take your ranking; you look at the top-ranked item; and if the user likes
it, you add a one to your score; otherwise, you do nothing; and then you move on to the second-rank
item, and if he likes it, you add a one by log two; otherwise, you do nothing; third item, one by log three
if he likes it; otherwise, nothing; and so on and so forth, ‘kay?
So nDCG is incredibly easy to define, but it turns out that it’s also really hard to optimize. So you can see
that there’s a sort function buried here inside nDCG, and this means that not only is nDCG not convex,
but it is also not differentiable, which means we can’t directly apply any of the large-scale, gradientbased techniques that we’ve developed for efficient optimization. Furthermore, nDCG might behave
also … might also behave very erratically with respect to the hyperplane that we’re trying to learn. So
large changes in the hyperplane might have no impact on the induced rankings, and so nDCG will stay
flat in large regions of space, but there will also be situations when just a small change in nDCG … in the
hyperplane will be enough to send a few users from the left to the right, and this will change the
induced item rankings, and so nDCG will suddenly jump up, way? So nDCG has this nasty property that
it’s either flat everywhere—which means you don’t know which direction to go in if you want to
optimize it—or wherever every … anything interesting happens, it’s discontinuous, right? And this is a
very well-known loss function in the learning-to-rank literature, and Ofer and Chris have spent a lot of
time optimizing this. In fact, Chris—at ICML—won a test of time award for the work he’s done on
optimizing nDC for the last ten years.
So it turns out, like, based on all this, because nDCG is such a hard function to optimize, there’s a very
real danger that our FastXML might actually be harder to train than MLRF. But it turns out that there’s
something special we can do over here, which at first, is actually going to sound counterintuitive. So
what I’m going to do is: I’m going to make the problem even more complex by adding in an extra
hundred million variables into the optimization. Now, this is going to sound bizarre but for the fact that
each of these extra variables that I’m going to add can be optimized in microseconds, and then the
remaining problem that I’ll have left will turn out to be a simple binary classification problem in the
hyperplane, which is our bread-and-butter task, and which we know how to solve very efficiently, ‘kay?
So here’s what I’m going to do: I’m going to take each user and add an extra variable for him or her,
which I’ll call delta, and delta can be either minus one or plus one, depending on whether the user has
been assigned to the left partition or to the right partition; at the same time, I’ll introduce two extra
variables, r minus and r plus, for each item, and these will specify the rank of the item on the left and
the right child, respectively, ‘kay? And at the top, you can see how I’ve modified the objective function
to bind together the hyperplane normal, w, to the item-ranking variables, r minus and r plus, to the user
partition variables, delta.
Now, there might be a hundred million users and ten million items, so we’ll have added an extra
hundred and twenty million variables, but here is how we can optimize them very efficiently. So we’ll
start by simply assigning each of the users to a left or a right partition, completely at random—‘kay, so
there’s going to be a random initialization—but because I know which items each user likes, this will
induce not only a clustering over the items, but also a ranking over the items on the left and the right.
Now, you can see that this is not a very good partition, because the induced item ranking is not
compact, and so nDCG will be low, but what we can now do is apply iterations of this alternating
minimization procedure where we are first going to freeze the item rankings and then optimize over the
user variables, and then freeze the user variables and optimize over the user rankings. And we want to
keep doing this until we converge, okay?
So the way we’re going to start is we’ll freeze the item rankings, and now we’re going to optimize all the
user variables by going to each user in turn—let’s say we start with this particular user—and we’re going
to go to each user in turn and ask him, “Hey, are you better off sticking in your current partition, or
would you actually like to switch partitions?” So if you look at this user, we see he’s been assigned to
the right partition based on the random initialization, but his items are actually ranked higher on the left
than they are on the right, so he would be better off switching partitions. And we can do this for each
user in turn and get a repartitioning of the set of users. Now note that each of these—we’re optimizing
each delta for the user—can be done in microseconds, because all you need to do is go and see which
items the user liked—that’s a couple of table look-ups—you see their ranks on the left and their ranks
on the right, compute nDCG, and whichever is higher, you just assign the user there. So this is done in
microseconds.
The next step in the optimization is to freeze the user variables and recalculate the item rankings, right?
But again, that’s very efficient, right? Because you just need to make one pass over your ratings matrix,
see which users have been assigned to the right, and simply see how many votes were there for each
items, and sort, ‘kay? So that’s just going through the ratings matrix and doing a sort. So that can be
also done very quickly in microseconds, ‘kay? And now, you can see that this ranking is slightly better
than the ranking we started out with, but there’s still room for improvement. So we can keep applying
iterations of this alternating minimization—first freeze the item rankings and then optimize the user
variables, then freeze the users, optimize the item rankings—and we can prove that there will soon
come a time where no user will want to switch partitions any further, at which point of time, we’ll have
you converge to a stable partition and a stable ranking. So I’m not going to go into details of that
proof—you can come and talk to me afterwards or look at our KDD paper—but essentially, with … in a
very small number of iterations, you’ll have reached here, ‘kay?
The only problem that is left now is to separate the users on the left from the users on the right, but
that’s a simple binary classification problem, and you can learn your … you can use your favorite
machine learning toolkit—TLC, VW, Azure ML, whatever you like—that will just optimize that L1 log loss
objective and learn a binary classifier, ‘kay?
>>: Alright, quick question: so I missed how the xi’s encoded the you—this is Chris—the xi’s encoded
the users, and you mentioned how that was, just a one-of-n encoding?
>> Manik Varma: Oh, no, so the xi’s are the user feature vector, right? So the … his age, his gender, his
IP address, whatever oth … else we know about him—so that’s his feature vector. And the ratings
matrix is a zero-one encoding of which items he liked, right? So this entire … what I wanted to say was
that this entire operation can be carried out very efficiently. It … optimizing each delta and r takes only
microseconds, and this hyperplane can be learnt in a matter of seconds, right? And once we’ve learnt
the hyperplane, then we know how to partition a node into a left and a right child, so now we know how
to grow the tree, and we can simply start at the top and keep applying the procedure recursively until
the entire FastXML tree has been learnt in minutes, eh?
So let’s finally get to some results. We benchmarked FastXML on a bunch of small, medium, and large
datasets. The advantage of the small datasets is that they’re all publically available, and we can
compare FastXML’s performance to a number of techniques that have been proposed in the literature.
Of the medium-scale datasets, we tried Wikipedia—so this is the challenge version of Wikipedia; it has
about three hundred and twenty-five thousand labels, but not many algorithms will scale to it—but it’s
publically available. Then the Ads datasets are all proprietary to Microsoft, and the largest Ads dataset
has about seventy million training points, twenty million test points, nine million labels, and two million
dimensions, ‘kay? So the results I’m going to show you now are from KDD; we just learned that our NIPS
paper got accepted—so our paper on embeddings got accepted to NIPS—so the complexion of results
will change slightly at NIPS, but for now, let’s just go with the KDD results, ‘kay?
So I’m going to start by showing you results on the small datasets. So here is precision at one, precision
at three, and precision at five of a bunch of algorithms on the small datasets. Now, because these
datasets are small, we don’t really care about training time, so we decided to focus on prediction
accuracy, and we did a very fine sweep over the hyper-parameters of all the algorithms except for
FastXML. So FastXML also has a few hyper-parameters, such as the number of trees or the maximum
depth of a tree, but we decided to keep these fixed and set them to default values across all datasets,
both small and large, because ultimately, we will care a lot about FastXML’s training time, and I don’t
want to spend time retraining FastXML again and again and again with different hyper-parameter
settings when we get to the large datasets. So of course, this gives all of the other algorithms a slightly
unfair advantage over FastXML, but that’s alright, because even with one hand tied behind the back,
FastXML can still equal or outperform all of the other algorithms that have been published in the
literature.
So if you start by looking at all the low-rank matrix factorization collaborative-filtering embedding
techniques that have published, starting from the compressed sensing—or CS—work of John Langford,
Sham Kakade, going all the way up to the absolute state of the art in the field—at least, until the coming
NIPS—which is the LEML algorithm of Inderjit Dhillon, Prateek Jain, we see that FastXML is much better
than these—considerably better. Even if these techniques were taken to their limiting case, and we
learned a full-rank matrix rather than a low-rank matrix—as is done in the one-versus-all baseline—
these models might still not be able to outperform FastXML. Finally, FastXML might be better than
other tree-base methods as well, particularly as compared to MLRF—which is the multi-label random
forest that we had proposed two years ago—as well as LPSR, which is a technique that Jason Weston
had proposed when he was still at Google, and which was shown to give lifts in CTR while
recommending videos on YouTube.
So it would appear that, thanks to nDCG, even untuned FastXML can equal or outperform all … like,
highly-tuned versions of all the algorithms that have been presented in literature. However, this is not
the scale for which FastXML was designed. So if you move to a slightly larger dataset, this message gets
even more strongly reinforced. There’re a couple of things to notice; the first is that most of the
algorithms can no longer scale to Wikipedia if you restrict training to be something reasonable—let’s say
up to this … one full day on a standard … single core of a standard desktop. Of the algorithms that do
scale, FastXML’s top-ranked prediction can be significantly more accurate as compared to both LEML
and LPSR; so as compared to LEML, there’s about a thirty percent improvement, and about twenty-five
percent as compared to LPSR. The second thing to note is that, two years ago, it took me about five
hours to train MLRF, but that was on a thousand-node cluster in Cosmos; today, I can train FastXML on
Wikipedia in—so it took me four hours earlier—now, it takes me about five hours, but that’s on a single
core of a standard desktop; and that’s thanks to the new optimization based on alternating
minimization. But again, this is still not the scale at … for which FastXML was design; if you just wanted
to stick at the stair … at this scale, there’re a bunch of other tricks we could have played which would
have pushed this up to nearly sixty percent.
So if you move to slightly larger datasets, we see the trend is still the same. So FastXML is more
accurate in prediction than both LEML at … and LPSR at the Ads-430K dataset, and then LEML can no
longer scale to the Ads one million dataset, and neither LEML nor LPSR can scale to the Ads nine million
dataset. And if you notice throughout, our prediction time is always either less than one millisecond or
around one millisecond in the largest case. So this thing can actually be used in real-world applications
if you like. Now, the reason I was fixating on training on a single core is because that’s what my student
had back home at IIT Delhi, but modern-day machines are … have more than one core, and FastXML can
exploit that trivially by paralyzing by growing different trees on different cores. So here is how my
training time varies with the different number of cores; you can see that if I have access to sixteen cores,
then my training time is less than half an hour on both Ads one million and on Wikipedia, and I can train
in … on Ads-430K in less than five minutes. My training time on Ads nine million is still very large—it
takes me about ten hours to train thirty trees and seventeen hours to train fifty trees—but that’s much
better than the two days it was taking me two years ago on a thousand-node cluster. However, if th’any
experts out there in terms of large-scale learning on GPUs, I would love to discuss this more with you
and see if we can speed this up even further.
>>: Manik, I have a question. I missed something; so what’s the difference between two trees? I
understand how you grading … how you grow one tree, but you’re growing two trees in parallel? What
… how mean … just the random initialization is different or …?
>> Manik Varma: The random initialization is different, and then we’re using a L1 log loss, so that’s not
strongly convex, so depending on the initialization, you get different results for that.
>>: Alright.
>> Manik Varma: Sorry. We tried randomly sampling the data points or the features, but that didn’t
help; that wasn’t a very good idea.
>>: Okay. Is there any way to use here something like ratted posting as opposed to just random
forests?
>> Manik Varma: Maybe. It would slow your training down a lot, and I really, really wanted to make
training very fast over here. What happened was my cluster got taken away from me. [laughter] Right?
They said I was contributing too much to global warming. [laughter] So I said, “What’s the one thing they
can’t take away from me—right—it’s my desktop. I really, really want to train on one desktop.” [laughs]
So I was really fixated on making this thing train very quickly. Sure, if you had the luxury, you could do
that; I’ll get to what appropriate loss functions might be on which you might want to compute the
gradient. So actually, that’s going to come just next, so let’s have that discussion in five minutes, if you
don’t mind.
>>: I have a quick question for you. I … so I understand why it’s faster—it’s clear—what’s your intuition
as to why it’s more accurate than the other methods?
>> Manik Varma: So more accurate as compared to what? If you look at all the low-rank matrix
factorization work that’s been developed—right—I mean, if you look at recommendation, almost
everybody’s done collaborative filtering or low-rank factorizations. But when you come to this scale, if
you look at the ratings matrix, there is just no way it can be low-rank, so it’ll have … what’ll happen is:
you’ll have a hundred million users; each of these users in the tail will like two or three items, but they’ll
be a different two or three items. So there’ll be a column like one entry, nonzero, nonzero, nonzero; the
next guys will be different three entries nonzero; next guy, another different three entries nonzero. So
there’s no way any row can be written as a linear combination of the other rows.
>>: But even on the small problems, it was better.
>> Manik Varma: It’s not really that much better. The … when you … the real lift comes here—at
Wikipedia—and when we went to Wikipedia, and when we looked at the low-rank matrix, we did a SVD
into the top five hundred eigenvectors, and we saw that that captured only ten percent of the matrix.
And five hundred is a number that’s much larger than what everybody else uses; everybody else goes
between twenty to fifty embedding dimensions, because otherwise, the computational cost is just too
high—prediction, training, both, right? So only ten percent of the matrix is being captured. So that’s
what our NIPS paper is about; we’re trying to break free of the low-rank assumption and see if we can
do something which has a nonlinear embedding that preserves local distances; and that apparently
works much better. So that’s the reason for the performance improvement over low-rank methods.
As compared to other tree-base methods, well, I think we’re much better than LPSR because all Jason
does is he does k-means, and clusters this two sets of users, and goes with that, and so there’s no regard
for the loss function there. As compared to MLRF, I think the main reason we’re better is because we’re
not taking single features and learning splits on that; we’re learning a full hyperplane; and what was
happening with MLRF is that: when you have ten million features, your—or two million features—your
features become super selective, which means that if you were to take a split on any single feature, then
even for the best features, you would have—let’s say—a ten thousand or a hundred thousand
documents going left and then the rest of the hundred million going right. So you learn these
imbalanced trees, and you don’t get very good generalization. You can try and force MLRF to use
multiple features or learn balanced trees, and then your training time goes up even further, and your
accuracy—in some cases—comes down because of the regularizer. So I think that’s why this is better.
And then, nDCG is kind of … you want to do some kind of ranking in any case; that’s how you’re going to
measure performance—right—these are more like ranking and recommendation task, so it turns out to
be a better thing to do.
Only thing I’m not sure about is—which I would love to get some help from you guys—is: I don’t know
what the objective function should be at the root node, because ultimately, the predictions are going to
be made by the leaves, right? So if I’m taking something like precision at K, I want to optimize precision
at K at the leaves, and I want to optimize nDCG at K for the leaves, but doing that at the root node
would be too myopic, right? So if I just wanted to optimize—let’s say—nDCG at five at the root node, all
that would say is: what are the five best items on the left? And what are the five best items on the
right? And I don’t care about anything else. So if you have Wikipedia, where you have—let’s say—half a
million items, there’s no way that will work well. So we decided to go with nDCG over the entire set of
items, and one of the things that you pointed out—which was very helpful—is that ranking over the rest
of the tail … there’s a lot of information over there, so trying to learn a function over the … all L items,
rather than just the top five, can help you in any case. So we found that not trying to be too myopic
towards the top—looking at ranking all the items rather than just the top K—seemed to work well.
Sorry, that was a huge monologue. [laughs]
>>: That’ good.
>> Manik Varma: So the conclusion for this part of the talk … well, they’re only take … two take-home
messages; the first is that extreme classification is a new area in machine learning, which will allow us
not only to tackle classification problems at web scale, but which might also allow us to go back to other
problems in machine learning and try and reformulate them as extreme classification tasks; the second
take-home message is that FastXML is a new algorithm for extreme classification which can make
significantly more accurate predictions as compared to other methods, and which you can train on your
desktops. So if you’re interested in code, just e-mail me; I’d be very happy to share the code or the data
or—I said—something else.
>>: Manik.
>> Manik Varma: Anandan?
>>: Yeah.
>> Manik Varma: Hey.
>>: No, Prabhat to think about what’s the one thing you did different from, say, a year ago? Is this the
alternate minimizer?
>> Manik Varma: So it’s both the objective function. So earlier on, I was trying to optimize Gini index;
now, I’m trying to optimize nDCG; and earlier on, I was doing the standard random forest optimization,
which is the brute-force search and thresholding; and now, I’m doing the alternating minimization.
Those are the only two changes, and both are important. I’ll show you results later on about how that …
so just wait for five minutes, and I’ll show you those results.
>>: Do they both contribute to performance as well as accuracy, or one …?
>> Manik Varma: Yes, they both do, yes, yes. So I’ll show you results in five minutes if you don’t mind.
But … so I’m saying five minutes, because I want to just finish off this last portion of the talk, where what
I want to do is just go beyond discussing the specifics of a particular algorithm and return to the general
area of extreme classification, right? Why did we need to come up with a new area? Why did we need
to come up with a new name? Why couldn’t we just have done whatever people were doing earlier on?
Right? So what I’d like to do is discuss how extreme classification could be different from traditional
classification. And I think that not only are the scaling and computational aspects different, but I think
the statistics are also different, and this is perhaps best highlighted by considering how we might
evaluate the performance of a extreme classification algorithm as compared to a standard, traditional
classification algorithm. So this’ll go back to the performance evaluation question that somebody asked
right at beginning, ‘kay?
So it turns out that a number of loss or gain functions have been proposed for evaluating traditional
multi-label learners—such as the Hammming loss, or the Subset 0/1 loss, or Jaccard distance, precision,
recall, F-score, and so on, and so forth—but I think that these might not directly apply in the extreme
setting. So I’ve made a toy example to convey that, and the US presidential elections are very much in
the news, which is why I chose this particular example. So I’m going to give you the Wikipedia pages for
five US presidents, and the task is to label each of these pages with five Wikipedia labels, ‘kay? And I’m
showing you the results of three algorithms over here, which attempt this task. The first algorithm is a
constant algorithm, which means that no matter what the input is, it’s going to output the same list of
labels. So it looks at some of the most popular labels on Wikipedia, sorts them by popularity, and just
outputs that list. So that’s algorithm one; algorithm two is something that does the opposite of one.
Instead of looking at the tail of … head, it completely ignores the head, and focuses just on the tail. So
it’s trying to predict these labels that are not very common, but because this is a very difficult task, it
often makes mistakes, and I’m showing that by these dashed lines over here. So I actually put this
algorithm because of Leon’s keynote at ICML, where he was discussing metrics a lot, and one of the
metrics that he was proposing was coverage, right? So if you look at coverage, I’ve put this algorithm for
him over here. And then the third algorithm is something that is a mix of one and two, right? It has
some head labels, but then it also has some tail label. So what I’d like to do is just do a quick show of
hands; I want to see which algorithms you like; so people who think algorithm one is the best should
raise one hand; people who think algorithm two is better than one and three should raise both hands;
and then people who like algorithm three should, like, two hands and a leg or something. [laughter] Or
don’t do anything. Hey, Anandan, do a hand count for me please. See …
>>: See, I’m useful. I get—yeah—I get four two hands, and one one hand, and no leg.
>> Manik Varma: Four two hands? [laughter]
>>: [indiscernible] yeah.
>>: ‘Kay, three legs.
>> Manik Varma: I see.
>>: Your legs would have to be …
>> Manik Varma: [laughs] So I think most people would prefer the third algorithm—I can’t see over here
… well here, right? Most people by and large … sorry?
>>: You should do use yeas and nays next time if you want to count.
>> Manik Varma: Oh, okay. [laughs] That’s true. But if you look at all these three algorithms—right—
algorithm one is the one that maximizes precision. So no other algorithm can have higher precision than
one, but tickly, we don’t tend to like it very much. And then I been measuring performance in terms of
precision all this while. If you look at algorithm two, this is the one that maximizes coverage; so as I said,
coverage is the number of labels or categories—unique labels or categories—that you got right; so this
has the most number of correct predictions across labels, but many people don’t tend to like this as
well, right? Most people tend to like three, but three actually doesn’t optimize any of the raw metrics I
listed in the previous page. And if you look at why that is the case, then it’s very clear why traditional
loss functions might not be working so well. So I think one of the reasons is something I’ve already
alluded to, and that depends on the statistics of the positives to the negative labels. So if you look at all
of these datasets, on most data points, the average number of positive labels is completely dwarfed by
the number of negative labels—so positives means relevant, and negatives means irrelevant, eh? In
fact, in any cases, you have less than eight positive labels per data point and then hundreds of
thousands or millions of negatives. So traditional loss functions—such as the Hamming loss—which
place an equal emphasis on predicting the positive and the negative labels, cannot work well in this
setting.
Now, it turns out that there’s also another, more subtle reason why loss functions computed on the
negative labels don’t work that well, and that’s because many of the negative labels aren’t really
negative, ‘kay? So traditionally, we—in supervised learning—we’ve been working in this paradigm
where we suppose that there exists an expert somewhere out there—an annotator or an expert—who,
whenever we give him a test point, can tell us what are the labels that are relevant to that test point,
and so performance evaluation is easy. However, in the extreme setting, there cannot exist an expert or
an annotator who can go through a list of ten million labels and mark out the exact relevant subset. So
even if you look at Wikipedia and trust Wikipedia’s editors—here’s Jeannette’s page; it has some labels,
but you can see that many labels are missing right? So for example, it doesn’t mention that she’s a VP at
Microsoft, that she was a director at DARPA; it has very little to do … mention about her research and
her contributions there, so there’s nothing about type theory and so on, right? So a loss function which
would penalize you for predicting that type theory is relevant would actually be a not a very good loss
function over here, right?
So I think we should try and move away from loss functions that look at both positives and negatives,
such as Hamming loss or other loss functions, and look … work only with those loss functions that focus
on the positive labels, but that doesn’t solve the entire problem—right—because they … those will be
still biased, but it will also turn out to be the case that all traditional loss functions that work on just the
positives are … treat each label as being equal; however, that is definitely not the case in the extreme
scenario. So if you look at the distribution of labels, there are few labels that have a lot of training
data—that are called very frequently—and so it’s easy to train on those labels, and also it’s fai …
relatively easy to predict them; however, the vast majority of labels in an extreme scenario occur not
very frequently, eh? In many cases, they occur only once on some of the dataset, so I’ve … I’m showing
you over here: many of the labels occur less than five times, hey? So these datas … labels, they don’t
occur very frequently, and they’re very hard to train on, and they’re extremely difficult to predict
correctly; however, in many scenarios, you can get much more reward for predicting these rare labels
than the relevant label … than the popular labels.
So again, like in Jeanette’s webpage, the single most popular label on Wikipedia is living person, right?
But there’s very little information gain in predicting that Jeannette is alive, right? It’s much better to
predict from the tail and say … the … like, predict about type theory, or a VP at Microsoft, et cetera,
‘kay? And this might not be such a problem if all we had to do was predict all the positive labels, right?
Then it’s fine; you just predict all of them, and you’re done. But in many cases, the number of slots that
we have available for prediction is limited. So in particular, if you wanted to do recommendation, then
you ha … might have only five slots on your ads page, or only five slots on your Amazon page, and you
want to figure out: what are the five best labels I should show? And particularly in recommendation,
recommending items that are popular and common—and that you might already know about—might
not be very helpful; what will be … what’ll … what the best is is recommending these rare labels that
really delight and surprise you, right? “I had no idea this existed, but it’s relevant to me; it’s perfect for
what I want; and thank you very much for that recommendation.”
So I think if you were to design a loss function for the extreme scale, we will need to focus, of course, on
accuracy, right? But we should do this in an unbiased way; we should try and come up with loss
functions, which as your test set becomes larger and larger, the value that you compute for your metric
should converge to the value of the metric had it been computed on the full observed ground route.
And there’re tools and statistics which can help us do that—so in particular, propensity scoring or
important sampling—and we’ve started looking at some of those, but in addition to focusing on just
accuracy, I think algorithms … or, sorry, lost functions in the extreme scenario should also reward
predicting rare or novel items, and that will work well in some applications, but in some applications,
you might want diversity. So in some applications, it’s important to have a little bit of the head in there
as well, right? Because that’s where you generate most of your revenue, or in some situations, you
might want to have a mix of your head and tail. And then finally—at least, particularly in the
recommendation scenario—you might want to reward algorithms that are explainable, or that they …
you … if they can explain why they made a certain set of recommendations to you, you might trust them
more.
So I think this is a fascinating research area; there are lot of open research questions over here that
might not exist at … in their traditional setting, and I’d love to discuss these with you, or if there’s any
interest, share code or data or talk to you more about them. So that’s my pitch, and thank you very
much. [applause] So I’m happy to take questions. Let me just answer Anandan’s point quickly about
whether the optimization is more important or the loss function is more important. So one of these is
variants of FastXML, I think, yeah.
>>: Here, but it’s on small datasets.
>> Manik Varma: Right. So if you look at this—right—what we did is … we can take nDCG and put it in
MLRF, right? So you … the optimization is not going to be the alternating minimization; it’s going to be
the standard random forest im … optimization, where you sample features randomly, do a sweep, pick
the single best feature. So you can see that in most of the results, MLRF optimized … optimizing nDCG
does not do well. At the same time, you also said, “Okay, what happens if you take FastXML and restrict
it to optimizing precision at five or nDCG at five?” Which I was mentioning to Chris earlier. That also
doesn’t do well. So I would think that not only is optimizing nDCG important—rather than Gini index or
entropy—but it is also important to optimize it, learn a full hyperplane, and do the alternating
minimization. Yeah, other questions?
>>: I had a—I mean—a quest, or a comment, or—I don’t know—just a topic that I’m … so I think that …
so first of all, I think it’s … the work is great. I think you glanced over one important thing, which is kind
of subtle, and maybe that’s why, but you kind of passed over it as if it doesn’t exist, so we … so I want to
just ask you: what do you think about it? So when you talk about this setting, you can talk about one of
two scenarios: one is where you have a very big, but fixed, set of labels; then all the training set
asymptotes to infinity, so now we can think about it as a standard machine learning question, only with
the constant number of labels being very, very big; but all of our theory, all of our understanding, all of
our intuition still works out well. And I think the example you gave—as you said, “Well, let’s look at the
labels given in Wikipedia, and let’s use those to label the entire internet.” So that whole … we’ll freeze
Wikipedia at one point in time; we have many, many labels, but it’s fixed; and now, the web keeps
growing and growing; it can asymptotes to infinity; and we can apply our standard statistical tools and
understanding of machine. If you let—given a slightly different example—if you had said, “Listen, let’s
use the labels in Wikipedia and label new Wikipedia pages that come in,” now, you would have fallen
into the trap, because now, you have two things going to infinity: one is the number of labels, and the
other is the dataset size, and they are going together to infinity. So every new Wikipedia page appears
with new labels, so the label set grows almost at—you know—perhaps at the same rate as your training
data goes, so the nu … so again, sample size and number of labels both grow the same rate. And now,
all existing machine learning theory goes out of the window, and many things that you kind of took for
granted here—you know, you were focusing on doing it faster, doing it better, so on—now, all bets are
off; you have to start from scratch; and it seems like—I don’t know—maybe you were saying that this
would work in both scenarios.
>> Manik Varma: Yes, so …
>>: There’s … I see no reason why it should work in the latter.
>> Manik Varma: So I’m not a theory guy. So theoretically, things might really go down in a basket; in
fact, that’s a great way for you to come in and help peep proved some theorems about what is
happening, right? Practically, we’ve observed two things. One is: as new labels come in, both in the
tree setting and in the embedding setting—which is for our NIPS paper—you can always take new labels
and add them to the leaf nodes of the trees or embed them into the common space that you’re … that
the CCS space—the low-rank matrix factorizations. In either case, new labels, as they come in, simply
get inserted into the problem, and now you can predict them. And the guys at Google—Samy Bengio,
Jason Weston—have been doing this for a number of years now. So if you go back to the zero short
learning setting, you can—new abe … label came in—you just propagate it round your tree, figure out
which leaf node they is in; you update you leaf node distribution. In the extreme multi-class setting,
John Langford has been working on these long trees—so there’s online learning; points are coming in as
a fashion—and the trees are grown as the data points come in, and again, you can apply similar kinds of
things. So John actually has theorems showing—at least some—that his trees will be balanced and stuff;
I don’t think he has a guarantee for a … like, statistical guarantees.
And that’s actually a really interesting theory question, right? How do you even give statistical
guarantees at this scale? What does it mean for me to have a million labels? Like, what is the
complexity of this space? I could take a … one label and replicate it a million times, but that doesn’t
mean I have a million-label problem, right? And as you were pointing out, the way the label space goes
to infinity might be really interesting. So I think they’re really new theoretical research questions over
here: what is the dimensionality of the ambient label space? How do I give statistical guarantees? That I
would love for theory people to work on. Some theory work has recently started coming out; so at
ICML, there was a workshop on this, which I’m told by the organizers was very successful. They said that
it was the second most-attended workshop after deep learning. I think there’s a big curve—right—six
hundred for deep learning and a hundred, two hundred for yeah. So what they were saying was: let’s
say I’m going to take a Hamming loss and optimize that using one versus all or whatever standard
techniques I have, but now suppose I have to do efficient prediction, so I’m going to construct the tree;
well, how much worse will my Hamming loss be? Can I give abound on that as compared to the
Hamming loss over the entire one versus all set? So I think there’re lots of new questions coming out,
and should just put them up on the slide. If people are interested in working on them, I’d be very happy
to chat and discuss, and so …
>> Ofer Dekel: Let’s thank the speaker again. [applause]
>> Manik Varma: Thanks, all.
Download