17091 >> Christian Konig: It's my great pleasure to introduce...

advertisement
17091
>> Christian Konig: It's my great pleasure to introduce Ping Li from Cornell University. He's been
here before. He's done a couple of highly successful internships. And he will be talking to us
about booster decision trees, which is interesting from a personal perspective because they've
been used in a number of click-through prediction applications very successfully.
And with that, Ping Li.
>> Ping Li: Thank you. Yeah, it's real nice to be back. So this is about -- this talk is about
multi-class classifications. So my first slide is just to tell you what is multi-class classification. So
this example 10 class classifications, and so everybody in this class, in this room, knows about
this.
So the purpose of this slide is just to tell you what is my zip code. So let me go to the second
slide. So this is 10 class classification problem. So the next slide is a brief review of what we did
2006 and 2007.
It says if we do, when we do ranking, we can use a classification to do ranking. For example, the
MC rank. To rank use multi-class classification. This is done with Chris Burgess and Wi Chang.
So to do classification we need the X and a Y. So Y is a response and X is, we need an X and a
Y. So Y is a response. So X is a feature vector. How do we get feature vectors? We generate,
feature vectors are generated by combining the query words with web pages.
So, for example, the first feature vector is query one, plus UL 1. Second feature vector could be
the same query, plus UL 2. Then we generate many, many feature vectors and human judges
determine the relevance for each vector and into one of the five classes.
So zero to four. So what we did is we first solve the five classification problem and then ranked
the URLs according to expected relevance, expected relevance. We get the class of probability.
Then each class probability, we multiply by this class label and we get relevance scores. So that
allows us to rank the URLs because this give you the real number.
So the real number, so we can use that to rank the web pages. And otherwise if you just do the
proof force classifications, we're going to get lots of ties. And it's not going to be good enough to
rank the web pages.
So this is a nice trick to use. Okay. So that's how we use classification to solve ranking
problems. One fairly successful example. And so a little bit more formal definition of multi-class
classification is we give the training data set and X and a Y. So X is a feature vector which may
impede dimensions and Y ranges from zero to K minus 1.
So we have a K class classification problem. And the task -- you did not miss anything. So the
task is to learn a function which predicts a class label Y from XIs.
So when K is equal to 2, this is a binary classification problem. When K is larger than 2 this is a
multi-class classification problem. And this talk is about K minus 2. K larger than 2 Ks.
So we're talking about multi-class classifications. So one strategy, there are many, many
strategies for multi-class classifications. One strategy is to first we learn the class probabilities.
PK is a class probability which says given feature vector X what is the probability that Y the
response belong to one of the K classes.
So we know that probability to have sum to 1. So therefore instead of K degrees of freedom, we'll
have K minus 1 degrees of freedom only.
We only have K minus degrees 1 of freedom. Both hands has to be busy. So this is interesting.
Normally they have this laser pointer here. Right. But this one doesn't have it.
So a simple rule is to, once you learn the class probability of PK, you just use the maximum class
probabilities for, use the class that generates a maximum class probability as the class label. So
that's very simple rule.
And so as we said, in many applications the class probability is more important than the class
labels. But not in every applications. At least in some important applications we care about class
probabilities.
So, again, this is just one of the many successful strategies for multi-class classifications. So
now in order to learn the class probabilities we need the model, so this is one of the very
successfully used model, multinomial logic probability model, that we first learn a function value F
so we have K classes.
So little k ranges from zero to K minus 1. And so once we learn the function value F then we can
compute the class probabilities, PK, by the logistic transformations.
So this function value F in this regression is a combination of the features. And the task is to
learn beta. Of course, in boosting we're going to learn more complicated models than this linear
model.
So this is a multinomial large probability model. And this, we often use constraints, because we
only have K minus 1 degrees of freedom. So there's natural constraints people like to impose are
the function values sum to 0 because we really only has K minus 1 degrees of freedom.
And to understand this constraint a little bit better, so suppose this is -- suppose we add the same
constant C to the function value F to the same constant C.
So after this transformation we can notice that this function value, this C, the constant, does not
really matter, because it's going to be cancelled.
So therefore the function value F, they're not uniquely identified. So therefore for identifiability
issues we have to impose some kind of constraints. Otherwise we're going to have identifiability
issues.
So one popular and very natural choice is to assume that the sum of this K function values equal
to constants which is just the equivalent to say the sum of K function values equals to 0, because
you can always subtract the mean and you get the sum to 0 constraints. So therefore this is very
natural constraints. And it's very commonly used. And if you read many classification papers you
can see that this constraint is always used, almost always used. Except after very nice
discussion with Chris and why do we need these constraints.
Maybe there's more profound reasons and the people we can explore. But as a matter of fact,
this is constraint almost always used, if you read any multi-class classification papers.
So we're not inventing anything. So we just use that.
Now we have the probability model. The next class is to learn the function value F. So we can
learn this function value F by maximizing the multinomial likelihood.
So now suppose, because we have multinomial model, so we need to maximize the likelihood.
Suppose the K likelihood is proportional to, we only have one cell. One cell has number one.
The rest of the cell has number zero has label zero. So therefore the likelihood is proportional to
the probability of that particular class.
So it's only one term here. And the equivalent here we can maximize a lot likelihood which is
more convenient, or in machine learning we like to talk about loss functions instead of maximizing
likelihood. We like to use negative likelihood loss which is putting a sign in the front. So that
gives us the next [indiscernible] loss function. So this loss function is very commonly used. It's
nothing magic about this loss function. It's just the likelihood loss function, just likelihood function.
And to minimize this is the equivalent to doing maximal likelihood estimation. So what we're
doing is maximal likelihood estimation.
So here in logistical regression, this function value is assumed to be a linear function and the task
is to learn the betas.
And we of course we do more sophisticated things than these linear models. And so for each
individual sample point we have an N observation so each of these observations we have a loss,
total loss is just the summation over the losses.
So usually more convenient to write it in the W submation form, use indicative formation. If Y
belong to a particular case we have this R vector. Only one of them has value zero and the rest
has value one and the all the others has value zero so therefore assumption really only has one
term. This is just for convenience.
So this is the total loss function. And again to help us understand this sum to zero constraints
let's look at the Hessians of this model. It's actually singular model if you do not consider the
constraints.
For example, without assume these constraints, just assume there's no relationships among the
F values, we can derive the derivatives and the second derivatives and we can see, for example,
when K is equal to three the Hessian is this and we take the determinant and so that the
determinant of the Hessian is actually zero, which can be verified by algebra.
So we should not be surprised, because if we only have K minus 1 degrees of freedom so this
has to be zero is singular problem. And so Friedman and Trevor Hastings, what they've done is
the approximation of the Hessians.
So to start with the full matrix but they only use a diagonal part of it and they consider a factor,
which is like a heuristic factor, and I can show this is it's Y is a heuristic vector. But they use
vector to roughly consider we only have K minus 1 degrees of freedom. So this is the diagonal
approximation of the Hessian for K is the three case in Friedman and Trevors papers.
And so, again, so without the constraints we've got a singular problem. Of course, once you've
done the [indiscernible] approximation it's no longer singular. So this is -- but you consider
heuristic factor K minus over K to take into account the fact that we only have K minus 1 degrees
of freedom.
So then Friedman has this algorithmic function grading and boosting, much more flexible and
accurate than logistic regression. We have datasets with training samples and differential loss
function. In this case it's logistical loss. And Friedman took a greedy stage approach to build
attitude model.
So instead of linear model, they have this attitude model which means it's the sum of capital M
terms. M may be a thousand or 10,000. And so there are two things. First is a weak learner
called Etch which is a parameterized by A, and this row which is coefficient. So it's a linear
combination of weak learners.
So which can be a regression tree, for example. So such as at each stage Friedman, and they
did a greedy, they tried to do a greedy approximation. So every step they try to minimize the
loss.
So this is a greedy approach. However, even this is a difficult problem because this row and A, A
can be vectors.
So if you want to learn them simultaneously, it's a very difficult problem. So what Jerry did is,
well, they approximated conducts deepest ascent function space by solving the square problem.
So they can first solve for the A. So we have two parameters. Two sets of parameters to work
on.
First to solve for A and then solve for row. By solving this square problem, which is can be
viewed as approximately viewed as a steepest descent function space, so in this case you would
need to do steepest descent which means we need the derivatives, this is derivatives with loss
function with respect to the function values.
And then for row, what he did is he did line search. So that's how he solved the complicated
problem by two simple optimization problems. And we know that because this is Z square
problem which can be easily solved by trees instead of doing this way. Of course you can do this
square but you can also do trees.
So this is a generic gradient boosting algorithm. And is that we start with doing like M, capital M
iterations. At every iteration we compute the gradient, evaluate the gradient and the previous
function values. Then use that, then feed the gradient by the -- by solving the Lee squared
problem and then they did a line search after computing the coefficient of A.
Then add this model to the model F. So that's how the learning is proceeded, is conducted. So
this is only generic algorithm. And the mark is a particular implementation, particular
implementation of this algorithm by combining gradient boosting with the regression trees.
So this is a MART multiple addition regression trees. So basically what is the implementation of
this generic algorithm, they use a regression trees to solve this Lee squared problem. So
regression tree is a J terminal regression tree. Just in case you're not familiar with the regression
tree, so terminal mode means [writing on board]. So this is one, two, three, four, J. So J terminal
node. So in the two dimensional case, so it's basically you have a feature one and a feature two.
So basically what it does, I do the partition of the space.
We have samples. So basically the partitions are samples into, recursively into this kind of
regions. So that's one, two, three. I draw too many. So J. So that's basically what tree, how
tree algorithm works.
So that's why -- so they fix the numbers, they fix the tree size. And notice that we use, we
needed the response is actually the derivatives. So therefore they need to compute the first
derivatives and use that to learn the trees.
And for my search, what Jerry did is instead of doing the full line search he used the one step
neutral updates. To do neutral updates we need the first and second derivative. So this is the
second derivative. So this is the first derivative and this is the second derivative. So therefore
this is a very simple algorithm for implementing the generic boosting algorithms.
And also notice that for every class they build a tree. So in order to do this means we actually
use diagonal approximations. That's why you can do neutral updated for each region because of
diagonal approximation. Notice this additional parameter mu. The parameter mu is a very small
factor, .1, to avoid overfitting because you're doing the neutral updates so you can overshoot.
So the mu's for protective purposes. It must be there but it doesn't have to be there all the time.
But for convenience. So necessarily put there.
As Jerry did. So, therefore, this algorithm is very easy to understand because they use first
derivatives to learn the structure of the trees and they use the second derivatives to determine
the value of the terminal nodes. And there's a heuristic factor for considering the only K minus 1
degrees of freedom and use shrinkage to avoid overfitting. It's a remarkable table, as some of
you already know. So the variable performance can be obtained from this algorithm. And it's not
very sensitive to the parameters.
So the three parameters, which is the three parameters which are the number iterations M and
the terminal node, number terminal nodes and shrinkage. So number terminal node is the main
factor, main parameter. And other parameters are not too important. But the number terminal
nodes determine the capacity of the base learner and is the most important in terms of the
performance.
But even though the performance is not too sensitive to J as now it falls into some reasonable
region.
So ->>: [indiscernible].
>> Ping Li: Sorry?
>>: [indiscernible].
>> Ping Li: Mu?
>>: Yeah.
>> Ping Li: He suggested .1, because when you do neutral updates you only do one step. So
anything can happen. So that's why they pull a value like .1 just for protecting purposes. And
that's what I think. Of course, you can think about this as a shrinkage. You think you don't do
as -- you don't do too greedy, just use a small value.
I don't quite believe that. But I really believe this is really for numerical purposes. And, yeah, so
you were asking it was optimal value. Yeah, the optimal value is I don't know. And I always try a
.1. So that usually works. But in some cases actually .2 works better. And except when you
write papers you just always try to avoid, you try to make it simple, so just .1.
And as a smaller one, smaller values doesn't necessarily help. Because if you -- if these values
are too small that means you don't make enough progress at each step. So if they're too small
actually the convergence is too slow and the testing performance is bad.
So it's not necessarily true that smaller value would be better performance. But you should
actually I believe larger values if possible is better. If you can do it. Use larger values. That's
what I would suggest.
>>: You are able to take the first and second derivative, what's the greater descent use this
method, what's the better thing extent of methods?
>> Ping Li: Okay. So this is because once you -- you mean why do they do this? So this is for
convenience. For trees, a good thing about trees is that every region has the same constant
values. Has one value. Every region has one value. If you do one-step updates you start from
zero, right? Then you update to one particular value. Then that's much more -- then one step -let's see.
If you want to continue, it's actually not as convenient, because the function value is constant. It's
more convenient if you do one step.
And my experience is that, let's see, also because we do the diagonal approximation already. So
if you -- more like iterations I don't believe it's going to help. Because you already start with
approximate approximations.
But because of this diagonal approximation it brings in huge implications because of the fact that
taking advantage of the fact that the function value is zero, is a constant within each region.
And notice that they actually, because this diagonal approximation allow you to combine these
two steps in one step, because when they build a tree, right, you also have a function value.
However, when doing line search you have another multiplication to the function values so this
allows -- one step neutral updates allows you to do this in terms of valuing the one step instead of
two step. I think it's a very clever idea.
And, yeah, all right. So now let's look at this constraint again, the sum to zero constraint.
Because we start with this loss function and there's this probability model and we have this sum
to zero constraints. And without loss of generality, we can assume the class one. K equal to
zero is a base class, a reference class. A base nine class.
So that means we can just represent class zero as a sum of the other function values and put a
negative sign in front. So that's the constraints. And then we can get a different sets of
derivatives if we view the class zero is a base class.
You get different sets of derivatives where you have to -- so this is, they look different from this
derivatives and I'll explain why they look different. And they also, because we use class zero as
reference class, or base class. So now the derivatives has this information in it.
So let's do a little bit of math to show that indeed these kind of derivatives.
So first, because this is only math I'm going to do for this paper and this talk. So let's do it.
Otherwise I won't make it 40 minutes. So we start with the probability model, and we can
represent the function value F 0 as the sum. The negative sum of the rest.
So that's the same. So that's easy. So now let's do derivatives. How do you do derivatives?
Well, we have the product two terms. So we do derivatives of this guy, which is the same, then
that's the same.
Now we do the derivatives of this guy. So we'll take the square and then we need to take a
derivative of FK inside here. So that only has one FK so it keeps there and then all the rest also
is ha FK here. So therefore you still get minus F 0. Of course, by chain rule you get a negative
sign here. So that's how you get the derivatives. And then by simplifications, because PK is
represented at this form, so therefore after simplifications you receive that eventually we get
these kind of derivatives.
So this is just to help you build the confidence on some calculation I did which is very simple
algebra. So this is the first derivatives of the probability with respect to function values.
And now we have this loss function and which is sum of K terms and so it can be represented by
three parts. So the first part starts with 1, but it does not contain the K. The second part only the
K class and also the base class.
So we apply the chain rule and apply the previous results and you can see that you can get first
derivatives in this form. And the second derivative is even easier, because now you further take
derivatives, only need derivatives of P with respect to function value and PK. PI and PK with
respect to function values. So you can get first derivatives and second derivatives very easily.
So this is just to convince you that they are the true derivatives, if you believe in the models. So
now the next question is which reference class or base class to use?
If we know those are the true derivatives and except we need to choose one base class. Which
class to choose. So this idea, adaptive base class boost. The idea is we can multi-class by
figuring base class and don't have to train for the base class because the sum to zero constraints
and at each boost step adaptively. Choose the base class because we boost the many steps
anyway. So we choose it adaptively, and so that's why we call it adaptive case boost.
Now next question how do we choose adaptively choose a base class? So we should choose a
base class according to the performance, training loss. So this is one of many ideas I implement
many kinds of ideas.
This is the idea that's like least clever. But it gives me the least criticism. Exhaust for all base
classes and choose the base class that leads to the best performance. Small training loss. You
can use a validation loss to do it. But I believe it won't essentially make a difference.
So the only criticism for this computation expansive, but not too bad, unless SK is really large.
Search engine K is 5, not too big a deal. Good performance can be achieved. And there are
many other ideas which I'll tell you if I come back next year.
So ABC-MART, this is the algorithms, ABC-MART is one particular implementation of ABC boost,
combine MART with ABC Boost ABC-MART. This is MART, call it pseudo code because not
really the code nor ABC-MART. It's a pseudo code. What do we need to do with the MART we
need to replace the first derivative with a true first derivative and replace the second derivative
with the true first derivatives and then we don't need this heuristic term, because it's going to be
recovered naturally.
And we need additional four loop here. So we try every class as base from zero to K minus 1. Of
course inside the loop we do not have to try this if it's already the base, we do not need to train
because it can be inferred from the other classes.
So this is only thing changing we need. Of course, we need to do a little bit here. That's why I
call this pseudo-pseudo code. Yeah. So the change is minimal.
>>: Looks like for face cut in [indiscernible] community, have the base class, but you have to
referred to something which is mostly reliable class in the training, largest training components?
[indiscernible] the loss will be, the most for the loss. [indiscernible].
>> Ping Li: Yeah, that's why I said the many ideas how to implement. So exhaustive search
seems to me is the most natural thing to do, right? Because that's the best thing it can do is find
the one that's the smallest loss. And most reliable also means -- so I believe that there are many
ideas, many better ideas than exhaustive search. It's something you can always try.
As I said, the change is minimal if you already have the code. So for an experienced
programming in Microsoft you need maybe one hour to do it. Maybe less than an hour to do it.
>>: Do you have any intuition about what would make a good base class? More common code,
more common embedder or classes that are more uniform going into the features or anything like
that?
>>: You said you did have some approximation that's faster than the ->>: Any inequalities you would expect that would make a good base class, I don't see why you
would favor one over another.
>> Ping Li: For example, the one with the largest neighbors, the one with the most common class
as a beginning is a good base class. If you do the experiment, you will know it good experiment.
After Y you want to change that. Because the most common class as beginning, if you use as
base class and after a while it's no longer -- the optimization -- I mean, you have exhausted all
the advantages of using that as base class. So you want to change after a while. But if you only
stick with the most common class, you're not going to get good performance eventually. But at
the beginning it's a good starting point. So that's very good intuition.
>>: So would this loop change the base class as time goes on? Because you put it ->> Ping Li: No, this changes as time goes on, uh-huh.
>>: You're saying in the very beginning you want to use the [inaudible] as you learn more you
start to get [indiscernible].
>> Ping Li: Yeah, you can think that, yeah. So as beginning we're almost already to this class
with the most labels, yeah.
>>: I just had a question about MART. [Indiscernible] but this shrinkage [indiscernible] like that.
Always seemed mysterious to me. It's a step size. But if you say equal to one typically the
experiments the training just runs really fast. It just -- few trees to the training set so it does kind
of a fit. It's not clear why is it fitting, if it's a crack step size, the Hessian is a new approximation
and it should work.
So [indiscernible] pointed out recently you can do mu simply a way of scaling the Fs. So if you
think of a two class case rather than sigmoid, the fraction of that sigmoid is controlled by mu
[indiscernible] because it occurs in every act.
>> Ping Li: Yeah.
>>: So what you're doing really is regularizing just by changing that flat platform, to look at mu
makes a lot more sense. I never understood it the other way.
>> Ping Li: Yes. So that's -- maybe we should do more experiments on that. One thing you
could do is because the one step neutral updates, eventually when this -- eventually you would
be very close to singular at some point. You'll be close, because the P goes to zero. At some
point you are going to reach the pure nodes at some points.
Means the P, some P, some regions of P equals zero, close to zero. And so that, you're going to
have problems. Because the updates is very large. Of course, you can always put larger values.
>>: Mu's fixed for every generation of one.
>> Ping Li: Yes. P changes. P value seems to be very small, at some point -- at some point at
least for some cases it's going to -- what's going to happen is this value becomes really large.
>>: But that's something to address I would want to decide that's happening to something else,
right, affix mu for everything.
>>: I want to point it out because it's mysterious to me.
>> Ping Li: It is.
>>: It makes more sense to view it [indiscernible] rigorous than one. Choosing a differently
shaped function by that.
>>: Compared it with going to market.
>>: I'm talking about the original MART.
>>: Still use the same ->> Ping Li: I still use mu here.
>>: Mu is fixed during training.
>> Ping Li: So, yeah, so let's look at the experiments first. So I try to many [indiscernible], this is
the dataset that people like to use. Except first one people like to use subsets of it because it's
too large, cover type. It's all in the UCI. So the first one has about a 600,000 and S just use half
the training half the testing and all the rest people have lost of experiments, lots of results on the
datasets. The left one is famous one 6,000 training and 4,000 for testing. That's four K basically
for swap and training and testing so that way we can see the performance and because of this, if
we only use 4,000 for training the results, the training, the test arrow is going to be fairly large.
And it's more obvious to see the performance difference. And letter two case only use half of this
training for training and so we have larger testing sets with much smaller training set. So all the
rest, we use very standard partition or training and testing. So that's why we get these funny
numbers. But those numbers are not, were not specified by me. So it's just there.
So the number of features from 54, 16 to 256, 617. So this is standard datasets that people like
to use. And let's look at the performance. So this is MART, number of missed classification
errors, you want to divide the error rate by this number of test samples. So, for example, the
error rate of letter will be 99 divided by 4,000 will be less than 2.5%. And so this is a MART
number of errors. So this is number of errors for ABC-MART. We can see the relative
improvements. What's the relative improvements or errors on MART. Subtract errors
ABC-MART and normalize the errors. Get relative improvement about 10 to 20 percent. It's not
too exciting, but it's hard to get these kind of improvements nowadays after machine learning has
been developed for so many years. I can avoid the P values see P value is very small. When
zero, it's a very small number. I report zero. So this is the sum of multi-class classification errors.
>>: Did you use similar trees for each?
>> Ping Li: I see. So I want to do everything fair. And so lots of questions. One question is
improvement due to the particular choice of the parameters. So, well, okay, let's report the
experiments with serious parameters from mu from .04 because Jerry said, yes, mu should be
less than .1. Now I cannot afford to compute too many. Let's use.4 to .1 for different parameters.
If I use five paper, to fit into the paper I guess. So number of terminal nodes Jerry suggested this
number should be like eight or six. And so I just try from four to 20. And Jerry equal to 2 the
decision stand or two terminals is just not good. So we should not use. It's very special kind of
datasets. Otherwise it's not good performance. Try terminals from four to 20. I don't know why I
use even numbers. We can use odd numbers. But I use even numbers, they look better, and try
the number iteration. I try 10,000 at most means. I just try until the machine access is reached.
For industry data we'll never see the machine accuracy reached because that's why too many
iterations. I try for some letters, the machine accuracy will be reached before 10 equal 10,000.
So to do the implementation of MART so we get the results from Jerry Friedman's MART
program, compare them.
So it's the computation efficiency issue is true that training is lower but testing is faster. Why
testing is faster? Because we only have to evaluate K minus 1 trees. Instead of K tree. So the
testing fast also because the training is faster we're going to see the convergence is faster, so
therefore we may be able to stop at an earlier stage of the training. So therefore training, testing
is actually faster.
So that's actually the most important for many cases. So training is lower. But you may take a
little bit of time, but I guess in some cases you can afford it just let the computer run, they'll come
back and see the results or go to Hawaii or something and maybe training will be done.
So that's another issue. But still an issue. This is a more complete experiments for the datasets.
So I try and make four different learning rates or shrinkage, mu. Mu from .4 to .1. And many,
many tree size from 4 to 20.
And I report the minimum values of the test error. So this is for MART. So what I got is for the
tree, for example, 129. And this is freedom program get 143. And but overall they're very close.
In some cases, for example, I need to find the cases an example freedom 144, 145. So you
cannot expect they'll be identical because of the implementation details. Can someone tell me
why they could be different?
>>: Because they're both long?
>> Ping Li: Yeah, my implementation. This is Friedman implementation.
I guess the most obvious place why it could be different when you split, when you split the data
you can choose. You have to choose a point that's between the area where there's no data. You
have to make arbitrary maybe clever choice. But any clever choice will sometimes lead to like
undesirable performance.
So I think so it's understandable why the performance can be slightly different. So I always report
the same implementation instead of freedom of implementation so that way, but this is just for
comparisons. Nice thing about MART you can see that the performance is a fairly stable across
many choices of parameters. So this is very nice. And unless the base learner is too weak to
produce good results but other choices seems stable. Around 130, 140. That's very nice. So we
like those kind of algorithms. They're very stable for sensitive to choice of parameters.
If a program works very well but they change the parameter a little bit, the performance degraded.
So then that's not good for at least for industry applications. Maybe very good for papers.
But I think industry would prefer algorithms that were very stable for the particular choice of
parameters.
So this is for MART and ABC-MART. So we report the same experiments using the exact same
implementation. This is the percentage of improvements. So this is like 25 percent relative
improvements. This is improvements with respect to this number, not this number. Otherwise
this number would be big improvements. That would be tempting to do. But I report back to this
number. 25 percent improvements.
So we can see the improvements. It's also very stable. From 100 to 110, the error. And it's
improvements, it's interesting improvements.
>>: [Indiscernible] each time.
>> Ping Li: Very good question. You always ask very good questions. Several steps ahead of
me. Let's look at the training loss and testing loss. So I always train until machine accuracy is
reached or over. 10,000 iterations. So this is a training loss. So the training loss. It's interesting
that MART can reach at most 10 to the minus 14th. That's what I observed. After such a small,
like numbers, there's lots of things that can factor these numbers. But that's why there's some
kind of -- I can explain why this little thing here. But after reaching the 10 to the minus 16 or 14,
there's some real behavior. But in general we can see that ABC-MART converges faster at every
iteration. Converge faster than MART and you can reach the machine accuracy by MART reach
the accuracy after some iterations.
So, again, yeah, that's what I've done for training. And for testing. So this is the test arrows for
all the iterations.
So this number I reported is the lowest point. It's like this point. Lowest point or this point. And
for ABC-MART it's almost always lowest point. But for MART, because of this behavior, after a
while you cannot, the machine accuracy after reaching close to the machine accuracy the training
loss cannot go down any further.
So actually the arrow goes up very -- but I don't report this number. I report this number.
So it's not the same number of iterations. But that's why I show this figure. I think it's obvious.
Of course, we can start here any earlier place and we can see the improvements.
So, yeah, I try to do all kinds of things to answer any possible questions. And so this is different
datasets for like, you only use 2 K samples. So we can see similar, like freedom implementation
is very similar. Similar results.
And very stable across many choice of parameters. This is for MART. ABC-MART the
improvement is about 10 percent in this case but 10 percent because of the sample size is very
large. The test sample size is very large. So statistically very significant.
And, again, it's very stable improvements. And also this is the testing. We can see that this is
for -- I always report like this point, which is almost the same at this point versus this point for
MART, ABC-MART. So this is different datasets. So maybe another datasets of 10 digits. We
can similar things very stable and similar performance, different implementations of MART. And
like 10 or 20 percent improvements for ABC-MART and we can see the improvements for
different data trees is obvious improvements for this dataset. Again, always report lowest point.
How you opportunity you end up with results.
So this is isolet datasets highly dimensional. I only did -- so I only feel the partial table as waiting
from a gram proposal to fill the whole table. So this is the -- okay. It's a partial table but we can
still see the improvements like 10 or 20 percent improvements for this high dimensional data.
And you can see some interesting things.
One thing that's interesting is that it's no longer true that it's not necessarily true that using larger
trees will lead to better performance. For example, using 20 nodes maybe the performance is not
as good as you only use in like six nodes. So this is -But my point is to show the improvements compared with MART side by side, not for particular
parameters. And I believe eventually if you choose maybe J to 100 or something the
performance would will be the same because the base class. So at some point ABC-MART is not
going to improve anymore. But at least for this reasonable parameters the improvements is kind
of obvious.
Couple more slides and I'm done. So this is another experiment cover type. Because it's very
large datasets. So I did experiments for 5,000 steps and also report the improvements, I mean
the results for a thousand or 2,000 iterations. We can like only for eight improvements if we stop
earlier, we're going to get like 20 percent improvements.
So this is the same number of iterations. So just to show -- it's a fairly large datasets. So if we
stop earlier we can hire much better improvements like 10, 20 percent for this dataset.
So almost done. So some more insights here is when K -- any questions about the experiments?
>>: [indiscernible].
>> Ping Li: Yeah, that's hard.
>>: [indiscernible].
>> Ping Li: No, it's not too high.
>>: [indiscernible].
>> Ping Li: Two or 300 and 700, 600, like this case 600.
>>: In this dataset you don't have the development set that could use, supposed to be use until
the parameters like mu. J. [indiscernible].
>> Ping Li: Yeah. Talk about the evaluation set, right?
>>: Do they have validation in this case?
>> Ping Li: They don't have validation sets. But the purpose is to compare side by side, because
under my implementation or ABC-MART does the same thing, on line implementation is the
same. So this is fair comparisons. Because I don't care what parameter we're using as
ABC-MART for the same improvement I think this is fair comparisons. But that's good. If I want
to compare this with other algorithms then I need to do something different.
So there's some more insights about this. When K is equal to 2 ABC-MART actually recovers
MART which is kind of [indiscernible] because the first derivatives become twice, the first
derivative of MART becomes the second derivatives of ABC-MART and becomes second
derivative of MART factor of K minus K over K which is a half in this case is recovered but not
because one half, because of 204.
When K is larger than 2 the MART derivatives act like average ABC-MART derivatives. So this is
what I get. This is the ABC-MART derivative if you use the first derivative average over all bases
it's exactly K times as large. For the second derivatives, if you sum of all second derivatives it's
going to be larger than K plus 2 times the second derivative using MART. It's inequality.
Becomes equality only when K equals to 2. So this means K -- this factor, this magic factor may
be reasonably replaced by K over K plus 2 or even smaller.
So also means what this means is the mu in my program in ABC-MART actually slower, smaller,
effectively smaller than the mu using MART.
But again because now I tried many, many mus. So at least that should not be a criticism.
So the conclusion boosting is algorithms and two key ideas we can boost the algorithm including
base cost and adaptively select the base cost at each iteration and here I only report to you a
particular implementation of ABC-MART and I have several other implementations ongoing and I
call them different names.
So but this is one particular implementation. And the improvements in this dataset is about 10 or
20 percent. So thank you very much.
[applause].
>> Christian Konig: Any questions?
>>: You get the integer model, that's very similar.
>> Ping Li: That's exactly --
>>: Use the constraint, the features to something 1.
>> Ping Li: Sum to zero.
>>: Sum to zero. You actually mentioned this at the talk. I can't see one that is the problem
there. The object is [indiscernible].
>> Ping Li: Uh-huh.
>>: [indiscernible] the viability problem. The area you select you have plus C there. The same
as [indiscernible] so the effective strain to help you solve the optimization problem.
>> Ping Li: Well, if you believe that constraint. If you think there should be some relations among
the Fs, then that constraint is one natural choice. So if you believe that constraint makes sense.
>>: But why. Any application all I care is get the [indiscernible] to one.
>> Ping Li: All you cares is one thing from what the model should be is a different thing. So if the
model, if there should be constraints. If you believe there should be constraints, because all the
functions. They're related in a way but they're not unrelated. That means once you consider the
constraints then you actually really cut your search space, right?
So once you cut your search space, then you can do better. At every iteration. Because this is
basically from maximum likelihood estimation point of view. Maximum likelihood estimation if you
start with the true model, because goes to maximal likelihood, if you start with true derivatives
you're going to reach the maximum likelihood, closer to the maximum likelihood, that's why you
get a faster convergence at every iteration. Of course, this greedy algorithm so no guarantee
about global is faster. But at each global, locally, it's faster. And so that will lead to better
performance. Yeah.
>>: That helps with the optimization.
>> Ping Li: Yeah, has to do with optimizations. And of course, yeah, so that's the.
>>: The unconstraint optimization is easier than constraint, aren't they?
>> Ping Li: But this is inequality -- this is inequality constraint is an easy constraint.
>>: It's easier inequality but it's easier if you don't have any.
>> Ping Li: It's still easy, right? If you ->>: With maximal, you use a regular [indiscernible] regularization attempt. You still have to
identify mu. I think it might even be something to conduct with sum to zero constraint just
automatically. Constraint over ->> Ping Li: I know [indiscernible].
>>: [indiscernible].
>> Ping Li: Yes, except like you can write down the loss function and lambda summation square,
you can write that. That will happen if you lambda large enough means you have to force sum
zero and you can probably do something like that. That's actually interesting perspective.
Except that ->>: On trees.
>> Ping Li: If you do it on trees I don't know how you do neutral updates for each regions. So
this is, the diagonal approximation and neutral updates is convenient you do each iteration
separately. That's a huge advantage. And I don't know. Because you can do this optimization or
do it separately converges faster because the local, because it's more greedy. But if you just do,
formulate this as constraint optimization problem, then try to do the global neutral updates, then
it's not going to do as well because it does the same thing for the tree without considering each
individual regions. Then the convergence is slow. I tried that last year for the whole year. I
mean for a period of time. That's why it leads to this algorithm. It's like very simple observation.
Very simple discovery. But only after many hours of working unsuccessfully on unrelated things.
>>: So all the [indiscernible] all the features, all you learn is the beta of the linear weight.
>> Ping Li: Not linear weights.
>>: So you ->> Ping Li: That's the structure of the trees and the value of the trees.
>>: I see, okay.
>> Ping Li: The structure of the tree. The value.
>>: Because [indiscernible] of X. Do you do beta or whole graph.
>> Ping Li: I don't do that at 4. That's the logistic regression. But that's not done here.
>> Christian Konig: All right. Let's thank the speaker again.
[applause]
Download