>> Ofer Dekel: It is our great pleasure today to... Technion. I have known Elad almost 20 years, maybe...

advertisement
>> Ofer Dekel: It is our great pleasure today to have Elad Hazan from the
Technion. I have known Elad almost 20 years, maybe 19, soon to be 20. So we
are old buddies. Elad started in Tel Aviv University and finished his PhD at
Princeton under Sanjeev Arora. And then he moved to IBM research and from
there to the Technion where he is a faculty member. And now he is visiting
us this summer and is soon to finish a very successful visit.
So thank you Elad.
>> Elad Hazan: Okay, thanks Ofer. And it’s a great pleasure to be here and
to visit the machine learning group here. And so I am Elad Hazan and based
at the Tecnion. This is based on work with colleagues from my time at IBM,
Ken Clarkson and David Woodruff. And some students at Technion, Garaber and
[inaudible] who is a colleague at TTI now. And I will talk about linear
classification.
So this is a relatively small audience so feel free to interrupt me and ask
questions and we can make it interactive. Like if you want to hear more
about something than feel free to let me know.
So I will start with the basics and the basic problem I want to talk about is
called linear classification which I assume all of you have heard about. And
basically we have a data set which is represented by points [indiscernible]
space. And some are labeled blue and some are labeled red and we seek a
hyperplane that separates them such as the one in the picture.
So the [indiscernible] example is that you have let’s say e-mails. And these
e-mails are represented as the vector simply by taking every dimension to be
a word and then there is a bit of 0 or 1 without the [indiscernible] belongs
to the [indiscernible] or not. And then there is a label whether the e-mail
is spam or not and you are trying to classify it.
Now in this picture there is a hyperplane that separates all the blue points
from the red points. And of course that’s not the case in practice. Many
times there is no hyperplane that can separate these two sets. And I will
get to what can be done later on, but I want to start with the most basic
question. There is a hyperplane that separates the two sets and we are
trying to find it.
So to be a little bit more formal we have n vectors in d dimensions and I
will refer to them as A1 up to An in Rd. And then we have labels which are
whether the points are red or blue, so plus or minus 1. And we seek to find
the vector such that the sign of the inner product between --.
Hello, you missed nothing so far.
So the sign
examples is
we are only
I am trying
of inner product between the vector that we seek and all these
correct, assuming that there is such a vector. Okay. So again
talking now about the separable case. There is such a vector and
to find it.
So this is a fundamental machine learning primitive and it lies at the basis
of many other basic sophisticated methods. So if you Google or Bing linear
classification verses linear programming, linear classification gets more
hits. It’s extremely popular recently.
Now those of you actually familiar with the subject actually know that these
two problems are equivalent mathematically so they should get the same number
of hits and they don’t. But Bing is doing a little bit better in that
respect. And this is a very wide spread routine that is used commonly in
many internet applications and so on, such as spam detection.
And if we are talking about text applications what happens is that the number
of examples is usually very, very large. And the number of points which are
denoted by n and also the dimension is very, very large because you have the
dictionary, right.
So I heard a talk by a Google employee who said, well I can actually let you
try to guess how many words are there in all dictionaries in all languages in
the world? Can anyone guess?
>>: And what you mean by word of course --.
>> Elad Hazan: Yeah, so I include in it also let’s say like small expressions
that include two words maybe.
>>: It’s oh my god, billions.
>> Elad Hazan: So it’s actually something like --.
Sorry?
>>: [inaudible]
[laughter]
>> Elad Hazan: So actually you are talking about something like between two
and ten million depending on how many expressions you use, so it’s large.
It’s not incredibly large, but it’s a larger number.
Yeah. So the problems are very, very large. And of course it is important
to be very efficient when trying to come up with algorithms for finding such
a classifier.
Here is a very old, one of the oldest, algorithms invented in artificial
intelligence. It may be the oldest one. It is called the perceptron and it
is very effective for linear classification in the separable case.
So this was invented by Rosenblatt in 57 and some classical papers were
written about it in the context of [indiscernible] networks. So this
algorithm can be thought of as a very simple, the simplest [indiscernible]
because there is only norm.
And the way it works is the following: it starts off with some arbitrary
hyperplane, it doesn’t even matter what you start off with. And I assume
here that it is normal, you see that there is a circle here. The reason is
that I assume that the norm of this hyperplane is one. That of course it
makes absolutely no difference because we are talking about the sum of the
inner product.
And then what [indiscernible] it does is that it finds --. It finds a point
which is misclassified and adds it. So moves the hyperplane in the direction
of the misclassified point, very intuitive. Very intuitive and indeed, so
this is a formal presentation. [indiscernible] you will find the vector for
which the sign is incorrect. And you add it to the current hyperplane. And
I didn’t write here the normalization, but you need to normalize also, if you
want. It doesn’t have to. You don’t have to normalize.
So it is very simple. And to analyze the complexity of this very basic
algorithm we need to talk about the quantity called the margin which
basically is the distance between the hyperplane and the current, the closest
point to it. It is also a property of the hyperplane, but you can also talk
about the margin of an instance which is the maximum possible margin over all
possible hyperplane. And of course the larger the margin is, the easier the
problem is right? You have more wiggle room to find the hyperplane.
So what Novikoff proved in 1962 is that the perceptron algorithm returns an
epsilon approximate solution in 1 over epsilon squared iterations. And I am
pretty sure all of you have seen this theorem before. And what I mean by
epsilon approximate is that it is epsilon close to the optimal margin. So I
used here epsilon [indiscernible] to different quantities, I am sorry.
But basically if you take epsilon to be the margin over 2 it means that after
1 over margin squared iterations you will find the separating hyperplane.
Okay. All right. So yeah, the proof is so easy I can even sketch it here in
one slide which might be instructive because we might use it. So this is a
very simple proof. The original one that [indiscernible] be the optimal
hyperplane, the one that has classified everything correctly.
So that means that the inner product with any other vector up to the sign is
bigger than epsilon. And then what happens iteratively? Well iteratively we
add the point, which is misclassified right. So the inner product --.
Sorry?
>>: It must be the y is missing, so y.
>> Elad Hazan: That’s right. So I should note that actually in linear
classification you can assume that all the points are labeled blue without
loss of generality. Now you are just trying to put them all on one side.
Why, because --.
>>: [inaudible].
>> Elad Hazan: Why, because you can take a red point and then take the minus
vector through the red and say the label is open. So yeah, that’s right; I
assumed that in this slide. Otherwise you just have the y item here. So
right, so the optimal hyperplane has an epsilon margin. So this is bigger,
the inner product with the optimum grows by additive of epsilon every
iteration.
And on the other hand when you look at the norm of the hyperplane, the square
norm, well you just open it up. And this was misclassified right, so it’s
negative. And hence the squared norm grows by at most 1 every iteration. I
assume that all the data points are normalized to be one. So if you look at
the inner product between the optimum and the normalized hyperplane this is
obviously less than 1 because you have 2 unit vectors.
And the numerator goes by epsilon every iteration at least. The numerator
squared is at most t. So you get some blow up here. This is equal to route
t epsilon and this whole process can repeat at most 1 of epsilon squared
iterations. So, that’s the entire proof, very simple.
And now if we analyze the running time of this algorithm we have n vectors in
d dimensions. And we repeat this whole process 1 over epsilon squared
iterations. And every time we have to find the vector whose is inner product
is negative, right.
So it takes Nd time to go over the data search for a total time of Nd over
epsilon squared, which is great because this is linear. So just to represent
the data on the computer will take you n times d space. Right, so this is
linear if the margin is a constant, this whole thing is linear on the date of
the presentation. And this, so this was pretty much the state of the art for
a long time.
And I would like to start by describing another, a new algorithm which
improves upon this running time by instead of taking n times d you have n
plus d. So the important thing to note here is that this running time, by
the way I am hiding here a logarithmic factor in n, which I didn’t write.
And when you have n plus d over epsilon squared it is sublinear because n
plus d if epsilon is much, much larger than 1 over n 1 over d than this is
sublinear. It’s not smaller than n times d actually, right. So potentially
it is the square root of the original running time if epsilon is soon to be a
constant.
Okay. So why, at least I think it is surprising because your data might look
like this. So it might be that most of your data is very, very easy to
classify, but the actual optimal hyperplane is determined by very few points,
in this case 4 points. And indeed I mentioned that you can have d or 2 d
points that will determine the optimum exactly, right.
So it seems as if you
figure out what these
results that I show.
David. And from that
have to go over the entire data set at least once to
points are. Okay. And so this will be the basic
And from that time this was joint work with Ken and
time we have extended these results to other problems.
So for example if you have, if you want to perform non-linear
classifications, so instead of using the linear hyperplane you use some
quadratic polynomial, polynomial of the Greek q. Then we get an algorithm
that increases the running time by factor of q compared to the n plus d. But
this is still sublinear in the data size.
And a more recent result is using these ideas we can apply some algorithms to
semi-definite programming which is another mathematical optimization problem
and get sort of sublinear running times. Here m is the number of constraints
and n is the dimension. To write a semi-definite program you need space
which is m times n squared. So this is sort of a similar behavior. So
instead of m times you get m plus n squared. And I will try to get to this
work.
I should mention that in contrast to previous algorithms here you can prove
lower bounds on the running time. And the reason we can do it is because we
do not see the data even once. So we can use information theory and say any
algorithm that sees less data than what ever n plus d over epsilon squared
cannot be certain the answer is such and such. Okay.
So we can prove lower bounds on the running time, which does not go into the
realm of computational complexity and hence they are easy, relatively easy
and they are nearly tight both for linear classification, did I say that?
Let’s see, I didn’t say it, but this is tight actually [indiscernible]
factors. You must see at least n plus d over epsilon squared. Hence you
must perform at least that much computation.
Okay. So I will try to do as much as I can from these topics, but let me
start with describing the new sublinear algorithm unless there are questions
about the results first which I will be happy to answer.
>>: So --.
>> Elad Hazan: Yeah.
>>: You were mentioning in the first slide the difference between linear
program and linear classification. So why aren’t you presenting this as a
result in approximation algorithms, or linear programming or for semidefinite programming?
>> Elad Hazan: That’s a good question.
So it’s an excellent question.
So the reason I present it in this way, because in linear programming usually
the outcome is, so linear programming tries to optimize liner function over
[indiscernible]. And you would like to find the optimal vertex.
Now to find the optimal vertex usually you need very high precision. So it
is know that for linear, to find the optimal vertex your answer needs to be
precise, up to an epsilon, the same epsilon that noted in the margin, which
is exponentially small in n and d. Okay.
So these algorithms are not actually polynomial algorithms because epsilon
can be exponentially small in n and d. So they do not make sense for linear
programming. They are exponential algorithms for linear programming, but for
classification they make a lot of sense because in the real world we have
noise and then the margin is usually constant, 1 percent or whatever. It’s
not going to be exponentially small in anything.
So that’s why this makes much more sense in terms of machine learning and
statistical learning in general.
Yeah, thanks.
So any other questions?
Actually even for semi-definite programming people who, optimization people
who would see this running time would --. I gave this talk in terms of
optimization [indiscernible] and even faculty members in my department and
they were horrified that this is not a polynomial time algorithm, okay, but
if you assume that the world is noisy than it does make sense.
Oh, sorry, okay.
So let me go over the algorithm.
And the way I will describe, it is a randomized algorithm I should mention.
It doesn’t always give the right answer, but it gives the right answer with
high probability. Now this is any sub-linear algorithm must be randomized.
You can not deterministically know the answer without seeing all the data.
And the way I am going to present the algorithm is I am going to first
present a slower algorithm which is deterministic and is very easy to, it
will be very easy for me to convince you that it is correct. And then apply
all the randomization tricks to actually show you how we get the fast running
time. But to immediately give you the whole randomization algorithm will be
non-intuitive.
So here is a slow deterministic algorithm that we can, that I will try to
convince you is correct. It’s a primal-dual algorithm. It’s similar to the
perceptron, but it also has a dual component. So this algorithm it starts
with an arbitrary hyperplane. And it updates it according to awaited
combination of the points. So the perceptron [indiscernible] founded some
misclassified point and then updated.
So here we
classifier
starts off
importance
have a convex combination over the points and we update our
according to it. And what is this convex combination? Well it
being uniform distribution. And we update it according to the
of the examples. Okay.
So according to the inner product of the hyperplane, with the examples there,
essentially the more misclassified they are, the smaller this inner product
is, the higher weight they get.
>>: [inaudible].
>> Elad Hazan: No, no, but almost. So this eta will be 1 over root little t,
so very easy to code up. And this eta will be square root of logarithmic in
n divided by root e, so square root login by t, so both of these are easy to
code up. Yeah?
>>: So pt plus 1, don’t you normalize it?
[indiscernible]
>> Elad Hazan: Absolutely, yes.
>>: So how different is it from boosting, AdaBoost?
>> Elad Hazan: It is very close to AdaBoost. In fact this is the hedge
algorithm. So this whole primal-dual optimization algorithm has two
components. One of them is something that looks like gradient descent,
right. And the other is a hedge a multiplicative update. So it is exactly
AdaBoost if you want to think about it this way.
And that’s why it works. That’s why I want to convince, that’s why it’s an
easy algorithm to convince you that it works because you have two learning
algorithms playing one against the other.
Any other questions on this?
Okay. Let me show you an illustration because it’s always easier to grasp
this way. And this is a primal-dual perceptron. You have some weights over
the examples and what we tentatively do is take a convex combination
according to the weights which is the green point here, move in this
direction and update the weights. So points that became closer to the
hyperplane the weights increase. Points that are now easy the weights
decrease.
Okay. Sorry. So how to analyze a primal-dual algorithm and this is like
what Ron was saying. A very easy way and this is like standard methodology.
There is nothing new that we have invented here. So a standard methodology
in optimization is to say the following: take your optimization problem and
reduce it to a zero-sum game and once you, so that’s a way to view the
primal-dual formulation, as a zero-sum game.
And now how does your zero-sum game take to smart players and play them one
against the other. And if they are indeed smart they will converge to an
[indiscernible] which corresponds to a solution to the original optimization
problem.
Now what do we mean by smart players? So
field community we say that they are they
they play strategies that in the long run
hindsight. And that’s exactly what we do
in the machine learning or game
play low regret algorithms. So
converge to the optimal strategy in
here.
So a [indiscernible] theorem, again this is not something new that we proved.
But the [indiscernible] theorem says that if you perform this kind of
reduction, you take an optimization problem, reduce it to a game and then
play low regret algorithms one against the other. And they will converge to
an optimum of the zero-sum game of the optimization at a rate which is
bounded by the average regret of both players. The sum of average regrets of
both players.
So when this average regret is smaller than epsilon than you get an epsilon
approximate solution to the original optimization problem. And indeed low
regret algorithms have the property, that’s how they are defined. Their
average regret converges to zero. So this will indeed happen.
Now if you apply this reduction the total running time will be the number of
iterations multiplied by time-per-iteration to compute all the low regret
strategies. And so the reason we apply this reduction is because --. So
first of all it’s a very easy framework to apply and the fact that these low
regret algorithms are very easy to randomize.
So usually all regret bounds, those of you who are experts know what I am
talking about, talk about expected regret. And if you are talking about
expected regret it’s very easy to apply randomization tricks and retain the
same expected regret and not harm anything else in the process here. Okay.
Okay. So let’s go back to the primal-dual perceptron. We have the primal
player and the dual player. And as I am sure many of you notice this
algorithm is the hedge algorithm. Okay.
So the hedge algorithm is a low regret algorithm. It has a regret which is
bounded by square root number of iterations login. A standard bound of
AdaBoost, or hedge, or whatever and this algorithm is a gradient decent
algorithm, which has regret bounded by root t.
Now when I say gradient descent I want applicative updates what is the
objective function that we are talking about? It is essentially the saddle
point formulation of the entire program which is given by maximum x, maximum
point hyperplane and minimum distribution over examples of [indiscernible].
That’s the --. So the gradient of it is exactly what is written here and
regret with respect to p is exactly what is written here. And these are the
two algorithms, I should actually not hedge, but exponentially gradient.
That’s sort of the real formal name of this algorithm. And because of these
well know algorithms the number of iterations is logarithmic in n divided by
epsilon squared.
So what did we get?
We have two primal-dual algorithm --.
Yeah?
>>: So would this formation be different from simple gradient decent on the
exponential loss? It would be the exact same equation right?
>> Elad Hazan: How would you define exponential loss, on each example
separately?
>>: Yeah, and then you will optimize the mean exponential loss on your
[inaudible].
>> Elad Hazan: The minimum exponential loss? You can view it actually this
way: soft mean, soft mean, soft max or whatever. The soft mean, that’s
right. The soft mean you would take the soft mean function right, logarithm,
sum of e to the something and do gradient descent on top of that.
This is correct; there actually is a paper about this duality. You can view
it this way, but then I will not be able to apply the nice randomization
tricks which I want to apply and get a faster algorithm. But indeed, this is
correct.
Okay. So what did we get? We have an algorithm which I think I convinced
you it is correct because of all of this primal-dual and so on. And the
running time is every iteration we need to do this update and this update
which you can convince yourself it takes n d time to go over all examples,
technical combination, bla, bla, bla.
And a login of epsilon of squared iterations we come up with this running
time which is even worse than what I started off with, but that’s fine
because we are not done yet, right. But at least we know that this is
correct algorithm. And now we are going to speed it up and use
randomization.
So one piece of randomization is easy and that will be not to look at the
convex combination of examples, but to pick one example according to the
distribution. So if I go back for a minute, instead of taking the convex
combination, if we sample according to p that already reduces the running
time of this step from n d to just d, okay.
>>: [inaudible].
>> Elad Hazan: d.
>>: [inaudible].
>> Elad Hazan: Sure, yeah, but that’s also okay right. You can even do
smarter things, but I don’t mind playing d plus n. So that will be d plus n
and then the main difficulty is how do we implement the multiplicity of
updates efficiently? Okay. I do not want to spend n d time over there.
Now this is a much more subtle problem and the reason is that those of you
who are familiar with the hedge algorithm or the AdaBoost algorithm they do
not walk well with randomization.
So the regret of these multiplicative updates algorithms depends on the linfinity norm of the gradient. And the l-infinity norm of the gradient is
the largest number of magnitude in your entire vector. And if you randomize
there it could blow up very easily right. And so that’s the main difficulty.
And here we had to work much harder and come up with a new multiplicative
update algorithm which is not sensitive to the magnitude, but the variance of
these random numbers. Okay.
So let me tell you how to sample from inner products from two vectors because
recall that, let me go back for a second, in this dual step we have to update
the distribution according to inner products of the examples and the current
vector. So I want to sample these numbers very efficiently and get some
estimate here and replace the multiplicative update, but with something which
is not sensitive to the magnitude. Okay.
>>: Is the problem coming from like, because of the exponential effect of
some loss [inaudible]. But you can start with the different flavor boosting
which is not as --. So instead of AdaBoost you could use [inaudible] or
different, softer --.
>> Elad Hazan: Right, correct, but the problem is if you try to use something
that is different than an exponential update, let’s say you apply a gradient
descent algorithm right, then what will happen is you will get a dependence
which is not logarithmic in the number of examples, but linear or square root
in the number of examples, which is already very bad. So the number of
iterations will increase drastically. The only algorithm which has this
logarithmic nice property in the number of examples is this multiplicative
update. Good question.
Okay. So how do we estimate inner products of two vectors? Well you could
just sample a coordinate at random and return it, the inner product conduct
coordinate. And that is a correct way to do it, but it will have a variance
which is dependent on the number of coordinates, which is too large. Here is
another way which comes from the streaming literature. It’s called
[indiscernible] sampling. And there you have two vectors, v and u.
So assume they are unit vectors what you do is sample a coordinate with
mobility proportionate to the squared norm of 1 of the vectors. And return
the ratio between the corresponding coordinate’s u and v. All right.
So these sum up to 1 right, because I assume it’s a unit vector. This is a
valid distribution and you can easily convince yourself that the expectation
of the random variable I have defined is correct. It is indeed the inner
product between v and u.
And the nice part about this random variable is that its variance is small
because there are two unit vectors the variance is 1, unlike doing the
trivial method of sampling a coordinate randomly the variance will be
corresponding to the dimension. Here the variance is 1. So the variance is
1. That’s a good property, but the magnitude can be very, very large. It
can be infinite right. And that’s where the multiplicative update problem
comes in.
Okay. So let’s see how the new algorithm will look. So we added
randomization to the primal-dual algorithm instead of taking convex
combination of examples we pick one at random and we update the distribution
according not to the real linear product, but a 1 point sample of the inner
product. Okay. We sample a coordinate according to this sampling thing and
plug it into the multiplicative update.
Now I said that this multiplicative update doesn’t work. It’s too
insensitive to noise. So that will replace by sub polynomial update, which
is not too complicated it’s just a quadratic tail or expansion of the
exponential.
And we can show that this quadratic, this second order update retains the
logarithmic dependence on the number of experts, but also the regret of this
update relates to the variance of these samples, rather than the magnitude.
Okay.
All right. So an important point is let’s analyze the running time right.
This is fine, you can just sample from the distribution of the vector. To do
this we need to estimate all of these inner products and we need to sample
from n inner product. We need to estimate all n inner products, right.
So for that we need to pre-process x only once. This is very important. You
can sample, otherwise you would pre-process x every time and that will take
too long. So we sample one coordinate from x and using that coordinate take
all corresponding coordinates of the examples.
And hence the total running time is n plus d over epsilon squared. And again
I omitted the logarithmic factor in n. There is only one log here, nothing
more than that, but one logarithmic factor of n in the running time.
>>: [inaudible].
You chose one dimension once.
>> Elad Hazan: That’s right.
>>: [inaudible].
>> Elad Hazan: That’s right, that’s right. This will work if you want to get
a low variance in the actual probability estimate. You can actually sample
for each example a different coordinate, but you have to do it cleverly,
otherwise you will end up playing d ever time.
So there is a technique of pre-processing a vector and sampling from it again
and again with additional cost of o of 1. There is a way of doing it.
[indiscernible] is very easy using some binary tree. You can even do it in o
of 1 if you use really state of the art things.
Yeah?
>>: If the vectors are very sparse and you know that this sparsity has strong
[indiscernible] in the sense that some of them will have full dimensions and
the rest will be just [inaudible]. Can you leverage that to actually prove
it?
>> Elad Hazan: Absolutely.
>>: And my other question is also normalization.
problem also?
>> Elad Hazan: Okay.
Wouldn’t normalization be a
Both of the questions are very good.
So let me answer the first one. First what if the vectors are sparse? And
indeed in the real world the vectors are sparse right. So n times d is very
pessimistic. You don’t really play 10 million times what ever number of
documents in the internet because they are effective sparsity, whatever
[indiscernible].
Everything that I have said can be made to work if you replace d by
[indiscernible]. So again we get exactly the same speedup. Instead of n
times [indiscernible] you have n plus [indiscernible].
>>: But what I am saying is would it be even better if you know that
[indiscernible] has structure in the sense that some chunk of [indiscernible]
is frequent. And think of frequent words and then there are going to be some
rare words in every document and then [indiscernible].
Yeah. That’s right. That’s one of the properties that I do not yet know how
to exploit, but it is a good open question to try to exploit further
structure. Currently we do not know how to do it. And then you asked
something else which now I forgot.
>>: The normalization.
>> Elad Hazan: Normalization. Right. We need to normalize both here and
here. And both of these take, this will take d time and this will take n
time. So it doesn’t increase the running time by anything. It doesn’t cause
problems in other applications such as minimum enclosing ball. And so, but I
am not going to get into it. Okay. Any other questions about this?
Okay, so --.
>>: So this can be, the, the, the circumstance could be dominated by some
confidence. Like let’s say some features and dimensions are more important
than others and it’s very easy to miss them sampling in this way right?
>> Elad Hazan: Yeah, yeah.
>>: And some problems are like that right?
Where you are looking for --.
>> Elad Hazan: Yeah, it’s exactly the same --. That’s why it sounds like it
shouldn’t work because there could be very few features which are important
in the whole thing. But it does work because you keep on adding right. So
your distribution does focus on the more important examples and then the
vector x is just a linear combination of all these examples that you have
seen. And hence it will pick. The distribution will be concentrated on
those features that occur in these examples. So --.
>>: Oh, okay. So you have actually two distributions right?
features and one is on the example.
>> Elad Hazan: Tha’s right, that’s right.
One is on the
>>: Just to summarize. So you mentioned that [inaudible] features which are
just absolutely zero at all times right. So when I sample the features
actually I have probability 1 for observing or close to 1 of observing one of
these redundant features.
>> Elad Hazan: No, you have probability zero.
.
Well in the first iteration --
>>: [inaudible].
>> Elad Hazan: In the first iteration you will. You will do nothing in the
first iteration, but from that point on you will have probability one of the
--.
>>: [inaudible] because I will, because [inaudible].
>> Elad Hazan: That’s right.
>>: The problem is more interesting it seems like when you have features that
you add that are high variance, but un-related to your class target. And
then you are going to sample those features with your sampling algorithm. If
you have lots of those features you will get no signal. And so you need to
pre-process your data to normalize.
>>: But then you will have low margin.
>> Elad Hazan: Exactly, that’s exactly the answer.
>>: Then you will have low margins.
>>: Okay.
>> Elad Hazan: That’s exactly the answer. So, if that is the case then any
ways you are shooting for something which is very hard.
All right. So a few notes about what, so basically I concluded with a linear
classification example. Okay. So this I already said.
So the overall algorithm succeeds with probability 1/2, not any higher than
that. And in fact even verifying a solution, if I give you the hyperplane
now please tell me if it is an epsilon approximate classifier or not, you
have to spend n d time which is longer than what we used to train right.
So how do you boost the probability and so on? And it turns out that our
algorithm produces a primal and a dual pair and hence you can use duality to
get to estimate how good it is. And you can also do some simple tricks, some
other simple tricks to actually verify probabilistically whether a given
hyperplane is an epsilon approximate hyperplane or not.
And then you can boost the probability of success up to 1 minus delta and add
the log 1 over delta to the running time. So that’s the standard thing. So
we don’t increase by more than log 1 over delta.
And this whole thing is tight. You cannot run in time which is less than n
plus d over epsilon squared. The logarithmic factor we do not know if it’s
tight, but otherwise it is tight.
All right.
So I have I guess, I don’t know ten minutes or so right?
Yeah, so maybe I will sort of sketch other problems that occur.
So, minimum enclosing ball is a similar problem that is also called margin
estimation. And then with this problem you are trying to find the centroid
of the set of points. So the mathematical programming formulation is that
you are trying to find a point which minimizes the maximum distance to a
given set of points.
So basically we are trying to find this centroid of the body. And yeah, so
what happens here is that this function is mathematically speaking strongly
convex. Because it is strongly convex we can apply pretty much the same
technique that I have discussed so far, but the gradient decent version. I
am now talking a little bit more to those who know and are familiar with
convex optimization. The gradient descent value has lower regret. Because
it has lower regret we can reduce the running time.
So actually the running time we get for this problem is n over epsilon
squared plus d over epsilon. So 1 of the n or d factors which, I forget
which, is divided by epsilon, not epsilon squared. And that’s an artifact of
the strong convexity. And that we can also show to be tight, but here there
are some problems with the normalization as I said.
You asked, [indiscernible], about normalization. So here we need to assume
that all the points are in the unit ball or some kind of stronger assumption.
And maybe the much more interesting case is that [indiscernible] are nonlinear classifiers, so sometimes your data might look like this right. And
here there is no linear classifier that can classify all the points
correctly, but there is a non-linear classifier.
Now here is a video, so this ball here classifies correctly and it is nonlinear, but actually it is linear in the higher dimension of space.
So here is a video which I borrowed from [indiscernible] I don’t know if he
showed it here or not. He might have shown it here, but it is a very good
illustration by Udi Aharoni. Let me play this, so it’s really short.
Right, so this is the data set and then this circle, so if you lift all these
points to instead of the 2D to 3D then you can find a separating hyperplane
in 3D which is this hyperplane that will appear here.
Okay.
All right.
Will it work?
Yeah, good.
So for the, so mathematically speaking instead of taking a non-linear
quadratic what we have is quadratic function. Quadratic function in low
dimensions you can represent it simply by listing all polynomials in a higher
dimension of space. And if you have a polynomial of the Greek q than how
many dimensions do you need? Well if you started off with n variables and
you are talking about polynomials of the Greek q you need n to the q, so n to
q dimensions, right. So it behaves exponentially in q.
But you do not want to pay the price of being, of your running time being n
to the q and that’s what the whole kernel methodology is about. You can
actually compute inner products in time n, not n to the q simply by the
observation that the inner products of two such monomials is given by the
regular inner product to the power q. And all the perceptrons and even the
sub-linear perceptrons that are presented are a kernelizing algorithm. It
only uses inner products; it uses nothing else.
So we can take this sublinear perceptron which I have described. So you add
an example and then sample and kernelize it. How would you kernelize it?
Well you can think of your hyperplane as living in the higher dimensional
space and you add to it, not the example, but the lifting of the example to
the higher dimension. And you need to update the distribution. How would
you update the distribution? This is the only delicate part.
Well you need to update it according to the inner product of 2 mapped
vectors, not two vectors in low space, but two vectors in the big space. And
how would you do that? Well I am only talking about the polynomial kernel
now. This inner product is equal to ai.x to the power q, right. How would
you get an unbiased estimator of this quantity? Just estimate a.x q times
and take the product of all of these independently, right, so.
And that’s it, so it increases your running time by factor of q, because
every time you want to estimate this inner product you pick q, rather than 1.
But otherwise everything else remains exactly the same and the total running
time is q times n plus d over epsilon squared, which is very reasonable
because q is usually not very large in applications.
Yeah?
>>: So is there [inaudible].
>> Elad Hazan: Yeah, so the reason, so --. The easiest way to think about it
is that a Garrison kernel, I am talking now about theoretically, practically
maybe it will not work very well. But theoretically the Garrison kernel you
have some parameter there which is when does the Garrison begin to shrink?
And essentially you can take the tail of a presentation of the Garrison and
cut it off at that parameter. It’s very, very close to the Garrison and you
can apply that.
Practically speaking this may not be satisfactory right, because you don’t
really want to use the polynomial, you want to really use the Garrison. But
this is a good question and I list it as an open problem in the end. How do
you treat Kernels generically? This is very specific to polynomials. How do
you do something more generic?
Yeah?
>>: I was working on the --. So the variance of the system meter is original
variance to power of q right?
>> Elad Hazan: It’s even smaller than the variance we started off with,
because --.
So we take the product of independent random variables, right. So it’s
actually even smaller. The magnitude is less than 1. So if you take product
of q such things it’s 2 to the minus q or something. It gets even better
than before, which indicates actually that maybe there is something we can do
here, but we do not really know how to explore this fact so far.
Okay. Yeah, like maybe you don’t really need to sample iad, maybe you can do
something better than sample iad and reduce the running time.
Yeah.
So that I said.
So I will just close by saying something small about maybe the most
interesting variant of this whole thing. So far everything was separable.
What happens if things are not separable such as here? And then there is the
soft-margin formulation, right. Which I think is sort of the state of the
art in support vector machines. You do not try to minimize the number of
misclassifications because that’s going to be hard. We try to minimize the -. Do I have an illustration? I don’t, so you try to minimize the sum of
distances of misclassified points to your hyperplane. And that is a convex
problem which you can solve.
Now here is the soft-margin formulation and it is given by finding the
hyperplane which minimizes, well this just says that it’s in a ball and here
we are trying to minimize the sum of distances measured according to some
hinge loss, not according to [indiscernible] loss, but you can place any
other loss here.
And then you try to minimize the average of this loss. And this is a very
successful paradigm. It turns out that this is an easier optimization
problem than a separable case. This is not equivalent to liner programming.
The original one is and this is an easier problem. It’s an unconstrained
optimization problem of a single strong convex and very nice function. So
actually it’s an easier problem, its pretty amazing when you think about it.
And using just the classic gradient descent or [indiscernible] this has been
known since the 50s you can get epsilon approximate solutions in time d over
epsilon squared. Okay. You can get very, very quick solutions to this
optimization problem. And there is no need of any primal dual or anything
like that.
And in fact this is tight in the example model. So you can prove the law
bound that says if all you get to see is an example than you cannot do any
better than that. You have to see d over epsilon squared examples and each
one you [indiscernible] nothing can be done.
But actually if, so the whole idea of this work is to look inside the
examples, not just use them as a black box, right. We sample inside and see
the actual features rather than taking them as a box. And this random access
model we can actually get faster running times for this optimization problem
if you measure it in terms of how fast you get a certain generalization rule.
So this is actually much more intricate and I do not have time to go into the
details, but the basic ideas are exactly what I have discussed. And you need
to add some tweaks with respect to the actual objective. The objective is no
longer an optimization problem it is a generalization rule objective.
So we implemented the soft-margin SVM algorithm and we implemented the
[indiscernible] which is essentially a suite, successful suite of gradient
descent for soft-margin SVM. And we measured not the running time because
the running time so far over implementations is pretty slow. It’s a more
complicated algorithm than the [indiscernible], theoretically it’s better,
but in practice it’s not any better so far.
But we measured how many accesses to the data does the algorithm occur? And
there indeed we get much, much better --. The blue line is sublinear
implementation and we get much, much better convergence to the optimal error.
>>: [inaudible]
>> Elad Hazan: Sorry?
>>: Is the blue line [inaudible]?
>> Elad Hazan: Yes, this is an intriguing feature which I have no idea how to
explain, that there is some kind of over fitting going on. So they both
converge to the same area. This is the optimal area of the optimization
problem, but the sublinear algorithm reaches some better point first and then
it deteriorates. And this is consistent through many data sets and many
experiments that you do.
And there is something going on here. I suspect it has to do with the fact
that you can actually describe the generalization error, not via support
vectors and so on like the usual trick, but in terms of features. You have a
[indiscernible] hypothesis that the algorithm is learning. And at some point
it over fits. So you know that you really want me to get the optimal
solution and we are going to forget about the whole [indiscernible] solution
and get [inaudible].
But this is extremely intriguing. I think there is something to find out
here theoretically. What is the class of hypothesis which is actually
smaller and has the same kind of generalization error. And that’s what is
happening here in the data.
>>: A more aggressive learning [indiscernible] does not help? [inaudible].
I mean this is optimal in terms of the best solution to get with [inaudible].
>> Elad Hazan: Yeah, it’s hard to say optimal, but I think yeah. So my
student [indiscernible] he implemented this and he did most of the work in
this paper and I trust him completely that he ran comprehensive experiments
and this is very consistent, across data sets, and across different learning
rates and across everything, so yeah.
>>: Is it possible that its regularizing due to just the way that the
algorithm looks at sort of sub-sets of the features at a time? It seems like
there might be something there where it’s kind of moving kind of
conservatively.
>> Elad Hazan: Yeah, yeah, no, no, I think there is something of this flavor
of which I don’t know what it is yet. But I intend to find out and would be
very happy if someone else found it out before me. But there is something
very interesting going on here, definitely.
Yeah, I don’t think have time to go into semidefinite programming, but
basically similar ideas can be used to solve semidefinite programs. And you
get sublinear running times. Here there is some gap between the upper and
lower bounds, which is due to [indiscernible] computations. So there is a
bottleneck here which seems much harder than the whole, our entire work.
So this has to do with [indiscernible] computations, which no one knows how
to do better than some specific running time, but the [indiscernible]. But
anyways it’s nearly tight.
I should finish. Maybe I will say one word about lower
idea of the lower bounds comes from the following fact,
an array and I tell you either this array is completely
blue ball inside of it. How many accesses to the array
me to distinguish between these two cases? How many?
bounds. So the whole
let’s say I give you
empty or there is a
do you need to tell
>>: Random results for [inaudible].
>> Elad Hazan: For [indiscernible] you will need to see the whole thing
randomized. Let’s say you want to succeed with probability half.
>>: [inaudible].
>> Elad Hazan: You need to see half of the array. You are not going to get
away without seeing half of the array. So this is very intuitive and that’s
exactly what we use for the lower bound. So we take two instances of whether
they are linear classification or semi-definite programming and it doesn’t
really matter. The higher value just came from the bottom line.
And then so we take two instances of semi-definite programming let’s say and
with the property that there is, one of them is going to have a solution with
margin epsilon and the other is not, because I hid in the relevant coordinate
of 0 instead of say 1 or in epsilon whatever the value is. And there is no
way you are going to be able to distinguish between whether this STP has a
epsilon approximate solution of 1 or 0 unless you go over the entire data or
using a randomized algorithm for half of the data.
So that’s exactly the flavor of
elaborate more, but you can see
computational lower bound, it’s
lower bound and that gives us a
the lower bounds. And I am not going to
that this is not a difficult --. Its not a
a much, much easier information theoretic
tight lower bound.
So let me just summarize. So I have presented some linear algorithms for
linear classification and talked a little bit about unknown variance. So an
open question is to, like [indiscernible] was asking, can we handle Kernels
generically rather than just using the polynomial and then [indiscernible]
approximation of whatever function you want the polynomial? Then someone
else, basically all these questions someone here in the audience already
asked, so I have an easy job.
So are there assumptions on the data that would permit faster optimization
such as some kind of probabilistic generation or something of the same
flavor? What if we do allow one pass over the data? So these algorithms are
strictly sub-linear. They run in time which is proportionate to the square
root of the data.
Now if you allow one pass, which may be reasonable in applications right,
maybe just to allow one pass just to write the data onto a disc. Then no
lower bounds are known because you entered a regime of computational
complexity and no one knows to prove anything in computational complexity in
this kind of regime.
So everything is possible and I conjecture that you should be able to prove a
running time of the form linear n d plus polynomial in n and d and polynomial
in 1 over epsilon. And that would be very, I think, very significant in
practice.
There is a theoretical framework by Dunagan-Vempala which takes an
approximate --. So everything I have presented is an approximate
optimization right. It’s not polynomial time. But they can take these rough
approximate algorithms and convert them into real polynomial time, boost them
into real polynomial time algorithms. And there is potential here to improve
the running time of this polynomial time algorithms via what I have
presented.
And something that John asked me, so can you exploit computer architecture
and get these things actually to work really well in practice rather than
just a theoretical improvement?
So unless there are questions, that’s basically it, I will conclude here.
thank you for your attention, it was a pleasure.
So
[clapping]
>>: Any more questions?
>>: I just want to know, feature selection algorithm that’s called
[indiscernible] that was introduced about 10 years ago --.
>> Elad Hazan: Okay. Yeah, so there is and it’s --.
Lion King at the time of writing it, so --.
My kids were into the
>>: [indiscernible].
>> Elad Hazan: Yeah, I imagine yeah.
[laughter].
Okay.
Thanks.
Download