>> Ofer Dekel: It is our great pleasure today to have Elad Hazan from the Technion. I have known Elad almost 20 years, maybe 19, soon to be 20. So we are old buddies. Elad started in Tel Aviv University and finished his PhD at Princeton under Sanjeev Arora. And then he moved to IBM research and from there to the Technion where he is a faculty member. And now he is visiting us this summer and is soon to finish a very successful visit. So thank you Elad. >> Elad Hazan: Okay, thanks Ofer. And it’s a great pleasure to be here and to visit the machine learning group here. And so I am Elad Hazan and based at the Tecnion. This is based on work with colleagues from my time at IBM, Ken Clarkson and David Woodruff. And some students at Technion, Garaber and [inaudible] who is a colleague at TTI now. And I will talk about linear classification. So this is a relatively small audience so feel free to interrupt me and ask questions and we can make it interactive. Like if you want to hear more about something than feel free to let me know. So I will start with the basics and the basic problem I want to talk about is called linear classification which I assume all of you have heard about. And basically we have a data set which is represented by points [indiscernible] space. And some are labeled blue and some are labeled red and we seek a hyperplane that separates them such as the one in the picture. So the [indiscernible] example is that you have let’s say e-mails. And these e-mails are represented as the vector simply by taking every dimension to be a word and then there is a bit of 0 or 1 without the [indiscernible] belongs to the [indiscernible] or not. And then there is a label whether the e-mail is spam or not and you are trying to classify it. Now in this picture there is a hyperplane that separates all the blue points from the red points. And of course that’s not the case in practice. Many times there is no hyperplane that can separate these two sets. And I will get to what can be done later on, but I want to start with the most basic question. There is a hyperplane that separates the two sets and we are trying to find it. So to be a little bit more formal we have n vectors in d dimensions and I will refer to them as A1 up to An in Rd. And then we have labels which are whether the points are red or blue, so plus or minus 1. And we seek to find the vector such that the sign of the inner product between --. Hello, you missed nothing so far. So the sign examples is we are only I am trying of inner product between the vector that we seek and all these correct, assuming that there is such a vector. Okay. So again talking now about the separable case. There is such a vector and to find it. So this is a fundamental machine learning primitive and it lies at the basis of many other basic sophisticated methods. So if you Google or Bing linear classification verses linear programming, linear classification gets more hits. It’s extremely popular recently. Now those of you actually familiar with the subject actually know that these two problems are equivalent mathematically so they should get the same number of hits and they don’t. But Bing is doing a little bit better in that respect. And this is a very wide spread routine that is used commonly in many internet applications and so on, such as spam detection. And if we are talking about text applications what happens is that the number of examples is usually very, very large. And the number of points which are denoted by n and also the dimension is very, very large because you have the dictionary, right. So I heard a talk by a Google employee who said, well I can actually let you try to guess how many words are there in all dictionaries in all languages in the world? Can anyone guess? >>: And what you mean by word of course --. >> Elad Hazan: Yeah, so I include in it also let’s say like small expressions that include two words maybe. >>: It’s oh my god, billions. >> Elad Hazan: So it’s actually something like --. Sorry? >>: [inaudible] [laughter] >> Elad Hazan: So actually you are talking about something like between two and ten million depending on how many expressions you use, so it’s large. It’s not incredibly large, but it’s a larger number. Yeah. So the problems are very, very large. And of course it is important to be very efficient when trying to come up with algorithms for finding such a classifier. Here is a very old, one of the oldest, algorithms invented in artificial intelligence. It may be the oldest one. It is called the perceptron and it is very effective for linear classification in the separable case. So this was invented by Rosenblatt in 57 and some classical papers were written about it in the context of [indiscernible] networks. So this algorithm can be thought of as a very simple, the simplest [indiscernible] because there is only norm. And the way it works is the following: it starts off with some arbitrary hyperplane, it doesn’t even matter what you start off with. And I assume here that it is normal, you see that there is a circle here. The reason is that I assume that the norm of this hyperplane is one. That of course it makes absolutely no difference because we are talking about the sum of the inner product. And then what [indiscernible] it does is that it finds --. It finds a point which is misclassified and adds it. So moves the hyperplane in the direction of the misclassified point, very intuitive. Very intuitive and indeed, so this is a formal presentation. [indiscernible] you will find the vector for which the sign is incorrect. And you add it to the current hyperplane. And I didn’t write here the normalization, but you need to normalize also, if you want. It doesn’t have to. You don’t have to normalize. So it is very simple. And to analyze the complexity of this very basic algorithm we need to talk about the quantity called the margin which basically is the distance between the hyperplane and the current, the closest point to it. It is also a property of the hyperplane, but you can also talk about the margin of an instance which is the maximum possible margin over all possible hyperplane. And of course the larger the margin is, the easier the problem is right? You have more wiggle room to find the hyperplane. So what Novikoff proved in 1962 is that the perceptron algorithm returns an epsilon approximate solution in 1 over epsilon squared iterations. And I am pretty sure all of you have seen this theorem before. And what I mean by epsilon approximate is that it is epsilon close to the optimal margin. So I used here epsilon [indiscernible] to different quantities, I am sorry. But basically if you take epsilon to be the margin over 2 it means that after 1 over margin squared iterations you will find the separating hyperplane. Okay. All right. So yeah, the proof is so easy I can even sketch it here in one slide which might be instructive because we might use it. So this is a very simple proof. The original one that [indiscernible] be the optimal hyperplane, the one that has classified everything correctly. So that means that the inner product with any other vector up to the sign is bigger than epsilon. And then what happens iteratively? Well iteratively we add the point, which is misclassified right. So the inner product --. Sorry? >>: It must be the y is missing, so y. >> Elad Hazan: That’s right. So I should note that actually in linear classification you can assume that all the points are labeled blue without loss of generality. Now you are just trying to put them all on one side. Why, because --. >>: [inaudible]. >> Elad Hazan: Why, because you can take a red point and then take the minus vector through the red and say the label is open. So yeah, that’s right; I assumed that in this slide. Otherwise you just have the y item here. So right, so the optimal hyperplane has an epsilon margin. So this is bigger, the inner product with the optimum grows by additive of epsilon every iteration. And on the other hand when you look at the norm of the hyperplane, the square norm, well you just open it up. And this was misclassified right, so it’s negative. And hence the squared norm grows by at most 1 every iteration. I assume that all the data points are normalized to be one. So if you look at the inner product between the optimum and the normalized hyperplane this is obviously less than 1 because you have 2 unit vectors. And the numerator goes by epsilon every iteration at least. The numerator squared is at most t. So you get some blow up here. This is equal to route t epsilon and this whole process can repeat at most 1 of epsilon squared iterations. So, that’s the entire proof, very simple. And now if we analyze the running time of this algorithm we have n vectors in d dimensions. And we repeat this whole process 1 over epsilon squared iterations. And every time we have to find the vector whose is inner product is negative, right. So it takes Nd time to go over the data search for a total time of Nd over epsilon squared, which is great because this is linear. So just to represent the data on the computer will take you n times d space. Right, so this is linear if the margin is a constant, this whole thing is linear on the date of the presentation. And this, so this was pretty much the state of the art for a long time. And I would like to start by describing another, a new algorithm which improves upon this running time by instead of taking n times d you have n plus d. So the important thing to note here is that this running time, by the way I am hiding here a logarithmic factor in n, which I didn’t write. And when you have n plus d over epsilon squared it is sublinear because n plus d if epsilon is much, much larger than 1 over n 1 over d than this is sublinear. It’s not smaller than n times d actually, right. So potentially it is the square root of the original running time if epsilon is soon to be a constant. Okay. So why, at least I think it is surprising because your data might look like this. So it might be that most of your data is very, very easy to classify, but the actual optimal hyperplane is determined by very few points, in this case 4 points. And indeed I mentioned that you can have d or 2 d points that will determine the optimum exactly, right. So it seems as if you figure out what these results that I show. David. And from that have to go over the entire data set at least once to points are. Okay. And so this will be the basic And from that time this was joint work with Ken and time we have extended these results to other problems. So for example if you have, if you want to perform non-linear classifications, so instead of using the linear hyperplane you use some quadratic polynomial, polynomial of the Greek q. Then we get an algorithm that increases the running time by factor of q compared to the n plus d. But this is still sublinear in the data size. And a more recent result is using these ideas we can apply some algorithms to semi-definite programming which is another mathematical optimization problem and get sort of sublinear running times. Here m is the number of constraints and n is the dimension. To write a semi-definite program you need space which is m times n squared. So this is sort of a similar behavior. So instead of m times you get m plus n squared. And I will try to get to this work. I should mention that in contrast to previous algorithms here you can prove lower bounds on the running time. And the reason we can do it is because we do not see the data even once. So we can use information theory and say any algorithm that sees less data than what ever n plus d over epsilon squared cannot be certain the answer is such and such. Okay. So we can prove lower bounds on the running time, which does not go into the realm of computational complexity and hence they are easy, relatively easy and they are nearly tight both for linear classification, did I say that? Let’s see, I didn’t say it, but this is tight actually [indiscernible] factors. You must see at least n plus d over epsilon squared. Hence you must perform at least that much computation. Okay. So I will try to do as much as I can from these topics, but let me start with describing the new sublinear algorithm unless there are questions about the results first which I will be happy to answer. >>: So --. >> Elad Hazan: Yeah. >>: You were mentioning in the first slide the difference between linear program and linear classification. So why aren’t you presenting this as a result in approximation algorithms, or linear programming or for semidefinite programming? >> Elad Hazan: That’s a good question. So it’s an excellent question. So the reason I present it in this way, because in linear programming usually the outcome is, so linear programming tries to optimize liner function over [indiscernible]. And you would like to find the optimal vertex. Now to find the optimal vertex usually you need very high precision. So it is know that for linear, to find the optimal vertex your answer needs to be precise, up to an epsilon, the same epsilon that noted in the margin, which is exponentially small in n and d. Okay. So these algorithms are not actually polynomial algorithms because epsilon can be exponentially small in n and d. So they do not make sense for linear programming. They are exponential algorithms for linear programming, but for classification they make a lot of sense because in the real world we have noise and then the margin is usually constant, 1 percent or whatever. It’s not going to be exponentially small in anything. So that’s why this makes much more sense in terms of machine learning and statistical learning in general. Yeah, thanks. So any other questions? Actually even for semi-definite programming people who, optimization people who would see this running time would --. I gave this talk in terms of optimization [indiscernible] and even faculty members in my department and they were horrified that this is not a polynomial time algorithm, okay, but if you assume that the world is noisy than it does make sense. Oh, sorry, okay. So let me go over the algorithm. And the way I will describe, it is a randomized algorithm I should mention. It doesn’t always give the right answer, but it gives the right answer with high probability. Now this is any sub-linear algorithm must be randomized. You can not deterministically know the answer without seeing all the data. And the way I am going to present the algorithm is I am going to first present a slower algorithm which is deterministic and is very easy to, it will be very easy for me to convince you that it is correct. And then apply all the randomization tricks to actually show you how we get the fast running time. But to immediately give you the whole randomization algorithm will be non-intuitive. So here is a slow deterministic algorithm that we can, that I will try to convince you is correct. It’s a primal-dual algorithm. It’s similar to the perceptron, but it also has a dual component. So this algorithm it starts with an arbitrary hyperplane. And it updates it according to awaited combination of the points. So the perceptron [indiscernible] founded some misclassified point and then updated. So here we classifier starts off importance have a convex combination over the points and we update our according to it. And what is this convex combination? Well it being uniform distribution. And we update it according to the of the examples. Okay. So according to the inner product of the hyperplane, with the examples there, essentially the more misclassified they are, the smaller this inner product is, the higher weight they get. >>: [inaudible]. >> Elad Hazan: No, no, but almost. So this eta will be 1 over root little t, so very easy to code up. And this eta will be square root of logarithmic in n divided by root e, so square root login by t, so both of these are easy to code up. Yeah? >>: So pt plus 1, don’t you normalize it? [indiscernible] >> Elad Hazan: Absolutely, yes. >>: So how different is it from boosting, AdaBoost? >> Elad Hazan: It is very close to AdaBoost. In fact this is the hedge algorithm. So this whole primal-dual optimization algorithm has two components. One of them is something that looks like gradient descent, right. And the other is a hedge a multiplicative update. So it is exactly AdaBoost if you want to think about it this way. And that’s why it works. That’s why I want to convince, that’s why it’s an easy algorithm to convince you that it works because you have two learning algorithms playing one against the other. Any other questions on this? Okay. Let me show you an illustration because it’s always easier to grasp this way. And this is a primal-dual perceptron. You have some weights over the examples and what we tentatively do is take a convex combination according to the weights which is the green point here, move in this direction and update the weights. So points that became closer to the hyperplane the weights increase. Points that are now easy the weights decrease. Okay. Sorry. So how to analyze a primal-dual algorithm and this is like what Ron was saying. A very easy way and this is like standard methodology. There is nothing new that we have invented here. So a standard methodology in optimization is to say the following: take your optimization problem and reduce it to a zero-sum game and once you, so that’s a way to view the primal-dual formulation, as a zero-sum game. And now how does your zero-sum game take to smart players and play them one against the other. And if they are indeed smart they will converge to an [indiscernible] which corresponds to a solution to the original optimization problem. Now what do we mean by smart players? So field community we say that they are they they play strategies that in the long run hindsight. And that’s exactly what we do in the machine learning or game play low regret algorithms. So converge to the optimal strategy in here. So a [indiscernible] theorem, again this is not something new that we proved. But the [indiscernible] theorem says that if you perform this kind of reduction, you take an optimization problem, reduce it to a game and then play low regret algorithms one against the other. And they will converge to an optimum of the zero-sum game of the optimization at a rate which is bounded by the average regret of both players. The sum of average regrets of both players. So when this average regret is smaller than epsilon than you get an epsilon approximate solution to the original optimization problem. And indeed low regret algorithms have the property, that’s how they are defined. Their average regret converges to zero. So this will indeed happen. Now if you apply this reduction the total running time will be the number of iterations multiplied by time-per-iteration to compute all the low regret strategies. And so the reason we apply this reduction is because --. So first of all it’s a very easy framework to apply and the fact that these low regret algorithms are very easy to randomize. So usually all regret bounds, those of you who are experts know what I am talking about, talk about expected regret. And if you are talking about expected regret it’s very easy to apply randomization tricks and retain the same expected regret and not harm anything else in the process here. Okay. Okay. So let’s go back to the primal-dual perceptron. We have the primal player and the dual player. And as I am sure many of you notice this algorithm is the hedge algorithm. Okay. So the hedge algorithm is a low regret algorithm. It has a regret which is bounded by square root number of iterations login. A standard bound of AdaBoost, or hedge, or whatever and this algorithm is a gradient decent algorithm, which has regret bounded by root t. Now when I say gradient descent I want applicative updates what is the objective function that we are talking about? It is essentially the saddle point formulation of the entire program which is given by maximum x, maximum point hyperplane and minimum distribution over examples of [indiscernible]. That’s the --. So the gradient of it is exactly what is written here and regret with respect to p is exactly what is written here. And these are the two algorithms, I should actually not hedge, but exponentially gradient. That’s sort of the real formal name of this algorithm. And because of these well know algorithms the number of iterations is logarithmic in n divided by epsilon squared. So what did we get? We have two primal-dual algorithm --. Yeah? >>: So would this formation be different from simple gradient decent on the exponential loss? It would be the exact same equation right? >> Elad Hazan: How would you define exponential loss, on each example separately? >>: Yeah, and then you will optimize the mean exponential loss on your [inaudible]. >> Elad Hazan: The minimum exponential loss? You can view it actually this way: soft mean, soft mean, soft max or whatever. The soft mean, that’s right. The soft mean you would take the soft mean function right, logarithm, sum of e to the something and do gradient descent on top of that. This is correct; there actually is a paper about this duality. You can view it this way, but then I will not be able to apply the nice randomization tricks which I want to apply and get a faster algorithm. But indeed, this is correct. Okay. So what did we get? We have an algorithm which I think I convinced you it is correct because of all of this primal-dual and so on. And the running time is every iteration we need to do this update and this update which you can convince yourself it takes n d time to go over all examples, technical combination, bla, bla, bla. And a login of epsilon of squared iterations we come up with this running time which is even worse than what I started off with, but that’s fine because we are not done yet, right. But at least we know that this is correct algorithm. And now we are going to speed it up and use randomization. So one piece of randomization is easy and that will be not to look at the convex combination of examples, but to pick one example according to the distribution. So if I go back for a minute, instead of taking the convex combination, if we sample according to p that already reduces the running time of this step from n d to just d, okay. >>: [inaudible]. >> Elad Hazan: d. >>: [inaudible]. >> Elad Hazan: Sure, yeah, but that’s also okay right. You can even do smarter things, but I don’t mind playing d plus n. So that will be d plus n and then the main difficulty is how do we implement the multiplicity of updates efficiently? Okay. I do not want to spend n d time over there. Now this is a much more subtle problem and the reason is that those of you who are familiar with the hedge algorithm or the AdaBoost algorithm they do not walk well with randomization. So the regret of these multiplicative updates algorithms depends on the linfinity norm of the gradient. And the l-infinity norm of the gradient is the largest number of magnitude in your entire vector. And if you randomize there it could blow up very easily right. And so that’s the main difficulty. And here we had to work much harder and come up with a new multiplicative update algorithm which is not sensitive to the magnitude, but the variance of these random numbers. Okay. So let me tell you how to sample from inner products from two vectors because recall that, let me go back for a second, in this dual step we have to update the distribution according to inner products of the examples and the current vector. So I want to sample these numbers very efficiently and get some estimate here and replace the multiplicative update, but with something which is not sensitive to the magnitude. Okay. >>: Is the problem coming from like, because of the exponential effect of some loss [inaudible]. But you can start with the different flavor boosting which is not as --. So instead of AdaBoost you could use [inaudible] or different, softer --. >> Elad Hazan: Right, correct, but the problem is if you try to use something that is different than an exponential update, let’s say you apply a gradient descent algorithm right, then what will happen is you will get a dependence which is not logarithmic in the number of examples, but linear or square root in the number of examples, which is already very bad. So the number of iterations will increase drastically. The only algorithm which has this logarithmic nice property in the number of examples is this multiplicative update. Good question. Okay. So how do we estimate inner products of two vectors? Well you could just sample a coordinate at random and return it, the inner product conduct coordinate. And that is a correct way to do it, but it will have a variance which is dependent on the number of coordinates, which is too large. Here is another way which comes from the streaming literature. It’s called [indiscernible] sampling. And there you have two vectors, v and u. So assume they are unit vectors what you do is sample a coordinate with mobility proportionate to the squared norm of 1 of the vectors. And return the ratio between the corresponding coordinate’s u and v. All right. So these sum up to 1 right, because I assume it’s a unit vector. This is a valid distribution and you can easily convince yourself that the expectation of the random variable I have defined is correct. It is indeed the inner product between v and u. And the nice part about this random variable is that its variance is small because there are two unit vectors the variance is 1, unlike doing the trivial method of sampling a coordinate randomly the variance will be corresponding to the dimension. Here the variance is 1. So the variance is 1. That’s a good property, but the magnitude can be very, very large. It can be infinite right. And that’s where the multiplicative update problem comes in. Okay. So let’s see how the new algorithm will look. So we added randomization to the primal-dual algorithm instead of taking convex combination of examples we pick one at random and we update the distribution according not to the real linear product, but a 1 point sample of the inner product. Okay. We sample a coordinate according to this sampling thing and plug it into the multiplicative update. Now I said that this multiplicative update doesn’t work. It’s too insensitive to noise. So that will replace by sub polynomial update, which is not too complicated it’s just a quadratic tail or expansion of the exponential. And we can show that this quadratic, this second order update retains the logarithmic dependence on the number of experts, but also the regret of this update relates to the variance of these samples, rather than the magnitude. Okay. All right. So an important point is let’s analyze the running time right. This is fine, you can just sample from the distribution of the vector. To do this we need to estimate all of these inner products and we need to sample from n inner product. We need to estimate all n inner products, right. So for that we need to pre-process x only once. This is very important. You can sample, otherwise you would pre-process x every time and that will take too long. So we sample one coordinate from x and using that coordinate take all corresponding coordinates of the examples. And hence the total running time is n plus d over epsilon squared. And again I omitted the logarithmic factor in n. There is only one log here, nothing more than that, but one logarithmic factor of n in the running time. >>: [inaudible]. You chose one dimension once. >> Elad Hazan: That’s right. >>: [inaudible]. >> Elad Hazan: That’s right, that’s right. This will work if you want to get a low variance in the actual probability estimate. You can actually sample for each example a different coordinate, but you have to do it cleverly, otherwise you will end up playing d ever time. So there is a technique of pre-processing a vector and sampling from it again and again with additional cost of o of 1. There is a way of doing it. [indiscernible] is very easy using some binary tree. You can even do it in o of 1 if you use really state of the art things. Yeah? >>: If the vectors are very sparse and you know that this sparsity has strong [indiscernible] in the sense that some of them will have full dimensions and the rest will be just [inaudible]. Can you leverage that to actually prove it? >> Elad Hazan: Absolutely. >>: And my other question is also normalization. problem also? >> Elad Hazan: Okay. Wouldn’t normalization be a Both of the questions are very good. So let me answer the first one. First what if the vectors are sparse? And indeed in the real world the vectors are sparse right. So n times d is very pessimistic. You don’t really play 10 million times what ever number of documents in the internet because they are effective sparsity, whatever [indiscernible]. Everything that I have said can be made to work if you replace d by [indiscernible]. So again we get exactly the same speedup. Instead of n times [indiscernible] you have n plus [indiscernible]. >>: But what I am saying is would it be even better if you know that [indiscernible] has structure in the sense that some chunk of [indiscernible] is frequent. And think of frequent words and then there are going to be some rare words in every document and then [indiscernible]. Yeah. That’s right. That’s one of the properties that I do not yet know how to exploit, but it is a good open question to try to exploit further structure. Currently we do not know how to do it. And then you asked something else which now I forgot. >>: The normalization. >> Elad Hazan: Normalization. Right. We need to normalize both here and here. And both of these take, this will take d time and this will take n time. So it doesn’t increase the running time by anything. It doesn’t cause problems in other applications such as minimum enclosing ball. And so, but I am not going to get into it. Okay. Any other questions about this? Okay, so --. >>: So this can be, the, the, the circumstance could be dominated by some confidence. Like let’s say some features and dimensions are more important than others and it’s very easy to miss them sampling in this way right? >> Elad Hazan: Yeah, yeah. >>: And some problems are like that right? Where you are looking for --. >> Elad Hazan: Yeah, it’s exactly the same --. That’s why it sounds like it shouldn’t work because there could be very few features which are important in the whole thing. But it does work because you keep on adding right. So your distribution does focus on the more important examples and then the vector x is just a linear combination of all these examples that you have seen. And hence it will pick. The distribution will be concentrated on those features that occur in these examples. So --. >>: Oh, okay. So you have actually two distributions right? features and one is on the example. >> Elad Hazan: Tha’s right, that’s right. One is on the >>: Just to summarize. So you mentioned that [inaudible] features which are just absolutely zero at all times right. So when I sample the features actually I have probability 1 for observing or close to 1 of observing one of these redundant features. >> Elad Hazan: No, you have probability zero. . Well in the first iteration -- >>: [inaudible]. >> Elad Hazan: In the first iteration you will. You will do nothing in the first iteration, but from that point on you will have probability one of the --. >>: [inaudible] because I will, because [inaudible]. >> Elad Hazan: That’s right. >>: The problem is more interesting it seems like when you have features that you add that are high variance, but un-related to your class target. And then you are going to sample those features with your sampling algorithm. If you have lots of those features you will get no signal. And so you need to pre-process your data to normalize. >>: But then you will have low margin. >> Elad Hazan: Exactly, that’s exactly the answer. >>: Then you will have low margins. >>: Okay. >> Elad Hazan: That’s exactly the answer. So, if that is the case then any ways you are shooting for something which is very hard. All right. So a few notes about what, so basically I concluded with a linear classification example. Okay. So this I already said. So the overall algorithm succeeds with probability 1/2, not any higher than that. And in fact even verifying a solution, if I give you the hyperplane now please tell me if it is an epsilon approximate classifier or not, you have to spend n d time which is longer than what we used to train right. So how do you boost the probability and so on? And it turns out that our algorithm produces a primal and a dual pair and hence you can use duality to get to estimate how good it is. And you can also do some simple tricks, some other simple tricks to actually verify probabilistically whether a given hyperplane is an epsilon approximate hyperplane or not. And then you can boost the probability of success up to 1 minus delta and add the log 1 over delta to the running time. So that’s the standard thing. So we don’t increase by more than log 1 over delta. And this whole thing is tight. You cannot run in time which is less than n plus d over epsilon squared. The logarithmic factor we do not know if it’s tight, but otherwise it is tight. All right. So I have I guess, I don’t know ten minutes or so right? Yeah, so maybe I will sort of sketch other problems that occur. So, minimum enclosing ball is a similar problem that is also called margin estimation. And then with this problem you are trying to find the centroid of the set of points. So the mathematical programming formulation is that you are trying to find a point which minimizes the maximum distance to a given set of points. So basically we are trying to find this centroid of the body. And yeah, so what happens here is that this function is mathematically speaking strongly convex. Because it is strongly convex we can apply pretty much the same technique that I have discussed so far, but the gradient decent version. I am now talking a little bit more to those who know and are familiar with convex optimization. The gradient descent value has lower regret. Because it has lower regret we can reduce the running time. So actually the running time we get for this problem is n over epsilon squared plus d over epsilon. So 1 of the n or d factors which, I forget which, is divided by epsilon, not epsilon squared. And that’s an artifact of the strong convexity. And that we can also show to be tight, but here there are some problems with the normalization as I said. You asked, [indiscernible], about normalization. So here we need to assume that all the points are in the unit ball or some kind of stronger assumption. And maybe the much more interesting case is that [indiscernible] are nonlinear classifiers, so sometimes your data might look like this right. And here there is no linear classifier that can classify all the points correctly, but there is a non-linear classifier. Now here is a video, so this ball here classifies correctly and it is nonlinear, but actually it is linear in the higher dimension of space. So here is a video which I borrowed from [indiscernible] I don’t know if he showed it here or not. He might have shown it here, but it is a very good illustration by Udi Aharoni. Let me play this, so it’s really short. Right, so this is the data set and then this circle, so if you lift all these points to instead of the 2D to 3D then you can find a separating hyperplane in 3D which is this hyperplane that will appear here. Okay. All right. Will it work? Yeah, good. So for the, so mathematically speaking instead of taking a non-linear quadratic what we have is quadratic function. Quadratic function in low dimensions you can represent it simply by listing all polynomials in a higher dimension of space. And if you have a polynomial of the Greek q than how many dimensions do you need? Well if you started off with n variables and you are talking about polynomials of the Greek q you need n to the q, so n to q dimensions, right. So it behaves exponentially in q. But you do not want to pay the price of being, of your running time being n to the q and that’s what the whole kernel methodology is about. You can actually compute inner products in time n, not n to the q simply by the observation that the inner products of two such monomials is given by the regular inner product to the power q. And all the perceptrons and even the sub-linear perceptrons that are presented are a kernelizing algorithm. It only uses inner products; it uses nothing else. So we can take this sublinear perceptron which I have described. So you add an example and then sample and kernelize it. How would you kernelize it? Well you can think of your hyperplane as living in the higher dimensional space and you add to it, not the example, but the lifting of the example to the higher dimension. And you need to update the distribution. How would you update the distribution? This is the only delicate part. Well you need to update it according to the inner product of 2 mapped vectors, not two vectors in low space, but two vectors in the big space. And how would you do that? Well I am only talking about the polynomial kernel now. This inner product is equal to ai.x to the power q, right. How would you get an unbiased estimator of this quantity? Just estimate a.x q times and take the product of all of these independently, right, so. And that’s it, so it increases your running time by factor of q, because every time you want to estimate this inner product you pick q, rather than 1. But otherwise everything else remains exactly the same and the total running time is q times n plus d over epsilon squared, which is very reasonable because q is usually not very large in applications. Yeah? >>: So is there [inaudible]. >> Elad Hazan: Yeah, so the reason, so --. The easiest way to think about it is that a Garrison kernel, I am talking now about theoretically, practically maybe it will not work very well. But theoretically the Garrison kernel you have some parameter there which is when does the Garrison begin to shrink? And essentially you can take the tail of a presentation of the Garrison and cut it off at that parameter. It’s very, very close to the Garrison and you can apply that. Practically speaking this may not be satisfactory right, because you don’t really want to use the polynomial, you want to really use the Garrison. But this is a good question and I list it as an open problem in the end. How do you treat Kernels generically? This is very specific to polynomials. How do you do something more generic? Yeah? >>: I was working on the --. So the variance of the system meter is original variance to power of q right? >> Elad Hazan: It’s even smaller than the variance we started off with, because --. So we take the product of independent random variables, right. So it’s actually even smaller. The magnitude is less than 1. So if you take product of q such things it’s 2 to the minus q or something. It gets even better than before, which indicates actually that maybe there is something we can do here, but we do not really know how to explore this fact so far. Okay. Yeah, like maybe you don’t really need to sample iad, maybe you can do something better than sample iad and reduce the running time. Yeah. So that I said. So I will just close by saying something small about maybe the most interesting variant of this whole thing. So far everything was separable. What happens if things are not separable such as here? And then there is the soft-margin formulation, right. Which I think is sort of the state of the art in support vector machines. You do not try to minimize the number of misclassifications because that’s going to be hard. We try to minimize the -. Do I have an illustration? I don’t, so you try to minimize the sum of distances of misclassified points to your hyperplane. And that is a convex problem which you can solve. Now here is the soft-margin formulation and it is given by finding the hyperplane which minimizes, well this just says that it’s in a ball and here we are trying to minimize the sum of distances measured according to some hinge loss, not according to [indiscernible] loss, but you can place any other loss here. And then you try to minimize the average of this loss. And this is a very successful paradigm. It turns out that this is an easier optimization problem than a separable case. This is not equivalent to liner programming. The original one is and this is an easier problem. It’s an unconstrained optimization problem of a single strong convex and very nice function. So actually it’s an easier problem, its pretty amazing when you think about it. And using just the classic gradient descent or [indiscernible] this has been known since the 50s you can get epsilon approximate solutions in time d over epsilon squared. Okay. You can get very, very quick solutions to this optimization problem. And there is no need of any primal dual or anything like that. And in fact this is tight in the example model. So you can prove the law bound that says if all you get to see is an example than you cannot do any better than that. You have to see d over epsilon squared examples and each one you [indiscernible] nothing can be done. But actually if, so the whole idea of this work is to look inside the examples, not just use them as a black box, right. We sample inside and see the actual features rather than taking them as a box. And this random access model we can actually get faster running times for this optimization problem if you measure it in terms of how fast you get a certain generalization rule. So this is actually much more intricate and I do not have time to go into the details, but the basic ideas are exactly what I have discussed. And you need to add some tweaks with respect to the actual objective. The objective is no longer an optimization problem it is a generalization rule objective. So we implemented the soft-margin SVM algorithm and we implemented the [indiscernible] which is essentially a suite, successful suite of gradient descent for soft-margin SVM. And we measured not the running time because the running time so far over implementations is pretty slow. It’s a more complicated algorithm than the [indiscernible], theoretically it’s better, but in practice it’s not any better so far. But we measured how many accesses to the data does the algorithm occur? And there indeed we get much, much better --. The blue line is sublinear implementation and we get much, much better convergence to the optimal error. >>: [inaudible] >> Elad Hazan: Sorry? >>: Is the blue line [inaudible]? >> Elad Hazan: Yes, this is an intriguing feature which I have no idea how to explain, that there is some kind of over fitting going on. So they both converge to the same area. This is the optimal area of the optimization problem, but the sublinear algorithm reaches some better point first and then it deteriorates. And this is consistent through many data sets and many experiments that you do. And there is something going on here. I suspect it has to do with the fact that you can actually describe the generalization error, not via support vectors and so on like the usual trick, but in terms of features. You have a [indiscernible] hypothesis that the algorithm is learning. And at some point it over fits. So you know that you really want me to get the optimal solution and we are going to forget about the whole [indiscernible] solution and get [inaudible]. But this is extremely intriguing. I think there is something to find out here theoretically. What is the class of hypothesis which is actually smaller and has the same kind of generalization error. And that’s what is happening here in the data. >>: A more aggressive learning [indiscernible] does not help? [inaudible]. I mean this is optimal in terms of the best solution to get with [inaudible]. >> Elad Hazan: Yeah, it’s hard to say optimal, but I think yeah. So my student [indiscernible] he implemented this and he did most of the work in this paper and I trust him completely that he ran comprehensive experiments and this is very consistent, across data sets, and across different learning rates and across everything, so yeah. >>: Is it possible that its regularizing due to just the way that the algorithm looks at sort of sub-sets of the features at a time? It seems like there might be something there where it’s kind of moving kind of conservatively. >> Elad Hazan: Yeah, yeah, no, no, I think there is something of this flavor of which I don’t know what it is yet. But I intend to find out and would be very happy if someone else found it out before me. But there is something very interesting going on here, definitely. Yeah, I don’t think have time to go into semidefinite programming, but basically similar ideas can be used to solve semidefinite programs. And you get sublinear running times. Here there is some gap between the upper and lower bounds, which is due to [indiscernible] computations. So there is a bottleneck here which seems much harder than the whole, our entire work. So this has to do with [indiscernible] computations, which no one knows how to do better than some specific running time, but the [indiscernible]. But anyways it’s nearly tight. I should finish. Maybe I will say one word about lower idea of the lower bounds comes from the following fact, an array and I tell you either this array is completely blue ball inside of it. How many accesses to the array me to distinguish between these two cases? How many? bounds. So the whole let’s say I give you empty or there is a do you need to tell >>: Random results for [inaudible]. >> Elad Hazan: For [indiscernible] you will need to see the whole thing randomized. Let’s say you want to succeed with probability half. >>: [inaudible]. >> Elad Hazan: You need to see half of the array. You are not going to get away without seeing half of the array. So this is very intuitive and that’s exactly what we use for the lower bound. So we take two instances of whether they are linear classification or semi-definite programming and it doesn’t really matter. The higher value just came from the bottom line. And then so we take two instances of semi-definite programming let’s say and with the property that there is, one of them is going to have a solution with margin epsilon and the other is not, because I hid in the relevant coordinate of 0 instead of say 1 or in epsilon whatever the value is. And there is no way you are going to be able to distinguish between whether this STP has a epsilon approximate solution of 1 or 0 unless you go over the entire data or using a randomized algorithm for half of the data. So that’s exactly the flavor of elaborate more, but you can see computational lower bound, it’s lower bound and that gives us a the lower bounds. And I am not going to that this is not a difficult --. Its not a a much, much easier information theoretic tight lower bound. So let me just summarize. So I have presented some linear algorithms for linear classification and talked a little bit about unknown variance. So an open question is to, like [indiscernible] was asking, can we handle Kernels generically rather than just using the polynomial and then [indiscernible] approximation of whatever function you want the polynomial? Then someone else, basically all these questions someone here in the audience already asked, so I have an easy job. So are there assumptions on the data that would permit faster optimization such as some kind of probabilistic generation or something of the same flavor? What if we do allow one pass over the data? So these algorithms are strictly sub-linear. They run in time which is proportionate to the square root of the data. Now if you allow one pass, which may be reasonable in applications right, maybe just to allow one pass just to write the data onto a disc. Then no lower bounds are known because you entered a regime of computational complexity and no one knows to prove anything in computational complexity in this kind of regime. So everything is possible and I conjecture that you should be able to prove a running time of the form linear n d plus polynomial in n and d and polynomial in 1 over epsilon. And that would be very, I think, very significant in practice. There is a theoretical framework by Dunagan-Vempala which takes an approximate --. So everything I have presented is an approximate optimization right. It’s not polynomial time. But they can take these rough approximate algorithms and convert them into real polynomial time, boost them into real polynomial time algorithms. And there is potential here to improve the running time of this polynomial time algorithms via what I have presented. And something that John asked me, so can you exploit computer architecture and get these things actually to work really well in practice rather than just a theoretical improvement? So unless there are questions, that’s basically it, I will conclude here. thank you for your attention, it was a pleasure. So [clapping] >>: Any more questions? >>: I just want to know, feature selection algorithm that’s called [indiscernible] that was introduced about 10 years ago --. >> Elad Hazan: Okay. Yeah, so there is and it’s --. Lion King at the time of writing it, so --. My kids were into the >>: [indiscernible]. >> Elad Hazan: Yeah, I imagine yeah. [laughter]. Okay. Thanks.