1 >> Ming-Wei Chang: Hi, so it's my pleasure to introduce Kai-Wei here today. So Kai-Wei is currently a graduate student in University of Illinois Urbana-Champaign. So he has done several impressive work along the direction of solving large scale linear classification model, like SVM, and he's also one of the main contributors of a widely used package LIBLINEAR. Today, he's going to talk about his work at changing large scale SVM where data cannot fit into memory. Thank you. >> Kai-Wei Chang: Okay. Thanks, Ming-Wei, for the introduction. So today I would like to talk about how to solve large scale linear classification when data cannot fit in memory. And part of this work was done with [indiscernible] National Taiwan University. And another part of this work's done with my advisor, professor at UIUC. So let me give you, first give you some motivation of this work, then I will describe the approach that we use for this problem, and then I will describe another method to solve the same problem. So here is the motivation. As we know, in recent years, the size of the data become larger and larger. So here is an example. If we look at the data mining challenge, KDDCup, you can see that in 2004, the largest datasets is only 40 megabtye. And in 2009, the dataset become 1.9 gigabyte. And in 2010, the largest dataset is even larger, it become 10 gigabyte. And also significantly large amount of data appear in several domains, such as in spam filtering, in data stream mining, and in web mining. And also, there is a report by several internet company that they have a large data need. And usually if you use more examples and more expressive feature, you can get better performance, so that's why people care about large data. So among several method that you deal with large data, linear classifier is one of the popular method. And here, by linear, we mean that we solve the model using the original feature domain, and we do not make the data into a high dimensional feature space. And for certain problem, that's something the linear model can achieve similar accuracy as good as serving a nonlinear model. But the training time and testing time are much smaller. And this is especially -- this especially true for the situation that a number of feature is large. 2 Sod here, it shows simple that we train the linear SVM and the linear SVM with RBF kernel on function of dataset. And you can see that although for some dataset there is a gap of accuracy between the linear model and nonlinear model. However, for those large dataset, the accuracy between the linear model and nonlinear model are very small. However, the training time for linear model is much, much smaller than the nonlinear model. And here, the time here, I only show the time assuming the data is already in the memory. So it is [indiscernible] what I want to talk in this work. >>: So can you get accuracy of 99.7. >> Kai-Wei Chang: Which one? >>: The final column, mnist 38, the first row. result before. >> Kai-Wei Chang: >>: No, no. You mean the [indiscernible]? In the first row. >> Kai-Wei Chang: >>: 99.7 I've never seen that Oh, the 38. To the right. >> Kai-Wei Chang: Oh, this is a binary classification only on the category 3 and 8. I guess you are talking about the [indiscernible] situation. All theories are binary classification result. >>: Is linear constraint to only use current variables and not try any combinations of feature induction? >> Kai-Wei Chang: Oh, here I just using original feature space. So -- >>: Do you know what the number would be with feature induction involved and you can do linear on induce features? >> Kai-Wei Chang: You mean using the degree to polynomial expansion or something like that? 3 >>: Yeah, something. >> Kai-Wei Chang: >>: Some combination of the features? Whatever your favorite induction technique is. >> Kai-Wei Chang: Yeah, actually, we have a paper to describe about that. can use a polynomial expansion, and then you can trend the model using the linear model to trend that feature induction. So we have -- You >>: [inaudible] biggest losers that you have right now, which looks like the covtype and webspam, did those recover quite a bit? >> Kai-Wei Chang: Oh, yeah. So for example if you use the trigram of the webspam, then you can get about 99.3. I mean, even use linear model on the webspam with trigram. >>: This using trigrams only? >> Kai-Wei Chang: >>: So -- >> Kai-Wei Chang: >>: No, these using bigram only. Uni means unigram. So the numbers are not using bigrams? >> Kai-Wei Chang: Yeah, these numbers are not using bigrams. This is a Unigram. I think your comment is true, that if you do some feature induction here, then you can get better result in between the linear model and other model. And also, the training time can also be smaller than using a nonlinear model. That's true. Okay. So another observation here is that if we assume that the data is already in memory, the training team here is pretty small. So let's say that the [indiscernible] linear model is already well developed so they are [indiscernible] linear model. For example this one proposed using [indiscernible]. This one using the [indiscernible] method and we also propose a method to use [indiscernible] linear. However if you direct to use this method to [indiscernible] situation where data cannot fit in memory, then the 4 training time will be very, very large because of the disk swapping. So here I show you a figure here. So the X axis here is the size of the dataset, and the Y axis here is the training time. And we train the model on a machine with only one gigabyte memory. And the green line here show the actual memory that we can use on that method. And you can see that when the size of the data stay close to X memory that we can use, the training time increase rapidly. And this is because of the disk swapping. >>: [inaudible] initial loading cause the data from the disk or is it assumed that you had a warm start, or what is the setup here? >> Kai-Wei Chang: So this is we just run the LIBLINEAR on the data, and we don't care about anything. So you will take care of that loading data and also like do the disk swapping [indiscernible]. >>: Does the linear use threads to do additional loading of data in the bigram? >> Kai-Wei Chang: No, no, no. It's just one thread. You're just using one thread. Okay. And so here, we model the training time as two part. So the first part is that we need to access the data from the disk to the memory. And the second part is we need to update a model using the data in memory. And in the situation that data can be stored in memory, the previous words, assuming that the data is already loaded. So they only focus on the second turn, and you can know the first. However, we arguing if in this situation the first part can be dominated long in time. So here's an exam pig. If you run LIBLINEAR on [indiscernible] dataset. So this dataset has about 500,000 instance, and it's about one gigabyte. And in that situation, it take about one, only one minute to load the data into memory, but it only take about five seconds to solve the model. And people only report that they solve the model using five seconds. And in a situation that if data cannot fit in memory, then you cannot just load the data into your memory once. You might need to only load a portion of data at a time. In that situation, the cost of a disk access might be even higher. Okay. So our goal here is we're trying to construct large linear classifier to handle the data larger -- handle a situation that data size is larger than 5 memory capacity. And here we only focus on training on one machine. And we make assumption of our data, that we assume that the data size is larger than the memory capacity, but it can be stored in the disk of one machine. And we also assume that for the data using the sub-sampling technique will cause a low accuracy. And we will show that in this experiment, that the data we are using are [indiscernible] by these two assumptions. So here is a condition of a viable method in my mind. So because the data can be only stored in the disk, so we cannot avoid random access to a disk. Therefore, at each time the method can only load a continued chunk of the data from disk at a time. And second, that we are looking for exactly solution of the linear model. So we require the optimization procedure need to converge toward the optimal. So this is not talking about just getting an approximate solution of training the model. And third, because we know that this access is very expensive, so we should reduce the number of times to access the data from the disk. So that the number of [indiscernible] should be smaller, should be small. And also, we are trying to find a simple solution for this situation because we want to support several functionalities, such as the multiclass classification, using like one versus all strategy and also can doing something that parameter selection and support other functionalities, such as incremental, decremental learning. >>: [inaudible] does it mean like mini batch methods? mini batch methods would have no accuracy? You're assuming that >> Kai-Wei Chang: No, I'm assuming that if you just random select something from the whole dataset, then you train the model on this random sample, then you will get a lower accuracy than training on the entire dataset. Okay. So in this talk, I'm focused on training the linear SVM. Although our method can also apply to [indiscernible] and other similar formulations such as a multiclass formulation by [indiscernible]. And here, given the trains data, YSI, and SI is the end dimension factor. And Y is labeled, and it can be positive one or negative one. And here the N here is number of features and L is the number of data. 6 So SVM formulation, the variable here is W. So the number of variable is equal to the number of feature, and the first turn people call it [indiscernible] turn. In the second turn, it's the loss function and C is the parameter to variance these two turns. And it's well known that this [indiscernible] is equivalent to the [indiscernible] SVM formulation. And in the [indiscernible], the variable is [indiscernible] and the number of variable is equal to the number of instance. And each variable [indiscernible] corresponds to one data point SI. And the [indiscernible] function of the [indiscernible] SVM is [indiscernible] problem, and the constraint here is a bound be constraint. And the Q metric here is equal to, Qij is equal to [indiscernible] times SISA. So the [indiscernible] entry of Q is related to the I instance, and the [indiscernible] instance of the data. And in the optimal solution, there is a relationship between the optimal solution of the [indiscernible]. In the optimal, the W is equal to the linear combination of the data. Okay. So let me talk about the first method we proposed. So the idea here is pretty simple. Because we cannot load all the data into memory, so we say we want to split the data into several chunk. And at each time, we load only one data chunk into a memory. So is our algorithm that we first split the data into several block and store the data into several files accordingly. And then we give an initial model, and we look until the model get converged. And at each time, we load one data block from the disk to the memory and conduct some operation on this data to update a model, and then we do it again until the model get converge. And I know I haven't talk about detail. And I will talk about detail later. So this algorithm is related to the block minimization method, which is a classical optimization method using the data mining machine learning area. So in block minimization method, they consider a block of variables at the time that has been widely used to solve many formulations, such as to solve linear SVM. And here, in this situation, the data is larger than the memory. So even from previous work, we cannot avoid random access to the disk. So we cannot do the 7 [indiscernible] variable selection. of the variables. So here, we just using a fixed partition So there are two remaining issue here. So first, we need to decide the block size that we are using. And the second is we need to specify which operation we [indiscernible] to update a model. So let me go to the first issue first. >>: Can you go back two slides? One more slide. Over here, for the Q -- one slide back here. I think according to what is right there, the main problem is the Q, the Q square. >> Kai-Wei Chang: Yeah, yeah. >>: So but that problem is already solved, you assume. So the problem is the data to memory or the square of the data that's [inaudible]. >> Kai-Wei Chang: No, assuming the data is not fit in memory. measures also cannot fit in memory also. >>: So the Q But you assume [indiscernible]. >> Kai-Wei Chang: For the Q? Yeah, for the Q, the [indiscernible] several method using [indiscernible] using the XML method and in linear situation we have a dual [indiscernible] and basically only consider one [indiscernible] at a time. >>: So the Q might be solved. >> Kai-Wei Chang: >>: Yeah, yeah. And I will talk about this also later. Okay. >> Kai-Wei Chang: So basically, we don't compute all the Q entry and store in the memory. We compute the entry of Q when the time we need it, yeah. Also, let's go to the first issue. So here, we make an assumption that you assume that each block has a similar size. Each block, the sizes are similar. And here is our conclusion that we say the size of block cannot be too large, because each block of data need to be fit in memory. And also, the size of the block cannot be too small. Otherwise, we might need lot of time to loading the 8 data. And here is our [indiscernible]. So here, we use a simple preparation of the [indiscernible] so we only can see the [indiscernible] of the one other iteration. So the capacity of one other iteration is this formulation. So it can be divide by two part. The first part is a time to load each block to the memory and the second part is [indiscernible] operation on the data in memory. It's times the number of blocks that we have. So in the literature, that when people know this turn, they only can see the second turn. And in this situation, because usually the operation come down on this part of the data, it's more linear to a size of the [indiscernible] so they get a conclusion that using the most faster block to update is better. So that is why in LIBLINEAR, [indiscernible] choose the block sizes and in SVM they choose the block size N. However, here the story are different, because we also need to consider the first turn. And moreover, the first turn almost always dominate the second turn. And in this situation, because we know that S is a block of data from this, it needs some initial [indiscernible] and do something there. And also, need another turn that related to load the data from this one by one. So this another turn is [indiscernible] to the size of data. For an outer iteration, the running time could well become the initial cost of the access times the number of blocks times the [indiscernible] with respect to the size of data. And that is not related to the block size. So only this turn related to the block size. And in this situation, you can find that if you're using large block size, it's better. So here we get a conclusion that when the data cannot fit in memory, we should lose a block size as large as possible. And we will also show this in our experiment to confirm that it is also empirically true. So the second issue here is that we need to specify how to update the model using a block of data, and here we propose two method. The first one is using -- to solve the two SVM by using LIBLINEAR in each block. And the second one is to solve primal SVM using the stochastic rate and descend in each block. Okay. So for the first method, if we look at the dual function, we can -- as we say, that each dual variable corresponds to one data point. So there is a 9 natural combination between a block of data to a block of variables. So here, we say we want to solve a sub-problem such that we update those [indiscernible] variable and fix the rest. Then you can [indiscernible] this. So in this sub-problem, keeping the current model, we're trying to find updating the [indiscernible] that those variable that corresponds to a data not in the memory of fixed. And also, the updating you need to satisfy the bounded constraint. And if you [indiscernible] into this formulation, then you can get this solved problem. So then next, we want to verify that solving this problem only need to use the data that's already in memory. So let's look at this solve problem. So there is a reminder that the entry of the Q is related to the I instant in the [indiscernible] instance. Okay. So what is the first term of these? So the first term basically is this [indiscernible] and this whole block is a Q method a [indiscernible] of the Q. And you can see that using the two access, these entry only involve the Dat that already in memory. So this is good. And the last turn here is basically constant too, that also we can ignore this. The problem is the second turn. Because the second turn, we have [indiscernible] between the [indiscernible] of Q and times the [indiscernible]. So basically, this one requires access all the data. However, we find that we can use a tree to maintain a temporary factor Q so that this [indiscernible] only requires to use the data in memory. So here are some detail. So basically, we use a [indiscernible] proposed in our paper that by maintaining this temporary factor W, then you can find that Q [indiscernible] is actually equal to Y times the [indiscernible] between the W and X. >>: Does that tree apply for nonlinear SVM too? >> Kai-Wei Chang: Actually not. Because we need to use a property that [indiscernible] -- yeah. So the idea is that if you just substitute this into this one, then you can get this equation. Now, so if you look at these, to get the [indiscernible] is actually only involve the R instance. So that to compute [indiscernible] only involve the data that already memory. 10 >>: So you don't have any solution for nonlinear? >> Kai-Wei Chang: Yeah, we don't have that, no. If we had it, [indiscernible] classification, yeah. Also, there is a good property here, because we know that we only updated alpha corresponds to data already memory. So updating this W only involved those data that already in memory. So basically, after this slide, we can safely say that to solve this problem, we only need the data that already in memory. So now we can inject this into our previous algorithm. So for solving dual SVM, when solving -- after we load the [indiscernible] of data into memory, then we approximate the sub-problem to obtain the updating direction. Then we can use this direction to update our model and also update the temporary factor W. >>: So I think something here is no matter how you block the data, the final result will be the same because it's [indiscernible]. >> Kai-Wei Chang: Yeah, yeah, yeah. Also, we have improved the convergence. >>: But does it take longer to get converged if you block the data some small? Here, you just assume [indiscernible] access time [inaudible]. >> Kai-Wei Chang: >>: Like you -- You have more blocks there, it may take longer. >> Kai-Wei Chang: Theoretically, the convergence rate is the same. However, that for practice, that if you block the data too small, then the convergence -- I mean, you will have a lot of blocks and each block, data are not communicating with each other so the convergence will be slower. That's true. Okay. So here, our problem can be solved by any bounded constrained method, and there we used LIBLINEAR to solve this problem and LIBLINEAR implement a [indiscernible] descent method. And in practice, we usually only find the approximate solution of the sub-problem. So here we need to say something about the stopping condition and 11 also the convergence. So we propose two approach for the stopping condition for solving the linear sub-problem. The first way is that we can use a fixed number of passes to let data in memory. So, for example, you can say we go through all the data in memory, five times, ten times and update the model using each of the instant at the time. And the second way is that you can define some gradient-based stopping condition. For example we can use the one in LIBLINEAR, which uses projected gradient -- normal for projected gradient. And if normal for projected gradient is less than 0.1, then it stop. And, of course, you can use some combination of these two stopping conditions. And we can prove that convergence hold for both these two approach. Okay. So next I want to talk about how to solve the primal SVM in this framework. So in primal SVM, as we know, each variable is corresponds to one feature. So there is no relationship between the variable to [indiscernible] instance. However, we can use the stochastic gradient descent, because in stochastic gradient descent, they're only using a small set of data to update a model. And this will be very closely related to online learning method. And here we consider to use Pegasos for experiments. So in Pegasos, there's a [indiscernible] mode. So one is that they can update a model using a block of data. And also, you can update a model using only one data. >>: This is primal? >> Kai-Wei Chang: This is primal. >>: Why do you want to use primal? primal? >> Kai-Wei Chang: >>: [indiscernible] is it better to use No. That's what I thought. >> Kai-Wei Chang: The reason I show this slide here, because people doing 12 online learning, they usually think that we can do this using the primal and using some online learning method. So in this parameter, we want to show that using the dual method we propose is actually more faster than this primal method. >>: Dual method can also be used with online learning. >>: You mentioned it can be used in online learning. [inaudible]. >> Kai-Wei Chang: Yes, this can be used in online, yes, that's true. mean, it can have the online version, but the -- yeah. That's right. >>: This is just for comparison. >> Kai-Wei Chang: >>: But I Yeah, it's for comparison, yeah. [inaudible]. >> Kai-Wei Chang: That's true. >>: Although there are some papers that I read that there's a great advantage to use primal. Maybe in the kernel, nonlinear version. [inaudible] the second chapter, they talk about primal method may be better. >> Kai-Wei Chang: I'm sure that, I think that's both different dataset, they might be different. You might get different conclusion. So for example, we have an observation that if number of data are very large, then you will have a lot of number of variable in dual form. In that situation, sometimes using primal method will be better. >>: Although you might be -- >> Kai-Wei Chang: Oh, yeah. Yeah, it's only [indiscernible] consideration. >>: Using the primal method, actually you don't need the blocks. You can stream the data from the file and [indiscernible]. If you use [indiscernible], then actually you [indiscernible] will be the speed of the disk. >> Kai-Wei Chang: Yeah, that's true. Yeah. 13 >>: Can you talk about that? >> Kai-Wei Chang: Yeah, I haven't shown you the result using multithread. But because the [indiscernible] is the disk. So we kind of arguing that in the primal method, if you're using those online method, they only consider one data point at a time. So then you might not data from this once and only update a model only once, then you need to do throw away this sample. So in this situation, you need multi data access from disk. >>: [inaudible]. >> Kai-Wei Chang: It's not the theoretical result. I mean, in practice this will be slower because you need multiple access from disk. >>: So you need to go through the data more than once? >> Kai-Wei Chang: Yeah, yeah. If you want to get -- actually, in the end of this, I will show a result that only go through data only once, and the method I propose will get almost accurate as you go through data many times. But the primal method will get some suboptimal solution. >>: Using the typical effects like [indiscernible], wouldn't you converge fairly fast just because the learning rate will become small enough that you don't need to pass anymore? >> Kai-Wei Chang: No, depending on -- >>: [inaudible]. >>: Memory, right? >>: Yeah, the memory -- >>: This is like 30 gigs plus, right? So this is the -- >>: It's supposed to have a learning rate of [indiscernible]. So depending if you're running out of the memory, because of the number of features or the number of results. If you're running because of number of features -- 14 >>: But if it's sparse, then you likely -- >> Kai-Wei Chang: >>: Of course, it also depend on the data. [inaudible]. >>: The [indiscernible] of Pegasos is the [inaudible]. So if you have giant disk where it's [inaudible], learning rate of [indiscernible] you can't get too far from [indiscernible]. So the only thing that would happen is your dataset is very large, but this is due to the fact that every example takes a lot of memory. So I mention every example takes one meg. Then even if you have 1G, it's just [indiscernible]. So at this point, you really, really want to use every bit of information that you have. But this is not the case, I think the theory suggests that it would be very, very close to [indiscernible]. >> Kai-Wei Chang: Yeah, and also, you know, I see in our paper that we use the dual [indiscernible] method, and we also compare the convergence rate between our method and the Pegasos. So the difference that if you use the old [indiscernible] method, because it's a [indiscernible] method and the convergence rate will depend on the size of data. And in Pegasos, the convergent rate does not depend on the size of data. But that if you consider the convergence rate in terms of the accuracy solution -- I mean the [indiscernible] between the current model and the optimum model, in [indiscernible] value, then the convergence of the [indiscernible] method is faster than the Pegasos. >>: How did you find the distance from the optimal objective? [indiscernible]? >> Kai-Wei Chang: by -- Oh, yeah. That is a theoretical analysis. Did you do a What do you mean >>: [inaudible] trying to find the distance from the [indiscernible], how did you choose your optimum? How did you know what the optimum is? >> Kai-Wei Chang: I didn't get. >>: I mean, when you find the absolute distance of your [indiscernible], how do you calculate the [indiscernible]? 15 >> Kai-Wei Chang: >>: You mean in practical or in theory? When you actually performed the experiments. >> Kai-Wei Chang: Oh, so we basically run the algorithm on the data several times until the duality [indiscernible] is small enough. Okay. So we're arguing they're because that for the algorithm that Pegasos, when you load the neg [indiscernible] into memory, you can only perform one update using one data point in the data in memory. You cannot perform several update on those data. Because those method can't really assume that the data is uniformly [indiscernible] on the whole dataset. So if you just update the data in memory several, a lot of time and get converged, then you will converge to the model that only turn on this part of the data. But in the dual method, because the data points is corresponds to the variables, so you can exactly solve this problem. So that is the main difference here. Okay. So here are some implementation issue. The first one is we find that if you compress the data and then store the data in a compressed data in a disk, then the loading time can be smaller. And this has been shown in several package that in [indiscernible] they also do these kind of things. >>: These are in binary or in plain text? >> Kai-Wei Chang: In binary. Because we want to several space. And the second one is we find that if we just split the data, then for some dataset, the [indiscernible] data is ordered by label. So if you just split the data, then you will get the [indiscernible] with the same label, then that will kill the [indiscernible]. So then we say the random split in the initial is needed. So we need along an algorithm to do the [indiscernible] read in data compression in the situation that data lodges in memory. Then we show one of the methods in our paper. And also we show in the paper that our method is simple enough so you can perform several functionalities such as for cross validation, multiclass classification, and incremental and decremental learning. 16 So let me show you some experiments result. So this is several dataset that we experiment on. And the largest dataset here is webspam, and uses about 16 gigabyte and we use this on a machine with one gigabyte memory. And so the largest data is about 16 times larger. Although some people will say that now we have much larger machine, but we kind of trying to [indiscernible] this situation in only one gigabyte machine. So [indiscernible] data. So here we show that on these three data, if you just do some random subsampling, then you will get suboptimal results. So here, the X axis is the percentage of the whole dataset we used and Y axis is the accuracy. Difference to the best accuracy in this dataset. And you can see that ->>: I have seen many algorithm that allow you to select data optimally. >> Kai-Wei Chang: >>: Yeah, yeah. That's true. So that's what we use to compare rather than -- >> Kai-Wei Chang: Actually, in my second part of talk, will relate to this one. Yeah. So basically this one, I just want to show you that if you just get a subset of the data that can fit into the memory, then you will get a suboptimal result. And we compared these four methods. The first one is block minimization method solving the dual using LIBLINEAR. And we go, using LIBLINEAR to go through all the Dat in memory ten rounds as a stopping condition. And the second one is Pegasos, and at each time, Pegasos pick one data in the memory and do update. And LIBLINEAR is the one that we just used standard LIBLINEAR to turn the model and we also compared to the VW. That is a well known online learning package, and we also used a version 5.1. >>: And when you go through all the data points, do you randomly choose? >> Kai-Wei Chang: We manually go to one data point, yeah. 17 >>: [inaudible] Pegasos? >> Kai-Wei Chang: >>: Does it help? >> Kai-Wei Chang: >>: Yeah, yeah, yeah. Yeah, okay. I don't know. Sorry. I just use the package. It costs a lot and the option does not help. >> Kai-Wei Chang: It cost a lot? >>: It's expensive and it's known to not actually help much in many applications. It costs a lot in terms of cycles, I'm saying. >> Kai-Wei Chang: >>: Oh, okay. Once you get a decent [indiscernible], it doesn't matter. >> Kai-Wei Chang: Yeah, yeah. Disk is very -- that's true. Okay. So here is the result. So in this figure, we show the function value reduction. So the X axis is training time and the Y axis is relative function value difference to the optimal solution. And the reference model is get by solving the data, the entire data several times until we get a converge. So then we can use the optimum function value. And both of the X axis and Y axis you see here not scale. So you can see that if you just use the LIBLINEAR so solve this model, then you become very, very slow because of disk swap. And the blue line or in here, we include all the time including the initial split and compress the data. So the comparison, we trying to make this comparison here. And in the black line here is the block optimization method with Pegasos and blue liner is block minimization method with LIBLINEAR. So we're trying to argue that using the dual, solving in the dual can be faster. >>: [inaudible] did you get to the disk speed area? >> Kai-Wei Chang: Disk speed? Sorry? 18 >>: Yeah, did you get to the also of implementation. Once implementation. You can take method for that, which is the >> Kai-Wei Chang: disk speed area. So, you know, it's a matter you compare time, it's also a matter of [indiscernible]. So here, there is a clear speed that you read the data from the disk. Oh, yeah, yeah. >>: For example in Pegasos, was the disk always at 100 percent, or was it 50 percent? >> Kai-Wei Chang: Yeah, I forget exactly number of the disk speed. But we show it in the paper. Yeah. And also, again, that we use some machine is pretty old, from 2009. So now that you [indiscernible], we can have a machine that have a faster disk speed now. >>: You used the same machine? >> Kai-Wei Chang: >>: Your variable is the implementation? >> Kai-Wei Chang: >>: Yeah, we used the same machine for all these parameters. Yeah, yeah. This is kind of. -- >> Kai-Wei Chang: Yeah. So here we also compare the result to the online learning package, VW. So VW has a very good implementation of the stochastic gradient descent kind of method, online learning method. And we also show that in this situation, our model can get a final accuracy more faster than the VW. And here are some experiments that for ->>: Was there a vanilla [indiscernible], or was it VW with all the VW tricks like the adapted learning grades and so on? >> Kai-Wei Chang: Yeah, we used the VW that you say adaptive learning and those things, yeah. We also play with the linear a little bit, and also [indiscernible] to make the experiment more. So here we show some other experimental results to compare our theory. So this 19 one, we say that if you accidentally get one [indiscernible] of data with [indiscernible], you basically will kill the convergence rate, you will converge very, very slow. At least one, we show that the difference between block size method. So here, this is -- X axis is the training time and Y axis is [indiscernible] difference to optimal solution. And the red line here is, we're using -- we separate data into thousand blocks. And green line here, we separate data into 400 and this is 200 and this is 40s. So this is confirmed that using some more number of the blocks or the large size of the data block can get better result. So in conclusion, we have proposed a method that can handle the Dat 16 times faster than memory. And this paper won the best paper award in KDD 2010 and we have a more complete version in the TKDD. And then we find each framework can be further extended and that is why we're trying to propose a multiplier algorithm for the same task. So after doing this work -- oh, sorry. >>: [inaudible]. So is it possible to extend it to [indiscernible] so if you have -- if you want to load the different blocks in different machines. >> Kai-Wei Chang: Yeah, so I finish a paper this year that arguing that if you have the data larger than the memory, then you should put them into several machine, and then you can do some communication and this can avoid to access the data from several times because each machine can store approximately [indiscernible] memory. But yeah, that's true. That if for some situation, that thought might be better. But we're still arguing that sometime the communication might be also higher and also -- of. >>: [inaudible] do you think you can extend it in some way to allow it to work on a [indiscernible]. >> Kai-Wei Chang: Yeah, that's a possible. Actually, I am thinking about that, but I haven't get a solution now. You can also think that if you have a very big data but you have only -- you have that limited number of machine, then maybe each data in each machine still cannot fit into a memory. 20 >>: [inaudible]. >> Kai-Wei Chang: >>: Asynchronous based on W would work on this method? >> Kai-Wei Chang: >>: Because you could -- Yeah, you could do each situation, basically. I mean [indiscernible]. >> Kai-Wei Chang: >>: Sorry? Yeah, yeah. Yeah. Do you have [inaudible]. >>: Likely, yes. /P-FRPL. If you're very sparse in asynchronous [indiscernible] yeah. >> Kai-Wei Chang: So we are still thinking about how to reduce this access time here. So the one way that you can apply some compression to lessen the loading time, which we did before and also you can maybe concede some compression that do in feature hash and doing something to reduce the data size. But another way to doing this is that we find that if you can better utilize the memory in learning, then you can possibly deduce a number of iteration to get an accurate model. Then you can use less time to access the data. So algorithms, many focus on this second one. So the idea is that if the sample is likely to be important, then we want to cache it in a memory so that we can spend more time and more effort on those samples. So if you're familiar with large margin method, then the intuition is we want to cache those simple factor in memory so that if it's true, then you just need to do one update and you are done. So based on this intuition, we proposed a selective block minimization method and we shorthand it like SBM. So we can show that SBM has a several good properties. So the first one is that SBM can -- SBM can save the disk access by using less number of iterations. So it is very simple. If we're using the SBM on the rapid spam dataset, then you only need to load data from this only once and do the update, then you can get as accurate a result as the model exactly. And as a result, the method is very efficient in this scenario. And the second properties that although our method catch the data in the 21 memory, so it's select the sample, [indiscernible] uniformly, but we can still prove these methods converge to optimal. Okay. So again, we using the linear SVM as an example and again we can only use this method to solve dual SVM, because we need the relationship between the dual variable and the data. So what we're doing wrong in the block minimization method. Because we do the random split in the initial phase, so you can consider that important data are spread in several block and each block has only a small portion of the important data and large portion of the unimportant data. So then we waste time and memory on those unimportant data. So we're trying to collect all these important data into cache, then we can do better update. However, the problem is that we don't know which sample are important before we are solving of the model. So then we /KWRAOED a way to find and cache those methods during the training process. And our solution is that we split the memory into two part. In one part, we load a new data block into this part of memory. In another part, we use for cache the sample. And then we can churn our model on both this data and can get a better model. So here, I use the animation to solve this algorithm. So suppose given a large set of training data, and we're trying to find a separator to separate the circle and triangle. And here, we assume those data point close to the final separator [indiscernible] is more important and we [indiscernible]. So in your left-hand side, we show the memory usage of the block minimization method. In the right-hand side, we show the memory usage of the selective block minimization method. And in the first step, both two methods loads block of data into a memory. And for [indiscernible] comparison, we assume the block -- the data block in the SBM is smaller than in BM, because you need to reserve some memory for the cache. So then both method, binary classifier and then for SBM, you catch those sample close to the current margin and [indiscernible] and then you can cache it in memory. And then we do it again and again. So you cache more and more samples. And also, you remove those samples that he doesn't think that is important anymore 22 from the cache. So after a few iterations, we get most of importance into already the cache. >>: [inaudible]. >> Kai-Wei Chang: Yeah, yeah, basically, that's true. Yeah. And then at this point when you trend the model using the data in memory, the SBM can update a model using more important data. So then you converge much faster. So this is the key algorithm here. Here, I highlight the difference between the SBM and the block minimization method. >>: [indiscernible]. >> Kai-Wei Chang: >>: Yeah, we -- [inaudible]. >> Kai-Wei Chang: >>: Yeah, that's true. Yeah, but in learning on linear SBM, right? Yes. >> Kai-Wei Chang: Yeah, in learning on linear SBM, they already have some similar tick increase called [indiscernible], and yeah, it's kind of different, but these two methods are very related, yes. >>: So the stuff that is in the cached set, does it ever go out of the cached set, or is it going to remain there? >> Kai-Wei Chang: No, it will go out. We will put -- you mean if the -- we will remove that sample from the cached set. Yeah, we will remove it. >>: [inaudible]. >> Kai-Wei Chang: Yeah, yeah. So basically, after each iteration, we use our model to choose all the sample from all of the part. Yeah. So given the initial model and at the beginning the cache is empty, and then we do these [indiscernible] converge and at each time load a block of data and train the 23 model on both the block of data and the cache, and then we update our model and update our cache based on the current model. So again, the sub-problem can be involved by any bound-constrained method. And here, we use LIBLINEAR, which implement a coordinate descent method. And then we can prove that the convergence hold no matter how you select those cache sample. That is, you can even just do some random slashing of those cache sample and the method still get converge. >>: [inaudible] can't you drag the [indiscernible]? >> Kai-Wei Chang: Sorry? >>: If you choose the cache adversarial way, can you prove the convergence if the cache is chosen by [indiscernible]. >> Kai-Wei Chang: Yeah, we can prove that, because you can consider the caches is actually not very -- I mean, the cache is not -- the cache is you're just trying to -- it's kind of similar that you do in the variable selection in the nonlinear SBM. >>: Can't you [indiscernible] those samples by margin? Would it affect the current weights? Is that ->> Kai-Wei Chang: It doesn't matter, actually. >>: That seems -- >>: Yeah, that seems like -- >> Kai-Wei Chang: No, sorry. That's because when you choose the data, it's kind of like you're choosing the variable to update. So you can -- so because this is the coordinate is a method, so you can -- the choose of variable is just [indiscernible]. So you can choose a different variable to update. But in the end, at least you can if you update all the variable, if you can go through all the variables to do the update, then you will get converge. >>: Actually changes the distribution that you see. So I can decide to take -- I can take some outliers, and just put them in the cache and free the cache and I have it. Then you [indiscernible] you always want to see these out 24 layers. Therefore, you'll see the distribution that you'll see will be skewed. >> Kai-Wei Chang: No, because you update. So in that situation, the [indiscernible] will be -- if you just put it in again and again, then the [indiscernible] will get fixed and not update. So then you will ->>: [inaudible]. >>: So I see the main confusion is that this is not an online algorithm. It will go over the data many directions. So because they keep the [indiscernible], so eventually they will figure out those examples are not important so then they will assign that zero weight. So basically, the dispersion doesn't change if they add zero weight. >>: [inaudible]. >>: Because the best [indiscernible] associate to [indiscernible]. So you will figure out that alpha is close to zero. So basically, even though they are there, you would not update your weight. >>: [indiscernible]. >> Kai-Wei Chang: >>: You won't update your weight, but the update will -- Sorry, but the update will be zero. >> Kai-Wei Chang: So regarding the cache, you can actually define this alpha function and just keep a sample with the higher alpha. And here we define the cache functions using the distance between the sample and the margin. So here are some implementation issues. The one is that because of we want to remove some unimportant sample from the cache, and this can be done using a way that you just copy and paste the sample to another memory and copy back. But then you will need some extra memory. But in the paper, we propose a way that you don't need to use any actual memory to perform these operation. And another issue, that if the number of instance is too large, then the alphas will also be very large, and then you can not afford to store all the alpha in memory. But in this situation, we can use a sparse implementation of the alpha and store the alpha that corresponds to the 25 data not in memory to some hash table or in the disk so that we can deal with the larger dataset. Okay. So the method can also be extended to solve several other linear model. And let me briefly discuss the relationship between this model and others' model. So here is a selective sample in algorithm proposed in early 2000. The way they're doing this is that first, just select a subset of data and train the model on that subset of data. Then they get a model, then they go through the entire dataset and to choose the data that's close to the current margin and put that into a memory. And then they use this data to update a model again. And it already doing this process. So this method is very related to our method. However, in SBM, we load the data into memory, not only to select the data, but also using it for training. So that we can avoid the overhead of selecting sample. And also, this method is kind of heuristic, but our method can be proved to converge. And another relationship is between this method and online learning method. And online learning method is a popular way to deal with large data. And but in online learning algorithm, they usually only perform one update on a single instance. So because the update is simple, so that they may need large number of iteration. And here are some online learning algorithm related to SBM. For, example, like Pegasos can also update on blocks of data. But as I said before, Pegasos can only do one update on that block. Otherwise, the convergence will not be guaranteed. Also, there are also some methods that are trying to use this cache heuristic. For example, in this method, when they're trying to solve the [indiscernible] perception, they also do some caching. Again, they don't have the guarantee in converge. So let me show you some experiment results. So in our paper, we show the experiment on these two big binary classification dataset, and another one on the multiclass classification dataset. And here we use a machine with two gigabyte memory. 26 So we compared these five method. So the first one is proposed SBM method. And the second one is for, just for some interest that we just do some random cache. And we do not select the cache here. The third one is block minimization, solving the dual form that we showed in our first talk. And we also compared to another two method that only using one sample at a time. One is the block minimization method with Pegasos. And the another one is VW. So again, here we show the convergence to the optimal. So the Y axis is relative function value difference to the optimal. And X axis is the time. You can see even that you do a random cache, our method converge faster than the previous block minimization method. And if you select an opposing sample, then the convergence is much faster. And in terms of accuracy, we showed in the Y axis, it's the difference to the best accuracy we can get on this dataset. So this is VW and this is Pegasos. So those methods, only update on one sample. And if you update on a block of sample, you can get an accurate model more faster, faster. And if you use our method, then you can even get almost accurate amount as a first iteration. So there is the result that we're trying to analyze the cache size and the number of inner-iterations to the convergence. So the Y axis here, again selective function variable difference and X axis here is the time. And the different lines show the result with different size of cache. So this one is using no cache. So it is basically a block minimization method. And this one is using a very small cache that's only ten megabyte, 100 megabyte and one giga and two giga. So here, you can observe that using a big cache usually gets better results. But if you use a cache too large, then the methods start to catch some non-important data in the cache so you converge -- so the result is a little bit slower. And here is a result using three inner-iterations. This goes [indiscernible] inner-iteration. And you can see that if you [indiscernible] too tight, you'll also converge, the convergence also becomes a little bit slow. 27 >>: Does it correspond to like one gig? How many examples is that? >> Kai-Wei Chang: Yeah, the example is that 280,000. And also, in this data, we show that one gig is better because in this dataset, the -- the size of the [indiscernible] factor is about one giga. So that if you choose the size of a cache it's close to the number -- the size of the sub factor, then you will get better result. But, of course, you cannot know that before you turn in your model. Yeah, and we also used our method in a streaming situation. So for some dataset, if the data is too large, then it's difficult to process several times. So people trying to treat it as a stream can only do one run on the data. And existing methods, online learning method you should update on one simple point at the time. And we are arguing in this situation, the single pass of a sample is typically not enough. However if we apply the SBM here, we can get better result. The advantage is that, first, because SBM can specify the size of a cache and also specify the time span in solving the inner-iteration. So you can depend on your need to define the size of memory and learning time that you would want to take for this method so that we can fully utilize its valuable resources. And second, we're arguing the performance of this method is more stable because you're updating not only on one sample, but updating on a set of samples. So here, we show the time that only go through the dataset once and we assume that data is stored in the ASCII format. So the I/O time is the time to load data from an ASCII file. And the learning time is the time to learn the model using the data in memory. And the total time, basically just dimension of the first two. And because VW has a very good implementation they use to load the data into learning at the same time. So there's no way to separate I/O time, running time so we only show the total time here. And we get a reference model by running the LIBLINEAR on that data several times until it gets converged and you get a final accuracy about 99.55. So for these two methods, only take one update on one instance at a time. So 28 the learning time's very small. However, the I/O time is very large. So we are arguing that you wasted time to do the I/O, but you did not utilize time to do the linear. And they get a suboptimal result. And SBM, we took more time to do the learning so we can get better result, even though it only load data from disk only once. Okay. So in conclusions, we propose an SBM model for training a large scale linear classification. And this method can be extended to several other formulations, Crammer and Singer formulation. And we release an experiment code here and also we implement this method into a branch of LIBLINEAR, and can download here. So let me also talk a little bit about my current work in the UIUC. So here I've been looking at the distributed training strategies for binary and structural SVMs, and I'm also doing a -- I'm also doing an application on co-reference resolution. So this is a co-reference resolution problem. That giving you an article and trying to find those noun phrases that are driven to the same entity. So in here, the president Bill Clinton and his and the president are co-referred. So we're trying to use some structured learning and latent structured learning method to solve these tasks. So this is my talk. Thank you very much. >>: You talked mostly about the SBM. Regression problems, [inaudible]. >> Kai-Wei Chang: So recently, that my previous professor proposed a method to use dual [indiscernible] method to solve the [indiscernible] regression. And I haven't done this experiment, but I guess this method can also apply for ->>: In the previous case, you won't be able to [inaudible]. >> Kai-Wei Chang: Yeah. >>: It's not always hard to do that. [inaudible]. >> Kai-Wei Chang: >>: Yeah, it's not. It's not constant [inaudible]. Yeah. 29 >> Kai-Wei Chang: In regression if you use [indiscernible], there's also some notion of a simple factor. Yeah. You also have a simple factor there. >>: [inaudible]. >> Kai-Wei Chang: >>: Oh, yeah, yeah. I don't have the concept on. [inaudible]. >>: If I understand correctly, you are basically using, only using [indiscernible] to solve a dual problem. So basically, there is like an outer-iterate which is a bigger [indiscernible] set. And inner-iterate is smaller according to descent. So you can say you can prove converge. Do you assume any number of inner-iteration to implement in order to guarantee the convergence, or how accurate you need to solve for the inner problem in order to guarantee the conversion. >> Kai-Wei Chang: Yeah, so that before I say that, so the one is that you need to -- you know you probably need to be stopped at some point. If you can guarantee that -- I mean, if you can guarantee that if the inner [indiscernible] can be stopped for any number of iteration, then basically you can prove the convergence. The idea that is you cannot [indiscernible] problem forever. Otherwise, you will not see other sample in other block. So if you can guarantee that for given number of maximized iteration this [indiscernible] stop, then you can prove the convergence. >>: So if you have maximal number of iteration for inner problems, so can you solve the inner problem exactly after least number of iteration? >> Kai-Wei Chang: >>: No, no, I'm not claiming that. So you need to get an approximate? >>: You need to get an approximate [indiscernible] and you say I stopped here, because I need to look at other sample. >>: As long as you can solve the inner problem accurately enough, you can guarantee conversion? 30 >> Kai-Wei Chang: Yeah, yeah. >>: So it's just a guarantee of the conversion. conversion rate? You don't know anything about >> Kai-Wei Chang: Oh, no. The convergence rate will also be linear. symptotic convergence rate ->>: The Linear convergence? >> Kai-Wei Chang: Yeah, linear convergence. >>: The linear convergence rate, though, you need to solve the sub-problems exactly, right? >> Kai-Wei Chang: No you don't need to do that. So okay. So you can consider that if the data is all in memory, then you can use that [indiscernible] method and this method has been proven to converge linearly to the optimal. >>: Which paper? >> Kai-Wei Chang: The ISML paper we had in 2008. >>: [indiscernible] rate, right? basically? >> Kai-Wei Chang: You use the [indiscernible] analysis, Huh? >>: You've seen after a certain number of iterations a landing site of all of them? >> Kai-Wei Chang: Yeah, it is a symptotic convergence rate, not a [indiscernible], yeah. In our ISML 2008 paper. >>: [inaudible]. >> Kai-Wei Chang: >>: No, I mean, they say that -- [inaudible] optimization standard. That's the standard. 31 >>: Here, the blocks are like act together since you have to cache? >> Kai-Wei Chang: Yeah. >>: It could be the data and this cache will be updated next time. different from the traditional block calling? >>: So it's Basically, not overlap on the block. >> Kai-Wei Chang: So, for instance [indiscernible], you do the parameter, you do the [indiscernible]. So you can select [indiscernible] update. And in that situation, you also may be choose the same variable several times during your update. And you can become [indiscernible]. So he's just saying here you just consider a [indiscernible] for a variable [indiscernible] to select a variable from this part of data so the convergence can occur. That is the big idea. If you want to see the details, you can see our paper. >>: From shrinking, the idea is you end up in the shrink set. have the shrink set, you go on ->> Kai-Wei Chang: Yeah. >>: So do you have something similar for your cache? you continuously load the second part as well? >> Kai-Wei Chang: And you once From what I understood, Yeah, yeah. >>: Do you stop doing that at some point in time and just work on the cache, or do you continue, on a reiteration, you will load some block from disk on to the second part and then yule update your cache and go on with it? >> Kai-Wei Chang: >>: Yes, it's the second one, yes. So it's not exactly [indiscernible]. >> Kai-Wei Chang: Yeah. So the thing is that so, actually, at first time, I wanted to do shrinking in this scenario. Then it's very hard to do that. And I come out this [indiscernible], basically it's doing the shrinking on another 32 way. So shrinking is basically remove the data from consideration. add the data into consideration. Yeah. Here, we >>: So I had a question about one of the graphs that you had back earlier showing like the progress of the SVM and it looks like based on if you were to look at time, the distance between your optimal solution is monotonically decreasing. Is that true? And if so, can you explain why it monotonically decreases? Because if you have like what they were talking about earlier with -- this slide. It looks like it's monotonically decreases. >> Kai-Wei Chang: >>: Uh-huh. What if you select bad examples in your cache at first? >> Kai-Wei Chang: So here, I'm showing the dual [indiscernible] function value. So this is proven that the result will monotonically decreasing. Because at each time, if you consider that yeah, usually we're solving the problem and the problem is the minimal, the problem. So the T equal to TL is one solution. And we find the meaning of this sub-problem so you always reduce a dual [indiscernible] function. So in the case that if you use [indiscernible] example in the cache, then if it's selected, then the alpha will become a fixed number. So you will not affect the opportunity function any more after a few iteration. >> Ming-Wei Chang: Thank you.