>> Denyong Zhou: I'm very happy to have Eric Xing as our invited speaker today. Eric is a faculty member from CMU computer science department, and his research basically is work on very special machine learning from fundamental machine learning research to computational biology and to even [indiscernible] lot of scale [indiscernible] learning. >> Eric Xing: All right. I want to begin by thanking Denny for inviting me here. I'm very glad to share with you some of my recent work on system and algorithmic foundations for big learning. So it's a fancy topic. But hopefully, you don't get too disappointed after just seeing the topic. So I've been working on distributed systems and large scale learning only very recently. And honestly, this is almost all of a selfish reason, because in my group, I've been spending many, many years developing different machine learning models for many specific applications to the point that I was embarrassed that I myself couldn't scale up my algorithms. Especially when I did my sabbatical in Facebook for a year, I failed to deliver my promise to boost their revenue by a certain percentage. So I went home and decided I should really catch up on this line of research. So this is a big project, you know, supported, joined by many of my colleagues and students which I'm going to show you their pictures at the end of the talk. So here is an opening statement. So, you know, everybody's talking about large data, and what do we want to get out of from large data. So in many context, people want to make, you know, good predictions based on large data, and sometimes one can argue, well, if your goal is to only to prediction, you may not need fancy models. You know, some very, you know, simple models such as a regression or maybe a black box model like a deep network can probably do the job. But in scientific research and in many other domains, people actually want to get information out of data, in addition to making predictions, and that's actually a picture reflecting that kind of goal, right. It was a picture almost more than ten years ago when people sequenced for the first time a human genome, they want to piece together the puzzle in the genome to find out the regulation programs and other biological stories about it. And I just want to [indiscernible] that kind of an objective, which is if you are in a forest and you may not necessarily know a tree is falling unless you see it because, you know, the knowledge if not to be perceived by a person is just in or out data, and you cannot really make use of that. Therefore, I think part of the goal in all the big learning is to not only make you yourself big data for delivering certain specific goal but also to, you know, look into data and find out additional information about it. And for that, I personally felt that machine learning is a promising tool to provide a kind of capability. And ideally, we want the machine learning to behave like that, right. You have an automatic system, you know, which can consume any sort of data in large quantity. And at the end of the day, you can spin out predictions, hypothesis, and all that kind of thing. But unfortunately, I don't think this, you know, very rosy view is happening at this point for many, many reasons. And personally, I ran into the following problems. First of all, everybody is talking about the same problem. We have, you know, two large amount of data to the point that we really don't have an effective system to deal with this data, especially when we want to use some fancy, you know, analytical tools. And personally, I'm very interested in, you know, looking into what is inside the Facebook network. The network's very big. Nowadays, it has about a billion node and plus all the other meta informations and other, you know, semantic informations in the network, and that creates a big computational challenge. And that challenge is well known. Many people are talking about, you know, data being so big, and we need to have big storage system, big computing system to deal with that. But recently, people also started to realize another challenge, which is this so called big model problem, right. I put here, you know, a model that is known to everyone here. This is a very big deep neural network. And now, you know, really, you know, providing, you know, a very generic and powerful capability to understand, you know, a lot of, you know, complicated information, especially latent semantic information and data. But the model has been built very big to make them really powerful. For example, their recent Google brain network has about 1 billion node, 1 billion parameters, and that kind of model really cannot be put into the memory of a single machine. Therefore, you also need to deal with some very complicated issues of making distributed computing of parallel inference possible. A third challenge, which I also heard recently through a Microsoft involved workshop [indiscernible] is this so called huge cognitive space or huge task space where, say, even one of the simple task classification, what if you wanted to [indiscernible], say there are a million or maybe even a billion possible labels in that classification problem on top of a big data set. And that kind of problem is really unheard of to machining community many, many years ago because of the we are very used to study binary classifications using SVM or regression or maybe a more [indiscernible] classification, and turns out that kind of machinery may not be so effective in solving problem of this scale. So we start, you know, I want to basically, you know, inspire my group to hopefully find a good solution that can be used for addressing not only one but many of this problem. And this is a very, very difficult goal to accomplish because what we see now are mostly like this picture, right? So we have a lot of hardware development with different fancy machine configurations in hardware. And on top of that, we have a lot of different machine learning models. And typically, the working style is that if a researcher or a group identify a problem and decide to use one model, he's going to build a team and implement a special purpose development or implementation of that model, especially when we want to make it very big. And then you have this, you know, one line connecting the system and the model and the problem. And if you have many problems, you probably run into this kind of thing, right, which is sometimes doable in large companies like here or in Google. But it is a very expensive investment. And it is also, basically, pushing university researchers out this [indiscernible] because we cannot afford, you know, putting the team like that and spending the money like that to build these kind of solution. So we asked the following question: Is there a more elegant, maybe more economical way of addressing the challenge in big data, big model and big task based on the available hardware system using a kind of a thin risk style solution where we have this bottleneck which provide general purpose algorithmic and system blocks so that you can, you know, waste a very modest amount of effort, you know, somehow customize it with specific problems you want to solve. So the talk I'm going present today is about our initial effort and some results towards this goal. So there are some existing solutions already directed toward this goal, and here I just want to provide a very, very brief context and to show you where we stand in this speaker context. So this is the bigger picture I see so far. In machine learning, many, many of the machine learning researchers actually are already aware about the big data, big learning challenge. And their solution is typically of this style. They're going to push for smarter algorithmic development to the point that they develop very elegant evidence which have the correct [indiscernible] guarantee of convergence which can, you know, efficiently make use of an ideal parallel system to spread out the competent task into, you know, a distributing environment and then, you know, you have all these papers talking about, you know, great action of computational cost. well designed experiment. And some, you know, But if you really read into the paper in very carefully, you will see two typical kind of patterns that I found on very interesting. One is this micro of in the algorithm or in the [indiscernible] where they identify a key step, such as an update in a gradient or maybe a message passing step which they call to be parallelizable and then they're going to parallelize it based on not necessarily a real system. Very often, an imaginary system which incur no communication cost, no asynchronizing loss and other things and then they will say if that happens, my algorithm will converge at a verified speed. And this experiment sometimes is a simulated one. They basically can sequentially simulate a parallel system and show you a very good answer. But if you put the whole into a real system, chances are you either need to reimplement the whole thing or maybe the whole algorithm isn't really working. So that's basically on the algorithmic side. On the system side, there is also a great movement in pushing for parallel and distributed computing. And there, the focus is really to try to push up the, you know, the [indiscernible] throughput that you can you know, as best as can in pushing the computing into different machines, and there is maybe limited communication cost and maybe even no cost, you know, asynchronous algorithm and then hope for the best. Usually, such paper do not provide a theoretical guarantee on the behavior of the algorithm. But sometimes, in the empirical experiment, they will show, you know, good answers, answers and algorithms. That's pretty much where we stand, you know, in the current research paradigm, and there are some limited effort in the middle which show you, you know, a system that provide some guarantee or some algorithm which are mostly [indiscernible] on a system's configurations. So that's basically where the current research effort is directing toward, and in my opinion, we don't yet have a good solution is that can offer, you know, more generic or even universal solution to the large scale and the big learning challenge that I mentioned above. So here I'm going present, you know, some recent work from my group in collaboration with some friends in developing a new framework for the big learning problems I talk about. And we call the system Petuum for a reason we can discuss offline, but it contain the following building blocks, right. It has a big, you know, machine learning architect, system architect which allow data parallelism, model parallelism, and fault tolerance and many other, you know, intricate issues that you have to deal with in running a big distributed system. And on top of that, we provide a programming model which allow users at different level, researchers, practitioners or experts to you know, make use of the parallel design, you know, with a great kind of hopefully a good reward in reduction of computational costs. And also, we have tool boxes and APIs supplying, you know, ready made implementations of a major model such as a deep network or topic model and stuff like that. So what is behind this system is not a monolithic kind of structure in the system architect or a single [indiscernible] or just a single combination of the tool. In fact, we've been trying to make use of a lot of resources at CMU, in particular in my group, to push on the research on multiple angle. For example, we do actually put a lot of effort in push the limit in designing smarter algorithms, good models and good representations to enable scalability as much as we can, even on a single machine. Then going beyond that, we're going migrate such element into a real parallel system and actually plunge into the dirty work to actually do some plumbing and make the system more kind of adapted to the kind of algorithm that it's supposed to carry. And finally, we actually also have theoreticians in the group showing analysis, demonstrating such development actually may provide you a good guarantee on the consistency and the convergence. So there are a lot of, you know, tiny kind of pieces of work, you know, existing in different places in this whole landscape, but I'm going to order this talk in these three main kind of aspects. One is about how we do big data parallelism, you know, using an example problem that we've been working on on doing large networking. Then I'm going to also talk about how we deal with model parallelism in decomposing big models into different machines. If I have time, I'm going to talk a little bit about task parallelism as well. So let me begin with the first one. So the problem really originate from my own terrible experience at Facebook, where I was asked to actually analyze the whole social network that they have in their system, and so here is the problem. There are many ways of doing network analysis, as you know. And, in fact, many of the analysis is more of a scientific one, say I want to tell a story, say how communicating for a network is and what is the degree characteristic of network. And that is interesting, but they are not a problem that Facebook should care, right? They actually are caring more about actionable kind of knowledge on networks such as a personal interest inference or a trend prediction, stuff like that. And specifically, this might be a problem they are interested in. How about I want to turn every people into a pie chart in which every slice indicate the probability of he is interested in certain things or maybe reflecting his certain social activity. For example, this guy can have a multi facet social trace, such as his profession being a doctor, and he could have a family, and he also have some regular playmates to entertain himself in other time. And this is a typical new way of doing, you know, social community inference in which we don't simply just group every people into one cluster. In fact, we want to estimate a so called mixed membership of every individual in a social network. And obviously, this is an interesting query because it allows you to make many interesting predictions like this. So the way we begin with that is let me first talk about the model before I talk about a system, because the model itself can be a very unique one. First of all, if you think about the size of the problem on a network with a billion node, you know, we probably need to start from the very scratch to think about how do we actually formulate the problem in the first place. Because this [indiscernible] matrix that carried a whole network is the longer [indiscernible] to be stored or even read through. They have, you know, a billion to the power of two number of entries. >>: It's very sparse? >> Eric Xing: It's very sparse, but it's still very many, and many of them are zero. Typical model for this kind of data has to all model zeroes. We ask do we need to do that or not. So we choose to use a different representation of the feature, which is called the triangular feature, which, you know, has a long history in the scientific literature of being more reflective of the behavior of every social individual. And here is how it look like. Suppose that now we look at every node instead of just counting how many one or zero edges it is involved with, but actually counting how many times it is involved with a different types of triangle [indiscernible]. Such as for this node, it is in one of this full triangle, it is in another full triangle, it is in a half triangle, and so on, so forth. So this is just another way of characterizing, you know, social network features in the big network. And it also reflects a mixed membership involvement of every individual in network. I'm not going to argue how good it is. There are plenty of, you know, papers and empirical studies show the benefit of that, but I'm going focus on the computational issue of solving model built on this representation. So the model we employ is so called is called a mixed membership triangular model, which is very much like a topic model. But in this case, the topic sampled for every individual to create or to generate a triangle that three people are involved with. Okay. So here, every individual have their pie chart, which is a topic vector, a mixed membership vector. And then based on who he is going to interact with in the triangle, he is going to instantiate a particular social role. Say he's want to be a professor. This one is also professor. That one's a student, then two professor plus one student make like to form a triangle, things like that. So it's a typical probabilistic program, you know, people are aware about. But in here, we have some [indiscernible] inference problem. The main inference problems are two fold. Here are the data we actually observe, which is the triangular sufficient statistics in the network. But we need to infer, for every individual, a pie chart, which is the mixed membership vector, and also this matrix, which is the social interaction between different communities or different social roles, you know, if they are put into a triplet. And typically, you know, on the algorithmic side, there are, you know, roughly two main lines of algorithmic direction. One is based on, you know, Monte Carlo, Markov Chain Monte Carlo, and the other is based on Stochastic Variational Inference. And here I use the topic model as example to illustrate typical operation. Basically, in each of these algorithmic prototypes, they basically are going to visit every random variable, every hidden random available cyclically. And whenever I visit, I am going to draw a sample based on a conditional distribution defined on the Markov blanket of this node, which basically gives you the conditional probability of this particular hidden variable being, you know, taking a certain configuration or not. So in the MCMC case, I'm going to toss a die and draw a sample. In the SVI case, I'm going to compute a deterministic approximation to that posterior probability. But the computation can be very heavy, because this whole step has to happen iteratively on every hidden variable and happen in multiple iteration until they converge. >> So for this type of [indiscernible] top down model, how scalable can it go? >> Eric Xing: Yeah, that's what I'm going to show. In fact, in the [indiscernible], they are hardly scalable. If you look at a typical, you know, MCMC or SVI paper for a network, they can go about a few hundred of your [indiscernible] nodes. And that's exactly why we are interested in addressing this problem. So there are multiple ways of scaling up. If you look at, for example, paper from, say, David Bly's group, he is very obsessed with scalable models of LDA, and his approach is to make the algorithm smarter and smarter, right. For example, you will see ideas such as a subsampling, such as stochastic estimation. And to the point, I think, we also contribute some new ideas in parsimonious modeling to reduce the size of the model. So here we have a number of steps actually pushing up the ideas, pushing up the scalability on a single machine, being a shared memory environment. If you are interested, I can go down to the details of each of these algorithmic ideas, but the good thing is that they have been well established in a community that these two are convergent and reliable results. And we have some even more new ideas recently to even make this speed up or acceleration even more aggressive by reducing the variance of stochastic updates by controlling the learning rates automatically and so on, so forth. So all these ideas actually are quite valuable and they actually push up the performance quite a lot. For example, in here I want to show you just by doing a delta subsampling, meaning that I don't need to look at all my one billion minus one neighborhood in terms of generating a [indiscernible]. I can subsample a neighborhood depending on how much degree I have in my neighborhood. And with a different, you know, degree of subsampling, I can actually turn the inference complexity, you know, from a quadratic one down to a linear one and down to a linear one with a very small slope. And to point that we can actually deal with a pretty sizable network in here. It is a network of the Stanford web graph, which has roughly a quarter million nodes, and we can pretty much finish the whole job in 18 hours for five social roles. This is academically quite admirable, because it's the first time that such a network can be actually even processed under a series Bayesian graphic model. But, of course >>: So in that case, the [indiscernible] in the graph correspond to what entity? >> Eric Xing: Oh, here? of the mixed membership. simplex, okay. >>: Here is, all right, the visualization This is a five dimensional social Does it normally come up from the earlier model? >> Eric Xing: What I'm doing here is actually I'm plotting this data, plotting the pie chart. Okay. Every pie chart correspond to a point in here, which is a five dimensional coordinate, and their actually size corresponds to the degree of that node. So here, you can have a visualization of the community, of the mixed membership community in a bigger social network. You can see probably some, you know, absolute communities and some mixed communities in the middle, so on, so forth. But with the Stochastic Variational Inference idea, we actually can further speed it up, and, you know, I can have some impressive number here, for example, for that Stanford web network, which is a quarter million. We now can actually finish the whole running in about ten minutes, okay. That's actually a quite impressive number. But, of course, this ten minutes is for only five social roles, which is a little bit unrealistic. A quarter million people only have five roles. That's probably too little. So how about increasing to five hundred roles, okay. We actually can do that in six hours, which is still quite good. All right. So this basically >>: Operational method better or >> Eric Xing: [indiscernible] so the previous one was MCMC with data subsampling, and here it is the Stochastic Variational Inference. So up to here, I think we've been really reaching the limit of algorithmic kind of acceleration. We worked very hard in pushing a way to be smarter and smarter. But up to here, we hit a wall. Basically, the network is still bigger than what we can handle. For example, I just show a network of a quarter million, but what if I want to handle, say, a hundred million, which is still a fraction of the Facebook network. And even more challenging is what if I want a number of roles to be even more reflective, because for a billion people, or for even a hundred million people, it's easy to imagine they have ten thousand different roles or even more than that. And with a little very simple calculation, you can figure out that even a memory needs for storing such information takes a few terabytes of RAM, which is not feasible for a single machine. And if you run any of these algorithms that we used before, it would take a few years to finish the whole thing. So even a scholarly sound and, you know, publication wise very elegant idea really cannot deliver a practical industrial solution that the companies really need. So that actually pushed us into a corner. And therefore, we have to now start think about how to do the dirty part, the parallel inference part. It is actually not quite trivial to scale this algorithm into a parallel setting. So here is why. Suppose, you know, we want to do a, you know, very direct data distributed sampling. So we just put the data into different kind machines and do sample on each client separately. And usually, you know, when we write our code or do our analysis, we will say, okay, we can, you know, replace that update step now into a parallel update where I'm going to wait for, you know, local inference information to be collected from different machines. But in reality, this step isn't happen in such a nice way, because you have all sorts of problem in the machine, in the system, which prevents you from getting all the message back at the same time. >>: Now here, the T is the data >> Eric Xing: >>: Which? The T variable. >> Eric Xing: No, this is the iteration count. I'm actually using be an extremely simple [indiscernible] model which suggests that this parallelization can happen automatically, no matter how I do it, and basically ideal parallelization. So the whole point is that if you're right in this way, all the computational complexity was swept under the rug. You don't see it, but it actually happen and bother you a lot in the real system. >>: I see. The basic concept of stochastic algorithm, you don't really have the iteration concept? It's a [indiscernible]. So what [indiscernible]. >> Eric Xing: Yeah, okay. Stochastic inference algorithm, by nature, cannot it would be even harder to parallelize because the whole correctness guarantee is built on a sequential execution of the stochastic updates, right. Here, I want to parallelize. Therefore, here I haven't really seen what this tell what this steps are. Just imagine there is a step that is parallelizable already, and I'm saying even in that single setting where algorithm is proven parallelizable, you still need to worry about how the parallelize can happen, because last step itself can inference >>: So [indiscernible] neural network [indiscernible]. >> Eric Xing: Yeah, yeah. You will see. Actually, I will have neural network stories later in the second part. So to actually come back to this network issue happen over the network, you have to set up a barrier and then you need to cache the [indiscernible] so wait until the synchronization, and that's actually quite non trivial. So here is a very simple illustration of how serious problem can be. So as we now know, you know, the network can have very little bandwidth. Therefore, you cannot pass many messages in a short time. And secondly, all these different machines may perform unevenly, because you are not the owner of the whole machine. Other processes are going in there and vibration, [indiscernible] uniformity and other things can also perform in the course [indiscernible] performance. And at the end of day, here is what we actually observed. We found that we're running an LDA on a, you know, a modest cluster, you know. Even in ideal environment where we only run this only program, no other program are running, and we own the whole system, we spend about, you know, a good amount of 80 percent of the time communicating, and the other fraction is for actual computing. That's a big waste. And if we allow the program to run that way, the kind of task that I had talked about for the billion node network inference is not going to be happen to be solved in a short amount of time. >>: So in that example, is everything converging on the single sort of master machine that correlates [indiscernible]. >> Eric Xing: Yes, you have to, you know, for machine learning algorithms if you use a [indiscernible] server idea or any other ideas you, have to have a collection step, right, to gather all the sufficient statistics. >>: So all that waiting time is on the [indiscernible]. >> Eric Xing: >>: Yeah, yeah. You have a question there? Were you using [indiscernible] or is it every node. >> Eric Xing: Oh, okay. That's another layer of clustering, okay. Mini batching, you know, is a secondary layer of clustering where on each particular client and you compute the update on that sample over that machine, you can ask whether I do mini batching or I use the whole thing, right. Here, I'm attracting them away, basically. I'm talking about a generic algorithmic behavior, regardless of you use mini batch or not. >>: This timing will be affected by [indiscernible]. >> Eric Xing: I'm not doing mini batching here yet. Using mini batching would be even worse, because in mini batching, your iteration, you know, your computing time for each iteration will be reduced, but you need more iterations. In fact, the time spending on the communication can be even worse. >>: [indiscernible]. >> Eric Xing: Yeah, so the idea of mini batching is that on every update, I'm going to compute update on the smaller subset of data. Therefore, I can actually compute a gradient even faster, but I need to have more iterations than to convergence, right. And the more iterations mean actually more time spent on communication, right. So I'm talking about the ideal case, you can actually do a perfect gradient at every step, at a bounded cost. But still, your time for communication can be very significant. >>: So this applies to the [indiscernible] learning for [indiscernible]. >> Eric Xing: It applies to yeah. The statement here isn't actually toward a particular algorithm. It's actually a generic [indiscernible], yeah. >>: The basic question is that for [indiscernible] learning is by default, it's the batch learning, if you look at your description, have never had anything [indiscernible]. >> Eric Xing: That's a heuristic arrangement. If you want to prove correctness of the [indiscernible], for instance, you start with a non batching idea. And if you use batch, you need to further prove it is [indiscernible] which is not always true, in fact, okay. So here, I'm talking always about a convergent, correctly behaving algorithm instead of a heuristic. Heuristic can do anything. For example, you can do the [indiscernible] parallel LDA, just parallelize it and it will still run. How correct it is, I don't know. So here, I want to first be sure it is going to be correct, okay. Then we ask the cost of you need to pay. And here is the cost in to pay. Basically, to make things correct, you need to communicate often enough to make sure every update is consistent and that means, you know, you need to have tens of thousands of communications every second, you know, in the parallel distribution process. And this is impossible for a system like Hadoop, because that iteration cost is just too high, will totally overwhelm the network. So we ask them can you actually do something better to reduce the communication or maybe to rebalance the time spent on communication or computing. That's basically comes to the idea about our first part of the talk, which is a better synchronization procedure for parallel computing. Nowadays, there are two major kind of class of synchronization scheme in parallelization computing. One is a bulk synchronous update, right. Synchronization, by the way, is [indiscernible] if you care about the correctness of the algorithm. Otherwise, you can just go async. So if you do think, then you need to set up a barrier and ideally, hopefully, before a barrier, every process will finish about the same time and then the interior this barrier get collapsed and then enter the next update. But this ideal case never happens because of many of the unpredictable, you know, unexpected defects in a network system, in a clustered system. What you see actually is here. Different process will reach the end at different time point, and then in the collapsing step, I need to do various amount of computing because of this random delay coming from different places. Therefore, a lot of the space which I pointed out here, white one, red one actually just the time wasted for computing, for communication. It is not doing any useful computing. And this fraction can be very big to the point that it can take 80 percent of the time, even in ideal setting. The other solution to parallelism is to ignore parallelism through asynchronization. You do totally asynchronous one and then, you know, hopefully in ideal case, each different processes will not be off sync by too much. Therefore, you may still get, you know, at any point, a correct kind of update up to some bounded error. But this is again a very dangerous assumption, because in the extreme case, the amount of asynchronization can be very extreme to the point that some threads may be several iterations ahead of others. Therefore, their updates cannot be correctly integrated with other updates. Because what if one gradient tell you to go that way. After a few step, it asks you to go this way. Then if you average them, which way you will go? You cannot get a correct direction. So this is a typical kind of artifacts [indiscernible] face in the current [indiscernible] solution to big learning problem. And we want to actually explore a middle ground where can we actually reduce amount of communication in the BSP but still get the cost, the speed of async implementation with the correctness of BSP. That's the question we want to ask. And that leads to the work we presented at NIPS this year, which is called the Stale Synchronous Parallelism in which we actually, you know, set up a timer, actually, for every process, for every thread, which allow it to monitor how much it is ahead or behind other threads in a parallel environment. And we want to enforce the following update law. First of all, there is a parameter server so that, you know, every thread can independently, you know, update the parameter server to actually leads to a learning step in the parallel machine learning. But on the other hand, if it needs to go to next step, it can choose to read the information from the parameter server, which is actually a global solution, or read information from their local server, okay. Every thread had a local cache of a local version of the parameter state of value. And it reads from the local server if it is not too much ahead away from other processes. Meaning that, you know, my current version is not much different from the global version. Therefore, I can read locally. That saves some computation. But if I'm too far away from others, then I need to update and the rate from areas I need to stop and actually updating myself and wait. So that creates, actually, a very interesting behavior which basically allowed the slower [indiscernible] threads to actually do a lot of, you know, updates, and without actually reading too often from the global server. But for the very faster ones, it will stop itself at some point. Or if it goes it needs to actually update, every time it updates, it has to read from the global server, okay. So that's kind of balance, you can see this one will spend more time on computing and less time on communication, and that one will spend more time on communication and less time load computing. And eventually, all these different threads will reach to the goal at roughly the same time point. And here is a rough [indiscernible] interface of the SSP parameter server, okay. So every machine will have its own connection to this centralized place through a rather simple read and write interface which is very much it's no different, not much different from a single machine, you know, interface, which means that you don't have to rewrite your parallel your program. You can pretty much use the same program information and the detailed implementation is taken care of in our low level infrastructure implementation. And as a result, you can actually see, you know, the system, you know, did provide a very flexible way of controlling the trade off between, you know, very active synchronizing communication or a more kind of economic [indiscernible] synchronization where I don't actually go to the network very often. And the amount of staleness basically specified an amount of iteration cycles that the machine can that the stress can be deviating from each other. You can see as the stale number comes bigger, the amount of communication actually did go down and then it gives you a better trade off between the communication computing. And this idea can be applied to not just a single model, such as LDA. In fact, for a wide variety of machine model, such as what I listed here, topic model, matrix factorization, regression. As long as you have a setup where you need to maintain a shared, globally shared parameter to be estimated from the data and that [indiscernible], you can always basically distribute the data into different kind of machines and then put the parameters into the SSP table and then do the update in the asynchronous fashion. >>: [indiscernible]. >> Eric Xing: This is a data parallel model at this point, and we assume, of course, the model itself can be stored in this SSP table, although I haven't really been specific about how these can be [indiscernible]. In fact, this one can also be a parallel system which actually increase the power of its storage. So here I have some performance evidence of this system. You know, we tried a carefully chosen kind of spectrum of different models. LDA stands for a typical probabilistic graphic model inference problem. LDA no, LASSO stands for a typical, you know, convex optimization kind of problem. And here, the matrix factorization is by itself a very popular operation in machine learning. And you can see in each cases, the async, which is actually this one, the Stale Synchronous Parallelism is going to give you the best kind of convergence behavior, and it's not only fast but also convergent to the better, actually, spot in the given amount of time. Faster and more accurate. And the people, one thing people asked and care about most is how really how kind of scalable it is. And you can show interesting, you know, curve on a small system. What if I have even a bigger problem that I need more machine. We did an experiment here which shows that at least, you know, to the limit of having started two machines with about 300 cores, we still have a pretty nice linear scale up in which every time you double the machine, you get about 78 percent scale up, which is a pretty nice, decent behavior in scalability. And lastly, I want to emphasize that this SSP server is not just heuristically divided data and do distributed update. It actually has a convergence guarantee that the resultant estimate is going to convert the true result if there is one. And the intuition is very, very simple, because, you know, say taking this stochastic gradient descent as example, you know, ideally, if you don't have any parallelism, you are going to get accurate gradient, therefore you are going to move to [indiscernible] action. But now we have the SSP, which basically allow you allow different threads to deviate from each other by certain iteration and that translates to a small amount of error, okay, due to inconsistency between iterations. But on the other hand, our staleness is bounded. Therefore, the amount of error is also bounded. Therefore, every time you are going to deviate away from the optimum direction a little bit, but they will be random deviation and cumulatively, they will still lead to a convergence. >>: So you scale the number of machines to scale this growth longer, the slow machines, right? Is there some point at which the convergence now goes slower because you've added too many machines and the staleness is too long? >> Eric Xing: The staleness number is a constant that you can set, which does not have dependent to be dependent on any more machines. >> Really, because then wouldn't the fraction network time start adding up to be more than >> Eric Xing: No. If you have more machine, the amount of communication will be longer. But the staleness iteration number isn't actually controlling the amount of time you need to spend. I will just check how many iterations I'm away from others. So with more machine, I think it is, how should I say? If you have more machines with a constant staleness, you are going to eventually waste more time on communication. effective progress you made will be slower. >>: Therefore, the Do you know how many machines [indiscernible]. >> Eric Xing: So here, actually, we have a theory, which actually shows you, you know, the relationship of this. That's actually the beauty of the [indiscernible]. It basically says that your updates, you know, will become written to the true one at the rate of this. You have the F and L, which is the typical rate to prove as a function of the data behavior, right. The Lipschitz, the smoothness of the Lipschitz constant and the [indiscernible] of the data, stuff like that. But here, we have two other things. S is the number of staleness and P is the [indiscernible] parallelism. So if you have a P, of course, more P, that makes the bound bigger and the lower quality. But you are going to actually get, you know, a T there. But if you increase it, you can smooth you can erase that kind of effect. >>: P is time >> Eric Xing: P is the number of iteration. Iteration, yeah. And again, the intuition, as I said, is that you have this bounded staleness which allowed you to quantify amount of inconsistency of the error introduced through the asynchronous updates. And that actually carries through the analysis. >>: If I look at the bound, like I have had certain number of machines, then is that P? >> Eric Xing: Yeah, it's P. Why it's P, the number of threads. >>: Right, right. But [indiscernible] number of machines. But now set S to zero, maybe. Then I would get the best convergence rate, right? The smallest upper bound happens when S is the small S. So why would I ever set, according to this theory, S to any number >> Eric Xing: This is the bound. >>: Yeah, yeah. zero. The bigger, the worse. So I want to minimize the bound, to set S to >> Eric Xing: Well, you know, this bound is more like a qualitative kind of a guidance of the behavior rather than, you know, [indiscernible] SVM bound, you don't actually tune the thing to actually give [indiscernible] bound. It gives you the guarantee that there is a bound exist. But pushing down the bound or up the bound isn't quite meaningful here because you don't actually know how loose the bound is. There are some other constants in here. >>: [indiscernible] number of iterations, not wall clock time. You said as to zero it's number of iterations. It's number of iterations. You said as to zero, each of those iteration is going to take a while. >> Eric Xing: Yeah. >>: [indiscernible] so can we get a bound so for equal risk so that we have the parameter that we're interested in are the risk and the time. We want somehow to tie them together in terms of if I have [indiscernible] and I want this amount of risk and I'm willing to take so [indiscernible] up to that point so what would be the best configuration? Does it scale up with the time [indiscernible] to zero with the number of machine or is there some optimus. So how does all these numbers come in? >> Eric Xing: Well, it does [indiscernible] in this equation where you keep one thing constant if, for example if you raise the number of machines. You are going to, you know, basically get the bound pushed up in a [indiscernible] because more machines. If you don't change the iteration time and don't change the staleness bound, it's going to create more inconsistency. Therefore, the bound is worse. That's the kind of flavor that you want to look at all of this proof. >>: How does the [indiscernible] time change with respect to the number of machines? >>: Yeah. It almost like if you use >>: [indiscernible] distribution or something, maybe you could start to like come up with some kind of relationship between the two, for like really large collection of machines or something. >>: Right. >> Eric Xing: So I'm not sure I answered the question. few question come in to me so which one should I Quite a >>: I think talking about just [indiscernible] rates based on number of iterations and number of machines, but not talking about to the time span. >>: I'm not yet. >>: But that is the more important factor. >>: Yeah, that to be happening. First of all, this analysis is the first of its kind, but it's a typical kind of convention in convergence in [indiscernible] to worry about iteration. But because the wall clock time is very hard to characterize depending on your configuration, how we're implementation, right. So here we're talking about more extract way. >>: So there is another factor here. So if each [indiscernible] computation is very, very fast, and then the communication cost compare to that computation can be very huge. In that case, when increase number of machines, you probably don't get a lot, right, especially if you want to force some stillness. >> Eric Xing: No, not necessarily. The computation isn't necessarily that fast. You know, for a >>: [indiscernible] use smaller mini batch size or using GPU. >> Eric Xing: That's a different idea so you are right now already start such as a few different ideas that we haven't investigated. Once you put mini batch, the correctness of it was not established /KWRELT. So they become in the heuristic region, you can do that, but I don't think an analysis can be easily generated. >>: If you have very small task and if you want to use manual machines, what [indiscernible]. >> Eric Xing: >>: A small task? See a time increase or a decrease? >> Eric Xing: That's a good question. I don't, I don't actually have a clear answer at this point. Here, I'm basically on the more pragmatic side. I'm not playing a game. Say I want to view the big system and run some small task. That's becoming more kind of a gaming style investigation. I'm saying that you really have to do this large company. You have so much data already available, and it leaves you something to do, and I'm going to guarantee you that it is not terribly [indiscernible]. That's kind of a mentality of this analysis, okay. So it is by nature not a the theory, if some of like in some of the [indiscernible] is to give you the trend but not the actual kind of [indiscernible] of say computer time of convergence. That's never happened, right. So I don't know how to answer, because [indiscernible] not knowing. I'm just giving you the trend. >>: Just speaking empirically, you've been doing it looks like maybe up to 32 nodes or something. Have you encountered a point where basically because it looked like you were still get something nontrivial speedup when you add the node, like 5.6, something like this. Have you encountered situations where essentially it turns out where I think the sort of spirit of the original question was, hey, it could even get slower. >> Eric Xing: Oh, yeah, that's definitely true. That's definitely true. For example, there is [indiscernible] if I do it right now, I could almost predict result. I put 10,000 machines here and run the same thing, I'm going to be slower. >>: Okay. >> Eric Xing: That's for sure, because there are some other kind of communication complexity that needs to be resolved. For example, the switch isn't big enough to take all message at the same time. What do you do? You have to wait. And that kind of constant is not taken here yet. So here, we still assume for example the message are not computing, are not blocking, are not clogging, that kind of thing. But I'm saying that if you have the power to support a nice kind of a communication behavior, then this is something we expect. But in real case, I'm sure, you know, you cannot infinitely increase number of machines. Yeah. >>: Actually, we have some analysis on [indiscernible] machines in which case even [indiscernible] is not fast enough. >> Eric Xing: Yeah. GPU is a very, very kind of special kind of configuration, right. It is having a lot of core, but everything you do [indiscernible] verify you need to have a big connection. That's why [indiscernible] together on the thing. If you have a distributed GPU, I'm not even sure whether there are good implementations for distributed GPU yet, because that communication is not easily realizable, you know, across different machines. So yeah, I haven't started that kind of configuration. more like a classical, traditional CPU based clusters Here it's configurations. If you bear with me, I have a lot more stories to tell. I want to maybe pass this, but tell the other stories that you may also find interesting. But yeah, so even on a serious side, I can have some [indiscernible] to show that not only the process is converging, but also the variance is also bounded in a sense. You have [indiscernible] better quality in the convergence. But this just some add on qualifications. And finally, again, I'm doing this really for real life communication and pure task. Here I want to show you just a result we achieved by using this system. But otherwise, we cannot be [indiscernible] that. So on that network, if you remember, we have really [indiscernible] network under the MMTM model. And so we tried to run our system in competing with the best system known so far. That's always something we want to try. And here is a network which contain four million nodes. That's perhaps the best kind of large network that can be run by any other system, and this implication by David Bly's group, it was done in 24 hours, and in our case we have three hours to finish the whole MSB analysis. And another network with now 40 million nodes, which is really not very trivial, we can actually finish the computation in 14 hours and that algorithm isn't deployable because it crashed the machine. Okay. So that's the data parallel part. >>: So in those cases, were the programs [indiscernible] changes were showing. >> Eric Xing: For our program? >>: The version that you ran on your architecture, did you have to write the algorithm? >> Eric Xing: Our algorithm was [indiscernible]. >>: So if I already have [indiscernible] trying to make any change, use your framework? >> Eric Xing: Okay, good question. You pretty much don't need to do a lot of change. This is an Matlab spark. You need to turn it into a vertex program. We actually are using a native programming, almost like a Matlab. At least that's the goal. So far, I don't know whether your program will be running on us, but at least the way we program our model is no different than we program her systems other than changing the two lines of code into this parameter server code. And again, we haven't really built the kind of full [indiscernible] interface yet, but that's the goal in our system, at least. >>: Kind of feels like you're writing code for one machine. >> Eric Xing: >>: For one machine, yeah, exactly, yes. Okay. >> Eric Xing: Yeah, on your writing of the code, you don't feel a strong kind of low level parallelism that you have to take care about, okay. And again, you know, I can just repeat here, you know, the one take home message is that in this case, that design of the low level system is not blind of the behavior of the characteristic of the machine algorithm itself. We actually are making active use of the machine algorithm property, which is the iterative convergence property so that, you know, they can be consistent and resistant to presence of small errors. And that actually is the key spirit in our design, because many of the current distributed computing, putting a lot of emphasis on serial ability and sequentialization. That's actually sometimes unnecessary and incur big cost result to many game out of it. All right. So let me move on to the next part, because I'm running out of time. Which is about the model parallelism. Again, it's a very familiar problem. Here I have some examples. Motivating this kind of problem. But you guys have plenty of more such problems in which you really need to do, you know, a, maybe a convex [indiscernible] problem in which the size of the parameter is very big. You have, say, billions of parameters. And how to actually make this [indiscernible]. In this case, you may even have small data. You can argue I don't really have [indiscernible] the problem, but unfortunately, the model is very big. You can still have a need for, you know, parallel computing. And here, you know, our approach is again divided into two kind of different kind of road maps. One is about doing the algorithmic kind of innovation to really push for very fast kind of sequential algorithm, as much as we can. And again, I think this is not the main focus of this talk. But I just give you some name here for typical convex optimization problems such as using a kind of, you know, ADMM type of idea to, you know, decouple overlapping coupling between different random variables so that they can support to run simple [indiscernible] kind of [indiscernible] algorithm. And for non convex loss, you can make them smooth and differentiable using a smooth proximal [indiscernible] type of approach and and the four constraints which existing in large content, you can theoretically organize those constraints and then do, you know, systematic, you know, thresholding approach to resolve every consistency and constraint. And again, every of this idea is [indiscernible] a little bit to make them more and more efficient on a single machine to the point that you actually, in the rich [indiscernible] that your model cannot be stored in a single machine. So here I want to share with you another idea, which is not published yet. It's called the structure and wire parallelization of big models. So how to add your connection to the existing work. You heard about the [indiscernible] algorithm [indiscernible]. The key idea is I have a large dimension regression problem, and I want to make the update parallel across dimension. I'm going to run distributed different dimensions into different machines. And then the proof says that if a different dimensions are not strongly correlated, I'm going to converge. But if they do correlate, then I don't have any guarantee. The truth is that in many social media and genetic applications, you are almost guaranteed to have highly correlated dimensions across different [indiscernible] dimension. Therefore, you don't actually see a [indiscernible] algorithm converge very easily on very high dimensional problems. So this is actually where we want to study our problem. We actually proposed an approach which does parallelization based on knowledge of the structure. Thus the knowledge of the structure can be a dynamic one, because the structure can change during your execution of the optimization. So here is the idea. I'm going to have this system called a STRADS, standing for structure aware dynamic scheduler. It is going to constantly exam, you know, any emergent structures in the high dimensional space. The structure is defined in a generic way. It could be correlations between different dimensions. Could be co up dates across different dimensions or any other kind of possible behaviors that is kind of tying multiple dimensions together. And then once you discover this >>: This is the model is just regression. >> Eric Xing: In this case, it is regression, but it could be also neural network models. Essentially it's focusing on the behavior of the coefficients. >>: The raw data? >> Eric Xing: No, it is not raw data. It is the coefficient, which is the estimate of the parameters. So, yeah, that's another difference. We don't prescan the data and discover structure because that is not possible if you have real big data. We really want to do a Bootstrap thing, you'll come out of nothing. You start to have some estimate. You start to examine whether there are structure and you distribute them accordingly and then you also dissolve any conflict structures in that dynamic thing. So as a result, you have this dynamic kind of created clustering of coefficients and they get distributed into different workers. And within each worker, you have highly coded ones which should be [indiscernible] together and across different workers you have decorrelated ones that can be updated in parallel. And such structure will be adjusted either every iteration or every couple of iteration to make sure that you have the best load balance in across the whole process. And just to show you a behavior of this on LASSO, you can see that even for very high dimensional problem, even for a modest parallelization, a shotgun is really not quite working well. For the number here, we try run shotgun on two machines. It is kind of still converging, but slower than our dynamic scheduler. But if you run them on four machine, which means a greater deal of parallelization, actually we couldn't observe a convergence. The line just fly away. And this graph is a more kind of, kind of hopefully a stronger illustration where we actually simulate data, okay, so that we actually simulate in such a way that the shotgun was at least can still converge. They are not too strongly correlated, okay, but you can still make use of the [indiscernible] structure to hopefully inspire a better distribution of the task to inspire parallelism and you can see with our STRADS scheduler, you know, once you increase the number of cores to, you know, promote greater degree of parallelism, we actually see a very interesting, you know, dropping of the convergence increase of convergence rate with your increase of number of cores. And in particular, I found this phenomenon very interesting. If you look at, you know, the shotgun curve, or maybe a curve which has lowered the parallelism, they're kind of smooth. It seems to have a constant kind of convergent path to follow. But once you increase the amount of parallelism, you actually see that some curve originally converging at a particular rate and then suddenly they drop down to a different rate and then converge. So why that's happening? I suspect that's happening because you have this dynamic scheduling, you know. Have a distribution, but you're not commital and the second time once the structure change, you actually have a redistribution of a task and then they launch to a different convergence path, which hopefully even better. Therefore, you can very quickly find optimal convergence path and then reach to a convergence very rapidly. >>: So what this redistribute, which ones? >> Eric Xing: Redistribution means that you have a dimension. You put ten dimensions in here, another hundred dimensions in here. You are grouping different dimensions to allow them to be updated within the machine in a correct fashion, right. >>: So [indiscernible] you have to actually somehow get these dimensions to the machine. More data. >> Eric Xing: >>: Yes. How do you handle that? >> Eric Xing: distribution. There is a central scheduler to do task >>: [indiscernible] my dimension, then wouldn't this traffic of getting the dimensions to the nodes will be time consuming? >> Eric Xing: Yeah. That's true. But imagine that if you do a shotgun type of random distribution, the same, because they need to basically also collect information back from every dimension. So this added [indiscernible] isn't a substantial traffic. >>: Can you give an example to kind of dataset stow which you think this kind of [indiscernible]. >> Eric Xing: In here, I'm showing that in LASSO, which basically, you know, it is during the [indiscernible]. So every imagine in every iteration, I have a ten machines, okay, and I have a million dimension. Basically do a pending on a designation in which ten percent goes to this machine and that machine. >>: In every iteration, you redistribute the data to a different machine? >> Eric Xing: Yes. It is a parameter. No distribution of the data in this case. >>: Only potentially if >>: I'm missing here something. >> Eric Xing: So the Yeah, talking about. >>: Mentioned one million. So I know if I'm a working and I'm assigned to work on seven coordinates, I need to have the data for these coordinates. >> Eric Xing: Yes, yes. Good question. Let me tell the secret. So the truth is if you look at the Google [indiscernible] project, okay, they have a vast, you know, partition of the big network into multiple pieces, and they keep the data [indiscernible] with different machines. Therefore, they all have the same data, okay. That's basically the setting here. Okay. Now ask if my data's really big and I need to do a further kind of partition, then we'll say we actually are going to [indiscernible] data on that worker machine as well. Therefore, the update will be using only the partial data. So join data [indiscernible] and join data model [indiscernible] is one more step beyond this. We actually have a result as well, which I'm going to talk about if I have time, but that requires >>: Google [indiscernible]. >> Eric Xing: No, I think they replicate data. If they do both, then it is you can always do heuristics thing and do both, but you lose the guarantee. [indiscernible] or not. >>: The truth is that can you generate [indiscernible]. >> Eric Xing: Actually, the good news is that there are some guarantees even if you do that. We actually prove new results for that. >>: So in this case, you're talking about with a million parameters on two machines, it means each machine has about a hundred thousands parameters. >> Eric Xing: Coming like that. >>: So every update, every time you got a check, you have to send a hundred thousand parameters each of the central server and he has to do this mass comparison to find out the correlation between them? >> Eric Xing: Very good question. Now we go through details. Remember, this is a sparse read question, and what if you only care about non zero ones? Okay. There are actually not too many. The vast dimension actually is mott it's for you to check, but when you actually want to send the workload, it's not too many, in fact. >>: You can compare sparse correlations correctly without having all the not only the zeroes are or okay, yeah. >> Eric Xing: You have the global the schedule actually has the whole picture. It has no problem for it to in fact, what it does is it will subsample the coefficients and do a comparison of that. >>: Are these [indiscernible] kind of indicative of when it had to start redistributing and sort of found the right >>: It is not really we actually have a [indiscernible] which I didn't show. It gives you a picture of the trade off about the computing time and the communication time. They're roughly the same as well, okay. Because computing is [indiscernible] efficient, all these are [indiscernible]. You don't actually [indiscernible] sequentially wait, right. You actually are doing the discovery of the dynamic structure while the computing is happening on a client side. So this time are not actually sequential. >>: Okay. >> Eric Xing: Yeah. >>: One more question. When you decide you need to redistribute across machines, is it up to the algorithm how they recombine the parameters? >> Eric Xing: It is not up to the algorithms. The algorithms don't see that. It is inside the schedule. You can, of course, overwrite a scheduler's default policy and tell it how to distribute. That's actually an API we provide. But the scheduler itself has its own kind of simplest way of doing that. The simplest way is in the [indiscernible] just run and distribute. That's basically the [indiscernible] implementation right now. >>: Okay. >> Eric Xing: Yeah. But we just want to sell this concept that the active search of dynamic structure is more beneficial than just randomly distribute and also it's more correct. And this system basically is an implementation that supports that kind of operation. And without you do the actual scheduling. >>: [indiscernible]. >> Eric Xing: Yeah, I need to run fast. I have been slides. >>: You have another meeting. >> Eric Xing: Let me finish up in about five minutes, okay, and I can skip the third story because that's very simple. So again, without further adieu, I'm going to say that just by hopefully you just believe, trust me that there is a guarantee on also the correctness of this distribution on LASSO for a convergence, which is not actually true for the shotgun algorithm. But here is a kind of very [indiscernible] comparison. We did run the experiment on very high dimensional LASSO to the point we reached 10 million dimension. And you can see that we compared shotgun, which is basically the Graph Lab implementation with ours, and like ten million dimension, with different amount of non zero inputs, we actually are converging, but a little bit slower, I think, than Graph Lab. But the truth is that when we get even the bigger dimension, we still converge and the other algorithm is not even converging for that. And the graph was meant to show you it is a real scale problem, which is a [indiscernible] dimension, which is kind of nontrivial. We also have a very latest readout on the preliminary implementation of the DNN. And in here, I just want to show that we get to the kind of expected scale up of speed by adding more machines because the model is chunked into different sub machines. Performance wise, distributed inference is always not as correct as [indiscernible] inference, but it is measured from the predictive error. We are not far away from the state of the art deep [indiscernible] paper and this is a [indiscernible]. >>: So this is a classic you are talking about the [indiscernible] task, right? >> Eric Xing: Yeah. >>: But these results are >> Eric Xing: Oh, this is all [indiscernible]. [indiscernible] task. This is the >>: But according to the description, it's the [indiscernible] classification. Are we using hidden Markov model? >> Eric Xing: >>: No, it's not. You are not using hidden Markov model? >> Eric Xing: No, no. >>: Then it's a classification. So classification task is easier than the recognition task, which the result you could get Royce /RO*EUGS result. >> Eric Xing: Yeah, yeah so I call [indiscernible] last night. It's here the point isn't to show [indiscernible] with the computer speed up in the linear without losing much. That's all I wanted to show. It's basically these hosts are supporting DNN implementation. We're not trying to invent a DNN algorithm or try to post performance at this point yet. >>: But [indiscernible]. >>: Typically, if you want classification results >>: The question is [indiscernible]. >> Eric Xing: Yeah. Just, yeah. So here it's basically benchmarking, you know, runability of a different models. That's the whole point. I'm not trying to [indiscernible] result ready to deliver. >>: Because you are using stochastic gradient algorithm or not? >> Eric Xing: [indiscernible]. gradient algorithm. >>: Okay. I think it is stochastic So in this case, after each mini batch, do you >> Eric Xing: Can we take this is a detailed cluster at low level. We can take offline and let me finish delivering the whole message. So the insight here again, I'm not trying to say which algorithm and the how that model specific. I'm trying to say that, you know, there is another dimension in machine learning you can exploit to build highly [indiscernible] in which there exists [indiscernible] in the big model and if you know how to use that to distribute a task, you really get a good game out of it. And this claim is not dependent on what algorithm you run or what model you will run, okay. And so this third one, I think I'm going to skip it tells you how to actually, again, distribute, say, a million classification tasks or even higher into a distributed system in a very effective way, which use some tricky [indiscernible] tradition of the task space using a coding idea. But I guess I run out of time. So let me maybe conclude by the following kind of observations. I think, you know, there is a lot of opportunities in doing scalable machine learning if you are willing to do algorithmic development and [indiscernible] development together in the same place. And let them to cross benefit each other and to cross inform each other how to write the best parallel algorithm and how to build the best system supporting such algorithm. And in particular for our system made very explicit use of this iterative convergent behavior in which, you know, we pay cost of introducing error but gain by speeding up the whole iteration and to a great deal. And secondly, we, you know, discover structures in complex model and then parallelize the components of the model accordingly. And by using these ideas, which is very typical to wide class of machine algorithms, it is very likely for you to get a lot of benefit in distributed computing. And just so wrap up, I want to return to this big picture. I think, you know, we are in the process of building this Petuum system which represents a thing, risk connecting the algorithmic needs and the system resource through a kind of generic and a universal solution on the system building blocks and algorithm building blocks. And I think the results are promising. And we are about a number of other existing groups making the same effort and we want to, of course, be in the context. And for that we also show you maybe a very recent result just to compare a typical job, you know, that people always play, you know, in different systems, which is the topic model effort inference, the LDA one. So we [indiscernible] model on the pretty, you know, substantial dataset with seven billion token and with 80 million documents, but we want to make a point that we push the [indiscernible] setting out of the ideal scholar, academic kind of environment where you focus your effort on a small number of topics, because your memory can hold that many topics. So in here, if you run a hundred topic, this red bar is Petuum. This is Graph Lab. The number shows you the throughput in terms of million token per second. So we comparable, although we are about 50 percent better. But what you really want out of certain number of documents is not about a hundred topic. You really want a lot more topics. And people don't do that not because they don't like more topic. It's because if you increase on the topics, your memory cannot hold a model. In our case, our system can now already support the inference of ten thousand topics and we reach still a, you know, a decent threshold throughput in word processing and Graph Lab cannot even [indiscernible] because that many topics just blow the whole thing. >>: [indiscernible]. >> Eric Xing: Oh, the most modern version. >>: Company compete with that in terms of violation or just specific [indiscernible]. >> Eric Xing: >>: What do you mean by [indiscernible]. [indiscernible]. >> Eric Xing: Yeah, yeah, we just take their tool and implement our Graph Lab beyond that. And we even compared with the Google version of Graph Lab, of LDA, and you can see that, you know, we are kind of comparable. At least we can plot the whole thing on the same chart. And but we want to emphasize that, yeah, here there's a whole team of people building the graph in a very specialized fashion. And in our case, it's just a regular implementation on your Petuum system, and we still reach a pretty good speed. And in here again, we didn't compare the larger topic model, the larger topic because on the 10,000 case, I don't have any results to compare it with. So that kind of showcases you the direction we are driving toward for Petuum. It is to support real large scale, you know, data intensive and model heavy kind of inference task. And the reasonable amount of theoretical analysis and the ways to [indiscernible] interface that people can actually make use of down the road. So with that, I want to close. Sorry for dragging you so long into the talk. And it is an effort involved with many of my students in the group, which I circled here, and also with collaborators who are, you know, experts in offering system and [indiscernible] language. I don't want to read their name anymore, just to save time. If you are interested, you can email me or talk to me offline to find out more details.