>> Ofer Dekel: Great. Without further ado I'd like to introduce the first speaker Carlos Guestrin and he'll be talking about machine learning for big data in the cloud. >> Carlos Guestrin: Thank you. What a thrill to be here. There's lots of really cool faces and people working on machine learning. This is really exciting, exciting to also be in the Seattle area. And I'm going to be talking about really work of my students, Yucheng Low and Joey Gonzalez, who are here. You should stand up. Aapo Kyrola, Jay Gu, post-doc Danny Bickson. And other people from my group, new people. So Swelen, new post-doc, Tyler, new graduate student and assisting Vista from Sony. So you can hang out with them they've been working on related things, too. And the work we've been doing is on GraphLab a system for large-scale machine learning on big data. And the question of big data is pretty obvious these days. We're all familiar with it. And my issue with it is well characterized by this article in the New York Times from a couple of months ago, where they said that data has become an economic asset like currency or gold. And I don't know if I believe that is true. I think it is only true if we don't just store data but if we get some value out of it. If we do some interesting process. And I think this is where machine learning comes in. And so part of the reason we're here is to do machine learning on big data and extract some really deep understanding. And we have to do that in the context of new computer architectures. And people like me who have been programming machine learning for a while or if I'm honest really my graduate students, I'm not so familiar with the ins and outs of programming this large scale parallel systems, and we end up with spending a lot of time with things like race conditions, distributing states, race conditions, communications, race conditions, and these are things that I don't really know what they're about. And we end up with code that nobody can use and nobody can maintain. And because of that in our field people have been moving to higher level abstractions for programming parallel machines. And MapReduce is one of the standard abstractions where Hadoop is a public domain implementation or open source implementation, and they're good for in the basically independence of problems. So, for example, if we have a large set of images and I want to extract faces these images I can independently send them to each processor. So if I have this large number of independence of problems, MapReduce is quite good. And you can step back and say, okay, machine learning things like feature extraction are well done with this kind of abstraction. But the question that we posed a few years ago is whether there's more to machine learning. What are the types of problems that don't break up so easily into the set of independence of problems. And if I step back and think about machine learning, I think the power of machine learning is about discovering dependencies in data and discovering structure. And problems with graphs are or characterize a lot of this structure in machine learning and they're not well suited for MapReduce in our opinion. So let me give a couple of very high level examples of that. So let's say that I have a large database of personal images and I label this image and this face as being my grandmother. And I want to propagate this information throughout this database of images. And what can happen is that there might be other images where my grandmother appears but the similarity between this pair of images might not be strong enough to make a good prediction. What graph type algorithms can do is connect faces across different images. They're similar and also look at the fact that people who co-occur in the image with other people tend to be the same set of folks. So families co-occur together. If somehow I propagate this information using graph, using graphical models or other techniques I can make method predictions. This is an example of the graph of dependencies where it's hard to think about how to break this into independence of problems. Let me give another example that we're very familiar with. This is collaborative filtering. So let's say wanted to make movie recommendations and I have this user and his favorite movie is women on the verge of a nervous breakdown. Followed by the celebration, city of God, and wild strawberries, and you can step back and ask what other movies do I recommend to this user. It might be hard to think about what to do. But there are other users who are similar who have liked the same movies and maybe liked other related movie La Dolce Vita you might say let me recommend La Dolce Vita to these users and this idea of exploiting this graph of dependencies may be by doing matrix factorization and other techniques is another way we can extract value from this data. Again is this true for misdemeanor machine learning models if you have modeling topic modeling, for example, the LDA type models then the graph you're dealing with is documents and words. So these are the words that appear in each document and you're using this graph to discover topics. So, again, stepping back and thinking about the pipeline of machine learning, we might have a lot of data. Maybe these are images, maybe these are documents. And typically we do some initial preprocessing we extract some features like faces and we form a graph of them based on some dependencies, and that gets fed to some kind of structure machine learning algorithm like graphical models, LDA matrix factorization and so on, and that gives us what I think about as the value of the data. If I think about this pipeline, the first two steps in the pipeline, the first two blue boxes what we call graph ingress are well done with data parallel methods, things like Hadoop and MapReduce are quite good at doing that kind of thing. But the purple box with structured machine learning is where it's hard and this is where we'll be foe cushioning your efforts on. So if we step back and think about what is more to machine learning that does not fit into the Hadoop model so well. Here's what I think about this problem, that I've been mentioning, things like supervised learning, collaborative filtering, graphical analysis and so on. And we call this graph parallel algorithms. And so let me give you a running example that we're going to use throughout the talk of a graph parallel algorithm. And this is the standard page rank example, which has been applied to many domains, but let's say that we think about it from the context of social network. And figure out who to advertise to and it depends on the rank of that user the social rank. So I ask what is the rank of this user? Depends on the rank of the users who follow her. And what is the rank? Depends on the rank of the users who follow them and you can imagine this is a loopy graph you have to somehow it at any rate this idea into convergence and you say something like my rank is weighted average of neighbor's rank my range is the neighbor's ranks and I iterate that formula until I reach some formulation and fit into the model you have graph of dependence like the social network. You have some local function you perform on it like my rank is the weighted average of my neighbor's ranks and iterate that until I reach some convergence point. It turns out that from a theoretical and practical perspective, MapReduce is not a good abstraction for this problem. And this begs the need for what we call graph parallel abstractions. And that was the impetus of the start of the GraphLab project in our group. And the idea or the dream for us was we come in, we know how to solve a problem in Matlab on one machine. Somehow I'm going to give it to GraphLab and put on it say Amazon's EC2 cloud and I get efficient parallel predictions. That's our goal. That's where we started. Now, this was -- let me just give I a quick overview of GraphLab one the first one we started with. And you start with a graph basically, for example, search of network graph where vertices have some data user's profiles and edges have some information like similarity between the users and you want to perform some computation on this graph. How do we compute in graphs? How do you think about that? We do it by thinking like a vertex. And how does a vertex think? Well, a vertex only gets to see itself and its neighbors. If I'm the red vertex here I can only read or modify data in neighboring edges and neighboring vertices. So let's go back to page rank of example. I get to read the current ranks of my neighbors. I get to update my rank as the weighted average of neighbor's ranks if my rank is changed sufficiently I get to tell neighbors you should redo your computation. This is the dynamics they can do it's about writing simple programs like this and they get automatically parallelized for you. So they get pushed to the cloud and all these other issues you're worried about get addressed for you automatically. So let me give you an example of those issues. I mentioned raised conditions in the beginning. Let's say I'm executing these red vertices in parallel and it gets modified data in the neighboring vertices, if by the same time the blue vertices gets executed the scopes overlap they might modify data at the same time you don't know what happens with that. And when we started this project, people told me, oh, you're doing machine learning, these are all statistical methods. You should totally ignore this. You should just go and hope for the best. What you're telling me is that if you have no consistency, you can have higher throughput. That's more updates per second. One way to think about this is, by the way, if you have a big family they're all sitting at this dinner table, if everybody talks and nobody listens to each other you have more throughput. More people talk per second. But you might not really understand each other. So the same thing happens to many machine learning algorithms. So even though you can do more updates per second, you might have possibly slower convergence or even really bad behavior. So, for example, if I show in this graph this is Netflix data, on the X axis I'm showing you the updates as they happen over time. Y axis is training error. This is just eight cores. This is not really a lot of potential conflict here. So small dinner table. So if you have inconsistent updates this is the kind of behavior you get. So it doesn't converge, oscillates can be quite problematic, if you guarantee consistent updates this is the kind of behavior you get quickly converge to the right answer. And the nice thing is that if you're doing standard parallel programming, you have to deal with all these raced condition issues but if you use GraphLab you don't GraphLab automatically takes care of it in a user tuneable way. You don't have to worry about these issues. So we've been working on the abstraction where you provide this graph and update function. And you get to choose consistency model and it gets automatically parallelized for you. And at that point, this is about a year ago, a lot of algorithms have been implemented on top of GraphLab and it was picked up in industry quite a bit. We're feeling good quite a bit ourselves Tom Michigan's group out of CMU wanted to solve a very large NLP problem using an algorithm called QLM couldn't use it on one machine they used Hadoop. For them as the core they could solve the problem in about seven hours. They're feeling good at least they could solve the problem but then they tried out GraphLab, and with 32 machines we can solve the problem in 80 seconds. That's about 0.3 percent of Hadoop time. So with this kind of result, we thought, okay, we're doing well. We're feeling good. Let's try something bigger. Let's try a problem where there was no published running time results. And this is a problem of about seven billion edges. Dealing with the Alta Vista Web graph from 2002. And when we tried that GraphLab failed miserably. It didn't work. And we have to step back ask try to understand why did it not work. Why is it giving us bad performance? And the reason it was giving us bad performance is that this graph is called natural graphs. Most graph abstractions like GraphLab one, Prego and others have assumed kind of idealized graphs. For example, graphs that have low vertex degrees where they're easy to partition across machines. Natural graphs are not like that. They have many vertices with very high degrees. And they're very, very hard to partition. This will give you a sense of this if you look at the Web graph that I mentioned the top one percent of the vertices are touching 50 percent of the edges. And this is not just a problem on the Web graph if you think about a social network you might have a popular person connected to many others in the social network. In movie recommendations you can have movies that lots of people watched and machine learning you have these things called hyperparameters that are connected to potentially every variable in the model. If you do text analysis you can have very popular word that appears in many, many documents. And high degree vertices can be problematic. So, for example, if I try to partition them across machines, then I end up cutting a lot of edges. And the amount of communication you have to do is linear on the number of edges that you cut. And so this can be very bad. In fact, for natural graphs, even if you could solve the NP hard cutting problem, the cuts that you get will not be cheap. So even, even if you could solve the problem you still have a ton of communication. And so understanding this issue of natural graphs led us to design GraphLab two where we thought we're introducing a new type of partitioning for our data. We take this high degree vertices or vertices in general and we split them across machines. And this type of partitioning is a natural consequence of the new abstraction that we're using. From a perspective of the user, you still program, still think like vertex. Still programming like on the left, but it gets executed in the new kind of distributed way on the right. Just in two slides I'll give you a sense of how things are different here. If we step back and look at things like page rank and other machine learning problems, often when you're writing update functions it can be split into two phases, first I gather information about my neighbors, for example, their ranks then I change something about myself, like the weighted average of my neighbor's ranks. And then I go and I tell my neighbors something. For example, you should go and redo our computation. This is phase where I gather information about my neighbors I change something about myself and I go back and tell my neighbors something, is a pretty general abstraction with what's happening in machine learning. So we define this gas tick composition where you first gather information about your neighbors in a data parallel way. Then you change something about yourself and apply phase and then you scatter information in data parallel sense of your neighbors. So this is our new abstraction. I'm going through it kind of quickly. But let me just tell you that a lot of the machine learning problems that we want to solve fall into this model. So things like inference and graphical models, collaborative filtering, matrix factorization, clustering LDA can be expressed in this way. And since we're taking high degree vertices or any vertices and splitting them across machines, the communication problem now is linear in the number of machines a vertex lives in. This is called the vertex cut problem. And percolation theory suggests it's possible to get low cost vertex cuts in natural graphs. So unlike edge cuts with vertex cuts it's possible to do this well. And GraphLab two implements a number of online algorithms with some theoretical guarantees for implementing these types of cuts for the graphs that you read in. So this was a fast tour of GraphLab in our system. So of the GraphLab ex-tracks. Let me assure you that GraphLab is a natural system you can use, built on top of standard infrastructure for the cloud like HDFS. It provides a bunch of functionality that you don't have to worry about because you can program in terms of the abstraction. So all this stuff that goes under it can be totally ignored. Or for those interested, you can use one of the toolbox that we already provide on top of it. So things like graph analysis, clustering, matrix factorization, and so on. So let me just give you a couple of examples of performance. One of the analysis problems that people in the dual large social networks is count the number of triangles. And a triangle is a three-way relationship that indicates that a person is part of a strong community. And if you look at the Twitter graph from 2010 there's about 38 billion triangles to be counted. And last year there's a paper that use Hadoop for this problem, and this is the current state of the art for the system and with about a thousand machines it took them 400 minutes to solve this problem in the Twitter 2010 graph. With GraphLab 264 machines it takes us a minute and a half. And now we can ask ourselves is this because Hadoop is not implemented as well as GraphLab. That might be part of the issue, but the main issue here is that MapReduce is just the wrong abstraction for this problem. You end up with too much communication because of the issues that I mentioned earlier. Now, this is about going faster, a lot faster than Hadoop. But this might not be the only measure of productivity. Especially if you're in industry. Another measure of productivity is programmer's time or thinking time. So let me give you an example of that. Let's say I want to do LDA on Wikipedia, the whole of Wikipedia. This is the type of things that companies like Yahoo! are very interested in because they want to use it for recommendations of content. And Alex Moore at Yahoo! built this very cool system for doing very large scale LDA. With about 100 machines they can process 150 million tokens per second, which is very impressive. With GraphLab II, with 64 machines, we can process about 100 time tokens per second. It's pretty comparable. The difference is Yahoo! system was built specifically for that task and took a long time to build. For the GraphLab system Joey spent about four hours and wrote 200 lines of code. So this is the prerogativity difference. And finally I mentioned this Alta Vista graph that was [inaudible] in the beginning and now we can run it on something like 64 machines on Amazon over a thousand cores four terabytes of RAM and do a whole iteration of the graph in seven seconds. So that's a billion links processed per second and there's only 30 lines of code that were written here. So at this point around this time my student [inaudible] walks into my office and says: Buy me a Mac mini and I thought he wanted to watch TV. But he said, no, I want to show you that we can solve Web scale problems on a small machine. And so there's always a genesis story for names of things that may or may not be true. A lot of cloud-based systems have animal mass cots like Hadoop has an elephant. GraphLab is because it has the dog Labrador. Apple said I want to build the graph chihuahua or graph chi and by exploiting hard drives or SSDs he can solve very large problems. And the challenge of using hard drives or SDDs is random accesses. If your data lives randomly in the disk and you have to read it from different parts of the disk, you end up spending a lot of time on reading and writing on IO. And what he has is a new parallel sliding windows method that minimizes the number of random accesses. I won't have time to get it into but teaser of result if you go back to the problem of accounting with Hadoop thousand machines taking 400 millions and GraphLab 64 machines taking a minute and a half with graph chi on just a Mack mini he can solve this problem in 59 minutes. And graph chi is a first endeavor on the next step for the GraphLab project which is dealing with streaming data. So rather than a batch setting with all the data available in one place, we can deal with data arriving and modifying the graph over time. So, for example, in a Mack mini, he can have 100,000 graph updates happening per second at the same time as he computes the 200,000 vertex computations per second. So the GraphLab project is about providing a novel system and a novel abstraction for programming large scale machines and it's really meant for some of the challenges that we face in machine learning. A project is available under Apache two license and GraphLab.org you can find releases of both GraphLab and graph chi. Thank you. [applause] >> Ofer Dekel: So we have time for a couple of brief questions. >>: Very interesting. The GAS abstraction talks about sort of unsatisfying because the gather and scatter seem redundant, classes of problems for which you can just get rid of one or the other and do a random stochastic selection of vertices? >> Carlos Guestrin: There's like three or four questions in that first question. So let me just try to separate them. So question, the last question was can you have a stochastic selection over vertices. Do you have to touch every neighbor every time? And the answer is clearly no. And so the way that we've looked at it is through some caching schema, you can just look at what vertices have changed. But you can also put in some randomization which we haven't. But it's easily possible. The other question is whether gather is redundant with scatter. It turns out that for some algorithms you want to broadcast information to your neighbors and that's what scatter does. For some algorithms you want to aggregate information about your neighbors and that's what gather does. So you get to choose. You can say I want the one to know or other to know but we've noticed those patterns were common. Any other questions? Yes. >>: You mentioned it runs nicely under this framework are you running into any algorithms you would like to fit nicely but don't, like of the ones you talk about which are the most problematic or what don't you support that you would like to. >> Carlos Guestrin: >>: With the gather price scatter? With the model in general. >> Carlos Guestrin: So I talked about the basic version of GraphLab. We've added some functionality to deal with things that don't fit so well in the abstraction. For example, if you want to keep track of convergence rate or global gradient, you need to do some kind of global aggregation on top of this or have shared parameters across machines. For example, the parameter sharing. And so that doesn't fit so neatly in the abstraction. So we have extra functionality on top to do that. The gather applies scatter model turns out you can write the original GraphLab one abstraction in the scatter model with some loss. Where the gather you just keep accumulating state from all your neighbors. So it is as -- as it is representative of the earlier one but perhaps slightly less efficient. And we have some examples where it would be bad and I'm happy to go through them with you. >>: Thanks. >> Carlos Guestrin: I see a little pressure on the side here. So --