>> Ofer Dekel: Great. Without further ado I'd... first speaker Carlos Guestrin and he'll be talking about machine

advertisement
>> Ofer Dekel: Great. Without further ado I'd like to introduce the
first speaker Carlos Guestrin and he'll be talking about machine
learning for big data in the cloud.
>> Carlos Guestrin: Thank you. What a thrill to be here. There's
lots of really cool faces and people working on machine learning. This
is really exciting, exciting to also be in the Seattle area.
And I'm going to be talking about really work of my students, Yucheng
Low and Joey Gonzalez, who are here. You should stand up. Aapo
Kyrola, Jay Gu, post-doc Danny Bickson. And other people from my
group, new people. So Swelen, new post-doc, Tyler, new graduate
student and assisting Vista from Sony. So you can hang out with them
they've been working on related things, too.
And the work we've been doing is on GraphLab a system for large-scale
machine learning on big data. And the question of big data is pretty
obvious these days. We're all familiar with it. And my issue with it
is well characterized by this article in the New York Times from a
couple of months ago, where they said that data has become an economic
asset like currency or gold. And I don't know if I believe that is
true. I think it is only true if we don't just store data but if we
get some value out of it. If we do some interesting process. And I
think this is where machine learning comes in. And so part of the
reason we're here is to do machine learning on big data and extract
some really deep understanding. And we have to do that in the context
of new computer architectures. And people like me who have been
programming machine learning for a while or if I'm honest really my
graduate students, I'm not so familiar with the ins and outs of
programming this large scale parallel systems, and we end up with
spending a lot of time with things like race conditions, distributing
states, race conditions, communications, race conditions, and these are
things that I don't really know what they're about. And we end up with
code that nobody can use and nobody can maintain.
And because of that in our field people have been moving to higher
level abstractions for programming parallel machines. And MapReduce is
one of the standard abstractions where Hadoop is a public domain
implementation or open source implementation, and they're good for in
the basically independence of problems. So, for example, if we have a
large set of images and I want to extract faces these images I can
independently send them to each processor.
So if I have this large number of independence of problems, MapReduce
is quite good. And you can step back and say, okay, machine learning
things like feature extraction are well done with this kind of
abstraction.
But the question that we posed a few years ago is whether there's more
to machine learning. What are the types of problems that don't break
up so easily into the set of independence of problems.
And if I step back and think about machine learning, I think the power
of machine learning is about discovering dependencies in data and
discovering structure. And problems with graphs are or characterize a
lot of this structure in machine learning and they're not well suited
for MapReduce in our opinion.
So let me give a couple of very high level examples of that. So let's
say that I have a large database of personal images and I label this
image and this face as being my grandmother. And I want to propagate
this information throughout this database of images. And what can
happen is that there might be other images where my grandmother appears
but the similarity between this pair of images might not be strong
enough to make a good prediction. What graph type algorithms can do is
connect faces across different images. They're similar and also look
at the fact that people who co-occur in the image with other people
tend to be the same set of folks. So families co-occur together. If
somehow I propagate this information using graph, using graphical
models or other techniques I can make method predictions. This is an
example of the graph of dependencies where it's hard to think about how
to break this into independence of problems.
Let me give another example that we're very familiar with. This is
collaborative filtering. So let's say wanted to make movie
recommendations and I have this user and his favorite movie is women on
the verge of a nervous breakdown. Followed by the celebration, city of
God, and wild strawberries, and you can step back and ask what other
movies do I recommend to this user. It might be hard to think about
what to do.
But there are other users who are similar who have liked the same
movies and maybe liked other related movie La Dolce Vita you might say
let me recommend La Dolce Vita to these users and this idea of
exploiting this graph of dependencies may be by doing matrix
factorization and other techniques is another way we can extract value
from this data. Again is this true for misdemeanor machine learning
models if you have modeling topic modeling, for example, the LDA type
models then the graph you're dealing with is documents and words. So
these are the words that appear in each document and you're using this
graph to discover topics.
So, again, stepping back and thinking about the pipeline of machine
learning, we might have a lot of data. Maybe these are images, maybe
these are documents. And typically we do some initial preprocessing we
extract some features like faces and we form a graph of them based on
some dependencies, and that gets fed to some kind of structure machine
learning algorithm like graphical models, LDA matrix factorization and
so on, and that gives us what I think about as the value of the data.
If I think about this pipeline, the first two steps in the pipeline,
the first two blue boxes what we call graph ingress are well done with
data parallel methods, things like Hadoop and MapReduce are quite good
at doing that kind of thing. But the purple box with structured
machine learning is where it's hard and this is where we'll be foe
cushioning your efforts on. So if we step back and think about what is
more to machine learning that does not fit into the Hadoop model so
well. Here's what I think about this problem, that I've been
mentioning, things like supervised learning, collaborative filtering,
graphical analysis and so on. And we call this graph parallel
algorithms. And so let me give you a running example that we're going
to use throughout the talk of a graph parallel algorithm. And this is
the standard page rank example, which has been applied to many domains,
but let's say that we think about it from the context of social
network. And figure out who to advertise to and it depends on the rank
of that user the social rank. So I ask what is the rank of this user?
Depends on the rank of the users who follow her. And what is the rank?
Depends on the rank of the users who follow them and you can imagine
this is a loopy graph you have to somehow it at any rate this idea into
convergence and you say something like my rank is weighted average of
neighbor's rank my range is the neighbor's ranks and I iterate that
formula until I reach some formulation and fit into the model you have
graph of dependence like the social network. You have some local
function you perform on it like my rank is the weighted average of my
neighbor's ranks and iterate that until I reach some convergence point.
It turns out that from a theoretical and practical perspective,
MapReduce is not a good abstraction for this problem. And this begs
the need for what we call graph parallel abstractions. And that was
the impetus of the start of the GraphLab project in our group. And the
idea or the dream for us was we come in, we know how to solve a problem
in Matlab on one machine. Somehow I'm going to give it to GraphLab and
put on it say Amazon's EC2 cloud and I get efficient parallel
predictions. That's our goal. That's where we started.
Now, this was -- let me just give I a quick overview of GraphLab one
the first one we started with. And you start with a graph basically,
for example, search of network graph where vertices have some data
user's profiles and edges have some information like similarity between
the users and you want to perform some computation on this graph. How
do we compute in graphs? How do you think about that? We do it by
thinking like a vertex. And how does a vertex think? Well, a vertex
only gets to see itself and its neighbors. If I'm the red vertex here
I can only read or modify data in neighboring edges and neighboring
vertices. So let's go back to page rank of example. I get to read the
current ranks of my neighbors. I get to update my rank as the weighted
average of neighbor's ranks if my rank is changed sufficiently I get to
tell neighbors you should redo your computation. This is the dynamics
they can do it's about writing simple programs like this and they get
automatically parallelized for you. So they get pushed to the cloud
and all these other issues you're worried about get addressed for you
automatically. So let me give you an example of those issues. I
mentioned raised conditions in the beginning. Let's say I'm executing
these red vertices in parallel and it gets modified data in the
neighboring vertices, if by the same time the blue vertices gets
executed the scopes overlap they might modify data at the same time you
don't know what happens with that. And when we started this project,
people told me, oh, you're doing machine learning, these are all
statistical methods. You should totally ignore this. You should just
go and hope for the best. What you're telling me is that if you have
no consistency, you can have higher throughput. That's more updates
per second. One way to think about this is, by the way, if you have a
big family they're all sitting at this dinner table, if everybody talks
and nobody listens to each other you have more throughput. More people
talk per second. But you might not really understand each other. So
the same thing happens to many machine learning algorithms. So even
though you can do more updates per second, you might have possibly
slower convergence or even really bad behavior. So, for example, if I
show in this graph this is Netflix data, on the X axis I'm showing you
the updates as they happen over time. Y axis is training error. This
is just eight cores. This is not really a lot of potential conflict
here. So small dinner table. So if you have inconsistent updates this
is the kind of behavior you get. So it doesn't converge, oscillates
can be quite problematic, if you guarantee consistent updates this is
the kind of behavior you get quickly converge to the right answer. And
the nice thing is that if you're doing standard parallel programming,
you have to deal with all these raced condition issues but if you use
GraphLab you don't GraphLab automatically takes care of it in a user
tuneable way. You don't have to worry about these issues.
So we've been working on the abstraction where you provide this graph
and update function. And you get to choose consistency model and it
gets automatically parallelized for you. And at that point, this is
about a year ago, a lot of algorithms have been implemented on top of
GraphLab and it was picked up in industry quite a bit. We're feeling
good quite a bit ourselves Tom Michigan's group out of CMU wanted to
solve a very large NLP problem using an algorithm called QLM couldn't
use it on one machine they used Hadoop. For them as the core they
could solve the problem in about seven hours. They're feeling good at
least they could solve the problem but then they tried out GraphLab,
and with 32 machines we can solve the problem in 80 seconds.
That's about 0.3 percent of Hadoop time. So with this kind of result,
we thought, okay, we're doing well. We're feeling good. Let's try
something bigger. Let's try a problem where there was no published
running time results. And this is a problem of about seven billion
edges. Dealing with the Alta Vista Web graph from 2002. And when we
tried that GraphLab failed miserably. It didn't work. And we have to
step back ask try to understand why did it not work. Why is it giving
us bad performance? And the reason it was giving us bad performance is
that this graph is called natural graphs. Most graph abstractions like
GraphLab one, Prego and others have assumed kind of idealized graphs.
For example, graphs that have low vertex degrees where they're easy to
partition across machines. Natural graphs are not like that. They
have many vertices with very high degrees. And they're very, very hard
to partition. This will give you a sense of this if you look at the
Web graph that I mentioned the top one percent of the vertices are
touching 50 percent of the edges.
And this is not just a problem on the Web graph if you think about a
social network you might have a popular person connected to many others
in the social network. In movie recommendations you can have movies
that lots of people watched and machine learning you have these things
called hyperparameters that are connected to potentially every variable
in the model. If you do text analysis you can have very popular word
that appears in many, many documents.
And high degree vertices can be problematic. So, for example, if I try
to partition them across machines, then I end up cutting a lot of
edges. And the amount of communication you have to do is linear on the
number of edges that you cut. And so this can be very bad. In fact,
for natural graphs, even if you could solve the NP hard cutting
problem, the cuts that you get will not be cheap. So even, even if you
could solve the problem you still have a ton of communication.
And so understanding this issue of natural graphs led us to design
GraphLab two where we thought we're introducing a new type of
partitioning for our data. We take this high degree vertices or
vertices in general and we split them across machines.
And this type of partitioning is a natural consequence of the new
abstraction that we're using. From a perspective of the user, you
still program, still think like vertex. Still programming like on the
left, but it gets executed in the new kind of distributed way on the
right.
Just in two slides I'll give you a sense of how things are different
here. If we step back and look at things like page rank and other
machine learning problems, often when you're writing update functions
it can be split into two phases, first I gather information about my
neighbors, for example, their ranks then I change something about
myself, like the weighted average of my neighbor's ranks. And then I
go and I tell my neighbors something. For example, you should go and
redo our computation. This is phase where I gather information about
my neighbors I change something about myself and I go back and tell my
neighbors something, is a pretty general abstraction with what's
happening in machine learning.
So we define this gas tick composition where you first gather
information about your neighbors in a data parallel way. Then you
change something about yourself and apply phase and then you scatter
information in data parallel sense of your neighbors.
So this is our new abstraction. I'm going through it kind of quickly.
But let me just tell you that a lot of the machine learning problems
that we want to solve fall into this model. So things like inference
and graphical models, collaborative filtering, matrix factorization,
clustering LDA can be expressed in this way. And since we're taking
high degree vertices or any vertices and splitting them across
machines, the communication problem now is linear in the number of
machines a vertex lives in. This is called the vertex cut problem.
And percolation theory suggests it's possible to get low cost vertex
cuts in natural graphs. So unlike edge cuts with vertex cuts it's
possible to do this well. And GraphLab two implements a number of
online algorithms with some theoretical guarantees for implementing
these types of cuts for the graphs that you read in.
So this was a fast tour of GraphLab in our system. So of the GraphLab
ex-tracks. Let me assure you that GraphLab is a natural system you can
use, built on top of standard infrastructure for the cloud like HDFS.
It provides a bunch of functionality that you don't have to worry about
because you can program in terms of the abstraction. So all this stuff
that goes under it can be totally ignored.
Or for those interested, you can use one of the toolbox that we already
provide on top of it. So things like graph analysis, clustering,
matrix factorization, and so on.
So let me just give you a couple of examples of performance. One of
the analysis problems that people in the dual large social networks is
count the number of triangles. And a triangle is a three-way
relationship that indicates that a person is part of a strong
community.
And if you look at the Twitter graph from 2010 there's about 38 billion
triangles to be counted. And last year there's a paper that use Hadoop
for this problem, and this is the current state of the art for the
system and with about a thousand machines it took them 400 minutes to
solve this problem in the Twitter 2010 graph. With GraphLab 264
machines it takes us a minute and a half.
And now we can ask ourselves is this because Hadoop is not implemented
as well as GraphLab. That might be part of the issue, but the main
issue here is that MapReduce is just the wrong abstraction for this
problem. You end up with too much communication because of the issues
that I mentioned earlier.
Now, this is about going faster, a lot faster than Hadoop. But this
might not be the only measure of productivity. Especially if you're in
industry. Another measure of productivity is programmer's time or
thinking time.
So let me give you an example of that. Let's say I want to do LDA on
Wikipedia, the whole of Wikipedia. This is the type of things that
companies like Yahoo! are very interested in because they want to use
it for recommendations of content. And Alex Moore at Yahoo! built this
very cool system for doing very large scale LDA. With about 100
machines they can process 150 million tokens per second, which is very
impressive. With GraphLab II, with 64 machines, we can process about
100 time tokens per second. It's pretty comparable. The difference is
Yahoo! system was built specifically for that task and took a long time
to build. For the GraphLab system Joey spent about four hours and
wrote 200 lines of code.
So this is the prerogativity difference. And finally I mentioned this
Alta Vista graph that was [inaudible] in the beginning and now we can
run it on something like 64 machines on Amazon over a thousand cores
four terabytes of RAM and do a whole iteration of the graph in seven
seconds. So that's a billion links processed per second and there's
only 30 lines of code that were written here.
So at this point around this time my student [inaudible] walks into my
office and says: Buy me a Mac mini and I thought he wanted to watch
TV. But he said, no, I want to show you that we can solve Web scale
problems on a small machine. And so there's always a genesis story for
names of things that may or may not be true. A lot of cloud-based
systems have animal mass cots like Hadoop has an elephant. GraphLab is
because it has the dog Labrador. Apple said I want to build the graph
chihuahua or graph chi and by exploiting hard drives or SSDs he can
solve very large problems. And the challenge of using hard drives or
SDDs is random accesses. If your data lives randomly in the disk and
you have to read it from different parts of the disk, you end up
spending a lot of time on reading and writing on IO. And what he has
is a new parallel sliding windows method that minimizes the number of
random accesses. I won't have time to get it into but teaser of result
if you go back to the problem of accounting with Hadoop thousand
machines taking 400 millions and GraphLab 64 machines taking a minute
and a half with graph chi on just a Mack mini he can solve this problem
in 59 minutes. And graph chi is a first endeavor on the next step for
the GraphLab project which is dealing with streaming data. So rather
than a batch setting with all the data available in one place, we can
deal with data arriving and modifying the graph over time. So, for
example, in a Mack mini, he can have 100,000 graph updates happening
per second at the same time as he computes the 200,000 vertex
computations per second.
So the GraphLab project is about providing a novel system and a novel
abstraction for programming large scale machines and it's really meant
for some of the challenges that we face in machine learning. A project
is available under Apache two license and GraphLab.org you can find
releases of both GraphLab and graph chi. Thank you.
[applause]
>> Ofer Dekel:
So we have time for a couple of brief questions.
>>: Very interesting. The GAS abstraction talks about sort of
unsatisfying because the gather and scatter seem redundant, classes of
problems for which you can just get rid of one or the other and do a
random stochastic selection of vertices?
>> Carlos Guestrin: There's like three or four questions in that first
question. So let me just try to separate them. So question, the last
question was can you have a stochastic selection over vertices. Do you
have to touch every neighbor every time? And the answer is clearly no.
And so the way that we've looked at it is through some caching schema,
you can just look at what vertices have changed. But you can also put
in some randomization which we haven't. But it's easily possible. The
other question is whether gather is redundant with scatter. It turns
out that for some algorithms you want to broadcast information to your
neighbors and that's what scatter does. For some algorithms you want
to aggregate information about your neighbors and that's what gather
does. So you get to choose. You can say I want the one to know or
other to know but we've noticed those patterns were common.
Any other questions?
Yes.
>>: You mentioned it runs nicely under this framework are you running
into any algorithms you would like to fit nicely but don't, like of the
ones you talk about which are the most problematic or what don't you
support that you would like to.
>> Carlos Guestrin:
>>:
With the gather price scatter?
With the model in general.
>> Carlos Guestrin: So I talked about the basic version of GraphLab.
We've added some functionality to deal with things that don't fit so
well in the abstraction. For example, if you want to keep track of
convergence rate or global gradient, you need to do some kind of global
aggregation on top of this or have shared parameters across machines.
For example, the parameter sharing. And so that doesn't fit so neatly
in the abstraction. So we have extra functionality on top to do that.
The gather applies scatter model turns out you can write the original
GraphLab one abstraction in the scatter model with some loss. Where
the gather you just keep accumulating state from all your neighbors.
So it is as -- as it is representative of the earlier one but perhaps
slightly less efficient. And we have some examples where it would be
bad and I'm happy to go through them with you.
>>:
Thanks.
>> Carlos Guestrin: I see a little pressure on the side here. So --
Download