>> Denyong Zhou: I'm very happy to have Eric... speaker today. Eric is a faculty member from CMU...

advertisement
>> Denyong Zhou: I'm very happy to have Eric Xing as our invited
speaker today. Eric is a faculty member from CMU computer
science department, and his research basically is work on very
special machine learning from fundamental machine learning
research to computational biology and to even [indiscernible] lot
of scale [indiscernible] learning.
>> Eric Xing: All right. I want to begin by thanking Denny for
inviting me here. I'm very glad to share with you some of my
recent work on system and algorithmic foundations for big
learning. So it's a fancy topic. But hopefully, you don't get
too disappointed after just seeing the topic.
So I've been working on distributed systems and large scale
learning only very recently. And honestly, this is almost all of
a selfish reason, because in my group, I've been spending many,
many years developing different machine learning models for many
specific applications to the point that I was embarrassed that I
myself couldn't scale up my algorithms. Especially when I did my
sabbatical in Facebook for a year, I failed to deliver my promise
to boost their revenue by a certain percentage. So I went home
and decided I should really catch up on this line of research.
So this is a big project, you know, supported, joined by many of
my colleagues and students which I'm going to show you their
pictures at the end of the talk. So here is an opening
statement. So, you know, everybody's talking about large data,
and what do we want to get out of from large data. So in many
context, people want to make, you know, good predictions based on
large data, and sometimes one can argue, well, if your goal is to
only to prediction, you may not need fancy models. You know,
some very, you know, simple models such as a regression or maybe
a black box model like a deep network can probably do the job.
But in scientific research and in many other domains, people
actually want to get information out of data, in addition to
making predictions, and that's actually a picture reflecting that
kind of goal, right. It was a picture almost more than ten years
ago when people sequenced for the first time a human genome, they
want to piece together the puzzle in the genome to find out the
regulation programs and other biological stories about it. And I
just want to [indiscernible] that kind of an objective, which is
if you are in a forest and you may not necessarily know a tree is
falling unless you see it because, you know, the knowledge if not
to be perceived by a person is just in or out data, and you
cannot really make use of that. Therefore, I think part of the
goal in all the big learning is to not only make you yourself big
data for delivering certain specific goal but also to, you know,
look into data and find out additional information about it.
And for that, I personally felt that machine learning is a
promising tool to provide a kind of capability. And ideally, we
want the machine learning to behave like that, right. You have
an automatic system, you know, which can consume any sort of data
in large quantity. And at the end of the day, you can spin out
predictions, hypothesis, and all that kind of thing.
But unfortunately, I don't think this, you know, very rosy view
is happening at this point for many, many reasons. And
personally, I ran into the following problems. First of all,
everybody is talking about the same problem. We have, you know,
two large amount of data to the point that we really don't have
an effective system to deal with this data, especially when we
want to use some fancy, you know, analytical tools. And
personally, I'm very interested in, you know, looking into what
is inside the Facebook network. The network's very big.
Nowadays, it has about a billion node and plus all the other meta
informations and other, you know, semantic informations in the
network, and that creates a big computational challenge.
And that challenge is well known. Many people are talking about,
you know, data being so big, and we need to have big storage
system, big computing system to deal with that.
But recently, people also started to realize another challenge,
which is this so called big model problem, right. I put here,
you know, a model that is known to everyone here. This is a very
big deep neural network. And now, you know, really, you know,
providing, you know, a very generic and powerful capability to
understand, you know, a lot of, you know, complicated
information, especially latent semantic information and data.
But the model has been built very big to make them really
powerful. For example, their recent Google brain network has
about 1 billion node, 1 billion parameters, and that kind of
model really cannot be put into the memory of a single machine.
Therefore, you also need to deal with some very complicated
issues of making distributed computing of parallel inference
possible.
A third challenge, which I also heard recently through a
Microsoft involved workshop [indiscernible] is this so called
huge cognitive space or huge task space where, say, even one of
the simple task classification, what if you wanted to
[indiscernible], say there are a million or maybe even a billion
possible labels in that classification problem on top of a big
data set. And that kind of problem is really unheard of to
machining community many, many years ago because of the we are
very used to study binary classifications using SVM or regression
or maybe a more [indiscernible] classification, and turns out
that kind of machinery may not be so effective in solving problem
of this scale.
So we start, you know, I want to basically, you know, inspire my
group to hopefully find a good solution that can be used for
addressing not only one but many of this problem. And this is a
very, very difficult goal to accomplish because what we see now
are mostly like this picture, right? So we have a lot of
hardware development with different fancy machine configurations
in hardware. And on top of that, we have a lot of different
machine learning models.
And typically, the working style is that if a researcher or a
group identify a problem and decide to use one model, he's going
to build a team and implement a special purpose development or
implementation of that model, especially when we want to make it
very big.
And then you have this, you know, one line connecting the system
and the model and the problem. And if you have many problems,
you probably run into this kind of thing, right, which is
sometimes doable in large companies like here or in Google. But
it is a very expensive investment. And it is also, basically,
pushing university researchers out this [indiscernible] because
we cannot afford, you know, putting the team like that and
spending the money like that to build these kind of solution.
So we asked the following question: Is there a more elegant,
maybe more economical way of addressing the challenge in big
data, big model and big task based on the available hardware
system using a kind of a thin risk style solution where we have
this bottleneck which provide general purpose algorithmic and
system blocks so that you can, you know, waste a very modest
amount of effort, you know, somehow customize it with specific
problems you want to solve.
So the talk I'm going present today is about our initial effort
and some results towards this goal. So there are some existing
solutions already directed toward this goal, and here I just want
to provide a very, very brief context and to show you where we
stand in this speaker context.
So this is the bigger picture I see so far. In machine learning,
many, many of the machine learning researchers actually are
already aware about the big data, big learning challenge. And
their solution is typically of this style. They're going to push
for smarter algorithmic development to the point that they
develop very elegant evidence which have the correct
[indiscernible] guarantee of convergence which can, you know,
efficiently make use of an ideal parallel system to spread out
the competent task into, you know, a distributing environment and
then, you know, you have all these papers talking about, you
know, great action of computational cost.
well designed experiment.
And some, you know,
But if you really read into the paper in very carefully, you will
see two typical kind of patterns that I found on very
interesting. One is this micro of in the algorithm or in the
[indiscernible] where they identify a key step, such as an update
in a gradient or maybe a message passing step which they call to
be parallelizable and then they're going to parallelize it based
on not necessarily a real system. Very often, an imaginary
system which incur no communication cost, no asynchronizing loss
and other things and then they will say if that happens, my
algorithm will converge at a verified speed. And this experiment
sometimes is a simulated one. They basically can sequentially
simulate a parallel system and show you a very good answer.
But if you put the whole into a real system, chances are you
either need to reimplement the whole thing or maybe the whole
algorithm isn't really working. So that's basically on the
algorithmic side.
On the system side, there is also a great movement in pushing for
parallel and distributed computing. And there, the focus is
really to try to push up the, you know, the [indiscernible]
throughput that you can
you know, as best as can in pushing
the computing into different machines, and there is maybe limited
communication cost and maybe even no cost, you know, asynchronous
algorithm and then hope for the best.
Usually, such paper do not provide a theoretical guarantee on the
behavior of the algorithm. But sometimes, in the empirical
experiment, they will show, you know, good answers, answers and
algorithms. That's pretty much where we stand, you know, in the
current research paradigm, and there are some limited effort in
the middle which show you, you know, a system that provide some
guarantee or some algorithm which are mostly [indiscernible] on a
system's configurations.
So that's basically where the current research effort is
directing toward, and in my opinion, we don't yet have a good
solution is that can offer, you know, more generic or even
universal solution to the large scale and the big learning
challenge that I mentioned above.
So here I'm going present, you know, some recent work from my
group in collaboration with some friends in developing a new
framework for the big learning problems I talk about. And we
call the system Petuum for a reason we can discuss offline, but
it contain the following building blocks, right.
It has a big, you know, machine learning architect, system
architect which allow data parallelism, model parallelism, and
fault tolerance and many other, you know, intricate issues that
you have to deal with in running a big distributed system.
And on top of that, we provide a programming model which allow
users at different level, researchers, practitioners or experts
to you know, make use of the parallel design, you know, with a
great kind of
hopefully a good reward in reduction of
computational costs. And also, we have tool boxes and APIs
supplying, you know, ready made implementations of a major model
such as a deep network or topic model and stuff like that.
So what is behind this system is not a monolithic kind of
structure in the system architect or a single [indiscernible] or
just a single combination of the tool. In fact, we've been
trying to make use of a lot of resources at CMU, in particular in
my group, to push on the research on multiple angle. For
example, we do actually put a lot of effort in push the limit in
designing smarter algorithms, good models and good
representations to enable scalability as much as we can, even on
a single machine.
Then going beyond that, we're going migrate such element into a
real parallel system and actually plunge into the dirty work to
actually do some plumbing and make the system more kind of
adapted to the kind of algorithm that it's supposed to carry.
And finally, we actually also have theoreticians in the group
showing analysis, demonstrating such development actually may
provide you a good guarantee on the consistency and the
convergence.
So there are a lot of, you know, tiny kind of pieces of work, you
know, existing in different places in this whole landscape, but
I'm going to order this talk in these three main kind of aspects.
One is about how we do big data parallelism, you know, using an
example problem that we've been working on on doing large
networking.
Then I'm going to also talk about how we deal with model
parallelism in decomposing big models into different machines.
If I have time, I'm going to talk a little bit about task
parallelism as well. So let me begin with the first one.
So the problem really originate from my own terrible experience
at Facebook, where I was asked to actually analyze the whole
social network that they have in their system, and so here is the
problem. There are many ways of doing network analysis, as you
know. And, in fact, many of the analysis is more of a scientific
one, say I want to tell a story, say how communicating for a
network is and what is the degree characteristic of network. And
that is interesting, but they are not a problem that Facebook
should care, right? They actually are caring more about
actionable kind of knowledge on networks such as a personal
interest inference or a trend prediction, stuff like that.
And specifically, this might be a problem they are interested in.
How about I want to turn every people into a pie chart in which
every slice indicate the probability of he is interested in
certain things or maybe reflecting his certain social activity.
For example, this guy can have a multi facet social trace, such
as his profession being a doctor, and he could have a family, and
he also have some regular playmates to entertain himself in other
time.
And this is a typical new way of doing, you know, social
community inference in which we don't simply just group every
people into one cluster. In fact, we want to estimate a so
called mixed membership of every individual in a social network.
And obviously, this is an interesting query because it allows you
to make many interesting predictions like this.
So the way we begin with that is let me first talk about the
model before I talk about a system, because the model itself can
be a very unique one. First of all, if you think about the size
of the problem on a network with a billion node, you know, we
probably need to start from the very scratch to think about how
do we actually formulate the problem in the first place. Because
this [indiscernible] matrix that carried a whole network is the
longer [indiscernible] to be stored or even read through. They
have, you know, a billion to the power of two number of entries.
>>:
It's very sparse?
>> Eric Xing: It's very sparse, but it's still very many, and
many of them are zero. Typical model for this kind of data has
to all model zeroes. We ask do we need to do that or not. So we
choose to use a different representation of the feature, which is
called the triangular feature, which, you know, has a long
history in the scientific literature of being more reflective of
the behavior of every social individual. And here is how it look
like.
Suppose that now we look at every node instead of just counting
how many one or zero edges it is involved with, but actually
counting how many times it is involved with a different types of
triangle [indiscernible]. Such as for this node, it is in one of
this full triangle, it is in another full triangle, it is in a
half triangle, and so on, so forth. So this is just another way
of characterizing, you know, social network features in the big
network. And it also reflects a mixed membership involvement of
every individual in network.
I'm not going to argue how good it is. There are plenty of, you
know, papers and empirical studies show the benefit of that, but
I'm going focus on the computational issue of solving model built
on this representation.
So the model we employ is so called
is called a mixed
membership triangular model, which is very much like a topic
model. But in this case, the topic sampled for every individual
to create or to generate a triangle that three people are
involved with. Okay. So here, every individual have their pie
chart, which is a topic vector, a mixed membership vector. And
then based on who he is going to interact with in the triangle,
he is going to instantiate a particular social role. Say he's
want to be a professor. This one is also professor. That one's
a student, then two professor plus one student make like to form
a triangle, things like that.
So it's a typical probabilistic program, you know, people are
aware about. But in here, we have some [indiscernible] inference
problem. The main inference problems are two fold. Here are the
data we actually observe, which is the triangular sufficient
statistics in the network. But we need to infer, for every
individual, a pie chart, which is the mixed membership vector,
and also this matrix, which is the social interaction between
different communities or different social roles, you know, if
they are put into a triplet.
And typically, you know, on the algorithmic side, there are, you
know, roughly two main lines of algorithmic direction. One is
based on, you know, Monte Carlo, Markov Chain Monte Carlo, and
the other is based on Stochastic Variational Inference. And here
I use the topic model as example to illustrate typical operation.
Basically, in each of these algorithmic prototypes, they
basically are going to visit every random variable, every hidden
random available cyclically. And whenever I visit, I am going to
draw a sample based on a conditional distribution defined on the
Markov blanket of this node, which basically gives you the
conditional probability of this particular hidden variable being,
you know, taking a certain configuration or not.
So in the MCMC case, I'm going to toss a die and draw a sample.
In the SVI case, I'm going to compute a deterministic
approximation to that posterior probability. But the computation
can be very heavy, because this whole step has to happen
iteratively on every hidden variable and happen in multiple
iteration until they converge.
>> So for this type of [indiscernible] top down model, how
scalable can it go?
>> Eric Xing: Yeah, that's what I'm going to show. In fact, in
the [indiscernible], they are hardly scalable. If you look at a
typical, you know, MCMC or SVI paper for a network, they can go
about a few hundred of your [indiscernible] nodes. And that's
exactly why we are interested in addressing this problem.
So there are multiple ways of scaling up. If you look at, for
example, paper from, say, David Bly's group, he is very obsessed
with scalable models of LDA, and his approach is to make the
algorithm smarter and smarter, right. For example, you will see
ideas such as a subsampling, such as stochastic estimation. And
to the point, I think, we also contribute some new ideas in
parsimonious modeling to reduce the size of the model.
So here we have a number of steps actually pushing up the ideas,
pushing up the scalability on a single machine, being a shared
memory environment. If you are interested, I can go down to the
details of each of these algorithmic ideas, but the good thing is
that they have been well established in a community that these
two are convergent and reliable results. And we have some even
more new ideas recently to even make this speed up or
acceleration even more aggressive by reducing the variance of
stochastic updates by controlling the learning rates
automatically and so on, so forth.
So all these ideas actually are quite valuable and they actually
push up the performance quite a lot. For example, in here I want
to show you just by doing a delta subsampling, meaning that I
don't need to look at all my one billion minus one neighborhood
in terms of generating a [indiscernible]. I can subsample a
neighborhood depending on how much degree I have in my
neighborhood. And with a different, you know, degree of
subsampling, I can actually turn the inference complexity, you
know, from a quadratic one down to a linear one and down to a
linear one with a very small slope.
And to point that we can actually deal with a pretty sizable
network in here. It is a network of the Stanford web graph,
which has roughly a quarter million nodes, and we can pretty much
finish the whole job in 18 hours for five social roles. This is
academically quite admirable, because it's the first time that
such a network can be actually even processed under a series
Bayesian graphic model. But, of course
>>: So in that case, the [indiscernible] in the graph correspond
to what entity?
>> Eric Xing: Oh, here?
of the mixed membership.
simplex, okay.
>>:
Here is, all right, the visualization
This is a five dimensional social
Does it normally come up from the earlier model?
>> Eric Xing: What I'm doing here is actually I'm plotting this
data, plotting the pie chart. Okay. Every pie chart correspond
to a point in here, which is a five dimensional coordinate, and
their actually size corresponds to the degree of that node. So
here, you can have a visualization of the community, of the mixed
membership community in a bigger social network. You can see
probably some, you know, absolute communities and some mixed
communities in the middle, so on, so forth.
But with the Stochastic Variational Inference idea, we actually
can further speed it up, and, you know, I can have some
impressive number here, for example, for that Stanford web
network, which is a quarter million. We now can actually finish
the whole running in about ten minutes, okay. That's actually a
quite impressive number. But, of course, this ten minutes is for
only five social roles, which is a little bit unrealistic. A
quarter million people only have five roles. That's probably too
little. So how about increasing to five hundred roles, okay. We
actually can do that in six hours, which is still quite good.
All right. So this basically
>>:
Operational method better or
>> Eric Xing: [indiscernible] so the previous one was MCMC with
data subsampling, and here it is the Stochastic Variational
Inference. So up to here, I think we've been really reaching the
limit of algorithmic kind of acceleration. We worked very hard
in pushing a way to be smarter and smarter. But up to here, we
hit a wall.
Basically, the network is still bigger than what we can handle.
For example, I just show a network of a quarter million, but what
if I want to handle, say, a hundred million, which is still a
fraction of the Facebook network. And even more challenging is
what if I want a number of roles to be even more reflective,
because for a billion people, or for even a hundred million
people, it's easy to imagine they have ten thousand different
roles or even more than that.
And with a little very simple calculation, you can figure out
that even a memory needs for storing such information takes a few
terabytes of RAM, which is not feasible for a single machine.
And if you run any of these algorithms that we used before, it
would take a few years to finish the whole thing.
So even a scholarly sound and, you know, publication wise very
elegant idea really cannot deliver a practical industrial
solution that the companies really need. So that actually pushed
us into a corner. And therefore, we have to now start think
about how to do the dirty part, the parallel inference part.
It is actually not quite trivial to scale this algorithm into a
parallel setting. So here is why. Suppose, you know, we want to
do a, you know, very direct data distributed sampling. So we
just put the data into different kind machines and do sample on
each client separately. And usually, you know, when we write our
code or do our analysis, we will say, okay, we can, you know,
replace that update step now into a parallel update where I'm
going to wait for, you know, local inference information to be
collected from different machines.
But in reality, this step isn't happen in such a nice way,
because you have all sorts of problem in the machine, in the
system, which prevents you from getting all the message back at
the same time.
>>: Now here, the T is the data
>> Eric Xing:
>>:
Which?
The T variable.
>> Eric Xing: No, this is the iteration count. I'm actually
using be an extremely simple [indiscernible] model which suggests
that this parallelization can happen automatically, no matter how
I do it, and basically ideal parallelization. So the whole point
is that if you're right in this way, all the computational
complexity was swept under the rug. You don't see it, but it
actually happen and bother you a lot in the real system.
>>: I see. The basic concept of stochastic algorithm, you don't
really have the iteration concept? It's a [indiscernible]. So
what [indiscernible].
>> Eric Xing: Yeah, okay. Stochastic inference algorithm, by
nature, cannot
it would be even harder to parallelize because
the whole correctness guarantee is built on a sequential
execution of the stochastic updates, right. Here, I want to
parallelize. Therefore, here I haven't really seen what this
tell what this steps are. Just imagine there is a step that is
parallelizable already, and I'm saying even in that single
setting where algorithm is proven parallelizable, you still need
to worry about how the parallelize can happen, because last step
itself can inference
>>:
So [indiscernible] neural network [indiscernible].
>> Eric Xing: Yeah, yeah. You will see. Actually, I will have
neural network stories later in the second part. So to actually
come back to this network issue happen over the network, you have
to set up a barrier and then you need to cache the
[indiscernible] so wait until the synchronization, and that's
actually quite non trivial.
So here is a very simple illustration of how serious problem can
be. So as we now know, you know, the network can have very
little bandwidth. Therefore, you cannot pass many messages in a
short time. And secondly, all these different machines may
perform unevenly, because you are not the owner of the whole
machine. Other processes are going in there and vibration,
[indiscernible] uniformity and other things can also perform in
the course [indiscernible] performance. And at the end of day,
here is what we actually observed.
We found that we're running an LDA on a, you know, a modest
cluster, you know. Even in ideal environment where we only run
this only program, no other program are running, and we own the
whole system, we spend about, you know, a good amount of 80
percent of the time communicating, and the other fraction is for
actual computing. That's a big waste. And if we allow the
program to run that way, the kind of task that I had talked about
for the billion node network inference is not going to be happen
to be solved in a short amount of time.
>>: So in that example, is everything converging on the single
sort of master machine that correlates [indiscernible].
>> Eric Xing: Yes, you have to, you know, for machine learning
algorithms if you use a [indiscernible] server idea or any other
ideas you, have to have a collection step, right, to gather all
the sufficient statistics.
>>:
So all that waiting time is on the [indiscernible].
>> Eric Xing:
>>:
Yeah, yeah.
You have a question there?
Were you using [indiscernible] or is it every node.
>> Eric Xing: Oh, okay. That's another layer of clustering,
okay. Mini batching, you know, is a secondary layer of
clustering where on each particular client and you compute the
update on that sample over that machine, you can ask whether I do
mini batching or I use the whole thing, right. Here, I'm
attracting them away, basically. I'm talking about a generic
algorithmic behavior, regardless of you use mini batch or not.
>>:
This timing will be affected by [indiscernible].
>> Eric Xing: I'm not doing mini batching here yet. Using mini
batching would be even worse, because in mini batching, your
iteration, you know, your computing time for each iteration will
be reduced, but you need more iterations. In fact, the time
spending on the communication can be even worse.
>>:
[indiscernible].
>> Eric Xing: Yeah, so the idea of mini batching is that on
every update, I'm going to compute update on the smaller subset
of data. Therefore, I can actually compute a gradient even
faster, but I need to have more iterations than to convergence,
right. And the more iterations mean actually more time spent on
communication, right. So I'm talking about the ideal case, you
can actually do a perfect gradient at every step, at a bounded
cost. But still, your time for communication can be very
significant.
>>: So this applies to the [indiscernible] learning for
[indiscernible].
>> Eric Xing: It applies to
yeah. The statement here isn't
actually toward a particular algorithm. It's actually a generic
[indiscernible], yeah.
>>: The basic question is that for [indiscernible] learning is
by default, it's the batch learning, if you look at your
description, have never had anything [indiscernible].
>> Eric Xing: That's a heuristic arrangement. If you want to
prove correctness of the [indiscernible], for instance, you start
with a non batching idea. And if you use batch, you need to
further prove it is [indiscernible] which is not always true, in
fact, okay. So here, I'm talking always about a convergent,
correctly behaving algorithm instead of a heuristic. Heuristic
can do anything. For example, you can do the [indiscernible]
parallel LDA, just parallelize it and it will still run. How
correct it is, I don't know.
So here, I want to first be sure it is going to be correct, okay.
Then we ask the cost of you need to pay. And here is the cost in
to pay. Basically, to make things correct, you need to
communicate often enough to make sure every update is consistent
and that means, you know, you need to have tens of thousands of
communications every second, you know, in the parallel
distribution process. And this is impossible for a system like
Hadoop, because that iteration cost is just too high, will
totally overwhelm the network.
So we ask them can you actually do something better to reduce the
communication or maybe to rebalance the time spent on
communication or computing. That's basically comes to the idea
about our first part of the talk, which is a better
synchronization procedure for parallel computing.
Nowadays, there are two major kind of class of synchronization
scheme in parallelization computing. One is a bulk synchronous
update, right. Synchronization, by the way, is [indiscernible]
if you care about the correctness of the algorithm. Otherwise,
you can just go async. So if you do think, then you need to set
up a barrier and ideally, hopefully, before a barrier, every
process will finish about the same time and then the interior
this barrier get collapsed and then enter the next update.
But this ideal case never happens because of many of the
unpredictable, you know, unexpected defects in a network system,
in a clustered system. What you see actually is here. Different
process will reach the end at different time point, and then in
the collapsing step, I need to do various amount of computing
because of this random delay coming from different places.
Therefore, a lot of the space which I pointed out here, white
one, red one actually just the time wasted for computing, for
communication. It is not doing any useful computing. And this
fraction can be very big to the point that it can take 80 percent
of the time, even in ideal setting.
The other solution to parallelism is to ignore parallelism
through asynchronization. You do totally asynchronous one and
then, you know, hopefully in ideal case, each different processes
will not be off sync by too much. Therefore, you may still get,
you know, at any point, a correct kind of update up to some
bounded error. But this is again a very dangerous assumption,
because in the extreme case, the amount of asynchronization can
be very extreme to the point that some threads may be several
iterations ahead of others. Therefore, their updates cannot be
correctly integrated with other updates. Because what if one
gradient tell you to go that way. After a few step, it asks you
to go this way. Then if you average them, which way you will go?
You cannot get a correct direction.
So this is a typical kind of artifacts [indiscernible] face in
the current [indiscernible] solution to big learning problem.
And we want to actually explore a middle ground where can we
actually reduce amount of communication in the BSP but still get
the cost, the speed of async implementation with the correctness
of BSP. That's the question we want to ask.
And that leads to the work we presented at NIPS this year, which
is called the Stale Synchronous Parallelism in which we actually,
you know, set up a timer, actually, for every process, for every
thread, which allow it to monitor how much it is ahead or behind
other threads in a parallel environment. And we want to enforce
the following update law.
First of all, there is a parameter server so that, you know,
every thread can independently, you know, update the parameter
server to actually leads to a learning step in the parallel
machine learning. But on the other hand, if it needs to go to
next step, it can choose to read the information from the
parameter server, which is actually a global solution, or read
information from their local server, okay. Every thread had a
local cache of a local version of the parameter state of value.
And it reads from the local server if it is not too much ahead
away from other processes. Meaning that, you know, my current
version is not much different from the global version.
Therefore, I can read locally. That saves some computation. But
if I'm too far away from others, then I need to update and the
rate from areas
I need to stop and actually updating myself
and wait.
So that creates, actually, a very interesting behavior which
basically allowed the slower [indiscernible] threads to actually
do a lot of, you know, updates, and without actually reading too
often from the global server. But for the very faster ones, it
will stop itself at some point. Or if it goes
it needs to
actually update, every time it updates, it has to read from the
global server, okay. So that's kind of balance, you can see this
one will spend more time on computing and less time on
communication, and that one will spend more time on communication
and less time load computing. And eventually, all these
different threads will reach to the goal at roughly the same time
point.
And here is a rough [indiscernible] interface of the SSP
parameter server, okay. So every machine will have its own
connection to this centralized place through a rather simple read
and write interface which is very much
it's no different, not
much different from a single machine, you know, interface, which
means that you don't have to rewrite your parallel
your
program. You can pretty much use the same program information
and the detailed implementation is taken care of in our low level
infrastructure implementation.
And as a result, you can actually see, you know, the system, you
know, did provide a very flexible way of controlling the trade
off between, you know, very active synchronizing communication or
a more kind of economic [indiscernible] synchronization where I
don't actually go to the network very often. And the amount of
staleness basically specified an amount of iteration cycles that
the machine can
that the stress can be deviating from each
other. You can see as the stale number comes bigger, the amount
of communication actually did go down and then it gives you a
better trade off between the communication computing.
And this idea can be applied to not just a single model, such as
LDA. In fact, for a wide variety of machine model, such as what
I listed here, topic model, matrix factorization, regression. As
long as you have a setup where you need to maintain a shared,
globally shared parameter to be estimated from the data and that
[indiscernible], you can always basically distribute the data
into different kind of machines and then put the parameters into
the SSP table and then do the update in the asynchronous fashion.
>>:
[indiscernible].
>> Eric Xing: This is a data parallel model at this point, and
we assume, of course, the model itself can be stored in this SSP
table, although I haven't really been specific about how these
can be [indiscernible]. In fact, this one can also be a parallel
system which actually increase the power of its storage.
So here I have some performance evidence of this system. You
know, we tried a carefully chosen kind of spectrum of different
models. LDA stands for a typical probabilistic graphic model
inference problem. LDA
no, LASSO stands for a typical, you
know, convex optimization kind of problem. And here, the matrix
factorization is by itself a very popular operation in machine
learning.
And you can see in each cases, the async, which is actually this
one, the Stale Synchronous Parallelism is going to give you the
best kind of convergence behavior, and it's not only fast but
also convergent to the better, actually, spot in the given amount
of time. Faster and more accurate.
And the people, one thing people asked and care about most is how
really how kind of scalable it is. And you can show interesting,
you know, curve on a small system. What if I have even a bigger
problem that I need more machine. We did an experiment here
which shows that at least, you know, to the limit of having
started two machines with about 300 cores, we still have a pretty
nice linear scale up in which every time you double the machine,
you get about 78 percent scale up, which is a pretty nice, decent
behavior in scalability.
And lastly, I want to emphasize that this SSP server is not just
heuristically divided data and do distributed update. It
actually has a convergence guarantee that the resultant estimate
is going to convert the true result if there is one. And the
intuition is very, very simple, because, you know, say taking
this stochastic gradient descent as example, you know, ideally,
if you don't have any parallelism, you are going to get accurate
gradient, therefore you are going to move to [indiscernible]
action.
But now we have the SSP, which basically allow you
allow
different threads to deviate from each other by certain iteration
and that translates to a small amount of error, okay, due to
inconsistency between iterations. But on the other hand, our
staleness is bounded. Therefore, the amount of error is also
bounded. Therefore, every time you are going to deviate away
from the optimum direction a little bit, but they will be random
deviation and cumulatively, they will still lead to a
convergence.
>>: So you scale the number of machines to scale this growth
longer, the slow machines, right? Is there some point at which
the convergence now goes slower because you've added too many
machines and the staleness is too long?
>> Eric Xing: The staleness number is a constant that you can
set, which does not have dependent
to be dependent on any more
machines.
>> Really, because then wouldn't the fraction network time start
adding up to be more than
>> Eric Xing: No. If you have more machine, the amount of
communication will be longer. But the staleness iteration number
isn't actually controlling the amount of time you need to spend.
I will just check how many iterations I'm away from others. So
with more machine, I think it is, how should I say? If you have
more machines with a constant staleness, you are going to
eventually waste more time on communication.
effective progress you made will be slower.
>>:
Therefore, the
Do you know how many machines [indiscernible].
>> Eric Xing: So here, actually, we have a theory, which
actually shows you, you know, the relationship of this. That's
actually the beauty of the [indiscernible]. It basically says
that your updates, you know, will become written to the true one
at the rate of this. You have the F and L, which is the typical
rate to prove as a function of the data behavior, right. The
Lipschitz, the smoothness of the Lipschitz constant and the
[indiscernible] of the data, stuff like that. But here, we have
two other things.
S is the number of staleness and P is the [indiscernible]
parallelism. So if you have a P, of course, more P, that makes
the bound bigger and the lower quality. But you are going to
actually get, you know, a T there. But if you increase it, you
can smooth
you can erase that kind of effect.
>>:
P is time
>> Eric Xing: P is the number of iteration. Iteration, yeah.
And again, the intuition, as I said, is that you have this
bounded staleness which allowed you to quantify amount of
inconsistency of the error introduced through the asynchronous
updates. And that actually carries through the analysis.
>>: If I look at the bound, like I have had certain number of
machines, then is that P?
>> Eric Xing:
Yeah, it's P.
Why it's P, the number of threads.
>>: Right, right. But [indiscernible] number of machines. But
now set S to zero, maybe. Then I would get the best convergence
rate, right? The smallest upper bound happens when S is the
small S. So why would I ever set, according to this theory, S to
any number
>> Eric Xing:
This is the bound.
>>: Yeah, yeah.
zero.
The bigger, the worse.
So I want to minimize the bound, to set S to
>> Eric Xing: Well, you know, this bound is more like a
qualitative kind of a guidance of the behavior rather than, you
know, [indiscernible] SVM bound, you don't actually tune the
thing to actually give [indiscernible] bound. It gives you the
guarantee that there is a bound exist. But pushing down the
bound or up the bound isn't quite meaningful here because you
don't actually know how loose the bound is. There are some other
constants in here.
>>: [indiscernible] number of iterations, not wall clock time.
You said as to zero
it's number of iterations. It's number of
iterations. You said as to zero, each of those iteration is
going to take a while.
>> Eric Xing:
Yeah.
>>: [indiscernible] so can we get a bound so for equal risk so
that we have the parameter that we're interested in are the risk
and the time. We want somehow to tie them together in terms of
if I have [indiscernible] and I want this amount of risk and I'm
willing to take so [indiscernible] up to that point so what would
be the best configuration? Does it scale up with the time
[indiscernible] to zero with the number of machine or is there
some optimus. So how does all these numbers come in?
>> Eric Xing: Well, it does [indiscernible] in this equation
where you keep one thing constant if, for example if you raise
the number of machines. You are going to, you know, basically
get the bound pushed up in a [indiscernible] because more
machines. If you don't change the iteration time and don't
change the staleness bound, it's going to create more
inconsistency. Therefore, the bound is worse. That's the kind
of flavor that you want to look at all of this proof.
>>: How does the [indiscernible] time change with respect to the
number of machines?
>>: Yeah.
It almost like if you use
>>: [indiscernible] distribution or something, maybe you could
start to like come up with some kind of relationship between the
two, for like really large collection of machines or something.
>>:
Right.
>> Eric Xing: So I'm not sure I answered the question.
few question come in to me so which one should I
Quite a
>>: I think talking about just [indiscernible] rates based on
number of iterations and number of machines, but not talking
about to the time span.
>>: I'm not yet.
>>: But that is the more important factor.
>>: Yeah, that to be happening. First of all, this analysis is
the first of its kind, but it's a typical kind of convention in
convergence in [indiscernible] to worry about iteration. But
because the wall clock time is very hard to characterize
depending on your configuration, how we're implementation, right.
So here we're talking about more extract way.
>>: So there is another factor here. So if each [indiscernible]
computation is very, very fast, and then the communication cost
compare to that computation can be very huge. In that case, when
increase number of machines, you probably don't get a lot, right,
especially if you want to force some stillness.
>> Eric Xing: No, not necessarily. The computation isn't
necessarily that fast. You know, for a
>>: [indiscernible] use smaller mini batch size or using GPU.
>> Eric Xing: That's a different idea so you are right now
already start such as a few different ideas that we haven't
investigated. Once you put mini batch, the correctness of it was
not established /KWRELT. So they become in the heuristic region,
you can do that, but I don't think an analysis can be easily
generated.
>>: If you have very small task and if you want to use manual
machines, what [indiscernible].
>> Eric Xing:
>>:
A small task?
See a time increase or a decrease?
>> Eric Xing: That's a good question. I don't, I don't actually
have a clear answer at this point. Here, I'm basically on the
more pragmatic side. I'm not playing a game. Say I want to view
the big system and run some small task. That's becoming more
kind of a gaming style investigation. I'm saying that you really
have to do this large company. You have so much data already
available, and it leaves you something to do, and I'm going to
guarantee you that it is not terribly [indiscernible]. That's
kind of a mentality of this analysis, okay. So it is by nature
not a
the theory, if some of
like in some of the
[indiscernible] is to give you the trend but not the actual kind
of [indiscernible] of say computer time of convergence. That's
never happened, right. So I don't know how to answer, because
[indiscernible] not knowing. I'm just giving you the trend.
>>: Just speaking empirically, you've been doing it looks like
maybe up to 32 nodes or something. Have you encountered a point
where basically
because it looked like you were still get
something nontrivial speedup when you add the node, like 5.6,
something like this. Have you encountered situations where
essentially it turns out where
I think the sort of spirit of
the original question was, hey, it could even get slower.
>> Eric Xing: Oh, yeah, that's definitely true. That's
definitely true. For example, there is [indiscernible] if I do
it right now, I could almost predict result. I put 10,000
machines here and run the same thing, I'm going to be slower.
>>:
Okay.
>> Eric Xing: That's for sure, because there are some other kind
of communication complexity that needs to be resolved. For
example, the switch isn't big enough to take all message at the
same time. What do you do? You have to wait. And that kind of
constant is not taken here yet. So here, we still assume for
example the message are not computing, are not blocking, are not
clogging, that kind of thing. But I'm saying that if you have
the power to support a nice kind of a communication behavior,
then this is something we expect. But in real case, I'm sure,
you know, you cannot infinitely increase number of machines.
Yeah.
>>: Actually, we have some analysis on [indiscernible] machines
in which case even [indiscernible] is not fast enough.
>> Eric Xing: Yeah. GPU is a very, very kind of special kind of
configuration, right. It is having a lot of core, but everything
you do [indiscernible] verify you need to have a big connection.
That's why [indiscernible] together on the thing. If you have a
distributed GPU, I'm not even sure whether there are good
implementations for distributed GPU yet, because that
communication is not easily realizable, you know, across
different machines.
So yeah, I haven't started that kind of configuration.
more like a classical, traditional CPU based clusters
Here it's
configurations. If you bear with me, I have a lot more stories
to tell. I want to maybe pass this, but tell the other stories
that you may also find interesting.
But yeah, so even on a serious side, I can have some
[indiscernible] to show that not only the process is converging,
but also the variance is also bounded in a sense. You have
[indiscernible] better quality in the convergence. But this just
some add on qualifications. And finally, again, I'm doing this
really for real life communication and pure task. Here I want to
show you just a result we achieved by using this system. But
otherwise, we cannot be [indiscernible] that.
So on that network, if you remember, we have really
[indiscernible] network under the MMTM model. And so we tried to
run our system in competing with the best system known so far.
That's always something we want to try. And here is a network
which contain four million nodes. That's perhaps the best kind
of large network that can be run by any other system, and this
implication by David Bly's group, it was done in 24 hours, and in
our case we have three hours to finish the whole MSB analysis.
And another network with now 40 million nodes, which is really
not very trivial, we can actually finish the computation in 14
hours and that algorithm isn't deployable because it crashed the
machine.
Okay.
So that's the data parallel part.
>>: So in those cases, were the programs [indiscernible] changes
were showing.
>> Eric Xing:
For our program?
>>: The version that you ran on your architecture, did you have
to write the algorithm?
>> Eric Xing:
Our algorithm was [indiscernible].
>>: So if I already have [indiscernible] trying to make any
change, use your framework?
>> Eric Xing: Okay, good question. You pretty much don't need
to do a lot of change. This is an Matlab spark. You need to
turn it into a vertex program. We actually are using a native
programming, almost like a Matlab. At least that's the goal. So
far, I don't know whether your program will be running on us, but
at least the way we program our model is no different than we
program her systems other than changing the two lines of code
into this parameter server code. And again, we haven't really
built the kind of full [indiscernible] interface yet, but that's
the goal in our system, at least.
>>:
Kind of feels like you're writing code for one machine.
>> Eric Xing:
>>:
For one machine, yeah, exactly, yes.
Okay.
>> Eric Xing: Yeah, on your writing of the code, you don't feel
a strong kind of low level parallelism that you have to take care
about, okay. And again, you know, I can just repeat here, you
know, the one take home message is that in this case, that design
of the low level system is not blind of the behavior of the
characteristic of the machine algorithm itself. We actually are
making active use of the machine algorithm property, which is the
iterative convergence property so that, you know, they can be
consistent and resistant to presence of small errors. And that
actually is the key spirit in our design, because many of the
current distributed computing, putting a lot of emphasis on
serial ability and sequentialization. That's actually sometimes
unnecessary and incur big cost result to many game out of it.
All right. So let me move on to the next part, because I'm
running out of time. Which is about the model parallelism.
Again, it's a very familiar problem. Here I have some examples.
Motivating this kind of problem. But you guys have plenty of
more such problems in which you really need to do, you know, a,
maybe a convex [indiscernible] problem in which the size of the
parameter is very big. You have, say, billions of parameters.
And how to actually make this [indiscernible]. In this case, you
may even have small data. You can argue I don't really have
[indiscernible] the problem, but unfortunately, the model is very
big. You can still have a need for, you know, parallel
computing. And here, you know, our approach is again divided
into two kind of different kind of road maps. One is about doing
the algorithmic kind of innovation to really push for very fast
kind of sequential algorithm, as much as we can. And again, I
think this is not the main focus of this talk. But I just give
you some name here for typical convex optimization problems such
as using a kind of, you know, ADMM type of idea to, you know,
decouple overlapping coupling between different random variables
so that they can support to run simple [indiscernible] kind of
[indiscernible] algorithm. And for non convex loss, you can make
them smooth and differentiable using a smooth proximal
[indiscernible] type of approach and and the four constraints
which existing in large content, you can theoretically organize
those constraints and then do, you know, systematic, you know,
thresholding approach to resolve every consistency and
constraint. And again, every of this idea is [indiscernible] a
little bit to make them more and more efficient on a single
machine to the point that you actually, in the rich
[indiscernible] that your model cannot be stored in a single
machine.
So here I want to share with you another idea, which is not
published yet. It's called the structure and wire
parallelization of big models. So how to add your connection to
the existing work. You heard about the [indiscernible] algorithm
[indiscernible]. The key idea is I have a large dimension
regression problem, and I want to make the update parallel across
dimension. I'm going to run distributed different dimensions
into different machines. And then the proof says that if a
different dimensions are not strongly correlated, I'm going to
converge. But if they do correlate, then I don't have any
guarantee. The truth is that in many social media and genetic
applications, you are almost guaranteed to have highly correlated
dimensions across different [indiscernible] dimension.
Therefore, you don't actually see a [indiscernible] algorithm
converge very easily on very high dimensional problems. So this
is actually where we want to study our problem.
We actually proposed an approach which does parallelization based
on knowledge of the structure. Thus the knowledge of the
structure can be a dynamic one, because the structure can change
during your execution of the optimization. So here is the idea.
I'm going to have this system called a STRADS, standing for
structure aware dynamic scheduler. It is going to constantly
exam, you know, any emergent structures in the high dimensional
space. The structure is defined in a generic way. It could be
correlations between different dimensions. Could be co up dates
across different dimensions or any other kind of possible
behaviors that is kind of tying multiple dimensions together.
And then once you discover this
>>:
This is the model is just regression.
>> Eric Xing: In this case, it is regression, but it could be
also neural network models. Essentially it's focusing on the
behavior of the coefficients.
>>:
The raw data?
>> Eric Xing: No, it is not raw data. It is the coefficient,
which is the estimate of the parameters. So, yeah, that's
another difference. We don't prescan the data and discover
structure because that is not possible if you have real big data.
We really want to do a Bootstrap thing, you'll come out of
nothing. You start to have some estimate. You start to examine
whether there are structure and you distribute them accordingly
and then you also dissolve any conflict structures in that
dynamic thing. So as a result, you have this dynamic kind of
created clustering of coefficients and they get distributed into
different workers. And within each worker, you have highly coded
ones which should be [indiscernible] together and across
different workers you have decorrelated ones that can be updated
in parallel.
And such structure will be adjusted either every iteration or
every couple of iteration to make sure that you have the best
load balance in across the whole process. And just to show you a
behavior of this on LASSO, you can see that even for very high
dimensional problem, even for a modest parallelization, a shotgun
is really not quite working well. For the number here, we try
run shotgun on two machines. It is kind of still converging, but
slower than our dynamic scheduler. But if you run them on four
machine, which means a greater deal of parallelization, actually
we couldn't observe a convergence. The line just fly away.
And this graph is a more kind of, kind of hopefully a stronger
illustration where we actually simulate data, okay, so that we
actually simulate in such a way that the shotgun was at least can
still converge. They are not too strongly correlated, okay, but
you can still make use of the [indiscernible] structure to
hopefully inspire a better distribution of the task to inspire
parallelism and you can see with our STRADS scheduler, you know,
once you increase the number of cores to, you know, promote
greater degree of parallelism, we actually see a very
interesting, you know, dropping of the convergence
increase of
convergence rate with your increase of number of cores.
And in particular, I found this phenomenon very interesting. If
you look at, you know, the shotgun curve, or maybe a curve which
has lowered the parallelism, they're kind of smooth. It seems to
have a constant kind of convergent path to follow. But once you
increase the amount of parallelism, you actually see that some
curve originally converging at a particular rate and then
suddenly they drop down to a different rate and then converge.
So why that's happening? I suspect that's happening because you
have this dynamic scheduling, you know. Have a distribution, but
you're not commital and the second time once the structure
change, you actually have a redistribution of a task and then
they launch to a different convergence path, which hopefully even
better. Therefore, you can very quickly find optimal convergence
path and then reach to a convergence very rapidly.
>>:
So what this redistribute, which ones?
>> Eric Xing: Redistribution means that you have a dimension.
You put ten dimensions in here, another hundred dimensions in
here. You are grouping different dimensions to allow them to be
updated within the machine in a correct fashion, right.
>>: So [indiscernible] you have to actually somehow get these
dimensions to the machine. More data.
>> Eric Xing:
>>:
Yes.
How do you handle that?
>> Eric Xing:
distribution.
There is a central scheduler to do task
>>: [indiscernible] my dimension, then wouldn't this traffic of
getting the dimensions to the nodes will be time consuming?
>> Eric Xing: Yeah. That's true. But imagine that if you do a
shotgun type of random distribution, the same, because they need
to basically also collect information back from every dimension.
So this added [indiscernible] isn't a substantial traffic.
>>: Can you give an example to kind of dataset stow which you
think this kind of [indiscernible].
>> Eric Xing: In here, I'm showing that in LASSO, which
basically, you know, it is during the [indiscernible]. So every
imagine in every iteration, I have a ten machines, okay, and I
have a million dimension. Basically do a pending on a
designation in which ten percent goes to this machine and that
machine.
>>: In every iteration, you redistribute the data to a different
machine?
>> Eric Xing: Yes.
It is a parameter.
No distribution of the data in this case.
>>:
Only potentially if
>>:
I'm missing here something.
>> Eric Xing:
So the
Yeah, talking about.
>>: Mentioned one million. So I know if I'm a working and I'm
assigned to work on seven coordinates, I need to have the data
for these coordinates.
>> Eric Xing: Yes, yes. Good question. Let me tell the secret.
So the truth is if you look at the Google [indiscernible]
project, okay, they have a vast, you know, partition of the big
network into multiple pieces, and they keep the data
[indiscernible] with different machines. Therefore, they all
have the same data, okay. That's basically the setting here.
Okay. Now ask if my data's really big and I need to do a further
kind of partition, then we'll say we actually are going to
[indiscernible] data on that worker machine as well. Therefore,
the update will be using only the partial data. So join data
[indiscernible] and join data model [indiscernible] is one more
step beyond this. We actually have a result as well, which I'm
going to talk about if I have time, but that requires
>>: Google [indiscernible].
>> Eric Xing: No, I think they replicate data. If they do both,
then it is
you can always do heuristics thing and do both, but
you lose the guarantee.
[indiscernible] or not.
>>:
The truth is that can you generate
[indiscernible].
>> Eric Xing: Actually, the good news is that there are some
guarantees even if you do that. We actually prove new results
for that.
>>: So in this case, you're talking about with a million
parameters on two machines, it means each machine has about a
hundred thousands parameters.
>> Eric Xing:
Coming like that.
>>: So every update, every time you got a check, you have to
send a hundred thousand parameters each of the central server and
he has to do this mass comparison to find out the correlation
between them?
>> Eric Xing: Very good question. Now we go through details.
Remember, this is a sparse read question, and what if you only
care about non zero ones? Okay. There are actually not too
many. The vast dimension actually is mott
it's for you to
check, but when you actually want to send the workload, it's not
too many, in fact.
>>: You can compare sparse correlations correctly without having
all the
not only the zeroes are or
okay, yeah.
>> Eric Xing: You have the global
the schedule actually has
the whole picture. It has no problem for it to
in fact, what
it does is it will subsample the coefficients and do a comparison
of that.
>>: Are these [indiscernible] kind of indicative of when it had
to start redistributing and sort of found the right
>>: It is not really
we actually have a [indiscernible] which
I didn't show. It gives you a picture of the trade off about the
computing time and the communication time. They're roughly the
same as
well, okay. Because computing is [indiscernible]
efficient, all these are [indiscernible]. You don't actually
[indiscernible] sequentially wait, right. You actually are doing
the discovery of the dynamic structure while the computing is
happening on a client side. So this time are not actually
sequential.
>>:
Okay.
>> Eric Xing:
Yeah.
>>: One more question. When you decide you need to redistribute
across machines, is it up to the algorithm how they recombine the
parameters?
>> Eric Xing: It is not up to the algorithms. The algorithms
don't see that. It is inside the schedule. You can, of course,
overwrite a scheduler's default policy and tell it how to
distribute. That's actually an API we provide. But the
scheduler itself has its own kind of simplest way of doing that.
The simplest way is in the [indiscernible] just run and
distribute. That's basically the [indiscernible] implementation
right now.
>>:
Okay.
>> Eric Xing: Yeah. But we just want to sell this concept that
the active search of dynamic structure is more beneficial than
just randomly distribute and also it's more correct. And this
system basically is an implementation that supports that kind of
operation. And without you do the actual scheduling.
>>:
[indiscernible].
>> Eric Xing:
Yeah, I need to run fast.
I have been slides.
>>:
You have another meeting.
>> Eric Xing: Let me finish up in about five minutes, okay, and
I can skip the third story because that's very simple. So again,
without further adieu, I'm going to say that just by hopefully
you just believe, trust me that there is a guarantee on also the
correctness of this distribution on LASSO for a convergence,
which is not actually true for the shotgun algorithm.
But here is a kind of very [indiscernible] comparison. We did
run the experiment on very high dimensional LASSO to the point we
reached 10 million dimension. And you can see that we compared
shotgun, which is basically the Graph Lab implementation with
ours, and like ten million dimension, with different amount of
non zero inputs, we actually are converging, but a little bit
slower, I think, than Graph Lab. But the truth is that when we
get even the bigger dimension, we still converge and the other
algorithm is not even converging for that.
And the graph was meant to show you it is a real scale problem,
which is a [indiscernible] dimension, which is kind of
nontrivial.
We also have a very latest readout on the preliminary
implementation of the DNN. And in here, I just want to show that
we get to the kind of expected scale up of speed by adding more
machines because the model is chunked into different sub
machines. Performance wise, distributed inference is always not
as correct as [indiscernible] inference, but it is measured from
the predictive error. We are not far away from the state of the
art deep [indiscernible] paper and this is a [indiscernible].
>>: So this is a classic
you are talking about the
[indiscernible] task, right?
>> Eric Xing:
Yeah.
>>:
But these results are
>> Eric Xing: Oh, this is all [indiscernible].
[indiscernible] task.
This is the
>>: But according to the description, it's the [indiscernible]
classification. Are we using hidden Markov model?
>> Eric Xing:
>>:
No, it's not.
You are not using hidden Markov model?
>> Eric Xing:
No, no.
>>: Then it's a classification. So classification task is
easier than the recognition task, which the result you could get
Royce /RO*EUGS result.
>> Eric Xing: Yeah, yeah so I call [indiscernible] last night.
It's here the point isn't to show [indiscernible] with the
computer speed up in the linear without losing much. That's all
I wanted to show. It's basically these hosts are supporting DNN
implementation. We're not trying to invent a DNN algorithm or
try to post performance at this point yet.
>>:
But [indiscernible].
>>:
Typically, if you want classification results
>>:
The question is [indiscernible].
>> Eric Xing: Yeah. Just, yeah. So here it's basically
benchmarking, you know, runability of a different models. That's
the whole point. I'm not trying to [indiscernible] result ready
to deliver.
>>:
Because you are using stochastic gradient algorithm or not?
>> Eric Xing: [indiscernible].
gradient algorithm.
>>:
Okay.
I think it is stochastic
So in this case, after each mini batch, do you
>> Eric Xing: Can we take
this is a detailed cluster at low
level. We can take offline and let me finish delivering the
whole message.
So the insight here again, I'm not trying to say which algorithm
and the how that model specific. I'm trying to say that, you
know, there is another dimension in machine learning you can
exploit to build highly [indiscernible] in which there exists
[indiscernible] in the big model and if you know how to use that
to distribute a task, you really get a good game out of it. And
this claim is not dependent on what algorithm you run or what
model you will run, okay.
And so this third one, I think I'm going to skip it tells you how
to actually, again, distribute, say, a million classification
tasks or even higher into a distributed system in a very
effective way, which use some tricky [indiscernible] tradition of
the task space using a coding idea. But I guess I run out of
time.
So let me maybe conclude by the following kind of observations.
I think, you know, there is a lot of opportunities in doing
scalable machine learning if you are willing to do algorithmic
development and [indiscernible] development together in the same
place. And let them to cross benefit each other and to cross
inform each other how to write the best parallel algorithm and
how to build the best system supporting such algorithm.
And in particular for our system made very explicit use of this
iterative convergent behavior in which, you know, we pay cost of
introducing error but gain by speeding up the whole iteration and
to a great deal. And secondly, we, you know, discover structures
in complex model and then parallelize the components of the model
accordingly.
And by using these ideas, which is very typical to wide class of
machine algorithms, it is very likely for you to get a lot of
benefit in distributed computing.
And just so wrap up, I want to return to this big picture. I
think, you know, we are in the process of building this Petuum
system which represents a thing, risk connecting the algorithmic
needs and the system resource through a kind of generic and a
universal solution on the system building blocks and algorithm
building blocks.
And I think the results are promising. And we are about a number
of other existing groups making the same effort and we want to,
of course, be in the context. And for that we also show you
maybe a very recent result just to compare a typical job, you
know, that people always play, you know, in different systems,
which is the topic model effort inference, the LDA one.
So we [indiscernible] model on the pretty, you know, substantial
dataset with seven billion token and with 80 million documents,
but we want to make a point that we push the [indiscernible]
setting out of the ideal scholar, academic kind of environment
where you focus your effort on a small number of topics, because
your memory can hold that many topics.
So in here, if you run a hundred topic, this red bar is Petuum.
This is Graph Lab. The number shows you the throughput in terms
of million token per second. So we comparable, although we are
about 50 percent better. But what you really want out of certain
number of documents is not about a hundred topic. You really
want a lot more topics. And people don't do that not because
they don't like more topic. It's because if you increase on the
topics, your memory cannot hold a model. In our case, our system
can now already support the inference of ten thousand topics and
we reach still a, you know, a decent threshold
throughput in
word processing and Graph Lab cannot even [indiscernible] because
that many topics just blow the whole thing.
>>:
[indiscernible].
>> Eric Xing:
Oh, the most modern version.
>>: Company compete with that in terms of violation or just
specific [indiscernible].
>> Eric Xing:
>>:
What do you mean by [indiscernible].
[indiscernible].
>> Eric Xing: Yeah, yeah, we just take their tool and implement
our Graph Lab beyond that. And we even compared with the Google
version of Graph Lab, of LDA, and you can see that, you know, we
are kind of comparable. At least we can plot the whole thing on
the same chart. And but we want to emphasize that, yeah, here
there's a whole team of people building the graph in a very
specialized fashion. And in our case, it's just a regular
implementation on your Petuum system, and we still reach a pretty
good speed. And in here again, we didn't compare the larger
topic model, the larger topic because on the 10,000 case, I don't
have any results to compare it with.
So that kind of showcases you the direction we are driving toward
for Petuum. It is to support real large scale, you know, data
intensive and model heavy kind of inference task. And the
reasonable amount of theoretical analysis and the ways to
[indiscernible] interface that people can actually make use of
down the road. So with that, I want to close. Sorry for
dragging you so long into the talk. And it is an effort involved
with many of my students in the group, which I circled here, and
also with collaborators who are, you know, experts in offering
system and [indiscernible] language. I don't want to read their
name anymore, just to save time. If you are interested, you can
email me or talk to me offline to find out more details.
Download