>> Yuval Peres: All right. We're very happy... second version of the talk he gave at the University...

advertisement
>> Yuval Peres: All right. We're very happy that before leaving Tobias agreed to give us a
second version of the talk he gave at the University of Washington earlier. So let's hear about
quasirandom load balancing.
>> Tobias Friedrich: Thank you for the introduction. I've been here for two weeks. It's my last
day here at MSR and I'm going to talk about quasirandom load balancing. I'm hoping that the title
somehow will explain later.
Okay. So since I have seen too many conference talks where I didn't get the problem on the first
slide, I tried to make a visual start to the problem of load balancing. So given some compute
cluster, which is some network, and we have a number of tokens.
You can either call it number of load or chips. Somehow arbitrarily distributed over your network.
What your aim, you want to somehow balance it out. This is the kind of jobs you have to handle
and you're going to balance it out such that every processor or every node of this network has
roughly the same number of tokens. And to make it easy, one can assume that load is devisable,
which means you can split it up in arbitrarily small pieces. And then this becomes kind of simple.
So let's look, for example, here what happens if these four tokens divide, they split. They split,
they split. Here you see they split in half, and they're half of the difference is sent to the
neighbors.
This was like one time step you balance out with your neighbors. And in the next time step,
pieces are getting smaller and smaller. So we get some distribution like that one. So you have
one and a half tokens here. One and three-quarters here, two and a half here and so on. After
another step just to finish it up, you might have distribution like that and after say a logarithmic
number of time steps N is the number of nodes. If this was an expander, meaning that you have
some constant eigenvalue gap then this is perfectly balanced, meaning the discrepancy, the
difference between the maximum and the minimum load is a constant.
>>: How do you exactly balance out the load, each step here?
>> Tobias Friedrich: The formal definition comes in the next slide. So this is the problem we are
dealing with. You have some distribution of the load, some arbitrary distribution on your network.
And you want to balance it out equally up to some small constant.
Okay. Here comes the definition Yuval was asking. You assume you have an unconnected
undirected graph. To make the talk easier I say the graph is working well. Otherwise if it's not,
then this is just the maximum degree. And this is the diffusion matrix telling us, well, if you
send -- we keep half of the load to ourselves and the rest we split evenly and send it to our
neighbors.
So if we have degree 2, then they'll be a quarter to each of the neighbors and keep half to
ourselves. Okay. So we stay in the idealized case, where everything is nicely devisable, and
then we say, okay, at every time step we have some distribution of the load, which means at
every time step this is done vector, revector with dimension number of vertices and what happens
in one time step, well, the new load vector is the old vector times diffusion matrix, which means
that at time T we have the initial load times P to T.
Okay. So that's a very easy chain. If everything is nice and devisable and what are we actually
sending on a fixed edge? The edge come vertex I to vertex J, what are we sending? Well, it's
the amount we are sending from I to J depending on the load at I, minus the amount of stuff in
from J to I. So that's actually what we're sending to flow on an edge I/J.
Okay. Small example so everybody knows the model. We have two vertices of degree 4, the
load 16 and 2. So here we want to send 1.75. Since the load difference is 14 divided by 2 is 7
divided by degree 4 is 1.75. So that's what we want to send here.
And if everything is nicely devisable then we send exactly this. That's the model. And now we
are coming to the more realistic case. Since in real word you can't split up your jobs in arbitrary
small pieces. In the real world your jobs are discrete, which means you either solve this problem
or you don't. You can't split it up.
In the real world you can't split up your tokens. Then it becomes a bit more interesting. So the
question is, well, how do we discretize this device for flow. That's what we used to send in the
continuous case, from one vertex to another one.
Well, the first thing you would come up with I suppose this approach of rounding to 0, meaning
just rounding it down. This was suggested by Rabani Sinclair and Wanka more than six years
ago, about 12 years ago, and it looks as follows. This is what you want to send in the continuous
case. You just round it out.
So this is the integral flow on an edge. Let's look at our example again. We wanted to send 1.75.
You round it down and just send 1. That's the ->>: Assume this is the direction where it's positive?
>> Tobias Friedrich: Yeah. And we can now describe our process again of equations which
means, well, the new load is the old load minus, well, this, what we are sending out. That's the
flow from I to all its J neighbors. And using this EIJ, this is the last notation I'm introducing, is the
difference between what we are sending in the continuous case and what we're sending the
discrete case. This is the kind of the excess we have because of the rounding.
So because of this we can't split our tokens arbitrarily, well, we make this kind of mistake. So this
is the excess load allocated because of the rounding. And if we have this notation, well, then
apparently here the excess is three-quarters. So we round down by three-quarters, and we can
describe our process as, well, it's what we would have done in the continuous case if you could
have split it, plus, well, summing up everything what we didn't, because of the rounding. That's a
simple equation.
You can analyze this. Now it becomes clear and says randomness is designed in the algorithm
design, something like that. So now the obvious question is, yeah, rounding down each time we
have 1.75 to send feels like a waste. So if we actually want to send from here to here 1.99
rounding it down, come on, this can't be the fastest way. So the natural randomization would be
randomized rounding, saying this approach which means we choose the integral flow at the
randomized rounding of the continuous flow. So not just all this rounding down but doing
randomized rounding. Randomized rounding means if you have a running example of 1.75,
rounding up of one quarter may. Rounding up three-quarters, rounding down one quarter. So
rounding depending on the fraction pod.
Okay. And well, this has some nice properties. So we see that the expected flow is exact the
continuous flow. So an expectation we're doing exactly what you do in the idealized model which
is like the fastest and quickest you can do splitting up everything as good as you can.
And what we have as well, well this excess, the expected excess, so what we do wrong in
expectation, is here as well. So that's kind of hard to -- I mean, that's like an expectation we're
doing optimal.
Well, if you use randomization, first it's harder to analyze. And second, in this case, it could
happen that you send out more than you actually have. So if like on every edge you want to send
out in the continuous case 1.5, and since you round independently at random, round up
everybody, then it can happen that you will round up every edge and send out more than you
have.
So you might end up with negative loads in the middle. But that's not too bad. So you can
handle that. But I mean, there are some issues with this randomization. But, well, it's a nice
model anyhow, I believe.
But now after we have regarded Sinclair's talk, you get back to Jim's talk about quasi
randomness. So we have a random process where we do randomized rounding of continuous
flow.
What's the crucial property here? The crucial property is that the expected excess -- so the
expected rounding error I make is zero or is small.
So what we actually want to achieve with our randomized rounding is that these accumulated
excesses, so like when we round we don't always round down and always send something less
than we actually should be. This remains small. So let's look at the following derandomization,
the following model, bounded error diffusion.
We say we choose the integral flow such that the accumulated rounding errors -- so this is
summing up these Es over time are small, which means I look at an edge and I watch out that not
every time I'm rounding in one direction. So I want these roundings, if they are too much in that
direction then next time I should round in the other direction. Some want to bound these errors.
And then of course I can define this excesses continuous versus discrete.
And our running example, we want to round the 1.75 as we are used to. We want to have, say, to
keep these accumulated rounding errors below then 1. So say this is 1. So now we have to
decide whether we round this 1.75 up or down.
Well, if you round it up, we have an error of a quarter. If we round it down it's an error of
three-quarters. So we can say we round it down, which means we choose to round it down and
get an error of three-fourths.
And the next step, well, again, we want send like one and a half in the continuous case. So there
now we have already rounded down by three-quarters. If we would round down again, we would
add up a bias in the other direction of accumulating these excesses up to 1.25. So we have to
round in the other direction, so we have to round up this time, such that we have a negative
excess.
And this sum is just .25 and not 1.25. So that this stays smaller than 1. So that's the rule. Okay.
That's a very general framework, because I didn't tell you how I actually choose how to round.
And well I guess you already have an idea how you can do that. I call it quasi random diffusion
we can discuss after the talk whether you like the title or not. The easiest way to we say we
round up or down such that this accumulated rounding error is minimized. And I call this quasi
random, because it somehow imitates the property of the random process of having a small
accumulated rounding error.
That's what it somehow imitates deterministically. And where this corresponds to a bounded
error diffusion error previous slide keeping it below one-half, and it can be implemented with lock
of the degree storage per edge, and I call it quasi random error, excuse to explain that.
Now to the results. It's slide 22, a bit late for that. Well, I want to show the models first. So we
just look at two graph classes. The first one is the hypercube. You know what's a hypercube.
And for the hypercube it's known for this idealized process where you can split up your jobs in
arbitrary small pieces, you can reduce discrepancy of K down to a constant within lock times lock
number of steps.
And now what you usually do is you compare how much the discretization makes things worse,
makes them slower. So we want to compare the deviation at all times at all vertices between the
idealized processes and the continuous -- and the discrete process.
And it's known that for this deterministically rounding down, this is a log cube.
>>: This discrepancy [inaudible] can you remind how we measure it?
>> Tobias Friedrich: This is the difference between the minimum and the maximum load. Okay.
So we see that, okay, our deterministic process is pretty close to the continuous process. That's
what we want to compare. Now, we introduce this randomness, which means we don't round
down all the time, but we sometimes round up as well.
Then with high probability, we get a bound of order log square. So by introducing this
randomness, we get slightly better. And now comes the title. If you remove the randomness
again, then you get even better. So here you see the standard deterministic rounding down, this
is the randomized log square. If you do this bounded error diffusion or quasi random, keeping the
random error square, you get it down to a log.
>>: How can it be faster?
>> Tobias Friedrich: This is the difference. So this is measuring two things. This is your number
of steps. So this is what you need -- I mean you can't avoid, takes at least as long. And here we
just compare both processes.
So I just added this first item that can match it to after that time that discrepancy result.
>>: As long as the final answer is integer, why does it matter meter step fractions because your
intermediate steps were negative numbers which are not obviously better than having fractions.
>> Tobias Friedrich: Yeah, but if you have a certain job and it's sending around in your network,
what does it mean to send half of a job?
>>: It's a computational step. I thought you compute until the job goes where it goes then you
start ->> Tobias Friedrich: No, I actually send it. So in each step it's really send and then the load's
different. And I want this process which is somehow guided by this local looking, how the
rounding errors on my edges, that this converges fast globally.
>>: Then you do 6-1 negative number of jobs.
>> Tobias Friedrich: This negative thing is not so bad since, well you can just add some virtual
tokens which have no real load and you're close enough to the idealized that you don't get in the
negative range. But people seem not to bother too much about this negative problem. But of
course I had to raise it since I'm about to sell the last one. Is that clear?
>>: So can I offer other suggestions?
>> Tobias Friedrich: Of course, there are various other ways of randomization and there are
various other ways of quasi randomization. So like rounding down in all your edges and then you
look at how many tokens remain and distribute these guys at random or in some cyclic order or
there are loads of ways one can think of. But you still have to be able to analyze it. That's the
other thing.
I mean, we really can discuss other models. Okay. So this was the behavior ->>: Sorry. So this log N was kind of -- so that's nice. So that's why this is the right particular way
of doing it?
>> Tobias Friedrich: Thanks for asking this. So, in fact, this is theta for this algorithm, and I don't
think you can do better. And since I'm an honest person, I'll also tell you what the law abound is
serious. It looks really nice with O log square O log about actually the only thing we could prove
here is a [inaudible] log and at least here you have to prove a big gap omega log square.
>>: The intuition when you do the idealized thing it's like the distribution around the walk.
>> Tobias Friedrich: Uh-huh.
>>: So N here is this -- not the dimension. Cube, it's number of vertices.
>> Tobias Friedrich: Yes. So this is basically D cube, D square. D is the dimension. What I
want to avoid this D because it's somehow arbitrary, whether it's dimension or degree.
>>: So looks like the ideal thing is just the distribution of random walk.
>> Tobias Friedrich: Yes.
>>: And somehow the random walk and the hypercube mixes log, log-log N. And it doesn't seem
to correspond to what we have there.
>> Tobias Friedrich: Well, you have this initial discrepancy as well. And as long as this is
polynomial, you have log square, as a bound for the time on to the continuous process which is a
constant between the maximum and minimum load.
So that's like classic. And I'm just working on the deviation -- I mean, whatever bound you have
on the idealized process I can tell you, well, I have the same plus this deviation.
>>: But this one is sharp, the one that ->> Tobias Friedrich: Yes.
>>: Why don't K get into the question, the quasi random? I take K to be N to the N to the N.
>>: Because it's rounding. So it doesn't scale. Always have an integer. Why does that matter?
I kept the amount of math.
>>: The error between the idealized and the rounding.
>>: Realized has K expression. Right. But the error, this is the difference between them.
>> Tobias Friedrich: We are talking here about the discrepancy which is here K versus the
constant. This runtime here, it doesn't appear here at all, I don't care. I just measure what's the
difference at all times between the two processes.
>>: A lot of things are not times.
>> Tobias Friedrich: Just discrepancy. Which means the difference between the minimum and
the maximum load.
>>: I'd like to suggest a number of processes by ->> Tobias Friedrich: I mean, I don't know how we should handle that. [laughter] so I'd like to
discuss other processes but ->>: Let me try to do it quick. Your two vertices, nearby vertices are 162. You average, and you
say 1.75 left to right. How about keeping it separate until -- so you say the 16 goes to some two
to each neighbor. So that's two from left to right and the two must send .25 from right to left. So
how is it 2.5, for example, doing it round robin, so one runtime, east and so on. And then you
make the difference.
>> Tobias Friedrich: That's a very nice product I'm going to analyze in the summer or something,
but the problem is that all these protocols, it's kind of crucial that you have an edge-based view.
>>: Edge-based?
>> Tobias Friedrich: Edge-based view. So what you're saying you have a vertex-based view.
So the vertex decides I'm sending this amount, he's sending this amount, and then, well, the edge
just sends difference. But these techniques are somehow bounded to an edge-based analysis.
Okay. But I wanted to say another process as well which is similarly simple, which is the Torus,
D dimensional Torus with D's constant. It's known how fast idealized process gets down to a
constant and again we are only interested in the deviation between the idealized and the discrete
process, and it's known that it's polynomial for the always rounding down.
It's also polynomial for this randomized rounding. However, it's a constant for this quasi random
approach. And this constant of course is sharp. Here you can prove a law bind of some poly log,
and this is sharp as well.
So here you have a prove of gap between these two guys, well, not a prove gap between these
two, but at least this is a huge gap between constant and not a constant.
Okay. So now we have hopefully understood what's the model, what's the problem, and now we
want to see how do you prove something for these models.
And first I want to say my favorite lemma, it's very simple, very easy to state, easy to prove and
very handy. So I think I had four or five papers where I used kind of this lemma.
So what does it say? Well, you have some function mapping from X to R. And you have some,
so that's the function. And it's unimodal. Which means going up, going down. And you have
these Xs. X1 to X14, some fixed points and you have some sequence plus minus, plus minus, so
like assigning each of these guys plus minus. And then it's easy to see that this sum can be
bounded by device the maximum.
Well NY, if you split this unimodal function at its local extrema, well, then this is the sum of the left
guys which is this minus this, which means I just add up these stripes so the whole sum on this
side is bounded by the maximum of the function left and right so this is why you get two times
maximum. Seems to be a well-known lemma. It's somehow not that often used. But you can
generalize it a little.
But if you have a good citation for it, I would be very open to hear. Okay. You can generalize it a
little to, well, L modal functions. So if it's just L extrema, well then it's basically the same. And
you can also generalize it not having this plus minus run sequence but having some sequence of
this vector here which has partial sums which are small.
And then you can bound this -- well, if these partial sums are bound by lemma, then you can
bound the sum by maximum. And this is kind of nice, because if you know that your function is
unimodal, and you have these coefficients which have these small partial sums, then the sum
becomes a maximum. And the sum usually is much larger than a maximum. So this is a very
good bound if your function is unimodal.
Okay. That's my favorite lemma I want to sell. Now let's see how to apply it.
>>: You go with some information [inaudible].
>> Tobias Friedrich: Okay. And now we want to see how to apply that for the hypercube. So we
have some discrete process. We have this accumulated rounding errors are bounded by some
lambda and we want to prove that the deviation is at most the lambda times log N.
Okay. We can describe this deviation between the idealized and the discrete process by this
sum. I'm not going to tell you why. They're summing over to time, summing over the edges, then
here we have this excess. And then this is the probability of going from zero to I. Mine is going
from 0 to J. And you can bound this by two times this sum, just bounding it. And now we see
something which looks familiar.
So this sum actually we want to apply our favorite lemma. Because we know that these partial
sums of these coefficients are smaller by assumption, and if you also would know that this guy
here is unimodal, then we could say this sum becomes the maximum and then some
combinatorics would give A times lambda times log, so I have to know what's the maximum of
this probability but I can bound that. And the main thing to say is this probability here is a
unimodal function which looks like it should be known. I mean, we will discuss this issue.
And, okay, so it remains to prove that the probability of going from zero to I in this hypercube is
unimodal in time. And, well, we can describe this as follows. We know that PIJth, the probability
of going from I to J, and I'll introduce FIJ as the first passage probability, saying this is the
probability that random walks [inaudible] goes to J, is J the first time at time T.
That's the first [inaudible] probability at random walk HI. This is J the first time at step T. Then
we can describe this PIJ as a convolution of, well, to go from I to J is the same as going from I to
J being a J the first time times, well, doing arbitrary number of loops at J.
So that's an easy observation. And, well, this guy's obviously unimodal. So going from J to J is
unimodal in time. Why? It's monotone. Since in zero steps I'm the probability of 1 going from J
to J. In more steps the probability is getting smaller and smaller.
>>: So this is just for the lazier ->> Tobias Friedrich: Yes, that's important. If this guy here is log concave, was log concave, then
the convolution of unimodal and log concave function is unimodal again. It remains to prove that
this first passage probability going from I to J the first time at J is log concave. Okay. So it
remains to prove that this first passage probability is log concave in time.
Now we can observe that so far we looked at random walk on the hypercube, but, well, a
hypercube just consists of certain layers, depending on the number of 1s in the vertex. So you
can describe a random walk on a hypercube as a random walk on a path, just with adjusted
probabilities of going backwards and forwards. So this is basically the same as a random walk on
a path with appropriate probabilities.
So it remains to prove -- we heard that a couple of times -- that this first passage probability on a
path is log concave. Okay. Then we have a very classic result of Colin McCraiger, which is by
the way the father of Colin from UW. And he proved it for the continuous case and until recently
proved it for discrete cases as well. So one knows that the first passage probability is distributed
as the sum of independent geometric random variables. These guys are all log concave,
convolution of log concave guys is log concave again. So that's done, kind of.
Well, we can prove those as well. And because, well, this paper just came out two weeks before
the other work. So I just sketch you -- I made it a bit shorter -- how one would prove it if one
wants to prove that this first page probability is log concave. First we observe that if we want to
go from 1 to D on the path it's the same as going the first time from 0 to 1 first time from 1 to 2
and so on. Then we can do a set transform and get to a multiplication of these probabilities. Now
we can just look at these probabilities and see where their roots are, and then we get an
expression which is some constant times this guy. And if you look at this guy long enough, then
going backwards gives the geometric distribution with the right parameter and now we have a
convolution of geometric distributions, which are each log concave convolution of log concave is
log concave so this probability is log concave.
This was the proof for the hypercube.
>>: So you can't write a formula for PIT?
>> Tobias Friedrich: Well, you can. But you don't see interesting properties. [laughter].
Things that he says, though. So one has to do a bit more. And I thought it must have been
known before. And this probability is log concave. But I can come back to this question later.
We've seen how it works on the hypercube. And now we want to see ->>: [Inaudible].
>> Tobias Friedrich: Wait for my last slide. How would you do the same with Torus? Just a
quick sketch. So again we want to prove the deviation between this continuous and discrete
process is, if it's bounded, then our deviation is smaller.
So, again, we start with the sum of this excess times the probability. And now we have a random
walk on a Torus with loops, lazy random walk.
We can describe this lazy random walk with loop probability one-half on the Torus as random
walk without [inaudible] loops on 2-D dimensional grid. So like if this is your Torus, well, then you
mirror the Torus in all directions. So you become a D dimensional grid. And now you have this Z
loop probability graph one-half. Well D dimensions do the job that you just map them to the same
point.
So you can describe random walk in a lazy random walk on a Torus as a nonlazy random walk on
a double dimensional grid. Now we have these probabilities P bar for the two dimensional grid.
>>: The Torus wraps itself to a degree around the grid doesn't it?
>> Tobias Friedrich: The grid is infinite. Simulate this on the grid by wrapping the Torus like this
and again. So that like if the grid walks here, then on the Torus it's the same point.
So the probabilities of the Torus are sums of positions in the grid.
>>: Are you -- the Torus is D dimensional?
>> Tobias Friedrich: The Torus is is D dimensional. I need a 2-D dimensional grid to have this
loop probability of one-half incorporated. So these extra dimensions I just ignore, which doubles
the probability of staying there. Okay. So we basically can bound our deviation now by this sum,
but this just got a nonlazy random walk on the grid, which is much easier. And there's this
beautiful book by Lawlor. He also has an updated version on his home page, very very good
book, where there are nice locus central theorems that say, well, this probability of random walk
go from zero to I on D or in this case 2-D dimensional grid is roughly this multi-variety normative
distribution up to some small errors.
And, well, this means this expression is basically this multi-variety, normal distribution. Plus
some small error. And then comes handy this nice proof by Cooper and Spencer for this
deterministic random walks used a similar approach, and they already showed we didn't have to
do that, that this difference actually just has six local extrema. Now the fate with lemma comes in
to play, we apply our lemma can bound this guy here where the odds have to bound this O of
small A but one can do this and finished with the Torus proof.
Okay. To sum up, my open problem for you, for which other craft transition probability between
two fixed vertices unimodal in time. For the sake of safety I mean lazy random walk and as we
have seen for the Torus we needed some approximate unimodality, which is meaning it's very
close to unimodal function but so close that the sum of these errors is still small. I mean, this
holds for the hypercube and for the Torus and maybe one can show similar things for random
craft or random regular crafts, I tried that. But so far no success.
I believe it's something more general.
>>: You did show us it's unimodal on the Torus?
>> Tobias Friedrich: No. Well, it's somehow hidden in the approximate. It's K constant.
Constant modal.
>>: Limited.
>>: But then the error is six ->>: [Inaudible] unimodal.
>> Tobias Friedrich: Yes.
>>: Because it's smaller.
>> Tobias Friedrich: Yes. This is what I mean with approximately. And ->>: Do you know it's not actually unimodal in the Torus?
>> Tobias Friedrich: I don't. I don't. I know this also holds on the regular tree.
>>: But on continuous random walk [inaudible] it's more natural.
>> Tobias Friedrich: I mean, I mean this as an open -- it always fails to me, it should be known
for other crafts, but it was apparently had enough to show it for the hypercube or proximally for
the Torus, this would immediately give bounds for the quasi load balancing but it's an interesting
question.
>>: Would you show us a little more this load concavity, the modality it's just easier [inaudible]
couple of years ago from [inaudible].
>> Tobias Friedrich: Okay.
>>: The modality -- the modality for birth and death chains and then [inaudible].
>> Tobias Friedrich: For the hypercube that applied, yeah.
>>: But there we didn't do any node -- we didn't discuss load.
>> Tobias Friedrich: Because it just used unimodality, where we needed the log concavity to
have the convolution.
>>: This was in directive of P.
>> Tobias Friedrich: Because log concave, convolution, unimodal is unimodal, but unimodal
unimodal is not unimodal. But thanks. [applause].
>> Tobias Friedrich: Jim, you had another question?
>>: Yes. So you said another. Is that like a hint >> Tobias Friedrich: You didn't.
>>: So one important feature of the Cooper Spencer results is they're limited to bipartite graphs,
and that's an important, technical level. Something I've never understood. But here it doesn't
look like you're using the by partition. In fact, you're doing the lazy walk which suggests that's not
important. But all your examples are all bipartite.
>> Tobias Friedrich: That's a big difference to this deterministic random walks, because there
you have a problem with the bipartiteness, we can discuss about that later. But here this is can
be solved by the laziness. The lazy part saves all day here. And so that's important. Otherwise
on the bipartite graph it's always you're one larger than zero, larger than zero, with a normal
random walk you're lost with that probability if the property of the probability, if all laziness.
But one could still use a part of their results.
>>: But this should apply as far as you can tell with graphs that are nonbipartite?
>> Tobias Friedrich: Yes. The other examples were too simple to have non-bipartite graphs.
Okay. Other questions? If not, then thanks again. [applause]
Download