>> Dave Maltz: So it's a great pleasure to... friend. I don't know George. He's a --...

advertisement
>> Dave Maltz: So it's a great pleasure to introduce George. George is an old
friend. I don't know George. He's a -- I think you'll find a master expositor,
fantastic teacher. And he wrote and amazing book that I recommend to
everybody called Network Algorithmics. It's organized in a really special way
around principles. And it's not like any other networking book out there that I've
ever seen.
And, you know, George is known for lots of mechanisms that are implemented in
every switch like deficit round robin. He founded a company Netshift which was
acquired by Cisco and did the worm detection and stuff. It does into Cisco's
gear. So George has had a huge impact on the networking industry both in
education and in real products. This particular problem he's going to talk about is
super dear to our hearts here at Microsoft.
This performance isolation is something that Azure you're absolutely has to solve
and any datacenter enterprise datacenter as well has to solve when they -- so it's
-- and you know, there's work on it at MSR, there's work on it in UCSD and we
were really -- it's super timely. And really keen to find out how George attacks
the problem. Thanks, George.
>> George Varghese: Thanks. Thank you. So I'm afraid I have only 15 to 20
slides, so I'll try to -- I should have probably made a few more, but I ran out of
time with all the things I was trying to do. So I'll try and -- I think the basic ideas
will be clear and where I forgot the experiments I have the paper here, so I'll go
back and say remember what the parameters were. I'll be groping at the graphs
and say now, what was that. So we'll figure it out together. Okay?
So I guess the problem is very simple conceptually, right? We'd like to virtualize
datacenter networks across services. So the setting that we address, which is
somewhat different from the Azure you're one, so I want to make sure that you
understand our setting is one for enterprise datacenters, right? So enterprise
datacenters -- so what people sometimes call private clouds as opposed to public
clouds, okay? So private cloud like let's say a Fed Ex or Pfizer typically has been
sold -- the vision that -- instead of keeping all your departments having separate
servers, right, like you have engineering, accounting, because of VMs, right, you
should be able to save a lot of money by consolidating your servers, right? So
that's the trend where we know that VMware and the VMware [inaudible].
So that's happening, right? But then when you see that happening, also storage
is also being, you know, somewhat virtualized because you can have, you know,
a given physical disk can always be partitioned and broken up among people.
So but now you're beginning to see things. And if you look a little carefully into
the trade press, you'll see these things where people are saying that during times
of VM backups, right, or strange things are happening, these VMs are interfering
with each other, right?
So in some sense the network -- you know, virtualization means many things,
and you have to define what it means, right? But one kind of instinctive meaning
is separation. You would like these things to separate. And so really the goal we
are trying to address here is trying to actually build what is known as what we
might -- people have called a virtual datacenter, right? So what's a datacenter?
You have you know resources like CPU, you know, like disk and memory, and
then you also have a network. Right? In some sense people have made good
solutions to memory disk and processing, right? And so we would like to really
define what it might mean to share a network.
And so part of this talk -- and this is all very early work. Balgi's [phonetic] been
doing work, Albert and Srikan [phonetic] and Ming have been doing work. So I
think everybody is groping for a definition. So don't take any of this very
seriously. You might come up with a better definition. But at least you have to
start somewhere. So that's the context for this.
All right. So definitely modern datacenters are built at very large scale,
thousands of servers. They currently execute a large number of applications.
And so people are already virtualizing resources to reduce cost rather than
having a datacenter for engineering, a datacenter for accounting. And they really
like this because you also have agility because you can move your datacenter -your VMs around. Okay. So that is kind of nice.
And so you have all these existing technologies which we've talked about Xen.
There's SANs for storage and there are people are fooling around with memory
too. And really a lot of this is about resources on a single machine. Right?
So the interesting thing that we'd like to look at is bandwidth. And now what you
have to do is divide it across the set of -- so you can VMware, right? If you take
your single server and you often I think Amazon has some number like say 10,
right, 10 VMs, so 10 VMs allocated. But it's pretty easy to see the model, right?
In the worst case of all 10 VMs are actually active, they're very unlikely, right,
you'll get one-tenth of the CPU.
If none of them are active, right, you will get actually all of the CPU. But actually
they tend to not be a little alarmed by you getting all of the CPU and playing
games, so they tend to have some limit on how much CPU you get. So they
won't give you all. So they have a min guarantee and a max guarantee. So
that's some kind of guidance as to the way people -- but that's on one machine,
right?
Imagine now you had to do it across multiple machines. It's not -- you have to
kind of define what it means. And so -- so is bandwidth a bottleneck? So I guess
from VL2, these are numbers from VL2 I guess or I think. I don't know where -this should have credited but this is the -- some slides that Terry made up. And
so there's lots and lots of traffic between the servers and the datacenter, right?
There is stuff leaving and entering. So we are really worried about the bandwidth
within the dataset, okay? So the outside stuff, we're not talking about ISPs, we're
not talking about the outside stuff. And so -- and I guess so the network may be
a bottleneck for computation. At least that's an action that we are going to hold
through here. Okay? And so certainly I -- you know, with a little bit of Googling
you'll find these examples of a little bit of disturbing signs of people are saying
oh, I virtualized everything and my app was doing just fine and suddenly the
other guy's apps, VMs started backing up and I got -- I got -- I took a hit. And so
people are beginning to see this. So all right.
So basically we would like to have this notion of bandwidth virtualization where -and I think you can think in terms of Microsoft, for example, right? You have a
bunch of properties. You have things like bing, you have, you know, like -- what
do you have? You have -- [laughter] e-mail. What is it called? MSN. Hotmail.
I'm sorry. I don't use these things. [laughter]. I'm on record too here. Victor's
looking at me.
You know, I'm so used to using one of your competitor's examination that I had to
you know, really keep groping for the names. Right. So you have a bunch of
properties. And so you would like to give each property all the application the
illusion of owning a virtual datacenter, separate CPU done, disk done, memory
done, and the network is [inaudible] okay.
Now, one of the most important things we believe in, and it's a little arguable,
right, is that we believe that you once -- you would like to have statistical
multiplexing on the network bandwidth. If you don't want it, it's really easy, right?
Go and reserve on everything. And you know, with some extent reservations are
possible. People know how to do it, right? You know, you can just reserve
bandwidth at every length. But when you start trying to statistically multiply, the
game becomes more challenging, right?
So we kind of know how to do it in a single link. There's something called
[inaudible]. But to do it across multiple links you know you need a definition for
us, right? Yes?
>>: [inaudible] resources out here that need to be virtualized [inaudible] on that
list?
>> George Varghese: That's not the complete list at all. Good point. Right? It's
a [inaudible]. Because you might be using, you know, like an equal amount of
CPU as another app but you might be using a lot more power. It's not at all clear
that those two are. Generally I think the assumption is when you go about -- I
mean, it may be a simplified assumption that if you use the same round of disk
and memory and CPU that you're probably using the same [inaudible]. If that
simplifying assumption maybe not. If par is an incorrect measure that's going to
derive from some of these things, maybe it's possible. But you're right, this is not
a complete list. But this is the first cut list that one would pick, right? So power is
a very interesting -- how do you divide your battery.
So I guess we really would like statistical multiplexing. And if you look at disks,
clearly you don't statistically multiplex. You don't have some really [inaudible]
and somebody else's disk. They are using it, you write into their bits. Now that's
not possible, right? But certainly it's true for VMs, right? When one VM is not
running, the other VM can, you know, get at least some of the CPU that the other
one is. Right? So, go ahead.
>>: So the reason [inaudible] right, I mean unless you're using [inaudible] so if
you do, right, if you [inaudible] so they're not completely made for that.
>> George Varghese: I agree. I agree. But what we'd like to do is sort of give
the property that has a certain minimum on a bandwidth and maybe if they're
lucky they'll get up to a certain max.
So okay. Okay. So with all of this, right, so what's our -- so why is bandwidth
virtualized? Because now you have to do it across multiple links. Okay? So
now you have all these QoS mechanisms, right? And it's sort of -- they look like
they kind of solve the problems. But so there is things called router scheduling
where what happens is you can do these mechanisms by which you can give you
-- give somebody a fair share of one link, right?
So you take the link. And today, I think many routers will basically allow you to
write a classifier and then track and you can map it on to a certain number of
buckets. Maybe hundred, maybe, you know, 10, 20, and all those guys can be
given a share of a link, right? And you can give them weights and so you could
have two-fifths of the link, I could have one-fifth of the link. But now you have to
do it across multiple links and sort of have a -- and it turns out that so you don't -it's really confusing because there's tons of work, right? So there's traffic
engineering and -- but so what is traffic engineering do? In some sense traffic
engineering is actually you've got this certainly we spent a lot of time reading a
lot of papers, including his.
And as far as I understand it, it does a better job of routing admitted traffic, right?
It doesn't really have this notion of allocating across multiple things. So, for
example, this is a DOS attack that comes into your network, traffic engineering
will do its best to route it on the least or on better utilized parts, but it doesn't
prevent it from taking over the share of other people. So yes?
>>: [inaudible] statically allocated slice or dynamically ->> George Varghese: A good time -- that's one statistical multiplex, right. So we
come to the model in a second. So after all [inaudible] right? Okay. So let's
start with a model, right? So what is our model? So let's just draw this network,
right, which is a physical network. And very simple, right, two tier network, where
there are four switches at the bottom and there's one core switch. We could
make it more complicated, but let's start with it.
So but let's also have these different properties, right, which are colored, so you
have a one, which is the green property which has a VM here and a VM here, A2
has a VM here, here, and here, and A3 has a VM. So they kind of overlap in
some strange ways. So I'm going to assume that whenever I show something
like this that the pair -- all pairs want to communicate.
So this -- on this machine -- so by the way, these are the switches. So I haven't
drawn the machines. But it means a machine here running A1 wants to
communicate with the machine here running A1. But in the case like A2, you
have one connection between these two machines and one connection between
these two machines. Does that make sense? Is this picture clear because I
haven't drawn the host to keep it simple to plug that. I've only drawn the edge
switches and the course switches. Right?
So now, though, so the -- I would like to have some kind of manager sort of
decide some kind of weights. The green guy is four times as important as the -and the blue and red are equally important. You know how, how did you do that?
Okay. So I think weights are very important because if you talk about properties,
right, a very natural way to allocate bandwidth is by revenue, right? How much
they -- if you don't have properties, if you -- if you have even engineering
accounting, companies are very familiar with the cost accounting, right? It's a
very natural way to say you know, okay, you won more, you know, you pay more
internal cost accounting set. So you need something of lever, right? And we're
looking for the simplest possible lever. We don't want a lot of bells and whistles
but a simple weight. But this is weights across the network, right? So what does
it mean, right? It's not entirely clear how these guys share. So let's talk -- let's
take the example, right? So imagine that a 1 and all of them want to send at full
throttle, right, they just want to dump into the network. All connections would
have enough data to send. They're all trying to complete, you know, some big
MapReduce.
Well, let's start with this link over here, right? So in this link, right, the only
competitors A1 is trying to send one here and A2 is trying to send one here,
correct? Right? But what happens is that they have -- they have -- because they
have -- oh, I'm sorry, I screwed up actually. This should have been the -- the -right. Okay. So maybe this is right. So when they compete, right, they share in
the ratio 4s to 1, right, so therefore this guy should get 8, right, what -- so what
we'd like to do is decompose this picture into a green network, a purple network,
and a -- and a red network. And we'd like to give bandwidth labels to each.
That's a simple model. It's not the best model because we're not hiding topology.
Okay. But we are exposing locality.
And at least currently in datacenters there's a huge difference between local
bandwidth and so maybe it's an advantage to actually keep a slightly more
complex model. Now, people don't like to actually expose topology to customers,
right? But remember we are not talking about the Azure environment yet. We're
talking about enterprise datacenters. So it's not as unreasonable to expose your
topology to enterprise datacenters. In Azure environment you might think do you
want customers to have the exact few or far topology because, you know, they
could attack it and stuff. But for now let's just finesse that. Right. Go ahead.
>>: The question I have is so we hear a lot about [inaudible] networks and
customers [inaudible] from rack and customers being aware that racks are on the
same X switch or different parts of the tree. So are we assuming some
[inaudible] network here where these things go away or --
>> George Varghese: No, no. I'm just -- I think it's somewhat [inaudible] all of
that, but definitely not assuming that. We would like to work regardless of the
underlying network model. So for example, you know, in Iraq much more
bandwidth in Iraq than there is elsewhere, then we want to expose that, right?
But we want to expose your share of that so that you see independently of other
people. Yes?
>>: [inaudible] tremendous amount of uniformity and how much bandwidth each
particular instance of an application or member of a [inaudible] would get based
on whose [inaudible] so if you look at the purple in the middle that guy gets 2G,
that VM might have 2G of connectivity whereas a equivalent A2 VM sitting on a 3
only gets 1G just because of who else happened to be placed on the [inaudible].
>> George Varghese: Right. I think the general -- the general thing I would do to
avoid that would be if there's a few properties, right, you are guaranteed on every
link, right, at least this share of your bandwidth. So for example if there's 411,
right, you have guaranteed if you're a 4 person, you're guaranteed at least 4-6 if
there are only 3 applications regardless [inaudible].
So if you want to move people around, you have some kind of floor that is
actually totally independent of [inaudible].
>>: I see.
>> George Varghese: Yes?
>>: So this model [inaudible] from each instance of the VM on net based
system?
>> George Varghese: No. The weights are assigned to the application. Think of
it like the property like search gets weight 4 and e-mail gets weight 1. So the -the applications that you don't even know any of this except that there's some -the model is that you -- there's some identify in a packet like a port number or
something that identifies the application, right, and so that can be mapped to
weights.
>>: [inaudible].
>> George Varghese: They could migrate and that will be fine.
>>: And that will mean that you have -- reconverge your [inaudible].
>> George Varghese: We will reconverge the bandwidth allocations. But there's
a certain minimum bandwidth you get regardless of where you are. Right? And
you have -- you can -- you think of -- in some sense you get a certain proportion
of every link, right? But the actual numbers -- now the statistical multiplex is
going to vary tremendously depending on where you're co-located, who's act -that's true anyway, right? Even if these guys are not co-located, it really depends
on how much this guy is using it, right? If this guy is empty, you get all of the
[inaudible].
So we know that anyway. So migration is just another piece of dynamism that
affects the statistical multiplexing links, right? So you have a minimum and a
maximum, right?
>>: [inaudible] true for compute work virtualization or storage virtualization?
>> George Varghese: For compute there is but not for storage, right? So
compute, for example, if the other VMs are not active you can go up to the top.
Now, what we really think we also will have to do in such a model, right, is also
have to enforce a max. So even though, for example, on this link the 10 gig link
if the -- if the -- if the purple was not there, this guy could get the entire 10, right?
This is what statistic -- you probably want to allow the administrator of the
network to configure a lower max. Because other people play games.
Maybe not in the enterprise, but at least in the Azure environment you probably
want -- you want some mac, which is easy, it's just a rate control. All right. So
but I haven't really talked about the model. Okay. So now what happens is that
so let's assume that this guy, right, this purple guy -- sorry, the green guy has -let's see. So the purple guy gets to -- on this link, right, he basically gets two gig
because when he's sharing this link with this other guy, he has a four to one ratio
so he gets two. But now it's a little more complicated, right?
He has two connections, one going like this and one going like this. Right? So
because I assume that there's a pair of connections between each machine. So
the assumption that we are going to make is that every pair -- whenever you
have multiple connections within the same application, they share equally.
Okay? So this is an important distinction because what you're trying to say is
even if an app opens up many, many connections, right, it gets the same share
but it's completely dependent on this. And that's really important because today
we know that in Hadoop and by simply increasing the number of slaves or the
number of masters, right, you can get a tremendous number of TCP connections
and you'll get an unfair share of bandwidth. So that's not happening, right?
First we're going to take off the top based on the weights and -- but then within
that, we don't want to differentiate your connection. It's just too much work.
Right, we could if you wanted to. But now you have to sort of explicitly say for
each of your TCP connections what's the weights. And you could add that to the
model, but it's messy, right? It's messy not from implementation but
specification. Because you have to sort of specify every care of VMs and what
they're trying to do. And we're trying to avoid all of that in the model. Okay.
So this kind of gets [inaudible]. And now here's an interesting thing that
happens. When this guy gets one gig over here, if you look at this link over here,
right, the red guy is sharing it with this, but he's only sharing it with one of these
connections but that connection is limited to one gig so the red guy can go all the
way to 9. So this is a famous thing that many of you will have heard of. It's
actually called max-min fic -- max-min fair sharing which max-min means if your
bottleneck somewhere like this guy, right, even though on this link you are
actually equal weight, you and the red guy are equal weight and you would think
you should be given five. It's like why? I'm limited to one anywhere elsewhere,
so why not be generous and give it to -- don't be a dog in the manger, give it -give it to the other guy. And so this guy gets 9.
Okay. So there is actually a slightly recursive calculation where you have to find
the bottleneck and then find -- share that one and then that sort of suggests other
bottlenecks and so there is -- there's something to be -- but it's automatically
done. It's very well known in the literature. But however, this is what we would
call hierarchal max-min fair sharing based on the properties, the colors. Then
once that has been decided, then you do max-min fair sharing within the
connections as you may equal weights. So it's a two step process. Yes?
>>: [inaudible].
>> George Varghese: This will have to be recalculated. Although there's a
certain minimum bandwidth that I was telling Dave that will happen regardless.
So you will be guaranteed regardless the way people move. Because
fundamentally if you take your weight, divide it by the sum of the weights of
everybody, right, there is a certain ratio you're guaranteed. So there's a floor you
have. So you can be completely [inaudible] with it. Yes?
>>: So going back to what Albert was asking. Thinking about [inaudible] and
thinking about this [inaudible] independent of what you do with the computes, so
if you don't have the same sort of models to compute, you might get -- so let's
say -- let's take extreme case and you allocate 411 here and you allocate
something like two, four -- two, three, one, right?
>> George Varghese: Assuming the allocation is in a networkwide ->>: So [inaudible] so what I mean if your computes are not matched, are not the
same with what the network bandwidth you've actually allocated then you could
get into a situation where you're actually not exploiting the benefits of ->> George Varghese: Yes. So I think there are a number of design -- once you
-- this is a bunch of mechanisms. How you use them is always -- given a certain
computer requirement how do you map it on to a certain weight? But remember
the weights are across the network. We're trying to keep the model as simple as
possible. Because generally our experience with things is that Cisco weighted
red, per this thing, per app, nobody uses those knobs. It's just too complicated,
right?
So, you know, there are a lot of us keep putting in more and more nods but the
administrators don't trust them for most part. So we just try to give a very
something knob, right? You have like, you know, 10 properties and just give me
the weight.
>>: [inaudible] sort of wondering to benefit that, you actually have to have ->> George Varghese: You have to have a design thing --
>>: All across, not just ->> George Varghese: I agree. Yeah. And then you might ask the questions,
how well should I migrating my VMs, what is the right way. But those are
questions that come on top of this. So first you have to give some mechanism I
think to start. But some form of control. Yes?
>>: Even in the network [inaudible] you could have multiplexing between
[inaudible] such if you have [inaudible].
>> George Varghese: Yes.
>>: So burst off actually can come [inaudible] you should be able to give
[inaudible] is actually not using that [inaudible].
>> George Varghese: Right. And that's actually the next slide. Right. That's
exactly the next slide. So the next slide is now what happens -- what happens is
A1 to A -- this guy has not burst -- he's not using the full -- he's only using six gig,
right? What happens, this guy gets four gig on this, right, by the allocation -that's what we desire. We haven't shown you the mechanism yet. And so
therefore two gig here, right, and therefore this guy goes down to eight. But
there is a recursive sort of coupling. Yes?
>>: I'm trying to understand in the blue network what the two left nodes have
different allocations.
>> George Varghese: Well, this is the ->>: [inaudible] identical.
>> George Varghese: Yeah. Yeah. This is the sum of both of them, right. So
you're right. So this is the sum of both connections, right. So there are two
connections, one going here and one going here, both of 2G.
>>: But also two from the left one, going from the middle to the ride.
>>: That, the [inaudible].
[brief talking over].
>> George Varghese: I probably didn't calculate it right, so I -- I may not have
done the calculations right. I was trying to do this a few minutes before and so,
right, so I think I'm not assuming a connection from here to here, right. If there
was, this I would have to change these numbers. Sorry. You're right.
So I think they worked out correctly in the paper, but every time I do these
examples I get them wrong. So I apologize. Yes?
>>: So that was the single [inaudible] figure out these caps fairwise and could be
--
>> George Varghese: No, no. What happens is there's this sort of -- the way it
works is you first find the bottleneck link, the one that has the minimum weighted
-- divided by the weight, right, is the smallest. And then once that happens, you
know, like that restricts certain flows, right, but then those flows then cause new
-- have new bottlenecks and then so you have to keep -- it's this very standard
algorithm. I can tell you about it. It's classical and it was studied for 20 years
now.
So it's called a water filling algorithm where you start trying to do this, go to the
bottleneck, find that and ->>: [inaudible].
>> George Varghese: No, you can't. You have to actually -- and it's -- it's in
worse case it can be E sequential steps for E is the number of edges. But in
practice it's the diameter and practice is only a few bottlenecks. It doesn't work.
You've done it in centralized fashion. Yes.
>>: So every time [inaudible].
>> George Varghese: Yes.
>>: [inaudible].
>> George Varghese: Yes.
>>: [inaudible].
>> George Varghese: We'll talk about all the trails. So once we get to the
mechanism we'll see various ones will have different layers. Okay.
So now with this -- but this has -- so I guess the first thing is so let's talk about
three mechanisms we're going to do, right, and one of them is going to be
something called group allocation which is really -- so let's talk about our goals,
right. What can we change?
Now I think if you're Microsoft and hopefully this is like this is a willing audience to
hear this, right, you really can't go around necessarily assuming that the routers
of change. Because, you know, that's Cisco and yes, you might be able to
convince them but it's -- you know, and maybe you have friends in Broadcom, but
it still takes a while. So ideally you would like to do this with now router, right?
Now, Balgi, when he came here, he talked about a mechanism that sort of is
modified QCN. Their he's assuming that he can change the switches.
So number one, clear point of distinction between his work and ours is we're not
going to assume that the router switches. Okay. He also would like not to -- not
to go ahead and configure to add software to the host. Some of our proposals
are going to add software. Okay. But Microsoft's probably okay with that, right,
put certain other people would find that hard.
So our first thing is called group allocation and it's going to leverage TCP's
behavior. It requires no software/hardware changes but it does require
configuration in routing so even that was hard for the Azure folks, right?
And it's going to converge very fast, it's going to converge [inaudible] basically
TCP, right. Another thing called rate throttling which says that works only based
on TCP, and I have to tell you all of this. And so if you want to do this, right,
you're going to have to do some kind of measurement and some kind of
[inaudible] and you have to add software in the host and this is going to be a few
milliseconds, just to measure that there's a problem. Okay?
And finally much more weird thing. This is a centralized allocate offer where like
our CP and route -- centralized route control things we're going to do everything,
including bandwidth allocation in centralized fashion. And this is going to be slow
because in general you have to see probably in the order of hundreds of
milliseconds, right, or 10s. So this is a set of tradeoffs and we'll try to show you
that this one is the easiest, right, and fastest, right, but requests some
assumptions, right, and it only gives you one definition.
This one can hand UDP as well, this one can handle everything and the kitchen
sink but it's complicated. You have to do a certain amount of work. So there's a
set of tradeoffs that [inaudible]. So let's start with this picture, okay. So maybe
some of this stuff about the definitions will come a little more clearly.
So now I've forgotten all of the datacenter topology, I'm not trying to bother to
draw switches and simple topologies that I cannot understand, right, and not
every pair. So there's a host 1 and a host 4, there's an edge switch, core switch.
So you can actually, if you want, mentally move -- twist all this thing into the right
shape. But I can do this way. And there's a host 1 and a host 2 and a host 4 and
a host 3. So let me see what the -- so the idea is that very simple -- there is a
famous paper which some of you may have read but it's very old, so I'm not sure
the new grad students have read. I'm sure [inaudible] has read this. Basically
there's a paper by Ellen Hahne, right, in many years ago which said -- do you
remember when this is? So this is a -- she was Gallagher's student right? And
she basically said, look, if you do -- first of all, right, why doesn't window flow
control -- why don't just a simple fair queueing work? So let's just see -- let me
see what I can have an example where I hope I'll remember.
So what happens is that imagine that I have one color or one which is using -and this is 10 megabit link. And suppose one guy is using -- has a weight of 4,
right, and another guy here is -- and there's somebody over here who wants to
go in over here, right? Now, okay, so there are three colors -- there are three
properties. One has one connection here, one has one connection going this
way, and a third one going here.
So image this property has a weight of 4. So that limits this end-to-end flow to a
rate of -- to a rate of 2. Right? Because out of this 10 you get four-fifths of that
rate. So therefore you should normally be limited to two here, so therefore this
other guy should get 8, correct? That's the desired approach.
Now, suppose you just simply used router configuration and used DRR and you
basically said of these guys, this guy gets four times the other one. So you
indeed get the right division on this link. You'll get 2 and 10. But if you don't do
anything, right, this guy is going to send way to fast on this link and actually make
sure that it's five out of five. Although all these packets have been drop kicked.
So simple router based deficit round robin or fair queuing it works, but it's
wasteful. Because what could happen is the sender could go ahead and say,
you know, forget it, I'm just going to send at maximum rate and I'll get throttled at
every link but the bottleneck I'll finally gets throttled in my right link but I'm going
to waste bandwidth in all the preceding links.
Now, some of you will be thinking but that can't happen if the sender is TCP and
you're right. And that's really Hahne's intuition, right? And so Hahne's intuition
basically said if you take any window-based flow control, right, and fair queuing
at the routers, you get max-min fair share. One of assumptions you know hard
proof, you know, like -- and but it's not hierarchal max-min fair share so we can't
directly use. So basically all we are trying to say is that if you -- if you just do
this, you actually do get hierarchy, and if you do fair queuing not by individual
connections but by the properties at each router, then you will get exactly the
definition we want. You'll get hierarchal max-min. So it's sort of you get it for
free. You don't ever do anything.
Just an embarrassing result because it's like say do nothing, use existing stuff
and TCP and everything, you get the right mark. Right? So it's a feeling from an
engineering -- it's not very appealing when you write a paper, right? So it was
like the people say all right, then leave, you know. So okay. So let's start with a
1 here.
>>: [inaudible].
>> George Varghese: Yes.
>>: It's kind of the very same thing.
>> George Varghese: Yes.
>>: Exactly the same thing. So if [inaudible] flows to [inaudible] for example cars
are coming on different streets and they're all going in one direction [inaudible]
the very end it's going to be totally my formed.
>> George Varghese: Right.
>>: Unless you [inaudible].
>> George Varghese: Right.
>>: That's [inaudible].
>> George Varghese: These are very well studied problems. The only twist here
is I said this theorem is like 20 years old, right? The twist here is that we are
moving it to a hierarchal sense and the only thing we are saying is that make
sure that you don't do this on a per connection basis, the fair queuing here, but
you do it on a property basis. So it means every -- so if you have 10 connections
going in from search, right, all of them will be treated in the same DRRQ. And
that's the only thing -- if you do that, you'll get the -- okay. So, so what happens
is that let's see, that there are two properties here, and let's assume they have
equal weights I guess. And so what happens here? So and there is this guy.
And so there's a 4 to 1 ratio. So clearly a one. Now, let's look at the time scale
in which this happens. So if both these guys start -- the greens and the reds start
at the same time, as long as you have a DRR like scheduled at the router here,
which they do, that is basically giving 4 times the packets on the red and the
green when it's going out of the southbound link, right? It's immediately this
guy's going to get hit. Does that make sense? There's no time constant at all.
This is like even less than -- it's less than round trip delays. It's microseconds,
right? So that's done.
But when that's done, what happens is that as soon as that is done, then the -- at
some point what's going to happen is this TCP is going to stop -- is going to
sense that there is less bandwidth. It's going to go ahead and -- but it's also
sharing that two with this other TCP, the H 2 TCP to H 4 and so they're actually
going to go down to 1-1. Right? So it's just the right result automatically
happens and you don't have to do over. What's happening now? What did I do
wrong? Okay. Well. Go ahead.
>>: [inaudible] on the flows, right?
>> George Varghese: Not really. Because these only take a few round trip
delays for this to happen.
>>: But I mean, there have been plenty of results that say that TCP takes a lot of
converge and you have multiple -- I mean in this case I guess there's one
bottleneck.
>> George Varghese: So depending on the number of bottlenecks. I agree. I
agree. So it's actually like the -- it turns out that the regular max-min fair share
calculation it moves from bottleneck to bottleneck. So there's a factor which is
the number of bottlenecks. Generally you don't assume there's that many.
Especially in the dataset. I would say there's the uplinks and maybe -- shouldn't
be a problem, right? The few round trip delays. Sorry. You had a question?
>>: So this still has the same kind of weird problem that if there are two links
coming down from H 2, then they're each going to get one-half.
>> George Varghese: No. Okay. If there are two links coming down from H2 to
this router, no. Because you're doing it based on green versus red. Not based -so if there are two connect -- two -- you mean off the green -- off the green
share?
>>: Correct. Yes.
>> George Varghese: Definitely.
>>: Would they each get one-third or would you have one, one-half, one-half?
>> George Varghese: No. Okay. So let's start. You want two links coming
down. And is there another TCP connection coming down? So H2 is opening up
a second connection to H4. Right? So first the router scheduling mechanism is
still going to take 8 for this -- so 2 is left over for the again. Now there are three
connections he opens up he gets two-third, two-third, three third. So your
connections are not going to affect the other properties. That's important. You
open up more connections, you -- you know, they divide, that's your problem. If
you're suddenly your whole -- no. You can argue in a VM world you might want
to do further limiting on a VM basis.
>>: And you may have -- you may have also met with the controlled -- the
relative rates of your [inaudible].
>> George Varghese: You may want to control the relative rates of your own
flows, right. So we have simple extensions by which we can do that. But it's just
-- it's extra complexity. And we always afraid of anything that complicates -- so
you could say certain -- between certain pairs of hosts you want to give it more
weight, right? And so, in fact, some of the next things we -- will allow us to do
that.
So now, so far though, this problem, though, is certainly going to be true for a
UDP, right, because he doesn't care. He's going to just dump over here and
steal away the 5. So you can argue maybe it doesn't matter, because nobody
deserved it anyway. You know, they paid for their shares though -- but there's
something, you know, a bit ghastly about a UDP person who is going to drop
later right not doing something.
So now there's tons of work on TCP friendly mechanisms that try to make -- but
we try to do something very simple, right? So this is actually the idea for
[inaudible] and it's a very simple idea. So basically what he said was you do the
same thing, right, and now you have the same thing and so it's 8 and so what -so the problem is what if A2, the green guy is UDP. So if you don't do it right,
right, what he's going to do is he's going to go ahead and transmit as fast as
possible and take five out of this link, right, and as opposed to -- he's only getting
-- he's only getting one here, right? He should be sending it one, but he's going
to take five and all his packets are going to be dropped at this time. So how do
you prevent that kind of behavior, right?
So in order to do that, right, so the idea would be that the receiver, you have to
put some kind of shim layer here. So where would it be? It would have to be
somewhere between -- somewhere between the network and the -- and the UDP
layer. Something that is intercepting the packets and measuring the rate. Right?
And so basically he's going to measure the rate, he's going to send back and
he's going to feed it back to this guy. And so the intuition is this guy must be
sending it five to cause trouble here. But if the receiver is measuring one, why is
he going at one? You know. So if you could enforce that, then you -- that's the
intuition. Go ahead.
>>: [inaudible] flashbacks here.
>>: Yes. Me too.
>>: So why are we using [inaudible] between the routers?
>> George Varghese: Because we don't want to change the routers. Right? So
this is important. Constraints the heavy, right? So if you could change any -- lots
of [inaudible] change the router. We have a very constrained playing field, right,
Cisco routers. Who know when they're going to change. And, you know, for
something like this, you know, like if you can do it without any changes that's the
best.
>>: Back pressure works in a line, it doesn't work very well in [inaudible].
>>: Yeah. Yeah. This is a very.
>> George Varghese: Well, it turns out that, you know, that you know [inaudible]
and all these guys, they have methods, right? But it is complicated, though. But
I'd rather finesse that argument for now by saying, look, the rules of the game is
we contact those switches, right? You may be able to test the L2 switches but
you can't test the L3 switches and the congestion could happen anywhere so it's
like, you know, let's try and do it without changing. Without changing the internal
network.
>>: [inaudible] wireless routing is selling software because the [inaudible]
speeds are so low. This kind of routing is being done in a hardware ASICs
because it's [inaudible].
>>: [inaudible].
>> George Varghese: So that's very important. Because our experience with an
event is you know this is so -- they said finally six months out of arguing before
they agreed to put a feature in, right? Then it takes two years for somebody like
Cisco to build an ASIC. Why? Because one-half years is half a year is spent on
design, one-half year is spent on testing. Because an ASIC is so expensive you
can't afford to -- so now two years later you think it's all done, but it's not. Then
they decide to put on it a board, right? And once they put -- the really number is
like five years for any -- any new thing. It's shocking. It's terrifying. By then you
lose interest in a feature, right, then it might come in, right?
So it's the -- so it's really scary to them. You're thinking doing nothing for five
years. And think of the effort of socializing this about not just Cisco but juniper,
extreme, foundry, you know, all these other guys.
>>: But, but I mean couldn't you be doing -- if traffic patterns change a little bit
slower than per packet cases and a morphing [inaudible] I mean couldn't you be
doing these things a lot in software and a lot faster [inaudible] change in the
millisecond order or ->> George Varghese: You could. You could. But it turns out that the hardware
does have hooks to measure things but -- but and so, in fact, we might leverage
some of these. But even then, even changing the router software although it's
not a five year period [inaudible] still two and a half years. But everything is
[inaudible] it just -- probably Microsoft is probably not much faster, right? But it
doesn't -- it's not that easy to get a product out there and so it's really shocking
when you see the real numbers.
And so right now, nothing changes, right? Now, this one, though, it's going to
require you to actually add some kind of layer, right, over here, which does
require a modable kind of module into your kernel or something like that. Which
in Linux we know how to do, we assume you can do it in Windows too, right?
And so what is this -- what is this thing?
>>: [inaudible] you said you more or less have less bandwidth [inaudible] but
there is a reverse problem there which is so you made the assumption that you
know which property wants to get what ratios. And so you don't have those sort
of things -- I mean you have to adapt very quickly as to what [inaudible]
otherwise you will run out of -- either you start things or you just [inaudible] links
which are not being utilized so I mean it's sort of a ->> George Varghese: But I think fundamentally though ->>: The [inaudible] just saying where do you -- which are the things [inaudible].
>> George Varghese: Right.
>>: [inaudible].
>> George Varghese: Do you feel that it's hard for -- I mean, people don't want
to actually you know provide some simple guidance to the network as to this
property is more important than this one. If you don't do that, it's like this -- the
network has no basis for doing anything, right? It could have one guy completely
assuming the network and it's fine as far as you're concerned because you didn't
give me any sense of importance.
So the point is you have to give me some information as to your relative intent for
the use for the network.
>>: Well, it sort of depends on what race horse. For example if you think of
[inaudible] and yes, it definitely needs more priority but [inaudible] also
datacenters are running which is quite a few of them. You can't do this, right,
because then how are you selecting which is the right one, unless you put money
into the equation or [inaudible] something else.
>> George Varghese: So what you might do is do at least first time with the big
ones, right, and then everything else falls into one bucket. So it's still better
because you get predictable service from the people who are making revenue.
And yet you're sharing the same network. And your competitor Google has all
the same [inaudible] even they really likes to -- like to run on one physical
network. And they're beginning to find ->>: [inaudible] not even sure why you would share it with anybody else.
>> George Varghese: They do, though. Lot of people do. Not always.
>>: Yeah. I mean, data planning versus adds versus the oracles versus ->>: So within the [inaudible].
[brief talking over].
>> George Varghese: Imagine CFO has certain queries, right?
>>: Yeah.
>> George Varghese: Unless there are backups going on where the people
have two backups. And, you know, you probably want low proprietary but you
want to make sure that it's some bandwidth.
Okay. So the idea is very simple here, right. You go in and measure the rate
and you feed it back to H1, right. And now the -- you need a little bit of a
dynamite [inaudible] you need to be a little careful, right? It's not -- so what
happens is that -- so this guy is bumping at five, he's dropping his packets, but
this guy measures one. He feeds it back. Now, what should this guy be allowed
to send out. It turns out that you can't let him send it exactly one. If you do, then
he'll never grow, right? So you have to sort of because, you know, if the network
bandwidth did increase, right, you want to allow him to grow.
And once he sends it one, he's only going to receive at one, and that's going to
be maintained forever. So in order to do that, you let him go at a slightly higher
rate, like 20 percent, right? So he goes at 1.2. So now [inaudible] bandwidth
suddenly opens up, right, you know, you're going to see 1.21, 1.21. If the
network bandwidth goes up, he's going to suddenly measure himself as 1.2. In
this case, he goes up to 1.4. And so he keeps flat.
So you want an ability for him to grow, right? So the rule is that when you -when you come here, right, if the bandwidth received is higher, whatever the
bandwidth is, you actually go down to something like 20 percent of that. Right?
And there's a little bit of care you need to take to make sure that the right thing
happens, things don't -- but those are all in the paper. So if you -- I don't want to
talk about the mechanism. Yes.
>>: What if the new property comes in or [inaudible].
>> George Varghese: So a new property comes in, what you have to do is -- so
here is the configuration required, right? Ideally what you would like is some kind
of open flow or some other software where it's centrally done where it has to go
to every router, right, and basically add that new thing in the DRR rates. And it
can be done. But right now we do it by going to each one and physically
configuring it. Which is more painful.
But it seems to me that that level of management should happen in the future.
And if that happens then you can do it for one. So it's not that hard conceptually
but today it's painful because you have to log into every console and do it.
>>: [inaudible].
>> George Varghese: Exactly. To every router. So right now we don't want to
even know where you're using it. Maybe you could problem gain by saying only
do it in these routers. But right now we think the simplest way is just go to every
router and do it, right? So you don't even bother, right? This is a very simple
[inaudible] but even this level of configuration, we talk to the Azure guys, no, no,
no, you know, you're not into it. And so it's interesting, you know, that we thought
it wasn't such a big deal but for real running networks even this is a problem, you
know.
The rate limit is actually slightly over one. And then you have to allow for
[inaudible]. Okay.
So now then we started saying all right, we've done all of these things and they
kind of work, they give us the policy we want, they handle UDP. But maybe we
want this definition, this hierarchal max-min fair sharing has a number of things
that we don't -- are not as flexible as we would like. So for example, it treats
every connection alike, like I think some of you said well, what is the connection
-- you want some servers to have more, right? And we can think of even weirder
policies, okay.
So here is like two flows, right? And the idea would be that imagine what you
wanted normally when you -- if you have the same weights, right, the bandwidths
are shared -- we need three for this. Is there a third one? Okay. So imagine
there are three flows, right? Now, normally it means if they're all three active,
right, each of them gets a third, 3.33. Does it make sense? Right. Okay. But it
means that if one of them is totally idle, the other guys share it equally, 5, 5. So
in some sense there's a minimum, right, that is specified by one weight, but the
excessive is also specified by the same weight. Maybe you want two different
weights. Because some applications require a guaranteed fixed amount of
bandwidth, but maybe they don't require any excessive bandwidth at all. So
maybe you want two weights. One weight for sharing the -- you know, the
bandwidth if everybody is using it, it gives you a sort of min bandwidth. But if any
excess comes around, you share it in a different set of weights. It's -- you know,
you can think of reasons why that might be interesting. Because in certain apps
you should -- very predictable bandwidth, why give them any excessive, right, or
backups you want to make sure that they have a definite amount but no more.
Backups is wrong. Probably they want to be less.
So we could think of -- so if in this case they're all equal it will be 3.3. But if you
actually had an excess weight of two and one, then actually the numbers would
actually change and the guy who had the bigger excess weight would have -when the green was not -- was not -- was not -- was not sending then he would
get the red would get a little more because he had a higher excess weight. He
would get 5.5 and this guy would only get 4.4. There's still up to 10 but they use
it in a different proportion.
So they're just trying to push the model a little bit and say you know if you push it
a little bit, can we get more flexibility? Right now max-min fair sort of pushes
your bandwidth into certain regions and would like to sort of expand that space,
okay? Once you do that, we quickly realize that we can't rely on TCP because
TCP has no flexibility to look at these two ways. We can't rely on UDP, right?
And so what we realized that we needed a centralized bandwidth allocator.
So how does a centralized bandwidth allocator work, right? So it's -- it's
interesting -- so the network transcends weights, right? So now what happens is
we go ahead and somehow we have to measure the traffic matrix. We have to
sort of see for each property how much is it trying to send to everybody else?
And that's where we might be able to commandeer the access switches because
they do have -- they do have a certain amount of weight measurements in the
hardware. If not, you have to do it at the software -- at the [inaudible] okay. Or
the end servers.
Once you get that, you basically send this traffic matrix periodically, right,
probably not any sooner than 100 milliseconds because this is a lot of work, right,
and you send it to your centralized QoS computation engine, and based on that,
this guy predicts your demands for the next -- because he has to predict because
you might be growing, right, and then you compute this or anything you want to,
you know, more infancy policies, right, from the weighted bisection bandwidths,
and now you go ahead and send back the current -- the rates to be used for the
next one, and now you have to rely on [inaudible] the routers to make sure that if
somebody, a UDP flow has been given only one it stays at one.
So the only advantage of this compared to all this other stuff is it allows you
much more general policies. Now you're free to do anything you want. You can
go ahead and give some sellers more, you can do, you know, different weights.
And it's kind of not as -- it looks weird, centralized bandwidth because bandwidth
changes so fast. But they're hoping to track only somewhat smaller coarser time
scale changes in bandwidth, right? And if you think of lots of what's going on
networking research, they've moved to centralized routing control platforms. And
so this is another step in that direction.
Okay. So now there's lots of questions and I'll take all of them.
>>: [inaudible].
>> George Varghese: So we just really need to get it to one guy. So I'm not sure
we need broadcast as much. It sounded like an impasse, right? They're all
going to one guy. All the traffic measured at the edges is going to one
centralized computation. Oh, you could do it in a distributed fashion instead. Is
that right? But then everybody has to do that computation. And that computation
you haven't done it, it's -- it takes -- it takes a few hundred minutes so we would
rather do it in one place and [inaudible] yeah.
For the simple model there is nothing being sent anyway. It's just automatically
leveraging on TCP and except for the [inaudible] so there's no [inaudible] we can
talk about that. But that's very well studied problem. Right? So it's 20 years old.
So -- yes, go ahead.
>>: I was going to ask you a question [inaudible] so there was a [inaudible]
paper recently that was very similar [inaudible].
>> George Varghese: Eric? I don't know. [inaudible] paper?
>>: Yeah. I mean we did something somewhat similar and we were basically
[inaudible].
>> George Varghese: I'm glad. So we should definitely cite that.
>>: [inaudible] I had a lot of work in [inaudible] because [inaudible] this situation
from years actually and the problem is I was trying to think about the
assumptions you're making versus the assumptions that and there are
differences there which ask okay, it's -- so one question I have for you is that in
order to validate these [inaudible] really have to look at the traffic and you really
have to look at those properties and how far those last and how long [inaudible]
without that, this is not really useful, right? You have to [inaudible] because you
said there is a -- there is some sort of [inaudible] routers or how much [inaudible]
how fast you can [inaudible].
>> George Varghese: I think this -- to this single multiplexing I agree. It depends
on the real traffic. But the fact that you can guarantee sort of minimums, right,
that can be established without any regard to ->>: That's true.
>> George Varghese: Right now you don't have that. You don't have any
firewalls at all. I think that's the first thing to establish. That can be done without
changing all these routers.
>>: Well, but you were saying -- you were saying [inaudible] you were saying
that [inaudible] for example I don't give one, I give 1.2 or something, right?
>> George Varghese: Right.
>>: To make sure it has room to expand.
>> George Varghese: Right.
>>: And the question to me was well [inaudible] fine, what's heating doing with
TCP?
>> George Varghese: It's not. It's just basically taking a little bit more than its
actual allocated [inaudible] this is a common thing that is used a lot. So it's not
such a big deal and DRO does it so and I [inaudible] that one is not. I agree that
really ideally we would have to have a big dataset to see how this was, to
measure the statistical multiplex. I would love to do that. It requires, you know,
help from people who are actually doing it. So ->>: One of the things you [inaudible] push it back down.
>> George Varghese: I agree. So that's a good point. So the timing of that one,
that is scary, right, so it really depends on prediction, smoothness, and all those
things. In fact it's actually scary. The thing that VL2 say that things change so
fast so it actually you know it's actually depressing for this kind of thing.
So nevertheless I think we just put it in for completeness, right? So if you don't
like that the first two are very fast. Sorry. I think ->>: So I want to just push back [inaudible] datacenters probably the number of
[inaudible] the number of posts that a centralized converter has to [inaudible]
such times and this seems to be no good way of splicing it ->> George Varghese: I actually did some algorithmic work there. We talked
about it. So what we did was we basically did splice it, right, on a edge switch to
edge switch basis and then we did the -- did the calculation sort of recursively on
top for all the guys sharing. That helped a lot, the complex -- but you're right. On
those you take host to host the number of pairs are so massive, right? It's
roughly an N squared kind of algorithm. So -- and we did a lot of work to make
this fast. So that's something, you know -- but, you're right, fundamentally you
know it's very -- you have these distributed thing that I'm doing in automatically
and so the speed of this is another [inaudible].
>>: So just to clarify when you spliced it [inaudible].
>> George Varghese: It's not distributed [inaudible] it's not distribute -- we did
just a centralized thing but we basically -- we didn't splice it at the [inaudible] we
basically -- we divided the -- there is an N square in the complexity, right? If you
make N the number of hosts it becomes very large. So we tried to reduce that to
order N -- M, where M is the number of edge switched pairs.
>>: Compute trunks between it.
>> George Varghese: Yes. Exactly. And then -- then -- then the last thing on
the trunk is easy, right, that's an easy computation. So we kind of just break it up
into equivalent -- these are just methods of releasing the centralized complexity.
Sorry.
>>: So I guess what we're thinking is that [inaudible] and it's as if it's centralized,
it's as if you [inaudible]. We all love that. And all we need is TCP and there are
-- and so it's hard to give that up now. I mean, why can't we have a new instance
to base [inaudible] doesn't require the.
>> George Varghese: You could. But now you would have to change all the
TCPs and all the other [inaudible] like that to do it. Oh, you mean -- right. So the
-- and I -- personally, right, I wouldn't do the centralized -- it just -- it didn't -- we
did it in the reverse way. We thought of that first and then we came down to the
simpler methods and we felt -- we felt hard pressed to give up and I'd like to
[inaudible] so we put it in anyway. And we just wrote a set. But I think I agree. If
I was [inaudible] there's no way I would do it, right? It's way too complicated.
>>: So [inaudible].
>> George Varghese: Sorry?
>>: [inaudible].
>> George Varghese: I shouldn't be saying things like this here. I'll get into
trouble with the paper reviewers.
>>: Oh, no, no. Okay. So talking about [inaudible] this is the central problem in
traffic engineering for 20 years, right, and basically what it's all come down to is
predictability. You don't need emissions control. And if you can predict the
required share, then you would have all the time in the world in the centralized
algorithm is the right way to do it. And you -- there were studies done you will get
I think this is like 30 percent better than [inaudible] version and how close you get
to optimal, right?
>>: But then it's arguable what your definition of optimum is, right, because now
the ISP, the standard definitions are least used link should be -- and it's like even
that circuit for the datacenter, is that the right [inaudible] so many things like that
in the end, right?
>>: [inaudible] asked about you the evaluation metric. So, you know, when we
think about these things we sort of get stuck in exactly that element that we have
to have loads and loads of traffic that we can look at that [inaudible].
>> George Varghese: That's fair. And we did.
>>: So from that perspective I was asking -- I mean [inaudible] but I was asking
now. How did you evaluate?
>> George Varghese: So we'll talk [inaudible] so basically we built a test paper.
It's small, right? So we definitely did not have the advantage of large scale real
traffic. And so we had no handle on the traffic -- on the statistical model. But we
did -- were able to show the isolation, right? And on real -- and also the
important thing is we took real router, we took fulcrum switches, we built a real
test. So that was kind of nice that this hypothesis that you only had to figure the
thing was correct. So that's really what we did.
Plus we started up certain properties with lots and lots of connections and we
said it didn't matter. Once you said the weights they could do jack but you
couldn't, you couldn't -- so that's what we did. Would it be better to run it on a
real thing? Yes. But, you know, that's a different set of -- we did talk to Srikan
briefly of getting it routed, trying to do this on it, but it didn't happen. Yes, so go
ahead.
>>: Not only is bandwidth allocation problems end up that you slice a bandwidth
correctly but then you had the problems that per connection response times vary
so much that no one even studies that, right, in New York that -- I mean
sometimes for certain connections they [inaudible] in the other case I mean was
one of the metrics response times [inaudible].
>> George Varghese: Yes. So response time is you're trying to see how fast the
[inaudible] go up, right, and maybe we were restricting DSLs in our apps it's also
we needed a -- we do need a richer set of apps. We tend to study a lot of
Hadoop like apps, right? So they generally -- and even they -- they're very
sensitive. They have a certain phase, a reduce phase and a map phase. And
the certain -- the swat phase for example is much more bandwidth intensive.
The other phase really didn't matter. So we did actually see that sometimes
when they give twice the bandwidth weight to one app or the other, it doesn't
complete its job twice as fast simply because there's other phases where it's not
bandwidth right? So there's a lot of that in the paper where you actually trying to
see the effect on application level performance with all of these things. But
what's the statistical multiplexing games across large crowds? We don't have
any [inaudible]. Sorry. Turn this off. Okay. Go ahead.
>>: [inaudible].
>> George Varghese: Yeah?
>>: I believe that the number of queues that you can [inaudible] so do you have
something to do when the number of properties you have is [inaudible].
>> George Varghese: So I think Srikan and Ming are definitely interested in
Azure, right, where the number is more like 10,000. And this time -- so there are
two differences in what they are doing. I think one is that we can no longer hide
under this thing like well, you have 100 -- roughly 100 I think you can get now.
Fulcrum had 100, right? And so -- and so then you have to find some way to
expend it when you don't have it. That's number one. And number two, you
have to worry about more adversarial people, right?
[inaudible] actually we're not assuming, and the reason why it helps to not
assume nonadversarial behavior is if one guy notices that his bandwidth is being
stolen when he's not using it, right, and it takes a little bit of time, however little,
right, milliseconds to get it back, that he might gain the system by saying I'll just
send idle traffic, right?
So certainly in an ISP world you would worry about that and Azure world you
would worry about that, right? But here in an enterprise setting maybe that was
[inaudible] so you have a different set of problems. You have to worry what
adversariality, you have to worry about scaling into larger numbers. And that's
[inaudible]. Okay.
So [inaudible] right. So although we had this maximum fair notion if you talk
about applications it's not really clear what it means for applications to share the
network. So we are not entirely sure, right, how we actually define this notion of
application sharing, should we think of it as supercomputing? And what about
multipath, there's a ton of work on [inaudible] fair share with [inaudible] it's all
messy, right? They come out with NP complete algorithms, it's a [inaudible] and
what is the -- one of the nice things of the evaluation which we did, which was
surprising to us is just for the heck of it we'd run this mechanism with a simple
amount of multipath. And it seemed to do the right thing. It seemed to take the
aggregate bisection bandwidth and share it proportional to [inaudible] and that
was nice. Because that's not exactly obvious given the theoretical results.
So again, we don't have any explanation for it, but at least in simple cases,
simple datacenter cases maybe it's the topology and that it seemed to work well
with. So I thought I had some slides on evaluation. Oh, yes, I do.
So here is a quick thing, right. So the group allocation, the TCP thing, you know,
you do reflect configuration at switches. It's very fast, right, but it's only TCP
flows and it's only hierarchal max-min. The rate throttling it now basically allows
UDP but it now goes up because you have to measure the UDP rate and mature
-- and it's only hierarchal max-min. And the centralized allocation is floor and it's
probably more general allocation policies and it's not clear how scalable it is. But
it's true. But you have to worry about when you're going and doing large scale
you only have to really worry about it.
So how did we implement this? We basically -- we had a fulcrum 10 gig switch
and we were able to sort of configure it as lots of smaller switches which we were
able to then sort of emulate subswitches and emulate topologies. And this was a
topology we're much more interested in because it is multipath. It is very simple,
right? And so we -- we had to do certain things to -- it's all in the paper how we
actually made all this experiment work. But fundamentally without -- let me try to
read what we did here. Because -- so what we did was we had two applications,
a red and green. A red and a green. And I'm -- let's see. And so -- let's take this
one, for example. This is the simplest one. And so the -- what we did for the red
and the -- the red application, they were both Hadoop applications except that
they were both doing sorts except that one of them had eight slaves, and the
other one had four. Right? I mean, just think where is this and that should have
had this on this slide. And so one of them had a total of 96 maps, eight per slave
and 96 reducers while the other one used a smaller amount of reducers, four per
slave. So it turns out if you don't use this, right, the person with most slaves in
more TCP connections and all the bottleneck links he gets a much bigger share
of the bandwidth, right?
And it's pretty straightforward that since you are actually allocating based on the
application and not on the connections that you will get this kind of behavior.
And there's a lot more to it, there's a lot more description on does the
application's complete faster? In the end even if you give them equal bandwidth
they don't all -- and so -- but that's the kind of spirit of some of those -- of -- but
the most interesting thing is exactly -- roughly the same thing happened in the
multipath. There was a few edge effects that we are still trying to figure out what
happened. But in the multipath case if you go back to the picture, right, what we
would really like is in some sense the bisection bandwidth has gone up to two
gig. Because there are two sets of paths.
And we'd like this two gig to be shared among these two applications, right, in
proportion to their weight. And that seems to happen. And that's the thing that
we are most pleased with. Because ECMP is alive and you can't wait for theory
to catch up and all of these things to catch -- and Hahne's result and all these do
really apply to all these cases. So maybe there's some new theory to be done.
But we are hopeful that maybe in simple datacenter topology the simple thing will
-- so that's roughly where we are.
And I think if we had to summarize what we are trying to do is we really just trying
to figure this out, you know. What is the right notion? It seems like a pressing
problem. We need to deep it. Somebody needs to define it. And we're taking a
first definition, hierarchal max-min fair sharing. So maybe shares some
contribution in just dining what you mean, right? If you don't know what you
mean, right, you can't argue which is better than the other.
And then you -- we are taking the simplest possible mechanisms, right, that can
do what we mean, one with a generalization of Hahne's result, right, to doing it
for property as opposed to connection and then we fix it for you to be and then
maybe generalize it too, right.
So and that's really where we are at. We're not satisfied with the whole thing.
We think there's more to be done. And but it's our first cut of this. And it turns
out in the Azure environment we talk to the Azure guys, you know, there's a
number of issues, right, even the simple like no change to existing routers
actually configuration. They don't want that, right? And so maybe there is some
new created things and we've talked about some of those things.
Balgi's approach, which is going ahead and sending these QCN requires
modification we're not sure will actually work in the near future, right? And these
guys have it totally different. Go ahead.
>>: [inaudible].
>> George Varghese: In the implementation?
>>: Yes.
>> George Varghese: I'd have to. I don't remember all of these things. They -the -- what was the positive section?
>>: The X axis was in 10 thousands of second.
>> George Varghese: No, no, no. So this one, they would not do anything for
converge. So converge you have to measure the [laughter], right, to see how
fast the [inaudible] got to some group. Because the bandwidth doesn't tell you
anything, right? And these measurements were I think even the bandwidth
measurements were done in aggregate so like maybe hundred milliseconds so
they -- that's way too -- the module servers to core should tell you how fast
[inaudible].
>>: [inaudible].
>> George Varghese: No. That's totally separate experiment, right. So the -- so
-- but generally it's roughly a few TCP kind of -- because it's a very simple
topology. You could argue in more complex topologies it would be more
complex. So but that's not shown by this. Because these are just coarse
bandwidth measurements that we managed to switch -- we manage to put in
something to measure this everyone but it's very coarse.
>>: I mean it does look like a [inaudible] behavior going on. So there's some
dynamics that [inaudible].
>> George Varghese: So and there are some explanations of why some of those
happen. Part of the behavior is because the application itself changes phases,
right? It goes from sort to -- remember we have a real application running. It's
not like [inaudible] running. So definitely there's a big change in dynamics. And
if you see that, right, that will explain why, you know, like these things change
because they actually sort of switch gears and from sort to map and things -- so
there is something that has to be explained. So that's all I have. And I -- you
know, like just happy to open it up for any, you know, like anybody's ideas.
Because we're still trying to figure this out.
And Albert, do you want to start the discussion? You have five minutes.
>> Dave Maltz: Should we thank the speaker first?
>>: Yes, let's do.
[applause].
>>: I think the most interesting part is [inaudible].
>> George Varghese: Requirements.
>>: For solution.
>> George Varghese: Yes.
>>: And it seems like you require.
>> George Varghese: In then, yes. Because we expose the locality in some
sense.
>>: Yeah. And so your neighbors influence what you get.
>> George Varghese: Yes. So it's very different if your kind of model, right, the
VL2 model where you basically get rid of all locality, right? But it seems to me
that if you had a VL2 model, then simply you could really have very nicely a
model of just -- you could sort of give the model to everybody that they have a
big switch. And now there's totally new locality because it's burnt out of the
system by your mechanisms and now it's much simpler because the model is you
have just a big switch and you have certain, you know, bandwidths that are equal
everywhere.
>>: We still have to do your rate [inaudible] because basically the results
[inaudible] model basically [inaudible] current model which is exactly what your
[inaudible] you're not pushing for traffic into the network that can possibly come
out of it.
>>: Even with the VL2 network there will still be -- you would still depend on
where [inaudible] co-located because your bandwidth [inaudible] switch.
>> George Varghese: Will depend on that.
>>: Yes.
>>: [inaudible].
>>: [inaudible] VM and you're co-located on the server [inaudible].
>>: It matters inside of a single [inaudible] because there is no [inaudible]. If you
happen to be co-located with very bad very many at the same place them you're
stuck.
>>: Okay. But that part you can fix [inaudible] on that one.
>> George Varghese: But I think that they're thinking of fixing that with another
level of flow control, right, which is like -- it's VM based. Each VM gets a certain
amount of bandwidth.
>>: Right. Right.
>> George Varghese: Okay. So I think Windows is putting that in, right?
>>: Yes.
>>: [inaudible].
>>: No. I mean only that problem that the congestion is at the network.
>> George Varghese: We are really bothered about ->>: That co-location at the same place.
>> George Varghese: Sorry, Albert. You had a thought. You want to finish it?
>>: Yes. The other thing is we talk about [inaudible] you don't know how much
[inaudible] I mean the whole idea of why this [inaudible] going to be cheap is
filling valleys [inaudible] stuff like that was not allocating for the mass. So even
the big guys don't know quite how to [inaudible] and if you force -- if you don't try
to [inaudible]. So you know say you buy 20 percent more of a huge number.
>> George Varghese: In your paper and I've tried to understand the relation in
the host model. You are also doing some kind of multiplexing where -- in the
original paper, the AT&T paper, right? So how does -- I mean, do some of those
mechanisms apply?
>>: That was the [inaudible].
>> George Varghese: Yeah.
>>: And it was mission control. Yeah. So maybe ->> George Varghese: Maybe it's similar right in the end, yeah.
>>: Could we go back to [inaudible].
>> George Varghese: Yes.
>>: [inaudible] only one over the sum [inaudible].
>> George Varghese: Exactly.
>>: Times ->> George Varghese: Times the bandwidth of the -- times your weight times the
bandwidth.
>>: That's the share.
>> George Varghese: Exactly. The [inaudible] is guaranteed for you regardless
of [inaudible] which I think I find that helpful because it gives you a certain floor,
right? And now I can migrate and do everything. Now you -- now, on top of that,
you can start saying if locality is visible and you have a network, now I can have
an optimization problem, where should I locate my VMs to get the -- go ahead.
>>: I'm trying to figure out how much that does solve some of your problems. So
today I mean so I was [inaudible] not where I am. I'm planning all these tricks to
figure out how to get to be in the right places.
>> George Varghese: Yeah.
>>: With your system I'm still going to play all these trips to figure out whether I
want to [inaudible] bigger there, right?
>> George Varghese: Right. But I think it's still -- I think it's still nice to be able to
be sure that you have a certain bandwidth that works regardless of what other
people are doing. Regardless of [inaudible] to me that's reassuring because it
gives you a certain minimum level of performance that you can't regardless of
other people's gaming.
>>: And the other thing is that oftentimes when you look at these jobs studies
the time you complete the job is defined by the outliers, right? So you've got a
good locality for a bunch of your compute notes and your data storage notes but
there's someone way far out there who you had data from because you couldn't
get [inaudible] so having a good bound job time depends on having good
communication.
>>: But on that note like the outlier unless you have a phase and [inaudible] are
the ones that are alerting you, right? [inaudible] a machine that has other kinds
of applications on it. I mean, your fraction will be proportional to that and it will be
quite low.
>> George Varghese: No, no. I think the assumption here is that you have -- it's
not so much -- you have a worst case thing, right? We're fought talking a lot of
properties. You're talking about lots and lots of properties. But if you're talking a
certain number, regardless -- there's a certain minimum which assumes that
everybody is on your machine and on your link at the same time: That's your
floor.
>>: Have been meaning.
>> George Varghese: Every other property is there at the same time, right. So
that's like -- may not be. Depends on how big you are and how much revenue
you get. Like so if you're half the receive of the company, you get half the
bandwidth always. It's like you don't have to worry about an experimental person
coming in and destroying your stuff. But -- and various other things.
>>: But it does sound like if you get paid twice [inaudible] increase your ratio or
you could open up 10 VMs and throw out the 9 worst once. That second
approach might give you much better reward.
>>: [inaudible].
>>: And so shouldn't play all those other games.
>> George Varghese: Maybe.
>>: Yeah. I mean, without a -- without a [inaudible] subscription you're going to
[inaudible] there's going to be upside to find those.
>>: We were wondering is there a market for subscribed networks if you market
in a sense [inaudible] will it pay?
>> George Varghese: I think in the future will they pay for oversubscribed
networks or will they pay for this mechanism [inaudible].
>>: I was wondering if it's more general than that.
>> George Varghese: I think it's more general than that. I think they always
have [inaudible] networks. If it's not always subscribed you still -- [inaudible].
>>: You know what you're getting.
>> George Varghese: Right. I think it's the predictability that I'm hoping is
something that [inaudible] will want. So even with the VM I mean to some extent
you have some notion of a certain minimum amount of performance, right. And
that I think is comforting, right, although most of the time you get a lot more. But
I think having ->>: These models are still being experimented with. I mean, Amazon VMs
there's no guarantee. It's like a one gigahertz processor, but it's not.
>> George Varghese: Actually because of [inaudible] or some other strange
thing.
>>: They do a lot better than [inaudible] but they don't do fair share. There are
games you can play to get more than your fair share on EC2, although you do a
lot better than if you brought a say -- out of the box and a machine that is horrible
in terms of how a fair share [inaudible].
>> George Varghese: Really?
>>: But three or four to one difference from fair share usually.
>> George Varghese: And what about EC2?
>>: EC2 is a lot better than that.
>> George Varghese: Two to one maybe?
>>: [inaudible].
[brief talking over].
>>: I think it ends up you can push it statistically but I don't think [inaudible] I
think it's [inaudible] for fair share.
>> George Varghese: Again I think once you go to the Azure model it is a
different setting. You have people playing games. Here I think it's accounting an
engineering and I -- you know, I -- they're all reporting to one CEO, I'm hoping.
And so that's -- and that's -- and so it's reasonable then at one -- like in
customers it's very hard, right. If I'm not using my hundred megabits and I paid
for it, you mean you're using it? And actually the argument is when you're not
using it, I'll get yours too but that's still a little hairy. Can you imagine an ISP -but in the company setting I think that's a little more plausible, right? So we're
still feeling this out. But thank you for your comments. And yes.
>>: [inaudible] even if you have bandwidth allocation you will never have a
stable system, right, even if [inaudible] drop things. So I mean what your
assumption really is by doing this [inaudible] allocation you sort of make the
assumption I think things were stable that I can guarantee this end-to-end path
from end point to end point but if things -- for example [inaudible] essentially
things can get bad even within the single router [inaudible] now when you don't
have stability in the system, then even if you [inaudible] if things can still go
[inaudible] constant influx.
>> George Varghese: So what we like about this actually, that's actually
interesting point. So this is a point I think I made. But the nice thing about this is,
the floor, is normally image that you had alternate paths, right, and you had
backup paths. And so what would happen is if you want to do reservation you
would have to reserve on the alternate paths as well, because you never know if
you use it. And this is totally wasteful.
Here, you go ahead and do these kinds of weights on these alternate -- but
you're not using them, other people are free to use them, right? So which is kind
of nice because in some sense you're less wasteful of these. So [inaudible] you
don't have to have to propagate anything because the weight's already there,
everywhere. You go to the alternate path it's waiting for you, it's already
preallocated. So I'm not quite answering your question, but it did suggest
something that, you know, like even if there's a network is churning a lot, it's
changing a lot, right, from one path to another, this stuff is totally independent
because, you know, at least in the simple model it's configured, it just -- it doesn't
change. It just -- it has its weight and it's read for you. You switch from one part
to another, it automatically allocates because you just configured every router
when you started.
>>: [inaudible] for instance?
>>: I think that's [inaudible].
>>:
>>: [inaudible] so essentially do you not become sensitive because there's
[inaudible] stability in the system [inaudible] and so even if you -- you have these
weights but if there is some [inaudible].
>> George Varghese: What would your ideal be here? What would you like?
Because until I know that, I don't know how to compare it to what it's doing.
>>: Well, I guess the sense really is that, you know, [inaudible] it's basically
[inaudible] engineering on the links and [inaudible] stability is [inaudible] so those
[inaudible].
>>: [inaudible] first of all I have to go and apologize but that's exactly my
[inaudible] what we want and that's a great pushback. And then adapt a
mechanism for that.
>> George Varghese: I think the big question is what is the question really.
Really, that is [inaudible] because this is a whole new area and, you know,
everybody has this intuitive notion that something has to be done. But the
question is what need to be done, right? And what would users want?
>>: Simpler question [inaudible] since you're saying in terms of robustness of
your system so if you have some notion of [inaudible] but if you have ->> George Varghese: I would say the first two are not -- the first one is very
robust, right. The second one maybe a little less. The third one is probably not
robust at all. You know, if you have tremendous changes in traffic, it's just going
to be -- it's centralized, it takes the law, and it's going to rely on predictability. So
that's my [inaudible].
>>: [inaudible].
>>: [inaudible].
>>: [inaudible] helps a lot with that. It turns out if you just turn on DRR with
[inaudible] DRR that like the intuitive -- and this is [inaudible] which is to say that
you don't end up with a small flow. You only screw up people in your own DRR
bin and so you can cause real nasty things to happen to people in your own DRR
bin but if you've said that's yourself, they be that problem mostly goes away is my
understanding. That's certainly when my experiments of DRR turned on, it helps
a lot.
>>: [inaudible] the most robust [inaudible].
>>: [inaudible].
>> George Varghese: It's not like a control system that part. That's why I said
the first one is. The second and third, they're measuring, they're going to -- I
hear you. I feel your pain there. But not in the first one. Right. So thanks.
>>: It's a type of [inaudible] [laughter].
>>: Yeah.
>> George Varghese: Yeah. [inaudible].
>>: What's the [inaudible]
Download