>> Dave Maltz: So it's a great pleasure to introduce George. George is an old friend. I don't know George. He's a -- I think you'll find a master expositor, fantastic teacher. And he wrote and amazing book that I recommend to everybody called Network Algorithmics. It's organized in a really special way around principles. And it's not like any other networking book out there that I've ever seen. And, you know, George is known for lots of mechanisms that are implemented in every switch like deficit round robin. He founded a company Netshift which was acquired by Cisco and did the worm detection and stuff. It does into Cisco's gear. So George has had a huge impact on the networking industry both in education and in real products. This particular problem he's going to talk about is super dear to our hearts here at Microsoft. This performance isolation is something that Azure you're absolutely has to solve and any datacenter enterprise datacenter as well has to solve when they -- so it's -- and you know, there's work on it at MSR, there's work on it in UCSD and we were really -- it's super timely. And really keen to find out how George attacks the problem. Thanks, George. >> George Varghese: Thanks. Thank you. So I'm afraid I have only 15 to 20 slides, so I'll try to -- I should have probably made a few more, but I ran out of time with all the things I was trying to do. So I'll try and -- I think the basic ideas will be clear and where I forgot the experiments I have the paper here, so I'll go back and say remember what the parameters were. I'll be groping at the graphs and say now, what was that. So we'll figure it out together. Okay? So I guess the problem is very simple conceptually, right? We'd like to virtualize datacenter networks across services. So the setting that we address, which is somewhat different from the Azure you're one, so I want to make sure that you understand our setting is one for enterprise datacenters, right? So enterprise datacenters -- so what people sometimes call private clouds as opposed to public clouds, okay? So private cloud like let's say a Fed Ex or Pfizer typically has been sold -- the vision that -- instead of keeping all your departments having separate servers, right, like you have engineering, accounting, because of VMs, right, you should be able to save a lot of money by consolidating your servers, right? So that's the trend where we know that VMware and the VMware [inaudible]. So that's happening, right? But then when you see that happening, also storage is also being, you know, somewhat virtualized because you can have, you know, a given physical disk can always be partitioned and broken up among people. So but now you're beginning to see things. And if you look a little carefully into the trade press, you'll see these things where people are saying that during times of VM backups, right, or strange things are happening, these VMs are interfering with each other, right? So in some sense the network -- you know, virtualization means many things, and you have to define what it means, right? But one kind of instinctive meaning is separation. You would like these things to separate. And so really the goal we are trying to address here is trying to actually build what is known as what we might -- people have called a virtual datacenter, right? So what's a datacenter? You have you know resources like CPU, you know, like disk and memory, and then you also have a network. Right? In some sense people have made good solutions to memory disk and processing, right? And so we would like to really define what it might mean to share a network. And so part of this talk -- and this is all very early work. Balgi's [phonetic] been doing work, Albert and Srikan [phonetic] and Ming have been doing work. So I think everybody is groping for a definition. So don't take any of this very seriously. You might come up with a better definition. But at least you have to start somewhere. So that's the context for this. All right. So definitely modern datacenters are built at very large scale, thousands of servers. They currently execute a large number of applications. And so people are already virtualizing resources to reduce cost rather than having a datacenter for engineering, a datacenter for accounting. And they really like this because you also have agility because you can move your datacenter -your VMs around. Okay. So that is kind of nice. And so you have all these existing technologies which we've talked about Xen. There's SANs for storage and there are people are fooling around with memory too. And really a lot of this is about resources on a single machine. Right? So the interesting thing that we'd like to look at is bandwidth. And now what you have to do is divide it across the set of -- so you can VMware, right? If you take your single server and you often I think Amazon has some number like say 10, right, 10 VMs, so 10 VMs allocated. But it's pretty easy to see the model, right? In the worst case of all 10 VMs are actually active, they're very unlikely, right, you'll get one-tenth of the CPU. If none of them are active, right, you will get actually all of the CPU. But actually they tend to not be a little alarmed by you getting all of the CPU and playing games, so they tend to have some limit on how much CPU you get. So they won't give you all. So they have a min guarantee and a max guarantee. So that's some kind of guidance as to the way people -- but that's on one machine, right? Imagine now you had to do it across multiple machines. It's not -- you have to kind of define what it means. And so -- so is bandwidth a bottleneck? So I guess from VL2, these are numbers from VL2 I guess or I think. I don't know where -this should have credited but this is the -- some slides that Terry made up. And so there's lots and lots of traffic between the servers and the datacenter, right? There is stuff leaving and entering. So we are really worried about the bandwidth within the dataset, okay? So the outside stuff, we're not talking about ISPs, we're not talking about the outside stuff. And so -- and I guess so the network may be a bottleneck for computation. At least that's an action that we are going to hold through here. Okay? And so certainly I -- you know, with a little bit of Googling you'll find these examples of a little bit of disturbing signs of people are saying oh, I virtualized everything and my app was doing just fine and suddenly the other guy's apps, VMs started backing up and I got -- I got -- I took a hit. And so people are beginning to see this. So all right. So basically we would like to have this notion of bandwidth virtualization where -and I think you can think in terms of Microsoft, for example, right? You have a bunch of properties. You have things like bing, you have, you know, like -- what do you have? You have -- [laughter] e-mail. What is it called? MSN. Hotmail. I'm sorry. I don't use these things. [laughter]. I'm on record too here. Victor's looking at me. You know, I'm so used to using one of your competitor's examination that I had to you know, really keep groping for the names. Right. So you have a bunch of properties. And so you would like to give each property all the application the illusion of owning a virtual datacenter, separate CPU done, disk done, memory done, and the network is [inaudible] okay. Now, one of the most important things we believe in, and it's a little arguable, right, is that we believe that you once -- you would like to have statistical multiplexing on the network bandwidth. If you don't want it, it's really easy, right? Go and reserve on everything. And you know, with some extent reservations are possible. People know how to do it, right? You know, you can just reserve bandwidth at every length. But when you start trying to statistically multiply, the game becomes more challenging, right? So we kind of know how to do it in a single link. There's something called [inaudible]. But to do it across multiple links you know you need a definition for us, right? Yes? >>: [inaudible] resources out here that need to be virtualized [inaudible] on that list? >> George Varghese: That's not the complete list at all. Good point. Right? It's a [inaudible]. Because you might be using, you know, like an equal amount of CPU as another app but you might be using a lot more power. It's not at all clear that those two are. Generally I think the assumption is when you go about -- I mean, it may be a simplified assumption that if you use the same round of disk and memory and CPU that you're probably using the same [inaudible]. If that simplifying assumption maybe not. If par is an incorrect measure that's going to derive from some of these things, maybe it's possible. But you're right, this is not a complete list. But this is the first cut list that one would pick, right? So power is a very interesting -- how do you divide your battery. So I guess we really would like statistical multiplexing. And if you look at disks, clearly you don't statistically multiplex. You don't have some really [inaudible] and somebody else's disk. They are using it, you write into their bits. Now that's not possible, right? But certainly it's true for VMs, right? When one VM is not running, the other VM can, you know, get at least some of the CPU that the other one is. Right? So, go ahead. >>: So the reason [inaudible] right, I mean unless you're using [inaudible] so if you do, right, if you [inaudible] so they're not completely made for that. >> George Varghese: I agree. I agree. But what we'd like to do is sort of give the property that has a certain minimum on a bandwidth and maybe if they're lucky they'll get up to a certain max. So okay. Okay. So with all of this, right, so what's our -- so why is bandwidth virtualized? Because now you have to do it across multiple links. Okay? So now you have all these QoS mechanisms, right? And it's sort of -- they look like they kind of solve the problems. But so there is things called router scheduling where what happens is you can do these mechanisms by which you can give you -- give somebody a fair share of one link, right? So you take the link. And today, I think many routers will basically allow you to write a classifier and then track and you can map it on to a certain number of buckets. Maybe hundred, maybe, you know, 10, 20, and all those guys can be given a share of a link, right? And you can give them weights and so you could have two-fifths of the link, I could have one-fifth of the link. But now you have to do it across multiple links and sort of have a -- and it turns out that so you don't -it's really confusing because there's tons of work, right? So there's traffic engineering and -- but so what is traffic engineering do? In some sense traffic engineering is actually you've got this certainly we spent a lot of time reading a lot of papers, including his. And as far as I understand it, it does a better job of routing admitted traffic, right? It doesn't really have this notion of allocating across multiple things. So, for example, this is a DOS attack that comes into your network, traffic engineering will do its best to route it on the least or on better utilized parts, but it doesn't prevent it from taking over the share of other people. So yes? >>: [inaudible] statically allocated slice or dynamically ->> George Varghese: A good time -- that's one statistical multiplex, right. So we come to the model in a second. So after all [inaudible] right? Okay. So let's start with a model, right? So what is our model? So let's just draw this network, right, which is a physical network. And very simple, right, two tier network, where there are four switches at the bottom and there's one core switch. We could make it more complicated, but let's start with it. So but let's also have these different properties, right, which are colored, so you have a one, which is the green property which has a VM here and a VM here, A2 has a VM here, here, and here, and A3 has a VM. So they kind of overlap in some strange ways. So I'm going to assume that whenever I show something like this that the pair -- all pairs want to communicate. So this -- on this machine -- so by the way, these are the switches. So I haven't drawn the machines. But it means a machine here running A1 wants to communicate with the machine here running A1. But in the case like A2, you have one connection between these two machines and one connection between these two machines. Does that make sense? Is this picture clear because I haven't drawn the host to keep it simple to plug that. I've only drawn the edge switches and the course switches. Right? So now, though, so the -- I would like to have some kind of manager sort of decide some kind of weights. The green guy is four times as important as the -and the blue and red are equally important. You know how, how did you do that? Okay. So I think weights are very important because if you talk about properties, right, a very natural way to allocate bandwidth is by revenue, right? How much they -- if you don't have properties, if you -- if you have even engineering accounting, companies are very familiar with the cost accounting, right? It's a very natural way to say you know, okay, you won more, you know, you pay more internal cost accounting set. So you need something of lever, right? And we're looking for the simplest possible lever. We don't want a lot of bells and whistles but a simple weight. But this is weights across the network, right? So what does it mean, right? It's not entirely clear how these guys share. So let's talk -- let's take the example, right? So imagine that a 1 and all of them want to send at full throttle, right, they just want to dump into the network. All connections would have enough data to send. They're all trying to complete, you know, some big MapReduce. Well, let's start with this link over here, right? So in this link, right, the only competitors A1 is trying to send one here and A2 is trying to send one here, correct? Right? But what happens is that they have -- they have -- because they have -- oh, I'm sorry, I screwed up actually. This should have been the -- the -right. Okay. So maybe this is right. So when they compete, right, they share in the ratio 4s to 1, right, so therefore this guy should get 8, right, what -- so what we'd like to do is decompose this picture into a green network, a purple network, and a -- and a red network. And we'd like to give bandwidth labels to each. That's a simple model. It's not the best model because we're not hiding topology. Okay. But we are exposing locality. And at least currently in datacenters there's a huge difference between local bandwidth and so maybe it's an advantage to actually keep a slightly more complex model. Now, people don't like to actually expose topology to customers, right? But remember we are not talking about the Azure environment yet. We're talking about enterprise datacenters. So it's not as unreasonable to expose your topology to enterprise datacenters. In Azure environment you might think do you want customers to have the exact few or far topology because, you know, they could attack it and stuff. But for now let's just finesse that. Right. Go ahead. >>: The question I have is so we hear a lot about [inaudible] networks and customers [inaudible] from rack and customers being aware that racks are on the same X switch or different parts of the tree. So are we assuming some [inaudible] network here where these things go away or -- >> George Varghese: No, no. I'm just -- I think it's somewhat [inaudible] all of that, but definitely not assuming that. We would like to work regardless of the underlying network model. So for example, you know, in Iraq much more bandwidth in Iraq than there is elsewhere, then we want to expose that, right? But we want to expose your share of that so that you see independently of other people. Yes? >>: [inaudible] tremendous amount of uniformity and how much bandwidth each particular instance of an application or member of a [inaudible] would get based on whose [inaudible] so if you look at the purple in the middle that guy gets 2G, that VM might have 2G of connectivity whereas a equivalent A2 VM sitting on a 3 only gets 1G just because of who else happened to be placed on the [inaudible]. >> George Varghese: Right. I think the general -- the general thing I would do to avoid that would be if there's a few properties, right, you are guaranteed on every link, right, at least this share of your bandwidth. So for example if there's 411, right, you have guaranteed if you're a 4 person, you're guaranteed at least 4-6 if there are only 3 applications regardless [inaudible]. So if you want to move people around, you have some kind of floor that is actually totally independent of [inaudible]. >>: I see. >> George Varghese: Yes? >>: So this model [inaudible] from each instance of the VM on net based system? >> George Varghese: No. The weights are assigned to the application. Think of it like the property like search gets weight 4 and e-mail gets weight 1. So the -the applications that you don't even know any of this except that there's some -the model is that you -- there's some identify in a packet like a port number or something that identifies the application, right, and so that can be mapped to weights. >>: [inaudible]. >> George Varghese: They could migrate and that will be fine. >>: And that will mean that you have -- reconverge your [inaudible]. >> George Varghese: We will reconverge the bandwidth allocations. But there's a certain minimum bandwidth you get regardless of where you are. Right? And you have -- you can -- you think of -- in some sense you get a certain proportion of every link, right? But the actual numbers -- now the statistical multiplex is going to vary tremendously depending on where you're co-located, who's act -that's true anyway, right? Even if these guys are not co-located, it really depends on how much this guy is using it, right? If this guy is empty, you get all of the [inaudible]. So we know that anyway. So migration is just another piece of dynamism that affects the statistical multiplexing links, right? So you have a minimum and a maximum, right? >>: [inaudible] true for compute work virtualization or storage virtualization? >> George Varghese: For compute there is but not for storage, right? So compute, for example, if the other VMs are not active you can go up to the top. Now, what we really think we also will have to do in such a model, right, is also have to enforce a max. So even though, for example, on this link the 10 gig link if the -- if the -- if the purple was not there, this guy could get the entire 10, right? This is what statistic -- you probably want to allow the administrator of the network to configure a lower max. Because other people play games. Maybe not in the enterprise, but at least in the Azure environment you probably want -- you want some mac, which is easy, it's just a rate control. All right. So but I haven't really talked about the model. Okay. So now what happens is that so let's assume that this guy, right, this purple guy -- sorry, the green guy has -let's see. So the purple guy gets to -- on this link, right, he basically gets two gig because when he's sharing this link with this other guy, he has a four to one ratio so he gets two. But now it's a little more complicated, right? He has two connections, one going like this and one going like this. Right? So because I assume that there's a pair of connections between each machine. So the assumption that we are going to make is that every pair -- whenever you have multiple connections within the same application, they share equally. Okay? So this is an important distinction because what you're trying to say is even if an app opens up many, many connections, right, it gets the same share but it's completely dependent on this. And that's really important because today we know that in Hadoop and by simply increasing the number of slaves or the number of masters, right, you can get a tremendous number of TCP connections and you'll get an unfair share of bandwidth. So that's not happening, right? First we're going to take off the top based on the weights and -- but then within that, we don't want to differentiate your connection. It's just too much work. Right, we could if you wanted to. But now you have to sort of explicitly say for each of your TCP connections what's the weights. And you could add that to the model, but it's messy, right? It's messy not from implementation but specification. Because you have to sort of specify every care of VMs and what they're trying to do. And we're trying to avoid all of that in the model. Okay. So this kind of gets [inaudible]. And now here's an interesting thing that happens. When this guy gets one gig over here, if you look at this link over here, right, the red guy is sharing it with this, but he's only sharing it with one of these connections but that connection is limited to one gig so the red guy can go all the way to 9. So this is a famous thing that many of you will have heard of. It's actually called max-min fic -- max-min fair sharing which max-min means if your bottleneck somewhere like this guy, right, even though on this link you are actually equal weight, you and the red guy are equal weight and you would think you should be given five. It's like why? I'm limited to one anywhere elsewhere, so why not be generous and give it to -- don't be a dog in the manger, give it -give it to the other guy. And so this guy gets 9. Okay. So there is actually a slightly recursive calculation where you have to find the bottleneck and then find -- share that one and then that sort of suggests other bottlenecks and so there is -- there's something to be -- but it's automatically done. It's very well known in the literature. But however, this is what we would call hierarchal max-min fair sharing based on the properties, the colors. Then once that has been decided, then you do max-min fair sharing within the connections as you may equal weights. So it's a two step process. Yes? >>: [inaudible]. >> George Varghese: This will have to be recalculated. Although there's a certain minimum bandwidth that I was telling Dave that will happen regardless. So you will be guaranteed regardless the way people move. Because fundamentally if you take your weight, divide it by the sum of the weights of everybody, right, there is a certain ratio you're guaranteed. So there's a floor you have. So you can be completely [inaudible] with it. Yes? >>: So going back to what Albert was asking. Thinking about [inaudible] and thinking about this [inaudible] independent of what you do with the computes, so if you don't have the same sort of models to compute, you might get -- so let's say -- let's take extreme case and you allocate 411 here and you allocate something like two, four -- two, three, one, right? >> George Varghese: Assuming the allocation is in a networkwide ->>: So [inaudible] so what I mean if your computes are not matched, are not the same with what the network bandwidth you've actually allocated then you could get into a situation where you're actually not exploiting the benefits of ->> George Varghese: Yes. So I think there are a number of design -- once you -- this is a bunch of mechanisms. How you use them is always -- given a certain computer requirement how do you map it on to a certain weight? But remember the weights are across the network. We're trying to keep the model as simple as possible. Because generally our experience with things is that Cisco weighted red, per this thing, per app, nobody uses those knobs. It's just too complicated, right? So, you know, there are a lot of us keep putting in more and more nods but the administrators don't trust them for most part. So we just try to give a very something knob, right? You have like, you know, 10 properties and just give me the weight. >>: [inaudible] sort of wondering to benefit that, you actually have to have ->> George Varghese: You have to have a design thing -- >>: All across, not just ->> George Varghese: I agree. Yeah. And then you might ask the questions, how well should I migrating my VMs, what is the right way. But those are questions that come on top of this. So first you have to give some mechanism I think to start. But some form of control. Yes? >>: Even in the network [inaudible] you could have multiplexing between [inaudible] such if you have [inaudible]. >> George Varghese: Yes. >>: So burst off actually can come [inaudible] you should be able to give [inaudible] is actually not using that [inaudible]. >> George Varghese: Right. And that's actually the next slide. Right. That's exactly the next slide. So the next slide is now what happens -- what happens is A1 to A -- this guy has not burst -- he's not using the full -- he's only using six gig, right? What happens, this guy gets four gig on this, right, by the allocation -that's what we desire. We haven't shown you the mechanism yet. And so therefore two gig here, right, and therefore this guy goes down to eight. But there is a recursive sort of coupling. Yes? >>: I'm trying to understand in the blue network what the two left nodes have different allocations. >> George Varghese: Well, this is the ->>: [inaudible] identical. >> George Varghese: Yeah. Yeah. This is the sum of both of them, right. So you're right. So this is the sum of both connections, right. So there are two connections, one going here and one going here, both of 2G. >>: But also two from the left one, going from the middle to the ride. >>: That, the [inaudible]. [brief talking over]. >> George Varghese: I probably didn't calculate it right, so I -- I may not have done the calculations right. I was trying to do this a few minutes before and so, right, so I think I'm not assuming a connection from here to here, right. If there was, this I would have to change these numbers. Sorry. You're right. So I think they worked out correctly in the paper, but every time I do these examples I get them wrong. So I apologize. Yes? >>: So that was the single [inaudible] figure out these caps fairwise and could be -- >> George Varghese: No, no. What happens is there's this sort of -- the way it works is you first find the bottleneck link, the one that has the minimum weighted -- divided by the weight, right, is the smallest. And then once that happens, you know, like that restricts certain flows, right, but then those flows then cause new -- have new bottlenecks and then so you have to keep -- it's this very standard algorithm. I can tell you about it. It's classical and it was studied for 20 years now. So it's called a water filling algorithm where you start trying to do this, go to the bottleneck, find that and ->>: [inaudible]. >> George Varghese: No, you can't. You have to actually -- and it's -- it's in worse case it can be E sequential steps for E is the number of edges. But in practice it's the diameter and practice is only a few bottlenecks. It doesn't work. You've done it in centralized fashion. Yes. >>: So every time [inaudible]. >> George Varghese: Yes. >>: [inaudible]. >> George Varghese: Yes. >>: [inaudible]. >> George Varghese: We'll talk about all the trails. So once we get to the mechanism we'll see various ones will have different layers. Okay. So now with this -- but this has -- so I guess the first thing is so let's talk about three mechanisms we're going to do, right, and one of them is going to be something called group allocation which is really -- so let's talk about our goals, right. What can we change? Now I think if you're Microsoft and hopefully this is like this is a willing audience to hear this, right, you really can't go around necessarily assuming that the routers of change. Because, you know, that's Cisco and yes, you might be able to convince them but it's -- you know, and maybe you have friends in Broadcom, but it still takes a while. So ideally you would like to do this with now router, right? Now, Balgi, when he came here, he talked about a mechanism that sort of is modified QCN. Their he's assuming that he can change the switches. So number one, clear point of distinction between his work and ours is we're not going to assume that the router switches. Okay. He also would like not to -- not to go ahead and configure to add software to the host. Some of our proposals are going to add software. Okay. But Microsoft's probably okay with that, right, put certain other people would find that hard. So our first thing is called group allocation and it's going to leverage TCP's behavior. It requires no software/hardware changes but it does require configuration in routing so even that was hard for the Azure folks, right? And it's going to converge very fast, it's going to converge [inaudible] basically TCP, right. Another thing called rate throttling which says that works only based on TCP, and I have to tell you all of this. And so if you want to do this, right, you're going to have to do some kind of measurement and some kind of [inaudible] and you have to add software in the host and this is going to be a few milliseconds, just to measure that there's a problem. Okay? And finally much more weird thing. This is a centralized allocate offer where like our CP and route -- centralized route control things we're going to do everything, including bandwidth allocation in centralized fashion. And this is going to be slow because in general you have to see probably in the order of hundreds of milliseconds, right, or 10s. So this is a set of tradeoffs and we'll try to show you that this one is the easiest, right, and fastest, right, but requests some assumptions, right, and it only gives you one definition. This one can hand UDP as well, this one can handle everything and the kitchen sink but it's complicated. You have to do a certain amount of work. So there's a set of tradeoffs that [inaudible]. So let's start with this picture, okay. So maybe some of this stuff about the definitions will come a little more clearly. So now I've forgotten all of the datacenter topology, I'm not trying to bother to draw switches and simple topologies that I cannot understand, right, and not every pair. So there's a host 1 and a host 4, there's an edge switch, core switch. So you can actually, if you want, mentally move -- twist all this thing into the right shape. But I can do this way. And there's a host 1 and a host 2 and a host 4 and a host 3. So let me see what the -- so the idea is that very simple -- there is a famous paper which some of you may have read but it's very old, so I'm not sure the new grad students have read. I'm sure [inaudible] has read this. Basically there's a paper by Ellen Hahne, right, in many years ago which said -- do you remember when this is? So this is a -- she was Gallagher's student right? And she basically said, look, if you do -- first of all, right, why doesn't window flow control -- why don't just a simple fair queueing work? So let's just see -- let me see what I can have an example where I hope I'll remember. So what happens is that imagine that I have one color or one which is using -and this is 10 megabit link. And suppose one guy is using -- has a weight of 4, right, and another guy here is -- and there's somebody over here who wants to go in over here, right? Now, okay, so there are three colors -- there are three properties. One has one connection here, one has one connection going this way, and a third one going here. So image this property has a weight of 4. So that limits this end-to-end flow to a rate of -- to a rate of 2. Right? Because out of this 10 you get four-fifths of that rate. So therefore you should normally be limited to two here, so therefore this other guy should get 8, correct? That's the desired approach. Now, suppose you just simply used router configuration and used DRR and you basically said of these guys, this guy gets four times the other one. So you indeed get the right division on this link. You'll get 2 and 10. But if you don't do anything, right, this guy is going to send way to fast on this link and actually make sure that it's five out of five. Although all these packets have been drop kicked. So simple router based deficit round robin or fair queuing it works, but it's wasteful. Because what could happen is the sender could go ahead and say, you know, forget it, I'm just going to send at maximum rate and I'll get throttled at every link but the bottleneck I'll finally gets throttled in my right link but I'm going to waste bandwidth in all the preceding links. Now, some of you will be thinking but that can't happen if the sender is TCP and you're right. And that's really Hahne's intuition, right? And so Hahne's intuition basically said if you take any window-based flow control, right, and fair queuing at the routers, you get max-min fair share. One of assumptions you know hard proof, you know, like -- and but it's not hierarchal max-min fair share so we can't directly use. So basically all we are trying to say is that if you -- if you just do this, you actually do get hierarchy, and if you do fair queuing not by individual connections but by the properties at each router, then you will get exactly the definition we want. You'll get hierarchal max-min. So it's sort of you get it for free. You don't ever do anything. Just an embarrassing result because it's like say do nothing, use existing stuff and TCP and everything, you get the right mark. Right? So it's a feeling from an engineering -- it's not very appealing when you write a paper, right? So it was like the people say all right, then leave, you know. So okay. So let's start with a 1 here. >>: [inaudible]. >> George Varghese: Yes. >>: It's kind of the very same thing. >> George Varghese: Yes. >>: Exactly the same thing. So if [inaudible] flows to [inaudible] for example cars are coming on different streets and they're all going in one direction [inaudible] the very end it's going to be totally my formed. >> George Varghese: Right. >>: Unless you [inaudible]. >> George Varghese: Right. >>: That's [inaudible]. >> George Varghese: These are very well studied problems. The only twist here is I said this theorem is like 20 years old, right? The twist here is that we are moving it to a hierarchal sense and the only thing we are saying is that make sure that you don't do this on a per connection basis, the fair queuing here, but you do it on a property basis. So it means every -- so if you have 10 connections going in from search, right, all of them will be treated in the same DRRQ. And that's the only thing -- if you do that, you'll get the -- okay. So, so what happens is that let's see, that there are two properties here, and let's assume they have equal weights I guess. And so what happens here? So and there is this guy. And so there's a 4 to 1 ratio. So clearly a one. Now, let's look at the time scale in which this happens. So if both these guys start -- the greens and the reds start at the same time, as long as you have a DRR like scheduled at the router here, which they do, that is basically giving 4 times the packets on the red and the green when it's going out of the southbound link, right? It's immediately this guy's going to get hit. Does that make sense? There's no time constant at all. This is like even less than -- it's less than round trip delays. It's microseconds, right? So that's done. But when that's done, what happens is that as soon as that is done, then the -- at some point what's going to happen is this TCP is going to stop -- is going to sense that there is less bandwidth. It's going to go ahead and -- but it's also sharing that two with this other TCP, the H 2 TCP to H 4 and so they're actually going to go down to 1-1. Right? So it's just the right result automatically happens and you don't have to do over. What's happening now? What did I do wrong? Okay. Well. Go ahead. >>: [inaudible] on the flows, right? >> George Varghese: Not really. Because these only take a few round trip delays for this to happen. >>: But I mean, there have been plenty of results that say that TCP takes a lot of converge and you have multiple -- I mean in this case I guess there's one bottleneck. >> George Varghese: So depending on the number of bottlenecks. I agree. I agree. So it's actually like the -- it turns out that the regular max-min fair share calculation it moves from bottleneck to bottleneck. So there's a factor which is the number of bottlenecks. Generally you don't assume there's that many. Especially in the dataset. I would say there's the uplinks and maybe -- shouldn't be a problem, right? The few round trip delays. Sorry. You had a question? >>: So this still has the same kind of weird problem that if there are two links coming down from H 2, then they're each going to get one-half. >> George Varghese: No. Okay. If there are two links coming down from H2 to this router, no. Because you're doing it based on green versus red. Not based -so if there are two connect -- two -- you mean off the green -- off the green share? >>: Correct. Yes. >> George Varghese: Definitely. >>: Would they each get one-third or would you have one, one-half, one-half? >> George Varghese: No. Okay. So let's start. You want two links coming down. And is there another TCP connection coming down? So H2 is opening up a second connection to H4. Right? So first the router scheduling mechanism is still going to take 8 for this -- so 2 is left over for the again. Now there are three connections he opens up he gets two-third, two-third, three third. So your connections are not going to affect the other properties. That's important. You open up more connections, you -- you know, they divide, that's your problem. If you're suddenly your whole -- no. You can argue in a VM world you might want to do further limiting on a VM basis. >>: And you may have -- you may have also met with the controlled -- the relative rates of your [inaudible]. >> George Varghese: You may want to control the relative rates of your own flows, right. So we have simple extensions by which we can do that. But it's just -- it's extra complexity. And we always afraid of anything that complicates -- so you could say certain -- between certain pairs of hosts you want to give it more weight, right? And so, in fact, some of the next things we -- will allow us to do that. So now, so far though, this problem, though, is certainly going to be true for a UDP, right, because he doesn't care. He's going to just dump over here and steal away the 5. So you can argue maybe it doesn't matter, because nobody deserved it anyway. You know, they paid for their shares though -- but there's something, you know, a bit ghastly about a UDP person who is going to drop later right not doing something. So now there's tons of work on TCP friendly mechanisms that try to make -- but we try to do something very simple, right? So this is actually the idea for [inaudible] and it's a very simple idea. So basically what he said was you do the same thing, right, and now you have the same thing and so it's 8 and so what -so the problem is what if A2, the green guy is UDP. So if you don't do it right, right, what he's going to do is he's going to go ahead and transmit as fast as possible and take five out of this link, right, and as opposed to -- he's only getting -- he's only getting one here, right? He should be sending it one, but he's going to take five and all his packets are going to be dropped at this time. So how do you prevent that kind of behavior, right? So in order to do that, right, so the idea would be that the receiver, you have to put some kind of shim layer here. So where would it be? It would have to be somewhere between -- somewhere between the network and the -- and the UDP layer. Something that is intercepting the packets and measuring the rate. Right? And so basically he's going to measure the rate, he's going to send back and he's going to feed it back to this guy. And so the intuition is this guy must be sending it five to cause trouble here. But if the receiver is measuring one, why is he going at one? You know. So if you could enforce that, then you -- that's the intuition. Go ahead. >>: [inaudible] flashbacks here. >>: Yes. Me too. >>: So why are we using [inaudible] between the routers? >> George Varghese: Because we don't want to change the routers. Right? So this is important. Constraints the heavy, right? So if you could change any -- lots of [inaudible] change the router. We have a very constrained playing field, right, Cisco routers. Who know when they're going to change. And, you know, for something like this, you know, like if you can do it without any changes that's the best. >>: Back pressure works in a line, it doesn't work very well in [inaudible]. >>: Yeah. Yeah. This is a very. >> George Varghese: Well, it turns out that, you know, that you know [inaudible] and all these guys, they have methods, right? But it is complicated, though. But I'd rather finesse that argument for now by saying, look, the rules of the game is we contact those switches, right? You may be able to test the L2 switches but you can't test the L3 switches and the congestion could happen anywhere so it's like, you know, let's try and do it without changing. Without changing the internal network. >>: [inaudible] wireless routing is selling software because the [inaudible] speeds are so low. This kind of routing is being done in a hardware ASICs because it's [inaudible]. >>: [inaudible]. >> George Varghese: So that's very important. Because our experience with an event is you know this is so -- they said finally six months out of arguing before they agreed to put a feature in, right? Then it takes two years for somebody like Cisco to build an ASIC. Why? Because one-half years is half a year is spent on design, one-half year is spent on testing. Because an ASIC is so expensive you can't afford to -- so now two years later you think it's all done, but it's not. Then they decide to put on it a board, right? And once they put -- the really number is like five years for any -- any new thing. It's shocking. It's terrifying. By then you lose interest in a feature, right, then it might come in, right? So it's the -- so it's really scary to them. You're thinking doing nothing for five years. And think of the effort of socializing this about not just Cisco but juniper, extreme, foundry, you know, all these other guys. >>: But, but I mean couldn't you be doing -- if traffic patterns change a little bit slower than per packet cases and a morphing [inaudible] I mean couldn't you be doing these things a lot in software and a lot faster [inaudible] change in the millisecond order or ->> George Varghese: You could. You could. But it turns out that the hardware does have hooks to measure things but -- but and so, in fact, we might leverage some of these. But even then, even changing the router software although it's not a five year period [inaudible] still two and a half years. But everything is [inaudible] it just -- probably Microsoft is probably not much faster, right? But it doesn't -- it's not that easy to get a product out there and so it's really shocking when you see the real numbers. And so right now, nothing changes, right? Now, this one, though, it's going to require you to actually add some kind of layer, right, over here, which does require a modable kind of module into your kernel or something like that. Which in Linux we know how to do, we assume you can do it in Windows too, right? And so what is this -- what is this thing? >>: [inaudible] you said you more or less have less bandwidth [inaudible] but there is a reverse problem there which is so you made the assumption that you know which property wants to get what ratios. And so you don't have those sort of things -- I mean you have to adapt very quickly as to what [inaudible] otherwise you will run out of -- either you start things or you just [inaudible] links which are not being utilized so I mean it's sort of a ->> George Varghese: But I think fundamentally though ->>: The [inaudible] just saying where do you -- which are the things [inaudible]. >> George Varghese: Right. >>: [inaudible]. >> George Varghese: Do you feel that it's hard for -- I mean, people don't want to actually you know provide some simple guidance to the network as to this property is more important than this one. If you don't do that, it's like this -- the network has no basis for doing anything, right? It could have one guy completely assuming the network and it's fine as far as you're concerned because you didn't give me any sense of importance. So the point is you have to give me some information as to your relative intent for the use for the network. >>: Well, it sort of depends on what race horse. For example if you think of [inaudible] and yes, it definitely needs more priority but [inaudible] also datacenters are running which is quite a few of them. You can't do this, right, because then how are you selecting which is the right one, unless you put money into the equation or [inaudible] something else. >> George Varghese: So what you might do is do at least first time with the big ones, right, and then everything else falls into one bucket. So it's still better because you get predictable service from the people who are making revenue. And yet you're sharing the same network. And your competitor Google has all the same [inaudible] even they really likes to -- like to run on one physical network. And they're beginning to find ->>: [inaudible] not even sure why you would share it with anybody else. >> George Varghese: They do, though. Lot of people do. Not always. >>: Yeah. I mean, data planning versus adds versus the oracles versus ->>: So within the [inaudible]. [brief talking over]. >> George Varghese: Imagine CFO has certain queries, right? >>: Yeah. >> George Varghese: Unless there are backups going on where the people have two backups. And, you know, you probably want low proprietary but you want to make sure that it's some bandwidth. Okay. So the idea is very simple here, right. You go in and measure the rate and you feed it back to H1, right. And now the -- you need a little bit of a dynamite [inaudible] you need to be a little careful, right? It's not -- so what happens is that -- so this guy is bumping at five, he's dropping his packets, but this guy measures one. He feeds it back. Now, what should this guy be allowed to send out. It turns out that you can't let him send it exactly one. If you do, then he'll never grow, right? So you have to sort of because, you know, if the network bandwidth did increase, right, you want to allow him to grow. And once he sends it one, he's only going to receive at one, and that's going to be maintained forever. So in order to do that, you let him go at a slightly higher rate, like 20 percent, right? So he goes at 1.2. So now [inaudible] bandwidth suddenly opens up, right, you know, you're going to see 1.21, 1.21. If the network bandwidth goes up, he's going to suddenly measure himself as 1.2. In this case, he goes up to 1.4. And so he keeps flat. So you want an ability for him to grow, right? So the rule is that when you -when you come here, right, if the bandwidth received is higher, whatever the bandwidth is, you actually go down to something like 20 percent of that. Right? And there's a little bit of care you need to take to make sure that the right thing happens, things don't -- but those are all in the paper. So if you -- I don't want to talk about the mechanism. Yes. >>: What if the new property comes in or [inaudible]. >> George Varghese: So a new property comes in, what you have to do is -- so here is the configuration required, right? Ideally what you would like is some kind of open flow or some other software where it's centrally done where it has to go to every router, right, and basically add that new thing in the DRR rates. And it can be done. But right now we do it by going to each one and physically configuring it. Which is more painful. But it seems to me that that level of management should happen in the future. And if that happens then you can do it for one. So it's not that hard conceptually but today it's painful because you have to log into every console and do it. >>: [inaudible]. >> George Varghese: Exactly. To every router. So right now we don't want to even know where you're using it. Maybe you could problem gain by saying only do it in these routers. But right now we think the simplest way is just go to every router and do it, right? So you don't even bother, right? This is a very simple [inaudible] but even this level of configuration, we talk to the Azure guys, no, no, no, you know, you're not into it. And so it's interesting, you know, that we thought it wasn't such a big deal but for real running networks even this is a problem, you know. The rate limit is actually slightly over one. And then you have to allow for [inaudible]. Okay. So now then we started saying all right, we've done all of these things and they kind of work, they give us the policy we want, they handle UDP. But maybe we want this definition, this hierarchal max-min fair sharing has a number of things that we don't -- are not as flexible as we would like. So for example, it treats every connection alike, like I think some of you said well, what is the connection -- you want some servers to have more, right? And we can think of even weirder policies, okay. So here is like two flows, right? And the idea would be that imagine what you wanted normally when you -- if you have the same weights, right, the bandwidths are shared -- we need three for this. Is there a third one? Okay. So imagine there are three flows, right? Now, normally it means if they're all three active, right, each of them gets a third, 3.33. Does it make sense? Right. Okay. But it means that if one of them is totally idle, the other guys share it equally, 5, 5. So in some sense there's a minimum, right, that is specified by one weight, but the excessive is also specified by the same weight. Maybe you want two different weights. Because some applications require a guaranteed fixed amount of bandwidth, but maybe they don't require any excessive bandwidth at all. So maybe you want two weights. One weight for sharing the -- you know, the bandwidth if everybody is using it, it gives you a sort of min bandwidth. But if any excess comes around, you share it in a different set of weights. It's -- you know, you can think of reasons why that might be interesting. Because in certain apps you should -- very predictable bandwidth, why give them any excessive, right, or backups you want to make sure that they have a definite amount but no more. Backups is wrong. Probably they want to be less. So we could think of -- so if in this case they're all equal it will be 3.3. But if you actually had an excess weight of two and one, then actually the numbers would actually change and the guy who had the bigger excess weight would have -when the green was not -- was not -- was not -- was not sending then he would get the red would get a little more because he had a higher excess weight. He would get 5.5 and this guy would only get 4.4. There's still up to 10 but they use it in a different proportion. So they're just trying to push the model a little bit and say you know if you push it a little bit, can we get more flexibility? Right now max-min fair sort of pushes your bandwidth into certain regions and would like to sort of expand that space, okay? Once you do that, we quickly realize that we can't rely on TCP because TCP has no flexibility to look at these two ways. We can't rely on UDP, right? And so what we realized that we needed a centralized bandwidth allocator. So how does a centralized bandwidth allocator work, right? So it's -- it's interesting -- so the network transcends weights, right? So now what happens is we go ahead and somehow we have to measure the traffic matrix. We have to sort of see for each property how much is it trying to send to everybody else? And that's where we might be able to commandeer the access switches because they do have -- they do have a certain amount of weight measurements in the hardware. If not, you have to do it at the software -- at the [inaudible] okay. Or the end servers. Once you get that, you basically send this traffic matrix periodically, right, probably not any sooner than 100 milliseconds because this is a lot of work, right, and you send it to your centralized QoS computation engine, and based on that, this guy predicts your demands for the next -- because he has to predict because you might be growing, right, and then you compute this or anything you want to, you know, more infancy policies, right, from the weighted bisection bandwidths, and now you go ahead and send back the current -- the rates to be used for the next one, and now you have to rely on [inaudible] the routers to make sure that if somebody, a UDP flow has been given only one it stays at one. So the only advantage of this compared to all this other stuff is it allows you much more general policies. Now you're free to do anything you want. You can go ahead and give some sellers more, you can do, you know, different weights. And it's kind of not as -- it looks weird, centralized bandwidth because bandwidth changes so fast. But they're hoping to track only somewhat smaller coarser time scale changes in bandwidth, right? And if you think of lots of what's going on networking research, they've moved to centralized routing control platforms. And so this is another step in that direction. Okay. So now there's lots of questions and I'll take all of them. >>: [inaudible]. >> George Varghese: So we just really need to get it to one guy. So I'm not sure we need broadcast as much. It sounded like an impasse, right? They're all going to one guy. All the traffic measured at the edges is going to one centralized computation. Oh, you could do it in a distributed fashion instead. Is that right? But then everybody has to do that computation. And that computation you haven't done it, it's -- it takes -- it takes a few hundred minutes so we would rather do it in one place and [inaudible] yeah. For the simple model there is nothing being sent anyway. It's just automatically leveraging on TCP and except for the [inaudible] so there's no [inaudible] we can talk about that. But that's very well studied problem. Right? So it's 20 years old. So -- yes, go ahead. >>: I was going to ask you a question [inaudible] so there was a [inaudible] paper recently that was very similar [inaudible]. >> George Varghese: Eric? I don't know. [inaudible] paper? >>: Yeah. I mean we did something somewhat similar and we were basically [inaudible]. >> George Varghese: I'm glad. So we should definitely cite that. >>: [inaudible] I had a lot of work in [inaudible] because [inaudible] this situation from years actually and the problem is I was trying to think about the assumptions you're making versus the assumptions that and there are differences there which ask okay, it's -- so one question I have for you is that in order to validate these [inaudible] really have to look at the traffic and you really have to look at those properties and how far those last and how long [inaudible] without that, this is not really useful, right? You have to [inaudible] because you said there is a -- there is some sort of [inaudible] routers or how much [inaudible] how fast you can [inaudible]. >> George Varghese: I think this -- to this single multiplexing I agree. It depends on the real traffic. But the fact that you can guarantee sort of minimums, right, that can be established without any regard to ->>: That's true. >> George Varghese: Right now you don't have that. You don't have any firewalls at all. I think that's the first thing to establish. That can be done without changing all these routers. >>: Well, but you were saying -- you were saying [inaudible] you were saying that [inaudible] for example I don't give one, I give 1.2 or something, right? >> George Varghese: Right. >>: To make sure it has room to expand. >> George Varghese: Right. >>: And the question to me was well [inaudible] fine, what's heating doing with TCP? >> George Varghese: It's not. It's just basically taking a little bit more than its actual allocated [inaudible] this is a common thing that is used a lot. So it's not such a big deal and DRO does it so and I [inaudible] that one is not. I agree that really ideally we would have to have a big dataset to see how this was, to measure the statistical multiplex. I would love to do that. It requires, you know, help from people who are actually doing it. So ->>: One of the things you [inaudible] push it back down. >> George Varghese: I agree. So that's a good point. So the timing of that one, that is scary, right, so it really depends on prediction, smoothness, and all those things. In fact it's actually scary. The thing that VL2 say that things change so fast so it actually you know it's actually depressing for this kind of thing. So nevertheless I think we just put it in for completeness, right? So if you don't like that the first two are very fast. Sorry. I think ->>: So I want to just push back [inaudible] datacenters probably the number of [inaudible] the number of posts that a centralized converter has to [inaudible] such times and this seems to be no good way of splicing it ->> George Varghese: I actually did some algorithmic work there. We talked about it. So what we did was we basically did splice it, right, on a edge switch to edge switch basis and then we did the -- did the calculation sort of recursively on top for all the guys sharing. That helped a lot, the complex -- but you're right. On those you take host to host the number of pairs are so massive, right? It's roughly an N squared kind of algorithm. So -- and we did a lot of work to make this fast. So that's something, you know -- but, you're right, fundamentally you know it's very -- you have these distributed thing that I'm doing in automatically and so the speed of this is another [inaudible]. >>: So just to clarify when you spliced it [inaudible]. >> George Varghese: It's not distributed [inaudible] it's not distribute -- we did just a centralized thing but we basically -- we didn't splice it at the [inaudible] we basically -- we divided the -- there is an N square in the complexity, right? If you make N the number of hosts it becomes very large. So we tried to reduce that to order N -- M, where M is the number of edge switched pairs. >>: Compute trunks between it. >> George Varghese: Yes. Exactly. And then -- then -- then the last thing on the trunk is easy, right, that's an easy computation. So we kind of just break it up into equivalent -- these are just methods of releasing the centralized complexity. Sorry. >>: So I guess what we're thinking is that [inaudible] and it's as if it's centralized, it's as if you [inaudible]. We all love that. And all we need is TCP and there are -- and so it's hard to give that up now. I mean, why can't we have a new instance to base [inaudible] doesn't require the. >> George Varghese: You could. But now you would have to change all the TCPs and all the other [inaudible] like that to do it. Oh, you mean -- right. So the -- and I -- personally, right, I wouldn't do the centralized -- it just -- it didn't -- we did it in the reverse way. We thought of that first and then we came down to the simpler methods and we felt -- we felt hard pressed to give up and I'd like to [inaudible] so we put it in anyway. And we just wrote a set. But I think I agree. If I was [inaudible] there's no way I would do it, right? It's way too complicated. >>: So [inaudible]. >> George Varghese: Sorry? >>: [inaudible]. >> George Varghese: I shouldn't be saying things like this here. I'll get into trouble with the paper reviewers. >>: Oh, no, no. Okay. So talking about [inaudible] this is the central problem in traffic engineering for 20 years, right, and basically what it's all come down to is predictability. You don't need emissions control. And if you can predict the required share, then you would have all the time in the world in the centralized algorithm is the right way to do it. And you -- there were studies done you will get I think this is like 30 percent better than [inaudible] version and how close you get to optimal, right? >>: But then it's arguable what your definition of optimum is, right, because now the ISP, the standard definitions are least used link should be -- and it's like even that circuit for the datacenter, is that the right [inaudible] so many things like that in the end, right? >>: [inaudible] asked about you the evaluation metric. So, you know, when we think about these things we sort of get stuck in exactly that element that we have to have loads and loads of traffic that we can look at that [inaudible]. >> George Varghese: That's fair. And we did. >>: So from that perspective I was asking -- I mean [inaudible] but I was asking now. How did you evaluate? >> George Varghese: So we'll talk [inaudible] so basically we built a test paper. It's small, right? So we definitely did not have the advantage of large scale real traffic. And so we had no handle on the traffic -- on the statistical model. But we did -- were able to show the isolation, right? And on real -- and also the important thing is we took real router, we took fulcrum switches, we built a real test. So that was kind of nice that this hypothesis that you only had to figure the thing was correct. So that's really what we did. Plus we started up certain properties with lots and lots of connections and we said it didn't matter. Once you said the weights they could do jack but you couldn't, you couldn't -- so that's what we did. Would it be better to run it on a real thing? Yes. But, you know, that's a different set of -- we did talk to Srikan briefly of getting it routed, trying to do this on it, but it didn't happen. Yes, so go ahead. >>: Not only is bandwidth allocation problems end up that you slice a bandwidth correctly but then you had the problems that per connection response times vary so much that no one even studies that, right, in New York that -- I mean sometimes for certain connections they [inaudible] in the other case I mean was one of the metrics response times [inaudible]. >> George Varghese: Yes. So response time is you're trying to see how fast the [inaudible] go up, right, and maybe we were restricting DSLs in our apps it's also we needed a -- we do need a richer set of apps. We tend to study a lot of Hadoop like apps, right? So they generally -- and even they -- they're very sensitive. They have a certain phase, a reduce phase and a map phase. And the certain -- the swat phase for example is much more bandwidth intensive. The other phase really didn't matter. So we did actually see that sometimes when they give twice the bandwidth weight to one app or the other, it doesn't complete its job twice as fast simply because there's other phases where it's not bandwidth right? So there's a lot of that in the paper where you actually trying to see the effect on application level performance with all of these things. But what's the statistical multiplexing games across large crowds? We don't have any [inaudible]. Sorry. Turn this off. Okay. Go ahead. >>: [inaudible]. >> George Varghese: Yeah? >>: I believe that the number of queues that you can [inaudible] so do you have something to do when the number of properties you have is [inaudible]. >> George Varghese: So I think Srikan and Ming are definitely interested in Azure, right, where the number is more like 10,000. And this time -- so there are two differences in what they are doing. I think one is that we can no longer hide under this thing like well, you have 100 -- roughly 100 I think you can get now. Fulcrum had 100, right? And so -- and so then you have to find some way to expend it when you don't have it. That's number one. And number two, you have to worry about more adversarial people, right? [inaudible] actually we're not assuming, and the reason why it helps to not assume nonadversarial behavior is if one guy notices that his bandwidth is being stolen when he's not using it, right, and it takes a little bit of time, however little, right, milliseconds to get it back, that he might gain the system by saying I'll just send idle traffic, right? So certainly in an ISP world you would worry about that and Azure world you would worry about that, right? But here in an enterprise setting maybe that was [inaudible] so you have a different set of problems. You have to worry what adversariality, you have to worry about scaling into larger numbers. And that's [inaudible]. Okay. So [inaudible] right. So although we had this maximum fair notion if you talk about applications it's not really clear what it means for applications to share the network. So we are not entirely sure, right, how we actually define this notion of application sharing, should we think of it as supercomputing? And what about multipath, there's a ton of work on [inaudible] fair share with [inaudible] it's all messy, right? They come out with NP complete algorithms, it's a [inaudible] and what is the -- one of the nice things of the evaluation which we did, which was surprising to us is just for the heck of it we'd run this mechanism with a simple amount of multipath. And it seemed to do the right thing. It seemed to take the aggregate bisection bandwidth and share it proportional to [inaudible] and that was nice. Because that's not exactly obvious given the theoretical results. So again, we don't have any explanation for it, but at least in simple cases, simple datacenter cases maybe it's the topology and that it seemed to work well with. So I thought I had some slides on evaluation. Oh, yes, I do. So here is a quick thing, right. So the group allocation, the TCP thing, you know, you do reflect configuration at switches. It's very fast, right, but it's only TCP flows and it's only hierarchal max-min. The rate throttling it now basically allows UDP but it now goes up because you have to measure the UDP rate and mature -- and it's only hierarchal max-min. And the centralized allocation is floor and it's probably more general allocation policies and it's not clear how scalable it is. But it's true. But you have to worry about when you're going and doing large scale you only have to really worry about it. So how did we implement this? We basically -- we had a fulcrum 10 gig switch and we were able to sort of configure it as lots of smaller switches which we were able to then sort of emulate subswitches and emulate topologies. And this was a topology we're much more interested in because it is multipath. It is very simple, right? And so we -- we had to do certain things to -- it's all in the paper how we actually made all this experiment work. But fundamentally without -- let me try to read what we did here. Because -- so what we did was we had two applications, a red and green. A red and a green. And I'm -- let's see. And so -- let's take this one, for example. This is the simplest one. And so the -- what we did for the red and the -- the red application, they were both Hadoop applications except that they were both doing sorts except that one of them had eight slaves, and the other one had four. Right? I mean, just think where is this and that should have had this on this slide. And so one of them had a total of 96 maps, eight per slave and 96 reducers while the other one used a smaller amount of reducers, four per slave. So it turns out if you don't use this, right, the person with most slaves in more TCP connections and all the bottleneck links he gets a much bigger share of the bandwidth, right? And it's pretty straightforward that since you are actually allocating based on the application and not on the connections that you will get this kind of behavior. And there's a lot more to it, there's a lot more description on does the application's complete faster? In the end even if you give them equal bandwidth they don't all -- and so -- but that's the kind of spirit of some of those -- of -- but the most interesting thing is exactly -- roughly the same thing happened in the multipath. There was a few edge effects that we are still trying to figure out what happened. But in the multipath case if you go back to the picture, right, what we would really like is in some sense the bisection bandwidth has gone up to two gig. Because there are two sets of paths. And we'd like this two gig to be shared among these two applications, right, in proportion to their weight. And that seems to happen. And that's the thing that we are most pleased with. Because ECMP is alive and you can't wait for theory to catch up and all of these things to catch -- and Hahne's result and all these do really apply to all these cases. So maybe there's some new theory to be done. But we are hopeful that maybe in simple datacenter topology the simple thing will -- so that's roughly where we are. And I think if we had to summarize what we are trying to do is we really just trying to figure this out, you know. What is the right notion? It seems like a pressing problem. We need to deep it. Somebody needs to define it. And we're taking a first definition, hierarchal max-min fair sharing. So maybe shares some contribution in just dining what you mean, right? If you don't know what you mean, right, you can't argue which is better than the other. And then you -- we are taking the simplest possible mechanisms, right, that can do what we mean, one with a generalization of Hahne's result, right, to doing it for property as opposed to connection and then we fix it for you to be and then maybe generalize it too, right. So and that's really where we are at. We're not satisfied with the whole thing. We think there's more to be done. And but it's our first cut of this. And it turns out in the Azure environment we talk to the Azure guys, you know, there's a number of issues, right, even the simple like no change to existing routers actually configuration. They don't want that, right? And so maybe there is some new created things and we've talked about some of those things. Balgi's approach, which is going ahead and sending these QCN requires modification we're not sure will actually work in the near future, right? And these guys have it totally different. Go ahead. >>: [inaudible]. >> George Varghese: In the implementation? >>: Yes. >> George Varghese: I'd have to. I don't remember all of these things. They -the -- what was the positive section? >>: The X axis was in 10 thousands of second. >> George Varghese: No, no, no. So this one, they would not do anything for converge. So converge you have to measure the [laughter], right, to see how fast the [inaudible] got to some group. Because the bandwidth doesn't tell you anything, right? And these measurements were I think even the bandwidth measurements were done in aggregate so like maybe hundred milliseconds so they -- that's way too -- the module servers to core should tell you how fast [inaudible]. >>: [inaudible]. >> George Varghese: No. That's totally separate experiment, right. So the -- so -- but generally it's roughly a few TCP kind of -- because it's a very simple topology. You could argue in more complex topologies it would be more complex. So but that's not shown by this. Because these are just coarse bandwidth measurements that we managed to switch -- we manage to put in something to measure this everyone but it's very coarse. >>: I mean it does look like a [inaudible] behavior going on. So there's some dynamics that [inaudible]. >> George Varghese: So and there are some explanations of why some of those happen. Part of the behavior is because the application itself changes phases, right? It goes from sort to -- remember we have a real application running. It's not like [inaudible] running. So definitely there's a big change in dynamics. And if you see that, right, that will explain why, you know, like these things change because they actually sort of switch gears and from sort to map and things -- so there is something that has to be explained. So that's all I have. And I -- you know, like just happy to open it up for any, you know, like anybody's ideas. Because we're still trying to figure this out. And Albert, do you want to start the discussion? You have five minutes. >> Dave Maltz: Should we thank the speaker first? >>: Yes, let's do. [applause]. >>: I think the most interesting part is [inaudible]. >> George Varghese: Requirements. >>: For solution. >> George Varghese: Yes. >>: And it seems like you require. >> George Varghese: In then, yes. Because we expose the locality in some sense. >>: Yeah. And so your neighbors influence what you get. >> George Varghese: Yes. So it's very different if your kind of model, right, the VL2 model where you basically get rid of all locality, right? But it seems to me that if you had a VL2 model, then simply you could really have very nicely a model of just -- you could sort of give the model to everybody that they have a big switch. And now there's totally new locality because it's burnt out of the system by your mechanisms and now it's much simpler because the model is you have just a big switch and you have certain, you know, bandwidths that are equal everywhere. >>: We still have to do your rate [inaudible] because basically the results [inaudible] model basically [inaudible] current model which is exactly what your [inaudible] you're not pushing for traffic into the network that can possibly come out of it. >>: Even with the VL2 network there will still be -- you would still depend on where [inaudible] co-located because your bandwidth [inaudible] switch. >> George Varghese: Will depend on that. >>: Yes. >>: [inaudible]. >>: [inaudible] VM and you're co-located on the server [inaudible]. >>: It matters inside of a single [inaudible] because there is no [inaudible]. If you happen to be co-located with very bad very many at the same place them you're stuck. >>: Okay. But that part you can fix [inaudible] on that one. >> George Varghese: But I think that they're thinking of fixing that with another level of flow control, right, which is like -- it's VM based. Each VM gets a certain amount of bandwidth. >>: Right. Right. >> George Varghese: Okay. So I think Windows is putting that in, right? >>: Yes. >>: [inaudible]. >>: No. I mean only that problem that the congestion is at the network. >> George Varghese: We are really bothered about ->>: That co-location at the same place. >> George Varghese: Sorry, Albert. You had a thought. You want to finish it? >>: Yes. The other thing is we talk about [inaudible] you don't know how much [inaudible] I mean the whole idea of why this [inaudible] going to be cheap is filling valleys [inaudible] stuff like that was not allocating for the mass. So even the big guys don't know quite how to [inaudible] and if you force -- if you don't try to [inaudible]. So you know say you buy 20 percent more of a huge number. >> George Varghese: In your paper and I've tried to understand the relation in the host model. You are also doing some kind of multiplexing where -- in the original paper, the AT&T paper, right? So how does -- I mean, do some of those mechanisms apply? >>: That was the [inaudible]. >> George Varghese: Yeah. >>: And it was mission control. Yeah. So maybe ->> George Varghese: Maybe it's similar right in the end, yeah. >>: Could we go back to [inaudible]. >> George Varghese: Yes. >>: [inaudible] only one over the sum [inaudible]. >> George Varghese: Exactly. >>: Times ->> George Varghese: Times the bandwidth of the -- times your weight times the bandwidth. >>: That's the share. >> George Varghese: Exactly. The [inaudible] is guaranteed for you regardless of [inaudible] which I think I find that helpful because it gives you a certain floor, right? And now I can migrate and do everything. Now you -- now, on top of that, you can start saying if locality is visible and you have a network, now I can have an optimization problem, where should I locate my VMs to get the -- go ahead. >>: I'm trying to figure out how much that does solve some of your problems. So today I mean so I was [inaudible] not where I am. I'm planning all these tricks to figure out how to get to be in the right places. >> George Varghese: Yeah. >>: With your system I'm still going to play all these trips to figure out whether I want to [inaudible] bigger there, right? >> George Varghese: Right. But I think it's still -- I think it's still nice to be able to be sure that you have a certain bandwidth that works regardless of what other people are doing. Regardless of [inaudible] to me that's reassuring because it gives you a certain minimum level of performance that you can't regardless of other people's gaming. >>: And the other thing is that oftentimes when you look at these jobs studies the time you complete the job is defined by the outliers, right? So you've got a good locality for a bunch of your compute notes and your data storage notes but there's someone way far out there who you had data from because you couldn't get [inaudible] so having a good bound job time depends on having good communication. >>: But on that note like the outlier unless you have a phase and [inaudible] are the ones that are alerting you, right? [inaudible] a machine that has other kinds of applications on it. I mean, your fraction will be proportional to that and it will be quite low. >> George Varghese: No, no. I think the assumption here is that you have -- it's not so much -- you have a worst case thing, right? We're fought talking a lot of properties. You're talking about lots and lots of properties. But if you're talking a certain number, regardless -- there's a certain minimum which assumes that everybody is on your machine and on your link at the same time: That's your floor. >>: Have been meaning. >> George Varghese: Every other property is there at the same time, right. So that's like -- may not be. Depends on how big you are and how much revenue you get. Like so if you're half the receive of the company, you get half the bandwidth always. It's like you don't have to worry about an experimental person coming in and destroying your stuff. But -- and various other things. >>: But it does sound like if you get paid twice [inaudible] increase your ratio or you could open up 10 VMs and throw out the 9 worst once. That second approach might give you much better reward. >>: [inaudible]. >>: And so shouldn't play all those other games. >> George Varghese: Maybe. >>: Yeah. I mean, without a -- without a [inaudible] subscription you're going to [inaudible] there's going to be upside to find those. >>: We were wondering is there a market for subscribed networks if you market in a sense [inaudible] will it pay? >> George Varghese: I think in the future will they pay for oversubscribed networks or will they pay for this mechanism [inaudible]. >>: I was wondering if it's more general than that. >> George Varghese: I think it's more general than that. I think they always have [inaudible] networks. If it's not always subscribed you still -- [inaudible]. >>: You know what you're getting. >> George Varghese: Right. I think it's the predictability that I'm hoping is something that [inaudible] will want. So even with the VM I mean to some extent you have some notion of a certain minimum amount of performance, right. And that I think is comforting, right, although most of the time you get a lot more. But I think having ->>: These models are still being experimented with. I mean, Amazon VMs there's no guarantee. It's like a one gigahertz processor, but it's not. >> George Varghese: Actually because of [inaudible] or some other strange thing. >>: They do a lot better than [inaudible] but they don't do fair share. There are games you can play to get more than your fair share on EC2, although you do a lot better than if you brought a say -- out of the box and a machine that is horrible in terms of how a fair share [inaudible]. >> George Varghese: Really? >>: But three or four to one difference from fair share usually. >> George Varghese: And what about EC2? >>: EC2 is a lot better than that. >> George Varghese: Two to one maybe? >>: [inaudible]. [brief talking over]. >>: I think it ends up you can push it statistically but I don't think [inaudible] I think it's [inaudible] for fair share. >> George Varghese: Again I think once you go to the Azure model it is a different setting. You have people playing games. Here I think it's accounting an engineering and I -- you know, I -- they're all reporting to one CEO, I'm hoping. And so that's -- and that's -- and so it's reasonable then at one -- like in customers it's very hard, right. If I'm not using my hundred megabits and I paid for it, you mean you're using it? And actually the argument is when you're not using it, I'll get yours too but that's still a little hairy. Can you imagine an ISP -but in the company setting I think that's a little more plausible, right? So we're still feeling this out. But thank you for your comments. And yes. >>: [inaudible] even if you have bandwidth allocation you will never have a stable system, right, even if [inaudible] drop things. So I mean what your assumption really is by doing this [inaudible] allocation you sort of make the assumption I think things were stable that I can guarantee this end-to-end path from end point to end point but if things -- for example [inaudible] essentially things can get bad even within the single router [inaudible] now when you don't have stability in the system, then even if you [inaudible] if things can still go [inaudible] constant influx. >> George Varghese: So what we like about this actually, that's actually interesting point. So this is a point I think I made. But the nice thing about this is, the floor, is normally image that you had alternate paths, right, and you had backup paths. And so what would happen is if you want to do reservation you would have to reserve on the alternate paths as well, because you never know if you use it. And this is totally wasteful. Here, you go ahead and do these kinds of weights on these alternate -- but you're not using them, other people are free to use them, right? So which is kind of nice because in some sense you're less wasteful of these. So [inaudible] you don't have to have to propagate anything because the weight's already there, everywhere. You go to the alternate path it's waiting for you, it's already preallocated. So I'm not quite answering your question, but it did suggest something that, you know, like even if there's a network is churning a lot, it's changing a lot, right, from one path to another, this stuff is totally independent because, you know, at least in the simple model it's configured, it just -- it doesn't change. It just -- it has its weight and it's read for you. You switch from one part to another, it automatically allocates because you just configured every router when you started. >>: [inaudible] for instance? >>: I think that's [inaudible]. >>: >>: [inaudible] so essentially do you not become sensitive because there's [inaudible] stability in the system [inaudible] and so even if you -- you have these weights but if there is some [inaudible]. >> George Varghese: What would your ideal be here? What would you like? Because until I know that, I don't know how to compare it to what it's doing. >>: Well, I guess the sense really is that, you know, [inaudible] it's basically [inaudible] engineering on the links and [inaudible] stability is [inaudible] so those [inaudible]. >>: [inaudible] first of all I have to go and apologize but that's exactly my [inaudible] what we want and that's a great pushback. And then adapt a mechanism for that. >> George Varghese: I think the big question is what is the question really. Really, that is [inaudible] because this is a whole new area and, you know, everybody has this intuitive notion that something has to be done. But the question is what need to be done, right? And what would users want? >>: Simpler question [inaudible] since you're saying in terms of robustness of your system so if you have some notion of [inaudible] but if you have ->> George Varghese: I would say the first two are not -- the first one is very robust, right. The second one maybe a little less. The third one is probably not robust at all. You know, if you have tremendous changes in traffic, it's just going to be -- it's centralized, it takes the law, and it's going to rely on predictability. So that's my [inaudible]. >>: [inaudible]. >>: [inaudible]. >>: [inaudible] helps a lot with that. It turns out if you just turn on DRR with [inaudible] DRR that like the intuitive -- and this is [inaudible] which is to say that you don't end up with a small flow. You only screw up people in your own DRR bin and so you can cause real nasty things to happen to people in your own DRR bin but if you've said that's yourself, they be that problem mostly goes away is my understanding. That's certainly when my experiments of DRR turned on, it helps a lot. >>: [inaudible] the most robust [inaudible]. >>: [inaudible]. >> George Varghese: It's not like a control system that part. That's why I said the first one is. The second and third, they're measuring, they're going to -- I hear you. I feel your pain there. But not in the first one. Right. So thanks. >>: It's a type of [inaudible] [laughter]. >>: Yeah. >> George Varghese: Yeah. [inaudible]. >>: What's the [inaudible]