15700 >> Ratul Mahajan: It's my pleasure to welcome Mahesh Balakrishnan. He's from Cornell University, university student. And he's going to talk about reliable communications in data centers. >> Mahesh Balakrishnan: Thanks, Ratul. As he said, I'm Mahesh Balakrishnan. And I'm from Cornell University. That's a very rare picture of Cornell. You can see it's not snowing. Today I'm going to be talking about reliable communication for data centers. So commodity data centers emerged in the mid '90s. Internet service companies found they could scale to millions of users by using several forms of inexpensive machines. Now the technical insight behind that is simple. First generation Internet services were extremely parallel. Economically, though, commodity data centers were revolutionary. They allowed companies to acquire and maintain massive infrastructure on shoestring budget. Since then, practically everyone has jumped onto the data center bandwagon and data centers have supplanted big end computing as the defacto platform for any kind of scalable application. However, today's data centers are different from their predecessors in two ways. Firstly, data centers no longer act in isolation. They interact with other data centers that are geographically remote. The links between these data centers are typically high speed optical networks. Companies are deploying such networks because it gives them immense value, allows them to mirror data to remote locations for locality to clients, for disaster tolerance, for locality to outsourcing hubs. And this is particularly increasing looking like the network map of any modern corporation. Also data centers are now real time. Now real time is a loaded word and it means different things to different people. In one sense data centers are real time because you're running applications traditionally considered real time. Domains such as finance and aerospace which are traditionally real time are moving to commodity platforms. However, in a completely different sense, we are going towards a world where people are storing all data and functionality within data centers. If I open my laptop and I store some data on it, I did a blog, I send an e-mail to someone, all that is being funneled through a remote data center. The data center is the computer. Almost an extension of Sun slogan of the '90s, the network of the computer. The interesting corollary is that if when I open my machine, if all of what I do on it is being funneled into a remote data center, then I expect the same responsiveness from that remote datacenter that I do from my local machine. So datacenters are real time inductions. They have to be extremely responsive. Now Gartner did a survey of datacenters operators where it asked let's define a real time infrastructure of a datacenter that recovered from failure in seconds or minutes. They asked people do you think this is important and do you have it? And those were the answers. Most people thought it was important. Most people did not have any kind of real time recovery capability. And that leads us to the big challenge, how do we build a real time datacenter, one that can recover from faults and other disruptive events within seconds. And this is a huge deal, right? This is almost a grand challenge. It was dealing with literally dozens of failure modes at every level of the software stack. And it's kind of the backdrop for my Ph.D. thesis. I looked at various failure modes within datacenters and between them and in fact built several systems that tackled different faults. Now, this talk is going to be about mechanisms at the network level. If you want to build a system that recovers within seconds from failures, you need to provide it with network parameters that recover in milliseconds from bad stuff that happens in the network. Specifically, we want to recover from packet loss in the network and we want to recover extremely rapidly. Now, the insight here is that existing protocols are reactive. When something wrong happens, they react to it and often there's a cost of reaction and a latency of reaction. And both can often be very high. Now, we're going to introduce proactive protocols where you inject the overhead's a priori and you get immediate recovery when anything goes wrong. In this talk I'm going to be talking about two systems. One is called Maelstrom. It's a protocol for reliable communication between datacenters on these high speed optical links, 5,000 of miles. Second, I'm going to be talking about Ricochet, which is a reliable multi-cast protocol within datacenters. So moving on to Maelstrom. Today the only option you have for reliable communication between datacenters is commodity TCPIP. If you're running a commodity datacenter, you'll be running Windows or Linux. They have commodity stacks. So what happens when you open a connection between here and New York? Well, TCP has three problems. Problem number one is throughput collapse due to loss in the network. TCP was historically designed for short congested Internets. It does not work well when user logs into high speed networks. It's egregious if you lose .1 percent of your packets on the link throughput is less than 10 MBPS on essentially a 40 gig link. Now, this is a problem that's extremely well known in academia. Been studied for 10 years. People have come up with all kinds of new protocols. I know you guys are building compound TCPs. Linux are dropping back. However today the most commonly deployed protocol in the wild is still Reno (phonetic) and that's a protocol over 10 years old. Problem number two host for high rate with this problem. capacity to buffer deal with it, they with TCP it requires really large buffers at the end traffic. Again, there are new variants that deal But, again, the end host may not have enough memory that much data. So right now the way that companies hire people to go around tuning TCP buffers. Problem number three is that TCP intrinsically has a feedback loop in it. When you lose a packet, the receiver goes back to the sender to recover the lost packet. So it really does not matter for TCP whether your datacenters are next to each other or on different continents or on different planets. So what do people do about this? Well, solution number one is you just sit and twiddle your thumbs and you wait for the operating system companies to deploy new, better versions. What do companies do today when they're still running on XP or some older variant of Windows or Linux? They do multiple things. They tend to generally be quite ad hoc. They rewrite applications. Try to use multiple flows instead of one. They go around tuning TCP buffers. But the most common solution to this problem is they throw money at it or at least they try to. You could pay an ISP a million dollars and the ISP will give you the perfect network, the network that does not drop packets. This is done today. ISPs give you a service level agreement that says the network will not lose more than X percent of your traffic. If it does we will compensate you financially. That's the model today. Throw money at it. You get a perfect network. You stop working about what's wrong with TCP. The question is what if I don't have a million dollars, what if I want intercontinental connectivity and want it for $50,000 or whatever. We have one data point for such a network. It's called taragrid (phonetic). It's a grid of super computers in the U.S. Multiple high profile sites including NCSA in San Diego. This is a fairly heavily used network. Used on a day-to-day basis by e-science researchers. Notice that, firstly, the links between sites are 10 to 40 gigs per second. They're not really loaded -- over a three-month period we never saw link realization more than 70 percent. Secondly, the path between any of these two sites has multiple hops in it. This is a fairly standard kind of topology that's being deployed out there for these kind of networks. Now Taragrid has loss, even though it's not congested. It has a monitoring framework where sites pick other sites, send a blast of UDP traffic. And it notes down the associated loss rate. We saw something extremely surprising. We found that 24% of all these measurements had a loss rate of greater than .01 percent. Enough to flat line any kind of PC running on top of it. We dug into this data and we tried to find out what was going on. We found two reasons. Reason one for loss was that the path between two nodes and different data center was cluttered. Multiple active devices. These active devices would stop dropping packets for no reason. We found faulty rate limiters, had bad nicks that were dropping packets. And because these guys were not paying someone a million bucks to go and pull that card and replace it, you saw persistent loss. In an academic setting, nobody cares if you're losing your DP packet, they're just using blast protocols, making do with blasting packet across the pipe, so they didn't care. Talking to people, we saw another source of loss on these networks was loss of fiber. That's not true for Taragrid but apparently a lot of the fiber in the ground in the U.S. has a certain spec. It's designed for one DPPS or 10 DPPS. When you try to run a laser that can multiplex 100 gigs on top of this, you see noise. It send a better rate, better rate, translates into a better packet rate. So we're in this world where you have infinite bandwidth, essentially, but you have loss, noisy high bandwidth medium. We want to run unmodified TCPIP on top of these networks. You so how do we do this? Here's how. This shows the data path between two datacenters, and you have loss occurring in the middle. By the way, people don't ask questions in the first five or 10 minutes but now it's open, now you can -Go on. >>: I have a question. So part of your renovation, it seems ten megabits per second is not a good upper limit for a given TCP flow, yet it seems like in each datacenter there's lots of hosts that want to communicate to the host and other datacenters. In aggregate, it would seem there are many, many flows, each one of which is capped at 10 megabits per second. If that's the case then does one in aggregate saturate a 40 gigabit per second link? >> Mahesh Balakrishnan: I think between two datacenters you're going to see all kinds of traffic. You'll have a session with interactive. Five transfers with one big chunk of data. A thousand load rate. Essentially, I don't want to think about it. I do not want to reason about what kind of traffic there is. I want one solution that picks up every problem that TCP has. I want to show you we can get to that. >>: The experiments, how do you make sure the last was due to these two reasons not to ->> Mahesh Balakrishnan: Right now we are deploying essentially -- so there's a real network. And when we suggested we should have access to the root inside it, people were not very happy. So a lot of the experiments are on test beds where we inject loss artificially. Right now we're doing something interesting. We're using the coordinate site on that network to inject a routing loop so we can send traffic out and measure it. But that involves a lot of talking to network admin, and that's something that we're doing now. >>: My question, the silent layer will give you error in seconds, just ask them did you look at that? >> Mahesh Balakrishnan: much. These are essentially gigabit ethernet, pretty In some sense, here in some sense there's problem motivation. You could fix all the problems talking about, by fiddling around in the electronics on the network, changing the optics. We don't want to. I want the ability to just ask the ISP, give me bandwidth. I'm not asking for it necessarily. I don't want you to go around pulling the cards out and replacing them on the fly. But just give me the bandwidth. >>: So low loss. >> Mahesh Balakrishnan: It could be random. Could be low or high. We want a solution that fixes all of that. With the one assumption that bandwidth is not a limited resource. Maybe let me just describe the solution, I think it becomes clearer. So I'm going to drop a box between the datacenter and the wide area link and this box is a passive device that's very important. If it fails, nothing goes wrong. You're still getting routing between the two datacenters. And it's completely transparent. End hosts still running, TCP, Reno (phonetic), what have you. The network is not modified. It's what the ISP gave you. It could be dropping packets at rating rates with rating correlation packet. The way we do this is by using almost a time-honored technique, we use -- we'll use forward data correction. What's FDC? It's been around for decades, used in link and satellite communications in fact on CDs for encoding against errors. And it's a simple idea that you have a communication channel and you have the sender and yet redundancy into it that the receiver can use to recover lost data. And this example, this shows one of the codes, one of the more common ones. It shows that for every five packets the sender sends out, generates three error packets and executes it into the link. The receiver can recover from the loss of any three packets. Now SBC is very nice because it has this property that recovery latency does not depend on the down trip time. So now it does not matter anymore if your datacenter is in Asia or anywhere. The other nice property it has at any given point you know how much of your network is overhead. You can tune TCP to say every five data packet generate one error correction packet. And that's exactly how much overhead you have in the network. No reactive times, no network going down due to acknowledgment. In the last 10 years people have proposed packet level C. Microsoft proposed that. And this is a nice idea because it's inexpensive. You can just run it on your machine. You don't need extra hardware. However, two things have stopped packet level at end host from being deployed. One is that you can't go around changing end host stacks. That in some sense you can get around. The bigger more fundamental problem is you don't get this RDT for free. You get it by trading off. Now recovery latency depends not on the RDT but on the data rate and the channel. So now if you're running FCC end to end between two machines and two different datacenters, it's going to be succession, you send a packet and waiting, you can't encode fast enough. You can see why. In this example the sender does not create any error correction traffic until he has enough data. So if the sender is sending a packet every second, then he has to wait for five seconds before generating any error correction packet and the receiver has to wait that much time before recovering any lost packet. How are we going to get around it? We can't do end to end, even though it's a great idea end to end, covers any possible kind of loss but we can't do it because the data rates are not high enough machine to machine. So what are we going to do? We're going to push FEC into the network where we can actually aggregate traffic. And now this is a general idea, but the simplest instantiation of this is a box. You do FEC in a middle box. The middle box sits between the end host and datacenter, end host and wide area link. Does FEC for you. That's the simple possible instantiation. And later in this talk I'm show you the same idea in a much more sophisticated implementation. >>: Can we inject some packets if -- >> Mahesh Balakrishnan: Because you have thousands of channels in your datacenters. It may be that your machine is sending one packet but the others I could be sending are blips of packet. >> I understand what you're saying. The other choice, if for single host, if there's nothing off traffic, you pass them haphazardly. >> Mahesh Balakrishnan: Understand that, but then you lose the property which to me is very important. But your network link at any given point has X percent traffic. >>: That's the problem you see anywhere, right? >> Mahesh Balakrishnan: If you're aggregating in the box, gigabit per second, you don't ever have to -- you know exactly how much you're sending into the network. That's only the very nice property because you can go to your ISP and say I want ex-bandwidth. You know exactly how much data good put you can send on the network now. >>: I have a question, which is just -- I understand your qualitative question to Ray Dong's question. Do you also later have a quantitative? >> Mahesh Balakrishnan: Yeah. In terms of I don't understand the objection. If I have enough rate, data rate, packets coming in, never do padding, then I know exactly how much packets I'm generating per packet. I know exactly how much of my network is overhead due to the SPC. >>: But on that -- >>: Can I express, Ray Dong's objection in a different way. I think when you've got a host sending, if you were to do a graph of hosting on the traffic you're sending, we're looking at the tail where you're going to be generating these extra packets. So you're taking a small amount of traffic and doubling it which shouldn't really matter much. And so when, if you're justifying these two ways you can start this argument, one is it will work if you're aggregating them all together. But if you want to throw away the idea that it would work if machines are doing it individually, you need to show that taking that thin tail and doubling really hurts you. >> Mahesh Balakrishnan: I get your point if you have 100 nodes in each datacenter, arbitrary panels, the numbers of channels high. So you would already launch these, now. You would have, I think they're interacting in you have would be extremely potentially. If I have five sessions opened to different machines in different center, everyone one in my office does, you get a problem everyone just waiting, or padding. Maybe you don't buy that. We can probably take it off line. But that's my assumption, that you could have a a lot of variables and I don't want to do padding. >>: Do you have data to support that, data center does not have -- >> Mahesh Balakrishnan: But it belongs ->>: I would love to get my hands on that data. It's the U.S., not -- they had sessions. >> Mahesh Balakrishnan: I would just say we don't have that data, so we're just trying to solve any possible without enough traffic. I guess just go on. So exactly how does this work? On the top of you have the send side data center. On the bottom you have the receive side data center. The send side appliance is snooping traffic leaving the data center, creating what you can now think of XR. Every five data packets is creating XR dumping it into the channel. The receive side appliance picks up the XRs as it uses them to recover lost packets. Note that the receive side client which would be on the other end of that arrow on the bottom does not see any loss. It will see out of order arrival and, hence, we need to recover packets extremely fast. It's important we recover packets immediately otherwise the TCPIP will perceive out-of-order arrivals. So I said we're using XRs. We're using something a little more sophisticated. So what happens if you have bursting and correlated loss? The forward correction has always had this other problem. The elephant in the room in some sense is that it's not really good at handling burst loss factors. If you lose 10 packets in a row, all bets are off. For traditional SCC encodings, the recovery latency that you get from a code depends on the maximum burst size it can tolerate. The way they do this, they interleaf codings. So they put one big stream into a lot of little streams. They encode separately over these streams and hence get proportional increase in burst tolerance. It means you have to wait longer and longer for recovery to happen. Because now you're encoding over loaded channels. Remember I said recovery latency depends on the data rate in and channel. We came up with a new code that has a graceful degradation code. The goal is if I lose a single packet I want to recover it immediately. If I lose 10 in a row I want to recover it in 10 milliseconds. If I lose 100 in a row, I want it to be recovered in 100 milliseconds. The way we do this is fairly simple. We just create XRs at different interleaves. We create one layer of XRs of data packets, another layer from the 10th. A third layer from every hundredth. And so on. That means for constant overhead, given the SEC (phonetic) code, we get a graceful degradation code in latency that no other SEC code has. We don't get it for free. We have to trade off on the recovery part of the code. So this compares to mainstream code to read Soloman, which is the economic SEC code. You can see we lose a little bit in recovery power but not that much. In return we get much better latency properties. >>: (Inaudible). >> Mahesh Balakrishnan: So I spent a lot of time convincing people of this. If they're not convinced, look, if you have a better code, I've given you a way to do to deploy it. So actually they agree with me, if they don't I say you think you have a better code, deploy it. >>: (Inaudible). >> Mahesh Balakrishnan: It has constant overhead. So that's the key for me. That if you use a lot of the newer codes, they do not give you this property that you get exactly this overhead. For example, the latency code for which Michael was mentioning on SDI, the argument there is, and it's true for something recent called growth code. The argument there is if you change the overhead to get burst tolerance, we don't want to do that because (inaudible) that may not be. I want the guarantee that 75 percent of on-line bandwidth is pure overhead. I don't want to go around tweaking that. That's one set of things to optimize against. >>: Why do you want -- will not exceed the network capacity. >> His thing there is a lot of protocols that failed in some sense failed because they were deactive. As when they do something. Something happens on the network and they react to it. And this is true for things like multi-cast protocols where there is broadcast storms, then you have max storms and stuff like that. So we were trying to come up with a protocol that is essentially is the most passive thing on earth. Doesn't react, just injects this much overhead. Hence, you can predictively plan around it. That may not be a goal that you think is important, but that's kind of the assumption behind this. So Maelstrom operates at the IP level. It can do two things with TCP. It can ignore it, in which case it's just working at the IP level. TCP end-to-end is maintained. It's a passive device if nothing goes wrong. That's the version I'll talk to. However, that doesn't lead to all the problems of TCP. Eliminates loss. Doesn't handle the buffering problem. So to solve that problem we can also use the Maelstrom appliance in the critical part where we make it an active device. It intercepts and breaks the TCP flow control. And that's nice with some applications that want to handle buffering outside the end host. But it does convert into an active device. So we implemented this. And we were able to build a prototype that runs on the commodity box, runs at a gigabyte per second. Limited by the outbound link. Added 10 gig, we'd probably be able to keep it around 3 BBs (phonetic). So how does this work? So here is a graph that shows what happens to TCP when loss occurs in the network. On top, you have TCP without loss. And we're stretching the link out. And on the bottom you have TCP with loss. You can take just the lapses completely when you have a loss rate. Now we introduced them from the middle. We get three lines that are indistinguishable at different losses. They're very close to the TCP loss so far. >> How much were you running? >> I think we were running somewhere between 10 and 20 code parallels. So we did expect that it would be a number, one overlap (phonetic). >>: (Inaudible). >> Mahesh Balakrishnan: But the effective observe lost flow is still high enough to completely flat line it. We found these bursts with Reno, but we ran it on springer (phonetic). Very surprising we got along with the same code theme. We're trying to figure out exactly why that is. Maybe that's just an artifact of the loss model. >>: (Inaudible) TCP and they claim they do it much better than other ->> Mahesh Balakrishnan: So most of the new protocols either use a different control code or they use deli (phonetic). I think TCP does both. In either case, I mean I don't want to comment on the newer work, because I know have not run against every protocol out there. It's possible that for this loss model some of them would behave much better. But the strongest claim I can make is, look, for the most commonly deferred protocols out there this does happen. But here you can fix it if you play around with it. I do believe that. So you can see that at 25 milliseconds we don't track TCP. That's because we're limited with the outbound nick. We have around 650 ndbs (phonetic) of good put put. >>: Accountant recommendation or. >> Mahesh Blakrishnan: Actually evaluation on a test bed. accurate. It's not >>: Loss is being injected. >> Mahesh Blakrishnan: Loss is being injected. Let's see. We're doing it artificially at the network level. Turns out to not be very complicated. We just drop the packet. >>: These are random or these are ->> Mahesh Blakrishnan: These are random. I have both sites. >>: Couple of slides ago you mentioned this leads to more production. Does that mean we have to change the ->> Mahesh Blakrishnan: Very standard. People have been doing performance enhancing proxies. And the idea is if I try to send a packet from here, there's a box that intercepts that, breaks the connection, does the buffering for me. So the other interesting thing in this graph is that you pull the link out, your through put is going down even when no loss occurs. >>: Gap between 50 and 900, and the closing of that gap is? >> Mahesh Balakrishnan: So the box can handle a gigabit per second of traffic going out. We're adding a SEC. So SEC plus data is around a gigabit per second. We're adding maybe 25 percent overhead in these. If we had a bigger nick we'd be running at a gig. But obviously the moment you hit the bandwidth ceiling, you will have this problem that will put us lower. So through is going down with the length of the link. That's the problem with buffering that I talked about. As you stretch the link out you need bigger buffers for the end host. If you don't change the buffer, this is the phenomena you observed. If it changes from active device, now it's a critical part. I can say things can go wrong. We essentially get this opportunity, the through put of the independent values. The third claim is Maelstrom handles the recovery delivery problem for TCP. On the left we have, you can see that when you have a loss rate, TCP has these massive delivery spikes. And one reason for the spike is that the receiver buffers all incoming data because TCP sports sequence delivery. If I lose it, the packet receiver has to wait for the buffer or incoming packet back until the missing one is recovered. That's one source of (inaudible). The other sources are conditional control case. Another is buffering aspect. On the right you can see that we eliminate all loss. That's a real observation. More interestingly, you can see that A for normal data packets, we're not adding any latency. Just getting losses. Data packet, recovering it extremely fast. There's absolutely no difference between recovered packets. Maybe a few milliseconds. How does the code work? Layered interleaving code I talked about. We had these systems of packet recovery latency. On the left, we show that if you have completely random losses, most packets are being recovered almost immediately. You just have one big bar at the beginning. As we make losses, burst here and burst here, here's a point where you're losing 20 packets in a row, 40 packets in a row, you can see the histogram is shifting to the right. So we're getting this property that the recovery latency depends on the burst amount. Before I go on to the next part of this talk I'll give you an opportunity to ask questions. I want to mention other work we're doing on top of this. The reason we started looking at this problem of intercenter data communication was because we were approached by financial institutions that needed mirroring solutions for disaster tolerance. And the state of the art right now is that banks in New York placed mirrors and New Jersey for disaster tolerance. That's as far as they can go. What they ideally want to do is place it somewhere in Kentucky. And the reason for that is there are only two ways you can do mirroring right now. You can go to, right to the file system. The file system can send the data to a mirror and wait for an acknowledgement. It's extremely safe and extremely slow. The other option is you just return to the application and you just assume that the packet gets there. Now we're exploring a middle ground where we add what so much overhead into the network, so much redundancy that we can push the reliability level of a piece of data to the point where it is as safe as if we wrote it to a local disk. So we built a file system that does this. Essentially goes to the user once the problem of using the data is high enough assuming the data is lost in the network. So I'm going to move to Ricochet protocol. But do you have any questions at this point. >>: Is there really much purpose in handling large burst losses because these length of time to recover TCP is going to notice. >> Mahesh Balakrishnan: You're right. The burst sending make it transparent to TCP but we can't. But the argument -- one of the nice things about the work is it works at any IP level, transparent link, performance. So arguably it still has value. >>: Seems like you can't have semantics difference between waiting for it or not. The network (inaudible) is going to save it. >> Mahesh Balakrishnan: It's a model, you're right. It works only if the reason the packet gets across this loss in the network, not a disconnect or something. You're right. >>: Or a failure, for instance, if you're active. >> Mahesh Balakrishnan: The model, I do agree with that. But we think that it provides value to people who don't need complete safety but don't want to go the other way as well. >>: So that was Maelstrom. It's reliability between data centers. Now I'm going to show you that the same kind of technique using forward data correction in interesting ways can have a lot of value within data centers as well. So what's the connection? Well, in the inter data center case, one of the main reasons we wanted to use TCP was because you can't have a feedback loop from the receiver going back to the sender. The length of the link is simply too high. And, hence, we wanted this property that all communication is just one way. Now, within the data center the RTTs are not very high, but the standard mode of communication is multi-cast. And if you have a lot of receivers for multi-cast, you again have the same problem where the receivers are not very far away but there are just too many of them. So if you wrote an acknowledgment-based protocol which has a very tight feedback loop, then for every packet the sender sends out he's going to be swamped by incoming acknowledgments. So you can't have a feedback loop in either case. And we're going to use the same kind of techniques in this setting now. But before I talk about that, is multi-cast used in datacenter, why is it used and how is it used? So within datacenter, data is often sprayed across multiple nodes, partitioned. It's replicated. People use a whole number of communication paradigms to do this. They use pop projects. And you can imagine caching and validation groups. And the example I'm going to give you is from a financial setting where you have nodes that are subscribed to portfolios of equities. So you have a node that's tracking a number of stocks. And the way that it's done today is each equity is mapped to a group, a multi-cast group. So if I'm interested in an equity, I join the multi-cast group and I get the update for that group. Now, this means that each node, if it's interested in hundreds of equities, it's going to belong to many different groups, which means that the granularity of a single group is going to be fine, which means that the data rate in a single group could be fairly low. You're not going to get a thousand updates for the Microsoft talk. You're going to get one every other second or something like that. However, even though a node is in a lot of loaded literate groups, it can easily get overloaded. It's tracking 100 equities, maybe 1,000. If there's any kind of traffic spike, the node gets overloaded. Multi-cast protocols are UDP-based and hence nodes can get overloaded. They can drop packets at the buffer. When we looked at this two to three years back, the dominant model was for nodes to be very thin. You have 10 blades traveling on a very fast network. They can get easily overband. >>: (Inaudible). >> Mahesh Balakrishnan: They're already on top of multi-cast but the reliability layer, at least with current technologies, always at the application level. You're right. So what happens when nodes get overloaded, they drop packets. We found it's very ridiculously easy to overload a blade server. On top you have a blade server that's getting more data than it can handle. On bottom, you have one that's getting less data than it can handle. They're in the same group. But they're one on top of an extra group. And they're behind the same switching segment. What this graph is saying is that, A, loss is occurring at the end host. It's independent across nodes. It's not in the network. And it's bursting. happening because of kernel buffer overflows. So it's So how do we deal with this problem? Reliable multi-cast is a very well-studied problem. It's been around for years, and it's been solved. Scalable multi-cast is a solved problem. However, when people say "scalability," they mean scalability in the number of receivers. There's been all this work that looks at how to scale multi-cast to hundreds, thousands, millions of receivers. In a datacenter, though, you want multi-cast where you can have lots of receivers, sure. But also lots of senders to a group. What if a node is aggregating data from many different senders and that's something that's nobody's looked at before. Also, you want to support a lot of overlapping groups. Each node is in hundreds of groups. So I want to support a system where multi-cast is used at very fine granularity where you can use it at the item level or object level. So how do we do this today? Well, there's only a certain number of protocols you can write to solve this problem. Number one is you just use acknowledgment you use TCP reliability to multi-cast. This does not work because you get ac (phonetic) implosion. I already mentioned that. Number two is you use negative acknowledgments, and this is how all commercial multi-cast technology is built. There's a protocol called SRM in '95 and '96 (inaudible) and that's been translated by companies, including Microsoft, into better protocols that are essentially used negative acknowledgments. Now I'll come back to what the problem is with this in a minute. Number three is you can just inject SEC at the sender. Sender creates XRs and ejects them into the channel and nodes recover from loss. Now, the problem here is if you have hundreds of thousands of loaded groups you get the same issue that you can't use padding. And SEC doesn't work very well. Latency depends on the data rate in the channel. Now this artifact applies to negative acknowledgment protocol aspect. The receiver does not know it's dropped the packet until it receives the next packet from the same sender to the same group. In a setting where you have hundreds of senders and hundreds of groups this could be seconds. So what do we do? We're going to apply the same technique we did in the last set of work. We'll move SEC into location where we can aggregate it over high rate data channels. We're going to have receivers generate XRs and we're going to have receivers pass around these XRs that can then be used to recover loss data. Remember, I showed you that loss is occurring at the end host, it's independent across end host and, hence, this actually works. The receiver on top misses data. can recover the missing packet. It gets an XR with that data and it How exactly does this work? So each receiver is generating XRs over incoming data packets. And it creates an XR and it picks "see other packet". See other nodes in the system and it sends the XR to them. Now, this gives you essentially a forward data correction rate. You know that your system is X percent overhead. And it has an interesting property that you just remove the sender from the reliability protocol and hence now you don't care if you have one sender or if you have a thousand senders. You're completely oblivious to the number of senders to the group. It's a completely receiver-driven protocol. We have property number two. We already have scalability in the number of receivers, that's the easy one. Now we have scalability in the number of senders to the group. What about the next step? Can we get scalability in the number of groups in the system? Well, here's how we do it. So this shows how the protocol I described in the last slide works. You have two receivers. They're in two different groups, sending each of their XRs. So Receiver R1 sends R-2 two XRs, one from data in group A and other from data in group B. And it does these independently, starting two versions of the last protocol. But we can do better, right? We can aggregate data across groups. R-1 can send R-2 an XR that combines data in groups A and B and recovery happens much, much faster. Now, this works in the simple case. What if you have hundreds of overlapping multi-cast groups? Are there any questions at this point? What if you have hundreds of overlapping multi-cast groups in arbitrary patterns? This is how we handle it. A receiver belongs to multiple groups. Say in this case, and one belongs to three groups, A, B and C. And it essentially decomposes these groups into regions of disjoint overlap. And one is in three groups. And it breaks these groups into these regions. Now, this is scalable. It looks unscalable, but it's not. And the reason this, A nodes are not interested in groups they're not part of. So if you ran a traditional group-based protocol, it has exactly the same overhead. And B, the group membership service, whether it's gossip-based or central life, it's a standard MLS that you're using for anything else. All this disjoint region decomposition is occurring at the end host. Another point to be made, looks like there are an exponential number of regions. But this is bounded by the number of nodes that N1 shares a group with. So in reality you won't have an exponential number of regions. Now what do we do with this? N1 does all this magic. Breaks groups into regions. Now, remember that the basic operation and receiver-based, single group protocol I showed you, was that each receiver generates an XR from incoming data packets. It selects C random guys from the group and it sends the XR to them. In this case it's picking five guys from the group and sending XR to them. Now we want to compose the XR we send to the receiver based on the groups we share with it. If I'm in three different groups under the receiver, I want to aggregate data across those three groups. And if I know what the regions of disjoint overlap are, now I know that, you know, if I'm sending an XR to a node in region ABC, I have to aggregate data from ABC. I know if I'm sending it from node only in A, not interested in B and C, combine it only aggregate from group A. So we set targets for the XR, not from groups but from regions. This allows us to selectively compose each XR at the fastest rate possible. Now, this means that we get scalability in the number of groups because we are running, within each intersection we're running the SEC protocol as fast as we can. For the definition of a channel here is now an intersection of multiple groups that you shared with some other node. How does this work? >>: I didn't understand that last slide. >> Mahesh Balakrishnan: Okay. So the basic operation of receiver-based SEC is I get a bunch of data in a group, in a single group. I create an XR from it and I pick other guys from the group. So if we are in one big group I pick five of you and send you an XR for data in group A. Now the point is if you and I are in five different groups together I could have sent you an XR with data from all the groups. So to enable that, I first have to know whether I share groups with you. So in this case let's say you're in intersection A/B, which means you are interested in data from group A and B but not in group C. Now, instead of picking five random guys from the group, I'm going to pick random people from these intersections. So I'm going to pick you, the proportional fraction of five based on the size of the region compared to the size of the group. It's kind of regional sampling. I'm dividing this room into smaller chunks and I'm picking smaller fractions of the sample set from the group. It's like sampling within regions instead of groups. So it's like ->>: What defines the group membership? >> Mahesh Balakrishnan: The group membership, exactly. The commonality in the groups I share with you. Go on. >>: What are the black dots and the green dots? >> Mahesh Balakrishnan: The green dots are the nodes I happen to select. So the five of them. The black dots are the nodes that didn't get selected. I want this kind of behavior where I'm just picking five green dots from group A but I want to do it on the per region basis. So instead of picking five guys from group A I'm picking one guy from A/B and one guy from -- and it all adds up to five. >>: The green dots, nodes or packets? >> Mahesh Balakrishnan: They're nodes. They're nodes. >>: Do you do this action every time? >> Mahesh Balakrishnan: It's a little complicated. In the first version of the protocol I create an XR and I select. In this version I select and then I create the XR based on who I selected. >>: (Inaudible) reselect the nodes. >> Mahesh Balakrishnan: At the granularity of the single XR. So if I know I'm creating 25 percent of my traffic is XR, there's a certain number of XRs I'm creating, and based on that, for each XR, I end up selecting a node. >>: Is that membership of (inaudible) performance? >> Mahesh Balakrishnan: It would, except in datacenter you don't have that much join, that kind of an assumption. That's why you can't use something like this -- it would still work because it's gossip-based in some sense. If you had join, there would be some kind of lag. But it would still work but not as for sure (phonetic). >>: Why do you only do one packet, create one XR and send it five times? >> Mahesh Balakrishnan: That's what we do in the initial case. We can't do it here because different nodes need different XRs. >>: Select targets for the XRs but you don't send them the same ->> Mahesh Balakrishnan: The nature of the XR depends on the target, because we want to compose the XRs faster based on the target. If I'm in 10 different groups with you, I want to create an XR from those 10 groups. >>: Then why are you selecting all the targets kind of as part of a single -- as part of a single step? Right? Why not -- it sounds like it's it ter active. >> Mahesh Balakrishnan: In practice it ends up -- this is how I'm showing it but in practice we're treating each of the regions separately and we are doing what you're saying. >>: So it does decompose into each round I select a new node who is my target. >> Mahesh Balakrishnan: Very much. Yes. >>: Amounts to random selection. >> Mahesh Balakrishnan: Exactly. Amounts to random selection except you're doing it from regions. So how does this work? Well, on the X axis we have the number of groups each node is in. The two different graphs, the one on the left shows the percentage of packets we're successfully recovering with this mechanism and on the right we have the latency we're recovering them with. Now, the point of this graph is as you make the groups finer and finer, as you take your system and you decompose it into finer and finer groups, your performance is not affected. For every other multi-cast protocol out there, including the ones developed by Microsoft in fact which are essentially the industry standard, the latency shoots up as you refine the groups. To the extent we actually ran it and we found that with 128 groups it's 400 times slower. And this is almost -- this is one reason I'll have a slide later where I talk about this. But it's one reason why multi-cast hasn't caught on in datacenters is because the reliability mechanisms have been so incredibly broken. They were designed for the wide area. They don't work well in datacenters. So this shows that we get scalability in the number of groups. So you can tolerate a lot of load iterate groups. And this shows the recovery histogram. So at different loss rates we're recovering most traffic in the initial segment using the data correction. Because we don't have TCPIP running on top this is the final one. So we need a negative acknowledgment layer to catch the fractions of packets that SP can't recover. So the bump in the middle is this extra reactive layer. Now what does this mean? This means that you can say I want 20% overhead in my system and that will be all you get for most packets. And as the loss rate is extremely high, then you have slight reactive traffic. And we tried it with burst velocity losses. And we found because we have so much diversity in the encoding it's extremely resilient to diversity losses. We can lose, 100 packets in a row and still we recover most of them at the same latency average. So one of the questions I get when I present this work is, hold on, nobody uses multi-cast in datacenters. As of 2008, that's perfectly true. Nobody does. Microsoft, I'm pretty sure, doesn't. And I know that other companies don't. And there are three major reasons why. One is that the reliable multi-cast mechanisms used were incredibly bad. They brought down, between 2000 and 2003, in 2000 practically every company that had a datacenter, mostly financial institutions, were using IP multi-cast. And then they had a state of essentially brownouts and blackouts and people started phasing multi-cast out. There's a long list of companies that just kicked IP multi-cast out of their datacenters. One reason was because the reliability mechanisms were broken and we fixed that, we think we fixed that. But there were other issues. One is if you have 1,000 groups in your datacenter, your router and your nicks are not able to handle it. They can handle 100 groups. Beyond that, everyone gets all traffic, essentially. And the other problem is that multi-cast is intrinsically dangerous. If you're all in a group, I can join the group. I can start sending data at my highest rate and bring the entire system down. So there seems to be fundamental reasons why no one uses multi-cast. And the insiders that look -- you want logical multi-cast capability because everyone uses that within datacenters. But you want the ability to control it so you have a layer of indirection in the kernel where you convert logical IP multi-cast addresses into a set of unicast addresses or maybe one unicast address and a couple of IP multi-cast addresses and so on. So you know that your system, your routing system can handle, say, 100 IP multi-cast addresses. You use them for the high rate traffic. And that's something we're working on right now. >>: I didn't understand that slide either. >> Mahesh Balakrishnan: Okay. We're talking ->>: I got the first part. I didn't understand the solution. How is it that now multi-cast will not be such a dangerous thing to use? >> Mahesh Balakrishnan: Here's what I'm going to do. The application still sees an IP multi-cast group. It joins the user set, all that stuff. We're going to intercept that and we're going to convert it into unicast in the kernel. Now if you do that you automatically get one thing. Your routers are not unscalable anymore, right? You're just using unicast. The network is not seeing any IP multi-cast. Now the next thing we can do is we can have an IP multi-cast in the network, we just can't have thousands of groups. We can have maybe 100. So if there are groups in the system that have a lot of traffic, logical groups, we use the scars IP multi-cast addresses for those groups and everyone else is using unicast. So that's one step. And that kind of solves the routing problem. The scalability problem to some extent. Now, if you have a kernel layer that's doing this, you can also have things like intelligent admission control. You can mandate who gets to send to what multi-cast group. So I can just join a secure group and start spanning it. You have the opportunity now to add a security layer. >>: I find this very plausible, but it feels like now that you're giving up the -- feels like you're giving up the efficiency benefits of doing the multi-cast in the number layer, which makes me think why would you stop here rather than moving further towards application layer multi-cast? >> Mahesh Balakrishnan: I think in a datacenter, the main sources of latency and overhead are in the end host. For example, I don't think overlays are a good idea because for most applications if you have multiple hops in the datacenter you're already in trouble. You go into the stack, you come out of the stack, massive latency. The other point would be that I think a major source of the overhead here is why do you need IP multi-cast? Why can't you just do multiple sends? Because every time you do and you're going into the kernel. If you can get rid of that, that's already a huge step in the right direction. So it's not really a network efficiency, it would be end host efficiency that we're looking at. >>: Let's make sure I understand what you're saying. You're saying instead of multi-cast groups that are not in that physical multi-cast instance, there's some sort of kernel layer, or bus or something, where you're doing multiple sends somehow? >> Mahesh Balakrishnan: Yes, exactly. Kernel is doing unicast multiple sends. >>: Finer point on John's question. As long as you've addressed all the efficiency problems of now we're not taking the packet all the way up to user space and sending it up to times, we have a nice kernel level, the bus, why not use it for all the multi-cast tracks? >> Mahesh Balakrishnan: You mean eliminate IP multi-cast entirely? >>: Right. If I understand your question correctly, if you already have this mechanism in place for doing essentially non-network multi-cast. >> Mahesh Balakrishnan: Right. We do that. I don't know. I haven't built the system yet. I have a feeling that at the end of the day, if you have extremely popular groups, there would be value in having an IP mechanism that's even faster than the kernel multi send. But I don't have a feel for the numbers. But you have a point. Maybe it would be pointless going here. >>: Maybe my question is, you've done such a great job, why aren't you using it for everything? >> Mahesh Balakrishnan: Well, I'll invent it and then I'll measure it and then I'll know. >>: The key question is where is the break-even point, crossover point where you get the efficiency from having actual multi-cast being so much greater than it's worth, using the real physical resource of the IP multi-cast group at break point you should be able to see the data. >> Mahesh Balakrishnan: That's pretty much true. So that's the technical part of this talk. I wrote down a number of things. Most of them have a feel to it that they tackle a protocol of failure mode to solve the problem. And, frankly, at the end of the day it's a lot of duct tape for data centers. Fine performance problems where people are using retrofitted protocols within datacenters and just fix it. But, really, you know, we've had 10 years of experience with datacenters or more. Can't we come up with new abstractions? That sounds like a collegiate thing to say because distributed systems have this long history of failed abstractions. Do we really want to come up with more? But I think there's something fundamental here. We are not building the bridge before people want to cross over. And sometimes you're seeing all the people swim across and we're figuring out that we need a bridge. We need new abstractions. To some extent the community has caught on and they are building new abstractions. Some of it is happening at Microsoft with Triad. But people are dealing with a very specific datacenter application. How do you do data mining or data-based style functionality over very large data sets? Can't we go one step further? And here's what I think it should look like. This is something I saw in a paper by Jim Gray first, an entire number of co-authors from Microsoft. And the idea was that you know there is a conical building paradigm system for datacenters. Partition the service across multiple nodes and you get scalability. Replicate it, you get fault tolerance and reliability. This is a very simple picture. All the people I've talked to, every service I've seen running within a datacenter is built exactly this way. And each time they rebuild it from strach. They take multiple operating systems running on individual nodes. They write load balancing layer. They start doing -- they build intelligent partitioning functionality in the load balancing layers. Why can't we build an operating system that just takes care of this? Now the phrase the datacenter, the computer, I thought I came up with it, but I saw it in an article by David Patterson like three months back in the Communication of the ACM. And he's a computer architect. He says can we come up with a new instruction set for datacenters, what would an add be? Well, I don't think we need to go that far. But can we look at things, like what a process set, what a process is, what a thread is, what prediction boundaries are, what does a socket look like, and what's a location into a service now that your service is spread out across dozens of machines. So can we come up with new abstractions? That essentially is my goal over the next few years. I believe that now we have enough experience with datacenters as a community to actually look at these problems and come up with new abstractions and that they will be used. And if we do that, all these things we can do, then, you have concerns like power and privacy. For example, if the datacenter OS knew what a partition was it could mandate that data does not flow across partitions. And then you could pass third-party developers the rights that automatically manages privacy. That's just one example. But I think you build out abstractions, you can then hide a lot of complexity behind the abstractions. So to conclude I present a picture of a real time datacenter that has to recover from disruptive events within seconds. And I talked about things you can do at the network level to enable this. Specifically, network load protocols recovered from loss packets in milliseconds. I think they're the tip of the iceberg. I think there's a long distance to go before we get anywhere near this goal. Thank you. (Applause). >>: Any more questions? Code availability? >> Mahesh Balakrishnan: These are -- Ricochet is up for download. In fact, we rewrote it in C, the two versions, and reporting it in the Red Hat stack in fact. Red Hat is building something called Cupid, clustering platform they're building right now. We're trying to get them to use Ricochet as the clustering layer. And for that we need to do a complete rewrite. It's up on Source Four (phonetic) actually. >>: I'm more interested in Maelstrom. >> Mahesh Balakrishnan: Sorry? >>: How much did Maelstrom? That's what I'm more interested in. Ricochet not so much. >> Mahesh Balakrishnan: Maelstrom is -- can you download it? Yes. It's a kernel module. I don't know if there's a link right now publicly, but if you send me an e-mail I'll be happy to send you the source. >>: Make us believe in that your test bed actually is something that we should, the results we are getting ->> Mahesh Balakrishnan: You should believe it. That's a question I get a lot. Which is why right now at least for the wide area work, we're doing our best to actually set up a real optical network test bed. It's a lot of work because these are fairly expensive resources and they don't want computer science researchers near them. They're owned typically by the scientific community. But we're very close. If you see a journal version of this paper, that's exactly what it is, add to this whole thing. Actual validation on and off the critical test bed. >>: Where is the test bed? >> Mahesh Balakrishnan: The test bed is a combination of things. We set up a delaying layer on local clusters. So we have a bunch of -- I mean we essentially just put a lot of delay between two things. We were able to do that because the interconnect was big enough. But that's why we're limited by things like one GPP, we can't run at 10 gigs because we just can't buffer and delay the packets long enough. The other test bed we tried to emulate in the lab which is the same kind of story. It has enough capacity for us to do it but to a point. What we're doing right now, and we are already have it running to a degree, we are actually setting up some kind of a clever routing grid. So we send packets out into taragrid but it's not going to hurt anyone, just boomerang onto us. That may not be as real as we want. But it's one step in the right direction. But there's a huge problem here. I think in academia we just don't have the resources to study these kind of networks. And it's hard to just muzzle into a working optical network because it's an extremely expensive resource. And I'll give, either the physicists and the chemists are doing real work on it, as opposed to computer scientists research. >>: You can make future business of chemists go 10 times faster in the network. >> Mahesh Balakrishnanr: Right. Incentivise them, absolutely. The problem there is you get stuck on a very specific thing that they want you to do. I guess it's a trade-off. >>: Which grant is the money coming from? (Laughter) (Applause) 1:00:25.