15700 >> Ratul Mahajan: It's my pleasure to welcome Mahesh... He's from Cornell University, university student. And he's going...

advertisement
15700
>> Ratul Mahajan: It's my pleasure to welcome Mahesh Balakrishnan.
He's from Cornell University, university student. And he's going to
talk about reliable communications in data centers.
>> Mahesh Balakrishnan: Thanks, Ratul. As he said, I'm Mahesh
Balakrishnan. And I'm from Cornell University. That's a very rare
picture of Cornell. You can see it's not snowing.
Today I'm going to be talking about reliable communication for data
centers. So commodity data centers emerged in the mid '90s. Internet
service companies found they could scale to millions of users by using
several forms of inexpensive machines.
Now the technical insight behind that is simple. First generation
Internet services were extremely parallel. Economically, though,
commodity data centers were revolutionary. They allowed companies to
acquire and maintain massive infrastructure on shoestring budget.
Since then, practically everyone has jumped onto the data center
bandwagon and data centers have supplanted big end computing as the
defacto platform for any kind of scalable application. However,
today's data centers are different from their predecessors in two ways.
Firstly, data centers no longer act in isolation. They interact with
other data centers that are geographically remote. The links between
these data centers are typically high speed optical networks.
Companies are deploying such networks because it gives them immense
value, allows them to mirror data to remote locations for locality to
clients, for disaster tolerance, for locality to outsourcing hubs. And
this is particularly increasing looking like the network map of any
modern corporation.
Also data centers are now real time. Now real time is a loaded word
and it means different things to different people.
In one sense data centers are real time because you're running
applications traditionally considered real time. Domains such as
finance and aerospace which are traditionally real time are moving to
commodity platforms. However, in a completely different sense, we are
going towards a world where people are storing all data and
functionality within data centers.
If I open my laptop and I store some data on it, I did a blog, I send
an e-mail to someone, all that is being funneled through a remote data
center. The data center is the computer. Almost an extension of Sun
slogan of the '90s, the network of the computer.
The interesting corollary is that if when I open my machine, if all of
what I do on it is being funneled into a remote data center, then I
expect the same responsiveness from that remote datacenter that I do
from my local machine. So datacenters are real time inductions. They
have to be extremely responsive.
Now Gartner did a survey of datacenters operators where it asked let's
define a real time infrastructure of a datacenter that recovered from
failure in seconds or minutes. They asked people do you think this is
important and do you have it? And those were the answers. Most people
thought it was important. Most people did not have any kind of real
time recovery capability.
And that leads us to the big challenge, how do we build a real time
datacenter, one that can recover from faults and other disruptive
events within seconds. And this is a huge deal, right?
This is almost a grand challenge. It was dealing with literally dozens
of failure modes at every level of the software stack. And it's kind
of the backdrop for my Ph.D. thesis. I looked at various failure modes
within datacenters and between them and in fact built several systems
that tackled different faults.
Now, this talk is going to be about mechanisms at the network level.
If you want to build a system that recovers within seconds from
failures, you need to provide it with network parameters that recover
in milliseconds from bad stuff that happens in the network.
Specifically, we want to recover from packet loss in the network and we
want to recover extremely rapidly. Now, the insight here is that
existing protocols are reactive. When something wrong happens, they
react to it and often there's a cost of reaction and a latency of
reaction. And both can often be very high.
Now, we're going to introduce proactive protocols where you inject the
overhead's a priori and you get immediate recovery when anything goes
wrong.
In this talk I'm going to be talking about two systems. One is called
Maelstrom. It's a protocol for reliable communication between
datacenters on these high speed optical links, 5,000 of miles. Second,
I'm going to be talking about Ricochet, which is a reliable multi-cast
protocol within datacenters.
So moving on to Maelstrom. Today the only option you have for reliable
communication between datacenters is commodity TCPIP. If you're
running a commodity datacenter, you'll be running Windows or Linux.
They have commodity stacks. So what happens when you open a connection
between here and New York? Well, TCP has three problems. Problem
number one is throughput collapse due to loss in the network. TCP was
historically designed for short congested Internets. It does not work
well when user logs into high speed networks. It's egregious if you
lose .1 percent of your packets on the link throughput is less than 10
MBPS on essentially a 40 gig link.
Now, this is a problem that's extremely well known in academia. Been
studied for 10 years. People have come up with all kinds of new
protocols. I know you guys are building compound TCPs. Linux are
dropping back. However today the most commonly deployed protocol in
the wild is still Reno (phonetic) and that's a protocol over 10 years
old.
Problem number two
host for high rate
with this problem.
capacity to buffer
deal with it, they
with TCP it requires really large buffers at the end
traffic. Again, there are new variants that deal
But, again, the end host may not have enough memory
that much data. So right now the way that companies
hire people to go around tuning TCP buffers.
Problem number three is that TCP intrinsically has a feedback loop in
it. When you lose a packet, the receiver goes back to the sender to
recover the lost packet.
So it really does not matter for TCP whether your datacenters are next
to each other or on different continents or on different planets.
So what do people do about this? Well, solution number one is you just
sit and twiddle your thumbs and you wait for the operating system
companies to deploy new, better versions.
What do companies do today when they're still running on XP or some
older variant of Windows or Linux? They do multiple things. They tend
to generally be quite ad hoc. They rewrite applications. Try to use
multiple flows instead of one. They go around tuning TCP buffers. But
the most common solution to this problem is they throw money at it or
at least they try to.
You could pay an ISP a million dollars and the ISP will give you the
perfect network, the network that does not drop packets.
This is done today. ISPs give you a service level agreement that says
the network will not lose more than X percent of your traffic. If it
does we will compensate you financially. That's the model today.
Throw money at it. You get a perfect network. You stop working about
what's wrong with TCP.
The question is what if I don't have a million dollars, what if I want
intercontinental connectivity and want it for $50,000 or whatever. We
have one data point for such a network. It's called taragrid
(phonetic). It's a grid of super computers in the U.S.
Multiple high
profile sites including NCSA in San Diego. This is a fairly heavily
used network. Used on a day-to-day basis by e-science researchers.
Notice that, firstly, the links between sites are 10 to 40 gigs per
second. They're not really loaded -- over a three-month period we
never saw link realization more than 70 percent. Secondly, the path
between any of these two sites has multiple hops in it. This is a
fairly standard kind of topology that's being deployed out there for
these kind of networks.
Now Taragrid has loss, even though it's not congested. It has a
monitoring framework where sites pick other sites, send a blast of UDP
traffic. And it notes down the associated loss rate. We saw something
extremely surprising. We found that 24% of all these measurements had a
loss rate of greater than .01 percent. Enough to flat line any kind of
PC running on top of it.
We dug into this data and we tried to find out what was going on. We
found two reasons. Reason one for loss was that the path between two
nodes and different data center was cluttered. Multiple active
devices. These active devices would stop dropping packets for no
reason. We found faulty rate limiters, had bad nicks that were
dropping packets. And because these guys were not paying someone a
million bucks to go and pull that card and replace it, you saw
persistent loss.
In an academic setting, nobody cares if you're losing your DP packet,
they're just using blast protocols, making do with blasting packet
across the pipe, so they didn't care. Talking to people, we saw
another source of loss on these networks was loss of fiber. That's not
true for Taragrid but apparently a lot of the fiber in the ground in
the U.S. has a certain spec. It's designed for one DPPS or 10 DPPS.
When you try to run a laser that can multiplex 100 gigs on top of this,
you see noise. It send a better rate, better rate, translates into a
better packet rate. So we're in this world where you have infinite
bandwidth, essentially, but you have loss, noisy high bandwidth medium.
We want to run unmodified TCPIP on top of these networks.
You so how do we do this? Here's how. This shows the data path
between two datacenters, and you have loss occurring in the middle. By
the way, people don't ask questions in the first five or 10 minutes but
now it's open, now you can -Go on.
>>: I have a question. So part of your renovation, it seems ten
megabits per second is not a good upper limit for a given TCP flow, yet
it seems like in each datacenter there's lots of hosts that want to
communicate to the host and other datacenters. In aggregate, it would
seem there are many, many flows, each one of which is capped at 10
megabits per second. If that's the case then does one in aggregate
saturate a 40 gigabit per second link?
>> Mahesh Balakrishnan: I think between two datacenters you're going
to see all kinds of traffic. You'll have a session with interactive.
Five transfers with one big chunk of data. A thousand load rate.
Essentially, I don't want to think about it. I do not want to reason
about what kind of traffic there is. I want one solution that picks up
every problem that TCP has. I want to show you we can get to that.
>>: The experiments, how do you make sure the last was due to these
two reasons not to ->> Mahesh Balakrishnan: Right now we are deploying essentially -- so
there's a real network. And when we suggested we should have access to
the root inside it, people were not very happy. So a lot of the
experiments are on test beds where we inject loss artificially. Right
now we're doing something interesting. We're using the coordinate site
on that network to inject a routing loop so we can send traffic out and
measure it. But that involves a lot of talking to network admin, and
that's something that we're doing now.
>>: My question, the silent layer will give you error in seconds, just
ask them did you look at that?
>> Mahesh Balakrishnan:
much.
These are essentially gigabit ethernet, pretty
In some sense, here in some sense there's problem motivation. You
could fix all the problems talking about, by fiddling around in the
electronics on the network, changing the optics. We don't want to. I
want the ability to just ask the ISP, give me bandwidth. I'm not
asking for it necessarily. I don't want you to go around pulling the
cards out and replacing them on the fly. But just give me the
bandwidth.
>>:
So low loss.
>> Mahesh Balakrishnan: It could be random. Could be low or high. We
want a solution that fixes all of that. With the one assumption that
bandwidth is not a limited resource. Maybe let me just describe the
solution, I think it becomes clearer.
So I'm going to drop a box between the datacenter and the wide area
link and this box is a passive device that's very important. If it
fails, nothing goes wrong. You're still getting routing between the
two datacenters. And it's completely transparent. End hosts still
running, TCP, Reno (phonetic), what have you. The network is not
modified. It's what the ISP gave you. It could be dropping packets at
rating rates with rating correlation packet. The way we do this is by
using almost a time-honored technique, we use -- we'll use forward data
correction. What's FDC? It's been around for decades, used in link
and satellite communications in fact on CDs for encoding against
errors. And it's a simple idea that you have a communication channel
and you have the sender and yet redundancy into it that the receiver
can use to recover lost data.
And this example, this shows one of the codes, one of the more common
ones. It shows that for every five packets the sender sends out,
generates three error packets and executes it into the link. The
receiver can recover from the loss of any three packets. Now SBC is
very nice because it has this property that recovery latency does not
depend on the down trip time. So now it does not matter anymore if
your datacenter is in Asia or anywhere. The other nice property it has
at any given point you know how much of your network is overhead. You
can tune TCP to say every five data packet generate one error
correction packet. And that's exactly how much overhead you have in
the network. No reactive times, no network going down due to
acknowledgment.
In the last 10 years people have proposed packet level C. Microsoft
proposed that. And this is a nice idea because it's inexpensive. You
can just run it on your machine. You don't need extra hardware.
However, two things have stopped packet level at end host from being
deployed. One is that you can't go around changing end host stacks.
That in some sense you can get around. The bigger more fundamental
problem is you don't get this RDT for free. You get it by trading off.
Now recovery latency depends not on the RDT but on the data rate and
the channel. So now if you're running FCC end to end between two
machines and two different datacenters, it's going to be succession,
you send a packet and waiting, you can't encode fast enough. You can
see why.
In this example the sender does not create any error correction traffic
until he has enough data. So if the sender is sending a packet every
second, then he has to wait for five seconds before generating any
error correction packet and the receiver has to wait that much time
before recovering any lost packet.
How are we going to get around it? We can't do end to end, even though
it's a great idea end to end, covers any possible kind of loss but we
can't do it because the data rates are not high enough machine to
machine.
So what are we going to do? We're going to push FEC into the network
where we can actually aggregate traffic. And now this is a general
idea, but the simplest instantiation of this is a box. You do FEC in a
middle box. The middle box sits between the end host and datacenter,
end host and wide area link. Does FEC for you.
That's the simple possible instantiation. And later in this talk I'm
show you the same idea in a much more sophisticated implementation.
>>:
Can we inject some packets if --
>> Mahesh Balakrishnan: Because you have thousands of channels in your
datacenters. It may be that your machine is sending one packet but the
others I could be sending are blips of packet.
>> I understand what you're saying. The other choice, if for single
host, if there's nothing off traffic, you pass them haphazardly.
>> Mahesh Balakrishnan: Understand that, but then you lose the
property which to me is very important. But your network link at any
given point has X percent traffic.
>>:
That's the problem you see anywhere, right?
>> Mahesh Balakrishnan: If you're aggregating in the box, gigabit per
second, you don't ever have to -- you know exactly how much you're
sending into the network. That's only the very nice property because
you can go to your ISP and say I want ex-bandwidth. You know exactly
how much data good put you can send on the network now.
>>: I have a question, which is just -- I understand your qualitative
question to Ray Dong's question. Do you also later have a
quantitative?
>> Mahesh Balakrishnan: Yeah. In terms of I don't understand the
objection. If I have enough rate, data rate, packets coming in, never
do padding, then I know exactly how much packets I'm generating per
packet. I know exactly how much of my network is overhead due to the
SPC.
>>:
But on that --
>>: Can I express, Ray Dong's objection in a different way. I think
when you've got a host sending, if you were to do a graph of hosting on
the traffic you're sending, we're looking at the tail where you're
going to be generating these extra packets. So you're taking a small
amount of traffic and doubling it which shouldn't really matter much.
And so when, if you're justifying these two ways you can start this
argument, one is it will work if you're aggregating them all together.
But if you want to throw away the idea that it would work if machines
are doing it individually, you need to show that taking that thin tail
and doubling really hurts you.
>> Mahesh Balakrishnan: I get your point
if you have 100 nodes in each datacenter,
arbitrary panels, the numbers of channels
high. So you would already launch these,
now. You would have, I think
they're interacting in
you have would be extremely
potentially.
If I have five sessions opened to different machines in different
center, everyone one in my office does, you get a problem everyone just
waiting, or padding.
Maybe you don't buy that. We can probably take it off line. But
that's my assumption, that you could have a a lot of variables and I
don't want to do padding.
>>:
Do you have data to support that, data center does not have --
>> Mahesh Balakrishnan:
But it belongs ->>:
I would love to get my hands on that data.
It's the U.S., not -- they had sessions.
>> Mahesh Balakrishnan: I would just say we don't have that data, so
we're just trying to solve any possible without enough traffic. I
guess just go on.
So exactly how does this work? On the top of you have the send side
data center. On the bottom you have the receive side data center. The
send side appliance is snooping traffic leaving the data center,
creating what you can now think of XR. Every five data packets is
creating XR dumping it into the channel. The receive side appliance
picks up the XRs as it uses them to recover lost packets. Note that
the receive side client which would be on the other end of that arrow
on the bottom does not see any loss. It will see out of order arrival
and, hence, we need to recover packets extremely fast. It's important
we recover packets immediately otherwise the TCPIP will perceive
out-of-order arrivals.
So I said we're using XRs. We're using something a little more
sophisticated. So what happens if you have bursting and correlated
loss? The forward correction has always had this other problem. The
elephant in the room in some sense is that it's not really good at
handling burst loss factors. If you lose 10 packets in a row, all bets
are off. For traditional SCC encodings, the recovery latency that you
get from a code depends on the maximum burst size it can tolerate. The
way they do this, they interleaf codings. So they put one big stream
into a lot of little streams. They encode separately over these
streams and hence get proportional increase in burst tolerance. It
means you have to wait longer and longer for recovery to happen.
Because now you're encoding over loaded channels. Remember I said
recovery latency depends on the data rate in and channel. We came up
with a new code that has a graceful degradation code. The goal is if I
lose a single packet I want to recover it immediately. If I lose 10 in
a row I want to recover it in 10 milliseconds. If I lose 100 in a row,
I want it to be recovered in 100 milliseconds. The way we do this is
fairly simple. We just create XRs at different interleaves. We create
one layer of XRs of data packets, another layer from the 10th. A third
layer from every hundredth. And so on. That means for constant
overhead, given the SEC (phonetic) code, we get a graceful degradation
code in latency that no other SEC code has. We don't get it for free.
We have to trade off on the recovery part of the code. So this
compares to mainstream code to read Soloman, which is the economic SEC
code. You can see we lose a little bit in recovery power but not that
much. In return we get much better latency properties.
>>: (Inaudible).
>> Mahesh Balakrishnan: So I spent a lot of time convincing people of
this. If they're not convinced, look, if you have a better code, I've
given you a way to do to deploy it. So actually they agree with me, if
they don't I say you think you have a better code, deploy it.
>>:
(Inaudible).
>> Mahesh Balakrishnan: It has constant overhead. So that's the key
for me. That if you use a lot of the newer codes, they do not give you
this property that you get exactly this overhead. For example, the
latency code for which Michael was mentioning on SDI, the argument
there is, and it's true for something recent called growth code. The
argument there is if you change the overhead to get burst tolerance, we
don't want to do that because (inaudible) that may not be. I want the
guarantee that 75 percent of on-line bandwidth is pure overhead. I
don't want to go around tweaking that.
That's one set of things to optimize against.
>>:
Why do you want -- will not exceed the network capacity.
>> His thing there is a lot of protocols that failed in some sense
failed because they were deactive. As when they do something.
Something happens on the network and they react to it. And this is
true for things like multi-cast protocols where there is broadcast
storms, then you have max storms and stuff like that. So we were
trying to come up with a protocol that is essentially is the most
passive thing on earth. Doesn't react, just injects this much
overhead. Hence, you can predictively plan around it. That may not be
a goal that you think is important, but that's kind of the assumption
behind this.
So Maelstrom operates at the IP level. It can do two things with TCP.
It can ignore it, in which case it's just working at the IP level. TCP
end-to-end is maintained. It's a passive device if nothing goes wrong.
That's the version I'll talk to. However, that doesn't lead to all the
problems of TCP. Eliminates loss. Doesn't handle the buffering
problem. So to solve that problem we can also use the Maelstrom
appliance in the critical part where we make it an active device. It
intercepts and breaks the TCP flow control.
And that's nice with some applications that want to handle buffering
outside the end host. But it does convert into an active device.
So we implemented this. And we were able to build a prototype that
runs on the commodity box, runs at a gigabyte per second. Limited by
the outbound link. Added 10 gig, we'd probably be able to keep it
around 3 BBs (phonetic). So how does this work? So here is a graph
that shows what happens to TCP when loss occurs in the network. On
top, you have TCP without loss. And we're stretching the link out.
And on the bottom you have TCP with loss. You can take just the lapses
completely when you have a loss rate. Now we introduced them from the
middle. We get three lines that are indistinguishable at different
losses. They're very close to the TCP loss so far.
>> How much were you running?
>> I think we were running somewhere between 10 and 20 code parallels.
So we did expect that it would be a number, one overlap (phonetic).
>>:
(Inaudible).
>> Mahesh Balakrishnan: But the effective observe lost flow is still
high enough to completely flat line it.
We found these bursts with Reno, but we ran it on springer (phonetic).
Very surprising we got along with the same code theme. We're trying to
figure out exactly why that is. Maybe that's just an artifact of the
loss model.
>>: (Inaudible) TCP and they claim they do it much better than
other ->> Mahesh Balakrishnan: So most of the new protocols either use a
different control code or they use deli (phonetic). I think TCP does
both. In either case, I mean I don't want to comment on the newer
work, because I know have not run against every protocol out there.
It's possible that for this loss model some of them would behave much
better. But the strongest claim I can make is, look, for the most
commonly deferred protocols out there this does happen. But here you
can fix it if you play around with it. I do believe that.
So you can see that at 25 milliseconds we don't track TCP. That's
because we're limited with the outbound nick. We have around 650 ndbs
(phonetic) of good put put.
>>: Accountant recommendation or.
>> Mahesh Blakrishnan: Actually evaluation on a test bed.
accurate.
It's not
>>: Loss is being injected.
>> Mahesh Blakrishnan: Loss is being injected. Let's see. We're doing
it artificially at the network level. Turns out to not be very
complicated. We just drop the packet.
>>: These are random or these are ->> Mahesh Blakrishnan: These are random. I have both sites.
>>: Couple of slides ago you mentioned this leads to more production.
Does that mean we have to change the ->> Mahesh Blakrishnan: Very standard. People have been doing
performance enhancing proxies. And the idea is if I try to send a
packet from here, there's a box that intercepts that, breaks the
connection, does the buffering for me.
So the other interesting thing in this graph is that you pull the link
out, your through put is going down even when no loss occurs.
>>:
Gap between 50 and 900, and the closing of that gap is?
>> Mahesh Balakrishnan: So the box can handle a gigabit per second of
traffic going out. We're adding a SEC. So SEC plus data is around a
gigabit per second. We're adding maybe 25 percent overhead in these.
If we had a bigger nick we'd be running at a gig. But obviously the
moment you hit the bandwidth ceiling, you will have this problem that
will put us lower. So through is going down with the length of the
link. That's the problem with buffering that I talked about. As you
stretch the link out you need bigger buffers for the end host. If you
don't change the buffer, this is the phenomena you observed. If it
changes from active device, now it's a critical part. I can say things
can go wrong. We essentially get this opportunity, the through put of
the independent values. The third claim is Maelstrom handles the
recovery delivery problem for TCP.
On the left we have, you can see that when you have a loss rate, TCP
has these massive delivery spikes. And one reason for the spike is
that the receiver buffers all incoming data because TCP sports sequence
delivery. If I lose it, the packet receiver has to wait for the buffer
or incoming packet back until the missing one is recovered.
That's one source of (inaudible). The other sources are conditional
control case. Another is buffering aspect. On the right you can see
that we eliminate all loss. That's a real observation. More
interestingly, you can see that A for normal data packets, we're not
adding any latency. Just getting losses. Data packet, recovering it
extremely fast. There's absolutely no difference between recovered
packets. Maybe a few milliseconds.
How does the code work? Layered interleaving code I talked about. We
had these systems of packet recovery latency. On the left, we show
that if you have completely random losses, most packets are being
recovered almost immediately. You just have one big bar at the
beginning. As we make losses, burst here and burst here, here's a
point where you're losing 20 packets in a row, 40 packets in a row, you
can see the histogram is shifting to the right.
So we're getting this property that the recovery latency depends on the
burst amount.
Before I go on to the next part of this talk I'll give you an
opportunity to ask questions. I want to mention other work we're doing
on top of this. The reason we started looking at this problem of
intercenter data communication was because we were approached by
financial institutions that needed mirroring solutions for disaster
tolerance. And the state of the art right now is that banks in New
York placed mirrors and New Jersey for disaster tolerance. That's as
far as they can go. What they ideally want to do is place it somewhere
in Kentucky. And the reason for that is there are only two ways you
can do mirroring right now. You can go to, right to the file system.
The file system can send the data to a mirror and wait for an
acknowledgement. It's extremely safe and extremely slow. The other
option is you just return to the application and you just assume that
the packet gets there. Now we're exploring a middle ground where we
add what so much overhead into the network, so much redundancy that we
can push the reliability level of a piece of data to the point where it
is as safe as if we wrote it to a local disk. So we built a file
system that does this. Essentially goes to the user once the problem
of using the data is high enough assuming the data is lost in the
network. So I'm going to move to Ricochet protocol. But do you have
any questions at this point.
>>: Is there really much purpose in handling large burst losses
because these length of time to recover TCP is going to notice.
>> Mahesh Balakrishnan: You're right. The burst sending make it
transparent to TCP but we can't. But the argument -- one of the nice
things about the work is it works at any IP level, transparent link,
performance. So arguably it still has value.
>>: Seems like you can't have semantics difference between waiting for
it or not. The network (inaudible) is going to save it.
>> Mahesh Balakrishnan: It's a model, you're right. It works only if
the reason the packet gets across this loss in the network, not a
disconnect or something. You're right.
>>: Or a failure, for instance, if you're active.
>> Mahesh Balakrishnan: The model, I do agree with that. But we think
that it provides value to people who don't need complete safety but
don't want to go the other way as well.
>>: So that was Maelstrom. It's reliability between data centers.
Now I'm going to show you that the same kind of technique using forward
data correction in interesting ways can have a lot of value within data
centers as well.
So what's the connection? Well, in the inter data center case, one of
the main reasons we wanted to use TCP was because you can't have a
feedback loop from the receiver going back to the sender. The length
of the link is simply too high. And, hence, we wanted this property
that all communication is just one way. Now, within the data center
the RTTs are not very high, but the standard mode of communication is
multi-cast.
And if you have a lot of receivers for multi-cast, you again have the
same problem where the receivers are not very far away but there are
just too many of them. So if you wrote an acknowledgment-based
protocol which has a very tight feedback loop, then for every packet
the sender sends out he's going to be swamped by incoming
acknowledgments. So you can't have a feedback loop in either case.
And we're going to use the same kind of techniques in this setting now.
But before I talk about that, is multi-cast used in datacenter, why is
it used and how is it used?
So within datacenter, data is often sprayed across multiple nodes,
partitioned. It's replicated. People use a whole number of
communication paradigms to do this. They use pop projects. And you
can imagine caching and validation groups. And the example I'm going
to give you is from a financial setting where you have nodes that are
subscribed to portfolios of equities.
So you have a node that's tracking a number of stocks. And the way
that it's done today is each equity is mapped to a group, a multi-cast
group. So if I'm interested in an equity, I join the multi-cast group
and I get the update for that group.
Now, this means that each node, if it's interested in hundreds of
equities, it's going to belong to many different groups, which means
that the granularity of a single group is going to be fine, which means
that the data rate in a single group could be fairly low. You're not
going to get a thousand updates for the Microsoft talk. You're going
to get one every other second or something like that.
However, even though a node is in a lot of loaded literate groups, it
can easily get overloaded. It's tracking 100 equities, maybe 1,000.
If there's any kind of traffic spike, the node gets overloaded.
Multi-cast protocols are UDP-based and hence nodes can get overloaded.
They can drop packets at the buffer. When we looked at this two to
three years back, the dominant model was for nodes to be very thin.
You have 10 blades traveling on a very fast network. They can get
easily overband.
>>: (Inaudible).
>> Mahesh Balakrishnan: They're already on top of multi-cast but the
reliability layer, at least with current technologies, always at the
application level. You're right.
So what happens when nodes get overloaded, they drop packets. We found
it's very ridiculously easy to overload a blade server. On top you
have a blade server that's getting more data than it can handle. On
bottom, you have one that's getting less data than it can handle.
They're in the same group. But they're one on top of an extra group.
And they're behind the same switching segment. What this graph is
saying is that, A, loss is occurring at the end host. It's independent
across nodes. It's not in the network. And it's bursting.
happening because of kernel buffer overflows.
So it's
So how do we deal with this problem? Reliable multi-cast is a very
well-studied problem. It's been around for years, and it's been
solved. Scalable multi-cast is a solved problem. However, when people
say "scalability," they mean scalability in the number of receivers.
There's been all this work that looks at how to scale multi-cast to
hundreds, thousands, millions of receivers.
In a datacenter, though, you want multi-cast where you can have lots of
receivers, sure. But also lots of senders to a group. What if a node
is aggregating data from many different senders and that's something
that's nobody's looked at before. Also, you want to support a lot of
overlapping groups. Each node is in hundreds of groups.
So I want to support a system where multi-cast is used at very fine
granularity where you can use it at the item level or object level.
So how do we do this today? Well, there's only a certain number of
protocols you can write to solve this problem. Number one is you just
use acknowledgment you use TCP reliability to multi-cast. This does
not work because you get ac (phonetic) implosion. I already mentioned
that.
Number two is you use negative acknowledgments, and this is how all
commercial multi-cast technology is built. There's a protocol called
SRM in '95 and '96 (inaudible) and that's been translated by companies,
including Microsoft, into better protocols that are essentially used
negative acknowledgments. Now I'll come back to what the problem is
with this in a minute.
Number three is you can just inject SEC at the sender. Sender creates
XRs and ejects them into the channel and nodes recover from loss. Now,
the problem here is if you have hundreds of thousands of loaded groups
you get the same issue that you can't use padding. And SEC doesn't
work very well. Latency depends on the data rate in the channel. Now
this artifact applies to negative acknowledgment protocol aspect. The
receiver does not know it's dropped the packet until it receives the
next packet from the same sender to the same group.
In a setting where you have hundreds of senders and hundreds of groups
this could be seconds. So what do we do? We're going to apply the
same technique we did in the last set of work. We'll move SEC into
location where we can aggregate it over high rate data channels. We're
going to have receivers generate XRs and we're going to have receivers
pass around these XRs that can then be used to recover loss data.
Remember, I showed you that loss is occurring at the end host, it's
independent across end host and, hence, this actually works.
The receiver on top misses data.
can recover the missing packet.
It gets an XR with that data and it
How exactly does this work? So each receiver is generating XRs over
incoming data packets. And it creates an XR and it picks "see other
packet". See other nodes in the system and it sends the XR to them.
Now, this gives you essentially a forward data correction rate. You
know that your system is X percent overhead.
And it has an interesting property that you just remove the sender from
the reliability protocol and hence now you don't care if you have one
sender or if you have a thousand senders. You're completely oblivious
to the number of senders to the group. It's a completely
receiver-driven protocol. We have property number two. We already
have scalability in the number of receivers, that's the easy one. Now
we have scalability in the number of senders to the group.
What about the next step? Can we get scalability in the number of
groups in the system? Well, here's how we do it. So this shows how
the protocol I described in the last slide works. You have two
receivers. They're in two different groups, sending each of their XRs.
So Receiver R1 sends R-2 two XRs, one from data in group A and other
from data in group B. And it does these independently, starting two
versions of the last protocol. But we can do better, right? We can
aggregate data across groups. R-1 can send R-2 an XR that combines
data in groups A and B and recovery happens much, much faster.
Now, this works in the simple case. What if you have hundreds of
overlapping multi-cast groups? Are there any questions at this point?
What if you have hundreds of overlapping multi-cast groups in arbitrary
patterns? This is how we handle it. A receiver belongs to multiple
groups. Say in this case, and one belongs to three groups, A, B and C.
And it essentially decomposes these groups into regions of disjoint
overlap.
And one is in three groups. And it breaks these groups into these
regions. Now, this is scalable. It looks unscalable, but it's not.
And the reason this, A nodes are not interested in groups they're not
part of. So if you ran a traditional group-based protocol, it has
exactly the same overhead. And B, the group membership service,
whether it's gossip-based or central life, it's a standard MLS that
you're using for anything else. All this disjoint region decomposition
is occurring at the end host. Another point to be made, looks like
there are an exponential number of regions. But this is bounded by the
number of nodes that N1 shares a group with.
So in reality you won't have an exponential number of regions.
Now what do we do with this? N1 does all this magic. Breaks groups
into regions. Now, remember that the basic operation and
receiver-based, single group protocol I showed you, was that each
receiver generates an XR from incoming data packets. It selects C
random guys from the group and it sends the XR to them. In this case
it's picking five guys from the group and sending XR to them. Now we
want to compose the XR we send to the receiver based on the groups we
share with it. If I'm in three different groups under the receiver, I
want to aggregate data across those three groups. And if I know what
the regions of disjoint overlap are, now I know that, you know, if I'm
sending an XR to a node in region ABC, I have to aggregate data from
ABC. I know if I'm sending it from node only in A, not interested in B
and C, combine it only aggregate from group A. So we set targets for
the XR, not from groups but from regions. This allows us to
selectively compose each XR at the fastest rate possible.
Now, this means that we get scalability in the number of groups because
we are running, within each intersection we're running the SEC protocol
as fast as we can. For the definition of a channel here is now an
intersection of multiple groups that you shared with some other node.
How does this work?
>>: I didn't understand that last slide.
>> Mahesh Balakrishnan: Okay. So the basic operation of
receiver-based SEC is I get a bunch of data in a group, in a single
group. I create an XR from it and I pick other guys from the group.
So if we are in one big group I pick five of you and send you an XR for
data in group A.
Now the point is if you and I are in five different groups together I
could have sent you an XR with data from all the groups. So to enable
that, I first have to know whether I share groups with you. So in this
case let's say you're in intersection A/B, which means you are
interested in data from group A and B but not in group C. Now, instead
of picking five random guys from the group, I'm going to pick random
people from these intersections.
So I'm going to pick you, the proportional fraction of five based on
the size of the region compared to the size of the group. It's kind of
regional sampling. I'm dividing this room into smaller chunks and I'm
picking smaller fractions of the sample set from the group. It's like
sampling within regions instead of groups. So it's like ->>: What defines the group membership?
>> Mahesh Balakrishnan: The group membership, exactly. The
commonality in the groups I share with you.
Go on.
>>: What are the black dots and the green dots?
>> Mahesh Balakrishnan: The green dots are the nodes I happen to
select. So the five of them. The black dots are the nodes that didn't
get selected. I want this kind of behavior where I'm just picking five
green dots from group A but I want to do it on the per region basis.
So instead of picking five guys from group A I'm picking one guy from
A/B and one guy from -- and it all adds up to five.
>>: The green dots, nodes or packets?
>> Mahesh Balakrishnan: They're nodes. They're nodes.
>>: Do you do this action every time?
>> Mahesh Balakrishnan: It's a little complicated. In the first
version of the protocol I create an XR and I select. In this version I
select and then I create the XR based on who I selected.
>>: (Inaudible) reselect the nodes.
>> Mahesh Balakrishnan: At the granularity of the single XR. So if I
know I'm creating 25 percent of my traffic is XR, there's a certain
number of XRs I'm creating, and based on that, for each XR, I end up
selecting a node.
>>: Is that membership of (inaudible) performance?
>> Mahesh Balakrishnan: It would, except in datacenter you don't have
that much join, that kind of an assumption. That's why you can't use
something like this -- it would still work because it's gossip-based in
some sense. If you had join, there would be some kind of lag. But it
would still work but not as for sure (phonetic).
>>: Why do you only do one packet, create one XR and send it five
times?
>> Mahesh Balakrishnan: That's what we do in the initial case. We
can't do it here because different nodes need different XRs.
>>: Select targets for the XRs but you don't send them the same ->> Mahesh Balakrishnan: The nature of the XR depends on the target,
because we want to compose the XRs faster based on the target. If I'm
in 10 different groups with you, I want to create an XR from those 10
groups.
>>: Then why are you selecting all the targets kind of as part of a
single -- as part of a single step? Right? Why not -- it sounds like
it's it ter active.
>> Mahesh Balakrishnan: In practice it ends up -- this is how I'm
showing it but in practice we're treating each of the regions
separately and we are doing what you're saying.
>>: So it does decompose into each round I select a new node who is my
target.
>> Mahesh Balakrishnan: Very much. Yes.
>>: Amounts to random selection.
>> Mahesh Balakrishnan: Exactly. Amounts to random selection except
you're doing it from regions.
So how does this work? Well, on the X axis we have the number of
groups each node is in. The two different graphs, the one on the left
shows the percentage of packets we're successfully recovering with this
mechanism and on the right we have the latency we're recovering them
with. Now, the point of this graph is as you make the groups finer and
finer, as you take your system and you decompose it into finer and
finer groups, your performance is not affected.
For every other multi-cast protocol out there, including the ones
developed by Microsoft in fact which are essentially the industry
standard, the latency shoots up as you refine the groups. To the
extent we actually ran it and we found that with 128 groups it's 400
times slower. And this is almost -- this is one reason I'll have a
slide later where I talk about this. But it's one reason why
multi-cast hasn't caught on in datacenters is because the reliability
mechanisms have been so incredibly broken. They were designed for the
wide area. They don't work well in datacenters.
So this shows that we get scalability in the number of groups. So you
can tolerate a lot of load iterate groups. And this shows the recovery
histogram.
So at different loss rates we're recovering most traffic in the initial
segment using the data correction. Because we don't have TCPIP running
on top this is the final one. So we need a negative acknowledgment
layer to catch the fractions of packets that SP can't recover. So the
bump in the middle is this extra reactive layer. Now what does this
mean? This means that you can say I want 20% overhead in my system and
that will be all you get for most packets. And as the loss rate is
extremely high, then you have slight reactive traffic.
And we tried it with burst velocity losses. And we found because we
have so much diversity in the encoding it's extremely resilient to
diversity losses. We can lose, 100 packets in a row and still we
recover most of them at the same latency average.
So one of the questions I get when I present this work is, hold on,
nobody uses multi-cast in datacenters. As of 2008, that's perfectly
true. Nobody does. Microsoft, I'm pretty sure, doesn't. And I know
that other companies don't.
And there are three major reasons why. One is that the reliable
multi-cast mechanisms used were incredibly bad. They brought down,
between 2000 and 2003, in 2000 practically every company that had a
datacenter, mostly financial institutions, were using IP multi-cast.
And then they had a state of essentially brownouts and blackouts and
people started phasing multi-cast out.
There's a long list of companies that just kicked IP multi-cast out of
their datacenters. One reason was because the reliability mechanisms
were broken and we fixed that, we think we fixed that. But there were
other issues. One is if you have 1,000 groups in your datacenter, your
router and your nicks are not able to handle it. They can handle 100
groups. Beyond that, everyone gets all traffic, essentially.
And the other problem is that multi-cast is intrinsically dangerous.
If you're all in a group, I can join the group. I can start sending
data at my highest rate and bring the entire system down. So there
seems to be fundamental reasons why no one uses multi-cast. And the
insiders that look -- you want logical multi-cast capability because
everyone uses that within datacenters. But you want the ability to
control it so you have a layer of indirection in the kernel where you
convert logical IP multi-cast addresses into a set of unicast addresses
or maybe one unicast address and a couple of IP multi-cast addresses
and so on.
So you know that your system, your routing system can handle, say, 100
IP multi-cast addresses. You use them for the high rate traffic. And
that's something we're working on right now.
>>: I didn't understand that slide either.
>> Mahesh Balakrishnan: Okay. We're talking ->>: I got the first part. I didn't understand the solution. How is
it that now multi-cast will not be such a dangerous thing to use?
>> Mahesh Balakrishnan: Here's what I'm going to do. The application
still sees an IP multi-cast group. It joins the user set, all that
stuff. We're going to intercept that and we're going to convert it
into unicast in the kernel.
Now if you do that you automatically get one thing. Your routers are
not unscalable anymore, right? You're just using unicast. The network
is not seeing any IP multi-cast.
Now the next thing we can do is we can have an IP multi-cast in the
network, we just can't have thousands of groups. We can have maybe
100. So if there are groups in the system that have a lot of traffic,
logical groups, we use the scars IP multi-cast addresses for those
groups and everyone else is using unicast. So that's one step.
And that kind of solves the routing problem. The scalability problem
to some extent. Now, if you have a kernel layer that's doing this, you
can also have things like intelligent admission control. You can
mandate who gets to send to what multi-cast group. So I can just join
a secure group and start spanning it. You have the opportunity now to
add a security layer.
>>: I find this very plausible, but it feels like now that you're
giving up the -- feels like you're giving up the efficiency benefits of
doing the multi-cast in the number layer, which makes me think why
would you stop here rather than moving further towards application
layer multi-cast?
>> Mahesh Balakrishnan: I think in a datacenter, the main sources of
latency and overhead are in the end host. For example, I don't think
overlays are a good idea because for most applications if you have
multiple hops in the datacenter you're already in trouble. You go into
the stack, you come out of the stack, massive latency.
The other point would be that I think a major source of the overhead
here is why do you need IP multi-cast? Why can't you just do multiple
sends? Because every time you do and you're going into the kernel. If
you can get rid of that, that's already a huge step in the right
direction. So it's not really a network efficiency, it would be end
host efficiency that we're looking at.
>>: Let's make sure I understand what you're saying. You're saying
instead of multi-cast groups that are not in that physical multi-cast
instance, there's some sort of kernel layer, or bus or something, where
you're doing multiple sends somehow?
>> Mahesh Balakrishnan: Yes, exactly. Kernel is doing unicast
multiple sends.
>>: Finer point on John's question. As long as you've addressed all
the efficiency problems of now we're not taking the packet all the way
up to user space and sending it up to times, we have a nice kernel
level, the bus, why not use it for all the multi-cast tracks?
>> Mahesh Balakrishnan: You mean eliminate IP multi-cast entirely?
>>: Right. If I understand your question correctly, if you already
have this mechanism in place for doing essentially non-network
multi-cast.
>> Mahesh Balakrishnan: Right. We do that. I don't know. I haven't
built the system yet. I have a feeling that at the end of the day, if
you have extremely popular groups, there would be value in having an IP
mechanism that's even faster than the kernel multi send. But I don't
have a feel for the numbers. But you have a point. Maybe it would be
pointless going here.
>>: Maybe my question is, you've done such a great job, why aren't you
using it for everything?
>> Mahesh Balakrishnan: Well, I'll invent it and then I'll measure it
and then I'll know.
>>: The key question is where is the break-even point, crossover point
where you get the efficiency from having actual multi-cast being so
much greater than it's worth, using the real physical resource of the
IP multi-cast group at break point you should be able to see the data.
>> Mahesh Balakrishnan: That's pretty much true.
So that's the technical part of this talk. I wrote down a number of
things. Most of them have a feel to it that they tackle a protocol of
failure mode to solve the problem. And, frankly, at the end of the day
it's a lot of duct tape for data centers. Fine performance problems
where people are using retrofitted protocols within datacenters and
just fix it.
But, really, you know, we've had 10 years of experience with
datacenters or more. Can't we come up with new abstractions? That
sounds like a collegiate thing to say because distributed systems have
this long history of failed abstractions. Do we really want to come up
with more? But I think there's something fundamental here. We are not
building the bridge before people want to cross over. And sometimes
you're seeing all the people swim across and we're figuring out that we
need a bridge.
We need new abstractions. To some extent the community has caught on
and they are building new abstractions. Some of it is happening at
Microsoft with Triad. But people are dealing with a very specific
datacenter application. How do you do data mining or data-based style
functionality over very large data sets? Can't we go one step further?
And here's what I think it should look like. This is something I saw
in a paper by Jim Gray first, an entire number of co-authors from
Microsoft. And the idea was that you know there is a conical building
paradigm system for datacenters. Partition the service across multiple
nodes and you get scalability. Replicate it, you get fault tolerance
and reliability. This is a very simple picture. All the people I've
talked to, every service I've seen running within a datacenter is built
exactly this way.
And each time they rebuild it from strach. They take multiple
operating systems running on individual nodes. They write load
balancing layer. They start doing -- they build intelligent
partitioning functionality in the load balancing layers. Why can't we
build an operating system that just takes care of this?
Now the phrase the datacenter, the computer, I thought I came up with
it, but I saw it in an article by David Patterson like three months
back in the Communication of the ACM. And he's a computer architect.
He says can we come up with a new instruction set for datacenters, what
would an add be? Well, I don't think we need to go that far. But can
we look at things, like what a process set, what a process is, what a
thread is, what prediction boundaries are, what does a socket look
like, and what's a location into a service now that your service is
spread out across dozens of machines. So can we come up with new
abstractions? That essentially is my goal over the next few years.
I believe that now we have enough experience with datacenters as a
community to actually look at these problems and come up with new
abstractions and that they will be used.
And if we do that, all these things we can do, then, you have concerns
like power and privacy. For example, if the datacenter OS knew what a
partition was it could mandate that data does not flow across
partitions. And then you could pass third-party developers the rights
that automatically manages privacy.
That's just one example. But I think you build out abstractions, you
can then hide a lot of complexity behind the abstractions.
So to conclude I present a picture of a real time datacenter that has
to recover from disruptive events within seconds. And I talked about
things you can do at the network level to enable this. Specifically,
network load protocols recovered from loss packets in milliseconds. I
think they're the tip of the iceberg. I think there's a long distance
to go before we get anywhere near this goal. Thank you.
(Applause).
>>: Any more questions? Code availability?
>> Mahesh Balakrishnan: These are -- Ricochet is up for download. In
fact, we rewrote it in C, the two versions, and reporting it in the Red
Hat stack in fact. Red Hat is building something called Cupid,
clustering platform they're building right now. We're trying to get
them to use Ricochet as the clustering layer. And for that we need to
do a complete rewrite. It's up on Source Four (phonetic) actually.
>>: I'm more interested in Maelstrom.
>> Mahesh Balakrishnan: Sorry?
>>: How much did Maelstrom? That's what I'm more interested in.
Ricochet not so much.
>> Mahesh Balakrishnan: Maelstrom is -- can you download it? Yes.
It's a kernel module. I don't know if there's a link right now
publicly, but if you send me an e-mail I'll be happy to send you the
source.
>>: Make us believe in that your test bed actually is something that
we should, the results we are getting ->> Mahesh Balakrishnan: You should believe it. That's a question I
get a lot. Which is why right now at least for the wide area work,
we're doing our best to actually set up a real optical network test
bed.
It's a lot of work because these are fairly expensive resources and
they don't want computer science researchers near them. They're owned
typically by the scientific community. But we're very close. If you
see a journal version of this paper, that's exactly what it is, add to
this whole thing. Actual validation on and off the critical test bed.
>>: Where is the test bed?
>> Mahesh Balakrishnan: The test bed is a combination of things. We
set up a delaying layer on local clusters. So we have a bunch of -- I
mean we essentially just put a lot of delay between two things. We
were able to do that because the interconnect was big enough. But
that's why we're limited by things like one GPP, we can't run at 10
gigs because we just can't buffer and delay the packets long enough.
The other test bed we tried to emulate in the lab which is the same
kind of story. It has enough capacity for us to do it but to a point.
What we're doing right now, and we are already have it running to a
degree, we are actually setting up some kind of a clever routing grid.
So we send packets out into taragrid but it's not going to hurt anyone,
just boomerang onto us. That may not be as real as we want. But it's
one step in the right direction. But there's a huge problem here.
I think in academia we just don't have the resources to study these
kind of networks. And it's hard to just muzzle into a working optical
network because it's an extremely expensive resource. And I'll give,
either the physicists and the chemists are doing real work on it, as
opposed to computer scientists research.
>>: You can make future business of chemists go 10 times faster in the
network.
>> Mahesh Balakrishnanr: Right. Incentivise them, absolutely. The
problem there is you get stuck on a very specific thing that they want
you to do. I guess it's a trade-off.
>>: Which grant is the money coming from?
(Laughter)
(Applause)
1:00:25.
Download