Jitu Padhye: All right. It's my pleasure to... University of Bucharest and he has done a number of...

advertisement
Jitu Padhye: All right. It's my pleasure to welcome Costin. Costin works at the
University of Bucharest and he has done a number of things, most important of
which is multipath TCP. This work has won a number of awards. It's getting
report pretty heavily in Europe, hopefully will arrive here soon.
>> Costin Raiciu: Thanks. Okay. This is started [inaudible] my PhD and is joint
work with bunch of people. I've bolded the ones that are most important to this
work. But there's many, many others that come to be.
So all right. The context is that the networks are really being multipath these
days. We've got mobile devices that have multiple wireless interfaces, each of
them with different coverage properties, different energy consumption, different
throughputs.
Data centers, you know, I'm showing this now, you know, they've got many
servers. Within any given rack you've got servers connected to the top of the
rack switch, then you have many relax and then a redundant topology connecting
these racks together and then between any two servers have many paths that
you could use to communicate.
Finally online providers such as Microsoft, Google, whatever, they multi-home in
the Internet to get better redundancy, better performance, and so forth. So -- and
then the client can connect to it by using any of these network connections.
So the networks are multipath, but we still use TCP today. And, I mean, 90
percent of the applications use TCP. So this is basically almost like monopoly.
And the reason people use TCP is because it offers this very nice byte-stream
abstraction, you know, reliable delivery and it also matches offered load with
capacity to the network. And that's quite nice.
The trouble with TCP is that it's fundamentally single-path, right? It was
designed to combine the connection to two source -- to two IP addresses. If any
of those addresses changes then the connection goes down. Even in the
network you cannot really take a multipath TCP connection and place it on
multiple paths because the ends get really confused and the performance drops
significantly.
So there's a fundamental mismatch between the multipath networks and the
single-path transport, and this creates problems. And let me just show you two
instances of problems that it creates.
So here's me commuting from -- sorry, that's not the Microsoft phone, commuting
from home to work listening to my Internet radio. And I'm using my 3G
connection, and then as I get to work there's WiFi. So I would like my phone to
switch it to using WiFi because, well, it's cheaper, economically, it's cheaper from
an energy point of view and so forth.
Now, the trouble is today when I do this switch all of my connections will die, and
that's not nice. Another problem is in data center. So I'm showing here a factory
data center where the servers are at the bottom. Each rack has only two
servers. This is just for illustrative purposes. And then in this connection, if the
red servers want to communicate they can choose any of the four paths they
have available. And let's say they choose this path on the third top most switch.
They could have chosen any of the other paths.
And normally the way this is done is by using a protocol called eco [inaudible]
that will literally place a flow randomly on all of the available paths.
Now, let's say that the black servers want to communicate too. They will also
choose a random path. And there's a probability that these two paths will collide.
And then if you have a collision, the effect is pretty bad. Each of the connections
will get half of the throughput they should be getting in an idle capacity elsewhere
in the network.
And this happens in a network that is provisioned to behave like a full
non-blocking switch, right? So the theory is that each of these connections
should be getting one gigabit, and in this example they are not.
And you might be wondering, well, yeah, but this is a contrived example, does
this really matter? So we set up a very simple experiment. And the experiment
is this. Every server chooses a single other server at random to send to, and
each server will have one outgoing connection, a single incoming connection. So
-- and we call this a permutation.
So we did the simulation. And here are the results. So on the X axis I'm showing
the flow IDs ranked by throughput and then on the Y axis I have the throughput.
So the theory says we should be getting one gigabit for all of these flows. The
practice is rather different, okay? So the average is foreclose get something like
400 megabits. There are some flows that are lucky and they get close to one
gigabit. But there are some flows that are really unlucky and they get 100
megabits. And this is because of the collisions.
All right. So as you might have guessed from the title of this talk, the solution to
all of these problems is multipath TCP. And what it is is really just an evolution of
TCP that allows us to use multiple paths within a single transfer connection.
Before I go on and tell you what we did and how multipath TCP looks like and so
forth, let me just briefly discuss some related work.
So I've showed you a few point problems. The first impulse we have when we
see point problems is to come up with point solutions. And for instance we can
solve mobility differently, right? You can go about it in a different layer by using
mobile IP what was standardized a while ago, but it didn't get any traction.
And you could solve an application layer, for instance, by using HTTP range and
change your application to deal with mobility. The problem with changing
application is that it's a lot of effort. If every application needs to have built in
mobility, then that's a lot of complexity of putting in the application.
So the best place to change this and transport -- you don't have the context of
the -- at the IP level you don't have the context of the transport, so it turns out
that the best level to do this is at transport layer. So -- and there's been two
proposals to this end. So [inaudible] proposed migrate TCP, which is basically
what it says. It can just move a connection from one IP to the other. And also
HTTP has this capability too.
Now, all of these come with this mind-set that mobility should be about fast
handover. When I have a new interface I should quickly hand over to that
interface and I'm go, right?
What you see later is that we think a better solution is to do something like a slow
handover. When you have multiple interface you should use them all the time
because their performances might fluctuate quite dramatically and then if you do
a fast handover you might suffer.
Okay. Now, you can solve data center collisions differently, for instance by
having a centralized controller that knows all of your switches and that knows all
of your flows. If you know all of your flows then you can compute a placement of
flows to path that doesn't have collisions and then you can implement it. And this
is what Hedera proposes. But the problem with this is that it didn't really scale
that well. You need a really tight control loop to do this in realtime and to get
benefit.
And finally, okay, multipath TCP is not a new idea. It has been proposed at least
half a dozen times, all right? So I think the first one to proceeds it was Christian
Huitema in 1995 at the ATF. I think he was with Microsoft. And there's been, as
I said, half a dozen proposals ever since. So what's different in this work from
the previous work?
Two things, really. First the context. You know, in the past four or five years
we've really seen multipath networks. You know, smartphones took off and data
centers took off.
The second our goal. So we set out from the beginning to have a deployable
multipath TCP. So what we really want is to take existing applications, run them
over this new -- well, this new protocol, but the applications should not be
changed at all. They shouldn't even be recompiled, right? They should speak
the same circuit interface to the stack and the stack should do somehow
multipath underneath, okay?
Now, that's all fine. We also want it to work over today's networks. If you
network changes that's a show stopper, right? Nobody's going to change the
routers just because you have a new protocol. So we really want to be able to
run through today's networks.
And finally, if there's a path where TCP would work, we want this new protocol to
work also because if it didn't then users will complain, you know, the Internet has
gone down, what's happening here, right?
So these are very sensible goals. But it turns out it's not -- yeah?
>>: I'm curious. One of the [inaudible] mobile scenario that you're describing,
one of the goals that would be nice would be to deploy these new versions of the
TCP without having to change both sides, so only changing one side.
>> Costin Raiciu: Yeah, that's also feasible. Actually when we started out this
effort, it was within a EU project called Trilogy and one of the partners started it
exactly with this idea that you will not change a single end. It turns out that it's -you're quite limited in what you can do if you change a single end. And that thing
died for some reason. I mean, we can take ->>: [inaudible].
>> Costin Raiciu: It's a hard problem. I mean, it's -- you're really very limited in
what you can do.
Okay. So, you know, fast forward. We have a Linux TCP implementation that
implements the current protocol draft. The protocol draft is in the final phase of
standardization at the ATF so it should be in RTF soon. So this thing is already
happening. It does support legacy apps. It works over today's networks and
we've tested it, right? So -- but, you know, I'm getting a bit ahead of myself.
So here's multipath TCP. There's a lot of components to it. I don't have time to
talk about them all, so I'll just leave out these two parts flow control and encoding
control information. If you want, we can talk about it afterwards.
Okay. So when we started off this work I -- and I don't think Mark Handley either
realized how difficult it can be. So he came in 2007 and said I think we should go
multipath TCP, it's like a good idea and, you know, the reason he started that
work is because in theory there are a few results teaching us how to do
congestion control. And that was like the big sort of break through.
But to be able to do the congestion control you really need a vehicle to carry the
byte. So that's the protocol really. So he said why don't you think about the
protocol. Then in the beginning I thought well, this will be just one month. I'll get
it done. And then I'll move on to the interesting stuff like the congestion control.
Well, it turns out it took five years. And the reason it took five years is not
because I was stupid, but it's because we designing in our Internet architecture
that nobody really understands. So here's the -- here's what we teach our
students.
You know, you have this very nice protocol layer and you've got link layer
addresses that are visible only to one link. And then on top of it IP works end to
end, routers afford packets based on the destination addresses and on top of
that TCP does reliable retransmission and, you know, ensures [inaudible] and
finally the applications use this interface. That's really nice except it's mostly
fiction, okay?
The reason it is fiction is the because of middleboxes. And I'll use this pirate
skull to denote middleboxes. So how big is the problem anyway? So we did a
study that we presented in IMC last year to understand exactly what's the scope
of the problem.
So here's the IP header at the top and TCP header at the bottom. And the theory
says that as the packets get through to the Internet two fields should change, the
TTL and the header checksum and that's it. Okay?
Now, we know that there are [inaudible] right? So those change the source IP
and the source port for outgoing packets and the destination IP and the
destination port for incoming packets. So we know that these get changed,
okay?
Now, there are fire walls that randomize the initial sequence number because
some of the stacks are weak and they choose a predictable sequence number.
So what this means is that on the outgoing packet the sequence numbers get
changed. On the incoming packet the X gets changed, right? So these two
fields also get changed in the Internet.
Now, in terms of that for any particular field you find in here there will be at
middleboxes out there deploy the changes, right? So the real picture looks
something like this. And you'll be glad to see there's some white space in there.
I only left it because it contrasts nice. You know, it's not true. Those fields get
changed though. [laughter].
So in this context when you're designing a new protocol you have to be really
defensive, because otherwise it can really bite you back.
>>: [inaudible].
>> Costin Raiciu: Everyone -- I mean, middleboxes will set it to zero. Even
Linux will set it to zero when it comes in, right? So you don't -- you really don't
want to put any urgent data to get -- to get pushed in then. Someone thought
that security has it at some point then just decided that, you know, you are
generate pointer.
All right. Let's start with what the protocol looks like. So a multipath TCP
connection starts like a regular TCP connection with a SYN packet except it
carries a new option that's called MP capable. And this option also carries a
unique connection identifier designated by the one that's sending the SYN.
Now, the logic at the passive open -- the server in this case is if the SYN has
multipath capable and I'm doing multipath, then I'll just enable multipath TCP
before this connection. So what it does is it gets the SYN, replies with the
SYN/ACK with an MP capable and its own local unique identifier for this
connection. And it says okay, now we're enabled. We're good.
The client, the active opener or the client has the same logic. You know if the
SYN/ACK has multipath capable then I will enable multipath TCP and then I send
the third ACK. And this -- what I've shown here is your regular way of negotiating
a new TCP option, right? All the TCP options we know, this is how they're
negotiated, all right?
And after you've done this, then what you have is a subflow set up within a
multipath TCP connection. So both ends will basically know that this subflow is
part of a connection and they will have a local identifier that's unique for that
connection.
Now, at any point in time, both ends can add new subflows to this connection,
right? The subflow can come from the same IP address or a different IP
address. It doesn't matter. The only requirement is that the five tuples of the
different subflows different. So at least the port number should differ across
different subflows.
Okay. So now if I want to add another subflow, I send again a SYN, like a
regular TCP, except the option in the SYN is different. Different. It's called join.
And this tells the server that this is an existing multipath connection we're adding
a subflow to and it's not a new connection. And the join will contain the unique
identifier of the server for this connection. And this allows the server to
demultiplex this request and place it at the right connection.
Now, these unique identifiers are actually used for security purposes, but I will
not cover these -- this in this talk. Okay.
So the SYN/ACK comes back with a join. The third ACK and finally we have a
new subflow. And this can go on as many time as you want. You can fin
subflows if you need. The subflows can die and they just timeout and it's not a
big deal. The connection keeps going on as long as there are single subflow
that's not timed out yet. Okay?
When the last -- when the last subflow has timed out finally the connection dies.
Or when you do an explicit teardown. All right. Now, that was pretty easy, right?
But it was too easy. The reality is a bit different. So say we're in this case where
the SYN has come in and now we're sending back the SYN/ACK with the
multipath capable and the server thinks we have enabled multipath for this
connection.
Our study in IMC shows that -- okay. Now, we can have a -- we can have a nice
middlebox in there. So our study at IMC shows that 6% of access networks will
actually remove options they don't know. And if you look at port 80, 14% of these
networks will remove unknown options.
So if it happens that on the way forward the option got through but on the way
back the path was asymmetric and it didn't get through, then what you get is the
SYN/ACK coming without the option, and now the client thinks multipath is
disabled for this connection and the server thinks it's enabled.
And what happens after this is really complicated. I will not go through this. But
it's just not good. Okay? So you want to fix this? And the fix is pretty obvious.
You know, what you want is in the third ACK if you enable multipath TCP, you
want to carry some multipath specific option telling the server that, yes, I did get
multipath TCP enabled.
So the logic changes of the server. It says if the SYN has multipath capable and
the third ACK has this multipath specific option, then you do enable multipath
TCP. In this particular case the ACK will not contain the multipath option, so the
two endpoints are in agreement. We're not going multipath TCP. Okay?
So this really shows what you need to do if you want your connection not to
break in today's Internet, right? To achieve goal through which was if TCP works
through a connection, multipath TCP should work. You really need to fall back to
TCP if something goes wrong.
In this particular case we fall back to TCP in the negotiation itself. But multipath
TCP can fall back to TCP during any time -- during the lifetime of the
connections. So if at some point the multipath options don't get through anymore
because, say, the path changed, the multipath [inaudible] does fall back to TCP.
And from that point on, it doesn't go back to multipath. It just stays TCP, but it
doesn't break the connection, okay?
All right. So the lesson is it used to be that we negotiated option -- new protocol
options between two endpoints. Nowadays you're negotiating between two
endpoints and an unknown number of intermediaries. And unless you take this
into account, new -- negotiation will actually fail. And this applies not only to
multipath, but any extension to TCP you might want to do.
Okay. So now we have multiple subflows. How do we actually send data on
these subflows, right? And let's start with a primer of TCP. I mean, you all know
this, but it is just to contrast with what you do with multipath. So TCP gives
sequence numbers to every byte and then will place a segment to the wire.
So in this -- in this example I'm showing actually packet sequence numbers for
simplicity, but, you know, there's no difference in concept.
So, okay, the sequence numbers help the receiver, first of all, pass data in order
to the application, right? And also detect if there are holes and so forth. So this
allows the receiver to implement the TCP coat rack, which is reliable bytes in
delivery.
As the packets are coming in, the receiver is generating X and the X still the
sender, yes, the packet got there. And if it didn't, then you can just transit and so
forth, right? I mean, we all know this. Okay. So the easiest way we can think of
implementing multipath, the CPU is just taking all of these segments that the
TCP creates and playing them on different paths. That's like the straw man
design, right? That's what everyone seems of when you think of old multipath.
You know, we put a segment on the top path, we put segment -- oh, yeah.
Middleboxes. We put segment on the bottom path and so forth. Okay? Now,
what you will see is that this path only sees segment 2 and 4 and this path only
sees segment 1 and 3. Okay? So they see something that looks like a TCP
connection with holes in it, right?
Now, the forward segments will get there fine. It's not a big deal. It's when the X
come back that things starts to get messy, okay? So ACK 1 will be generated
and then ACK 2 will be generated. Right? The problem is this path has not seen
segment 2, all right? Segment -- oh, no has not seen segment 1. Has not seen
segment 1 because it saw 2 and 4. All right? So now we can see ACK 2,
cumulatively ACKs both 1 and 2 and it will be upset. All right.
The problem is this path has not seen segment 2, all right? Segment -- oh, no,
has not seen segment 1. Okay? This path has not seen segment 1 because it
saw 2 and 4, all right? So now we can see ACK 2 which cumulatively ACKs both
1 and 2, and it will be upset. All right.
It turns out that a third of the ACKs we measured in our IMC paper will actually
correct these ACKs or drop them or reset the connection. So one of these three
actions will happen, right?
So in this particular case let's say it correct the ACK, right? So what will happen
is it will create it to ACK 0 because that's the cumulative ACK that it has seen.
And on the top path ACK 3 will be corrected to ACK 1. Okay?
So although all the segments got to the receiver, the sender is not aware of this
because the paths are correcting the ACKs, right? And this [inaudible] we're
stuck now. We don't know how to make progress. Okay. So what does work?
We really need a sequence space for each subflow to make a subflow look like a
TCP collection over the wire with no gaps, okay? And if we have the sequence
number then we can use this to do retransmissions and to detect losses, right?
But we still have the problem that different paths will have different delays. And
you get reordering, right? So to deal with reordering you did a separate
sequence space that's of the connection level, right, and this will be used by the
receiver to put packets back in order. We also need a data acknowledgement for
this sequence space at the connection level and so forth.
All right. So this is how the multipath TCP packet header looks like. So the ports
and the sequence numbers, they relate to the subflow of the multipath TCP
connection. And this makes the -- the subflow look like a regular TCP connection
in the wire. Except there are some options that belong to multipath TCP that
current middleboxes don't understand, and these options allow the receiver to
reorder data.
And the connection -- multipath TCP has a single receive window or a
connection, and this is code is relative to this ACK that's carried as an option
here. I will not go deeper into flow control, as I said.
Okay. So here's how it work. Now, I again want to send segment one but this
will be the data sequence number one, right? And when I map it on the red path
it will receive a subflow sequence number, in this case 100, okay? Now, I send
the second segment here. It will be -- it will receive a subflow sequence number
on the blue path. And finally I send the third segment there. As you see the
subflow sequence numbers are increasing continuously but the data sequence
numbers can be -- it didn't matter. They can have holes, okay?
Now, what will happen if this path fails and this packet never gets there? Well,
what will happen is that on this subflow we will get a timeout. And when there is
a timeout we reinject all the segments that are outstanding on this path, on the
other path that is are working. Okay? So what will happen is here. As you see,
we have the same data sequence number, but it's not on the red subflow. And in
this way, multipath TCP can make progress.
>>: [inaudible].
>> Costin Raiciu: Well, the ACKs are -- so you have ACKs that uniquely the
regular ACKs just ACK the subflow sequence number. And then you have a
cumulative ACK for the data. And so ->>: [inaudible].
>> Costin Raiciu: Yes. Yes. Yes.
>>: All right.
>> Costin Raiciu: Okay. So we started out with what looked like an incredibly
open design space for the protocol. It turns out that in practice, because of a lot
of constraints, the designs -- the decisions we took there was not much -- much
room for maneuver. So it turns out that a lot of the decisions were more than
realistic, right? Anyone could start from the same goals in the same Internet.
You end up with like the one possible solution. There was not much room for
maneuver.
>>: [inaudible] when is to just build it out of a bunch of ordinary TCP connections
and put your extra stuff, instead of options, just in the TCP data.
>> Costin Raiciu: Absolutely. So ->>: [inaudible].
>> Costin Raiciu: So I did not cover that part about encoding. So it turns out in
the IT if we had like a six-month old -- a six-month longer discussion about where
to encode control information. You can put them in options or in data. Okay?
It turns out -- so there's like a chain of logical implications. So first of all, you
need the data ACK. If you don't have a data ACK, you cannot do flow control
properly. Then if you put the data ACK in the payload, I was just chatting with
Jitu earlier, you get deadlocks. You can get deadlocks. Because the payload
the subject to flow control, okay, and congestion control. But flow control is the
problem.
So what will happen is now if -- I should probably not do this, but I can
[inaudible].
>>: [inaudible].
>> Costin Raiciu: Okay. I can show you some slides that show the deadlock.
>>: [inaudible].
>> Costin Raiciu: All right. Okay. So you can get a deadlock basically. That's ->>: [inaudible].
>> Costin Raiciu: Now, this is it for the protocol itself, right? Let's switch to
congestion control, okay? And I'll start off with a very sort of high level slide.
Now, this is back in the '60s. We used to have circuit switch networks. And what
this means is whenever you have a connection, you have to set up a circuit to a
network. And I'm showing here a single link and two flows. You know, the blue
flow and red flow.
Now, the problem with this picture is as you probably are well aware is that this
flow is bursting here. It could use all the capacity. But it's not allowed by the
very static reservation of capacity on this link. And the same here. So the result
with circuit switch network is that you get great isolation but very poor utilization,
all right?
Now, fast forward 10 years in the '70s you know packet switch networks you can
do this thing very nice. So what they're really doing with -- what we're doing with
packet switching is really pooling these circuits together to get greater utilization.
Now, fast forward 40 years more today. This is what we have today. So we
have two separate links in the network with flows and this is -- looks strikingly
similar to the circuit image before, right? So the next step is obviously this one,
where you can take multiple separate links and pool them together such that
each flow can burst and use underutilized capacity elsewhere in the network.
Okay?
And this is really what multipath tries to do. Right? The problem is when you go
from here to here you've just lost isolation. So you need somehow to manage
the sharing of capacity. And that's -- that's what TCP does. TCP congestion
control really decides who gets what.
The question is in the multipath context how do you split the -- how do you split
the capacity across the flows, you know? What is the equivalent of TCP in the
multipath context in and this is multipath congestion control.
Now, as I said, as we started this work, this is like the very interesting research
question that sort of drove the whole project. And it turns out that the answer to
this question came when we defined the goal we want to achieve. Once we had
the goals that with hindsight looked very, very sort of obvious, it was not very
difficult to design a congestion control that achieves those goals. So here are the
goals.
The first goal is the most obvious goal. If you have a single multipath TCP
connection with many subflows, sharing a bottleneck link with TCP then you don't
want multipath TCP to beat up TCP. You wanted to get the same capacity as
TCP, right? And that's like, I mean, if you tell people I will give you multipath
TCP to just kill TCP throughput, they will say okay, we're not deploying this.
This is obvious.
>>: [inaudible] it's more than your share.
>>: That's right. [inaudible] [laughter].
>> Costin Raiciu: Well, not the TC -- you know.
>>: [inaudible] TCP connection is it?
>> Costin Raiciu: Sorry?
>>: It would just be like opening more than one TCP connection and try to get ->> Costin Raiciu: Exactly why people that's -- people say that's not fair; right? I
mean, don't go to the ATF telling them that we're going to beat up TCP because
they're not like [inaudible] if you go to a specific company and say your products
will be beating up other products, then that's a different way of thinking, but okay.
So the second goal is more subtle. The second goal is we should use paths we
have -- the paths we have efficiently. And here's what I mean. So in this
particular case I have three links, each with the same capacity of 12 megabits
and a single multipath flow that has two subflows. One of them has a single hop
path and the other one has two hop paths.
In this particular case it's obvious that this flow should get 24 megabits. It should
use both paths and should get 24 megabits.
Now, to make this more interesting, I'll add another flow here, again with a 1-hop
path and a 2-hop path. And finally I'll add another flow here for symmetry, okay?
So you have all of these flows that have a 1-hop path and a 2-hop path, and the
network is pretty -- pretty utilized, okay?
So the question is let's say these guys here sending, they have a choice on how
much traffic to put on each path and they can decide, I can put X on this and X
on this or 2X on this and less on this and so forth.
So how should they split the traffic? It turns out that the way they split the traffic
really affect the total throughput of the network, okay? So if they split it equally,
each of them puts equal weight on both paths you get eight megabits and this is
completely counterintuitive. Why are they getting 8 megabits, they should be
getting 12 megabits, no? Because that's the -- the capacity is 36 megabit in the
network. If you decide by 3 it should be 12.
Well, the reason they're getting 8 megabits is because they're putting a lot of
traffic on these 2-hop paths that are less efficient and basically use two
resources instead of one. On the 1-hop path you use one resource, on the 2-hop
path you use two resources.
Now, if you put more weight on the shorter paths, for instance 2 to 1, you get 9
megabits. 4 to 1 you get 10 megabits. Infinite to 1 you get 12 megabits. Okay?
So in this particular case, the optimal allocation of bandwidth is you should push
all of your traffic through the direct path and no traffic through the other paths,
okay?
Now, okay. Does this really mean anything? Well, first of all, the theory says
there is a way to get this, to get to this outcome just by doing congestion control
here. Right? So the theory says this. Each sender should look at the
congestion it gets for each of these paths and it should put traffic -- it should push
traffic away from the more congested paths. That's what theory says.
And, I mean, there are -- there are proofs that this can be done in a stable way.
It will not oscillate. So in this particular case what we see is -- let's look at this
guy. This guy will have loss rate say P on this path, okay? But on this path it will
see a loss rate of P plus P, right? It will see a much higher loss rate. And that's
why it's pushing all of the traffic away from the higher loss rate path, right?
And the same goes for this one. In the example I showed before where the flow
is alone, there will be no higher loss rate on the 2-hop path. So in that case, you
get the benefit, okay?
So basically the theory says you should do congestion control you should put
traffic away from the congested paths and that's all very nice. The problem with
that is, well, here's the real network. You know. We have a 3G path with very
low loss in high RTT. It gets a loss every five minutes, right? And it has an RTT
of five seconds depending on what carrier you use.
And here's a WiFi path that gets a loss every -- every second or even earlier but
this actually gives you 10 megabits compared to one megabit, all right? So if I
just take the goal I set before, always prefer your lower loss networks, that
means where you have 3G and WiFi always use 3G, put no traffic on WiFi.
That's clearly a bad way to do it. Because people will say, yeah, but WiFi was
giving me 10 megabits and you're giving me 1 megabit and you're saying it's
much better for the network. You know, I don't buy that.
So you really need a counter balance for that design principle which basically
says in any given configuration if multipath TCP is giving you less throughput
than the best path -- than TCP on the best path, then you're doing something
wrong. So really the goal here would be, look, in this case is this get a -- you
should get at least 10 megabits per second, okay? How you get the throughput,
maybe you push more here but I should get at least 10 megabits because
otherwise nobody will deploy this thing. All right?
So how do we achieve all of this? Well, this is TCP congestion control. And we
all know this. So I'm just showing it to contrast it to what multipath TCP does,
okay? So the maintains a congestion window for each connection, and then as it
get -- as it is getting ACKs back, for each ACK it gets, it increase the window -- it
increase the window for 1 over W. Basically this means in 1 RTT, it increases
the congestion window by one packet, okay?
And then if it gets a drop, then it just has the window. And this is pretty simple. I
mean, everyone knows this. Now multipath TCP does this. You have a
congestion window for each path, all right? When you get a loss on path R you
just have the congestion window of that particular path, all right? So it's exactly
the same with TCP here. The difference is the increase part. This is where all
the smart multipath TCP are.
So what I have here is the sum of windows across all of the paths. And you can
look at this as a constant for now, right? So let's say I have two paths. I have a
path with window 1 and another path with window 100. This would be 101 here.
All right?
Now, the path with window 1 will increase on every RTT by this much. The path
with window 100 will increase by 100 times this much. It will increase much
more, okay? So what this does is really push the increase toward the path that
has the better loss rate, the higher congestion right. All right?
So it actually linearly does this. If you have two flows that have the same
congestion window they will increase with the same amount. All right?
Now, if you say this is 1 and if you consider that the RTTs are the same for all the
flows, you realize that on aggregate multipath TCP increases with one packet per
RTT across all of its subflows, all right? And this gives you the fairness to TCP.
If all of my subflows are going through the same bottleneck, then I will be fair to
TCP. So okay. To get goal 2, that is move traffic away from congestion. This is
-- this is what I'm doing. This helps me get goal 2. To get goals 1 and 3, what
we are doing is we are dynamically adjusting this value of alpha to do the
following thing.
So multipath TCP has loss rate and RTT estimates for each of its paths. Once I
have that I can just plug them into an equation and say what would TCP get on
this path. And I get some throughput. And then what I basically do is I change
this alpha such that in aggregate multipath TCP gets the same throughput.
That's the idea.
Okay. So, I mean, this is the mechanism. It's much easier to see what the
emerging behavior is. Okay. And this is the real formula, but I'm not going to
discuss it.
All right. So let's say we have a web server with two hundred-megabit links. And
it happens that two clients are using the top links and four clients are using the
bottom link. So clearly the bottom link is much more congested.
Now, multipath TCP comes, using both links. Now, because this link is much
more congested, it will push all of its traffic to the top link, right, and pushing very
little traffic through here mostly to just probe the capacity of this link. All right?
So -- and the flow throughputs are 33 megabits for the flows using the top link
and multipath and 25 for the bottom flow.
Now, if I have another multipath TCP connection, again, I push all of the traffic to
the top link and I leave no traffic on the bottom link. And in this case, you'll see
that all of the flows in the setup have exactly the same throughput. Okay? If I
add one more connection, I start pushing in the bottom path too, right? If I only
push in the bottom -- in the top -- in the top path, then I would have more
congestion there. So basically I have to balance out congestion.
So what is really happening is that multipath TCP is trying to equalize congestion
between these two links, okay? And the net effect is that these two links become
-- start behaving like a single higher capacity link, and their capacity shared
between all the flows, okay? So if I keep adding flows, then that's what happens.
So this is the theory. This is what the theory says. This is the practice. Okay?
So this is a practical experiment. So I start with I think five flows on the top link
and 15 flows on the bottom link. And this is the average throughput for TCP for
TCP flows. And then I start 10 multipath TCP flows which are shown in the -- in
this color, in orange.
So what you see is that multipath TCP does something close to a theory. It's not
exactly perfect. But so what it's doing is it's mostly pushing traffic to the top link.
As you see the decrease of the top TCPs, the bottom TCPs will also decrease a
little bit because of the probing it does. And also -- so multipath roughly matches
the rate of the flows. So the practice is not exactly the same as the theory, but
it's pretty close.
Okay. So the take away is that multipath TCP can make a connection of links
behave like a single pooled resource. It's as if you take those links, you put them
together and they just a higher capacity link that everyone can draw traffic from.
And this gives you better fairness and better utilization. That's the intuition.
Okay. The other application of multipath TCP is mobile devices as I mentioned
in the beginning. So again I have the same scenario here. Now, at this point
with multipath TCP there is absolutely no problem to open a new subflow using
the WiFi connection and just offload the connection from 3G to WiFi, right? So I
can do this make before a break scenario without any problem. All right?
The problem is of course in this case if I move away immediately from the WiFi
coverage then I lost my connection, right? So I have to reconnect to 3G. So a
better way to do this is to actually open both connections at the same time and
do something that will cause low handover, right? Instead of quickly switching,
you know, use all of your connections and that's going to give you the best
throughput and the best performance. Now, you will right fly argue that well, if
you do that today, your [inaudible] will die not in one day but like in half a day,
okay? So that's pretty -- that's pretty much bad. You know, completely ignores
the argument, but from a performance and robustness point of view, this is the
best thing you want to do.
Now, if in the future the batteries get better then you can do this. If not, you can
do this for limited amount of time while the networks are flaky and so forth.
So whether you can do this in practice is an open question. But from the
networking point of view this is the better solution.
Okay. So here's and experiment we did. So this is our code running. We're
using WiFi in 3G. Real 3G, real WiFi. And then we're comparing two things.
We're comparing a modified W get application that monitors the interface status.
And if it notices that the -- one of the interfaces is down, the interface it is using is
down, then it reopens the connection using the other working interface and it
resumes the download with the HTTP range header. Okay?
And what you see here is basically the connection goes down. It takes some
time to detect it. And then once you start the new connection over 3G it takes
time for 3G to ramp up. Normally it takes like two or three seconds to ramp up.
Yeah?
>>: So [inaudible] multipath TCP do you mean without?
>> Costin Raiciu: No, this is without. This is regular TCP.
>>: [inaudible] fix the ->> Costin Raiciu: Oh, no. This is the [inaudible] multipath.
>>: Oh, okay.
>> Costin Raiciu: Sorry. Okay. So basically with multipath TCP there is really
no disruption. I mean, the take away here is not necessarily that the application
handover sucks, I mean, it's just two seconds. Two seconds. You can cope with
this.
The problem is that all applications need to be able to do this mobility inside the
application which is very tricky, okay? So, I mean, as an extreme case what we
did is we actually took scrap and forced it to go over multipath TCP by closing all
the ports available. So we have an HTTP proxy that runs multipath TCP. On the
client where multipath TCP we started the regular Skype line. And we killed, you
know, all the HTTP ports, all the 3G traffic. So Skype, again it's so good at
detecting an exit from the machine that it actually goes over HTTP.
So while Skype was going over HTTP, we actually killed WiFi 3G and did
handover with Skype while playing some encyclopedia, some audio stuff.
So it's funny to see what happens. I mean, I'm not showing it here, but basically
get a small period of silence of one or two seconds and then you see the audio
stream being played at the higher rate because Skype gets all the packets.
Okay? So, I mean, that's an extreme example. You really don't want to do
voiceover over TCP. But the handover multipath TCP provides does work for
unmodified applications. That's really the take away.
>>: [inaudible] advantages some you see which is when you have a notification
of a WiFi loss on reapplication [inaudible] to be at like 10 applications, they all
want to jump on the CPU, reconnect and so you can have a sudden flood of
activity on the device where they're all fighting for CPU and they're fighting for
network. And this might -- [inaudible].
>>: And they probably all wanted the same thing.
>>: Right. They all wanted ->> Costin Raiciu: Yeah. Yeah. I mean, there's lot of applications.
Okay. So the last thing I will -- okay. I'm a bit over. The last thing I will discuss
is how multipath TCP can be used in data centers.
So I've mentioned this thing that you take collection of links and make them like a
single -- like a single resource. And this is the intuition of why multipath should
work in data centers. So currently today you have TCP connection. It randomly
picks a path and it sticks to that pads. With multipath TCP you do exactly the
same thing. But instead of having a single subflow for collection you have many
subflows. And for each of those subflows they get placed randomly by equal
multipath on different paths.
And the subflows, as I said, will have different five tuples. So they will have
different ports. And that's why they look like different TCP connections. All right?
And if it happens to have -- if you happen to have collisions on any of your
subflows then basically can just move traffic away from the congested link on to
the uncongested links. Okay? So visual this looks like this. So this is a case
that I was showing you before where I was having a collision, where I was
showing you collision. So let's say that this black flow is multipath. It will start
another subflow that will most likely happen to hit the idle path.
Now, the congestion control will detect this and will push traffic away from the
congested path, okay, to the uncongested path. It will get one gigabit on this
path. And the fact is that this red subflow is happy again because it's getting one
gigabit, right, despite this collision.
So, you know, and just to show you the example I showed earlier, this was the
TCP throughput you would be getting. In blue we have the multipath TCP
throughput. Now, clearly this is not perfect. It's not exactly hundred percent like
you are getting in theory but it gets pretty close like the average is close 90
percent utilization and even in the worst case you get 600 mitigate megabit. So,
okay, this is all simulation. So what we did is we actually went on Amazon EC2
and -- to see if we can get some of these benefits in practice. So, you know,
infrastructure as a service, we rented some virtual machines. And it turns out
that Amazon, the EC2 has multipath topology. So when we started out in 2010,
when we were testing this, I think only one of their availability zones had
multipath.
Retesting a few months ago all of the availability is on said multipath. So I think
they actually upgrade the network on the older availability zones to have
multipath networks.
Okay. So we took 40 million instances running multipath TCP on kernel and the
experiment what we did was very simple. So we just iperf'd from every machine
to every other machine periodically in sequence by doing either TCP, multipath
with 2 flows, 2 subflows or 4 subflows.
And here are the numbers. Again, I'm showing the flow rank here and the
throughputs on the Y axis. So this is TCP and multipath with 2 or 4 subflows.
You clearly see it will get a good benefit in this case. And these flows are mostly
in the same rack.
So of course the EC2 network is a black box to us. We have no idea exactly why
these benefit appear, right in they could be because there is instead a factory
network in there or a multipath network and a lot of cross traffic and collisions
really help you. Or they could be because they are doing per flow shaping. So
we don't -- we don't really know, okay. So this is just qualifying the results.
Okay. Yes?
>>: These are actually select the different paths.
>> Costin Raiciu: So you don't.
>>: Just by virtue of different five tuples they ->> Costin Raiciu: Uh-huh.
>>: -- chose different paths?
>> Costin Raiciu: Exactly. So you don't do anything different in the stack. You
just open many subflows. Each of them will have a different five tuple. And then
you hope that they'll get placed on different path. That's basically ->>: [inaudible] interfaces in which case --
>> Costin Raiciu: Yeah. If you have multiple interfaces then it's a different story.
But if you want to use sort of network based multipath, then you just use five -different five tuples.
>>: So there would be some kind of logic on each end to say [inaudible] five
second subflows, different groups ->> Costin Raiciu: Absolutely. So, I mean, these are [inaudible] that we haven't
really worked on too much. So in the data center the way we vision that you do
this is, you have a sys control that tells the stack, you know, you should open
this many subflows, maybe, you know, this much time after the collection starts,
you know, if that collection is -- lasts more than let's say 50 milliseconds or
something, it's not just a quick transfer, then it makes sense to open it, okay?
But I'm just throwing 50 milliseconds as a number out there. I don't know if it's a
good number or either hundred or 200. Yeah?
>>: I'm just wondering how you avoid the wave effect. If everybody gets
synchronized into saying okay, I see congestion, stop sending there and it all
move at the same to be uncongested and create the back and forth ->> Costin Raiciu: So this was like the biggest problem with low dependent
righting and that's like nobody does, let's say, traffic engineering in realtime trying
to buy balanced congestion because it's like the holy grail.
The theory of multipath congestion won't really show on this is stable. So there
are proves in theory showing it is stable. And our controller sort of is based on
that theory. So we haven't seen any oscillations in practice.
The problem is the controls we saw -- the controls in theory are shown stable but
they have a different -- a different behavior -- I mean, they're basically
exponential controllers. They're like scaleable TCP. They increase
multiplicatively every time.
The problem with TCP is that if it has small windows it increases very
aggressively. So if you increase from 1 to 2, that's a very aggressive increase if
you can like a thousand subflows -- a thousand flows because you double the
load for or less, right?
But what stops you from oscillating in that particular case is a timeout, right? So
you'll be timing out a lot of the time and, you know, statistically it evens out.
So the answer is the theory says it should be stable. In practice we've seen it's
stable. It could only be messy in the like the cases where you have very, very
small windows. But in that case, as I said, you know, that probabilistic nature of
timeouts that does help you a little bit. So we think it's pretty safe. It shouldn't be
a big deal.
If you do it in the because you don't have the -- you don't have the -- you don't
recall how long the loop is, basically, of the flows. So if you do a change how
long should you wait until you change again? Right? Here you're doing a
change on the natural path, natural control loop of the path which is basically the
round trip time, right? So that's why it intuitively should be easier to get stability if
you're doing end point congestion control rather than in the network.
Okay. So designing a -- you know, any multipath TCP isn't difficult. I mean, it's
pretty easy to do -- to do something that works. But designing the employable
one is much, much more difficult. They actually took the biggest part of this
work. Just because the -- our Internet architecture is evolving all the time, right?
And what this means is that you need to put defense I have mechanisms to
detect when things go wrong, and you need to fall back to TCP. So let me give
you another example of a defensive memorandum.
There are middleboxes out there that will modify your payload, your TCP
payload. For instance an FTP middlebox will rewrite -- in the control channel will
rewrite the IP address. If it's a net. If it's a net doing FTP, it will rewrite the IP
address written by the source to be the IP address of the net, right? And the IP
address is written in ASCII. That means the length changes. So we can lose or
add a few bytes. You don't know.
So now that could really mess up a multipath TCP connection because I've sent
a few segments here but one of them got shrunk or got bigger. So then when I
put them in order, I will send garbage to the application. I don't know how to put
that in order. So what do you do in this case. So what we did is basically well,
you really have to check some the data, right? So basically the data comes with
a checksum that's carried as an option and when multipath TCP detects a
checksum failure, basically you have two options. If you have other working
subflows then you just reset that subflow that has a checksum failure and never
use it again.
If that's your only subflow, then we have a handshake that allows you to fall back
to TCP. But you never come back to multipath again. So basically just from now
on we're doing TCP we'll let the middlebox do exchange. But we're not doing
multipath TCP anymore, we just do TCP. Okay. And this is not just multipath
TCP. Like if you want to do deployable changes to TCP you need to take all of
these into accounts. That's basically the lesson we learn the hard way.
And I'll finish with sort of advertising slides which is multipath topology is really
multipath transport. And multipath TCP with used by unchanged application on
today's networks. You can try it out.
And the biggest sort of Theoretical break through and the reason it's such a nice
thing, this ability to move traffic away from congestion and take multiple links and
make them look like a single shared link. And that's why -- I mean, the protocol
itself is the engineering part that was really difficult. But the smart -- this is the
smart part, right? This is like -- this is why it gives you such nice benefits.
And with that, I'm done. Thanks.
[applause].
>> Costin Raiciu: Yeah?
>>: [inaudible].
>> Costin Raiciu: So, I mean, I've been talking to the Google guys, to the
Facebook guys and so forth. The Google guys made it pretty clear that they will
-- they are reluctant to touch -- to touch it until it's in the mailing kernel. And this
is like the code is maintained by our colleagues from Belgium. So they are on
the same project and they did most of the protocol implementation. So the push
in the next couple of years we have funding for this, so the push is to just try to
get it in the main line.
So currently we have one of the stack commuters tutoring us on how to
restructure the big -- the big patch we have. So we have like a 10,000 lines of
code batch which touches all of the TCP stack. So it's a big change.
So that's -- my guess is once it's in the main line, people will start playing with it.
Before that, yeah, they can play with it in a simple, you know, they can just see
are there benefits. But I'm not exactly sure that they are willing to sign up to -- so
if you use like an experimental kernel like the one we have, the problem is the
main line kernel changes all the time so you have to port the patches. And that's
a huge effort. I mean, that's a huge effort.
So realistically speaking I think we will see it in data centers when we have it in
the main line. Or, no, after we have it in the main line. And probably a bit
optimistic to say that [inaudible] in the main line [inaudible].
But on mobiles I think that the stories a bit different because the push seems to
be -- seems to be much stronger on mobiles like for mobility, things -- you know,
there's a bunch of companies interested, trying it out, doing experiments and
stuff.
So in that particular case it might be sooner.
>>: [inaudible] that you either -- you want to do one or the other and rarely both.
You know, 3G is good when I walk outside and it's a great connection. But if I
had WiFi I would never use [inaudible].
>> Costin Raiciu: Absolutely. Yeah. So, I mean, you can use it -- so we have a
paper in SIGCOMM workshop this year and basically this is what we look at, you
know. We look at what happens if you use them both at the same time, if you
just use one and so forth. So if you just use one, you'll have some glitches when
you switch from one to the other because for instance if you're using WiFi and
you switch to 3G then you have a startup time for the 3G because even 3G, even
3G is connected until you get the radio location it takes like a few seconds.
So it's no problem, you know, I mean, the implementation currently allows you to
say this interface is in backup mode and this is the primary interface. So you can
just -- you instruct the code to just -- you instruct the kernel to prefer one path
over the other. So that's not a big deal. You can do it. Yeah?
>>: Do you have Android? Sorry.
>>: So to avoid [inaudible] have you considered using something like SSL
extensions to handle [inaudible].
>> Costin Raiciu: So, I mean, yeah. It's -- I think it's a tricky path to go down to.
I mean, the first -- if you look at the -- how the whole Internet architecture has
been evolving, it's been a story of encapsulation, right? And what -encapsulating multipath TCP over something like SSL is the ultimate in
encapsulation because if the multipath TCP becomes the sort of standard then
everything is encapsulated over SSL. And I'm one of the skeptics of this
evolution which -- so a lot of the people I talk today say yeah, but that's the end,
you know, once you encrypt it they're done. And I don't think so. I think there's
just a step. And they say -- whatever. The government comes and says you
install this key in your browser or else you don't get Internet. Okay? So once
you install this key in your browser what will happen is they'll break the SSL
connection, they'll terminate it, they have a middlebox that looks very carefully at
your traffic and then they forward it. And then nothing's stopping them from
doing everything they are doing now from doing it there, right? So I think it's
really a race to the bottom, all of this encapsulation. I mean, my good friend
Lucien says that we should use HTTP as a [inaudible], I mean, that's even
scarier.
But for certain deployment it makes a lot of sense. In the long-term I think we will
lose. I don't know. Yeah?
>>: So [inaudible] no change application. I'm curious. What if you take an
application like Skype that's already reacting to change notifications regarding
Internet connections appearing in this, appearing -- do they suddenly start
behaving properly with MTCP or do they start inappropriately tearing down
connections that would have worked?
>> Costin Raiciu: So our experience with Skype, I mean, I don't know, I can't
give you firsthand information because I didn't run the experiment. Our
experience has been that it was no problem like killing one interface and just
forcing Skype to move to the other. I mean, not forcing. Skype was getting -- so
the thing is if your interface dies then your kernel will tell you your circuit has
died. You get an abort from the circuit. It doesn't happen with multipath TCP.
Okay?
So basically the kernel just keeps feeding you and your happy, right? I mean, if
an application out of that actually just does [inaudible] over whatever, then it will
-- you know, it could actually get confused by what's going on, right? But for
Skype in particular you just -- it just worked. I mean, as I said, we -- you know,
we had -- had the flow working over WiFi. You kill WiFi and month to 3G nothing
happens. I mean --
>>: [inaudible] vast majority of well behaved TCP. We have seen programs that
listen to IP change, route notifications and if they hear a route change notification
that they think should kill their connection, they'll just tear it down.
>> Costin Raiciu: Yeah. I think -- in that particular case, I mean, it's -- should
probably work correctly. If they kill a connection nothing happens, right? They
just -- it's just they're taking it in their hands, right? I mean, yeah. Yeah.
>>: So [inaudible].
>> Costin Raiciu: How much [inaudible] sorry?
>>: You can help latencies.
>> Costin Raiciu: Latencies.
>>: Like RTTs. Not only [inaudible].
>> Costin Raiciu: So there's two things. There's -- so I think multipath TCP
mostly benefits you if you want -- if you wanted to -- if you want to do bulk
transfer, right? That's I think where most of the benefits are.
Because of the mechanisms we have with the [inaudible] sequence numbers and
so forth you could imagine you could use this to get better latency per packet
latency. I mean, in general per packet latency will be higher than multipath TCP
because one path can be -- can be a hard delay and then the packet cannot get
passed up to the application until, you know, the higher delay packet gets there
because they need to be passed up in order.
If you're in a data center and you're chatting with some net app guys at some
point and they were saying well, I think we might use this to, for instance, send
the same [inaudible] over multiple subflows redundancy, okay, and then make
sure that whatever, you know. The first one gets there. I mean, yeah that's
completely fine. I mean, it's -- the protocol allows that theory. You can easily do
that. I just have no idea how people want to use it. But in general if you really
care about packet level delay and so forth, this is probably -- I mean, and short
transactions this is not for you because by the time you set up the second
subflow, you know, the connection is gone.
And this is actually what happens in practice like you start the connection. The
stack of course starts pushing packets on the initial subflow. And then at some
point I think when you [inaudible] or something, the second subflow gets created,
right? So by that time data center for instance most connections will be gone any
way. So ->> Jitu Padhye: All right. Let's thank the speaker.
>> Costin Raiciu: Thank you.
[applause]
Download