Jitu Padhye: All right. It's my pleasure to welcome Costin. Costin works at the University of Bucharest and he has done a number of things, most important of which is multipath TCP. This work has won a number of awards. It's getting report pretty heavily in Europe, hopefully will arrive here soon. >> Costin Raiciu: Thanks. Okay. This is started [inaudible] my PhD and is joint work with bunch of people. I've bolded the ones that are most important to this work. But there's many, many others that come to be. So all right. The context is that the networks are really being multipath these days. We've got mobile devices that have multiple wireless interfaces, each of them with different coverage properties, different energy consumption, different throughputs. Data centers, you know, I'm showing this now, you know, they've got many servers. Within any given rack you've got servers connected to the top of the rack switch, then you have many relax and then a redundant topology connecting these racks together and then between any two servers have many paths that you could use to communicate. Finally online providers such as Microsoft, Google, whatever, they multi-home in the Internet to get better redundancy, better performance, and so forth. So -- and then the client can connect to it by using any of these network connections. So the networks are multipath, but we still use TCP today. And, I mean, 90 percent of the applications use TCP. So this is basically almost like monopoly. And the reason people use TCP is because it offers this very nice byte-stream abstraction, you know, reliable delivery and it also matches offered load with capacity to the network. And that's quite nice. The trouble with TCP is that it's fundamentally single-path, right? It was designed to combine the connection to two source -- to two IP addresses. If any of those addresses changes then the connection goes down. Even in the network you cannot really take a multipath TCP connection and place it on multiple paths because the ends get really confused and the performance drops significantly. So there's a fundamental mismatch between the multipath networks and the single-path transport, and this creates problems. And let me just show you two instances of problems that it creates. So here's me commuting from -- sorry, that's not the Microsoft phone, commuting from home to work listening to my Internet radio. And I'm using my 3G connection, and then as I get to work there's WiFi. So I would like my phone to switch it to using WiFi because, well, it's cheaper, economically, it's cheaper from an energy point of view and so forth. Now, the trouble is today when I do this switch all of my connections will die, and that's not nice. Another problem is in data center. So I'm showing here a factory data center where the servers are at the bottom. Each rack has only two servers. This is just for illustrative purposes. And then in this connection, if the red servers want to communicate they can choose any of the four paths they have available. And let's say they choose this path on the third top most switch. They could have chosen any of the other paths. And normally the way this is done is by using a protocol called eco [inaudible] that will literally place a flow randomly on all of the available paths. Now, let's say that the black servers want to communicate too. They will also choose a random path. And there's a probability that these two paths will collide. And then if you have a collision, the effect is pretty bad. Each of the connections will get half of the throughput they should be getting in an idle capacity elsewhere in the network. And this happens in a network that is provisioned to behave like a full non-blocking switch, right? So the theory is that each of these connections should be getting one gigabit, and in this example they are not. And you might be wondering, well, yeah, but this is a contrived example, does this really matter? So we set up a very simple experiment. And the experiment is this. Every server chooses a single other server at random to send to, and each server will have one outgoing connection, a single incoming connection. So -- and we call this a permutation. So we did the simulation. And here are the results. So on the X axis I'm showing the flow IDs ranked by throughput and then on the Y axis I have the throughput. So the theory says we should be getting one gigabit for all of these flows. The practice is rather different, okay? So the average is foreclose get something like 400 megabits. There are some flows that are lucky and they get close to one gigabit. But there are some flows that are really unlucky and they get 100 megabits. And this is because of the collisions. All right. So as you might have guessed from the title of this talk, the solution to all of these problems is multipath TCP. And what it is is really just an evolution of TCP that allows us to use multiple paths within a single transfer connection. Before I go on and tell you what we did and how multipath TCP looks like and so forth, let me just briefly discuss some related work. So I've showed you a few point problems. The first impulse we have when we see point problems is to come up with point solutions. And for instance we can solve mobility differently, right? You can go about it in a different layer by using mobile IP what was standardized a while ago, but it didn't get any traction. And you could solve an application layer, for instance, by using HTTP range and change your application to deal with mobility. The problem with changing application is that it's a lot of effort. If every application needs to have built in mobility, then that's a lot of complexity of putting in the application. So the best place to change this and transport -- you don't have the context of the -- at the IP level you don't have the context of the transport, so it turns out that the best level to do this is at transport layer. So -- and there's been two proposals to this end. So [inaudible] proposed migrate TCP, which is basically what it says. It can just move a connection from one IP to the other. And also HTTP has this capability too. Now, all of these come with this mind-set that mobility should be about fast handover. When I have a new interface I should quickly hand over to that interface and I'm go, right? What you see later is that we think a better solution is to do something like a slow handover. When you have multiple interface you should use them all the time because their performances might fluctuate quite dramatically and then if you do a fast handover you might suffer. Okay. Now, you can solve data center collisions differently, for instance by having a centralized controller that knows all of your switches and that knows all of your flows. If you know all of your flows then you can compute a placement of flows to path that doesn't have collisions and then you can implement it. And this is what Hedera proposes. But the problem with this is that it didn't really scale that well. You need a really tight control loop to do this in realtime and to get benefit. And finally, okay, multipath TCP is not a new idea. It has been proposed at least half a dozen times, all right? So I think the first one to proceeds it was Christian Huitema in 1995 at the ATF. I think he was with Microsoft. And there's been, as I said, half a dozen proposals ever since. So what's different in this work from the previous work? Two things, really. First the context. You know, in the past four or five years we've really seen multipath networks. You know, smartphones took off and data centers took off. The second our goal. So we set out from the beginning to have a deployable multipath TCP. So what we really want is to take existing applications, run them over this new -- well, this new protocol, but the applications should not be changed at all. They shouldn't even be recompiled, right? They should speak the same circuit interface to the stack and the stack should do somehow multipath underneath, okay? Now, that's all fine. We also want it to work over today's networks. If you network changes that's a show stopper, right? Nobody's going to change the routers just because you have a new protocol. So we really want to be able to run through today's networks. And finally, if there's a path where TCP would work, we want this new protocol to work also because if it didn't then users will complain, you know, the Internet has gone down, what's happening here, right? So these are very sensible goals. But it turns out it's not -- yeah? >>: I'm curious. One of the [inaudible] mobile scenario that you're describing, one of the goals that would be nice would be to deploy these new versions of the TCP without having to change both sides, so only changing one side. >> Costin Raiciu: Yeah, that's also feasible. Actually when we started out this effort, it was within a EU project called Trilogy and one of the partners started it exactly with this idea that you will not change a single end. It turns out that it's -you're quite limited in what you can do if you change a single end. And that thing died for some reason. I mean, we can take ->>: [inaudible]. >> Costin Raiciu: It's a hard problem. I mean, it's -- you're really very limited in what you can do. Okay. So, you know, fast forward. We have a Linux TCP implementation that implements the current protocol draft. The protocol draft is in the final phase of standardization at the ATF so it should be in RTF soon. So this thing is already happening. It does support legacy apps. It works over today's networks and we've tested it, right? So -- but, you know, I'm getting a bit ahead of myself. So here's multipath TCP. There's a lot of components to it. I don't have time to talk about them all, so I'll just leave out these two parts flow control and encoding control information. If you want, we can talk about it afterwards. Okay. So when we started off this work I -- and I don't think Mark Handley either realized how difficult it can be. So he came in 2007 and said I think we should go multipath TCP, it's like a good idea and, you know, the reason he started that work is because in theory there are a few results teaching us how to do congestion control. And that was like the big sort of break through. But to be able to do the congestion control you really need a vehicle to carry the byte. So that's the protocol really. So he said why don't you think about the protocol. Then in the beginning I thought well, this will be just one month. I'll get it done. And then I'll move on to the interesting stuff like the congestion control. Well, it turns out it took five years. And the reason it took five years is not because I was stupid, but it's because we designing in our Internet architecture that nobody really understands. So here's the -- here's what we teach our students. You know, you have this very nice protocol layer and you've got link layer addresses that are visible only to one link. And then on top of it IP works end to end, routers afford packets based on the destination addresses and on top of that TCP does reliable retransmission and, you know, ensures [inaudible] and finally the applications use this interface. That's really nice except it's mostly fiction, okay? The reason it is fiction is the because of middleboxes. And I'll use this pirate skull to denote middleboxes. So how big is the problem anyway? So we did a study that we presented in IMC last year to understand exactly what's the scope of the problem. So here's the IP header at the top and TCP header at the bottom. And the theory says that as the packets get through to the Internet two fields should change, the TTL and the header checksum and that's it. Okay? Now, we know that there are [inaudible] right? So those change the source IP and the source port for outgoing packets and the destination IP and the destination port for incoming packets. So we know that these get changed, okay? Now, there are fire walls that randomize the initial sequence number because some of the stacks are weak and they choose a predictable sequence number. So what this means is that on the outgoing packet the sequence numbers get changed. On the incoming packet the X gets changed, right? So these two fields also get changed in the Internet. Now, in terms of that for any particular field you find in here there will be at middleboxes out there deploy the changes, right? So the real picture looks something like this. And you'll be glad to see there's some white space in there. I only left it because it contrasts nice. You know, it's not true. Those fields get changed though. [laughter]. So in this context when you're designing a new protocol you have to be really defensive, because otherwise it can really bite you back. >>: [inaudible]. >> Costin Raiciu: Everyone -- I mean, middleboxes will set it to zero. Even Linux will set it to zero when it comes in, right? So you don't -- you really don't want to put any urgent data to get -- to get pushed in then. Someone thought that security has it at some point then just decided that, you know, you are generate pointer. All right. Let's start with what the protocol looks like. So a multipath TCP connection starts like a regular TCP connection with a SYN packet except it carries a new option that's called MP capable. And this option also carries a unique connection identifier designated by the one that's sending the SYN. Now, the logic at the passive open -- the server in this case is if the SYN has multipath capable and I'm doing multipath, then I'll just enable multipath TCP before this connection. So what it does is it gets the SYN, replies with the SYN/ACK with an MP capable and its own local unique identifier for this connection. And it says okay, now we're enabled. We're good. The client, the active opener or the client has the same logic. You know if the SYN/ACK has multipath capable then I will enable multipath TCP and then I send the third ACK. And this -- what I've shown here is your regular way of negotiating a new TCP option, right? All the TCP options we know, this is how they're negotiated, all right? And after you've done this, then what you have is a subflow set up within a multipath TCP connection. So both ends will basically know that this subflow is part of a connection and they will have a local identifier that's unique for that connection. Now, at any point in time, both ends can add new subflows to this connection, right? The subflow can come from the same IP address or a different IP address. It doesn't matter. The only requirement is that the five tuples of the different subflows different. So at least the port number should differ across different subflows. Okay. So now if I want to add another subflow, I send again a SYN, like a regular TCP, except the option in the SYN is different. Different. It's called join. And this tells the server that this is an existing multipath connection we're adding a subflow to and it's not a new connection. And the join will contain the unique identifier of the server for this connection. And this allows the server to demultiplex this request and place it at the right connection. Now, these unique identifiers are actually used for security purposes, but I will not cover these -- this in this talk. Okay. So the SYN/ACK comes back with a join. The third ACK and finally we have a new subflow. And this can go on as many time as you want. You can fin subflows if you need. The subflows can die and they just timeout and it's not a big deal. The connection keeps going on as long as there are single subflow that's not timed out yet. Okay? When the last -- when the last subflow has timed out finally the connection dies. Or when you do an explicit teardown. All right. Now, that was pretty easy, right? But it was too easy. The reality is a bit different. So say we're in this case where the SYN has come in and now we're sending back the SYN/ACK with the multipath capable and the server thinks we have enabled multipath for this connection. Our study in IMC shows that -- okay. Now, we can have a -- we can have a nice middlebox in there. So our study at IMC shows that 6% of access networks will actually remove options they don't know. And if you look at port 80, 14% of these networks will remove unknown options. So if it happens that on the way forward the option got through but on the way back the path was asymmetric and it didn't get through, then what you get is the SYN/ACK coming without the option, and now the client thinks multipath is disabled for this connection and the server thinks it's enabled. And what happens after this is really complicated. I will not go through this. But it's just not good. Okay? So you want to fix this? And the fix is pretty obvious. You know, what you want is in the third ACK if you enable multipath TCP, you want to carry some multipath specific option telling the server that, yes, I did get multipath TCP enabled. So the logic changes of the server. It says if the SYN has multipath capable and the third ACK has this multipath specific option, then you do enable multipath TCP. In this particular case the ACK will not contain the multipath option, so the two endpoints are in agreement. We're not going multipath TCP. Okay? So this really shows what you need to do if you want your connection not to break in today's Internet, right? To achieve goal through which was if TCP works through a connection, multipath TCP should work. You really need to fall back to TCP if something goes wrong. In this particular case we fall back to TCP in the negotiation itself. But multipath TCP can fall back to TCP during any time -- during the lifetime of the connections. So if at some point the multipath options don't get through anymore because, say, the path changed, the multipath [inaudible] does fall back to TCP. And from that point on, it doesn't go back to multipath. It just stays TCP, but it doesn't break the connection, okay? All right. So the lesson is it used to be that we negotiated option -- new protocol options between two endpoints. Nowadays you're negotiating between two endpoints and an unknown number of intermediaries. And unless you take this into account, new -- negotiation will actually fail. And this applies not only to multipath, but any extension to TCP you might want to do. Okay. So now we have multiple subflows. How do we actually send data on these subflows, right? And let's start with a primer of TCP. I mean, you all know this, but it is just to contrast with what you do with multipath. So TCP gives sequence numbers to every byte and then will place a segment to the wire. So in this -- in this example I'm showing actually packet sequence numbers for simplicity, but, you know, there's no difference in concept. So, okay, the sequence numbers help the receiver, first of all, pass data in order to the application, right? And also detect if there are holes and so forth. So this allows the receiver to implement the TCP coat rack, which is reliable bytes in delivery. As the packets are coming in, the receiver is generating X and the X still the sender, yes, the packet got there. And if it didn't, then you can just transit and so forth, right? I mean, we all know this. Okay. So the easiest way we can think of implementing multipath, the CPU is just taking all of these segments that the TCP creates and playing them on different paths. That's like the straw man design, right? That's what everyone seems of when you think of old multipath. You know, we put a segment on the top path, we put segment -- oh, yeah. Middleboxes. We put segment on the bottom path and so forth. Okay? Now, what you will see is that this path only sees segment 2 and 4 and this path only sees segment 1 and 3. Okay? So they see something that looks like a TCP connection with holes in it, right? Now, the forward segments will get there fine. It's not a big deal. It's when the X come back that things starts to get messy, okay? So ACK 1 will be generated and then ACK 2 will be generated. Right? The problem is this path has not seen segment 2, all right? Segment -- oh, no has not seen segment 1. Has not seen segment 1 because it saw 2 and 4. All right? So now we can see ACK 2, cumulatively ACKs both 1 and 2 and it will be upset. All right. The problem is this path has not seen segment 2, all right? Segment -- oh, no, has not seen segment 1. Okay? This path has not seen segment 1 because it saw 2 and 4, all right? So now we can see ACK 2 which cumulatively ACKs both 1 and 2, and it will be upset. All right. It turns out that a third of the ACKs we measured in our IMC paper will actually correct these ACKs or drop them or reset the connection. So one of these three actions will happen, right? So in this particular case let's say it correct the ACK, right? So what will happen is it will create it to ACK 0 because that's the cumulative ACK that it has seen. And on the top path ACK 3 will be corrected to ACK 1. Okay? So although all the segments got to the receiver, the sender is not aware of this because the paths are correcting the ACKs, right? And this [inaudible] we're stuck now. We don't know how to make progress. Okay. So what does work? We really need a sequence space for each subflow to make a subflow look like a TCP collection over the wire with no gaps, okay? And if we have the sequence number then we can use this to do retransmissions and to detect losses, right? But we still have the problem that different paths will have different delays. And you get reordering, right? So to deal with reordering you did a separate sequence space that's of the connection level, right, and this will be used by the receiver to put packets back in order. We also need a data acknowledgement for this sequence space at the connection level and so forth. All right. So this is how the multipath TCP packet header looks like. So the ports and the sequence numbers, they relate to the subflow of the multipath TCP connection. And this makes the -- the subflow look like a regular TCP connection in the wire. Except there are some options that belong to multipath TCP that current middleboxes don't understand, and these options allow the receiver to reorder data. And the connection -- multipath TCP has a single receive window or a connection, and this is code is relative to this ACK that's carried as an option here. I will not go deeper into flow control, as I said. Okay. So here's how it work. Now, I again want to send segment one but this will be the data sequence number one, right? And when I map it on the red path it will receive a subflow sequence number, in this case 100, okay? Now, I send the second segment here. It will be -- it will receive a subflow sequence number on the blue path. And finally I send the third segment there. As you see the subflow sequence numbers are increasing continuously but the data sequence numbers can be -- it didn't matter. They can have holes, okay? Now, what will happen if this path fails and this packet never gets there? Well, what will happen is that on this subflow we will get a timeout. And when there is a timeout we reinject all the segments that are outstanding on this path, on the other path that is are working. Okay? So what will happen is here. As you see, we have the same data sequence number, but it's not on the red subflow. And in this way, multipath TCP can make progress. >>: [inaudible]. >> Costin Raiciu: Well, the ACKs are -- so you have ACKs that uniquely the regular ACKs just ACK the subflow sequence number. And then you have a cumulative ACK for the data. And so ->>: [inaudible]. >> Costin Raiciu: Yes. Yes. Yes. >>: All right. >> Costin Raiciu: Okay. So we started out with what looked like an incredibly open design space for the protocol. It turns out that in practice, because of a lot of constraints, the designs -- the decisions we took there was not much -- much room for maneuver. So it turns out that a lot of the decisions were more than realistic, right? Anyone could start from the same goals in the same Internet. You end up with like the one possible solution. There was not much room for maneuver. >>: [inaudible] when is to just build it out of a bunch of ordinary TCP connections and put your extra stuff, instead of options, just in the TCP data. >> Costin Raiciu: Absolutely. So ->>: [inaudible]. >> Costin Raiciu: So I did not cover that part about encoding. So it turns out in the IT if we had like a six-month old -- a six-month longer discussion about where to encode control information. You can put them in options or in data. Okay? It turns out -- so there's like a chain of logical implications. So first of all, you need the data ACK. If you don't have a data ACK, you cannot do flow control properly. Then if you put the data ACK in the payload, I was just chatting with Jitu earlier, you get deadlocks. You can get deadlocks. Because the payload the subject to flow control, okay, and congestion control. But flow control is the problem. So what will happen is now if -- I should probably not do this, but I can [inaudible]. >>: [inaudible]. >> Costin Raiciu: Okay. I can show you some slides that show the deadlock. >>: [inaudible]. >> Costin Raiciu: All right. Okay. So you can get a deadlock basically. That's ->>: [inaudible]. >> Costin Raiciu: Now, this is it for the protocol itself, right? Let's switch to congestion control, okay? And I'll start off with a very sort of high level slide. Now, this is back in the '60s. We used to have circuit switch networks. And what this means is whenever you have a connection, you have to set up a circuit to a network. And I'm showing here a single link and two flows. You know, the blue flow and red flow. Now, the problem with this picture is as you probably are well aware is that this flow is bursting here. It could use all the capacity. But it's not allowed by the very static reservation of capacity on this link. And the same here. So the result with circuit switch network is that you get great isolation but very poor utilization, all right? Now, fast forward 10 years in the '70s you know packet switch networks you can do this thing very nice. So what they're really doing with -- what we're doing with packet switching is really pooling these circuits together to get greater utilization. Now, fast forward 40 years more today. This is what we have today. So we have two separate links in the network with flows and this is -- looks strikingly similar to the circuit image before, right? So the next step is obviously this one, where you can take multiple separate links and pool them together such that each flow can burst and use underutilized capacity elsewhere in the network. Okay? And this is really what multipath tries to do. Right? The problem is when you go from here to here you've just lost isolation. So you need somehow to manage the sharing of capacity. And that's -- that's what TCP does. TCP congestion control really decides who gets what. The question is in the multipath context how do you split the -- how do you split the capacity across the flows, you know? What is the equivalent of TCP in the multipath context in and this is multipath congestion control. Now, as I said, as we started this work, this is like the very interesting research question that sort of drove the whole project. And it turns out that the answer to this question came when we defined the goal we want to achieve. Once we had the goals that with hindsight looked very, very sort of obvious, it was not very difficult to design a congestion control that achieves those goals. So here are the goals. The first goal is the most obvious goal. If you have a single multipath TCP connection with many subflows, sharing a bottleneck link with TCP then you don't want multipath TCP to beat up TCP. You wanted to get the same capacity as TCP, right? And that's like, I mean, if you tell people I will give you multipath TCP to just kill TCP throughput, they will say okay, we're not deploying this. This is obvious. >>: [inaudible] it's more than your share. >>: That's right. [inaudible] [laughter]. >> Costin Raiciu: Well, not the TC -- you know. >>: [inaudible] TCP connection is it? >> Costin Raiciu: Sorry? >>: It would just be like opening more than one TCP connection and try to get ->> Costin Raiciu: Exactly why people that's -- people say that's not fair; right? I mean, don't go to the ATF telling them that we're going to beat up TCP because they're not like [inaudible] if you go to a specific company and say your products will be beating up other products, then that's a different way of thinking, but okay. So the second goal is more subtle. The second goal is we should use paths we have -- the paths we have efficiently. And here's what I mean. So in this particular case I have three links, each with the same capacity of 12 megabits and a single multipath flow that has two subflows. One of them has a single hop path and the other one has two hop paths. In this particular case it's obvious that this flow should get 24 megabits. It should use both paths and should get 24 megabits. Now, to make this more interesting, I'll add another flow here, again with a 1-hop path and a 2-hop path. And finally I'll add another flow here for symmetry, okay? So you have all of these flows that have a 1-hop path and a 2-hop path, and the network is pretty -- pretty utilized, okay? So the question is let's say these guys here sending, they have a choice on how much traffic to put on each path and they can decide, I can put X on this and X on this or 2X on this and less on this and so forth. So how should they split the traffic? It turns out that the way they split the traffic really affect the total throughput of the network, okay? So if they split it equally, each of them puts equal weight on both paths you get eight megabits and this is completely counterintuitive. Why are they getting 8 megabits, they should be getting 12 megabits, no? Because that's the -- the capacity is 36 megabit in the network. If you decide by 3 it should be 12. Well, the reason they're getting 8 megabits is because they're putting a lot of traffic on these 2-hop paths that are less efficient and basically use two resources instead of one. On the 1-hop path you use one resource, on the 2-hop path you use two resources. Now, if you put more weight on the shorter paths, for instance 2 to 1, you get 9 megabits. 4 to 1 you get 10 megabits. Infinite to 1 you get 12 megabits. Okay? So in this particular case, the optimal allocation of bandwidth is you should push all of your traffic through the direct path and no traffic through the other paths, okay? Now, okay. Does this really mean anything? Well, first of all, the theory says there is a way to get this, to get to this outcome just by doing congestion control here. Right? So the theory says this. Each sender should look at the congestion it gets for each of these paths and it should put traffic -- it should push traffic away from the more congested paths. That's what theory says. And, I mean, there are -- there are proofs that this can be done in a stable way. It will not oscillate. So in this particular case what we see is -- let's look at this guy. This guy will have loss rate say P on this path, okay? But on this path it will see a loss rate of P plus P, right? It will see a much higher loss rate. And that's why it's pushing all of the traffic away from the higher loss rate path, right? And the same goes for this one. In the example I showed before where the flow is alone, there will be no higher loss rate on the 2-hop path. So in that case, you get the benefit, okay? So basically the theory says you should do congestion control you should put traffic away from the congested paths and that's all very nice. The problem with that is, well, here's the real network. You know. We have a 3G path with very low loss in high RTT. It gets a loss every five minutes, right? And it has an RTT of five seconds depending on what carrier you use. And here's a WiFi path that gets a loss every -- every second or even earlier but this actually gives you 10 megabits compared to one megabit, all right? So if I just take the goal I set before, always prefer your lower loss networks, that means where you have 3G and WiFi always use 3G, put no traffic on WiFi. That's clearly a bad way to do it. Because people will say, yeah, but WiFi was giving me 10 megabits and you're giving me 1 megabit and you're saying it's much better for the network. You know, I don't buy that. So you really need a counter balance for that design principle which basically says in any given configuration if multipath TCP is giving you less throughput than the best path -- than TCP on the best path, then you're doing something wrong. So really the goal here would be, look, in this case is this get a -- you should get at least 10 megabits per second, okay? How you get the throughput, maybe you push more here but I should get at least 10 megabits because otherwise nobody will deploy this thing. All right? So how do we achieve all of this? Well, this is TCP congestion control. And we all know this. So I'm just showing it to contrast it to what multipath TCP does, okay? So the maintains a congestion window for each connection, and then as it get -- as it is getting ACKs back, for each ACK it gets, it increase the window -- it increase the window for 1 over W. Basically this means in 1 RTT, it increases the congestion window by one packet, okay? And then if it gets a drop, then it just has the window. And this is pretty simple. I mean, everyone knows this. Now multipath TCP does this. You have a congestion window for each path, all right? When you get a loss on path R you just have the congestion window of that particular path, all right? So it's exactly the same with TCP here. The difference is the increase part. This is where all the smart multipath TCP are. So what I have here is the sum of windows across all of the paths. And you can look at this as a constant for now, right? So let's say I have two paths. I have a path with window 1 and another path with window 100. This would be 101 here. All right? Now, the path with window 1 will increase on every RTT by this much. The path with window 100 will increase by 100 times this much. It will increase much more, okay? So what this does is really push the increase toward the path that has the better loss rate, the higher congestion right. All right? So it actually linearly does this. If you have two flows that have the same congestion window they will increase with the same amount. All right? Now, if you say this is 1 and if you consider that the RTTs are the same for all the flows, you realize that on aggregate multipath TCP increases with one packet per RTT across all of its subflows, all right? And this gives you the fairness to TCP. If all of my subflows are going through the same bottleneck, then I will be fair to TCP. So okay. To get goal 2, that is move traffic away from congestion. This is -- this is what I'm doing. This helps me get goal 2. To get goals 1 and 3, what we are doing is we are dynamically adjusting this value of alpha to do the following thing. So multipath TCP has loss rate and RTT estimates for each of its paths. Once I have that I can just plug them into an equation and say what would TCP get on this path. And I get some throughput. And then what I basically do is I change this alpha such that in aggregate multipath TCP gets the same throughput. That's the idea. Okay. So, I mean, this is the mechanism. It's much easier to see what the emerging behavior is. Okay. And this is the real formula, but I'm not going to discuss it. All right. So let's say we have a web server with two hundred-megabit links. And it happens that two clients are using the top links and four clients are using the bottom link. So clearly the bottom link is much more congested. Now, multipath TCP comes, using both links. Now, because this link is much more congested, it will push all of its traffic to the top link, right, and pushing very little traffic through here mostly to just probe the capacity of this link. All right? So -- and the flow throughputs are 33 megabits for the flows using the top link and multipath and 25 for the bottom flow. Now, if I have another multipath TCP connection, again, I push all of the traffic to the top link and I leave no traffic on the bottom link. And in this case, you'll see that all of the flows in the setup have exactly the same throughput. Okay? If I add one more connection, I start pushing in the bottom path too, right? If I only push in the bottom -- in the top -- in the top path, then I would have more congestion there. So basically I have to balance out congestion. So what is really happening is that multipath TCP is trying to equalize congestion between these two links, okay? And the net effect is that these two links become -- start behaving like a single higher capacity link, and their capacity shared between all the flows, okay? So if I keep adding flows, then that's what happens. So this is the theory. This is what the theory says. This is the practice. Okay? So this is a practical experiment. So I start with I think five flows on the top link and 15 flows on the bottom link. And this is the average throughput for TCP for TCP flows. And then I start 10 multipath TCP flows which are shown in the -- in this color, in orange. So what you see is that multipath TCP does something close to a theory. It's not exactly perfect. But so what it's doing is it's mostly pushing traffic to the top link. As you see the decrease of the top TCPs, the bottom TCPs will also decrease a little bit because of the probing it does. And also -- so multipath roughly matches the rate of the flows. So the practice is not exactly the same as the theory, but it's pretty close. Okay. So the take away is that multipath TCP can make a connection of links behave like a single pooled resource. It's as if you take those links, you put them together and they just a higher capacity link that everyone can draw traffic from. And this gives you better fairness and better utilization. That's the intuition. Okay. The other application of multipath TCP is mobile devices as I mentioned in the beginning. So again I have the same scenario here. Now, at this point with multipath TCP there is absolutely no problem to open a new subflow using the WiFi connection and just offload the connection from 3G to WiFi, right? So I can do this make before a break scenario without any problem. All right? The problem is of course in this case if I move away immediately from the WiFi coverage then I lost my connection, right? So I have to reconnect to 3G. So a better way to do this is to actually open both connections at the same time and do something that will cause low handover, right? Instead of quickly switching, you know, use all of your connections and that's going to give you the best throughput and the best performance. Now, you will right fly argue that well, if you do that today, your [inaudible] will die not in one day but like in half a day, okay? So that's pretty -- that's pretty much bad. You know, completely ignores the argument, but from a performance and robustness point of view, this is the best thing you want to do. Now, if in the future the batteries get better then you can do this. If not, you can do this for limited amount of time while the networks are flaky and so forth. So whether you can do this in practice is an open question. But from the networking point of view this is the better solution. Okay. So here's and experiment we did. So this is our code running. We're using WiFi in 3G. Real 3G, real WiFi. And then we're comparing two things. We're comparing a modified W get application that monitors the interface status. And if it notices that the -- one of the interfaces is down, the interface it is using is down, then it reopens the connection using the other working interface and it resumes the download with the HTTP range header. Okay? And what you see here is basically the connection goes down. It takes some time to detect it. And then once you start the new connection over 3G it takes time for 3G to ramp up. Normally it takes like two or three seconds to ramp up. Yeah? >>: So [inaudible] multipath TCP do you mean without? >> Costin Raiciu: No, this is without. This is regular TCP. >>: [inaudible] fix the ->> Costin Raiciu: Oh, no. This is the [inaudible] multipath. >>: Oh, okay. >> Costin Raiciu: Sorry. Okay. So basically with multipath TCP there is really no disruption. I mean, the take away here is not necessarily that the application handover sucks, I mean, it's just two seconds. Two seconds. You can cope with this. The problem is that all applications need to be able to do this mobility inside the application which is very tricky, okay? So, I mean, as an extreme case what we did is we actually took scrap and forced it to go over multipath TCP by closing all the ports available. So we have an HTTP proxy that runs multipath TCP. On the client where multipath TCP we started the regular Skype line. And we killed, you know, all the HTTP ports, all the 3G traffic. So Skype, again it's so good at detecting an exit from the machine that it actually goes over HTTP. So while Skype was going over HTTP, we actually killed WiFi 3G and did handover with Skype while playing some encyclopedia, some audio stuff. So it's funny to see what happens. I mean, I'm not showing it here, but basically get a small period of silence of one or two seconds and then you see the audio stream being played at the higher rate because Skype gets all the packets. Okay? So, I mean, that's an extreme example. You really don't want to do voiceover over TCP. But the handover multipath TCP provides does work for unmodified applications. That's really the take away. >>: [inaudible] advantages some you see which is when you have a notification of a WiFi loss on reapplication [inaudible] to be at like 10 applications, they all want to jump on the CPU, reconnect and so you can have a sudden flood of activity on the device where they're all fighting for CPU and they're fighting for network. And this might -- [inaudible]. >>: And they probably all wanted the same thing. >>: Right. They all wanted ->> Costin Raiciu: Yeah. Yeah. I mean, there's lot of applications. Okay. So the last thing I will -- okay. I'm a bit over. The last thing I will discuss is how multipath TCP can be used in data centers. So I've mentioned this thing that you take collection of links and make them like a single -- like a single resource. And this is the intuition of why multipath should work in data centers. So currently today you have TCP connection. It randomly picks a path and it sticks to that pads. With multipath TCP you do exactly the same thing. But instead of having a single subflow for collection you have many subflows. And for each of those subflows they get placed randomly by equal multipath on different paths. And the subflows, as I said, will have different five tuples. So they will have different ports. And that's why they look like different TCP connections. All right? And if it happens to have -- if you happen to have collisions on any of your subflows then basically can just move traffic away from the congested link on to the uncongested links. Okay? So visual this looks like this. So this is a case that I was showing you before where I was having a collision, where I was showing you collision. So let's say that this black flow is multipath. It will start another subflow that will most likely happen to hit the idle path. Now, the congestion control will detect this and will push traffic away from the congested path, okay, to the uncongested path. It will get one gigabit on this path. And the fact is that this red subflow is happy again because it's getting one gigabit, right, despite this collision. So, you know, and just to show you the example I showed earlier, this was the TCP throughput you would be getting. In blue we have the multipath TCP throughput. Now, clearly this is not perfect. It's not exactly hundred percent like you are getting in theory but it gets pretty close like the average is close 90 percent utilization and even in the worst case you get 600 mitigate megabit. So, okay, this is all simulation. So what we did is we actually went on Amazon EC2 and -- to see if we can get some of these benefits in practice. So, you know, infrastructure as a service, we rented some virtual machines. And it turns out that Amazon, the EC2 has multipath topology. So when we started out in 2010, when we were testing this, I think only one of their availability zones had multipath. Retesting a few months ago all of the availability is on said multipath. So I think they actually upgrade the network on the older availability zones to have multipath networks. Okay. So we took 40 million instances running multipath TCP on kernel and the experiment what we did was very simple. So we just iperf'd from every machine to every other machine periodically in sequence by doing either TCP, multipath with 2 flows, 2 subflows or 4 subflows. And here are the numbers. Again, I'm showing the flow rank here and the throughputs on the Y axis. So this is TCP and multipath with 2 or 4 subflows. You clearly see it will get a good benefit in this case. And these flows are mostly in the same rack. So of course the EC2 network is a black box to us. We have no idea exactly why these benefit appear, right in they could be because there is instead a factory network in there or a multipath network and a lot of cross traffic and collisions really help you. Or they could be because they are doing per flow shaping. So we don't -- we don't really know, okay. So this is just qualifying the results. Okay. Yes? >>: These are actually select the different paths. >> Costin Raiciu: So you don't. >>: Just by virtue of different five tuples they ->> Costin Raiciu: Uh-huh. >>: -- chose different paths? >> Costin Raiciu: Exactly. So you don't do anything different in the stack. You just open many subflows. Each of them will have a different five tuple. And then you hope that they'll get placed on different path. That's basically ->>: [inaudible] interfaces in which case -- >> Costin Raiciu: Yeah. If you have multiple interfaces then it's a different story. But if you want to use sort of network based multipath, then you just use five -different five tuples. >>: So there would be some kind of logic on each end to say [inaudible] five second subflows, different groups ->> Costin Raiciu: Absolutely. So, I mean, these are [inaudible] that we haven't really worked on too much. So in the data center the way we vision that you do this is, you have a sys control that tells the stack, you know, you should open this many subflows, maybe, you know, this much time after the collection starts, you know, if that collection is -- lasts more than let's say 50 milliseconds or something, it's not just a quick transfer, then it makes sense to open it, okay? But I'm just throwing 50 milliseconds as a number out there. I don't know if it's a good number or either hundred or 200. Yeah? >>: I'm just wondering how you avoid the wave effect. If everybody gets synchronized into saying okay, I see congestion, stop sending there and it all move at the same to be uncongested and create the back and forth ->> Costin Raiciu: So this was like the biggest problem with low dependent righting and that's like nobody does, let's say, traffic engineering in realtime trying to buy balanced congestion because it's like the holy grail. The theory of multipath congestion won't really show on this is stable. So there are proves in theory showing it is stable. And our controller sort of is based on that theory. So we haven't seen any oscillations in practice. The problem is the controls we saw -- the controls in theory are shown stable but they have a different -- a different behavior -- I mean, they're basically exponential controllers. They're like scaleable TCP. They increase multiplicatively every time. The problem with TCP is that if it has small windows it increases very aggressively. So if you increase from 1 to 2, that's a very aggressive increase if you can like a thousand subflows -- a thousand flows because you double the load for or less, right? But what stops you from oscillating in that particular case is a timeout, right? So you'll be timing out a lot of the time and, you know, statistically it evens out. So the answer is the theory says it should be stable. In practice we've seen it's stable. It could only be messy in the like the cases where you have very, very small windows. But in that case, as I said, you know, that probabilistic nature of timeouts that does help you a little bit. So we think it's pretty safe. It shouldn't be a big deal. If you do it in the because you don't have the -- you don't have the -- you don't recall how long the loop is, basically, of the flows. So if you do a change how long should you wait until you change again? Right? Here you're doing a change on the natural path, natural control loop of the path which is basically the round trip time, right? So that's why it intuitively should be easier to get stability if you're doing end point congestion control rather than in the network. Okay. So designing a -- you know, any multipath TCP isn't difficult. I mean, it's pretty easy to do -- to do something that works. But designing the employable one is much, much more difficult. They actually took the biggest part of this work. Just because the -- our Internet architecture is evolving all the time, right? And what this means is that you need to put defense I have mechanisms to detect when things go wrong, and you need to fall back to TCP. So let me give you another example of a defensive memorandum. There are middleboxes out there that will modify your payload, your TCP payload. For instance an FTP middlebox will rewrite -- in the control channel will rewrite the IP address. If it's a net. If it's a net doing FTP, it will rewrite the IP address written by the source to be the IP address of the net, right? And the IP address is written in ASCII. That means the length changes. So we can lose or add a few bytes. You don't know. So now that could really mess up a multipath TCP connection because I've sent a few segments here but one of them got shrunk or got bigger. So then when I put them in order, I will send garbage to the application. I don't know how to put that in order. So what do you do in this case. So what we did is basically well, you really have to check some the data, right? So basically the data comes with a checksum that's carried as an option and when multipath TCP detects a checksum failure, basically you have two options. If you have other working subflows then you just reset that subflow that has a checksum failure and never use it again. If that's your only subflow, then we have a handshake that allows you to fall back to TCP. But you never come back to multipath again. So basically just from now on we're doing TCP we'll let the middlebox do exchange. But we're not doing multipath TCP anymore, we just do TCP. Okay. And this is not just multipath TCP. Like if you want to do deployable changes to TCP you need to take all of these into accounts. That's basically the lesson we learn the hard way. And I'll finish with sort of advertising slides which is multipath topology is really multipath transport. And multipath TCP with used by unchanged application on today's networks. You can try it out. And the biggest sort of Theoretical break through and the reason it's such a nice thing, this ability to move traffic away from congestion and take multiple links and make them look like a single shared link. And that's why -- I mean, the protocol itself is the engineering part that was really difficult. But the smart -- this is the smart part, right? This is like -- this is why it gives you such nice benefits. And with that, I'm done. Thanks. [applause]. >> Costin Raiciu: Yeah? >>: [inaudible]. >> Costin Raiciu: So, I mean, I've been talking to the Google guys, to the Facebook guys and so forth. The Google guys made it pretty clear that they will -- they are reluctant to touch -- to touch it until it's in the mailing kernel. And this is like the code is maintained by our colleagues from Belgium. So they are on the same project and they did most of the protocol implementation. So the push in the next couple of years we have funding for this, so the push is to just try to get it in the main line. So currently we have one of the stack commuters tutoring us on how to restructure the big -- the big patch we have. So we have like a 10,000 lines of code batch which touches all of the TCP stack. So it's a big change. So that's -- my guess is once it's in the main line, people will start playing with it. Before that, yeah, they can play with it in a simple, you know, they can just see are there benefits. But I'm not exactly sure that they are willing to sign up to -- so if you use like an experimental kernel like the one we have, the problem is the main line kernel changes all the time so you have to port the patches. And that's a huge effort. I mean, that's a huge effort. So realistically speaking I think we will see it in data centers when we have it in the main line. Or, no, after we have it in the main line. And probably a bit optimistic to say that [inaudible] in the main line [inaudible]. But on mobiles I think that the stories a bit different because the push seems to be -- seems to be much stronger on mobiles like for mobility, things -- you know, there's a bunch of companies interested, trying it out, doing experiments and stuff. So in that particular case it might be sooner. >>: [inaudible] that you either -- you want to do one or the other and rarely both. You know, 3G is good when I walk outside and it's a great connection. But if I had WiFi I would never use [inaudible]. >> Costin Raiciu: Absolutely. Yeah. So, I mean, you can use it -- so we have a paper in SIGCOMM workshop this year and basically this is what we look at, you know. We look at what happens if you use them both at the same time, if you just use one and so forth. So if you just use one, you'll have some glitches when you switch from one to the other because for instance if you're using WiFi and you switch to 3G then you have a startup time for the 3G because even 3G, even 3G is connected until you get the radio location it takes like a few seconds. So it's no problem, you know, I mean, the implementation currently allows you to say this interface is in backup mode and this is the primary interface. So you can just -- you instruct the code to just -- you instruct the kernel to prefer one path over the other. So that's not a big deal. You can do it. Yeah? >>: Do you have Android? Sorry. >>: So to avoid [inaudible] have you considered using something like SSL extensions to handle [inaudible]. >> Costin Raiciu: So, I mean, yeah. It's -- I think it's a tricky path to go down to. I mean, the first -- if you look at the -- how the whole Internet architecture has been evolving, it's been a story of encapsulation, right? And what -encapsulating multipath TCP over something like SSL is the ultimate in encapsulation because if the multipath TCP becomes the sort of standard then everything is encapsulated over SSL. And I'm one of the skeptics of this evolution which -- so a lot of the people I talk today say yeah, but that's the end, you know, once you encrypt it they're done. And I don't think so. I think there's just a step. And they say -- whatever. The government comes and says you install this key in your browser or else you don't get Internet. Okay? So once you install this key in your browser what will happen is they'll break the SSL connection, they'll terminate it, they have a middlebox that looks very carefully at your traffic and then they forward it. And then nothing's stopping them from doing everything they are doing now from doing it there, right? So I think it's really a race to the bottom, all of this encapsulation. I mean, my good friend Lucien says that we should use HTTP as a [inaudible], I mean, that's even scarier. But for certain deployment it makes a lot of sense. In the long-term I think we will lose. I don't know. Yeah? >>: So [inaudible] no change application. I'm curious. What if you take an application like Skype that's already reacting to change notifications regarding Internet connections appearing in this, appearing -- do they suddenly start behaving properly with MTCP or do they start inappropriately tearing down connections that would have worked? >> Costin Raiciu: So our experience with Skype, I mean, I don't know, I can't give you firsthand information because I didn't run the experiment. Our experience has been that it was no problem like killing one interface and just forcing Skype to move to the other. I mean, not forcing. Skype was getting -- so the thing is if your interface dies then your kernel will tell you your circuit has died. You get an abort from the circuit. It doesn't happen with multipath TCP. Okay? So basically the kernel just keeps feeding you and your happy, right? I mean, if an application out of that actually just does [inaudible] over whatever, then it will -- you know, it could actually get confused by what's going on, right? But for Skype in particular you just -- it just worked. I mean, as I said, we -- you know, we had -- had the flow working over WiFi. You kill WiFi and month to 3G nothing happens. I mean -- >>: [inaudible] vast majority of well behaved TCP. We have seen programs that listen to IP change, route notifications and if they hear a route change notification that they think should kill their connection, they'll just tear it down. >> Costin Raiciu: Yeah. I think -- in that particular case, I mean, it's -- should probably work correctly. If they kill a connection nothing happens, right? They just -- it's just they're taking it in their hands, right? I mean, yeah. Yeah. >>: So [inaudible]. >> Costin Raiciu: How much [inaudible] sorry? >>: You can help latencies. >> Costin Raiciu: Latencies. >>: Like RTTs. Not only [inaudible]. >> Costin Raiciu: So there's two things. There's -- so I think multipath TCP mostly benefits you if you want -- if you wanted to -- if you want to do bulk transfer, right? That's I think where most of the benefits are. Because of the mechanisms we have with the [inaudible] sequence numbers and so forth you could imagine you could use this to get better latency per packet latency. I mean, in general per packet latency will be higher than multipath TCP because one path can be -- can be a hard delay and then the packet cannot get passed up to the application until, you know, the higher delay packet gets there because they need to be passed up in order. If you're in a data center and you're chatting with some net app guys at some point and they were saying well, I think we might use this to, for instance, send the same [inaudible] over multiple subflows redundancy, okay, and then make sure that whatever, you know. The first one gets there. I mean, yeah that's completely fine. I mean, it's -- the protocol allows that theory. You can easily do that. I just have no idea how people want to use it. But in general if you really care about packet level delay and so forth, this is probably -- I mean, and short transactions this is not for you because by the time you set up the second subflow, you know, the connection is gone. And this is actually what happens in practice like you start the connection. The stack of course starts pushing packets on the initial subflow. And then at some point I think when you [inaudible] or something, the second subflow gets created, right? So by that time data center for instance most connections will be gone any way. So ->> Jitu Padhye: All right. Let's thank the speaker. >> Costin Raiciu: Thank you. [applause]