Multipath TCP Costin Raiciu Christoph Paasch University Politehnica of Bucharest Université catholique de Louvain Joint work with: Mark Handley, Damon Wischik, University College London Olivier Bonaventure, Sébastien Barré, Université catholique de Louvain and many many others Thanks to Networks are becoming multipath Mobile devices have multiple wireless connections Networks are becoming multipath Networks are becoming multipath Networks are becoming multipath Datacenters have redundant topologies Networks are becoming multipath Client Servers are multi-homed How do we use these networks? TCP. Used by most applications, offers byte-oriented reliable delivery, adjusts load to network conditions [Labovits et al – Internet Interdomain traffic – Sigcomm 2010] TCP is single path A TCP connection Uses a single-path in the network regardless of network topology Is tied to the source and destination addresses of the endpoints Mismatch between network and transport creates problems Poor Performance for Mobile Users 3G celltower Poor Performance for Mobile Users 3G celltower Poor Performance for Mobile Users 3G celltower Poor Performance for Mobile Users 3G celltower Offload to WiFi Poor Performance for Mobile Users 3G celltower All ongoing TCP connections die Collisions in datacenters [Fares et al - A Scalable, Commodity Data Center Network Architecture - Sigcomm 2008] Single-path TCP collisions reduce throughput [Raiciu et. Al – Sigcomm 2011] Multipath TCP Multipath TCP (MPTCP) is an evolution of TCP that can effectively use multiple paths within a single transport connection • Supports unmodified applications • Works over today’s networks • Standardized at the IETF (almost there) Multipath TCP components Connection setup Sending data over multiple paths Encoding control information Dealing with (many) middleboxes Congestion control [Raiciu et. al – NSDI 2012] [Wischik et. al – NSDI 2011] Multipath TCP components Connection setup Sending data over multiple paths Encoding control information Dealing with (many) middleboxes Congestion control [Raiciu et. al – NSDI 2012] [Wischik et. al – NSDI 2011] MPTCP Connection Management MPTCP Connection Management MPTCP Connection Management SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y MPTCP Connection Management SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y MPTCP Connection Management SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y MPTCP Connection Management SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y SUBFLOW 2 CWND Snd.SEQNO Rcv.SEQNO TCP Packet Header Bit 0 Bit 15 Bit 16 Source Port Bit 31 Destination Port Sequence Number Acknowledgment Number Header Length Reserved Code bits Checksum 20 Bytes Receive Window Urgent Pointer Options Data 0 - 40 Bytes TCP Packet Header Bit 0 Bit 15 Bit 16 Source Port Bit 31 Destination Port Sequence Number Acknowledgment Number Header Length Reserved Code bits Checksum 20 Bytes Receive Window Urgent Pointer Options Data 0 - 40 Bytes Sequence Numbers Packets go multiple paths. – Need sequence numbers to put them back in sequence. – Need sequence numbers to infer loss on a single path. Options: – One sequence space shared across all paths? – One sequence space per path, plus an extra one to put data back in the correct order at the receiver? Sequence Numbers • One sequence space per path is preferable. – Loss inference is more reliable. – Some firewalls/proxies expect to see all the sequence numbers on a path. • Outer TCP header holds subflow sequence numbers. – Where do we put the data sequence numbers? MPTCP Packet Header Bit 0 Bit 15 Bit 16 Subflow Source Port Bit 31 Subflow Destination Port Subflow Sequence Number Subflow Acknowledgment Number Header Length Reserved Code bits Checksum Data sequence number 20 Bytes Receive Window Urgent Pointer Options Data Data ACK 0 - 40 Bytes MPTCP Operation options … SEQ 1000 … DSEQ 10000 DATA SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y SUBFLOW 2 CWND Snd.SEQNO Rcv.SEQNO MPTCP Operation options … SEQ 1000 … DSEQ 10000 DATA SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y SUBFLOW 2 CWND Snd.SEQNO Rcv.SEQNO MPTCP Operation options … SEQ 1000 … DSEQ 10000 DATA SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y options … SEQ 5000 … DSEQ 11000 DATA SUBFLOW 2 CWND Snd.SEQNO Rcv.SEQNO MPTCP Operation options … SEQ 1000 … DSEQ 10000 DATA SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y options … SEQ 5000 … DSEQ 11000 DATA SUBFLOW 2 CWND Snd.SEQNO Rcv.SEQNO MPTCP Operation options … SEQ 1000 … DSEQ 10000 DATA SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y options … SEQ 5000 … DSEQ 11000 DATA SUBFLOW 2 CWND Snd.SEQNO Rcv.SEQNO MPTCP Operation … ACK 2000 Data ACK 11000 … SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y options … SEQ 5000 … DSEQ 11000 DATA SUBFLOW 2 CWND Snd.SEQNO Rcv.SEQNO MPTCP Operation SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y options … SEQ 5000 … DSEQ 11000 DATA SUBFLOW 2 CWND Snd.SEQNO Rcv.SEQNO MPTCP Operation SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y options … SEQ 5000 … DSEQ 11000 DATA SUBFLOW 2 CWND Snd.SEQNO Rcv.SEQNO MPTCP Operation options … SEQ 2000 … DSEQ 11000 DATA SUBFLOW 1 CWND Snd.SEQNO Rcv.SEQNO FLOW Y SUBFLOW 2 CWND Snd.SEQNO Rcv.SEQNO Multipath TCP Congestion Control Packet switching ‘pools’ circuits. Multipath ‘pools’ links TCP controls how a link is shared. How should a pool be shared? Two circuits A link Two separate links 42 A pool of links Design goal 1: Multipath TCP should be fair to regular TCP at shared bottlenecks A multipath TCP flow with two subflows Regular TCP To be fair, Multipath TCP should take as much capacity as bottleneckTCP link,should no matter manycapacity subflows ToTCP be at fair,a Multipath take how as much asitTCP atisausing. bottleneck link, no matter how many paths it is using. 43 Design goal 2: 44 MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s Each flow has a choice of a 1-hop and a 2-hop path. To be fair, Multipath TCP should take as much capacity as TCP How should splitlink, its traffic? at a bottleneck no matter how many paths it is using. Design goal 2: 45 MPTCP should use efficient paths 12Mb/s 8Mb/s 12Mb/s 8Mb/s 12Mb/s 8Mb/s If flow split its traffic 1:1 ... take as much capacity as TCP Toeach be fair, Multipath TCP should at a bottleneck link, no matter how many paths it is using. Design goal 2: 46 MPTCP should use efficient paths 12Mb/s 9Mb/s 12Mb/s 9Mb/s 12Mb/s 9Mb/s If flow split its traffic 2:1 ... take as much capacity as TCP Toeach be fair, Multipath TCP should at a bottleneck link, no matter how many paths it is using. Design goal 2: 47 MPTCP should use efficient paths 12Mb/s 10Mb/s 12Mb/s 10Mb/s 12Mb/s 10Mb/s •To If split its traffic 4:1take ... as much capacity as TCP beeach fair, flow Multipath TCP should at a bottleneck link, no matter how many paths it is using. Design goal 2: 48 MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s •To If split its traffic ∞:1take ... as much capacity as TCP beeach fair, flow Multipath TCP should at a bottleneck link, no matter how many paths it is using. Design goal 3: 49 MPTCP should get at least as much as TCP on the best path wifi path: high loss, small RTT 3G path: low loss, high RTT Design Goal 2 says to send all your traffic on the least congested path, in this 3G.take But this has high RTT, as TCP To be fair, Multipath TCPcase should as much capacity hence it will give lownothroughput. at a bottleneck link, matter how many paths it is using. How does TCP congestion control work? Maintain a congestion window w. Increase w for each ACK, by 1/w Decrease w for each drop, by w/2 50 How does MPTCP congestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by w r r Decrease wr for each drop on path r, by wr /2 51 How does MPTCP congestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by Goal 2 w r r Decrease wr for each drop on path r, by wr /2 52 How does MPTCP congestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by Goals 1&3 w r r Decrease wr for each drop on path r, by wr /2 53 How does MPTCP congestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by Decrease wr for each drop on path r, by wr /2 54 Applications of Multipath TCP At a multihomed web server, MPTCP tries to share the ‘pooled access capacity’ fairly. 2 TCPs @ 50Mb/s 100Mb/s 100Mb/s 4 TCPs @ 25Mb/s 56 At a multihomed web server, MPTCP tries to share the ‘pooled access capacity’ fairly. 2 TCPs @ 33Mb/s 1 MPTCP @ 33Mb/s 4 TCPs @ 25Mb/s 100Mb/s 100Mb/s 57 At a multihomed web server, MPTCP tries to share the ‘pooled access capacity’ fairly. 2 TCPs @ 25Mb/s 2 MPTCPs @ 25Mb/s 100Mb/s 100Mb/s 4 TCPs @ 25Mb/s The total capacity, 200Mb/s, is shared out evenly between all 8 flows. 58 At a multihomed web server, MPTCP tries to share the ‘pooled access capacity’ fairly. 2 TCPs @ 22Mb/s 3 MPTCPs @ 22Mb/s 100Mb/s 100Mb/s 4 TCPs @ 22Mb/s The total capacity, 200Mb/s, is shared out evenly between all 9 flows. It’s as if they were all sharing a single 200Mb/s link. The two links can be said to form a 200Mb/s pool. 59 At a multihomed web server, MPTCP tries to share the ‘pooled access capacity’ fairly. 2 TCPs @ 20Mb/s 4 MPTCPs @ 20Mb/s 100Mb/s 100Mb/s 4 TCPs @ 20Mb/s The total capacity, 200Mb/s, is shared out evenly between all 10 flows. It’s as if they were all sharing a single 200Mb/s link. The two links can be said to form a 200Mb/s pool. 60 At a multihomed web server, MPTCP tries to share the ‘pooled access capacity’ fairly. 5 TCPs 100Mb/s First 0, then 10 MPTCPs 100Mb/s throughput per flow [Mb/s] 15 TCPs We confirmed in experiments that MPTCP nearly manages to pool the capacity of the two access links. Setup: two 100Mb/s access links, 10ms delay, first 20 flows, then 30. time [min] 61 At a multihomed web server, MPTCP tries to share the ‘pooled access capacity’ fairly. 5 TCPs 100Mb/s First 0, then 10 MPTCPs 100Mb/s 62 15 TCPs MPTCP makes a collection of links behave like a single large pool of capacity — i.e. if the total capacity is C, and there are n flows, each flow gets throughput C/n. Multipath TCP can pool datacenter networks Instead of using one path for each flow, use many random paths Don’t worry about collisions. Just don’t send (much) traffic on colliding paths Multipath TCP in data centers Multipath TCP in data centers MPTCP better utilizes the FatTree network MPTCP on EC2 • Amazon EC2: infrastructure as a service – We can borrow virtual machines by the hour – These run in Amazon data centers worldwide – We can boot our own kernel • A few availability zones have multipath topologies – 2-8 paths available between hosts not on the same machine or in the same rack – Available via ECMP Amazon EC2 Experiment • 40 medium CPU instances running MPTCP • For 12 hours, we sequentially ran all-to-all iperf cycling through: – TCP – MPTCP (2 and 4 subflows) MPTCP improves performance on EC2 Same Rack Implementing Multipath TCP in the Linux Kernel Linux Kernel MPTCP About 10000 lines of code in the Linux Kernel Initially started by Sébastien Barré Now, 3 actively working on Linux Kernel MPTCP Christoph Paasch Fabien Duchêne Gregory Detal Freely available at http://mptcp.info.ucl.ac.be MPTCP-session creation Application creates regular TCPsockets MPTCP-session creation The Kernel creates the Meta-socket MPTCP creating new subflows The Kernel handles the different MPTCP subflows MPTCP Performance with apache 100 simultaneous HTTP-Requests, total of 100000 MPTCP Performance with apache 100 simultaneous HTTP-Requests, total of 100000 MPTCP Performance with apache 100 simultaneous HTTP-Requests, total of 100000 MPTCP on multicore architectures Flow-to-core affinity steers all packets from one TCP-flow to the same core. MPTCP has lots of L1/L2 cache-misses because the individual subflows are steered to different CPU-cores MPTCP on multicore architectures MPTCP on multicore architectures Solution: Send all packets from the same MPTCP-session to the same CPU-core Based on Receive-Flow-Steering implementation in Linux (Author: Tom Herbert from Google) MPTCP on multicore architectures Multipath TCP on Mobile Devices MPTCP over WiFi/3G TCP over WiFi/3G MPTCP over WiFi/3G MPTCP over WiFi/3G MPTCP over WiFi/3G MPTCP over WiFi/3G MPTCP over WiFi/3G MPTCP over WiFi/3G MPTCP over WiFi/3G WiFi to 3G handover with Multipath TCP A mobile node may lose its WiFi connection. Regular TCP will break! Some applications support recovering from a broken TCP (HTTP-Header Range) Thanks to the REMOVE_ADDR-option, MPTCP is able to handle this without the need for application support. WiFi to 3G handover with Multipath TCP WiFi to 3G handover with Multipath TCP WiFi to 3G handover with Multipath TCP WiFi to 3G handover with Multipath TCP WiFi to 3G handover with Multipath TCP WiFi to 3G handover with Multipath TCP WiFi to 3G handover with Multipath TCP Related Work Multipath TCP has been proposed many times before – First by Huitema (1995),CMT, pTCP, M-TCP, … You can solve mobility differently – At different layer: Mobile IP, HTTP range – At transport layer: Migrate TCP, SCTP You can deal with datacenter collisions differently – Hedera (Openflow + centralized scheduling) Multipath topologies need multipath transport Multipath TCP can be used by unchanged applications over today’s networks MPTCP moves traffic away from congestion, making a collection of links behave like a single pooled resource Backup Slides Packet-level ECMP in datacenters How does MPTCP congestion control work? 107 Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Increase wr for each ACK on path r, by Design goals 1&3: At any potential bottleneck S that path r might be in, look at the best that a single-path TCP could get, and compare to what I’m getting. Decrease wr for each drop on path r, by wr /2 How does MPTCP congestion control work? 108 Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Design goal 2: We want to shift traffic away from congestion. To achieve this, we increase windows in proportion to their size. Increase wr for each ACK on path r, by Decrease wr for each drop on path r, by wr /2