Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo Carey Williamson, University of Calgary 1 Copyright © 2005 Department of Computer Science Motivation • Emerging IT paradigms – – – – Data centers, grid computing, HPC, multi-core Cluster-based storage systems, SAN, NAS Large-scale data management “in the cloud” Data manipulation via “services-oriented computing” • Cost and efficiency advantages from IT trends, economy of scale, specialization marketplace • Performance advantages from parallelism – Partition/aggregation, MapReduce, BigTable, Hadoop – Think RAID at Internet scale! (1000x) 2 Copyright © 2005 Department of Computer Science Problem Statement TCP retransmission timeouts How to provide high goodput for data center applications? TCP throughput degradation N • • • • High-speed, low-latency network (RTT ≤ 0.1 ms) Highly-multiplexed link (e.g., 1000 flows) Highly-synchronized flows on bottleneck link Limited switch buffer size (e.g., 32 KB) 3 Copyright © 2005 Department of Computer Science Related Work • E. Krevat et al., “On Application-based Approaches to Avoiding TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of SuperComputing 2007 • A. Phanishayee et al., “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of FAST 2008 • Y. Chen et al., “Understanding TCP Incast Throughput Collapse in Datacenter Networks”, WREN 2009 • V. Vasudevan et al., “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication”, Proceedings of ACM SIGCOMM 2009 • M. Alizadeh et al., “Data Center TCP”, Proc. ACM SIGCOMM 2010 • A. Shpiner et al., “A Switch-based Approach to Throughput Collapse and Starvation in Data Centers”, IWQoS 2010 4 Copyright © 2005 Department of Computer Science Summary Summary of Related Work • Data centers have specific network characteristics • TCP-incast throughput collapse problem emerges • Possible solutions: – Tweak TCP timers and/or parameters for this environment – Redesign (or replace!) TCP in this environment – Rewrite applications for this environment (Facebook) – Increase switch buffer sizes (extra queueing delay!) – Smart edge coordination for uploads/downloads 5 Copyright © 2005 Department of Computer Science Data Center System Model Logical data block 1 (S) packet size S_DATA (e.g., 1 MB) small buffer B 2 Server 3 switch client link capacity C Request Unit (SRU) (e.g., 32 KB) N N servers 6 Copyright © 2005 Department of Computer Science Performance Comparisons Internet vs. data center network: • Internet propagation delay: 10-100 ms • data center propagation delay: 0.1 ms • packet size 1 KB, link capacity 1 Gbps -> packet transmission time is 0.01 ms 7 Copyright © 2005 Department of Computer Science Summary Analysis Overview (1 of 2) • Determine maximum TCP flow concurrency (n) that can be supported without any packet loss • Arrange the servers into k groups of (at most) n servers each, by staggering the group scheduling 8 Copyright © 2005 Department of Computer Science Summary Analysis Overview (2 of 2) • Determine maximum TCP flow concurrency (n) that can be supported without any packet loss – Determine flow size in packets (based on SRU and MSS) – Determine maximum outstanding packets per flow (Wmax) – Determine max flow concurrency (based on B and Wmax) • Arrange the servers into k groups of (at most) n servers each, by staggering the group scheduling 9 Copyright © 2005 Department of Computer Science Summary Determining Wmax • Recall TCP slow start dynamics: – Initial TCP congestion window (cwnd) is 1 packet – Acks cause cwnd to double every RTT (1, 2, 4, 8, 16…) • Consider TCP transfer of an arbitrary SRU (e.g., 21) • Determine peak power-of-2 cwnd value (WA) • Determine “residual window” for the last RTT (WB) • Wmax depends on both WA and WB (e.g., WA+ WB/2 ) 10 Copyright © 2005 Department of Computer Science Scheduling Overview n n n n N n n 11 Copyright © 2005 Department of Computer Science Scheduling Details Using lossless scheduling of server responses: maximum n servers responding simultaneously, with k groups of responding servers scheduled Server i (1 <= i <= N) starts responding at: 12 Copyright © 2005 Department of Computer Science Theoretical Results Maximum goodput of an application in a data center with lossless scheduling is: g= S ~ T (k 1 ) + T + d max where: • • • • • S - size of a logical data block T - actual completion time of an SRU - SRU completion time used for scheduling k – how many groups of servers to use dmax - real system scheduling variance 13 Copyright © 2005 Department of Computer Science Solution Analytical Model Results 14 Copyright © 2005 Department of Computer Science Results for 10 KB Fixed SRU Size (1 of 2) 15 Copyright © 2005 Department of Computer Science Results for 10 KB Fixed SRU Size (2 of 2) 16 Copyright © 2005 Department of Computer Science Results for Varied SRU Size (1 MB / N) 17 Copyright © 2005 Department of Computer Science Effect of TCP Timer Granularity 18 Copyright © 2005 Department of Computer Science Summary and Conclusion Application-level scheduling for TCPincast throughput collapse Main idea: scheduling responses of servers so that there are no losses Maximum goodput with lossless scheduling Non-monotonic goodput, highly-sensitive to network configuration parameters 19 Copyright © 2005 Department of Computer Science Future Work Implementing and testing our solution in real data centers Evaluating our solution for different application traffic scenarios 20 Copyright © 2005 Department of Computer Science