Tail Latency: Networking The story thus far • Tail latency is bad • Causes: – Resource contention with background jobs – Device failure – Uneven-split of data between tasks – Network congestion for reducers Ways to address tail latency • • • • • Clone all tasks Clone slow tasks Copy intermediate data Remove/replace frequently failing machines Spread out reducers What is missing from this picture? • Networking: – Spreading out reducers is not sufficient. • The network is extremely crucial – Studies on Facebook traces show that [Orchestra] • in 26% of jobs, shuffle is 50% of runtime. • in 16% of jobs, shuffle is more than 70% of runtime • 42% of tasks spend over 50% of their time writing to HDFS User-facing online services Limits Other implication of Network Request Response Scalability Application SLAs Cascading SLAs Scalability ofAggregator Netflix-like recommendation 200 ms SLAs for components at each system5msis bottlenecked by communication 5ms 5ms 25ms 100 ms Agg. 4ms 45 ms Worker 200 Iteration time (s) Agg. Agg. 4ms 25ms Communication Worker level of the hierarchy 25ms 30ms 4ms 4ms 15ms 250 35ms Worker Computation 150 22ms Network SLAs Deadlines on communications Did not scale beyond 60 nodes between components Worker » Comm. time increased faster than comp. time decreased 100 Flow Deadlines 50 A0 flow is useful if and only if it satisfies its deadline 10 30 60 90 Today’s transport protocols: Number of machines Deadline agnostic and strive for fairness 5 What is the of the Role? Network What is tImpact e N twork’s • Assume deadline of forRTT tasks [DCTCP] Analyzed10ms distribution measurements: • Simulate job completion times based on distributions of tasks completion times • Median RTT takes 334μs, but 6% take over 2ms • Can be as high asa14ms • For 40Network about 4delays tasks alone (14%)ucan consume the data time b [SIGCOMM’10] dget Dretrieval’s ta Centerfail TCP D ( CTCP) for 400 14Source: tasks [3%] respectively 5 What is the Impact of the Network • Assume 10ms deadline for tasks [DCTCP] • Simulate job completion times based on distributions of tasks completion times (focus on 99.9%) • For 40 about 4 tasks (14%) for 400 14 tasks [3%] fail respectively WhatImpact is the Impact the Network on PageofCreation • • Assume 10ms for tasks Under the RTTdeadline distribution, 150[DCTCP] data retrievals take 200ms computation Simulate job (ignoring completion times based time) on distributions of tasks completion times • ForAs40Facebook about 4already tasks (14%) at 130 data retrievals per page, for 400 14 tasks [3%] fail respectively need to address network delays 7 Other implication of Network Limits Scalability Scalability of Netflix-like recommendation system is bottlenecked by communication Iteration time (s) 250 Communication 200 Computation Did not scale beyond 60 nodes 150 » Comm. time increased faster than comp. time decreased 100 50 0 10 30 60 90 Number of machines 9 What Causes this Variation in Network Transfer Times? • First let’s look at type of traffic in network • Background Traffic – Latency sensitive short control messages; e.g. heart beats, job status – Large files: e.g. HDFS replication, loading of new data • Map-reduce jobs – Small RPC-request/response with tight deadlines – HDFS reads or writes with tight deadlines What Causes this Variation in Network Transfer Times? • No notion of priority – Latency sensitive and non-latency sensitive share the network equally. • Uneven load-balancing – ECMP doesn’t schedule flows evenly across all paths – Assume long and short are the same • Bursts of traffic – Networks have buffers which reduce loss but introduce latency (time waiting in buffer is variable) – Kernel optimization introduce burstiness Ways to Eliminate Variation and Improve tail latency • Make the network faster – HULL, DeTail, DCTCP – Faster networks == smaller tail • Optimize how application use the network – Orchestra, CoFlows – Specific big-data transfer patterns, optimize the patterns to reduce transfer time • Make the network aware of deadlines – D3, PDQ – Tasks have deadlines. No point doing any work if deadline wouldn’t be met – Try and prioritize flows and schedule them based on deadline. Fair-Sharing or Deadline-based sharing • Fair-share (Status-Quo) – Every one plays nice but some deadlines lines can be missed • Deadline-based – Deadlines met but may require non-trial implemantionat • Two ways to do deadline-based sharing – Earliest deadline first (PDQ) – Make BW reservations for each flow • Flow rate = flow size/flow deadline • Flow size & deadline are known apriori Limitations of Fair Sharing Fair-Sharing or Deadline-based sharing Case for unfair sharing: Deadline aware Status Quo Flows X – Earliest deadline first (PDQ) f1 Flows • Flow Two versions of deadline-based sharing f1, 20ms f1 f2 f2 – Make BW reservations for each flow Flow f2, 40ms 20 40 6 flows with 30ms deadline 30 Time X X X X X X 40 Time Flows Flows • Flow rate = flow size/flow deadline Time Flowquenching: size & deadline are known apriori Case for •flow 20 30 Time WithInsufficient deadline awareness, bandwidthone to satisfy flow can all deadlines be quenched With Allfair other share, flows all make flows their miss the deadline deadline (partial (empty response) response) Issues with Deadline Based Scheduling • Implications for non-deadline based jobs – Starvation? Poor completion times? • Implementation Issues – Assign deadlines to flows not packets – Reservation approach • Requires reservation for each flow • Big data flows: can be small & have small RTT – Control loop must be extremelly fast – Earliest deadline first • Requires coordination between switches & servers • Servers: specify flow deadline • Switches: priority flows and determine rate – May require complex switch mechanisms How do you make the Network Faster • Throw more hardware at the problem – Fat-Tree, VL2, B-Cube, Dragonfly – Increases bandwidth (throughput) but not necessarily latency So, how do you reduce latency • Trade bandwidth for latency – Buffering adds variation (unpredictability) – Eliminate network buffering & bursts • Optimize the network stack – Use link level information to detect congestion – Inform application to adapt by using a different path HULL: Trading BW for Latency • Buffering introduces latency – Buffer is used to accommodate bursts – To allow congestion control to get good throughput • Removing buffers means – Lower throughput for large flows – Network can’t handle bursts – Predictable low latency Why do Bursts Exists? • Systems review: – NIC (network Card) informs OS of packets via interrupt • Interrupt consume CPU • If one interrupt for each packet the CPU will be overwhelmed – Optimization: batch packets up before calling interrupt • Size of the batch is the size of the burst Impact'of'Interrupt'Coalescing' Why do Bursts Exists? • Systems review: Interrupt' Receiver' Throughput' Burst'Size' Coalescing' CPU'(%)' (Gbps)' (KB)' – NIC (network Card) informs OS of packets via interrupt • Interrupt consume CPU • If one interrupt for each packet the CPU will be overwhelmed – Optimization: batch packets up before calling interrupt • Size of the batch is the size of the burst More'Interrupt' Coalescing' Lower'CPU'UAlizaAon'' &'Higher'Throughput' More'' BursAness' Why Does Congestion Need buffers? • Congestion Control AKA TCP – Detects bottleneck link capacity through packet loss – When loss it halves its sending rate. • Buffers help the keep the network busy – Important for when TCP reduce sending rate by half • Essentially the network must double capacity for TCP to work well. – Buffer allow for this doubling TCP Review • Bandwidth-delay product rule of thumb: – A single flow needs C×RTT buffers for 100% Throughput. Buffer Size B Throughput B < C×RTT 100% B ≥ C×RTT B 100% 22 Key Idea Behind Hull • Eliminate Bursts – Add a token bucket (Pacer) into the network – Pacer must be in the network so it happens after the system optimizations that cause bursts. • Eliminate Buffering – Send congestion notification messages before link it fully utilized • Make applications believe the link is full when there’s still capacity – TCP has poor congestion control algorithm • Replace with DCTCP Key Idea Behind Hull • Key$idea:$$ •– Eliminate Bursts Associate$congesI on$with$link$uI lizaI on,$not$buffer$occupancy$$ – – Add a token (Gibbens$ Kelly$1999,$ &$Srikant$2001)$$ bucket&$ (Pacer) intoKunniyur$ the network – Pacer must be in the network so it happens after the system optimizations that cause bursts. !! • Eliminate Buffering – Send congestion notification messages before link it fully utilized • Make applications believe the link is full when there’s still capacity 11$ Orchestra: Managing Data Transfers in Computer Clusters • Group all flows belonging to a stage into a transfer • Perform inter-transfer coordination • Optimize at the level of transfer rather than individual flows Transfer Patterns HDFS Broadcast Map Map Transfer: set of all flows transporting data between two stages of a job Map Shuffle Reduce Reduce Incast* – Acts as a barrier Completion time: Time for the last receiver to finish HDFS 26 Orchestra Cooperative broadcast (Corn ITC – Infer and utilize topology information Inter-Transfer Fair sharing Controller (ITC) FIFO Priority Weighted Shuffle Scheduling (WSS) TCShuffle (shuffle) TC Broadcast (broadcast) TC Broadcast (broadcast) Transfer Hadoop shuffle Controller WSS (TC) HDFS Transfer Tree (TC) Controller Cornet HDFS Transfer Tree (TC) Controller Cornet – Assign flow rates to optimiz shuffle completion time Inter-Transfer Controller – Implement weighted fair sharing between transfers End-to-end performance shuffle broadcast 1 broadcast 2 27 Cornet: Cooperative broadcast Broadcast same data to every receiver » Fast, scalable, adaptive to bandwidth, and resilient Peer-to-peer mechanism optimized for cooperative environments Use bit-torrent to distribute data Observations Cornet Design Decisions 1. High-bandwidth, low-latency network Large block size (4-16MB) 2. No selfish or malicious peers No need for incentives (e.g., TFT) No (un)choking Everyone stays till the end 3. Topology matters Topology-aware broadcast 28 Topology-aware Cornet Many data center networks employ tree topologies Each rack should receive exactly one copy of broadcast – Minimize cross-rack communication Topology information reduces cross-rack data transfer – Mixture of spherical Gaussians to infer network topology 29 Status quo in Shuffle r1 s1 s2 Links to r1 and r2 are full: r2 s3 s4 s5 3 time units Link from s3 is full: 2 time units Completion time: 5 time units 31 Weighted Shuffle Scheduling Allocate rates to each flow using weighted fair sharing, where the weight of a flow between a sender-receiver pair is proportional to the total amount of data to be sent r1 1 s1 1 s2 r2 2 2 s3 1 1 s4 s5 Completion time: 4 time units Up to 1.5X improvement 32 Faster spam classification Communication reduced from 42% to 28% of the iteration time Overall 22% reduction in iteration time 33 Summary • Discuss tail latency in network – Types of traffic in network – Implications on jobs – Cause of tail latency • Discuss Hull: – Trade Bandwidth for latency – Penalize huge flows – Eliminate bursts and buffering • Discuss Orchestra: – Optimize transfers instead of individual flows • Utilize knowledge about application semantics http://www.mosharaf.com/ 34