Orchestra Managing Data Transfers in Computer Clusters Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica UC Berkeley Moving Data is Expensive Typical MapReduce jobs in Facebook spend 33% of job running time in large data transfers Application for training a spam classifier on Twitter data spends 40% time in communication 2 Limits Scalability Scalability of Netflix-like recommendation system is bottlenecked by communication Iteration time (s) 250 Communication 200 Computation Did not scale beyond 60 nodes 150 » Comm. time increased faster than comp. time decreased 100 50 0 10 30 60 90 Number of machines 3 Transfer Patterns Broadcast Transfer: set of all flows transporting data between two stages of a job Map Shuffle Reduce Incast* » Acts as a barrier Completion time: Time for the last receiver to finish 4 Contributions 1. Optimize at the level of transfers instead of individual flows 2. Inter-transfer coordination 5 Orchestra ITC Inter-Transfer Fair sharing Controller (ITC) FIFO Priority TCShuffle (shuffle) TCBroadcast (broadcast) TCBroadcast (broadcast) Transfer Hadoop shuffle Controller WSS (TC) HDFS Transfer Tree (TC) Controller Cornet HDFS Transfer Tree (TC) Controller Cornet shuffle broadcast 1 broadcast 2 6 Outline ITC ITC Fair sharing Cooperative broadcast (Cornet) »FIFO Infer and utilize topology information Priority Weighted Shuffle Scheduling (WSS) TC (shuffle) Shuffle Hadoop shuffle TC WSS TC (broadcast) (broadcast) » Assign flowTC rates to optimize shuffle Broadcast Broadcast HDFS completion timeHDFS Tree TC Tree TC Cornet Cornet Inter-Transfer Controller » Implement weighted fair sharing between transfers End-to-end performance 7 Cornet: Cooperative broadcast Broadcast same data to every receiver » Fast, scalable, adaptive to bandwidth, and resilient Peer-to-peer mechanism optimized for cooperative environments Observations Cornet Design Decisions 1. High-bandwidth, low-latency network Large block size (4-16MB) 2. No selfish or malicious peers No need for incentives (e.g., TFT) No (un)choking Everyone stays till the end 3. Topology matters Topology-aware broadcast 8 Cornet performance 1GB data to 100 receivers on EC2 Status quo 4.5x to 5x improvement 9 Topology-aware Cornet Many data center networks employ tree topologies Each rack should receive exactly one copy of broadcast » Minimize cross-rack communication Topology information reduces cross-rack data transfer » Mixture of spherical Gaussians to infer network topology 10 Topology-aware Cornet 200MB data to 30 receivers on DETER 3 inferred clusters ~2x faster than vanilla Cornet 11 Status quo in Shuffle r1 s1 s2 r2 s3 s4 s5 Links to r1 and r2 are full: 3 time units Link from s3 is full: 2 time units Completion time: 5 time units 12 Weighted Shuffle Scheduling r1 Allocate rates to each flow using weighted fair sharing, where the weight of a flow between a sender-receiver pair is proportional to the total amount of data to be sent 1 s1 1 s2 r2 2 2 s3 1 s4 1 s5 Completion time: 4 time units Up to 1.5X improvement 13 Inter-Transfer Controller aka Conductor Weighted fair sharing » Each transfer is assigned a weight » Congested links shared proportionally to transfers’ weights Implementation: Weighted Flow Assignment (WFA) » Each transfer gets a number of TCP connections proportional to its weight » Requires no changes in the network nor in end host OSes 14 Benefits of the ITC Shuffle using 30 nodes on EC2 % of active flows 100% 80% Low Priority Job 0 High Priority Job 1 60% High Priority Job 2 High Priority Job 3 40% 20% 0% Two priority classes 0 5 10 15 20 25 30 35 40 45 40 45 Time(s) » FIFO within each class Without Inter-transfer Scheduling Low priority transfer High priority transfers » 250MB per reducer % of active flows » 2GB per reducer 100% 80% 60% 40% 20% 0% 0 5 10 15 20 25 30 35 Time(s) Priority Scheduling in Conductor 43% reduction in high priority xfers 6% increase of the low priority xfer End-to-end evaluation Developed in the context of Spark – an iterative, inmemory MapReduce-like framework Evaluated using two iterative applications developed by ML researchers at UC Berkeley » Training spam classifier on Twitter data » Recommendation system for the Netflix challenge 16 Faster spam classification Communication reduced from 42% to 28% of the iteration time Overall 22% reduction in iteration time 17 Scalable recommendation system Before 250 After 250 Communication Computation 200 Iteration time (s) Iteration time (s) 200 Communication Computation 150 100 150 100 50 50 0 0 10 30 60 Number of machines 90 10 30 60 Number of machines 90 1.9x faster at 90 nodes 18 Related work DCN architectures (VL2, Fat-tree etc.) » Mechanism for faster network, not policy for better sharing Schedulers for data-intensive applications (Hadoop scheduler, Quincy, Mesos etc.) » Schedules CPU, memory, and disk across the cluster Hedera » Transfer-unaware flow scheduling Seawall » Performance isolation among cloud tenants 19 Summary Optimize transfers instead of individual flows » Utilize knowledge about application semantics Coordinate transfers » Orchestra enables policy-based transfer management » Cornet performs up to 4.5x better than the status quo » WSS can outperform default solutions by 1.5x No changes in the network nor in end host OSes http://www.mosharaf.com/ 20 BACKUP SLIDES 21 MapReduce logs Weeklong trace of 188,000 MapReduce jobs from a 3000-node cluster 1 CDF 0.8 0.6 0.4 Maximum number of concurrent transfers is several hundreds 0.2 0 0 0.2 0.4 0.6 0.8 Fraction of job lifetime spent in shuffle phase 33% time in shuffle on average 22 Monarch (Oakland’11) Real-time spam classification from 345,000 tweets with urls » Logistic Regression » Written in Spark Spends 42% of the iteration time in transfers » 30% broadcast » 12% shuffle 100 iterations to converge 23 Collaborative Filtering Does not scale beyond 60 nodes 250 Communication Iteration time (s) 200 Netflix challengeComputation 150 » Predict users’ ratings for 100 movies they haven’t seen 50 based on their ratings for 0 other movies 10 30 60 90 Number of machines 385MB data broadcasted in each iteration 24 Cornet performance 1GB data to 100 receivers on EC2 4.5x to 6.5x improvement 25 Shuffle bottlenecks At a sender At a receiver In the network An optimal shuffle schedule must keep at least one link fully utilized throughout the transfer 26 Current implementations Shuffle 1GB to 30 reducers on EC2 27