SDN + Storage Outline • Measurement of storage traffic • Network aware placement • Control of resources • SDN + Resource allocation – Predicting Resources utilization – Bring it all together HDFS Storage Patters – Maps reads from HDFS • Local read versus Non-local read • Rack locality or not Locality!!! 80% HDFS Storage Patters – Maps reads from HDFS • Local read versus Non-local read • Rack locality or not Cross-rack Traffic 80% HDFS Storage Patters – Reducers writes to HDFS • 3 copies of file written to HDFS • 2 rack local and 1 non-rack local • Fault tolerance and good performance THERE MUST BE CROSS RACK TRAFFIC Ideal Goal: Minimize Congestion Real Life Traces • Analyze Facebook traces: – 33% of time spent in network – Network links are highly utilized; why? – Determine cause of network traffic 1. Job output 2. Job input 3. Pre-processing Current Ways To Improve HDFS Transfers • Change Network Paths – Hedera, MicroTE, C-thru, Helios • Change Network Rates – Orchestra, D3 • Increase Network Capacity – VL2, Portland (Fat-Tree) The case for Flexible Endpoints • Traffic Matrix limits benefits – of techniques that change paths – of network rates • Ability to Change Matrix is important 90% 20% 80% 90% Flexible Endpoints in HDFS • Recall: Constraint placed by HDFS – 3 replicas – 2 fault domains – Doesn’t matter where as long as constraints are met • The source of transfer is fixed! – However destination, location of 3 replicas is not fixed Sinbad • Determine placement for block replica – Place replicas to avoid hotspots – Constraints: • 3 copies • Spread across 2 fault domains • Benefits – Faster writes: – Faster transfers Sinbad: Ideal Algorithm • Input: – Blocks of diff size – Links of diff capacity • Objective: – Minimize write time (transfer time) • Challenges: Lack of future knowledge – Location & duration of hotspots – Size and arrival times of new replicas Sinbad Heuristic • Assumptions – Link utilizations are stable • True for 5-10 seconds – All block have same size • Fixed-size large blocks • Heuristic: – Pick least-loaded link/path – Send block from file with least amount to send Sinbad Architecture • Recall: original DFS is master-slave architecture • Sinbad has similar Sinbad • Determine placement for block replica – Place replicas to avoid hotspots – Constraints: • 3 copies • Spread across 2 fault domains • Benefits – Faster writes: – Faster transfers Orchestrating the Entire Cluster • How to control Compute, Network, Storage? • Challenges from SinBAD – How to determine future replica demands? • You can’t control job arrival • You can control task scheduling • If you predict job characteristics you can determine future – How to determines future hot spots? • Control all network traffic (SDN) • Use future Ideal Centralized Entity • Controls: – Storage, CPU, N/W • Determines: – Which task to run – Where to run the task – When to start Network transfer • What rate to transfer at • Which network path Predicting Job Characteristics • To predict resources that a job needs to complete, what do you need? Predicting Job Characteristics – Job’s DAG (job’s traces history) – Computations time for each node – Data transfer size between nodes – Transfer time between nodes Things you absolutely know! • Input data – Size of input data – Location of all replicas – Split of input data 200 GB 3 Mappers HDFS Map Map Map • Job’s D.A.G – # of Map – # of Reduce 2 Reducers Reduce Reduce HDFS Approaches to Prediction: Input/intermediate/Output Data • Assumption: – Map & Reduce run same code over and over – Code gives the same ratio of reduction • E.g. 50% reduction from Map to intermediate • E.g. 90% reduction from intermediate to output Map HDFS 200 GB Map Map • Implications: 100 GB – Given size of input, you can determine size of future transfers • Problems: – Not always true!!! Reduce Reduce HDFS 10 GB Approaches to Prediction: Task Run Time • Assumption: – Task is dominated by reading input – Time to run a task is essentially time to read input • If Local: Time to read from Disk • If non-local: Time to read across Network Map HDFS 200 GB Map Map • Implication: 100 GB – If you can model read time you can determine task run time • Problems: – How do you model disk I/O? – How do you model I/O interrupt contention? Reduce Reduce HDFS 10 GB Predict Job Runs • Given: – Prediction of tasks, transfers, and of Dag • Can you predict job completion time? – How do you account for interleaving between jobs? – How do you determine optimal # of slots? – How do you determine optimal network bandwidth? 1 0 sec Reduce 23 30 10 sec HDFS 2 2 10 GB 3 Map 8 1 Map HDFS 2 100 GB Reduce Map 200 GB 23 3 40 sec • Really easy right? – But what happens if the network only has 2 slots • You can’t run map in parallel Reduce 23 2 30 3 sec 13 sec HDFS 2 10 GB 3 Map 8 Reduce Map 1 1 0 sec 100 GB Map HDFS 200 GB 2 23 3 33 sec • Which tasks to run in which order? • How many slots to assign? Approaches to Prediction Job Run Times • Assumption: – Job Runtime Function (# slots) • Implication: – Given N slots, I can predict completion time • Jockey Approach [EuroSys’10] – – – – Track job progress: fraction of completed tasks Build a map of [{% done + # of slots} time to complete] Use simulator to build map Iterate through all possible combination of # of slots and %done. • Problems: – Ignores network transfers: Network congestion – Cross job contention on server can impact completion time – Not all tasks are equal: # of tasks done isn’t a good representation of progress Open Questions • What about background traffic? – Control messages – Other bulk transfer • What about unexpected events? – Failures? – Loss of data? • What about protocol inefficiencies? – Hadoop scheduling – TCP inefficiencies – Server scheduling