• SLURM++: A distributed job launch prototype for extreme-scale ensemble computing (IPDPS14 submission) • MATRIX: A distributed Many-Task Computing execution fabric designed for exascale (CCGRID14 submission) Next Generation Job Management System for Extreme Scales 2 • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales 3 • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales 4 • Today (June, 2013): 34 Petaflop O(100K) nodes O(1M) cores • Near future (~2020): Exaflop Computing ~1M nodes ~1B processor-cores/threads Next Generation Job Management System for Extreme Scales 5 • • • • • Ensemble Computing Over-decomposition Many-Task Computing Jobs/Tasks are finer-grained Requirements high availability extreme high throughput (1M tasks/sec) low Latency Next Generation Job Management System for Extreme Scales 6 • Batch scheduled HPC workloads • Lack the support of ensemble workloads • Centralized Design Poor Scalability Single-point-of-failure • SLURM maximum throughput of 500 jobs/sec • Decentralized design is demanded Next Generation Job Management System for Extreme Scales 7 • Architect, and design job management systems for exascale ensemble computing • Identifies the challenges and solutions towards supporting job management systems at extreme scales • Evaluate and compare different design choices at large scale Next Generation Job Management System for Extreme Scales 8 • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales 9 • Proposed a distributed architecture for job management systems, and identified the challenges and solutions towards supporting job management system at extremescales • Designed and developed a novel distributed resource stealing algorithm for efficient HPC job launch • Designed and implemented a distributed job launch prototype SLURM++ for extreme scales by leveraging SLURM and ZHT • Evaluated SLURM and SLRUM++ up to 500-nodes with various micro-benchmarks of different job sizes with excellent results up to 10X higher throughput Next Generation Job Management System for Extreme Scales 10 … Fully-Connected controller and data server controller and data server … cd cd controller and data server … … cd cd cd cd cd cd cd 1. Controllers are fully connected 2. Ratio and Partition Size are configurable for HPC and MTC 3. Data servers are also fully connected Next Generation Job Management System for Extreme Scales 11 Key Value controller id job id job id + original controller id job id + original controller id + involved controller id Description The free (available) nodes number of free in a partition managed by node, free the corresponding node list controller The original controller that original is responsible for a controller id submitted job The controllers that involved participate in launching a controller list job participated node list The nodes in a partition that are involved in launching a job Next Generation Job Management System for Extreme Scales 12 ALGORITHM 1. Resource Stealing • The procedure of stealing resources from other partitions • Why need to steal resources? • When to steal resources? • Where and how to steal resources? • What if no resources to be stolen? Input: number of nodes required (num_node_req), number of controllers (num_ctl), controller membership list (ctl_id[num_ctl]), sleep length (sleep_length), number of reties (num_retry). Output: list of involved controller ids (ctl_id_inv), participated nodes (par_node[]). num_node_allocated = 0; num_try = 0; num_ctl_inv = 0; while num_node_allocated < num_node_req do remote_ctl_idx = Random(num_ctl); remote_ctl_id = ctl_id[remote_ctl_idx]; again: remote_free_resource = lookup(remote_ctl_id); if (remote_free_reource == NULL) then continue; else remote_num_free_node = strtok(remote_free_source); if (remote_num_free_node > 0) then num_try = 0; remote_num_node_allocated = remote_num_free_node > (num_node_req – num_node_allocated) ? (num_node_req – num_node_allocated) : remote_num_free_node; if (allocate nodes succeeds) then num_node_allocated += remote_num_node_allocated; par_node[num_ctl_inv++] = allocated node list strcat(ctl_id_inv, remote_ctl_id); else goto again; end else sleep(sleep_length); num_try++; if (num_try > num_retry) do release all the allocated nodes; Resource Stealing again; end end end end return ctl_id_inv, par_node; Next Generation Job Management System for Extreme Scales 13 lookup_1 lookup_2 return value return value cswap_1 return true, value cswap_2 return false, value cswap_2 again return true, value Next Generation Job Management System for Extreme Scales message exchange flow message exchange flow • When different controllers try to allocate Data Server Operation Sequence the same resources ALGORITHM 2. Compare and Swap • Naive to solvecswap_1 the problem is to add a lookup_1 way lookup_2 cswap_2 cswap_2 Input: key (key), value seen before (seen_value), new value intended to insert (new_value), and the storage hash map (map). Output: A Boolean value indicates success (TRUE) or failure (FALSE), and the current actual value (current_value). global current_value = map.get(key);lock for each queried key in the if (!strcmp(current_value, seen_value)) then map.put(key, new_value); DKVS return TRUE; else return FALSE; • Atomic compare and swap operation in the end Client 1 Data Server Client 2 DKVS that can tell the controllers whether the resource allocation succeeds 14 • A controller needs to wait on specific state ALGORITHM 3. Statebefore Change Callback change moving on Input: key (key), value expected (expect_value), blocking timeout (time_out), and the storage hash map (map). Output: A Boolean value indicates success (TRUE) or failure (FALSE). • Inefficient when client keeps polling from current_value = map.get(key); start_time = gettimeofday(); the server result = TRUE; while(stcmp(current_value, expect_value)) do end_time = gettimeofday(); • The server has a blocking state change if (end_time – start_time < time_out) then sleep (1); callback operation current_value = map.get(key); else result = false; break; end done return result; Next Generation Job Management System for Extreme Scales 15 • • • • • • SLURM description Light-weight controller as ZHT client Job launching as a separate thread Implement the resource stealing algorithm Developed in C 3K lines of code + SLURM 50K lines of code + ZHT 8K lines of code Next Generation Job Management System for Extreme Scales 16 • LANL Kodiak machine up to 500 nodes • Each node has two 64-bit AMD Opteron processors at 2.6GHz and 8GB memory • SLURM version 2.5.3 • Partition Size of 50 • Metrics: Efficiency number of ZHT message Next Generation Job Management System for Extreme Scales 17 Conclusions: 1. SLURM remains almost constant, while SLURM++ has an increasing trend with respect to the scale 2. MTC configuration is more preferable for MTC workload than HPC configuration Next Generation Job Management System for Extreme Scales 18 Conclusion: 1. SLURM experiences constant, while SLURM++ has linearly increasing trend with respect to scale 2. Medium-job workload introduces a little bit resource stealing overhead Next Generation Job Management System for Extreme Scales 19 Conclusion: 1. SLURM is about to saturate at large scales, while SLURM++ has an increasing trend with respect to scale 2. The more partitions we have, the better chance that a controller can steal resources Next Generation Job Management System for Extreme Scales 20 • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales 21 • Design and implement a distributed MTC task execution fabric (MATRIX) that uses adaptive work stealing technique for distributed load balancing, and employs a DKVS to store task metadata. • Explore the parameter space of work stealing technique as applied to exascale class systems, through a lightweight job scheduling system simulator, SimMatrix, up to millions of nodes, billions of cores, and trillions of tasks. • Evaluate and compare MATRIX with various task schedulers( Falkon, Sparrow, and CloudKon) tested on an IBM Blue Gene/P machine and Amazon Cloud, using both micro-benchmarks and real workload traces. Next Generation Job Management System for Extreme Scales 22 • Distributed load balancing technique • Idle scheduler steals tasks from overloaded one • Number of tasks to steal • Number of static neighbors • Number of dynamic neighbors • Poll interval Next Generation Job Management System for Extreme Scales 23 • Select Neighbors ALGORITHM 1. Dynamic Multi-Random Neighbor Selection (DYN-MUL-SEL) Input: Node id (node_id), number of neighbors (num_neigh), and number of nodes (num_node), and the node array (nodes). Output: A collection of neighbors (neigh). bool selected[num_node]; for each i in 0 to num_node do selected[i] = FALSE; end selected[node_id] = TRUE; Node neigh[num_neigh]; index = −1; for each i in 0 to num_neigh−1 do repeat index = Random( ) % num_node; until !selected[index]; selected[index] = TRUE; neigh[i] = nodes[index]; end return neigh; Next Generation Job Management System for Extreme Scales 24 • Dynamic Poll Interval ALGORITHM 2. Adaptive Work Stealing Algorithm (ADA-WORK-STEALING) Input: Node id (node_id), number of neighbors (num_neigh), number of nodes (num_node), the node array (nodes), the initial poll interval (poll_interval), and upper bound (upper_bound). Output: NULL neigh = DYN-MUL-SEL(node_id, num_neigh, num_node, nodes); most_load_node = neigh[0]; for each i in 1 to num_node−1 do if (most_load_node < neigh[i].load) then most_load_node = neigh[i]; end end if (most_load_node.load == 0) then sleep(poll_interval); poll_interval = poll_interval × 2; if (poll_interval < upper_bound) then ADA-WORK-STEALING(node_id, num_neigh, num_node, nodes, poll_interval); else return; end else num_task_steal = number of tasks to steal from most_load_node; if (num_task_steal == 0) then sleep(poll_interval); poll_interval = poll_interval × 2; if (poll_interval < upper_bound) then ADA-WORK-STEALING(node_id, num_neigh, num_node, nodes, poll_interval); else return; end else poll_interval = 1; end Next Generation Job Management System for Extreme Scales 25 lookup task status (2) Scheduler return task status (3) Client Client cl i en t in ter a Executor cti client interaction submit tasks (1) on Scheduler work stealing Scheduler ZHT server …… ZHT server Compute Node Executor Compute Node Fully-Connected work stealing Scheduler request load (4) send load (5) request task (6) send task (7) Scheduler Next Generation Job Management System for Extreme Scales 26 • Worst Case All tasks are submitted to one scheduler • Best Case All tasks are schedulers evenly distributed Next Generation Job Management System for Extreme Scales to all 27 Next Generation Job Management System for Extreme Scales 28 • Client doesn’t have to be alive • A monitoring program polls the task execution progress periodically • Record logs about the system state for doing visualization Next Generation Job Management System for Extreme Scales 29 • Focus on the Amazon Cloud environment • The “m1.medium” instance 1 virtual cpu, 2 compute units, 3.7GB memory, 410GB hard disk. • Run MATRIX, Sparrow, and CloudKon up to 64 instances. • The number of executing threads is 2 for all of the systems. Next Generation Job Management System for Extreme Scales 30 Reason: 1. MATRIX has near perfect load balancing when submitting tasks 2. Sparrow needs to send probe messages to push tasks 3. the clients of CloudKon need to push and pull tasks from SQS Next Generation Job Management System for Extreme Scales 31 Conclusion: 1. SQS, and DynamoDB are designed specifically for Cloud 2. The probing and pushing method in Sparrow has poor scalability for heterogeneous workload 3. The network communication layer of MATRIX needs to be improved significantly Next Generation Job Management System for Extreme Scales 32 • Through SimMatrix up to millions of nodes, billions of cores, and trillions of tasks • Number of Tasks to Steal • Number of Static Neighbors • Number of Dynamic Neighbors Next Generation Job Management System for Extreme Scales 33 100% 90% steal_1 80% steal_2 70% Efficiency steal_log 60% steal_sqrt 50% steal_half 40% Conclusion: 30% 1. steal half of tasks 20% 2. a small number of static neighbors is not sufficient 10% 3. the more tasks (< half) to steal, the better performance 51 2 10 24 20 48 40 96 81 92 25 6 12 8 64 32 16 8 4 2 1 0% Scale (No. of Nodes) Next Generation Job Management System for Extreme Scales 34 100% 90% 80% nb_2 70% Efficiency nb_log 60% nb_sqrt 50% nb_eighth 40% nb_quar 30% nb_half 51 2 10 24 20 48 40 96 81 92 25 6 12 8 64 32 16 8 4 2 1 Conclusion: 20% 1. the optimal10% number of static neighbors is eighth of all nodes 2. impossible 0% at exascale with 128K neighbors 3. need dynamic neighbors to reduce the neighbor count Scale (No. of Nodes) Next Generation Job Management System for Extreme Scales 35 100% 90% 80% 70% Efficiency nb_1 60% nb_2 50% nb_log 40% nb_sqrt 30% 20% Next Generation Job Management System for Extreme Scales 10 48 57 6 65 53 6 16 38 4 40 96 10 24 25 6 64 16 4 1 Conclusion: 0% 1. square root number of dynamic neighbors is optimal 2. reasonable at exascale (1K neighbors) Scale (No. of Nodes) 26 21 44 10% 36 • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales 37 • Applications for exascale computing are becoming ensemble, and finer-grained • Task schedulers for exascale computing need to be distributed, scalable • SLURM++ should be integrated with MATRIX Next Generation Job Management System for Extreme Scales 38 • • • • • • Re-implement MATRIX Large Scale Integration of SLURM++ and MATRIX Workflow integration Data-aware Scheduling Distributed Map-Reduce framework Support Next Generation Job Management System for Extreme Scales 39 • • • • • • Xiaobing Zhou Hao Chen Kiran Ramamurthy Iman Sadooghi Michael Lang Ioan Raicu Next Generation Job Management System for Extreme Scales 40 • More information: – http://datasys.cs.iit.edu/~kewang/ • Contact: – kwang22@hawk.iit.edu • Questions? Next Generation Job Management System for Extreme Scales 41