Next Generation Job Management System for Extreme Scales

advertisement
• SLURM++: A distributed job launch
prototype for extreme-scale ensemble
computing (IPDPS14 submission)
• MATRIX:
A distributed Many-Task
Computing execution fabric designed for
exascale (CCGRID14 submission)
Next Generation Job Management System for Extreme Scales
2
• Introduction & Motivation
• SLURM++
• MATRIX
• Conclusion & Future Work
Next Generation Job Management System for Extreme Scales
3
• Introduction & Motivation
• SLURM++
• MATRIX
• Conclusion & Future Work
Next Generation Job Management System for Extreme Scales
4
• Today (June, 2013): 34 Petaflop
 O(100K) nodes
 O(1M) cores
• Near future (~2020): Exaflop Computing
 ~1M nodes
 ~1B processor-cores/threads
Next Generation Job Management System for Extreme Scales
5
•
•
•
•
•
Ensemble Computing
Over-decomposition
Many-Task Computing
Jobs/Tasks are finer-grained
Requirements
high availability
extreme high throughput (1M tasks/sec)
low Latency
Next Generation Job Management System for Extreme Scales
6
• Batch scheduled HPC workloads
• Lack the support of ensemble workloads
• Centralized Design
Poor Scalability
Single-point-of-failure
• SLURM maximum throughput of 500
jobs/sec
• Decentralized design is demanded
Next Generation Job Management System for Extreme Scales
7
• Architect, and design job management
systems for exascale ensemble computing
• Identifies the challenges and solutions
towards supporting job management
systems at extreme scales
• Evaluate and compare different design
choices at large scale
Next Generation Job Management System for Extreme Scales
8
• Introduction & Motivation
• SLURM++
• MATRIX
• Conclusion & Future Work
Next Generation Job Management System for Extreme Scales
9
• Proposed a distributed architecture for job management
systems, and identified the challenges and solutions
towards supporting job management system at extremescales
• Designed and developed a novel distributed resource
stealing algorithm for efficient HPC job launch
• Designed and implemented a distributed job launch
prototype SLURM++ for extreme scales by leveraging
SLURM and ZHT
• Evaluated SLURM and SLRUM++ up to 500-nodes with
various micro-benchmarks of different job sizes with
excellent results up to 10X higher throughput
Next Generation Job Management System for Extreme Scales
10
…
Fully-Connected
controller and data server
controller and data server
…
cd
cd
controller and data server
…
…
cd
cd
cd
cd
cd
cd
cd
1. Controllers are fully connected
2. Ratio and Partition Size are configurable for HPC and MTC
3. Data servers are also fully connected
Next Generation Job Management System for Extreme Scales
11
Key
Value
controller id
job id
job id + original
controller id
job id + original
controller id +
involved
controller id
Description
The free (available) nodes
number of free
in a partition managed by
node, free
the corresponding
node list
controller
The original controller that
original
is responsible for a
controller id
submitted job
The controllers that
involved
participate in launching a
controller list
job
participated
node list
The nodes in a partition
that are involved in
launching a job
Next Generation Job Management System for Extreme Scales
12
ALGORITHM 1.
Resource Stealing
• The procedure of stealing resources from
other partitions
• Why need to steal resources?
• When to steal resources?
• Where and how to steal resources?
• What if no resources to be stolen?
Input: number of nodes required (num_node_req), number of controllers (num_ctl), controller membership list (ctl_id[num_ctl]), sleep length (sleep_length),
number of reties (num_retry).
Output: list of involved controller ids (ctl_id_inv), participated nodes (par_node[]).
num_node_allocated = 0; num_try = 0; num_ctl_inv = 0;
while num_node_allocated < num_node_req do
remote_ctl_idx = Random(num_ctl);
remote_ctl_id = ctl_id[remote_ctl_idx];
again:
remote_free_resource = lookup(remote_ctl_id);
if (remote_free_reource == NULL) then
continue;
else
remote_num_free_node = strtok(remote_free_source);
if (remote_num_free_node > 0) then
num_try = 0;
remote_num_node_allocated =
remote_num_free_node > (num_node_req –
num_node_allocated) ? (num_node_req –
num_node_allocated) : remote_num_free_node;
if (allocate nodes succeeds) then
num_node_allocated +=
remote_num_node_allocated;
par_node[num_ctl_inv++] = allocated node list
strcat(ctl_id_inv, remote_ctl_id);
else
goto again;
end
else
sleep(sleep_length);
num_try++;
if (num_try > num_retry) do
release all the allocated nodes;
Resource Stealing again;
end
end
end
end
return ctl_id_inv, par_node;
Next Generation Job Management System for Extreme Scales
13
lookup_1
lookup_2
return value
return value
cswap_1
return true, value
cswap_2
return false, value
cswap_2 again
return true, value
Next Generation Job Management System for Extreme Scales
message exchange flow
message exchange flow
• When different controllers try to allocate
Data Server Operation Sequence
the same resources
ALGORITHM
2. Compare and Swap
•
Naive
to solvecswap_1
the problem
is to
add a
lookup_1 way
lookup_2
cswap_2
cswap_2
Input: key (key), value seen before (seen_value), new value intended to insert (new_value), and the storage hash map (map).
Output: A Boolean value indicates success (TRUE) or failure (FALSE), and the current actual value (current_value).
global
current_value
= map.get(key);lock for each queried key in the
if (!strcmp(current_value, seen_value)) then
map.put(key,
new_value);
DKVS
return TRUE;
else
return
FALSE;
•
Atomic
compare and swap operation in the
end
Client 1
Data Server
Client 2
DKVS that can tell the controllers whether
the resource allocation succeeds
14
• A controller needs to wait on specific state
ALGORITHM
3. Statebefore
Change Callback
change
moving on
Input: key (key), value expected (expect_value), blocking timeout (time_out), and the storage hash map (map).
Output: A Boolean value indicates success (TRUE) or failure (FALSE).
• Inefficient when client keeps polling from
current_value = map.get(key);
start_time = gettimeofday();
the server
result = TRUE;
while(stcmp(current_value, expect_value)) do
end_time
= gettimeofday();
•
The
server has a blocking state change
if (end_time – start_time < time_out) then
sleep (1);
callback
operation
current_value
= map.get(key);
else
result = false;
break;
end
done
return result;
Next Generation Job Management System for Extreme Scales
15
•
•
•
•
•
•
SLURM description
Light-weight controller as ZHT client
Job launching as a separate thread
Implement the resource stealing algorithm
Developed in C
3K lines of code + SLURM 50K lines of
code + ZHT 8K lines of code
Next Generation Job Management System for Extreme Scales
16
• LANL Kodiak machine up to 500 nodes
• Each node has two 64-bit AMD Opteron
processors at 2.6GHz and 8GB memory
• SLURM version 2.5.3
• Partition Size of 50
• Metrics:
Efficiency
number of ZHT message
Next Generation Job Management System for Extreme Scales
17
Conclusions:
1. SLURM remains almost constant, while SLURM++ has an increasing trend
with respect to the scale
2. MTC configuration is more preferable for MTC workload than HPC
configuration
Next Generation Job Management System for Extreme Scales
18
Conclusion:
1. SLURM experiences constant, while SLURM++ has linearly
increasing trend with respect to scale
2. Medium-job workload introduces a little bit resource stealing
overhead
Next Generation Job Management System for Extreme Scales
19
Conclusion:
1. SLURM is about to saturate at large scales, while SLURM++ has an
increasing trend with respect to scale
2. The more partitions we have, the better chance that a controller can steal
resources
Next Generation Job Management System for Extreme Scales
20
• Introduction & Motivation
• SLURM++
• MATRIX
• Conclusion & Future Work
Next Generation Job Management System for Extreme Scales
21
• Design and implement a distributed MTC task execution
fabric (MATRIX) that uses adaptive work stealing
technique for distributed load balancing, and employs a
DKVS to store task metadata.
• Explore the parameter space of work stealing technique
as applied to exascale class systems, through a lightweight job scheduling system simulator, SimMatrix, up to
millions of nodes, billions of cores, and trillions of tasks.
• Evaluate and compare MATRIX with various task
schedulers( Falkon, Sparrow, and CloudKon) tested on
an IBM Blue Gene/P machine and Amazon Cloud, using
both micro-benchmarks and real workload traces.
Next Generation Job Management System for Extreme Scales
22
• Distributed load balancing technique
• Idle scheduler steals tasks from overloaded one
• Number of tasks to steal
• Number of static neighbors
• Number of dynamic neighbors
• Poll interval
Next Generation Job Management System for Extreme Scales
23
• Select Neighbors
ALGORITHM 1. Dynamic Multi-Random Neighbor Selection (DYN-MUL-SEL)
Input: Node id (node_id), number of neighbors (num_neigh), and number of nodes (num_node), and the node array (nodes).
Output: A collection of neighbors (neigh).
bool selected[num_node];
for each i in 0 to num_node do
selected[i] = FALSE;
end
selected[node_id] = TRUE;
Node neigh[num_neigh];
index = −1;
for each i in 0 to num_neigh−1 do
repeat
index = Random( ) % num_node;
until !selected[index];
selected[index] = TRUE;
neigh[i] = nodes[index];
end
return neigh;
Next Generation Job Management System for Extreme Scales
24
• Dynamic Poll Interval
ALGORITHM 2. Adaptive Work Stealing Algorithm (ADA-WORK-STEALING)
Input: Node id (node_id), number of neighbors (num_neigh), number of nodes (num_node), the node array (nodes), the initial poll interval (poll_interval),
and upper bound (upper_bound).
Output: NULL
neigh = DYN-MUL-SEL(node_id, num_neigh, num_node, nodes);
most_load_node = neigh[0];
for each i in 1 to num_node−1 do
if (most_load_node < neigh[i].load) then
most_load_node = neigh[i];
end
end
if (most_load_node.load == 0) then
sleep(poll_interval);
poll_interval = poll_interval × 2;
if (poll_interval < upper_bound) then
ADA-WORK-STEALING(node_id, num_neigh, num_node, nodes, poll_interval);
else
return;
end
else
num_task_steal = number of tasks to steal from most_load_node;
if (num_task_steal == 0) then
sleep(poll_interval);
poll_interval = poll_interval × 2;
if (poll_interval < upper_bound) then
ADA-WORK-STEALING(node_id, num_neigh, num_node, nodes, poll_interval);
else
return;
end
else
poll_interval = 1;
end
Next Generation Job Management System for Extreme Scales
25
lookup task status (2)
Scheduler
return task status (3)
Client
Client
cl i
en
t in
ter
a
Executor
cti
client interaction
submit tasks (1)
on
Scheduler
work stealing
Scheduler
ZHT server
……
ZHT server
Compute Node
Executor
Compute Node
Fully-Connected
work stealing
Scheduler
request load (4)
send load (5)
request task (6)
send task (7)
Scheduler
Next Generation Job Management System for Extreme Scales
26
• Worst Case
All tasks are submitted to one scheduler
• Best Case
All tasks are
schedulers
evenly
distributed
Next Generation Job Management System for Extreme Scales
to
all
27
Next Generation Job Management System for Extreme Scales
28
• Client doesn’t have to be alive
• A monitoring program polls the task
execution progress periodically
• Record logs about the system state for
doing visualization
Next Generation Job Management System for Extreme Scales
29
• Focus on the Amazon Cloud environment
• The “m1.medium” instance
 1 virtual cpu, 2 compute units,
3.7GB memory,
410GB hard disk.
• Run MATRIX, Sparrow, and CloudKon up to 64
instances.
• The number of executing threads is 2 for all of
the systems.
Next Generation Job Management System for Extreme Scales
30
Reason:
1. MATRIX has near perfect load balancing when submitting tasks
2. Sparrow needs to send probe messages to push tasks
3. the clients of CloudKon need to push and pull tasks from SQS
Next Generation Job Management System for Extreme Scales
31
Conclusion:
1. SQS, and DynamoDB are designed specifically for Cloud
2. The probing and pushing method in Sparrow has poor scalability for
heterogeneous workload
3. The network communication layer of MATRIX needs to be improved
significantly
Next Generation Job Management System for Extreme Scales
32
• Through SimMatrix up to millions of nodes,
billions of cores, and trillions of tasks
• Number of Tasks to Steal
• Number of Static Neighbors
• Number of Dynamic Neighbors
Next Generation Job Management System for Extreme Scales
33
100%
90%
steal_1
80%
steal_2
70%
Efficiency
steal_log
60%
steal_sqrt
50%
steal_half
40%
Conclusion: 30%
1. steal half of
tasks
20%
2. a small number of static neighbors is not sufficient
10%
3. the more tasks (< half) to steal, the better performance
51
2
10
24
20
48
40
96
81
92
25
6
12
8
64
32
16
8
4
2
1
0%
Scale (No. of Nodes)
Next Generation Job Management System for Extreme Scales
34
100%
90%
80%
nb_2
70%
Efficiency
nb_log
60%
nb_sqrt
50%
nb_eighth
40%
nb_quar
30%
nb_half
51
2
10
24
20
48
40
96
81
92
25
6
12
8
64
32
16
8
4
2
1
Conclusion: 20%
1. the optimal10%
number of static neighbors is eighth of all nodes
2. impossible 0%
at exascale with 128K neighbors
3. need dynamic neighbors to reduce the neighbor count
Scale (No. of Nodes)
Next Generation Job Management System for Extreme Scales
35
100%
90%
80%
70%
Efficiency
nb_1
60%
nb_2
50%
nb_log
40%
nb_sqrt
30%
20%
Next Generation Job Management System for Extreme Scales
10
48
57
6
65
53
6
16
38
4
40
96
10
24
25
6
64
16
4
1
Conclusion: 0%
1. square root number of dynamic neighbors is optimal
2. reasonable at exascale (1K neighbors)
Scale (No. of Nodes)
26
21
44
10%
36
• Introduction & Motivation
• SLURM++
• MATRIX
• Conclusion & Future Work
Next Generation Job Management System for Extreme Scales
37
• Applications for exascale computing are
becoming ensemble, and finer-grained
• Task schedulers for exascale computing
need to be distributed, scalable
• SLURM++ should be integrated with
MATRIX
Next Generation Job Management System for Extreme Scales
38
•
•
•
•
•
•
Re-implement MATRIX
Large Scale
Integration of SLURM++ and MATRIX
Workflow integration
Data-aware Scheduling
Distributed
Map-Reduce
framework
Support
Next Generation Job Management System for Extreme Scales
39
•
•
•
•
•
•
Xiaobing Zhou
Hao Chen
Kiran Ramamurthy
Iman Sadooghi
Michael Lang
Ioan Raicu
Next Generation Job Management System for Extreme Scales
40
• More information:
– http://datasys.cs.iit.edu/~kewang/
• Contact:
– kwang22@hawk.iit.edu
• Questions?
Next Generation Job Management System for Extreme Scales
41
Download