Some Workload Scheduling Alternatives for High Performance

advertisement
SOME WORKLOAD SCHEDULING
ALTERNATIVES IN THE HIGH
PERFORMANCE COMPUTING
ENVIRONMENT
Tyler A. Simon,
University of Maryland, Baltimore County
Jim McGalliard, FEDSIM
CMG Southern Region
Raleigh October 4, 2013
1
TOPICS









HPC WORKLOADS
BACKFILL
MAPREDUCE & HADOOP
HADOOP WORKLOAD SCHEDULING
ALTERNATIVE PRIORITIZATIONS
ALTERNATIVE SCHEDULES
DYNAMIC PRIORITIZATION
SOME DYNAMIC PRIORITIZATION RESULTS
CONCLUSIONS
2
HPC Workloads
Current generation HPCs have many CPUs – say,
1,000 or 10,000. Scale economies of clusters of
commodity processors made price/performance of
custom-designed processors too expensive.
 Typical HPC applications map some system, e.g., a
physical system like the Earth’s atmosphere – into a
matrix.

3
HPC Workloads
 e.g.,
the Cubed Sphere…
4
HPC Workloads
 Mathematical systems,
such as systems of
linear equations
 Physical systems, such as particle or
molecular physics
 Logical systems, including more recently web
activitiy, of which more later
5
HPC Workloads



The application software simulates the behavior or
dynamics of the system represented by assigning parts of
the system to nodes of the cluster. After processing, final
results are collected from the nodes.
Often, applications can represent the behavior of the
system of interest more accurately by using more nodes.
Mainframes optimize the use of the expensive processor
resource by dispatching new work on it when the current
workload no longer requires it, e.g., at the start of an I/O
operation
6
HPC Workloads





In contrast to a traditional mainframe workload, HPC
jobs may use hundreds or thousands of processor nodes
simultaneously.
Halting job execution on 1,000 CPUs so that a single
CPU can start an I/O operation is not efficient.
Some HPCs have checkpoint/restart capabilities that
could allow job interruption and also easier recovery
from a system failure. Most do not.
On an HPC system, typically, once a job is dispatched, it
is allowed to run uninterrupted until completion.
(More on that later.)
7
Backfill
HPCs are usually oversubscribed.
 Backfill is commonly used
to increase processor
utilization in an HPC
environment.

8
No Backfill (FCFS)
Processors
`
J2
J4
J1
J3
Time
No Backfill
9
Processors
Strict Backfill
J3
J2
J1
J4
Time
Backfill
10
Relaxed Backfill
Processors
J4
J3
J2
J1
Time
Relaxed Backfill
11
MapReduce
Amdahl’s Law explains why it is in an important
HPC environment for the application to be coded to
exploit parallelism.
 Doing so – programming for parallelism - is an
historically difficult problem.
 MadReduce is a response to this problem.

12
MapReduce
MapReduce is a simple method for implementing
parallelism in a program. The programmer inserts
map() and reduce() calls in the code. The compiler,
dispatcher, and operating system take care of
distributing the code and data onto multiple nodes.
 The map() function distributes the input file data onto
many disks and the code onto many processors. The
code processes this data in parallel. For example,
weather parameters in discrete patches of the surface
of the globe.

13
MapReduce
The reduce() function gathers and combines the data
from many nodes into one and generates the output
data set.
 MapReduce makes it easier for programmers to
implement parallel versions of their applications by
taking care of the distribution, management,
shuffling, and return to and from the many
processors and their associated data storage.

14
MapReduce
MapReduce also improves reliability by distributing
the data redundantly.
 And takes care of load-balancing and performance
optimization by distributing the code and data fairly
among the many nodes in the cluster.
 MapReduce may also be used to implement
prioritization by controlling the number of map and
reduce slots created and allocated to various users,
classes of users, or classes of jobs – the more slots
allocated, the faster that user or job will run.

15
MapReduce
Image Licensed
by Gnu Free
Software
Foundation
16
MapReduce
MapReduce was originally developed for use by
small teams where FIFO or “social scheduling” was
sufficient for workload scheduling.
 It has grown to be used for very large problems, such
as sorting, searching, and indexing large data sets.
 Google uses MapReduce to index the world wide
web and holds a patent on the method.
 MapReduce is not the only framework for parallel
processing and has opponents as well as advocates.

17
HADOOP





Hadoop is a specific open source implementation of the
MapReduce framework written in Java and licensed by
Apache
It includes a distributed file system
Designed for very large (thousands of processors)
systems using commodity processors, including grid
systems
Implements data replication for checkpoint (failover) and
locality
Locality of processors with their data help network
bandwidth
18
Applications of Hadoop





Data intensive applications
Applications needing fault
tolerance
Not a DBMS
Yahoo; Facebook; LinkedIn;
Amazon
Watson, the IBM system that won
on Jeopardy used Hadoop
19
Hadoop MapReduce & Distributed File
System Layers
Source:
Wikipedia
20
Hadoop Workload Scheduling
Task Tracker attempts to dispatch processors on
nodes near the data (e.g., same rack) the application
requires. Hadoop HDFS is “rack aware”
 Default Hadoop scheduling is FCFS=FIFO
 Workload scheduling to optimize data locality and
also optimize processor utilization are in conflict.
 There are various alternatives to HDFS and the
default Hadoop scheduler

21
Some Prioritization Alternatives






Wait Time
Run Time
Number of Processors
Queue
Composite Priorities
Dynamic Prioritices
22
Some Scheduling Alternatives








Global Vs. Local Scheduling
Resource-Aware Scheduling
Phase-Based Scheduling
Delay Scheduling
Copy-Compute Splitting
Preemption and Interruption
Social Scheduling
Variable Budget Scheduling
23
Dynamic Prioritization

Bill Ward and Tyler Simon, among others, have
proposed setting job priorities as the product of
several factors:
(Est’d Proc Time)Proc Parameter
* (Wait Time)Wait Parameter
* (CPUs Requested)CPU Parameter
* Queue
= Priority
24
Algorithm –
Prioritization
Data: job file, system size
Result: Schedule performance
Read job input file;
for α,β,γ = -1 → 1,+ = 0.1 do
while jobs either running or queued do
calculate job priorities;
for every job do
if job is running and has time remaining then
update_running_job();
else
for all waiting jobs do
pull jobs from priority queue and start based
on best _t;
if job cannot be started then
increment waittime
end
end
end
end
Print results;
end
end
25
Algorithm – Cost Evaluation
Require: C, capacity of the knapsack, n, the number of tasks, a, array of
tasks of size n
Ensure: A cost vector containing Ef for each job class Cost (A, B, C, D,
WRT).
1: i = 0
2: for α = -2 ≤ 2; α+=0.1 do
3: for β = -2 ≤ 2; β+=0.1 do
4:
for γ = -2 ≤ 2; γ+=0.1 do
5:
Cost[i++] = schedule(α,β,γ)
{For Optimization}
6:
if cost[i] < bestSoFar then
7:
bestSoFar = cost[i]
8:
end if
9:
end for
10: end for
11: end for
26
Some Results (0, 1, 0) = FCFS
27
Some Results (0, 0, -1) = SJF
28
Some Results (-0.1, -0.3, -0.4)
29
Some Results
Policy
LJF
LargestJF
SJF
SmallestJF
FCFS
Dynamic
α
β
γ
0
1
0
-1
0
-0.1
0
0
0
0
1
-0.3
1
0
-1
0
0
-0.4
Wait
165
146
150
165
156
145
Utiliz TotalWait
84
4820
95
5380
92
3750
84
4570
88
4970
95
3830
30
Conclusions

Where data center management wants to maximize
some calculable objective, it may be possible for an
exhaustive search of the parameter space to
constantly tune the system and provide near-optimal
performance in terms of that function
31
Conclusions

We expect new capabilities, e.g., commodity cluster
computers and Hadoop, to continue to inspire new
applications. The proliferation of workload
scheduling alternatives may be the result of (1) the
challenges of parallel programming, (2) the
popularity of open source platforms that are easy to
customize, and (3) the brief lifespan of MapReduce
that has not yet had the chance to mature.
32
Bibliography (from National CMG
Paper)
[Calzolari] Calzolari, Federico, and Volpe, Silvia.
“A New Job Migration Algorithm to Improve Data
Center Efficiency,” Proceedings of Science. The
International Symposium on Grids and Clouds
and the Open Grid Forum, Taipei, March, 2011.
[Feitelson1998] Feitelson, Dror, and Rudolph,
Larry. “Metrics and Benchmarking for Parallel
Job Scheduling.” Job Scheduling Strategies for
Parallel Processing ’98, LNCS 1459, SpringerVerlag, Berlin, 1998.
33
[Feitelson1999] Feitelson, Dror and Naaman,
Michael. “Self-Tuning Systems” IEEE Software,
March/April 1999.
[Glassbrook] Glassbrook, Richard and McGalliard,
James. “Performance Management at an Earth
Science Supercomputer Center.” CMG 2003.
[Heger] Heger, Dominique. “Hadoop Performance
Tuning – A Pragmatic & Iterative Approach”
www.cmg.org/measureit/issues/mit97/m_97_3.p
df 1997
34
[Hennessy] Hennessy, J. and Patterson, D.
Computer Architecture: A Quantitative Approach,
2nd Edition. Morgan Kauffmann, San Mateo,
California.
[Herodotou] Herodotou, Herodotos and Babu,
Shivnath. “Profiling, What-if Analysis, and Costbased Optimization of MapReduce Programs,”
Proceedings of the 37th International Conference
on Very Large Data Bases, Vol. 4, No. 11. VLDB
Endowment, Seattle, ©2011.
[Hovestadt] Hovestadt, Matthias, and others.
"Scheduling in hpc resource management
systems: queuing vs. planning." Job Scheduling
Strategies for Parallel Processing. Springer Berlin
Heidelberg, 2003.
35
[Nguyen] Nguyen, Phuong; Simon, Tyler; and
others. “A Hybrid Scheduling Algorithm for Data
Intensive Workloads in a MapReduce
Environment,” IEEE/ACM Fifth International
Conference on Utility and Cloud Computing.
IEEE Computer Society, ©2012.
[Rao] Rao, B. Thirumala, and others.
“Performance Issues of Heterogeneous Hadoop
Clusters in Cloud Computing” Global Journal of
Computer Science and Technology. Volume XI
Issue VIII, May 2011.
[Sandholm2009] Sandholm, Thomas, and Lai,
Kevin. “MapReduce Optimization Using
Regulated Dynamic Prioritization,” SIGMETRICS
Performance ’09. ACM, Seattle, ©2009.
36
[Sandholm2010] Sandholm, Thomas, and Lai,
Kevin. “Dynamic Proportional Share Scheduling
in Hadoop” Job Scheduling Strategies for Parallel
Processing 2010. Springer-Verlag, Berlin, 2010.
[Scavlex] Scavlex. http://compprog.wordpress.
com/2007/11/20/the-fractional-knapsackproblem
[Sherwani] Sherwani, Jahanzeb, and others.
"Libra: a computational economy‐based job
scheduling system for clusters." Software:
Practice and Experience 34.6 2004.
37
[Simon] Simon, Tyler and others. “Multiple
Objective Scheduling of HPC Workloads Through
Dynamic Prioritization” HPC 2013, Spring
Simulation Conference, The Society for Modeling
& Simulation International, 2013.
[Spear] Spear, Carrie and McGalliard, James. “A
Queue Simulation Tool for a High Performance
Scientific Computing Center.” CMG 2007. San
Diego, 2007.
[Streit] Streit, Achim. “On Job Scheduling for
HPC-Clusters and the dynP Scheduler”
Paderborn Center for Parallel Computing,
Paderborn, Germany.
38
[Ward] Ward, William A. and others. “Scheduling
Jobs on Parallel Systems Using a Relaxed Backfill
Strategy.” Revised Papers from the 8th
International Workshop on Job Scheduling
Strategies for Parallel Processing, JSSP ’02,
London, Springer-Verlag, 2002.
[Zaharia2009] Zaharia, Matei, and others. “Job
Scheduling for Multi-User MapReduce Clusters”
Technical Report UCB/EECS-2009-55. University
of California at Berkeley, 2009.
[Zaharia2010] Zaharia, Mateir, and others. “Delay
Scheduling: A Simple Technique for Achieving
Locality and Fairness in Cluster Scheduling.”
EuroSys’10.ACM ,Paris, ©2010
39
Download