A WORKLOAD SCHEING IN A HIGH PERFORMANCE COMPUTING

advertisement
SOME WORKLOAD SCHEDULING
ALTERNATIVES IN THE HIGH
PERFORMANCE COMPUTING
ENVIRONMENT
Jim McGalliard, FEDSIM
Tyler Simon,
University of Maryland Baltimore County
CMG Southern Region
Richmond April 25, 2013
1
TOPICS








MAINFRAME VS. HPC WORKLOADS
HPC WORKLOAD SCHEDULING
BACKFILL
DYNAMIC PRIORITIZATION
MAPREDUCE FRAMEWORK & HADOOP
HADOOP WORKLOAD SCHEDULING
DYNAMIC PRIORITIZATION IN HADOOP
SOME RESULTS
2
HPC Vs. Mainframe Workload
Scheduling
CMG’s historical context is MVS on S/360
In a traditional mainframe environment, a workload
is runs on a single CPU or perhaps a few CPUs
 Processor utilization is optimized on one level by
dispatching a new job on a processor when the
current job idles that processor, e.g., by starting an
I/O
 Possible to achieve very high actual processor
utilization (e.g., 103%) due to a variety of
optimizations, of which job swapping is one
 Not how it’s done on HPC


3
HPC vs. Mainframe Workload
Scheduling
Current generation HPCs have many CPUs – say,
tens of thousands. Due to scale economies, clusters
of commodity processors are the most common
design. Custom-designed processors (e.g., Cray for
many years) are too expensive in terms of price vs.
performance.
 Many HPC applications map some system –
typically, a physical system (e.g., Earth’s
atmosphere) – into a matrix or network of many
similar nodes.

4
HPC vs. Mainframe Workload
Scheduling
 e.g.,
the Cubed Sphere…
5
HPC vs. Mainframe Workload
Scheduling
The application software simulates the behavior or
dynamics of the system represented by assigning
parts of the system to nodes of the cluster. After
processing, final results are collected from the nodes.
 Often, more nodes can represent the behavior of the
system more accurately using more nodes.

6
HPC Workload Scheduling




So, in contrast to a traditional mainframe workload, HPC
jobs may use hundreds or thousands of processor nodes
simultaneously.
Halting job execution on 1,000 CPUs so that a single
CPU can start an I/O operation is not efficient.
Some HPCs have checkpoint/restart capabilities that
could allow job interruption and also easier recovery
from a system failure. Most do not.
On an HPC system, typically, once a job is dispatched, it
is allowed to run uninterrupted until completion.
7
HPC Workload Scheduling




The use of queues and priority classification is common
in HPC environments, as it is with mainframes.
A job class might be categorized by the maximum
number of processors a job might require, the estimated
length of processing time, or the need for specialized
resources.
Users or user groups may be allocated budgets of CPU
time or of other resources.
A consequence of such queue structures is that users must
estimated their jobs’ processing times. These may not be
accurate.
8
HPC Workload Scheduling
Cluster HPCs often have very low actual processor
utilization – say, 10 or 20%.
 High reported utilization may measure CPU allocation
rather than CPU-busy time
 Workload optimization by the sys admin is done at the
job rather than CPU level – deciding which jobs will be
dispatched on which processors in what sequence.
 The ability of an individual job to use its allocated
processors is of interest and subject to Amdahl's Law,
among other constraints, but is outside the scope of this
presentation.

9
The Knapsack Problem
Finding the optimum sequence of jobs, such that,
e.g., utilization or throughput is maximized or
expansion is minimized, is an example of the
Knapsack Problem.
 Finding the optimum in a Knapsack Problem is often
impossible.
 Some variation of the “Greedy algorithm,” in which
jobs are sorted on their value, is often implemented.

10
Backfill
HPCs are usually oversubscribed.
 Backfill is commonly used
to increase processor
utilization in an HPC
environment.

11
No Backfill (FCFS)
Processors
`
J2
J4
J1
J3
Time
No Backfill
12
Processors
Strict Backfill
J3
J2
J1
J4
Time
Backfill
13
Relaxed Backfill
14
Backfill
Backfill is a common HPC workload scheduling
optimization at the job-level.
 Note, it depends on the programmer’s estimate of the
execution time, which is likely to be influenced by
the priority queue structure for that data center and
machine.

15
Dynamic, Multi-Factor Prioritization

Bill Ward and Tyler Simon, among others, have
proposed setting job priorities as the product of
several factors:
(Wait Time)Wait Parameter
* (Est’d Proc Time)Proc Parameter
* (CPUs Requested)CPU Parameter
* Queue
= Priority
16
Dynamic Priorities with Budgets





Prioritization can also include a user allocation or
budget.
Users can increase or decrease the priorities of their
jobs by bidding dollars or some measure of their
allocation.
Higher bids mean faster results.
Lower bids mean more jobs can be run, at a cost of
waiting longer for results.
Similar to 1970s time-sharing charge-back systems.
17
MapReduce
As mentioned, Amdahl’s Law explains why it is in an
important HPC environment for the application to be
coded to exploit parallelism.
 Doing so – programming for parallelism - is an
historically difficult problem.
 MadReduce is a response to this problem.

18
MapReduce
MapReduce is a method for simple implementation
of parallelism in a program. The programmer inserts
map() and reduce() calls in the code, the compiler,
dispatcher, and operating system take care of
distributing the code and data onto multiple nodes.
 The map() function distributes the input file data onto
many disks and the code onto many processors. The
code processes this data in parallel. For example,
weather parameters in discrete patches of the surface
of the globe.

19
MapReduce
The reduce() function gathers and combines the data
from many nodes into generates the output data set.
 MapReduce makes it easier for programmers to
implement parallel versions of their applications by
taking care of the distribution, management,
shuffling, and return to and from the many
processors and their associated data storage.

20
MapReduce
MapReduce can also take care of reliability by
distributing the data redundantly.
 Also takes care of load-balancing and performance
optimization by distributing the code and data fairly
among the many nodes in the cluster.
 MapReduce may also be used to implement
prioritization by controlling the number of map and
reduce slots created and allocated to various users,
classes of users, or classes of jobs – the more slots
allocated, the faster that user or job will run.

21
MapReduce
Image Licensed
by Gnu Free
Software
Foundation
22
MapReduce
MapReduce was originally developed for use by
small teams where FIFO or “social scheduling” was
sufficient for workload scheduling.
 It has grown to be used for very large problems, such
as sorting, searching, and indexing large data sets.
 Google uses MapReduce to index the world wide
web and holds a patent on the method.
 MapReduce is not the only framework for parallel
processing and has opponents as well as advocates.

23
HADOOP





Hadoop is a specific open source implementation of the
MapReduce framework written in Java and licensed by
Apache
Hadoop also includes a distributed file system
Designed for very large (thousands of processors)
systems using commodity processors, including grid
systems
Implements data replication for checkpoint (failover) and
locality
Locality of processors with their data help network
bandwidth
24
Applications of Hadoop





Data intensive applications
Applications needing fault
tolerance
Not a DBMS
Yahoo; Facebook; LinkedIn;
Amazon
Watson, the IBM system that won
on Jeopardy used Hadoop
25
Hadoop MapReduce & Distributed File
System Layers
Source:
Wikipedia
26
Hadoop Workload Scheduling
Task Tracker attempts to dispatch processors on
nodes near the data (e.g., same rack) the application
requires. Hadoop HDFS is “rack aware”
 Default Hadoop scheduling is FCFS=FIFO
 Workload scheduling to optimize data locality and
also optimize processor utilization are in conflict.
 There are various alternatives to HDFS and the
default Hadoop scheduler

27
Hadoop Scheduling Alternatives: Fair-Share
Scheduler






Fair-Share is a workload scheduler designed to
replace FCFS
Organized into pools (a default of one per user)
Goal: Ensure Fair Resource Use
Each pool entitled to a minimum share of the cluster
Can be FIFO or Fair-Share within a pool
Shares map() and reduce() slots
28
Fair-Share





If excess capacity is available, can be given to a pool
that has its minimum
Allows maximum jobs/user
Allows maximum jobs/pool
Allows pre-emption of jobs
Can use size (# of tasks) or priorities
29
Hadoop Scheduling Alternatives: Capacity
Scheduler







Organized into queues
Queues share a percent of the cluster
Goal: Maximize utilization & throughput, e.g., fast
response time for small jobs and guaranteed response
time for production jobs
FIFO within queues
Uses pre-emption
Queues can be started and stopped at run time
If excess capacity is available, can be given to any queue
30
Dynamic Prioritization in Hadoop
Tyler Simon, a collaborator of mine in prior Southern
CMG presentations, has studied the application of
dynamic prioritization in the Hadoop environment.
 An additional consideration in his study: fractional
job allocation. Dispatching the job on less than the
total number of CPUs requested by the user.
Massively parallel jobs can tolerate this.

31
Dynamic Prioritization in Hadoop
Priorities calculated based on same factors as before:
actual wait time, estimated processing time, and
number of CPUs requested.
 Queues are typically structured around those same
factors. Using a single queue and reflecting the
significance of those factors in the priority
calculation is more flexible than fixed queues.
 System administrator must still set parameters
(relative importance of each factor).

32
Dynamic Prioritization in Hadoop Vs. Fair
Share






Backfill
No Pre-emption
Fixed shares are optional
Improved Workload Response Time
Improved Utilization
Priorities are dynamic – recalculated at each
scheduling interval
33
Dynamic Prioritization in Hadoop Vs. Ward
Compare to Ward’s proposed dynamic prioritization
 Use of parameters is the same
 SysAdmin implements policy by selecting
parameters the same, but…..
 Only one queue
 Includes fractional Knapsack – fraction of
a job may be dispatched to fill the Knapsack

34
Dynamic Prioritization in Hadoop Vs. Ward
Option: System can select its own parameters – selftuning: run the optimization globally, then calculate
priorities when a job completes (or is submitted, or
on a schedule)
 Or, run the optimization whenever a job completes,
selecting the job(s) to run next to yield near-optimum
performance for jobs currently in the queue
 Can be extended to prioritize on other parameters –
e.g., time of day, day of week, reliability, power
consumption, etc.

35
Algorithm –
Prioritization
Source:
“Multiple Objective
Scheduling of HPC
Workloads
Through Dynamic
Prioritization”
Tyler Simon,
Phoung Nguyen,
& Milton Halem
36
Algorithm – Cost Evaluation
37
Some Results
38
Some Results
39
Some Results
40
Some Results
41
In Conclusion…
Tyler’s results indicate about a 35% reduction in
workload response time (batch turnaround time)
compared to the default Hadoop scheduler
 Many ideas have been published about workload
scheduling and optimization in both HPC and
Hadoop
 Not clear who will come out on top
 A lot outside the scope of this presentation

42
Bibliography



Glassbrook, Richard, and McGalliard, James. “Performance
Management at an Earth Science Supercomputer Center.”
CMG 2003.
Heger, Dominique. “Hadoop Performance Tuning – A
Pragmatic & Iterative Approach”
www.cmg.org/measureit/issues/mit97/m_97_3.pdf
Kumar, Rajeev and Banerjee, Nilanjan. “Analysis of a
Multiobjective Evolutionary Algorithm on the 0-1 Knapsack
Problem” Theoretical Computer Science 358 (2006).
43
Bibliography
• Nguyen, Phoung; Simon, Tyler A.; Halem, Milton;
Chapman, David; and Li, Quang Li. A Hybrid Scheduling
Algorithm for Data Intensive Workloads in a Map Reduce
Environment, 5th IEEE/ACM International Conference on
Utility and Cloud Computing (UCC2012), pp. 161-167,
Chicago, Il, (Nov. 5-8, 2012.)
• Rao, B. Thirumala and others. “Performance Issues of
Heterogeneous Hadoop Clusters in Cloud Computing.”
Global Journal of Computer Science and Technology Volume
XI Issue VII May 2011.
• Sandholm, Thomas, and Kai, Kevin. “Dynamic Proportional
Share Scheduling in Hadoop,” Proceedings of the 15th
International Conference on Job Scheduling Strategies for
Parallel Processing, JSSP ‘10. Berlin, Spring-Verlag.
44
Bibliography



Scavlex. http://compprog.wordpress.com/2007/11/20/thefractional-knapsack-problem.
Simon, Tyler and McGalliard, James. “Multi-Core Processor
Memory Contention Benchmark Analysis Case Study,” CMG
2009, Dallas.
Simon, Tyler and others. “Multiple Objective Scheduling of
HPC Workoads Through Dynamic Prioritization” HPC 2013,
Spring Simulation Conference, The Society for Modeling &
Simulation International.
45
Bibliography


Spear, Carrie and McGalliard, James. “A Queue Simulation
Tool for a High Performance Scientific Computing Center.”
CMG 2007, San Diego.
Ward, William A. and others. “Scheduling Jobs on Parallel
Systems Using a Relaxed Backfill Strategy.” Revised Papers
from the 8th International Workshop on Job Scheduling
Strategies for Parallel Processing, JSSP ’02, London,
Springer-Verlag.
46
Download