High Performance Computing - Louisiana State University

advertisement
Prof. Thomas Sterling
Department of Computer Science
Louisiana State University
February 1, 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
CAPACITY COMPUTING
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
Topics
•
•
•
•
•
•
•
•
•
Key terms and concepts
Basic definitions
Models of parallelism
Speedup and Overhead
Capacity Computing & Unix utilities
Condor : Overview
Condor : Useful commands
Performance Issues in Capacity Computing
Material for Test
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
2
Topics
•
•
•
•
•
•
•
•
•
Key terms and concepts
Basic definitions
Models of parallelism
Speedup and Overhead
Capacity Computing & Unix utilities
Condor : Overview
Condor : Useful commands
Performance Issues in Capacity Computing
Material for Test
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
3
Key Terms and Concepts
Conventional serial execution where
the problem is represented as a series
of instructions that are executed by
the CPU (also sequential execution)
Problem
Parallel execution of a problem involves
partitioning of the problem into multiple
executable parts that are mutually
exclusive and collectively exhaustive
represented as a partially ordered set
exhibiting concurrency.
Task
Problem
Task
Task
Task
CPU
Parallel computing takes advantage of
concurrency to :
• Solve larger problems within
bounded time
• Save on Wall Clock Time
• Overcoming memory constraints
• Utilizing non-local resources
instructions
instructions
CPU
CPU
CPU
CPU
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
4
Key Terms and Concepts
• Scalable Speedup : Relative reduction of execution time of a fixed
size workload through parallel execution
Speedup =
execution _ time _ on _ one _ processor
execution _ time _ on _ N _ processors
• Scalable Efficiency : Ratio of the actual performance to the best
possible performance.
Efficiency =
execution _ time _ on _ one _ processor
execution _ time _ on _ N _ processors ´ N
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
5
Topics
•
•
•
•
•
•
•
•
•
Key terms and concepts
Basic definitions
Models of parallelism
Speedup and Overhead
Capacity Computing & Unix utilities
Condor : Overview
Condor : Useful commands
Performance Issues in Capacity Computing
Material for Test
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
6
Defining the 3 C’s …
• Main Classes of computing :
– High capacity parallel computing : A strategy for employing
distributed computing resources to achieve high throughput
processing among decoupled tasks. Aggregate performance of the
total system is high if sufficient tasks are available to be carried out
concurrently on all separate processing elements. No single task is
accelerated. Uses increased workload size of multiple tasks with
increased system scale.
– High capability parallel computing : A strategy for employing
tightly coupled structures of computing resources to achieve
reduced execution time of a given application through partitioning
into concurrently executable tasks. Uses fixed workload size with
increased system scale.
– Cooperative computing : A strategy for employing moderately
coupled ensemble of computing resources to increase size of the
data set of a user application while limiting its execution time. Uses
a workload of a single task of increased data set size with increased system
scale.
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
7
Strong Scaling Vs. Weak Scaling
Work per task
Weak Scaling
Strong Scaling
1
2
4
8
Machine Scale (# of nodes)
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
8
Granularity (size / node)
Total Problem Size
Strong Scaling, Weak Scaling
Strong Scaling
Weak Scaling
Machine Scale (# of nodes)
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
9
Defining the 3 C’s …
• High capacity computing systems emphasize the
overall work performed over a fixed time period. Work
is defined as the aggregate amount of computation
performed across all functional units, all threads, all
cores, all chips, all coprocessors and network
interface cards in the system.
• High capability computing systems emphasize
improvement (reduction) in execution time of a single
user application program of fixed data set size.
• Cooperative computing systems emphasize single
application weak scaling
– Performance increase through increase in problem size
(usually data set size and # of task partitions) with increase in
system scale
Adapted from : High-performance throughput computing
S Chaudhry, P Caprioli, S Yip, M Tremblay - IEEE Micro, 2005 - doi.ieeecomputersociety.org
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
10
Strong Scaling, Weak Scaling
•
•
•
Capability
•
•
•
Primary scaling is decrease in response time proportional to increase in resources
applied
Single job, constant size – goal: response-time scaling proportional to machine size
Tightly-coupled concurrent tasks making up single job
Cooperative
•
•
•
•
Single job, (different nodes working on different partitions of the same job)
Job size scales proportional to machine
Granularity per node is fixed over range of system scale
Loosely coupled concurrent tasks making up single job
Capacity
•
•
Primary scaling is increase in throughput proportional to increase in resources
applied
Decoupled concurrent tasks, each a separate job, increasing in number of instances
– scaling proportional to machine.
Capacity
Cooperative Capability
Single Job
Workload Size Scaling
Weak Scaling
Strong Scaling
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
11
Topics
•
•
•
•
•
•
•
•
•
Key terms and concepts
Basic definitions
Models of parallelism
Speedup and Overhead
Capacity Computing & Unix utilities
Condor : Overview
Condor : Useful commands
Performance Issues in Capacity Computing
Material for Test
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
12
Models of Parallel Processing
• Conventional models of parallel processing
– Decoupled Work Queue (covered in segment 1 of the course)
– Communicating Sequential Processing (CSP message passing)
(covered in segment 2)
– Shared memory multiple thread (covered in segment 3)
• Some alternative models of parallel processing
– SIMD
• Single instruction stream multiple data stream processor array
– Vector Machines
• Hardware execution of value sequences to exploit pipelining
– Systolic
• An interconnection of basic arithmetic units to match algorithm
– Data Flow
• Data precedent constraint self-synchronizing fine grain execution units
supporting functional (single assignment) execution
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
13
Shared memory multiple Thread
•
•
•
•
•
Static or dynamic
Fine Grained
OpenMP
Distributed shared memory systems
Covered in Segment 3
CPU 1
CPU 2
CPU 3
Network
memory
memory
memory
Symmetric Multi Processor
(SMP usually cache coherent )
Orion JPL NASA
CPU 1
CPU 2
CPU 3
memory
memory
memory
Network
Distributed Shared Memory
(DSM usually cache coherent)
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
14
Communicating Sequential Processes
•
•
•
•
•
•
•
•
One process is assigned to each
processor
Work done by the processor is
performed on the local data
Data values are exchanged by
messages
Synchronization constructs for inter
process coordination
Distributed Memory
Coarse Grained
MPI application programming interface
Commodity clusters and MPP
memory
memory
memory
CPU 1
CPU 2
CPU 3
Network
Distributed Memory
(DM often not cache coherent)
– MPP is acronym for “Massively Parallel
Processor”
•
Covered in Segment 2
QueenBee
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
15
Decoupled Work Queue Model
• Concurrent disjoint tasks
– Job stream parallelism
– Parametric Studies
• SPMD (single program multiple data)
•
•
•
•
Very coarse grained
Example software package : Condor
Processor farms and commodity clusters
This lecture covers this model of parallelism
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
16
Topics
•
•
•
•
•
•
•
•
•
Key terms and concepts
Basic definitions
Models of parallelism
Speedup and Overhead
Capacity Computing & Unix utilities
Condor : Overview
Condor : Useful commands
Performance Issues in Capacity Computing
Material for Test
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
17
Ideal Speedup Issues
• W is total workload measured in elemental pieces of
work (e.g. operations, instructions, subtasks, tasks, etc.)
• T(p) is total execution time measured in elemental time
steps (e.g. clock cycles) where p is # of execution sites
(e.g. processors, threads)
• wi is work for a given task I, measured in operations
• Example: here we divide a million (really Mega)
operation workload, W, into a thousand tasks, w1 to
w1024 each of a 1 K operations
• Assume 256 processors performing workload in parallel
• T(256) = 4096 steps, speedup = 256, Eff = 1
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
18
Ideal Speedup Example
W
w210
w1
210
Units : steps
220
P28
P1
Processors
W = åw i
i
210
210
210
210
T(1)=220
T(28)=212
212
220
Speedup = 12 = 2 8
2
220
0
Efficiency = 12
=
2
=1
8
2 ´2
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
19
Granularities in Parallelism
Overhead
• The additional work that needs to be performed in
order to manage the parallel resources and
concurrent abstract tasks that is in the critical time
path.
Coarse Grained
• Decompose problem into large independent
tasks. Usually there is no communication between
the tasks. Also defined as a class of parallelism where:
Coarse Grained
Computation
Overhead
“relatively large amounts of computational work is done
between communication”
Fine Grained
• Decompose problem into smaller interdependent tasks. Usually these tasks are
communication intensive. Also defined as a class of
Finely Grained
Computation
Overhead
parallelism where: “relatively small amounts of
computational work are done between communication
events” –www.llnl.gov/computing/tutorials/parallel_comp
Images adapted from : http://www.mhpcc.edu/training/workshop/parallel_intro/
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
20
Overhead
• Overhead: Additional critical path (in time) work required
to manage parallel resources and concurrent tasks that
would not be necessary for purely sequential execution
• V is total overhead of workload execution
• vi is overhead for individual task wi
• Each task takes vi +wi time steps to complete
• Overhead imposes upper bound on scalability
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
21
Overhead
P
W = å wi
i=1
W
wi =
P
Assumption : Workload is
infinitely divisible
W
Tn = v +
n
T1 W + n
W
P
P
S= =
@
=
=
TP W + n W + n 1+ P ´ n 1+ n
W
P
P
W
v
w
P
v = overhead
V = Total overhead
w = work unit
W = Total work
Ti = execution time
with i processors
P = # processors
V+W=4v+4w
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
22
Scalability and Overhead for fixed sized
work tasks
• W is divided in to J tasks of size wg
• Each task requires v overhead work to manage
• For P processors there are approximates J/P tasks to be
performed in sequence so,
• TP is J(wg + v)/P
• Note that S = T1 / TP
• So, S = P / (1 + v / wg)
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
23
Scalability & Overhead
éW ù W
J = # tasks = ê ú »
ê wg ú wg
when
T1 = W + v » W
W >> v
J
W
Wæ
v ö
çç1+
÷÷
TP = ´ ( wg + v) =
´ (wg + v) =
P
Pwg
Pè
wg ø
æ
v ö
çç1+
÷÷
wg ø
è
T1
W
P
S=
@
=
æ
ö
v
TP W
v
çç1+
÷÷ 1+ w
g
Pè
wg ø
W
TP =
P
v = overhead
wg = work unit
W = Total work
Ti = execution time with i
processors
P = # Processors
J = # Tasks
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
24
Topics
•
•
•
•
•
•
•
•
•
Key terms and concepts
Basic definitions
Models of parallelism
Speedup and Overhead
Capacity Computing & Unix utilities
Condor : Overview
Condor : Useful commands
Performance Issues in Capacity Computing
Material for Test
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
25
Capacity Computing with basic Unix tools
• Combination of common Unix utilities such as ssh, scp, rsh,
rcp can be used to remotely create jobs (to get more
information about these commands try man ssh, man scp,
man rsh, man rcp on any Unix shell)
• For small workloads it can be convenient to translate the
execution of the program into a simple shell script.
• Relying on simple Unix utilities poses several application
management constraints for cases such as :
–
–
–
–
–
Aborting started jobs
Querying for free machines
Querying for job status
Retrieving job results
etc..
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
26
BOINC , Seti@Home
• BOINC (Berkley Open Infrastructure for Network Computing)
• Opensource software that enables distributed coarse grained
computations over the internet.
• Follows the Master-Worker model, in BOINC : no
communication takes place among the worker nodes
• SETI@Home
• Einstein@Home
• Climate prediction
• And many more…
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
27
Topics
•
•
•
•
•
•
•
•
•
Key terms and concepts
Basic definitions
Models of parallelism
Speedup and Overhead
Capacity Computing & Unix utilities
Condor : Overview
Condor : Useful commands
Performance Issues in Capacity Computing
Material for Test
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
28
Management Middleware : Condor
•
•
•
•
•
Designed, developed and maintained at University of
Wisconsin Madison by a team lead by Miron Livny
Condor is a versatile workload management system
for managing pool of distributed computing resources
to provide high capacity computing.
Assists distributed job management by providing
mechanisms for job queuing, scheduling, priority
management, tools that facilitate utilization of
resources across Condor pools
Condor also enables resource management by
providing monitoring utilities, authentication &
authorization mechanisms, condor pool management
utilities and support for Grid Computing middleware
such as Globus.
Condor Components
•
•
•
ClassAds
Matchmaker
Problem Solvers
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
29
Management Middleware : Condor
Condor Components : Class Ads
• ClassAds (Classified Advertisements) concept is
very similar to the newspaper classifieds concepts
where buyers and sellers advertise their products
using abstract yet uniquely defining named
expressions. Example : Used Car Sales
• ClassAds language in Condor provides well
defined means of describing the User Job and the
end resources ( storage / computational ) so that
the Condor MatchMaker can match the job with the
appropriate pool of resources.
Src : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in
Practice: The Condor Experience" Concurrency and Computation: Practice and
Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005.
http://www.cs.wisc.edu/condor/doc/condor-practice.pdf
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
30
Job ClassAd & Machine ClassAd
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
31
Management Middleware : Condor
Condor MatchMaker
•
•
MatchMaker, a crucial part of the Condor
architecture, uses the job description classAd
provided by the user and matches the Job to the
best resource based on the Machine description
classAd
MatchMaking in Condor is performed in 4 steps :
1.
2.
3.
4.
Job Agent (A) and resources (R) advertise themselves.
Matchmaker (M) processes the known classAds and
generates pairs that best match resources and jobs
Matchmaker informs each party of the job-resource pair of
their prospective match.
The Job agent and resource establish connection for further
processing. (Matchmaker plays no role in this step, thus
ensuring separation between selection of resources and
subsequent activities)
Src : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed
Computing in Practice: The Condor Experience" Concurrency and
Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356,
February-April, 2005.
http://www.cs.wisc.edu/condor/doc/condor-practice.pdf
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
32
Topics
•
•
•
•
•
•
•
•
•
Key terms and concepts
Basic definitions
Models of parallelism
Speedup and Overhead
Capacity Computing & Unix utilities
Condor : Overview
Condor : Useful commands
Performance Issues in Capacity Computing
Material for Test
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
33
Management Middleware : Condor
Condor Problem Solvers
• Master-Worker (MW) is a problem solving system that is
•
•
useful for solving a coarse grained problem of
indeterminate size such as parameter sweep etc.
The MW Solver in Condor consists of 3 main components :
work-list, a tracking module, and a steering module. The
work-list keeps track of all pending work that master needs
done. The tracking module monitors progress of work
currently in progress on the worker nodes. The steering
module directs computation based on results gathered and
the pending work-list and communicates with the
matchmaker to obtain additional worker processes.
DAGMan is used to execute multiple jobs that have
dependencies represented as a Directed Acyclic Graph
where the nodes correspond to the jobs and edges
correspond to the dependencies between the jobs.
DAGMan provides various functionalities for job monitoring
and fault tolerance via creation of rescue DAGs.
Master
w1
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
w..N
34
Management Middleware : Condor
Indepth Coverage :
http://www.cs.wisc.edu/condor/publications.html
Recommended Reading :
Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience"
Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4,
pages 323-356, February-April, 2005. [PDF]
Todd Tannenbaum, Derek Wright, Karen Miller, and Miron Livny, "Condor - A Distributed Job Scheduler",
in Thomas Sterling, editor, Beowulf Cluster Computing with Linux, The MIT Press, 2002.
ISBN: 0-262-69274-0 [Postscript] [PDF]
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
35
Core components of Condor
•
•
•
•
•
•
condor_master: This program runs constantly and ensures that all other parts of Condor
are running. If they hang or crash, it restarts them.
condor_collector: This program is part of the Condor central manager. It collects
information about all computers in the pool as well as which users want to run jobs. It is
what normally responds to the condor_status command. It's not running on your computer,
but on the main Condor pool host (Arete head node).
condor_negotiator: This program is part of the Condor central manager. It decides what
jobs should be run where. It's not running on your computer, but on the main Condor pool
host (Arete head node).
condor_startd: If this program is running, it allows jobs to be started up on this computer-that is, Arete is an "execute machine". This advertises Arete to the central manager (more
on that later) so that it knows about this computer. It will start up the jobs that run.
condor_schedd If this program is running, it allows jobs to be submitted from this
computer--that is, desktron is a "submit machine". This will advertise jobs to the central
manager so that it knows about them. It will contact a condor_startd on other execute
machines for each job that needs to be started.
condor_shadow For each job that has been submitted from this computer (e.g., desktron),
there is one condor_shadow running. It will watch over the job as it runs remotely. In some
cases it will provide some assistance. You may or may not see any condor_shadow
processes running, depending on what is happening on the computer when you try it out.
Source : http://www.cs.wisc.edu/condor/tutorials/cw2005-condor/intro.html
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
36
Condor : A Walkthrough of Condor
commands
condor_status : provides current pool status
condor_q : provides current job queue
condor_submit : submit a job to condor pool
condor_rm : delete a job from job queue
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
37
What machines are available ?
(condor_status)
condor_status queries resource information sources and provides
the current status of the condor pool of resources

Some common condor_status command line options :
 -help : displays usage information
 -avail : queries condor_startd ads and prints information about available
resources
 -claimed : queries condor_startd ads and prints information about claimed
resources
 -ckptsrvr : queries condor_ckpt_server ads and display checkpoint server
attributes
 -pool hostname queries the specified central manager (by default queries
$COLLECTOR_HOST)
 -verbose : displays entire classads
 For more options and what they do run “condor_status –help”
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
38
condor_status : Resource States
• Owner : The machine is currently being utilized by a
user. The machine is currently unavailable for jobs
submitted by condor until the current user job
completes.
• Claimed : Condor has selected the machine for use
by other users.
• Unclaimed : Machine is unused and is available for
selection by condor.
• Matched : Machine is in a transition state between
unclaimed and claimed
• Preempting : Machine is currently vacating the
resource to make it available to condor.
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
39
Example : condor_status
[cdekate@celeritas ~]$ condor_status
Name
OpSys
Arch
State
Activity
LoadAv Mem
ActvtyTime
vm1@compute-0
vm2@compute-0
vm3@compute-0
vm4@compute-0
vm1@compute-0
vm2@compute-0
vm3@compute-0
vm4@compute-0
…
…
vm3@compute-0
vm4@compute-0
LINUX
LINUX
LINUX
LINUX
LINUX
LINUX
LINUX
LINUX
X86_64
X86_64
X86_64
X86_64
X86_64
X86_64
X86_64
X86_64
Unclaimed
Unclaimed
Unclaimed
Owner
Unclaimed
Unclaimed
Unclaimed
Unclaimed
Idle
Idle
Idle
Idle
Idle
Idle
Idle
Idle
0.000
0.000
0.010
1.000
0.000
0.000
0.000
0.000
1964
1964
1964
1964
1964
1964
1964
1964
3+13:42:23
3+13:42:24
0+00:45:06
0+00:00:07
3+13:42:25
1+09:05:58
3+13:37:27
0+00:05:07
LINUX
LINUX
X86_64 Unclaimed
X86_64 Unclaimed
Idle
Idle
0.000
0.000
1964
1964
3+13:42:33
3+13:42:34
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX
32
3
0
29
0
0
0
Total
32
3
0
29
0
0
0
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
40
What jobs are currently in the queue?
condor_q
• condor_q provides a list of job that have been submitted to the
Condor pool
• Provides details about jobs including which cluster the job is
running on, owner of the job, memory consumption, the name of
the executable being processed, current state of the job, when
the job was submitted and how long has the job been running.

Some common condor_q command line options :
 -global : queries all job queues in the pool
 -name : queries based on the schedd name provides a queue listing of the named
schedd
 -claimed : queries condor_startd ads and prints information about claimed resources
 -goodput : displays job goodput statistics (“goodput is the allocation time when an
application uses a remote workstation to make forward progress.” – Condor
Manual)
 -cputime : displays the remote CPU time accumulated by the job to date...
 For more options run : “condor_q –help”
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
41
Example : condor_q
[cdekate@celeritas ~]$ condor_q
-- Submitter: celeritas.cct.lsu.edu : <130.39.128.68:40472> : celeritas.cct.lsu.edu
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE CMD
30.0
cdekate
1/23 07:52
0+00:01:13 R 0
9.8 fib 100
30.1
cdekate
1/23 07:52
0+00:01:09 R 0
9.8 fib 100
30.2
cdekate
1/23 07:52
0+00:01:07 R 0
9.8 fib 100
30.3
cdekate
1/23 07:52
0+00:01:11 R 0
9.8 fib 100
30.4
cdekate
1/23 07:52
0+00:01:05 R 0
9.8 fib 100
5 jobs; 0 idle, 5 running, 0 held
[cdekate@celeritas ~]$
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
42
How to submit your Job ? condor_submit
•
•
•
Create a job classAd (condor submit file) that contains Condor
keywords and user configured values for the keywords.
Submit the job classAd using “condor_submit”
Example :
condor_submit matrix.submit
•
condor_submit –h provides additional flags
[cdekate@celeritas NPB3.2-MPI]$ condor_submit -h
Usage: condor_submit [options] [cmdfile]
Valid options:
-verbose
verbose output
-name <name>
submit to the specified schedd
-remote <name>
submit to the specified remote schedd
(implies -spool)
-append <line>
add line to submit file before processing
(overrides submit file; multiple -a lines ok)
-disable
disable file permission checks
-spool
spool all files to the schedd
-password <password>
specify password to MyProxy server
-pool <host>
Use host as the central manager to query
If [cmdfile] is omitted, input is read from stdin
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
43
condor_submit : Example
[cdekate@celeritas ~]$ condor_submit fib.submit
Submitting job(s).....
Logging submit event(s).....
5 job(s) submitted to cluster 35.
[cdekate@celeritas ~]$ condor_q
-- Submitter: celeritas.cct.lsu.edu : <130.39.128.68:51675> :
celeritas.cct.lsu.edu
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE CMD
35.0
cdekate
1/24 15:06
0+00:00:00 I 0
9.8 fib
35.1
cdekate
1/24 15:06
0+00:00:00 I 0
9.8 fib
35.2
cdekate
1/24 15:06
0+00:00:00 I 0
9.8 fib
35.3
cdekate
1/24 15:06
0+00:00:00 I 0
9.8 fib
35.4
cdekate
1/24 15:06
0+00:00:00 I 0
9.8 fib
10
15
20
25
30
5 jobs; 5 idle, 0 running, 0 held
[cdekate@celeritas ~]$
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
44
How to delete a submitted job ? condor_rm
•
condor_rm : Deletes one or more jobs from Condor job pool. If a
particular Condor pool is specified as one of the arguments then the
condor_schedd matching the specification is contacted for job deletion,
else the local condor_schedd is contacted.
[cdekate@celeritas ~]$ condor_rm -h
Usage: condor_rm [options] [constraints]
where [options] is zero or more of:
-help
Display this message and exit
-version
Display version information and exit
-name schedd_name
Connect to the given schedd
-pool hostname
Use the given central manager to find daemons
-addr <ip:port>
Connect directly to the given "sinful string"
-reason reason
Use the given RemoveReason
-forcex
Force the immediate local removal of jobs in the X state
(only affects jobs already being removed)
and where [constraints] is one or more of:
cluster.proc
Remove the given job
cluster
Remove the given cluster of jobs
user
Remove all jobs owned by user
-constraint expr
Remove all jobs matching the boolean expression
-all
Remove all jobs (cannot be used with other constraints)
[cdekate@celeritas ~]$
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
45
condor_rm : Example
[cdekate@celeritas ~]$ condor_q
-- Submitter: celeritas.cct.lsu.edu : <130.39.128.68:51675> :
celeritas.cct.lsu.edu
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE CMD
41.0
cdekate
1/24 15:43
0+00:00:03 R 0
9.8 fib
41.1
cdekate
1/24 15:43
0+00:00:01 R 0
9.8 fib
41.2
cdekate
1/24 15:43
0+00:00:00 R 0
9.8 fib
41.3
cdekate
1/24 15:43
0+00:00:00 R 0
9.8 fib
41.4
cdekate
1/24 15:43
0+00:00:00 R 0
9.8 fib
100
150
200
250
300
5 jobs; 0 idle, 5 running, 0 held
[cdekate@celeritas ~]$ condor_rm 41.4
Job 41.4 marked for removal
[cdekate@celeritas ~]$ condor_rm 41
Cluster 41 has been marked for removal.
[cdekate@celeritas ~]$
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
46
Creating Condor submit file ( Job a
ClassAd )
• Condor submit file contains key-value pairs that help
describe the application to condor.
• Condor submit files are job ClassAds.
• Some of the common descriptions found in the job
ClassAds are :
executable = (path to the executable to run on Condor)
input = (standard input provided as a file)
output = (standard output stored in a file)
log = (output to log file)
arguments = (arguments to be supplied to the queue)
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
47
DEMO : Steps involved in running a job on
Condor.
1.
2.
3.
4.
Creating a Condor submit file
Submitting the Condor submit file to a Condor pool
Checking the current state of a submitted job
Job status Notification
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
48
Condor Usage Statistics
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
49
Montage workload implemented and executed
using Condor ( Source : Dr. Dan Katz )
•
•
•
•
Mosaicking astronomical images :
Powerful Telescopes taking high resolution (and highest zoom) pictures of the sky can cover small region over time
Problem being solved in this project is “stitching” these images together to make a high-resolution zoomed in
snapshot of the sky.
Aggregate requirements of 140000 CPU hours (~16 years on a single machine) output ranging in the order of 6
TeraBytes
Example DAG for 10 input files
Maps an abstract workflow
to an executable form
mProject
mDiff
Pegasus
http://pegasus.isi.edu/
mFitPlane
mConcatFit
mBgModel
Grid Information
Systems
mBackground
Information about
available resources, data
location
mAdd
Condor DAGMan
Data Stage-in nodes
Executes the workflow
Montage compute nodes
Data stage-out nodes
Registration nodes
MyProxy
Grid
User’s grid credentials
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
50
Montage Use By IPHAS: The INT/WFC Photometric
H-alpha Survey of the Northern Galactic Plane
(Source : Dr. Dan Katz)
Nebulosity in vicinity of HII region, IC 1396B,
in Cepheus
Crescent Nebula NGC
6888
Study extreme
phases of stellar
evolution that
involve very large
mass loss
Supernova remnant
S147
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
51
Topics
•
•
•
•
•
•
•
•
•
Key terms and concepts
Basic definitions
Models of parallelism
Speedup and Overhead
Capacity Computing & Unix utilities
Condor : Overview
Condor : Useful commands
Performance Issues in Capacity Computing
Material for Test
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
52
Capacity Computing Performance
Issues
• Throughput computing
• Performance measured as total workload performed over time to
complete
• Overhead factors
–
–
–
–
–
Start up time
Input data distribution
Output result data collection
Terminate time
Inter-task coordination overhead (No task coupling)
• Starvation
– Insufficient work to keep all processors busy
– Inadequate parallelism of coarse grained task parallelism
– Poor or uneven load distribution
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
53
Topics
•
•
•
•
•
•
•
•
•
Key terms and concepts
Basic definitions
Models of parallelism
Speedup and Overhead
Capacity Computing & Unix utilities
Condor : Overview
Condor : Useful commands
Performance Issues in Capacity Computing
Material for Test
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
54
Summary : Material for the Test
•
•
•
•
•
Key terms & Concepts (4,5,7,8,9,10,11)
Decoupled work-queue model (16)
Ideal speedup (18,19)
Overhead and Scalability (20,21,22,23,24)
Understand Condor concepts detailed in slides (30,
31,32, 34,35, 36,37)
• Capacity computing performance issues (53)
• Required reading materials :
– http://www.cct.lsu.edu/~cdekate/7600/beowulf-chapter-rev1.pdf
– Specific pages to focus on : 3-16
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
55
CSC 7600 Lecture 5 : Capacity Computing,
Spring 2011
56
Download