Prof. Thomas Sterling Department of Computer Science Louisiana State University February 1, 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS CAPACITY COMPUTING CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 Topics • • • • • • • • • Key terms and concepts Basic definitions Models of parallelism Speedup and Overhead Capacity Computing & Unix utilities Condor : Overview Condor : Useful commands Performance Issues in Capacity Computing Material for Test CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 2 Topics • • • • • • • • • Key terms and concepts Basic definitions Models of parallelism Speedup and Overhead Capacity Computing & Unix utilities Condor : Overview Condor : Useful commands Performance Issues in Capacity Computing Material for Test CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 3 Key Terms and Concepts Conventional serial execution where the problem is represented as a series of instructions that are executed by the CPU (also sequential execution) Problem Parallel execution of a problem involves partitioning of the problem into multiple executable parts that are mutually exclusive and collectively exhaustive represented as a partially ordered set exhibiting concurrency. Task Problem Task Task Task CPU Parallel computing takes advantage of concurrency to : • Solve larger problems within bounded time • Save on Wall Clock Time • Overcoming memory constraints • Utilizing non-local resources instructions instructions CPU CPU CPU CPU CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 4 Key Terms and Concepts • Scalable Speedup : Relative reduction of execution time of a fixed size workload through parallel execution Speedup = execution _ time _ on _ one _ processor execution _ time _ on _ N _ processors • Scalable Efficiency : Ratio of the actual performance to the best possible performance. Efficiency = execution _ time _ on _ one _ processor execution _ time _ on _ N _ processors ´ N CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 5 Topics • • • • • • • • • Key terms and concepts Basic definitions Models of parallelism Speedup and Overhead Capacity Computing & Unix utilities Condor : Overview Condor : Useful commands Performance Issues in Capacity Computing Material for Test CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 6 Defining the 3 C’s … • Main Classes of computing : – High capacity parallel computing : A strategy for employing distributed computing resources to achieve high throughput processing among decoupled tasks. Aggregate performance of the total system is high if sufficient tasks are available to be carried out concurrently on all separate processing elements. No single task is accelerated. Uses increased workload size of multiple tasks with increased system scale. – High capability parallel computing : A strategy for employing tightly coupled structures of computing resources to achieve reduced execution time of a given application through partitioning into concurrently executable tasks. Uses fixed workload size with increased system scale. – Cooperative computing : A strategy for employing moderately coupled ensemble of computing resources to increase size of the data set of a user application while limiting its execution time. Uses a workload of a single task of increased data set size with increased system scale. CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 7 Strong Scaling Vs. Weak Scaling Work per task Weak Scaling Strong Scaling 1 2 4 8 Machine Scale (# of nodes) CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 8 Granularity (size / node) Total Problem Size Strong Scaling, Weak Scaling Strong Scaling Weak Scaling Machine Scale (# of nodes) CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 9 Defining the 3 C’s … • High capacity computing systems emphasize the overall work performed over a fixed time period. Work is defined as the aggregate amount of computation performed across all functional units, all threads, all cores, all chips, all coprocessors and network interface cards in the system. • High capability computing systems emphasize improvement (reduction) in execution time of a single user application program of fixed data set size. • Cooperative computing systems emphasize single application weak scaling – Performance increase through increase in problem size (usually data set size and # of task partitions) with increase in system scale Adapted from : High-performance throughput computing S Chaudhry, P Caprioli, S Yip, M Tremblay - IEEE Micro, 2005 - doi.ieeecomputersociety.org CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 10 Strong Scaling, Weak Scaling • • • Capability • • • Primary scaling is decrease in response time proportional to increase in resources applied Single job, constant size – goal: response-time scaling proportional to machine size Tightly-coupled concurrent tasks making up single job Cooperative • • • • Single job, (different nodes working on different partitions of the same job) Job size scales proportional to machine Granularity per node is fixed over range of system scale Loosely coupled concurrent tasks making up single job Capacity • • Primary scaling is increase in throughput proportional to increase in resources applied Decoupled concurrent tasks, each a separate job, increasing in number of instances – scaling proportional to machine. Capacity Cooperative Capability Single Job Workload Size Scaling Weak Scaling Strong Scaling CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 11 Topics • • • • • • • • • Key terms and concepts Basic definitions Models of parallelism Speedup and Overhead Capacity Computing & Unix utilities Condor : Overview Condor : Useful commands Performance Issues in Capacity Computing Material for Test CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 12 Models of Parallel Processing • Conventional models of parallel processing – Decoupled Work Queue (covered in segment 1 of the course) – Communicating Sequential Processing (CSP message passing) (covered in segment 2) – Shared memory multiple thread (covered in segment 3) • Some alternative models of parallel processing – SIMD • Single instruction stream multiple data stream processor array – Vector Machines • Hardware execution of value sequences to exploit pipelining – Systolic • An interconnection of basic arithmetic units to match algorithm – Data Flow • Data precedent constraint self-synchronizing fine grain execution units supporting functional (single assignment) execution CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 13 Shared memory multiple Thread • • • • • Static or dynamic Fine Grained OpenMP Distributed shared memory systems Covered in Segment 3 CPU 1 CPU 2 CPU 3 Network memory memory memory Symmetric Multi Processor (SMP usually cache coherent ) Orion JPL NASA CPU 1 CPU 2 CPU 3 memory memory memory Network Distributed Shared Memory (DSM usually cache coherent) CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 14 Communicating Sequential Processes • • • • • • • • One process is assigned to each processor Work done by the processor is performed on the local data Data values are exchanged by messages Synchronization constructs for inter process coordination Distributed Memory Coarse Grained MPI application programming interface Commodity clusters and MPP memory memory memory CPU 1 CPU 2 CPU 3 Network Distributed Memory (DM often not cache coherent) – MPP is acronym for “Massively Parallel Processor” • Covered in Segment 2 QueenBee CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 15 Decoupled Work Queue Model • Concurrent disjoint tasks – Job stream parallelism – Parametric Studies • SPMD (single program multiple data) • • • • Very coarse grained Example software package : Condor Processor farms and commodity clusters This lecture covers this model of parallelism CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 16 Topics • • • • • • • • • Key terms and concepts Basic definitions Models of parallelism Speedup and Overhead Capacity Computing & Unix utilities Condor : Overview Condor : Useful commands Performance Issues in Capacity Computing Material for Test CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 17 Ideal Speedup Issues • W is total workload measured in elemental pieces of work (e.g. operations, instructions, subtasks, tasks, etc.) • T(p) is total execution time measured in elemental time steps (e.g. clock cycles) where p is # of execution sites (e.g. processors, threads) • wi is work for a given task I, measured in operations • Example: here we divide a million (really Mega) operation workload, W, into a thousand tasks, w1 to w1024 each of a 1 K operations • Assume 256 processors performing workload in parallel • T(256) = 4096 steps, speedup = 256, Eff = 1 CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 18 Ideal Speedup Example W w210 w1 210 Units : steps 220 P28 P1 Processors W = åw i i 210 210 210 210 T(1)=220 T(28)=212 212 220 Speedup = 12 = 2 8 2 220 0 Efficiency = 12 = 2 =1 8 2 ´2 CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 19 Granularities in Parallelism Overhead • The additional work that needs to be performed in order to manage the parallel resources and concurrent abstract tasks that is in the critical time path. Coarse Grained • Decompose problem into large independent tasks. Usually there is no communication between the tasks. Also defined as a class of parallelism where: Coarse Grained Computation Overhead “relatively large amounts of computational work is done between communication” Fine Grained • Decompose problem into smaller interdependent tasks. Usually these tasks are communication intensive. Also defined as a class of Finely Grained Computation Overhead parallelism where: “relatively small amounts of computational work are done between communication events” –www.llnl.gov/computing/tutorials/parallel_comp Images adapted from : http://www.mhpcc.edu/training/workshop/parallel_intro/ CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 20 Overhead • Overhead: Additional critical path (in time) work required to manage parallel resources and concurrent tasks that would not be necessary for purely sequential execution • V is total overhead of workload execution • vi is overhead for individual task wi • Each task takes vi +wi time steps to complete • Overhead imposes upper bound on scalability CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 21 Overhead P W = å wi i=1 W wi = P Assumption : Workload is infinitely divisible W Tn = v + n T1 W + n W P P S= = @ = = TP W + n W + n 1+ P ´ n 1+ n W P P W v w P v = overhead V = Total overhead w = work unit W = Total work Ti = execution time with i processors P = # processors V+W=4v+4w CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 22 Scalability and Overhead for fixed sized work tasks • W is divided in to J tasks of size wg • Each task requires v overhead work to manage • For P processors there are approximates J/P tasks to be performed in sequence so, • TP is J(wg + v)/P • Note that S = T1 / TP • So, S = P / (1 + v / wg) CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 23 Scalability & Overhead éW ù W J = # tasks = ê ú » ê wg ú wg when T1 = W + v » W W >> v J W Wæ v ö çç1+ ÷÷ TP = ´ ( wg + v) = ´ (wg + v) = P Pwg Pè wg ø æ v ö çç1+ ÷÷ wg ø è T1 W P S= @ = æ ö v TP W v çç1+ ÷÷ 1+ w g Pè wg ø W TP = P v = overhead wg = work unit W = Total work Ti = execution time with i processors P = # Processors J = # Tasks CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 24 Topics • • • • • • • • • Key terms and concepts Basic definitions Models of parallelism Speedup and Overhead Capacity Computing & Unix utilities Condor : Overview Condor : Useful commands Performance Issues in Capacity Computing Material for Test CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 25 Capacity Computing with basic Unix tools • Combination of common Unix utilities such as ssh, scp, rsh, rcp can be used to remotely create jobs (to get more information about these commands try man ssh, man scp, man rsh, man rcp on any Unix shell) • For small workloads it can be convenient to translate the execution of the program into a simple shell script. • Relying on simple Unix utilities poses several application management constraints for cases such as : – – – – – Aborting started jobs Querying for free machines Querying for job status Retrieving job results etc.. CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 26 BOINC , Seti@Home • BOINC (Berkley Open Infrastructure for Network Computing) • Opensource software that enables distributed coarse grained computations over the internet. • Follows the Master-Worker model, in BOINC : no communication takes place among the worker nodes • SETI@Home • Einstein@Home • Climate prediction • And many more… CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 27 Topics • • • • • • • • • Key terms and concepts Basic definitions Models of parallelism Speedup and Overhead Capacity Computing & Unix utilities Condor : Overview Condor : Useful commands Performance Issues in Capacity Computing Material for Test CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 28 Management Middleware : Condor • • • • • Designed, developed and maintained at University of Wisconsin Madison by a team lead by Miron Livny Condor is a versatile workload management system for managing pool of distributed computing resources to provide high capacity computing. Assists distributed job management by providing mechanisms for job queuing, scheduling, priority management, tools that facilitate utilization of resources across Condor pools Condor also enables resource management by providing monitoring utilities, authentication & authorization mechanisms, condor pool management utilities and support for Grid Computing middleware such as Globus. Condor Components • • • ClassAds Matchmaker Problem Solvers CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 29 Management Middleware : Condor Condor Components : Class Ads • ClassAds (Classified Advertisements) concept is very similar to the newspaper classifieds concepts where buyers and sellers advertise their products using abstract yet uniquely defining named expressions. Example : Used Car Sales • ClassAds language in Condor provides well defined means of describing the User Job and the end resources ( storage / computational ) so that the Condor MatchMaker can match the job with the appropriate pool of resources. Src : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005. http://www.cs.wisc.edu/condor/doc/condor-practice.pdf CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 30 Job ClassAd & Machine ClassAd CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 31 Management Middleware : Condor Condor MatchMaker • • MatchMaker, a crucial part of the Condor architecture, uses the job description classAd provided by the user and matches the Job to the best resource based on the Machine description classAd MatchMaking in Condor is performed in 4 steps : 1. 2. 3. 4. Job Agent (A) and resources (R) advertise themselves. Matchmaker (M) processes the known classAds and generates pairs that best match resources and jobs Matchmaker informs each party of the job-resource pair of their prospective match. The Job agent and resource establish connection for further processing. (Matchmaker plays no role in this step, thus ensuring separation between selection of resources and subsequent activities) Src : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005. http://www.cs.wisc.edu/condor/doc/condor-practice.pdf CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 32 Topics • • • • • • • • • Key terms and concepts Basic definitions Models of parallelism Speedup and Overhead Capacity Computing & Unix utilities Condor : Overview Condor : Useful commands Performance Issues in Capacity Computing Material for Test CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 33 Management Middleware : Condor Condor Problem Solvers • Master-Worker (MW) is a problem solving system that is • • useful for solving a coarse grained problem of indeterminate size such as parameter sweep etc. The MW Solver in Condor consists of 3 main components : work-list, a tracking module, and a steering module. The work-list keeps track of all pending work that master needs done. The tracking module monitors progress of work currently in progress on the worker nodes. The steering module directs computation based on results gathered and the pending work-list and communicates with the matchmaker to obtain additional worker processes. DAGMan is used to execute multiple jobs that have dependencies represented as a Directed Acyclic Graph where the nodes correspond to the jobs and edges correspond to the dependencies between the jobs. DAGMan provides various functionalities for job monitoring and fault tolerance via creation of rescue DAGs. Master w1 CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 w..N 34 Management Middleware : Condor Indepth Coverage : http://www.cs.wisc.edu/condor/publications.html Recommended Reading : Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005. [PDF] Todd Tannenbaum, Derek Wright, Karen Miller, and Miron Livny, "Condor - A Distributed Job Scheduler", in Thomas Sterling, editor, Beowulf Cluster Computing with Linux, The MIT Press, 2002. ISBN: 0-262-69274-0 [Postscript] [PDF] CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 35 Core components of Condor • • • • • • condor_master: This program runs constantly and ensures that all other parts of Condor are running. If they hang or crash, it restarts them. condor_collector: This program is part of the Condor central manager. It collects information about all computers in the pool as well as which users want to run jobs. It is what normally responds to the condor_status command. It's not running on your computer, but on the main Condor pool host (Arete head node). condor_negotiator: This program is part of the Condor central manager. It decides what jobs should be run where. It's not running on your computer, but on the main Condor pool host (Arete head node). condor_startd: If this program is running, it allows jobs to be started up on this computer-that is, Arete is an "execute machine". This advertises Arete to the central manager (more on that later) so that it knows about this computer. It will start up the jobs that run. condor_schedd If this program is running, it allows jobs to be submitted from this computer--that is, desktron is a "submit machine". This will advertise jobs to the central manager so that it knows about them. It will contact a condor_startd on other execute machines for each job that needs to be started. condor_shadow For each job that has been submitted from this computer (e.g., desktron), there is one condor_shadow running. It will watch over the job as it runs remotely. In some cases it will provide some assistance. You may or may not see any condor_shadow processes running, depending on what is happening on the computer when you try it out. Source : http://www.cs.wisc.edu/condor/tutorials/cw2005-condor/intro.html CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 36 Condor : A Walkthrough of Condor commands condor_status : provides current pool status condor_q : provides current job queue condor_submit : submit a job to condor pool condor_rm : delete a job from job queue CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 37 What machines are available ? (condor_status) condor_status queries resource information sources and provides the current status of the condor pool of resources Some common condor_status command line options : -help : displays usage information -avail : queries condor_startd ads and prints information about available resources -claimed : queries condor_startd ads and prints information about claimed resources -ckptsrvr : queries condor_ckpt_server ads and display checkpoint server attributes -pool hostname queries the specified central manager (by default queries $COLLECTOR_HOST) -verbose : displays entire classads For more options and what they do run “condor_status –help” CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 38 condor_status : Resource States • Owner : The machine is currently being utilized by a user. The machine is currently unavailable for jobs submitted by condor until the current user job completes. • Claimed : Condor has selected the machine for use by other users. • Unclaimed : Machine is unused and is available for selection by condor. • Matched : Machine is in a transition state between unclaimed and claimed • Preempting : Machine is currently vacating the resource to make it available to condor. CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 39 Example : condor_status [cdekate@celeritas ~]$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@compute-0 vm2@compute-0 vm3@compute-0 vm4@compute-0 vm1@compute-0 vm2@compute-0 vm3@compute-0 vm4@compute-0 … … vm3@compute-0 vm4@compute-0 LINUX LINUX LINUX LINUX LINUX LINUX LINUX LINUX X86_64 X86_64 X86_64 X86_64 X86_64 X86_64 X86_64 X86_64 Unclaimed Unclaimed Unclaimed Owner Unclaimed Unclaimed Unclaimed Unclaimed Idle Idle Idle Idle Idle Idle Idle Idle 0.000 0.000 0.010 1.000 0.000 0.000 0.000 0.000 1964 1964 1964 1964 1964 1964 1964 1964 3+13:42:23 3+13:42:24 0+00:45:06 0+00:00:07 3+13:42:25 1+09:05:58 3+13:37:27 0+00:05:07 LINUX LINUX X86_64 Unclaimed X86_64 Unclaimed Idle Idle 0.000 0.000 1964 1964 3+13:42:33 3+13:42:34 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 32 3 0 29 0 0 0 Total 32 3 0 29 0 0 0 CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 40 What jobs are currently in the queue? condor_q • condor_q provides a list of job that have been submitted to the Condor pool • Provides details about jobs including which cluster the job is running on, owner of the job, memory consumption, the name of the executable being processed, current state of the job, when the job was submitted and how long has the job been running. Some common condor_q command line options : -global : queries all job queues in the pool -name : queries based on the schedd name provides a queue listing of the named schedd -claimed : queries condor_startd ads and prints information about claimed resources -goodput : displays job goodput statistics (“goodput is the allocation time when an application uses a remote workstation to make forward progress.” – Condor Manual) -cputime : displays the remote CPU time accumulated by the job to date... For more options run : “condor_q –help” CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 41 Example : condor_q [cdekate@celeritas ~]$ condor_q -- Submitter: celeritas.cct.lsu.edu : <130.39.128.68:40472> : celeritas.cct.lsu.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 30.0 cdekate 1/23 07:52 0+00:01:13 R 0 9.8 fib 100 30.1 cdekate 1/23 07:52 0+00:01:09 R 0 9.8 fib 100 30.2 cdekate 1/23 07:52 0+00:01:07 R 0 9.8 fib 100 30.3 cdekate 1/23 07:52 0+00:01:11 R 0 9.8 fib 100 30.4 cdekate 1/23 07:52 0+00:01:05 R 0 9.8 fib 100 5 jobs; 0 idle, 5 running, 0 held [cdekate@celeritas ~]$ CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 42 How to submit your Job ? condor_submit • • • Create a job classAd (condor submit file) that contains Condor keywords and user configured values for the keywords. Submit the job classAd using “condor_submit” Example : condor_submit matrix.submit • condor_submit –h provides additional flags [cdekate@celeritas NPB3.2-MPI]$ condor_submit -h Usage: condor_submit [options] [cmdfile] Valid options: -verbose verbose output -name <name> submit to the specified schedd -remote <name> submit to the specified remote schedd (implies -spool) -append <line> add line to submit file before processing (overrides submit file; multiple -a lines ok) -disable disable file permission checks -spool spool all files to the schedd -password <password> specify password to MyProxy server -pool <host> Use host as the central manager to query If [cmdfile] is omitted, input is read from stdin CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 43 condor_submit : Example [cdekate@celeritas ~]$ condor_submit fib.submit Submitting job(s)..... Logging submit event(s)..... 5 job(s) submitted to cluster 35. [cdekate@celeritas ~]$ condor_q -- Submitter: celeritas.cct.lsu.edu : <130.39.128.68:51675> : celeritas.cct.lsu.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 35.0 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 35.1 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 35.2 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 35.3 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 35.4 cdekate 1/24 15:06 0+00:00:00 I 0 9.8 fib 10 15 20 25 30 5 jobs; 5 idle, 0 running, 0 held [cdekate@celeritas ~]$ CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 44 How to delete a submitted job ? condor_rm • condor_rm : Deletes one or more jobs from Condor job pool. If a particular Condor pool is specified as one of the arguments then the condor_schedd matching the specification is contacted for job deletion, else the local condor_schedd is contacted. [cdekate@celeritas ~]$ condor_rm -h Usage: condor_rm [options] [constraints] where [options] is zero or more of: -help Display this message and exit -version Display version information and exit -name schedd_name Connect to the given schedd -pool hostname Use the given central manager to find daemons -addr <ip:port> Connect directly to the given "sinful string" -reason reason Use the given RemoveReason -forcex Force the immediate local removal of jobs in the X state (only affects jobs already being removed) and where [constraints] is one or more of: cluster.proc Remove the given job cluster Remove the given cluster of jobs user Remove all jobs owned by user -constraint expr Remove all jobs matching the boolean expression -all Remove all jobs (cannot be used with other constraints) [cdekate@celeritas ~]$ CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 45 condor_rm : Example [cdekate@celeritas ~]$ condor_q -- Submitter: celeritas.cct.lsu.edu : <130.39.128.68:51675> : celeritas.cct.lsu.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 41.0 cdekate 1/24 15:43 0+00:00:03 R 0 9.8 fib 41.1 cdekate 1/24 15:43 0+00:00:01 R 0 9.8 fib 41.2 cdekate 1/24 15:43 0+00:00:00 R 0 9.8 fib 41.3 cdekate 1/24 15:43 0+00:00:00 R 0 9.8 fib 41.4 cdekate 1/24 15:43 0+00:00:00 R 0 9.8 fib 100 150 200 250 300 5 jobs; 0 idle, 5 running, 0 held [cdekate@celeritas ~]$ condor_rm 41.4 Job 41.4 marked for removal [cdekate@celeritas ~]$ condor_rm 41 Cluster 41 has been marked for removal. [cdekate@celeritas ~]$ CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 46 Creating Condor submit file ( Job a ClassAd ) • Condor submit file contains key-value pairs that help describe the application to condor. • Condor submit files are job ClassAds. • Some of the common descriptions found in the job ClassAds are : executable = (path to the executable to run on Condor) input = (standard input provided as a file) output = (standard output stored in a file) log = (output to log file) arguments = (arguments to be supplied to the queue) CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 47 DEMO : Steps involved in running a job on Condor. 1. 2. 3. 4. Creating a Condor submit file Submitting the Condor submit file to a Condor pool Checking the current state of a submitted job Job status Notification CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 48 Condor Usage Statistics CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 49 Montage workload implemented and executed using Condor ( Source : Dr. Dan Katz ) • • • • Mosaicking astronomical images : Powerful Telescopes taking high resolution (and highest zoom) pictures of the sky can cover small region over time Problem being solved in this project is “stitching” these images together to make a high-resolution zoomed in snapshot of the sky. Aggregate requirements of 140000 CPU hours (~16 years on a single machine) output ranging in the order of 6 TeraBytes Example DAG for 10 input files Maps an abstract workflow to an executable form mProject mDiff Pegasus http://pegasus.isi.edu/ mFitPlane mConcatFit mBgModel Grid Information Systems mBackground Information about available resources, data location mAdd Condor DAGMan Data Stage-in nodes Executes the workflow Montage compute nodes Data stage-out nodes Registration nodes MyProxy Grid User’s grid credentials CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 50 Montage Use By IPHAS: The INT/WFC Photometric H-alpha Survey of the Northern Galactic Plane (Source : Dr. Dan Katz) Nebulosity in vicinity of HII region, IC 1396B, in Cepheus Crescent Nebula NGC 6888 Study extreme phases of stellar evolution that involve very large mass loss Supernova remnant S147 CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 51 Topics • • • • • • • • • Key terms and concepts Basic definitions Models of parallelism Speedup and Overhead Capacity Computing & Unix utilities Condor : Overview Condor : Useful commands Performance Issues in Capacity Computing Material for Test CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 52 Capacity Computing Performance Issues • Throughput computing • Performance measured as total workload performed over time to complete • Overhead factors – – – – – Start up time Input data distribution Output result data collection Terminate time Inter-task coordination overhead (No task coupling) • Starvation – Insufficient work to keep all processors busy – Inadequate parallelism of coarse grained task parallelism – Poor or uneven load distribution CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 53 Topics • • • • • • • • • Key terms and concepts Basic definitions Models of parallelism Speedup and Overhead Capacity Computing & Unix utilities Condor : Overview Condor : Useful commands Performance Issues in Capacity Computing Material for Test CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 54 Summary : Material for the Test • • • • • Key terms & Concepts (4,5,7,8,9,10,11) Decoupled work-queue model (16) Ideal speedup (18,19) Overhead and Scalability (20,21,22,23,24) Understand Condor concepts detailed in slides (30, 31,32, 34,35, 36,37) • Capacity computing performance issues (53) • Required reading materials : – http://www.cct.lsu.edu/~cdekate/7600/beowulf-chapter-rev1.pdf – Specific pages to focus on : 3-16 CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 55 CSC 7600 Lecture 5 : Capacity Computing, Spring 2011 56