An Introduction To Condor High Throughput Computing Week 2007

An Introduction To Condor High Throughput Computing Week 2007 Computer Sciences Department University of Wisconsin-Madison http://www.condorproject.org The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students. http://www.cs.wisc.edu/condor 2 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who:  face software engineering challenges in a distributed UNIX/Linux/NT environment  are involved in national and international grid collaborations,  actively interact with academic and commercial users,  maintain and support large distributed production environments,  and educate and train students. http://www.cs.wisc.edu/condor 3 A Multifaceted Project › Harnessing the power of clusters – dedicated and/or opportunistic (Condor) › Job management services for Grid applications (Condor-G, Stork) › Fabric management services for Grid resources (Condor, › › › › › GlideIns, NeST) Distributed I/O technology (Parrot, Kangaroo, NeST) Job-flow management (DAGMan, Condor, Hawk) Distributed monitoring and management (HawkEye) Technology for Distributed Systems (ClassAD, MW) Packaging and Integration (NMI, VDT) http://www.cs.wisc.edu/condor 4 Some free software produced by the Condor Project › Condor System › VDT › MW › NeST › Metronome › ClassAd Library › DAGMan › Fault Tolerant Shell › › (FTSH) Hawkeye GCB › › › › Data! Stork Parrot VDT And others… all as open source http://www.cs.wisc.edu/condor 5 Full featured system › Flexible scheduling policy engine via ClassAds  Preemption, suspension, requirements, preferences, groups, quotas, settable fair-share, system hold… › Facilities to manage BOTH dedicated CPUs (clusters) and non› › › › › › › › dedicated resources (desktops) Transparent Checkpoint/Migration for many types of serial jobs No shared file-system required Federate clusters w/ a wide array of Grid Middleware Workflow management (inter-dependencies) Support for many job types – serial, parallel, etc. Fault-tolerant: can survive crashes, network outages, no single point of failure. Development APIs: via SOAP / web services, DRMAA (C), Perl package, GAHP, flexible command-line tools, MW Platforms: Linux i386/IA64, Windows 2k/XP, MacOS, Solaris, IRIX, HP-UX, Compaq Tru64, … lots. http://www.cs.wisc.edu/condor 6 Who uses Condor? › Commercial  Oracle, Micron, Hartford Life Insurance, CORE, Xerox, Exxon/Mobil, Shell, Alterra, Texas Instruments, JP Morgan, Leica, … › Research Community  Universities, Gov Labs  Bundles / Products: NMI, VDT, RedHat  Grid Communities: OSG, TeraGrid, EGEE/LCG/gLite, CDF @ Fermi, USCMS, LIGO, iVDGL, NSF Middleware Initiative Build and Test Lab, … http://www.cs.wisc.edu/condor 7 Condor in a nutshell › Condor is a set of distributed services  A scheduler service (schedd) • job policy  A resource manager service (startd) • resource policy  A matchmaker service • administrator policy › These services use ClassAds as the lingua franca http://www.cs.wisc.edu/condor 8 ClassAds and Matchmaking: Finding Machines For Jobs Finding Jobs for Machines http://www.cs.wisc.edu/condor 9 Condor Computers… …And jobs …AndTakes matches them I need a Mac! Desktop Computers Dedicated Clusters Matchmaker I need a Linux box with 2GB RAM! http://www.cs.wisc.edu/condor 10 Quick Terminology › Cluster: A dedicated set of computers not for interactive use › Pool: A collection of computers used by Condor  May be dedicated  May be interactive http://www.cs.wisc.edu/condor 11 Matchmaking › Matchmaking is fundamental to Condor › Matchmaking is two-way  Job describes what it requires: I need Linux && 2 GB of RAM  Machine describes what it requires: I will only run jobs from the Physics department › Matchmaking allows preferences  I need Linux, and I prefer machines with more memory but will run on any machine you provide me http://www.cs.wisc.edu/condor 12 Why Two-way Matching? › Condor conceptually divides people into three groups:  Job submitters  Machine owners  Pool (cluster) administrator } May or may not be the same people › All three of these groups have preferences http://www.cs.wisc.edu/condor 13 › › › › › Machine owner preferences I prefer jobs from the physics group I will only run jobs between 8pm and 4am I will only run certain types of jobs Jobs can be preempted if something better comes along (or not) Remember BNL’s “START” expression?  It expressed policies like this  People use complex policies to control resource usage—Condor supports them http://www.cs.wisc.edu/condor 14 System Admin Prefs › When can jobs preempt other jobs? › Which users have higher priority? http://www.cs.wisc.edu/condor 15 ClassAds › ClassAds state facts  My job’s executable is analysis.exe  My machine’s load average is 5.6 › ClassAds state preferences  I require a computer with Linux http://www.cs.wisc.edu/condor 16 ClassAds • ClassAds are: – – – – semi-structured user-extensible schema-free Attribute = Expression Example: MyType = "Job" String TargetType = "Machine" Number ClusterId = 1377 Owner = "roy“ Cmd = “analysis.exe“ Requirements = (Arch == "INTEL") Boolean && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024)>=ImageSize) … http://www.cs.wisc.edu/condor 17 Schema-free ClassAds › Condor imposes some schema  Owner is a string, ClusterID is a number… › But users can extend it however they like, for jobs or machines  AnalysisJobType = “simulation”  HasJava_1_4 = TRUE  ShoeLength = 7 › Matchmaking can use these attributes  Requirements = OpSys == "LINUX" && HasJava_1_4 == TRUE http://www.cs.wisc.edu/condor 18 Submitting jobs › Users submit jobs from a computer  Jobs described as ClassAds  Each submission computer has a queue  Queues are not centralized  Submission computer watches over queue  Can have multiple submission computers  Submission handled by condor_schedd Condor_schedd Queue  b  b 2  4ac x 2a http://www.cs.wisc.edu/condor 19 Advertising computers › Machine owners describe computers  Configuration file extends ClassAd  ClassAd has dynamic features • Load Average • Free Memory •…  ClassAds are sent to Matchmaker ClassAd Type = “Machine” Requirements = “…” Matchmaker (Collector) http://www.cs.wisc.edu/condor 20 Matchmaking › Negotiator collects list of computers › Negotiator contacts each schedd  What jobs do you have to run? › Negotiator compares each job to each computer  Evaluate requirements of job & machine  Evaluate in context of both ClassAds  If both evaluate to true, there is a match › Upon match, schedd contacts execution computer http://www.cs.wisc.edu/condor 21 Matchmaking diagram Matchmaker Matchmaking Service 2 Negotiator condor_schedd Collector Information service 1 3 Job queue service Queue http://www.cs.wisc.edu/condor 22 Running a Job Matchmaker condor_submit condor_negotiator condor_collector Manages Machine condor_schedd Queue condor_shadow Manages Remote Job condor_startd Manages condor_starter Local Job Job http://www.cs.wisc.edu/condor 23 Condor processes › › › › › › › Master: Takes care of other processes Collector: Stores ClassAds Negotiator: Performs matchmaking Schedd: Manages job queue Shadow: Manages job (submit side) Startd: Manages computer Starter: Manages job (execution side) http://www.cs.wisc.edu/condor 24 Some notes › › › › One negotiator/collector per pool Can have many schedds (submitters) Can have many startds (computers) A machine can have any combination  Dedicated cluster: maybe just startds  Shared workstations: schedd + startd  Personal Condor: everything http://www.cs.wisc.edu/condor 25 Layout of the Condor Pool = Process Spawned = ClassAd Communication Pathway Cluster Node Central Manager master master startd schedd negotiator collector startd Cluster Node master startd Desktop master startd schedd Desktop master startd schedd http://www.cs.wisc.edu/condor 26 Our Condor Pool › Each lab computer has  Schedd (queue)  Startd › One central manager  Collector  Negotiator › When you get access to the computers:  Run: condor_status http://www.cs.wisc.edu/condor 27 Running a Condor Job http://www.cs.wisc.edu/condor 28 Getting Condor › Available as a free download from http://www.cs.wisc.edu/condor › Download Condor for your operating system  Available for many UNIX platforms: • Linux, Solaris, Mac OS X, HPUX, AIX…  Also for Windows http://www.cs.wisc.edu/condor 29 Condor Releases › Naming scheme similar to the Linux Kernel… › Major.minor.release  Stable: Minor is even (a.b.c) • Examples: 6.6.3, 6.8.4, 6.8.5 • Very stable, mostly bug fixes  Developer: Minor is odd (a.b.c) • New features, may have some bugs • Examples: 6.9.4, 6.9.5 › Today’s releases:  Stable: 6.8.6  Development: 6.9.5  Hopefully soon Stable: 7.0 http://www.cs.wisc.edu/condor 30 Try out Condor: Use a Personal Condor › Condor:  on your own workstation  no root access required  no system administrator intervention needed  can use as “Condor-G” to submit to grid sites • But I won’t talk about this for the moment. http://www.cs.wisc.edu/condor 31 Personal Condor?! What’s the benefit of a Condor Pool with just one user and one machine? http://www.cs.wisc.edu/condor 32 Your Personal Condor will ... › … keep an eye on your jobs and will keep you › › › › posted on their progress … implement your policy on the execution order of the jobs … keep a log of your job activities … add fault tolerance to your jobs … implement your policy on when the jobs can run on your workstation http://www.cs.wisc.edu/condor 33 After Personal Condor… › When a Personal Condor pool works for you…  Convince your co-workers to add their computers to the pool  Add dedicated hardware to the pool http://www.cs.wisc.edu/condor 34 Four Steps to Run a Job 1. 2. 3. 4. Choose a Universe for your job Make your job batch-ready Create a submit description file Run condor_submit http://www.cs.wisc.edu/condor 35 1. Choose a Universe › There are many choices  Vanilla: any old job  Standard: checkpointing & remote I/O  Java: better for Java jobs  MPI: Run parallel MPI jobs  Virtual Machine: Run a virtual machine as job  … › For now, we’ll just consider vanilla › We’ll use Java universe in exercises: it is an extension of the Vanilla universe http://www.cs.wisc.edu/condor 36 2. Make your job batch-ready › Must be able to run in the background: no interactive input, windows, GUI, etc. › Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices › Organize data files http://www.cs.wisc.edu/condor 37 3. Create a Submit Description File › A plain ASCII text file  Not a ClassAd  But condor_submit will make a ClassAd from it › Condor does not care about file extensions › Tells Condor about your job:       Which executable, Which universe, Input, output and error files to use, Command-line arguments, Environment variables, Any special requirements or preferences http://www.cs.wisc.edu/condor 38 Simple Submit Description File # Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = analysis Log = my_job.log Queue http://www.cs.wisc.edu/condor 39 4. Run condor_submit › You give condor_submit the name of the submit file you have created: condor_submit my_job.submit › condor_submit parses the submit file, checks for it errors, and creates a ClassAd that describes your job. http://www.cs.wisc.edu/condor 40 The Job Queue › condor_submit sends your job’s ClassAd to the schedd  Manages the local job queue  Stores the job in the job queue • Atomic operation, two-phase commit • “Like money in the bank” › View the queue with condor_q http://www.cs.wisc.edu/condor 41 An example submission % condor_submit my_job.submit % condor_submit my_job.submit Submitting job(s). Submitting job(s). job(s) submitted to cluster 1. 1 job(s) 1 submitted to cluster 1. % condor_q % condor_q -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : --IDSubmitter: perdita.cs.wisc.edu : SIZE CMD OWNER SUBMITTED RUN_TIME ST PRI 1.0 roy 7/6 06:52 : 0+00:00:00 I 0 0.0 analysis <128.105.165.34:1027> 1 idle,SUBMITTED 0 running, 0 held ID1 jobs; OWNER RUN_TIME ST PRI SIZE CMD 1.0 roy 7/6 06:52 0+00:00:00 I 0 0.0 foo % 1 jobs; 1 idle, 0 running, 0 held http://www.cs.wisc.edu/condor 42 Sample Condor User Log 000 (0001.000.000) 05/25 19:10:03 Job submitted from host: <128.105.146.14:1816> Job submitted from host: <128.105.146.14:1816> ... 001 (0001.000.000) 05/25 19:12:17 Job executing on host: <128.105.146.14:1026> Job executing on host: <128.105.146.14:1026> ... 005 (0001.000.000) 05/25 19:13:06 Job terminated. Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:37, Sys 0 00:00:00 - Run Remote Usage (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:05 - Run Local Usage Usr Usr00:00:37, 00:00:00 - Usage Run Remote Usage 0 00:00:37, Sys 0Sys 00:00:00 - Total Remote 0 00:00:00, Sys 0 00:00:05 - Total Local Usage Usr Usr00:00:00, Sys 00:00:05 - Run Local Usage 9624 - Run Bytes Sent By Job Usr Sys By00:00:00 - Total Remote Usage 714615900:00:37, - Run Bytes Received Job 9624 - Total Bytes Sent By Job Usr 00:00:00, Sys 00:00:05 - Total Local Usage 7146159 9624 ... - 7146159 9624 Total Bytes Received By Job - Run Bytes Sent By Job - Run Bytes Received By Job http://www.cs.wisc.edu/condor Total Bytes Sent By Job 43 Another Submit Description File # Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_1 Queue http://www.cs.wisc.edu/condor 44 “Clusters” and “Processes” › If your submit file describes multiple jobs, we › › › › call this a “cluster” Each job within a cluster is called a “process” or “proc” If you only specify one job, you still get a cluster, but it has only one process A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”) Process numbers always start at 0 http://www.cs.wisc.edu/condor 45 Example Submit Description File for a Cluster # Example condor_submit input file that defines # a cluster of two jobs with different iwd Universe = vanilla Executable = my_job Arguments = -arg1 -arg2 InitialDir = run_0 Queue  Becomes job 2.0 InitialDir = run_1 Queue  Becomes job 2.1 http://www.cs.wisc.edu/condor 46 % condor_submit my_job.submit-file Submitting job(s). 2 job(s) submitted to cluster 2. % condor_q -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 6/16 06:52 0+00:02:11 R 0 0.0 my_job 2.0 frieda 6/16 06:56 0+00:00:00 I 0 0.0 my_job 2.1 frieda 6/16 06:56 0+00:00:00 I 0 0.0 my_job 3 jobs; 2 idle, 1 running, 0 held % http://www.cs.wisc.edu/condor 47 Submit Description File for a BIG Cluster of Jobs › The initial directory for each job is specified › › with the $(Process) macro, and instead of submitting a single job, we use “Queue 600” to submit 600 jobs at once $(Process) will be expanded to the process number for each job in the cluster (from 0 up to 599 in this case), so we’ll have “run_0”, “run_1”, … “run_599” directories All the input/output files will be in different directories! http://www.cs.wisc.edu/condor 48 Submit Description File for a BIG Cluster of Jobs # Example condor_submit input file that defines # a cluster of 600 jobs with different iwd Universe = vanilla Executable = my_job Arguments = -arg1 –arg2 InitialDir = run_$(Process) Queue 600 http://www.cs.wisc.edu/condor 49 Job submit files: Match Ad Substitution macros › A substitution macro takes the value of the expression from the resource (machine ClassAd) after a match has been made. These macros look like: $$(attribute) › An alternate form looks like: $$(attribute:string_if_attribute_undefined) http://www.cs.wisc.edu/condor 50 Example Submit File # Run on Solaris or Linux Universe = vanilla Requirements = Opsys==“Linux” || Opsys == “Solaris” Executable = foo.$$(opsys) Arguments = -textures $$(texturedir:/usr/local/alias/textures) queue http://www.cs.wisc.edu/condor 51 Details › Lots of options available in the submit file › Commands to watch the queue, the state of your pool, and lots more › You’ll see much of this in the hands-on exercises. http://www.cs.wisc.edu/condor 52 General User Commands › › › › › › › › › condor_status condor_q condor_submit condor_rm condor_prio condor_history condor_submit_dag condor_checkpoint condor_compile View Pool Status View Job Queue Submit new Jobs Remove Jobs Intra-User Prios Completed Job Info Specify Dependencies Force a checkpoint Link Condor library http://www.cs.wisc.edu/condor 53 Workflows? › I want to have dependencies between jobs. › I want my workflow to be managed even in the face of failures. http://www.cs.wisc.edu/condor 54 Simple Bourne script… #!/bin/sh cd /work/foo rm –rf data cp -r /fresh/data . What if ‘/work/foo’ is unavailable?? http://www.cs.wisc.edu/condor 55 Getting production run ready… #!/bin/sh for attempt in 1 2 3 cd /work/foo if [ ! $? ] then echo "cd failed, trying again..." sleep 5 else break fi done if [ ! $? ] then echo "couldn't cd, giving up..." return 1 fi http://www.cs.wisc.edu/condor 56 Want other Scheduling possibilities? Use the Scheduler Universe › In addition to VANILLA, another job universe is the Scheduler Universe. › Scheduler Universe jobs run on the submitting machine and serve as a meta-scheduler. › DAGMan meta-scheduler included http://www.cs.wisc.edu/condor 57 DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”) http://www.cs.wisc.edu/condor 58 What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. Job A › Each job is a “node” in the DAG. › Each node can have any number Job B Job C of “parent” or “children” nodes – as long as there are no loops! Job D http://www.cs.wisc.edu/condor 59 Defining a DAG › A DAG is defined by a .dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job B Job C Job D › each node will run the Condor job specified by its accompanying Condor submit file http://www.cs.wisc.edu/condor 60 Submitting a DAG › To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag › condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. › Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it. http://www.cs.wisc.edu/condor 61 Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. A Condor A Job Queue B C .dag File DAGMan D http://www.cs.wisc.edu/condor 62 Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. A Condor B Job C Queue B C DAGMan D http://www.cs.wisc.edu/condor 63 Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. A Condor Job Queue B X Rescue File DAGMan D http://www.cs.wisc.edu/condor 64 Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. A Condor Job C Queue B C Rescue File DAGMan D http://www.cs.wisc.edu/condor 65 Recovering a DAG (cont’d) › Once that job completes, DAGMan will continue the DAG as if the failure never happened. A Condor Job D Queue B C DAGMan D http://www.cs.wisc.edu/condor 66 Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. A Condor Job Queue B C DAGMan D http://www.cs.wisc.edu/condor 67 Additional DAGMan Features › Provides other handy features for job management…  nodes can have PRE & POST scripts  failed nodes can be automatically re-tried a configurable number of times  job submission can be “throttled” http://www.cs.wisc.edu/condor 68 Another sample DAGMan submit file # Filename: diamond.dag Job A A.condor Job B B.condor Job C C.condor Job B Job D D.condor Script PRE A top_pre.csh Script PRE B mid_pre.perl $JOB Script POST B mid_post.perl $JOB $RETURN Script PRE C mid_pre.perl $JOB Script POST C mid_post.perl $JOB $RETURN Script PRE D bot_pre.csh PARENT A CHILD B C PARENT B C CHILD D Retry C 3 Job A Job C Job D http://www.cs.wisc.edu/condor 69 How can my jobs access their data files? http://www.cs.wisc.edu/condor 70 Access to Data in Condor › Use shared filesystem if available  In today’s exercises, we won’t use our shared filesystem › No shared filesystem?  Condor can transfer files • • • • Can automatically send back changed files Atomic transfer of multiple files Can be encrypted over the wire This is what we’ll do in the exercises  Remote I/O Socket  Standard Universe can use remote system calls (more on this later) http://www.cs.wisc.edu/condor 71 Condor File Transfer › ShouldTransferFiles = YES  Always transfer files to execution site › ShouldTransferFiles = NO  Rely on a shared filesystem › ShouldTransferFiles = IF_NEEDED  Will automatically transfer the files if the submit and execute machine are not in the same FileSystemDomain Universe = vanilla Executable = my_job Log = my_job.log ShouldTransferFiles = IF_NEEDED Transfer_input_files = dataset$(Process), common.data Queue 600 http://www.cs.wisc.edu/condor 72 Some of the machines in the Pool do not have enough memory or scratch disk space to run my job! http://www.cs.wisc.edu/condor 73 Specify Requirements › An expression (syntax similar to C or Java) › Must evaluate to True for a match to be made Universe Executable Log InitialDir Requirements Queue 600 = = = = = vanilla my_job my_job.log run_$(Process) Memory >= 256 && Disk > 10000 http://www.cs.wisc.edu/condor 74 Beyond Requirements › Condor has extensive policies  What does a job require?  What does a job prefer?  When should you kill a job?  When should you try again? › All powered by ClassAds http://www.cs.wisc.edu/condor 75 We’ve seen how Condor can: … keeps an eye on your jobs and will keep you posted on their progress … implements your policy on the execution order of the jobs … keeps a log of your job activities http://www.cs.wisc.edu/condor 76 My jobs run for 20 days… › What happens when they get pre-empted? › How can I add fault tolerance to my jobs? http://www.cs.wisc.edu/condor 77 Condor’s Standard Universe to the rescue! › Condor can support various combinations of › features/environments in different “Universes” Different Universes provide different functionality for your job:  Vanilla:  Scheduler:  Standard: Run any serial job Plug in a scheduler Support for transparent process checkpoint and restart http://www.cs.wisc.edu/condor 78 Process Checkpointing › Condor’s process checkpointing mechanism saves the entire state of a process into a checkpoint file  Memory, CPU, I/O, etc. › The process can then be restarted from › right where it left off Typically no changes to your job’s source code needed—however, your job must be relinked with Condor’s Standard Universe support library http://www.cs.wisc.edu/condor 79 Relinking Your Job for Standard Universe To do this, just place “condor_compile” in front of the command you normally use to link your job: % condor_compile gcc -o myjob myjob.c - OR - % condor_compile f77 -o myjob filea.f fileb.f http://www.cs.wisc.edu/condor 80 Limitations of the Standard Universe › Condor’s checkpointing is not at the kernel level. Thus in the Standard Universe the job may not:  fork()  Use kernel threads  Use some forms of IPC, such as pipes and shared memory › Many typical scientific jobs are OK › Must be same gcc as Condor was built with http://www.cs.wisc.edu/condor 81 When will Condor checkpoint your job? › Periodically, if desired (for fault tolerance) › When your job is preempted by a higher › › priority job When your job is vacated because the execution machine becomes busy When you explicitly run:  condor_checkpoint  condor_vacate  condor_off  condor_restart http://www.cs.wisc.edu/condor 82 Remote System Calls › I/O system calls are trapped and sent back › to submit machine Allows transparent migration across administrative domains  Checkpoint on machine A, restart on B › No source code changes required › Language independent › Opportunities for application steering http://www.cs.wisc.edu/condor 83 Remote I/O condor_schedd condor_startd condor_shadow condor_starter File I/O Job Lib http://www.cs.wisc.edu/condor 84 Java Universe Job universe executable jar_files input output arguments queue = = = = = = java Main.class MyLibrary.jar infile outfile Main 1 2 3 http://www.cs.wisc.edu/condor 85 Why not use Vanilla Universe for Java jobs? › Java Universe provides more than just inserting “java” at the start of the execute line  Knows which machines have a JVM installed  Knows the location, version, and performance of JVM on each machine  Can differentiate JVM exit code from program exit code  Can report Java exceptions http://www.cs.wisc.edu/condor 86 The Grid Universe Most useful when 1. We want to send a job off to a far away machine 2. We want to hand a job to another batch processing system on the local machine 3. We want to send a job off to a far away machine, in order to hand that job to another batch processing system on that machine http://www.cs.wisc.edu/condor 87 The Grid Universe › All handled in the submit description file › Supports several back end types:  Globus: GT2, GT3, GT4  NorduGrid  UNICORE  Condor  PBS  LSF http://www.cs.wisc.edu/condor 88 PBS and the Submit Description File › Details of the PBS installation in $(GLITE_LOCATION)/etc/batch_gahp.config universe = grid input = job6.input output = job6.result log = job6.log grid_resource = pbs queue http://www.cs.wisc.edu/condor 89 Submit Description File For gt4: XXX is one of: Fork Condor PBS LSF SGE universe = grid input = job3.input output = job3.result log = job3.log grid_resource = gt4 https://198.51.254.40:8080/wsrf/ service/ManagedJobFactoryService XXX queue IP address:Port number OR Host name:Port number http://www.cs.wisc.edu/condor 90 Some additional job universe types › Local › Parallel › VM (Virtual Machine)  VMWare Server  Xen  Libvirt (kvm, …) › In the works:  Amazon Elastic Compute Cloud  ??? http://www.cs.wisc.edu/condor 91 › › › › › › › › › › › › I could also talk lots about… GCB: Living with firewalls & private networks Federated Grids/Clusters APIs and Portals MW Database Support (Quill) High Availability Fail-over Compute On-Demand (COD) Dynamic Pool Creation (“Glide-in”) Role-based prioritization and accounting Strong security, incl privilege separation Data movement scheduling in workflows … http://www.cs.wisc.edu/condor 92 But I won’t › After break: Practical exercises › Please ask me questions, now or later http://www.cs.wisc.edu/condor 93

An Introduction To Condor High Throughput Computing Week 2007

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib