• • • • • • Dr Steven Newhouse Technical Director Overview Capbility Specifying policy through ClassAds Condor-G & Glidein Master & Worker Condor DAG Man London e-Science Centre Department of Computing, Imperial College London 2 mlnopohqrs "!#!%$'& (*)+& ,- • High Throughput Computing (HTC) Infrastructure: CPU Cycles/year • Uses ClassAds to represent jobs & resources during matchmaking • Inherent fault-tolerance • Available on W2000 (limited support) & UNIX MONQPSRTNQUe V VXW NQY[ZQ\^]_NQN Wa`cbd MONQPSRTNQU g VXk fhgji . )/& (1032547698;:<=<;> ?9@AB<;C76D4=E@D:FAG8H6DIGIGJH?9>8HECK>69?;L 3 s 4 o ws 3a y3 z;uyw x tK{uyx ||uww}vyau ~yx uy }3~Kx uy ¯³ §¤¬¤´µ££9¶'·'¢±ª¤´'¦ K a §¤¡ ¢¤¨£ £w¨3¥'©¤¦ ª« ¬'¢­®« §¤ª ¯ ¢­®° ±¢² • One user & one machine (no root access) • Provides a persistent execution environment that will run jobs when your policy allows them to run • Need jobs to run without any user interaction 33 wa uyyauyxºwuwuy 5 }3x ~K5~K ~y x ~w|p} x |}~Kx }a 7u }3 ~y}3uyx | y a uy }3uyx ¸ |3{¤}au ~y|} x |p}~Kx }3 | y a v |} x¹ uy ~w|p} x |}~Kx }a v |} x¹ uy ~w|p} x |}~Kx }a 6 1 »s • Master • Vanilla – Starts up, manages, and deploys new daemons – No checkpointing, migration, restart or remote calls • Startd (Represents a machine to the Condor system) • Standard – Responsible for starting, suspending, and stopping jobs – Enforces the wishes of the machine owner (the owner’s “policy”… more on this soon) – Relink with Condor ‘C’ library to provide checkpointing, migration & restart – Redirect remote system calls to submitting machine – No support for fork, shared memory, or IPC system calls • Schedd (Represents users to the Condor system) • Globus – Maintains the persistent queue of jobs – Responsible for contacting available machines and sending them jobs – Specify the resource job manager • Scheduler (e.g. DAGman) • Collector (Represents the state of the Condor pool) – Runs meta-scheduler on submitting machine – Each daemon periodically provides an update via a ClassAd • Negotiator (Matches jobs to resources through ClassAd’s) 7 ¼ ph½m ¾¤ ¿½p »sÀr Á • MPI • PVM 8  ýsÄ • I/O System calls trapped and sent back to submit machine • Allows Transparent Migration Across Administrative Domains • Language Independent • No Source Code changes required • Condor can transfer files Schedd Startd Starter Shadow – Automatically send back changed files – Atomic transfer of multiple files Submit Customer Job Condor Syscall Lib 9 Å oh½mhÆ»qrÇÈ • Looking for the solution to the NUG30 quadratic assignment problem • An informal collaboration of mathematicians and computer scientists • Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites) 10 ÄêÄsë ÉÊ˵Ì9˵ÍÎ9˵Í3Ê˵É5Ë'Ï5Ë'ÉÐ5Ë'É5Ì9Ë É5Ñ9Ë Ò˵ÍÉ9˵Í9Ë ÊËpÍ3ÒË'ÍÌ5Ë'Í5Í9Ë É5Ï9˵Í9Ð5ËpÉ9Ó5Ë'ÏÑ5Ë'Ð5Ë'ÍÑ9˵É3ÒË Î5Ë'É5Î9˵Ó9ËpÍ5Ó9˵É5Í5Ë'É5É5˵Í5Ï Ô;ÕaÖ×DØDÙyÚÛ Ü=Ý Þ ßàDáKá®ÕaâHã¤àaä*×aâ#Ø3àDÞ Öwå®ä ÕaæyÖÕDÞµá®âHçrè æyéwàDá®æè á 11 12 2 ¼ h½mo ëosëo Submit Machine Schedd Schedd »½Ä ¼ ìsí ¼ î Collector Collector Collector Collector Collector Collector Negotiator Negotiator Negotiator Negotiator Negotiator Negotiator Central Manager Pool-Foo Central Manager Pool-Bar Central Manager (CONDOR_HOST) saturn viking-i00 (Solaris E6800) (Linux Xeon) • A Requirement expression must evaluate to True for a match to be made • All matches which meet the requirements can be sorted by preference with a Rank expression. • Higher the Rank, the better the match mir (Linux x86) 13 Universe = vanilla Executable = my_job Arguments = -arg1 –arg2 InitialDir = run_$(Process) Requirements = Memory >= 256 && Disk > 10000 Rank = (KFLOPS*10000) + Memory Queue 600 ¼ ssïðoñ ½msò ó î pÄÄ START – When is this machine willing to start a job RANK - Job Preferences SUSPEND - When to suspend a job CONTINUE - When to continue a suspended job PREEMPT – When to nicely stop running a job KILL - When to immediately kill a preempting job • START jobs when their has been no activity on the keyboard/mouse for 5 minutes and the load average is low • SUSPEND jobs as soon as activity is detected • PREEMPT jobs if the activity continues for 5 minutes or more • KILL jobs if they take more than 5 minutes to preempt 15 16 Å so"ëõô NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) BackgroundLoad = 0.3 HighLoad = 0.5 KeyboardBusy = (KeyboardIdle < 10) CPU_Busy = ($(NonCondorLoadAvg) >= $(HighLoad)) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer = (CurrentTime - EnteredCurrentActivity) START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND = $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT = (Activity == "Suspended") && $(ActivityTimer) > 300 KILL = $(ActivityTimer) > 300 14 ö o÷s÷ø • A ClassAd maps attributes to expressions • Expressions – – – – Constants: strings, numbers, etc. Expressions: other.Memory > 600M Lists: { “roy”, “pfc”, “melski” } Other ClassAds • Powerful tool for grid computing – Semi-structured (you pick your structure) – Matchmaking • Open Source: http://www.cs.wisc.edu/condor/classad 17 18 3 ÷ Å oîë ÷õ» ë [ • Condor: Type = “Job”; Owner = “roy”; Universe = “Standard”; Requirements = (other.OpSys == “Linux” && other.DiskSpace > 140M); Rank = (other.DiskSpace > 300M ? 10 : 1); – Used to describe jobs & machines – Used by the matchmaker to place jobs on machines • European Data Grid: ] [ – JDL: Class Ad derived schema for jobs & machines – Used by the resource broker to match jobs to machines (See later) Type = “Machine”; OpSys = “Linux”; DiskSpace = 500M; AllowedUsers = {“roy”, “melski”, “pfc”}; Requirements = (IsMember(other.Owner, AllowedUsers); • Future: ] – XML Representation (To see a real ClassAd, try: condor_q –l or condor_status –l) 19 ö oêwùqnø 20 ùqnïúl1ê ö sî Condor-G • Use Condor to run jobs on Grid resources • Use the Globus Toolkit to provide: Grid Resource Schedd Schedd – GRAM (submit a remote job) – GASS (transfer job’s files) LSF LSF • Use Condor to provide: – Job management – Fault tolerance – Credential management • Two components: Globus Universe & GlideIn • Disadvantages – No remote syscalls, checkpoint/migration, or dynamic resource selection 21 ùqrïúl1ê ö î ws 3û5ü a ¤ýµ Condor-G 22 ùqnïúl1ê ö sî ws 3û5ü 3 ýp Grid Resource Condor-G Schedd Schedd Grid Resource Schedd Schedd LSF LSF GridManager GridManager 23 LSF LSF 24 4 ùqrïúl1ê ö î ws 3û5ü a ¤ýµ ùqnïúl1ê ö sî ws 3û5ü 3 ýp Condor-G Grid Resource Condor-G Grid Resource Schedd Schedd JobManager JobManager Schedd Schedd JobManager JobManager LSF LSF GridManager GridManager GridManager GridManager LSF LSF User User Job Job 25 qnê qr¤Ãh»s Job Submission Machine Job Execution Site • Create your own personal Condor pool from temporarily-acquired Grid resources • Brings the full power of Condor to the Grid • Run a Condor startd on a Grid resource • Startd reports back to your machine and runs Vanilla and Standard Universe jobs Globus GateKeeper rk Fo rk End User Requests Persistant Job Queue Fo Condor -G Scheduler 26 Globus JobManager Globus JobManager Submit Submit Fork Site Job Scheduler (PBS, Condor, LSF, LoadLeveler, NQE, etc. ) Condor -G GridManager GASS Server Job X Job Y 28 ùqrïúl1ê ö î ws 3a y3 Condor-G ùqnïúl1ê ö sî 33 wa Grid Resource Condor-G Schedd Schedd Grid Resource Schedd Schedd LSF LSF LSF LSF Collector Collector Collector Collector þ ü ÿ ®ÿ 29 30 5 ùqrïúl1ê ö î ws 3a y3 Condor-G ùqnïúl1ê ö sî 33 wa Grid Resource Schedd Schedd GridManager GridManager LSF LSF Condor-G Grid Resource Schedd Schedd JobManager JobManager LSF LSF GridManager GridManager Collector Collector Collector Collector þ ü ÿ ®ÿ þ ü ÿ ®ÿ 31 ùqrïúl1ê ö î ws 3a y3 32 ùqnïúl1ê ö sî 33 wa Condor-G Grid Resource Condor-G Grid Resource Schedd Schedd JobManager JobManager Schedd Schedd JobManager JobManager GridManager GridManager LSF LSF LSF LSF GridManager GridManager Startd Startd Startd Startd Collector Collector Collector Collector þ ü ÿ ®ÿ þ ü ÿ ®ÿ 33 ùqrïúl1ê ö î ws 3a y3 Condor-G Grid Resource Schedd Schedd JobManager JobManager GridManager GridManager 33 wa z; uyuy yx t7a{¤uwx |p|uyyuy}vyaxu ~yx uy }3ºw~Kx uw uyuy þ ü ÿ ®ÿ þ ü ÿ ®ÿ 3 'ÿ 7ü Ha wa!p3ü LSF LSF Startd Startd Collector Collector 34 User User Job Job 35 36 6 Åö#" Å ù ö sî Job Submission Machine Job Execution Site • Master-Worker Style Parallel Applications Persistant Job Queue End User Requests Condor -G GridManager Condor -G Scheduler Fork – Large problem partitioned into small pieces (tasks); – The master manages tasks and resources (worker pool); – Each worker gets a task, execute it, sends the result back, and repeat until all tasks are done; – Examples: ray-tracing, optimization problems, etc. Globus Daemons + Local Site Scheduler [See Figure 1] GASS Server Fork • On Condor (PVM, Globus, … … ) Condor -G Collector – Many opportunities! – Issues (in a Distributed Opportunistic Environment): Condor Daemons Transfer Job X • Resource management, communication, portability; • Fault-tolerance, dealing with runtime pool changes. Fork Condor Shadow Process for Job X Job Resource Information Redirected System Call Data Job X Condor System Call Trapping & Checkpoint Library 37 Åö ½Ä¤pho ö sî%$ 38 Åö ½h½ms • An OO framework with simple interfaces • Nug30 solved in 7 days by MW-QAP – Quadratic assignment problem outstanding for 30 years – Utilized 2500 machines from 10 sites – 3 classes to extend, a few virtual functions to fill; – Scientists can focus on their algorithms. • NCSA, ANL, UWisc, Gatech, INFN@Italy, … … • 1009 workers at peak, 11 CPU years • Lots of Functionality – http://www-unix.mcs.anl.gov/metaneos/nug30/ – Handles all the issues in a meta-computing environment; – Provides sufficient info. to make smart decisions. • STORM (flight scheduling) • Many Choices without Changing User Code – Stochastic programming problem (1000M row X 13000M col) – Multiple resource managers: Condor, PVM, … – 2K times larger than the best sequential program can do – 556 workers at peak, 1 CPU year – Multiple communication interfaces: PVM, File, Socket, … – http://www.cs.wisc.edu/~swright/stochastic/atr/ 39 ó ÷q Å 40 &('*)!+-, .)0/-1-243 • A DAG is the data structure used by DAGMan to represent these dependencies. • Directed Acyclic Graph Manager • Specify dependencies between your Condor jobs, so it can manage them automatically for you. e.g. Don’t run “B” until “A” has completed successfully. • Most real science involves complex sequences of tasks – on many resources at many sites. • Each job is a “node” in the DAG. • Each node can have any number of “parent” or “children” nodes – as long as there are no loops! e.g., move data, compute, check, move back, etc • Failures are a certainty, so recoverability of the sequence – not just the jobs – is crucial. 41 Job A Job B Job C Job D 42 7 ¼ ë ó ÷q ¼ ¤ë ó ÷q1¿òÁ • DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. • DAGMan holds & submits jobs to the Condor queue at the appropriate times. A Condor A Job Queue B A C .dag File Condor B Job Queue C DAGMan D B C DAGMan D 43 ¼ ë ó ÷q ¿òÁ 44 ¼ së ó ÷q • In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. • Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. A Condor Job Queue B A X Rescue File Condor Job Queue C DAGMan D Rescue File C B DAGMan D 45 ôoë ó ÷q 46 ¼ í Àr½ð½Ä • Once the DAG is complete, the DAGMan job itself is finished, and exits. • Executes locally on the submit host before or after job submission… • Example: PRE Job A # diamond.dag PRE A prepare-A.sh Job A a.sub Job B b.sub Job C c.sub Job D d.sub POST D double-check.sh Parent A Child B C Parent B C Child D A Condor Job Queue B DAGMan C D Job B Job C Job D POST • PRE/POST scripts are part of node 47 48 8 Æ ¼ ð ¼65 ½s • • • • • Tells DAGMan to re-run a node multiple times if necessary… • Example: Job A # diamond.dag Job A a.sub Job B b.sub RETRY B 5 Job C c.sub RETRY C 5 Job D d.sub Parent A Child B C Parent B C Child D Job B Job C Condor provides a reliable execution infrastructure Enables the federation of resources Provides a reliable Grid submission system Distributed dependent workflow (no control flow) with failure detection and recovery Job D 49 50 9