Condor’s grid tools and the eMinerals project Mark Calleja Background • Model the atomistic processes involved in environmental issues (radioactive waste disposal, pollution, weathering). • Jobs can last minutes to weeks, require a few Mb to a few Gb of memory. • Generally not data intensive. • Project has 12 postdocs and spread over a number of sites: Bath, Cambridge, Daresbury, Reading, RI and UCL. • Experience of using Condor pools within eMinerals addressed yesterday by John Brodholt: I’ll concentrate on Condor’s client tools, especially grid submission and DAGMan. Minigrid Resources • Two Condor pools: large one at UCL (~930 Windows boxes) and a small one at Cambridge (~25 heterogeneous nodes). • Three Linux clusters, each with one master + 16 nodes running under PBS queues. • An IBM pSeries platform with 24 processors under LoadLeveller. • A number of Storage Resource Broker (SRB) instances, providing ~3TB of distributed, transparent storage. • Application server, including SRB Metadata Catalogue (MCAT), and database cluster at Daresbury. • All accessed via Globus. Storage Resource Broker (SRB) • San Diego Supercomputer Center • SRB is client-server middleware that provides a uniform interface for connecting to heterogeneous data resources over a network and accessing replicated data sets. • In conjunction with the Metadata Catalog (MCAT), provides a way to access data sets and resources based on their attributes and/or logical names rather than their names or physical locations. • Provides a number of user interfaces: command line (useful for scripting), Jargon (java toolkit), InQ (Windows GUI) and MySRB (web browser). Typical work process • Start by uploading input data into the SRB using one of the three client tools: a) S-commands (command line tools) b) InQ (Windows) c) MySRB (web browser) Data in the SRB can be annotated using the Metadata Editor, and then searched using the CCLRC DataPortal. This is especially useful for the output data. • Construct relevant Condor/DAGman submit script/workflow. • Launch onto minigrid using Condor-G client tools. Job workflow 1) 2) Gatekeeper SRB Condor pool Condor PBS JMgr JMgr 3) On remote gatekeeper, run a jobmanager-fork job to create a temporary directory and extract input files from SRB. Submit next node in workflow to relevant jobmanager, e.g. PBS or Condor, to actually perform the required computational job. On completion of the job, run another jobmanager-fork job on the relevant gatekeeper to ingest the output data into the SRB and clean up the temporary working area. my_condor_submit • • • • • • Ideally, would like to see this SRB functionality absorbed into condor_submit. Stork looks neat, BUT: 1) User needs Condor, DAGMan and Stork specific scripts. 2) Currently “limited beta version for RH 7.2” (dynamically linked). In the meantime, we’ve provided our own wrapper to these workflows, called my_condor_submit. This takes as its argument an ordinary Condor, or Condor-G, submit script, but also recognises some SRB-specific extensions. Limitations: the SRB extensions currently can’t make use of Condor macros, e.g. job.$$(OpSys). Currently also developing a job submission portal. my_condor_submit # Example submit script for a remote Condor pool Universe Globusscheduler Executable Notification = globus = lake.esc.cam.ac.uk/jobmanager-condor-INTEL-LINUX = add.pl = NEVER GlobusRSL = (condorsubmit=(transfer_files ALWAYS)(universe vanilla)(transfer_input_files A, B))(arguments=A B res) Sget Sput Sdir Sforce Output Log Error Queue = A, B = res = test = true = job.out = job.log = job.error # Or just “Sget = *” # Or just “Sput = *” # To turn into a PBS job replace with: # # Globusscheduler = lake.esc.cam.ac.uk/jobmanager-pbs # GlobusRSL = (arguments=A B res)(job_type=single) Recursive DAGs • PCs at UCL pool can be switched off after a few hours to make students leave teaching rooms. • Can get round this by submitting recursive DAGs that run for “short” times (~5 hours). • Hence re-scheduled jobs would at most lose 5 hours of CPU time. • Possible Gotcha: the POST script in the DAG kicks off the next node, but there will be contention for DAG log files. Solve by making forked process sleep a few seconds. Bottom line… • eMinerals makes extensive use of Condor’s client tools: condor_submit is the favoured weapon of choice. • DAGMan has been particularly useful. • The integration of Condor-G and the SRB with the job-execution components of the minigrid have provided most obvious added value to the project. But we’d still like to see… • Support for SRB functionality in condor_submit files. • Ability for users to monitor files produced by jobs in the vanilla universe: “Is my week-old simulation still behaving itself?”. Something like condor_fetchlog for users? Can chirp help? • Is parallel rather than sequential flocking a feasibility?