Condor’s grid tools and the eMinerals project Mark Calleja

advertisement
Condor’s grid tools and the
eMinerals project
Mark Calleja
Background
• Model the atomistic processes involved in environmental
issues (radioactive waste disposal, pollution,
weathering).
• Jobs can last minutes to weeks, require a few Mb to a
few Gb of memory.
• Generally not data intensive.
• Project has 12 postdocs and spread over a number of
sites: Bath, Cambridge, Daresbury, Reading, RI and
UCL.
• Experience of using Condor pools within eMinerals
addressed yesterday by John Brodholt: I’ll concentrate
on Condor’s client tools, especially grid submission and
DAGMan.
Minigrid Resources
• Two Condor pools: large one at UCL (~930 Windows
boxes) and a small one at Cambridge (~25
heterogeneous nodes).
• Three Linux clusters, each with one master + 16 nodes
running under PBS queues.
• An IBM pSeries platform with 24 processors under
LoadLeveller.
• A number of Storage Resource Broker (SRB) instances,
providing ~3TB of distributed, transparent storage.
• Application server, including SRB Metadata Catalogue
(MCAT), and database cluster at Daresbury.
• All accessed via Globus.
Storage Resource Broker (SRB)
• San Diego Supercomputer Center
• SRB is client-server middleware that provides a uniform
interface for connecting to heterogeneous data
resources over a network and accessing replicated data
sets.
• In conjunction with the Metadata Catalog (MCAT),
provides a way to access data sets and resources based
on their attributes and/or logical names rather than their
names or physical locations.
• Provides a number of user interfaces: command line
(useful for scripting), Jargon (java toolkit), InQ (Windows
GUI) and MySRB (web browser).
Typical work process
• Start by uploading input data into the SRB using one of the three
client tools:
a) S-commands (command line tools)
b) InQ (Windows)
c) MySRB (web browser)
Data in the SRB can be annotated using the Metadata Editor, and then
searched using the CCLRC DataPortal. This is especially useful for the
output data.
• Construct relevant Condor/DAGman submit script/workflow.
• Launch onto minigrid using Condor-G client tools.
Job workflow
1)
2)
Gatekeeper
SRB
Condor
pool
Condor PBS
JMgr
JMgr
3)
On remote gatekeeper, run a
jobmanager-fork job to create
a temporary directory and
extract input files from SRB.
Submit next node in workflow
to relevant jobmanager, e.g.
PBS or Condor, to actually
perform the required
computational job.
On completion of the job, run
another jobmanager-fork job
on the relevant gatekeeper to
ingest the output data into the
SRB and clean up the
temporary working area.
my_condor_submit
•
•
•
•
•
•
Ideally, would like to see this SRB functionality
absorbed into condor_submit.
Stork looks neat, BUT:
1) User needs Condor, DAGMan and Stork specific
scripts.
2) Currently “limited beta version for RH 7.2”
(dynamically linked).
In the meantime, we’ve provided our own wrapper to
these workflows, called my_condor_submit.
This takes as its argument an ordinary Condor, or
Condor-G, submit script, but also recognises some
SRB-specific extensions.
Limitations: the SRB extensions currently can’t make
use of Condor macros, e.g. job.$$(OpSys).
Currently also developing a job submission portal.
my_condor_submit
# Example submit script for a remote Condor pool
Universe
Globusscheduler
Executable
Notification
= globus
= lake.esc.cam.ac.uk/jobmanager-condor-INTEL-LINUX
= add.pl
= NEVER
GlobusRSL = (condorsubmit=(transfer_files ALWAYS)(universe vanilla)(transfer_input_files A,
B))(arguments=A B res)
Sget
Sput
Sdir
Sforce
Output
Log
Error
Queue
= A, B
= res
= test
= true
= job.out
= job.log
= job.error
# Or just “Sget = *”
# Or just “Sput = *”
# To turn into a PBS job replace with:
#
# Globusscheduler
= lake.esc.cam.ac.uk/jobmanager-pbs
# GlobusRSL
= (arguments=A B res)(job_type=single)
Recursive DAGs
• PCs at UCL pool can be switched off after a few hours to
make students leave teaching rooms.
• Can get round this by submitting recursive DAGs that
run for “short” times (~5 hours).
• Hence re-scheduled jobs would at most lose 5 hours of
CPU time.
• Possible Gotcha: the POST script in the DAG kicks off
the next node, but there will be contention for DAG log
files. Solve by making forked process sleep a few
seconds.
Bottom line…
• eMinerals makes extensive use of Condor’s
client tools: condor_submit is the favoured
weapon of choice.
• DAGMan has been particularly useful.
• The integration of Condor-G and the SRB with
the job-execution components of the minigrid
have provided most obvious added value to the
project.
But we’d still like to see…
• Support for SRB functionality in condor_submit
files.
• Ability for users to monitor files produced by
jobs in the vanilla universe: “Is my week-old
simulation still behaving itself?”. Something like
condor_fetchlog for users? Can chirp help?
• Is parallel rather than sequential flocking a
feasibility?
Download