CamGrid Mark Calleja Cambridge eScience Centre

advertisement
CamGrid
Mark Calleja
Cambridge eScience Centre
What is it?
• A number of like minded groups and departments
(10), each running their own Condor pool(s), which
federate their resources (12).
• Coordinated by the Cambridge eScience Centre
(CeSC), but no overall control.
• Been running now for ~2.5 years, ~70+ users.
• Currently have ~950 processors/cores available.
• “All” linux (various), mostly x86_64, running 24/7.
• Mostly Dell PowerEdge 1950 (like HPCF), four
cores with 8GB.
• Around 2M CPU hours to date.
Some details
• Pools run the latest stable version of Condor
(currently 6.8.6).
• All machines get an (extra) IP address in a
CUDN-only routeable range for Condor.
• Each pool sets its own policies, but these must
be visible to other users of CamGrid.
• Currently we see vanilla, standard and parallel
(MPI) universe jobs.
• Users get accounts on a machine in their local
pool; jobs are then distributed around the grid by
Condor using its flocking mechanism.
• MPI jobs on single SMP machines have proved
very useful.
NTE of Ag3[Co(CN)6] with SMP/MPI sweep
Monitoring Tools
• A number of web based tools provided to
monitor the state of the grid and of jobs.
• CamGrid is based on trust, so must make sure
that machines are fairly configured.
• The university gave us £450k (~$950k) to buy
new hardware; need to ensure that it’s online as
promised.
CamGrid’s file viewer
• Standard universe uses RPCs to echo I/O
operations back to submit host.
• What about other universes? How can I check
the health of my long running simulation?
• We’ve provided our own facility, which involves
an agent installed on each execute node and
accessed via a web interface.
• Works with vanilla and parallel (MPI) jobs.
• Requires local sysadmins to install and run it.
CamGrid’s file viewer
Checkpointable vanilla universe
• Standard universe is fine, if you can link to Condor’s
libraries (Pete Keller – “getting harder”).
• Investigating using BLCR (Berkeley Lab
Checkpoint/Restart) kernel modules for linux.
• Uses kernel resources, and can thus restore resources
that user-level libraries cannot.
• Supported by some flavours of MPI (late LAM, OpenMPI).
• The idea was to use Parrot’s user-space FS to wrap a
vanilla job and save the job’s state on a chirp server.
• However, currently Parrot breaks some BLCR
functionality.
What doesn’t work so well…
• Each pool is run by local sysadmin(s), but these are of
variable quality/commitment.
• We’ve set up mailing lists for users and sysadmins:
hardly ever used (don’t want to advertise ignorance?).
• Some pools have used SRIF hardware to redeploy
machines committed earlier. Naughty…
• Don’t get me started on merger with UCS’s central
resource (~400 nodes).
But generally we’re happy bunnies
• “CamGrid was an invaluable tool allowing us to reliably sample the
large parameter space in a reasonable amount of time. A half-year's
worth of CPU running was collected in a week."
-- Dr. Ben Allanach
• “CamGrid was essential in order for us to be able to run the different
codes in real time.”
-- Prof. Fernando Quevedo
• “I needed to run simulations that took a couple of weeks each. Without
access to the processors on CamGrid, it would have taken a couple of
years to get enough results for a publication.“
-- Dr. Karen Lipkow
Current issues
• Protecting resources on execute nodes; Condor
seems lax at this, e.g. memory, disk space.
• Increasingly interested in VMs (i.e. Xen). Some
pools run it, but not concerted (effects on SMP
MPI jobs?).
• Green issues: will we be forced to buy WoL cards
in the near future?
• Altruistic computing: a recent wave of interest for
BOINC/backfill jobs for medical, protein folding,
etc., but who runs the jobs? Audit trail?
• How do we interact with outsiders? Ideally keep it
to Condor (some Globus, toyed with VPNs). Most
CamGrid stakeholders just dish out conventional,
ssh-accessible accounts.
Finally…
• CamGrid:
http://www.escience.cam.ac.uk/projects/camgrid/
• Contact:
mc321@cam.ac.uk
Questions?
Download