CamGrid Mark Calleja Cambridge eScience Centre What is it? • A number of like minded groups and departments (10), each running their own Condor pool(s), which federate their resources (12). • Coordinated by the Cambridge eScience Centre (CeSC), but no overall control. • Been running now for ~2.5 years, ~70+ users. • Currently have ~950 processors/cores available. • “All” linux (various), mostly x86_64, running 24/7. • Mostly Dell PowerEdge 1950 (like HPCF), four cores with 8GB. • Around 2M CPU hours to date. Some details • Pools run the latest stable version of Condor (currently 6.8.6). • All machines get an (extra) IP address in a CUDN-only routeable range for Condor. • Each pool sets its own policies, but these must be visible to other users of CamGrid. • Currently we see vanilla, standard and parallel (MPI) universe jobs. • Users get accounts on a machine in their local pool; jobs are then distributed around the grid by Condor using its flocking mechanism. • MPI jobs on single SMP machines have proved very useful. NTE of Ag3[Co(CN)6] with SMP/MPI sweep Monitoring Tools • A number of web based tools provided to monitor the state of the grid and of jobs. • CamGrid is based on trust, so must make sure that machines are fairly configured. • The university gave us £450k (~$950k) to buy new hardware; need to ensure that it’s online as promised. CamGrid’s file viewer • Standard universe uses RPCs to echo I/O operations back to submit host. • What about other universes? How can I check the health of my long running simulation? • We’ve provided our own facility, which involves an agent installed on each execute node and accessed via a web interface. • Works with vanilla and parallel (MPI) jobs. • Requires local sysadmins to install and run it. CamGrid’s file viewer Checkpointable vanilla universe • Standard universe is fine, if you can link to Condor’s libraries (Pete Keller – “getting harder”). • Investigating using BLCR (Berkeley Lab Checkpoint/Restart) kernel modules for linux. • Uses kernel resources, and can thus restore resources that user-level libraries cannot. • Supported by some flavours of MPI (late LAM, OpenMPI). • The idea was to use Parrot’s user-space FS to wrap a vanilla job and save the job’s state on a chirp server. • However, currently Parrot breaks some BLCR functionality. What doesn’t work so well… • Each pool is run by local sysadmin(s), but these are of variable quality/commitment. • We’ve set up mailing lists for users and sysadmins: hardly ever used (don’t want to advertise ignorance?). • Some pools have used SRIF hardware to redeploy machines committed earlier. Naughty… • Don’t get me started on merger with UCS’s central resource (~400 nodes). But generally we’re happy bunnies • “CamGrid was an invaluable tool allowing us to reliably sample the large parameter space in a reasonable amount of time. A half-year's worth of CPU running was collected in a week." -- Dr. Ben Allanach • “CamGrid was essential in order for us to be able to run the different codes in real time.” -- Prof. Fernando Quevedo • “I needed to run simulations that took a couple of weeks each. Without access to the processors on CamGrid, it would have taken a couple of years to get enough results for a publication.“ -- Dr. Karen Lipkow Current issues • Protecting resources on execute nodes; Condor seems lax at this, e.g. memory, disk space. • Increasingly interested in VMs (i.e. Xen). Some pools run it, but not concerted (effects on SMP MPI jobs?). • Green issues: will we be forced to buy WoL cards in the near future? • Altruistic computing: a recent wave of interest for BOINC/backfill jobs for medical, protein folding, etc., but who runs the jobs? Audit trail? • How do we interact with outsiders? Ideally keep it to Condor (some Globus, toyed with VPNs). Most CamGrid stakeholders just dish out conventional, ssh-accessible accounts. Finally… • CamGrid: http://www.escience.cam.ac.uk/projects/camgrid/ • Contact: mc321@cam.ac.uk Questions?