US CMS Testbed
Alan De Smet
Computer Sciences Department
University of Wisconsin-Madison adesmet@cs.wisc.edu
http://www.cs.wisc.edu/condor
Large Hadron Collider
› Supercollider on French-
Swiss border
› Under construction, completion in
2006.
(Based on slide by Scott Koranda at NCSA) www.cs.wisc.edu/condor
Compact Muon Solenoid
› Detector /
Experiment for LHC
› Search for
Higgs Boson, other fundamental forces www.cs.wisc.edu/condor
Still Under Development
› Developing software to process enormous amount of data generated
› For testing and prototyping, the detector is being simulated now
Simulating events (particle collisions)
› We’re involved in the United States portion of the effort www.cs.wisc.edu/condor
Storage and Computational
Requirements
› Simulating and reconstructing millions of events per year, batches of around
150,000 (about 10 CPU months)
› Each event requires about 3 minutes of processor time
› A single run will generate about 300
GB of data www.cs.wisc.edu/condor
Before Condor-G and
Globus
› Runs are hand assigned to individual sites
Manpower intensive to organize run distribution and collect results
› Each site has staff managing their runs
Manpower intensive to monitor jobs, CPU availability, disk space, etc.
www.cs.wisc.edu/condor
Before Condor-G and
Globus
› Use existing tool (MCRunJob) to manage tasks
Not “Grid-Aware”
Expects reliable batch system www.cs.wisc.edu/condor
UW High Energy Physics: A special case
› Was a site being assigned runs
› Modified local configuration to flock to UW Computer Science Condor pool
When possible used standard universe to increase available computers
During one week used 30,000 CPU hours.
www.cs.wisc.edu/condor
Our Goal
› Move the work onto “the Grid” using
Globus and Condor-G www.cs.wisc.edu/condor
Why the Grid?
› Centralize management of simulation work
› Reduce manpower at individual sites www.cs.wisc.edu/condor
Why Condor-G?
› Monitors and manages tasks
› Reliability in unreliable world www.cs.wisc.edu/condor
Lessons Learned
› The grid will fail
› Design for recovery www.cs.wisc.edu/condor
The Grid Will Fail
› The grid is complex
› The grid is new and untested
Often beta, alpha, or prototype.
› The public Internet is out of your control
› Remote sites are out of your control www.cs.wisc.edu/condor
The Grid is Complex
› Our system has 16 layers
› A minimal Globus/Condor-G system has 9 layers
Most layers stable and transparent
› MCRunJob > Impala > MOP > condor_schedd > DAGMan > condor_schedd > condor_gridmanager > gahp_server > globus-gatekeeper > globus-job-manager > globus-job-manager-script.pl > local batch system submit > local batch system execute > MOP wrapper >
Impala wrapper > actual job www.cs.wisc.edu/condor
Design for Recovery
› Provide recovery at multiple levels to minimize lost work
› Be able to start a particular task over from scratch if necessary
› Never assume that a particular step will succeed
› Allocate lots of debugging time www.cs.wisc.edu/condor
Now
› Single master site sends jobs to distributed worker sites.
› Individual sites provide configured
Globus node and batch system
› 300+ CPUs across a dozen sites.
› Condor-G acts as reliable batch system and Grid front end www.cs.wisc.edu/condor
How? MOP.
› Monte Carlo Distributed Production
System
› Pretends to be local batch system for
MCRunJob
› Repackages jobs to run on a remote site www.cs.wisc.edu/condor
CMS Testbed Big Picture
Master Site Worker
MCRunJob
Globus
MOP
Condor
DAGMan
Real Work
Condor-G www.cs.wisc.edu/condor
DAGMan, Condor-G,
Globus, Condor
› DAGMan - Manages dependencies
› Condor-G - Monitors the job on master site
› Globus - Sends jobs to remote site
› Condor - Manages job and computers at remote site www.cs.wisc.edu/condor
Recovery: Condor
› Automatically recovers from machine and network problems on execute cluster.
www.cs.wisc.edu/condor
Recovery: Condor-G
› Automatically monitors for and retries a number of possibly transient errors.
› Recovers from down master site, down worker sites, down network.
› After a network outage can reconnect to still running jobs.
www.cs.wisc.edu/condor
Recovery: DAGMan
› If a particular task fails permanently, notes it and allows easy retry.
Can automatically retry, we don’t.
www.cs.wisc.edu/condor
Globus
› Globus software under rapid development
Use old software and miss important updates
Use new software and deal with version incompatibilities www.cs.wisc.edu/condor
Fall of 2002: First Test
› Our first run gave us two weeks to do about 10 days of work (given available
CPUs at the time).
› We had problems
Power outage (several hours), network outages (up to eleven hours), worker site failures, full disks, Globus failures www.cs.wisc.edu/condor
It Worked!
› The system recovered automatically from many problems
› Relatively low human intervention
Approximately one full time person www.cs.wisc.edu/condor
Since Then
› Improved automatic recovery for more situations
› Generated 1.5 million events (about
30 CPU years) in just a few months
› Currently gearing up for even larger runs starting this summer www.cs.wisc.edu/condor
Future Work
› Expanding grid with more machines
› Use Condor-G’s scheduling capabilities to automatically assign jobs to sites
› Officially replace previous system this summer.
www.cs.wisc.edu/condor
Thank You!
› http://www.cs.wisc.edu/condor
› adesmet@cs.wisc.edu
www.cs.wisc.edu/condor