US CMS Testbed

US CMS Testbed

Alan De Smet

Computer Sciences Department

University of Wisconsin-Madison adesmet@cs.wisc.edu

http://www.cs.wisc.edu/condor

Large Hadron Collider

› Supercollider on French-

Swiss border

› Under construction, completion in

2006.

(Based on slide by Scott Koranda at NCSA) www.cs.wisc.edu/condor

Compact Muon Solenoid

› Detector /

Experiment for LHC

› Search for

Higgs Boson, other fundamental forces www.cs.wisc.edu/condor

Still Under Development

› Developing software to process enormous amount of data generated

› For testing and prototyping, the detector is being simulated now

 Simulating events (particle collisions)

› We’re involved in the United States portion of the effort www.cs.wisc.edu/condor

Storage and Computational

Requirements

› Simulating and reconstructing millions of events per year, batches of around

150,000 (about 10 CPU months)

› Each event requires about 3 minutes of processor time

› A single run will generate about 300

GB of data www.cs.wisc.edu/condor

Before Condor-G and

Globus

› Runs are hand assigned to individual sites

 Manpower intensive to organize run distribution and collect results

› Each site has staff managing their runs

 Manpower intensive to monitor jobs, CPU availability, disk space, etc.

www.cs.wisc.edu/condor

Before Condor-G and

Globus

› Use existing tool (MCRunJob) to manage tasks

 Not “Grid-Aware”

 Expects reliable batch system www.cs.wisc.edu/condor

UW High Energy Physics: A special case

› Was a site being assigned runs

› Modified local configuration to flock to UW Computer Science Condor pool

 When possible used standard universe to increase available computers

 During one week used 30,000 CPU hours.


Our Goal

› Move the work onto “the Grid” using

Globus and Condor-G www.cs.wisc.edu/condor

Why the Grid?

› Centralize management of simulation work

› Reduce manpower at individual sites www.cs.wisc.edu/condor

Why Condor-G?

› Monitors and manages tasks

› Reliability in unreliable world www.cs.wisc.edu/condor

Lessons Learned

› The grid will fail

› Design for recovery www.cs.wisc.edu/condor

The Grid Will Fail

› The grid is complex

› The grid is new and untested

 Often beta, alpha, or prototype.

› The public Internet is out of your control

› Remote sites are out of your control www.cs.wisc.edu/condor

The Grid is Complex

› Our system has 16 layers

› A minimal Globus/Condor-G system has 9 layers

 Most layers stable and transparent

› MCRunJob > Impala > MOP > condor_schedd > DAGMan > condor_schedd > condor_gridmanager > gahp_server > globus-gatekeeper > globus-job-manager > globus-job-manager-script.pl > local batch system submit > local batch system execute > MOP wrapper >

Impala wrapper > actual job www.cs.wisc.edu/condor

Design for Recovery

› Provide recovery at multiple levels to minimize lost work

› Be able to start a particular task over from scratch if necessary

› Never assume that a particular step will succeed

› Allocate lots of debugging time www.cs.wisc.edu/condor

Now

› Single master site sends jobs to distributed worker sites.

› Individual sites provide configured

Globus node and batch system

› 300+ CPUs across a dozen sites.

› Condor-G acts as reliable batch system and Grid front end www.cs.wisc.edu/condor

How? MOP.

› Monte Carlo Distributed Production

System

› Pretends to be local batch system for

MCRunJob

› Repackages jobs to run on a remote site www.cs.wisc.edu/condor

CMS Testbed Big Picture

Master Site Worker

MCRunJob

Globus

MOP

Condor

DAGMan

Real Work

Condor-G www.cs.wisc.edu/condor

DAGMan, Condor-G,

Globus, Condor

› DAGMan - Manages dependencies

› Condor-G - Monitors the job on master site

› Globus - Sends jobs to remote site

› Condor - Manages job and computers at remote site www.cs.wisc.edu/condor

Recovery: Condor

› Automatically recovers from machine and network problems on execute cluster.


Recovery: Condor-G

› Automatically monitors for and retries a number of possibly transient errors.

› Recovers from down master site, down worker sites, down network.

› After a network outage can reconnect to still running jobs.


Recovery: DAGMan

› If a particular task fails permanently, notes it and allows easy retry.

 Can automatically retry, we don’t.


Globus

› Globus software under rapid development

 Use old software and miss important updates

 Use new software and deal with version incompatibilities www.cs.wisc.edu/condor

Fall of 2002: First Test

› Our first run gave us two weeks to do about 10 days of work (given available

CPUs at the time).

› We had problems

 Power outage (several hours), network outages (up to eleven hours), worker site failures, full disks, Globus failures www.cs.wisc.edu/condor

It Worked!

› The system recovered automatically from many problems

› Relatively low human intervention

 Approximately one full time person www.cs.wisc.edu/condor

Since Then

› Improved automatic recovery for more situations

› Generated 1.5 million events (about

30 CPU years) in just a few months

› Currently gearing up for even larger runs starting this summer www.cs.wisc.edu/condor

Future Work

› Expanding grid with more machines

› Use Condor-G’s scheduling capabilities to automatically assign jobs to sites

› Officially replace previous system this summer.


Thank You!

› http://www.cs.wisc.edu/condor

› adesmet@cs.wisc.edu


US CMS Testbed

Related documents

Products

Support

US CMS Testbed

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib