US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project

advertisement
US CMS Testbed
A Grid Computing Case Study
Alan De Smet
Condor Project
University of Wisconsin at Madison
adesmet@cs.wisc.edu
http://www.cs.wisc.edu/~adesmet/
Trust No One
• The grid will fail
• Design for recovery
The Grid Will Fail
• The grid is complex
• The grid is relatively new and untested
– Much of it is best described as prototypes or
alpha versions
• The public Internet is out of your
control
• Remote sites are out of your control
Design for Recovery
• Provide recovery at multiple levels to
minimize lost work
• Be able to start a particular task over
from scratch if necessary
• Never assume that a particular step will
succeed
• Allocate lots of debugging time
Some Background
Compact Muon Solenoid
Detector
• The Compact Muon
Solenoid (CMS)
detector at the Large
Hadron Collider will
probe fundamental
forces in our Universe
and search for the
yet-undetected Higgs
Boson.
(Based on slide by Scott Koranda at NCSA)
Compact Muon Solenoid
(Based on slide by Scott Koranda at NCSA)
CMS - Now and the Future
• The CMS detector is expected to come
online in 2006
• Software to analyze this enormous
amount of data from the detector is
being developed now.
• For testing and prototyping, the
detector is being simulated now.
What We’re Doing Now
• Our runs are divided into two phases
– Monte Carlo detector response simulation
– Physics reconstruction
• The testbed currently only does
simulation, but is moving toward
reconstruction.
Storage and Computational
Requirements
• Simulating and reconstructing millions
of events per year
• Each event requires about 3 minutes of
processor time
• Events are generally processed in run of
about 150,000 events
• The simulation step of a single run will
generate about 150 GB of data
– Reconstruction has similar requirements
Existing CMS Production
• Runs are assigned to individual sites
• Each site has staff managing their runs
– Manpower intensive to monitor jobs, CPU
availability, disk space
• Local site uses Impala (old way) or
MCRunJob (new way) to manage jobs
running on local batch system.
Testbed CMS Production
• What I work on
• Designed to allow a single master site
to manage jobs scattered to many
worker sites
CMS Testbed Workers
Site
University of Wisconsin - Madison
CPUs
5
Fermi National Accelerator Laboratory 12
California Institute of Technology
8
University of Florida
42
University of California – San Diego
3
As we move from testbed to full production,
we will add more sites and hundreds of
CPUs.
CMS Testbed Big Picture
Master Site
Worker
Impala
Globus
MOP
Condor
DAGMan
Real Work
Condor-G
Impala
•
•
•
•
•
•
Tool used in current production
Assembles jobs to be run
Sends jobs out
Collects results
Minimal recovery mechanism
Expects to hand jobs off to a local batch
system
– Assumes local file system
MOP
• Monte Carlo Distributed Production
System
– It could have been MonteDistPro (as the,
The Count of…)
• Pretends to be local batch system for
Impala
• Repackages jobs to run on a remote
site
MOP Repackaging
• Impala hands MOP a list of input files,
output files, and a script to run.
• Binds site specific information to script
– Path to binaries, location of scratch space,
staging location, etc
– Impala is given locations like
_path_to_gdmp_dir_ which MOP rewrites
• Breaks jobs into five step DAGs
• Hands job off to DAGMan/Condor-G
MOP Job Stages
MOP Job
Stages
• Stage-in - Move input data and
program to remote site
• Run - Execute the program
• Stage-out - Retrieve program
logs
• Publish - Retrieve program
output
• Cleanup - Delete files
MOP Job Stages
Combined DAG
...
...
...
...
...
• A MOP “run” collects
multiple groups into a
single DAG which is
submitted to DAGMan
DAGMan, Condor-G, Globus,
Condor
• DAGMan - Manages dependencies
• Condor-G - Monitors the job on master
site
• Globus - Sends jobs to remote site
• Condor - Manages job and computers
at remote site
Typical Network
Configuration
MOP Master
Machine
Private Network
Public Internet
Worker Site:
Head Node
Worker Site:
Compute
Node
Worker Site:
Compute
Node
Network Configuration
• Some sites make compute nodes visible
to the public Internet, but many do not.
– Private networks will scale better as sites
add dozens or hundreds of machine
– As a result, any stage handling data
transfer to or from the MOP Master must
run on the head node. No other node can
address the MOP Master
• This is a scalability issue. We haven’t hit
the limit yet.
When Things Go Wrong
• How recovery is handled
Recovery - DAGMan
• Remembers current status
– When restarted, determines current
progress and continues.
• Notes failed jobs for resubmission
– Can automatically retry, but we don’t
Recovery - Condor-G
• Remembers current status
– When restarted, reconnects jobs to remote
sites and updates status
– Also runs DAGMan, when restarted restarts
DAGMan
• Retries in certain failure cases
• Holds jobs in other failure cases
Recovery - Condor
• Remembers current status
• Running on remote site
• Recovers job state and restarts jobs on
machine failure
globus-url-copy
• Used for file transfer
• Client process can hang under some
circumstances
• Wrapped in a shell script giving transfer
a maximum duration. If run exceeds
duration, job is killed and restarted.
• Using ftsh to write script - Doug Thain’s
Fault Tolerant Shell.
Human Involvement in
Failure Recovery
• Condor-G places some problem jobs on
hold
– By placing them on hold, we prevent the
jobs from failing and provide an opportunity
to recover.
• Usually Globus problems:expired
certificate, jobmanager
misconfiguration, bugs in the
jobmanager
Human Involvement in
Failure Recovery
• A human diagnoses the jobs placed on
hold
– Is problem transient? condor_release the
job.
– Otherwise fix the problem, then release the
job.
– Can the problem not be fixed? Reset the
GlobusContactString and release the job,
forcing it to restart.
• condor_qedit <clusterid>
GlobusContactString X
Human Involvement in
Failure Recovery
• Sometimes tasks themselves fail
• A variety of problems, typically
external: disk full, network outage
– DAGMan notes failure. When all possible
DAGMan nodes finish or fail, a rescue DAG
file is generated.
– Submitting this rescue DAG will retry all
failed nodes.
Doing Real Work
CMS Production Job 1828
• US CMS Testbed asked to help with real
CMS production
• Given 150,000 events to do in two
weeks.
What Went Wrong
•
•
•
•
•
•
Power outage
Network outages
Worker site failures
Globus failures
DAGMan failure
Unsolved mysteries
Power Outage
• A power outage at the UW took out the
master site and the UW worker site for
several hours
• During the outage worker sites
continued running assigned tasks, but
as they exhausted their queues we
could not send additional tasks
• File transfers sending data back failed
• System recovered well
Network Outages
• Several outages, most less than an
hour, one for eleven hours
• Worker sites continued running
assigned tasks
• Master site was unable to report status
until network was restored
• File transfers failed
• System recovered well
Worker Site Failures
• One site had a configuration change go
bad, causing the Condor jobs to fail
– Condor-G placed problem tasks on hold.
When the situation was resolved, we
released the jobs and they succeeded.
• Another site was incompletely upgraded
during the run.
– Jobs were held, released when fixed.
Worker Site Failure / Globus
Failure
• At one site, Condor jobs were removed
from the pool using condor_rm,
probably by accident
• The previous Globus interface to Condor
wasn’t prepared for that possibility and
erroneously reported the job as still
running
– Fixed in newest Globus
• Job’s contact string was reset.
Globus Failures
• globus-job-manager would sometimes
stop checking the status of a job,
reporting the last status forever
• When a job was taking unusually long,
this was usually the problem
• Killing the globus-job-manager caused
a new one to be started, solving the
problem
– Has to be done on the remote site
• (Or via globus-job-run)
Globus Failures
• globus-job-manager would sometimes
corrupt state files
• Wisconsin team debugged problem and
distributed patched program
• Failed jobs had their
GlobusContactStrings reset.
Globus Failures
• Some globus-job-managers would
report problems accessing input files
– The reason has not been diagnosed.
• Affected jobs had their
GlobusContactStrings reset.
DAGMan failure
• In one instance a DAGMan managing 50
groups of jobs crashed.
• The DAG file was tweaked by hand to
mark completed jobs as such and
resubmitted
– Finished jobs in a DAG simply have DONE
added to then end of their entry
Problems Previously
Encountered
• We’ve been doing test runs for ~9
months. We’ve encountered and
resolved many other issues.
• Consider building your own copy of the
Globus tools out of CVS to stay on top
of bugfixes.
• Monitor http://bugzilla.globus.org/
and the Globus mailing lists.
The Future
Future Improvements
• Currently our run stage runs as a
vanilla universe Condor job on the
worker site. If there is a problem the
job must be restarted from scratch.
Switching to the standard universe
would allow jobs to recover and
continue aborted runs.
Future Improvements
• Data transfer jobs are run as Globus
fork jobs. They are completely
unmanaged on the remote site. If the
remote site has an outage, there is no
information on the jobs.
– Running these under Condor (Scheduler
universe) would ensure that status was not
lost.
– Also looking at using the DaP Scheduler
Future Improvements
• Jobs are assigned to specific sites by an
operator
• Once assigned, changing the assigned
site is nearly impossible
• Working to support “grid scheduling”:
automatic assignment of jobs to sites
and changing site assignment
Download