Condor in High Energy Physics David Colling Imperial College London/GridPP

advertisement
Condor in High Energy Physics
David Colling
Imperial College London/GridPP
Outline
• What characterises HEP experments
Focus on two experiments, one from the current
generation (DØ) and one from the next
generation (CMS).
• How HEP experiments use Condor
Describe (some of) the ways that these
experiments use Condor.
• Will not be talking about EDG/LCG/EGEE
Does use Condor Matchmaking, CondorG,
DAGman etc as Francesco Prelz will be talking
about this on Friday
David Colling
Imperial College London/GridPP
Thanks are due to….
From CMS
Dan Bradley, Claudio Grandi, Craig Prescott
From SAM-Grid
Gabriele Garzoglio, Igor Terekhov, Gavin
Davies, Wyatt Merritt
I have used many of their slides
David Colling
Imperial College London/GridPP
Characteristics of a High Energy
Physics Experiment:
• Expensive… in fact so expensive that no
individual country can afford them
• Large international collaborations with
collaborators
• The experiments take many years to design
/build and then take data for many years … so
technology changes during the lifetime of an
experiment
David Colling
Imperial College London/GridPP
The DØ detector
DZero
David Colling
Imperial College London/GridPP
Data size for the D0 Experiment
• Detector Data
–
–
–
–
1,000,000 Channels
Event size 250KB
Event rate ~50 Hz
On-line Data Rate 12
MBps
– 100 TB/year
• Total data
– detector,
reconstructred,
simulated
– 400 TB/year
David Colling
Imperial College London/GridPP
A Truly International Collaboration
Totals:
– 18 countries
Europe, Asia, and
North, Central and
South Americas
– 73 institutions &
labs
33 US
40 non-US
– 646 physicists
334 US
312 non-US
All these sites have (at least some) computaional resources
David Colling
Imperial College London/GridPP
CMS … the next generation
On the LHC being built at
CERN near Geneva
Due to start taking data in
2007… eventually ~10PB/year
Each event ~1MB
Although experiment is not due to start until 2007 there
is a very large simulation requirement now.
David Colling
Imperial College London/GridPP
David Colling
Imperial College London/GridPP
So what do these experiments need to do…
Data Analysis
• All data has to be reconstructed before it can be
used
• This reconstruction is re-done as the
understanding of the detector improves
• The reconstructed events are then used individual
physicists
Monte Carlo Simulation
• HEP experiments rely heavily on MC simulation in
designing the detector of the detector and then in
understanding the data that it produces. Typically as
much MC as real Data
• MC data is generally treated (reconstructed &
analysed) in exactly the same way as real data
David Colling
Imperial College London/GridPP
How do these experiment use Condor?
- As a batch system
Both for large scale clusters and (occasionally) for
cycle stealing
- Through use of Condor-G to submit to “Grid” resource
This has sometimes involved working closely with
the Condor team who have modified Condor-G to
satisfy the needs of these communities
David Colling
Imperial College London/GridPP
The SAM-Grid project—the D0 solution
• Strategy: enhance the distributed data handling system of
the experiments (SAM), incorporating standard Grid tools
and protocols, and developing new solutions for Grid
computing (JIM)
• History: SAM from 1997, JIM from end of 2001
• Funds: the Particle Physics Data Grid (US) and GridPP
(UK)
• Mission: enable fully distributed computing for D0 and
CDF
• People: Computer scientists and Physicists from Fermilab
and the collaborating Institutions
David Colling
Imperial College London/GridPP
Technology Choices made in 2001
•
Low level resource management: Globus GRAM.
Clearly not enough...
•
Condor-G: right components and functionalities, but
not enough in 2001...
•
D0 and the Condor Team have been collaborating
since, under the auspices of PPDG to address the
requirements of a large distributed system, with
distributively owned and shared resources.
David Colling
Imperial College London/GridPP
CondorG: added functionalities I
•
Use of the condor Match Making Service as
Grid Resource Selector
–
–
•
Advertisement of grid site capabilities to the
MMS
Dynamic $$(gatekeeper) selection for jobs
specifying requirements on grid sites
Concurrent submission of multiple jobs to
the same grid resource
–
–
at any given moment, a grid site is capable of
accepting up to N jobs
the MMS was modified to push up to N jobs to
the same site in the same negotiation cycle
David Colling
Imperial College London/GridPP
Condor-G: added functionalities II
•
Flexible Match Making logic
– the job/resource match criteria should be
arbitrarily complex (based on more info
than what fits in the classad), statefull
(remembers match history), “pluggable”
(by administrators and users)
– Example: send the job where most of the
data are. The MMS contacts the site data
handling service to rank a job/site match
– This leads to a very thin and flexible “grid
broker”
David Colling
Imperial College London/GridPP
Condor-G: added functionalities III
•
Light clients
– A user should be able to submit a job
from a laptop and turn it off
– Client software (condor_submit, etc.) and
queuing service (condor_schedd) should
be on different machines
– This leads to a 3 tiers architecture for
Condor-G: client, queuing, execution
sites. Security was implemented via
X509.
David Colling
Imperial College London/GridPP
Condor-G: added functionalities IV
•
Resubmission/Rematching logic
– If the MMS matched a job to a site, which
cannot accept it after trying the
submission N times, the job should be
rematched to a different site
– Flexible penalization of already failed
matches
David Colling
Imperial College London/GridPP
Architecture
User Interface
Flow of:
job
data
User Interface
User Interface
Submission
meta-data
User Interface
Submission
Global Job Queue
Resource Selector
Grid Client
Match Making
Global DH Services
Info Gatherer
SAM Naming Server
Info Collector
SAM Log Server
Resource Optimizer
MSS
Cluster
Data Handling
Local Job Handling
SAM Station
(+other servs)
Grid Gateway
SAM Stager(s)
Local Job Handler
(CAF, D0MC, BS, ...)
AAA
Worker Nodes
SAM DB Server
Site
RC
MetaData Catalog
Bookkeeping Service
Info Manager
JIM Advertise
Dist.FS
Cache
MDS
Web Serv
Info Providers
Grid Monitoring
XML DB server
Site Conf.
Glob/Loc
David
CollingJID map
...
Imperial College London/GridPP
User Tools
Site
Site
Site
User
Interface
Interface
Submission
Submission
Client
Job Management
JOB
Match
Match
job_type
MyType
=
montecarlo
"Job"
MyType
"Machine"
Making
Making
Broker
station_name
TargetType
=
ccin2p3-analysis
"Machine"
TargetType
"Job"
Service
runjob_requestid
ClusterId
304
= 11866"ccin2p3-analysis.d0.prd.jobmanager-runjob"
Name
ext.
Queuing
Queuing
runjob_numevts
JobType
=
“montecarlo"
10000 "ccd0.in2p3.fr:2119/jobmanager-runjob"
gatekeeper_url_
logic
System
d0_release_version
GlobusResource
"$$(gatekeeper_url_)"
= p14.05.01
DbURL
"http://ccd0.in2p3.fr:7080/Xindice"
jobfiles_dataset
Requirements
(TARGET.station_name_
= san_jobset2
==
"ccin2p3-analysis" && ...)
sam_nameservice_
"IOR:000000000000002a49444c3........."
Informatio
Information
minbias_dataset
Rank
= ccin2p3_minbias_dataset
station_name_ 0.000000
"ccin2p3-analysis"
nCollector
Collector
sam_experiment
station_univ
=
"prd"
d0
station_experiment_
"d0"
sam_universe
station_ex
= prd
"d0"
station_universe_
"prd"
group
RequestId
= test
"11866" "Linux+2.4"
cluster_architecture_
Execution
Site
#1
Execution Site #n
instances
ProjectId
= 1 "sam_ccd0_012457_25321_0"
cluster_name_
"LyonsGrid"
Data
Data
Data
DbURL
"$$(DbURL)"
local_storage_path_
Handling
Handling "/samgrid/disk"
Handling
Computin
Computing
cert_subject
"/DC=org/DC=doegrids/OU=People/CN=Aditya
Nishandar
..."
Computing
"ccd0.in2p3.fr"
System
glocal_storage_node_
Element
Element
System
System
Element
Env
"MATCH_RESOURCE_NAME=$$(name);\
schema_version_
"1_1"
SAM_STATION=$$(station_name_);\
site_name_
"ccin2p3"
StorageSAM_USER_NAME=aditya;..."
Storage
Storage
Storage
...
Storage
Element
Element
Element
Element
Args
"--requestId=11866"
"--gridId=sam_ccd0_012457"
...
Element
...
Grid
Grid
Grid
Grid
Computing
Sensor
Sensor
Element
Sensors
Sensors
s
s
David Colling
Imperial College London/GridPP
SAMGrid-plots
JIM: Active execution sites: 11DØ, 1 CDF in testing
http://samgrid.fnal.gov:8080/
David Colling
Imperial College London/GridPP
SAMGrid plots
David Colling
Imperial College London/GridPP
MonteCarlo produced
David Colling
Imperial College London/GridPP
DØ – Production - MC
• All DØ MC always produced off-site
• SAMGrid now default (went into production in mar 04)
– Based on request system and jobmanager-mc_runjob
– MC software package retrieved via SAM
– Currently running at (multiple) sites in Cz, Fr, UK, USA
(10 in total + FNAL)
• more on way, inc central farm
– Average production efficiency ~90%
– Average inefficiency due to grid infrastructure ~1-5%
David Colling
Imperial College London/GridPP
DØ – Production - Reprocessing
• Early next year, DØ will reprocess ~1,000,000,000
events (~250TB)
• This will all involve SAMGrid
• Then installed ~20 sites in ~8 countries
SAMGrid is now an essential part of the DØ
computing setup and it has CondorG at its centre
David Colling
Imperial College London/GridPP
CMS MC production (millions of events)
RC
Bristol/Ral
CERN
CIEMAT (Madrid)
CSIC (Spain)
Forschungszentrum (Karlsruhe)
Imperial College
IN2P3
INFN
CMS-LCG
Moscow
Pakistan
USMOP
Wisconsin
Generation CMSIM
OSCAR ooHit
2.402
3.885
0
0
14.82
5.204
4.834
12.443
3.17
2.318
0.1
0.75
0.3
3.124
0
1.622
2.37
4.019
0
2.149
10.12
4.219
2.276
0.598
4.19
3.973
0
1.48
9.65
9.65
0
6.761
0.5
1.486
0.598
0
0.24
1.282
0
0.58
0.673
0
0
0
0.38
2.271
14.153
0
28.75
8.732
0.96
0.15
David Colling
Imperial College London/GridPP
“USMOP” is the USCMS Monte Carlo
Production system
- Used to run MC simulation at on the Grid3 sites
-Master (submission) node prepares a DAGs that will run
at executution site.
- CondorG is then used to submit these to the Grid3
sites. Normally, execution site is specified but can
domatchmaking
- At each site the DAG stages-in the required files, runs
the job, stages-out the output and finally cleans up.
-One person (Craig Precott), submitting from 2 sites can
produce vast amount MC data!
David Colling
Imperial College London/GridPP
Craig also points out…
Condor is the most popular local batch system on
Grid3
Underlying
Batch System
------------condor
pbs
lsf
fbsng
Number of
Sites
--------17
10
1
2
David Colling
Imperial College London/GridPP
CMS MC production at Wisconsin
… Dan Bradley et al
Caveat
Clearly, there are people here who no more about so my
comments are going to be very general. However the
volume of data produced and the efficiency with which
the resources are used means that it needs some
comment…
David Colling
Imperial College London/GridPP
CMS MC production at Wisconsin
… Dan Bradley et al
The basic points are:
- That they were able to flock between several different
Condor Pools, owned by different users in physically
different locations.
- For the fortran versions they were able to condor
compile, however could not for the C++ versions…
making OSCAR running much less efficient than CMSIM.
- They developed a Python/MySQL based job tracking
and management system.
- Also note that Wisconsin’s Grid3 site also participates
in USMOP
David Colling
Imperial College London/GridPP
Conclusions…
-Condor and CondorG are used by many HEP
experiments, of which I have described only two
- Condor is udes in two basic ways (with variations)…
As a batch system … with all the potential
advantages that Condor brings
Via CondorG to submit to Grid resources.
-These uses of Condor are central to several projects.
David Colling
Imperial College London/GridPP
Download