ATLAS Grid Computing and Data Challenges

advertisement
ATLAS Grid Computing and
Data Challenges
Nurcan Ozturk
University of Texas at Arlington
Recent Progresses in High Energy Physics
Bolu, Turkey. June 23-25, 2004
1
Outline
•
Introduction
ATLAS Experiment
ATLAS Computing System
• ATLAS Computing Timeline
•
•
•
ATLAS Data Challenges
•
•
•
DC2 Event Samples
Data Production Scenario
ATLAS New Production System
Grid Flavors in Production System
Windmill-Supervisor
An Example of XML Messages
• Windmill-Capone Screenshots
•
•
•
•
•
Grid Tools
Conclusions
2
Introduction
•
Why Grid Computing:
• Scientific research becomes more and more complex and
international teams of scientists grow larger and larger
• Grid technologies enables scientist to use remote computers
and data storage systems to be able to retrieve and analyze
the data around the world
• Grid Computing power will be a key to the success of the
LHC experiments
• Grid computing is a challenge not only for particle physics
experiments but also for biologists, astrophysicists and
gravitational wave researchers
3
ATLAS Experiment
•
•
•
ATLAS (A Toroidal LHC Apparatus)
experiment at the Large Hadron Collider
at CERN will start taking data in 2007.
proton-proton collisions with a 14 TeV
center-of-mass energy
ATLAS will study:
•
•
•
•
•
•
SM Higgs Boson
SUSY states
SM QCD, EW, HQ Physics
New Physics?
Total amount of “raw” data:  1 PB/year
Needs the GRID to reconstruct and
analyze this data: Complex “Worldwide
Computing Model” and “Event Data
Model”
•
•
•
Raw Data @ CERN
Reconstructed data “distributed”
All members of the collaboration must have access to
“ALL” public copies of the data
~2000 Collaborators
~150 Institutes
34 Countries
4
ATLAS Computing System
PC (2004) = ~1 kSpecInt2k
~Pb/sec
Event Builder
10 GB/sec
Event Filter
~159kSI2k
•Some data for calibration
and monitoring to institutes
450 Mb/sec
•Calibrations flow back
~ 300MB/s/T1
/expt
Tier 1
US Regional
Centre
(R. Jones)
Tier 0
Italian Regional
Centre
~9 Pb/year/T1
No simulation

T0 ~5MSI2k

French Regional
Centre

UK Regional
Centre (RAL)

~7.7MSI2k/T1
~2 Pb/year/T1
622Mb/s
Tier 2
Northern Tier
~200kSI2k
Tier2 Centre
Tier2 Centre
Tier2 Centre
~200kSI2k~200kSI2k~200kSI2k

~200 Tb/year/T2
622Mb/s
Lancaster
Liverpool Manchest Sheffield
~0.25TIPS
er
Physics data
cache
Workstations
100 - 1000
MB/s
Desktop
Each Tier 2 has ~25 physicists working
on one or more channels
Each Tier 2 should have the full AOD,
TAG & relevant Physics Group
summary data
Tier 2 do bulk of simulation
5
ATLAS Computing Timeline
2003
(D. Barberis)
• POOL/SEAL release (done)
• ATLAS release 7 (with POOL persistency) (done)
• LCG-1 deployment (done)
2004
• ATLAS complete Geant4 validation (done)
NOW
• ATLAS release 8 (done)
• DC2 Phase 1: simulation production
2005
• DC2 Phase 2: intensive reconstruction (the real challenge!)
• Combined test beams (barrel wedge)
• Computing Model paper
2006
• Computing Memorandum of Understanding
• ATLAS Computing TDR and LCG TDR
• DC3: produce data for PRR and test LCG-n
2007
• Physics Readiness Report
• Start commissioning run
• GO!
6
ATLAS Data Challenges
Data Challenges --> generate and analyze simulated data with
increasing scale and complexity using Grid (as much as possible)
• Goal:
• Validation of the Computing Model, the software, the data
model, and to ensure the correctness of the technical choices
to be made
• Provide simulated data to design and optimize the detector
• Experience gained these Data Challenges will be used to
formulate the ATLAS Computing Technical Design Report
• Status:
• DC0 (December2001-June2002), DC1 (July2002-March2003) completed
• DC2 ongoing
• DC3, DC4 planned (one/year)
7
DC2 Event Samples
Channel
A0
A0a
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
Top
Top (mis-aligned)
Z
B1
B2
B3
B4
B5
Jets
Gamma + jet
bb -> B
Jets
Gamma_jet
H1
H2
H3
H4
H5
H6
H7
H8
H9
Higgs (130)
Higgs (180)
Higgs (120)
Higgs (170)
Higgs (170)
Higgs (115)
Higgs (115)
MSSM Higgs
MSSM Higgs
M1
Minimum bias
W
Z + jet
dijets
W + 4 jets
QCD
Suzy
Higss
DC1 susy
Total
Decay
Cuts
(G. Poulard)
Events (10**6)
before filter
Events (10**6)
1
e-e
mu-mu
tau-tau
leptons
no Pt cut
Pt > 600
W -> leptons
tau-tau
Pt > 180
Pt > 20
mu6-mu6
Pt > 17
4 leptons
4 leptons
gamma-gamma
W-W
tau-tau
tau-tau
b-b-A(300)
b-b-A(115)
1
1
1
1
0.5
0.25
0.25
0.5
0.1
0.1
0.05
1
0.2
0.25
1
0.05
0.04
0.04
0.015
0.015
0.015
0.015
0.015
0.015
0.015
9.435
8
Data Production Scenario
Input
Event
generation
G4 simulation
Detector
response
(G. Poulard)
Output
Generated
events
none
Comments
< 2 GB files
Generated
Events
“part of”
< 2 GB files
Hits
+ MCTruth
< 2 GB files
Job duration limited
to 24h!
~ 2000 jobs/day
~ 500 GB/day
~ 5 MB/s
Hits
+ MCTruth
1 file
Digits
+MCTruth
RDO (or BS)
No MCTruth if BS
(Generated
events)
~ 2000 jobs/day
Pile-up
Hits “signal”
+MCTruth
Hits “min.b”
Byte-stream
“pile-up” data
RDO
1 (or few) files
BS
Still some work
Events mixing
RDO or BS
Several files
BS
“
Reconstruction
RDO or BS
ESD
AOD production
ESD
AOD
1 file
Several 10 files
Digits
+MCTruth
RDO (or BS)
Input:
~ 10 GB/job
~ 10 TB/day
~ 150 MB/s
Streaming?
9
ATLAS New Production System
prodDB
AMI
Don Quijote
dms
Windmill
super
jabber
super
super
soap
jabber
LCG
exe
super
LCG
exe
Capone
Dulcinea
RLS
LCG
G3
exe
RLS
NG
jabber
soap
NG
exe
Lexor
super
LSF
exe
RLS
Grid3
http://www.nordugrid.org/applications/prodsys/
LSF
10
Grids Flavors in Production System
Regional Centres Connected to the LCG Grid
centre
Austria
Canada
UIBK
TRIUMF, Vancouver
Univ. Montreal
Univ. Alberta
Czech Republic CESNET, Prague
University of Prague
IN2P3, Lyon**
France
Germany
FZK, Karlsruhe
Greece
Holland
Hungary
Israel
Italy
Japan
Poland
DESY
University of Aachen
University of Wuppertal
GRNET, Athens
NIKHEF, Amsterdam
KFKI, Budapest
Tel Aviv University**
Weizmann Institute
CNAF, Bologna
INFN, Torino
INFN, Milano
INFN, Roma
INFN, Legnaro
ICEPP, Tokyo**
Cyfronet, Krakow
** not yet in LCG-2
country
centre
Portugal
Russia
Spain
LIP, Lisbon
SINP, Moscow
PIC, Barcelona
IFIC, Valencia
IFCA, Santander
University of Barcelona
Uni. Santiago de Compostela
CIEMAT, Madrid
UK
UAM, Madrid
CERN
CSCS, Manno**
Academia Sinica, Taipei
NCU, Taipei
RAL
Cavendish, Cambridge
Imperial, London
Lancaster University
Manchester University
Sheffield University
USA
QMUL, London
FNAL
Centres
BNL**
Switzerland
Taiwan
L. Perini
LCG: LHC Computing Grid,
> 40 sites
• Grid3: USA Grid, 27 sites
• NorduGrid: Denmark, Sweden,
•
07-May-04
country
country
Norway, Finland, Germany,
Estonia, Slovenia, Slovakia,
Australia, Switzerland, 35 sites
in process of being connected
centre
China
IHEP, Beijing
India
TIFR, Mumbai
Pakistan
NCP, Islamabad
Hewlett Packard to provide “Tier 2-like” services for
LCG, initially in Puerto Rico
11
Windmill-Supervisor
•
•
•
•
Supervisor development team at UTA:
Kaushik De, Nurcan Ozturk, Mark Sosebee
supervisor-executor communication is via
Jabber protocol developed for Instant
Messaging
XML (Extensible Markup Language )
messages are passed between supervisorexecutor
supervisor-executor interaction:
numJobsWanted
executeJobs
getExecutorData
• getStatus
• fixJob
• killJob
prod
DB
*
data mgt
system
supervisor
prod m anager
*
executor
replica
catalog
•
•
•
•
Final verification of jobs is done by
supervisor
Windmill webpage:
http://www-hep.uta.edu
12
An Example of XML Messages
numJobWanted : supervisor-executor negotiation of number of jobs to process
<?xml version="1.0" ?>
<windmill type="request” user="supervisor" version="0.6">
<numJobsWanted>
<minimumResources>
<transUses>JobTransforms-8.0.1.2 Atlas-8.0.1 – software version</transUses>
<cpuConsumption>
<count>100000 - minimum CPU required for a production job</count>
<unit>specint2000seconds - unit of CPU usage</unit>
</cpuConsumption>
<diskConsumption>
<count>500 - maximum output file size</count>
<unit>MB</unit>
</diskConsumption>
<ipConnectivity>no - IP connection required from CE </ipConnectivity>
<minimumRAM>
<count>256 - minimum physical memory requirement</count>
<unit>MB</unit>
</minimumRAM>
</minimumResources>
</numJobsWanted>
</windmilll>
<?xml version="1.0" ?>
<windmill type="respond” user=“executor" version="0.8">
<numJobsWanted>
<availableResources>
<jobCount>5</jobCount>
<cpuMax>
<count>100000</count>
<unit>specint2000</unit>
</cpuMax>
</availableResources>
</numJobsWanted>
</windmill>
supervisor’s request
executor’s respond
13
Windmill-Capone Screenshots
14
Grid Tools
What tools are needed for a Grid site?
An example: Grid3 - USA Grid
• Joint project with USATLAS,
USCMS, iVDGL, PPDG, GriPhyN
• Components:
• VDT based
Classic SE (gridftp)
Monitoring: Grid site Catalog,
Ganglia, MonALISA
• Two RLS servers and VOMS
server for ATLAS
•
•
•
Installation:
•
•
pacman –get iVDGL:Grid3
Takes ~ 4 hours to bring up a
site from scratch
VDT (Virtual Data Toolkit)
version 1.1.14 gives:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Virtual Data System 1.2.3
Class Ads 0.9.5
Condor 6.6.1
EDG CRL Update 1.2.5
EDG Make Gridmap 2.1.0
Fault Tolerant Shell (ftsh) 2.0.0
Globus 2.4.3 plus patches
GLUE Information providers
GLUE Schema 1.1, extended version 1
GPT 3.1
GSI-Enabled OpenSSH 3.0
Java SDK 1.4.1
KX509 2031111
Monalisa 0.95
MyProxy 1.11
Netlogger 2.2
PyGlobus 1.0
PyGlobus URL Copy 1.1.2.11
RLS 2.1.4
UberFTP 1.3
15
Conclusions
•
•
•
•
•
•
Grid paradigm works; opportunistic use of existing resources,
run anywhere, from anywhere, by anyone...
Grid computing is a challenge, needs world wide collaboration
Data production using Grid is possible, successful so far
Data Challenges are the way to test the ATLAS computing
model before the real experiment starts
Data Challenges also provides data for Physics groups
A learning and improving experience with Data Challenges
16
Download