physics_analysis_and_databases_mlimper_MINOR_27March

advertisement
LHC Physics Analysis
and Databases
Maaike Limper
Introduction
 New Oracle sponsored CERN OpenLab fellow: Maaike
Limper, started January 2012
 Project outline: Investigate possibility of doing LHC-scale data
reconstruction and/or physics analysis within an Oracle
database

MINI-CV: Maaike Limper has a master in Physics from the University of Amsterdam, and
a PhD in Particle Physics completed with Nikhef (The Dutch National Institute for
Particle Physics). As a part of her PhD and sub-sequent post-doc she worked on the
ATLAS experiment, one of the experiments that measures events produced by the
Large Hadron Collider at CERN. During her work as an ATLAS physicist, Maaike
worked on the construction of the silicon tracker, developed track and vertex
reconstruction algorithms and participated in the analysis of the first LHC data
recorded with the ATLAS detector. Maaike was also involved in developing some of the
ATLAS data-taking conditions databases for the pixel detector. As Prompt
Reconstruction Coordinator for ATLAS, she was responsible for the
reconstruction of all data recorded by the ATLAS detector. As of January 2012 she works
full-time as an Oracle funded Openlab fellow for the CERN IT department.
LHC physics analysis and databases - M. Limper
2
Large Hadron Collider at CERN
 Four main experiments recording events produced by the
Large Hadron Collider: ATLAS, CMS, LHCb and ALICE
 Examples of LHC-scale data-processing from my
experience with the ATLAS experiment
Lake Geneva
LHC
ATLAS
LHC physics analysis and databases - M. Limper
CERN
3
LHC-scale data processing
event
data taking
data acquisition
raw data
simulated raw data
event
reconstruction
event
analysis
reconstruction
analysis objects
(extracted per physics topic)
event
summary
data
ntuple1
analysis
ntuple2
event
simulation
generationsimulationdigitization
ntupleN
interactive
physics analysis
(thousands of users!)
LHC physics analysis and databases - M. Limper
4
Event data taking
event
data taking
data acquisition
In 2011 LHC delivered 5.61 fb-1
of p-p collision data
raw data
~20 thousand events that produced a Standard Model Higgs with a mass of 150 GeV
~300 billion inelastic proton-proton interactions
ATLAS uses a flexible trigger menu to determine which events are interesting enough to record…
ATLAS recorded 1.6 billion events in 2011
LHC physics analysis and databases - M. Limper
5
ATLAS data reconstruction
Raw data (detector hits, energy depositions etc)
reconstructed by the ATLAS (C++) software framework
Reconstruction task examples:
•
Fit particle trajectories from
hits measured in the inner
detector
•
Cluster energy deposits
measured in the calorimeter
to reconstruct “jets” of
particles
•
Fit trajectory from hits in
muon spectrometer
•
Combine track information to
determine muon candidate
from interaction point
LHC physics analysis and databases - M. Limper
6
ATLAS data reconstruction
Raw data (detector hits, energy depositions etc)
reconstructed by the ATLAS (C++) software framework
Reconstruction task examples:
•
Fit particle trajectories from
hits measured in the inner
detector
•
Cluster energy deposits
measured in the calorimeter
to reconstruct “jets” of
particles
•
Fit trajectory from hits in
muon spectrometer
•
Combine track information to
determine muon candidate
from interaction point
LHC physics analysis and databases - M. Limper
7
ATLAS data reconstruction
Raw data (detector hits, energy depositions etc)
reconstructed by the ATLAS (C++) software framework
Reconstruction task examples:
•
Fit particle trajectories from
hits measured in the inner
detector
•
Cluster energy deposits
measured in the calorimeter
to reconstruct “jets” of
particles
•
Fit trajectory from hits in
muon spectrometer
•
Combine track information to
determine muon candidate
from interaction point
LHC physics analysis and databases - M. Limper
8
ATLAS data reconstruction
Raw data (detector hits, energy depositions etc)
reconstructed by the ATLAS (C++) software framework
Reconstruction task examples:
•
Fit particle trajectories from
hits measured in the inner
detector
•
Cluster energy deposits
measured in the calorimeter
to reconstruct “jets” of
particles
•
Fit trajectory from hits in
muon spectrometer
•
Combine track information to
determine muon candidate
from interaction point
LHC physics analysis and databases - M. Limper
9
ATLAS data reconstruction
Raw data (detector hits, energy depositions etc)
reconstructed by the ATLAS (C++) software framework
Reconstruction task examples:
•
Fit particle trajectories from
hits measured in the inner
detector
•
Cluster energy deposits
measured in the calorimeter
to reconstruct “jets” of
particles
•
Fit trajectory from hits in
muon spectrometer
•
Combine track information to
determine muon candidate
from interaction point
LHC physics analysis and databases - M. Limper
10
ATLAS real physics
event example:
ATLAS data analysis
Reconstruction focuses on creating physics objects from the
information measured in the detector
Analysis focuses on interpreting information from the
reconstructed objects to determine what type of event took place
Z->mm candidate,
mmm=93.4 GeV
Example:
apply quality criteria on muon candidates and
calculate the invariant mass from the sum of the
muon
4-momentum
todatabases
find a Z-boson
candidate
LHC physics
analysis and
- M. Limper
11
ATLAS data reconstruction
ATLAS uses flexible trigger menus to reduce data-taking rate to
~300 recorded events/second (2011 rate)
Tier-0 computing center at CERN has ~3000 CPUs available to
reconstruct ATLAS data while it is recorded
Initial data gets reconstructed twice:
• “Express reconstruction” during data-taking
• “Bulk reconstruction” 36 hours after end of run, using
beamspot and calibration constants derived from express
number of CPUs increased
express
reconstruction
during busy periods
bulk reco
Reprocessing campaigns every 2/3 months to re-reconstruct all
data with latest version of reconstruction software
LHC physics analysis and databases - M. Limper
12
Event Simulation
In addition to the real physics events, physicists require
simulated (MC) events to compare/test/understand their results
Each physics group requests sets of signal and background samples
~100 million simulated events requested
GenerationSimulationDigitizationReconstruction
During each ATLAS reprocessing campaign new version of all
simulation samples are provided
simulated raw data
event
simulation
generationsimulationdigitization
LHC physics analysis and databases - M. Limper
13
Data analysis
Physics analysis at LHC is mainly done with ROOT
•
C++
•
analysis tools (plotting/fitting/statistical analysis)
ROOT-ntuples are centrally produced by physics groups from previously
reconstructed event summary data
Each physics group determines specific content of ntuple
• Physics objects to include
• Level of detail to be stored per physics object
• Event filter and/or pre-analysis steps
event
summary
data
ntuple1
ntuple2
Variables stored for each event in the form of:
ntupleN
• scalar (example: missing energy, number of reconstructed muons)
• vectors (example: energy, direction, momentum of reconstructed muons)
• vector-of-vectors (example: position of hits on reconstructed muons)
LHC physics analysis and databases - M. Limper
14
Physics Analysis from DB
Benchmark Physics Analys in an Oracle DB:
•
Simplified version of HZbbll analysis (search for standard model
Higgs boson produced in association with a Z-boson)
•
Select muon-candidates to recontruct Z-peak
•
Select b-jet-candidates to reconstruct Higgs-peak
•
Signal sample: 29887 events (3 ntuples)
•
Background sample (Z->mumu+jets): 289916 events (30 ntuples)
•
Use ntuple defined by ATLAS Top Physics Group: ”NTUP_TOP”
•
4212 physics attributes per event
Initial challenges:
Large number of attributes, many of which are vector-type, difficult to
implement in a single table, so I divided data over multiple tables
Need to select data with SQL-queries instead of C++ code containing a
loop over all events in the file
LHC physics analysis and databases - M. Limper
15
Physics DB
Initial DB implementation holds 695 out of 4212 variables (16.5%):
•
“EventData”-table: 184 columns
184 event-related variables (scalar),
primaryKey=(RunNumber,EventNumber)
•
“muon”-table: 271 columns,
268 muon-related variables (muon-vector content),
primaryKey=(muonId,RunNumber,EventNumber),
foreignKey=(RunNumber,EventNumber)
•
“jet”-table: 193 columns
190 jet-related variables (jet-vector content),
primaryKey=(jetId,RunNumber,EventNumber),
foreignKey=(RunNumber,EventNumber)
•
“MET”-table: 55 columns
53 MET (Missing Transverse Energy)-related variables (scalar),
primaryKey=(RunNumber,EventNumber),
foreignKey=(RunNumber,EventNumber)
ROOT-ntuple size is 880 MB
Current DB-size per stored ntuple (16.5% of contents) is ~ 200 MB
Full DB-size would be ~1.2 GB per ntuple ~2.6 GB per ntuple
LHC physics analysis and databases - M. Limper
16
Physics Analysis
Simplified version of HZbbll analysis:
• muon selection: “IsMuon”-function to return TRUE, include
requirement pT>20 GeV and |η|<2.4 plus several requirement on hits
and holes on tracks
• Require exactly 2 selected muons per event
• b-jet selection: tranverse momentum greater than pT>25 GeV, |η|<2.5
and “flavour_weight_Comb”>1.55 (to select b-jets)
• Require exactly 2 selected b-jets per event
• Require 1 of the 2 b-jets to have pT>45 GeV
• Plot “invariant mass” of muons (Z-peak) and of b-jets (Higgs-peak)
Two versions of this analysis:
• Standard ntuple-analysis in ROOT (C++) using locally stored ntuples
• Analysis from Oracle Physics DB running on same machine as DB and using
functions implemented in “PHYSANALYSIS” PL/SQL-package: “IsMuon”,
“InvariantMassLeptons, “InvariantMassJets”
LHC physics analysis and databases - M. Limper
17
Physics Analysis in SQL
Done using SQL-query making temporary tables for different
selections and joining data from different tables via “EventNumber”
with selectedmuon as (select "muon_i","EventNumber","RunNumber","E","px","py","pz" from "muon" where
MLIMPER.PHYSANALYSIS.IS_MUON("muon_i", "pt", "eta", "phi", "E", "me_qoverp_exPV",
"id_qoverp_exPV","me_theta_exPV", "id_theta_exPV", "id_theta","isCombinedMuon",
"isLowPtReconstructedMuon","tight","expectBLayerHit", "nBLHits", "nPixHits","nPixelDeadSensors", "nPixHoles",
"nSCTHits","nSCTDeadSensors", "nSCTHoles","nTRTHits", "nTRTOutliers",0,20000.,2.4) = 1 ),
selectedeventsmuon as (select "EventNumber", COUNT(*) as "mu_sel_n" from selectedmuon group by
"EventNumber" HAVING COUNT(*)=2),
selectedbjet as (select "jet_i","EventNumber","RunNumber","E","pt","phi","eta" from "jet" INNER JOIN
selectedeventsmuon USING("EventNumber") where "pt"/1000>25 and abs("eta")<2.5 and "fl_w_Comb">1.55 ),
selectedevents as (select "EventNumber", COUNT(*) as "jet_sel_n" from selectedbjet group by "EventNumber"
HAVING COUNT(*)=2)
select MLIMPER.PHYSANALYSIS.INV_MASS_LEPTONS(mu1."E",mu2."E",mu1."px",mu2."px",mu1."py",
mu2."py",mu1."pz",mu2."pz")/1000. as "DiMuonMass",
MLIMPER.PHYSANALYSIS.INV_MASS_JETS(jet1."E",jet2."E",jet1."pt",jet2."pt",jet1."phi",jet2."phi",jet1."eta",jet2.
"eta")/1000. as "DiJetMass" from selectedmuon mu1, selectedmuon mu2, selectedbjet jet1, selectedbjet jet2,
selectedevents evSel
where mu1."muon_i"<mu2."muon_i" and mu1."EventNumber"=evSel."EventNumber" and
mu2."EventNumber"=evSel."EventNumber" and jet1."jet_i"<jet2."jet_i" and
jet1."EventNumber"=evSel."EventNumber" and jet2."EventNumber"=evSel."EventNumber" and
jet1."pt"/1000.>45.
LHC physics analysis and databases - M. Limper
18
Physics Analysis benchmark
Output of SQL-query send to ROOT to
produce standard root-histograms:
ROOT-macro using original ntuples as
input produces identical histograms:
LHC physics analysis and databases - M. Limper
19
Physics Analysis in DB
Time to produce plots from physics DB vs from ntuple-files
Sample
#events
#sel.events time from DB
time from ntuple
HZbbll
9993
421
5 s ??
7s
HZbbll
29987
1326
95 s
14 s
Z+2jets
289916
170
85 s
110 s
Both tests done on CERN virtual machine, 2 GB RAM, 2 CPU, SLC5 64-bit
Average time of analysis measured after reboot of virtual machine
Speed from ntuple scales depends on:
• number of files
• number of used ntuple-branches (=physics-attributes)
DB-speed depends on:
• clever implementation of SQL-query: I’m not (yet) an SQL-guru…
• table-size: select from jet-table much slower than from muon-table, as
more jet-objects stored per event
LHC physics analysis and databases - M. Limper
20
Physics Analysis in DB
First implementation of Physics Analysis in Oracle DB
•
Data in multiple tables
•
SQL-query can reproduce selection in loop over events
•
Analysis from DB speed similar to original ntuple-analysis but
many complexities still not implemented…
Possible space-gain for DB version of analysis data:
Each physics group optimized their own ntuple-size based on physics and level of
detail required for their analysis, but sum of different ntuple contains duplicate info
Physics Analysis DB would contains all physics objects, divided over multiple tables, each
physics group can choose which tables to use
analysis objects
(extracted per physics topic)
event
summary
data
ntuple1
ntuple2
ntupleN
LHC physics analysis and databases - M. Limper
21
Physics Analysis in DB
First implementation of Physics Analysis in Oracle DB
•
data in multiple tables
•
SQL-query can reproduce selection in loop over events
•
Analysis from DB slower than original ntuple-analysis and many
complexities still not implemented…
Possible space-gain for DB version of analysis data:
Each physics group optimized their own ntuple-size based on physics and level of detail
required for their analysis, but sum of different ntuple contains duplicate info
Physics Analysis DB would contains all physics objects, divided over multiple tables,
each physics group can choose which tables to use
event
summary
data
analysis objects
stored in database
physicsDB
LHC physics analysis and databases - M. Limper
22
Physics Analysis in DB
Space requirement for realistic physics DB (order of magnitude):
~1.2 2.6 GB per ntuple (10k events)
~2 billion events data+simulation in 2011~240 TB 520 TB of analysis data
~10 revisions (reconstruction software versions) actively analyzed at a given
time (more revisions may need to be archived) ~2400 TB of analysis data
~factor 4 more data from LHC expected in 2012
Analysis DB would need to be accessible by thousands of users!
duplicates of DB at different analysis sites needed
event
summary
data
analysis objects
stored in database
physicsDB
LHC physics analysis and databases - M. Limper
23
To be continued…
Reconstruction versus Analysis of LHC data
Analysis data is relatively easily organized in tables and columns
Reconstruction of data starts from raw-event data, less easily organized in
tables and columns and likely to require data in blobs
Analysis uses many select-type arguments and functions than can be
implemented in PL/SQL
Reconstruction of data in DB will require use of external agent to run the
experiment’s C++ reconstruction software at the database
Analysis data will need to be accessible by many users doing many different
analysis’ at once
Reconstruction tasks are centrally organized, data is reconstructed during
data-taking and re-reconstructed during re-processing campaigns
How to optimize the use of Oracle services
for LHC-scale data processing?
LHC physics analysis and databases - M. Limper
24
Download