LHC Physics Analysis and Databases Maaike Limper Introduction New Oracle sponsored CERN OpenLab fellow: Maaike Limper, started January 2012 Project outline: Investigate possibility of doing LHC-scale data reconstruction and/or physics analysis within an Oracle database MINI-CV: Maaike Limper has a master in Physics from the University of Amsterdam, and a PhD in Particle Physics completed with Nikhef (The Dutch National Institute for Particle Physics). As a part of her PhD and sub-sequent post-doc she worked on the ATLAS experiment, one of the experiments that measures events produced by the Large Hadron Collider at CERN. During her work as an ATLAS physicist, Maaike worked on the construction of the silicon tracker, developed track and vertex reconstruction algorithms and participated in the analysis of the first LHC data recorded with the ATLAS detector. Maaike was also involved in developing some of the ATLAS data-taking conditions databases for the pixel detector. As Prompt Reconstruction Coordinator for ATLAS, she was responsible for the reconstruction of all data recorded by the ATLAS detector. As of January 2012 she works full-time as an Oracle funded Openlab fellow for the CERN IT department. LHC physics analysis and databases - M. Limper 2 Large Hadron Collider at CERN Four main experiments recording events produced by the Large Hadron Collider: ATLAS, CMS, LHCb and ALICE Examples of LHC-scale data-processing from my experience with the ATLAS experiment Lake Geneva LHC ATLAS LHC physics analysis and databases - M. Limper CERN 3 LHC-scale data processing event data taking data acquisition raw data simulated raw data event reconstruction event analysis reconstruction analysis objects (extracted per physics topic) event summary data ntuple1 analysis ntuple2 event simulation generationsimulationdigitization ntupleN interactive physics analysis (thousands of users!) LHC physics analysis and databases - M. Limper 4 Event data taking event data taking data acquisition In 2011 LHC delivered 5.61 fb-1 of p-p collision data raw data ~20 thousand events that produced a Standard Model Higgs with a mass of 150 GeV ~300 billion inelastic proton-proton interactions ATLAS uses a flexible trigger menu to determine which events are interesting enough to record… ATLAS recorded 1.6 billion events in 2011 LHC physics analysis and databases - M. Limper 5 ATLAS data reconstruction Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework Reconstruction task examples: • Fit particle trajectories from hits measured in the inner detector • Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles • Fit trajectory from hits in muon spectrometer • Combine track information to determine muon candidate from interaction point LHC physics analysis and databases - M. Limper 6 ATLAS data reconstruction Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework Reconstruction task examples: • Fit particle trajectories from hits measured in the inner detector • Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles • Fit trajectory from hits in muon spectrometer • Combine track information to determine muon candidate from interaction point LHC physics analysis and databases - M. Limper 7 ATLAS data reconstruction Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework Reconstruction task examples: • Fit particle trajectories from hits measured in the inner detector • Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles • Fit trajectory from hits in muon spectrometer • Combine track information to determine muon candidate from interaction point LHC physics analysis and databases - M. Limper 8 ATLAS data reconstruction Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework Reconstruction task examples: • Fit particle trajectories from hits measured in the inner detector • Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles • Fit trajectory from hits in muon spectrometer • Combine track information to determine muon candidate from interaction point LHC physics analysis and databases - M. Limper 9 ATLAS data reconstruction Raw data (detector hits, energy depositions etc) reconstructed by the ATLAS (C++) software framework Reconstruction task examples: • Fit particle trajectories from hits measured in the inner detector • Cluster energy deposits measured in the calorimeter to reconstruct “jets” of particles • Fit trajectory from hits in muon spectrometer • Combine track information to determine muon candidate from interaction point LHC physics analysis and databases - M. Limper 10 ATLAS real physics event example: ATLAS data analysis Reconstruction focuses on creating physics objects from the information measured in the detector Analysis focuses on interpreting information from the reconstructed objects to determine what type of event took place Z->mm candidate, mmm=93.4 GeV Example: apply quality criteria on muon candidates and calculate the invariant mass from the sum of the muon 4-momentum todatabases find a Z-boson candidate LHC physics analysis and - M. Limper 11 ATLAS data reconstruction ATLAS uses flexible trigger menus to reduce data-taking rate to ~300 recorded events/second (2011 rate) Tier-0 computing center at CERN has ~3000 CPUs available to reconstruct ATLAS data while it is recorded Initial data gets reconstructed twice: • “Express reconstruction” during data-taking • “Bulk reconstruction” 36 hours after end of run, using beamspot and calibration constants derived from express number of CPUs increased express reconstruction during busy periods bulk reco Reprocessing campaigns every 2/3 months to re-reconstruct all data with latest version of reconstruction software LHC physics analysis and databases - M. Limper 12 Event Simulation In addition to the real physics events, physicists require simulated (MC) events to compare/test/understand their results Each physics group requests sets of signal and background samples ~100 million simulated events requested GenerationSimulationDigitizationReconstruction During each ATLAS reprocessing campaign new version of all simulation samples are provided simulated raw data event simulation generationsimulationdigitization LHC physics analysis and databases - M. Limper 13 Data analysis Physics analysis at LHC is mainly done with ROOT • C++ • analysis tools (plotting/fitting/statistical analysis) ROOT-ntuples are centrally produced by physics groups from previously reconstructed event summary data Each physics group determines specific content of ntuple • Physics objects to include • Level of detail to be stored per physics object • Event filter and/or pre-analysis steps event summary data ntuple1 ntuple2 Variables stored for each event in the form of: ntupleN • scalar (example: missing energy, number of reconstructed muons) • vectors (example: energy, direction, momentum of reconstructed muons) • vector-of-vectors (example: position of hits on reconstructed muons) LHC physics analysis and databases - M. Limper 14 Physics Analysis from DB Benchmark Physics Analys in an Oracle DB: • Simplified version of HZbbll analysis (search for standard model Higgs boson produced in association with a Z-boson) • Select muon-candidates to recontruct Z-peak • Select b-jet-candidates to reconstruct Higgs-peak • Signal sample: 29887 events (3 ntuples) • Background sample (Z->mumu+jets): 289916 events (30 ntuples) • Use ntuple defined by ATLAS Top Physics Group: ”NTUP_TOP” • 4212 physics attributes per event Initial challenges: Large number of attributes, many of which are vector-type, difficult to implement in a single table, so I divided data over multiple tables Need to select data with SQL-queries instead of C++ code containing a loop over all events in the file LHC physics analysis and databases - M. Limper 15 Physics DB Initial DB implementation holds 695 out of 4212 variables (16.5%): • “EventData”-table: 184 columns 184 event-related variables (scalar), primaryKey=(RunNumber,EventNumber) • “muon”-table: 271 columns, 268 muon-related variables (muon-vector content), primaryKey=(muonId,RunNumber,EventNumber), foreignKey=(RunNumber,EventNumber) • “jet”-table: 193 columns 190 jet-related variables (jet-vector content), primaryKey=(jetId,RunNumber,EventNumber), foreignKey=(RunNumber,EventNumber) • “MET”-table: 55 columns 53 MET (Missing Transverse Energy)-related variables (scalar), primaryKey=(RunNumber,EventNumber), foreignKey=(RunNumber,EventNumber) ROOT-ntuple size is 880 MB Current DB-size per stored ntuple (16.5% of contents) is ~ 200 MB Full DB-size would be ~1.2 GB per ntuple ~2.6 GB per ntuple LHC physics analysis and databases - M. Limper 16 Physics Analysis Simplified version of HZbbll analysis: • muon selection: “IsMuon”-function to return TRUE, include requirement pT>20 GeV and |η|<2.4 plus several requirement on hits and holes on tracks • Require exactly 2 selected muons per event • b-jet selection: tranverse momentum greater than pT>25 GeV, |η|<2.5 and “flavour_weight_Comb”>1.55 (to select b-jets) • Require exactly 2 selected b-jets per event • Require 1 of the 2 b-jets to have pT>45 GeV • Plot “invariant mass” of muons (Z-peak) and of b-jets (Higgs-peak) Two versions of this analysis: • Standard ntuple-analysis in ROOT (C++) using locally stored ntuples • Analysis from Oracle Physics DB running on same machine as DB and using functions implemented in “PHYSANALYSIS” PL/SQL-package: “IsMuon”, “InvariantMassLeptons, “InvariantMassJets” LHC physics analysis and databases - M. Limper 17 Physics Analysis in SQL Done using SQL-query making temporary tables for different selections and joining data from different tables via “EventNumber” with selectedmuon as (select "muon_i","EventNumber","RunNumber","E","px","py","pz" from "muon" where MLIMPER.PHYSANALYSIS.IS_MUON("muon_i", "pt", "eta", "phi", "E", "me_qoverp_exPV", "id_qoverp_exPV","me_theta_exPV", "id_theta_exPV", "id_theta","isCombinedMuon", "isLowPtReconstructedMuon","tight","expectBLayerHit", "nBLHits", "nPixHits","nPixelDeadSensors", "nPixHoles", "nSCTHits","nSCTDeadSensors", "nSCTHoles","nTRTHits", "nTRTOutliers",0,20000.,2.4) = 1 ), selectedeventsmuon as (select "EventNumber", COUNT(*) as "mu_sel_n" from selectedmuon group by "EventNumber" HAVING COUNT(*)=2), selectedbjet as (select "jet_i","EventNumber","RunNumber","E","pt","phi","eta" from "jet" INNER JOIN selectedeventsmuon USING("EventNumber") where "pt"/1000>25 and abs("eta")<2.5 and "fl_w_Comb">1.55 ), selectedevents as (select "EventNumber", COUNT(*) as "jet_sel_n" from selectedbjet group by "EventNumber" HAVING COUNT(*)=2) select MLIMPER.PHYSANALYSIS.INV_MASS_LEPTONS(mu1."E",mu2."E",mu1."px",mu2."px",mu1."py", mu2."py",mu1."pz",mu2."pz")/1000. as "DiMuonMass", MLIMPER.PHYSANALYSIS.INV_MASS_JETS(jet1."E",jet2."E",jet1."pt",jet2."pt",jet1."phi",jet2."phi",jet1."eta",jet2. "eta")/1000. as "DiJetMass" from selectedmuon mu1, selectedmuon mu2, selectedbjet jet1, selectedbjet jet2, selectedevents evSel where mu1."muon_i"<mu2."muon_i" and mu1."EventNumber"=evSel."EventNumber" and mu2."EventNumber"=evSel."EventNumber" and jet1."jet_i"<jet2."jet_i" and jet1."EventNumber"=evSel."EventNumber" and jet2."EventNumber"=evSel."EventNumber" and jet1."pt"/1000.>45. LHC physics analysis and databases - M. Limper 18 Physics Analysis benchmark Output of SQL-query send to ROOT to produce standard root-histograms: ROOT-macro using original ntuples as input produces identical histograms: LHC physics analysis and databases - M. Limper 19 Physics Analysis in DB Time to produce plots from physics DB vs from ntuple-files Sample #events #sel.events time from DB time from ntuple HZbbll 9993 421 5 s ?? 7s HZbbll 29987 1326 95 s 14 s Z+2jets 289916 170 85 s 110 s Both tests done on CERN virtual machine, 2 GB RAM, 2 CPU, SLC5 64-bit Average time of analysis measured after reboot of virtual machine Speed from ntuple scales depends on: • number of files • number of used ntuple-branches (=physics-attributes) DB-speed depends on: • clever implementation of SQL-query: I’m not (yet) an SQL-guru… • table-size: select from jet-table much slower than from muon-table, as more jet-objects stored per event LHC physics analysis and databases - M. Limper 20 Physics Analysis in DB First implementation of Physics Analysis in Oracle DB • Data in multiple tables • SQL-query can reproduce selection in loop over events • Analysis from DB speed similar to original ntuple-analysis but many complexities still not implemented… Possible space-gain for DB version of analysis data: Each physics group optimized their own ntuple-size based on physics and level of detail required for their analysis, but sum of different ntuple contains duplicate info Physics Analysis DB would contains all physics objects, divided over multiple tables, each physics group can choose which tables to use analysis objects (extracted per physics topic) event summary data ntuple1 ntuple2 ntupleN LHC physics analysis and databases - M. Limper 21 Physics Analysis in DB First implementation of Physics Analysis in Oracle DB • data in multiple tables • SQL-query can reproduce selection in loop over events • Analysis from DB slower than original ntuple-analysis and many complexities still not implemented… Possible space-gain for DB version of analysis data: Each physics group optimized their own ntuple-size based on physics and level of detail required for their analysis, but sum of different ntuple contains duplicate info Physics Analysis DB would contains all physics objects, divided over multiple tables, each physics group can choose which tables to use event summary data analysis objects stored in database physicsDB LHC physics analysis and databases - M. Limper 22 Physics Analysis in DB Space requirement for realistic physics DB (order of magnitude): ~1.2 2.6 GB per ntuple (10k events) ~2 billion events data+simulation in 2011~240 TB 520 TB of analysis data ~10 revisions (reconstruction software versions) actively analyzed at a given time (more revisions may need to be archived) ~2400 TB of analysis data ~factor 4 more data from LHC expected in 2012 Analysis DB would need to be accessible by thousands of users! duplicates of DB at different analysis sites needed event summary data analysis objects stored in database physicsDB LHC physics analysis and databases - M. Limper 23 To be continued… Reconstruction versus Analysis of LHC data Analysis data is relatively easily organized in tables and columns Reconstruction of data starts from raw-event data, less easily organized in tables and columns and likely to require data in blobs Analysis uses many select-type arguments and functions than can be implemented in PL/SQL Reconstruction of data in DB will require use of external agent to run the experiment’s C++ reconstruction software at the database Analysis data will need to be accessible by many users doing many different analysis’ at once Reconstruction tasks are centrally organized, data is reconstructed during data-taking and re-reconstructed during re-processing campaigns How to optimize the use of Oracle services for LHC-scale data processing? LHC physics analysis and databases - M. Limper 24