Information Technology Division Data Analysis in Experimental Particle Physics Lectures at the CERN-CLAF School, 13-14 May 2001, Itacuruça, Brazil Prof. Manuel Delfino CERN Information Technology Division* 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division * Permanent address: Departamento de Física, Universidad Autónoma de Barcelona, España 1 IT Division Data Analysis in Particle Physics Outline of Lecture 1 Characteristics of data from particle experiments From DAQ data to Event Records: Event Building From hits to tracks and clusters From tracks and clusters to “particles”: Correlating sub-detector information Uncertainties and resolution Data reconstruction and “production”: Data Summary “Tapes” Personal data analysis: n-tuples 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 2 IT Division Data Analysis in Particle Physics Outline of Lecture 2 Monte Carlo simulation Statistics and error analysis Hypothesis testing Simulation of particle production and interactions with the detector Digital representations of event data Monitoring and Calibration Why physicists don’t (yet) use Excel and Oracle for their daily analysis. The challenge of analysis for the LHC experiments The challenge of computing for the LHC Solving the LHC computing challenge 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 3 IT Division 13-14 May 2001 Characteristics of data from particle experiments Data Analysis / M. Delfino / CERN IT Division 4 IT Division Characteristics of data from particle experiments Most data comes from digitized information from sensors activated by particles crossing them. We call the data resulting from the observation of a particle collision an event. During hours, days, weeks, months, years or even decades, we observe many events. We group them according to the timevarying experimental conditions into runs. Calibration and environmental information is also stored, usually in a periodic fashion. For practical reasons, this data is stored in data files of many events. Almost always, events are independent from each other. 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 5 IT Division Characteristics of data from particle experiments The Experimental Particle Physics Data Worm Data file 418 Data file 419 Run 137 Run 138 Run 139 Run 140 Calibration records 13-14 May 2001 Event number 31896 Data Analysis / M. Delfino / CERN IT Division 6 IT Division 13-14 May 2001 From DAQ data to Event Records “Event Building” Data Analysis / M. Delfino / CERN IT Division 7 IT Division 13-14 May 2001 From hits to tracks and clusters Data Analysis / M. Delfino / CERN IT Division 8 IT Division From hits to tracks and clusters Occupancy and point resolution are related to ambiguities in track finding 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 9 IT Division From hits to tracks and clusters Calibration, monitoring and software are needed to resolve these ambiguities 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 10 IT Division From hits to tracks and clusters What you see is not always what there was ! Nuclear interaction 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 11 IT Division Monitoring and Calibration Particles deposit energy in sensors Sensors give Voltages, Currents, Charges Space position of sensor is known On-detector Analog-to-Digital Converters change these into numbers representing these or other quantities (for example clock-ticks between V pulses) Calibration establishes the relationship between the ADC units and the physical units (eV, {x,y,z}, ns) In the laboratory, using controlled conditions In the field, using known physical processes The calibration can depend on environment or drift due to uncontrolled parameters: Monitoring 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 12 IT Division From tracks and clusters to “particles” Correlating sub-detector information m e 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 13 IT Division Uncertainties and resolution Each measurement or hit has some uncertainty, due to alignment and the characteristic of the sensor. These uncertainties get propagated, often in a nonlinear manner, to resolution functions for the physics quantities used in analysis. Resolution has various consequences: Particle-ID with dE/dx in theTPC Note different scales Direct on measurements Signal-Background confusion Combinatorics 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 14 IT Division Data reconstruction and “production”: Data Summary “Tapes” Reconstruction turns hits+calibration+geometry into particle hypothesis Reconstruction is time consuming and must be made coherently Centrally organized production Output is one or more levels of so-called Data Summary Tapes (DST) which are used as input to Personal Analysis In practice, there is a lot of utility software to organize these data for easy analysis (bookkeeping) Programming of complicated event structures Old: FORTRAN with home-made memory managers Today: Object-Oriented design using C++ or Java 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 15 IT Division Personal data analysis Most modern detectors can address multiple physics topics. Hundreds or thousands of professors and students distributed around the world. Modern experimental collaborations are early example of virtual communities. Historical enablers for virtual communities: 13-14 May 2001 Fellowship and exchange programmes Telegraph, telex, telephone and telefax National and International Laboratories Reasonably priced airline tickets Computer inter-networking, e-mail and ftp The World Wide Web Multi-media applications on the Internet Data Analysis / M. Delfino / CERN IT Division 16 IT Division Personal data analysis Today, physics analysis topics are increasingly tackled by virtual teams within these virtual communities. Must maintain coherency of data and algorithms within the virtual team. “Production” for a modern detector is very complex and consumes many resources. DST contains all imagined reconstruction objects for all foreseen analysis, so they are big. Handling a DST often requires installation of special software libraries and writing code in “reconstruction dialect”. 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 17 IT Division Personal data analysis Solution: Each virtual team develops a code to extract a common analysis dataset for a given topic which is written and manipulated using a “lingua franca”: n-tuples and the Physics Analysis Workstation (PAW) Physicist’s version of business data mining with Excel Iterative process (time-scale of weeks or months): Team agrees on complex algorithms to be coded in the extraction program. Algorithms coded and tested, extraction from DST. n-tuple file is rapidly distributed via computer network. n-tuple is analyzed using non-compiled platformindependent code (PAW macros today, Java in future ?) that are easily modified and shared by e-mail. Eventually limitations are reached, go back to step 1. 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 18 IT Division Personal data analysis PAW was the “killer application” for physics in the 90s Interactive, just as powerful workstations became available Platform independent, in a very diverse workstation world Graphical, just as X-windows gave graphics over network Simple to write analysis macros, just as the complexity of FORTRAN programming required in experiments decoupled most of the collaborators from the experiment’s code. In summary, PAW was like going from DOS to Macintosh. One major limitation of PAW is the lack of variable length structures or more generally data objects. ROOT overcomes these limitations keeping a similar philosophy as PAW. Java Analysis Studio tries to go further with “agents”. 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 19 IT Division Personal data analysis Which will be the “killer application” for LHC analysis? Is a Mac Classic on Appletalk enough or do we need the conceptual leap equivalent of Web + Java-enabled browser? Will the personal n-tuple model work for LHC ? Do we need and can we afford to support our own interactive data analysis tool ? Will one of the newer tools, such as Java Analysis Studio, go exponential in the open source world ? Many questions, one simple answer: It will be young people like you who will make the next step happen. 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 20 IT Division Monte Carlo simulation Monte Carlo simulation uses random numbers ( mathematics textbooks) Try the following: Find a source of random numbers in the interval [0,1] (calculator, Excel, etc.) Take a function that you want to simulate (e.g. y=x2) and normalize it to fit in the interval [0,1] for both x and y. Find graph paper to histogram values of x Repeat this at least 20 times: • Throw two random numbers. Use first as value for x • Evaluate the function y and compare its value to 2nd random number – If function value is less than random number, add a count to histogram in the correct bin for x – If function value is more than random number, forget it Compare your histogram to the shape of the function 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 21 IT Division Monte Carlo simulation If you don’t know how to program, you can pick up an Excel file from http://cern.ch/Manuel.Delfino/Brazil Here is the result Example of Monte Carlo simulation of y=x*x for 100 trials: 10 9 7 6 y Note there are 30 entries so the “efficiency” is 30% 8 Y 5 SAMPLE 4 3 Note the statistical fluctuations 2 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x Homework: How is the normalization done ? 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 22 IT Division Statistics and error analysis Analysis involves selecting, counting and normalizing. Things are easier when you actually have a signal. Understand underlying statistics: Poisson, Binomial,Multinomial, etc. If measuring a differential distribution, understand relation between normalization of binned counts vs. total counts. Understand selection biases and their impact on observed distributions. Things are a lot harder when you place limits. Two observations: If you cannot make an analytical estimate of the uncertainties, I won’t believe your result. The expression “n-sigma effect” should be banned. 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 23 IT Division Hypothesis testing You must understand Bayes’ theorem. And every time you think you understand it, you must make a big effort to understand it better ! Compare differential distributions of data with predictions of “theory” or “model” Different theories Different parameters for same model Setting up the statistical test is often straight-forward, which is why it is surprising most people do it wrong Taking account of resolution and systematic uncertainties is hard Make simulation look like data to get your answers Even if graphics looks better the other way around !!! 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 24 IT Division Simulation of particle production and interactions with the detector For particle production, combine Monte Carlo with Detailed particle properties Detailed cross-sections predicted by theory of phenomenology Computation of phase-space Output consists of event records containing simulated particles (often called 4-vectors by experimentalists) For simulating the detector, combine MC with Detailed description of the detector Detailed cross-sections for interaction with detector materials Detailed phenomenology of mechanism producing signal Transport (Ray-tracing) algorithms including B fields Digitization model mapping of {x,y,z} to read-out channel 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 25 IT Division Simulation of particle production and interactions with the detector Example: Small part of design of GEANT4 Reference to Jackson’s textboook in documentation ! 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 26 IT Division Digital representations of event data In principle, representing event data digitally should be very simple, except: everything comes in variable numbers: hits, tracks, clusters ambiguities lead to multiple relations particle identification may depend on analysis hypothesis etc. In simple terms, events don’t look like bank account data, they look like collections of objects. You can do a reasonable representation using relational tables, but actually using the data structures from Fortran programs is still cumbersome Object Oriented Programming is a better match, but C++ does not resolve all problems Frameworks 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 27 IT Division Why physicists don’t (yet) use Excel and Oracle for their daily analysis. Spreadsheets like Excel and relational databases like Oracle have a very “square” view of data. This is not a good match to the Data Worm. “Normal” people (banks and insurance companies) can define a priori the quantities that they will select on (the keys of the database). We usually derive selection criteria a posteriori using quantities calculated from the stored data. We like (need ?) to express queries as individualistic detailed low-level computer codes. Difficult to support in database. But this is changing very rapidly due to Data Mining: Businesses are interested in analyzing their raw data in unpredictable ways. Example: Cash register tickets to choose sale items Support for this requires a more “organic” view of data, for example object-relational databases. 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 28 IT Division Why physicists don’t (yet) use Excel and Oracle for their daily analysis. Idealized Particle hypothesis One to Many Mass Charge Momentum Origin One to Many Simple relation 13-14 May 2001 Cluster Position One Width Depth Energy Number of hits Calorimeter hit to Many Position Response Track One Origin Curvature Extrapolation Number of hits Tracker hit to Many Data Analysis / M. Delfino / CERN IT Division Position Response 29 IT Division Why physicists don’t (yet) use Excel and Oracle for their daily analysis. Reality Particle hypothesis Many to Many Mass Charge Momentum Origin Many to Many Complicated algorithmic relation 13-14 May 2001 Cluster Position Many Width Depth Energy Number of hits Track Many Origin Curvature Extrapolation Number of hits Data Analysis / M. Delfino / CERN IT Division Calorimeter hit to Many Position Response Tracker hit to Many Position Response 30 IT Division 13-14 May 2001 The challenge of analysis for the LHC experiments Data Analysis / M. Delfino / CERN IT Division 31 IT Division The challenge of analysis for the LHC experiments Online 1:107 1:1012 Analysis 1:105 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 32 IT Division 13-14 May 2001 The challenge of analysis for the LHC experiments Data Analysis / M. Delfino / CERN IT Division 33 IT Division The challenge of analysis for the LHC experiments 35K SI95 Event Filter (selection & reconstruction) Detector 0.1 to 1 GB/sec ~200 MB/sec Event Summary Data 1 PB / year Raw data 500 TB ~100 MB/sec 250K SI95 350K SI95 64 GB/sec Batch Physics Analysis Event Reconstruction analysis objects Event Simulation Thousands of scientists distributed around the planet 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 34 IT Division The challenge of computing for the LHC Long Term Tape Storage Estimates TeraBytes 14'000 12'000 10'000 LHC 8'000 6'000 4'000 2'000 Current Experiments COMPASS 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 0 Year 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 35 IT Division The challenge of computing for the LHC Long Term Tape Storage Estimates TeraBytes 14'000 Accumulation: 10 PB/year Signal/Background up to 1:1012 12'000 10'000 LHC 8'000 6'000 4'000 2'000 Current Experiments COMPASS 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 0 Year 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 36 IT Division K SI95 5,000 The challenge of computing for the LHC Estimated CPU Capacity required at CERN Moore’s law – 4,000 some measure of the capacity technology advances provide for a constant number of processors or investment LHC 3,000 2,000 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 0 1998 1,000 Jan 2000: 3.5K SI95 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 37 IT Division 13-14 May 2001 The challenge of computing for the LHC Data Analysis / M. Delfino / CERN IT Division 38 IT Division The challenge of computing for the LHC Continued innovation 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 39 IT Division Solving the LHC Computing Challenge: Technology Development Domains DEVELOPER VIEW GRID FABRIC USER VIEW APPLICATION 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 40 IT Division Solving the LHC Computing Challenge Storage Network 12 10 Thousand dual-CPU boxes 1.5 0.8 8 6* Multi-Gigabit Ethernet switches 24 * Farm Network Hundreds of tape drives 5 Real-time detector data * Data Rate in Gbps 250 0.8 0.8 960 * LAN-WAN Routers Storage Network Grid Interface 10 Thousand disk units Computing fabric at CERN (2006) 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 41 IT Division Solving the LHC Computing Challenge: Data-Intensive Grid Research Grid Protocol Architecture Application “Managing multiple resources”: ubiquitous infrastructure services User Collective Application “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division Internet Protocol Architecture “Specialized services”: user- or appln-specific distributed services 42 IT Division Acknowledgements Many of the figures in this talk are from the Web sites of ATLAS, CMS, Aleph and Delphi. Thanks to Markus Elsing for Delphi displays of tracking and nuclear interaction. GEANT4 design diagram from the documentation. Thanks to Les Robertson for LHC Computing diagrams. Grid architecture diagram adapted from Ian Foster. 13-14 May 2001 Data Analysis / M. Delfino / CERN IT Division 43