Data Analysis in Experimental Particle Physics

advertisement
Information Technology Division
Data Analysis in
Experimental Particle Physics
Lectures at the CERN-CLAF School,
13-14 May 2001, Itacuruça, Brazil
Prof. Manuel Delfino
CERN Information Technology Division*
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
* Permanent address: Departamento de Física, Universidad Autónoma de Barcelona, España
1
IT Division
Data Analysis in Particle Physics
Outline of Lecture 1
 Characteristics of data from particle experiments
 From DAQ data to Event Records:
Event Building
 From hits to tracks and clusters
 From tracks and clusters to “particles”:
Correlating sub-detector information
 Uncertainties and resolution
 Data reconstruction and “production”:
Data Summary “Tapes”
 Personal data analysis: n-tuples
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
2
IT Division
Data Analysis in Particle Physics
Outline of Lecture 2
 Monte Carlo simulation
 Statistics and error analysis
 Hypothesis testing
 Simulation of particle production and interactions with
the detector
 Digital representations of event data
 Monitoring and Calibration
 Why physicists don’t (yet) use Excel and Oracle for
their daily analysis.
 The challenge of analysis for the LHC experiments
 The challenge of computing for the LHC
 Solving the LHC computing challenge
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
3
IT Division
13-14 May 2001
Characteristics of data from
particle experiments
Data Analysis / M. Delfino / CERN IT Division
4
IT Division
Characteristics of data from
particle experiments
 Most data comes from digitized information from sensors
activated by particles crossing them.
 We call the data resulting from the observation of a particle
collision an event.
 During hours, days, weeks, months, years or even decades, we
observe many events. We group them according to the timevarying experimental conditions into runs.
 Calibration and environmental information is also stored,
usually in a periodic fashion.
 For practical reasons, this data is stored in data files of many
events.
 Almost always, events are independent from each other.
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
5
IT Division
Characteristics of data from
particle experiments
The Experimental Particle Physics Data Worm
Data file 418
Data file 419
Run 137 Run 138 Run 139 Run 140
Calibration
records
13-14 May 2001
Event
number
31896
Data Analysis / M. Delfino / CERN IT Division
6
IT Division
13-14 May 2001
From DAQ data to Event Records
“Event Building”
Data Analysis / M. Delfino / CERN IT Division
7
IT Division
13-14 May 2001
From hits to tracks and clusters
Data Analysis / M. Delfino / CERN IT Division
8
IT Division
From hits to tracks and clusters
Occupancy
and point
resolution are
related to
ambiguities in
track finding
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
9
IT Division
From hits to tracks and clusters
Calibration,
monitoring and
software are
needed to
resolve these
ambiguities
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
10
IT Division
From hits to tracks and clusters
What you see is not
always what there
was !
Nuclear interaction
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
11
IT Division
Monitoring and Calibration
 Particles deposit energy in sensors
 Sensors give Voltages, Currents, Charges
 Space position of sensor is known
 On-detector Analog-to-Digital Converters change
these into numbers representing these or other
quantities (for example clock-ticks between V pulses)
 Calibration establishes the relationship between the
ADC units and the physical units (eV, {x,y,z}, ns)
 In the laboratory, using controlled conditions
 In the field, using known physical processes
 The calibration can depend on environment or drift
due to uncontrolled parameters: Monitoring
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
12
IT Division
From tracks and clusters to “particles”
Correlating sub-detector information
m
e
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
13
IT Division
Uncertainties and resolution
 Each measurement or hit
has some uncertainty, due
to alignment and the
characteristic of the sensor.
 These uncertainties get
propagated, often in a nonlinear manner, to resolution
functions for the physics
quantities used in analysis.
 Resolution has various
consequences:
Particle-ID with
dE/dx in theTPC
Note
different
scales
 Direct on measurements
 Signal-Background confusion
 Combinatorics
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
14
IT Division
Data reconstruction and “production”:
Data Summary “Tapes”
 Reconstruction turns hits+calibration+geometry
into particle hypothesis
 Reconstruction is time consuming and must be made
coherently  Centrally organized production
 Output is one or more levels of so-called Data
Summary Tapes (DST) which are used as input to
Personal Analysis
 In practice, there is a lot of utility software to organize
these data for easy analysis (bookkeeping)
 Programming of complicated event structures
 Old: FORTRAN with home-made memory managers
 Today: Object-Oriented design using C++ or Java
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
15
IT Division
Personal data analysis
 Most modern detectors can address
multiple physics topics.
 Hundreds or thousands of professors and students
distributed around the world.
 Modern experimental collaborations are early
example of virtual communities.
 Historical enablers for virtual communities:







13-14 May 2001
Fellowship and exchange programmes
Telegraph, telex, telephone and telefax
National and International Laboratories
Reasonably priced airline tickets
Computer inter-networking, e-mail and ftp
The World Wide Web
Multi-media applications on the Internet
Data Analysis / M. Delfino / CERN IT Division
16
IT Division
Personal data analysis
 Today, physics analysis topics are increasingly
tackled by virtual teams within these virtual
communities.
 Must maintain coherency of data and algorithms
within the virtual team.
 “Production” for a modern detector is very complex
and consumes many resources.
 DST contains all imagined reconstruction objects
for all foreseen analysis, so they are big.
 Handling a DST often requires installation of special
software libraries and writing code in
“reconstruction dialect”.
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
17
IT Division
Personal data analysis
 Solution: Each virtual team develops a code to extract
a common analysis dataset for a given topic which is
written and manipulated using a “lingua franca”:
n-tuples and the Physics Analysis Workstation (PAW)
 Physicist’s version of business data mining with Excel
 Iterative process (time-scale of weeks or months):
 Team agrees on complex algorithms to be coded in the
extraction program.
 Algorithms coded and tested, extraction from DST.
 n-tuple file is rapidly distributed via computer network.
 n-tuple is analyzed using non-compiled platformindependent code (PAW macros today, Java in future ?) that
are easily modified and shared by e-mail.
 Eventually limitations are reached, go back to step 1.
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
18
IT Division
Personal data analysis
 PAW was the “killer application” for physics in the 90s




Interactive, just as powerful workstations became available
Platform independent, in a very diverse workstation world
Graphical, just as X-windows gave graphics over network
Simple to write analysis macros, just as the complexity of
FORTRAN programming required in experiments decoupled
most of the collaborators from the experiment’s code.
 In summary, PAW was like going from DOS to Macintosh.
 One major limitation of PAW is the lack of variable
length structures or more generally data objects.
 ROOT overcomes these limitations keeping a similar
philosophy as PAW.
 Java Analysis Studio tries to go further with “agents”.
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
19
IT Division
Personal data analysis
 Which will be the “killer application” for LHC analysis?
 Is a Mac Classic on Appletalk enough or do we need
the conceptual leap equivalent of Web + Java-enabled
browser?
 Will the personal n-tuple model work for LHC ?
 Do we need and can we afford to support our own
interactive data analysis tool ?
 Will one of the newer tools, such as Java Analysis
Studio, go exponential in the open source world ?
 Many questions, one simple answer:
It will be young people like you who will make the next
step happen.
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
20
IT Division
Monte Carlo simulation
 Monte Carlo simulation uses random numbers
( mathematics textbooks)
 Try the following:
 Find a source of random numbers in the interval [0,1]
(calculator, Excel, etc.)
 Take a function that you want to simulate (e.g. y=x2) and
normalize it to fit in the interval [0,1] for both x and y.
 Find graph paper to histogram values of x
 Repeat this at least 20 times:
• Throw two random numbers. Use first as value for x
• Evaluate the function y and compare its value to 2nd random number
– If function value is less than random number, add a count to histogram in
the correct bin for x
– If function value is more than random number, forget it
 Compare your histogram to the shape of the function
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
21
IT Division
Monte Carlo simulation
 If you don’t know how to program, you can pick up an
Excel file from http://cern.ch/Manuel.Delfino/Brazil
 Here is the result
Example of Monte Carlo simulation of y=x*x
for 100 trials:
10
9
7
6
y
 Note there are
30 entries so the
“efficiency” is 30%
8
Y
5
SAMPLE
4
3
 Note the statistical
fluctuations
2
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
x
 Homework: How is the normalization done ?
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
22
IT Division
Statistics and error analysis
 Analysis involves selecting, counting and
normalizing.
 Things are easier when you actually have a signal.
 Understand underlying statistics: Poisson,
Binomial,Multinomial, etc.
 If measuring a differential distribution, understand relation
between normalization of binned counts vs. total counts.
 Understand selection biases and their impact on observed
distributions.
 Things are a lot harder when you place limits.
 Two observations:
 If you cannot make an analytical estimate of the
uncertainties, I won’t believe your result.
 The expression “n-sigma effect” should be banned.
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
23
IT Division
Hypothesis testing
 You must understand Bayes’ theorem.
And every time you think you understand it, you must
make a big effort to understand it better !
 Compare differential distributions of data with
predictions of “theory” or “model”
 Different theories
 Different parameters for same model
 Setting up the statistical test is often straight-forward,
which is why it is surprising most people do it wrong
 Taking account of resolution and systematic
uncertainties is hard
 Make simulation look like data to get your answers
 Even if graphics looks better the other way around !!!
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
24
IT Division
Simulation of particle production and
interactions with the detector
 For particle production, combine Monte Carlo with
 Detailed particle properties
 Detailed cross-sections predicted by theory of
phenomenology
 Computation of phase-space
 Output consists of event records containing simulated
particles (often called 4-vectors by experimentalists)
 For simulating the detector, combine MC with
 Detailed description of the detector
 Detailed cross-sections for interaction with detector
materials
 Detailed phenomenology of mechanism producing signal
 Transport (Ray-tracing) algorithms including B fields
 Digitization model mapping of {x,y,z} to read-out channel
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
25
IT Division
Simulation of particle production and
interactions with the detector
Example:
Small part
of design
of GEANT4
Reference to
Jackson’s
textboook in
documentation !
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
26
IT Division
Digital representations of event data
 In principle, representing event data digitally should be
very simple, except:




everything comes in variable numbers: hits, tracks, clusters
ambiguities lead to multiple relations
particle identification may depend on analysis hypothesis
etc.
 In simple terms, events don’t look like bank account
data, they look like collections of objects.
 You can do a reasonable representation using
relational tables, but actually using the data structures
from Fortran programs is still cumbersome
 Object Oriented Programming is a better match, but
C++ does not resolve all problems  Frameworks
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
27
IT Division
Why physicists don’t (yet) use Excel
and Oracle for their daily analysis.
 Spreadsheets like Excel and relational databases like Oracle
have a very “square” view of data.
This is not a good match to the Data Worm.
 “Normal” people (banks and insurance companies) can define
a priori the quantities that they will select on (the keys of the
database).
We usually derive selection criteria a posteriori using
quantities calculated from the stored data.
 We like (need ?) to express queries as individualistic detailed
low-level computer codes. Difficult to support in database.
 But this is changing very rapidly due to Data Mining:
Businesses are interested in analyzing their raw data in
unpredictable ways.
Example: Cash register tickets to choose sale items
 Support for this requires a more “organic” view of data, for
example object-relational databases.
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
28
IT Division
Why physicists don’t (yet) use Excel
and Oracle for their daily analysis.
Idealized
Particle hypothesis
One to Many
Mass
Charge
Momentum
Origin
One to Many
Simple relation
13-14 May 2001
Cluster
Position One
Width
Depth
Energy
Number of hits
Calorimeter hit
to Many
Position
Response
Track
One
Origin
Curvature
Extrapolation
Number of hits
Tracker hit
to Many
Data Analysis / M. Delfino / CERN IT Division
Position
Response
29
IT Division
Why physicists don’t (yet) use Excel
and Oracle for their daily analysis.
Reality
Particle hypothesis
Many to Many
Mass
Charge
Momentum
Origin
Many to Many
Complicated
algorithmic
relation
13-14 May 2001
Cluster
Position Many
Width
Depth
Energy
Number of hits
Track
Many
Origin
Curvature
Extrapolation
Number of hits
Data Analysis / M. Delfino / CERN IT Division
Calorimeter hit
to Many
Position
Response
Tracker hit
to Many
Position
Response
30
IT Division
13-14 May 2001
The challenge of analysis for the LHC
experiments
Data Analysis / M. Delfino / CERN IT Division
31
IT Division
The challenge of analysis for the LHC
experiments
Online
1:107
1:1012
Analysis
1:105
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
32
IT Division
13-14 May 2001
The challenge of analysis for the LHC
experiments
Data Analysis / M. Delfino / CERN IT Division
33
IT Division
The challenge of analysis for the LHC
experiments
35K SI95
Event Filter
(selection &
reconstruction)
Detector
0.1 to 1
GB/sec
~200 MB/sec
Event
Summary
Data
1 PB / year
Raw data
500 TB
~100 MB/sec
250K SI95
350K SI95
64 GB/sec
Batch Physics
Analysis
Event
Reconstruction
analysis objects
Event
Simulation
Thousands of scientists
distributed around the planet
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
34
IT Division
The challenge of computing for the LHC
Long Term Tape Storage Estimates
TeraBytes
14'000
12'000
10'000
LHC
8'000
6'000
4'000
2'000
Current
Experiments
COMPASS
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
0
Year
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
35
IT Division
The challenge of computing for the LHC
Long Term Tape Storage Estimates
TeraBytes
14'000
Accumulation: 10 PB/year
Signal/Background up to 1:1012
12'000
10'000
LHC
8'000
6'000
4'000
2'000
Current
Experiments
COMPASS
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
0
Year
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
36
IT Division
K SI95
5,000
The challenge of computing for the LHC
Estimated CPU Capacity required at CERN
Moore’s law –
4,000
some measure of
the capacity
technology
advances provide
for a constant
number of
processors or
investment
LHC
3,000
2,000
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
0
1998
1,000
Jan 2000:
3.5K SI95
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
37
IT Division
13-14 May 2001
The challenge of computing for the LHC
Data Analysis / M. Delfino / CERN IT Division
38
IT Division
The challenge of computing for the LHC
Continued innovation
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
39
IT Division
Solving the LHC Computing Challenge:
Technology Development Domains
DEVELOPER VIEW
GRID
FABRIC
USER VIEW
APPLICATION
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
40
IT Division
Solving the LHC Computing Challenge
Storage
Network
12
10 Thousand dual-CPU boxes
1.5
0.8
8
6*
Multi-Gigabit Ethernet switches
24 *
Farm Network
Hundreds of
tape drives
5
Real-time
detector data
* Data Rate
in Gbps
250
0.8
0.8
960 *
LAN-WAN Routers
Storage Network
Grid Interface
10 Thousand disk units
Computing fabric
at CERN (2006)
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
41
IT Division
Solving the LHC Computing Challenge:
Data-Intensive Grid Research
Grid Protocol Architecture
Application
“Managing multiple resources”:
ubiquitous infrastructure services
User
Collective
Application
“Sharing single resources”:
negotiating access, controlling use
Resource
“Talking to things”: communication
(Internet protocols) & security
Connectivity
Transport
Internet
“Controlling things locally”: Access
to, & control of, resources
Fabric
Link
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
Internet Protocol Architecture
“Specialized services”: user- or
appln-specific distributed services
42
IT Division
Acknowledgements
 Many of the figures in this talk are from the Web sites
of ATLAS, CMS, Aleph and Delphi.
 Thanks to Markus Elsing for Delphi displays of
tracking and nuclear interaction.
 GEANT4 design diagram from the documentation.
 Thanks to Les Robertson for LHC Computing
diagrams.
 Grid architecture diagram adapted from Ian Foster.
13-14 May 2001
Data Analysis / M. Delfino / CERN IT Division
43
Download