20150311_stats_analyticsWG_DG_AV - Indico

advertisement
Quantitative data analysis:
concepts, practices and tools
Domenico Giordano
Andrea Valassi
(CERN IT-SDC)
Analytics WG Meeting, 11th March 2015
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
1
Motivation and foreword
• This is a condensed version of the 1.5h IT-SDC White Area
Lecture on February 18: https://indico.cern.ch/event/351669
– Shorter presentation with fewer slides, shorter practical demo
– Some slides of original WA can be found here in “Backup slides”
• Motivation of the original White Area Lecture(s)
– Data (and error) analysis is what HEP physicists do all the time
– “Big data” is ubiquitous, especially for IT professionals
– Recent discussions in IT-SDC meetings over data analysis/presentation
• A second White Area lecture is foreseen (date to be fixed)
–
–
–
–
After the academic training (7-9 Apr) on statistics by H. Prosper
And after collecting feedback and suggestions from SDC and others
Tentative subject: correlations, combining, fitting and more
Please let us know if you have any feedback from the analytics WG!
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
2
Scope of this presentation
• What it is – four interwoven threads
– Concepts
• In probability and statistics
– Best practices and tools
• For analysing data, designing experiments, presenting results
– Examples
• From physics, IT and other domains
• What it is not
– A complete course in 40’ (we suggest references, or use Google!)
– A universal recipe to all problems (must use your common sense)
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
3
Outline
• Measurements and errors
– Probability and distributions, mean and standard deviation
• Populations and samples
– What is statistics and what do we do with it?
– The Law of Large Numbers and the Central Limit Theorem
• Designing experiments
• Presenting results
– Error bars: which ones?
– Displaying distributions: histograms or smoothing?
• Conclusions and references
• Introduction to tools and demo
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
4
Measurements and errors
• Measurement
– “Process of experimentally obtaining one or more quantity values that
can reasonably be attributed to a quantity” [Int. Voc. Metrology, 2012]
• In metrology and statistics, “error” does not mean “mistake”!
– “Error” means our uncertainty (how different result is from “true” value)
• We cannot avoid all sources of errors – we must instead:
– Design experiments to minimize errors (within costs)
– Estimate and report our uncertainty
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
5
Accuracy and precision
• Every measurement is subject to errors!
• Repeated measurements differ due to random “statistical” effects
– Precision: consistency of repeated measurements with one another
• Imperfect instruments shift all values by the same “systematic” bias
– Accuracy: absence of systematic bias
The first of many
distributions in this talk!...
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
6
Reporting errors – basics
• Measurement not complete without a statement of uncertainty
– Always quote an error (unless obvious or implicit, “I am 187cm tall”)
• Report results as “value ± error”
– Error range normally indicates a “confidence interval”, e.g. 68%
• Higgs: mH = 125.7 ± 0.4 GeV [Particle Data Group, Sep 2014]
– Different sources may be quoted separately (e.g. statistic / systematic)
• Rounding and significant figures
– Central value and error always quoted with the same decimal places
• 125.7 ± 0.4 (neither 126 ± 0.4, nor 125.7 ± 1)
– Error generally rounded to 1 or 2 significant figures, use common sense
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
7
Probability and distributions
• Interpretations of probability
– Frequentist interpretation: objective, limit frequency over repeated trials
– Bayesian interpretation (not for this talk!): subjective, degree of belief
• Formal probability theory (Kolmogorov axioms)
– P๏‚ณ0 for one event; normalized, ๏“P=1; adds up for independent events
• Probability distributions for discrete values ๐‘ฅ๐‘–
– Probability that ๐‘ฅ equals ๐‘ฅ๐‘– is ๐‘ƒ(๐‘ฅ๐‘– ) ๏‚ณ 0;
– Normalization is ๐‘– ๐‘ƒ(๐‘ฅ๐‘– ) = 1
• Probability density functions (p.d.f.) for continuous values ๐‘ฅ
– Probability that ๐‘ฅ lies in ๐‘ฅ0 , ๐‘ฅ0 + ๐‘‘๐‘ฅ is ๐‘“ ๐‘ฅ0 ๐‘‘๐‘ฅ ๏‚ณ 0
– Normalization is ๐‘“ ๐‘ฅ ๐‘‘๐‘ฅ = 1
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
8
Expectation values for distributions
• Mean (or average) of f(x) – “centre of gravity” of distribution
– Expectation value of x, first-order moment around 0
– For continuous 1D variables: ๐ธ ๐‘ฅ = ๐‘ฅ๐‘“ ๐‘ฅ ๐‘‘๐‘ฅ ≡ ๐œ‡
– For discrete 1D variables: ๐ธ ๐‘ฅ =
๐‘– ๐‘ฅ๐‘– ๐‘ƒ(๐‘ฅ๐‘– )
≡๐œ‡
• Standard deviation of f(x) – “width” of distribution
– Square root of variance ๐‘‰ ๐‘ฅ = ๐ธ ๐‘ฅ − μ 2 = ๐ธ ๐‘ฅ 2 − μ2 ≡ ๐œŽ 2
– Variance: expectation value of (x- ๏ญ)2, second-order moment around ๏ญ
– For continuous 1D variables: ๐‘‰ ๐‘ฅ = ๐‘ฅ − μ 2 ๐‘“ ๐‘ฅ ๐‘‘๐‘ฅ ≡ ๐œŽ 2
– For discrete 1D variables: ๐‘‰ ๐‘ฅ =
๐‘–
๐‘ฅ๐‘– − ๐œ‡ 2 ๐‘ƒ(๐‘ฅ๐‘– ) ≡ ๐œŽ 2
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
9
Poisson processes and distributions
• Independent events with a fixed average “rate” per unit “time”:
– #decays of a fixed amount of radioactive material in a fixed time interval
– Queuing theory: “memoryless” Poisson process in Erlang’s M/D/1 model
• Probability ๐‘ƒ ๐‘›; ๐œ† = (๏ฌ
๐‘›
– Asymmetric for low ๏ฌ
– More symmetric for high ๏ฌ
−๏ฌ
)
๐‘’
๐‘›!
• Expectation values
– E[n] = ๏ญ ≡ ๏ฌ and V[n] = ๐œŽ 2 = ๏ฌ
– i.e. ๐œŽ = ๏ฌ (width ~ ๐‘ !)
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
10
Gaussian (normal) distributions
• P.d.f. is ๐‘“ ๐‘ฅ; ๐œ‡, ๐œŽ =
1
๐‘’
๐œŽ 2๐œ‹
(๐‘ฅ−๐œ‡)2
−
2๐œŽ2
– “N(๏ญ,๏ณ2)” for mean ๏ญ and variance ๏ณ2
– “Standard normal” N(0,1) for ๏ญ=0, ๏ณ2=1
– The most classical “bell-shaped” curve
• Many properties that make it nice and easy to do math with!
– Error propagation: if x1, x2 independent and normal, x1+x2 is normal
– Parameter estimation: max likelihood and other estimators coincide
• Too often abused (as too nice!), but does describe real cases
– Good model for Brownian motion of many independent particles
– Generally relevant for large-number systems (๏‚ฎcentral limit theorem)
NB: 68.3%, 95.4%, 99.7% confidence limits (1-2-3 ๏ณ) are only for Gaussians!
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
11
Statistics – populations and samples
• Population: data set of all possible outcomes of an experiment
– The 7,162,119,434 (±?!) humans in the world in 2013 [UN DESA]
– The theoretically possible (“gedanken”) infinite collision events at LHC
• Sample: data set of outcomes of the few experiments we did
– The people who happen to be sitting in this room in this very moment
– The actual collision events in 4.9 fb-1 for the 2014 LHC mtop [arxiv]
• Population and sample are fundamental concepts in statistics
– Statistics is the study of collection, analysis and presentation of data
– We generally collect, analyse and present results from limited data
samples, assuming they come from an underlying population or model
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
12
What do we do with statistics?
• Two main methodologies in statistics
– Descriptive statistics – describe properties of collected data sample
– Inferential statistics – deduce properties of underlying population
• Statistical inference is based on probability theory
• In HEP we use inferential statistics for a variety of purposes:
– Parameter estimation
• Combining results, fitting model parameters
• Determining errors and confidence limits
– Statistical tests
• Goodness of fit – are results compatible between them and with models?
• Hypothesis testing – which of two models is more favored by experiments?
– Design and optimization of experimental measurements
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
13
Challenges in statistical inference
• Difficult to make general claims from few specific observations
–
–
–
–
Sample may be not “representative” of the population...
Population may be a “moving target” that varies in time...
Often lack a priori knowledge of shape of population distribution...
Or there would simply be too many parameters in realistic models...
• High-energy physicists are lucky!
– Any LHC data-taking sample is “representative” of the Laws of Nature!
– These Laws of Nature (for most practical purposes) do not vary in time!
– We can compute distributions using theoretical models for these Laws
because quantum mechanics is probabilistic (Monte Carlo simulations)
– And in most cases these models have a limited number of parameters
๏‚ฎ Result: spectacular agreement of experiments and statistical predictions
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
14
Others are not so lucky! (IT people included?…)
NOT EVERYTHING IS GAUSSIAN:
BEWARE OF THE LONG TAILS
OF YOUR DISTRIBUTIONS!!!
• “Outliers”? “Black swans”? How are data distributed? What is 1 “sigma”?
– “By any historical standard, the financial crisis of the past 18 months has been
extraordinary. Some suggested it is the worst since the early 1970s; others, the worst
since the Great Depression; others still, the worst in human history. […] Back in August
2007, the CFO of Goldman Sachs commented to the FT «We are seeing things that
were 25-standard deviation moves, several days in a row». […] A 25-sigma event
would be expected to occur once every 6x10124 lives of the universe. That is quite a lot
of human histories. […] Fortunately, there is a simpler explanation – the model was
wrong.” [A. Haldane, Bank of England, 2009, “Why banks failed the stress test”]
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
15
The simplest example in parameter estimation
• Sample “statistics” are estimators of population “parameters”
– Parameter: population property, computed as function of population data
– Statistic: sample property, computed as function of sample data
• The two simplest population properties we are interested in:
– Population mean ๐œ‡ = ๐ธ ๐‘ฅ = ๐‘ฅ๐‘“ ๐‘ฅ ๐‘‘๐‘ฅ
– Population variance ๐œŽ 2 = ๐‘‰ ๐‘ฅ = ๐‘ฅ − ๐œ‡ 2 ๐‘“ ๐‘ฅ ๐‘‘๐‘ฅ
• We estimate them using the equivalent sample properties:
– Sample mean ๐‘ฅ =
๐‘ฅ๐‘–
๐‘
– Sample variance ๐‘  2 =
โˆถ estimator of population mean ๐œ‡
(๐‘ฅ๐‘– −๐‘ฅ)2
๐‘−1
: estimator of population variance σ2
• The basic question is: how “good” is ๐‘ฅ as an estimator of ๐œ‡ ?
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
16
The Law of Large Numbers
Yes, winning a dice game once
is a matter of luck...
G. de La Tour – Les joueurs de dés (1651)
... but is there any such thing as
an optimal strategy for winning
“on average” in the “long-term”?
[Cardano, Fermat, Pascal, J. Bernoulli, Khinchin, Kolmogorov...]
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
17
Three fundamental results for N๏‚ฎ๏‚ฅ
• The Law of Large Numbers (for a sample of N independent variables xi
identically distributed with finite population mean ๏ญ)
– Sample mean x = xi N “converges” to population mean ๏ญ for N → ∞
– In practice: even if my dice game strategy is only winning 51% of the times,
“in the long term” it will make me a millionaire! (Yes, but how fast? See CLT!)
• The Central Limit Theorem (for a sample of N independent variables xi
identically distributed with finite population mean ๏ญ and finite s.d. ๏ณ)
– Distribution of x “converges” to Gaussian around ๏ญ with s.d. ๐œŽ ๐‘ for N → ∞
– Hence we use the standard error ๐‘  ๐‘ as uncertainty on ๐‘ฅ as estimator of ๏ญ
• The Central Limit Theorem (for a sample of N independent variables xi
non-identically distributed with finite population mean ๏ญ and finite s.d. ๏ณ)
– Distribution of x “converges” to Gaussian for N → ∞, irrespective of xi shapes
– This is how we justify why we so often assume Gaussian errors on any
measurement: the net result of many independent effects is Gaussian
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
18
Designing experiments?
• In practice, we suggest to retain 3 common-sense concepts
• 1. Aim for reproducibility
– Even if observations are not reproducible (LHC), the method should be!
• 2. Try to use controlled environments if possible
– Remove external factors – failing which, must assign an error to them
• IT example: single out I/O timing and remove uninteresting CPU timing
– Understand if you are doing relative or absolute comparisons
• 3. Measurements are an iterative process
– The more you control one error, the more you must work on the next one
– In HEP, some systematics can be reduced by using larger data samples
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
19
Presenting errors – misunderstandings
• Statistics is relatively simple but is used in many different ways
– Different “common practices” and buzzwords in different fields (HEP,
astronomy, economics, medicine, psychology, neuroscience, biology, IT...)
– Different types of error bars are appropriate for different needs… (next slide)
• General rule: ensure that people are clear about what you are doing!
– “Many leading researchers… do not adequately distinguish CIs and SE bars.”
[Belia, Fidler, Williams, Cumming, Psychological Methods, 2005, “Researchers
Misunderstand Confidence Intervals and Standard Error Bars” ]
– “Experimental biologists are often unsure how error bars should be used and
interpreted.” [Cumming, Fidler, Vaux, JCB, 2007, “Error bars in experimental biology”]
– “The meaning of error bars is often misinterpreted, as is the statistical
significance of their overlap.” [Krzywinski, Altman, Nature Methods, 2013, “Error bars”]
– Read these last two articles to understand the different types of error bars!
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
20
There are different types of error bars!
• Two categories that can be called “descriptive” or “inferential” error bars
– [Cumming, Fidler, Vaux, JCB, 2007, “Error bars in experimental biology”]
• 1. Error bars describing the observed sample width and range of values
– Standard deviation of the sample
• Represents width of both sample and population (estimate of s.d. of the population)
– Box plots for the sample
• Representing median, quartiles/deciles/percentiles, outliers…
• 2. Error bars describing the inferred range for the population mean
– Standard error on the population mean
– Confidence intervals for the population mean (based on Gaussian assumption!)
• Use different types for different needs – and say which ones you use!
– See a specific IT example on one of the next slides
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
21
Error bars – descriptive or inferential?
Ben Podgursky
• Example: which programming language
“earns you the highest salary”?!
– One post (based on git commit data and Rapleaf
API) was very popular in August 2013
– On popular request, it was later updated to
include error bars computed as 95% CIs
• Are CIs the most relevant error bars here?
– It depends on what you are interested in!
• Do you really care to claim “I work with the number 1 best paid language”?
– If you do, then yes: CIs (and/or standard errors) are relevant to you…
– You want to estimate the population means and be able to rank them
• Do you only care to know how much you can expect to earn in each case?
– In this case, then no: standard deviations or boxplots are much more relevant to you!
• Different questions require different tools to answer them!
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
22
Quartiles, deciles, percentiles, median…
• Height distribution in a sample with 20 persons
Median, quartile, decile, percentile require a linear ordering (mean and mode do not)
1st
quartile
Median
quartile)
(2nd
3rd quartile
95th percentile
Mode
1st decile
(most frequent height)
Sample
Mean
There can be many modes!
Many global/local maxima.
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
23
Deciles and percentiles – real examples
“top decile” = tail of distribution with 10% highest income
“One thing that Piketty and his colleagues Emmanuel Saez
and Anthony Atkinson have done is to popularize the use of
simple charts that are easier to understand. In particular,
they present pictures showing the shares of over-all income
and wealth taken by various groups over time, including the
top decile of the income distribution and the top percentile
(respectively, the top ten per cent and those we call ‘the one
per cent’).” [J. Cassidy, The New Yorker, 2014]
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
24
Boxplots*
• Boxplot
– Edges (left/right): Q1/Q3 quartiles
• Q3-Q1 = IQR (inter quartile range)
– Line inside: median (Q2 quartile)
• “Whiskers”
– Extend to most extreme data points
within median ± 1.5 IQR
– Outliers are plotted individually
• Other slightly different styles exist*
•
Above: compare boxplots with mean and standard errors for three Gaussian N(0,1) samples
*Nice reference: [Krzywinski, Altman, Nature Methods, 2013, “Visualizing samples with box plots”]
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
25
Plots and distributions – basics
• Label the horizontal and vertical axes!
– Show which variables are plotted in x and y and in which units
– For histograms, you may use a label “#entries per x.yz <units> bin”
• Choose the x and y ranges wisely
– If one axis is ๏„time, do you need to show negative values or start at 0?
– Use the same range across many plots if shown next to each other
• Or add a reference marker to make it visually easier to compare plots
• In a nutshell: explain what you are doing, make sure everything is clear!
– On the plot itself (label, legends, title) – especially in a presentation
– In a caption below the plot (and/or in the text) – in an article or thesis
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
26
Distributions - histograms or smoothing
• Histograms: entries per bin
– Advantages:
• Granularity (bin width) is obvious!
• Error per bin (~ ๏ƒ–n) is obvious!
– Disadvantages:
• Not smooth (is it a disadvantage?)
• Depends on bin width
• Smoothing techniques
– Advantages:
• Smooth (is it an advantage?)
– Disadvantages:
• Also depend on “resolution” effect
• Unclear if shown alone (unclear x
resolution, unclear y error)
– Example here: kernel density
estimator (KDE) with N(0,1) kernel
๐‘“ ๐‘ฅ =
1
๐‘›โ„Ž
๐‘›
๐‘–=1 ๐พ
๐‘ฅ−๐‘ฅ๐‘–
โ„Ž
• 1/n times sum of Gaussians N(xi,h)
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
27
Conclusions: take-away messages?
• Do use errors and error bars!
– When quoting measurements and errors, check your significant figures!
– Different types of error bars for different needs! Say which ones you are using!
•
•
•
•
Descriptive, width of distributions – standard deviations ๏ณ, box plots…
Inferential, population mean estimate uncertainty – standard errors ๏ณ/๏ƒ–n, CIs…
[Why do we use ๏ณ/๏ƒ–n? Because of the Central Limit Theorem!]
[Ask yourself: are you describing a sample or inferring population properties?]
• Beware of long tails and of outliers!
– More generally: we all love Gaussians but reality is often different!
• [Why do we love Gaussians? Because maths becomes so much easier with them!]
• [Why do we feel ok to abuse Gaussians? Because of the Central Limit Theorem!]
• Before analyzing data, design your experiment!
– Aim for reproducibility, reduce external factors – and it is an iterative process
• Make your plots understandable and consistent with one another
– Label your axes and use similar ranges and styles across different plots
– Be aware of binning effects (do you really prefer smoothing to histograms?)
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
28
References – books
• Our favorites
– J. Taylor, Introduction to Error Analysis – a great introductory classic
– A. v. d. Bos, Parameter Estimation for Scientists and Engineers – advanced
– J. O. Weatherall, The Physics of Finance – an enjoyable history of statistics
• Many other useful reads…
–
–
–
–
–
–
F. James, Statistical Methods in Experimental Physics
G. Cowan, Statistical Data Analysis
R. Willink, Measurement Uncertainty and Probability
I. Hughes, T. Hase, Measurements and their Uncertainties
S. Brandt, Data Analysis
R. A. Fisher, Statistical Methods, Experimental Design and Scientific Inference
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
29
References – on the web
• “Community” (to varying degrees) sites about maths/stats
– wikipedia.org (see description by planetmath)
– planetmath.org (see description by wikipedia)
– mathworld.wolfram.com (see description by wikipedia)
• Lecture courses on statistics
– For CERN Summer Students (Lyons 2014, Cranmer 2013, Cowan 2011)
– Chris Blake’s Statistics Lectures for astronomers
• Regular articles about statistics
– Statistics section in Particle Data Group’s Review of Particle Physics
– Nature “Points of Significance” open access column (“for biologists”)
• Launched in 2013, the International Year of Statistics!
• Tools: ROOT, R, iPython notebook – more links in “Demo” slides
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
30
Understood? Beware of outliers! ;-)
http://xkcd.com
Questions?
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
31
Demo
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
32
Data Analysis
“Data Analysis is the process of inspecting, cleaning, transforming, and
modeling data with the goal of discovering useful information, suggesting
conclusions, and supporting decision-making.” [http://en.wikipedia.org/wiki/Data_analysis]
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
33
Data Analysis Tools
• There is a huge choice of tools for data analysis! Including...
– ROOT – essential HEP component for analysis, plots, fits, I/O...
– R – a “different implementation of S”, but as GNU Free Software
– Python scientific ecosystem – booming in all fields, from science to
finance (see for instance the iPython 2013 Progress Report)
• We chose to use Python for these WA lectures
– Most people in IT-SDC use Python already
– Widely used outside HEP and CERN, great web documentation
– Integration to some level with ROOT (PyROOT) will be shown
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
34
iPython and its Notebook
• iPython provides a rich architecture for interactive computing
– Interactive shells (terminal and Qt-based) and data visualization
– Easy to use, high performance tools for parallel computing
– A browser-based notebook with support for code, text, mathematical
expressions, inline plots and other rich media
• Extensive documentation on the web
– Entry point: http://ipython.org
– Tutorials, videos, talks, book, mailing lists, chat room
– A lot of documentation is written in iPython and available in GitHub
• Shared through nbviewer
– Suggested lectures on scientific computing with Python
– Mining the social web
– Probabilistic programming and Bayesian methods for hackers
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
35
Quick start – our typical setup
• Server: any CC7 (CERN CentOS7) node
– Python 2.7 (native)
– iPython, numpy, scipy, matplotlib, pandas (via easy_install)
– ROOT (from AFS)
• Client: any web browser
– Either on the local server where iPython is running
• iPython notebook
– Or with port forwarding via ssh/putty on a remote client
• ssh –Y –L 8888:localhost:8888 your_server
• iPython notebook –no-browser
• Further reading on how to configure the notebook here
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
36
Backup slides
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
37
Example – why do we need errors?
• I want to buy a dishwasher! I saw one that I like!
–
–
- producer specs: “Width: 59.5 cm”
- I assume this means “between 59.4 and 59.6 cm”
“59.5 cm”
?
• I measured the space in my kitchen and it is “roughly” 55 cm
–
–
- 55 ± 5 cm: must measure this better! may fit?
- 55 ± 1 cm: bad luck, I need a “norme Suisse”!
“~ 55 cm”
We imply an “error” (uncertainty, precision, accuracy)
even if we do not explicitly quote it!
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
38
Binomial distribution
• Experiment with only two outcomes (Bernoulli trial)
– Probability of success is p, probability of failure is q=1-p
– Typical example: throwing dice, only “6” is success with probability 1/6
• What is the probability of k successes in n independent trials?
๐‘›!
๐‘˜ (1 − ๐‘)๐‘›−๐‘˜
– ๐‘ƒ๐‘ ๐‘˜; ๐‘› = ๐‘˜! ๐‘›−๐‘˜
๐‘
!
๐‘›!
• Order not important, ๐‘˜! ๐‘›−๐‘˜
permutations
!
– Mean (expected #successes) is
๐‘˜๐‘ƒ๐‘ ๐‘˜; ๐‘› = μ = ๐‘๐‘› as expected!
– Variance is
๐ธ ๐‘ฅ 2 − μ2 = ๐œŽ 2 = ๐‘›๐‘(1 − ๐‘)
• Other examples
– #entries in one bin of a histogram given N entries in total
– decay channels of unstable particles (branching ratios)
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
39
Poisson processes and distributions
• Independent events with a fixed average “rate” per unit “time”:
– #decays of a fixed amount of radioactive material in a fixed time interval
– #events in HEP process (of given cross-section) for a fixed “luminosity”
• #entries in one bin of a histogram for fixed “luminosity” (total is Poisson too)
– queuing theory: “memoryless” Poisson process in Erlang’s M/D/1 model
• Limit of binomial distribution for N๏‚ฎ๏‚ฅ, p๏‚ฎ0 with ๏ญ = Np fixed
๐‘›
๏ฌ
• Probability ๐‘ƒ ๐‘›; ๐œ† = (
– asymmetric for low ๏ฌ
– more symmetric for high ๏ฌ
๐‘›!)
๐‘’ −๏ฌ
• Expectation values
– E[n] = ๏ญ ≡ ๏ฌ and V[n] = ๐œŽ 2 = ๏ฌ
– i.e. ๐œŽ = ๏ฌ (width ~ ๐‘ !)
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
40
Gaussian (normal) distributions
• P.d.f. is ๐‘“ ๐‘ฅ; ๐œ‡, ๐œŽ =
1
๐‘’
๐œŽ 2๐œ‹
(๐‘ฅ−๐œ‡)2
−
2๐œŽ2
– “N(๏ญ,๏ณ2)” for mean ๏ญ and variance ๏ณ2
– “Standard normal” N(0,1) for ๏ญ=0, ๏ณ2=1
– The most classical “bell-shaped” curve
• Many properties that make it nice and easy to do math with!
– Error propagation: if x1, x2 independent and normal, x1+x2 is normal
– Parameter estimation: max likelihood and other estimators coincide
– ๏ฃ2 tests: the ๏ฃ2 distribution is derived from the normal distribution
• Too often abused (as too nice!), but does describe real cases
– Limit of Poisson distribution if population mean is a large number
– Good model for Brownian motion of many independent particles
– Generally relevant for large-number systems (๏‚ฎcentral limit theorem)
NB: 68.3%, 95.4%, 99.7% confidence limits (1-2-3 ๏ณ) are only for Gaussians!
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
41
Many other distributions...
• Cauchy (or Lorentz)
– No mean! (integral undefined)
– HEP: non-relativistic Breit-Wigner
• Pareto
– Finite mean, infinite variance (๏ก=2)
• Exponential
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
42
What do we do with statistics in HEP?
Measurement
Combinations
Precision measurements!
Standard Model tests
(parameter estimation)
(goodness of fit)
LEP EW WG March 2012
Gfitter Group Sep 2012
ATLAS July 2012
5๏ณ* signal significance
(hypothesis testing)
FCC kick-off meeting
G. Rolandi
*[3 x 10-7 one-sided Gaussian]
95% CL for exclusion
(hypothesis testing)
Searches... and discoveries!
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
43
Reporting errors – basics
• Measurement not complete without a statement of uncertainty
– Always quote an error (unless obvious, “59.5 cm”)
• Report results as “value ± error”
– Error range normally indicates a “confidence interval”, e.g. 68%
• Higgs: mH = 125.7 ± 0.4 GeV [Particle Data Group, Sep 2014]
– May use asymmetric errors and separately quote different error sources
+0.5
• mH = 124.3+0.6
−0.5 (stat.) −0.3 (syst.) GeV [ATLAS (ZZ*๏‚ฎ4l), Jul 2014]
• Rounding and significant figures
– Central value and error always quoted with the same decimal places
• 124.3 ± 0.6 (neither 124 ± 0.6, nor 124.3 ± 1)
– Error generally rounded to 1 or 2 significant figures, use common sense
• Keep more significant figures to combine several measurements
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
44
Others are not so lucky! (IT people included…)
THE NEWS
THE POST
PANIC!
WALL ST.
BLOODBATH
• Beware of your assumptions! (And beware of “magic” tools!)
– “The Black-Scholes equation was the mathematical justification for the trading that
plunged the world's banks into catastrophe [….] On 19 October 1987, Black Monday,
the world's stock markets lost more than 20% of their value within a few hours. An
event this extreme is virtually impossible under the model's assumptions […] Large
fluctuations in the stock market are far more common than Brownian motion predicts.
The reason is unrealistic assumptions – ignoring potential black swans. But usually the
model performed very well, so as time passed and confidence grew, many bankers
and traders forgot the model had limitations. They used the equation as a kind of
talisman, a bit of mathematical magic to protect them against criticism if anything went
wrong.” [I. Stewart, The Guardian, 2012]
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
45
“Outliers” and “black swans”
IN SUMMARY AND IN PRACTICE:
BEWARE OF THE LONG TAILS
OF YOUR DISTRIBUTIONS!!!
• How are your data distributed? What is one “sigma”?
– “By any historical standard, the financial crisis of the past 18 months has been
extraordinary. Some suggested it is the worst since the early 1970s; others, the worst
since the Great Depression; others still, the worst in human history. […] Back in August
2007, the CFO of Goldman Sachs commented to the FT «We are seeing things that
were 25-standard deviation moves, several days in a row». […] A 25-sigma event
would be expected to occur once every 6x10124 lives of the universe. That is quite a lot
of human histories. […] Fortunately, there is a simpler explanation – the model was
wrong.” [A. Haldane, Bank of England, 2009, “Why banks failed the stress test”]
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
46
Understood? Beware of outliers! ;-)
http://xkcd.com
Box plot! We’ll get there...
XKCD comics-style plots in iPython? Click here.
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
47
The Law of Large Numbers (1)
Yes, winning a dice game once
is a matter of luck...
G. de La Tour – Les joueurs de dés (1651)
... but is there any such thing as
an optimal strategy for winning
“on average” in the “long-term”?
[Cardano, Fermat, Pascal, J. Bernoulli, Khinchin, Kolmogorov...]
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
48
The Law of Large Numbers (2)
• Given a sample of N independent variables xi = {x1,... xN}
identically distributed with finite population mean ๏ญ
– the law of large numbers says that the sample mean ๐‘ฅ = ๐‘ฅ๐‘– ๐‘
“converges” to the population mean ๏ญ for ๐‘ → ∞
– generalizable to non-identically distributed xi with finite and “gentle” ๏ณi
– hold irrespective of shapes of xi distributions
• In practice:
– Even if my dice game strategy is only winning 51% of the times, “in the
long term” it will make me a millionaire!
– The LLN does not say yet how fast this will happen...
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
49
The Central Limit Theorem (1)
• Given a sample of N independent variables xi = {x1,... xN}
identically distributed with population mean ๏ญ and finite s.d. ๏ณ
– the central limit theorem essentially says that the distribution of the
sample mean ๐‘ฅ = ๐‘ฅ๐‘– ๐‘ “converges” to a Gaussian around the
population mean ๏ญ with standard deviation ๐œŽ ๐‘ for ๐‘ → ∞
– again true irrespective of shape of xi distribution
• In practice this tells us how fast the LLN converges: as 1/ N
– We use the standard error ๐‘  ๐‘ as uncertainty on ๐‘ฅ as estimator of ๏ญ
– We’ll now see some examples of this in the demo
• And even more interesting is... the next slide...
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
50
The Central Limit Theorem (2)
• Given N independent variables xi = {x1,... xN} non-identically
distributed with population means ๏ญi and finite s.d. ๏ณi
– the central limit theorem essentially says that the distribution of the
mean ๐‘ฅ๐‘– ๐‘ “converges” to a Gaussian around ๏ญ๐‘– ๐‘ with standard
deviation √ ๐œŽ๐‘–2 ๐‘ for ๐‘ → ∞
– again true irrespective of shapes of xi distributions
• as long as they are “nicely behaved” (~ the ๏ณi are not wildly different)
• In practice:
– This is how we justify why we so often assume Gaussian errors on any
measurement: the net result of many independent effects is Gaussian!
• The reason why we like Gaussians still being that it is easier to do maths
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
51
External factors?
LEP energy calibration
Effect of the Moon
on ELEP ~ 120ppm
[CERN SL 94/07]
Effect of TGVs on
ELEP ~ few 10ppm
[CERN SL 97/47]
Departure of 16:50
Geneva-Paris TGV
on 13 Nov 1995
• LEP energy calibration
– Systematic error on combined LEP measurement of the Z boson mass
reduced from 7 MeV [Phys. Lett. B 1993] to 2 MeV [Phys. Rep. 2006]
– Again: aim for reproducibility, control external factors, iterative process...
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
52
Correlation does not imply causation
http://xkcd.com
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
53
Presenting results – and their errors
You can
- Either, show what you observed
- Or show what you inferred from the observations
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
54
Should you always show error bars?
• Personal opinion: no, not really
– If the intrinsic limitations of your measurement and method are reasonably clear
– But it is really a question of common sense and case-by-case choices
• Example: Andrea’s Oracle performance plots for COOL query scalability
– Controlled environment, low external variations – private clients, low server load
– What really matters is the lower bound and it is obvious – many points
• Most often measure the fastest case – fluctuations are only on the slow side
– Measurements are relative – only interested in slopes, not in face values
– Fast and repeatable measurements – just retry if a plot looks weird
– Plots initially used to understand the issue, now just as a control – no fits
• The real analysis is on the Oracle server-side trace files
y axis range is zoomed in
the left plot – is this clear?
no, probably a bad choice
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
55
Conclusions
• Statistics has implications at many levels and in many fields
– Daily needs, global economy, HEP, formal mathematics and more
– Different fields may have different buzzwords for similar concepts
• We reminded a few basic concepts
• And we suggested a few tools and practices
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
56
Questions?
D. Giordano and A. Valassi – Quantitative Data Analysis
Analytics WG – 11 Mar 2015
57
Download