Quantitative data analysis: concepts, practices and tools Domenico Giordano Andrea Valassi (CERN IT-SDC) Analytics WG Meeting, 11th March 2015 D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 1 Motivation and foreword • This is a condensed version of the 1.5h IT-SDC White Area Lecture on February 18: https://indico.cern.ch/event/351669 – Shorter presentation with fewer slides, shorter practical demo – Some slides of original WA can be found here in “Backup slides” • Motivation of the original White Area Lecture(s) – Data (and error) analysis is what HEP physicists do all the time – “Big data” is ubiquitous, especially for IT professionals – Recent discussions in IT-SDC meetings over data analysis/presentation • A second White Area lecture is foreseen (date to be fixed) – – – – After the academic training (7-9 Apr) on statistics by H. Prosper And after collecting feedback and suggestions from SDC and others Tentative subject: correlations, combining, fitting and more Please let us know if you have any feedback from the analytics WG! D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 2 Scope of this presentation • What it is – four interwoven threads – Concepts • In probability and statistics – Best practices and tools • For analysing data, designing experiments, presenting results – Examples • From physics, IT and other domains • What it is not – A complete course in 40’ (we suggest references, or use Google!) – A universal recipe to all problems (must use your common sense) D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 3 Outline • Measurements and errors – Probability and distributions, mean and standard deviation • Populations and samples – What is statistics and what do we do with it? – The Law of Large Numbers and the Central Limit Theorem • Designing experiments • Presenting results – Error bars: which ones? – Displaying distributions: histograms or smoothing? • Conclusions and references • Introduction to tools and demo D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 4 Measurements and errors • Measurement – “Process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity” [Int. Voc. Metrology, 2012] • In metrology and statistics, “error” does not mean “mistake”! – “Error” means our uncertainty (how different result is from “true” value) • We cannot avoid all sources of errors – we must instead: – Design experiments to minimize errors (within costs) – Estimate and report our uncertainty D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 5 Accuracy and precision • Every measurement is subject to errors! • Repeated measurements differ due to random “statistical” effects – Precision: consistency of repeated measurements with one another • Imperfect instruments shift all values by the same “systematic” bias – Accuracy: absence of systematic bias The first of many distributions in this talk!... D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 6 Reporting errors – basics • Measurement not complete without a statement of uncertainty – Always quote an error (unless obvious or implicit, “I am 187cm tall”) • Report results as “value ± error” – Error range normally indicates a “confidence interval”, e.g. 68% • Higgs: mH = 125.7 ± 0.4 GeV [Particle Data Group, Sep 2014] – Different sources may be quoted separately (e.g. statistic / systematic) • Rounding and significant figures – Central value and error always quoted with the same decimal places • 125.7 ± 0.4 (neither 126 ± 0.4, nor 125.7 ± 1) – Error generally rounded to 1 or 2 significant figures, use common sense D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 7 Probability and distributions • Interpretations of probability – Frequentist interpretation: objective, limit frequency over repeated trials – Bayesian interpretation (not for this talk!): subjective, degree of belief • Formal probability theory (Kolmogorov axioms) – P๏ณ0 for one event; normalized, ๏P=1; adds up for independent events • Probability distributions for discrete values ๐ฅ๐ – Probability that ๐ฅ equals ๐ฅ๐ is ๐(๐ฅ๐ ) ๏ณ 0; – Normalization is ๐ ๐(๐ฅ๐ ) = 1 • Probability density functions (p.d.f.) for continuous values ๐ฅ – Probability that ๐ฅ lies in ๐ฅ0 , ๐ฅ0 + ๐๐ฅ is ๐ ๐ฅ0 ๐๐ฅ ๏ณ 0 – Normalization is ๐ ๐ฅ ๐๐ฅ = 1 D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 8 Expectation values for distributions • Mean (or average) of f(x) – “centre of gravity” of distribution – Expectation value of x, first-order moment around 0 – For continuous 1D variables: ๐ธ ๐ฅ = ๐ฅ๐ ๐ฅ ๐๐ฅ ≡ ๐ – For discrete 1D variables: ๐ธ ๐ฅ = ๐ ๐ฅ๐ ๐(๐ฅ๐ ) ≡๐ • Standard deviation of f(x) – “width” of distribution – Square root of variance ๐ ๐ฅ = ๐ธ ๐ฅ − μ 2 = ๐ธ ๐ฅ 2 − μ2 ≡ ๐ 2 – Variance: expectation value of (x- ๏ญ)2, second-order moment around ๏ญ – For continuous 1D variables: ๐ ๐ฅ = ๐ฅ − μ 2 ๐ ๐ฅ ๐๐ฅ ≡ ๐ 2 – For discrete 1D variables: ๐ ๐ฅ = ๐ ๐ฅ๐ − ๐ 2 ๐(๐ฅ๐ ) ≡ ๐ 2 D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 9 Poisson processes and distributions • Independent events with a fixed average “rate” per unit “time”: – #decays of a fixed amount of radioactive material in a fixed time interval – Queuing theory: “memoryless” Poisson process in Erlang’s M/D/1 model • Probability ๐ ๐; ๐ = (๏ฌ ๐ – Asymmetric for low ๏ฌ – More symmetric for high ๏ฌ −๏ฌ ) ๐ ๐! • Expectation values – E[n] = ๏ญ ≡ ๏ฌ and V[n] = ๐ 2 = ๏ฌ – i.e. ๐ = ๏ฌ (width ~ ๐ !) D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 10 Gaussian (normal) distributions • P.d.f. is ๐ ๐ฅ; ๐, ๐ = 1 ๐ ๐ 2๐ (๐ฅ−๐)2 − 2๐2 – “N(๏ญ,๏ณ2)” for mean ๏ญ and variance ๏ณ2 – “Standard normal” N(0,1) for ๏ญ=0, ๏ณ2=1 – The most classical “bell-shaped” curve • Many properties that make it nice and easy to do math with! – Error propagation: if x1, x2 independent and normal, x1+x2 is normal – Parameter estimation: max likelihood and other estimators coincide • Too often abused (as too nice!), but does describe real cases – Good model for Brownian motion of many independent particles – Generally relevant for large-number systems (๏ฎcentral limit theorem) NB: 68.3%, 95.4%, 99.7% confidence limits (1-2-3 ๏ณ) are only for Gaussians! D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 11 Statistics – populations and samples • Population: data set of all possible outcomes of an experiment – The 7,162,119,434 (±?!) humans in the world in 2013 [UN DESA] – The theoretically possible (“gedanken”) infinite collision events at LHC • Sample: data set of outcomes of the few experiments we did – The people who happen to be sitting in this room in this very moment – The actual collision events in 4.9 fb-1 for the 2014 LHC mtop [arxiv] • Population and sample are fundamental concepts in statistics – Statistics is the study of collection, analysis and presentation of data – We generally collect, analyse and present results from limited data samples, assuming they come from an underlying population or model D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 12 What do we do with statistics? • Two main methodologies in statistics – Descriptive statistics – describe properties of collected data sample – Inferential statistics – deduce properties of underlying population • Statistical inference is based on probability theory • In HEP we use inferential statistics for a variety of purposes: – Parameter estimation • Combining results, fitting model parameters • Determining errors and confidence limits – Statistical tests • Goodness of fit – are results compatible between them and with models? • Hypothesis testing – which of two models is more favored by experiments? – Design and optimization of experimental measurements D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 13 Challenges in statistical inference • Difficult to make general claims from few specific observations – – – – Sample may be not “representative” of the population... Population may be a “moving target” that varies in time... Often lack a priori knowledge of shape of population distribution... Or there would simply be too many parameters in realistic models... • High-energy physicists are lucky! – Any LHC data-taking sample is “representative” of the Laws of Nature! – These Laws of Nature (for most practical purposes) do not vary in time! – We can compute distributions using theoretical models for these Laws because quantum mechanics is probabilistic (Monte Carlo simulations) – And in most cases these models have a limited number of parameters ๏ฎ Result: spectacular agreement of experiments and statistical predictions D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 14 Others are not so lucky! (IT people included?…) NOT EVERYTHING IS GAUSSIAN: BEWARE OF THE LONG TAILS OF YOUR DISTRIBUTIONS!!! • “Outliers”? “Black swans”? How are data distributed? What is 1 “sigma”? – “By any historical standard, the financial crisis of the past 18 months has been extraordinary. Some suggested it is the worst since the early 1970s; others, the worst since the Great Depression; others still, the worst in human history. […] Back in August 2007, the CFO of Goldman Sachs commented to the FT «We are seeing things that were 25-standard deviation moves, several days in a row». […] A 25-sigma event would be expected to occur once every 6x10124 lives of the universe. That is quite a lot of human histories. […] Fortunately, there is a simpler explanation – the model was wrong.” [A. Haldane, Bank of England, 2009, “Why banks failed the stress test”] D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 15 The simplest example in parameter estimation • Sample “statistics” are estimators of population “parameters” – Parameter: population property, computed as function of population data – Statistic: sample property, computed as function of sample data • The two simplest population properties we are interested in: – Population mean ๐ = ๐ธ ๐ฅ = ๐ฅ๐ ๐ฅ ๐๐ฅ – Population variance ๐ 2 = ๐ ๐ฅ = ๐ฅ − ๐ 2 ๐ ๐ฅ ๐๐ฅ • We estimate them using the equivalent sample properties: – Sample mean ๐ฅ = ๐ฅ๐ ๐ – Sample variance ๐ 2 = โถ estimator of population mean ๐ (๐ฅ๐ −๐ฅ)2 ๐−1 : estimator of population variance σ2 • The basic question is: how “good” is ๐ฅ as an estimator of ๐ ? D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 16 The Law of Large Numbers Yes, winning a dice game once is a matter of luck... G. de La Tour – Les joueurs de dés (1651) ... but is there any such thing as an optimal strategy for winning “on average” in the “long-term”? [Cardano, Fermat, Pascal, J. Bernoulli, Khinchin, Kolmogorov...] D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 17 Three fundamental results for N๏ฎ๏ฅ • The Law of Large Numbers (for a sample of N independent variables xi identically distributed with finite population mean ๏ญ) – Sample mean x = xi N “converges” to population mean ๏ญ for N → ∞ – In practice: even if my dice game strategy is only winning 51% of the times, “in the long term” it will make me a millionaire! (Yes, but how fast? See CLT!) • The Central Limit Theorem (for a sample of N independent variables xi identically distributed with finite population mean ๏ญ and finite s.d. ๏ณ) – Distribution of x “converges” to Gaussian around ๏ญ with s.d. ๐ ๐ for N → ∞ – Hence we use the standard error ๐ ๐ as uncertainty on ๐ฅ as estimator of ๏ญ • The Central Limit Theorem (for a sample of N independent variables xi non-identically distributed with finite population mean ๏ญ and finite s.d. ๏ณ) – Distribution of x “converges” to Gaussian for N → ∞, irrespective of xi shapes – This is how we justify why we so often assume Gaussian errors on any measurement: the net result of many independent effects is Gaussian D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 18 Designing experiments? • In practice, we suggest to retain 3 common-sense concepts • 1. Aim for reproducibility – Even if observations are not reproducible (LHC), the method should be! • 2. Try to use controlled environments if possible – Remove external factors – failing which, must assign an error to them • IT example: single out I/O timing and remove uninteresting CPU timing – Understand if you are doing relative or absolute comparisons • 3. Measurements are an iterative process – The more you control one error, the more you must work on the next one – In HEP, some systematics can be reduced by using larger data samples D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 19 Presenting errors – misunderstandings • Statistics is relatively simple but is used in many different ways – Different “common practices” and buzzwords in different fields (HEP, astronomy, economics, medicine, psychology, neuroscience, biology, IT...) – Different types of error bars are appropriate for different needs… (next slide) • General rule: ensure that people are clear about what you are doing! – “Many leading researchers… do not adequately distinguish CIs and SE bars.” [Belia, Fidler, Williams, Cumming, Psychological Methods, 2005, “Researchers Misunderstand Confidence Intervals and Standard Error Bars” ] – “Experimental biologists are often unsure how error bars should be used and interpreted.” [Cumming, Fidler, Vaux, JCB, 2007, “Error bars in experimental biology”] – “The meaning of error bars is often misinterpreted, as is the statistical significance of their overlap.” [Krzywinski, Altman, Nature Methods, 2013, “Error bars”] – Read these last two articles to understand the different types of error bars! D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 20 There are different types of error bars! • Two categories that can be called “descriptive” or “inferential” error bars – [Cumming, Fidler, Vaux, JCB, 2007, “Error bars in experimental biology”] • 1. Error bars describing the observed sample width and range of values – Standard deviation of the sample • Represents width of both sample and population (estimate of s.d. of the population) – Box plots for the sample • Representing median, quartiles/deciles/percentiles, outliers… • 2. Error bars describing the inferred range for the population mean – Standard error on the population mean – Confidence intervals for the population mean (based on Gaussian assumption!) • Use different types for different needs – and say which ones you use! – See a specific IT example on one of the next slides D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 21 Error bars – descriptive or inferential? Ben Podgursky • Example: which programming language “earns you the highest salary”?! – One post (based on git commit data and Rapleaf API) was very popular in August 2013 – On popular request, it was later updated to include error bars computed as 95% CIs • Are CIs the most relevant error bars here? – It depends on what you are interested in! • Do you really care to claim “I work with the number 1 best paid language”? – If you do, then yes: CIs (and/or standard errors) are relevant to you… – You want to estimate the population means and be able to rank them • Do you only care to know how much you can expect to earn in each case? – In this case, then no: standard deviations or boxplots are much more relevant to you! • Different questions require different tools to answer them! D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 22 Quartiles, deciles, percentiles, median… • Height distribution in a sample with 20 persons Median, quartile, decile, percentile require a linear ordering (mean and mode do not) 1st quartile Median quartile) (2nd 3rd quartile 95th percentile Mode 1st decile (most frequent height) Sample Mean There can be many modes! Many global/local maxima. D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 23 Deciles and percentiles – real examples “top decile” = tail of distribution with 10% highest income “One thing that Piketty and his colleagues Emmanuel Saez and Anthony Atkinson have done is to popularize the use of simple charts that are easier to understand. In particular, they present pictures showing the shares of over-all income and wealth taken by various groups over time, including the top decile of the income distribution and the top percentile (respectively, the top ten per cent and those we call ‘the one per cent’).” [J. Cassidy, The New Yorker, 2014] D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 24 Boxplots* • Boxplot – Edges (left/right): Q1/Q3 quartiles • Q3-Q1 = IQR (inter quartile range) – Line inside: median (Q2 quartile) • “Whiskers” – Extend to most extreme data points within median ± 1.5 IQR – Outliers are plotted individually • Other slightly different styles exist* • Above: compare boxplots with mean and standard errors for three Gaussian N(0,1) samples *Nice reference: [Krzywinski, Altman, Nature Methods, 2013, “Visualizing samples with box plots”] D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 25 Plots and distributions – basics • Label the horizontal and vertical axes! – Show which variables are plotted in x and y and in which units – For histograms, you may use a label “#entries per x.yz <units> bin” • Choose the x and y ranges wisely – If one axis is ๏time, do you need to show negative values or start at 0? – Use the same range across many plots if shown next to each other • Or add a reference marker to make it visually easier to compare plots • In a nutshell: explain what you are doing, make sure everything is clear! – On the plot itself (label, legends, title) – especially in a presentation – In a caption below the plot (and/or in the text) – in an article or thesis D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 26 Distributions - histograms or smoothing • Histograms: entries per bin – Advantages: • Granularity (bin width) is obvious! • Error per bin (~ ๏n) is obvious! – Disadvantages: • Not smooth (is it a disadvantage?) • Depends on bin width • Smoothing techniques – Advantages: • Smooth (is it an advantage?) – Disadvantages: • Also depend on “resolution” effect • Unclear if shown alone (unclear x resolution, unclear y error) – Example here: kernel density estimator (KDE) with N(0,1) kernel ๐ ๐ฅ = 1 ๐โ ๐ ๐=1 ๐พ ๐ฅ−๐ฅ๐ โ • 1/n times sum of Gaussians N(xi,h) D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 27 Conclusions: take-away messages? • Do use errors and error bars! – When quoting measurements and errors, check your significant figures! – Different types of error bars for different needs! Say which ones you are using! • • • • Descriptive, width of distributions – standard deviations ๏ณ, box plots… Inferential, population mean estimate uncertainty – standard errors ๏ณ/๏n, CIs… [Why do we use ๏ณ/๏n? Because of the Central Limit Theorem!] [Ask yourself: are you describing a sample or inferring population properties?] • Beware of long tails and of outliers! – More generally: we all love Gaussians but reality is often different! • [Why do we love Gaussians? Because maths becomes so much easier with them!] • [Why do we feel ok to abuse Gaussians? Because of the Central Limit Theorem!] • Before analyzing data, design your experiment! – Aim for reproducibility, reduce external factors – and it is an iterative process • Make your plots understandable and consistent with one another – Label your axes and use similar ranges and styles across different plots – Be aware of binning effects (do you really prefer smoothing to histograms?) D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 28 References – books • Our favorites – J. Taylor, Introduction to Error Analysis – a great introductory classic – A. v. d. Bos, Parameter Estimation for Scientists and Engineers – advanced – J. O. Weatherall, The Physics of Finance – an enjoyable history of statistics • Many other useful reads… – – – – – – F. James, Statistical Methods in Experimental Physics G. Cowan, Statistical Data Analysis R. Willink, Measurement Uncertainty and Probability I. Hughes, T. Hase, Measurements and their Uncertainties S. Brandt, Data Analysis R. A. Fisher, Statistical Methods, Experimental Design and Scientific Inference D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 29 References – on the web • “Community” (to varying degrees) sites about maths/stats – wikipedia.org (see description by planetmath) – planetmath.org (see description by wikipedia) – mathworld.wolfram.com (see description by wikipedia) • Lecture courses on statistics – For CERN Summer Students (Lyons 2014, Cranmer 2013, Cowan 2011) – Chris Blake’s Statistics Lectures for astronomers • Regular articles about statistics – Statistics section in Particle Data Group’s Review of Particle Physics – Nature “Points of Significance” open access column (“for biologists”) • Launched in 2013, the International Year of Statistics! • Tools: ROOT, R, iPython notebook – more links in “Demo” slides D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 30 Understood? Beware of outliers! ;-) http://xkcd.com Questions? D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 31 Demo D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 32 Data Analysis “Data Analysis is the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.” [http://en.wikipedia.org/wiki/Data_analysis] D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 33 Data Analysis Tools • There is a huge choice of tools for data analysis! Including... – ROOT – essential HEP component for analysis, plots, fits, I/O... – R – a “different implementation of S”, but as GNU Free Software – Python scientific ecosystem – booming in all fields, from science to finance (see for instance the iPython 2013 Progress Report) • We chose to use Python for these WA lectures – Most people in IT-SDC use Python already – Widely used outside HEP and CERN, great web documentation – Integration to some level with ROOT (PyROOT) will be shown D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 34 iPython and its Notebook • iPython provides a rich architecture for interactive computing – Interactive shells (terminal and Qt-based) and data visualization – Easy to use, high performance tools for parallel computing – A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media • Extensive documentation on the web – Entry point: http://ipython.org – Tutorials, videos, talks, book, mailing lists, chat room – A lot of documentation is written in iPython and available in GitHub • Shared through nbviewer – Suggested lectures on scientific computing with Python – Mining the social web – Probabilistic programming and Bayesian methods for hackers D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 35 Quick start – our typical setup • Server: any CC7 (CERN CentOS7) node – Python 2.7 (native) – iPython, numpy, scipy, matplotlib, pandas (via easy_install) – ROOT (from AFS) • Client: any web browser – Either on the local server where iPython is running • iPython notebook – Or with port forwarding via ssh/putty on a remote client • ssh –Y –L 8888:localhost:8888 your_server • iPython notebook –no-browser • Further reading on how to configure the notebook here D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 36 Backup slides D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 37 Example – why do we need errors? • I want to buy a dishwasher! I saw one that I like! – – - producer specs: “Width: 59.5 cm” - I assume this means “between 59.4 and 59.6 cm” “59.5 cm” ? • I measured the space in my kitchen and it is “roughly” 55 cm – – - 55 ± 5 cm: must measure this better! may fit? - 55 ± 1 cm: bad luck, I need a “norme Suisse”! “~ 55 cm” We imply an “error” (uncertainty, precision, accuracy) even if we do not explicitly quote it! D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 38 Binomial distribution • Experiment with only two outcomes (Bernoulli trial) – Probability of success is p, probability of failure is q=1-p – Typical example: throwing dice, only “6” is success with probability 1/6 • What is the probability of k successes in n independent trials? ๐! ๐ (1 − ๐)๐−๐ – ๐๐ ๐; ๐ = ๐! ๐−๐ ๐ ! ๐! • Order not important, ๐! ๐−๐ permutations ! – Mean (expected #successes) is ๐๐๐ ๐; ๐ = μ = ๐๐ as expected! – Variance is ๐ธ ๐ฅ 2 − μ2 = ๐ 2 = ๐๐(1 − ๐) • Other examples – #entries in one bin of a histogram given N entries in total – decay channels of unstable particles (branching ratios) D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 39 Poisson processes and distributions • Independent events with a fixed average “rate” per unit “time”: – #decays of a fixed amount of radioactive material in a fixed time interval – #events in HEP process (of given cross-section) for a fixed “luminosity” • #entries in one bin of a histogram for fixed “luminosity” (total is Poisson too) – queuing theory: “memoryless” Poisson process in Erlang’s M/D/1 model • Limit of binomial distribution for N๏ฎ๏ฅ, p๏ฎ0 with ๏ญ = Np fixed ๐ ๏ฌ • Probability ๐ ๐; ๐ = ( – asymmetric for low ๏ฌ – more symmetric for high ๏ฌ ๐!) ๐ −๏ฌ • Expectation values – E[n] = ๏ญ ≡ ๏ฌ and V[n] = ๐ 2 = ๏ฌ – i.e. ๐ = ๏ฌ (width ~ ๐ !) D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 40 Gaussian (normal) distributions • P.d.f. is ๐ ๐ฅ; ๐, ๐ = 1 ๐ ๐ 2๐ (๐ฅ−๐)2 − 2๐2 – “N(๏ญ,๏ณ2)” for mean ๏ญ and variance ๏ณ2 – “Standard normal” N(0,1) for ๏ญ=0, ๏ณ2=1 – The most classical “bell-shaped” curve • Many properties that make it nice and easy to do math with! – Error propagation: if x1, x2 independent and normal, x1+x2 is normal – Parameter estimation: max likelihood and other estimators coincide – ๏ฃ2 tests: the ๏ฃ2 distribution is derived from the normal distribution • Too often abused (as too nice!), but does describe real cases – Limit of Poisson distribution if population mean is a large number – Good model for Brownian motion of many independent particles – Generally relevant for large-number systems (๏ฎcentral limit theorem) NB: 68.3%, 95.4%, 99.7% confidence limits (1-2-3 ๏ณ) are only for Gaussians! D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 41 Many other distributions... • Cauchy (or Lorentz) – No mean! (integral undefined) – HEP: non-relativistic Breit-Wigner • Pareto – Finite mean, infinite variance (๏ก=2) • Exponential D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 42 What do we do with statistics in HEP? Measurement Combinations Precision measurements! Standard Model tests (parameter estimation) (goodness of fit) LEP EW WG March 2012 Gfitter Group Sep 2012 ATLAS July 2012 5๏ณ* signal significance (hypothesis testing) FCC kick-off meeting G. Rolandi *[3 x 10-7 one-sided Gaussian] 95% CL for exclusion (hypothesis testing) Searches... and discoveries! D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 43 Reporting errors – basics • Measurement not complete without a statement of uncertainty – Always quote an error (unless obvious, “59.5 cm”) • Report results as “value ± error” – Error range normally indicates a “confidence interval”, e.g. 68% • Higgs: mH = 125.7 ± 0.4 GeV [Particle Data Group, Sep 2014] – May use asymmetric errors and separately quote different error sources +0.5 • mH = 124.3+0.6 −0.5 (stat.) −0.3 (syst.) GeV [ATLAS (ZZ*๏ฎ4l), Jul 2014] • Rounding and significant figures – Central value and error always quoted with the same decimal places • 124.3 ± 0.6 (neither 124 ± 0.6, nor 124.3 ± 1) – Error generally rounded to 1 or 2 significant figures, use common sense • Keep more significant figures to combine several measurements D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 44 Others are not so lucky! (IT people included…) THE NEWS THE POST PANIC! WALL ST. BLOODBATH • Beware of your assumptions! (And beware of “magic” tools!) – “The Black-Scholes equation was the mathematical justification for the trading that plunged the world's banks into catastrophe [….] On 19 October 1987, Black Monday, the world's stock markets lost more than 20% of their value within a few hours. An event this extreme is virtually impossible under the model's assumptions […] Large fluctuations in the stock market are far more common than Brownian motion predicts. The reason is unrealistic assumptions – ignoring potential black swans. But usually the model performed very well, so as time passed and confidence grew, many bankers and traders forgot the model had limitations. They used the equation as a kind of talisman, a bit of mathematical magic to protect them against criticism if anything went wrong.” [I. Stewart, The Guardian, 2012] D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 45 “Outliers” and “black swans” IN SUMMARY AND IN PRACTICE: BEWARE OF THE LONG TAILS OF YOUR DISTRIBUTIONS!!! • How are your data distributed? What is one “sigma”? – “By any historical standard, the financial crisis of the past 18 months has been extraordinary. Some suggested it is the worst since the early 1970s; others, the worst since the Great Depression; others still, the worst in human history. […] Back in August 2007, the CFO of Goldman Sachs commented to the FT «We are seeing things that were 25-standard deviation moves, several days in a row». […] A 25-sigma event would be expected to occur once every 6x10124 lives of the universe. That is quite a lot of human histories. […] Fortunately, there is a simpler explanation – the model was wrong.” [A. Haldane, Bank of England, 2009, “Why banks failed the stress test”] D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 46 Understood? Beware of outliers! ;-) http://xkcd.com Box plot! We’ll get there... XKCD comics-style plots in iPython? Click here. D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 47 The Law of Large Numbers (1) Yes, winning a dice game once is a matter of luck... G. de La Tour – Les joueurs de dés (1651) ... but is there any such thing as an optimal strategy for winning “on average” in the “long-term”? [Cardano, Fermat, Pascal, J. Bernoulli, Khinchin, Kolmogorov...] D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 48 The Law of Large Numbers (2) • Given a sample of N independent variables xi = {x1,... xN} identically distributed with finite population mean ๏ญ – the law of large numbers says that the sample mean ๐ฅ = ๐ฅ๐ ๐ “converges” to the population mean ๏ญ for ๐ → ∞ – generalizable to non-identically distributed xi with finite and “gentle” ๏ณi – hold irrespective of shapes of xi distributions • In practice: – Even if my dice game strategy is only winning 51% of the times, “in the long term” it will make me a millionaire! – The LLN does not say yet how fast this will happen... D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 49 The Central Limit Theorem (1) • Given a sample of N independent variables xi = {x1,... xN} identically distributed with population mean ๏ญ and finite s.d. ๏ณ – the central limit theorem essentially says that the distribution of the sample mean ๐ฅ = ๐ฅ๐ ๐ “converges” to a Gaussian around the population mean ๏ญ with standard deviation ๐ ๐ for ๐ → ∞ – again true irrespective of shape of xi distribution • In practice this tells us how fast the LLN converges: as 1/ N – We use the standard error ๐ ๐ as uncertainty on ๐ฅ as estimator of ๏ญ – We’ll now see some examples of this in the demo • And even more interesting is... the next slide... D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 50 The Central Limit Theorem (2) • Given N independent variables xi = {x1,... xN} non-identically distributed with population means ๏ญi and finite s.d. ๏ณi – the central limit theorem essentially says that the distribution of the mean ๐ฅ๐ ๐ “converges” to a Gaussian around ๏ญ๐ ๐ with standard deviation √ ๐๐2 ๐ for ๐ → ∞ – again true irrespective of shapes of xi distributions • as long as they are “nicely behaved” (~ the ๏ณi are not wildly different) • In practice: – This is how we justify why we so often assume Gaussian errors on any measurement: the net result of many independent effects is Gaussian! • The reason why we like Gaussians still being that it is easier to do maths D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 51 External factors? LEP energy calibration Effect of the Moon on ELEP ~ 120ppm [CERN SL 94/07] Effect of TGVs on ELEP ~ few 10ppm [CERN SL 97/47] Departure of 16:50 Geneva-Paris TGV on 13 Nov 1995 • LEP energy calibration – Systematic error on combined LEP measurement of the Z boson mass reduced from 7 MeV [Phys. Lett. B 1993] to 2 MeV [Phys. Rep. 2006] – Again: aim for reproducibility, control external factors, iterative process... D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 52 Correlation does not imply causation http://xkcd.com D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 53 Presenting results – and their errors You can - Either, show what you observed - Or show what you inferred from the observations D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 54 Should you always show error bars? • Personal opinion: no, not really – If the intrinsic limitations of your measurement and method are reasonably clear – But it is really a question of common sense and case-by-case choices • Example: Andrea’s Oracle performance plots for COOL query scalability – Controlled environment, low external variations – private clients, low server load – What really matters is the lower bound and it is obvious – many points • Most often measure the fastest case – fluctuations are only on the slow side – Measurements are relative – only interested in slopes, not in face values – Fast and repeatable measurements – just retry if a plot looks weird – Plots initially used to understand the issue, now just as a control – no fits • The real analysis is on the Oracle server-side trace files y axis range is zoomed in the left plot – is this clear? no, probably a bad choice D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 55 Conclusions • Statistics has implications at many levels and in many fields – Daily needs, global economy, HEP, formal mathematics and more – Different fields may have different buzzwords for similar concepts • We reminded a few basic concepts • And we suggested a few tools and practices D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 56 Questions? D. Giordano and A. Valassi – Quantitative Data Analysis Analytics WG – 11 Mar 2015 57