Quantitative data analysis: concepts, practices and tools Domenico Giordano Andrea Valassi (CERN IT-SDC) White Area Lecture, 18th February 2015 D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 1 Lecture scope • What it is – four interwoven threads – Concepts • In probability and statistics – Best practices and tools • For analysing data, designing experiments, presenting results – Examples • From physics, IT and other domains • What it is not – a complete course in 2h (we suggest references, or use Google!) – a universal recipe to all problems (must use your common sense) • Two lectures – Lecture 1 (Feb 18): foundations of concepts, tools and practices – Lecture 2 (Mar 4?): correlations, combining, fitting and more • Postpone to after academic training (7-9 Apr) on statistics by H. Prosper? D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 2 Motivation • Why these talks? – Data (and error) analysis is what HEP physicists do all the time! – “Big data” is ubiquitous, especially for IT professionals • Why us? Why as IT-SDC WA lectures? – Experience in HEP analysis (precision measurements, combination...) – Recent discussions in IT-SDC meetings over data analysis/presentation D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 3 Outline (WA #1) • Measurements and errors – Probability and distributions, mean and standard deviation – Introduction to tools and first demo • Populations and samples – What is statistics and what do we do we do with it? – The Law of Large Numbers and the Central Limit Theorem – Second demo • Designing experiments • Presenting results – Error bars: which ones? – Displaying distributions: histograms or smoothing? • Conclusions and references D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 4 Errors – probability and distributions D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 5 Example – why do we need errors? • I want to buy a dishwasher! I saw one that I like! – – - producer specs: “Width: 59.5 cm” - I assume this means “between 59.4 and 59.6 cm” “59.5 cm” ? • I measured the space in my kitchen and it is “roughly” 55 cm – – - 55 ± 5 cm: must measure this better! may fit? - 55 ± 1 cm: bad luck, I need a “norme Suisse”! “~ 55 cm” We imply an “error” (uncertainty, precision, accuracy) even if we do not explicitly quote it! D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 6 Measurements and errors • Measurement – “process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity” [Int. Voc. Metrology, 2012] • In metrology and statistics, “error” does not mean “mistake”! – “error” means our uncertainty (how different result is from “true” value) • Every measurement is subject to errors – repeated measurements differ due to random “statistical” effects – imperfect instruments shift all values by the same “systematic” bias • We cannot avoid all sources of errors – we must instead: – design experiments to minimize errors (within costs) – estimate and report our uncertainty D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 7 Accuracy and precision Low accuracy High precision Measurement frequency Low accuracy Low precision High accuracy Low precision True value Accuracy High accuracy High precision Measured value Precision The first of many distributions in this talk!... • Related to systematic and statistical errors – Accuracy: absence of systematic bias – Precision: consistency of repeated measurements with one another D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 8 Reporting errors – basics • Measurement not complete without a statement of uncertainty – Always quote an error (unless obvious, “59.5 cm”) • Report results as “value ± error” – Error range normally indicates a “confidence interval”, e.g. 68% • Higgs: mH = 125.7 ± 0.4 GeV [Particle Data Group, Sep 2014] – May use asymmetric errors and separately quote different error sources +0.5 • mH = 124.3+0.6 −0.5 (stat.) −0.3 (syst.) GeV [ATLAS (ZZ*๏ฎ4l), Jul 2014] • Rounding and significant figures – Central value and error always quoted with the same decimal places • 124.3 ± 0.6 (neither 124 ± 0.6, nor 124.3 ± 1) – Error generally rounded to 1 or 2 significant figures, use common sense • Keep more significant figures to combine several measurements D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 9 Probability and distributions • Probability – Formal branch of mathematics: probability theory (Kolmogorov axioms) • Is ๏ณ 0 for one event; normalized, sum is 1; adds up for independent events – Frequentist interpretation: objective, limit frequency over repeated trials – Bayesian interpretation (not for this WA!): subjective, degree of belief • Probability distributions for discrete values ๐ฅ๐ – Normalization: ๐ ๐(๐ฅ๐ ) =1 • Probability distributions for continuous values ๐ฅ – Probability density function (p.d.f.): ๐ ๐ฅ • P(๐ฅ lies in [๐ฅ0 , ๐ฅ0 + ๐๐ฅ] ) = ๐ ๐ฅ0 ๐๐ฅ – Normalization: ๐ ๐ฅ ๐๐ฅ = 1 D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 10 Expectation values for distributions • Mean (or average) of f(x) – “centre of gravity” of distribution – ๐ธ๐ฅ = ๐ฅ๐ ๐ฅ ๐๐ฅ ≡ ๐ • Expectation value of x : first-order “moment” around 0 – Discrete case: ๐ธ ๐ฅ = ๐ ๐ฅ๐ ๐(๐ฅ๐ ) ≡๐ • Standard deviation of f(x) – “width” of distribution – Square root of variance ๐ ๐ฅ = ๐ธ ๐ฅ − μ 2 = ๐ฅ − μ 2 ๐ ๐ฅ ๐๐ฅ ≡ ๐ 2 • Expectation value of (x- ๏ญ)2 : second-order “moment” around ๏ญ • Also equal to ๐ ๐ฅ = ๐ธ ๐ฅ 2 − μ2 = ๐ 2 – Discrete case: V ๐ฅ = ๐ธ ๐ฅ − μ 2 = ๐ D. Giordano and A. Valassi – Quantitative Data Analysis ๐ฅ๐ − μ 2 ๐(๐ฅ๐ ) ≡ ๐ 2 WA lecture – 18 Feb 2015 11 Binomial distribution • Experiment with only two outcomes (Bernoulli trial) – Probability of success is p, probability of failure is q=1-p – Typical example: throwing dice, only “6” is success with probability 1/6 • What is the probability of k successes in n independent trials? ๐! ๐ (1 − ๐)๐−๐ – ๐๐ ๐; ๐ = ๐! ๐−๐ ๐ ! ๐! • Order not important, ๐! ๐−๐ permutations ! – Mean (expected #successes) is ๐๐๐ ๐; ๐ = μ = ๐๐ as expected! – Variance is ๐ธ ๐ฅ 2 − μ2 = ๐ 2 = ๐๐(1 − ๐) • Other examples – #entries in one bin of a histogram given N entries in total – decay channels of unstable particles (branching ratios) D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 12 Poisson processes and distributions • Independent events with a fixed average “rate” per unit “time”: – #decays of a fixed amount of radioactive material in a fixed time interval – #events in HEP process (of given cross-section) for a fixed “luminosity” • #entries in one bin of a histogram for fixed “luminosity” (total is Poisson too) – queuing theory: “memoryless” Poisson process in Erlang’s M/D/1 model • Limit of binomial distribution for N๏ฎ๏ฅ, p๏ฎ0 with ๏ญ = Np fixed ๐ ๏ฌ • Probability ๐ ๐; ๐ = ( – asymmetric for low ๏ฌ – more symmetric for high ๏ฌ ๐!) ๐ −๏ฌ • Expectation values – E[n] = ๏ญ ≡ ๏ฌ and V[n] = ๐ 2 = ๏ฌ – i.e. ๐ = ๏ฌ (width ~ ๐ !) D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 13 Gaussian (normal) distributions • P.d.f. is ๐ ๐ฅ; ๐, ๐ = 1 ๐ ๐ 2๐ (๐ฅ−๐)2 − 2๐2 – “N(๏ญ,๏ณ2)” for mean ๏ญ and variance ๏ณ2 – “Standard normal” N(0,1) for ๏ญ=0, ๏ณ2=1 – The most classical “bell-shaped” curve • Many properties that make it nice and easy to do math with! – Error propagation: if x1, x2 independent and normal, x1+x2 is normal – Parameter estimation: max likelihood and other estimators coincide – ๏ฃ2 tests: the ๏ฃ2 distribution is derived from the normal distribution • Too often abused (as too nice!), but does describe real cases – Limit of Poisson distribution if population mean is a large number – Good model for Brownian motion of many independent particles – Generally relevant for large-number systems (๏ฎcentral limit theorem) NB: 68.3%, 95.4%, 99.7% confidence limits (1-2-3 ๏ณ) are only for Gaussians! D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 14 Many other distributions... • Cauchy (or Lorentz) – No mean! (integral undefined) – HEP: non-relativistic Breit-Wigner • Pareto – Finite mean, infinite variance (๏ก=2) • Exponential D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 15 Demo #1 D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 16 Data Analysis “Data Analysis is the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.” [http://en.wikipedia.org/wiki/Data_analysis] D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 17 Data Analysis Tools • There is a huge choice of tools for data analysis! Including... – ROOT – essential HEP component for analysis, plots, fits, I/O... – R – a “different implementation of S”, but as GNU Free Software – Python scientific ecosystem – booming in all fields, from science to finance (see for instance the iPython 2013 Progress Report) • We chose to use Python for these WA lectures – Most people in IT-SDC use Python already – Widely used outside HEP and CERN, great web documentation – Integration to some level with ROOT (PyROOT) will be shown D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 18 iPython and its Notebook • iPython provides a rich architecture for interactive computing – Interactive shells (terminal and Qt-based) and data visualization – Easy to use, high performance tools for parallel computing – A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media • Extensive documentation on the web – Entry point: http://ipython.org – Tutorials, videos, talks, book, mailing lists, chat room – A lot of documentation is written in iPython and available in GitHub • Shared through nbviewer – Suggested lectures on scientific computing with Python – Mining the social web – Probabilistic programming and Bayesian methods for hackers D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 19 Quick start – our typical setup • Server: any CC7 (CERN CentOS7) node – Python 2.7 (native) – iPython, numpy, scipy, matplotlib, pandas (via easy_install) – ROOT (from AFS) • Client: any web browser – Either on the local server where iPython is running • iPython notebook – Or with port forwarding via ssh/putty on a remote client • ssh –Y –L 8888:localhost:8888 your_server • iPython notebook –no-browser • Further reading on how to configure the notebook here D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 20 Populations and samples – statistics! D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 21 What is statistics? • Statistics: study of collection, analysis, presentation of data – Descriptive statistics – describe properties of collected data sample – Inferential statistics – deduce properties of underlying population • Statistical inference is based on probability theory • What do we do with statistics? – Parameter estimation • Combining results, fitting model parameters • Determining errors and confidence limits – Statistical tests • Goodness of fit – are results compatible between them and with models? • Hypothesis testing – which of two models is more favored by experiments? – Design and optimization of experimental measurements, sampling... D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 22 What do we do with statistics in HEP? Measurement Combinations Precision measurements! Standard Model tests (parameter estimation) (goodness of fit) LEP EW WG March 2012 Gfitter Group Sep 2012 ATLAS July 2012 5๏ณ* signal significance (hypothesis testing) FCC kick-off meeting G. Rolandi *[3 x 10-7 one-sided Gaussian] 95% CL for exclusion (hypothesis testing) Searches... and discoveries! D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 23 Population vs. Sample • Population: set of all possible outcomes of an experiment – The 7,162,119,434 (±?!) humans in the world in 2013 [UN DESA] – The theoretically possible (“gedanken”) infinite collision events at LHC • Sample: the set of outcomes in the few experiments we did – The 2,513 CERN staff members in 2013 [CERN HR] – The actual collision events in 4.9 fb-1 for the 2014 LHC mtop [arxiv] • Sample properties are used to infer population properties – estimate parameters (with uncertainties) of population distributions – and make (probabilistic) predictions about what you can expect D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 24 Parameter estimation • Estimators: functions (‘statistics’) of measurements in a sample – Sample mean ๐ฅ = – Sample variance ๐ 2 ๐ฅ๐ ๐ = โถ estimator of population mean ๐ = E[x] (๐ฅ๐ −๐ฅ)2 ๐−1 : estimator of population variance σ2 = V[x] – Plus many estimation methods to fit or combine measurements • Good estimators (e.g. ๐ฅ as estimator of ๐) – Accuracy – absence of bias E ๐ฅ = ๐ – Precision – low variance of ๐ฅ around E ๐ฅ D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 25 What claims can I make from a sample? • You can use sample information in different ways – can describe what you observed – can quote inferred parameter estimates and their uncertainties • to some extent, can understand shape of population distribution – can make predictions (“confidence intervals”) from the estimates • if sure about distribution shapes (1๏ณ is 68% only for Gaussians!) • Difficult to make general claims from few specific observations – sample may be not “representative” of the population (bias)... – population may be a “moving target” that varies in time... – need a priori knowledge of shape of population distribution... D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 26 High-energy physicists are lucky! (or spoilt?!) • We aim to measure/discover Laws of Nature that exist a priori! • We ~can assume that these Laws do not vary in time!! • We can compute distributions for theoretical models because quantum mechanics is probabilistic! (Monte Carlo simulations) – Statistical errors truly Poissonian, hence Gaussian for large numbers – (Although we end up treating systematics as Gaussian too... argh?!) D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 27 Others are not so lucky! (IT people included…) THE NEWS THE POST PANIC! WALL ST. BLOODBATH • Beware of your assumptions! (And beware of “magic” tools!) – “The Black-Scholes equation was the mathematical justification for the trading that plunged the world's banks into catastrophe [….] On 19 October 1987, Black Monday, the world's stock markets lost more than 20% of their value within a few hours. An event this extreme is virtually impossible under the model's assumptions […] Large fluctuations in the stock market are far more common than Brownian motion predicts. The reason is unrealistic assumptions – ignoring potential black swans. But usually the model performed very well, so as time passed and confidence grew, many bankers and traders forgot the model had limitations. They used the equation as a kind of talisman, a bit of mathematical magic to protect them against criticism if anything went wrong.” [I. Stewart, The Guardian, 2012] D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 28 “Outliers” and “black swans” IN SUMMARY AND IN PRACTICE: BEWARE OF THE LONG TAILS OF YOUR DISTRIBUTIONS!!! • How are your data distributed? What is one “sigma”? – “By any historical standard, the financial crisis of the past 18 months has been extraordinary. Some suggested it is the worst since the early 1970s; others, the worst since the Great Depression; others still, the worst in human history. […] Back in August 2007, the CFO of Goldman Sachs commented to the FT «We are seeing things that were 25-standard deviation moves, several days in a row». […] A 25-sigma event would be expected to occur once every 6x10124 lives of the universe. That is quite a lot of human histories. […] Fortunately, there is a simpler explanation – the model was wrong.” [A. Haldane, Bank of England, 2009, “Why banks failed the stress test”] D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 29 Understood? Beware of outliers! ;-) http://xkcd.com Box plot! We’ll get there... XKCD comics-style plots in iPython? Click here. D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 30 The Law of Large Numbers (1) Yes, winning a dice game once is a matter of luck... G. de La Tour – Les joueurs de dés (1651) ... but is there any such thing as an optimal strategy for winning “on average” in the “long-term”? [Cardano, Fermat, Pascal, J. Bernoulli, Khinchin, Kolmogorov...] D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 31 The Law of Large Numbers (2) • Given a sample of N independent variables xi = {x1,... xN} identically distributed with finite population mean ๏ญ – the law of large numbers says that the sample mean ๐ฅ = ๐ฅ๐ ๐ “converges” to the population mean ๏ญ for ๐ → ∞ – generalizable to non-identically distributed xi with finite and “gentle” ๏ณi – hold irrespective of shapes of xi distributions • In practice: – Even if my dice game strategy is only winning 51% of the times, “in the long term” it will make me a millionaire! – The LLN does not say yet how fast this will happen... D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 32 The Central Limit Theorem (1) • Given a sample of N independent variables xi = {x1,... xN} identically distributed with population mean ๏ญ and finite s.d. ๏ณ – the central limit theorem essentially says that the distribution of the sample mean ๐ฅ = ๐ฅ๐ ๐ “converges” to a Gaussian around the population mean ๏ญ with standard deviation ๐ ๐ for ๐ → ∞ – again true irrespective of shape of xi distribution • In practice this tells us how fast the LLN converges: as 1/ N – We use the standard error ๐ ๐ as uncertainty on ๐ฅ as estimator of ๏ญ – We’ll now see some examples of this in the demo • And even more interesting is... the next slide... D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 33 The Central Limit Theorem (2) • Given N independent variables xi = {x1,... xN} non-identically distributed with population means ๏ญi and finite s.d. ๏ณi – the central limit theorem essentially says that the distribution of the mean ๐ฅ๐ ๐ “converges” to a Gaussian around ๏ญ๐ ๐ with standard deviation √ ๐๐2 ๐ for ๐ → ∞ – again true irrespective of shapes of xi distributions • as long as they are “nicely behaved” (~ the ๏ณi are not wildly different) • In practice: – This is how we justify why we so often assume Gaussian errors on any measurement: the net result of many independent effects is Gaussian! • The reason why we like Gaussians still being that it is easier to do maths D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 34 Demo #2 D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 35 “If a man will begin with certainties, he shall end in doubts; but if he will be content to begin with doubts, he shall end in certainties”. [F. Bacon, “Of the Proficience and Advancement of Learning, Divine and Human”, 1605] Designing experiments D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 36 Designing experiments? • In practice, we suggest to retain 3 common-sense concepts – 1. aim for reproducibility • Even if observations are not reproducible (LHC), the method should be! – 2. try to use controlled environments if possible • Removing external factors – failing which, you must assign an error to them – IT example: single out I/O timing and remove uninteresting CPU timing • Relative vs absolute comparisons – also “internal” vs “external” validity – 3. measurements are an iterative process • The more you control one error, the more you must work on the next one • In HEP, some systematics can be reduced by using larger data samples • You can get much more formal (or philosophical) than this... – Science is about having doubts! (Descartes, Galileo, Popper...) • Statistics is a quantitative tool to address these doubts, and test hypotheses! – Statistics also helps in the formal design of experiments (Fisher) – External factors in “non-fundamental” sciences (ceteris paribus) D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 37 External factors? LEP energy calibration Effect of the Moon on ELEP ~ 120ppm [CERN SL 94/07] Effect of TGVs on ELEP ~ few 10ppm [CERN SL 97/47] Departure of 16:50 Geneva-Paris TGV on 13 Nov 1995 • LEP energy calibration – Systematic error on combined LEP measurement of the Z boson mass reduced from 7 MeV [Phys. Lett. B 1993] to 2 MeV [Phys. Rep. 2006] – Again: aim for reproducibility, control external factors, iterative process... D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 38 Correlation does not imply causation http://xkcd.com D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 39 Presenting results – and their errors You can - Either, show what you observed - Or show what you inferred from the observations D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 40 Misunderstandings • Statistics is relatively simple but is used in many different ways – Different “common practices” and buzzwords in different fields (HEP, astronomy, economics, medicine, psychology, neuroscience, biology, IT...) – Different types of error bars are appropriate for different needs… (next slide) • General rule: ensure that people are clear about what you are doing! – “Many leading researchers… do not adequately distinguish CIs and SE bars.” [Belia, Fidler, Williams, Cumming, Psychological Methods, 2005, “Researchers Misunderstand Confidence Intervals and Standard Error Bars” ] – “Experimental biologists are often unsure how error bars should be used and interpreted.” [Cumming, Fidler, Vaux, JCB, 2007, “Error bars in experimental biology”] – “The meaning of error bars is often misinterpreted, as is the statistical significance of their overlap.” [Krzywinski, Altman, Nature Methods, 2013, “Error bars”] – Read these last two articles to understand the different types of error bars! D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 41 There are different types of error bars! • Error bars describing the observed sample width and range of values – Standard deviation of the sample • Represents width of both sample and population (estimate of s.d. of the population) – Box plots for the sample • Representing median, quartiles/deciles/percentiles, outliers… • Error bars describing the inferred range for the population mean – Standard error on the population mean – Confidence intervals for the population mean (based on Gaussian assumption!) • These 2 categories can be called “descriptive” or “inferential” error bars – [Cumming, Fidler, Vaux, JCB, 2007, “Error bars in experimental biology”] • Use different types for different needs – and say which ones you use! – See a specific IT example on one of the next slides D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 42 Error bars – descriptive or inferential? Ben Podgursky • Example: which programming language “earns you the highest salary”?! – One post (based on git commit data and Rapleaf API) was very popular in August 2013 – On popular request, it was later updated to include error bars computed as 95% CIs • Are CIs the most relevant error bars here? – It depends on what you are interested in! • Do you really care to claim “I work with the number 1 best paid language”? – If you do, then yes: CIs (and/or standard errors) are relevant to you… – You want to estimate the population means and be able to rank them • Do you only care to know how much you can expect to earn in each case? – In this case, then no: standard deviations or boxplots are much more relevant to you! • Different questions require different tools to answer them! D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 43 Quartiles, deciles, percentiles, median… • Height distribution in a sample with 20 persons Median, quartile, decile, percentile require a linear ordering (mean and mode do not) 1st quartile Median quartile) (2nd 3rd quartile 95th percentile Mode 1st decile (most frequent height) Sample Mean There can be many modes! Many global/local maxima. D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 44 Deciles and percentiles – real examples “top decile” = tail of distribution with 10% highest income “One thing that Piketty and his colleagues Emmanuel Saez and Anthony Atkinson have done is to popularize the use of simple charts that are easier to understand. In particular, they present pictures showing the shares of over-all income and wealth taken by various groups over time, including the top decile of the income distribution and the top percentile (respectively, the top ten per cent and those we call ‘the one per cent’).” [J. Cassidy, The New Yorker, 2014] D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 45 Boxplots* • Boxplot – Edges (left/right): Q1/Q3 quartiles • Q3-Q1 = IQR (inter quartile range) – Line inside: median (Q2 quartile) • “Whiskers” – Extend to most extreme data points within median ± 1.5 IQR – Outliers are plotted individually • Other slightly different styles exist* • Above: compare boxplots with mean and standard errors for three Gaussian N(0,1) samples *Nice reference: [Krzywinski, Altman, Nature Methods, 2013, “Visualizing samples with box plots”] D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 46 Should you always show error bars? • Personal opinion: no, not really – If the intrinsic limitations of your measurement and method are reasonably clear – But it is really a question of common sense and case-by-case choices • Example: Andrea’s Oracle performance plots for COOL query scalability – Controlled environment, low external variations – private clients, low server load – What really matters is the lower bound and it is obvious – many points • Most often measure the fastest case – fluctuations are only on the slow side – Measurements are relative – only interested in slopes, not in face values – Fast and repeatable measurements – just retry if a plot looks weird – Plots initially used to understand the issue, now just as a control – no fits • The real analysis is on the Oracle server-side trace files y axis range is zoomed in the left plot – is this clear? no, probably a bad choice D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 47 Plots and distributions – basics • Label the horizontal and vertical axes! – Show which variables are plotted in x and y and in which units – For histograms, you may use a label “#entries per x.yz <units> bin” • Choose the x and y ranges wisely – If one axis is ๏time, do you need to show negative values or start at 0? – Use the same range across many plots if shown next to each other • Or add a reference marker to make it visually easier to compare plots • In a nutshell: explain what you are doing, make sure everything is clear! – On the plot itself (label, legends, title) – especially in a presentation – In a caption below the plot (and/or in the text) – in an article or thesis D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 48 Distributions - histograms or smoothing • Histograms: entries per bin – Advantages: • Granularity (bin width) is obvious! • Error per bin (~ ๏n) is obvious! – Disadvantages: • Not smooth (is it a disadvantage?) • Depends on bin width • Smoothing techniques – Advantages: • Smooth (is it an advantage?) – Disadvantages: • Also depend on “resolution” effect • Unclear if shown alone (unclear x resolution, unclear y error) – Example here: kernel density estimator (KDE) with N(0,1) kernel ๐ ๐ฅ = 1 ๐โ ๐ ๐=1 ๐พ ๐ฅ−๐ฅ๐ โ • 1/n times sum of Gaussians N(xi,h) D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 49 Conclusions (WA #1) • Statistics has implications at many levels and in many fields – Daily needs, global economy, HEP, formal mathematics and more – Different fields may have different buzzwords for similar concepts • We reminded a few basic concepts • And we suggested a few tools and practices D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 50 Take-away messages? (WA #1) • Do use errors and error bars! – When quoting measurements and errors, check your significant figures! – Different types of error bars for different needs! Say which ones you are using! • • • • Descriptive, width of distributions – standard deviations ๏ณ, box plots… Inferential, population mean estimate uncertainty – standard errors ๏ณ/๏n, CIs… [Why do we use ๏ณ/๏n? Because of the Central Limit Theorem!] [Ask yourself: are you describing a sample or inferring population properties?] • Beware of long tails and of outliers! – More generally: we all love Gaussians but reality is often different! • [Why do we love Gaussians? Because maths becomes so much easier with them!] • [Why do we feel ok to abuse Gaussians? Because of the Central Limit Theorem!] • Before analyzing data, design your experiment! – Aim for reproducibility, reduce external factors – and it is an iterative process • Make your plots understandable and consistent with one another – Label your axes and use similar ranges and styles across different plots – Be aware of binning effects (do you really prefer smoothing to histograms?) D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 51 References – books • Our favorites – J. Taylor, Introduction to Error Analysis – a great introductory classic – A. v. d. Bos, Parameter Estimation for Scientists and Engineers – advanced – J. O. Weatherall, The Physics of Finance – an enjoyable history of statistics • Many other useful reads… – – – – – – F. James, Statistical Methods in Experimental Physics G. Cowan, Statistical Data Analysis R. Willink, Measurement Uncertainty and Probability I. Hughes, T. Hase, Measurements and their Uncertainties S. Brandt, Data Analysis R. A. Fisher, Statistical Methods, Experimental Design and Scientific Inference D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 52 References – on the web • “Community” (to varying degrees) sites about maths/stats – wikipedia.org (see description by planetmath) – planetmath.org (see description by wikipedia) – mathworld.wolfram.com (see description by wikipedia) • Lecture courses on statistics – for CERN Summer Students (Lyons 2014, Cranmer 2013, Cowan 2011) – Chris Blake’s Statistics Lectures for astronomers • Regular articles about statistics – Statistics section in Particle Data Group’s Review of Particle Physics – Nature “Points of Significance” open access column (“for biologists”) • Launched in 2013, the International Year of Statistics! • Tools: ROOT, R, iPython notebook – more links in “Demo” slides D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 53 Questions? D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 54 Backup slides D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 55 Caravaggio – I bari (1594) D. Giordano and A. Valassi – Quantitative Data Analysis WA lecture – 18 Feb 2015 56