Laboratory Statistics & Graphics with EXCEL® Tutorial book Dietmar Stöckl Dietmar@stt-consulting.com With thanks to • Linda M Thienpont Linda.thienpont@ugent.be • Kristian Linnet, MD, PhD Linnet@post7.tele.dk • Per Hyltoft Petersen, MSc Per.hyltoft.petersen@ouh.fyns-amt.dk • Sverre Sandberg, MD, PhD Sverre.sandberg@isf.uib.no STT Consulting Dietmar Stöckl, PhD Abraham Hansstraat 11 B-9667 Horebeke, Belgium e-mail: dietmar@stt-consulting.com Statistics & graphics for the laboratory 2 Content Content overview How to use this book How to use the EXCEL-files Univariate Statistics • Data, data presentation, and data description • Gaussian (or Normal) distribution • Tests for normality & calculations with logarithms • Sampling statistics: Confidence intervals • Estimation and hypothesis testing (F-test, Chi2-test, t-tests, outliers) • Analysis of variance (ANOVA) • Statistical power concept and sample size Bivariate Statistics • Graphical techniques • Combined graphical/statistical techniques • Correlation • Regression Annex Statistics & graphics for the laboratory 3 Content Content overview EXCEL-files General • DataGeneration • StatFunctions&Tables ANOVA • Cochran&Bartlett • ANOVA Data, -presentation & -description • Datasets • Data&DataPresentation • Graphs-EXCEL • ProbPlots Power & sample size • Power Gaussian distribution • Exercises-BasicStats • Gaussian Distribution • NormalRankitPlot Graphical techniques • GraphBivariate-EXCEL Sampling statistics • SamplingStatistics • CI-Calculator Combined graphical/statistical techniques • Bland&Altman Estimation & hypothesis testing and confidence intervals Correlation & regression • CI&NHST-EXCEL • CI&NHST • CI&NHST-Exercise • Grubbs: (http://www.graphpad.com/ articles/outlier.htm) • Correlation&Regression • CorrRegr-EXCEL Statistics & graphics for the laboratory 4 Content Detailed content Univariate statistics Data • Types of data & types of statistics • Exemplary laboratory data – "Repeated weighing"-experiment – Adult serum triacylglycerides Data presentation (univariate data) • Importance of digits • Table • Graphics with EXCEL® – Dot-plot – Histogram – Frequency polygon – Dynamic histogram – Box and whisker plot • Time-indexed plots Data description • Descriptive statistics – Location Mean, median, mode – Dispersion Range, variance, standard deviation, coefficient of variation • Equations • Descriptive statistics with EXCEL® • Importance of digits Gaussian (or Normal) distribution • "Bell-shaped" (similar to a histogram) • Cumulated: "S-shaped" • Cumulated & linearized • 2-sided and 1-sided probabilities • Inside/outside probabilities • Probabilities at selected s (z) values • Deviation from normality (skewness & kurtosis) Tests for normality Calculations with logarithms Statistics & graphics for the laboratory 5 Content Detailed content Sampling statistics: Confidence intervals • t-distribution (distribution of means) – Confidence interval of a mean – Confidence interval of the "1.96 s-limit" • Chi2(2)-distribution (distribution of variances) – Confidence interval of a standard deviation • Interpretation of confidence limits Estimation and hypothesis testing (F-test, Chi2-test, t-tests) • Introduction • t-tests • Outlier tests (k • SD, Grubbs, Dixon's Q) • F-test, 2 (=Chi2)-test • Tests and confidence limits Analysis of Variance (ANOVA) • Introduction • Model I ANOVA Performance strategy – Testing of outliers – Testing of variances (Cochran "C", Bartlett) • Model II ANOVA – Applications Power and sample size Bivariate Statistics Graphical techniques • Scatter-plot • Difference-plot • Residual-plot • Krouwer-plot • Influences on the plots (data-range; subgroups; outliers; scaling) • Influences of random- and systematic errors on the plots • Linearity • Specifications in plots Statistics & graphics for the laboratory 6 Content Detailed content Bivariate statistics (ctd.) Combined graphical/statistical techniques • The Bland & Altman approach Correlation • The statistical model • Correlation in method comparison • Non-parametric correlation Regression • Ordinary linear regression (OLR) • Deming regression • Passing-Bablok regression (non-parametric) • Weighted regression • Regression & method comparison • Regression & calibration Annex Statistics with EXCEL® • EXCEL® installation requirements • Tips for EXCEL®-graphics Statistical resources • Web resources – Glossary of statistical terms – Interesting educational resources • Statistical software – General – "Laboratory statistics" • Books Statistical tables Presenter's publications & courses related to the topic Statistics & graphics for the laboratory 7 Overview Basics Data Quantitative Categorical [Importance] Digits Statistics Exploratory Data Analysis Parametric Non-parametric Bayesian Data presentation (univariate) Table Dot-plot Histogram/Frequency polygon Frequency cumulated Normal probability plot (Rankit) Krouwer (mountain) plot Box & whisker plot Time-indexed plot Data presentation (bivariate) 2 x 2 Table Scatter plot Residual plot Bias plot Bland & Altman plot/approach Data transformation Logarithms Other Variance pooling Variance propagation Total error calculations Statistics & graphics for the laboratory 8 Overview Descriptive statistics Location Mean Median Mode Dispersion Minimum/Maxiumum Range; quartile/quantile Variance Standard deviation (SD, or s) Coefficient of variation (CV) z-value Gaussian distribution Graphics >Data presentation (univariate) Probabilities 2-sided 1-sided Inside Outside At selected z-values Deviations Skewness Kurtosis Sampling statistics Confidence intervals Central limit theorem t-distribution Conf. interval of a mean Conf. interval of a centile (1.96) Chi-square (2) distribution Conf. interval of a SD Statistics & graphics for the laboratory 9 Overview Significance testing Means (n>2: ANOVA) Non-parametric 1-sample t-test Wilcoxon signed rank t-test Non-parametric Mann-Whitney U Paired t-test Non-parametric SDs Wilcoxon signed rank 1-sample F-test (Chi-square~) F-test Distribution Chi-square Non-parametric Kolmogorov Smirnov Anderson Darling Outlier Grubbs test Dixon’s Q-test Variances (n > 2) Cochran "C" Bartlett Power Sample size calculation ANOVA One-way Model I: Significance testing (means n > 2) Model II: Variance estimation Non-parametric Kruskal-Wallis Model I versus Model II Correlation (r) Pearson Non-parametric Regression Spearman, Kendall Ordinary linear regression Deming regression Non-parametric Passing Bablok regression Weighted regression forms Statistics & graphics for the laboratory 10 Overview General considerations to approach data Frequent statistical questions • Which kind of data? • Which kind of distribution? • Is there a difference? • Was there a change? • Is there an association? • What is the probability? Kind of data Which kind of data • Quantitative –Measured –Counted • Categorical –Ordinal –Nominal Appropriate statistic Approach for quantitative, measured data Data collection/ Kind of experiment Which question Plot data • Retrospective • Experience • Statistically planned • Sufficient digits • Sample size calculations • Description • Difference • Change • Association • Prediction • Sample size • Selection of plot • Selection of test • Selection of probability • Outliers • Distribution (n > 20) • Parametric direct or –Remove outliers –Transform data • Non-parametric –[Remove outliers] Statistics & graphics for the laboratory 11 Overview Summary of significance tests Problem Parametric Non-parametric Graphic Outlier Grubbs Dixon’s Q Distribution CHI2 Anderson-Darling (recommended) Kolmogorov-Smirnov Normal probability plot (=Rankit-plot) Mean$ vs target t-test, 1-sample Wilcoxon signed rank Confidence interval (CI) 2 Means$ t-test (equal & unequal variances): perform Ftest before Mann-Whitney U CI Paired means$ (Change) Paired t-test Wilcoxon signed rank CI SD:VAR vs target CHI2 2 SDs/VAR F-test Siegel-Tukey >2 means$ ANOVA Kruskal-Wallis >2 variances Cochran’s C Bartlett Association Pearson Correlation Spearman or Kendall Prediction Regression Passing-Bablok regression Dot-plot CI CI Rankit-plot with CHI2function $: or median; SD = standard deviation; VAR = variance; vs = versus Summary of graphics Univariate data Bivariate data Dot-plot Scatter plot Histogram/Frequency polygon Difference (bias) plots Cumulated frequency plot Residual plot Krouwer plot (folded cum. frequency) [Contingency tables] Normal probability (Rankit) plot Box & Whisker plot Run-sequence plot (Control charts) Lag-plot Statistics & graphics for the laboratory 12 Overview Selected analytical problems and associated statistics Analytical problem Associated statistics$ Method evaluation/validation (in-house) General Basic statistics Outlier tests (e.g., Grubbs) Imprecision F-test; CHI2-test (#), ANOVA Limit of detection Probability & Power Linearity Regression, ANOVA Calibration Regression & correlation Sample trueness/bias (recovery) t-tests (#) Accuracy (uncertainty) of result Variance propagation Method comparison Regression & correlation Trouble-shooting Power (sample size calculations) $Pure measurement variation, usually, is assumed Gaussian #Alternative: confidence intervals Collaborative trials (n >2 laboratories) Imprecision Cochran C; Bartlett Bias ANOVA, Model I Estimation of variance ANOVA, Model II Interpretation of analytical results Depends on the problem Various (see above) Tests for distribution Data transformation Bayesian statistics Statistics & graphics for the laboratory 13 How to use this book & the EXCEL-files How to use this book This book is an introductory text to basic statistical and graphical techniques used in the analytical laboratory. It is accompanied by EXCEL-files that should facilitate self-education by -demonstrating the statistical & graphical possibilities of EXCEL -explaining statistical concepts with dynamic worksheets -providing examples for creating user-specific templates The use of the EXCEL-files is indicated by the following icons: The general layout is shown in the figure below. Statistics & graphics for the laboratory 14 How to use this book & the EXCEL-files How to use the EXCEL-files EXCEL-Settings The files have been tested with EXCEL 2000 & EXCEL XP. Activate the AddIns: Analysis ToolPak & Analysis ToolPak -VBA. Macro security: Medium or low. When opening the files choose "Enable Macros" The nicest view is in the "Full Screen" mode Notes: Make a back-up with a different name; do not save changes. Features of the EXCEL-files -Easy "click-through" navigation between the worksheets -Information icon: gives information about the intention of the file -Note icon: draws attention to particular EXCEL or other issues -Exercise icon: gives instructions for interactive worksheets; additionally, the worksheets give detailed information of how to perform certain exercises. -Comment-cell: contains important information to specific topics -Many files contain dynamic elements for user interaction Statistics & graphics for the laboratory 15 How to use this book & the EXCEL-files How to use the EXCEL-files CAVE Please close other applications. During extensive use with Windows 98, it may be necessary to delete Windows>Temp files every now and then (otherwise, EXCEL may shut down). The EXCEL-files will guide the user through the statistical functions of EXCEL that are available through the fx-icon and the "Data Analysis" AddIn. A summary of the statistical functions of EXCEL is given in the file EXCEL-StatFunctions. The file StatTables-EXCEL contains statistical tables that are created with the EXCEL-functions • NORMSINV (z-table) • FINV (F-table) • TINV (t-table) • CHIINV (Chi2-table) The EXCEL-files are of • tutorial nature (explaining the statistical concepts) • practical nature (templates for use) Legal notice The EXCEL-files are for educative purpose. They should not be regarded as commercial software. They have been prepared with utmost care but it cannot be excluded that they may contain an error. The author is not liable for errors. Statistics & graphics for the laboratory 16 Data, data presentation & data description Data • Types of data & types of statistics • Exemplary laboratory data – "Repeated weighing"-experiment – Adult serum triacylglycerides Data presentation (univariate data) • Importance of digits • Table • Graphics with EXCEL® – Dot-plot – Histogram – Frequency polygon – Dynamic histogram – Box and whisker plot • Time-indexed plots Datasets; Data&DataPresentation Statistics & graphics for the laboratory 17 Data, data presentation & data description Types of data & types of statistics Types of data To correctly apply statistical techniques, we first have to understand the type of data we are dealing with (see Table below). QUANTITATIVE ("numerical") Measured ("continuous")$ • Blood pressure • Height • Weight CATEGORICAL Ordinal ("ranked") (ordered categories, usually based on a measure) • Grade of cancer • Better, same, worse • Disagree, neutral, agree Counted ("discrete") Number of • … childrens in a family • … cases of aids in a city Nominal (unordered categories) • Sex (male/female) • Alive/Dead • Blood group $Maybe converted to nominal by "cutoffs" (normotension; hypertension) Statistics at square one (10th ed). Swinscow, Campbell. BMJ Books, 2002 Types of statistics When we know which type of data we are dealing with, we still have to know (or make assumptions) about the probability distribution of the data to apply the correct type of statistics (>Parametric-/>Non-parametric statistics; >Bayesian statistics). Identification of the type of distribution can be done with graphical techniques (>Exploratory Data Analysis) and formal statistical testing. Parametric statistics Parametric methods for statistical hypothesis testing assume that the distributions of the variables being assessed have certain characteristics (usually, a "normal" distribution is assumed). The basic assumption of normality of distributions relies on the assumption of many independent additive factors as responsible for a dispersion. Parametric techniques usually involve squared measures, e.g. the standard deviation is computed from sums of squared deviations from the mean. The basis of squaring is properties of the normal distribution that renders squaring the optimal (most effective) estimation technique. With regard to real distributions, the squaring principle makes parametric approaches sensitive towards the presence of outliers. Testing for outliers should always be considered in parametric testing. Statistics & graphics for the laboratory 18 Data, data presentation & data description Types of data & types of statistics Non-parametric statistics Non-parametric (or distribution-free) methods for statistical hypothesis testing make no assumptions about the frequency distributions of the variables being assessed. Bayesian statistics Statistics which incorporate prior knowledge and accumulated experience into probability calculations. Statistics that uses subjective probability as a starting point for assessing a subsequent probability. Exploratory Data Analysis Exploratory data analysis is a term used to describe a group of techniques (largely graphical in nature) that sheds light on the structure of the data. Without this knowledge the scientist, or anyone else, cannot be sure they are using the correct form of statistical evaluation. Before applying statistics, data should be plotted. Data types & typical statistics Often, certain types of data are related to certain types of statistics. Some of the most common cases are presented below. Quantitative continuous data (parametric) • Descriptive statistics and confidence interval of a mean. • Confidence interval of a standard deviation. • Grubbs' test to detect an outlier. • t-test to compare two means. • Analysis of variance (ANOVA). Ranked data (non parametric) • Mann-Whitney U • Kruskal-Wallis one-way ANOVA • Wilcoxon signed ranks • Sign test • Kolmogorov Smirnov Categorical data • Chi-square (compare observed and expected frequencies). • Binomial and sign test (compare observed and expected proportions). • Fisher's and chi-square (analyze a 2x2 contingency table). • Predictive values from sensitivity, specificity, and prevalence. Statistics & graphics for the laboratory 19 Data, data presentation & data description Exemplary laboratory data In the laboratory, we mainly deal with measured data. The distribution of these data can be dominated by the laboratory manipulation itself (e.g., pipetting, repeated measurement of the same sample) or by the analyte (e.g., biological variation). In the first part of the course, 2 data-sets will be given that represent these 2 cases (weighing; biological variation of serum triacylglycerides). Data-set 1 (weighing) Data-set 2 Gravimetric control of a pipetted volume Serum triacylglycerides in adult males The experiment • Pipet: 20-200µl variabel • Pipetted nominal volume: 100µl • 21 Pipettings (n = 21) • Balance: Readability: 0,01 mg Other data-sets Other data-sets can be found in the file "Datasets". It will be made use of at other places in the book. Creation of data-sets The file "DataGeneration" explains how to generate • Random, univariate data • Bivariate data with constant SD • Bivariate data with constant CV • log-Normal distributed data It should be used after the book has been worked-through. Statistics & graphics for the laboratory 20 Data, data presentation & data description Importance of digits It is important to know that the quality of our data may depend on the number of reported digits. Adapting digits with EXCEL® 1st option • The decrease/increase decimals icon Decreases the decimals visually, but still uses them for calculations .0 .00 2nd option • Tools>Options>Calculation>Precision as displayed: • Then decrease decimals The data are rounded and the decimals are lost! Afterwards, deactivate field again! Data&DataPresentation (Worksheet "Digits") Weight (mg) 99.92 100.23 99.50 … 100.39 100.23 99.25 100.28 100.25 … 99.83 100.05 100.22 Weighing 1 2 3 … 10 11 12 13 14 … 19 20 21 Weight (mg) 100 100 100 … 100 100 99 100 100 … 100 100 100 25 20 15 10 5 0 99,00 99,25 99,50 99,75 100,00 100,25 100,50 100,75 101,00 Weighing 1 2 3 … 10 11 12 13 14 … 19 20 21 Frequency We round the data of the weighing experiment: Follow the instructions given in the EXCEL-sheet "Digits" 1. Tools>Options>Calculation>Precision as displayed 2. Select gray cells, reduce to 2 fewer digits Afterwards, UNCHECK IT! The rounded data don't reflect the spread of the original data anymore • Report your data with sufficient digits, adapted to measurement precision! Statistics & graphics for the laboratory Bin 21 Data, data presentation & data description Data presentation When we have created data, it is important to present them in easily comprehensible forms (>Tables; >Graphs). We use the weighing data for exercise. Table (weighing experiment) • Try to describe the data (center, maximum, minimum, etc). We note: Tables are difficult to "read"! Sorting may help, but keep the sample number & the result together! Sorted! Weighing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Weight (mg) 99.92 100.23 99.50 100.27 100.22 100.01 100.18 100.04 99.60 100.39 100.23 99.25 100.28 100.25 100.44 99.50 100.04 100.13 99.83 100.05 100.22 Weighing 12 3 16 9 19 1 6 8 17 20 18 7 5 21 2 11 14 4 13 10 15 Weight (mg) 99.25 99.50 99.50 99.60 99.83 99.92 100.01 100.04 100.04 100.05 100.13 100.18 100.22 100.22 100.23 100.23 100.25 100.27 100.28 100.39 100.44 Remark "Single column data" (such as the weighing data), are also called univariate data. Data&DataPresentation (Worksheet "Dataset") Sorting by weight 1. Select gray cells 2. Data>Sort: Follow "Print Screen" We have seen that tables "are difficult to read". Try a picture (graph) Statistics & graphics for the laboratory 22 Data, data presentation & data description Graphs (Exploratory Data Analysis – "EDA") Graphs are particularly useful for presenting data in summarized form and for "shedding light" onto the structure of the data. Most useful for univariate data are >Dot-plots, >Histograms, >Box-and-whisker plots, and the >"Normal probability" plots. First, these types of plots will be described, followed by the EXCEL-exercises. Plots for univariate data Note: can also be "derived data" from bivariate data (e.g., differences) Dot plot The dot plot presents the distribution of a variable (Yi) (usually in y-axis) in a category (usually x-axis). Data point coordinates are [Category; Yi]. Equal values are usually visualized by an offset. Use: Visual summary of data and data distribution. The dot plot can show: center (i.e., the location) of the data; spread (i.e., the scale) of the data; skewness of the data; presence of outliers; and presence of multiple modes in the data. Statistics & graphics for the laboratory 23 Data, data presentation & data description 8 7 6 5 4 3 2 1 0 55 65 75 85 95 105 115 125 135 Histogram (Frequency polygon#) Histograms (the term was first used by Pearson, 1895) present the frequency distribution of a variable in columns drawn over class intervals (bin). The heights of the columns are proportional to the class frequencies. Coordinates are [Bin center Xi; frequency Yi]. *Bin = Midpoint: 85; Range: 80 – 90; Results in range: 2. Frequency Plots for univariate data Value-Bin Use: Visual summary of data and data distribution. The histogram can show: center (i.e., the location) of the data; spread (i.e., the scale) of the data; skewness of the data; presence of outliers; and presence of multiple modes in the data. #Frequency polygon: the midpoints of the top of the columns are connected by a line (columns are not shown). Coordinates example: [55;1], [65;0], [75;0], [85;2], [95;7], etc. Box & Whisker plot In box plots (this term was first used by Tukey, 1970), the central tendency (e.g., median or mean), and the range or variation statistics (e.g., quartiles) are computed and presented as a "box". The whiskers outside of the box represent a selected range (e.g., 10% & 90%; here: the full range). Outlier data points can also be plotted. Coordinates are [Category; Particular Y]. Use: Visual summary of data distribution. The box and whisker plot can show: center (i.e., the location) of the data; spread (i.e., the scale) of the data; skewness of the data; presence of outliers. It is particularly useful for detecting and illustrating location and variation changes between different groups of data. Statistics & graphics for the laboratory 24 Data, data presentation & data description Plots for univariate data Relative cumulative probability plot Values of a distribution are ordered and the relative cumulative probability of all values up to a certain value is plotted versus that value. Coordinates are [Value i; relative cumulative probability up to & including Value i]. Use: Visual test for Normal distribution: comparison of data polygon line with cumulated Gaussian line calculated with data SD. 0.5 Cumulative frequency Krouwer plot ("folded cumulated probability") Cumulated probability plot with y-axis "folded" at probability P = 0.5, or 50% (P = 0.5 or 50% is the maximum y-value). Up to 50%, coordinates are [Value i; cumulative relative probability up to & including Value i]. Above 50%, coordinates are [Value i; 100% minus cumulative relative probability up to & including Value i]. 0.4 0.3 0.2 0.1 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Multiple of sigma Use: Visual test for Normal distribution: comparison of data polygon line with a "folded" cumulated Gaussian line calculated with data SD. Similar to a histogram: visual summary of data and data distribution. The Krouwer plot can show: center (i.e., the location) of the data; spread (i.e., the scale) of the data; skewness of the data; presence of outliers. Special application: Method comparison (note: concentration information is lost). Normal probability plot Cumulated probability plot with y-axis normalized to the Gaussian (or Normal) distribution. Coordinates are [Value i; z-value at Value i]. Use: Visual test for Normal distribution: data should fit a line. Special application: Reference intervals. Statistics & graphics for the laboratory 25 Data, data presentation & data description Graphics with EXCEL® Dot-plots Data&DataPresentation (Worksheet "Dot Plot 1") Construct the figure below with EXCEL Follow the instructions on the sheet. The adaptation of the layout requires general knowldge about "Charts". This will not be explained further. Some guidance on "Charts" is given in the Annex of the book. Note The Worksheet "Dot Plot 2" contains a more advanced version of the Dot-plot. Its construction, however, is relatively complicated and requires some deeper understanding of EXCEL. Statistics & graphics for the laboratory 26 Data, data presentation & data description Histogram with EXCEL® Data&DataPresentation (Worksheet "Histogram") Construct a histogram with the weighing data by • Tools>Data Analysis>Histogram • Follow the guidance in the "Print Screen" The unmodified EXCEL® figure looks like the one below Disadvantages • Layout not attractive (but can be modified) • "Strange" data classification ("Bins") • The "More"-bin • "Static" = does not adapt when data change We can modify the histogram by use of the general EXCEL-commands. (see Annex) Difficulty with histograms No general rule can be given for the definition of the bin-width. Statistics & graphics for the laboratory 27 Data, data presentation & data description Frequency polygon with EXCEL® Data&DataPresentation (Worksheet "FrequPolygon") 10 8 6 4 2 0 99,00 99,25 99,50 99,75 100,00 100,25 100,50 100,75 101,00 Frequency Construct the frequency polygon from the histogram • Copy the histogram in the FrequPolygon sheet • [Left] Click on the histogram • Click the chart wizard • Choose this figure type Bin • Go to series • Finalize the figure • [when necessary, delete series 1] Dynamic histogram The dynamic histogram is an elegant form of presenting your data. It adapts automatically when data are added or changed. It uses the "array formula" Frequency in EXCEL® You have to • Define "Bins" • Select all cells of the "Frequency Range" • Type: =Frequency(Data-cell1:Data-celln;(OR: ,)Bin-cell1:Bin-celln) – Note: ; OR , depends on the "List Separator" (Control Panel>Regional Settings>Number) • Press: "SHIFT" & "CONTROL", hold them, and press "ENTER" Data&DataPresentation (Worksheet "DynHistogram") Statistics & graphics for the laboratory 28 Data, data presentation & data description Box and whisker plot with EXCEL® Data&DataPresentation (Worksheet "Box Plot") Construct the Box plot according to the instructions in the Worksheet. (from: http://www.mis.coventry.ac.uk/~nhunt/boxplot.htm) The construction uses the EXCEL-functions Median, Quartile, Minumum, and Maximum. The Box-plot can be constructed by putting them in the presented order. The Figure must be finalized with some special "Figure-commands" (see Worksheet explanations & "Print-Screen" at the right). Summary: graphs for univariate data • The dot-plot is a robust graph for small and large data sets. • The histogram is more suitable for larger data sets, however, the bin-width must be chosen adequately. • The box and whisker plot is particularly useful for lager data sets, however, it already contains some claculated statistics: it is not a pure graphical method. Graphs are important tools for the investigation of data distribution (outliers, sort of distribution). Statistics & graphics for the laboratory 29 Data, data presentation & data description Time-indexed plots Run-sequence plot The run-sequence plot presents data (Yi) along one axis in the time sequence they were obtained. Coordinates are [Time or event#; Yi]. Use: Presentation and investigation of time series (drift, shift, outlier). Special application: quality control. The figures below show 3 situations where randomness is violated (remember: During sorting, keep the sample number & the result together) Lag-plot (a lag is a fixed time displacement) A lag plot checks whether a data set or time series is random or not. Random data should not exhibit any identifiable structure in the lag plot. Non-random structure in the lag plot indicates that the underlying data are not random. In the Lag-plot, Yi-n (n = usually 1) is plotted on the x-axis and Yi is plotted on y-axis. 2 Underlying data structure Yi 1 Sinusoidal data sequence 0 -2 0 2 -1 -2 Yi -1 Statistics & graphics for the laboratory 30 Data, data presentation & data description Exploratory data analysis A wealth of information about Exploratory Data Analysis can be found in NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/ The most basic set of graphics for the investigation of a data set is the so-called "4-plot". "4-plot" The "4-plot" consists of a • run sequence plot; • lag plot; • histogram; • normal probability plot (see later >Normal distribution). Investigate data for Location Variation Distribution Outliers Statistics & graphics for the laboratory 31 Notes Notes Statistics & graphics for the laboratory 32 Data, data presentation & data description Data description Descriptive statistics • Location • Dispersion Equations Descriptive statistics with EXCEL® Importance of digits GaussianDistribution Introduction After we have plotted our data, we need to characterize them quantitatively. We use, for that purpose, several different measures that are related to the location (or central tendency) and the dispersion (or variability) of the data. Note We have to distinguish in the following between parameters and their statistical estimates (or “statistics”). Parameter A parameter is a numerical quantity measuring some aspect of a population. For example, the mean is a measure of central tendency. Greek letters are used to designate parameters. Parameters are rarely known and are usually estimated by statistics computed in samples. To the right of each Greek symbol is the symbol for the associated statistic used to estimate it from a sample. Quantity Parameter Statistic Mean μ M (or Xbar) Standard deviation σ s Proportion π p Correlation ρ r Statistics & graphics for the laboratory 33 Data, data presentation & data description Descriptive statistics Location Measures for the location (or central tendency) of data are the mean (average), the median, and the mode. Mean • Sum of all values divided by the number of data Median Uneven number of data • Value in the center Even number of data • Mean of the 2 values in the center Mode • Value that is observed most frequently For symmetric distributions, the mean, median, and mode are found at the same value. For skewed distributions, those 3 are found at different values. Notes In symmetric distributions, the mean is a good location measure. In skewed distributions, the median is a better location measure than the mean. The mode is the only location measure that can be used with nominal data. Dispersion Measures for the dispersion (or variability) of data are the range, quartiles/quantiles, the variance, the standard deviation, the coefficient of variation, and the z-value. Range • Maximum minus minimum Quartiles • The lower and upper quartiles (or 0.25 and 0.75 quantiles) are the 25th and 75th percentiles of the distribution. The 25th percentile of a variable is a value such that 25% of the values of the variable fall below that value. Variance • Sum of the squared difference of the values from the mean, divided by the number of data minus 1! Standard deviation (SD, or s: both are used in the course) • Square root of the variance (see also: from duplicates) Coefficient of variation • = relative SD in %; = 100 • [SD/mean] (%) z-value (Normalized, or normal, standard deviate) • z = y - µ/s, or z = xi - mean/s Statistics & graphics for the laboratory 34 Data, data presentation & data description Descriptive statistics Equations From k duplicates Statistics & graphics for the laboratory 35 Data, data presentation & data description Descriptive statistics with EXCEL® GaussianDistribution (Worksheet "DescrStats") Tools>Data Analysis>Descriptive Statistics • Follow the guidance in the "Print Screen" Descriptive statistics Single formula Weight (mg) Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(95,0%) 100,0276 0,070017 100,13 100,23 0,320857 0,102949 0,433566 -1,09862 1,19 99,25 100,44 2100,58 21 0,146052 100,0276 0,070017 100,13 100,23 0,320857 0,102949 0,433566 -1,09862 1,19 99,25 100,44 2100,58 21 0,146052 The descriptive statistics function in EXCEL calculates all the measures we have seen (and more: those will be explained later) Alternatively, the measures can be calculated individually by EXCEL using the fx icon. Importance of digits Rounded Coming back to the rounded weighing data, we look at the mean and the SD of the original data set and the rounded data set: We observe that the rounded data give different mean and SD values! Too few digits give erroneous statistical estimates! Weighing Weight (mg) Weight (mg) 1 2 3 … 19 20 21 Mean SD 99,92 100,23 99,50 … 99,83 100,05 100,22 100,03 0,321 100 100 100 … 100 100 100 99,95 0,218 Statistics & graphics for the laboratory 36 Data, data presentation & data description More data Compare the distributions when you acquire more data: Which distribution do you recognize? ………………………………………………………………… Statistics & graphics for the laboratory 37 Gaussian distribution Gaussian (or Normal) distribution • "Bell-shaped" (similar to a histogram) • Cumulated: "S-shaped" • Cumulated & linearized • 2-sided and 1-sided probabilities • Inside/outside probabilities • Probabilities at selected s (z) values • Deviation from normality (skewness & kurtosis) Datasets; GaussianDistribution; NormalRankitPlot Statistics & graphics for the laboratory 38 Gaussian distribution The Gaussian (or "Normal") distribution The Gaussian distribution is of utmost importance in analytical chemistry. The statistics involved with that distribution are called "parametric" statistics. If we would repeat the weighing -times, we expect a distribution as represented by the line. The normal distribution is defined by its mean and standard deviation. The standard normal distribution has a mean of 0 and a standard deviation of 1. IMPORTANT NOTE: For the "infinite" distributions, specific symbols (= "Parameters" ) are used: • Mean = µ • Standard deviation = s • Normalized (or normal) standard deviate = z GaussianDistribution (Worksheet "Random") EXCEL has a function that can simulate Gaussian distributed data. The function can be accessed with TOOLS>Data Analysis>Random number generation. The worksheet "Random" explains -how to generate random numbers -presents the result in a dynamic histogram -allows the comparison between "requested" mean and SD with the simulated "sample mean and SD". Please note that those may be different, in particular, when the sample size is low. Statistics & graphics for the laboratory 39 Gaussian distribution Graphical presentation of the Gaussian distribution The Gaussian distribution can be presented • In the normal way: "Bell-shaped" (similar to a histogram) • Cumulated: "S-shaped" • Cumulated & linearized = Normal probability plot EXCEL® template from P Hyltoft Petersen (note: not available in EXCEL ® itself) GaussianDistribution (Worksheets "GaussBell"; "GaussCumul") These worksheets use the EXCEL NORMDIST function. The "Print Screens" guide you through their application. The graphs will appear automatically. Note: The Normal Probability Plot will be demonstrated later. Statistics & graphics for the laboratory 40 Gaussian distribution Gaussian distributions – Probabilities IMPORTANT NOTE When data are Gaussian distributed, we can predict the frequencies (or probabilities) of their occurrence within or outside certain distances (s, or z-values) from the mean (see also Figures above). These probabilities are used in parametric statistical calculations. They are listed in tables, but they also can be calculated with EXCEL®. Of particular importance are probabilities that are used in statistical tests (95%, 99% probabilities). 2-sided and 1-sided probabilities Statistics distinguish probabilities in 2-sided & 1-sided • 2-sided probabilities: question is A different from B? • 1-sided probabilities: question(s) is A > B (A < B)? Of practical importance are probabilities "Inside" & "Outside" • Outside probabilities, for example, are important in internal quality control. Statistics & graphics for the laboratory 41 Gaussian distribution Gaussian distributions – Probabilities Probabilities at selected s (z) values 2-sided OUTSIDE 1-sided 2-sided 1.28 s INSIDE 1-sided 90% 1.65 s 95% [90 %] 5% [10 %] 1.96 s 97.5% 95% 2.5% 5% 2.0 s 97.7% 95.5% 2.3% 4.5% 2.33 s 99% 98% 1.0% 2.0% 2.58 s 99.5% 99% 0.5% 1.0% 3.0 s 99.87% 99.7% 0.13% 0.3% 1-sided probabilities 1-sided probabilities can be expected in the presence of considerable systematic error. Frequency .. < SE /RE = 1 SE /RE = 0 > 1.96 -1.96 0 Value 2.0 z-M ultiplier .. 1.9 1.8 1.7 1.6 0.00 0.25 0.50 SE/RE 0.75 1.00 Stöckl D, Thienpont LM. About the zmultiplier in total error calculations. Clin Chem Lab Med 2008;46:1648–9. At SE/RE 0.75 the probabilities become practically 1-sided (see Figure) Statistics & graphics for the laboratory 42 Gaussian distribution Gaussian distributions – Probabilities GaussianDistribution Worksheets "Probability" These worksheets demonstrate the modulation of the Gaussian distribution and the observation of probabilities: Outside 3s This templates demonstrates how the probabilities outside the original 3s limits (original population SD = 1) change when the population mean and/or SD are modulated. Modulation is achieved by simply clicking on the "Spinners". Outside 1.96s This templates demonstrates how the probabilities outside the original 1.96s limits (original population SD = 1) change when the population mean and/or SD are modulated. Modulation is achieved by simply clicking on the "Spinners". 1-sided probabilities This template demonstrates the concept of 1-sided probabilities. The "Spinners" allow the movement of the z-value. 1-sided probabilities are displayed at the top of the figure. Mean and SD are fixed in this example. "Inside"-probabilities This template shows the probabilities within certain distances (z-values) of the population mean. The z-value can be modulated with the "Spinnners". Mean and SD are fixed in this example. "Outside"-probabilities This template shows the probabilities outside certain distances (z-values) of the population mean. The z-value can be modulated with the "Spinnners". Mean and SD are fixed in this example. Statistics & graphics for the laboratory 43 Gaussian distribution Deviation from normality Skewness and Kurtosis The Gaussian distribution is characterized by specific frequencies of values around certain distances of the mean and it is symmetric to its mean. Distributions observed in practice may deviate from the ideal Gaussian distribution because of: • Skewness (left skew; right skew) – Too many data on one side (left or right) • Kurtosis (too many, or too few data in the center) – Platykurtic (too few data in the center) – Leptokurtic (too many data in the center) These situations are shown in the figures below, together with the respective numbers calculated by EXCEL®. Coefficient of skewness: Cskew = [Σ(xi – xm)3/N]/SD3 Zero: symmetric distribution; Positive: skewed to the right; Negative: skewed to the left. Coefficient of kurtosis: Ckurt = [Σ(xi – xm)4/N]/SD4 – 3 Zero: Normal distribution; Positive: Peaked distribution; Negative: Flat distribution. Both together are used in significance tests for normality. Some of the mostly used tests are listed on the next page. Statistics & graphics for the laboratory 44 Testing normality Testing normality Statistical significance tests for deviation from normality • Chi-square • Kolmogorov-Smirnov • Anderson-Darling • D’Agostino-Pearson The preferred tests are Anderson-Darling and D’Agostino-Pearson! Unfortunately, EXCEL has no in-built test for normality. Statistical tests for normality are usually not useful below sample sizes of 20 to 30. Graphical test for deviation from normality Normal Probability Plot (Courtesy of Per Hyltoft Petersen) The Normal Probability Plot/Rankit Plot allows visual assessment of data distribution. When data are NORMAL distributed, they should lie on a LINE. Deviations from the line indicate other distributions (e.g., skewed ones). NormalRankitPlot • A maximum of 1000 values can be entered. Please SORT the data, if neccessary. • The template foresees the transformation of the data into the logarithm (the lnversion is chosen). The 1st cell (E6) contains the formula, already. After sorting of the original data, the 1st cell should be copied down to the last entry. • The graphics are automatically produced on the other sheets. • The plots have 2 y-axes. The left y-axis is in units of z, it is linear in terms of z. The right y-axis is in units of probability (%), it is non-linear in terms of probability! The plot shows: -The distribution of the data -The Normal model of the data with its confidence intervals -The -/+1.96 s limits of the data, corresponding to the 2.5th and 97.5th percentiles. The cumulated percentage of the data can be read from the right y-scale. Note: The % scale is represented by a picture and the tick-marks are created by separate data series. If neccessary, adapt the location of the tick-marks by changing the value in the yellow cell (D3). Assesment of normality Compare the distribution with the model. Statistics & graphics for the laboratory 45 Testing normality Testing normality NormalRankitPlot Triacylglyceride example Statistics & graphics for the laboratory 46 Testing normality Calculations with logarithms Data transformation: Logarithms When the data are not normal distributed, one can try a transformation. Because, in nature, data are often log-normal distributed, logarithmic transformation of data can make them normal distributed. Test for normality: Triglycerides (See: Datasets.xls) n = 282; Lowest value: 0.3 mmol/L; Highest value: 3.2 mmol/L; Median: 0.92 mmol/L. CBstat Anderson Darling test: P < 0.01 data not normally distributed Anderson Darling test after logarithmic (natural) transformation P = 0.13 data log-normally distributed Normal Probability Plot (ln-transformed data Data are "on a line" Data are ln-Normal distributed Statistics & graphics for the laboratory 47 Calculations with logarithms Working with logarithms Calculate the reference interval of a logarithmic distribution Triglycerides 1. Transform the original data to ln 2. Calculate the mean of the ln (xi) values 3. Take the anti-ln of the mean of ln (xi) This equals the geometric mean of the original population, which is close to its median. The anti-ln of the mean of the logged value e-0.0689 is equal to the geometric mean of the original distribution where the latter is given by [x1*x2 …Xn]1/n The anti-ln of the SD is meaningless. Number mmol/l ln 1 0.3 -1.204 2 0.32 -1.139 3 0.34 -1.079 4 0.38 -0.968 5 0.4 -0.916 6 0.4 -0.916 … … … 282 3.2 1.163 Median 0.92 Anti-ln (e x ) 0.933 -0.069 Mean, ln EXCEL: EXP(x) Geometric mean 0.933 EXCEL: GEOMEAN Calculation of 2.5 and 97.5% percentile Mean (ln transformed) -0.0689 SD (ln transformed) 0.395 2.5 Percentile -0.0689 – 1.96*0.395 = - 0.843 97.5 percentile-0.0689 + 1.96*0.395 = 0.7053 Anti-ln of 2.5 & 97.5 perc 0.43 – 2.02 Reference interval = 0.43 – 2.02 mmol/l Alternative Alternatively, a non-parametric approach to the data may have been chosen. Non-parametric reference intervals can be calculated with the CBstat-software. Statistics & graphics for the laboratory 48 Notes CAVE log-transformation Introduction of non-linearity by data transformation in method comparison and commutability studies. Stöckl D, Thienpont LM. Clin Chem Lab Med 2008;46:1784-5. 6 y = 1.0994x - 0.3849 300 250 200 150 100 50 Routine method (lnAU).. Routine method (AU).. 350 0 y = 1.0113x + 0.0339 5 4 3 2 1 0 50 100 150 200 250 300 350 1 Reference method (AU) 3 4 5 6 6 y = 0.9995x + 14.65 300 250 200 150 100 50 0 Routine method (lnAU).. 350 Routine method (AU).. 2 Reference method (lnAU) 5 4 3 y = -0.0108x 3 + 0.21x 2 - 0.376x + 3.075 2 1 0 50 100 150 200 250 300 350 Reference method (AU) 1 2 3 4 5 6 Reference method (lnAU) Notes Statistics & graphics for the laboratory 49 Sampling statistics – Confidence intervals Sampling statistics & Confidence intervals t-distribution (sampling distribution of means) • Confidence interval of a mean • Confidence interval of the "1.96" s-point Chi2(2)-distribution (sampling distribution of variances) • Confidence interval of a standard deviation Interpretation of confidence limits SamplingStatistics; CI-Calculator Introduction We have characterized the Normal distribution on the basis of infinite sample size. In practice, we are only able to take a finite sample size. The smaller our sample size, the more uncertain our estimates will be. All experimental estimates have an uncertainty. The "true" value lies within a certain confidence interval around our estimate! We investigate the sampling distribution of • Means t-distribution • Variances 2-distribution These distributions are the basis for the calculation of confidence intervals (CI's) of experimentally determined means and variances (standard deviations). Statistics & graphics for the laboratory 50 Sampling statistics – Confidence intervals t-distribution (sampling distribution of means) The t-distribution forms the basis for the statistical treatment of means. Like with the Normal distribution for single results, the t-distribution allows to predict (with a certain probability) the location of a true population mean (µ) within a certain distance (confidence interval, CI) of the experimental mean$. The probabilities can be viewed 1-sided and 2-sided. The formula for the calculation of the CI is: s m = x ± t(u,a ) × n Note: The term s/n is called the standard error of the mean (SEM). $Note (infinite measurements or known s): If x is normally distributed with mean µ and standard deviation σ: • 95% of x observations are within µ+/-1.96 σ • 95% of xm values are within µ+/-1.96 σ/n >When s is known, one can use the z-value instead of the t-value. Characteristics of the t-distribution (see also figure below) • The shape of the t-distribution(s) depend on n. • The t-distribution equals the normal distribution for n = • t-distributions are more peaked than the normal and have wider tails. n = : Gauß n= 1 n=4 Remark The means of independent observations tend to be normally distributed irrespective of the primary type of distribution. Central limit theorem Statistics & graphics for the laboratory 51 Sampling statistics – Confidence intervals Confidence interval/limits of the mean Relationship confidence interval/confidence limit The confidence interval (mean ± CI) spans from the lower to the higher confidence limit (CL): CI = - CL < mean < + CL • CI = ± t • s/n • Lower CL = - t • s/n • Higher CL = + t • s/n The CI/CL of the mean depends • on the probability level, a • on the sort of tail (1-/2-tailed, also called 1-sided, or 2-sided) • on n (n, respectively) a, n, and the "sort of tail" determine the magnitude of t • the standard deviation s (also denoted SD in the book) The expression t/n can be summarized by a factor k. Then, a CL can be calculated as k • SD. A table of k-factors is given below, as well as a graphical presentation. n 4 5 6 10 15 20 21 30 50 100 k (X SD) 1,591 1,242 1,049 0,715 0,554 0,468 0,455 0,373 0,284 0,198 2-sided 95% CL (SD units) Relationship between confidence limit and sample size: k-factors for the 2-sided 95% confidence limit of a mean 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 0 20 40 60 80 100 n (from n = 4) Confidence limit of the 1.96 s point ("centile") Like for the mean, CL's can be calculated for any other point of the Normal distribution, e.g., for the 1.96 s point. s2 1.962 s2 The standard error (SE) for the 1.96 s point is: SE(1.96) = n 2n The CL of the 1.96 s point is calculated as: CL1.96s = 1.71 • Clmean The CL of the 1.96 s point is important for • Reference intervals • The Bland & Altman interpretation of method comparison studies. Statistics & graphics for the laboratory 52 Sampling statistics – Confidence intervals Chi2(2)-distribution (sampling distribution of variances) The 2-distribution allows to predict the location of a true population standard deviation (s) within a certain distance (CI) of the experimental SD. The probabilities can be viewed 1-sided and 2-sided. The distribution is used to calculate CI's/CL's of SD's. Characteristics of the Chi2(2)-distribution (see also figure below) • The shape of the function(s) depend on n (n) • The function(s) are highly asymmetric 95%-CIs of SDs become asymmetric Confidence interval/limits of s (SD) The CI/CL of s (SD) depends: • on the probability level, a (1-sided, or 2-sided, also called 1-/2-tailed) • on n (n, respectively) Calculation (2-sided; a/2 = 0.025, [1-a/2] = 0.975) Lower CL = SD • [(n-1)/X20.025(n-1)]0.5 Upper CL = SD • [(n-1)/X20.975(n-1)]0.5 Relationship between confidence limit and sample size: Factors for the 2-sided 95% confidence limit of s (SD) 4 5 6 10 15 20 21 30 50 100 Limits (X SD) Lower Upper 0,566 3,729 0,599 2,874 0,624 2,453 0,688 1,826 0,732 1,577 0,760 1,461 0,765 1,444 0,796 1,344 0,835 1,246 0,878 1,162 4,0 2-sided 95% CL (SD units) n 3,5 3,0 2,5 2,0 Upper limit 1,5 1,0 0,5 0 20 40 60 80 100 Lower limit 0,0 n (from n = 4) Statistics & graphics for the laboratory 53 Sampling statistics – Confidence intervals Interpretation of 95%-confidence limits Confidence limits and quality specifications The figure below shows a graphical interpretation of 95%-confidence limits versus a predefined quality specification: "10". Note When comparing an estimate with a specification, usually, the confidence limits are constructed 1-sided. Specification 10 1. Limit 2. Typical performance 1. Interpretation of the cases A – D when the specification is a limit A: "In", the specification is satisfied with 95% probability. B: Not "In" with 95% probability • More data may help C: Not "In" with 95% probability, but also not out with 95% probability. D: "Out" 2. Interpretation when the number characterizes a stable process If the "number" is the typical performance of a stable process, situation C can still be accepted. C: Look at lower limit: Not "Out" with 95% probability. This situation is applied in the EP 5 protocol to investigate whether the user CV is different from the typical manufacturer CV. Statistics & graphics for the laboratory 54 Exercises SamplingStatistics This tutorial contains interactive exercises that demonstrate: Sampling -The general effect of sample size on the distribution of mean & SD (single repetitions). Var, SD, Mean -The expected distribution of variance, SD, and mean for different sample sizes (high number of repetitions). Central Limit -The "Central Limit Theorem". t-Distr -The effect of the degrees of freedom on the t-distribution. Chi-square -The effect of the degrees of freedom on the Chi-square-distribution. The worksheets Conf-Interval CI 1.96 centile CI interpretation contain similar information as this text. CI-Calculator This file allows the: -Calculation of 1- and 2-sided 95% confidence intervals for mean and coefficient of variation (CV). -The comparison of experimentally observed mean and CV with a target value. Statistics & graphics for the laboratory 55 Notes Notes Statistics & graphics for the laboratory 56