Licence This presentation is © 2010-11, Anne Segonds-Pichon. This presentation is distributed under the creative commons Attribution-Non-Commercial-Share Alike 2.0 licence. This means that you are free: to copy, distribute, display, and perform the work to make derivative works Under the following conditions: Attribution. You must give the original author credit. Non-Commercial. You may not use this work for commercial purposes. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical to this one. Please note that: For any reuse or distribution, you must make clear to others the licence terms of this work. Any of these conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Full details of this licence can be found at http://creativecommons.org/licenses/by-nc-sa/2.0/uk/legalcode Introduction to statistics with GraphPad Prism 5 Anne Segonds-Pichon (anne.segonds-pichon@babraham.ac.uk) Experimental design Power analysis Sample size Experimental design Think stats!! • Translate your biological question into stats. • What type of data are you going to collect? e.g. Categorical or quantitative? • Very important: Difference between technical and biological replicates. • Technical replicates involve taking one sample from one tube and analysing it across multiple conditions. • Biological replicates are different samples measured across multiple conditions. Good experimental design relies on 2 principles: • replication: the more times something is repeated, the greater the confidence of ending up with a genuine result. • randomization: experimental subjects must be allocated to treatment groups at random Common errors in the design of experiments: • experiments done on an ad hoc basis • e.g. very different group size or time variability not taken into account • control and treatment done on different days • inappropriate choice of treatment group • e.g. different age group within genotypes • experiment too large or too small • e.g. groups so small, no stat analysis can be done, or only non parametric Power analysis • Typically to estimate a sufficient Sample size • Definition of power: probability of detecting the specified effect at the specified significance level. • Also: the power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is false. • Arbitrarily, accepted power: 80% to 90% The power analysis depends on the relationship between 6 variables: • • • • • • the the the the the the effect size of biological interest standard deviation significance level desired power of the experiment sample size alternative hypothesis (ie one or two-sided test) Fix any five of these and a mathematical relationship can be used to estimate the sixth. 1 The effect size of biological interest •the larger the effect size, the smaller the experiment will need to be to detect it. • measure of effect size: Cohen’s d with d = Mean 1 – Mean 2 pooled SD • effect size conventions: • d = 0.20 – small • d = 0.50 – medium • d = 0.80 – large 2 The standard deviation • ideally: pilot study • if not: difficult to estimate • solutions: • literature • previous experiments • best and worst case based on lowest and highest of the available estimates 3 The significance level • usually 5% (p<0.05) • p-value is ‘the probability that a result as least as extreme as the one actually found could have been found if the null hypothesis were true’. • e.g. a difference between 2 means (treatment and control) as ‘big’ as the one actually found in the experiment could be found even if the treatment had no effect. • Don’t throw away a p-value=0.051 ! 4 The desired power of the experiment: • ~80% 5 The sample size • that’s the all point! • home office: 3 Rs: • replacement, refinement, reduction Effect size Standard deviation Sample size 6 The alternative hypothesis • is it a one or two-sided test? Good news: G*Power can do it for you … … if you know all the parameters. • with no prior knowledge, the trickiest parameters are effect size and standard deviation. Power Analysis 9.0 Variable 8.5 8.0 7.5 7.0 Sample 1 Sample 2 Sample 1 Sample 2 9.0 Variable 8.5 8.0 7.5 7.0 Alternative: the resource equation method: • more appropriate in biology than medical experiments • more about the existence of a difference than the size of it E = N –T where: E is the error degrees of freedom N is the total number of experimental units T is number of treatments combinations Exemple: 5 treatment groups and 6 mice in each group E = 30 – 5 = 25 Rule-of-thumb: E should be between 10 and 20 To be handled with care ! Common errors in the statistical analysis of experiments: • failure to do any statistical analysis on numerical data • failure to screen raw data for errors • inappropriate standardisation • misinterpretation of p-values • inappropriate or incorrect statistical analysis and in particular the Student’s t-test Remember: Stats are all about understanding and controlling variation. signal noise signal noise If the noise is low then the signal is detectable … = statistical significance … but if the noise (i.e. interindividual variation) is large then the same signal will not be detected = no statistical significance In a statistical test, the ratio of signal to noise determines the significance. “To consult a statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of.” R.A.Fisher, 1938 Qualitative data • = not numerical • = values taken = usually names (also nominal) • e.g. variable sex: male or female • Values can be numbers but not numerical • e.g. group number = numerical label but not unit of measurement • Qualitative variable with intrinsic order in their categories = ordinal • Particular case: qualitative variable with 2 categories: binary or dichotomous • e.g. alive/dead or male/female Analysis of qualitative data Example of data (cats and dogs.xlsx): • Cats and dogs trained to line dance • 2 different rewards: food or affection • Is there a difference between the rewards? • Is there a significant relationship between my 2 variables? – are the animals rewarded by food more likely to line dance than the one rewarded by affection? • To answer this question: – a Chi-square test Chi-square test • In a chi-square test, the observed frequencies for two or more groups are compared with expected frequencies by chance. – With observed frequency = collected data • Example with the cats and dogs.xlsx Chi-square test (2) Expected frequency = (row total)*(column total)/grand total Did theydance? * Type of Training * Animal Crosstabulation Animal Cat Did they dance? Yes Count % within Did they dance? Count % within Did they dance? Count % within Did they dance? Count % within Did they dance? Count % within Did they dance? Count % within Did they dance? No Total Dog Did they dance? Yes No Total Type of Training Food as Affection as Reward Reward 26 6 81.3% 18.8% 6 30 16.7% 83.3% 32 36 47.1% 52.9% 23 24 48.9% 51.1% 9 10 47.4% 52.6% 32 34 48.5% 51.5% Total 32 100.0% 36 100.0% 68 100.0% 47 100.0% 19 100.0% 66 100.0% Example: expected frequency of cats line dancing after having received food as a reward: Probability of line dancing: 32/68 Probability of receiving food: 32/68 Expected frequency:(32/68)*(32/68)=0.22 22% of 68 = 15.1 Did theydance? * Type of Training * Animal Crosstabulation Animal Cat Did they dance? Yes No Total Dog Did they dance? Yes No Total Count Expected Count Count Expected Count Count Expected Count Count Expected Count Count Expected Count Count Expected Count Type of Training Food as Affection as Reward Reward 26 6 15.1 16.9 6 30 16.9 19.1 32 36 32.0 36.0 23 24 22.8 24.2 9 10 9.2 9.8 32 34 32.0 34.0 Total 32 32.0 36 36.0 68 68.0 47 47.0 19 19.0 66 66.0 For the cats: Chi2 = (26-15.1)2/15.1 + (6-16.9)2/16.9 + (6-16.9)2 /16.9 + (30-19.1)2/19.1 = 28.4 Is 28.4 big enough for the test to be significant? The Null hypothesis and the error types • The null hypothesis (H0): H0 = no effect • e.g.: the animals rewarded by food are as likely to line dance as the one rewarded by affection • The aim of a statistical test is to accept or to reject H0. Statistical decision True state of H0 H0 True H0 False Reject H0 Type I error False Positive Correct True Positive Do not reject H0 Correct True Negative Type II error False Negative • Traditionally, a test or a difference are said to be “significant” if the probability of type I error is: α =< 0.05 • High specificity = low False Positives = low Type I error • High sensitivity = low False Negatives = low Type II error Chi-square test: results Dog Cat 30 20 Counts Counts 30 10 0 Dance Yes Dance No 20 10 0 Food Affection Food Affection • In our example: cats are more likely to line dance if they are given food as reward than affection (p<0.0001) whereas dogs don’t mind (p=0.908). Quantitative data • They take numerical values (units of measurement) • They can be discrete (values vary by finite specific steps) or continuous (any values) • They can be described by a series of parameters: – Mean, variance, standard deviation, standard error and confidence interval The mean • Definition: average of all values in a column • It can be considered as a model because it summaries the data – Example: a group of 5 persons: number of friends of each members of the group: 1, 2, 3, 3 and 4 • Mean: (1+2+3+3+4)/5 = 2.6 friends per person – Clearly an hypothetical value • How can we know that it is an accurate model? – Difference between the real data and the model created The mean (2) • Calculate the magnitude of the differences between each data and the mean: – Total error = sum of difference From Field, 2000 =0 • No errors ! – Positive and negative: they cancel each other out. Sum of Squared errors (SS) • To avoid the problem of the direction of the error: we square them – Instead of sum of errors: sum of squared errors (SS): • SS gives a good measure of the accuracy of the model • But: dependent upon the amount of data: the more data, the higher the SS. • Solution: to divide the SS by the number of observations (N) • As we are interested in measuring the error in the sample to estimate the one in the population we divide the SS by N-1 instead of N and we get the variance (S2) = SS/N-1 Variance and standard deviation • Problem with variance: measure in squared units – For more convenience, the square root of the variance is taken to obtain a measure in the same unit as the original measure: • the standard deviation – S.D. = √(SS/N-1) = √(s2) = s • The standard deviation is a measure of how well the mean represents the data Standard deviation Small S.D: data close to the mean: mean is a good fit of the data Large S.D.: data distant from the mean: mean is not an accurate representation SD and SEM (SEM = SD/√N) • Many scientists are confused about the difference between the standard deviation (SD) and the standard error of the mean (SEM). – The SD quantifies how much the values vary from one another (scatter or spread). • The SD does not change predictably as you acquire more data. – The SEM tells you how much variability there is in this statistic across samples from the same population. • The SEM gets smaller as your samples get larger, – the mean of a large sample is likely to be closer to the true mean than is the mean of a small sample. SD and SEM The SD quantifies the scatter of the data. The SEM quantifies how far the sample mean is from the true population mean. SD or SEM ? • If the scatter is caused by biological variability, it is important to show the variation. – Report the SD rather than the SEM. • Better, show a graph of all data points, or perhaps report the largest and smallest value there is no reason to only report the mean and SD. • If you are using an in vitro system with no biological variability, the scatter can only result from experimental imprecision (no biological meaning). – Report the SEM since the SD is less useful here. • Instead, report the SEM to give your readers a sense of how well you have determined the mean. Confidence interval • 95% of observations in a normal distribution lie within +/- 1.96 SEM - So limits of 95% CI: [Mean - 1.96 SEM; Mean + 1.96 SEM] - SEM = SD/√N Error bars Type Description Standard deviation (SD) Descriptive Typical or average difference between the data points and their mean. Standard error (SEM) Inferential A measure of how variable the mean will be, if you repeat the whole study many times. Confidence interval (CI), usually 95% CI Inferential A range of values you can be 95% confident contains the true mean. SE gap ~ 4.5 n=3 SE gap ~ 2 n=3 16 Dependent variable Dependent variable 13 12 11 ~ 2 x SE: p~0.05 10 9 15 14 13 ~ 4.5 x SE: p~0.01 12 11 10 8 A 9 B A B SE gap ~ 2 n>=10 SE gap ~ 1 n>=10 12.0 11.0 ~ 1 x SE: p~0.05 10.5 10.0 9.5 A B Dependent variable Dependent variable 11.5 11.5 11.0 ~ 2 x SE: p~0.01 10.5 10.0 9.5 A B CI overlap ~ 1 n=3 CI overlap ~ 0.5 n=3 Dependent variable Dependent variable 14 12 ~ 1 x CI: p~0.05 10 8 15 ~ 0.5 x CI: p~0.05 10 6 A B A CI overlap ~ 0.5 n>=10 CI overlap ~ 0 n>=10 12 11 ~ 0.5 x CI: p~0.05 10 A B Dependent variable Dependent variable 12 9 B 11 ~ 0 x CI: p~0.01 10 9 A B Analysis of quantitative data • Check for normality • Choose the correct statistical test to answer your question: – They are 2 types of statistical tests: • Parametric tests with 4 assumptions to be met by the data, • Non-parametric tests with no or few assumptions (e.g. Mann-Whitney test) and/or for qualitative data (e.g. χ2 test). Assumptions of Parametric Data • All parametric tests have 4 basic assumptions that must be met for the test to be accurate. 1) Normally distributed data – Normal shape, bell shape, Gaussian shape • Transformations can be made to make data suitable for parametric analysis Assumptions of Parametric Data (2) • Frequent departure from normality: – Skewness: lack of symmetry of a distribution – Kurtosis: measure of the degree of peakedness in the distribution • The two distributions below have the same variance approximately the same skew, but differ markedly in kurtosis. Assumptions of Parametric Data (3) 2) Homogeneity in variance • The variance should not change systematically throughout the data 3) Interval data • The distance between points of the scale should be equal at all parts along the scale 4) Independence • Data from different subjects are independent – Values corresponding to one subjects do not influence the values corresponding to another subject. – Important in repeated measures experiments Analysis of quantitative data • Is there a difference between my groups regarding the variable I am measuring? – e.g.: are the mice in the group A heavier than the one in group B? • Tests with 2 groups: – Parametric: t-test – Non parametric: Mann-Whitney/Wilcoxon rank sum test • Tests with more than 2 groups: – Parametric: Analysis of variance (one-way ANOVA) – Non parametric: Kruskal Wallis • Is there a relationship between my 2 (continuous) variables? – e.g.: is there a relationship between the daily intake in calories and an increase in body weight? • Test: Correlation (parametric or non-parametric) Remember: Stats are all about understanding and controlling variation. signal noise signal noise If the noise is low then the signal is detectable … = statistical significance … but if the noise (i.e. interindividual variation) is large then the same signal will not be detected = no statistical significance In a statistical test, the ratio of signal to noise determines the significance. Comparison between 2 groups: t-test • Basic idea: – When we are looking at the differences between scores for 2 groups, we have to judge the difference between their means relative to the spread or variability of their scores • Ex: comparison of 2 groups control and treatment t-test (2) t-test (3) t-test (4) • 3 types: – Independent t-test • it compares means for two independent groups of cases. – Paired t-test • it looks at the difference between two variables for a single group: – the second sample is the same as the first after some treatment has been applied – One-Sample t-test • it tests whether the mean of a single variable differs from a specified constant (often 0) Example: coyote.xlsx • Question: are the males coyote bigger than the females? • First step: how do my data look like? – 4 assumptions for parametric tests – Plot the data Assumptions for parametric tests Histogram of Coyote:Freq. dist. (histogram) 10 Counts 8 Counts OK here but if several groups of different sizes, go for percentages Female Male 6 4 2 0 15 707274767880828486889092949698100 102 104 106 707274767880828486889092949698100 102 104 106 Bin Center Female Male Counts 10 5 Normality 0 15 69 72 75 78 81 84 87 90 93 96 99 102105 69 72 75 78 81 84 87 90 93 96 99 102105 Bin Center Female Male Counts 10 5 0 68 72 76 80 84 88 92 96 100 104 108 68 72 76 80 84 88 92 96 100 104 108 Bin Center Coyote 110 Maximum 100 Length (cm) Upper Quartile (Q3) 75th percentile Interquartile Range (IQR) 90 Lower Quartile (Q1) 25th percentile Median 80 Smallest data value > lower cutoff Cutoff = Q1 – 1.5*IQR 70 60 Outlier Male Female Independent t-test: example coyote.pzf 120 Standard error 95 90 85 Body length (cm) Body Mass 100 110 100 90 80 70 60 80 Female Females Male Males 95 Standard deviation 94 93 Length (cm) Body Mass 100 95 90 85 92 91 90 89 88 87 86 80 Female Male Confidence interval 85 Male Female Independent t-test: results coyote.xlsx Males tend to be longer than females but not significantly so (p=0.1045). What about the power of the analysis? Homogeneity in variance What about the power of the analysis? You would need a sample 3 times bigger to reach the accepted power of 80%. But is a 2.3 cm difference between genders biologically relevant? Another example of t-test: height husband wife.xlsx height husband wife.xlsx 20 15 Husband and Wife Difference Height 200 180 10 5 0 160 -5 140 Husband Wife -10 Normality Dependent t-test: example height husband wife.xlsx 200 15 180 Difference Height (cm) 20 160 140 10 5 0 -5 Husband Wife -10 Paired t-test = One sample t-test Results No test for homogeneity of variances Comparison of more than 2 means • Why can’t we do several t-tests? – Because it increases the familywise error rate. • What is the familywise error rate? – The error rate across tests conducted on the same experimental data. Familywise error rate • Example: if you want to compare 3 groups and you carry out 3 ttests, each with a 5% level of significance • The probability of not making the type I error is 95% (=1 – 0.05) – the overall probability of no type I errors is: 0.95 * 0.95 * 0.95 = 0.857 – So the probability of making at least one type I error is 1-0.857 = 0.143 or 14.3% – The probability has increased from 5% to 14.3% ! – If you compare 5 groups instead of 3, the familywise error rate is 40% !!!!! (=1-(0.95)n) • Solution for multiple comparisons: Analysis of variance Analysis of variance • Extension of the 2 group comparison of a ttest but with a slightly different logic: – If you want to compare 5 means, for example, you can compare each mean with another • It gives you 10 possible 2-group comparisons – Complicated ! So, the logic of the t-test cannot be directly transferred to the analysis of variance (=ANOVA) • Instead the ANOVA compares variances: – If variance between the 5 means > variance within the 5 groups (random error) • then the means must be more spread out than it would have been by chance. Analysis of variance • The statistic for ANOVA is the F ratio. • F= • F= Variance between the groups Variance within the groups (individual variability) Variation explained by the model (= systematic) Variation explained by unsystematic factors (= random variation) • If the variance amongst sample means is greater than the error/random variance, then F>1 – In an ANOVA, you test whether F is significantly higher than 1 or not. Analysis of variance Source of variation Sum of Squares df Mean Square F p-value Between Groups 2.665 4 0.6663 8.423 <0.0001 Within Groups 5.775 73 0.0791 Total 8.44 77 • Variance (= SS / N-1) is the mean square – df: degree of freedom with df = N-1 Hypothetical model Between groups variability Within groups variability Total sum of squares Parametric tests assumptions Normality Analysis of variance: Post hoc tests • The ANOVA is an “omnibus” test: it tells you that there is (or not) a difference between your means but not exactly which means are significantly different from which other ones. – To find out, you need to apply post hoc tests. – These post hoc tests should only be used when the ANOVA finds a significant effect. Analysis of variance: example protein expression.xlsx Homogeneity in variance F=0.6702/0.07896=8.49 Post hoc tests Protein expression 10 10 8 8 6 6 4 4 2 2 0 A B C D E 0 A B Cell groups E 1.5 Protein expression (Log) Protein expression (Log) D Cell groups 1.5 1.0 0.5 0.0 -0.5 -1.0 C A B C D E 1.0 0.5 0.0 -0.5 -1.0 A B C Cell groups D E 1 0.1 A B C Cell groups D E 0.4 Log(Protein Expression) Protein expression 10 0.2 -0.0 -0.2 -0.4 A B C Cell groups D E Correlation • A correlation coefficient is an index number that measures: – The magnitude and the direction of the relation between 2 variables – It is designed to range in value between -1 and +1 Correlation • Most widely-used correlation coefficient: – Pearson product-moment correlation coefficient “r” • The 2 variables do not have to be measured in the same units but they have to be proportional (meaning linearly related) – Coefficient of determination: • r is the correlation between X and Y • r2 is the coefficient of determination: – It gives you the proportion of variance in Y that can be explained by X, in percentage. Correlation: example roe deer.xlsx • Is there a relationship between parasite burden and body mass in roe deer? 30 Male Body Mass Female 25 20 15 10 1.0 1.5 2.0 2.5 Digestive Parasites 3.0 3.5 Correlation: example roe deer.xlsx There is a negative correlation between parasite load and fitness but this relationship is only significant for the males (p=0.0049 vs. females: p=0.2940). Exercises • Arachnophobia – Is it as scary to look at the picture of a spider than at a real one? • Cane toad – Is the proportion of cane toads infected by intestinal parasites the same in 3 different areas of Queensland? • Colorectal cancer (CRC) – Are DNA scores higher in people with colorectal cancer? • Migration of neutrophils – Do neutrophils go further depending on which inhibitor you use? Arachnophobia 30 Difference 20 10 0 ff di -10 -20 Answer: If you are an arachnophobe, it is scarier to look at a real spider than at the picture of one (p=0.0310). Cane toad Cane toad Number of toads 20 Infected Uninfected 15 Answer: The proportion of cane toads infected by intestinal parasites varies significantly between the 3 different areas of Queensland (p=0.0359), the animals being more likely to be parasitized in Rockhampton and Mackay than in Bowen. 10 5 0 Rockhampton Bowen Mackay Colorectal cancer 80 CRC no CRC Frequencies 60 40 20 0 0 7 14 21 28 35 Bin Center Answer: Higher DNA scores appear to be associated with a greater likelihood of CRC (p=0.0007). Migration of neutrophils Migrated neutrophils (%of total) A B C 20 15 10 5 0 Migrated neutrophils (%of total) 25 25 20 15 10 5 0 A B Genotypes C A B C Genotypes Answer: There is significant difference between the 3 inhibitors, with inhibitor C being the more effective and inhibitor A the least one (p<0.0001). * Outcome ** Predictor Choosing the Correct Statistical Test (adapted from “Choosing the correct stastitics” developed by James D. Leeper, Ph.D)