Statistics Appendix 16 Page 1 of 7 Normally Distributed Data, Sampling, Averages, Standard Error, and Significant Differences. If you take 20 jelly beans from a huge jar in which half of the beans are known to be red, the odds that exactly 10 of your 20 beans will be red is high, but not 100%. Perhaps you will get only 7 red beans, or maybe 15. If you take several 20-bean samples, though, the average (mean) number of red beans in those samples should be very close to 10. And the more such samples you take, the more closely their average proportion of red beans will approximate the true proportion of red beans in the whole jar. To relate this to a mycological experiment, suppose you measure the growth rate of four randomly chosen hyphae from fungal species A and compare those rates to four hyphae from species B (Case 1). Meanwhile, another student is comparing the growth rates of hyphae of species C and D (Case 2). In each pair, one average is 40 µm/min, and the other is 45 µm/min. In each case it seems one species has a higher growth rate than the other, but how much confidence can we have in that conclusion? Are the differences “significant”? Obviously, it depends on the numbers that went into those averages. Consider these data: Case 1 Hypha 1 Hypha 2 Hypha 3 Hypha 4 Species A 40 35 36 49 Species B 40 45 39 56 | | | | | | | Case 2 Species C 39 41 40 40 Species D 45 47 44 44 In both cases, the averages are 40 and 45µm/min. The growth rates in case 1 are more variable than those in case 2. Is this due to poor experimental shows reproducibility, or high natural variability? The reasons behind the wide "scatter" of the numbers in case 1 are not obvious, and the high variation makes our assessment of the average less certain. In contrast, the rates in case 2 are more consistent. Was that student more careful in taking measurements, or less? Based on these data alone, the differences between A and B are less likely to be “significant” than between C and D. Here, “significance” is a statistical judgement that relates to whether the differences might have come from random choice of individuals with a variable population. Few real cases are so clear cut, and even here we have uncertainties, so we need an impartial way of making a judgement. Statistics provides sophisticated tools for deciding when two (or more) sets of measurements are or are not significantly different. Unless you have already taken a statistics course (in which case, use those tools), here are statistical rules for making an objective decision about significant differences between data sets. In any sample you take (e.g., the values 40, 45, 39, and 56 in B, you can be 95% confident that the true value lies somewhere within two standard errors (SE) above and below the mean of the sample. This is called the 95% confidence interval. Standard error will be explained in more detail, below, along with confidence interval. Two samples are considered significantly different only if their 95% confidence intervals do not overlap. Clearly, everything depends on standard error (SE). Even if your calculator can do this with a few keystrokes, it is worth going through this exercise at least once, explicitly, and then Statistics Appendix 16 Page 2 of 7 checking your arithmetic against your calculator’s program as an internal check to make sure you are using your calculator properly. At the end of this section, work through the calculations for samples C and D. Calculating standard error is a two step process. First, calculate the standard deviation (SD), according to the formula: n _ SD = [ (X-xi )2/(n-1)]1/2 i=1 which translates into: Standard deviation equals the square root of {the sum of the square of differences from the mean, divided by the "degrees of freedom"}. Values to the exponent 1/2 mean "the square root of" that value. Do the calculation like this: Find the average of your sample values by adding them together and dividing by the number of samples (n). The average of samples is sometimes called "x bar", and is written as X with a bar over it (difficult to create on a word processor, so I'll use a capital X). Take each individual sample number, subtract the mean value, and square the result. It will not matter if the difference is positive or negative once the value is squared. Add the squared difference values together, the "sum of square differences" Divide the "sum of square differences" by one less than the number of samples, called the degree of freedom, written as the Greek letter nu, , which looks similar to a small v. Imagine that you have to define a set of numbers by one member of the set. Then you are only free to choose one less than the total number when choosing any other member of that set. Sample standard error calculation Species A Species B Species C 40 40 39 35 45 41 36 39 40 49 56 40 Species D 45 47 44 44 Species A Mean = 40 Sum of square differences = (40-40)2 + (35-40)2 + (36-40)2 + (49-40)2 = 02 + (-5)2 + (4)2 + (9)2 = 0 + 25 + 16 + 81 = 122 Degrees of freedom of the sample = 3 Sum of square differences / degrees of freedom = 40.67 Standard deviation = square root of (sum of squares / degrees of freedom) = 6.38 Species B Mean = 45 Sum of square differences = (40-45)2 + (45-45)2 + (39-45)2 + (56-45)2 = (-5)2 + (0)2 + (6)2 + (11)2 = 25 + 0 + 36 + 121 = 182 Degrees of freedom of the sample = 3 Sum of square differences / degrees of freedom = 60.66 Standard deviation = square root of (sum of squares / degrees of freedom) = 7.79 Statistics Appendix 16 Page 3 of 7 Repeat these calculations for species C and D. Standard deviations are useful in many statistical procedures. Here we need them to get the standard error, also called standard error of the mean, which is the standard deviation divided by the square root of the number of samples. In each of our samples, n = 4, and the square root of 4 is 2. Therefore, the SE of these samples is: Species A Mean 40 SD 6.38 SE 3.19 Species B 45 7.79 3.90 Species C 40 0.82 0.41 Species D 45 1.4 0.70 The confidence limits for the means in our samples are somewhere within two standard errors above and below the mean of the sample. Species A Species B Species C Species D 33.6 - 46.4 37.2 - 52.8 39.2 - 40.8 42.2 - 47.8 Therefore, for sample B, we can be 95% confident that the true mean value of the growth rate is between 37.2 and 52.8 µm/min To be precise, the SE values should be multiplied by 1.96, rather than 2, and even 1.96 is valid only for large samples. For samples of more typical size, the relevant number is a statistical value called "t", which is found by looking in a table of t-values, later in this appendix. Recall that sample means are considered significantly different only if their confidence intervals do not overlap. Therefore the mean value of the growth rates of A and B are not significantly different, and the mean growth rate for species C is significantly different from that of species D. A warning, however: growth rates can change depending on how the specimens are treated. Even seemingly stable characteristics like hyphal diameter depend on where the diameter is measured compared to the tip. Here, the diameters of these two hyphae are similar a long way back from the tip, but not closer to the tip. This is worth remembering in the context of the measurements you will be making in Lab 1. In order to compare sample means with statistical rigor, we can use a "t-test", also called Student's t-test, which was designed by a statistican (William Sealy Gosset, 1876-1937) who felt he was a "student" of mathematics, and published under that pseudonym. This is an excellent and rigorous test under the proper conditions. Statistics Appendix 16 Page 4 of 7 Because we do not know if A or B really has the higher mean, the statistical question we have to ask is whether we might have two samples taken at random from the same population. This is called a two-tailed t-test. For species A and B, the test calculations will look like this: Species A 40 35 36 49 Number of measurements (n) Degrees of freedom () Mean Sum of square differences n1 = 4 1 = 3 X1 = 40 SS1 = 40.67 Species B 40 45 39 56 n2 = 4 2 = 3 X2 = 45 SS2 = 60.66 Pooled variance s2p = SS1 + SS2 = 40.67 + 60.66 = 16.89 1 + 2 3+3 The next line is needed only if n1 ≠ n2; otherwise (as in this example), it is easy to simplify sX1-X2 = (s12p + s22p) 1/2 = [16.89 + 16.89 ] 1/2 = (8.44)1/2 = 2.90 (n1 n2 ) [ 4 4 ] In the following context, |X| means “the absolute value of X” t = |X1 - X2| = |40 - 45| = 5 = 1.72 SX1-X2 2.90 2.90 The critical t value is t 0.05 (n1+n2-2) = t 0.05, 6 = 1.943 (from a statistical table, part of which is reproduced below) Since our calculated value of t is less than the critical value, we can reject the significance of the difference between the mean growth rates of species A and B. The t-value is compared to a standard table, with critical values for degrees of freedom (total for the number of measurements in both samples) and confidence limits (likelihood of detecting a real difference, which is written as 1- ), of being wrong. Being 95% sure of being correct is written as a 0.05 confidence level, because you will be wrong in your estimation 5% of the time. Interpolate (estimate) if your “degree of freedom” is between two of the ones given in the abbreviated table on the next page. Statistics Df 1 2 3 4 6 8 10 15 20 25 ∞ Appendix 16 Page 5 of 7 t0.10 3.078 1.886 1.638 1.533 1.440 1.397 1.372 1.341 1.325 1.316 1.282 t0.05 6.314 2.920 3.353 2.132 1.943 1.869 1.812 1.753 1.725 1.708 1.645 t0.25 12.706 4.303 3.182 2.776 2.447 2.306 2.228 2.131 2.086 2.060 1.960 t0.01 31.821 6.965 4.541 3.747 3.143 2.896 2.764 2.602 2.528 2.485 2.326 t0.005 63.567 9.925 5.841 4.604 3.707 3.355 3.169 2.947 2.845 2.787 2.576 The critical t value is t 0.05 (n1+n2-2) = t 0.05, 6 = 1.943 Since our calculated value of t is less than the critical value, we can reject the significance of the difference between the mean growth rates of species A and B. Repeat the t-test for C vs D. My calculation for t for these data is 90.9, which exceeds the critical value, 1.943 meaning that the difference between the average growth rates for these two species is statistically significant. Chi-Square Test: Goodness of Fit In the preceding example, the growth rate data was a “continuous” variable. Like height and weight of people, we cannot measure these parameters exactly. Perhaps (like growth rate) they are changing with time; perhaps (like weight) there are technical limitations to the accuracy of our measurements. Another type of data is “categorical” frequency data. This is particularly clear cut for genetical studies such as Mendel’s pea experiments, where each individual showed the dominant phenotype, or did not. For example, in genetical studies, the distribution of phenotypes in a real population is compared to a Mendelian ratio like 1-to-1, 3-to-1, etc. If 100 pea plants consist of 70 tall individuals and 30 short individuals, is that closer to a 1-to-1 or to a 3-to-1 ratio? If it is closer to 1:1, is it close enough that you can be 95% confident in the fit? A statistical tool to resolve these kinds of questions is a Chi-Square (2) test, also called "goodness of fit" Here, the 2 statistic is calculated as the sum of (observed - expected)2 expected for each phenotype class. In our example, imagine that we cannot decide between 1:1 and 3:1. For the 1:1 2 = (70-50)2 + (30-50)2 = 400 + 400 = 8+8 = 16 50 50 50 50 Statistics For the 3:1 Appendix 16 Page 6 of 7 2 = (70-75)2 + (30-25)2 = 25 + 25 = 0.33 + 1 = 1.33 75 25 75 25 Here is the 2 table for one and two degrees of freedom. Remember, if you have two possibilities, you have one degree of freedom. = 0.10 0.05 0.025 0.01 0.005 0.001 =1 2.706 3.841 5.024 6.635 7.879 10.828 =2 4.605 5.991 7.378 9.210 10.597 13.816 The smaller the value of 2, the better the fit between the observed ratio and the expected ratio. Clearly the 70:30 distribution of tall and short plants is closer to a 3: 1 ratio than it is to a 1:1. You could see that even before doing the test, of course, but if it had been a problem involving multiple phenotypes, then a 2 test might have been the only way to tell. Now, is the observed ratio close enough to a 3:1 ratio that we can be 95% confidence that its deviation from the ideal is due simply to ordinary random variation? To answer that, we must compare our calculated 2 value of 1.333 to statistically determined values. In our example, for one degree of freedom, our calculated 2 statistic is smaller than the 0.05 critical value for a 3:1 ratio, but not for a 1:1 ratio. Therefore, we are 95% confident that our ratio fits a 3:1 distribution. What follows is for more sophisticated statistical problems than you’ll encounter in Biol 342, but is worth keeping in mind for other courses. 1) Statistical tests can be broadly classified into two main types: parametric and non-parametric. Parametric tests like the t-test assume the variable you measure has a normal (bell-curve) distribution and that your sample size is relatively large. Non-parametric tests like the Chi-square do not make these assumptions. For now, you don’t need to worry about distinguishing the two types of tests, but more rigorous statistical methods require first testing whether your (continuous) variable follows a normal distribution. 2) Is your dependent (measured) variable continuous or categorical? Continuous variables like length or weight can be measured on a continuous numerical scale (e.g. a ruler). Growth rate of hyphae is an example of a continuous variable. Categorical data involve classifying your sample into a category, e.g. “Tall” vs. “short” or “blue” vs. “red” as in the pea-plant genetics example. Note that in this case, the length of the plant was not actually measured, but only classified. With categorical data, you will end up with frequencies (or counts) of individuals in the different categories. 3) Is your independent variable continuous or categorical? 4) How many categories are being compared? __________________________________________________________________________ Dependent variable Independent variable Statistical test __________________________________________________________________________ continuous continuous Regression categorical categorical Chi-square categorical continuous Logistic regression continuous categorical t-test (for comparing 2 groups) 1-way ANOVA (for 3+ groups) continuous 2 category types 2-way ANOVA Statistics Appendix 16 Page 7 of 7 Examples 1) Regression: the relationship between a person’s weight and height (both continuous variables). You must test whether the slope of the fitted regression line is significantly different from zero. Merely plotting a line through a scatterplot of the data and calculating an r2 value IS NOT the statistical test. 2) Chi-square: comparing observed plant height categories to expected height categories as in the previous example. This test can also be used to see whether the frequencies in the height categories differ for 2 species of plants. 3) Logistic regression: a rather rare and complex test. For example, you could test whether the level of some nutrient (continuous independent variable) is related to mortality of the plant (mortality is categorical because it is “yes” or “no”). You need a computer program to calculate this efficiently. 4) t-test: as in your first example, growth rate is continuous, and you were comparing 2 categories (Species A vs. B) 5) 1-way ANOVA: same as above, but if you were comparing 3 or more species (e.g. Species A vs. B vs. C). Typically uses a computer program to calculate. 6) 2-way ANOVA: e.g. if you were comparing growth rates (continuous dependent variable) between two types of categories--each type of category is usually called a “factor”. Factor 1 = Species A vs B Factor 2 = temperature “hot” vs. “cold”. It does not matter how many divisions there are within each factor. For example you could do (Species A vs. B. vs. C) grown under the temperatures “hot” or “cold”. 2-way ANOVAs require a computer program.