Normally Distributed Data, Sampling, Averages and Standard Error

advertisement
Statistics
Appendix 16
Page 1 of 7
Normally Distributed Data, Sampling, Averages, Standard Error, and Significant
Differences.
If you take 20 jelly beans from a huge jar in which half of the beans are known to be red, the
odds that exactly 10 of your 20 beans will be red is high, but not 100%. Perhaps you will get
only 7 red beans, or maybe 15. If you take several 20-bean samples, though, the average (mean)
number of red beans in those samples should be very close to 10. And the more such samples
you take, the more closely their average proportion of red beans will approximate the true
proportion of red beans in the whole jar.
To relate this to a mycological experiment, suppose you measure the growth rate of four
randomly chosen hyphae from fungal species A and compare those rates to four hyphae from
species B (Case 1). Meanwhile, another student is comparing the growth rates of hyphae of
species C and D (Case 2). In each pair, one average is 40 µm/min, and the other is 45 µm/min.
In each case it seems one species has a higher growth rate than the other, but how much
confidence can we have in that conclusion? Are the differences “significant”? Obviously, it
depends on the numbers that went into those averages. Consider these data:
Case 1
Hypha 1
Hypha 2
Hypha 3
Hypha 4
Species A
40
35
36
49
Species B
40
45
39
56
|
|
|
|
|
|
|
Case 2
Species C
39
41
40
40
Species D
45
47
44
44
In both cases, the averages are 40 and 45µm/min. The growth rates in case 1 are more variable
than those in case 2. Is this due to poor experimental shows reproducibility, or high natural
variability? The reasons behind the wide "scatter" of the numbers in case 1 are not obvious, and
the high variation makes our assessment of the average less certain. In contrast, the rates in case
2 are more consistent. Was that student more careful in taking measurements, or less? Based on
these data alone, the differences between A and B are less likely to be “significant” than between
C and D. Here, “significance” is a statistical judgement that relates to whether the differences
might have come from random choice of individuals with a variable population. Few real cases
are so clear cut, and even here we have uncertainties, so we need an impartial way of making a
judgement. Statistics provides sophisticated tools for deciding when two (or more) sets of
measurements are or are not significantly different. Unless you have already taken a statistics
course (in which case, use those tools), here are statistical rules for making an objective decision
about significant differences between data sets.
In any sample you take (e.g., the values 40, 45, 39, and 56 in B, you can be 95% confident that
the true value lies somewhere within two standard errors (SE) above and below the mean of the
sample. This is called the 95% confidence interval. Standard error will be explained in more
detail, below, along with confidence interval.
Two samples are considered significantly different only if their 95% confidence intervals do not
overlap. Clearly, everything depends on standard error (SE). Even if your calculator can do this
with a few keystrokes, it is worth going through this exercise at least once, explicitly, and then
Statistics
Appendix 16
Page 2 of 7
checking your arithmetic against your calculator’s program as an internal check to make sure you
are using your calculator properly. At the end of this section, work through the calculations for
samples C and D. Calculating standard error is a two step process. First, calculate the standard
deviation (SD), according to the formula:
n _
SD = [ (X-xi )2/(n-1)]1/2
i=1
which translates into:
Standard deviation equals the square root of {the sum of the square of differences from the mean,
divided by the "degrees of freedom"}. Values to the exponent 1/2 mean "the square root of" that
value. Do the calculation like this:




Find the average of your sample values by adding them together and dividing by the number
of samples (n). The average of samples is sometimes called "x bar", and is written as X with
a bar over it (difficult to create on a word processor, so I'll use a capital X).
Take each individual sample number, subtract the mean value, and square the result. It will
not matter if the difference is positive or negative once the value is squared.
Add the squared difference values together, the "sum of square differences"
Divide the "sum of square differences" by one less than the number of samples, called the
degree of freedom, written as the Greek letter nu, , which looks similar to a small v.
Imagine that you have to define a set of numbers by one member of the set. Then you are
only free to choose one less than the total number when choosing any other member of that
set.
Sample standard error calculation
Species A
Species B
Species C
40
40
39
35
45
41
36
39
40
49
56
40
Species D
45
47
44
44
Species A
Mean = 40
Sum of square differences = (40-40)2 + (35-40)2 + (36-40)2 + (49-40)2
= 02 + (-5)2 + (4)2 + (9)2
= 0 + 25 + 16 + 81 = 122
Degrees of freedom of the sample = 3
Sum of square differences / degrees of freedom = 40.67
Standard deviation = square root of (sum of squares / degrees of freedom) = 6.38
Species B
Mean = 45
Sum of square differences = (40-45)2 + (45-45)2 + (39-45)2 + (56-45)2
= (-5)2 + (0)2 + (6)2 + (11)2
= 25 + 0 + 36 + 121 = 182
Degrees of freedom of the sample = 3
Sum of square differences / degrees of freedom = 60.66
Standard deviation = square root of (sum of squares / degrees of freedom) = 7.79
Statistics
Appendix 16
Page 3 of 7
Repeat these calculations for species C and D.
Standard deviations are useful in many statistical procedures. Here we need them to get the
standard error, also called standard error of the mean, which is the standard deviation divided by
the square root of the number of samples. In each of our samples, n = 4, and the square root of 4
is 2. Therefore, the SE of these samples is:
Species A
Mean 40
SD
6.38
SE
3.19
Species B
45
7.79
3.90
Species C
40
0.82
0.41
Species D
45
1.4
0.70
The confidence limits for the means in our samples are somewhere within two standard errors
above and below the mean of the sample.
Species A
Species B
Species C
Species D
33.6 - 46.4
37.2 - 52.8
39.2 - 40.8
42.2 - 47.8
Therefore, for sample B, we can be 95% confident that the true mean value of the growth rate is
between 37.2 and 52.8 µm/min
To be precise, the SE values should be multiplied by 1.96, rather than 2, and even 1.96 is valid
only for large samples. For samples of more typical size, the relevant number is a statistical
value called "t", which is found by looking in a table of t-values, later in this appendix.
Recall that sample means are considered significantly different only if their confidence intervals
do not overlap. Therefore the mean value of the growth rates of A and B are not significantly
different, and the mean growth rate for species C is significantly different from that of species D.
A warning, however: growth rates can change depending on how the specimens are treated.
Even seemingly stable characteristics
like hyphal diameter depend on where
the diameter is measured compared to
the tip. Here, the diameters of these
two hyphae are similar a long way
back from the tip, but not closer to the
tip. This is worth remembering in the
context of the measurements you will
be making in Lab 1.
In order to compare sample means with statistical rigor, we can use a "t-test", also called
Student's t-test, which was designed by a statistican (William Sealy Gosset, 1876-1937) who felt
he was a "student" of mathematics, and published under that pseudonym. This is an excellent and
rigorous test under the proper conditions.
Statistics
Appendix 16
Page 4 of 7
Because we do not know if A or B really has the higher mean, the statistical question we have to
ask is whether we might have two samples taken at random from the same population. This is
called a two-tailed t-test. For species A and B, the test calculations will look like this:
Species A
40
35
36
49
Number of measurements (n)
Degrees of freedom ()
Mean
Sum of square differences
n1 = 4
1 = 3
X1 = 40
SS1 = 40.67
Species B
40
45
39
56
n2 = 4
2 = 3
X2 = 45
SS2 = 60.66
Pooled variance s2p = SS1 + SS2 = 40.67 + 60.66 = 16.89
1 + 2
3+3
The next line is needed only if n1 ≠ n2; otherwise (as in this example), it is easy to simplify
sX1-X2 = (s12p + s22p) 1/2 = [16.89 + 16.89 ] 1/2 = (8.44)1/2 = 2.90
(n1
n2 )
[ 4
4 ]
In the following context, |X| means “the absolute value of X”
t = |X1 - X2| = |40 - 45| = 5
= 1.72
SX1-X2
2.90
2.90
The critical t value is t 0.05 (n1+n2-2) = t 0.05, 6 = 1.943 (from a statistical table, part of which is
reproduced below)
Since our calculated value of t is less than the critical value, we can reject the significance of the
difference between the mean growth rates of species A and B. The t-value is compared to a
standard table, with critical values for degrees of freedom (total for the number of measurements
in both samples) and confidence limits (likelihood of detecting a real difference, which is written
as 1- ), of being wrong. Being 95% sure of being correct is written as a 0.05 confidence level,
because you will be wrong in your estimation 5% of the time. Interpolate (estimate) if your
“degree of freedom” is between two of the ones given in the abbreviated table on the next page.
Statistics
Df
1
2
3
4
6
8
10
15
20
25
∞
Appendix 16
Page 5 of 7
t0.10
3.078
1.886
1.638
1.533
1.440
1.397
1.372
1.341
1.325
1.316
1.282
t0.05
6.314
2.920
3.353
2.132
1.943
1.869
1.812
1.753
1.725
1.708
1.645
t0.25
12.706
4.303
3.182
2.776
2.447
2.306
2.228
2.131
2.086
2.060
1.960
t0.01
31.821
6.965
4.541
3.747
3.143
2.896
2.764
2.602
2.528
2.485
2.326
t0.005
63.567
9.925
5.841
4.604
3.707
3.355
3.169
2.947
2.845
2.787
2.576
The critical t value is t 0.05 (n1+n2-2) = t 0.05, 6 = 1.943
Since our calculated value of t is less than the critical value, we can reject the significance of the
difference between the mean growth rates of species A and B.
Repeat the t-test for C vs D. My calculation for t for these data is 90.9, which exceeds the critical
value, 1.943 meaning that the difference between the average growth rates for these two species
is statistically significant.
Chi-Square Test: Goodness of Fit
In the preceding example, the growth rate data was a “continuous” variable. Like height and
weight of people, we cannot measure these parameters exactly. Perhaps (like growth rate) they
are changing with time; perhaps (like weight) there are technical limitations to the accuracy of
our measurements. Another type of data is “categorical” frequency data. This is particularly clear
cut for genetical studies such as Mendel’s pea experiments, where each individual showed the
dominant phenotype, or did not.
For example, in genetical studies, the distribution of phenotypes in a real population is compared
to a Mendelian ratio like 1-to-1, 3-to-1, etc. If 100 pea plants consist of 70 tall individuals and
30 short individuals, is that closer to a 1-to-1 or to a 3-to-1 ratio? If it is closer to 1:1, is it close
enough that you can be 95% confident in the fit? A statistical tool to resolve these kinds of
questions is a Chi-Square (2) test, also called "goodness of fit"
Here, the 2 statistic is calculated as the sum of (observed - expected)2
expected
for each phenotype class. In our example, imagine that we cannot decide between 1:1 and 3:1.
For the 1:1 2 = (70-50)2 + (30-50)2 = 400 + 400 = 8+8 = 16
50
50
50
50
Statistics
For the 3:1
Appendix 16
Page 6 of 7
2 = (70-75)2 + (30-25)2 = 25 + 25 = 0.33 + 1 = 1.33
75
25
75 25

Here is the 2 table for one and two degrees of freedom. Remember, if you have two
possibilities, you have one degree of freedom.
=
0.10
0.05
0.025 0.01
0.005 0.001
=1
2.706 3.841 5.024 6.635 7.879 10.828
=2
4.605 5.991 7.378 9.210 10.597 13.816
The smaller the value of 2, the better the fit between the observed ratio and the expected ratio.
Clearly the 70:30 distribution of tall and short plants is closer to a 3: 1 ratio than it is to a 1:1.
You could see that even before doing the test, of course, but if it had been a problem involving
multiple phenotypes, then a 2 test might have been the only way to tell. Now, is the observed
ratio close enough to a 3:1 ratio that we can be 95% confidence that its deviation from the ideal
is due simply to ordinary random variation? To answer that, we must compare our calculated 2
value of 1.333 to statistically determined values.
In our example, for one degree of freedom, our calculated 2 statistic is smaller than the 0.05
critical value for a 3:1 ratio, but not for a 1:1 ratio. Therefore, we are 95% confident that our
ratio fits a 3:1 distribution.
What follows is for more sophisticated statistical problems than you’ll encounter in
Biol 342, but is worth keeping in mind for other courses.
1) Statistical tests can be broadly classified into two main types: parametric and non-parametric.
Parametric tests like the t-test assume the variable you measure has a normal (bell-curve)
distribution and that your sample size is relatively large. Non-parametric tests like the
Chi-square do not make these assumptions. For now, you don’t need to worry about
distinguishing the two types of tests, but more rigorous statistical methods require first
testing whether your (continuous) variable follows a normal distribution.
2) Is your dependent (measured) variable continuous or categorical? Continuous variables like
length or weight can be measured on a continuous numerical scale (e.g. a ruler). Growth
rate of hyphae is an example of a continuous variable. Categorical data involve
classifying your sample into a category, e.g. “Tall” vs. “short” or “blue” vs. “red” as in
the pea-plant genetics example. Note that in this case, the length of the plant was not
actually measured, but only classified. With categorical data, you will end up with
frequencies (or counts) of individuals in the different categories.
3) Is your independent variable continuous or categorical?
4) How many categories are being compared?
__________________________________________________________________________
Dependent variable Independent variable
Statistical test
__________________________________________________________________________
continuous
continuous
Regression
categorical
categorical
Chi-square
categorical
continuous
Logistic regression
continuous
categorical
t-test (for comparing 2 groups)
1-way ANOVA (for 3+ groups)
continuous
2 category types
2-way ANOVA
Statistics
Appendix 16
Page 7 of 7
Examples
1) Regression: the relationship between a person’s weight and height (both continuous
variables). You must test whether the slope of the fitted regression line is significantly
different from zero. Merely plotting a line through a scatterplot of the data and
calculating an r2 value IS NOT the statistical test.
2) Chi-square: comparing observed plant height categories to expected height categories as in
the previous example. This test can also be used to see whether the frequencies in the
height categories differ for 2 species of plants.
3) Logistic regression: a rather rare and complex test. For example, you could test whether the
level of some nutrient (continuous independent variable) is related to mortality of the
plant (mortality is categorical because it is “yes” or “no”). You need a computer program
to calculate this efficiently.
4) t-test: as in your first example, growth rate is continuous, and you were comparing 2
categories (Species A vs. B)
5) 1-way ANOVA: same as above, but if you were comparing 3 or more species (e.g. Species A
vs. B vs. C). Typically uses a computer program to calculate.
6) 2-way ANOVA: e.g. if you were comparing growth rates (continuous dependent variable)
between two types of categories--each type of category is usually called a “factor”.
Factor 1 = Species A vs B Factor 2 = temperature “hot” vs. “cold”. It does not matter
how many divisions there are within each factor. For example you could do (Species A
vs. B. vs. C) grown under the temperatures “hot” or “cold”. 2-way ANOVAs require a
computer program.
Download