Statistics in Biology

advertisement
Statistics in Biology
Measures of Central Tendency
The center distribution in a set of data can be measured by the mean, median or mode.
Mean
The mean is a calculation where all of the data points are summed and then this number is divided by the total
number of data points. The sample mean (below) is an estimate of the mean of the entire population (μ).
Students in a biology class planted eight bean seeds in separate plastic cups and placed them under a bank of
fluorescent lights. Fourteen days later, the students measured the height of the bean plants that grew from
those seeds and recorded their results in Table 2.
Calculate the mean of these values in the space below.
What can you determine from your 8 sample mean compared to the center distribution of a population of 200
plants?
When would you use the mean?
When might you not use the mean?
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Statistics in Biology
Measures of Central Tendency
The center distribution in a set of data can be measured by the mean, median or mode.
Median
The median is the middle point of the data. If you lined up your data from largest to smallest you could find the
middle value and that would be your median.
The median is valuable when your data shows a large range or when there are data points that are extremely
large or small. It is also valuable when your data set is small.
A researcher studying mouse behavior recorded in Table 3 the time (in seconds) it took 13 different mice to
locate food in a maze.
Calculate the mean of the values.
Calculate the median of the values.
How do these two values compare in looking at the time it takes for a mouse to find food in a maze?
Justify which measure of central tendency you would use to explain the “average” time for the mouse to run the
maze and find the food.
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Statistics in Biology
Measures of Central Tendency
The center distribution in a set of data can be measured by the mean, median or mode.
Mode
The mode is a measure of how often a value occurs in your data set. The value of the mode is less about an
average and more about where your data clusters and how it is distributed. Below are examples of a modal
distribution and a bimodal distribution12.
If the mean or median had been used from the raw data, explain whether the modal patterns in each example
would have been shown as they are with the mode.
Why might a scientist be interested in the modal distribution (where data is clustered)?
1
Figure 1 - Oliveria, Fabiola (2005). Lognormal abundance distribution of woody species in a cerrado fragment (São
Carlos, southeastern Brazil). Brazilian Journal of Botany. vol. 28, no. 1.
2
Figure 2. Graph of Body Lengths of Weaver Ant Workers (Reproduced from
http://en.wikipedia.org/wiki/File:BimodalAnts.png.)
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Statistics in Biology
Measures of Variability
The variability in a set of data can be measured by the Range, Standard Deviation, and Variance.
Range
The range can be calculated in normally distributed data by subtracting the smallest value from the largest
value. This is a simple measure of variability in the data. A large value indicates a relatively large variability in
the data and a small value indicates low variability.
Students in a biology class measured the width in centimeters of eight leaves from eight different maple trees
and recorded their results in Table 4.
Show your calculation of the range below.
What does your range value say about the variability of leaf width for Maple trees?
Notice the value of leaf number 5. How does that value make you feel about the range of all Maple leaves on a
tree?
Standard Deviation
Standard deviation is the most common method of calculating variance. You should become
intimately familiar with using this equation. The sample standard deviation (s) is the average of
the deviation of each sample and the mean. in other words, how different each value is from
the mean. If we sum all of this variance and divide by the sample size and take the square
root, we get the standard deviation of the data set. The higher the standard deviation, the
further your values are distributed from the mean.
In a normally distributed sample, 1 standard deviation represents 34.1% variance away from the mean. If we
calculate variance on either side of the mean (+/- the mean), then for 1 standard deviation we should expect
our samples to fall within 68.3% of the normal curve. If we bump that out to +/- 2 standard deviations, then we
should expect all of our values to be found within 95.4% of the normal curve.
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
You are interested in knowing how tall bean plants (Phaseolus vulgaris) grow in two weeks after planting. You
plant a sample of 20 seeds (n = 20) in separate pots and give them equal amounts of water and light. After two
weeks, 17 of the seeds have germinated and have grown into small seedlings (now n = 17). You measure
each plant from the tips of the roots to the top of the tallest stem. You record the measurements in Table 5,
along with the steps for calculating the standard deviation.
Table 5. Plant Measurements and Steps for Calculating the Standard Deviation
Plant Number
Plant Height
(mm)
1
112
2
102
3
106
4
120
5
98
6
106
7
80
8
105
9
106
10
110
11
95
12
98
13
74
14
112
15
115
16
109
17
100
(mm)
mean =
(mm)
=
variance (s2)
standard deviation (s)
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Visualizing the Variability
Calculate the mean +/- 1 standard deviation
Mean + 1s = ___________
Mean - 1s = ___________
Calculate the mean +/- 2 standard deviations
Mean + 2s = ___________
Mean - 2s = ___________
Now that you have the standard deviation calculated, you can now look at your data and see that:
68.3% of the measurements fall between ______________mm and 95.4% fall between ____________ mm.
Another way to visualize the standard deviation relative to the mean is to create standard deviation bars. Thi is
simply done by plotting the mean on a graph either as a single point or as a column and adding a t-shaped bar
above and below the mean value.
For your data on plant height, plot below the mean and representations of +/- 1 standard deviation and +/- 2
standard deviations.
How do these calculations change with sample size?
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Statistics in Biology
Measures of Confidence
While the standard deviation tells us how spread out our data is from the mean, a different statistic can help us
figure out the uncertainty of our mean calculation in the first place.
How do sample means vary from the entire population?
Take a die 20 and roll it 100 times. Record your values below.
Mean of all 100 samples = ________________
Use a random number generator to select at random 8 sub-samples of 5 values from the 100 you collected.
Calculate the mean of each and write it in the bold bow below.
SS1
SS2
SS3
SS4
SS5
SS6
SS7
SS8
For each mean in the box, how confident are you that it represents the mean of the entire population? Color
each box red (not confident), yellow (somewhat confident), or green (confident)
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Calculating Uncertainty
The relationship of variability between a sample mean and a global mean can be expressed by calculating the
standard error of the mean (abbreviated as SE(𝑥̅ )or SEM).
The standard error of the mean represents the standard deviation of such a distribution and estimates how
close the sample mean is to the population mean.
The less deviation your samples have from the mean, the less SE. Also, The greater each sample size (i.e., 50
roll values rather than 5 roll values), the more closely the sample mean will estimate the population mean, and
therefore the standard error of the mean becomes smaller.
What the standard error of the mean tells you is that about two-thirds (68.3%) of the sample means would be
within ±1 standard error of the population mean and 95.4% would be within ±2 standard errors.
Another more precise measure of the uncertainty in the mean is the 95% confidence interval (95% CI). For
large sample sizes, 95% CI can be calculated using this formula: 1.96�/√ , which is typically rounded to 2�/
√� for ease of calculation. In other words, 95% CI is about twice the standard error of the mean.
SEM Error Bars
Many bar graphs include error bars, which may represent standard deviation, SEM, or 95% CI. When the bars
represent SEM, you know that if you took many samples only about two-thirds of the error bars would include
the population mean.
This is very different from standard deviation bars, which show how much variation there is among individual
observations in a sample. When the error bars represent 95% confidence intervals in a graph, you know that in
about 95% of cases the error bars include the population mean.
If a graph shows error bars that represent SEM, you can estimate the 95% confidence interval by making the
bars twice as big—this is a fairly accurate approximation for large sample sizes, but for small samples the 95%
confidence intervals are actually more than twice as big as the SEMs.
If we are trying to tell whether two or more samples are significantly different from each other, we can look for
error bar overlap. Error bars for different columns that overlap the mean may indicate that these two samples
are not significantly different. A non-overlap of error bars may indicate that these two samples are significantly
different. To be sure, another statistical test called a t-test could be performed.
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Seeds of many weed species germinate best in recently disturbed soil that lacks a light-blocking canopy of
vegetation. Students in a biology class hypothesized that weed seeds germinate best when exposed to light.
To test this hypothesis, the students placed a seed from crofton weed (Ageratina adenophora, an invasive
species on several continents) in each of 20 petri dishes and covered the seeds with distilled water. They
placed half the petri dishes in the dark and half in the light. After one week, the students measured the
combined lengths in millimeters of the radicles and shoots extending from the seeds in each dish.
Table 6. Combined Lengths of Crofton Weed Radicles and Shoots after One Week in the Dark and the Light
Petri Dish
Dark (x1)
(mm)
Light (x2)
(mm)
Dark
Light
(mm)
(mm)
1 and 2
12
18
5.8
0.16
3 and 4
8
22
2.6
12.96
5 and 6
15
17
29.1
1.96
7 and 8
13
23
11.5
21.16
9 and 10
6
16
13.0
5.76
11 and 12
4
18
31.4
0.16
13 and 14
13
22
11.6
12.96
15 and 16
14
12
19.3
40.96
17 and 18
5
19
21.1
0.36
19 and 20
6
17
13.0
1.96
98.4
158
Mean =
9.6 mm
18.4 mm
s=
mm
s=
=
=
=
=
mm
Standard Error
95% CI
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Use columns to graph your data for both treatments, using SE and 95% CI below.
Remember the initial hypothesis, “that weed seeds germinate best when exposed to light”.
With the statistical evidence you collected, explain whether the hypothesis can been rejected or fails to be
rejected.
Explain how the SEM error bars support your answer above. You may use either 1 SEM or 95% CI as support.
Because of the error bars, explain whether you can population means (and thus the treatments) are different.
Can you extend your thinking to whether to difference is purely by chance or whether it is statistically
significant?
Proceed to Calculating Descriptive Statistics
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Statistics in Biology
Inferential Statistics: T-Test
Statistical hypotheses are different from experimental hypotheses. In experimental hypotheses you are
measuring whether one variable has an effect on a process. Statistics evaluates a statistical null hypothesis.
This null hypothesis states that when comparing groups, the experimental effect had no impact on the process
and any change is due to chance alone. The null is given the variable H0.
If you grow 10 bean plants in dirt with added nitrogen and 10 bean plants in dirt without added nitrogen. You
find out that the means of these two samples are 13.2 centimeters and 11.9 centimeters, respectively.
● Does this result indicate that there is a difference between the two populations and that nitrogen might
promote plant growth?
● Or is the difference in the two means merely due to chance?
A statistical test is required to discriminate between these possibilities.
How do we define “chance”?
The significance level is the probability of getting a test statistic rare enough that you are comfortable rejecting
the null hypothesis (H0). The widely accepted significance level in biology is 0.05. If the probability (p) value is
less than 0.05, you reject the null hypothesis; if p is greater than or equal to 0.05, you don’t reject the null
hypothesis.
Comparing Means
Remember back when we calculated and graphed SEM bars to see if light and dark treatments showed any
difference in growth? We mentioned that to see if this was a statistically significant difference that we could
hang our hat on, we would need a method of comparing the means. Enter the t-test.
The t-test assesses the probability of getting the observed result if the null statistical hypothesis (H0) is true.
Typically, the null statistical hypothesis in a t-test is that the mean of sample 1 is equal to the mean of sample 2
or �1 = �2. Rejecting H0 supports the alternative hypothesis, H1,that the means are significantly different (�1not
equal to �2). In the plant example, the t-test determines whether any observed differences between the means
of the two groups of plants are statistically significant or have likely occurred simply by chance.
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Calculating the T-Test
Table 6. Combined Lengths of Crofton Weed Radicles and Shoots after One Week in the Dark and the Light
1. State your null hypothesis (H0):
2. Calculate the t-test SE value by dividing the variance (s2) for each treatment by its sample size. Take
square root.
3. Subtract mean 1 - mean 2 and take absolute value.
4. Divide the answer in step 3 by the answer in step 2. This is the tobs (t observed) measurement.
5. Use the t-Value critical table by adding the total number of data points in the experiment, minus 2. This
gives us our degrees of freedom.
6. Compare the tcrit (t-critical) value to the tobs value. If the calculated t-value is greater than the appropriate
critical t-value, this indicates that the means of the two samples are significantly different at the
probability value listed (in this case, 0.05). If the calculated t is smaller, then you cannot reject the null
hypothesis that there is no significant difference.
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Explain how your tobs (t observed) measurement
compared to the tcrit (t-critical) value.
Was your null rejected, or did it fail to be rejected?
What does this say about the significant significance
between your two treatments?
If you were to answer the hypothesis based on your
statistical evidence, what you would write?
Proceed to part C, Calculating t-Test Statistics
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Statistics in Biology
Inferential Statistics: Chi Square
Statistical hypotheses are different from experimental hypotheses. In experimental hypotheses you are
measuring whether one variable has an effect on a process. Statistics evaluates a statistical null hypothesis.
This null hypothesis states that when comparing groups, the experimental effect had no impact on the process
and any change is due to chance alone. The null is given the variable H0.
The t-test is used to compare the sample means of two sets of data. The chi-square test is used to determine
how the observed results compare to an expected or theoretical result.
●
●
Chi square should be done with raw values, not percent or frequencies
Chi square should also be calculated from large sample size
For this example we will use the example of Sickle Cell Anemia pages 9-12.
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Statistics in Biology
Establishing Relationships Between Sets of Data
Correlation and causation are two relationships that get confused often in society. Does the fact that two
pieces of data have a relationship to each other mean that one causes the other. The answer is no. In fact, we
would first need to establish correlation. One way to do this is through the correlation coefficient (r). This
statistic measures how related two variables are and the result will be a value between 1 and -1. As the value
approaches 0, the weaker the correlation.
● The correlation coefficient (r) establishes the relatedness between variables X and Y. A correlation
greater than 0.8 is generally described as strong, whereas a correlation less than 0.5 is generally
described as weak.
● An r2 value of 0.0 means that knowing X does not help you predict Y. There is no linear relationship
between X and Y, and the best-fit line is a horizontal line going through the mean of all Y values. When
r2 equals 1.0, all points lie exactly on a straight line with no scatter. Knowing X lets you predict Y
perfectly.
● the r2 value tells the strength of the relationship between variables X and Y.
● For example, if r = 0.922, then r 2 = 0.850, which means that 85% of the total variation in y can be
explained by the linear relationship between x and y (as described by the regression equation). The
other 15% of the total variation in y remains unexplained.
If you want to see data of correlation and how it does not prove causation, I would direct you to Spurious
Correlations by Tyler Vigen and you can experiment with his data.
(Document created by Bob Kuhn)
All examples included are from the Howard Hughes Medical Institute Biointeractive Teacher Guide: Math and Statistics
Download