File

advertisement
Statistics for AP Biology
Populations of organisms vary in almost every characteristic (e.g., size, weight, strength,
intelligence, growth rate, response to stimulus, longevity, etc.) Statistics allow scientists to
summarize characteristics of large populations quantitatively by making measurements on a
small portion of the individuals (i.e., a sample) of that population. Using statistics we can also
determine whether observed differences between characteristics of two populations are
“statistically different.”
In order to make inferences about entire populations, samples must be taken at random
(without bias). The size of the samples and the amount of variability in the measurements taken
will affect our ability to describe or to make predictions about the populations.
Descriptive Statistics
A parameter is a measure that describes or characterizes an entire population of data. It
is typically not possible to collect all data in a population, and therefore, we must estimate
parameters by computing statistics. Thus, a statistic describes or characterizes a sample of the
population. We begin describing characteristics of populations by using measures of central
tendency.
Three Measures of Central Tendency
Mean: The most commonly used parameter describing the central tendency of a population is
the mean (sometimes called the average). The mean of the entire population is represented by
the Greek symbol  (pronounced mew). The sample mean that estimates the population mean is
represented by X (pronounced x-bar). The sample mean is calculated as follows:
X = ΣX/n
(Equation 5.1)
where X represents an individual datum (e.g., tree height, weight, etc.) and n is the number of
individuals in the sample. As long as the sample is a random sample of the population, we may
assert that the sample mean is a good estimate of the population mean. Just how precise will be
shown later.
Median: Another parameter describing the central tendency of a population is the sample
median. The median is the middle measurement in a ranked listing of the data (i.e., there are an
equal number of individuals with values greater than the median and of individuals with values
less than the median). If there is an even number of individuals in the sample, the median is the
mean of the middle two measurements. Eleven is the median of Sample 1 and 11.5 is the median
of Sample 2 below.
Sample 1
Sample 2
5,5,6,7,7,7,8,8,10,11,12,13,15,15,15,16,16,17,18
5,5,6,7,7,7,8,8,10,11,12,13,15,15,15,16,16,17,18,18
1
Variability: Calculating the mean or central tendency of the set of data only provides a partial
description of the data. For example, look at the following two data sets. Each of the population
samples have the same mean, however, the data of Sample 1 are much more variable than the
data of Sample 2.
Sample 1
9
22
16
13
11
19
Mean = 15
Sample 2
15
16
17
15
13
14
Mean = 15
Another way to visualize variability is using histograms. Data from samples 1 and 2 are
plotted below using histograms. Note the dispersion of measurements around the mean (15).
3
3
Sample 1
Sample 2
2
Frequency
Frequency
2
1
1
0
0
8
9
10
11
12
13
14
15
16
17
18
19
20
Measurement
21
22
23
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Measurement
Range: One measure of data variability is the range or the difference between the largest and
smallest measurement (e.g., range = {9,22} for Sample 1 and {13,17} for Sample 2). However,
sample range almost always underestimates actual population range and does an incomplete job
of describing variability.
Sum of Squares: More applicable in statistical analysis is the measure of data variability based
on the deviation from the sample mean. We define this quantity as the sum of squared deviations
from the mean or sum of squares (abbreviated SS). The sum of squares is calculated as follows:
SS = Σ(X – X )2
(Equation 5.2)
thus, for the data in Sample 1,
2
SS = (9 – 15)2 + (22 – 15)2 + (16 – 15)2 + (13 – 15)2 + (11 – 15)2 + (19 – 15)2
= 122
Sample Variance: Sum of squares can be used to calculate a better measure of dispersion or
variability called the sample variance (abbreviated s2):
s2 = SS/DF
(Equation 5.3)
where DF is the degrees of freedom, defined as
DF = n – 1.
Therefore, for the data from Sample 1, DF = 6 – 1 = 5.
Thus, s2 = 122/5 = 24.4.
The sample variance is a good estimate of the population variance (σ2). In other words,
s is a good estimate of the variance we would compute if we were able to sample the entire
population of interest.
2
Standard Deviation: Rather than the sample variance, we usually report the standard deviation
(s or SD),
(Equation 5.4)
s or SD =
s 2 or
For Sample 1, SD =
24.4
4.94
=
The sample mean and standard deviation are often used to describe and compare
characteristics of populations. We report this information as the sample mean ± the standard
deviation ( X ± SD). For Sample 1, we would report: X = 15 ± 4.94.
The standard deviation is a measure of the dispersion, or scatter, of the data. For instance, if a
surgeon collects data for 20 patients with soft tissue sarcoma and the average tumor size in the
sample is 7.4 cm, the average does not provide a good idea of the individual sizes in the sample.
It could be that the sizes in the sample are similar and lie between 7 and 9 cm or that the sizes are
dissimilar with some tumors being very small and others very large. In the former case, size
likely will play little role in the differences in outcome between patients, whereas in the latter
case tumor size could be an important factor (confounding variable) explaining differences in
outcome between patients or relating to other variables such as surgical margins. Further, having
an estimate of the scatter of the data is useful when comparing different studies, as even with
similar averages, samples may differ greatly. It therefore is important to report the variability in
the sample and this is done with the standard deviation of the sample.
3
Statistics Practice
1-Using the data from Table 5.1, make a table that lists the population name (cave) and the
sample statistics for each population including mean number of parasites, median, DF, SS, s2,
and SD for each of the populations. Next, try your hand at creating a histogram.
Table 5.1 (Sample Data for practice). The number of parasites on randomly selected bats located
in 4 caves in the Southern United States during the summer of 2013.
Coronado cave
Manitou cave
Onyx cave
Crystal cave
3
6
3
11
2
5
5
10
6
11
3
7
7
3
2
6
5
9
1
7
2
8
1
5
9
6
4
2
1
28
7
1
4
Estimating Population Means
When we estimate a population mean from a sample mean, we may wonder how close
the estimate of the population mean is to the actual population mean. This is answered by
considering the fact that repeated samples from the same population will likely have somewhat
different means. The variability of these possible sample means is known as the standard error
(SE) of the mean and is calculated as follows:
SE = s / n
(Equation 6.1)
Using this equation, the standard error for Sample 1 is
SE = 4.94 / 6
= 2.02
It is important to differentiate between SD and SE. The SD is an estimate of the variance
in the actual characteristic measurements. In contrast, the SE is a measure of the variance among
estimates of the sample means that can be calculated from a single sample.
4
We can use the standard error to calculate a confidence interval (CI), a range of values
that, with a stated level of confidence, will include that actual population mean, µ.
There are many statistics that can be used to calculate a confidence interval. One of the
simplest and most common statistical tests uses a statistical distribution known as the Student’s t
(see the attached table 6.2). In this table, DF is the degrees of freedom (n – 1) and α is the
significance level. We can calculate a confidence interval using the following equation.
(1 - α ) CI for µ = X ± t(SE)
(Equation 6.2)
where t is the value obtained from Table 6.2 and SE is the standard error.
A significance level of 5% is the most frequently used level in biological sciences. Using
α = 0.05 allows us to compute a 95% (1 - α) confidence interval for Sample 1 as follows:
t(DF=5, α=0.05)
95% CI
= 2.57
=
15 ± 2.57(2.02)
=
15 ± 5.19.
Interpreting confidence intervals
A 95% confidence interval may be interpreted in the following manner for Sample 1:
95% of the time (in 95 experiments out of 100), µ, the actual (unknowable) mean of the entire
population, will fall between 9.81 (15 – 5.19) and 20.19 (15 + 5.19).
For this exercise, we will assume that when comparing population means, if confidence
intervals do NOT overlap, the means are significantly different. See Figure 6.1 on p. 4 for a
good way to display sample means and their confidence intervals.
This demonstration shows us two important relationships between the measurements we
make and the size of the sample we choose to measure. Measurements with lower variability
(smaller SD) provide us with a narrower confidence interval within which µ is expected. Larger
sample sizes (n) also lead to narrower confidence intervals because the denominator in our
calculation of SE will be larger. Note that you can’t control the variability of measurements
because that is inherent in whatever you measure, however, you can control your sample size and
it is almost always important to make it as large as possible.
5
More statistics practice
Using the data from Table 6.1 (below), make a table that lists the population name (Lake) and
the sample statistics for each population including mean ducklings fledged, median, DF, SS, s2,
SD, SE, and t95(SE) for each of the populations. In addition, make a bar graph that illustrates the
mean with the 95%CI represented as “error bars.” Are there any significant differences between
the lakes?
Table 6.1 (Sample Data for homework). The number of ducklings that fledged (survived) from
randomly located nests at 5 prairie pothole lakes in southern Minnesota during the summer of
2009.
Swan Lake
4
5
7
7
2
3
4
11
Eagle Lake
6
7
7
6
8
12
3
11
9
Minn. Lake
2
1
0
4
1
0
1
2
3
Loon Lake
11
13
11
17
12
10
14
Middle Lake
3
3
5
4
4
6
3
5
4
6
Download