Link to Word Version

advertisement
Variation Lab
by Ray L. Winstead
A major characteristic inherent in biological systems is the presence of
VARIATION or differences that exist among individual organisms in their structures,
functions, and behavior. As you have learned, or will learn, in the lecture part of the
course, these differences are important in the process of evolution. The concept of variation
will be illustrated in this lab by measuring some human characteristic and by examining
the differences that exist within and between two different groups. The characteristic (e.g.,
height, length of a finger bone) and groupings (e.g., gender) may be chosen by your
professor or by you, as directed by your lab professor.
In biological research, observation values (i.e., numerical measurements or data
points) are usually obtained to describe and give information about a particular
characteristic within a group of subjects. Consider the following way to represent such
measurements:
Measurements
Y1
Y2
Y3
.
Y1 represents the first measurement, Y2 represents the
.
second measurement, Yi represents the generalized
Yi
"ith" measurement, and Yn represents the last
.
measurement, where n is the total number of
.
measurements in the group.
Yn
= Y
n
Whenever you have any group of measurements, especially a very large group, it is
convenient and helpful to have an intuitive value denoting the "center" of the group. The
common average or "mean" provides this handy, intuitive measure and is simply the sum
of all the measurements divided by the number of measurements in the group (as
illustrated above).
Average = Mean =
However, just knowing the average of the measurements is not enough information
to have a good intuitive description of the characteristic. Information about the variation
within the group, i.e., the spread of the measurements around the mean is also helpful and
necessary. Consider the following two groups (Figure 1) where the averages are exactly the
same in both groups but the spread of the measurements around the mean is very different.
For each group the 12 tick marks along the horizontal axis indicate observation
measurement values on some scale appropriate for the measurement.
Figure 1
Note that the 12 values in the first group are clustered around the mean but, in general, are
distributed farther away from the mean than in the second group. Note further that the 12
values in the second group are also clustered around the mean and are, in general, closer to
the mean than in the first group. This shows that just having information about the mean is
not enough information to have a clear understanding of the measurements of that
characteristic.
Consider another representation below (Figure 2) of the two groups, including
further information about the distribution of measurement values in the two groups. The
horizontal axis represents a scale of observation values as before, while the vertical axis
represents the additional information of frequency, i.e., the number of observation values.
The curved line represents the change in the number of observation values in the group as
the actual value changes from a low value to a high value. Note that in both cases, as in
many biological distributions, the curve approximates a "Normal" distribution with more
values occurring in the center of the group near the mean and then becoming fewer in
number away from the mean.
Observation Values
Figure 2
Consider the following two representations (Figure 3) of the difference between one
individual observation measurement and the mean as a linear distance. In the first case the
difference is represented on the horizontal axis, while in the second case the same distance
is represented within the graph above the horizontal axis. In this way all the differences can
be represented on the same graph and visualized more easily, without having to
superimpose the different values all on top of each other on the horizontal axis.
Figure 3
In Figure 4 below note that the lines representing the differences between the
individual observations and the mean are, in general, longer in the first graph than the
lines in the second graph.
Figure 4
Here is where we can get a second index (the first being the mean) that will give us an
intuitive idea of the characteristics of the measurements in the group. This second index
will tell us, in general, how spread out the measurement values are away from the mean.
Especially note intuitively from Figure 4 that if we simply take an AVERAGE OF THE
DISTANCES representing differences between individual observation measurements and
the mean that in the first case the average of the distances will be large and in the second
case the average of the distances will be small. So, all we have to do is find the average of
those distances for each group and that will give us our desired intuitive index of the
spread of the observation values surrounding the mean for each group. If the index is
small, it will mean that the values are close to the mean, however if the index is large, it will
mean, in general, that the observation values are spread farther out from the mean (even
though there are still more of them in the center of the distribution). In this way we can use
the new index not only as a measure of variation within a group but also for comparative
purposes among different groups. In particular, the second index we wish to construct is
simply (a modified) average of the DIFFERENCES between the individual observations in
a group and the group mean and is, therefore, an index of the VARIATION in that group.
A large index indicates a large amount of variation (differences) within a group and a small
index will indicate the existence of only small differences among the measurements. In this
lab you will assess the variation WITHIN each group of measurements and also compare
the variability BETWEEN two groups.
The goal now is to calculate the index of the spread of the values around the mean,
i.e., the average of the differences between individual observation measurements and the
mean. First of all calculate the difference between each observation value and the mean:
Measurements
Difference
Average = Mean =
Y1
Y2
Y3
..
..
Yi
..
..
Yn
Y1 Y2 Y3 -
= Y
n
(Y n
Yi -
Yn )
Taking the average of the differences above will apparently give us our desired index of the
average of the differences and it first appears that we are finished. However, since some the
values are greater than the mean, some of the differences will be positive values, while some
of the differences will be negative values, since some of the values are less than the mean.
This means that our original calculated average of the differences above will ALWAYS be
equal to zero, given the inherent nature of the mean being in the middle of the group and
the positive and negative differences ALWAYS canceling out each other. Having a zero as
our index for all cases is not helpful! However, our original logic of calculating an average
of the distances was correct, so now we need to calculate the average of the differences after
first ignoring whether or not the difference is positive or negative, i.e., ignoring whether or
not the value is either smaller than or larger than the mean. We are really only interested
in the magnitude of the difference away from the mean, not which direction. The historical
way of solving this problem was simply to take the absolute value of each difference (e.g, | +
5| = + 5 and | - 5| = + 5) and then to take the average of those resulting positive absolute
values. That did in fact give the desired, useful index where a small average of the
differences indicated that the values are close to the mean and a large average of the
differences indicated that the observation values are spread farther out from the mean.
However, as the field of statistics grew, more powerful, but complicated, procedures were
introduced, and using this particular index for variation using absolute values proved to be
very cumbersome. Therefore, an alternative solution to the "getting rid of the signs"
problem was adopted that gives comparable results for a measure of the average
difference, i.e., a measure of the spread of the values around the mean. Instead of taking
the absolute value to get rid of the sign, each difference is squared, and the result is always
positive. So, now we have:
Measurements Difference Difference Squared
Y1
Y1 (Y1 - )2
Y2
Y2 (Y2 - )2
Y3
Y3 (Y3 - )2
...
...
Yi
Yi (Yi - )2
...
...
Yn
Yn (Yn - )2
= Y
(Y - )
(Y - )2
N
n
n
Taking the average of the squared differences gives us (Y - )2 .
n
However, the average of the squared differences was not our original goal, but only the
average of the differences (not squared). Therefore, in order to account for squaring each
difference (to get rid of the sign), we now must take the square root of our calculated
average of the squared differences. We have as our answer our desired index of the spread
of the values around the mean as
Average = Mean =
Average Difference =
.
This index gives us the information we need as a measure of the spread of the values within
a group around the mean, i.e., a measure of variation. If the index is small, it will mean that
the values are close to the mean, however if the index is large, it will mean, in general, that
the observation values are spread farther out from the mean. Another term for
"difference" is "deviation," so the Average Difference may also be called the Average
Deviation. Another term often used for "average" is "standard," so the Average Difference
is more often called the "Standard Deviation." THE STANDARD DEVIATION IS JUST
THE AVERAGE DIFFERENCE BETWEEN INDIVIDUAL OBSERVATION
MEASUREMENTS AND THE MEAN. Note that the formula above for the standard
deviation is not just a mysterious formula but a shorthand presentation of a logical way to
achieve a meaningful measure of variation. The average of the squared differences is also a
legitimate index of variation, since as the index goes up this also means that the spread of
the values around the mean is greater. The average of the squared differences is usually
called the "Variance."
The calculation of the standard deviation above assumes that ALL of the subjects of
interest within a group are available and have been measured. This is rarely the case. Most
of the time in biological research only a small sample of a larger group of interest is
actually measured. The assumption is then made that the sample accurately reflects the
characteristics of the larger group. Any conclusions drawn from the sample are only as
accurate as the accuracy of the sample representing the larger group. Although the
equation above is accurate for calculating the standard deviation for an entire group, it is
not an accurate representation of variation for the entire group if calculated from a smaller
sample. Since the calculation above is "biased" (systematically distorted) when calculated
from a sample, a correction factor is needed when estimating the true standard deviation of
a larger group from a smaller sample. Mathematical statisticians have shown that instead
of dividing by the number of cases n, dividing by n - 1 will provide a more accurate
"unbiased" ESTIMATE of the true standard deviation when calculated from a sample
from the larger group. Therefore, the following formula for standard deviation is the one
most often used and the one we will use during this lab.
Standard Deviation =
.
In this lab you will assess the variation WITHIN each group of measurements and also
compare the variability BETWEEN two groups.
Lab Procedure
Unless otherwise instructed by your lab professor, follow the procedure below.
Work in groups of three or four students at a lab table. Having half the class (e.g., students
at three lab tables) acquire data for one group (e.g., men) and the other half of the class
acquire data for the other group (e.g., women) works well. Leave the lab room and go out
to the Oak Grove (or other nearby place on campus) to select twelve (12) subjects
(students) to measure the chosen characteristic in each of the two groups of interest. Each
table is responsible for acquiring four data points, so that the class will have a total of 24
measurements. Record the measurements for one group (e.g., men) in the first column (1)
in Table 1 and record the information for the other group (e.g., women) in the first column
of Table 2. (Numbers in parentheses refer to the steps outlined in the Tables.) It will be
helpful to first list the 12 measurements in each group on the chalkboard before copying
this information into the tables, so that everyone will be working with the same data in the
same order. In this way finding errors in upcoming calculations will be easier. Note that the
measurements are not all the same, but are different from each other, indicating variation
in the measurements. The amount of this variation or variability within your sampled
group can be measured, as well as the difference that may exist between the two groups.
Next, add the measurements in each group together (2) and divide by the number of
measurements you have (n = 12) to give you the average (3) of the measured characteristic
for that group. Unless otherwise told by your lab professor, for steps (1) through (4) round
off your values to one decimal place, e.g., 7.6, however for steps (5) through (8) round off to
two decimal places, e.g., 1.21. After EVERY calculation always ask yourself the question:
Does the answer make sense? For example, is the average of the measurements close to
your intuitive guess of the middle of the recorded measurements? If not, then look for a
possible error in your calculations.
Since we are interested in differences as a measure of the variability among
individuals, calculate the difference Y between each individual measurement Y (1) and
the mean
(3) and record these differences in column (4) of the Table. These differences
are also called deviations from the mean. Note that some of these numbers are negative.
Although our objective is to obtain an average of the differences we cannot just
simply calculate this average in the usual way, because, recall, the sum of the differences
will always be zero (because some of the measurements are less than the mean and some
are more than the mean, i.e., because of the changes in sign). This would mean that the
average of the differences would also always be zero, and therefore, this would be a totally
useless index to compare different groups. To account for this fact and to get rid of the
negative signs, the deviations are squared. Square each one of the deviations (4) and record
these in the third column (5).
Add the squared deviations (5) together to give you the entry for step (6). This sum
of the squared deviations is sometimes referred to as simply the "sum of squares."
The group of individuals we measured is a sample from a larger population (e.g., all
students at IUP) and this sample is assumed to have basically the same characteristics as
the larger population. In order to get the best estimate of the spread of values (i.e. the
variation) around the mean of the larger population from our small sample, recall that
mathematical statisticians have shown that the sum of the squared deviations must be
divided by the number of observations minus one (n - 1), instead of just by the number of
observations (n). (Dividing by n will indeed give us the average of the squared deviations
for the sample, however we wish our estimate to be generalized to the larger group). For
step (7) in the Table, divide the sum of the squared deviations (6) by n - 1 (i.e., divide by 11
in our case). This value is also a legitimate index of the variability in the group and is
known as the variance.
To complete the calculation of the standard deviation (8) take the square root of the
calculated variance in step (7) and record this in the Table. The standard deviation will
make more intuitive sense to you if you think of it as an average deviation or the average
difference between the observed values and the mean of the sample. Either the standard
deviation or the variance may be used as the index of the variability of the measured
characteristic.
Go through the same procedure with the measurements from the second group
recorded in Table 2 to obtain the mean, variance, and standard deviation for the chosen
characteristic for this second group.
Using the information from the two samples, you will now be able to more formally
compare the difference observed between the two groups. Certainly the means and
variances that you have calculated from the two groups will vary some, however, the
question really is whether or not the observed differences in the means and variances
between the two groups are due to the chance sampling of unrepresentative individuals for
that particular characteristic or, in fact, indicate a real difference between the two larger
groups from which the smaller samples were drawn. Small observed differences could exist
between the two smaller samples even if the true means of the two populations are the
same. The standard statistical test called the "t test" will allow us to draw conclusions
about a possible, real difference between the two larger populations using the means of the
two small samples. Larger sample sizes would give us more accurate results about the
characteristics of the larger population.
In order to calculate the t test statistic, we first need to calculate the standard error
of the estimate of differences. (The standard error is a standard deviation of an estimate,
e.g. a mean). Information from both groups are used to calculate the needed standard
error:
When the sample sizes for both groups are 12, the value of n in the formula above is equal
to 12, not 24. Using the information from both groups, calculate the standard error for our
study:
The t test statistic is the appropriate decision maker to decide if the means of the
chosen characteristic in the two populations are indeed really different from each other.
first group - second group
t = ---------------------------------------S.E.
Based on the information in the two samples you have, calculate the t test statistic to
compare the means of the two groups:
If the value of the calculated t above is greater than 2.07 or less than - 2.07, then our
conclusion is that there is probably a real, "significant" difference between the means of
the two larger groups from which the samples came. If the value of the calculated t falls
between - 2.07 and + 2.07, then the observed difference between the two means is probably
just due to chance sampling and there is probably not a real difference between the two
groups. (This could also mean that our sample sizes are too small to pick up a possible real,
but small, difference between the means of the two populations. The critical value of 2.07 is
determined by the sample size of 12 in each of the two groups. Furthermore, the form of
the t test above is appropriate for the conditions of this study, however more complex
forms would be necessary under other conditions.) What is your conclusion about the
difference in our case of the means of the chosen characteristic between the two groups?
The t test determines whether or not there is a difference between the means of the
two groups. The variance, however, is an index of variability WITHIN a group and this
variability may be equal to or not equal to the variability within another group, regardless
of whether or not their means are the same. To test whether or not the variability WITHIN
the first group is different from the variability WITHIN the second group, compare the
calculated variances of each group using the standard F test:
Highest variance of the two groups
F = --------------------------------------------Lowest variance of the two groups
Calculate the F test statistic for the variances found in the two samples:
If the value of the calculated F above is greater than 2.81, then our conclusion is that there
is a real, significant difference between the variability found WITHIN each group with
respect to the chosen characteristic. If the calculated value is less than 2.81, then the
observed difference in the variances is probably due to chance sampling and there is
probably no real difference between the two groups with respect to the variability around
their respective means.
Table 1 for one group:
Table 2 for the second group:
Download