Chi-square analysis

advertisement
CHI-SQUARE EXERCISE
There are many methods of statistically testing scientific data--each test applicable under different
conditions. The Chi-square (X2) test, a measure of the discrepancy between the observed results
and some hypothetically expected results, is well suited to the field of genetics. You will use this
test in the analysis of your data.
Before you actually apply the test, however, you must understand certain basics of statistical
testing. We all know by simple observation that nature is variable. When Mendel counted his unit
characters, he did not find exact 3:1 or 9:3:3:1 ratios. He obtained very close approximations to
the ratios, but values which varied slightly from the theoretical ones. The variability he saw can be
accounted for by mistakes in counting or the natural variability in living systems. The question we
want to answer is, "When is the deviation from the expected value due to chance alone, and when
is it due to an incorrect hypothesis or inappropriate testing of the hypothesis?"
How much deviation from expected values will you accept before you start to question the
accuracy of your underlying assumptions or the reliability of your sampling (testing)
technique? This question can be answered only in an arbitrary manner. Suppose that every day at
lunch you and a friend toss a coin to determine who will buy the coffee. Assuming your friend and
the coin are honest, you expect to buy the coffee about 50% of the time. If you find that you are
buying coffee 55% of the time, you may not be bothered too much, but simply consider it bad luck.
If, however, you find you are buying coffee 60, 70 or 80% of the time you might begin to wonder
about the honesty of your friend, the coin, or both.
But when would you begin to wonder? By convention the cutoff point has been taken at the 0.05
(5%) level of probability. That is, if the probability of obtaining your result is one in twenty or
less, the deviation you observed from the expected value is generally considered to be too great to
have occurred by chance alone.
What should also be apparent at this point is that the rejection or acceptance of an observed ratio is
highly dependent on the number of events you have counted. Reconsider tossing the coin for
coffee. If you have performed this exercise ten times and have only won three times, you may not
be too concerned (although you certainly couldn't be sure the coin was fair). However, if you have
played 100 times and won only 30 times (still a 7:3 ratio), you might have cause for alarm. In
other words, sample size (or number of repetitions) can and should affect your decision.
How then do you carry out the Chi Square test? The mathematical process involved is given in the
formula:
(O-E)2
!2 = "
E
where O = the observed value
and E = the expected value
For each class of observations, calculate the deviation of the observed value from the expected
value. The deviation is then squared and divided by the expected value. Then all of the values are
summed. It should be immediately obvious that the smaller the differences between the observed
and the expected values, the smaller the value of X2 will be and, conversely, the greater the
difference the greater the value of X2.
In order to understand and use the Chi-square test properly, you should know the meaning of null
hypothesis (Ho), degrees of freedom (df) and probability (P). A null hypothesis is simply a
general description of what we expect to happen according to a standard hypothesis; i.e., you
would expect to get a 1:1 ratio of heads to tails if you tossed a coin l000 times. Strictly stated, the
“null hypothesis” is the hypothesis that the difference between the observed results and the
predicted results could arise by chance, rather than by some specific process. There is always one
degree of freedom fewer than the total number of classes into which the data fall. You can have
heads or tails, so there are two classes or df = 2 - 1 (df = l). Why is this so? If you have a shoe in
each hand, you can drop the shoe in the left hand or the right hand--you have a choice. But, after
you drop one shoe, you have no choice as to which shoe to drop next. In this case, as with the
coins, you have one (1) degree of freedom. Probability (P) is the percent of the time that the
observed data could have occurred by chance in a situation where the null hypothesis is really true.
Perhaps all of the above can be made more obvious by using examples.
EXAMPLE 1
Let us examine Mendel's F2 data for the Round/wrinkled and Yellow/green dihybrid cross. He
counted a total of 556 peas with this observed ratio: 315:108:101:32. These values can be seen in
Row 1 of Table 1. We will use this table to test Mendel's data by the X2 method.
Now we must find what the expected values are. Based on Mendel's second law we expect a
9:3:3:1 ratio or 9/16 of the total (313 out of 556) to fall into the Round Yellow class; 3/16 or
104/556 should be Round green; 3/16 or 104/556 should be wrinkled Yellow; and finally 1/16 or
35/556 should be wrinkled green. These calculated values are put in Row 2 of Table 1. We now
have the values Mendel observed and the expected values based on our null hypothesis. In Row 3
we subtract the expected values from the observed values to determine the deviation between
them. You can ignore the sign of these values because in Row 4 they will be squared. In Row 5
each one of the squared values is divided by its respective expected value from Row 2.
Finally all of the Row 5 values are summed in the lower right hand corner of the table. This value
is X2. In this particular instance it is equal to 0.510. Now you have to decide whether this X2
value could have occurred easily by chance or not. Fortunately, the chances of getting various
values of X2 by chance have been calculated by mathematicians--see Table 3.
However, before you can check your value of X2, you must determine the degrees of freedom (df)
involved. In Mendel's F2 data there are four classes. Therefore, there must be three (3) degrees of
freedom--df = 4 - 1 = 3.
Table 1
Phenotype
Round
Round
wrinkled
wrinkled
of Seeds
Yellow
green
Yellow
green
Total
=========================================================
Observed
315
108
101
32
556
numbers
___
Expected
313
104
104
35
556
numbers
___
Deviation
2
4
3
3
(O-E)
___
Deviation
4
16
9
9
Squared
(O-E)2
___
2
2
(O-E)
0.013
0.154 0.086 0.257
X =0.510
E
___
The probability (P) that the difference between the observed and expected values can be
accounted for by chance alone can now be determined, since df and X2 are known. Now look at
Table 3 and find the appropriate df (3) and determine P for this X2 value (0.510). Since 0.58 is
closest to 0.510, the probability of obtaining the results you did by chance if a 9:3:3:1 process is
actually operating is close to 90% (or 9 in 10). Therefore, there is about a 90% probability that the
observed deviation is due to chance alone.
An accepted practice is to reject a null hypothesis if the probability of the results occurring by
chance is 0.05 (5%) or less. Otherwise, all we can say is that we failed to reject the null
hypothesis. A more sensitive experiment might allow you to reject the null hypothesis, but for
now you aren't able to do so. The important point is that statistics can be used to reject a null
hypothesis but never to prove one. However in this case the data give us no reason to reject (or
even question) the null hypothesis that Mendel's data should fall in a 9:3:3:1 ratio.
EXAMPLE 2
Al Blop received 800 F2 tulip seeds from N. T. Careful which he planted in a nice warm (37°C),
moist place. Of the 800, only 652 germinated, grew and eventually produced flowers. Careful had
told Blop that the parental types were Red flowers/Smooth petal edges and yellow flowers/ruffled
petal edges--Red dominant over yellow and Smooth over ruffled--so he could expect a 9:3:3:1
ratio in the F2's he was planting. Blop, however, got the following results:
440 Red/Smooth; 50 Red/ruffled; 147 yellow/Smooth; 15 yellow/ruffled
He checked the null hypothesis (that he should have a 9:3:3:1 ratio) by using the X2 test.
Blop found X2 to be 78.6 (see Table 2).
Table 2
Phenotype
Red
Red
yellow
yellow
of Seeds
Smooth
ruffled
Smooth
ruffled
Total
=========================================================
Observed
440
50
147
15
652
numbers
___
Expected
367
122
122
41
652
numbers
___
Deviation
73
72
25
26
(O-E)
___
Deviation
5329
5184
625
676
Squared
(O-E)2
___
2
2
(O-E)
14.5
42.5 5.1
16.5 X =78.6
E
___
Since there are 4 classes of data, the df is 3 (df = 4 - 1). Looking at Table 3, we can see that
P = <0.001. This means that the probability of his results occurring by chance is much less than
0.001 and ,therefore, the null hypothesis should be rejected.
But--Blop wondered whether the 9:3:3:1 hypothesis was incorrect or whether some other factor
was at work. While carefully studying his results he realized if he there had been 800 total plants
(rather than the 652 he had), he would have expected the smooth-leaved plants to be about the
numbers that he actually obtained—but there would be too few with ruffled leaves. (Try this.)
Could the ruffled plants be unhealthy?
Blop wrote Careful about his results and concerns. Careful telephoned to say that the tulip seeds
should have been grown at 25° C since the ones with ruffled edges were temperature sensitive, and
don’t germinate well at 37°C.
Thus the apparent need to reject the null hypothesis was explained. There was something other
than random chance that made the observed results differ from those expected from a 9:3:3:1 ratio.
The use of X2 gave Blop the incentive to ask questions about his data, and to seek answers to those
questions.
In summary, to determine and use the Probability Value:
1. Determine df.
2. Determine X2.
3. In Table 3, find the appropriate df and follow the numbers horizontally, matching your
calculated X2 as closely as possible to the value given on that line.
5. Read the probability at the top of that column.
6. Reject the null hypothesis if the probability is equal to or less than 0.05 (5%) or fail to reject
the null hypothesis if the probability is greater than 0.05.
Table 3. CHI-SQUARE VALUES
Probabilities (P)
___________________________________________________________________
df
.90
.70
.50
.30
.20
.10
.05
.01
.001
___________________________________________________________________
1 .016
.15
.46
1.07
1.64
2.71
3.84
6.64 10.83
2 .21
.71
1.39
2.41
3.22
4.61
5.99
9.21 13.82
3 .58
1.42
2.37
3.67
4.64
6.25
7.82
11.35 16.27
4 1.06
2.20
3.36
4.88
5.99
7.78
9.49
13.28 18.47
5 1.61 3.00
4.35
6.06
7.29
9.24 11.07 15.09 20.52
6 2.20 3.83
5.35
7.23
8.56 10.65 12.59 16.81 22.46
7 2.83 4.67
6.35
8.38
9.80 12.02 14.07 18.48 24.32
8 3.49 5.53
7.34
9.52
11.03 13.36 15.51 20.09 26.13
9 4.17 6.39
8.34 10.66
12.24 14.68 16.92 21.67 27.88
10 4.87 7.27
9.34 11.78
13.44 15.99 18.31 23.21 29.59
This test should never be used on data expressed as percentages or ratios, nor should it be
used when the expected values in one or more classes fall below five.
This following exercise is designed to familiarize you with Chi-square analysis and to show you
how the total sample size can affect your calculated X2 value.
The ear of corn shown above has seeds of two different colors; therefore, you have a case of
phenotypic segregation. If you make the assumption that the "parents" of this ear of corn were
heterozygous for a single trait, seed color, you can test this hypothesis using X2. Of course, to test
this hypothesis you must count the kernels on the ear. Before you begin, look at the ear to
determine whether there are more dark or light kernels.
Which? _____________ The trait in the majority should be the dominant one. Once you have
determined the dominant trait, go through the graduated series of countings and tests described on
the next page. This exercise should demonstrate the importance of a large sample size in all
genetic testing.
Count the number of kernels in a single row on your ear and classify them as to color. Use the
dominance relationship determined previously and assume a 3:1 ratio from your hypothesis as to
the type of cross which gave rise to this ear. Test your data using X2 and the table below.
Table for Calculating Chi-Square (X2)---One Row
__________________________________________________________________
Phenotype
Total
=========================================================
Observed numbers
_
_
Expected numbers
________
____
Deviations
_________
_____
Deviations squared
_________________
Deviations squared,
divided by expected
numbers
___
_
X2 =
_______ _
Probability =
If you get a Probability = 5% or <5%, you may reject the null hypothesis. If, however, you get a
Probability = >5%, you cannot reject the null hypothesis; i.e., by convention your data are
considered not to be inconsistent with the expected 3:1 ratio. Because you got that result, and
because lots of other scientists have obtained it in the past, we tend to believe the 3:1 theory.
Do your data fit a 3:1 ratio?____________________________________
Now count three more rows and repeat the Chi-square test on the total for the four rows you have
counted. Then do the same for all the "kernels" found in all sixteen (16) rows.
Table for Calculating Chi-Square (X2)---Four Rows
__________________________________________________________________
Phenotype
Total
=========================================================
Observed numbers
_
_
Expected numbers
________
____
Deviations
_________
_____
Deviations squared
_________________
Deviations squared,
divided by expected
numbers
___
_
X2 =
________
Probability =
Table for Calculating Chi-Square (X2)---Sixteen Rows
__________________________________________________________________
Phenotype
Total
=========================================================
Observed numbers
_
_
Expected numbers
________
Deviations
_________
____
_____
Deviations squared
_________________
Deviations squared,
divided by expected
numbers
___
_
X2 =
________
Probability
How do X2 and its associated probability vary as your sample size gets larger?
Remember that a bigger sample size should get you closer to the "truth," whether "truth" is
the null hypothesis or "truth" is some deviation from the null hypothesis.
Tables for Calculating Chi-Square (X2) for my Cross, which is ___________
My Data:
Total
Phenotypes:
Observed Numbers
Expected Numbers
Deviations
Deviations squared
Deviations squared,
divided by expected
numbers
X2 =
Probability =
Section’s Data:
Total
Phenotypes:
Observed Numbers
Expected Numbers
Deviations
Deviations squared
Deviations squared,
divided by expected
numbers
X2 =
Probability =
Class Data:
Total
Phenotypes:
Observed Numbers
Expected Numbers
Deviations
Deviations squared
Deviations squared,
divided by expected
numbers
X2 =
Probability =
How does the accumulation of larger numbers of offspring affect the evaluation of your null
hypothesis?
Do you reject or fail to reject your null hypothesis? Explain.
Download