Fisher*s exact test

advertisement
FISHER’S EXACT TEST FOR THE TWO-BY-TWO TABLE

Suppose that X ~ Bin(m, p1) and Y ~ Bin(n, p2). Further suppose that X and Y are
independent. The data layout can be summarized in this two-by-two table:
Group
1
2
Total
Successes
X=A
Y=C
A+C
Failures
m-X=B
n-Y=D
B+D
Total
m = A+B
n = C+D
N = A+B+C+D
The display is overlaid with the A - B - C - D notation. We’ll refer to this notation at the
end.
We want to test the null hypothesis H0: p1 = p2 . For the moment, we will be
intentionally vague about whether H1 is p1 > p2 or p1 < p2 or p1  p2 . If m and n are
reasonably large, then we can use the chi-squared statistic
 AD  BC 
 A  B C  D  A  C  B  D 
2
 = N
2
Here “reasonably large” means big enough to use the normal
approximation. This would seem to require that m  30 and
n  30. The approximation also requires that X not be too close to
0 or to m, and also that Y also not be too close to 0 or to n; in such
cases a Poisson approximation would be more appropriate.
There is now some question about how to proceed if m and/on n is small. Let’s use the
symbol T = X + Y as the total of the first column. In terms of the random variables, let’s
note the table as
Group
1
2
Total
Successes
X
Y
T
Failures
Total
m
n
N
The joint distribution, under H0 , of (X, Y) is
m
m x  n 
n y
p y 1  p 
f(x, y) =   p x 1  p 


x 
 y
This uses p as the common value for p1 and p2 . Let’s make the transformation
(X, Y)  (X, T)

1
 gs2011
FISHER’S EXACT TEST FOR THE TWO-BY-TWO TABLE

Since these are all discrete random variables, we do not need to worry about the Jacobian
of a multivariable transformation. We need only replace y by t - x. The transformed
likelihood is
m
m x  n 
n t  x
f(x, t) =   p x 1  p 
p t  x 1  p 


x 
t  x
We could also be extra careful and note the ranges of the variables. The
likelihood above could have the indicator product I(0  x  m) I(0  y  n). This
would be transformed to I(0  x  m) I(0  t - x  n). We’ll have a little more to
say about this below, at the bottom of page 3.
Since T = X + Y, its distribution under the null hypothesis would be binomial (m+n, p).
That is, the probability law of T is
m  n t
m  n t
f(t) = 
p 1  p 

 t 
Let’s now find the conditional probability law of X, given T = t. This is
m x
m x
 x  p 1  p 
f ( x, t )
f(x | t) =
=  
f t 
m  n
 t 


=
 n  tx
n t  x
 t  x  p 1  p 


p t 1  p 
m  n t
m  n 
 x  t  x
  

m

n


 t 


It’s quite amazing that all the factors in p cancel.
This is the conditional probability law of X, given a total number of successes t. This
conditional law is hypergeometric, saying that the number of successes in group 1 is
based on m random draws from a set of m + n values, of which t are successes.
This hypergeometric forms the basis for Fisher’s exact test. No approximations are
needed! However, the hypergeometric distribution is a little “gritty” in the sense to be
noted below on page 4, and we may have trouble using it.

2
 gs2011
FISHER’S EXACT TEST FOR THE TWO-BY-TWO TABLE

Here is an example which shows Fisher’s exact test in action. Suppose that we want to
compare the success rates for two different brands of arthritis medication. Brand Q was
tested on 10 subjects, and brand R was tested on 12 subjects. Each subject was asked
whether the treatment provided any relief. The responses were these:
Brand
Q
R
Total
Successes
2
8
10
Failures
8
4
12
Total
10
12
22
The obvious null hypothesis is H0: pQ = pR . We want  = 0.05 as the level of
significance. We are tempted to use the chi-squared statistic, but the sample sizes are too
small to make the approximate distribution believable. Just for the record, the value
would be 2  4.7911.
Suppose that the alternative hypothesis were H1: pQ > pR . That is, suppose that the
experiment was set up by someone expecting to show that brand Q was better. An
immediate look at the data shows that product Q was worse. There is no way that H1 has
any chance at all. With no serious computation, we simply accept H0 and move on.
Suppose that the alternative hypothesis were H1: pQ < pR , and that the experiment was set
up by someone expecting to show that brand R was better. Since pˆ Q = 15 = 0.20 and
pˆ R = 23  0.67, the data suggest H1 and we must investigate whether this apparent
superior performance is distinguishable from mere chance. If the null hypothesis holds,
and if T = 10 is the total number of successes, then the conditional distribution of X, the
value in the upper left box, is hypergeometric with parameters (22; 10, 10). These three
parameters are (grand total; row total, column total). Thus
P( X = x | T = 10 ) =
 10   12 
 x   10  x 
 

22
 
 10 
 
 row total   grand total - row total 
 x
  column total - x 


= 
grand
total


 column total 


We could invoke this reasoning for any of the four cells of the table. For this table, the
first row has a smaller total that the second row (10 vs 12), and the first column has a
smaller total than the second column (10 vs 12). It’s convenient to work with the cell
corresponding to the smaller of the two row totals and also to the smaller of the two
column totals. For this table, it’s the upper-left cell. If this strategy is used, the
conditional hypergeometric distribution will start at zero, and the clerical issues are a
little simpler.
Once t is fixed in the indicator product I(0  x  m) I(0  t - x  n), the range of
x’s is given by max{ 0, t - n }  x  min{ t, m }. It’s really easiest if the lower
limit is zero.

3
 gs2011
FISHER’S EXACT TEST FOR THE TWO-BY-TWO TABLE

Here, with the help of Minitab, is this distribution:
x
0
1
2
3
4
5
6
7
8
9
10
Prob
0.000102
0.003402
0.034447
0.146974
0.300071
0.308645
0.160753
0.040826
0.004593
0.000186
0.000002
CumProb
0.00010
0.00350
0.03795
0.18492
0.48500
0.79364
0.95439
0.99522
0.99981
1.00000
1.00000
The actual event was { X = 2 }. We define the p-value as P[ X  2 ] = 0.03795. Since
this is less than 0.05, we would reject H0 and conclude that brand R is better. The event
{ X  2 } would be described as “an outcome as extreme, or even more extreme, in
support of H1 than the outcome actually observed.”
If we had been asked, before seeing the data, to formulate a 5% rejection rule, that rule
would have to be “Reject H0 if X  2.” This would however not be a 5% rule, as the real
probability of Type I error would be only 3.795%. The grittiness of a discrete
distribution prevents us from hitting the 5% target.
Finally, suppose that the alternative hypothesis were H1: pQ  pR . This would correspond
to an experiment set up with no particular prejudice. The null hypothesis distribution is
still the one above, but it’s not obvious how to compute a p-value. Here are two common
methods.
(1)
(1)


.
2
Equivalently, double the one-tail p-value and compare to . In our case,
this would be 2  0.03795 = 0.0759. As this is bigger than 0.05, we would
accept H0 . This is called the Clopper-Pearson method.
It can sometimes happen that the most extreme event in one of the tails has

probability > . In that case, the Clopper-Pearson method reverts to a
2
one-tail test at level . This is not an issue in our case, because the
low-end event { X = 0 } has probability 0.000102 < 0.025 and also the
upper-end event { X = 10 } has probability 0.000002 < 0.025.
Conduct the test as two separate one-tail tests, each at level
4
 gs2011
FISHER’S EXACT TEST FOR THE TWO-BY-TWO TABLE

(2)
Form the rejection set by collecting the outcomes { X = x } in increasing
order of probability. For our data this would give the following:
x
10
0
9
1
8
2
7
3
6
4
5
Prob
0.000002
0.000102
0.000186
0.003402
0.004593
0.034447
0.040826
0.146974
0.160753
0.300071
0.308645
CumProb
0.000002
0.000102
0.000290
0.003692
0.003692
0.042732
0.083558
0.230532
0.391285
0.691356
1.000001
Our problem had { X = 2 }, and the cumulative through this value is
0.042732. As this is below 0.05, we would reject H0 .
This is the Wilson-Sterne rule. Other methods have been proposed as well. The
Wilson-Sterne method is more powerful, meaning that it rejects H0 more often, but it has
other problems.
Problem 1: The Wilson-Sterne method is more complicated than the
Clopper-Pearson method.
Problem 2: One can invert hypothesis tests to get confidence intervals. The
phrase “invert hypothesis tests” refers to finding the set { 0 | with actual
data x, the hypothesis H0:  = 0 is accepted }. The Wilson-Sterne rule
can sometimes lead to disconnected confidence intervals!
Fisher’s exact test was illustrated with focus on the upper left cell. The procedure could
have been with any of the four cells of the table. All conclusions would be logically and
numerically consistent. As a clerical guideline, it’s usually easiest to focus on the cell
which has the smaller row total and also the smaller column total.

5
 gs2011
FISHER’S EXACT TEST FOR THE TWO-BY-TWO TABLE

This discussion would not be complete without mention of Fisher’s “lady tasting tea”
experiment. A certain aristocratic lady has taste so refined that she can tell whether the
milk or the tea was placed first in her teacup. This unusual skill is to be put to a test.
Eight teacups are set out. Four cups are selected at random, and for these selected cups,
the milk is poured first. For the remaining cups, the tea is poured first. The lady, who
has been discreetly kept away from the preparation, is now asked for her judgments. She
is aware that there are exactly four cups in each category, and she will try to identify the
four cups in which the milk was poured first. That is, she will supply data for this table:
Actual
Milk poured first
Tea poured first
Total
Lady’s judgment
Milk poured first
Tea poured first
4
4
Total
4
4
8
It happens that the lady identifies three out of four correctly. If we use the 5% level of
significance, how should we appraise her skill?
Note that she has filled out the table in this fashion:
Actual
Milk poured first
Tea poured first
Total
Lady’s judgment
Milk poured first
Tea poured first
3
1
1
3
4
4
Total
4
4
8
The null hypothesis is that her guessing is random, and the alternative is that she has
some skill. The alternative would support large numbers in the upper left cell. Let’s
associate this cell with the random variable X. The null hypothesis distribution is
hypergeometric, with these probabilities:
x
0
1
2
3
4
Prob
0.014286
0.228571
0.514286
0.228571
0.014286
CumProb
0.01429
0.24286
0.75714
0.98571
1.00000
The result was { X = 3 }. The event { X  3 } would be described as “an outcome as
extreme, or even more extreme, in support of H1 than the outcome actually observed.”
Then P[ X  3 ] = 1 - P[ X  2 ] = 0.24286 gives us the p-value. This is well in excess of
0.05, so that we would have to accept the null hypothesis that the lady is guessing. She
would have to correctly identify all four of the milk-first teacups in order to be
convincing!

6
 gs2011
Download