N15-categorical data

advertisement
BIOINF 2118
15 -Categorical data and contingency tables
Page 1 of 8
Categorical Data and the Chi-Square Test
The Pearson Chi-Square Test is a great tool for testing goodness-of-fit in contingency tables.
There is one and only one formula for an unlimited # of situations:
where Oi = the Observed count in table cell #i
Ei = the Expected count in table cell #i, if the null hypothesis H0 is true;
derived from a model with p parameters
= the “degrees of freedom” = the number of constraints in the model H0
= (# parameters in the “saturated model” HA)
– (# parameters in the model H0)
= (# table cells) – (# parameters in the model H0)
=K−p.
Note the “dot” over the ~. That means:
“The statistic Q is approximately chisquare, if the sample size
n = O1 + ... + OK
is large enough.”
The “null model” H0 has some constraint: Êi = f (O1,...,OK ,q ) for some function f.
For H0, the #parameters = dim( )  p .
The “saturated model” HA is the extreme case: the no-constraint model.
Under the saturated HA, Êi = Oi (a perfect fit), the #parameters= the number of table cells, K.
BIOINF 2118
15 -Categorical data and contingency tables
Page 2 of 8
Where does this whole idea come from?
Oi ~ Poisson(Ei ) .
Suppose
Then
E (Oi ) = Ei , var(Oi ) = Ei .
So under a normal approximation
.
Square each Zi to get a
. Now add together all K of them, to get Q = Z1 + ...+ ZK .
2
If there are K of them, why does Q have the distribution
instead of
2
?
Because... the Zi are not independent, but are functionally dependent through the model f.
Under H0, they are constrained to a smaller subspace, dimension=p.
If the distribution is not Poisson, for example, mulitnomial, that’s OK, because as the sample size increases, the sampling
mechanism matters less and less—the likelihood function takes over.
Application #1: Testing equality of probabilities of discrete outcomes.
Suppose
(N1,..., NK ) ~ multinomial (n,( p1,..., pK )) ,
and H0 is
p1 = ... = pK .
The full model has dimension K – 1 (assuming the p sum to 1). The null hypothesis has dimension zero. It is fully
specified:
pi = 1/ K .
The expectations are
Ei = npi0 = n / K .
Example: See R code, “14-Rcode for chisquare tests.R”.
NOTES: (1)This works EXACTLY the same for any other null hypothesis probability vector
p10 ,..., pK0 .
(2) The “rule of 5” says that, if any of the E’s are less than 5, then the chi-square test approximation may be too
inaccurate.
BIOINF 2118
15 -Categorical data and contingency tables
Page 3 of 8
Then one should do an alternative: randomization test, Fisher “exact” test, ... (discussed later).
Application #2: Testing a composite hypothesis
Suppose
. This is a “Hardy-Weinberg law”.
If a gene has two alleles, A and a, and the frequency of A equals q , then this model results IF there is no selection
pressure and no “assortative mating” and no measurement error.
Q: (If there is no “measurement error”, what kind of error is there?)
It is a composite null hypothesis; as q grows from 0 to 1, it traces out a curve (which is 1-dimensional) inside the 2dimensional simplex defined by
p1 + p2 + p3 = 1
(all p ≥0).
To use the test, we maximize the likelihood to get the MLE
, then replace in E :
Differentiate and set to zero, to get
.
Then substitute to get
.
The unconstrained parameter space has dimension=2, the null hypothesis has dimension=1.
So
=21=1.
Example: See HWEexample.R
(For the record, there are better tests.)
BIOINF 2118
15 -Categorical data and contingency tables
Page 4 of 8
Application #3: Testing whether a distribution is normal (Gaussian)
Given a sample of X’s, calculate the mean and variance.
Divide the real line into K sub-intervals. Count Ni = # points in each sub-interval
[ai , bi ] .
So the likelihood of the COUNT data is
.
By the theory, if we maximize this likelihood to get the MLE’s
then substitute to get
with
,
, and then form Q as before,
=(K1)  2=K3. But that’s a hard maximizaton. Much easier: use
.
But then the theory is wrong. Actually, K3 <
< K1.
DRAWBACKS:
If K is too big, the counts N are small, but if K is too small, it’s not a very sensitive test.
Also, it’s dependent on how you choose the specific a’s and b’s.
BETTER: Anderson-Darling test.
See gofNORMtest()in the package nsRFA.
BIOINF 2118
15 -Categorical data and contingency tables
Page 5 of 8
Application #4: Testing whether two categorical variables are independent
The data is arranged in a table, of the form
{Nij : i = 1,..., R, j = 1,...,C } .
For the observations, define row sums
and column sums
For the true probabilities pij , define row sums
.
and column sums
.
The null hypothesis is independence:
H0 :
pij = pi + p+ j
for i = 1,..., R, j = 1,...,C .
Under H0, the expectations are
.
Under H0, the dimension of the model (degrees of freedom) is R + C – 2, because the number of unknown parameters is
R + C and there are two constraints:
Under HA, there is no constraint except
.
, so the dimension is R x C – 1. The difference in dimensions
between the two models is (R x C – 1) – (R + C – 2) = (R  1) x (C – 1).
BIOINF 2118
15 -Categorical data and contingency tables
Page 6 of 8
Example: Voter preferences versus academic department (DeGroot-Schervish 9.3.1)
Observations:
{Oij : i = 1,...,R, j = 1,...,C}
Prefers.Mr.Jones
Prefers.Ms.Smith
undecided
TOTAL
Eng and science
24
23
12
59
Humanities, soc sciences
24
14
10
48
Fine arts
17
8
13
38
Industr, public admin
27
19
9
55
TOTAL
92
64
44
200
Does teacher preference vary with department?
Estimated expectations {Êij : i = 1,...,R, j = 1,...,C} under H0 (independence):
prefers Mr.Jones
prefers Ms. Smith
undecided
TOTAL
Eng and science
27.14
18.88
12.98
59
Humanities, soc sciences
22.08
15.36
10.56
48
Fine arts
17.48
12.16
8.36
38
Industr, public admin
25.30
17.60
12.10
55
TOTAL
92
64
44
200
(Notice that the row and column totals for the E’s agree with the original data.
Then Q= 6.68, (R  1) x (C – 1) = (4 – 1) x (3 – 1) = 6.
> 1 - pchisq(6.68, 6)
[1] 0.3514567
BIOINF 2118
15 -Categorical data and contingency tables
Page 7 of 8
See “chisq-voting-table.R”. Learn about loglin( ) and loglm( ).
“Tips and Tricks” with the Chi-square test

“Warning: expected cell count less than 5”

Yates’s “continuity correction”.

chisq.test(simulate.p.value=TRUE)
Comparing pairs of nested models
Suppose you have two models H1: E = f (q1) and H2: E = g(q 2 ) . If the parameters spaces are NESTED:
with different dimensions p1 and p2 with p1 < p2, then
and
Example: in the Prisoners’ Picnic, we can ask about nested hypotheses like these:
H1: Ate, Drank, and Sick are independent.
H2: Ate and Drank are independent conditional on Sick.
H3: Ate, Drank, and Sick may be pairwise associated, but there is no “3-way interaction”.
H1  H2  H3.
Θ1 ⊂ Θ2 ⊂ Θ3.
p1=3, p2=5, p3=6, K=7.
See prisonersPicnic- Chi-square tests and fisher tests.R.
(We will return to this data set with prisonersPicnic-modeling.R.)
Q1 Ì Q2 ,
BIOINF 2118
15 -Categorical data and contingency tables
Page 8 of 8
The likelhood ratio test is very similar to the chi-square test.
Both require sample size big enough.
It is based on multinomial likelihood function:
.
(Here I’ve used Ni instead of Oi for clarity.)
Then the log likelihood test statistic is
.
where
E i(1) is the fitted MLE for model 1 (HA) , E i(0) is the fitted MLE for model 0 (H0),
See “chisq-voting-table.R” for details.
Download