BIOINF 2118 15 -Categorical data and contingency tables Page 1 of 8 Categorical Data and the Chi-Square Test The Pearson Chi-Square Test is a great tool for testing goodness-of-fit in contingency tables. There is one and only one formula for an unlimited # of situations: where Oi = the Observed count in table cell #i Ei = the Expected count in table cell #i, if the null hypothesis H0 is true; derived from a model with p parameters = the “degrees of freedom” = the number of constraints in the model H0 = (# parameters in the “saturated model” HA) – (# parameters in the model H0) = (# table cells) – (# parameters in the model H0) =K−p. Note the “dot” over the ~. That means: “The statistic Q is approximately chisquare, if the sample size n = O1 + ... + OK is large enough.” The “null model” H0 has some constraint: Êi = f (O1,...,OK ,q ) for some function f. For H0, the #parameters = dim( ) p . The “saturated model” HA is the extreme case: the no-constraint model. Under the saturated HA, Êi = Oi (a perfect fit), the #parameters= the number of table cells, K. BIOINF 2118 15 -Categorical data and contingency tables Page 2 of 8 Where does this whole idea come from? Oi ~ Poisson(Ei ) . Suppose Then E (Oi ) = Ei , var(Oi ) = Ei . So under a normal approximation . Square each Zi to get a . Now add together all K of them, to get Q = Z1 + ...+ ZK . 2 If there are K of them, why does Q have the distribution instead of 2 ? Because... the Zi are not independent, but are functionally dependent through the model f. Under H0, they are constrained to a smaller subspace, dimension=p. If the distribution is not Poisson, for example, mulitnomial, that’s OK, because as the sample size increases, the sampling mechanism matters less and less—the likelihood function takes over. Application #1: Testing equality of probabilities of discrete outcomes. Suppose (N1,..., NK ) ~ multinomial (n,( p1,..., pK )) , and H0 is p1 = ... = pK . The full model has dimension K – 1 (assuming the p sum to 1). The null hypothesis has dimension zero. It is fully specified: pi = 1/ K . The expectations are Ei = npi0 = n / K . Example: See R code, “14-Rcode for chisquare tests.R”. NOTES: (1)This works EXACTLY the same for any other null hypothesis probability vector p10 ,..., pK0 . (2) The “rule of 5” says that, if any of the E’s are less than 5, then the chi-square test approximation may be too inaccurate. BIOINF 2118 15 -Categorical data and contingency tables Page 3 of 8 Then one should do an alternative: randomization test, Fisher “exact” test, ... (discussed later). Application #2: Testing a composite hypothesis Suppose . This is a “Hardy-Weinberg law”. If a gene has two alleles, A and a, and the frequency of A equals q , then this model results IF there is no selection pressure and no “assortative mating” and no measurement error. Q: (If there is no “measurement error”, what kind of error is there?) It is a composite null hypothesis; as q grows from 0 to 1, it traces out a curve (which is 1-dimensional) inside the 2dimensional simplex defined by p1 + p2 + p3 = 1 (all p ≥0). To use the test, we maximize the likelihood to get the MLE , then replace in E : Differentiate and set to zero, to get . Then substitute to get . The unconstrained parameter space has dimension=2, the null hypothesis has dimension=1. So =21=1. Example: See HWEexample.R (For the record, there are better tests.) BIOINF 2118 15 -Categorical data and contingency tables Page 4 of 8 Application #3: Testing whether a distribution is normal (Gaussian) Given a sample of X’s, calculate the mean and variance. Divide the real line into K sub-intervals. Count Ni = # points in each sub-interval [ai , bi ] . So the likelihood of the COUNT data is . By the theory, if we maximize this likelihood to get the MLE’s then substitute to get with , , and then form Q as before, =(K1) 2=K3. But that’s a hard maximizaton. Much easier: use . But then the theory is wrong. Actually, K3 < < K1. DRAWBACKS: If K is too big, the counts N are small, but if K is too small, it’s not a very sensitive test. Also, it’s dependent on how you choose the specific a’s and b’s. BETTER: Anderson-Darling test. See gofNORMtest()in the package nsRFA. BIOINF 2118 15 -Categorical data and contingency tables Page 5 of 8 Application #4: Testing whether two categorical variables are independent The data is arranged in a table, of the form {Nij : i = 1,..., R, j = 1,...,C } . For the observations, define row sums and column sums For the true probabilities pij , define row sums . and column sums . The null hypothesis is independence: H0 : pij = pi + p+ j for i = 1,..., R, j = 1,...,C . Under H0, the expectations are . Under H0, the dimension of the model (degrees of freedom) is R + C – 2, because the number of unknown parameters is R + C and there are two constraints: Under HA, there is no constraint except . , so the dimension is R x C – 1. The difference in dimensions between the two models is (R x C – 1) – (R + C – 2) = (R 1) x (C – 1). BIOINF 2118 15 -Categorical data and contingency tables Page 6 of 8 Example: Voter preferences versus academic department (DeGroot-Schervish 9.3.1) Observations: {Oij : i = 1,...,R, j = 1,...,C} Prefers.Mr.Jones Prefers.Ms.Smith undecided TOTAL Eng and science 24 23 12 59 Humanities, soc sciences 24 14 10 48 Fine arts 17 8 13 38 Industr, public admin 27 19 9 55 TOTAL 92 64 44 200 Does teacher preference vary with department? Estimated expectations {Êij : i = 1,...,R, j = 1,...,C} under H0 (independence): prefers Mr.Jones prefers Ms. Smith undecided TOTAL Eng and science 27.14 18.88 12.98 59 Humanities, soc sciences 22.08 15.36 10.56 48 Fine arts 17.48 12.16 8.36 38 Industr, public admin 25.30 17.60 12.10 55 TOTAL 92 64 44 200 (Notice that the row and column totals for the E’s agree with the original data. Then Q= 6.68, (R 1) x (C – 1) = (4 – 1) x (3 – 1) = 6. > 1 - pchisq(6.68, 6) [1] 0.3514567 BIOINF 2118 15 -Categorical data and contingency tables Page 7 of 8 See “chisq-voting-table.R”. Learn about loglin( ) and loglm( ). “Tips and Tricks” with the Chi-square test “Warning: expected cell count less than 5” Yates’s “continuity correction”. chisq.test(simulate.p.value=TRUE) Comparing pairs of nested models Suppose you have two models H1: E = f (q1) and H2: E = g(q 2 ) . If the parameters spaces are NESTED: with different dimensions p1 and p2 with p1 < p2, then and Example: in the Prisoners’ Picnic, we can ask about nested hypotheses like these: H1: Ate, Drank, and Sick are independent. H2: Ate and Drank are independent conditional on Sick. H3: Ate, Drank, and Sick may be pairwise associated, but there is no “3-way interaction”. H1 H2 H3. Θ1 ⊂ Θ2 ⊂ Θ3. p1=3, p2=5, p3=6, K=7. See prisonersPicnic- Chi-square tests and fisher tests.R. (We will return to this data set with prisonersPicnic-modeling.R.) Q1 Ì Q2 , BIOINF 2118 15 -Categorical data and contingency tables Page 8 of 8 The likelhood ratio test is very similar to the chi-square test. Both require sample size big enough. It is based on multinomial likelihood function: . (Here I’ve used Ni instead of Oi for clarity.) Then the log likelihood test statistic is . where E i(1) is the fitted MLE for model 1 (HA) , E i(0) is the fitted MLE for model 0 (H0), See “chisq-voting-table.R” for details.