Chi-square, Goodness of fit, and Contingency Tables

Chi-square, Goodness of fit,
and Contingency Tables
What is the χ2 distribution

Basically a distribution of squared
differences
Useful for detecting
categorical differences



Calculate the χ2 test statistic=
(observed-expected)2/expected
Degrees of freedom = number of
categories -1
Look up χ2 value for that degree of
freedom and chosen alpha value. If
test statistic > table value, then
significant
1.Two sided test: find the column
corresponding to α/2 in the table for
upper critical values and
1. reject the null hypothesis if the test
statistic is greater than the tabled
value.
2.Use 1 - α /2 in the table for lower
critical values and reject null if the test
statistic is less than the tabled value.
2.Upper one-sided test: find column
corresponding to α in upper critical values
table. If test statistic greater, reject.
Also useful for model fitting



Assume you have a fit a model to some
data and have some residual errors left
over.
You want to check if residuals are
normally distributed. You bin them in a
histogram
Estimate proportions of residuals in
each, compare to actual data
Model Fitting Example



Consider a classic genetics experiment.
The offspring of a cross between the F1 brassicas was 53 dark green and 11
yellow.
If the plants are heterozygous for color the ratio of 3 dark green to 1 yellow
would be expected.
Observed numbers
(O)
Expected numbers
(E)
O-E
(O-E)2
(O-E)2 / E
Dark Green
Yellow
Total
53
11
64
48
16
64
5
25
25/48 = 0.52
-5
25
25/16 = 1.56
0
2.08
Compound Hypotheses and
Directionality




With multiple categories, compound
hypotheses are possible
H0 Pr(cat 1) = 0.25, Pr(cat 2) = 0.50
and Pr(cat 3) = 0.75
HA: one of the above not the case
Where there are 2 categories, a
“directional alternative” is possible
Directional Alternatives


Only in the case of “dichotomous
variables” – two categories, effectively.
Step 1: Check Directionality of trend



If not, p-value > 0.5 by necessity
If so, proceed to step 2
The P-value is half what it would be if
HA were non directional
Directional Alternative
Example




Two football teams records are compared against the
average number of wins by an NFL team per year, 9.
Team 1 won 14 games this year and several players
were caught doping with HGF.
Team 2 won 11 games this year and tested clean.
Is there evidence that doping increased the number
of wins by team 1?
Contingency Tables



Use χ2 test statistic as above, but
Calculate expected values for each element in
table from E=(row total)*(column
total)/Grand Total;
Df =1
2x2 Contingency Tables

Can indicate either


Two independent
samples with a
dichotomous
observed variabled
One sample with two
dichotomous
observed variables
Female
Male
Tot(col)
HIV test 9
8
17
No HIV
test
52
51
103
Tot
(row)
61
59
120
Relation to Independence of
data




You can interpret
contingency tables in
terms of conditional
probabilities
Pr(HIV test | female)=
9/61
Pr(female | HIV test) =
9/17
Test becomes H0 :
Likelihood of taking and
HIV test is independent
of sex
Female
Male
Tot(col)
HIV
test
9
8
17
No HIV
test
52
51
103
Tot
(row)
61
59
120
Rxk contingency tables

Same as above, but degrees of freedom
= (r-1)*(k-1).
Corrections to the Chi-Squared
Test



It is a requirement that a chi-squared test be applied to discrete data. Counting
numbers are appropriate, continuous measurements are not. Assuming
continuity in the underlying distribution distorts the p value and may make false
positives more likely.
Frank Yates proposed a correction to the chi-squared formula. Adding a small
negative term to the argument. This tends to increase the p-value, and makes
the test more conservative, making false positives less likely. However, the test
may now be *too* conservative.
Additionally, chi squared test should not be used when the observed values in a
cell are <5. It is, at times not inappropriate to pad an empty cell with a
small value, though, as one can only assume the result would be more
significant with no value there.