Document 10489421

advertisement
The c2 test for independence
in a contingency table
MEI Conference 2012
Periodontal disease status
Mild
Moderate
Severe
X2=21.53
CAD positive
11
35
40
DF=3
CAD negative
15
28
18
P-value=0.000
•Periodontal disease affects the gums and bones
supporting the teeth.
•CAD is Coronary Artery Disease and is a major cause of
death worldwide.
•This study looked at 100 adults with CAD and 100 adults
without CAD.
•A dentist examined the teeth of all 200 without knowing
whether they had CAD or not.
•What do you notice?
Some data from medical research
Periodontal disease
status
None
Mild
Moderate
Severe
TOTAL
CAD
positive
0
0
42
58
100
CAD
negative
53
26
21
0
100
TOTAL
Periodontal disease
status
None
Mild
Moderate
Severe
TOTAL
CAD
positive
26
13
32
29
100
CAD
negative
27
13
31
29
100
TOTAL
53
26
63
58
200
53
26
63
58
200
Periodontal
disease status
None
Mild
Moderate
Severe
TOTAL
CAD
CAD
TOTAL
positive negative
14
39
53
11
15
26
35
28
63
40
18
58
100
100
200
Is there an association between periodontal
disease and coronary artery disease?
Testing whether there is an association
The hypotheses
Basic idea:
• Start by assuming that there is no association.
• The technical way of saying this is that the null
hypothesis is that there is no association
between periodontal disease and CAD.
• Calculate the expected frequencies you would
expect if there was no association.
• Compare observed and expected frequencies.
• H0: no association between periodontal disease
and CAD. (This is the null hypothesis)
• H1: some association between periodontal
disease and CAD. (This is the alternative
hypothesis)
• Assuming the null hypothesis is true gives us a
way of calculating expected frequencies.
1
Working out the expected frequencies
Comparing observed and expected frequencies
Periodontal disease
status
None
Mild
Moderate
Severe
TOTAL
Observed
CAD
negative
100
TOTAL
100
53
26
63
58
200
The column totals were fixed at 100 by the researchers.
Assume that the row totals are representative of levels of
periodontal disease in the general population. If people
with each level of periodontal disease are equally likely to
be CAD positive or CAD negative, what do you think the
expected frequencies are?
Differences between observed and expected
frequencies
Observed - Expected CAD positive
No PD
Mild PD
Moderate PD
Severe PD
CAD negative
• What do you notice?
• To get an idea of whether there is a lot of
difference overall, it would be handy to add
all the differences. What happens if you do
this?
Deciding whether the test statistic
is big
• To decide whether the test statistic is big,
we need something to compare it to.
• If the null hypothesis was true (and there
was no association between the
characteristics of interest), it is possible to
work out how likely different values of the
test statistic would be.
CAD
CAD
positive negative
No PD
14
39
Mild PD
11
15
Moderate PD
35
28
Severe PD
40
18
Expected
• The obvious thing
to do when
comparing the
observed and
expected
frequencies is to
look at differences
(subtracting).
• What happens
when you do this?
CAD
CAD
positive negative
No PD
Mild PD
Moderate PD
Severe PD
Getting a test statistic
(Observed – Expected)2 CAD positive CAD negative
Expected
No PD
5.896
5.896
Mild PD
0.308
0.308
Moderate PD
0.389
0.389
Severe PD
4.172
4.172
• Adding all these numbers up gives a single measure of difference.
• This is called the test statistic.
• For this table it comes to 21.53. This was labelled X2 in the research
paper.
• If the test statistic is big then this means that there was a lot of
difference between expected and observed frequencies – this leads
us to doubt the null hypothesis that there was no association.
The probability distribution of the test statistic
• The horizontal
axis shows
possible values of
the test statistic.
• The vertical axis
measures how
likely the value is
if the null
hypothesis is true.
• This type of
probability
distribution is
called a chi
squared (c2)
distribution.
Chi squared probability distribution
0.3
0.25
0.2
f(x)
CAD
positive
0.15
0.1
0.05
x
0
0
10
20
30
40
2
How big is “big”?
• The area under
the graph
represents
probability.
• The probability
that the test
statistic is more
than 7.81 is 5%
(IF the null
hypothesis is
true).
Degrees of freedom (ν)
The shaded area shows
the total probability that the
test statistic is more than
7.81.
Periodontal disease
status
None
Mild
Moderate
Severe
TOTAL
CAD
positive
100
CAD
negative
100
TOTAL
53
26
63
58
200
• Suppose the totals are fixed and you can put what you
like in the other cells (as long as the totals are correct).
• How many numbers can you choose freely?
Interpreting contributions to the test statistic
Once the test for association has been done, then if there is evidence
of association, the contributions to the test statistic can be examined.
• Look for large contributions first – these are where observed and
expected frequencies are far apart. Then look at whether the
observed frequency of each cell is larger or smaller than the
expected frequency of that cell.
• Remember that expected frequencies were calculated on the
assumption that there was no association between the variables so
large contributions show where the association is strongest.
• Look for small contributions – these are where observed and
expected frequencies are close together.
• Sometimes students may be asked to comment on how observed
compares with expected for each category. They should refer to
contributions to the test statistic as well as to observed and
expected frequencies.
3
Download