The c2 test for independence in a contingency table MEI Conference 2012 Periodontal disease status Mild Moderate Severe X2=21.53 CAD positive 11 35 40 DF=3 CAD negative 15 28 18 P-value=0.000 •Periodontal disease affects the gums and bones supporting the teeth. •CAD is Coronary Artery Disease and is a major cause of death worldwide. •This study looked at 100 adults with CAD and 100 adults without CAD. •A dentist examined the teeth of all 200 without knowing whether they had CAD or not. •What do you notice? Some data from medical research Periodontal disease status None Mild Moderate Severe TOTAL CAD positive 0 0 42 58 100 CAD negative 53 26 21 0 100 TOTAL Periodontal disease status None Mild Moderate Severe TOTAL CAD positive 26 13 32 29 100 CAD negative 27 13 31 29 100 TOTAL 53 26 63 58 200 53 26 63 58 200 Periodontal disease status None Mild Moderate Severe TOTAL CAD CAD TOTAL positive negative 14 39 53 11 15 26 35 28 63 40 18 58 100 100 200 Is there an association between periodontal disease and coronary artery disease? Testing whether there is an association The hypotheses Basic idea: • Start by assuming that there is no association. • The technical way of saying this is that the null hypothesis is that there is no association between periodontal disease and CAD. • Calculate the expected frequencies you would expect if there was no association. • Compare observed and expected frequencies. • H0: no association between periodontal disease and CAD. (This is the null hypothesis) • H1: some association between periodontal disease and CAD. (This is the alternative hypothesis) • Assuming the null hypothesis is true gives us a way of calculating expected frequencies. 1 Working out the expected frequencies Comparing observed and expected frequencies Periodontal disease status None Mild Moderate Severe TOTAL Observed CAD negative 100 TOTAL 100 53 26 63 58 200 The column totals were fixed at 100 by the researchers. Assume that the row totals are representative of levels of periodontal disease in the general population. If people with each level of periodontal disease are equally likely to be CAD positive or CAD negative, what do you think the expected frequencies are? Differences between observed and expected frequencies Observed - Expected CAD positive No PD Mild PD Moderate PD Severe PD CAD negative • What do you notice? • To get an idea of whether there is a lot of difference overall, it would be handy to add all the differences. What happens if you do this? Deciding whether the test statistic is big • To decide whether the test statistic is big, we need something to compare it to. • If the null hypothesis was true (and there was no association between the characteristics of interest), it is possible to work out how likely different values of the test statistic would be. CAD CAD positive negative No PD 14 39 Mild PD 11 15 Moderate PD 35 28 Severe PD 40 18 Expected • The obvious thing to do when comparing the observed and expected frequencies is to look at differences (subtracting). • What happens when you do this? CAD CAD positive negative No PD Mild PD Moderate PD Severe PD Getting a test statistic (Observed – Expected)2 CAD positive CAD negative Expected No PD 5.896 5.896 Mild PD 0.308 0.308 Moderate PD 0.389 0.389 Severe PD 4.172 4.172 • Adding all these numbers up gives a single measure of difference. • This is called the test statistic. • For this table it comes to 21.53. This was labelled X2 in the research paper. • If the test statistic is big then this means that there was a lot of difference between expected and observed frequencies – this leads us to doubt the null hypothesis that there was no association. The probability distribution of the test statistic • The horizontal axis shows possible values of the test statistic. • The vertical axis measures how likely the value is if the null hypothesis is true. • This type of probability distribution is called a chi squared (c2) distribution. Chi squared probability distribution 0.3 0.25 0.2 f(x) CAD positive 0.15 0.1 0.05 x 0 0 10 20 30 40 2 How big is “big”? • The area under the graph represents probability. • The probability that the test statistic is more than 7.81 is 5% (IF the null hypothesis is true). Degrees of freedom (ν) The shaded area shows the total probability that the test statistic is more than 7.81. Periodontal disease status None Mild Moderate Severe TOTAL CAD positive 100 CAD negative 100 TOTAL 53 26 63 58 200 • Suppose the totals are fixed and you can put what you like in the other cells (as long as the totals are correct). • How many numbers can you choose freely? Interpreting contributions to the test statistic Once the test for association has been done, then if there is evidence of association, the contributions to the test statistic can be examined. • Look for large contributions first – these are where observed and expected frequencies are far apart. Then look at whether the observed frequency of each cell is larger or smaller than the expected frequency of that cell. • Remember that expected frequencies were calculated on the assumption that there was no association between the variables so large contributions show where the association is strongest. • Look for small contributions – these are where observed and expected frequencies are close together. • Sometimes students may be asked to comment on how observed compares with expected for each category. They should refer to contributions to the test statistic as well as to observed and expected frequencies. 3