ASSOCIATIONS IN CATEGORICAL DATA (Statistics Course Notes by Alan Pickering) Conventions Used in My Statistics Handouts Text which appears in unshaded boxes can be treated as an aside. It may define a concept being used or provide a mathematical justification for something. Some of these relate to very important statistical ideas with which you SHOULD be familiar. The names of SPSS menus will be written in shadow bold font (e.g., Analyze). The options on the menu will be called procedures and the name will be written in bold small capitals font (e.g., DESCRIPTIVE STATISTICS). Procedures sometimes offer a family of related options, the name of the one selected will appear in italic small capitals font (e.g., CROSSTABS). The resulting window will contain boxes to be filled or checked, and will offer buttons to access subwindows. Subwindow names will appear in italic font (e.g., Statistics). Subwindows also have boxes to be filled or checked. Thus the full path to the CROSSTABS procedure will be written: Analyze > DESCRIPTIVE STATISTICS >> CROSSTABS Questions to be completed will appear in shaded boxes at various points. PART I – BASICS AND BACKGROUND What Are Categorical Data? Categorical or nominal data are those in which the classes within each variable do not have any meaningful numerical value. Common examples are: gender; presence or absence (of some sign, symptom, disease, or behaviour); ethnic group etc. It is sometimes useful to recode a numerical variable into a small number of categories, and sometimes this classification will retain ordinal information (e.g., carrying out a median-split on a personality scale to create high or low trait subgroups; or grouping subjects into broad age bands). The data can then be analysed using the techniques described below, but one has to be aware that these analyses usually have considerably reduced power, relative to using the full range of values on the variable. Contingency Tables One can carry out analyses on a single categorical variable to check whether the frequencies occurring for each level of that category are as predicted (check out SPSS Procedure: Analyze > NON-PARAMETRIC TESTS >> CHI-SQUARE). However, much more commonly, the basic data structure for categorical variables is the contingency (or classification) table, or crosstabulation, formed from the variables concerned. These are described as n-way contingency tables, where n refers to the number of variables involved. The contingency table documents the frequencies (i.e., counts) of data for each combination of the variables concerned. Hence contingency tables are also referred to as frequency tables. The small-scale example below is a 2-way table formed from the variables PDstatus (i.e., Parkinson’s Disease status: does the participant have Parkinson’s Disease?) and Smokehis (i.e., smoking history: has the participant smoked regularly at some point during their life?). Each variable in the table has two levels (yes vs. no). PDstatus yes no (=1) (=2) Smokehis yes (=1) no (=2) Column Totals 3 (5.73) 6 (3.27) 9 11 (8.27) 2 (4.73) 13 Row Totals 14 8 Grand Total= 22 Table 1. Observed frequency counts of current Parkinson’s disease status by smoking history. Expected frequencies under an independence model are in parentheses. In SPSS, some of the analysis procedures reviewed below require that the categorical variables have numerical values, so it is recommended to always code categorical variables in this way. Remember that it is completely arbitrary how these values are assigned to the categories1. To aid memory, and to produce clearer analysis printouts, one should always use the variable labelling options for such variables and include verbal value labels for each numerically-coded categorical variable. The Names of the Techniques Introductory stats courses familiarise students with a specific technique for analysing 2-way contingency tables: Pearson’s 2 test. More general methods are required for analysing higher-order contingency tables (involving 3 or more variables), and some of these analytic methods are therefore described (e.g., in Tabachnik and Fidell) as multiway frequency table analyses (or MFA). However, there are a large number of alternative names for a family of closely-related procedures. Here are just some of the other names which one may come across (e.g., in the procedure names available in SPSS): multinomial (or binary) logistic regression; logit regression; logistic analysis; analysis of multinomial variance; and (hierarchical) loglinear modelling. As these names imply the techniques are often comparable to the procedures used for handling numerical data, in particular multiple linear regression and analysis of variance (ANOVA). A reasonable simplification is that the categorical data analyses below can be thought of as extending the Pearson 2 test to multiway tables. The techniques also share, with the 2 test, the fact that the calculated statistics are tested for statistical significance against the 2 distribution (just as multiple linear regression and ANOVA involving testing their statistics against the F distribution). Association in Contingency Tables There are many techniques in statistics for detecting an association between 2 variables. A significant association simply means that the values of one variable vary systematically (i.e., at a level greater than chance) with values of the other variable. 1 As we shall see later, changing the way the numerical values for categories are assigned can sometimes alter the value of a particular statistic (e.g. from a positive number to a negative number), but this does not alter the statistical significance of the statistic. It is also sometimes helpful (for interpreting analyses) to choose one particular assignment of numbers rather than others. The most well-known measures of association are probably the (various types of) correlation coefficient between two variables. Correlation coefficients can reveal the extent to which the score (or rank) of one variable is linearly related to the score (or rank) of another. In contingency tables, the values within each category have no intrinsic numerical value, but associations can still be detected. An association means that the distribution of frequencies across the levels of one category differs depending upon the particular level of another category. When there is no association between variables, they are described as being independent. Thus, independence in a 2-way table means that there is no association between the row and column variables. It is a much-noted statistical fact that finding significant associations between variables, in itself, tells you nothing about the causal relationships at work. In association analyses one may therefore have no logical reason to treat variables as either dependent or independent. Sometimes research is entirely exploratory and, when significant associations are found, the search for a causal connection between the variables begins. In such research with categorical variables one might typically take a single sample of subjects and record the values on the variables of interest. For example, one might explore whether the political orientation reported by a subject (left; centre; right) was associated with the newspaper he or she reads. Neither variable here is obviously the dependent variable (DV), as the causal relationship could go in either direction. Very often, however, one has a causal model in mind. For example, one might be interested in whether a subject’s gender is associated with political orientation. Here, political orientation is the DV and a subject’s gender is the independent variable (IV), as it not possible for political orientation to affect gender. The research here will usually adopt a different sampling scheme, by controlling the sampling of the subjects in terms of the IVs. In this example, two samples of subjects (males and females) would be tested, recording the DV (political orientation) in each sample. In would be typical to arrange for equal-sized samples of males and females in such research. In the example data of Table 1, we are interested in finding variables that predict whether a subject will develop Parkinson’s Disease (PD). Thus PD status is the DV and the IV (or predictor) is smoking history. For contingency table data, the distinction between (exploratory) analyses, where all variables have a similar status, and analyses involving both IVs and DVs is important. We will see below that it affects the name of the analysis and the statistical procedure that one uses. In this handout, where the categorical analyses have both IVs and DVs, we will adopt the convention that the DV will be shown as the column variable. It follows from the above that a pair of alternative hypotheses (H1 = independence; H2 = association) may be applied to a contingency table. In order to decide between these hypotheses, one can calculate a statistic that reflects the discrepancy between the actual frequencies obtained and the frequencies that would be expected under the independence model described above. If the discrepancies are within the limits of chance (i.e., the statistic is nonsignificant), then one cannot reject the hypothesis of independence. If the discrepancies are not within chance limits (i.e., the statistic is significant) then one can safely reject the independence hypothesis, which implies association between the variables in the table. Estimating Expected Frequencies Under the Independence Model We will use the data from Table 1 as an example. If the variables PDstatus and Smokehis are independent then the proportion of “Smokehis=yes” subjects with PD should be equal to the proportion of “Smokehis=no”subjects with PD, and both should be equal to the proportion of subjects who have PD overall. The overall proportion with PD is equal to 0.41 (i.e., 9/22). Therefore, the expected frequency with PD in the Smokehis=yes group should be 0.41 times the total number of subjects in the Smokehis=yes group; i.e., 0.41*14 (=5.73). This gives the expected frequency without PD in the Smokehis=yes group by subtraction (=14-5.73=8.27). Similar calculations give the expected frequencies for PD (0.41*8=3.27) and no PD (=8-3.27=4.73) in the Smokehis=no group. Another way to get the expected frequency for a cell in row R and column C is to multiply the row total for row R by the column total for column C and divide the result by the grand total for the whole table (e.g., [9*14]/22=5.27 for row 1 and column 1). This approach is easy to use with tables that have more than 2 variables (where the rows represent one variable, columns another, and separate subtables are used for other variables). Testing Associations in 2-way Contingency Tables There are several statistics that one can compute to test for association vs. independence in a 2-way contingency table (such statistics are thus sometimes referred to as an “indices of association”). Three such statistics (described below) are worthy of attention, and two (G2; OR) are of particular significance for logistic regression analyses. In this section we shall consider a 2-way table with R rows and C columns; this is therefore referred to as an RxC table. There are m cells in the table where m = R * C. The actual frequency in cell number i of the table is denoted by the symbol fi and the expected frequency (under the independence model) is denoted by ei. (i) Pearson’s 2 statistic. How to Compute using SPSS: Select the following procedure: Analyze > DESCRIPTIVE STATISTICS >> CROSSTABS Click on the Statistics button to access the Statistics subwindow and then check the Chi-square box. (The Cells subwindow is also useful as it lets you display things other than just the actual frequencies in the contingency table.) Key SPSS Output: Conducting a Pearson 2 analysis on the data in Table 1 (which is available on the J drive as the dataset small parks data) produces the following SPSS output: Is/ Was subject a smoker? * Has got Parkinson's dise ase? Crosstabulation Count Is/ Was subject a s mok er? yes no Total Has got Parkinson's dis eas e? yes no 3 11 6 2 9 13 Total 14 8 22 Chi-Square Tests Pearson Chi-Square Continuity Correction a Likelihood Ratio Fis her's Exact Test Linear-by-Linear As sociation N of Valid Cases Value 6.044b 4.031 6.222 df 1 1 1 As ymp. Sig. (2-sided) .014 .045 .013 Exact Sig. (2-sided) Exact Sig. (1-sided) .026 5.769 1 .022 .016 22 a. Computed only for a 2x2 table b. 2 cells (50.0%) have expected count less than 5. The minimum expected count is 3.27. What Do These Results Mean?: The results of this analysis show that the 2 test statistic was significantly greater than the tabulated value (which is approximately the expected value for the statistic, assuming independence between the variables). Therefore, the independence model can be rejected and one concludes that variables of PDstatus and smokehis are associated. If the p-value associated with the 2 test statistic was nonsignificant, then one would not be able to reject the hypothesis that PDstatus and smokehis are independent. Formula2: 2 = i=1 to m ([fi – ei]2/ei) Degrees of Freedom (df): df=(R-1)*(C-1) Testing Significance: Under the independence model, the 2 test statistic has a distribution which follows the 2 distribution with the degrees of freedom as given above. (Pearson could have been more helpful and given his statistic a name that differed from that of the distribution against which it is tested.) It has been shown 2 The expression on the next line is used in many statistical formulae: i=1 to m (xi) It means calculate the sum of a set of numbers x1, x2, ... up to xm. that the 2 distribution can be used to test the Pearson 2 statistic as long as none of the expected frequencies is lower than 3. (ii) Likelihood ratio statistic (usually abbreviated G2 or Y2). “Computing using SPSS”, “Key SPSS Output”, “What Do These Results Mean?”, df, and “Testing Significance”, are the same as for Pearson 2. Because of the mathematical relationship between the two formulae, the values under many circumstances are approximately equal. Formula3: G2 = 2 * i=1 to m (fi *log[fi/ei]) (iii) Odds ratio (OR). (Note: this applies only to a 2x2 table, or to 2x2 comparison within a larger table.) How to Compute Using SPSS: This is also available via SPSS CROSSTABS. Follow the procedure for 2 and G2 but now check the “Cochran’s and Mantel-Haenszel Statistics” box in the Statistics subwindow. Key SPSS Output: Conducting an OR analysis for the data in Table 1 using SPSS CROSSTABS gives the following additional output: Mantel-Haenszel Common Odds Ratio Estimate Es timate ln(Estimate) Std. Error of ln(Estimate) As ymp. Sig. (2-s ided) As ymp. 95% Confidence Interval .091 -2.398 1.044 .022 Common Odds Lower Bound .012 Ratio Upper Bound .704 ln(Common Odds Lower Bound -4.445 Ratio) Upper Bound -.351 The Mantel-Haenszel common odds ratio estimate is asymptotically normally dis tributed under the common odds ratio of 1.000 as sumption. So is the natural log of the es timate. What Do These Results Mean: The OR, as for 2 and G2, tests whether the independence hypothesis can be rejected. The significance test above reveals a p-value of 0.022, and so the row and column variables (smoking history and Parkinson’s Disease status) are not independent. An OR value of 1 corresponds to perfect independence, and can range from 0 to plus infinity. The value calculated for the Table 1 data lies well below one (=0.091; the p-value shows this to be significantly different from 1). This means that the odds of having Parkinson’s Disease (PD) if you are a smoker (row 1) are 0.091 times the odds of having PD if you are a nonsmoker. Smoking in these data is significantly protective with respect to PD. If one had predicted the direction of this relationship (based on the 3 The log in the formula refers to natural logarithms, also written as loge or ln. existing literature demonstrating a PD-protective effect for smoking), then one could justifiably make a one-tailed test: the p-value for the OR of 0.091 in this case would be 0.022/2 (=0.011). In the SPSS output, one sees that the natural logarithm of the odds ratio is also reported. For various reasons, this statistic is more useful within contingency tables than the raw OR. This use of log-transformed statistics explains why contingency table analyses are often described as logistic analyses, logistic regressions or loglinear modelling. The output also reports confidence intervals for the OR and log(OR) statistics. The next 2 boxes give a few basic facts about logarithms and confidence intervals. Formula: Assuming the cells of a 2x2 table contain the frequencies a, b, c, d as follows:a c b d the formula is:OR = (a * d) / (c * b) If any of the frequencies (a to d) are zero then one usually replaces the zero with 0.5. Testing Significance: As the calculated ln(OR) is normally distributed under the independence hypothesis, the estimate can be converted to a Z-value and tested against the value in a normal probability table. Z(log[OR])=log(OR)/SE(log[OR]). For the data in table 1, the value is -2.30. In a standard normal (i.e., Z) distribution, the two-tailed probability of finding a value that is as far (or further) above or below 0 as –2.3, is 0.022. Key Facts About Logarithms Taking the logarithm of a number is the inverse mathematical operation to raising a number to a power (or exponentiating). If we write the following equation: xy = z then we can define the “log to the base x of z” thus: logx(z) = y Try it with x=10; y=2 and z=100. The log (to base 10) of 100 is 2; i.e. the log of a number is the power to which you have to raise the base to obtain the original number. Natural logarithms employ the number e as their base (e is approximately 2.718). In statistical theory, natural logs are always used. The convenient thing about logs is that they turn multiplicative (or reciprocal) relationships into additive (or subtractive) ones. Because ea * eb = e(a + b) and ea / eb = e(a - b) this means the following are true (and indeed it is true for logs to any base): loge(a*b) = loge(a) + loge(b) loge(a/b) = loge(a) - loge(b) In order for xy = 0 to be generally true y must be minus infinity (-∞). Thus, loge(0) could be -∞, but convention dictates that loge(0) is undefined Any number when raised to the power zero is 1 and so loge(1)=0. Thus we have the following ranges: loge(x) >0 loge(x)=0 loge(x) <0 if 1< x <+∞ if x=1 if 0< x <1 Probabilities (or likelihoods) have values between 0 and 1. As the statistics in this handout involve taking the logs of probabilities, it follows from the above that the resulting values are negative. What Are Confidence Intervals? When we calculate the value of a statistic for a sample of data, we are attempting to measure the true value of that statistic for the populations under study. Even if we have removed (most of) the sources of systematic bias from our experiment, the sample used to calculate the statistic will be subject to several sources of random error or noise. Therefore, although the value we calculate for the statistic is our best estimate of its true value in the population, we might prefer to give a range of values within which we feel the true value is likely to fall. A confidence interval (CI) for the calculated value of a statistic is just such a range (error bars on graphs serve a similar purpose). If we estimate a statistic to have a value of 10 with a 95% confidence interval of plus or minus 6, then we are saying that we are 95% certain that the true value of the statistic lies somewhere between 4 and 16 ([10 – 6] to [10 + 6]). If the statistic would be expected to have a value of 0 according to some hypothesis, then the value and associated CI in the above example imply that such a hypothesis should be rejected. Journals increasingly demand that CIs are calculated. In the SPSS output for the OR analysis of the data in Table 1, the log(OR) had a value of –2.398, with a standard error (SE) for the log(OR) of 1.044. The SE is given by the square root of the sum of the reciprocals of the frequencies used in the OR calculation (i.e. √(1/3 + 1/11 + 1/6 + 1/2)=1.044). With sufficient subjects in the whole table, the distribution of possible values for log(OR) is distributed normally around the value actually obtained. 95% of the distribution will lie within 1.96*SE either side of the estimated value. Thus, the 95% CI for the calculated value of ln(OR) lies between (-2.398 – 1.96*1.044) and (–2.398 + 1.96*1.044); i.e. between –4.44 and –0.35. To calculate the corresponding CI for the OR, we just simply compute e-4.44 (=0.01) and e-0.35(=0.70) For the data in table 1, our best guess of the true OR is 0.09, and are 95% certain that the true value lies between 0.01 and 0.70. As these 95% CIs do not include 1.0 (the value expected if there were independence between PDstatus and Smokehis), then the independence hypothesis can be rejected. Question 1 What happens if you calculate 2, G2, and OR statistics for the contingency table formed from the variables smokehis and PDstatr in the small parks dataset? (PDstatr is the same variable as PDstatus, only it has been recoded so that 1=no and 2=yes.)