Logistic Regression and Logistic Analysis (Alan Pickering, 2nd December 2003) To understand these notes you should first have understood the material in the notes on “Associations in two-way categorical data” (henceforth referred to as the A2WCD notes), which are available electronically in the usual places. MORE GENERAL PROCEDURES FOR ANALYSING CATEGORICAL DATA There are a number of techniques that are more general than the Pearson chi-squared test (2), likelihood ratio chi-squared test (G2), and odds-ratio (OR) methods reviewed in the A2WCD notes. In particular, these general methods can be used to analyse contingency tables with more than 2 variables. Next we consider the range of these procedures available in SPSS. Choosing A Procedure In SPSS There are lots of different ways to analyse contingency tables and categorical dependent variables within SPSS. Each of the following procedures can sometimes be used: Analyze > REGRESSION >> BINARY LOGISTIC Analyze > REGRESSION >> MULTINOMIAL LOGISTIC Analyze > LOGLINEAR >> GENERAL Analyze > LOGLINEAR >> LOGIT Analyze > LOGLINEAR >> MODEL SELECTION Each procedure works best for a particular type of statistical question, although the procedures can often be “forced” to carry out analyses for which they were not specifically designed. The outputs of each procedure look quite different, even though many of the results, buried within the printout, will be identical. However, some of the output information is unique to each procedure. My general advice is that the Multinomial Logistic Regression procedure is by far the most userfriendly and will deal with the most common data analyses of this type that we wish to carry out. The SPSS Help menu gives advice on choosing between procedures (select Help > TOPICS and then select the Index tab). Unfortunately, SPSS uses slightly different names for the procedures in its Help menu (and on printed output) than those that appear as the options in the Analyze menu. The following table will clarify the names that SPSS uses:- 1 Procedure Name in Analyze Menu Procedure Name used in Help Menu and in Output Logistic Regression Help Output Logistic Regression Multinomial Logistic Help Regression Output Nominal Regression General Loglinear Analysis Help Output General Loglinear (Analysis) Logit Loglinear Analysis Help Output General Loglinear (Analysis) Model Selection Loglinear Help Analysis Output HiLog Hierarchical Log Linear REGRESSION >> BINARY LOGISTIC REGRESSION >> MULTINOMIAL LOGISTIC LOGLINEAR >> GENERAL LOGLINEAR >> LOGIT LOGLINEAR >> MODEL SELECTION Table 1. The varying names used by SPSS to describe its categorical data analysis procedures Finally, note that the SPSS printed output, for the various types of categorical data analysis, is fairly confusing because it contains a lot of technical detail and jargon. That is why it is important to have a clear understanding of some of the basic issues covered below. Three Types of Analysis Logistic (or Logit) Regression: describes a general procedure in which one attempts to predict a categorical dependent variable (DV) from a group of predictors (IVs). These predictors can be categorical or numerical variables (the latter are referred to in SPSS as covariates). The DV can have two or more levels (binary or multinomial, respectively). This analysis can be thought of as analogous to (multiple) linear regression, but with categorical DVs. It is most easily carried out in SPSS using the following procedures: Analyze > REGRESSION >> BINARY LOGISTIC Analyze > REGRESSION >> MULTINOMIAL LOGISTIC Logistic (or Logit) Analysis: describes a special case of logistic regression in which all the predictor variables are categorical, and these analyses often include interaction terms formed from the predictor variables. This analysis can be thought of as analogous to ANOVA, but with categorical DVs. It is most easily carried out in SPSS using the following procedures: Analyze > REGRESSION >> BINARY LOGISTIC Analyze > REGRESSION >> MULTINOMIAL LOGISTIC Analyze > LOGLINEAR >> LOGIT (Hierarchical) Loglinear Modelling, Loglinear Analysis, or Multiway Frequency Table Analysis: describes a procedure in which there is no separation into DVs and predictors, and one is concerned with the interrelationships between all the categorical variables in the table. It is most easily carried out in SPSS using the following procedures: Analyze > LOGLINEAR >> GENERAL Analyze > LOGLINEAR >> MODEL SELECTION 2 Example of Logistic Analysis Using SPSS Logistic analysis may be most straightforward place to start looking at the more general contingency table analysis methods available in SPSS. This is because: (a) logistic analysis resembles ANOVA (which is familiar to psychologists); (b) categorical data from psychological experiments are probably most often in a form requiring logistic analysis (i.e., there is a DV and one or more categorical IVs); and (c) the relevant SPSS procedures are probably easier to execute and interpret than the other types of contingency table analysis methods. Before analysing multiway tables, we start with an example of a two-way analysis. Logistic Analysis Example: small parks data The data concerns subjects with Parkinson’s disease (PD), and the dataset contains disease status (PDstatus: 1= has disease; 2= no disease) and smoking history (Smokehis: 1=is or was a smoker; 2=never smoked). Table 2 is a key contingency table: PDstatus yes no (=1) (=2) Smokehis Row Totals yes (=1) 3 11 14 no (=2) 6 2 8 Column Totals 9 13 Grand Total= 22 Table 2. Observed frequency counts of current Parkinson’s disease status by smoking history. The data were analysed using the Analyze > REGRESSION >> MULTINOMIAL LOGISTIC procedure. The presence or absence of PD (PDstatus) was selected into the “Dependent variable” box and cigarette smoking history (Smokehis) was selected into the “Factor(s)” box. The Statistics button was selected and in the resulting subwindow, the “Likelihood ratio test” and “Parameter estimates” options were checked. The key resulting printed output was as follows:Model Fitting Information Model Intercept Only Final -2 Log Likelihood 11.308 5.087 Chi-Squa re 6.222 3 df Sig. 1 .013 Li kelihood Ra tio Tests -2 Log Lik elihood of Reduc ed Chi-Squa Effect Model re df Sig. Int ercept 5.087 .000 0 . SMOKEHIS 11.308 6.222 1 .013 The chi-square statistic is the differenc e in -2 log-lik elihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from t he final model. The null hy pothesis is t hat all parameters of that effect are 0. Parameter Estimates Has got Parkins on's dis eas e? yes Intercept [SMOKEHIS=1] [SMOKEHIS=2] B 1.099 -2.398 0a Std. Error .816 1.044 0 Wald 1.810 5.271 . df 1 1 0 Sig. .178 .022 . Exp(B) 9.09E-02 . 95% Confidence Interval for Exp(B) Lower Upper Bound Bound 1.17E-02 . .704 . a. This parameter is s et to zero because it is redundant. Understanding the Printout (The key jargon from the printout is highlighted in bold below.) In this analysis, there is only one effect of interest: the effect of smoking history Smokehis on PDstatus. As a result, the first two output tables (“Model Fitting Information”; “Likelihood Ratio Tests”) are completely redundant. Later, when we look at a 3-way example, these two tables provide different information. The final model is a model containing all the possible effects. As will be explained in more detail later, this model has two free parameters (parameters are just the independent components in the mathematical formula for the model). The final model proposes that the probability of PD is different in each of the two samples with differing smoking histories (i.e., differing values of Smokehis). Therefore, the final model needs 2 parameters: effectively these parameters correspond to the probability of having PD in each of the 2 samples. As there are only 2 outcomes (has PD vs. doesn’t have PD), we do not need to specify the probability of not having PD, because that is simply 1 minus the probability of having PD. The final model has a likelihood which we can denote with the symbol Lfinal. This likelihood is just the probability that exactly the observed data would have been obtained if the final model were true. These analyses use the natural logarithms (loge) of various values. Those who are not very familiar with logarithms should review the quick tutorial on logarithms that was given in the A2WCD notes. The value of -2*loge(Lfinal) is 5.087 (see “Model Fitting Information” in the above printout). The likelihood of getting exactly the data in Table 1, if the final model were true, is therefore given by e-5.087/2 (=0.08). Although this probability may seem low, the final model is the best possible model one could specify for these data. Later in these notes we consider how these likelihoods are calculated. The analysis also produces a reduced model that is simpler than the final model. The reduced model is called an intercept only model in the Model Fitting Information output table. Because the reduced model is formed from the final model by removing the effect of Smokehis on PDstatus, the reduced model appears in the row labelled SMOKEHIS in the table of Likelihood Ratio Tests. (The row labelled Intercept in this table should be ignored.) The reduced model has only a single parameter because it proposes only one rate of PD occurrence (i.e., the rate is the same for both samples differing in smoking history under this model). 4 The reduced model has a likelihood that can be represented Lreduced. From the printout shown above, the value of -2*loge(Lreduced) is 11.308; which corresponds to a likelihood of e-11.308/2 (=0.0035). Given the data, this model is more unlikely than the final model. This will be the case for any reduced model (because reduced models have fewer parameters than more complete models). In general, scientific modelling attempts to find the simplest model (i.e., the one with fewest parameters) that provides an adequate fit to the data. The key decision in this kind of analysis is therefore whether the lower likelihood of the reduced model is a statistically acceptable “trade” for the reduced number of parameters involved in the reduced model. The likelihood ratio test compares the likelihoods of the two models in order to make this decision. If the reduced model were true then it turns out that a function of the ratio of the log likelihoods (specifically –2*log[Lreduced/Lfinal]) would have a distribution that is approximated by the 2 distribution, with degrees of freedom given by the difference in the number of free parameters between the final and reduced models (2-1=1 in this case). From the properties of logs (see the A2WCD notes) we know that the log likelihood ratio statistic given above is equivalent to: 2*loge(Lfinal) –2*loge(Lreduced). The value of the statistic in the present example is (-5.087 - - 11.308)=6.222. Note that this value appears under the “Chi-Square” column heading in the table of Likelihood Ratio Tests. The above notes (and the footnote on the SPSS output) should have made it clear why it is called a likelihood ratio test statistic and why it is tested against the chi-squared distribution. The value obtained (6.222) is considerably greater than the critical value for 2 with df=1 (for p=0.05, this is just under 4) and so the reduced model can be rejected in favour of the final model. This result means that Smokehis and PDstatus are not independent in this dataset: i.e., there is a significant effect of Smokehis on PDstatus (p=0.013). We can compare the value of the likelihood ratio test statistic in this analysis with the value for the likelihood ratio statistic (G2) for two-way tables, which we can obtain using the SPSS CROSSTABS procedure (see A2WCD notes). The value is identical. The G2 statistic is a special case of logistic analysis when there are only 2 variables in the contingency table. We should also note that the Parameter Estimates output table shows estimates for the two parameters of he final model (B values). Note that the B parameter with a value of -2.398 represents the natural logarithm of the odds-ratio {loge(OR)}. Once, again we can calculate this for a 2x2 table using CROSSTABS (see A2WCD notes). The value Exp(B) “undoes” the effect of taking the logarithm, because the exponentiation function {Exp()} is the inverse operation to taking a logarithm (in much the same way as dividing is the inverse of multiplication). Exp(B) therefore gives us the odds ratio itself (OR) and its associated 95% confidence interval (CI; also available from CROSSTABS). Later on in these notes, we will discuss how these parameter estimates are constructed, and thus how they can be interpreted. When we move on to tables bigger than 2x2, the B parameter values shown in the output will each be different log(OR) values calculated from 4 cells within the larger table. 5 PART II – UNDERSTANDING THE STATISTICAL MODELLING TECHNIQUES USED IN LOGISTIC REGRESSION Sample Estimates and Population Probabilities This section is quite simple conceptually (and the technical bits are in boxes). If you grasp what’s going on, even if only roughly, then it will really help you to understand: (a) the process of executing logistic regression; (b) what to look at in the printed output; and (c) the jargon used in the printout. Imagine we tested a random sample of 100 female subjects in their twenties in a dart-throwing experiment. We got each subject to take one throw at the board. We scored the data very simply: did the subject hit the scoring portion of the board? This generated a categorical variable hitboard with values: 1=yes; 2=no. hitboard 1=yes 2=no 60 40 Total 100 Table 3. The summary data for the hitboard variable in the dart-throwing data So the overall probability of hitting the dartboard was 0.6 (60/100; let’s call that probability q). The measured value of 0.6, from this particular measurement sample, might be used to estimate a particular population probability that we are interested in (e.g., the probability with which women in their twenties would hit a dartboard given a single throw; this population probability will be denoted by the letter p). We might ask what is the most likely value of the population probability given the sample value (q=0.6) that we obtained. For any hypothetical value of p, we can easily calculate the likelihood of getting exactly 60 hits from 100 women using probability theory. The underlying theory and mechanics of the calculation are described in the box on the following page. 6 Using Probability Theory To Derive Likelihoods Take tossing a coin as an easy example, which involves all the processes we are interested in. What is the likelihood of getting exactly 2 heads in 3 tosses of a completely fair coin? This is a binomial problem as there are just two outcomes for each trial (Heads, H; Tails, T). We can count the answer. There are 8 (i.e. 23) possible sequences of 3 tosses which are all equally likely: TTT; TTH; THT; HTT; HHT; HTH; THH; HHH Only 3 of the sequences have exactly 2 Heads (THH; HTH; HHT), so the likelihood is 3/8 (=0.375). It is important that the outcome on each toss is independent of the outcome on every other toss. Independence therefore means, for example, that tossing a H on one trial does not change the chance of getting a H on the next trial. In this way the 8 possible sequences shown above are equally likely. This binomial problem, for 2 possible outcomes, can be described generally as trying to find the likelihood, L, of getting exactly m occurrences of outcome 1 in a total of N independent trials, when the probability of outcome 1 on a single trial is p). For our example above: N=3; m=2; p=0.5. (The value of p is 0.5 because it is a fair coin.) The general formula is: L = NCm * pm * (1 – p)(N-m) Where NCm is the number of ways (combinations) that m outcomes of a particular type can be arranged in a series of N outcomes (answer=3 in our example). NCm is itself given by the following formula:N Cm = N!/(m!*[N-m]!) Where the symbol ! means the factorial function. X! = X*(X-1)*(X-2)…*2*1. Thus, for example, 3!= 3*2*1 . Check that the above formulae generate 0.375 as the answer to our cointossing problem. The multinomial formulae are an extension of the above to deal with cases where there are more than 2 types of outcome. Applying Probability Theory to the Dart-Throwing Data For our darts data, the values to plug into the formula thus: there are 100 trials (i.e., N=100); and we are interested in the case where we obtained exactly 60 “hitboard=yes” outcomes (i.e., m=60). We can allow the value of p to vary in small steps from 0.05 to 0.95 and calculate a likelihood for each value of p. Putting these values into the formulae, we get the likelihoods that are shown in the following graph: 7 0.09 0.08 0.07 Likelihood 0.06 0.05 0.04 0.03 0.02 0.01 0 0 0.2 0.4 0.6 Values of p parameter 0.8 1 It is fairly clear from the graph that the likelihood of getting the result q=0.6 is at a maximum for the value p=0.6. In fact, this might seem intuitively obvious: if the true value of p were, say, 0.5 then to get a sample estimate of 0.6 must mean that the random sample used was slightly better than expected. This seems less likely to occur than getting a sample which performs exactly as expected. A coin-tossing example may help: I think many people would intuitively “know” that the likelihood of getting 10 heads in 20 tosses of a fair coin is greater than the likelihood of getting 8 (or 12) heads (and much greater than the likelihood of getting 2 or 18 heads). This means that our sample value (q=0.6) is the best estimate for p we can make, given the data. Another, possibly surprising, point that one might notice from the graph is that the likelihoods are all quite low. Even the maximum likelihood (for p=0.6) is only around 0.08. Even if the population probability was really 0.6, we would get sample values which differ from this value 92% of the time. The low values are because here we are talking about the likelihood of p being exactly 0.6 and, in psychology, we are more used to giving ranges of values. For example, we might more usefully give the 95% confidence intervals (CIs) around our sample estimate of q=0.6. These CIs give a range of values which, with 95% confidence, would be expected to contain the true value of p. (How to calculate such CIs is not discussed here.) 8 Maximum Likelihood Estimation In general, if one has frequency data of this kind, and an underlying hypothesis (or model) that can be expressed in terms of particular probabilities, then one can create a computer program to estimate the values of those probabilities which are associated with the maximum likelihood of leading to the data values obtained. This is the process called maximum likelihood estimation. In the darts example, we would therefore say that 0.6 is the maximum likelihood estimate (MLE) of the underlying population probability parameter (p), given the data obtained. It can also be said that the value p=0.6 provides the best fit to the data obtained in the experiment. For the simple dart-throwing example it was possible to work out the MLE for p by logic/intuition. For a more complex model, with several probabilities, numerical estimation by computer is often the only way to derive MLEs. Statistical packages, such as SPSS, use numerical methods to generate MLEs in several different kinds of analyses, including those involved in logistic regression. Comparing Likelihoods Recall that the likelihood was about 0.08 that the true value of p is 0.6, given our sample estimate. Statisticians do not apply any conventional likelihood values in order to draw conclusions about the real value of p. We do not apply the 0.05 convention used in hypothesis testing to evaluate likelihoods in these situations. (CIs, as described earlier, show how the 0.05 convention can be used in this situation.) Instead, statistical modelling works by comparing the likelihoods under two different hypotheses. Let us suppose that we have calculated the (maximum) likelihood under hypothesis 1 and a (maximum) likelihood under hypothesis 2. We can denote these hypotheses as H1 and H2, and the associated likelihoods as L1 and L2. As already noted, it has been found that -2 times the natural logarithm of the ratio between these two likelihoods (i.e., -2*loge[L1/L2]) has approximately the 2 distribution. Thus, we can use the log likelihood ratio and the 2 distribution to test whether H1 is significantly less likely than H2. In analysing frequency data, this approach is typically used when there is a hierarchy of hypotheses of increasing complexity. Hence, loglinear modelling of frequency data is often referred to as hierarchical loglinear modelling. Analysis proceeds by finding the simplest hypothesis which is able to account for the observed data with a likelihood that is not significantly lower than the next most complex hypothesis in the hierarchy. We will illustrate this rather abstract and wordy statement with a concrete worked example. Comparing Likelihoods in the Darts Data In fact the darts data used above were not collected from 100 women but from a single individual. She was a right-handed woman with no previous experience of darts. The data reflect 100 throws at the board using her right and left hand on alternate throws. (The data from this experiment are on the J drive as darts study.sav.) The full contingency table is illustrated below. 9 Throwing Hand Right (=1) Left (=2) Column Totals hitboard 1=yes 2=no 40 10 20 30 60 40 Row Totals 50 50 Grand Total=100 Table 4. The overall contingency data for the dart-throwing data We might generate a hierarchy of two simple hypotheses about the subject’s performance: H1: her ability to hit the board is unaffected by the hand she uses to throw H2: her ability to hit the board is affected by the hand she uses to throw1 For the full contingency table 2 independent probability values were measured in the experiment: the probability of hitting the board measured for her right hand and the probability measured for her left hand. We will denote the measured sample probability values by qL and qR for the left and right hand respectively. (The probability values for missing the board are not independent of the probabilities for hitting the board: the probability for a hit plus that for a miss must add up to 1.) We can represent the hypothesis H1 and H2 in terms of underlying population probabilities. H1 is an independence hypothesis (“throwing ability is independent of hand used”). According to H1 the true probability for a hit with the right hand (denoted pR) equals that for the left hand (pL). Because pR = pL we can replace these probabilities with a single value (denoted p; p = pR = pL). Hypotheses have parameters (probabilities in this case) and degrees of freedom. H1 is thus a “single-parameter” hypothesis (as it specifies only one probability value; i.e., p). The degrees of freedom (df) for a hypothesis are given by the number of independent data points (the independent probabilities measured in this experiment; 2 in this case) minus the number of freely varying parameters of the hypothesis. Thus, for H1, df=(2-1)=1. H2 is a more complex hypothesis than H1 because it has two parameters. H2 says that the probabilities for a hit with the left and right hands are not equal; i.e., pR pL. Thus, 2 separate probabilities are needed to specify the hypothesis. It also follows that H2 has df=0. A hypothesis produces a specific model when values are provided for the parameters of the hypothesis. A hypothesis with df=0 can always generate a model that is described as saturated. Saturated models are not very interesting because they describe (or fit) the data perfectly (in the sense that there is no discrepancy between the observed frequencies and those expected according to the model). The saturated model under hypothesis H2 would have the following values: pR=0.8 and pL=0.4. This model is the best-fitting version of hypothesis H2, given the data obtained: it is the version of H2 that has the maximum likelihood of generating the observed data. 1 This is a nondirectional hypothesis. Given that the subject is right-handed, we might have had a directional hypothesis specifying that her performance is better with her right, than her left, hand. 10 Question 1 Can you explain why the best-fitting parameter values for H2 are pR=0.8 and pL=0.4? By looking at the formulae given in textbooks (or the A2WCD notes), work out what the values of the 2 and G2 statistics would be under the expected frequencies generated by these probability values (Hint: you do not need a calculator to do this, as long as you remember what log(1) is.) From these probability values given above one can work out the likelihood of the data for the sample of right hand throws (=0.140) and the likelihood of the data for the left hand throws (=0.115). Assuming that the observed probabilities in the two samples of throws are independent of one another (i.e., the success or failure of the throwing trials for the left hand does not influence the success or failure of the trials for the right hand, nor vice versa), then the overall likelihood across both samples can be worked out by multiplying the likelihoods for the two separate samples 2. The overall likelihood is therefore 0.016 (=0.140*0.115). This independence assumption must be met in order to apply any kind of analysis of categorical data (from simple 2 tests to logistic regression). In addition, as noted earlier, the probability of hitting the board with any throw (with left or right hand) must be independent of the probability of hitting the board with any other throw; if this independence assumption is violated then the maximum likelihood estimation process (described above) will not estimate the true likelihoods, and the test statistics will not follow a chi-squared distribution. Question 2 (The questions in this box illustrate the fact that the use of a particular statistical analysis technique may inform the choice between similar, but subtly different designs, for the same experiment.) Is the assumption of independence between the left hand and right hand darts data samples justified? Would it have been more or less justified if the subject had taken all her right hand throws first, followed by all her left hand throws? Is the probability of success with each throw of the dart likely to be independent of the probability of success of any other throw? From the independence point of view, would a better design have been to test 100 separate right-handed women for one throw each, with half of them (selected at random) being asked to used their left hand? In general, a more complex hypothesis (such as H2) will be able to fit a set of data better than a simpler hypothesis with fewer parameters (such as H1). Using the log-likelihood ratio technique, outlined earlier, we can see if the best-fitting version of the simpler hypothesis (H1) can fit the observed darts data with a likelihood that is not significantly lower than the likelihood calculated for H2. If the likelihood for the best-fitting H1 model is not significantly lower than that for the bestfitting H2 model, then we adopt H1 as the best-fitting hypothesis and conclude that dart-throwing accuracy was independent of the hand used. However, if the fit of the H1 model is significantly poorer than that of H2 (i.e., the likelihood of H1 is significantly lower than that of H2) then we can reject H1 and conclude that throwing accuracy was not independent of the hand used. (The details of the likelihood ratio calculation is given below and then checked out using SPSS.) 2 A basic axiom of probability theory states that if event A (occurring with probability pA) and event B (occurring with probability pB) are independent, then the occurrence of both A and B is given by (pA * pB). 11 To emphasise the analogy with ANOVA, one can think of the likelihood ratio statistic as testing the interaction between the variables Hand and hitboard. A significant interaction would simply mean that the probability of hitting the board was affected by the hand used (i.e. supporting H2); the lack of significant interaction therefore supports H1. This way of thinking of the data is particularly helpful when we later analyse tables with more than two variables. Note also that we are usually not interested in the main effects under such analyses. The main effects in the darts data (i.e., for Hand and hitboard) would correspond to questions about the distributions of categories in the row and column totals. Specifically, the main effect for hitboard would tell us whether the ratio of hitboard=yes: hitboard=no responses, across the whole experiment, deviated from 50:50. This is not something of particular interest. Because we sampled the data such that there were equal numbers of left hand and right hand throws, the Hand main effect is completely meaningless. When contingency table data are sampled with a clear separation between DVs and IVs (and thus are suitable for logistic analysis) it will generally be the case that the main effects of the IVs will be meaningless. Calculating Log-Likelihood Ratios for the Darts Data We already calculated that the likelihoood for the best fitting model under hypothesis H2 was 0.016. We denote this value by L2. The corresponding loglikelihood is –4.135. This model has 2 independent parameters (i.e., 2 probabilities). Hypothesis H1 has only a single parameter, the probability of hitting the board (independent of hand used). It turns out that the best estimate we have for this probability is the overall probability of hitting the board in Table 4 (i.e., 60/100 = 0.6). We can use the likelihood formulae given earlier to calculate the likelihood of getting 40 hits out of 50 with the right hand if the true probability were 0.6. This likelihood is 0.0014. Similarly, the likelihood of getting 20 hits out of 50 for the left hand (if the true probability were 0.6) is 0.002. The overall likelihood (L1) for the table is therefore (0.0014*0.002), i.e. 2.9 x 10-6 (this is 2.9 in a million). The corresponding log-likelihood is –12.764. The ratio of the log-likelihood for the simpler model divided by the loglikelihood of the more complex model is thus L1/L2. We already noted that, if the simpler model were true, then the statistic –2*loge(L1/L2) would be distributed approximately as χ2, with df equal to the difference in number of parameters for the two models (here H2 has two parameters and H1 has 1; df = 1). But, -2*loge(L1/L2) = (-2*loge[L1]) – (-2* loge[L2]). Therefore, the test statistic for the darts data is (–2*-12.764) – (-2*-4.135) = 17.258. This is very much greater than the critical value for χ2 with 1 df and so we can reject H1 in favour of H2. There is a highly significant effect of Hand on ability to hit the dartboard. Checking The Result With SPSS We can run a logistic regression on the darts data using the Analyze > REGRESSION >> MULTINOMIAL LOGISTIC procedure SPSS. The key part of the printed output is shown below. The final model corresponds to best-fitting probabilities under hypothesis H2. This model is found to have a –2*log-likelihood (-2LL) of 8.268. When the simpler model (H1) is fitted to the data, this reduced model corresponds to omitting the effect of the Hand variable from the full model. The reduced model (with Hand omitted) is found to have a –2LL of 25.529. The likelihood ratio test involves subtracting the –2LL value for the full model from the –2LL value for the reduced model. The resulting value (in this case 17.261) is tested against the 2 distribution, with df equal to the difference in number of parameters between the two models (1 in this case). The result is highly significant. The values are the same as we got by hand earlier (within rounding errors). 12 Model Fitting Information Model Intercept Only Final -2 Log Likelihood 25.529 8.268 Chi-Squa re 17.261 df Sig. 1 .000 Li kelihood Ra tio Tests -2 Log Lik elihood of Reduc ed Chi-Squa Effect Model re df Sig. Int ercept 8.268 .000 0 . HAND 25.529 17.261 1 .000 The chi-square statistic is t he difference in -2 log-lik elihoods between the final model and a reduc ed model. The reduc ed model is formed by omitting an effect from t he final model. The null hy pothesis is t hat all parameters of that effect are 0. PART III – EXTENDING THE ANALYSES TO MULTIWAY DESIGNS The analysis of larger contingency tables will be illustrated using a logistic analysis of a 3-way table. 3-Way Logistic Analysis Example: more parks data The data relate to a larger study of Parkinson’s Disease (PD) and smoking history. The data are in the file called more parks data.sav. The Smokehis and PDstatus variables, familiar from the small parks dataset, each now have 3 levels. The Sex of each subject is also recorded. PDstatus: Clinicians have rated the presence or absence of the disease, or they called patients “borderline” if they did not quite meet the full clinical criteria3. Smokehis: Subjects are classified into those who have never smoked, those who gave up smoking more than 20 years ago, and those who have smoked in the last 20 years (including current smokers). The output from the CROSSTABS procedure (using the “layer” option) is as follows: 3 Note that we might regard the 3 values of PDstatus as having a natural order. Logistic regression analyses, such as the logistic analysis conducted in these notes, completely ignore this information. 13 sm oki ng history * Parkinson's Di sease Status * SEX Crossta bula tion Count SEX female male smoking his tory Total smoking his tory never s moked us ed t o smoke (> 20 y ears ago) smokes or smoked in last 20 y ears Parkinson's Diseas e Status has dis eas e borderline no disease 9 3 5 never s moked us ed t o smoke (> 20 y ears ago) smokes or smoked in last 20 y ears Total Total 17 4 2 12 18 1 3 11 15 14 8 8 2 28 6 50 16 3 4 12 19 3 2 10 15 14 8 28 50 In a logistic analysis the following effects of IVs on PDstatus need to be explored: Smokehis; Sex; and Smokehis*Sex. We noted earlier that each of these effects is really an interaction with PDstatus, and also noted that we are not interested in the true main effects of Smokehis or Sex. However, in this type of analysis, the Smokehis*PDstatus and Sex*PDstatus effects are conventionally referred to as main effects. The analyses reported below use a hierarchical approach to try to find the simplest model that provides an adequate model for the data. The underlying maths is just an extension of what we have already seen for 2-way tables. Executing Step 1 of the Analysis The data were analysed using the Analyze > REGRESSION >> MULTINOMIAL LOGISTIC procedure. In the first step of the analysis, the goal is to test whether it is necessary to include highest order IV effect (Smokehis*Sex) to adequately explain the data. To do this, PDstatus was selected into the “Dependent variable” box and Smokehis and Sex were selected into the “Factor(s)” box. The Statistics button was selected and, in the resulting subwindow, only the “Likelihood ratio test” and “Goodness of fit chi-square statistics” options were checked. The Model button was selected and, in the resulting subwindow, the “Main effects model” option should be checked (this is the default). The selection of the main effects model means that the model includes only the main effects of Smokehis and Sex on PDstatus. The selected statistics options allow us to see a likelihood ratio test comparing the main effects model with the more complete model including the Smokehis*Sex effect. The SPSS Output for Step 1 and Its Interpretation The key printed output resulting from the first step of the analysis was as follows:Model Fitting Information Model Intercept Only Final -2 Log Likelihood 49.361 35.013 Chi-Squa re 14.348 df Sig. 6 .026 The first part of the output (above) compares the main effects model (termed the “Final” model on the output) with the simplest possible (“Intercept Only”) model. The intercept only model includes neither the Smokehis nor Sex effects. The Smokehis effect on PDstatus involves a total of 4 free 14 parameters and the Sex effect on PDstatus involves a total of 2 free parameters4. The intercept only model does not include either of these effects and therefore has 6 more degrees of freedom than the final model. The difference in –2*log-likelihoods for the two models is 14.348 (labelled Chi-Square on the output). When compared with the χ2 distribution with 6 df, the likelihood ratio test statistic is significant (p=0.026). This means that the best-fitting model cannot afford to drop the combination of the Smokehis and Sex effects, although it may be the case that we do not need to include both of these two effects in our ultimate model. We find out the answer to this question later. The next output table shows two Goodness-of-Fit indices (GFIs). These GFIs compare the fit of the model requested in the analysis (i.e., the main effects model) with a saturated model formed from all the possible IV effects (i.e., a model including Smokehis, Sex and Smokehis*Sex in this case). The Deviance GFI is the one to consult as it is a likelihood ratio test statistic comparing these two models. The 4 df arise because the effect which differentiates the two models (i.e., Smokehis*Sex) has four parameters in it. (When calculating the number of parameters, one must remember that this effect is really Smokehis*Sex*PDstatus.) This test statistic (again labelled Chi-Square in the output) has a value of 2.406, which does not even approach significance (p=0.662) when compared with the χ2 distribution with 4 df. This tells us that the interaction of the IVs Smokehis and Sex does not affect PDstatus. Goodness-of-Fit Pearson Deviance Chi-Squa re 2.345 2.406 df 4 4 Sig. .673 .662 The final output table presents two further likelihood ratio tests. Each test compares the final model (containing both Smokehis and Sex) with a simpler reduced model formed by leaving out one of the two effects in the final model. Leaving out Smokehis from the model is associated with a likelihood ratio statistic of 14.348, which is highly significant (p=0.006) when judged against when compared with the χ2 distribution with 4 df (df =4 because 4 parameters are needed to specify the Smokehis effect on PDstatus). We must therefore include the IV Smokehis in the model which captures the data in the fewest parameters. Leaving out Sex from the model is associated with a likelihood ratio statistic of 0.007, which does not approach significance (p=0.997) when judged against when compared with the χ2 distribution with 2 df (df =2 because 2 parameters are needed to specify the Sex effect on PDstatus). We are therefore able to drop the IV Sex in the model which captures the data in the fewest parameters. 4 A simple way of working out the number of parameters for an IV*DV effect is as follows: if A is the number of categories in the IV and B is the number of categories in the DV, then number of parameters = (A-1)*(B-1). For the IV Smokehis there are 3 category levels (=A), and for the DV PDstatus there are 3 category levels (=B); thus number of parameters =4. 15 Likelihood Ratio Tests -2 Log Likelihood of Reduced Chi-Squa Effect Model re df Sig. Intercept 35.013 .000 0 . SMOKEHIS 49.361 14.348 4 .006 SEX 35.019 .007 2 .997 The chi-square statistic is the difference in -2 log-likelihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. Step 2 of the Analysis The preceding step confirms that a model including only an effect of Smokehis is the best model for the present data. The key contingency table is therefore given in Table 5 (i.e., it is collapsed across both sexes). We can then fit this model and look at the parameter estimates as these can allow us to investigate which cells in Table 5 contribute to the Smokehis effect on PDstatus. We might be interested to know whether both groups of smokers (quit long ago vs. smoked more recently) differ in their rates of PD, and which of these differ from nonsmokers. We might also be interested if any effects of smoking history are confined only to the definite presence of absence of PD, or whether there might be an effect for cases with milder, more borderline symptoms. Smokehis never smoked (=0) last smoked >20 years ago (=1) smoked in last 20 years (=2) Column Totals has disease (=0) 17 PDstatus borderline (=1) 5 no disease (=2) 11 Row Totals 7 6 24 37 4 5 21 30 28 16 56 Grand Total= 100 33 Table 5. The Smokehis by PDstatus contingency data for the more parks data dataset For step 2, once again we use the Analyze > REGRESSION >> MULTINOMIAL LOGISTIC procedure. PDstatus was selected into the “Dependent variable” box and Smokehis was selected into the “Factor(s)” box. The Statistics button was selected and, in the resulting subwindow, only the “Parameter Estimates” option was checked. (One doesn’t need to worry about the Model button here, because thee is only one factor in our model.) The key output is as follows: 16 Pa ramete r Estima tes Parkinson's Dis eas e Status has dis eas e borderline Intercept [SMOKEHIS=0] [SMOKEHIS=1] [SMOKEHIS=2] Intercept [SMOKEHIS=0] [SMOKEHIS=1] [SMOKEHIS=2] B Std. Error -1.658 .546 2.094 .669 .426 .694 0a 0 -1.435 .498 .647 .734 4.88E-02 .675 0a 0 W ald 9.239 9.798 .377 . 8.317 .776 .005 . df 1 1 1 0 1 1 1 0 Sig. .002 .002 .539 . .004 .378 .942 . Ex p(B) 95% Confidenc e Interval for Exp(B) Lower Upper Bound Bound 8.114 1.531 . 2.187 .393 . 30.098 5.972 . 1.909 1.050 . .453 .280 . 8.044 3.944 . a. This parameter is s et to zero becaus e it is redundant. As noted earlier the parameter estimates (B) are logs of odds ratios (ORs). Each parameter is therefore based on a comparison of 4 cells. The Parameter Estimates printed output table gives you enough information to work out how each B parameter is calculated. One can ignore the intercept parameters; they are not of interest. Notice that only 2 of the 3 categories of Parkinson’s Disease status are shown in the table (“has disease”=0 and “borderline”=1). This is because the other value (“no disease”=2) acts as the reference category. Thus the parameters in the “has disease” part of the table refer to the odds of having the disease relative to not having the disease. The parameters in the “borderline” part of the table refer to the odds of being borderline, once again relative to not having the disease. Notice also that there are parameters against only 2 of the Smokehis categories: Smokehis=0 (i.e., never smoked) and Smokehis=1 (i.e., smoked more than 20 years ago). Once again it is because the missing category (recent smokers) acts as a reference. The parameters listed against Smokehis=0 refer to ratios of odds for never smokers divided by odds for the reference category (recent smokers). Similarly, the parameters listed against Smokehis=1 refer to ratios of odds for long ago quitters divided by odds for recent smokers. Consider the B parameter for “has disease” and Smokehis=0. This parameter therefore is the natural logarithm of the following OR: the odds of having the disease (relative to not having the disease) amongst never smokers divided by the odds of having the disease (relative to not having the disease) amongst recent smokers. This is an important comparison for this research and involves the 4 cells shaded gray in Table 5. We can, therefore, calculate the OR concerned using the usual odds ratio formula given in textbooks (and in the A2WCD notes). The value is given by (17*21)/(4*11)=8.114. This checks with the value of Exp(B) given in the output, and loge(8114) also agrees with the value for B (=2.094). As noted in the A2WCD notes, the standard error (SE) of the estimate for the loge(OR) is given by the square root of the sum of the reciprocals of the 4 cell frequencies used to calculate the odds ratio. In this case the value is: √(1/17 + 1/21 + 1/4 + 1/11)=0.69 (this checks with value given by SPSS). This SE is used to produce the 95% CI around the estimated OR. Note that for this parameter our best estimate of the OR is 8.114 and we are 95% confident that it lies between 2.2 and 30.1. This means that we can confidently reject the hypothesis that the OR in question is 1 (the value expected if there were no effect). In this particular case, we conclude that never smokers are 8 times more likely than recent smokers to have PD (compared to not having PD), and that this increased risk is significant. 17 The SE(loge[OR]) is also used to calculate a Wald statistic that is used to test whether the OR differs significantly from 1. The Wald statistic is simply given by (B/SE[B])2. If the OR in question were 1, then the Wald statistic would have a distribution approximated by the χ2 distribution with 1 df. In this case the value of 9.798 is comfortably greater than the critical value for χ2, allowing us to reject the hypothesis that the OR concerned is 1 (p=0.002). The other parameters in the Parameter Estimates table tell us that long ago smokers and recent smokers do not differ from one another in their odds of PD relative to no PD (OR=1.53, ns), nor in their odds of borderline PD relative to no PD (OR=1.05, ns). Similarly, never smokers and recent smokers do not differ from one another in their odds of borderline PD relative to no PD (OR=1.91, ns). Question 3 We might well want to know whether long ago smokers and never smokers differed in their odds of getting PD (or borderline PD). Can you recode the Smokehis variable in the dataset to give the relevant ORs in the parameter estimates table? Hint: SPSS’s MULTINOMIAL LOGISTIC procedure always uses the highest numbered category as the reference category when calculating parameters in these analyses. A Warning About Parameter Estimates The same logistic analysis model can be represented mathematically in many different (but essentially identical) ways. Although the overall likelihood ratios and p-values are not affected, the “alternative parametrisations” naturally lead to differing parameter estimates with different interpretations. (This is similar to the issue of dummy and effect coding in multiple linear regression.) The parametrisation used by SPSS MULTINOMIAL LOGISTIC recodes the IVs in the model by using indicator variables, which take values of 0 and 1. Thus, for an IV with 3 categories (e.g., Smokehis in the above example), 2 indicator variables are needed. Indicator 1 will be 1 for the first category and 0 for the others; indicator 2 will be 1 for the second category and 0 for the others. As we have seen, the third category, given a zero in both indicator variables, will act as the reference category in the parameter estimates table (so indicator coding here is the same as dummy coding in multiple linear regression). Although indicator variable parametrisation is common, and leads to easy-to-interpret parameters, other statistical packages (and indeed other procedures within SPSS) use alternative methods. The parameter estimates obtained under packages or procedures with different parametrisations will differ from those obtained by SPSS MULTINOMIAL LOGISTIC. Logistic Regression The approach illustrated in these notes has been for logistic analysis (LA). Recall that LA is a special case of logistic regression (LR): all the IVs in LA are categorical whereas some (or all) of the IVs in LR can be continuous numerical variables. The relationship between LA and LR is identical to the relationship between ANOVA and multiple linear regression. To carry out an LR analysis including some continuous IVs, all that one needs to do when running the MULTINOMIAL LOGISTIC procedure is to enter the continuous IVs as “covariates” and the categorical IVs as “factors”. The Computer Class for week 10 will explore a LR analysis with a mixture of categorical and continuous IVs, and this should illustrate all the minor differences between LR and LA. 18 Question 4 Can you use the MULTINOMIAL LOGISTIC procedure with the small sample data on predicting a skiers fall from the difficulty of the run and the season of the year (this example is from chapter 12 of Tabachnik and Fidell)? These data are on the J drive under the file name tab and fid log reg example.sav. Hint: to get the same parameter estimates as them you will have to use the recoded fall and season variables fallrev and seasonr respectively. (N.B. Although the likelihood ratio test statistics and p-values come out the same as in the book, the absolute log-likelihoods differ slightly owing to a minor difference in how they are calculated.) A few brief comments about LR are warranted. Firstly, the OR parameters obtained for a continuous IV reflect the increase (or decrease) in odds of a particular category in the DV (relative to a reference category) for a one unit increase in the continuous IV. In other words, the odds ratios compare the odds of a particular DV category for subjects with a continuous IV score of (x+1) with the odds of the DV category for subjects with a continuous IV score of x. Secondly, it is well-known that contingency table analyses using statistics such as the Pearson 2 test, need adequate expected frequencies in all or most cells of the table. A related recommendation is that few (or none) of the cells in the table should have zero observed frequencies. If these conditions are not met, the statistics calculated may not be well approximated by the 2 distribution and so the resulting p-values may be inaccurate. The same issue applies to logistic regression, and indeed the SPSS MULTINOMIAL LOGISTIC procedure prints a warning about zero observed frequency cells (when you have the Goodness-of-fit option selected under the Statistics button). When one has a continuous IV in a LR, then that IV can potentially take many differing values across the whole dataset. The LR analysis is based on a so-called “covariate pattern” which is formed by crossing each observed value of each continuous IV (and each value of each factor) in the model with every value of the categorical DV in the model. Any cell in this covariate pattern, which does not have at least one observation in it, is a zero observed frequency cell in terms of logistic regression. There are not likely to be (m)any zero frequency cells in the covariate pattern if you have very large samples and/or continuous IVs with a small range of possible values. However, in other cases, it should be obvious that using continuous IVs will often produce empty cells -- particular values of the continuous variable might be rare or unique in the dataset and thus will not occur with each value of the DV. One can minimise the zero-cell problem by reducing the continuous IV scores into a small number of ordered values (e.g., quartile scores). For the particular IV, and quartile scoring, all subjects with scores in the lowest 25% of the sample are given a value of 1 for the recoded IV; subjects in the next 25% are given a score of 2 and so on. When this kind of recoded continuous IV is entered as a covariate in a LR this preserves the ordinal and interval nature of subjects’ (recoded) scores on that IV. This contrasts with a categorical IV (factor) where the category levels have no particular numerical relationship to one another. We will carry out this kind of recoding when analysing the LR example in the computer class. Thirdly, a “full factorial” model is these analyses is conventionally considered to include all the main effects and interactions formed by the categorical IVs (factors) plus all the main effects of the continuous IVs (covariates). Interactions between covariates or between covariates and factors are not included in such a specification, and so if one wants to explore these effects one will have to use the “Custom model” option in SPSS. 19 Finally, the nature of saturated model in logistic regression may not be obvious. (Recall that saturated models are uninteresting models which are perfectly able to capture the observed DV frequencies. They can do this because they have the same number of parameters as there are independent data points and so have df=0.) The way that continuous IVs (covariates) are specified in the model is very efficient: we use only (m-1) parameters across the whole range of covariate values, where m is the number of categories of the DV. Hence, the df of likelihood ratio tests relating to the removal of a covariate from a model will be (m-1). As already noted, for a covariate and a particular pair of categories of the DV, the parameter specifies the increase in odds of one of the DV categories (relative to the other) for each unit increase in covariate score. The saturated model, by contrast, requires a different parameter for each value of the covariate but one5 (as if it were being treated like a categorical factor). So, in your dataset if there are k different observed values of the covariate, the saturated model will include (m-1)*(k-1) parameters for each covariate. (Note that k is not necessarily the same as the number of possible values for the covariate, as there may well be no subjects in your data who score at particular covariate values.). The saturated model will also include (m-1)*(k-1)*(j-1) parameters for each interaction with each of the other effects in the model, where j is the number of observed values in the dataset for the other covariate or factor (there would also be parameters for higher order interaction terms as well, if present). We can see that the number of parameters for saturated models with covariates can rapidly become very large if the covariates are not recoded into a small number of values. The (deviance) goodness-of-fit tests for a specific model are log-likelihood ratio tests which compare the log-likelihood of obtaining the observed data based on the specific model with the loglikelihood based on the saturated model6. The df for this test are the difference between the number of parameters needed for the specific model and the larger number of parameters needed for the saturated model. Without very large samples of subjects, these tests will be unreliable if the covariate scores are not recoded into a small number of possible values. This is the usual “minus one” rule for df. If we have k different values of the covariate we need parameters to specify the (conditional) probability of a particular DV outcome at k-1 of these values. The probability for the final (kth) value of the covariate is not free to vary because the probabilities must sum to 1. 6 The likelihood of the observed data given the saturated model is 1 and so the log-likelihood is 0. 5 20