Regression modelling with a categorical outcome: logistic regression Logistic regression is similar to linear regression, the right hand side of the regression model below works in the same way. It is the outcome on the left hand side that is different. The standard multiple regression model of: Y = a + b1x1 + b2x2 + … +e Where Y is a continuous outcome requires a linear relationship between the outcome and explanatory variables. In the case where Y is a binary variable (can take one of two values, e.g. dead/alive, yes/no; also known as dichotomous) this relationship is not possible and so some form of mathematical function, known as the ‘link’ function needs to be applied to the binary outcome. This is a transformation of Y to create a linear relationship with the explanatory variables. These models are known as generalized linear models (GLM). Generalized linear models can also be used to model other types of outcome: Binary outcome from a RCT or case-control study: logistic regression Event rate or count: Poisson regression Binary outcome from a matched case-control study: conditional logistic regression Categorical outcome with >2 categories: multinomial regression Time to event data: exponential or Weibull models Model fitting Given that p= the probability of a person having the event of interest, then the function that is modelled as the outcome in a logistic regression is: Logit (p) = ln p 1-p . Where p/1-p= probability of event occurring probability of event not occurring = the odds ratio The model is therefore Logit (p) = a + b1x1 + b2x2 + … +e Where a is the estimated constant and b’s are the estimated regression coefficients, and e is the residual term. The parameters of this model are estimated using different methods from linear regression which uses least squares. Logistic regression uses maximum likelihood estimation. This is an iterative procedure which gives the regression coefficients that maximise the likelihood of our results assuming an underlying binomial distribution for the data. For a good explanation of maximum likelihood and regression modelling in general see ‘Statistics at square two’ by Michael Campbell (2nd edition, BMJ books, Blackwell). The significance of each variable in the model is assessed using the Wald statistic which is the estimate of the regression coefficient b divided by its standard error. The null hypothesis being tested is that b is zero. The estimated regression coefficients are used to calculate the odds ratio, which is the result most commonly reported from a logistic regression model and used to interpret the results. eb (the exponential of the coefficient) gives us an estimate of the odds ratio. This is the estimate of the odds of the outcome for a particular variable in the model, treating all other variables as fixed. For a categorical variable it gives the odds in one group relative to the other (or a baseline group if there are more than 2 levels of the variable). For a continuous variable it gives the increase in odds associated with a one unit increase in that variable. For example the estimated regression equation is: Logit (p) = -21.18 +0.075 age – 0.77 diabetes For age e0.075=1.078, indicating that an increase in age of 1 year increases the estimated odds of the event by 7.8%. For diabetes e—0.77=0.463, indicating that the odds of the event for people with diabetes are just under half the odds of those without. Alternatively the odds of the event for people without diabetes are 1/0.463 = 2.16 times those of people with diabetes. An odds ratio of 1 is the null hypothesis, that there is no difference between two groups. An odds ratio>1 indicates increased odds of having the event An odds ratio<1 indicates decreased odds of having the event. The estimated regression equation can be used to estimate the predicted probability of an event for each observation in the model. For each individual providing a set of data if we calculate the total of the right hand side of the regression equation for their particular set of values as Z = a + b1x1 + b2x2, then the predicted probability of the event is P = eZ/ (1+ eZ) Model selection For multiple logistic regression models the same methods of variable selection apply as for multiple linear regression models. Stepwise, backward, and forwards selection are all available in SPSS. As for linear regression these automatic methods should be used with caution as they are no substitute for thinking about the data and developing models by hand. Backward selection has been recommended as the best method. As for linear regression you need a minimum of ten outcomes per variable included in the model, but in the case of logistic regression this is ten outcomes per occurrence of the “event”. SPSS provides Enter (where variables are placed in the model by the investigator) and forwards and backwards selection (using different methods of selecting variables: Wald, Likelihood ratio, and conditional). In logistic regression the observed and predicted values are used to assess the fit of the model. The measure used is the log-likelihood. This is similar to residual sum of squares in linear regression in that it measures how much unexplained variation there is after the model has been fitted. Large values of the log-likelihood indicate a poor fitting model. The log-likelihood statistic is used to compare different models. Software packages usually report the -2LL, which is the difference between the log-likelihood of the basic model which includes only the constant term and the model including one of more variables. This value is compared to the chi-squared distribution with k degrees of freedom (where k is the change in the degrees of freedom between the 2 models). This method is recommended to assess the change in model fit when new variables are added to the model. However, it can be laborious in some software and looking at the significance of variables in the model using the Wald statistics is likely to lead to the same conclusions. In SPSS output the table “omnibus tests of model coefficients” the model chi-square statistic gives us the results of the change in -2LL when variables are added to the model. If this is significant at p<0.05 then this tells us that a model including these variables is significantly better than a model including the constant alone. SPSS also calculates 2 approximations of R2. These are not as useful as they are in linear regression. An alternative test of fit is the Hosmer-Lemeshow test, this groups the predicted probabilities from the model into groups of ten and compares the predicted and actual values. A significant p-value indicates that the model is a poor fit to the data. Predicted probabilities/residuals Our model enables us to calculate the predicted probability of an event for each participant. SPSS outputs the predicted probability for each participant and their predicted group membership, so we can compare the model predictions with the actual events. We can compare how well the predictions match reality. SPSS also produces a classification table and plot summarising these predictions. We can compute and inspect the same residuals and diagnostic measures as for linear regression. The main purpose of examining these is to identify points where the model is a poor fit, and to identify points which are exerting large influence on the results. SPSS gives us studentized, standardized, and deviance residuals. Any values above 3 are cause for concern. Plots of Cooks statistics and leverage against patient identification number can also identify any outlying observations (as for standard linear regression). Look for large values that lie apart form the rest of the body of the data (particularly values of Cooks statistic >1). Also values of DFBeta greater than 1 indicate potentially influential observations. References For guides to conducting analyses in SPSS: SPSS for windows made simple: release 10 (psychology press); Paul Kinnear and Colin Gray Discovering statistics using SPSS (Sage, 2nd edition): Andy Field (is very user-friendly and clear).