Regression modelling with a categorical outcome: logistic regression

advertisement
Regression modelling with a categorical outcome: logistic regression
Logistic regression is similar to linear regression, the right hand side of the regression
model below works in the same way. It is the outcome on the left hand side that is
different. The standard multiple regression model of:
Y = a + b1x1 + b2x2 + … +e
Where Y is a continuous outcome requires a linear relationship between the outcome and
explanatory variables. In the case where Y is a binary variable (can take one of two values,
e.g. dead/alive, yes/no; also known as dichotomous) this relationship is not possible and so
some form of mathematical function, known as the ‘link’ function needs to be applied to
the binary outcome. This is a transformation of Y to create a linear relationship with the
explanatory variables. These models are known as generalized linear models (GLM).
Generalized linear models can also be used to model other types of outcome:
Binary outcome from a RCT or case-control study: logistic regression
Event rate or count: Poisson regression
Binary outcome from a matched case-control study: conditional logistic regression
Categorical outcome with >2 categories: multinomial regression
Time to event data: exponential or Weibull models
Model fitting
Given that p= the probability of a person having the event of interest, then the function that
is modelled as the outcome in a logistic regression is:
Logit (p) = ln p
1-p
.
Where p/1-p= probability of event occurring
probability of event not occurring
= the odds ratio
The model is therefore
Logit (p) = a + b1x1 + b2x2 + … +e
Where a is the estimated constant and b’s are the estimated regression coefficients, and e is
the residual term.
The parameters of this model are estimated using different methods from linear regression
which uses least squares. Logistic regression uses maximum likelihood estimation. This is
an iterative procedure which gives the regression coefficients that maximise the likelihood
of our results assuming an underlying binomial distribution for the data. For a good
explanation of maximum likelihood and regression modelling in general see ‘Statistics at
square two’ by Michael Campbell (2nd edition, BMJ books, Blackwell).
The significance of each variable in the model is assessed using the Wald statistic which is
the estimate of the regression coefficient b divided by its standard error. The null
hypothesis being tested is that b is zero.
The estimated regression coefficients are used to calculate the odds ratio, which is the
result most commonly reported from a logistic regression model and used to interpret the
results.
eb (the exponential of the coefficient) gives us an estimate of the odds ratio.
This is the estimate of the odds of the outcome for a particular variable in the model,
treating all other variables as fixed. For a categorical variable it gives the odds in one group
relative to the other (or a baseline group if there are more than 2 levels of the variable). For
a continuous variable it gives the increase in odds associated with a one unit increase in
that variable. For example the estimated regression equation is:
Logit (p) = -21.18 +0.075 age – 0.77 diabetes
For age e0.075=1.078, indicating that an increase in age of 1 year increases the estimated
odds of the event by 7.8%.
For diabetes e—0.77=0.463, indicating that the odds of the event for people with diabetes are
just under half the odds of those without. Alternatively the odds of the event for people
without diabetes are 1/0.463 = 2.16 times those of people with diabetes.
An odds ratio of 1 is the null hypothesis, that there is no difference between two groups.
An odds ratio>1 indicates increased odds of having the event
An odds ratio<1 indicates decreased odds of having the event.
The estimated regression equation can be used to estimate the predicted probability of an
event for each observation in the model. For each individual providing a set of data if we
calculate the total of the right hand side of the regression equation for their particular set of
values as Z = a + b1x1 + b2x2, then the predicted probability of the event is
P = eZ/ (1+ eZ)
Model selection
For multiple logistic regression models the same methods of variable selection apply as for
multiple linear regression models. Stepwise, backward, and forwards selection are all
available in SPSS. As for linear regression these automatic methods should be used with
caution as they are no substitute for thinking about the data and developing models by
hand. Backward selection has been recommended as the best method. As for linear
regression you need a minimum of ten outcomes per variable included in the model, but in
the case of logistic regression this is ten outcomes per occurrence of the “event”.
SPSS provides Enter (where variables are placed in the model by the investigator) and
forwards and backwards selection (using different methods of selecting variables: Wald,
Likelihood ratio, and conditional).
In logistic regression the observed and predicted values are used to assess the fit of the
model. The measure used is the log-likelihood. This is similar to residual sum of squares in
linear regression in that it measures how much unexplained variation there is after the
model has been fitted. Large values of the log-likelihood indicate a poor fitting model. The
log-likelihood statistic is used to compare different models. Software packages usually
report the -2LL, which is the difference between the log-likelihood of the basic model
which includes only the constant term and the model including one of more variables. This
value is compared to the chi-squared distribution with k degrees of freedom (where k is the
change in the degrees of freedom between the 2 models). This method is recommended to
assess the change in model fit when new variables are added to the model. However, it can
be laborious in some software and looking at the significance of variables in the model
using the Wald statistics is likely to lead to the same conclusions.
In SPSS output the table “omnibus tests of model coefficients” the model chi-square
statistic gives us the results of the change in -2LL when variables are added to the model. If
this is significant at p<0.05 then this tells us that a model including these variables is
significantly better than a model including the constant alone.
SPSS also calculates 2 approximations of R2. These are not as useful as they are in linear
regression. An alternative test of fit is the Hosmer-Lemeshow test, this groups the predicted
probabilities from the model into groups of ten and compares the predicted and actual
values. A significant p-value indicates that the model is a poor fit to the data.
Predicted probabilities/residuals
Our model enables us to calculate the predicted probability of an event for each participant.
SPSS outputs the predicted probability for each participant and their predicted group
membership, so we can compare the model predictions with the actual events. We can
compare how well the predictions match reality. SPSS also produces a classification table
and plot summarising these predictions.
We can compute and inspect the same residuals and diagnostic measures as for linear
regression. The main purpose of examining these is to identify points where the model is a
poor fit, and to identify points which are exerting large influence on the results. SPSS gives
us studentized, standardized, and deviance residuals. Any values above 3 are cause for
concern.
Plots of Cooks statistics and leverage against patient identification number can also identify
any outlying observations (as for standard linear regression). Look for large values that lie
apart form the rest of the body of the data (particularly values of Cooks statistic >1). Also
values of DFBeta greater than 1 indicate potentially influential observations.
References
For guides to conducting analyses in SPSS:
SPSS for windows made simple: release 10 (psychology press); Paul Kinnear and Colin
Gray
Discovering statistics using SPSS (Sage, 2nd edition): Andy Field (is very user-friendly and
clear).
Download