SC504/HS927: WEEK 24: 14TH MARCH 2008: HANDOUT: LOGISTIC REGRESSION You can now build models with a number of explanatory variables. However multiple linear regression constrains you by only enabling you to investigate interval variables as your dependent variable. Often in social science you will want to look at the relationship between various independent variables and a categorical response variable. Logistic regression enables you to use categorical variables with two categories 0 and 1 (binary variables). It looks at the probability of being in a particular state (where the variable has the value 1) and the effect on this probability of various factors – your explanatory variables. Most categorical variables can effectively be turned into binary variables if you choose to focus upon one particular category. In this session we will be looking at a file ‘stats jan98’ that is a data set of low income families and we will be looking at the factors associated with receipt of Income Support (IS) or Income Based Jobs Seekers’ Allowance (JSA (IB)), which are means tested subsistence benefits for people where they and their partners (if any) are not working. Receipt of IS/JSA is shown in the variable isind, a binary variable where 1 = in receipt and 0 = not in receipt. We are interested in seeing which factors make IS/JSA receipt more likely among those poor families and whether we can identify protective factors. We have information in the data set on single and couple parents with children and single people and couples without children. We have information within the data on number of children, ethnic group of claimant, tenure, and area of residence (1=area of high concentration of poor people; 2= area of middle concentration of poor people; and 3=area of low concentration of poor people). We are assuming that having a partner makes IS/JSA receipt less likely and having children makes receipt of IS/JSA more likely – so we will be testing and controlling for these factors; but we want to look at the impact of ethnic group and we also want to investigate whether certain types of tenure and area reduce the likelihood of an IS/JSA claim, all other things being equal. Like multiple linear regression, logistic regression can include a number of variables in a model, both interval and categorical; for categorical variables it uses dummies but it is easier to include dummy variables in logistic regression than in multiple linear regression as the regression dialogue will create them for you. Also like linear regression, it takes the form of an equation where the estimate is constructed by a combination of an intercept and coefficients multiplied by the explanatory variables. However, as the dependent variable is not a continuum of values but a single state or its opposite, you want to estimate the probability of being in that state as opposed to not being in it. Starting with the overall probability of being in receipt of IS/JSA(IB): the probability or proportion of a binary variable is the same as its mean. For example, in ‘stats jan98’ the proportion of its 8 437 cases which are in receipt of IS/JSA is .773 which is 2 also its mean. Obviously a probability can only have values between 0 (nil probability) and 1 (100% probability), and therefore an equation multiplying the coefficients of explanatory variables will soon fall outside the range of the possible values as values of X increase. Therefore the probability of being, say, on IS/JSA is translated into an odds ratio. The odds ratio for being in receipt of IS/JSA is .773/(1.773) = 3.4. 77% of cases are in receipt of IS/JSA and any individual case is 3.4 times as likely to be in receipt of IS/JSA as not. A logistic regression investigates where these odds actually vary according to certain characteristics of the cases. An odds ratio can have values from 0 to infinity, where 1 equals a 50:50 probability. However it is possible that the explanatory variables will have a negative relationship with the probability of being on IS/JSA, so the logistic regression takes the (natural) log of the odds ratio as its estimate, which can have values between –infinity and +infinity. This means that when the model co-efficients are multiplied by the full range of values of the explanatory variables they cannot go beyond the bounds of possibility. The final equation therefore looks like this: Log P/(1-P) = a + ß1X1 + ß2X2 + ß3X3.... ßnXn In order to understand how to interpret the coefficients, we will first create a model. We will test the hypothesis that younger people and people with children are more likely to be in receipt of IS/JSA, when the other factor is controlled for. ‘Claimage’ is a continuous variable of the claimant’s age ‘kidind’ is a binary variable showing the presence (or not) of children in the family. Therefore they can be entered as they are with no need for dummies. Go to: Analyze Regression Binary Logistic Dependent insert ‘isind’ Covariates insert ‘claimage’ ‘kidind’ OK At the bottom of the output the coefficients will be given as follows: Variables in the Equation Step 1(a) claimage kidind Constant B -.009 S.E. .002 .053 .055 1.556 .114 a Variable(s) entered on step 1: claimage, kidind. Wald 14.329 df 1 Sig. .000 Exp(B) .991 .919 1 .338 1.054 186.016 1 .000 4.740 Coefficient and Exponent of the coefficient The coefficients are given under the B and their significance (or the significance of the Wald statistic derived from them) is given under Sig. As you can see the coefficients suggest that each increasing year of age of the claimant has a negative effect on the log odds of being on IS/JSA and the presence of children has a positive (though not significant) effect on the log odds. But what does it mean to have an effect on the log odds? It is easier to understand the effect if we take the anti-log ( or exponent) of the coefficient which gives us the effect of our explanatory variables on the odds of being on IS/JSA. This exponent is given in the final column under Exp 3 (B). As it is expressed as an odds ratio, it has values between 0 and infinity and a value of 1 represents equal odds or no effect. A value of less than 1 therefore represents a reduced probability of being on IS/JSA and a value of more than 1 represents an increased probability of being on IS/JSA. You can also create confidence intervals for the exponents of the coefficients as you did with linear regression. Within the regression dialogue click on options and then tick CI for exp(B) 95%. If the confidence limits include the value of 1 (which will mean that the confidence limit of the coefficient includes the value of 0) then you cannot reject the null hypthosis (and the variable will not be significant at the 95% level). Wald Statistic The Wald statistic, which gives you the significance for the estimate of the coefficient is produced by squaring the ratio of the estimate to its standard error (SE). The resulting figure has a chi-square distribution, from which significance is deduced. Note, however, that when the coefficient is large, the Wald Statistic can become unreliable as it might suggest the estimate is not significant even when it may be. However, as the effect of children is not statistically significant, we do not like this model. If we return to our original hypothesis, we thought that couples were less likely to be in receipt of Income Support/Job Seekers’ Allowance and people with children more likely. As children will be living in both couple or lone parent families, we cannot expect to see an effect from them without controlling for couple parent status. We therefore run another model, this time including ‘coupind’ (another binary variable) and get the following results: Variables in the Equation 1 Sig. .187 Exp(B) .997 16.709 1 .000 1.271 100.668 1 .000 .554 1.403 .115 149.772 a Variable(s) entered on step 1: claimage, kidind, coupind. 1 .000 4.069 Step 1(a) claimage kidind coupind Constant B -.003 S.E. .002 Wald 1.741 .240 .059 -.591 .059 df This time, age no longer has a significant effect (presumably related to the fact that couple claimants tend to be older than single claimants, so that age does not have an effect once being part of a couple is controlled for). Having children now significantly increases the chances of being on income support; and being in a couple, as we expected, decreases the chances of being on IS/JSA. To make the results easier still to interpret we could try to organise our results so that the effects are positive – where possible the Exp (B)s are greater rather than less than one. With binary data this is easy, we can simply look at the effect of being single rather than being in a couple and how that magnifies the chances of being on IS. In the data there is already a variable, loneind, that is the inverse of coupind. 4 Exercise - A Re-run the regression as above, replacing coupind with loneind and interpret the coefficients. Predicted Probabilities As with multiple linear regression you might want to investigate the estimated probability for a particular combination of characteristics of being on IS. Algebra allows you to get from the estimate of the log P/1-P to the estimate of P and gives you the equation P = expa b1 X1 b2 X 2 b3 X 3 1 expa b1 X1 b2 X 2 b3 X 3 (where the bs are the original coefficients in the first column of the output and the Xs are the explanatory variables claimage, kidind and loneind). However, you don’t have to attempt (or remember) this equation yourself as SPSS will calculate the individual probabilities for all the cases in your dataset. To do this, you go through the steps to run the regression again, but when you are in the regression dialogue box you click on Save and then tick the Probabilities box. In the data file there will now be a new variable Pre_1, which has the individual probabilities of being on IS for each of the cases in the file. If you want to look at a particular combination of characteristics you can use the Select cases command and then look in Descriptive Statistics at the Frequencies of Pre_1. For example, we could look within the current model at the probability of being on IS/JSA(IB) for lone parents. Go to: Analyze Regression Binary Logistic Dependent insert ‘isind’ Covariates insert ‘claimage’ ‘kidind’ ‘loneind’ Save Tick Probabilities OK Then go to: Select cases If insert ‘loneind=1 and kidind=1’ Continue make sure unselected cases are Filtered OK Then go to: Analyze Descriptive statistics Frequencies insert Pre_1 Statistics tick range min max mean 5 OK The Frequencies will give you the estimated probabilities for the range of all claimant ages within the data, and so, where you are not interested in a particular age you can either take the mean of the predicted probabilities or quote the range. Here you can see the range of predicted probabilities (of IS/JSA receipt) for lone parents of different ages according to this particular model, as well as the minimum, the maximum and the mean. The mean, at .822, is higher than the probability for the whole data set derived from the mean, as you might expect. Exercise - B Look at the probability, within your current model, of being on IS/JSA(IB) for couples with no children where the claimant is aged 40; and for single people without children. REMOVE YOUR FILTER BEFORE CONTINUING! Dummy Variables We are now going to remove claimage from the model (as it was no longer significant) leaving loneind and kidind; and we will test the effect of different types of tenure (ten) on the probability of receiving IS/JSA. There are three tenure types: local authority ( coded 1), owner occupation ( coded 2) and private rented (coded 3). Following the same principles as for multiple regression, we therefore need to create two dummy variables, but this time we can create them within the regression. Go to: Analyze Regression Binary Logistic Categorical Dependent insert isind Covariates insert kidind, loneind, ten highlight ten and move it across to categorical covariates At this point you have to choose which of the categories you want to serve as the baseline/refernce category. You are able to choose either the first (in this case 1 = local authority) or the last (in this case 3 = private rented). In this case we will choose local authority, the first category. You can also change the type of contrast between the categories at this point using the change contrast box, but we will stick with indicator, which creates dummy variables in the same way as you did for multiple regression. So therefore: Tick First Change (If you don’t click on this it will ignore your instructions and will use the default which is Last) Continue OK 6 The output will confirm what your base category is for a categorical variable and how it has labelled the other values, as follows: Categorical Variables Codings Parameter coding Frequency Tenure (1) (2) Local Authority/Housing Association Housing 4510 .000 Owner Occupier 1714 1.000 .000 Private Tenant 2213 .000 1.000 .000 Thus you can see that owner occupier will have a value of 1 for TEN(1) and private tenant will have a value of 1 for TEN(2). The coefficient for TEN(1) will therefore be the effect of owner occupation on the log-odds of receipt of IS/JSA compared to being in local authority housing, and holding all other variables constant. And similarly the coefficient for TEN(2) will be the effect of private tenancy on the logodds of receipt of IS/JSA compared to being in local authority housing, and holding all other variables constant. Variables in the Equation B Step 1(a) kidind S.E. .260 .055 ten ten(1) -.184 .068 ten(2) Wald df Sig. 22.386 1 .000 7.562 2 .023 7.409 1 .006 Exp(B) 1.297 .832 -.075 .063 1.423 1 .233 .927 loneind .574 .059 94.070 1 .000 1.775 Constant .745 .068 118.207 1 .000 2.106 a Variable(s) entered on step 1: kidind, ten, loneind. From the coefficients output we can therefore see that owner occupation reduces the odds of being in receipt of IS compared to being in local authority housing and that being a private tenant does not have a significant effect compared to being in local authority housing. Model Fit Just as linear regression is based on the principle of least squares – producing a line which results in the smallest sum of squared differences – logistic regression is based on the principle of maximum likelihood, which involves selecting the estimates for the parameters out of all possible values which make the observed results most likely. This is known as the maximum likelihood method. 7 To measure the fit of the model, we want to know just how likely the results provided by the model actually are, given the values in the sample. To do this it takes the likelihood given by the model and to make the numbers large enough to compare it calculates –2* log of the likelihood (-2LL) and compares this with –2LL of a model with no parameters. Where the model fits perfectly, -2ll will be equal to 0 as the likelihood will equal 1 (there will be no deviation of observed from predicted values). A smaller value for –2ll is therefore a better fit. The –2ll does not itself have any significance value; and for big data sets, such as this, it will remain quite large as there will necessarily be a considerable amount of difference between predicted probabilities and the true values which will equal either 0 or 1. To measure the significance of the improvement in fit (the reduction in the –2LL), the chi-square distribution of the change is used and the significance is calculated automatically. To make this a bit clearer, let’s repeat the first two models we looked at for ‘stats jan98’ but in successive blocks of the same regression. In the first model we tested the association between claimant’s age and having children on Income Support receipt. In the second model we added couple v. single status. Go to: Analyze Regression Binary Logistic Dependent insert isind Covariates insert claimage kidind Next Add coupind (or loneind) to the covariates box OK Block 0, simply summarises a model with the constant only included. Then Block 1 gives (under Model Summary) the –2 Log Likelihood for the first model (with claimage and kidind only included): Model Summary Step 1 -2 Log likelihood 9004.420(a) Cox & Snell R Square .002 Nagelkerke R Square .003 a Estimation terminated at iteration number 4 because parameter estimates changed by less than .001. In the preceding bit of output it also tells you how much this -2 log likelihood has reduced as a result of incorporating the independent variables you have added to your model, and the significance of that number for the degrees of freedom used (i.e. number of variables, or dummy variables added) and according to a chi-squared distribution. For this model the value of block and model are the same because this is the first block and so the full model includes all and only the variables you have added in this block. (Step is the same as block unless you are using a step-wise approach, which we are not covering here). 8 Omnibus Tests of Model Coefficients Chi-square Step 1 df Sig. Step 19.317 2 .000 Block 19.317 2 .000 Model 19.317 2 .000 As you can see, the difference between the model -2LL and the -2LL where no parameters is included is 19.3. SPSS calculates the significance for this value by reference to a chi-square distribution for the number of parameters (df). As we can see this improvement in model fit of 19.3 is highly significant. We can therefore say that this model (despite its limitations) provides considerably better estimates of the probability of being in receipt of Income Support than simply taking the mean (or overall proportion) for each case. We can then compare this model with the model including coupind (or loneind), added in ‘Block 2’. Here you can see that the addition of coupind (or loneind) has reduced the 2-LL further to around 8905. That is it’s around 100 smaller than the model without coupind (or loneind) in (Block 1), which, as you remember had a -2LL of 9004. Model Summary Step 1 -2 Log likelihood 8905.295(a) Cox & Snell R Square .014 Nagelkerke R Square .021 a Estimation terminated at iteration number 4 because parameter estimates changed by less than .001. Again to see if this improves the fit of the model overall we inspect the Omnibus Tests: Omnibus Tests of Model Coefficients Chi-square Step 1 Step df Sig. 99.124 1 .000 Block 99.124 1 .000 Model 118.441 3 .000 In this case the values for Block and Model are different. The value for Model gives the overall improvement in fit of the full model (i.e. including all three variables), compared to a model with no parameters. The Block value gives you the improvement of the new model, with the one additional variable, on the previous model, which had just ‘claimage’ and ‘kidind’ in it. Thus we see that the overall model is a significant improvement on no model (118 for three df), and also that adding ‘coupind’ (or loneind) to the previous model resulted in a substantial and highly significant improvement in the fit of the model (99 for 3 df). (Again, step and block are the same and will be if we construct models in this way and don’t perform a stepwise model.). 9 As the –2ll still remains very large at 8905.3, there is still scope for entering a number of additional variables to improve the fit – as long as they continue to contribute to a significant change in the chi-square value. Exercise - C Try the model similar to the one created previously (p.5 of these notes) with isind as the dependent variable and kidind and ten as the covariates (remember to define ten as categorical) but this time add in eth2 for the second block. Compare the model fit of the two models.