Logistic Regression

advertisement
SC504/HS927:
WEEK 24: 14TH MARCH 2008:
HANDOUT: LOGISTIC REGRESSION
You can now build models with a number of explanatory variables. However
multiple linear regression constrains you by only enabling you to investigate interval
variables as your dependent variable. Often in social science you will want to look at
the relationship between various independent variables and a categorical response
variable. Logistic regression enables you to use categorical variables with two
categories 0 and 1 (binary variables). It looks at the probability of being in a
particular state (where the variable has the value 1) and the effect on this probability
of various factors – your explanatory variables. Most categorical variables can
effectively be turned into binary variables if you choose to focus upon one particular
category.
In this session we will be looking at a file ‘stats jan98’ that is a data set of low income
families and we will be looking at the factors associated with receipt of Income
Support (IS) or Income Based Jobs Seekers’ Allowance (JSA (IB)), which are means
tested subsistence benefits for people where they and their partners (if any) are not
working. Receipt of IS/JSA is shown in the variable isind, a binary variable where 1
= in receipt and 0 = not in receipt. We are interested in seeing which factors make
IS/JSA receipt more likely among those poor families and whether we can identify
protective factors. We have information in the data set on single and couple parents
with children and single people and couples without children. We have information
within the data on number of children, ethnic group of claimant, tenure, and area of
residence (1=area of high concentration of poor people; 2= area of middle
concentration of poor people; and 3=area of low concentration of poor people). We
are assuming that having a partner makes IS/JSA receipt less likely and having
children makes receipt of IS/JSA more likely – so we will be testing and controlling
for these factors; but we want to look at the impact of ethnic group and we also want
to investigate whether certain types of tenure and area reduce the likelihood of an
IS/JSA claim, all other things being equal.
Like multiple linear regression, logistic regression can include a number of variables
in a model, both interval and categorical; for categorical variables it uses dummies but
it is easier to include dummy variables in logistic regression than in multiple linear
regression as the regression dialogue will create them for you. Also like linear
regression, it takes the form of an equation where the estimate is constructed by a
combination of an intercept and coefficients multiplied by the explanatory variables.
However, as the dependent variable is not a continuum of values but a single state or
its opposite, you want to estimate the probability of being in that state as opposed to
not being in it.
Starting with the overall probability of being in receipt of IS/JSA(IB): the probability
or proportion of a binary variable is the same as its mean. For example, in ‘stats
jan98’ the proportion of its 8 437 cases which are in receipt of IS/JSA is .773 which is
2
also its mean. Obviously a probability can only have values between 0 (nil
probability) and 1 (100% probability), and therefore an equation multiplying the
coefficients of explanatory variables will soon fall outside the range of the possible
values as values of X increase. Therefore the probability of being, say, on IS/JSA is
translated into an odds ratio. The odds ratio for being in receipt of IS/JSA is .773/(1.773) = 3.4. 77% of cases are in receipt of IS/JSA and any individual case is 3.4 times
as likely to be in receipt of IS/JSA as not. A logistic regression investigates where
these odds actually vary according to certain characteristics of the cases.
An odds ratio can have values from 0 to infinity, where 1 equals a 50:50 probability.
However it is possible that the explanatory variables will have a negative relationship
with the probability of being on IS/JSA, so the logistic regression takes the (natural)
log of the odds ratio as its estimate, which can have values between –infinity and
+infinity. This means that when the model co-efficients are multiplied by the full
range of values of the explanatory variables they cannot go beyond the bounds of
possibility. The final equation therefore looks like this:
Log P/(1-P) = a + ß1X1 + ß2X2 + ß3X3.... ßnXn
In order to understand how to interpret the coefficients, we will first create a model.
We will test the hypothesis that younger people and people with children are more
likely to be in receipt of IS/JSA, when the other factor is controlled for. ‘Claimage’ is
a continuous variable of the claimant’s age ‘kidind’ is a binary variable showing the
presence (or not) of children in the family. Therefore they can be entered as they are
with no need for dummies. Go to:
Analyze
Regression
Binary Logistic
Dependent insert ‘isind’
Covariates insert ‘claimage’ ‘kidind’
OK
At the bottom of the output the coefficients will be given as follows:
Variables in the Equation
Step
1(a)
claimage
kidind
Constant
B
-.009
S.E.
.002
.053
.055
1.556
.114
a Variable(s) entered on step 1: claimage, kidind.
Wald
14.329
df
1
Sig.
.000
Exp(B)
.991
.919
1
.338
1.054
186.016
1
.000
4.740
Coefficient and Exponent of the coefficient
The coefficients are given under the B and their significance (or the significance of
the Wald statistic derived from them) is given under Sig. As you can see the
coefficients suggest that each increasing year of age of the claimant has a negative
effect on the log odds of being on IS/JSA and the presence of children has a positive
(though not significant) effect on the log odds. But what does it mean to have an
effect on the log odds? It is easier to understand the effect if we take the anti-log ( or
exponent) of the coefficient which gives us the effect of our explanatory variables on
the odds of being on IS/JSA. This exponent is given in the final column under Exp
3
(B). As it is expressed as an odds ratio, it has values between 0 and infinity and a
value of 1 represents equal odds or no effect. A value of less than 1 therefore
represents a reduced probability of being on IS/JSA and a value of more than 1
represents an increased probability of being on IS/JSA.
You can also create confidence intervals for the exponents of the coefficients as you
did with linear regression. Within the regression dialogue click on options and then
tick CI for exp(B) 95%. If the confidence limits include the value of 1 (which will
mean that the confidence limit of the coefficient includes the value of 0) then you
cannot reject the null hypthosis (and the variable will not be significant at the 95%
level).
Wald Statistic
The Wald statistic, which gives you the significance for the estimate of the coefficient
is produced by squaring the ratio of the estimate to its standard error (SE). The
resulting figure has a chi-square distribution, from which significance is deduced.
Note, however, that when the coefficient is large, the Wald Statistic can become
unreliable as it might suggest the estimate is not significant even when it may be.
However, as the effect of children is not statistically significant, we do not like this
model. If we return to our original hypothesis, we thought that couples were less
likely to be in receipt of Income Support/Job Seekers’ Allowance and people with
children more likely. As children will be living in both couple or lone parent families,
we cannot expect to see an effect from them without controlling for couple parent
status.
We therefore run another model, this time including ‘coupind’ (another binary
variable) and get the following results:
Variables in the Equation
1
Sig.
.187
Exp(B)
.997
16.709
1
.000
1.271
100.668
1
.000
.554
1.403
.115
149.772
a Variable(s) entered on step 1: claimage, kidind, coupind.
1
.000
4.069
Step
1(a)
claimage
kidind
coupind
Constant
B
-.003
S.E.
.002
Wald
1.741
.240
.059
-.591
.059
df
This time, age no longer has a significant effect (presumably related to the fact that
couple claimants tend to be older than single claimants, so that age does not have an
effect once being part of a couple is controlled for). Having children now
significantly increases the chances of being on income support; and being in a couple,
as we expected, decreases the chances of being on IS/JSA.
To make the results easier still to interpret we could try to organise our results so that
the effects are positive – where possible the Exp (B)s are greater rather than less than
one. With binary data this is easy, we can simply look at the effect of being single
rather than being in a couple and how that magnifies the chances of being on IS. In
the data there is already a variable, loneind, that is the inverse of coupind.
4
Exercise - A
Re-run the regression as above, replacing coupind with loneind and interpret the
coefficients.
Predicted Probabilities
As with multiple linear regression you might want to investigate the estimated
probability for a particular combination of characteristics of being on IS. Algebra
allows you to get from the estimate of the log P/1-P to the estimate of P and gives you
the equation

P =
expa  b1 X1  b2 X 2  b3 X 3 
1  expa  b1 X1  b2 X 2  b3 X 3 
(where the bs are the original coefficients in the first column of the output and the Xs
are the explanatory variables claimage, kidind and loneind). However, you don’t
have to attempt (or remember) this equation yourself as SPSS will calculate the
individual probabilities for all the cases in your dataset. To do this, you go through
the steps to run the regression again, but when you are in the regression dialogue box
you click on Save and then tick the Probabilities box. In the data file there will now
be a new variable Pre_1, which has the individual probabilities of being on IS for
each of the cases in the file. If you want to look at a particular combination of
characteristics you can use the Select cases command and then look in Descriptive
Statistics at the Frequencies of Pre_1.
For example, we could look within the current model at the probability of being on
IS/JSA(IB) for lone parents. Go to:
Analyze
Regression
Binary Logistic
Dependent insert ‘isind’
Covariates insert ‘claimage’ ‘kidind’ ‘loneind’
Save Tick Probabilities
OK
Then go to:
Select cases
If insert ‘loneind=1 and kidind=1’
Continue make sure unselected cases are Filtered
OK
Then go to:
Analyze
Descriptive statistics
Frequencies insert Pre_1
Statistics tick range min max mean
5
OK
The Frequencies will give you the estimated probabilities for the range of all claimant
ages within the data, and so, where you are not interested in a particular age you can
either take the mean of the predicted probabilities or quote the range.
Here you can see the range of predicted probabilities (of IS/JSA receipt) for lone
parents of different ages according to this particular model, as well as the minimum,
the maximum and the mean. The mean, at .822, is higher than the probability for the
whole data set derived from the mean, as you might expect.
Exercise - B
Look at the probability, within your current model, of being on IS/JSA(IB)
 for couples with no children where the claimant is aged 40; and
 for single people without children.
REMOVE YOUR FILTER BEFORE CONTINUING!
Dummy Variables
We are now going to remove claimage from the model (as it was no longer
significant) leaving loneind and kidind; and we will test the effect of different types
of tenure (ten) on the probability of receiving IS/JSA. There are three tenure types:
local authority ( coded 1), owner occupation ( coded 2) and private rented (coded 3).
Following the same principles as for multiple regression, we therefore need to create
two dummy variables, but this time we can create them within the regression.
Go to:
Analyze
Regression
Binary Logistic
Categorical
Dependent insert isind
Covariates insert kidind, loneind, ten
highlight ten and move it across to
categorical covariates
At this point you have to choose which of the categories you want to serve as the
baseline/refernce category. You are able to choose either the first (in this case 1 =
local authority) or the last (in this case 3 = private rented). In this case we will choose
local authority, the first category. You can also change the type of contrast between
the categories at this point using the change contrast box, but we will stick with
indicator, which creates dummy variables in the same way as you did for multiple
regression. So therefore:
Tick First
Change (If you don’t click on this it will
ignore your instructions and will use the
default which is Last)
Continue
OK
6
The output will confirm what your base category is for a categorical variable and how
it has labelled the other values, as follows:
Categorical
Variables Codings
Parameter coding
Frequency
Tenure
(1)
(2)
Local Authority/Housing
Association Housing
4510
.000
Owner Occupier
1714
1.000
.000
Private Tenant
2213
.000
1.000
.000
Thus you can see that owner occupier will have a value of 1 for TEN(1) and private
tenant will have a value of 1 for TEN(2). The coefficient for TEN(1) will therefore be
the effect of owner occupation on the log-odds of receipt of IS/JSA compared to
being in local authority housing, and holding all other variables constant. And
similarly the coefficient for TEN(2) will be the effect of private tenancy on the logodds of receipt of IS/JSA compared to being in local authority housing, and holding
all other variables constant.
Variables in the Equation
B
Step
1(a)
kidind
S.E.
.260
.055
ten
ten(1)
-.184
.068
ten(2)
Wald
df
Sig.
22.386
1
.000
7.562
2
.023
7.409
1
.006
Exp(B)
1.297
.832
-.075
.063
1.423
1
.233
.927
loneind
.574
.059
94.070
1
.000
1.775
Constant
.745
.068
118.207
1
.000
2.106
a Variable(s) entered on step 1: kidind, ten, loneind.
From the coefficients output we can therefore see that owner occupation reduces the
odds of being in receipt of IS compared to being in local authority housing and that
being a private tenant does not have a significant effect compared to being in local
authority housing.
Model Fit
Just as linear regression is based on the principle of least squares – producing a line
which results in the smallest sum of squared differences – logistic regression is based
on the principle of maximum likelihood, which involves selecting the estimates for
the parameters out of all possible values which make the observed results most likely.
This is known as the maximum likelihood method.
7
To measure the fit of the model, we want to know just how likely the results provided
by the model actually are, given the values in the sample. To do this it takes the
likelihood given by the model and to make the numbers large enough to compare it
calculates –2* log of the likelihood (-2LL) and compares this with –2LL of a model
with no parameters. Where the model fits perfectly, -2ll will be equal to 0 as the
likelihood will equal 1 (there will be no deviation of observed from predicted values).
A smaller value for –2ll is therefore a better fit. The –2ll does not itself have any
significance value; and for big data sets, such as this, it will remain quite large as
there will necessarily be a considerable amount of difference between predicted
probabilities and the true values which will equal either 0 or 1. To measure the
significance of the improvement in fit (the reduction in the –2LL), the chi-square
distribution of the change is used and the significance is calculated automatically.
To make this a bit clearer, let’s repeat the first two models we looked at for ‘stats
jan98’ but in successive blocks of the same regression. In the first model we tested
the association between claimant’s age and having children on Income Support
receipt. In the second model we added couple v. single status.
Go to:
Analyze
Regression
Binary Logistic
Dependent insert isind
Covariates insert claimage kidind
Next Add coupind (or loneind) to the covariates box
OK
Block 0, simply summarises a model with the constant only included.
Then Block 1 gives (under Model Summary) the –2 Log Likelihood for the first
model (with claimage and kidind only included):
Model Summary
Step
1
-2 Log
likelihood
9004.420(a)
Cox &
Snell R
Square
.002
Nagelkerke R
Square
.003
a Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
In the preceding bit of output it also tells you how much this -2 log likelihood has
reduced as a result of incorporating the independent variables you have added to your
model, and the significance of that number for the degrees of freedom used (i.e.
number of variables, or dummy variables added) and according to a chi-squared
distribution. For this model the value of block and model are the same because this is
the first block and so the full model includes all and only the variables you have
added in this block. (Step is the same as block unless you are using a step-wise
approach, which we are not covering here).
8
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
19.317
2
.000
Block
19.317
2
.000
Model
19.317
2
.000
As you can see, the difference between the model -2LL and the -2LL where no
parameters is included is 19.3. SPSS calculates the significance for this value by
reference to a chi-square distribution for the number of parameters (df). As we can see
this improvement in model fit of 19.3 is highly significant. We can therefore say that
this model (despite its limitations) provides considerably better estimates of the
probability of being in receipt of Income Support than simply taking the mean (or
overall proportion) for each case.
We can then compare this model with the model including coupind (or loneind),
added in ‘Block 2’. Here you can see that the addition of coupind (or loneind) has
reduced the 2-LL further to around 8905. That is it’s around 100 smaller than the
model without coupind (or loneind) in (Block 1), which, as you remember had a -2LL
of 9004.
Model Summary
Step
1
-2 Log
likelihood
8905.295(a)
Cox &
Snell R
Square
.014
Nagelkerke R
Square
.021
a Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
Again to see if this improves the fit of the model overall we inspect the Omnibus
Tests:
Omnibus Tests of Model Coefficients
Chi-square
Step 1
Step
df
Sig.
99.124
1
.000
Block
99.124
1
.000
Model
118.441
3
.000
In this case the values for Block and Model are different. The value for Model gives
the overall improvement in fit of the full model (i.e. including all three variables),
compared to a model with no parameters. The Block value gives you the
improvement of the new model, with the one additional variable, on the previous
model, which had just ‘claimage’ and ‘kidind’ in it. Thus we see that the overall
model is a significant improvement on no model (118 for three df), and also that
adding ‘coupind’ (or loneind) to the previous model resulted in a substantial and
highly significant improvement in the fit of the model (99 for 3 df). (Again, step and
block are the same and will be if we construct models in this way and don’t perform a
stepwise model.).
9
As the –2ll still remains very large at 8905.3, there is still scope for entering a number
of additional variables to improve the fit – as long as they continue to contribute to a
significant change in the chi-square value.
Exercise - C
 Try the model similar to the one created previously (p.5 of these notes) with isind
as the dependent variable and kidind and ten as the covariates (remember to
define ten as categorical) but this time add in eth2 for the second block. Compare
the model fit of the two models.
Download