Logit and probit differ in terms of A) how their

advertisement
Center for Teaching, Research and Learning
Research Support Group
American University, Washington, D.C.
Hurst Hall 203
rsg@american.edu
(202) 885-3862
Advanced SPSS: Logit and Probit Regression
Workshop Objective
This workshop is designed to give a basic understanding of how to preform Logistic Unit
and Probability Unit Regressions in SPSS, which are standard ways of running
regressions with discrete dependent variables. We will use a combination of the SPSS
point-and-click interface and syntax coding.
Learning Outcomes
1. Understand Logit and Probit Models
2. Run preliminary analyses and data management
3. Use Logit and Probit Models
4. Interpret results
Scenario:
We are interested in predicting employment status using predictors such as gender, age,
marital status, number of children, and race. Employment status starts out as five levels,
but we will convert it to two levels, and then explore some options for running analyses
with bivariate dependent variable.
I.
Understanding Logit and Probit Models
A traditional regression (OLS) is designed to predict the value of a dependent variable
using the values of one or more independent variable, on the assumption that the variables are
linearly related. Depending on the values of the independent variables that you enter into the
equation, the resulting predictions can come out as fractions, and can be very large or small.
Typically this is no problem, but other times you are dealing with data where the outcome cannot
be fraction and where there are top and bottom limits to acceptable answers. For example, if you
have a model where “0” is the answer “No” and “1” is the answer “Yes”, what would you make
of it if your model predicted that someone would answer “.64”, or “2.4”, or “-3”. There are
1
certainly ways to make sense of those results, but the model itself seems non-ideal if it is giving
you those sorts of predictions, because no one will ever actually produce those responses.
Logit and probit are two generalizations of regression analysis designed to deal with
discrete dependent variables, for which fractional values do not make sense and there are a
limited number of outcomes. Traditionally, logit and probit are performed on data with binary
outcomes (e.g., Yes, No), but the methods can be generalized to cases with more outcomes
(multinomial logit or probit, e.g., Republican, Democrat, Libertarian, Green, Other), including
cases in which the outcomes have ordinal properties (ordered multinomial logit or probit e.g.,
unemployed, employed part time, employed full time).
In any of these cases, where a traditional regression would try to predict the most likely
value of a dependent variable, logit and probit try to predict the probability that a specific value
of the dependent variable will be found. These differences are captured in the following graph.
Notice that the linear model continues above and below the possible y values, whereas the
logistic model switches relatively abruptly from predicting a “0” to predicting a “1”.
Why would you want to do this? It might make more sense when you see it with some
data. Compare the following graphs; the first has continuous data that is fairly linear, the second
has dichotomous data with a value of either 0 or 1:
2
Logit and probit differ in terms of A) how their results are traditionally displayed, B) the
assumed underlying distribution, and C) because the assumed distributions differ, the parameter
estimates can often differ slightly. Taking those one at a time: As we will discuss below, logit
traditionally produces log-odds of a given outcome, which are then converted to odds, while
probit traditionally produces probabilities. As for the assumptions, logit assumes a logistic
probability distribution, while Probit assumes a normal distribution.
Which type of model should you use? In most cases, there is no obvious right or wrong
answer. The results tend to be very similar, and preference tends to vary by discipline.
II.
Preliminary Tests
We covered descriptive statistics in the introductory and intermediate courses. You always
want to run descriptive statistics before you perform more advanced analyses. In this case we
have included the code necessary for those analyses, but we will not spend time on them here.
First, run the “Get”, “Descriptives”, and “Frequencies” commands from the syntax file.
Note in the last frequency chart produced that our metric for employment “empstat” has
many levels, which will make our analysis difficult. The following two commands recode that
variable to have either two or three levels.
RECODE empstat (10 thru 12=1) (21 thru 22=2) (ELSE=SYSMIS) INTO employed2.
VARIABLE LABELS employed2 'Employment Status 2 Groups'.
EXECUTE.
Value Labels employed2 1 'Employed'
2 'Unemployed'.
RECODE empstat (30=3) (10 thru 12=1) (21 thru 22=2) (ELSE=SYSMIS) INTO employed3.
VARIABLE LABELS employed3 'Employment Status 3 Groups'.
EXECUTE.
Value Labels employed3 1 'Employed'
2 'Unemployed'
3 'Not in Labor Force'.
You can run the two “Crosstabs” commands to ensure that the recoding worked as desired.
CROSSTABS
/TABLES=empstat BY empstat1
/FORMAT=AVALUE TABLES
/CELLS=COUNT
/COUNT ROUND CELL.
CROSSTABS
/TABLES=empstat BY employed1
/FORMAT=AVALUE TABLES
/CELLS=COUNT
/COUNT ROUND CELL.
There is a non-linear effect of age on employment. We can see this effect if we run the “Graph”
command, then double click on the resulting graph and add a quadratic fit line. Alas, there is no
3
elegant way to do this all in the code. We can then create a new variable called “agesq”, to
account for that non-linear effect.
GRAPH
/SCATTERPLOT(BIVAR)=age WITH employed2
/MISSING=LISTWISE.
COMPUTE agesq=age * age.
EXECUTE.
III.
Using Logit and Probit Models
When you are finished with your preliminary analyses, it is time to run our models.
Running these analyses in SPSS can be a bit tricky, because there are several commands that
allow you do the analyses, and each works in slightly different ways. Thus we will use a variety
of pull down menues and commands.
LOGIT
To perform a logistic regression using the menues:

Go to ANALYZE then REGRESSION then BINARY LOGISTIC

To make it easier to select variables, right click on the list of variables and “Display
Variable Names”, then right click again and “Sort Alphabetically”.

In the “Dependent variable” field, enter employed2

In the “Factors” field, enter age, agesq, female, educ_cat, married, black

Press the “Categorical” tab and in the pop-up window move all the categorical
independent variables to the “Categorical Covariates” field.

Select each one of the categorical variables by clicking once on it; the “Change contrast”
panel is now active and you can change the “Reference Category” from “Last” to “First”.
(this simply tells SPSS to treat the first category of the categorical variables as the base
against which the other categories of the variable are to be compared).

Continue. OK.
You could also perform this analysis using the “LOGISTIC REGRESSION” command in the
syntax.
The first big chunk of the output “Block 0: Beginning Block” shows the analysis without
including any of your variables, it is not very interesting. You are interested in “Block 1:
4
Method = Enter”. In that Block, the first thing you see is an “Omnibus” test that uses the χ2
(Chi-square) statistic, and it is significant, so we can move on to look at the rest of the model.
Notice in particular the last box, which shows “Variables in the Equation”. The “B” values in
this chart are the log of the odds per unit change in the independent variable, which are very hard
to interpret. Lucikly, by default SPSS gives us the last column “exp(B)”, which gives us the
odds. If the odds are exactly 1 (read as 1:1), that means that the independent variable has no
effect on the dependent variable. When the odds are above 1, there is a positive relationship
between the variables, when the odds are below 1, there is a negative relationship. For example,
the odds of your being employed if you did not finish high school vs. if you finished college are
“1 : 3.241”, meaning you are three times as likely to be employed if you finished college.
Multinomial Logit
A logistic regression with more than one level of the dependent variable is a multinomial logistic
regression. We created the “Employed 3” variable above for this purpose. To perform a logistic
regression using the menues, follow the instructions below. In the next to last step, as an added
bonus, we will have it create new variables with the estimated probability of an individual falling
into each category, the category the model predicts is most likely, and the probability of that
particular prediction.

ANALYZE then REGRESSION then MULTINOMIAL LOGISTIC

Move employed3 to the “Dependent” field; under the “Reference category” tab. By
default the “Reference Category” will be “Last”, which means that “Employed” is our
comparison group.

In the “Factor(s)” field, enter female, educ_cat, married, black.

In the “Covariates” field, enter age, agesq.

Under the “Save” tab, check “Estimated response probabilities”, “Predicted category”
and “Predicted category probability”. Continue.

OK.
You could also run this analysis using the “NOMREG” syntax. Note that in the syntax
categorical variables come after “BY”, while continuous variables come after “WITH”.
5
The output here is a little more straightforward. Under the “Likelihood Ratio Tests” we
can see that we have a significant effect of sex, education level, and marital status, but not of
skin color. If we continue on to the “Parameter Estimates”, you can see that “Not in Labor
Force” and “Unemployed” have been seperatly compared with our reference group “Employed”.
Each chart can be read on its own just like tha charts above (though notice the “exp(B)” column
moved), and notice that the higher value for each variable is now used as the default value. You
can also compare between the groups.
PROBIT
Probit is not quite as elegant through the pull down menues, but it works pretty much the
same in the code. To use the pull down menues:

ANALYZE, then REGRESSION, then ORDINAL REGRESSION, followed by
selecting "Probit" from the "links" menu in the options.

In the “Dependen” field, enter employed1

In the “Factor(s)” field, enter female, educ_cat, married, black

In the “Covariates” field, enter age, agesq.

OK.
Note that all categorical variables here use the highest value of the variable as the contrast
condition as they did in the Multinomial Logit. You could also run this analysis using the
“PLUM” command in the syntax file.
The output here is pretty similar to the logit outputs above, but is much more difficult to
interpret. The “Parameter Estimates”, in the last part of the output, show the amount that a unit
change in the independent variable affect the z-score associated with the probability of the
dependent variable. Using the output from the code that includes age:
For every year you are older, the increase the z-score “.078” (modified by -.001 times the
square of your age). Because you are using the coefficient to modify a z-score, the affect of any
change is depends on the values you started with. For example, if your variables add up to a zscore of 1, the probability of being employed is 84.13%. Adding another year of age (ignoring
6
the affect of squared age) will increase your probability to 85.77%, an increase 1.64%. However,
if your z-score started at 2, an additional year of age would only increase your probability by
.49%, because you are further out on the tail of the normal curve. (I looked up the probabilities
for the z-scores in the back of an stats textbook.) In the case of the discrete variables, note that
the coefficents are all in comparison to the highest value in the category (there is not option in
SPSS’s probit command to change the comparison group).
Additional Notes to Consider
a) Empty or incomplete variables: You should check for empty or incomplete cells by
doing a crosstab between categorical predictors and the outcome variable. If a
variable has very few cases, the model may become unstable or it might not run at all.
b) Separation or quasi-separation (also called perfect prediction), a condition in which
the outcome does not vary at some levels of the independent variables.
c) Sample size: Both logit and probit models require more cases than OLS regression
because they use maximum likelihood estimation techniques.
7
Download