When Dependent Variables are Categorical
Chi-square analysis is frequently used.
Example Question: Is there a difference in likelihood of death in an ATV accident between persons wearing helmets and those without helmets?
Dependent variable is Death: No (0) vs. Yes (1).
[DataSet0]
So, based on this analysis, there is no significant difference in likelihood of dying between ATV accident victims wearing helmets and those without helmets.
Logistic Regression Lecture - 1 4/13/2020
Comments on Chi-square analyses
What’s good?
1. The analysis is appropriate. It hasn’t been supplanted by something else.
2. The results are usually easy to communicate, especially to lay audiences.
3. A DV with a few more than 2 categories can be easily analyzed.
4. An IV with only a few more than 2 categories can be easily analyzed.
What’s bad?
1. Incorporating more than one independent variable is awkward, requiring multiple tables.
2. Certain tests, such as tests of interactions
, can’t be performed easily when you have more than one IV.
3. Chi-square analyses can’t be done when you have continuous IVs unless you categorize the continuous IVs, which goes against recommendations to NOT categorize continuous variables because you lose power.
Alternatives to the Chi-square test.
We’ll focus on Dichotomous (two-valued) DVs.
1. Linear Regression techniques a. Multiple Linear Regression.
Stick your head in the sand and pretend that your DV is continuous and regress the (dichotomous) DV onto the mix of IVs. b. Discriminant Analysis (equivalent to MR when DV is dichotomous)
Problems with regression-based methods, when the dependent variable is dichotomous and the independent variable is continuous.
1. Assumption is that underlying relationship between Y and X is linear.
But when Y has only two values, how can that be?
2. Linear techniques assume that variability about the regression line is homogenous across possible values of X. But when Y has only two values, residual variability will vary as X varies, a violation of the homogeneity assumption.
3. Residuals will probably not be normally distributed.
4. Regression line will extend beyond the more negative of the two Y values in the negative direction and beyond the more positive value in the positive direction resulting in Y-hats that are impossible values.
2. Logistic Regression
3. Probit analysis
Logistic Regression and Probit analysis are very similar. Almost everyone uses Logistic. We’ll focus on it.
Logistic Regression Lecture - 2 4/13/2020
The Logistic Regression Equation
Without restricting the interpretation, assume that the dependent variable, Y, takes on two values, 0 or 1.
Conceptualizing Y-hat.
When you have a two-valued DV it is convenient to think of Y-hat as the likelihood or probability that one of the values will occur . We’ll use that conceptualization in what follows and view
Y-hat as the probability that Y will equal 1.
The equation will be presented as an equation for the probability that Y = 1, written simply as P(Y=1) . So we’re conceptualizing Y-hat as the probability that Y is 1.
The equation for simple Logistic Regression (analogous to Predicted Y = B
0
+ B
1
*X in linear regression)
1
(B
0
+ B
1
*X)
e
Y-hat = P(Y=1) = ---------------------
=
-----------------
-(B
0
+ B
1
*X) (B
0
+ B
1
*X)
1 + e e + 1
The logistic regression equation defines an S-shaped (Ogive) curve, that rises from 0 to 1. P(Y=1) is never negative and never larger than 1.
The curve of the equation . . .
B
0
: B
0
is analogous to the linear regression “constant” , i.e., intercept parameter. Although B
0
defines the
"height" of the curve at a given x, it should be noted that the curve as a whole moves to the right as B0 decreases. For the graphs below, B1=1 and X ranged from -5 to +5.
B
0
= +1
For equations for which B1 is the same, changing B0 only changes the location of the curve over the range of X-axis values.
The “slope” of the curve remains the same.
P(Y=1)
B
0
=0
B
0
= -1
Logistic Regression Lecture - 3 4/13/2020
B
1
: B
1
is analogous to the slope of the linear regression line. B
1
defines the “steepness” of the curve. It is sometimes called a discrimination parameter.
The larger the value of B
1
, the “steeper” the curve, the more quickly it goes from 0 to 1. B0=0 for the graph.
B
1
=4
B
1
=2
P(Y=
1)
B
1
=1
Note that there is a MAJOR difference between the linear regression and logistic regression curves - - -
The logistic regression lines asymptote at 0 and 1. They’re bounded by 0 and 1.
But the linear regression lines extend below 0 on the left and above 1 on the right.
If we interpret P(Y) as a probability, the linear regression curves cannot literally represent P(Y) except for a limited range of X values.
Logistic Regression Lecture - 4 4/13/2020
Example
P(Y) = .09090909
P(Y) = .50
P(Y) = .8
P(Y) = .99
Odds of Y = .09090909/.909090909 = .1
Y is 1/10 th as likely to occur as to not occur.
Odds of Y = .5/.5 = 1
Y is as likely to occur as to not occur.
Odds of Y = .8/.2 = 4
Y is 4 times more likely to occur than to not occur.
Odds of Y = .99/.01 = 99
Y is 99 times more likely to occur than to not occur.
Logistic Regression Lecture - 5 4/13/2020
Logistic Regression Lecture - 6 4/13/2020
Here’s a perfectly nice linear relationship between score values, from a recent study.
This relationship is of ACT Comp scores to Wonderlic scores. It shows that as intelligence gets higher, ACT scores get larger.
[DataSet3] G:\MdbR\0DataFiles\BalancedScale_110706.sav
Here’s the relationship when ACT Comp has been dichotomized at 23 , into Low vs. High.
When, proportions of High scores are plotted vs. WPT value, we get the following
So, to fit the above curve relating proportions of persons with
High ACT scores to WPT, we need a model that is ogival.
This is where the logistic regression function comes into play.
This means that even if the “underlying” true values are linearly related, proportions based on the dichotomized values will not be linearly related to the independent variable.
Logistic Regression Lecture - 7 4/13/2020
Valid
The FFROSH data.
The data here are from a study of the effect of the Freshman Seminar course on 1 st
semester GPA and on retention. It involved students from 1987-1992. The data were gathered to investigate the effectiveness of having the freshman seminar course as a requirement for all students. There were two main criteria, i.e., dependent variables – first semester GPA excluding the seminar course and whether a student continued into the
2 nd
semester.
The dependent variable in this analysis is whether or not a student moved directly into the 2 nd semester in the spring following his/her 1 st
fall semester. It is called RETAINED and is equal to 1 for students who retained to the immediately following spring semester and 0 for those who did not .
The analysis reported here was a serendipitous finding regarding the time at which students register for school.
It has been my experience that those students who wait until the last minute to register for school perform more poorly on the average than do students who register earlier. This analysis looked at whether this informal observation could be extended to the likelihood of retention to the 2 nd
semester.
After examining the distribution of the times students registered prior to the first day of class we decided to compute a dichotomous variable representing the time prior to the 1 st day of class that a student registered for classes. The variable was called EARLIREG – for EARLY REGistration. It had the value 1 for all students who registered 150 or more days prior to the first day of class and the value 0 for students who registered within 150 days of the 1 st day . (The 150 day value was chosen after inspection of the 1 st semester GPA data.)
So the analysis that follows examines the relationship of RETAINED to EARLIREG , retention to the 2 nd semester to early registration.
The analyses will be performed using CROSSTABS and using LOGISTIC REGRESSION.
First, univariate analyses . . .
GET FILE='E:\MdbR\FFROSH\Ffroshnm.sav'.
Fre var=retained earlireg. sustained
.00
1.00
Total
Frequency
552
4201
4753
Percent
11.6
Valid Percent
11.6
88.4
100.0
88.4
100.0
Cumulative
Percent
11.6
100.0 earlireg
Valid .00
1.00
Total
Frequency
2316
2437
4753
Percent
48.7
51.3
100.0
Valid Percent
48.7
51.3
100.0
Cumulative
Percent
48.7
100.0
Logistic Regression Lecture - 8 4/13/2020
crosstabs retained by earlireg /cells=cou col /sta=chisq.
Ca se Pr oces sing Sum mary
RE TAINED *
EA RLIREG
Va lid
N
47 53
Pe rcent
10 0.0%
N
Ca ses
Mi ssing
Pe rcent
0 .0%
To tal
N
47 53
Pe rcent
10 0.0%
RE TAINED * EARLIREG Crossta bulation
RE TAINED .00
To tal
1.0 0
Co unt
% within
EA RLIREG
Co unt
% within
EA RLIREG
Co unt
% within
EA RLIREG
.00
EA RLIREG
36 7
1.0 0
18 5
15 .8%
19 49
7.6 %
22 52
84 .2%
23 16
10 0.0%
92 .4%
24 37
10 0.0%
To tal
55 2
11 .6%
42 01
88 .4%
47 53
10 0.0%
So, 92.4% of those who registered early sustained, compared to
84.2% of those who registered late.
Chi-Square Tests
Pe arson Chi-Squa re
Co ntinu ity Co rrect ion a
Likeliho od Ratio
Fisher's Exac t Test
Va lue
78 .832
78 .030
79 .937
b df
1
1
1
Asy mp. Sig.
(2-sided )
.00 0
.00 0
.00 0
Ex act S ig.
(2-sided )
.00 0
Lin ear-b y-Lin ear
Associat ion
N o f Val id Ca ses
78 .815
1 .00 0
47 53 a.
Co mput ed on ly fo r a 2x 2 tab le b.
0 c ells (.0%) have expe cted coun t less than 5. T he m inimu m ex pect ed co unt i s 268 .97.
Ex act S ig.
(1-sided )
.00 0
Logistic Regression Lecture - 9 4/13/2020
The same analysis using Logistic Regression logistic regression retained WITH earlireg.
Analyze -> Regression -> Binary Logistic
Ca se Pr oces sing Sum mary
Un weigh ted Cases a
Se lecte d Cases Inc luded in
An alysis
N
47 53
Pe rcent
10 0.0
Mi ssing Case s
To tal
0
47 53
0 Un selec ted Cases
To tal 47 53 a.
If weigh t is in effe ct, se e cla ssifica tion table for t he to tal nu mber of ca ses.
.0
10 0.0
.0
10 0.0
De pendent V aria ble E ncoding
Ori gina l
1.0 0
Int ernal Valu e
0
1
The display to the left is a valuable check to make sure that your “1” is the same as the Logistic Regression procedure’s “1”.
Do whatever you can to make
Logistic’s 1s be the same cases as your 1s. Trust me.
The Logistic Regression procedure applies the logistic regression model to the data. It estimates the parameters of the logistic regression equation.
1
That equation is P(Y) = ---------------------
-( B
0
+ B
1
X)
1 + e
It performs the estimation in two stages. The first stage estimates only B
0
. So the model fit to the data in the first stage is simply
1
P(Y) = ------------------
-(B
0
)
1 + e
SPSS labels the various stages of the estimation procedure “Blocks”. In Block 0, a model with only B
0
is estimated
Logistic Regression Lecture - 10 4/13/2020
0
Cla ssifi cation Table a,b
Ste p 0
Ob serve d
RE TAINED
Ov erall Perce ntag e
.00
1.0 0
.00
RE TAINED
Pre dicte d
0
1.0 0
55 2
Pe rcent age Correc t
.0
0 42 01 10 0.0
88 .4
a.
Co nstan t is in clude d in the m odel .
b.
Th e cut value is .5 00
Explanation of the above table: The progresm estimated B0=2.030. The resulting P(Y=1) = .8839.
The program computes Y-hat=.8839 for each case using the logistic regression formula with the estimate of
B
0
. If Y-hat is less than or equal to a predetermined cut value of 0.500
, that case is recorded as a predicted 0 .
If Y-hat is greater than 0.5, the program records that case as a predicted 1 . It then creates the above table of number of actual 1’s and 0’s vs. predicted 1’s and 0’s.
Va riable s in the E quati on
B S.E .
Wa ld df Sig .
Exp (B)
Ste p 0 Co nstan t 2.0 30 .04 5 200 9.62 4 1 .00 0 7.6 11
The prediction equation for Block 0 is Y-hat = 1/(1 + e –2.030
) (The value 2.030 is shown in the “Variables in the Equation” table above.). Recall that B
1
is not yet in the equation. This means that Y-hat is a constant, equal to .8839
for each case. (I got this by entering the prediction equation into a calculator.) Since Y-hat for each case is greater than 0.5, all predictions in the above Classification Table are 1 , which is why the above table has only predicted 1’s. Sometimes this table is more useful than it was in this case. It’s typically most useful when the equation includes continuous predictors.
The above “Variables in the Equation” box is the Logistic Regression equivalent of the “Coefficients Box” in regular regression analysis.
The test statistic in that table is not a t statistic, as in regular regression, but the Wald statistic. The Wald statistic is (B/SE)
2
. So (2.030/.045)
2
= 2,035, which would be 2009.624 if the two coefficients were represented with greater precision.
Exp(B) is the odds ratio : e
2.030
It is the ratio of odds of P(Y=1) when the predictor equals 1 to the odds of
P(Y=1) when the predictor equals 0. It’s an indicator of strength of relationship to the predictor. Means nothing here.
Va riable s not in the Equati on
Ste p 0 Va riable s EA RLIREG
Sc ore
78 .832
df
1
Sig .
.00 0
Ov erall Stati stics
78 .832
1 .00 0
The “Variables not in the Equation” gives information on each independent variable that is not in the equation.
Specifically , it tells you whether or not the variable would be “significant” if it were added to the equation.
In this case, it’s telling us that EARLIREG would contribute significantly to the equation if it were added to the equation, which is what SPSS does next . . .
Logistic Regression Lecture - 11 4/13/2020
Om nibus Tes ts of Model Coeffic ients
Ch i-squa re
79 .937
df Sig .
.00 0
Note that the chi-square value is
Ste p 1 Ste p
Blo ck 79 .937
1
1 .00 0 almost the same value as the chisquare value from the CROSSTABS
Mo del 79 .937
1 .00 0 analysis.
Whew – three chi-square statistics.
“Step”: Compared to previous step in a stepwise regression. Ignore for now since this regression had only 1 step..
“Block”: Tests the significance of the improvement in fit of the model evaluated in this block vs. the previous block. Note that the chi-square is identical to the Likelihood ratio chi-square printed in the Chi-square Box in the CROSSTABS output.
“Model”: Ignore for now
Model S umm ary
Co x & S nell R Na gelke rke R
Ste p -2 L og li keliho od Sq uare Sq uare
1 333 4.21 2 .01 7 .03 3
The value under
“-2 Log likelihood”
is a measure of how well the model fit the data in an absolute sense.
Values closer to 0 represent better fit. But goodness of fit is complicated by sample size. The R Square values are measures analogous to “percent of variance accounted for”. All three measures tell us that there is a lot of variability in proportions of persons retained that is not accounted for by this one-predictor model.
Cla ssifi cation Table a
Ste p 1
Ob serve d
RE TAINED
Ov erall Perce ntag e
.00
1.0 0
Pre dicte d
.00
RE TAINED
0
1.0 0
55 2
0 42 01
Pe rcent age Correc t
.0
10 0.0
88 .4
a.
Th e cut value is .5 00
The above table is the revised version of the table presented in Block 0.
Note that since X is a dichotomous variable here, there are only two y-hat values. They are
1
P(Y) = --------------------- = .842 (see below)
-(B
0
+ B
1
* 0 )
1 + e
And
1
P(Y) = --------------------- = .924 (see below)
-(B
0
+ B
1
* 1 )
1 + e
In both cases, the y-hat was greater than .5, so predicted Y in the table was 1 for all cases.
Logistic Regression Lecture - 12 4/13/2020
Va riable s in the E quation
Ste p 1 a EA RLIREG
Co nstan t
B
.83 0
1.6 70
S.E .
.09 5
.05 7
Wa ld
75 .719
86 1.036
df
1
1
Sig .
.00 0
.00 0
Ex p(B)
2.2 92
5.3 11 a.
Va riable (s) en tered on step 1 : EARLIRE G.
The prediction equation is Y-hat = P(Y=1) = 1 / (1 + e -(.1.670 + .830*EARLIREG ) .
Since EARLIREG has only two values, those students who registered early will have predicted RETAINED value of 1/(1+e -(1.670+.830*1) ) = .924
. Those who registered late will have predicted RETAINED value of
1/(1+e -(1.670+.830*0) = 1/(1+e -1.670) ).= .842
. Since both predicted values are above .5, this is why all the cases were predicted to be retained in the table on the previous page.
Exp(B) is called the odds ratio. It is the ratio of the odds of Y=1 when X=1 to the odds of Y=1 when X=0.
Recall that the odds of 1 are P(Y=1)/(1-P(Y=1)). The odds ratio is
Odds when X=1 .924/(1-.924) 12.158
Odds ratio = --------------------- = --------------- = ------------------- = 2.29.
Odds when X= 0 .842/(1-.842) 5.329
So a person who registered early had odds of being retained that were 2.29 times the odds of a person registering late being retained. Odds ratio of 1 means that the DV is not related to the predictor.
Graphical representation of what we’ve just found.
The following is a plot of Y-hat vs. X, that is, the plot of predicted Y vs. X. Since there are only two values of
X (0 and 1), the plot has only two points. The curve drawn on the plot is the theoretical relationship of y-hat to other hypothetical values of X over a wide range of X values (ignoring the fact that none of them could occur.)
The curve is analogous to the straight line plot in a regular regression analysis.
1.2
1.0
.8
The two points are the predicted points for the two possible values of
RETAINED.
.6
.4
.2
0.0
-.2
-6.00
-5.00
-4.00
-3.00
-2.00
-1.00
.00
1.00
2.00
3.00
4.00
X Earlireg
Logistic Regression Lecture - 13 4/13/2020
Discussion
1. When there is only one dichotomous predictor, the CROSSTABS and LOGISTIC REGRESSION give the same significance results, although each gives different ancillary information.
BUT as mentioned above . . .
2. CROSSTABS can not be used to analyze relationships in which the X variable is continuous .
3. CROSSTABS can be used in a rudimentary fashion to analyze relationships between a dichotomous Y and 2 or more categorical X’s, but the analysis IS rudimentary and is laborious.
No tests of interactions are possible . The analysis involves inspection and comparison of multiple tables.
4. CROSSTABS, of course, cannot be used when there is a mixture of continuous and categorical IV’s.
5. LOGISTIC REGRESSION can be used to analyze all the situations mentioned in 2-4 above.
6. So CROSSTABS should be considered for the very simplest situations involving one categorical predictor.
But LOGISTIC REGRESSION is the analytic technique of choice when there are two or more categorical predictors and when there are one or more continuous predictors.
Logistic Regression Lecture - 14 4/13/2020
The data analyzed here represent the relationship of Pancreatitis Diagnosis to measures of Amylase and Lipase.
Both Amylase and Lipase levels are tests that can predict the occurrence of Pancreatitis. Generally, it is believed that the larger the value of either, the greater the likelihood of Pancreatitis.
The objective here is to determine
1) which alone is the better predictor of the condition and
2) to determine if both are needed .
Because the distributions of both predictors were positively skewed, logarithms of the actual Amylase and
Lipase values were used for this handout and for some of the following handouts.
This handout illustrates the analysis of the relationship of Pancreatitis diagnosis to Amylase only . Note that since Amylase is a continuous independent variable, chi-square analysis would not be appropriate.
The name of the dependent variables is PANCGRP . It is 1 if the person is diagnosed with Pancreatitis. It is 0 otherwise. This forces us to use a technique appropriate for dichotomous dependent variables.
Distributions of logamy and loglip – still somewhat positively skewed even though logarithms were taken.
60
50
40 logamy
60
50
40 loglip
5.00
4.00
3.00
30
20
10
0
1.00
1.50
2.00
2.50
logamy
3.00
3.50
30
20
10
4.00
Mean = 2.0267
Std. Dev. = 0.50269
N = 306
0
0.00
1.00
2.00
3.00
loglip
4.00
5.00
2.00
1.00
Mean = 2.3851
Std. Dev. = 0.82634
N = 306
0.00
1.00
1.50
2.00
2.50
logamy
3.00
3.50
4.00
The logamy and loglip scores are highly positively correlated. For that reason, it may be that once either is in the equation, adding the other won’t significantly increase the fit of the model. We’ll test that hypothesis later.
Logistic Regression Lecture - 15 4/13/2020
.8
.6
.4
.2
0.0
-.2
1.0
Relationship of Pancreatitis Diagnosis to log(Amylase)
1.2
1.0
.8
.6
.4
.2
This graph is of individual cases.
Y values are 0 or 1.
X values are continuous.
0.0
-.2
1.0
1.5
2.0
2.5
3.0
3.5
4.0
LOGA MY
This graph represents a primary problem with visualizing results when the dependent variable is a dichotomy.
It is difficult to see the relationship that may very well be represented by the data. One can see from this graph, however, that when log amylase is low, there are more 0’s (no Pancreatitis) and when log amylase is high there are more 1’s (presence of Pancreatitis).
The line through the scatterplot is the linear line of best fit. It was easy to generate. It represents the relationship of probability of Pancreatitis to log amylase that would be assumed if a linear regression were conducted. So the line is what we would predict based on linear regression.
But, the logistic regression analysis assumes that the relationship between probability of Pancreatitis to log amylase is different. The relationship assumed by the logistic regression analysis would be an S-shaped curve, called an ogive, shown below.
Below are the same data, this time with the line of best fit generated by the logistic regression analysis through it. While neither line fits the observed individual case points well in the middle, it’s easy to see that the logistic line fits better at small and at large values of log amylase.
1.2
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Predicted probabilit
LOGAMY
Pancreatitis Diag nos
LOGAMY
Logistic Regression Lecture - 16 4/13/2020
This is not a required part of logistic regression analysis but a way of presenting the data to help understand what’s going on.
The plots above were plots of individual cases. Each point represented the DV value of a case (0 or 1) vs. that person’s IV value (log amylase value). The problem was that the plot didn’t really show the relationship because the DV could take on only two values - 0 and 1.
When the DV is a dichotomy, it will be profitable to form groups of cases with similar X values and plot the proportion of 1’s within each group vs. the X value for that group.
To illustrate this, groups were formed for every .2
increase in log amylase. That is, the values 1.4, 1.6, 1.8,
2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4, 3.6, and 3.8 were used as group mid points. Each case was assigned to a group based on how close that case’s log amylase value was to the group midpoint. So, for example, all cases with log amylase between 1.5 and 1.7 were assigned to the 1.6 group.
SPSS Syntax: compute logamygp = rnd(logamy,.2).
Then the proportion of 1’s within each group was computed. (When the data are 0s and 1s, the mean of all the scores is equal to the proportion of 1s.) The figure below is a plot of the proportion of 1’s within each group vs. the groups midpoints. Note that the points form a curve, quite a bit like the ogival form from the logistic regression analysis shown on the previous page.
1.2
1.0
.8
Original Graph of 0s and
1s vs. LOGAMY.
.6
.4
.2
0.0
-.2
1.0
1.5
2.0
2.5
3.0
3.5
4.0
LOGA MY
1.2
1.0
.8
.6
.4
.2
0.0
-.2
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Note that the plot of proportion of
1s within groups is not linear.
The proportions increase in an approximate ogival (S-shaped) fashion, with asymptotes at 0 and
1.
This, of course, is a violation of the assumption of a linear relationship which is required when performing linear regression.
LOGA MY GP
The analyses that follow illustrate the application of both linear and logistic regression to the data.
Logistic Regression Lecture - 17 4/13/2020
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT pancgrp
/METHOD=ENTER logamy
/SCATTERPLOT=(*ZRESID ,*ZPRED )
/RESIDUALS HIST(ZRESID) NORM(ZRESID) .
Excerpt from the data file
b
Variables Entered/Removed
Model
1
Variables
Entered
LOGAMY a
Variables
Removed Method
.
Enter
Model S umm ary b
Mo del
1
R
.75 5 a
R
Sq uare
.57 0
Ad justed
R S quare
.56 8
Std . Erro r of t he
Est imate
.25 69 a.
Pre dicto rs: (Consta nt),
LO GAM Y b.
De pend ent V ariab le:
PA NCGRP
ANOVA b
Mo del
1 Re gression
Su m of
Sq uares df
22 .230
1
Re sidua l
To tal
16 .770
39 .000
25 4
25 5
Me an
Sq uare F Sig .
22 .230
33 6.706
.00 0 a
6.6 02E-02 a.
Pre dicto rs: (Consta nt), LOGA MY b.
De pend ent V ariab le: P ANCGRP
Coeffici ents a
The linear relationship of pancdiag to logamy is strong – R-squared=.570..
But as we'll see, the logistic relationship is even stronger.
Sta n dardi
Un stand ardiz e d Coeffi cients zed
Co eff icie nts
Mo del
1 (Co nstan t)
LO GAM Y
B
Std .
Error
-1.0 43 .06 9
Be ta t Sig .
-15 .125
.00 0
.63 5 .03 5 .75 5 18. 350 .00 0 a.
De pend ent V ariab le: PA NCG RP
Thus, the predicted linear relationship of probability of Pancreatitis to log amylase is
Predicted probability of Pancreatitis = -1.043 + 0.635 * logamy.
Logistic Regression Lecture - 18 4/13/2020
The following are the usual linear regression diagnostics.
Ca sew i se Diagnostics a
Ca se
Nu mber
54
77
85
97
Std .
Re sidua l PA NCG RP
3.0 16 1.0 0
3.3 43
3.4 19
1.0 0
1.0 0
3.2 18 1.0 0 a.
De pend ent
Va riable :
PA NCG RP
Re sidua ls Statistic s a
Min imum Ma ximu m Me an
Pre dicte d
Va lue
Re sidua l
Std .
Pre dicte d
Va lue
Std .
Re sidua l
-.10 44
-.59 98
-.98 9
-2.3 34
1.4 256
.87 86
4.1 93
3.4 19
-1.3 848E -16 a.
De pende nt V ariabl e: PA NCG RP
.18 75
.00 0
.00 0
Std .
De viatio n N
.29 53 256
.25 64
1.0 00
.99 8
256
256
256
60
Dependent Variable: PANCGRP
50
Nothing particularly unusual here.
Or here.
40
Frequency
30
20
10
0
-2.25
-1.75
-1.25
-.75
-.25
.25
Std. Dev = 1.00
Mean = 0.00
.75
1.25
1.75
2.25
2.75
3.25
N = 256.00
Regression Standardized Residual
Normal P-P Plot
1.00
The histogram of residuals is not particularly unusual.
.75
Expected Cum Prob
.50
.25
0.00
0.00
.25
.50
Observed Cum Prob
.75
1.00
Although there is a clear bend from the expected linear line, this is not particularly diagnostic..
Logistic Regression Lecture - 19 4/13/2020
Scatterplot
4
Dependent Variable: PANCGRP
3
2
1
0
-1
-2
-3
-2 -1 0 1 2 3 4 5
This is an indicator that there is something amiss.
The plot of residuals vs. predicted values is supposed to form a classic
0 correlation scatterplot, with no unusual shape. This is clearly unusual.
Regression Standardized Pr edicted Value
Computation of y-hats for the groups.
I had SPSS compute the Y-hat for each of the group mid-points discussed on page 3. I then plotted both the observed group proportion of 1’s that was shown on the previous page and the Y-hat for each group.
Of course, the Y-hats are in a linear relationship with log amylase. Note that the solid points don’t really represent the relationship shown by the open symbols. Note also that the solid points extend above 1 and below 0. But the observed proportions are bound by 1 and 0. compute mrgpyhat = -1.043 + .635*logamyvalue. execute .
GRAPH
/SCATTERPLOT(OVERLAY)=logamygp logamygp WITH probpanc mrgpyhat (PAIR)
/MISSING=LISTWISE .
1.4
1.2
1.0
.8
Predicted proportion of
Pancreatitis diagnosis within groups.
Note that predictions extend below 0 and above 1.
.6
.4
.2
0.0
-.2
1.0
1.5
2.0
2.5
3.0
3.5
4.0
MRGPYHAT
LOGAMYGP
PROBPANC
LOGAMYGP
Observed proportion of Pacreatitis diagnoses within groups.
Logistic Regression Lecture - 20 4/13/2020
Remember we’re here to determine if there is a significant relationship of pancreatitis diagnosis to logamylase. logistic regression pancgrp with logamy.
Case Processing Summary
Unweighted Cases
Selected Cases a
Included in Analysis
N
256
Unselected Cas es
Total
Mis sing Cases
Total
50
306
0
306 a. If weight is in effect, s ee class ification table for the total number of cases.
Percent
83.7
16.3
100.0
.0
100.0
De pendent Va ria ble Encoding
Original Value
.00 No Pancreatitis
Int ernal Value
0
1.00 P anc reat itis 1
SPSS’s Logistic regression procedure always performs the analysis in at least two steps, which it calls Blocks.
Recall the Logistic prediction formula is
1
P(Y) = ---------------------
-(B
0
+ B
1
X)
1 + e
In the first block, labeled Block 0, only B
0
is entered into the equation. In this B
0
only equation, the probability of a 1 is a constant, equal to the overall proportion of 1’s for the whole sample.
Obviously this model doesn’t make sense when your main interest is in whether or not the probability increases as X increases. But SPSS forces us to consider (or delete) the results of a B0 only model.
This model does serve as a useful baseline against which to assess subsequent models, all of which do assume that probability of a 1 increase as the IV increases.
Logistic Regression Lecture - 21 4/13/2020
For each block the Logistic Regression Procedure automatically prints a 2x2 table of predicted and observed 1’s and 0’s. For all of these tables, a case is classified as a predicted 1 if it’s Y-hat (predicted probability) exceed
0.5. Otherwise it’s classified as a predicted 0. Since only the constant is estimated here, the predicted probability for every case is 1/(1+exp(-( -1.466
)) = .1875. It happens that this is simply the proportion of 1’s in the sample, which is 48/256 = 0.1875. Since that’s less than 0.5, every case is predicted to be a 0 for this constant only model.
A case is classified as a Predicted 0 if the y-hat for that case is less than or
A case is classified as a Predicted 1 if the y-hat for that case is larger than .5
equal to .5
Predic ted
St ep 0
Observed
Pancreatitis Diagnosis
(DV)
Overall Percent age
No Pancreatitis
Pancreatitis a. Constant is inc luded in the model.
b. The cut value is .500
Pancreatitis Diagnosis (DV)
No
Pancreatitis Pancreatitis
208 0
48 0
Percentage
Correc t
100.0
.0
81.3
Specificity
Sensitivity
Va riables in the Equa tion
B S. E.
W ald df Sig.
Ex p(B)
St ep 0 Constant -1. 466 .160
83.852
1 .000
.231
The test that is recommended is the Wald test. The p-value of .000 says that the value of B0 is significantly different from 0.
The predicted probability of 1 here is
1 1 1
P(1) = ------------------------- = --------------------------- = ------------- = 0.1875, the observ ed proportion of 1’s.
-(-1.466)
Va riables not in the Equa tion
St ep 0 Variables LOGAMY
Sc ore
145.884
df
1
Sig.
.000
Overall Statistic s 145.884
1 .000
Th e “Variables not in the Equation” box says that if log amylase were added to the equation, it would be significant.
Logistic Regression Lecture - 22 4/13/2020
Omnibus Tests of Model Coefficients
Step 1 Step
Chi-square
151.643
df
1
Sig.
.000
Block 151.643
1 .000
Model 151.643
1 .000
Step: The procedure can perform stepwise regression from a set of covariates. The Chi-square step tests the significance of the increase in fit of the current set of covariates vs. those in the previous set.
Block: The significance of the increase in fit of the current model vs. the last Block. We’ll focus on this.
Model: Tests the significance of the increase in fit of the current model vs. the “B
0
only” model.
Model Summary
-2 Log Cox & Snell Nagelkerke
Step
1 likelihood
95.436
R Square
.447
R Square
.722
The Linear Regression R
2
was .570.
In the following classification table, for each case, the predicted probability of 1 is evaluated and compared with
0.5. If that probability is > 0.5, the case is a predicted 1, otherwise it’s a predicted 0.
Classification Table a
Predicted
Step 1
Observed
Pancreatitis Diagnos is
(DV)
Overall Percentage
No Pancreatitis
Pancreatitis
Pancreatitis Diagnos is (DV)
No
Pancreatitis Pancreatitis
200 8
14 34
Percentage
Correct
96.2
70.8
91.4
Specificity
Sensitivity (power) a. The cut value is .500
Specificity: Percentage of all cases without the disease who were predicted to not have it.
(Percentage of correct predictions for those people who don’t have the disease.)
Sensitivity: Percentage of all cases with the disease who were predicted to have it.
(Percentage of correct predictions for those people who did have the disease.)
Variables in the Equation
1
LOGAMY
Constant
B
6.898
-16.020
S.E.
1.017
2.227
a. Variable(s) entered on s tep 1: LOGAMY.
Wald
45.972
51.744
df
1
1
Sig.
.000
.000
Exp(B)
990.114
.000
Analogous to
“Coefficients” box in
Regression
These are the coefficients for the equation.
1 y-hat = ----------------------------------
-(-16.0203+6.8978*LOGAMY
1 + e
Logistic Regression Lecture - 23 4/13/2020
To show that the relationship assumed by the logistic regression analysis is a better representation of the relationship than the linear, I computed probability of 1 for each of the group midpoints from page 3. The figure below is a plot of those probabilities and the observed proportion of 1’s vs. the group midpoints.
Compare this figure with that on page 6 to see how much better the logistic regression relationship fits the data than does the linear relationship. compute lrgpyhat = 1/(1+exp(-(-16.0203 + 6.8978*logamygp))) .
GRAPH
/SCATTERPLOT(OVERLAY)=logamygp logamygp WITH probpanc lrgpyhat (PAIR)
/MISSING=LISTWISE .
1.2
1.0
.8
.6
Predicted proportions, most of which coincide precisely with the observed proportions.
Observed proportions. Could it be that there were coding errors for this group?
.4
.2
LRGPYHAT
LOGAMYGP
0.0
PROBPANC
-.2
LOGAMYGP
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Logamy
Compare this graph with the one immediately below from the linear regression analysis.
Note that the predicted proportions correspond much more closely to the observed proportions here.
Note the diverging predictions for all groups with proportions = 0 or 1.
The linear regression and the logistic regression analyses yield roughly the same predictions for “interior” points. But they diverge for “extreme” points – points with extremely small or extremely large values of X.
Logistic Regression Lecture - 24 4/13/2020
I computed residuals for all cases. Recall that a residual is Y – Y-hat. For these data, Y’s were either 1 or 0.
Y-hats are probabilities.
First, I computed Y-hats for all cases, using both the linear equation and the logistic equation.
. compute mryhat = -1.043 + .635*logamy. compute lryhat = 1/(1+exp(-(-16.0203 + 6.8978*logamy))).
Now residuals are computed
. compute mrresid = pancdiag - mryhat. compute lrresid = pancdiag - lryhat. frequencies variables = mrresid lrresid /histogram /format=notable.
80
Histogram
60
40
20
This is the distribution of residuals for the linear multiple regression .
It's like the plot on page 3, except these are actual residuals, not Z's of residuals.
Note that there are many large residuals - large negative and large positive.
0
-1
.0
0
-.8
0
-.6
0
-.4
0
-.2
0
-.0
0
.2
0
.4
0
.6
0
.8
0
Std. Dev = .26
Mean = .00
N = 256.00
1.0
0
1.2
MRRESID
1.0
.8
Positive residual
.6
.4
.2
0.0
-.2
1.0
LOGA MY
1.5
2.0
Negative residual
2.5
3.0
3.5
4.0
Logistic Regression Lecture - 25
The residuals above are simply distances of the observed points from the best fitting line, in this case a straight line.
4/13/2020
200
Histogram
100
This is the distribution of residuals for the logistic regression.
Note that most of them are virtually 0.
0
-1
.0
0
-.8
0
-.6
0
-.4
0
-.2
0
-.0
0
.2
0
.4
0
.6
0
.8
0
Std. Dev = .24
Mean = .00
N = 256.00
1.0
0
LRRESID
1
1
1
1
0
The residuals above are simply distances of the observed points from the best fitting line, in this case a logistic line.
The points which are circled are those with near 0 residuals.
0
Predicted Value
LOGAMY
0
PANCGRP
-0 LOGAMY
1.0
1.5
2.0
2.5
3.0
3.5
4.0
What these two sets of figures show is that the vast majority of residuals from the logistic regression analysis were virtually 0, while for the linear regression, there were many residuals that were substantially different from
0. So the logistic regression analysis has modeled the Y’s better than the linear regression.
Logistic Regression Lecture - 26 4/13/2020
LOGISTIC REGRESSION VAR=pancgrp logistic regression pancgrp with logamy loglip
/METHOD=ENTER logamy loglip
/CLASSPLOT
/CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .
Ca se Pr oces sing Sum mary
Un weigh ted Cases a
Se lecte d Cases Inc luded in
An alysis
Mi ssing Case s
Un selec ted Cases
To tal
To tal
N
25 6
50
30 6
0
30 6
Pe rcent a.
If weigh t is in effe ct, se e cla ssifica tion table for t he to tal nu mber of ca ses.
83 .7
16 .3
10 0.0
.0
10 0.0
De pendent V ariable Encodi ng
Ori ginal Valu e
.00 No
Pa ncrea titis
Int ernal Valu e
0
1.0 0 Pa ncre atitis 1
Cla ssifi cation Table a,b
Ste p 0
Ob serve d
Pa ncrea titis Diagn osis
(DV )
No Panc reati tis
Pa ncrea titis
Ov erall Perce ntag e
Pre dicte d
Pa ncrea titis Diagn osis (DV)
No Panc reati tis Pa ncrea titis
20 8 0
48 0
Pe rcent age Correc t
10 0.0
.0
81 .3
a.
Co nstan t is in clude d in the m ode l.
b.
Th e cut value is .5 00
Ste p 0 t
Co nstan
B
-1. 466
Va riable s in the E quation
S.E .
.16 0
Wa ld
83 .852
df
1
Sig .
.00 0
The following assumes a model with only the constant, B
0
in the equation.
Va riable s not in the Equati on
Ex p(B)
.23 1
Ste p 0 Va riable s LO GAM Y
Sc ore
14 5.884
df
1
Sig .
.00 0
LO GLIP 16 1.169
1 .00 0
Ov erall Stati stics
16 5.256
2 .00 0
Each p-value tells you whether or not the variable would be significant if entered BY ITSELF . That is, each of the above p-values should be interpreted on the assumption that only 1 of the variables would be entered.
Logistic Regression Lecture - 27 4/13/2020
Om nibus Tes ts of Model Coeffic ients
Ste p 1 Ste p
Blo ck
Mo del
Ste p
1
Ch i-squa re
17 0.852
17 0.852
17 0.852
df
2
2
2
Sig .
.00 0
.00 0
.00 0
Model S umm ary
-2 L og li keliho od
76. 228
Co x & S nell R
Sq uare
.48 7
Na gelke rke R
Sq uare
.78 7
Cla ssifi cation Table a
Ste p 1
Ob serve d
Pa ncrea titis Diagn osis
(DV )
Ov erall Perce ntag e
No Panc reati tis
Pa ncrea titis
Pre dicte d
Pa ncrea titis Diagn osis (DV)
No Panc reati tis Pa ncrea titis
20 4 4
Pe rcent age Correc t
98 .1
10 38 79 .2
94 .5
Specificity: 204/208
Sensitivity: 38/48 a.
Th e cut value is .5 00
Specificity is the ability to identify cases who do NOT have the disease.
Among those without the disease, .981 were correctly identified.
Sensitivity is the ability to identify cases who do have the disease.
Among those with the disease, .792 were correctely identified.
Ste p 1 a
LO GAM
Y
LO GLIP
Co nstan t
B
2.6 59
2.9 98
-14 .573
Va riable s in the E quation
S.E .
1.4 18
.84 4
2.2 51
Wa ld
3.5 18
12 .628
41 .907
df
1
1
1
Sig .
.06 1
.00 0
.00 0
Ex p(B)
14 .286
20 .043
.00 0
Note that LOGAMY does not officially increase predictability over that afforded by a.
Va riable (s) e ntere d on step 1 : LO GAM Y, LO GLIP .
LOGLIP.
Interpretation of the coefficients . . .
Bs: Not easily interpretable on a raw probability scale. Expected increase in log odds for a one-unit increase in
IV. If the p-value is <= .05, we can say that the inclusion of the predictor resulted in a significant change in probability of Y=1, increase if Bi > 0, decrease if Bi < 0. We just cannot give a simple quantitative prediction of the amount of change in probability of Y=1.
SEs: Standard error of the estimate of Bi.
Wald: Test statistic.
Sig: p-value associated with test statistic.
Note that LOGAMY does NOT (officially) add significantly to prediction over and above the prediction afforded by LOGLIP.
Exp(B): Odds ratio for a one-unit increase in IV among persons equal on the other IV.
Person one unit higher on IV will have Exp(B) greater odds of having Pancreatitis.
So a person one unit higher on LOGLIP will have 20.04 greater odds of having Pancreatitis.
The Exp(B) column is mostly useful for dichotomous predictors – 0 = Absent; 1 = present.
Logistic Regression Lecture - 28 4/13/2020
Step number: 1
Observed Groups and Predicted Probabilities
80 ┼ ┼
│N │
│N │
F │N │
R 60 ┼N ┼
E │N │
Q │N │
U │NN │
E 40 ┼NN ┼
N │NN │
C │NNN │
Y │NNN │
20 ┼NNN ┼
│NNN P │
│NNN NN P │
│NNNNNNNNNNN P N P PP PP │
Predicted ─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────────
Prob: 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Group: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP
Y-HAT
Predicted Probability is of Membership for Pancreatitis
The Cut Value is .50
Symbols: N - No Pancreatitis
P - Pancreatitis
Each Symbol Represents 5 Cases.
One aspect of the above plot is misleading because many cases are not represented in it. Only those cases which happened to be so close to other cases that a group of 5 cases could be formed are represented. So, for example, those relatively few cases whose y-hats were close to .5 are not seen in the above plot, because there were enough to make 5 cases.
Classification Plots using dot plots.
Here’s the same information gotten as dot plots of Y-hats with PANCGRP as a Row Panel Variable.
For the most part, the patients who did not get Pancreatitis had small predicted probabilities while the patients who did get it had high predicted probabilities, as you would expect.
There were, however, a few patients who did get Pancreatitis who had small values of Y-hat.
Those patients are dragging down the sensitivity of the test.
Note that these patients don’t show up on the CASEPLOT produced by the LOGISTIC
REGRESSION procedure.
Logistic Regression Lecture - 29 4/13/2020
Classification Plots using Histograms in EXPLORE
Here’s another equivalent representation of what the authors of the program were trying to show.
Histogram for pancgrp= No Pancreatitis
200
150
100
50
0
0.00000
0.20000
0.40000
0.60000
Predicted probability
0.80000
1.00000
Mean =0.0515201
Std. Dev. =0.12420309
N =208
Histogram for pancgrp= Pancreatitis
20
15
30
25
10
5
0
0.00000
0.20000
0.40000
0.60000
Predicted probability
0.80000
1.00000
Mean =0.7767463
Std. Dev. =0.33120602
N =48
Logistic Regression Lecture - 30 4/13/2020
Visualizing the equation with two predictors
(Mike – use this as an opportunity to whine about SPSS’s horrible 3-D graphing capability.)
With one predictor, a simple scatterplot of YHATs vs. X will show the relationship between Y and X implied by the model.
For two predictor models, a 3-D scatterplot is required. Here’s how the graph below was produced.
Graphs -> Interactive -> Scatterplot. . .
The graph shows the general ogival relationship of YHAT on the vertical to LOGLIP and
LOGAMY. But the relationships really aren’t apparent until the graph is rotated.
Don’t ask me to demonstrate rotation. SPSS now does not offer the ability to rotate the graph interactively. It used to offer such a capability, but it’s been removed. Shame on SPSS
Logistic Regression Lecture - 31 4/13/2020
.
The same graph but with Linear Regression Y-hats plotted vs. loglip and logamy.
Logistic Regression Lecture - 32 4/13/2020
Representing Relationships with a Table –the Powerpoint slides compute logamygp2 = rnd(logamy,.5).
<- Rounds logamy to the nearest .5 . logamygp2
Va lid 1.5 0
2.0 0
2.5 0
3.0 0
3.5 0
4.0 0
To tal
Fre quen cy
12 3
10 5
46
21
10
1
30 6
Pe rcent
40 .2
34 .3
15 .0
6.9
3.3
.3
10 0.0
Va lid P ercen t
40 .2
34 .3
15 .0
6.9
3.3
.3
10 0.0
loglipgp2
Cu mula tive
Pe rcent
40 .2
74 .5
89 .5
96 .4
99 .7
10 0.0
compute loglipgp2 = rnd(loglip,.5).
LOGAMY and
LOGLIP groups were created by rounding values of LOGAMY and LOGLIP to the nearest .5.
Va lid .50
1.0 0
Fre quen cy
1
6
Pe rcent
.3
2.0
Va lid P ercen t
.3
2.0
Cu mula tive
Pe rcent
.3
2.3
1.5 0
2.0 0
2.5 0
45
12 5
14 .7
40 .8
14 .7
40 .8
17 .0
57 .8
3.0 0
3.5 0
49
30
20
16 .0
9.8
6.5
16 .0
9.8
6.5
73 .9
83 .7
90 .2
4.0 0
4.5 0
20
8
6.5
2.6
6.5
2.6
96 .7
99 .3
5.0 0 2 .7
.7
10 0.0
To tal 30 6 10 0.0
10 0.0
means pancgrp yhatamylip by logamygp2 by loglipgp2.
Here’s the LOGLIP grouping in which the values were rounded to the nearest .5.
Here’s the top of a very long two way table of mean Y-hat values for each combination of logamy group and loglip group.
Below, this table is “prettified”.
Logistic Regression Lecture - 33 4/13/2020
The above MEANS output, put into a 2 way table in Word
The entry in each cell is the expected probability (Y-hat) of contracting Pancreatitus at the combination of logamy and loglip represented by the cell.
LOGLIP
4
3.5
.5
.
1
.
1.5 2 2.5 3
.99
3.5 4 4.5
1.00 1.00 1.00
LOGAMY
3
2.5
2
1.5 .00 .00
.03
.01
.00
.09
.04
.01
.30
.14
.05
.97
.73
.47
.
.98
.92
.85
.99
.97
1.00
1.00
This table shows the joint relationship of predicted Y to LOGAMY and LOGLIP. Move from the lower left of the table to the upper right.
It also shows the partial relationships of each.
Partial Relationship of YHAT to LOGLIP – Move across any row.
So, for example, if your logamylase were 2.5, your chances of having pancreatitus would be only
.03
if your loglipase were 1.5. But at the same 2.5 value of logamylase, your chances would be .97 if your loglipase value were 4.0.
Partial Relationship of YHAT to LOGAMY – Move up any column.
Empty cells show that there are certain combinations of LOGAMY and LOGLIP that are very unlikely.
Logistic Regression Lecture - 34 4/13/2020
The data here are the FFROSH data – freshmen from 1987-1992.
The dependent variable is RETAINED – whether a student went directly to the 2 nd
semester.
The independent variable is NRACE – the ethnic group recorded for the student. It has three values:
1: White; 2: African American 3: Asian-American
Recall that ALL independent variables are called covariates in
LOGISTIC REGRESSION.
We know that categorical independent variables with 3 or more categories must be represented by group coding variables.
LOGISTIC REGRESION allows us to do that internally.
Indicator coding is dummy coding. Here, Category 1 (White) is used as the reference category.
Logistic Regression Lecture - 35 4/13/2020
LOGISTIC REGRESSION retained
/METHOD = ENTER nrace
/CONTRAST (nrace)=Indicator(1)
/CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .
Ca se P roces sing Sum mary
Un weig hted Case s a
Se lecte d Cases Inc lude d in A naly sis
Mi ssing Case s
N
46 97
Pe rcent
To tal
56
47 53
0 Un selec ted Cases
To tal 47 53 a.
If weigh t is in effe ct, se e cla ssific ation table for t he to tal nu mber of ca ses.
98 .8
1.2
10 0.0
.0
10 0.0
De pende nt Va riabl e Enc oding
Ori ginal Valu e
.00
Inte rnal Value
0
1.0 0 1
Ca tegor ical Varia bles Codi ngs nra ce NUMB ERIC
WHITE/ BLACK/
ORIENT AL RACE CODE
1.0 0 WHITE
2.0 0 BL ACK
3.0 0 ORIENTAL
Fre quen cy
39 87
62 6
84
(1)
.00 0
1.0 00
Pa rame ter co ding
(2)
.00 0
.00 0
.00 0 1.0 00
(3)
This is the syntax generated by the above menus.
SPSS’s coding of the independent variable here is important.
Note that Whites are the 0,0 group. The first group coding variable compares
Blacks with Whites.
The 2 nd
compares
Asian-Americans with Whites.
Cla ssifi cation Table a,b
Ste p 0
Ob serve d ret ained .00
1.0 0
Ov erall Perce ntag e a.
Co nstan t is in clude d in the m odel .
b.
Th e cut value is .5 00
.00
ret ained
Pre dicte d
0
0
1.0 0
54 5
41 52
Pe rcent age Correc t
.0
10 0.0
88 .4
Va riable s in the E quati on
Ste p 0 Co nstan t
B
2.0 31
S.E .
.04 6
Wa ld
198 6.39 1 df
1
Sig .
.00 0
Exp (B)
7.6 18
Va riable s not in the Equation
Ste p 0 Va riable s nra ce
Sco re
6.6 80 df
2
Sig .
.03 5 nra ce(1) nra ce(2)
2.4 33
3.9 03
1
1
.11 9
.04 8
Ov erall Statistics 6.6 80 2 .03 5
SPSS first prints p-value information for the collection of group coding variables representing the categorical factor. Then it prints p-value information for each GCV separately. None of the information about categorical variables in this “Variables not in the Equation” box is too useful.
Logistic Regression Lecture - 36 4/13/2020
Om nibus Tes ts of Model Coeffic ients
Ste p 1 Ste p
Blo ck
Mo del
Ch i-squa re
7.7 48
7.7 48
7.7 48 df
2
2
2
Model S umm ary
Ste p
1
-2 Log l ikelih ood
33 64.16 0 a
Co x & S nell R
Sq uare
.00 2
Na gelke rke R
Sq uare
.00 3 a.
Est imat ion te rmin ated at ite ratio n num ber 6 bec ause pa rame ter estima tes ch ange d by less than .001.
Sig .
.02 1
.02 1
.02 1
Cla ssifi cation Table a
Ste p 1
Ob serve d ret ained
Ov erall Perce ntag e a.
Th e cut value is .5 00
.00
1.0 0
.00
ret ained
Pre dicte d
0
0
1.0 0
54 5
41 52
Pe rcent age Correc t
.0
10 0.0
88 .4
Va riable s in the E quati on
Ste p 1 a nra ce nra ce(1) nra ce(2)
Co nstan t
B
.23 7
1.0 07
1.9 89
S.E .
.14 3
.51 5
.04 9
Wa ld
6.3 68
2.7 41
3.8 29
166 9.86 9 df
1
1
2
1
Sig .
.04 1
.09 8
.05 0
.00 0
Exp (B)
1.2 68
2.7 37
7.3 06 a.
Va riable (s) en tered on step 1 : nrac e.
So the bottom line is that
0) There are significant differences in likelihood of retention to the 2 nd semester between the groups (p=.041).
Specifically . . .
1) Blacks are not significantly more likely to sustain than Whites, although the difference approaches significance. (p=.098).
2) Asian-Americans are significantly more likely to sustain than Whites (p=.050).
Logistic Regression Lecture - 37 4/13/2020
The data used for this are data on freshmen from 1987-1992. Start here on 10/7/15
The dependent variable is RETAINED – whether student went directly into the 2 nd
semester or not.
Predictors (covariates in logistic regression) are HSGPA , ACT composite , and Overall attempted hours in the first semester, excluding the freshman seminar course.
GET FILE='E:\MdbR\FFROSH\ffrosh.sav'. logistic regression retained with hsgpa actcomp oatthrs1.
Ca se Pr oces sing Sum mary
Un weigh ted Cases a
Se lecte d Cases Inc luded in
An alysis
N
48 52
Pe rcent
10 0.0
Mi ssing Case s
To tal
0
48 52
0 Un selec ted Cases
To tal 48 52 a.
If weigh t is in effe ct, se e cla ssifica tion table for t he to tal nu mber of ca ses.
.0
10 0.0
.0
10 0.0
Cla ssifi cation Table a,b
De pendent V aria ble E ncoding
Ori gina l
1.0 0
Int ernal Valu e
0
1
Ste p 0
Ob serve d
RE TAINED
Ov erall Perce ntag e
.00
1.0 0
.00
RE TAINED
Pre dicte d
0
1.0 0
62 0
Pe rcent age Correc t
.0
0 42 32 10 0.0
87 .2
Specificity
Sensitivity a.
Co nstan t is in clude d in the m odel .
b.
Th e cut value is .5 00
Ste p 0 t
Co nstan
B
1.9 21
Va riable s in the E quation
S.E .
.04 3
Wa ld
19 94.98 8 df
1
Sig .
.00 0
Ex p(B)
6.8 26
Ste p 0 Va riable s
Ov erall Stati stics
Va riable s not in the Equati on
HS GPA
ACTCO MP
OA TTHRS1
Sc ore
22 5.908
44 .653
27 4.898
38 5.437
df
1
1
1
3
Sig .
.00 0
.00 0
.00 0
.00 0
Recall that the p-values are those that would be obtained if a variable were put BY ITSELF into the equation.
Logistic Regression Lecture - 38 4/13/2020
Om nibus Tes ts of Model Coeffic ients
Ste p 1
Ste p
1
Ste p
Blo ck
Mo del
Ch i-squa re
38 1.011
38 1.011
38 1.011
df
3
3
3
Sig .
.00 0
.00 0
.00 0
Model S umm ary
-2 L og li keliho od
332 7.36 5
Co x & S nell R
Sq uare
.07 6
Na gelke rke R
Sq uare
.14 1
Cla ssifi cation Table a
Ste p 1
Ob serve d
RE TAINED
Ov erall Perce ntag e
.00
1.0 0
.00
RE TAINED
Pre dicte d
35
16
1.0 0
58 5
42 16
Pe rcent age Correc t
5.6
99 .6
87 .6
Specificity
Sensitivity a.
Th e cut value is .5 00
Va riable s in the E quation
Ste p 1 a
HS GPA
ACTCO MP
OA TTHRS1
Co nstan t
B
1.0 77
-.0 22
.14 8
-2. 225
S.E .
.10 1
.01 4
.01 2
.30 8
Wa ld
11 2.767
2.6 37
14 6.487
52 .362
df
1
1
1
1
Sig .
.00 0
.10 4
.00 0
.00 0
Ex p(B)
2.9 35
.97 8
1.1 60
.10 8 a.
Va riable (s) en tere d on step 1 : HS GPA, ACT COM P, O ATTHRS1 .
nd
nd
Logistic Regression Lecture - 39 4/13/2020
From the report to the faculty – Output from SPSS for the Macintosh Version 6 .
---------------------- Variables in the Equation -----------------------
Variable B S.E. Wald df Sig R Exp(B)
AGE -.0950 .0532 3.1935 1 .0739 -.0180 .9094
NSEX .2714 .0988 7.5486 1 .0060 .0388 1.3118
After adjusting for differences associated with the other variables, Males were more likely to enroll in the second semester .
NRACE1 -.4738 .1578 9.0088 1 .0027 -.0436 .6227
After adjusting for differences associated with the other variables, Whites were less likely to enroll in the second semester.
NRACE2 .1168 .1773 .4342 1 .5099 .0000 1.1239
HSGPA .8802 .1222 51.8438 1 .0000 .1162 2.4114
After adjusting for differences associated with the other variables, those with higher high school GPA's were more likely to enroll in the second semester.
ACTCOMP -.0239 .0161 2.1929 1 .1387 -.0072 .9764
OATTHRS1 .1588 .0124 164.4041 1 .0000 .2098 1.1721
After adjusting for differences associated with the other variables, those with higher attempted hours were more likely to enroll in the second semester.
EARLIREG .2917 .1011 8.3266 1 .0039 .0414 1.3387
After adjusting for differences associated with the other variables, those who registered six months or more before the first day of school were more likely to enroll in the second semester.
NADMSTAT -.2431 .1226 3.9330 1 .0473 -.0229 .7842
POSTSEM -.1092 .0675 2.6206 1 .1055 -.0130 .8965
PREYEAR2 -.0461 .0853 .2924 1 .5887 .0000 .9549
PREYEAR3 .1918 .0915 4.3952 1 .0360 .0255 1.2114
After adjusting for differences associated with the other variables, those who enrolled in 1991 were more likely to enroll in the second semester than others enrolled before 1990. What???
POSYEAR2 -.0845 .0977 .7467 1 .3875 .0000 .9190
POSYEAR3 -.1397 .0998 1.9585 1 .1617 .0000 .8696
HAVEF101 .4828 .1543 9.7876 1 .0018 .0459 1.6206
After adjusting for differences associated with the other variables, those who took the freshman seminar were more likely to enroll in second semester than those who did not.
Constant -.1075 1.1949 .0081 1 .9283
Va riable s in the E quation
Ste p 1 a ag e nse x nra ce nra ce(1) nra ce(2) hsg pa act comp oa tthrs1 ea rlireg ad mstat
(1) po stsem y19 88 y19 89 y19 91 y19 92 ha vef10 1
Co nstan t
B
-.0 99
.25 7
-.9 44
-.3 37
.85 2
-.0 21
.15 9
.31 6
.25 3
-.1 15
-.0 48
.17 7
-.0 78
-.1 24
.96 7
-.0 32
S.E .
.05 3
.09 9
.48 7
.50 4
.12 3
.01 6
.01 2
.10 2
.12 3
.06 8
.08 6
.09 2
.09 8
.10 1
.15 2
1.2 28
Wa ld
3.4 61
6.7 26
19 .394
3.7 49
.44 6
48 .204
1.6 76
16 3.499
9.6 40
4.2 22
2.8 80
.30 6
3.7 37
.63 3
1.5 11
40 .364
.00 1 df
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Sig .
.06 3
.01 0
.00 0
.05 3
.50 4
.00 0
.19 5
.00 0
.00 2
.04 0
.09 0
.58 0
.05 3
.42 6
.21 9
.00 0
.97 9 a.
Va riable (s) en tered on step 1 : age , nse x, nra ce, h sgpa , actc omp , oatt hrs1, earli reg, a dmst at, po stsem , y19 88, y 1989 , y19 91, y1 992, have f101 .
Ex p(B)
.90 5
1.2 94
1.2 88
.38 9
.71 4
2.3 44
.97 9
1.1 73
1.3 72
.89 1
.95 4
1.1 94
.92 5
.88 4
2.6 29
.96 8
Logistic Regression Lecture - 40
This is from SPSS V15.
There are slight differences in the numbers, not due to changes in the program but due to slight differences in the data. I believe some cases were dropped between when the V6 and V15 analyses were performed.
NRACE was coded differently in the V15 analysis.
The similarity is a tribute to the statisticians who developed logistic regression.
4/13/2020
The full FFROSH Analysis in Version 15 of SPSS logistic regression retained with age nsex nrace hsgpa actcomp oatthrs1 earlireg admstat
postsem y1988 y1989 y1991 y1992 havef101
/categorical nrace admstat.
[DataSet1] G:\MdbR\FFROSH\ffrosh.sav
Ca se P roces sing Sum mary
Un weig hted Case s
Se lecte d Cases
Un selec ted Cases
To tal a
Inc lude d in A naly sis
Mi ssing Case s
To tal
N
47 81
71
48 52
0
48 52
Pe rcent
98 .5
1.5
10 0.0
.0
10 0.0
a.
If weigh t is in effe ct, se e cla ssific ation table for t he to tal nu mber of ca ses.
De pende nt Va riabl e Enc oding
Ori ginal Valu e
.00
Inte rnal Value
0
1.0 0 1
Ca tegor ical Varia bles Codi ngs a nra ce NUMB ERIC
WHITE/ BLACK/ORIENT AL
RA CE CODE
1.0 0 WHITE
2.0 0 BL ACK ad mstat NUM ERI C
ADMISS TION STA TUS CODE
3.0 0 ORIENTAL
AP
CD a.
Th is cod ing result s in in dica tor co effic ients.
Fre quen cy
40 60
63 6
85
32 92
14 89
Block 0 output skipped
Pa rame ter co ding
(1)
1.0 00
.00 0
.00 0
1.0 00
(2)
.00 0
1.0 00
.00 0
.00 0
Logistic Regression Lecture - 41 4/13/2020
Om nibus Tes ts of Model Coeffic ients
Ste p 1 Ste p
Blo ck
Mo del
Ch i-squa re
49 4.704
49 4.704
49 4.704
df
15
15
15
Model S umm ary
Ste p
1
-2 Log l ikelih ood
31 55.84 2 a
Co x & S nell R
Sq uare
.09 8
Na gelke rke R
Sq uare
.18 4 a.
Est imat ion te rmin ated at ite ratio n num ber 6 bec ause pa rame ter estima tes ch ange d by less than .001.
Sig .
.00 0
.00 0
.00 0
Ste p 1
Ob serve d ret ained
Ov erall Perce ntag e a.
Th e cut value is .5 00
.00
1.0 0
Cla ssifi cation Table a
.00
ret ained
Pre dicte d
79
33
1.0 0
53 1
41 38
Pe rcent age Correc t
13 .0
99 .2
88 .2
Va riable s in the E quation
Ste p 1 a ag e nse x nra ce nra ce(1) nra ce(2) hsg pa act comp oa tthrs1 ea rlireg ad mstat (1) po stsem y19 88 y19 89 y19 91 y19 92 ha vef10 1
Co nstan t
B
-.0 99
.25 7
-.0 48
.17 7
-.0 78
-.1 24
.96 7
-.0 32
-.9 44
-.3 37
.85 2
-.0 21
.15 9
.31 6
.25 3
-.1 15
S.E .
.05 3
.09 9
.08 6
.09 2
.09 8
.10 1
.15 2
1.2 28
.48 7
.50 4
.12 3
.01 6
.01 2
.10 2
.12 3
.06 8
Wa ld
3.4 61
6.7 26
19 .394
3.7 49
.44 6
48 .204
1.6 76
16 3.499
9.6 40
4.2 22
2.8 80
.30 6
3.7 37
.63 3
1.5 11
40 .364
.00 1 df Sig .
.06 3
.01 0
.00 0
.05 3
.50 4
.00 0
.19 5
.00 0
.00 2
.04 0
.09 0
.58 0
.05 3
.42 6
.21 9
.00 0
.97 9
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
Ex p(B)
.90 5
1.2 94
.95 4
1.1 94
.92 5
.88 4
2.6 29
.96 8
.38 9
.71 4
2.3 44
.97 9
1.1 73
1.3 72
1.2 88
.89 1 a.
Va riable (s) en tere d on step 1 : age , nse x, nra ce, h sgpa , actc omp , oatt hrs1, earli reg, a dmst at, po stsem , y19 88, y 1989 , y19 91, y1 992 , have f101 .
The absence of a relationship to ACTCOMP is very interesting. It could be the foundation for a theory of retention.
Logistic Regression Lecture - 42 4/13/2020
From Pedhazur, p. 762. Problem 4. Messages attributed to either an Economist , a Labor Leader or a
Politician are given to participants. The message is about the effects of NAFTA (North American Free Trade
Agreement). The participants rated each message as Biased or Unbiased.
Half the participants were told that the source was male. The other half were told the source was female. The data are
Economist Labor Leader Politician
Male Source Rated Biased
Rated Unbiased
Female Source Rated Biased
Rated Unbiased
The data were entered into SPSS as follows . . .
7
18
5
20
13
12
17
8
19
6
20
5
1
1
1
1
These are summary data, not individual data.
For example, line 1: 1 1 1 7, represents 7 individual cases. If the data had been entered as individual data,
The first lines of the data editor would have been gender source judgment
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1 1
To get SPSS to “expand” the summary data into all of the individual cases it’s summarizing, I did the following
Data -> Weight Cases . . .
All analyses after the Weight Cases dialog will involve the
Expanded data of 150 cases.
Logistic Regression Lecture - 43 4/13/2020
If you’re interested, the syntax that will do all of the above is . . .
DATASET ACTIVATE DataSet5.
Data list free / gender source judgment freq.
Begin data.
1 1 1 7
1 1 0 18
1 2 1 13
1 2 0 12
1 3 1 19
1 3 0 6
This syntax reads frequency counts and analyzes them as if they were individual respondent data.
2 1 1 5
2 1 0 20
2 2 1 17
2 2 0 8
2 3 1 20
2 3 0 5 end data. weight by freq. value labels gender 1 "Male" 2 "Female"
/source 1 "Economist" 2 "Labor Leader" 3 "Politician"
/ judgment 1 "Biased" 0 "Unbiased".
The logistic regression dialogs . . .
Analyze -> Regression -> Binary Logistic . . .
The syntax to invoke the Logistic Regression command is logistic regression judgment /categorical = gender source
/enter source gender / enter source by gender
/save = pred(predicte).
Note how to tell the
LOGISTIC REGRESSION procedure to analyze the interaction between two factors.
Logistic Regression Lecture - 44 4/13/2020
Logistic Regression output
Case Processing Summary
Unweighted Cases a N Percent
100.0 Included in Analysis
Missing Cases
12
0 Selected Cases
Unselected Cases
Total
Total a. If weight is in effect, see classification table for the total number of cases.
Dependent Variable Encoding
Internal Value
12
0
12
Original Value
.00 Unbiased
1.00 Biased
0
1
Categorical Variables Codings a
Frequency Parameter coding
(1) (2)
.0
100.0
.0
100.0 source
1.00 Economist
2.00 Labor Leader
3.00 Politician
1.00 Male gender
2.00 Female a. This coding results in indicator coefficients.
4
6
6
4
4
1.000
.000
.000
1.000
.000
.000
1.000
.000
Note that the N is incorrect. We told SPSS to “expand” the summary data into 150 individual cases. But this part of the Logistic Regression command does not acknowledge that expansion except in the footnote.
This table tells us about the Group
Coding variables for source and gender
. . .
It’s dummy variable coding, with
Politician as the ref group.
Source(1) compares Economists with
Politicians.
Source(2) compares Labor Leaders with Politicians,.
Gender(1) compares Males vs.
Females.
Thanks, Logistic.
Logistic Regression Lecture - 45 4/13/2020
Block 0: Beginning Block
–
(I generally ignore the Block 0 output. Not much of interest here except to logistic regression aficionados.)
Classification Table a,b
Observed Predicted judgment Percentage Correct
.00 Unbiased 1.00 Biased
Step 0 judgment
.00 Unbiased
1.00 Biased
Overall Percentage
0
0
69
81
.0
100.0
54.0 a. Constant is included in the model. b. The cut value is .500
Variables in the Equation
Constant
B
.160
S.E.
.164
Wald
.958 df
1
Sig.
.328
Exp(B)
1.174 Step 0
Variables not in the Equation
Step 0
Variables
Overall Statistics source source(1) source(2) gender(1)
Score
30.435
27.174
1.087
.242
30.676 df
1
1
2
1
3
Sig.
.000
.000
.297
.623
.000
Logistic Regression Lecture - 46 4/13/2020
Block 1: Method = Enter
Omnibus Tests of Model Coefficients
Chi-square df
Step 1
Step
Block
Model
32.186
32.186
32.186
Model Summary
3
3
3
Sig.
.000
.000
.000
Step Nagelkerke R
Square
.258 1 174.797
a a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
Classification Table a
Observed judgment
Predicted
.00 Unbiased 1.00 Biased
38
Step 1
-2 Log likelihood judgment
Cox & Snell R
Square
.193
.00 Unbiased
1.00 Biased 12
Overall Percentage a. The cut value is .500
31
69
Percentage Correct
55.1
85.2
71.3
B
Variables in the Equation
S.E. Wald
26.944 df Sig.
.
000
Exp(B) source source(1)
2
1 .089
Step 1 a
-2.424
-.862
.476
.448
25.880
3.709
.000
.054
Bias=1 source(2) gender(1)
1
1
.422
Bias=0
.817
Constant
-.202
1.370
.368
.393
.303
12.143 1
.582
.000 3.934
Pol=0 Econ=1 a. Variable(s) entered on step 1: source, gender.
Source: The overall differences in probability of rating passage as biased across the 3 source groups.
Source(1): Probability of rating passage as biased least when respondents told that message was from
Economists.
Source(2): No officially significant difference between probability of rating passage as biased when attributed to Labor Leaders vs. when attributed to Politicians.
Gender(1) No difference in probability of rating passage as biased between males and females.
Logistic Regression Lecture - 47 4/13/2020
Block 2: Method = Enter This block adds the interaction of SourceXGender. No change in results.
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1
Step
Block
Model
1.594
1.594
33.780
2
2
5
.451
.451
.000
Step -2 Log likelihood
Model Summary
173.203
a
Cox & Snell R
Square
.202
Nagelkerke R
Square
.269 1 a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
Classification Table a
Observed judgment
Predicted
.00 Unbiased 1.00 Biased
Percentage Correct
Step 1 judgment
.00 Unbiased
1.00 Biased
Overall Percentage a. The cut value is .500
38
12
31
69
55.1
85.2
71.3
Variables in the Equation
Step 1 a source source(1) source(2) gender(1) source * gender source(1) by gender(1) source(2) by gender(1)
Constant
B
-2.773
-.633
-.234
.675
-.440
1.386
S.E.
.707
.659
.685
.958
.902
.500
Wald
17.214
15.374
.922
.116
1.573
.497
.238
7.687 df
1
1
2
1
1
1
2
1
Sig.
.000
Exp(B)
.000
.337
.733
.455
.481
.626
.006
.063
.531
.792
1.965
.644
4.000 a. Variable(s) entered on step 1: source * gender .
Since the interaction was not significant, we don’t have to interpret it.
The main conclusion is that respondents rated passages from politicians as more biased than from economists.
Logistic Regression Lecture - 48 4/13/2020
** Just for kicks, I analyzed this as if the DV were continuous.
Since GLM does not automatically test group coding variables, I created dummy codes for source within
GLM . . .
Note that dummy codes are called Simple in GLM.
Univariate Analysis of Variance
Between-Subjects Factors gender source
1.00
2.00
1.00
2.00
3.00
Value Label
Male
Female
Economist
Labor Leader
Politician
N
75
75
50
50
50
Dependent Variable: judgment
Source
Corrected Model
Intercept gender source gender * source
Error
Tests of Between-Subjects Effects
Type III Sum of
Squares
7.980
a
43.740
.060
7.560
.360
29.280 df
5
1
1
2
2
144
Mean Square
1.596
43.740
.060
3.780
.180
.203
Total
81.000 150
Corrected Total
37.260 149 a. R Squared = .214 (Adjusted R Squared = .187)
Contrast Results (K Matrix)
F
7.849
215.115
.295
18.590
.885
Sig.
.000
.000
.588
.000
.415 source Simple Contrast a
Level 1 vs. Level 3
Level 2 vs. Level 3
Contrast Estimate
Hypothesized Value
Difference (Estimate - Hypothesized)
Std. Error
Sig.
95% Confidence Interval for Difference
Contrast Estimate
Hypothesized Value
Difference (Estimate - Hypothesized)
Std. Error
Sig.
95% Confidence Interval for Difference
Lower Bound
Upper Bound
Lower Bound
Upper Bound
Dependent Variable judgment
-.540
0
-.540
.090
.000
-.718
-.362
-.180
0
-.180
.090
.048
-.358
-.002 a. Reference category = 3
Logistic Regression Lecture - 49 4/13/2020
Logistic Regression Lecture - 50 4/13/2020
Situation: A dichotomous state (e.g., illness, termination, death, success) is to be predicted.
You have a continuous predictor.
The relationship between the dichotomous dependent variable and the continuous predictor is significant.
The continuous predictor can consist of y-hats from the combination of multiple predictors.
It’ll be called y-hat from now on.
By setting a cutoff on the values of y-hat, Predicted 1s and Predicted 0s can be defined as is done in the
Logistic Regression Classification Table.
Predicted 1: Every case whose y-hat is above the cutoff.
Predicted 0: Every case whose y-hat is <= the cutoff.
Once predicted 1s and predicted 0s have been defined, the classification table printed by Logistic Regression can be created.
Some issues: 1) Which cutoff should be used.
2) Measuring predictability.
3) How does predictability relate to the cutoff which is employed.
ROC Curve
The Receiver Operating Characteristic curve provides an approach to understanding the relationship between a dichotomous dependent variable and a continuous predictor.
The ROC curve is a plot of Sensitivity vs. 1-Specificity .
Sensitivity: Percent of Actual 1s predicted correctly.
Specificity: Percent of Actual 0s predicted correctly.
1-Specificity: Percent of Actual 0s predicted incorrectly to be 1s.
In ROC terminology, Sensitivity is called the Hit or true positive rate.
1-Specificity is called the False Alarm or false positive rate.
So the ROC curve is a plot of the proportion of successful identifications vs. proportion of false identifications of some phenomenon.
Logistic Regression Lecture - 51 4/13/2020
Example: The Logamy, LogLip data revisited.
GET
FILE='G:\MDBT\InClassDatasets\amylip.sav'.
DATASET NAME DataSet1 WINDOW=FRONT.
LOGISTIC REGRESSION VARIABLES pancgrp
/METHOD=ENTER logamy loglip
/SAVE=PRED
/CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).
Logistic Regression
[DataSet1] G:\MDBT\InClassDatasets\amylip.sav
Case Processing Summary
Unweighted Cases a N
Selected Cases
Unselected Cases
Total
Included in Analysis
Missing Cases
Total
256
50
306
0
306 a. If weight is in effect, see classification table for the total number of cases.
Percent
83.7
16.3
100.0
.0
100.0
Dependent Variable Encoding
Original Value Internal Value
.00 No Pancreatitis
1.00 Pancreatitis
0
1
Block 1 : Method = Enter
Omnibus Tests of Model Coefficients
Step 1
Step
Block
Model
Chi-square
170.852
170.852
170.852 df
2
2
2
Sig.
.000
.000
.000
Note that I requested that the yhats be saved in the Data Editor
Model Summary
Step -2 Log likelihood Cox & Snell R
Square
.487
Nagelkerke R
Square
.787 1 a. Estimation terminated at iteration number 7 because parameter estimates changed by less than .001.
Classification Table a
Observed
76.228
a
Predicted pancgrp Pancreatitis Diagnosis (DV)
.00 No Pancreatitis
204
1.00 Pancreatitis pancgrp Pancreatitis Diagnosis (DV)
.00 No Pancreatitis
Step 1
1.00 Pancreatitis 10
4
38
Overall Percentage
Percentage Correct a. The cut value is .500
Specificity = proportion of the 208 0s that were predicted to be 0s = 204/208 = .981.
Sensitivity= proportion of the 48 1s that were predicted to be 1s: 38/48 = .792.
False alarm rate = Proportion of the 208 0s that were predicted to be 1s = 1-.981 = .019.
Hit rate = Sensitivity = .792.
Note that these specificity, sensitivity values are for only 1 cutoff = .500.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1 a logamy loglip
2.659
2.998
Constant -14.573 a. Variable(s) entered on step 1: logamy, loglip.
1.418
.844
2.251
3.518
12.628
41.907
The yhats were saved and renamed LogRegYhatAmyLip.
1
1
1
.061
.000
.000
14.286
20.043
.000
98.1
79.2
94.5
Logistic Regression Lecture - 52 4/13/2020
Specificity
Sensitivity
ROC LogRegYhatAmyLip BY pancgrp (1)
/PLOT=CURVE( REFERENCE )
/PRINT= COORDINATES
/CRITERIA=CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95)
/MISSING=EXCLUDE.
Logistic Regression Lecture - 53 4/13/2020
ROC Curve
[DataSet1] G:\MDBT\InClassDatasets\amylip.sav
Case Processing Summary pancgrp Pancreatitis Diagnosis (DV)
Positive a
Negative
Missing
Valid N (listwise)
Larger values of the test result variable(s) indicate stronger evidence for a positive actual state. a. The positive actual state is 1.00 Pancreatitis.
48
208
50
The ROC procedure employs as many cutoff values as possible (not just one = .500 as in Logistic
Regression).
It computes sensitivity and
1-Specificity for each cutoff value.
It plots Sensitivity vs. 1-
Specificity for each cutoff.
That plot is the blue curve.
False Alarm rate
Area Under the
Curve
Test Result
Variable(s): Yhat
Predicted probability
Area
.980
You can think of the ROC curve as a “running classification table” with sensitivity/1-specificity values for all possible cutoff values.
The above graph does not show small differences in percentages, so the Sensitivity of .98 when the cutoff = 0.5 in the table is shown as 1.00 in this graph.
Often, the area between the blue line and the green diagonal line is used as a measure of overall predictability.
That area will range from 0.5 (the blue line = the green line) to +1 (the blue line follows the outside of the graph.) In this instance, overall predictability would be considered quite high.
Logistic Regression Lecture - 54 4/13/2020
Example 2 – the FFROSH data. Predicting sustaining.
The basic analyses, from the previous lecture . . .(Block 0 stuff omitted.) logistic regression retained with age nsex nrace hsgpa actcomp oatthrs1 earlireg admstat
postsem y1988 y1989 y1991 y1992 havef101
/categorical nrace admstat / criteria = cut(.5) /save=pred .
Logistic Regression
[DataSet4] G:\MDBR\FFROSH\Ffroshnm.sav
Case Processing Summary
Unweighted Cases a N Percent
Note- y-hats saved.
Selected Cases
Unselected Cases
Total
Included in Analysis
Missing Cases
Total
4697
56
4753
0
4753 a. If weight is in effect, see classification table for the total number of cases.
Block 1: Method = Enter
Classification Table a
Observed Predicted
.00 retained
1.00
529
4145
Step 1 retained
.00
1.00
16
7
Overall Percentage
98.8
1.2
100.0
.0
100.0
Percentage Correct
2.9
99.8
88.6 a. The cut value is .500
age nsex
B
-.113
.268
Variables in the Equation
S.E. Wald
.053
.102
4.467
6.891 df
1
1
Sig.
.035
.009
Exp(B)
.893
1.307
Step 1 a nrace nrace(1) nrace(2) hsgpa actcomp oatthrs1 earlireg admstat(1) postsem y1988 y1989 y1991
-1.033
-.473
.969
-.009
.105
.351
.229
-.140
-.081
.203
-.062
.526
.543
.128
.017
.015
.105
.127
.071
.088
.097
.100
16.513
3.854
.758
57.301
.283
46.284
11.144
3.255
3.897
.833
4.426
.385
2
1
1
1
1
1
1
1
1
1
1
1
.000
.050
.384
.000
.595
.000
.001
.071
.048
.362
.035
.535 y1992 havef101
Constant
-.142
.851
.364
.102
.159
1.257
1.925
28.769
.084
1
1
1
.165
.000
.772
.868
2.341
1.438 a. Variable(s) entered on step 1: age, nsex, nrace, hsgpa, actcomp, oatthrs1, earlireg, admstat, postsem, y1988, y1989, y1991, y1992, havef101.
.356
.623
2.636
.991
1.111
1.421
1.257
.870
.922
1.225
.940
Logistic Regression Lecture - 55 4/13/2020
For this example, I reran the analysis several times, each time specifying a different cutoff value.
Here are the Classification Tables for different cutoff values
0.996
1.000.
0.994
1.000.
0.971
0.998.
0.804
0.965.
False Alarm rate
0.281
0.598.
Note that location of points on the ROC curve is the inverse of the values of the cutoff.
Logistic Regression Lecture - 56 4/13/2020
Advantages of ROC curve analysis
1. Forces you to realize that when using many prediction systems, an increase in sensitivity is invariably accompanied by a concomitant increase in false alarms . For a given selection system, points will only differ in the lower-left to upper-right direction. Moving to the upper right increases sensitivity but it also increases false alarms. Moving to the lower left decreases false alarms but also decreases sensitivity.
2. Shows that to make a prediction (I-O types, read selection) system better, you must increase sensitivity while at the same time decreasing false alarms – moving points toward the upper left of the ROC space.
Better selection
3. Enables you to graphically disentangle issues of bias (the value of the cutoff ) from issues of predictability
(the area under the curve).
Logistic Regression Lecture - 57 4/13/2020