Analyses of Cateogical Dependent Variables

advertisement

Analyses Involving Categorical Dependent Variables

When Dependent Variables are Categorical

Chi-square analysis is frequently used.

Example Question: Is there a difference in likelihood of death in an ATV accident between persons wearing helmets and those without helmets?

Dependent variable is Death: No (0) vs. Yes (1).

Crosstabs

[DataSet0]

So, based on this analysis, there is no significant difference in likelihood of dying between ATV accident victims wearing helmets and those without helmets.

Logistic Regression Lecture - 1 4/13/2020

Comments on Chi-square analyses

What’s good?

1. The analysis is appropriate. It hasn’t been supplanted by something else.

2. The results are usually easy to communicate, especially to lay audiences.

3. A DV with a few more than 2 categories can be easily analyzed.

4. An IV with only a few more than 2 categories can be easily analyzed.

What’s bad?

1. Incorporating more than one independent variable is awkward, requiring multiple tables.

2. Certain tests, such as tests of interactions

, can’t be performed easily when you have more than one IV.

3. Chi-square analyses can’t be done when you have continuous IVs unless you categorize the continuous IVs, which goes against recommendations to NOT categorize continuous variables because you lose power.

Alternatives to the Chi-square test.

We’ll focus on Dichotomous (two-valued) DVs.

1. Linear Regression techniques a. Multiple Linear Regression.

Stick your head in the sand and pretend that your DV is continuous and regress the (dichotomous) DV onto the mix of IVs. b. Discriminant Analysis (equivalent to MR when DV is dichotomous)

Problems with regression-based methods, when the dependent variable is dichotomous and the independent variable is continuous.

1. Assumption is that underlying relationship between Y and X is linear.

But when Y has only two values, how can that be?

2. Linear techniques assume that variability about the regression line is homogenous across possible values of X. But when Y has only two values, residual variability will vary as X varies, a violation of the homogeneity assumption.

3. Residuals will probably not be normally distributed.

4. Regression line will extend beyond the more negative of the two Y values in the negative direction and beyond the more positive value in the positive direction resulting in Y-hats that are impossible values.

2. Logistic Regression

3. Probit analysis

Logistic Regression and Probit analysis are very similar. Almost everyone uses Logistic. We’ll focus on it.

Logistic Regression Lecture - 2 4/13/2020

The Logistic Regression Equation

Without restricting the interpretation, assume that the dependent variable, Y, takes on two values, 0 or 1.

Conceptualizing Y-hat.

When you have a two-valued DV it is convenient to think of Y-hat as the likelihood or probability that one of the values will occur . We’ll use that conceptualization in what follows and view

Y-hat as the probability that Y will equal 1.

The equation will be presented as an equation for the probability that Y = 1, written simply as P(Y=1) . So we’re conceptualizing Y-hat as the probability that Y is 1.

The equation for simple Logistic Regression (analogous to Predicted Y = B

0

+ B

1

*X in linear regression)

1

(B

0

+ B

1

*X)

e

Y-hat = P(Y=1) = ---------------------

=

-----------------

-(B

0

+ B

1

*X) (B

0

+ B

1

*X)

1 + e e + 1

The logistic regression equation defines an S-shaped (Ogive) curve, that rises from 0 to 1. P(Y=1) is never negative and never larger than 1.

The curve of the equation . . .

B

0

: B

0

is analogous to the linear regression “constant” , i.e., intercept parameter. Although B

0

defines the

"height" of the curve at a given x, it should be noted that the curve as a whole moves to the right as B0 decreases. For the graphs below, B1=1 and X ranged from -5 to +5.

B

0

= +1

For equations for which B1 is the same, changing B0 only changes the location of the curve over the range of X-axis values.

The “slope” of the curve remains the same.

P(Y=1)

B

0

=0

B

0

= -1

Logistic Regression Lecture - 3 4/13/2020

B

1

: B

1

is analogous to the slope of the linear regression line. B

1

defines the “steepness” of the curve. It is sometimes called a discrimination parameter.

The larger the value of B

1

, the “steeper” the curve, the more quickly it goes from 0 to 1. B0=0 for the graph.

B

1

=4

B

1

=2

P(Y=

1)

B

1

=1

Note that there is a MAJOR difference between the linear regression and logistic regression curves - - -

The logistic regression lines asymptote at 0 and 1. They’re bounded by 0 and 1.

But the linear regression lines extend below 0 on the left and above 1 on the right.

If we interpret P(Y) as a probability, the linear regression curves cannot literally represent P(Y) except for a limited range of X values.

Logistic Regression Lecture - 4 4/13/2020

Example

P(Y) = .09090909

P(Y) = .50

P(Y) = .8

P(Y) = .99

Odds of Y = .09090909/.909090909 = .1

Y is 1/10 th as likely to occur as to not occur.

Odds of Y = .5/.5 = 1

Y is as likely to occur as to not occur.

Odds of Y = .8/.2 = 4

Y is 4 times more likely to occur than to not occur.

Odds of Y = .99/.01 = 99

Y is 99 times more likely to occur than to not occur.

Logistic Regression Lecture - 5 4/13/2020

So logistic regression is logistic in probability but linear in log odds.

Logistic Regression Lecture - 6 4/13/2020

Why we must fit ogival-shaped curves – the curse of categorization

Here’s a perfectly nice linear relationship between score values, from a recent study.

This relationship is of ACT Comp scores to Wonderlic scores. It shows that as intelligence gets higher, ACT scores get larger.

[DataSet3] G:\MdbR\0DataFiles\BalancedScale_110706.sav

Here’s the relationship when ACT Comp has been dichotomized at 23 , into Low vs. High.

When, proportions of High scores are plotted vs. WPT value, we get the following

So, to fit the above curve relating proportions of persons with

High ACT scores to WPT, we need a model that is ogival.

This is where the logistic regression function comes into play.

This means that even if the “underlying” true values are linearly related, proportions based on the dichotomized values will not be linearly related to the independent variable.

Logistic Regression Lecture - 7 4/13/2020

Valid

Crosstabs and Logistic Regression

Applied to the same 2x2 situation

The FFROSH data.

The data here are from a study of the effect of the Freshman Seminar course on 1 st

semester GPA and on retention. It involved students from 1987-1992. The data were gathered to investigate the effectiveness of having the freshman seminar course as a requirement for all students. There were two main criteria, i.e., dependent variables – first semester GPA excluding the seminar course and whether a student continued into the

2 nd

semester.

The dependent variable in this analysis is whether or not a student moved directly into the 2 nd semester in the spring following his/her 1 st

fall semester. It is called RETAINED and is equal to 1 for students who retained to the immediately following spring semester and 0 for those who did not .

The analysis reported here was a serendipitous finding regarding the time at which students register for school.

It has been my experience that those students who wait until the last minute to register for school perform more poorly on the average than do students who register earlier. This analysis looked at whether this informal observation could be extended to the likelihood of retention to the 2 nd

semester.

After examining the distribution of the times students registered prior to the first day of class we decided to compute a dichotomous variable representing the time prior to the 1 st day of class that a student registered for classes. The variable was called EARLIREG – for EARLY REGistration. It had the value 1 for all students who registered 150 or more days prior to the first day of class and the value 0 for students who registered within 150 days of the 1 st day . (The 150 day value was chosen after inspection of the 1 st semester GPA data.)

So the analysis that follows examines the relationship of RETAINED to EARLIREG , retention to the 2 nd semester to early registration.

The analyses will be performed using CROSSTABS and using LOGISTIC REGRESSION.

First, univariate analyses . . .

GET FILE='E:\MdbR\FFROSH\Ffroshnm.sav'.

Fre var=retained earlireg. sustained

.00

1.00

Total

Frequency

552

4201

4753

Percent

11.6

Valid Percent

11.6

88.4

100.0

88.4

100.0

Cumulative

Percent

11.6

100.0 earlireg

Valid .00

1.00

Total

Frequency

2316

2437

4753

Percent

48.7

51.3

100.0

Valid Percent

48.7

51.3

100.0

Cumulative

Percent

48.7

100.0

Logistic Regression Lecture - 8 4/13/2020

crosstabs retained by earlireg /cells=cou col /sta=chisq.

Crosstabs

Ca se Pr oces sing Sum mary

RE TAINED *

EA RLIREG

Va lid

N

47 53

Pe rcent

10 0.0%

N

Ca ses

Mi ssing

Pe rcent

0 .0%

To tal

N

47 53

Pe rcent

10 0.0%

RE TAINED * EARLIREG Crossta bulation

RE TAINED .00

To tal

1.0 0

Co unt

% within

EA RLIREG

Co unt

% within

EA RLIREG

Co unt

% within

EA RLIREG

.00

EA RLIREG

36 7

1.0 0

18 5

15 .8%

19 49

7.6 %

22 52

84 .2%

23 16

10 0.0%

92 .4%

24 37

10 0.0%

To tal

55 2

11 .6%

42 01

88 .4%

47 53

10 0.0%

So, 92.4% of those who registered early sustained, compared to

84.2% of those who registered late.

Chi-Square Tests

Pe arson Chi-Squa re

Co ntinu ity Co rrect ion a

Likeliho od Ratio

Fisher's Exac t Test

Va lue

78 .832

78 .030

79 .937

b df

1

1

1

Asy mp. Sig.

(2-sided )

.00 0

.00 0

.00 0

Ex act S ig.

(2-sided )

.00 0

Lin ear-b y-Lin ear

Associat ion

N o f Val id Ca ses

78 .815

1 .00 0

47 53 a.

Co mput ed on ly fo r a 2x 2 tab le b.

0 c ells (.0%) have expe cted coun t less than 5. T he m inimu m ex pect ed co unt i s 268 .97.

Ex act S ig.

(1-sided )

.00 0

Logistic Regression Lecture - 9 4/13/2020

The same analysis using Logistic Regression logistic regression retained WITH earlireg.

Logistic Regression

Analyze -> Regression -> Binary Logistic

Ca se Pr oces sing Sum mary

Un weigh ted Cases a

Se lecte d Cases Inc luded in

An alysis

N

47 53

Pe rcent

10 0.0

Mi ssing Case s

To tal

0

47 53

0 Un selec ted Cases

To tal 47 53 a.

If weigh t is in effe ct, se e cla ssifica tion table for t he to tal nu mber of ca ses.

.0

10 0.0

.0

10 0.0

De pendent V aria ble E ncoding

Ori gina l

1.0 0

Int ernal Valu e

0

1

The display to the left is a valuable check to make sure that your “1” is the same as the Logistic Regression procedure’s “1”.

Do whatever you can to make

Logistic’s 1s be the same cases as your 1s. Trust me.

The Logistic Regression procedure applies the logistic regression model to the data. It estimates the parameters of the logistic regression equation.

1

That equation is P(Y) = ---------------------

-( B

0

+ B

1

X)

1 + e

It performs the estimation in two stages. The first stage estimates only B

0

. So the model fit to the data in the first stage is simply

1

P(Y) = ------------------

-(B

0

)

1 + e

SPSS labels the various stages of the estimation procedure “Blocks”. In Block 0, a model with only B

0

is estimated

Logistic Regression Lecture - 10 4/13/2020

Block 0: Beginning Block (estimating only B

0

)

Cla ssifi cation Table a,b

Ste p 0

Ob serve d

RE TAINED

Ov erall Perce ntag e

.00

1.0 0

.00

RE TAINED

Pre dicte d

0

1.0 0

55 2

Pe rcent age Correc t

.0

0 42 01 10 0.0

88 .4

a.

Co nstan t is in clude d in the m odel .

b.

Th e cut value is .5 00

Explanation of the above table: The progresm estimated B0=2.030. The resulting P(Y=1) = .8839.

The program computes Y-hat=.8839 for each case using the logistic regression formula with the estimate of

B

0

. If Y-hat is less than or equal to a predetermined cut value of 0.500

, that case is recorded as a predicted 0 .

If Y-hat is greater than 0.5, the program records that case as a predicted 1 . It then creates the above table of number of actual 1’s and 0’s vs. predicted 1’s and 0’s.

Va riable s in the E quati on

B S.E .

Wa ld df Sig .

Exp (B)

Ste p 0 Co nstan t 2.0 30 .04 5 200 9.62 4 1 .00 0 7.6 11

The prediction equation for Block 0 is Y-hat = 1/(1 + e –2.030

) (The value 2.030 is shown in the “Variables in the Equation” table above.). Recall that B

1

is not yet in the equation. This means that Y-hat is a constant, equal to .8839

for each case. (I got this by entering the prediction equation into a calculator.) Since Y-hat for each case is greater than 0.5, all predictions in the above Classification Table are 1 , which is why the above table has only predicted 1’s. Sometimes this table is more useful than it was in this case. It’s typically most useful when the equation includes continuous predictors.

The above “Variables in the Equation” box is the Logistic Regression equivalent of the “Coefficients Box” in regular regression analysis.

The test statistic in that table is not a t statistic, as in regular regression, but the Wald statistic. The Wald statistic is (B/SE)

2

. So (2.030/.045)

2

= 2,035, which would be 2009.624 if the two coefficients were represented with greater precision.

Exp(B) is the odds ratio : e

2.030

It is the ratio of odds of P(Y=1) when the predictor equals 1 to the odds of

P(Y=1) when the predictor equals 0. It’s an indicator of strength of relationship to the predictor. Means nothing here.

Va riable s not in the Equati on

Ste p 0 Va riable s EA RLIREG

Sc ore

78 .832

df

1

Sig .

.00 0

Ov erall Stati stics

78 .832

1 .00 0

The “Variables not in the Equation” gives information on each independent variable that is not in the equation.

Specifically , it tells you whether or not the variable would be “significant” if it were added to the equation.

In this case, it’s telling us that EARLIREG would contribute significantly to the equation if it were added to the equation, which is what SPSS does next . . .

Logistic Regression Lecture - 11 4/13/2020

Block 1: Method = Enter (Adding estimation of B1 to the equation)

Om nibus Tes ts of Model Coeffic ients

Ch i-squa re

79 .937

df Sig .

.00 0

Note that the chi-square value is

Ste p 1 Ste p

Blo ck 79 .937

1

1 .00 0 almost the same value as the chisquare value from the CROSSTABS

Mo del 79 .937

1 .00 0 analysis.

Whew – three chi-square statistics.

“Step”: Compared to previous step in a stepwise regression. Ignore for now since this regression had only 1 step..

“Block”: Tests the significance of the improvement in fit of the model evaluated in this block vs. the previous block. Note that the chi-square is identical to the Likelihood ratio chi-square printed in the Chi-square Box in the CROSSTABS output.

“Model”: Ignore for now

Model S umm ary

Co x & S nell R Na gelke rke R

Ste p -2 L og li keliho od Sq uare Sq uare

1 333 4.21 2 .01 7 .03 3

The value under

“-2 Log likelihood”

is a measure of how well the model fit the data in an absolute sense.

Values closer to 0 represent better fit. But goodness of fit is complicated by sample size. The R Square values are measures analogous to “percent of variance accounted for”. All three measures tell us that there is a lot of variability in proportions of persons retained that is not accounted for by this one-predictor model.

Cla ssifi cation Table a

Ste p 1

Ob serve d

RE TAINED

Ov erall Perce ntag e

.00

1.0 0

Pre dicte d

.00

RE TAINED

0

1.0 0

55 2

0 42 01

Pe rcent age Correc t

.0

10 0.0

88 .4

a.

Th e cut value is .5 00

The above table is the revised version of the table presented in Block 0.

Note that since X is a dichotomous variable here, there are only two y-hat values. They are

1

P(Y) = --------------------- = .842 (see below)

-(B

0

+ B

1

* 0 )

1 + e

And

1

P(Y) = --------------------- = .924 (see below)

-(B

0

+ B

1

* 1 )

1 + e

In both cases, the y-hat was greater than .5, so predicted Y in the table was 1 for all cases.

Logistic Regression Lecture - 12 4/13/2020

Va riable s in the E quation

Ste p 1 a EA RLIREG

Co nstan t

B

.83 0

1.6 70

S.E .

.09 5

.05 7

Wa ld

75 .719

86 1.036

df

1

1

Sig .

.00 0

.00 0

Ex p(B)

2.2 92

5.3 11 a.

Va riable (s) en tered on step 1 : EARLIRE G.

The prediction equation is Y-hat = P(Y=1) = 1 / (1 + e -(.1.670 + .830*EARLIREG ) .

Since EARLIREG has only two values, those students who registered early will have predicted RETAINED value of 1/(1+e -(1.670+.830*1) ) = .924

. Those who registered late will have predicted RETAINED value of

1/(1+e -(1.670+.830*0) = 1/(1+e -1.670) ).= .842

. Since both predicted values are above .5, this is why all the cases were predicted to be retained in the table on the previous page.

Exp(B) is called the odds ratio. It is the ratio of the odds of Y=1 when X=1 to the odds of Y=1 when X=0.

Recall that the odds of 1 are P(Y=1)/(1-P(Y=1)). The odds ratio is

Odds when X=1 .924/(1-.924) 12.158

Odds ratio = --------------------- = --------------- = ------------------- = 2.29.

Odds when X= 0 .842/(1-.842) 5.329

So a person who registered early had odds of being retained that were 2.29 times the odds of a person registering late being retained. Odds ratio of 1 means that the DV is not related to the predictor.

Graphical representation of what we’ve just found.

The following is a plot of Y-hat vs. X, that is, the plot of predicted Y vs. X. Since there are only two values of

X (0 and 1), the plot has only two points. The curve drawn on the plot is the theoretical relationship of y-hat to other hypothetical values of X over a wide range of X values (ignoring the fact that none of them could occur.)

The curve is analogous to the straight line plot in a regular regression analysis.

1.2

1.0

.8

The two points are the predicted points for the two possible values of

RETAINED.

.6

.4

.2

0.0

-.2

-6.00

-5.00

-4.00

-3.00

-2.00

-1.00

.00

1.00

2.00

3.00

4.00

X Earlireg

Logistic Regression Lecture - 13 4/13/2020

Discussion

1. When there is only one dichotomous predictor, the CROSSTABS and LOGISTIC REGRESSION give the same significance results, although each gives different ancillary information.

BUT as mentioned above . . .

2. CROSSTABS can not be used to analyze relationships in which the X variable is continuous .

3. CROSSTABS can be used in a rudimentary fashion to analyze relationships between a dichotomous Y and 2 or more categorical X’s, but the analysis IS rudimentary and is laborious.

No tests of interactions are possible . The analysis involves inspection and comparison of multiple tables.

4. CROSSTABS, of course, cannot be used when there is a mixture of continuous and categorical IV’s.

5. LOGISTIC REGRESSION can be used to analyze all the situations mentioned in 2-4 above.

6. So CROSSTABS should be considered for the very simplest situations involving one categorical predictor.

But LOGISTIC REGRESSION is the analytic technique of choice when there are two or more categorical predictors and when there are one or more continuous predictors.

Logistic Regression Lecture - 14 4/13/2020

Logistic Regression Example 1: One Continuous Predictor

The data analyzed here represent the relationship of Pancreatitis Diagnosis to measures of Amylase and Lipase.

Both Amylase and Lipase levels are tests that can predict the occurrence of Pancreatitis. Generally, it is believed that the larger the value of either, the greater the likelihood of Pancreatitis.

The objective here is to determine

1) which alone is the better predictor of the condition and

2) to determine if both are needed .

Because the distributions of both predictors were positively skewed, logarithms of the actual Amylase and

Lipase values were used for this handout and for some of the following handouts.

This handout illustrates the analysis of the relationship of Pancreatitis diagnosis to Amylase only . Note that since Amylase is a continuous independent variable, chi-square analysis would not be appropriate.

The name of the dependent variables is PANCGRP . It is 1 if the person is diagnosed with Pancreatitis. It is 0 otherwise. This forces us to use a technique appropriate for dichotomous dependent variables.

Distributions of logamy and loglip – still somewhat positively skewed even though logarithms were taken.

60

50

40 logamy

60

50

40 loglip

5.00

4.00

3.00

30

20

10

0

1.00

1.50

2.00

2.50

logamy

3.00

3.50

30

20

10

4.00

Mean = 2.0267

Std. Dev. = 0.50269

N = 306

0

0.00

1.00

2.00

3.00

loglip

4.00

5.00

2.00

1.00

Mean = 2.3851

Std. Dev. = 0.82634

N = 306

0.00

1.00

1.50

2.00

2.50

logamy

3.00

3.50

4.00

The logamy and loglip scores are highly positively correlated. For that reason, it may be that once either is in the equation, adding the other won’t significantly increase the fit of the model. We’ll test that hypothesis later.

Logistic Regression Lecture - 15 4/13/2020

.8

.6

.4

.2

0.0

-.2

1.0

1. Scatterplots with individual cases – not terribly informative.

Relationship of Pancreatitis Diagnosis to log(Amylase)

1.2

1.0

.8

.6

.4

.2

This graph is of individual cases.

Y values are 0 or 1.

X values are continuous.

0.0

-.2

1.0

1.5

2.0

2.5

3.0

3.5

4.0

LOGA MY

This graph represents a primary problem with visualizing results when the dependent variable is a dichotomy.

It is difficult to see the relationship that may very well be represented by the data. One can see from this graph, however, that when log amylase is low, there are more 0’s (no Pancreatitis) and when log amylase is high there are more 1’s (presence of Pancreatitis).

The line through the scatterplot is the linear line of best fit. It was easy to generate. It represents the relationship of probability of Pancreatitis to log amylase that would be assumed if a linear regression were conducted. So the line is what we would predict based on linear regression.

But, the logistic regression analysis assumes that the relationship between probability of Pancreatitis to log amylase is different. The relationship assumed by the logistic regression analysis would be an S-shaped curve, called an ogive, shown below.

Below are the same data, this time with the line of best fit generated by the logistic regression analysis through it. While neither line fits the observed individual case points well in the middle, it’s easy to see that the logistic line fits better at small and at large values of log amylase.

1.2

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Predicted probabilit

LOGAMY

Pancreatitis Diag nos

LOGAMY

Logistic Regression Lecture - 16 4/13/2020

2. Grouping cases to show a relationship when the DV is a dichotomy.

This is not a required part of logistic regression analysis but a way of presenting the data to help understand what’s going on.

The plots above were plots of individual cases. Each point represented the DV value of a case (0 or 1) vs. that person’s IV value (log amylase value). The problem was that the plot didn’t really show the relationship because the DV could take on only two values - 0 and 1.

When the DV is a dichotomy, it will be profitable to form groups of cases with similar X values and plot the proportion of 1’s within each group vs. the X value for that group.

To illustrate this, groups were formed for every .2

increase in log amylase. That is, the values 1.4, 1.6, 1.8,

2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4, 3.6, and 3.8 were used as group mid points. Each case was assigned to a group based on how close that case’s log amylase value was to the group midpoint. So, for example, all cases with log amylase between 1.5 and 1.7 were assigned to the 1.6 group.

SPSS Syntax: compute logamygp = rnd(logamy,.2).

Then the proportion of 1’s within each group was computed. (When the data are 0s and 1s, the mean of all the scores is equal to the proportion of 1s.) The figure below is a plot of the proportion of 1’s within each group vs. the groups midpoints. Note that the points form a curve, quite a bit like the ogival form from the logistic regression analysis shown on the previous page.

1.2

1.0

.8

Original Graph of 0s and

1s vs. LOGAMY.

.6

.4

.2

0.0

-.2

1.0

1.5

2.0

2.5

3.0

3.5

4.0

LOGA MY

1.2

1.0

.8

.6

.4

.2

0.0

-.2

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Note that the plot of proportion of

1s within groups is not linear.

The proportions increase in an approximate ogival (S-shaped) fashion, with asymptotes at 0 and

1.

This, of course, is a violation of the assumption of a linear relationship which is required when performing linear regression.

LOGA MY GP

The analyses that follow illustrate the application of both linear and logistic regression to the data.

Logistic Regression Lecture - 17 4/13/2020

3. Linear Regression analysis of the logamy data, just for old time’s sake.

REGRESSION

/MISSING LISTWISE

/STATISTICS COEFF OUTS R ANOVA

/CRITERIA=PIN(.05) POUT(.10)

/NOORIGIN

/DEPENDENT pancgrp

/METHOD=ENTER logamy

/SCATTERPLOT=(*ZRESID ,*ZPRED )

/RESIDUALS HIST(ZRESID) NORM(ZRESID) .

Excerpt from the data file

Regression

b

Variables Entered/Removed

Model

1

Variables

Entered

LOGAMY a

Variables

Removed Method

.

Enter

Model S umm ary b

Mo del

1

R

.75 5 a

R

Sq uare

.57 0

Ad justed

R S quare

.56 8

Std . Erro r of t he

Est imate

.25 69 a.

Pre dicto rs: (Consta nt),

LO GAM Y b.

De pend ent V ariab le:

PA NCGRP

ANOVA b

Mo del

1 Re gression

Su m of

Sq uares df

22 .230

1

Re sidua l

To tal

16 .770

39 .000

25 4

25 5

Me an

Sq uare F Sig .

22 .230

33 6.706

.00 0 a

6.6 02E-02 a.

Pre dicto rs: (Consta nt), LOGA MY b.

De pend ent V ariab le: P ANCGRP

Coeffici ents a

The linear relationship of pancdiag to logamy is strong – R-squared=.570..

But as we'll see, the logistic relationship is even stronger.

Sta n dardi

Un stand ardiz e d Coeffi cients zed

Co eff icie nts

Mo del

1 (Co nstan t)

LO GAM Y

B

Std .

Error

-1.0 43 .06 9

Be ta t Sig .

-15 .125

.00 0

.63 5 .03 5 .75 5 18. 350 .00 0 a.

De pend ent V ariab le: PA NCG RP

Thus, the predicted linear relationship of probability of Pancreatitis to log amylase is

Predicted probability of Pancreatitis = -1.043 + 0.635 * logamy.

Logistic Regression Lecture - 18 4/13/2020

The following are the usual linear regression diagnostics.

Ca sew i se Diagnostics a

Ca se

Nu mber

54

77

85

97

Std .

Re sidua l PA NCG RP

3.0 16 1.0 0

3.3 43

3.4 19

1.0 0

1.0 0

3.2 18 1.0 0 a.

De pend ent

Va riable :

PA NCG RP

Re sidua ls Statistic s a

Min imum Ma ximu m Me an

Pre dicte d

Va lue

Re sidua l

Std .

Pre dicte d

Va lue

Std .

Re sidua l

-.10 44

-.59 98

-.98 9

-2.3 34

1.4 256

.87 86

4.1 93

3.4 19

-1.3 848E -16 a.

De pende nt V ariabl e: PA NCG RP

.18 75

.00 0

.00 0

Std .

De viatio n N

.29 53 256

.25 64

1.0 00

.99 8

256

256

256

60

Dependent Variable: PANCGRP

50

Nothing particularly unusual here.

Or here.

40

Frequency

30

20

10

0

-2.25

-1.75

-1.25

-.75

-.25

.25

Std. Dev = 1.00

Mean = 0.00

.75

1.25

1.75

2.25

2.75

3.25

N = 256.00

Regression Standardized Residual

Normal P-P Plot

1.00

The histogram of residuals is not particularly unusual.

.75

Expected Cum Prob

.50

.25

0.00

0.00

.25

.50

Observed Cum Prob

.75

1.00

Although there is a clear bend from the expected linear line, this is not particularly diagnostic..

Logistic Regression Lecture - 19 4/13/2020

Scatterplot

4

Dependent Variable: PANCGRP

3

2

1

0

-1

-2

-3

-2 -1 0 1 2 3 4 5

This is an indicator that there is something amiss.

The plot of residuals vs. predicted values is supposed to form a classic

0 correlation scatterplot, with no unusual shape. This is clearly unusual.

Regression Standardized Pr edicted Value

Computation of y-hats for the groups.

I had SPSS compute the Y-hat for each of the group mid-points discussed on page 3. I then plotted both the observed group proportion of 1’s that was shown on the previous page and the Y-hat for each group.

Of course, the Y-hats are in a linear relationship with log amylase. Note that the solid points don’t really represent the relationship shown by the open symbols. Note also that the solid points extend above 1 and below 0. But the observed proportions are bound by 1 and 0. compute mrgpyhat = -1.043 + .635*logamyvalue. execute .

GRAPH

/SCATTERPLOT(OVERLAY)=logamygp logamygp WITH probpanc mrgpyhat (PAIR)

/MISSING=LISTWISE .

Graph

1.4

1.2

1.0

.8

Predicted proportion of

Pancreatitis diagnosis within groups.

Note that predictions extend below 0 and above 1.

.6

.4

.2

0.0

-.2

1.0

1.5

2.0

2.5

3.0

3.5

4.0

MRGPYHAT

LOGAMYGP

PROBPANC

LOGAMYGP

Observed proportion of Pacreatitis diagnoses within groups.

Logistic Regression Lecture - 20 4/13/2020

4. Logistic Regression Analysis of logamy data

Remember we’re here to determine if there is a significant relationship of pancreatitis diagnosis to logamylase. logistic regression pancgrp with logamy.

Logistic Regression

Case Processing Summary

Unweighted Cases

Selected Cases a

Included in Analysis

N

256

Unselected Cas es

Total

Mis sing Cases

Total

50

306

0

306 a. If weight is in effect, s ee class ification table for the total number of cases.

Percent

83.7

16.3

100.0

.0

100.0

De pendent Va ria ble Encoding

Original Value

.00 No Pancreatitis

Int ernal Value

0

1.00 P anc reat itis 1

SPSS’s Logistic regression procedure always performs the analysis in at least two steps, which it calls Blocks.

Recall the Logistic prediction formula is

1

P(Y) = ---------------------

-(B

0

+ B

1

X)

1 + e

In the first block, labeled Block 0, only B

0

is entered into the equation. In this B

0

only equation, the probability of a 1 is a constant, equal to the overall proportion of 1’s for the whole sample.

Obviously this model doesn’t make sense when your main interest is in whether or not the probability increases as X increases. But SPSS forces us to consider (or delete) the results of a B0 only model.

This model does serve as a useful baseline against which to assess subsequent models, all of which do assume that probability of a 1 increase as the IV increases.

Logistic Regression Lecture - 21 4/13/2020

For each block the Logistic Regression Procedure automatically prints a 2x2 table of predicted and observed 1’s and 0’s. For all of these tables, a case is classified as a predicted 1 if it’s Y-hat (predicted probability) exceed

0.5. Otherwise it’s classified as a predicted 0. Since only the constant is estimated here, the predicted probability for every case is 1/(1+exp(-( -1.466

)) = .1875. It happens that this is simply the proportion of 1’s in the sample, which is 48/256 = 0.1875. Since that’s less than 0.5, every case is predicted to be a 0 for this constant only model.

Block 0: Beginning Block

A case is classified as a Predicted 0 if the y-hat for that case is less than or

A case is classified as a Predicted 1 if the y-hat for that case is larger than .5

equal to .5

Predic ted

St ep 0

Observed

Pancreatitis Diagnosis

(DV)

Overall Percent age

No Pancreatitis

Pancreatitis a. Constant is inc luded in the model.

b. The cut value is .500

Pancreatitis Diagnosis (DV)

No

Pancreatitis Pancreatitis

208 0

48 0

Percentage

Correc t

100.0

.0

81.3

Specificity

Sensitivity

Va riables in the Equa tion

B S. E.

W ald df Sig.

Ex p(B)

St ep 0 Constant -1. 466 .160

83.852

1 .000

.231

The test that is recommended is the Wald test. The p-value of .000 says that the value of B0 is significantly different from 0.

The predicted probability of 1 here is

1 1 1

P(1) = ------------------------- = --------------------------- = ------------- = 0.1875, the observ ed proportion of 1’s.

1 + e

-(-1.466)

1 + 4.332 5.332

Va riables not in the Equa tion

St ep 0 Variables LOGAMY

Sc ore

145.884

df

1

Sig.

.000

Overall Statistic s 145.884

1 .000

Th e “Variables not in the Equation” box says that if log amylase were added to the equation, it would be significant.

Logistic Regression Lecture - 22 4/13/2020

Block 1: Method = Enter

In this block, log amylase is added to the equation.

Omnibus Tests of Model Coefficients

Step 1 Step

Chi-square

151.643

df

1

Sig.

.000

Block 151.643

1 .000

Model 151.643

1 .000

Step: The procedure can perform stepwise regression from a set of covariates. The Chi-square step tests the significance of the increase in fit of the current set of covariates vs. those in the previous set.

Block: The significance of the increase in fit of the current model vs. the last Block. We’ll focus on this.

Model: Tests the significance of the increase in fit of the current model vs. the “B

0

only” model.

Model Summary

-2 Log Cox & Snell Nagelkerke

Step

1 likelihood

95.436

R Square

.447

R Square

.722

The Linear Regression R

2

was .570.

In the following classification table, for each case, the predicted probability of 1 is evaluated and compared with

0.5. If that probability is > 0.5, the case is a predicted 1, otherwise it’s a predicted 0.

Classification Table a

Predicted

Step 1

Observed

Pancreatitis Diagnos is

(DV)

Overall Percentage

No Pancreatitis

Pancreatitis

Pancreatitis Diagnos is (DV)

No

Pancreatitis Pancreatitis

200 8

14 34

Percentage

Correct

96.2

70.8

91.4

Specificity

Sensitivity (power) a. The cut value is .500

Specificity: Percentage of all cases without the disease who were predicted to not have it.

(Percentage of correct predictions for those people who don’t have the disease.)

Sensitivity: Percentage of all cases with the disease who were predicted to have it.

(Percentage of correct predictions for those people who did have the disease.)

Variables in the Equation

1

LOGAMY

Constant

B

6.898

-16.020

S.E.

1.017

2.227

a. Variable(s) entered on s tep 1: LOGAMY.

Wald

45.972

51.744

df

1

1

Sig.

.000

.000

Exp(B)

990.114

.000

Analogous to

“Coefficients” box in

Regression

These are the coefficients for the equation.

1 y-hat = ----------------------------------

-(-16.0203+6.8978*LOGAMY

1 + e

Logistic Regression Lecture - 23 4/13/2020

5. Post Analysis processing:

Computing Predicted proportions for the groups defined on page 3.

To show that the relationship assumed by the logistic regression analysis is a better representation of the relationship than the linear, I computed probability of 1 for each of the group midpoints from page 3. The figure below is a plot of those probabilities and the observed proportion of 1’s vs. the group midpoints.

Compare this figure with that on page 6 to see how much better the logistic regression relationship fits the data than does the linear relationship. compute lrgpyhat = 1/(1+exp(-(-16.0203 + 6.8978*logamygp))) .

GRAPH

/SCATTERPLOT(OVERLAY)=logamygp logamygp WITH probpanc lrgpyhat (PAIR)

/MISSING=LISTWISE .

Graph

1.2

1.0

.8

.6

Predicted proportions, most of which coincide precisely with the observed proportions.

Observed proportions. Could it be that there were coding errors for this group?

.4

.2

LRGPYHAT

LOGAMYGP

0.0

PROBPANC

-.2

LOGAMYGP

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Logamy

Compare this graph with the one immediately below from the linear regression analysis.

Note that the predicted proportions correspond much more closely to the observed proportions here.

Note the diverging predictions for all groups with proportions = 0 or 1.

The linear regression and the logistic regression analyses yield roughly the same predictions for “interior” points. But they diverge for “extreme” points – points with extremely small or extremely large values of X.

Logistic Regression Lecture - 24 4/13/2020

6. Discussion: Using residuals to distinguish between logistic and linear regression.

I computed residuals for all cases. Recall that a residual is Y – Y-hat. For these data, Y’s were either 1 or 0.

Y-hats are probabilities.

First, I computed Y-hats for all cases, using both the linear equation and the logistic equation.

. compute mryhat = -1.043 + .635*logamy. compute lryhat = 1/(1+exp(-(-16.0203 + 6.8978*logamy))).

Now residuals are computed

. compute mrresid = pancdiag - mryhat. compute lrresid = pancdiag - lryhat. frequencies variables = mrresid lrresid /histogram /format=notable.

Frequencies

80

Histogram

60

40

20

This is the distribution of residuals for the linear multiple regression .

It's like the plot on page 3, except these are actual residuals, not Z's of residuals.

Note that there are many large residuals - large negative and large positive.

0

-1

.0

0

-.8

0

-.6

0

-.4

0

-.2

0

-.0

0

.2

0

.4

0

.6

0

.8

0

Std. Dev = .26

Mean = .00

N = 256.00

1.0

0

1.2

MRRESID

1.0

.8

Positive residual

.6

.4

.2

0.0

-.2

1.0

LOGA MY

1.5

2.0

Negative residual

2.5

3.0

3.5

4.0

Logistic Regression Lecture - 25

The residuals above are simply distances of the observed points from the best fitting line, in this case a straight line.

4/13/2020

200

Histogram

100

This is the distribution of residuals for the logistic regression.

Note that most of them are virtually 0.

0

-1

.0

0

-.8

0

-.6

0

-.4

0

-.2

0

-.0

0

.2

0

.4

0

.6

0

.8

0

Std. Dev = .24

Mean = .00

N = 256.00

1.0

0

LRRESID

1

1

1

1

0

The residuals above are simply distances of the observed points from the best fitting line, in this case a logistic line.

The points which are circled are those with near 0 residuals.

0

Predicted Value

LOGAMY

0

PANCGRP

-0 LOGAMY

1.0

1.5

2.0

2.5

3.0

3.5

4.0

What these two sets of figures show is that the vast majority of residuals from the logistic regression analysis were virtually 0, while for the linear regression, there were many residuals that were substantially different from

0. So the logistic regression analysis has modeled the Y’s better than the linear regression.

Logistic Regression Lecture - 26 4/13/2020

Logistic Regression Example 2: Two Continuous predictors

LOGISTIC REGRESSION VAR=pancgrp logistic regression pancgrp with logamy loglip

/METHOD=ENTER logamy loglip

/CLASSPLOT

/CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .

[DataSet3] G:\MdbT\InClassDatasets\amylip.sav

Logistic Regression

Ca se Pr oces sing Sum mary

Un weigh ted Cases a

Se lecte d Cases Inc luded in

An alysis

Mi ssing Case s

Un selec ted Cases

To tal

To tal

N

25 6

50

30 6

0

30 6

Pe rcent a.

If weigh t is in effe ct, se e cla ssifica tion table for t he to tal nu mber of ca ses.

83 .7

16 .3

10 0.0

.0

10 0.0

De pendent V ariable Encodi ng

Ori ginal Valu e

.00 No

Pa ncrea titis

Int ernal Valu e

0

1.0 0 Pa ncre atitis 1

Block 0: Beginning Block

Cla ssifi cation Table a,b

Ste p 0

Ob serve d

Pa ncrea titis Diagn osis

(DV )

No Panc reati tis

Pa ncrea titis

Ov erall Perce ntag e

Pre dicte d

Pa ncrea titis Diagn osis (DV)

No Panc reati tis Pa ncrea titis

20 8 0

48 0

Pe rcent age Correc t

10 0.0

.0

81 .3

a.

Co nstan t is in clude d in the m ode l.

b.

Th e cut value is .5 00

Ste p 0 t

Co nstan

B

-1. 466

Va riable s in the E quation

S.E .

.16 0

Wa ld

83 .852

df

1

Sig .

.00 0

The following assumes a model with only the constant, B

0

in the equation.

Va riable s not in the Equati on

Ex p(B)

.23 1

Ste p 0 Va riable s LO GAM Y

Sc ore

14 5.884

df

1

Sig .

.00 0

LO GLIP 16 1.169

1 .00 0

Ov erall Stati stics

16 5.256

2 .00 0

Each p-value tells you whether or not the variable would be significant if entered BY ITSELF . That is, each of the above p-values should be interpreted on the assumption that only 1 of the variables would be entered.

Logistic Regression Lecture - 27 4/13/2020

Block 1: Method = Enter

Om nibus Tes ts of Model Coeffic ients

Ste p 1 Ste p

Blo ck

Mo del

Ste p

1

Ch i-squa re

17 0.852

17 0.852

17 0.852

df

2

2

2

Sig .

.00 0

.00 0

.00 0

Model S umm ary

-2 L og li keliho od

76. 228

Co x & S nell R

Sq uare

.48 7

Na gelke rke R

Sq uare

.78 7

Cla ssifi cation Table a

Ste p 1

Ob serve d

Pa ncrea titis Diagn osis

(DV )

Ov erall Perce ntag e

No Panc reati tis

Pa ncrea titis

Pre dicte d

Pa ncrea titis Diagn osis (DV)

No Panc reati tis Pa ncrea titis

20 4 4

Pe rcent age Correc t

98 .1

10 38 79 .2

94 .5

Specificity: 204/208

Sensitivity: 38/48 a.

Th e cut value is .5 00

Specificity is the ability to identify cases who do NOT have the disease.

Among those without the disease, .981 were correctly identified.

Sensitivity is the ability to identify cases who do have the disease.

Among those with the disease, .792 were correctely identified.

Ste p 1 a

LO GAM

Y

LO GLIP

Co nstan t

B

2.6 59

2.9 98

-14 .573

Va riable s in the E quation

S.E .

1.4 18

.84 4

2.2 51

Wa ld

3.5 18

12 .628

41 .907

df

1

1

1

Sig .

.06 1

.00 0

.00 0

Ex p(B)

14 .286

20 .043

.00 0

Note that LOGAMY does not officially increase predictability over that afforded by a.

Va riable (s) e ntere d on step 1 : LO GAM Y, LO GLIP .

LOGLIP.

Interpretation of the coefficients . . .

Bs: Not easily interpretable on a raw probability scale. Expected increase in log odds for a one-unit increase in

IV. If the p-value is <= .05, we can say that the inclusion of the predictor resulted in a significant change in probability of Y=1, increase if Bi > 0, decrease if Bi < 0. We just cannot give a simple quantitative prediction of the amount of change in probability of Y=1.

SEs: Standard error of the estimate of Bi.

Wald: Test statistic.

Sig: p-value associated with test statistic.

Note that LOGAMY does NOT (officially) add significantly to prediction over and above the prediction afforded by LOGLIP.

Exp(B): Odds ratio for a one-unit increase in IV among persons equal on the other IV.

Person one unit higher on IV will have Exp(B) greater odds of having Pancreatitis.

So a person one unit higher on LOGLIP will have 20.04 greater odds of having Pancreatitis.

The Exp(B) column is mostly useful for dichotomous predictors – 0 = Absent; 1 = present.

Logistic Regression Lecture - 28 4/13/2020

Classification Plots – a frequency distribution of predicted probabilities with different symbols representing actual classification

Step number: 1

Observed Groups and Predicted Probabilities

80 ┼ ┼

│N │

│N │

F │N │

R 60 ┼N ┼

E │N │

Q │N │

U │NN │

E 40 ┼NN ┼

N │NN │

C │NNN │

Y │NNN │

20 ┼NNN ┼

│NNN P │

│NNN NN P │

│NNNNNNNNNNN P N P PP PP │

Predicted ─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────────

Prob: 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

Group: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP

Y-HAT

Predicted Probability is of Membership for Pancreatitis

The Cut Value is .50

Symbols: N - No Pancreatitis

P - Pancreatitis

Each Symbol Represents 5 Cases.

One aspect of the above plot is misleading because many cases are not represented in it. Only those cases which happened to be so close to other cases that a group of 5 cases could be formed are represented. So, for example, those relatively few cases whose y-hats were close to .5 are not seen in the above plot, because there were enough to make 5 cases.

Classification Plots using dot plots.

Here’s the same information gotten as dot plots of Y-hats with PANCGRP as a Row Panel Variable.

For the most part, the patients who did not get Pancreatitis had small predicted probabilities while the patients who did get it had high predicted probabilities, as you would expect.

There were, however, a few patients who did get Pancreatitis who had small values of Y-hat.

Those patients are dragging down the sensitivity of the test.

Note that these patients don’t show up on the CASEPLOT produced by the LOGISTIC

REGRESSION procedure.

Logistic Regression Lecture - 29 4/13/2020

Classification Plots using Histograms in EXPLORE

Here’s another equivalent representation of what the authors of the program were trying to show.

Histogram for pancgrp= No Pancreatitis

200

150

100

50

0

0.00000

0.20000

0.40000

0.60000

Predicted probability

0.80000

1.00000

Mean =0.0515201

Std. Dev. =0.12420309

N =208

Histogram for pancgrp= Pancreatitis

20

15

30

25

10

5

0

0.00000

0.20000

0.40000

0.60000

Predicted probability

0.80000

1.00000

Mean =0.7767463

Std. Dev. =0.33120602

N =48

Logistic Regression Lecture - 30 4/13/2020

Visualizing the equation with two predictors

(Mike – use this as an opportunity to whine about SPSS’s horrible 3-D graphing capability.)

With one predictor, a simple scatterplot of YHATs vs. X will show the relationship between Y and X implied by the model.

For two predictor models, a 3-D scatterplot is required. Here’s how the graph below was produced.

Graphs -> Interactive -> Scatterplot. . .

  

 

 

 



  

 

   

 

 

 

 

 

  

  

 

 

 

The graph shows the general ogival relationship of YHAT on the vertical to LOGLIP and

LOGAMY. But the relationships really aren’t apparent until the graph is rotated.

Don’t ask me to demonstrate rotation. SPSS now does not offer the ability to rotate the graph interactively. It used to offer such a capability, but it’s been removed. Shame on SPSS

Logistic Regression Lecture - 31 4/13/2020

.

The same graph but with Linear Regression Y-hats plotted vs. loglip and logamy.

 

 

 

 

 

 

 

 

 

 

 

 

 

  

  

 

 

  

  

 

Logistic Regression Lecture - 32 4/13/2020

Representing Relationships with a Table –the Powerpoint slides compute logamygp2 = rnd(logamy,.5).

<- Rounds logamy to the nearest .5 . logamygp2

Va lid 1.5 0

2.0 0

2.5 0

3.0 0

3.5 0

4.0 0

To tal

Fre quen cy

12 3

10 5

46

21

10

1

30 6

Pe rcent

40 .2

34 .3

15 .0

6.9

3.3

.3

10 0.0

Va lid P ercen t

40 .2

34 .3

15 .0

6.9

3.3

.3

10 0.0

loglipgp2

Cu mula tive

Pe rcent

40 .2

74 .5

89 .5

96 .4

99 .7

10 0.0

compute loglipgp2 = rnd(loglip,.5).

LOGAMY and

LOGLIP groups were created by rounding values of LOGAMY and LOGLIP to the nearest .5.

Va lid .50

1.0 0

Fre quen cy

1

6

Pe rcent

.3

2.0

Va lid P ercen t

.3

2.0

Cu mula tive

Pe rcent

.3

2.3

1.5 0

2.0 0

2.5 0

45

12 5

14 .7

40 .8

14 .7

40 .8

17 .0

57 .8

3.0 0

3.5 0

49

30

20

16 .0

9.8

6.5

16 .0

9.8

6.5

73 .9

83 .7

90 .2

4.0 0

4.5 0

20

8

6.5

2.6

6.5

2.6

96 .7

99 .3

5.0 0 2 .7

.7

10 0.0

To tal 30 6 10 0.0

10 0.0

means pancgrp yhatamylip by logamygp2 by loglipgp2.

Here’s the LOGLIP grouping in which the values were rounded to the nearest .5.

Here’s the top of a very long two way table of mean Y-hat values for each combination of logamy group and loglip group.

Below, this table is “prettified”.

Logistic Regression Lecture - 33 4/13/2020

The above MEANS output, put into a 2 way table in Word

The entry in each cell is the expected probability (Y-hat) of contracting Pancreatitus at the combination of logamy and loglip represented by the cell.

LOGLIP

4

3.5

.5

.

1

.

1.5 2 2.5 3

.99

3.5 4 4.5

1.00 1.00 1.00

LOGAMY

3

2.5

2

1.5 .00 .00

.03

.01

.00

.09

.04

.01

.30

.14

.05

.97

.73

.47

.

.98

.92

.85

.99

.97

1.00

1.00

This table shows the joint relationship of predicted Y to LOGAMY and LOGLIP. Move from the lower left of the table to the upper right.

It also shows the partial relationships of each.

Partial Relationship of YHAT to LOGLIP – Move across any row.

So, for example, if your logamylase were 2.5, your chances of having pancreatitus would be only

.03

if your loglipase were 1.5. But at the same 2.5 value of logamylase, your chances would be .97 if your loglipase value were 4.0.

Partial Relationship of YHAT to LOGAMY – Move up any column.

Empty cells show that there are certain combinations of LOGAMY and LOGLIP that are very unlikely.

Logistic Regression Lecture - 34 4/13/2020

Logistic Regression Example 3: One Categorical IV with 3 categories

The data here are the FFROSH data – freshmen from 1987-1992.

The dependent variable is RETAINED – whether a student went directly to the 2 nd

semester.

The independent variable is NRACE – the ethnic group recorded for the student. It has three values:

1: White; 2: African American 3: Asian-American

Recall that ALL independent variables are called covariates in

LOGISTIC REGRESSION.

We know that categorical independent variables with 3 or more categories must be represented by group coding variables.

LOGISTIC REGRESION allows us to do that internally.

Indicator coding is dummy coding. Here, Category 1 (White) is used as the reference category.

Logistic Regression Lecture - 35 4/13/2020

LOGISTIC REGRESSION retained

/METHOD = ENTER nrace

/CONTRAST (nrace)=Indicator(1)

/CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .

Logistic Regression

Ca se P roces sing Sum mary

Un weig hted Case s a

Se lecte d Cases Inc lude d in A naly sis

Mi ssing Case s

N

46 97

Pe rcent

To tal

56

47 53

0 Un selec ted Cases

To tal 47 53 a.

If weigh t is in effe ct, se e cla ssific ation table for t he to tal nu mber of ca ses.

98 .8

1.2

10 0.0

.0

10 0.0

De pende nt Va riabl e Enc oding

Ori ginal Valu e

.00

Inte rnal Value

0

1.0 0 1

Ca tegor ical Varia bles Codi ngs nra ce NUMB ERIC

WHITE/ BLACK/

ORIENT AL RACE CODE

1.0 0 WHITE

2.0 0 BL ACK

3.0 0 ORIENTAL

Fre quen cy

39 87

62 6

84

(1)

.00 0

1.0 00

Pa rame ter co ding

(2)

.00 0

.00 0

.00 0 1.0 00

(3)

This is the syntax generated by the above menus.

SPSS’s coding of the independent variable here is important.

Note that Whites are the 0,0 group. The first group coding variable compares

Blacks with Whites.

The 2 nd

compares

Asian-Americans with Whites.

Block 0: Beginning Block

Cla ssifi cation Table a,b

Ste p 0

Ob serve d ret ained .00

1.0 0

Ov erall Perce ntag e a.

Co nstan t is in clude d in the m odel .

b.

Th e cut value is .5 00

.00

ret ained

Pre dicte d

0

0

1.0 0

54 5

41 52

Pe rcent age Correc t

.0

10 0.0

88 .4

Va riable s in the E quati on

Ste p 0 Co nstan t

B

2.0 31

S.E .

.04 6

Wa ld

198 6.39 1 df

1

Sig .

.00 0

Exp (B)

7.6 18

Va riable s not in the Equation

Ste p 0 Va riable s nra ce

Sco re

6.6 80 df

2

Sig .

.03 5 nra ce(1) nra ce(2)

2.4 33

3.9 03

1

1

.11 9

.04 8

Ov erall Statistics 6.6 80 2 .03 5

SPSS first prints p-value information for the collection of group coding variables representing the categorical factor. Then it prints p-value information for each GCV separately. None of the information about categorical variables in this “Variables not in the Equation” box is too useful.

Logistic Regression Lecture - 36 4/13/2020

Block 1: Method = Enter

Om nibus Tes ts of Model Coeffic ients

Ste p 1 Ste p

Blo ck

Mo del

Ch i-squa re

7.7 48

7.7 48

7.7 48 df

2

2

2

Model S umm ary

Ste p

1

-2 Log l ikelih ood

33 64.16 0 a

Co x & S nell R

Sq uare

.00 2

Na gelke rke R

Sq uare

.00 3 a.

Est imat ion te rmin ated at ite ratio n num ber 6 bec ause pa rame ter estima tes ch ange d by less than .001.

Sig .

.02 1

.02 1

.02 1

Cla ssifi cation Table a

Ste p 1

Ob serve d ret ained

Ov erall Perce ntag e a.

Th e cut value is .5 00

.00

1.0 0

.00

ret ained

Pre dicte d

0

0

1.0 0

54 5

41 52

Pe rcent age Correc t

.0

10 0.0

88 .4

Va riable s in the E quati on

Ste p 1 a nra ce nra ce(1) nra ce(2)

Co nstan t

B

.23 7

1.0 07

1.9 89

S.E .

.14 3

.51 5

.04 9

Wa ld

6.3 68

2.7 41

3.8 29

166 9.86 9 df

1

1

2

1

Sig .

.04 1

.09 8

.05 0

.00 0

Exp (B)

1.2 68

2.7 37

7.3 06 a.

Va riable (s) en tered on step 1 : nrac e.

Note that for a categorical variable, SPSS first prints “overall” information on the variable. Then it prints information for each specific group coding variable.

So the bottom line is that

0) There are significant differences in likelihood of retention to the 2 nd semester between the groups (p=.041).

Specifically . . .

1) Blacks are not significantly more likely to sustain than Whites, although the difference approaches significance. (p=.098).

2) Asian-Americans are significantly more likely to sustain than Whites (p=.050).

Logistic Regression Lecture - 37 4/13/2020

Logistic Regression Example 4: Three Continuous predictors – FFROSH Data

The data used for this are data on freshmen from 1987-1992. Start here on 10/7/15

The dependent variable is RETAINED – whether student went directly into the 2 nd

semester or not.

Predictors (covariates in logistic regression) are HSGPA , ACT composite , and Overall attempted hours in the first semester, excluding the freshman seminar course.

GET FILE='E:\MdbR\FFROSH\ffrosh.sav'. logistic regression retained with hsgpa actcomp oatthrs1.

Logistic Regression

Ca se Pr oces sing Sum mary

Un weigh ted Cases a

Se lecte d Cases Inc luded in

An alysis

N

48 52

Pe rcent

10 0.0

Mi ssing Case s

To tal

0

48 52

0 Un selec ted Cases

To tal 48 52 a.

If weigh t is in effe ct, se e cla ssifica tion table for t he to tal nu mber of ca ses.

.0

10 0.0

.0

10 0.0

Block 0: Beginning Block

Cla ssifi cation Table a,b

De pendent V aria ble E ncoding

Ori gina l

1.0 0

Int ernal Valu e

0

1

Ste p 0

Ob serve d

RE TAINED

Ov erall Perce ntag e

.00

1.0 0

.00

RE TAINED

Pre dicte d

0

1.0 0

62 0

Pe rcent age Correc t

.0

0 42 32 10 0.0

87 .2

Specificity

Sensitivity a.

Co nstan t is in clude d in the m odel .

b.

Th e cut value is .5 00

Ste p 0 t

Co nstan

B

1.9 21

Va riable s in the E quation

S.E .

.04 3

Wa ld

19 94.98 8 df

1

Sig .

.00 0

Ex p(B)

6.8 26

Ste p 0 Va riable s

Ov erall Stati stics

Va riable s not in the Equati on

HS GPA

ACTCO MP

OA TTHRS1

Sc ore

22 5.908

44 .653

27 4.898

38 5.437

df

1

1

1

3

Sig .

.00 0

.00 0

.00 0

.00 0

Recall that the p-values are those that would be obtained if a variable were put BY ITSELF into the equation.

Logistic Regression Lecture - 38 4/13/2020

Block 1: Method = Enter

Om nibus Tes ts of Model Coeffic ients

Ste p 1

Ste p

1

Ste p

Blo ck

Mo del

Ch i-squa re

38 1.011

38 1.011

38 1.011

df

3

3

3

Sig .

.00 0

.00 0

.00 0

Model S umm ary

-2 L og li keliho od

332 7.36 5

Co x & S nell R

Sq uare

.07 6

Na gelke rke R

Sq uare

.14 1

Cla ssifi cation Table a

Ste p 1

Ob serve d

RE TAINED

Ov erall Perce ntag e

.00

1.0 0

.00

RE TAINED

Pre dicte d

35

16

1.0 0

58 5

42 16

Pe rcent age Correc t

5.6

99 .6

87 .6

Specificity

Sensitivity a.

Th e cut value is .5 00

Va riable s in the E quation

Ste p 1 a

HS GPA

ACTCO MP

OA TTHRS1

Co nstan t

B

1.0 77

-.0 22

.14 8

-2. 225

S.E .

.10 1

.01 4

.01 2

.30 8

Wa ld

11 2.767

2.6 37

14 6.487

52 .362

df

1

1

1

1

Sig .

.00 0

.10 4

.00 0

.00 0

Ex p(B)

2.9 35

.97 8

1.1 60

.10 8 a.

Va riable (s) en tere d on step 1 : HS GPA, ACT COM P, O ATTHRS1 .

Note that while ACTCOMP would have been significant by itself without controlling for

HSGPA and OATTHRS1, when controlling for those two variables, it’s not significant.

So, the bottom line is that

1) Among persons equal on ACTCOMP and OATTHRS1, those with larger

HSGPAs were more likely to go directly into the 2

nd

semester.

2) Among persons equal on HSGPA and OATTHRS1, there was no significant relationship of likelihood of sustaining to ACTCOMP. Among persons equal on

HSGPA and OATTHRS1 those with higher ACTCOMP were not significantly more likely to sustain than those with lower ACTCOMP. Note that there are other variables that could be controlled for and that this relationship might “become” significant when those variables are controlled. (But it didn ’t.)

3) Among persons equal on HSGPA and ACTCOMP, those who took more hours in the first semester were more likely to go directly to the 2

nd

semester. What does this mean???? These were more likely to be full-time students??

Logistic Regression Lecture - 39 4/13/2020

Logistic Regression Example 5: The FFROSH Full Analysis

From the report to the faculty – Output from SPSS for the Macintosh Version 6 .

---------------------- Variables in the Equation -----------------------

Variable B S.E. Wald df Sig R Exp(B)

AGE -.0950 .0532 3.1935 1 .0739 -.0180 .9094

NSEX .2714 .0988 7.5486 1 .0060 .0388 1.3118

After adjusting for differences associated with the other variables, Males were more likely to enroll in the second semester .

NRACE1 -.4738 .1578 9.0088 1 .0027 -.0436 .6227

After adjusting for differences associated with the other variables, Whites were less likely to enroll in the second semester.

NRACE2 .1168 .1773 .4342 1 .5099 .0000 1.1239

HSGPA .8802 .1222 51.8438 1 .0000 .1162 2.4114

After adjusting for differences associated with the other variables, those with higher high school GPA's were more likely to enroll in the second semester.

ACTCOMP -.0239 .0161 2.1929 1 .1387 -.0072 .9764

OATTHRS1 .1588 .0124 164.4041 1 .0000 .2098 1.1721

After adjusting for differences associated with the other variables, those with higher attempted hours were more likely to enroll in the second semester.

EARLIREG .2917 .1011 8.3266 1 .0039 .0414 1.3387

After adjusting for differences associated with the other variables, those who registered six months or more before the first day of school were more likely to enroll in the second semester.

NADMSTAT -.2431 .1226 3.9330 1 .0473 -.0229 .7842

POSTSEM -.1092 .0675 2.6206 1 .1055 -.0130 .8965

PREYEAR2 -.0461 .0853 .2924 1 .5887 .0000 .9549

PREYEAR3 .1918 .0915 4.3952 1 .0360 .0255 1.2114

After adjusting for differences associated with the other variables, those who enrolled in 1991 were more likely to enroll in the second semester than others enrolled before 1990. What???

POSYEAR2 -.0845 .0977 .7467 1 .3875 .0000 .9190

POSYEAR3 -.1397 .0998 1.9585 1 .1617 .0000 .8696

HAVEF101 .4828 .1543 9.7876 1 .0018 .0459 1.6206

After adjusting for differences associated with the other variables, those who took the freshman seminar were more likely to enroll in second semester than those who did not.

Constant -.1075 1.1949 .0081 1 .9283

Va riable s in the E quation

Ste p 1 a ag e nse x nra ce nra ce(1) nra ce(2) hsg pa act comp oa tthrs1 ea rlireg ad mstat

(1) po stsem y19 88 y19 89 y19 91 y19 92 ha vef10 1

Co nstan t

B

-.0 99

.25 7

-.9 44

-.3 37

.85 2

-.0 21

.15 9

.31 6

.25 3

-.1 15

-.0 48

.17 7

-.0 78

-.1 24

.96 7

-.0 32

S.E .

.05 3

.09 9

.48 7

.50 4

.12 3

.01 6

.01 2

.10 2

.12 3

.06 8

.08 6

.09 2

.09 8

.10 1

.15 2

1.2 28

Wa ld

3.4 61

6.7 26

19 .394

3.7 49

.44 6

48 .204

1.6 76

16 3.499

9.6 40

4.2 22

2.8 80

.30 6

3.7 37

.63 3

1.5 11

40 .364

.00 1 df

1

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

Sig .

.06 3

.01 0

.00 0

.05 3

.50 4

.00 0

.19 5

.00 0

.00 2

.04 0

.09 0

.58 0

.05 3

.42 6

.21 9

.00 0

.97 9 a.

Va riable (s) en tered on step 1 : age , nse x, nra ce, h sgpa , actc omp , oatt hrs1, earli reg, a dmst at, po stsem , y19 88, y 1989 , y19 91, y1 992, have f101 .

Ex p(B)

.90 5

1.2 94

1.2 88

.38 9

.71 4

2.3 44

.97 9

1.1 73

1.3 72

.89 1

.95 4

1.1 94

.92 5

.88 4

2.6 29

.96 8

Logistic Regression Lecture - 40

This is from SPSS V15.

There are slight differences in the numbers, not due to changes in the program but due to slight differences in the data. I believe some cases were dropped between when the V6 and V15 analyses were performed.

NRACE was coded differently in the V15 analysis.

The similarity is a tribute to the statisticians who developed logistic regression.

4/13/2020

The full FFROSH Analysis in Version 15 of SPSS logistic regression retained with age nsex nrace hsgpa actcomp oatthrs1 earlireg admstat

postsem y1988 y1989 y1991 y1992 havef101

/categorical nrace admstat.

Logistic Regression

[DataSet1] G:\MdbR\FFROSH\ffrosh.sav

Ca se P roces sing Sum mary

Un weig hted Case s

Se lecte d Cases

Un selec ted Cases

To tal a

Inc lude d in A naly sis

Mi ssing Case s

To tal

N

47 81

71

48 52

0

48 52

Pe rcent

98 .5

1.5

10 0.0

.0

10 0.0

a.

If weigh t is in effe ct, se e cla ssific ation table for t he to tal nu mber of ca ses.

De pende nt Va riabl e Enc oding

Ori ginal Valu e

.00

Inte rnal Value

0

1.0 0 1

Ca tegor ical Varia bles Codi ngs a nra ce NUMB ERIC

WHITE/ BLACK/ORIENT AL

RA CE CODE

1.0 0 WHITE

2.0 0 BL ACK ad mstat NUM ERI C

ADMISS TION STA TUS CODE

3.0 0 ORIENTAL

AP

CD a.

Th is cod ing result s in in dica tor co effic ients.

Fre quen cy

40 60

63 6

85

32 92

14 89

Block 0: Beginning Block

Block 0 output skipped

Pa rame ter co ding

(1)

1.0 00

.00 0

.00 0

1.0 00

(2)

.00 0

1.0 00

.00 0

.00 0

Logistic Regression Lecture - 41 4/13/2020

Block 1: Method = Enter

Om nibus Tes ts of Model Coeffic ients

Ste p 1 Ste p

Blo ck

Mo del

Ch i-squa re

49 4.704

49 4.704

49 4.704

df

15

15

15

Model S umm ary

Ste p

1

-2 Log l ikelih ood

31 55.84 2 a

Co x & S nell R

Sq uare

.09 8

Na gelke rke R

Sq uare

.18 4 a.

Est imat ion te rmin ated at ite ratio n num ber 6 bec ause pa rame ter estima tes ch ange d by less than .001.

Sig .

.00 0

.00 0

.00 0

Ste p 1

Ob serve d ret ained

Ov erall Perce ntag e a.

Th e cut value is .5 00

.00

1.0 0

Cla ssifi cation Table a

.00

ret ained

Pre dicte d

79

33

1.0 0

53 1

41 38

Pe rcent age Correc t

13 .0

99 .2

88 .2

Va riable s in the E quation

Ste p 1 a ag e nse x nra ce nra ce(1) nra ce(2) hsg pa act comp oa tthrs1 ea rlireg ad mstat (1) po stsem y19 88 y19 89 y19 91 y19 92 ha vef10 1

Co nstan t

B

-.0 99

.25 7

-.0 48

.17 7

-.0 78

-.1 24

.96 7

-.0 32

-.9 44

-.3 37

.85 2

-.0 21

.15 9

.31 6

.25 3

-.1 15

S.E .

.05 3

.09 9

.08 6

.09 2

.09 8

.10 1

.15 2

1.2 28

.48 7

.50 4

.12 3

.01 6

.01 2

.10 2

.12 3

.06 8

Wa ld

3.4 61

6.7 26

19 .394

3.7 49

.44 6

48 .204

1.6 76

16 3.499

9.6 40

4.2 22

2.8 80

.30 6

3.7 37

.63 3

1.5 11

40 .364

.00 1 df Sig .

.06 3

.01 0

.00 0

.05 3

.50 4

.00 0

.19 5

.00 0

.00 2

.04 0

.09 0

.58 0

.05 3

.42 6

.21 9

.00 0

.97 9

1

1

1

1

1

1

1

1

1

1

2

1

1

1

1

1

1

Ex p(B)

.90 5

1.2 94

.95 4

1.1 94

.92 5

.88 4

2.6 29

.96 8

.38 9

.71 4

2.3 44

.97 9

1.1 73

1.3 72

1.2 88

.89 1 a.

Va riable (s) en tere d on step 1 : age , nse x, nra ce, h sgpa , actc omp , oatt hrs1, earli reg, a dmst at, po stsem , y19 88, y 1989 , y19 91, y1 992 , have f101 .

The absence of a relationship to ACTCOMP is very interesting. It could be the foundation for a theory of retention.

Logistic Regression Lecture - 42 4/13/2020

Logistic Regression Example 6: A 2 x 3 Factorial

From Pedhazur, p. 762. Problem 4. Messages attributed to either an Economist , a Labor Leader or a

Politician are given to participants. The message is about the effects of NAFTA (North American Free Trade

Agreement). The participants rated each message as Biased or Unbiased.

Half the participants were told that the source was male. The other half were told the source was female. The data are

Economist Labor Leader Politician

Male Source Rated Biased

Rated Unbiased

Female Source Rated Biased

Rated Unbiased

The data were entered into SPSS as follows . . .

7

18

5

20

13

12

17

8

19

6

20

5

1

1

1

1

These are summary data, not individual data.

For example, line 1: 1 1 1 7, represents 7 individual cases. If the data had been entered as individual data,

The first lines of the data editor would have been gender source judgment

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1 1

To get SPSS to “expand” the summary data into all of the individual cases it’s summarizing, I did the following

Data -> Weight Cases . . .

All analyses after the Weight Cases dialog will involve the

Expanded data of 150 cases.

Logistic Regression Lecture - 43 4/13/2020

If you’re interested, the syntax that will do all of the above is . . .

DATASET ACTIVATE DataSet5.

Data list free / gender source judgment freq.

Begin data.

1 1 1 7

1 1 0 18

1 2 1 13

1 2 0 12

1 3 1 19

1 3 0 6

This syntax reads frequency counts and analyzes them as if they were individual respondent data.

2 1 1 5

2 1 0 20

2 2 1 17

2 2 0 8

2 3 1 20

2 3 0 5 end data. weight by freq. value labels gender 1 "Male" 2 "Female"

/source 1 "Economist" 2 "Labor Leader" 3 "Politician"

/ judgment 1 "Biased" 0 "Unbiased".

The logistic regression dialogs . . .

Analyze -> Regression -> Binary Logistic . . .

The syntax to invoke the Logistic Regression command is logistic regression judgment /categorical = gender source

/enter source gender / enter source by gender

/save = pred(predicte).

Note how to tell the

LOGISTIC REGRESSION procedure to analyze the interaction between two factors.

Logistic Regression Lecture - 44 4/13/2020

Logistic Regression output

Case Processing Summary

Unweighted Cases a N Percent

100.0 Included in Analysis

Missing Cases

12

0 Selected Cases

Unselected Cases

Total

Total a. If weight is in effect, see classification table for the total number of cases.

Dependent Variable Encoding

Internal Value

12

0

12

Original Value

.00 Unbiased

1.00 Biased

0

1

Categorical Variables Codings a

Frequency Parameter coding

(1) (2)

.0

100.0

.0

100.0 source

1.00 Economist

2.00 Labor Leader

3.00 Politician

1.00 Male gender

2.00 Female a. This coding results in indicator coefficients.

4

6

6

4

4

1.000

.000

.000

1.000

.000

.000

1.000

.000

Note that the N is incorrect. We told SPSS to “expand” the summary data into 150 individual cases. But this part of the Logistic Regression command does not acknowledge that expansion except in the footnote.

This table tells us about the Group

Coding variables for source and gender

. . .

It’s dummy variable coding, with

Politician as the ref group.

Source(1) compares Economists with

Politicians.

Source(2) compares Labor Leaders with Politicians,.

Gender(1) compares Males vs.

Females.

Thanks, Logistic.

Logistic Regression Lecture - 45 4/13/2020

Block 0: Beginning Block

(I generally ignore the Block 0 output. Not much of interest here except to logistic regression aficionados.)

Classification Table a,b

Observed Predicted judgment Percentage Correct

.00 Unbiased 1.00 Biased

Step 0 judgment

.00 Unbiased

1.00 Biased

Overall Percentage

0

0

69

81

.0

100.0

54.0 a. Constant is included in the model. b. The cut value is .500

Variables in the Equation

Constant

B

.160

S.E.

.164

Wald

.958 df

1

Sig.

.328

Exp(B)

1.174 Step 0

Variables not in the Equation

Step 0

Variables

Overall Statistics source source(1) source(2) gender(1)

Score

30.435

27.174

1.087

.242

30.676 df

1

1

2

1

3

Sig.

.000

.000

.297

.623

.000

Logistic Regression Lecture - 46 4/13/2020

Block 1: Method = Enter

Omnibus Tests of Model Coefficients

Chi-square df

Step 1

Step

Block

Model

32.186

32.186

32.186

Model Summary

3

3

3

Sig.

.000

.000

.000

Step Nagelkerke R

Square

.258 1 174.797

a a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.

Classification Table a

Observed judgment

Predicted

.00 Unbiased 1.00 Biased

38

Step 1

-2 Log likelihood judgment

Cox & Snell R

Square

.193

.00 Unbiased

1.00 Biased 12

Overall Percentage a. The cut value is .500

31

69

Percentage Correct

55.1

85.2

71.3

B

Variables in the Equation

S.E. Wald

26.944 df Sig.

.

000

Exp(B) source source(1)

2

1 .089

Step 1 a

-2.424

-.862

.476

.448

25.880

3.709

.000

.054

Bias=1 source(2) gender(1)

1

1

.422

Bias=0

.817

Constant

-.202

1.370

.368

.393

.303

12.143 1

.582

.000 3.934

Pol=0 Econ=1 a. Variable(s) entered on step 1: source, gender.

Source: The overall differences in probability of rating passage as biased across the 3 source groups.

Source(1): Probability of rating passage as biased least when respondents told that message was from

Economists.

Source(2): No officially significant difference between probability of rating passage as biased when attributed to Labor Leaders vs. when attributed to Politicians.

Gender(1) No difference in probability of rating passage as biased between males and females.

Logistic Regression Lecture - 47 4/13/2020

Block 2: Method = Enter This block adds the interaction of SourceXGender. No change in results.

Omnibus Tests of Model Coefficients

Chi-square df Sig.

Step 1

Step

Block

Model

1.594

1.594

33.780

2

2

5

.451

.451

.000

Step -2 Log likelihood

Model Summary

173.203

a

Cox & Snell R

Square

.202

Nagelkerke R

Square

.269 1 a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.

Classification Table a

Observed judgment

Predicted

.00 Unbiased 1.00 Biased

Percentage Correct

Step 1 judgment

.00 Unbiased

1.00 Biased

Overall Percentage a. The cut value is .500

38

12

31

69

55.1

85.2

71.3

Variables in the Equation

Step 1 a source source(1) source(2) gender(1) source * gender source(1) by gender(1) source(2) by gender(1)

Constant

B

-2.773

-.633

-.234

.675

-.440

1.386

S.E.

.707

.659

.685

.958

.902

.500

Wald

17.214

15.374

.922

.116

1.573

.497

.238

7.687 df

1

1

2

1

1

1

2

1

Sig.

.000

Exp(B)

.000

.337

.733

.455

.481

.626

.006

.063

.531

.792

1.965

.644

4.000 a. Variable(s) entered on step 1: source * gender .

Since the interaction was not significant, we don’t have to interpret it.

The main conclusion is that respondents rated passages from politicians as more biased than from economists.

Logistic Regression Lecture - 48 4/13/2020

** Just for kicks, I analyzed this as if the DV were continuous.

Since GLM does not automatically test group coding variables, I created dummy codes for source within

GLM . . .

Note that dummy codes are called Simple in GLM.

Univariate Analysis of Variance

Between-Subjects Factors gender source

1.00

2.00

1.00

2.00

3.00

Value Label

Male

Female

Economist

Labor Leader

Politician

N

75

75

50

50

50

Dependent Variable: judgment

Source

Corrected Model

Intercept gender source gender * source

Error

Tests of Between-Subjects Effects

Type III Sum of

Squares

7.980

a

43.740

.060

7.560

.360

29.280 df

5

1

1

2

2

144

Mean Square

1.596

43.740

.060

3.780

.180

.203

Total

81.000 150

Corrected Total

37.260 149 a. R Squared = .214 (Adjusted R Squared = .187)

Contrast Results (K Matrix)

F

7.849

215.115

.295

18.590

.885

Sig.

.000

.000

.588

.000

.415 source Simple Contrast a

Level 1 vs. Level 3

Level 2 vs. Level 3

Contrast Estimate

Hypothesized Value

Difference (Estimate - Hypothesized)

Std. Error

Sig.

95% Confidence Interval for Difference

Contrast Estimate

Hypothesized Value

Difference (Estimate - Hypothesized)

Std. Error

Sig.

95% Confidence Interval for Difference

Lower Bound

Upper Bound

Lower Bound

Upper Bound

Dependent Variable judgment

-.540

0

-.540

.090

.000

-.718

-.362

-.180

0

-.180

.090

.048

-.358

-.002 a. Reference category = 3

Logistic Regression Lecture - 49 4/13/2020

Logistic Regression Lecture - 50 4/13/2020

ROC (Receiver Operating Characteristic) Curve analysis

Situation: A dichotomous state (e.g., illness, termination, death, success) is to be predicted.

You have a continuous predictor.

The relationship between the dichotomous dependent variable and the continuous predictor is significant.

The continuous predictor can consist of y-hats from the combination of multiple predictors.

It’ll be called y-hat from now on.

By setting a cutoff on the values of y-hat, Predicted 1s and Predicted 0s can be defined as is done in the

Logistic Regression Classification Table.

Predicted 1: Every case whose y-hat is above the cutoff.

Predicted 0: Every case whose y-hat is <= the cutoff.

Once predicted 1s and predicted 0s have been defined, the classification table printed by Logistic Regression can be created.

Some issues: 1) Which cutoff should be used.

2) Measuring predictability.

3) How does predictability relate to the cutoff which is employed.

ROC Curve

The Receiver Operating Characteristic curve provides an approach to understanding the relationship between a dichotomous dependent variable and a continuous predictor.

The ROC curve is a plot of Sensitivity vs. 1-Specificity .

Sensitivity: Percent of Actual 1s predicted correctly.

Specificity: Percent of Actual 0s predicted correctly.

1-Specificity: Percent of Actual 0s predicted incorrectly to be 1s.

In ROC terminology, Sensitivity is called the Hit or true positive rate.

1-Specificity is called the False Alarm or false positive rate.

So the ROC curve is a plot of the proportion of successful identifications vs. proportion of false identifications of some phenomenon.

Logistic Regression Lecture - 51 4/13/2020

Example: The Logamy, LogLip data revisited.

GET

FILE='G:\MDBT\InClassDatasets\amylip.sav'.

DATASET NAME DataSet1 WINDOW=FRONT.

LOGISTIC REGRESSION VARIABLES pancgrp

/METHOD=ENTER logamy loglip

/SAVE=PRED

/CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).

Logistic Regression

[DataSet1] G:\MDBT\InClassDatasets\amylip.sav

Case Processing Summary

Unweighted Cases a N

Selected Cases

Unselected Cases

Total

Included in Analysis

Missing Cases

Total

256

50

306

0

306 a. If weight is in effect, see classification table for the total number of cases.

Percent

83.7

16.3

100.0

.0

100.0

Dependent Variable Encoding

Original Value Internal Value

.00 No Pancreatitis

1.00 Pancreatitis

0

1

Block 1 : Method = Enter

Omnibus Tests of Model Coefficients

Step 1

Step

Block

Model

Chi-square

170.852

170.852

170.852 df

2

2

2

Sig.

.000

.000

.000

Note that I requested that the yhats be saved in the Data Editor

Model Summary

Step -2 Log likelihood Cox & Snell R

Square

.487

Nagelkerke R

Square

.787 1 a. Estimation terminated at iteration number 7 because parameter estimates changed by less than .001.

Classification Table a

Observed

76.228

a

Predicted pancgrp Pancreatitis Diagnosis (DV)

.00 No Pancreatitis

204

1.00 Pancreatitis pancgrp Pancreatitis Diagnosis (DV)

.00 No Pancreatitis

Step 1

1.00 Pancreatitis 10

4

38

Overall Percentage

Percentage Correct a. The cut value is .500

Specificity = proportion of the 208 0s that were predicted to be 0s = 204/208 = .981.

Sensitivity= proportion of the 48 1s that were predicted to be 1s: 38/48 = .792.

False alarm rate = Proportion of the 208 0s that were predicted to be 1s = 1-.981 = .019.

Hit rate = Sensitivity = .792.

Note that these specificity, sensitivity values are for only 1 cutoff = .500.

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

Step 1 a logamy loglip

2.659

2.998

Constant -14.573 a. Variable(s) entered on step 1: logamy, loglip.

1.418

.844

2.251

3.518

12.628

41.907

The yhats were saved and renamed LogRegYhatAmyLip.

1

1

1

.061

.000

.000

14.286

20.043

.000

98.1

79.2

94.5

Logistic Regression Lecture - 52 4/13/2020

Specificity

Sensitivity

ROC Analysis

ROC LogRegYhatAmyLip BY pancgrp (1)

/PLOT=CURVE( REFERENCE )

/PRINT= COORDINATES

/CRITERIA=CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95)

/MISSING=EXCLUDE.

Logistic Regression Lecture - 53 4/13/2020

ROC Curve

[DataSet1] G:\MDBT\InClassDatasets\amylip.sav

Case Processing Summary pancgrp Pancreatitis Diagnosis (DV)

Positive a

Negative

Missing

Valid N (listwise)

Larger values of the test result variable(s) indicate stronger evidence for a positive actual state. a. The positive actual state is 1.00 Pancreatitis.

48

208

50

The ROC procedure employs as many cutoff values as possible (not just one = .500 as in Logistic

Regression).

It computes sensitivity and

1-Specificity for each cutoff value.

It plots Sensitivity vs. 1-

Specificity for each cutoff.

That plot is the blue curve.

False Alarm rate

Area Under the

Curve

Test Result

Variable(s): Yhat

Predicted probability

Area

.980

You can think of the ROC curve as a “running classification table” with sensitivity/1-specificity values for all possible cutoff values.

The above graph does not show small differences in percentages, so the Sensitivity of .98 when the cutoff = 0.5 in the table is shown as 1.00 in this graph.

Often, the area between the blue line and the green diagonal line is used as a measure of overall predictability.

That area will range from 0.5 (the blue line = the green line) to +1 (the blue line follows the outside of the graph.) In this instance, overall predictability would be considered quite high.

Logistic Regression Lecture - 54 4/13/2020

Example 2 – the FFROSH data. Predicting sustaining.

The basic analyses, from the previous lecture . . .(Block 0 stuff omitted.) logistic regression retained with age nsex nrace hsgpa actcomp oatthrs1 earlireg admstat

postsem y1988 y1989 y1991 y1992 havef101

/categorical nrace admstat / criteria = cut(.5) /save=pred .

Logistic Regression

[DataSet4] G:\MDBR\FFROSH\Ffroshnm.sav

Case Processing Summary

Unweighted Cases a N Percent

Note- y-hats saved.

Selected Cases

Unselected Cases

Total

Included in Analysis

Missing Cases

Total

4697

56

4753

0

4753 a. If weight is in effect, see classification table for the total number of cases.

Block 1: Method = Enter

Classification Table a

Observed Predicted

.00 retained

1.00

529

4145

Step 1 retained

.00

1.00

16

7

Overall Percentage

98.8

1.2

100.0

.0

100.0

Percentage Correct

2.9

99.8

88.6 a. The cut value is .500

age nsex

B

-.113

.268

Variables in the Equation

S.E. Wald

.053

.102

4.467

6.891 df

1

1

Sig.

.035

.009

Exp(B)

.893

1.307

Step 1 a nrace nrace(1) nrace(2) hsgpa actcomp oatthrs1 earlireg admstat(1) postsem y1988 y1989 y1991

-1.033

-.473

.969

-.009

.105

.351

.229

-.140

-.081

.203

-.062

.526

.543

.128

.017

.015

.105

.127

.071

.088

.097

.100

16.513

3.854

.758

57.301

.283

46.284

11.144

3.255

3.897

.833

4.426

.385

2

1

1

1

1

1

1

1

1

1

1

1

.000

.050

.384

.000

.595

.000

.001

.071

.048

.362

.035

.535 y1992 havef101

Constant

-.142

.851

.364

.102

.159

1.257

1.925

28.769

.084

1

1

1

.165

.000

.772

.868

2.341

1.438 a. Variable(s) entered on step 1: age, nsex, nrace, hsgpa, actcomp, oatthrs1, earlireg, admstat, postsem, y1988, y1989, y1991, y1992, havef101.

.356

.623

2.636

.991

1.111

1.421

1.257

.870

.922

1.225

.940

Logistic Regression Lecture - 55 4/13/2020

For this example, I reran the analysis several times, each time specifying a different cutoff value.

Here are the Classification Tables for different cutoff values

0.996

1.000.

0.994

1.000.

0.971

0.998.

0.804

0.965.

False Alarm rate

0.281

0.598.

Note that location of points on the ROC curve is the inverse of the values of the cutoff.

Logistic Regression Lecture - 56 4/13/2020

Advantages of ROC curve analysis

1. Forces you to realize that when using many prediction systems, an increase in sensitivity is invariably accompanied by a concomitant increase in false alarms . For a given selection system, points will only differ in the lower-left to upper-right direction. Moving to the upper right increases sensitivity but it also increases false alarms. Moving to the lower left decreases false alarms but also decreases sensitivity.

2. Shows that to make a prediction (I-O types, read selection) system better, you must increase sensitivity while at the same time decreasing false alarms – moving points toward the upper left of the ROC space.

Better selection

3. Enables you to graphically disentangle issues of bias (the value of the cutoff ) from issues of predictability

(the area under the curve).

Logistic Regression Lecture - 57 4/13/2020

Download