Uploaded by Yati Yahya

19.Logistic Regression

advertisement
– Logistic regression is an alternative to multiple
regression
– Used to predict outcome variable that is a categorical
dichotomy from a set of categorical or continuous predictor
variables
– Used because with the categorical dichotomy outcome
variable violates the assumption of linearity in normal
regression
– Logistic regression emphasizes the probability of a
particular outcome for each case
– Dependent variable must be non-metric/categorical
(nominal or ordinal scaled)
– Independent variables can be combination of metric
and/or non-metric
– Logistic regression requires less assumptions than
discriminant analysis
– It does not require assumptions on multivariate
normality nor homogeneity of variance-covariance
matrices
Kinnear, P. R. (2011)
– The outcome variable (Ŷ) is the probability of
having one outcome or another based on the best
linear combination of predictors using maximumlikelihood estimation
– Probability of Y is calculated based on the following
formula:
P(Y )  Yˆ 
where
eu
1 e
Formula 1
u
u  b0  b1 X1  b2 X 2 .......  bp X p
e  thebase of naturallogarithms
p  number of predictor variables
e b0 b1–X 1 With one predictor variable, the formula will be:
P(Y )  Yˆi 
1 e b0b 1X 1
– With multiple predictor variables (p), the
formula will be:
P(Y )  Yˆi 
e b0 b1 X 1 b2 X 2 ...... b p X p
1 e
b0 b1 X 1 b2 X 2 .......bp X p
– The resulting value from the above computing
(probability) ranges between 0 and 1
:: A value close to 0 means Y is very unlikely to occur
:: A value close to 1 means Y is very likely to occur
– Can outcome be predicted from a set of predictor
variables?
– Which predictor variables predict the outcome?
– How strong is the relationship between outcome
and the predictor variables?
Assessing the Model
Assessing the Predictor
Relationship between
Predictors - Outcome
Odds Ratio
Classification of Cases
– Use the observed and predicted value of the
outcome to assess the fit of the model.
– The statistic used to measure the fit of the model is
called log-likelihood:
Log - likelihood  
Yln (Y )  (1Y ) ln (1Y )
N
ˆ
Formula 2


i
i
i
i
i1
– The log-likelihood is the summation of probabilities
associated with the predicted and actual outcomes
– This log-likelihood statistic is comparable to residual
sum of squares (SSE) in multiple regression
– Log-likelihood will be calculated for two different
models (bigger and smaller)
– The two models are compared by computing the
difference in their log-likelihood using Chi-square (χ2)
 2  2LL(B)  LL(0)
Formula 3
– LL(B) is log-likelihood for the bigger model which
includes all the predictors
– LL(0) is log-likelihood for the smaller model which
includes only the intercept
– degrees of freedom (df) = kB – k0
where k is number of parameters
– Test the null hypothesis that βi = 0
– Test the individual contribution of predictor variables
using Wald statistic
– The Wald statistic is comparable to t-test in multiple
regression
– Wald statistic is the squared ratio of the
unstandardized logistic coefficient to its standard
error.
– The Wald statistic and its corresponding p probability
level is part of SPSS output in the "Variables in the
Equation" table.
– A number of statistics can be used as measures of
association between predictors and outcome
– The measures include:
1. R-Statistic
2. Cox and Snell R2
3. Nagelkerke R2
4. Hosmer and Lemeshow’s R2
– R-statistic is comparable to multiple correlation
coefficient
– Formula:
 Wald  (2* df ) 

R 
  2LL(0) 
Formula 4
– R-statistic ranges between -1 to +1
– A positive value: as the predictor increases,
likelihood of the outcome occurring increases, vice
versa
– R2cs is comparable to R2 in multiple regression
– The value is displayed in SPSS Logistic Regression
– Formula:
  2 (LL( B)LL(0)) 

 n
2
RCS
 1 e
Formula 5
– However the value of R2cs never reaches its
theoretical maximum of 1
– Nagelkerke suggested for amendment to the earlier
R2CS
– The value is displayed in SPSS Logistic Regression
– Formula:
R2

CS
R 
 2( LL(0)) 
2
N

1 e 
n

Formula 6
– Formula to calculate R2L
RL2 
 2LL(B)
 2LL(0)
Formula 7
– Odds ratio is an indicator of the change in odds
resulting from a unit change in the predictor
– The odds ratio is the increase (or decrease if the
ratio is less than 1) in odds of being in one outcome
category when the predictor increases by one unit.
– It is similar to b-coefficient but is easier to interpret
(it does not involve logarithmic transformation)
– The odd of an event occurring are defined as the
probability of an event occurring divided by the
probability of the event not occurring
Odds 
P (event)
P (no event)
Formula 8
– The coefficients (b) are the natural logs of the odds
ratio, thus odds ratio can be calculated using the
following formula:
Odds  eb
Formula 9
– Odds ratio indicates the change in odds resulting
from a unit change in the predictor
– Odds ratio > 1
Predictor ↑, Probability of outcome occurring ↑
– Odds ratio < 1
Predictor ↑, Probability of outcome occurring ↓
Example
Dummy variable: Gender (1=Male, 0=Female)
– If Odd ratio = 1.25
1.25 – 1.0 = .25
Males have 25% higher odds than Females
– If Odd ratio = .8
.8 – 1.0 = -.20
Odds for Males are 20% less than Females
– One method of assessing the success of a model is
to evaluate its ability to predict correctly the outcome
– The cut-off value for classification is .50
– A case is assigned to category “1” if the model
predicts an outcome probability of greater than .5
i.e. Y = 1 if Ŷ > .5
Y = 0 if Ŷ < .5
– SPSS provides:
1. Percentage of correctly classify category “1”
2. Percentage of correctly classify category “0”
3. Overall percentage
1. Enter
– All variables entered simultaneously
2. Sequential/Hierarchical
– Variables entered in blocks
– Blocks should be based on past research or
theory being tested
3. Stepwise
– Variables entered on the basis of statistical
criteria (relative contribution to predict
outcome)
– Should be employed only for exploratory analysis
(From Tabachnick)
The following data set
include three variables:
1. FALL
0 - Not falling
1 - Falling
2. DIFFICULTY
Rated on 1 to 3 scale
3. SEASON
1 - autumn
2 - winter
3 - spring
Data set:
Fall Difficulty
1
3
1
1
0
1
1
2
1
3
0
2
0
1
1
3
1
2
1
2
0
2
0
2
1
3
1
2
0
3
Season
1
1
3
3
2
2
2
1
3
1
2
3
2
2
1
Data: Logistic Regression Tabachnick SKI
1. Model Fit
1.776(1.010)( DIFF )(0.928)( SEAS1)(0.418)( SEAS 2)
e
Prob(Fall )  Yˆi 
1 e 1.776(1.010)( DIFF )(0.928)( SEAS1)(0.418)( SEAS 2)

Log - likelihood  
N

Yi ln (Yˆi )  (1Yi ) ln (1Yi )
Formula 1
Formula 2
i1
 2  2LL(B)  LL(0)
Formula 3
Excel Computation
2. Significance of Predictors
and Odds Ratio

Wald   b 
 SE(b) 
2
Excel Computation
3. Relationship between
Predictors and Outcome
  2 (LL( B)LL(0)) 

 n
2
RCS
 1 e
RN2 CS
Formula 5
R2
 2( LL(0)) 


n
1 e
Formula 6
4. Classification of Cases
Table 1: Logistic Regression Analysis of Falling on a Ski
Run as a Function of Difficulty of Run and Season
Variables
B
Wald Test
p
Odds ratio
Constant
-1.776
0.88
.347
.169
Difficulty
1.010
1.27
.259
2.747
Season(1)
.927
0.34
.560
2.527
Season(2)
-.418
0.09
.763
.658
Note: R2 = .165 (Cox & Snell), .227 (Nagelkerke)
Model χ2 (3)= 2.710, p = .439
May want to also report CI for Odds ratio
Data: Logistic Regression PERFORM
(Adapted from Andy)
Variable
Label/Value
PERFORM
Performance in Subject
1 No
2 Yes
Interest in the Subject
1 No
2 Yes
Age in years
INTEREST
AGE
Andy Field (2005). Discovering Statistics Using SPSS. 2nd Edition. London: Sage PublicationsLtd
Table 2: Logistic Regression Analysis of Performance
as a Function of Interest and Age
Variables
B
Wald Test
Constant
Interest
Age
Note: R2 =
(Cox & Snell),
Model χ2 (_)=
,p=
(Nagelkerke)
p
Odds ratio
(From Tabachnick)
Variable
Label
WorkStatus
Work status
1 Working
2 Housewives
Presence of children
1 No
2 Yes
Locus of control
Attitudes toward current marriage
Attitudes toward housework
Attitudes toward role of women
Age group
Years of education
Children
Control
AttMar
AttHouse
AttRole
Age
Educ
Value
Data: Logistic Regression Tabachnick WORK STATUS
Table 3: Logistic Regression Analysis of Work Status
as a Function of Attitudinal Variables
Variables
B
Wald Test
Constant
Locus of control
Attitude towards
marital status
Attitude towards role
of women
Attitude towards
housework
Note: R2 =
(Cox & Snell),
Model χ2 (_)=
,p=
(Nagelkerke)
p
Odds ratio
Table :
Logistic Regression Analysis of Work Status as
a Function of Attitudinal Variables and Children
Variables
B
Wald Test
Constant
Presence of children
Locus of control
Attitude towards
marital status
Attitude towards role
of women
Attitude towards
housework
Note: R2 =
(Cox & Snell),
Model χ2 (_)=
,p=
(Nagelkerke)
p
Odds ratio
Download