Class 8 Lecture: Logistic Regression 3

advertisement
Logistic Regression 2
Sociology 8811 Lecture 7
Copyright © 2007 by Evan Schofer
Do not copy or distribute without permission
Stata Notes: Logistic Regression
• Stata has two commands: “logit” & “logistic”
– Logit, by default, produces raw coefficients
– Logistic, by default, produces odds ratios
• It exponentiates all coefficients for you!
• Note: Both yield identical results
– The following pairs of commands are identical
– For raw coefficients:
• logit gun male educ income south liberal
• logistic gun male educ income south liberal, coef
– And for odds ratios:
• logit gun male educ income south liberal, nocoef
• logistic gun male educ income south liberal
Review: Interpreting Coefficients
• Raw Coefficients: Change in log odds per
unit change in X
• Show direction
• Magnitude is hard to interpret
• Odds Ratios: Multiplicative change in odds
per unit change in X
• OR > 1 = positive effect, OR < 1 = negative
• Operates multiplicatively. Effect of 2-point change is
found by multiplying twice
• Percentage change in odds per unit change
• (OR-1)*100%.
Review: Interpreting Results
• Important point: Substantive effect of a
variable on predicted probability differs
depending on values of other variables
• If probability is already high for a given case, additional
increases may not have much effect
– Suppose a 1-point change in X doubles the odds…
• Effect isn’t substantively consequential if probability
(Y=1) is already very high
– Ex: 20:1 odds = .95 probability; 40:1 odds = .975 probability
– Change in probability is only .025
• Effect matters a lot for cases with probabilities near .5
– 1:1 odds = .5 probability. 2:1 odds = .67 probability
– Change in probability is nearly .2!
Review: Interpreting Results
• Predicted values of real (or hypothetical cases)
can vividly illustrate findings
• Stata “Adjust” command is very useful
• Example: Probabilities for men/women
. adjust, pr by(male)
-----------------------------------------------------------------Dependent variable: gun
Command: logistic
Variables left as is: educ, income, south, liberal
---------------------male |
pr
----------+----------0 |
.225814
1 |
.417045
----------------------
Note that the predicted
probability for men is
nearly twice as high as
for women.
Stata Notes: Adjust Command
• Stata “adjust” command can be tricky
– 1. By default it uses the entire sample, not just
cases in your prior analysis
• Best to specify prior sample:
• adjust if e(sample), pr by(male)
– 2. For non-specified variables, stata uses group
means (defined by “by” command)
• Don’t assume it pegs cases to overall sample mean
• Variables “left as is” take on mean for subgroups
– 3. It doesn’t take into account weighted data
• Use “lincom” if you have weighted data
Predicted Probabilities: Stata
• Effect of pol views & gender for PhD students
. adjust south=0 income=4 educ=20, pr by(liberal male)
-----------------------------------------------------------Dependent variable: gun
Command: logistic
Covariates set to value: south = 0, income = 4, educ = 20
---------------------------|
male
liberal |
0
1
Note that independent
----------+----------------variables are set to
1 | .046588 .096652
2 | .039818 .083241
values of interest. (Or
3 | .033996 .071544
can be set to mean).
4 |
.029
.06138
5 | .024719 .052578
6 | .021057 .044978
7 | .017927 .038433
Graphing Predicted Probabilities
• P(Y=1) for Women & Men by Liberal
.02
.04
.06
.08
.1
• scatter Women Men Liberal, c(l l)
0
2
4
Liberal
Women
6
Men
8
Did model categorize cases correctly?
• We can choose a criteria: predicted P > .5:
. estat clas
-------- True -------Classified |
D
~D |
Total
-----------+--------------------------+----------+
|
64
48 |
112
|
229
509 |
738
-----------+--------------------------+----------Total
|
293
557 |
850
Classified + if predicted Pr(D) >= .5
True D defined as gun != 0
-------------------------------------------------Sensitivity
Pr( +| D)
21.84%
Specificity
Pr( -|~D)
91.38%
Positive predictive value
Pr( D| +)
57.14%
Negative predictive value
Pr(~D| -)
68.97%
-------------------------------------------------False + rate for true ~D
Pr( +|~D)
8.62%
False - rate for true D
Pr( -| D)
78.16%
False + rate for classified +
Pr(~D| +)
42.86%
False - rate for classified Pr( D| -)
31.03%
-------------------------------------------------Correctly classified
67.41%
The model yields
predicted p>.5 for 112
people; only 64 of them
actually have guns
Overall, this simple model
doesn’t offer extremely
accurate predictions…
67% of people are correctly
classified
Note: Results change
if you use a different
criteria (e.g., p>.6)
Sensitivity / Specificity of Prediction
• Sensitivity: Of gun owners, what proportion
were correctly predicted to own a gun?
• Specificity: Of non-gun owners, what
proportion did we correctly predict?
• Choosing a different probability cutoff affects
those values
• If we reduce the cutoff to P > .4, we’ll catch a higher
proportion of gun owners
• But, we’ll incorrectly identify more non-gun owners.
• And, we’ll have more false positives.
Sensitivity / Specificity of Prediction
• Stata can produce a plot showing how
predictions will change if we vary “P” cutoff:
0.00
0.25
0.50
0.75
1.00
• Stata command: lsens
0.00
0.25
0.50
Probability cutoff
Sensitivity
0.75
Specificity
1.00
Hypothesis tests
• Testing hypotheses using logistic regression
• H0: There is no effect of year in grad program on coffee
drinking
• H1: Year in grad school is associated with coffee
– Or, one-tail test: Year in school increases probability of coffee
– MLE estimation yields standard errors… like OLS
– Test statistic: 2 options; both yield same results
• t = b/SE… just like OLS regression
• Wald test (Chi-square, 1df); essentially the square of t
– Reject H0 if Wald or t > critical value
• Or if p-value less than alpha (usually .05).
Model Fit: Likelihood Ratio Tests
• MLE computes a likelihood for the model
• “Better” models have higher likelihoods
• Log likelihood is typically a negative value, so “better”
means a less negative value… -100 > -1000
• Log likelihood ratio test: Allows comparison of
any two nested models
• One model must be a subset of vars in other model
– You can’t compare totally unrelated models!
• Models must use the exact same sample.
Model Fit: Likelihood Ratio Tests
• Default LR test comparison: Current model
versus “null model”
• Null model = only a constant; no covariates; K=0
• Also useful: Compare small & large model
• Do added variables (as a group) fit the data better?
– Ex: Suppose a theory suggests 4 psychological
variables will have an important effect…
• We could use LR test to compare “base model” to model
with 4 additional variables.
• STATA: Run first model; “store” estimates; run second
model; use stata command “lrtest” to compare models
Model Fit: Likelihood Ratio Tests
• Likelihood ratio test is based on the G-square
• Chi-square distributed; df = K1 – K0
• K = # variables; K1 = full model, K0 = simpler model
• L1 = likelihood for full model; L0 = simpler model
 L0 
G  2 ln     2 ln L0    2 ln L1 
 L1 
2
• Significant likelihood ratio test indicates that
the larger model (L1) is an improvement
• G2 > critical value; or p-value < .05.
Model Fit: Likelihood Ratio Tests
• Stata’s default LR test; compares to null model
. logistic gun male educ income south liberal, coef
Logistic regression
Log likelihood =
-502.7251
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
=
=
=
=
850
89.53
0.0000
0.0818
-----------------------------------------------------------------------------gun |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
.7837017
.156764
5.00
0.000
.4764499
1.090954
educ | -.0767763
.0254047
-3.02
0.003
-.1265686
-.026984
income |
.2416647
.0493794
4.89
0.000
.1448828
.3384466
south |
.7363169
.1979038
3.72
0.000
.3484327
1.124201
liberal | -.1641107
.0578167
-2.84
0.005
-.2774294
-.0507921
_cons |
-2.28572
.6200443
-3.69
0.000
-3.500984
-1.070455
------------------------------------------------------------------------------
Model likelihood = -502.7
Null model is a lower
value (more negative)
LR Chi2(5) indicates G-square for 5 degrees of
freedom
Prob > chi2 is a p-value. p < .05 indicates a
significantly better model
Model Fit: Likelihood Ratio Tests
• Example: Null model log likelihood: -547.5;
Full model: -502.7
• 5 new variables, so K1 – K0 = 5.
 L0 
G  2 ln     2 ln L0    2 ln L1 
 L1 
2
G   2 547.5   2 502.7  89.5
2
• According to 2 table, crit value=11.07
• Since 89.5 greatly exceeds 11.07, we are confident that
the full model is an improvement
• Also, observed p-value in STATA output is .000!
Model Fit: Pseudo R-Square
• Pseudo R-square
• “A descriptive measure that indicates roughly the
proportion of observed variation accounted for by the…
predictors.” Knoke et al, p. 313
Logistic regression
Log likelihood =
-502.7251
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
=
=
=
=
850
89.53
0.0000
0.0818
-----------------------------------------------------------------------------gun | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
2.189562
.3432446
5.00
0.000
1.610347
2.977112
educ |
.926097
.0235272
-3.02
0.003
.8811137
.9733768
income |
1.273367
.0628781
4.89
0.000
1.155904
1.402767
south |
2.08823
.4132686
3.72
0.000
1.416845
3.077757
liberal |
.848648
.049066
-2.84
0.005
.7577291
.9504762
------------------------------------------------------------------------------
Model explains roughly 8% of variation in Y
Assumptions & Problems
• Assumption: Independent random sample
• Serial correlation or clustering violate assumptions; bias
SE estimates and hypothesis tests
• We will discuss possible remedies in the future
• Multicollinearity: High correlation among
independent variables causes problems
• Unstable, inefficient estimates
• Watch for coefficient instability, check VIF/tolerance
• Remove unneeded variables or create indexes of related
variables.
Assumptions & Problems
• Outliers/Influential cases
• Unusual/extreme cases can distort results, just like OLS
– Logistic requires different influence statistics
• Example: dbeta – very similar to OLS “Cooks D”
– Outlier diagnostics are available in STATA
• After model: “predict outliervar, dbeta”
• Lists & graphs of residuals & dbetas can identify
influential cases.
Plotting Residuals by Casenumber
-2
-1
0
1
2
3
• predict sresid, rstandard
• gen casenum = _n
• scatter sresid casenum
0
1000
2000
casenum
3000
Assumptions & Problems
• Insufficient variance: You need cases for both
values of the dependent variable
• Extremely rare (or common) events can be a problem
• Suppose N=1000, but only 3 are coded Y=1
• Estimates won’t be great
– Also: Maximum likelihood estimates cannot be
computed if any independent variable perfectly
predicts the outcome (Y=1)
• Ex: Suppose Soc 8811 drives all students to drink
coffee... So there is no variation…
– In that case, you cannot include a dummy variable for taking
Soc 8811 in the model.
Assumptions & Problems
• Model specification / Omitted variable bias
• Just like any regression model, it is critical to include
appropriate variables in the model
• Omission of important factors or ‘controls’ will lead to
misleading results.
Real World Example: Coups
• Issue: Many countries face the threat of a
coup d’etat – violent overthrow of the regime
• What factors whether a countries will have a coup?
• Paper Handout: Belkin and Schofer (2005)
• What are the basic findings?
• How much do the odds of a coup differ for
military regimes vs. civilian governments?
– b=1.74; (e1.74 -1)*100% = +470%
• What about a 2-point increase in log GDP?
– b=-.233; ((e-.233 * e-.233) -1)*100% = -37%
Real World Example
• Goyette, Kimberly and Yu Xie. 1999.
“Educational Expectations of Asian American
Youths: Determinants and Ethnic Differences.”
Sociology of Education, 72, 1:22-36.
• What was the paper about?
•
•
•
•
What was the analysis?
Dependent variable? Key independent variables?
Findings?
Issues / comments / criticisms?
Download