Logistic Regression 2 Sociology 8811 Lecture 7 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission Stata Notes: Logistic Regression • Stata has two commands: “logit” & “logistic” – Logit, by default, produces raw coefficients – Logistic, by default, produces odds ratios • It exponentiates all coefficients for you! • Note: Both yield identical results – The following pairs of commands are identical – For raw coefficients: • logit gun male educ income south liberal • logistic gun male educ income south liberal, coef – And for odds ratios: • logit gun male educ income south liberal, nocoef • logistic gun male educ income south liberal Review: Interpreting Coefficients • Raw Coefficients: Change in log odds per unit change in X • Show direction • Magnitude is hard to interpret • Odds Ratios: Multiplicative change in odds per unit change in X • OR > 1 = positive effect, OR < 1 = negative • Operates multiplicatively. Effect of 2-point change is found by multiplying twice • Percentage change in odds per unit change • (OR-1)*100%. Review: Interpreting Results • Important point: Substantive effect of a variable on predicted probability differs depending on values of other variables • If probability is already high for a given case, additional increases may not have much effect – Suppose a 1-point change in X doubles the odds… • Effect isn’t substantively consequential if probability (Y=1) is already very high – Ex: 20:1 odds = .95 probability; 40:1 odds = .975 probability – Change in probability is only .025 • Effect matters a lot for cases with probabilities near .5 – 1:1 odds = .5 probability. 2:1 odds = .67 probability – Change in probability is nearly .2! Review: Interpreting Results • Predicted values of real (or hypothetical cases) can vividly illustrate findings • Stata “Adjust” command is very useful • Example: Probabilities for men/women . adjust, pr by(male) -----------------------------------------------------------------Dependent variable: gun Command: logistic Variables left as is: educ, income, south, liberal ---------------------male | pr ----------+----------0 | .225814 1 | .417045 ---------------------- Note that the predicted probability for men is nearly twice as high as for women. Stata Notes: Adjust Command • Stata “adjust” command can be tricky – 1. By default it uses the entire sample, not just cases in your prior analysis • Best to specify prior sample: • adjust if e(sample), pr by(male) – 2. For non-specified variables, stata uses group means (defined by “by” command) • Don’t assume it pegs cases to overall sample mean • Variables “left as is” take on mean for subgroups – 3. It doesn’t take into account weighted data • Use “lincom” if you have weighted data Predicted Probabilities: Stata • Effect of pol views & gender for PhD students . adjust south=0 income=4 educ=20, pr by(liberal male) -----------------------------------------------------------Dependent variable: gun Command: logistic Covariates set to value: south = 0, income = 4, educ = 20 ---------------------------| male liberal | 0 1 Note that independent ----------+----------------variables are set to 1 | .046588 .096652 2 | .039818 .083241 values of interest. (Or 3 | .033996 .071544 can be set to mean). 4 | .029 .06138 5 | .024719 .052578 6 | .021057 .044978 7 | .017927 .038433 Graphing Predicted Probabilities • P(Y=1) for Women & Men by Liberal .02 .04 .06 .08 .1 • scatter Women Men Liberal, c(l l) 0 2 4 Liberal Women 6 Men 8 Did model categorize cases correctly? • We can choose a criteria: predicted P > .5: . estat clas -------- True -------Classified | D ~D | Total -----------+--------------------------+----------+ | 64 48 | 112 | 229 509 | 738 -----------+--------------------------+----------Total | 293 557 | 850 Classified + if predicted Pr(D) >= .5 True D defined as gun != 0 -------------------------------------------------Sensitivity Pr( +| D) 21.84% Specificity Pr( -|~D) 91.38% Positive predictive value Pr( D| +) 57.14% Negative predictive value Pr(~D| -) 68.97% -------------------------------------------------False + rate for true ~D Pr( +|~D) 8.62% False - rate for true D Pr( -| D) 78.16% False + rate for classified + Pr(~D| +) 42.86% False - rate for classified Pr( D| -) 31.03% -------------------------------------------------Correctly classified 67.41% The model yields predicted p>.5 for 112 people; only 64 of them actually have guns Overall, this simple model doesn’t offer extremely accurate predictions… 67% of people are correctly classified Note: Results change if you use a different criteria (e.g., p>.6) Sensitivity / Specificity of Prediction • Sensitivity: Of gun owners, what proportion were correctly predicted to own a gun? • Specificity: Of non-gun owners, what proportion did we correctly predict? • Choosing a different probability cutoff affects those values • If we reduce the cutoff to P > .4, we’ll catch a higher proportion of gun owners • But, we’ll incorrectly identify more non-gun owners. • And, we’ll have more false positives. Sensitivity / Specificity of Prediction • Stata can produce a plot showing how predictions will change if we vary “P” cutoff: 0.00 0.25 0.50 0.75 1.00 • Stata command: lsens 0.00 0.25 0.50 Probability cutoff Sensitivity 0.75 Specificity 1.00 Hypothesis tests • Testing hypotheses using logistic regression • H0: There is no effect of year in grad program on coffee drinking • H1: Year in grad school is associated with coffee – Or, one-tail test: Year in school increases probability of coffee – MLE estimation yields standard errors… like OLS – Test statistic: 2 options; both yield same results • t = b/SE… just like OLS regression • Wald test (Chi-square, 1df); essentially the square of t – Reject H0 if Wald or t > critical value • Or if p-value less than alpha (usually .05). Model Fit: Likelihood Ratio Tests • MLE computes a likelihood for the model • “Better” models have higher likelihoods • Log likelihood is typically a negative value, so “better” means a less negative value… -100 > -1000 • Log likelihood ratio test: Allows comparison of any two nested models • One model must be a subset of vars in other model – You can’t compare totally unrelated models! • Models must use the exact same sample. Model Fit: Likelihood Ratio Tests • Default LR test comparison: Current model versus “null model” • Null model = only a constant; no covariates; K=0 • Also useful: Compare small & large model • Do added variables (as a group) fit the data better? – Ex: Suppose a theory suggests 4 psychological variables will have an important effect… • We could use LR test to compare “base model” to model with 4 additional variables. • STATA: Run first model; “store” estimates; run second model; use stata command “lrtest” to compare models Model Fit: Likelihood Ratio Tests • Likelihood ratio test is based on the G-square • Chi-square distributed; df = K1 – K0 • K = # variables; K1 = full model, K0 = simpler model • L1 = likelihood for full model; L0 = simpler model L0 G 2 ln 2 ln L0 2 ln L1 L1 2 • Significant likelihood ratio test indicates that the larger model (L1) is an improvement • G2 > critical value; or p-value < .05. Model Fit: Likelihood Ratio Tests • Stata’s default LR test; compares to null model . logistic gun male educ income south liberal, coef Logistic regression Log likelihood = -502.7251 Number of obs LR chi2(5) Prob > chi2 Pseudo R2 = = = = 850 89.53 0.0000 0.0818 -----------------------------------------------------------------------------gun | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .7837017 .156764 5.00 0.000 .4764499 1.090954 educ | -.0767763 .0254047 -3.02 0.003 -.1265686 -.026984 income | .2416647 .0493794 4.89 0.000 .1448828 .3384466 south | .7363169 .1979038 3.72 0.000 .3484327 1.124201 liberal | -.1641107 .0578167 -2.84 0.005 -.2774294 -.0507921 _cons | -2.28572 .6200443 -3.69 0.000 -3.500984 -1.070455 ------------------------------------------------------------------------------ Model likelihood = -502.7 Null model is a lower value (more negative) LR Chi2(5) indicates G-square for 5 degrees of freedom Prob > chi2 is a p-value. p < .05 indicates a significantly better model Model Fit: Likelihood Ratio Tests • Example: Null model log likelihood: -547.5; Full model: -502.7 • 5 new variables, so K1 – K0 = 5. L0 G 2 ln 2 ln L0 2 ln L1 L1 2 G 2 547.5 2 502.7 89.5 2 • According to 2 table, crit value=11.07 • Since 89.5 greatly exceeds 11.07, we are confident that the full model is an improvement • Also, observed p-value in STATA output is .000! Model Fit: Pseudo R-Square • Pseudo R-square • “A descriptive measure that indicates roughly the proportion of observed variation accounted for by the… predictors.” Knoke et al, p. 313 Logistic regression Log likelihood = -502.7251 Number of obs LR chi2(5) Prob > chi2 Pseudo R2 = = = = 850 89.53 0.0000 0.0818 -----------------------------------------------------------------------------gun | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | 2.189562 .3432446 5.00 0.000 1.610347 2.977112 educ | .926097 .0235272 -3.02 0.003 .8811137 .9733768 income | 1.273367 .0628781 4.89 0.000 1.155904 1.402767 south | 2.08823 .4132686 3.72 0.000 1.416845 3.077757 liberal | .848648 .049066 -2.84 0.005 .7577291 .9504762 ------------------------------------------------------------------------------ Model explains roughly 8% of variation in Y Assumptions & Problems • Assumption: Independent random sample • Serial correlation or clustering violate assumptions; bias SE estimates and hypothesis tests • We will discuss possible remedies in the future • Multicollinearity: High correlation among independent variables causes problems • Unstable, inefficient estimates • Watch for coefficient instability, check VIF/tolerance • Remove unneeded variables or create indexes of related variables. Assumptions & Problems • Outliers/Influential cases • Unusual/extreme cases can distort results, just like OLS – Logistic requires different influence statistics • Example: dbeta – very similar to OLS “Cooks D” – Outlier diagnostics are available in STATA • After model: “predict outliervar, dbeta” • Lists & graphs of residuals & dbetas can identify influential cases. Plotting Residuals by Casenumber -2 -1 0 1 2 3 • predict sresid, rstandard • gen casenum = _n • scatter sresid casenum 0 1000 2000 casenum 3000 Assumptions & Problems • Insufficient variance: You need cases for both values of the dependent variable • Extremely rare (or common) events can be a problem • Suppose N=1000, but only 3 are coded Y=1 • Estimates won’t be great – Also: Maximum likelihood estimates cannot be computed if any independent variable perfectly predicts the outcome (Y=1) • Ex: Suppose Soc 8811 drives all students to drink coffee... So there is no variation… – In that case, you cannot include a dummy variable for taking Soc 8811 in the model. Assumptions & Problems • Model specification / Omitted variable bias • Just like any regression model, it is critical to include appropriate variables in the model • Omission of important factors or ‘controls’ will lead to misleading results. Real World Example: Coups • Issue: Many countries face the threat of a coup d’etat – violent overthrow of the regime • What factors whether a countries will have a coup? • Paper Handout: Belkin and Schofer (2005) • What are the basic findings? • How much do the odds of a coup differ for military regimes vs. civilian governments? – b=1.74; (e1.74 -1)*100% = +470% • What about a 2-point increase in log GDP? – b=-.233; ((e-.233 * e-.233) -1)*100% = -37% Real World Example • Goyette, Kimberly and Yu Xie. 1999. “Educational Expectations of Asian American Youths: Determinants and Ethnic Differences.” Sociology of Education, 72, 1:22-36. • What was the paper about? • • • • What was the analysis? Dependent variable? Key independent variables? Findings? Issues / comments / criticisms?