Logistic Regression stat 557 Heike Hofmann Homework Outline • Deviance Residuals • Logistic Regression: • interpretation of estimates • inference & confidence intervals • model checking by grouping • scores Deviance • Deviance is used for model comparisons: -2 * difference in log likelihood of two models • Model M and saturated model: deviance of M, lack of fit of model M • Model M and null model (intercept only): null deviance, ‘improvement’ of model M • Models M1 and M2, where M1 is nested within Model M2: M1 is sufficient? ersion parameter in the deviance gives for models M1 and � � duals of GLMs − D(M2 ) = −2 L(θˆ1 ; y) − L(θˆ2 ; y) = g the dispersion parameter in the deviance gives for models M1 and M2 : � � � esiduals of GLMs � � � � ˆ ˆ ˆ = −2 y /a(φ) + 2 b( θ θ − θ ˆ ˆ i i1) − i i D(M ) − D(M ) = −2 L( θ ; y) − L( θ ; y) = 1 2 ting the for models M1 and M2 : 1 dispersion 2 parameter in the 1 deviance gives 2 Deviance � � � � � �� i� i b(θˆi 1 ) − b(θˆi 2 ) /a(φ) θˆi 1−−L(θˆθiˆ22; y)/a(φ) D(M1 ) − D(M2 ) == −2 −2 L(yθˆi1 ; y) = +2 has the form � φ/ω the+deviance model = −2 yi θˆi 1 − θˆfor 2 b(θˆi 1 ) −of b(θˆtwo i , yielding i 2 ) /a(φ) i 2 /a(φ) � i � � � i� dels, a(φ) has the form φ/ωi , iyielding for the deviancei of two models: � � � � � �for the deviance models, a(φ) has the form φ/ωi , yielding of two models: ˆ ˆ ˆ ˆ D(M1 )D(M − D(M ) = 2 ω y ( θ − θ ) + b( θ ) − b( θ ˆ ˆ ˆ ˆ 2 2) = 2 i 1θi 1 ) − b(θi 2i)1 /φ i2) ωi �i yi (θii2 −iθ2i 1 ) + b( 1 ) − D(M � � D(M1 ) − D(M2 ) = 2 ii ωi yi (θˆi 2 − θˆi 1 ) + b(θˆi 1 ) − b(θˆi 2 ) /φ � i ull model for M2 , we get residuals for yi by taking di , where D � = i di and � for residuals foryi ybyi taking by taking D= 2 , we i ,= where e fullM model for Mget for di , where d D� 2 , we get residuals i di and � ˆi 2 − θˆi 1 ) + b(θˆi 1 ) − b(θˆi�2 ) , = 2ω� individualdideviance i �yi (θcontribution: � • di = 2ωi yi (θˆi 2 − θˆi 1 ) + b(θˆi 1 ) − b(θˆi 2 ) , ˆi 2 − θˆi 1 ) + b(θˆi 1 ) − b(θˆi 2 ) , d = 2ω y ( θ i i i ance residual for observation i is deviance residual for observation i is � �d sign(y − ŷ ). dual for observation i is dii sign(yii− ŷi ).i , we have the Pearson’s residual for observation i Residuals Residuals duals uals Deviance eviance Pearson earson Residuals • Deviance Residuals � � di sign(yi − µ̂i ) di sign(yi − µ̂i ) • Pearson Residuals y − µ̂ e =� y − µ̂ i i i i i ei = � Var (yi ) Var (yi ) sets underestimate variance ets underestimate variance Both sets of residuals under-estimate the variance Logistic Regression • Y is binary (with levels 0,1) E[Y] = π, Var[Y] = π(1-π)/n • for ease of discussion: p=1, X is co-variate • g(E[Yi]) = α + β xi with g(π) = log (π/(1-π)) • We already know: for binary x, β is log odds ratio 0.0 0.2 0.4 p(x) 0.6 0.8 1.0 --4 -2 2 4 m 0.0 0.2 0.4 0.6 0 0.8 1 x1.0 p M p(x) 4(x) 2 .2 .4 .6 .8 .0 -4 • • -2 0 2 4 x π(x) = exp(α + β x) /(1 + exp(α + β x) ) Summary of previous findings: • For β > 0 ⇒ π(x) increases in x For β < 0 ⇒ π(x) decreases in x • for β > 0, π(x) increases, Since π(x)/(1 = exp(α + βx) = exp(α) exp(βx), for every unit in X odds increase multiplicafor β −<π(x)) 0, π(x) decreases tively by exp(β) (for every unit there is an (1 − exp(β)) 100% increase). • π(x)/(1 − π(x)) = exp(α) exp(β x) for every unit in X, odds the curve has steepest slope at π = 0.5 (forby x =exp(β) −α/β) - that slope is β/4. increase multiplicatively • exp(β) is odds ratio: ratio of odds at x + 1 versus odds at x. • • x = −α/β is called median effective level EL50 or median lethal dose LD50 . • inflection point for π(x) = 0.5, for x = − α/β (median effect, 39 median lethal dose), • steepest slope in inflection point: β/4 rence regression model Inference for β logit π(x) = α + βx r of interest is usually β in a hypothesis test of H0 : β = 0. Three approach h a test statistic: Mainly 3 methods: est • Wald • Loglikelihood lihood Ratio Test Test • Score � β̂ SEβ �2 ∼ χ21 2(L(β̂; y) − L(0; y)) ∼ χ21 2 u(β̂) ι(β̂) = ∂ ∂β L � ∼ χ21 2 −E ∂∂2 β L � ntervals for β are then found by backwards calculation; e.g. α 100% Wald CI β̂ ± zα/2 SEβ t Test ∂SE 2± β̂ z u(u( β̂)β̂)2 α/2∂β∂ LLβ 22 �∂β � ∼ χ == �∂ 2 � ∼ χ11 ι(ι( β̂)β̂) −E ∂2 L 2 −E ∂ 2β L Inference for π(x) ∂ β or π(x0 ) are based on confidence intervals for logit π(x0 ): rvals forfor β are then found Wald CI CI f ntervals β are then foundbybybackwards backwardscalculation; calculation; e.g. e.g. α 100% Wald π̂(x0 ) ar(log ) = var(α̂ + x0 β̂) = varα̂ + x20 varβ̂ + 2x0 cov(α zα/2SE SEββ β̂β̂±±zα/2 1− π̂(xlink 0 ) function g(x): Use • π(x )is: are basedononconfidence confidenceintervals intervals for for logit logit π(x ): rvals forfor π(x )) are based rntervals logit π(x • In x :π̂(x ) � 0 0 0 o 0 0 π̂(x0 ) 22 varβ̂ + 2x0 cov(α̂, β̂) var(log ) = var(α̂ + x β̂) = varα̂ + x 0 var(log 1 − π̂(x ))= var(α̂ + x0 β̂) = varα̂ + x0)0varβ̂ + 2x0 cov(α̂, β̂) π̂(x 0 0 1 − π̂(x0 ) 0 var(log α̂ + x0 β̂ ± zα/2 ald logit π(x) 0is: ) is: CI CI forfor logit π(x • Wald C.I: 0 �� Wald CI for π(xα̂0+) xis: 0 β̂ ± zα/2 α̂ + x0 β̂ ± zα/2 1 − π̂(x0 ) ) = (l, u). π̂(x0 ) var(log π̂(x0 ) ) = (l, u). var(log 1 − π̂(x0 )) = (l, u). 1 − π̂(x0 ) l u e e α 100% Wald CI for π(x0 ) is: , ) 100% Wald CI for π(x0 ) is: ( l u 1 + le 1 + e u e e ( el l , eu u ) ( 1 + e, 1 + e ) Example: Happiness Data > summary(happy) happy not too happy: 5629 pretty happy :25874 very happy :14800 NA's : 4717 marital divorced : 6131 married :27998 never married:10064 separated : 1781 widowed : 5032 NA's : 14 year Min. :1972 1st Qu.:1982 Median :1990 Mean :1990 3rd Qu.:2000 Max. :2006 age sex Min. : 18.00 female:28581 1st Qu.: 31.00 male :22439 Median : 43.00 Mean : 45.43 3rd Qu.: 58.00 Max. : 89.00 NA's :184.00 degree finrela bachelor : 6918 above average : 8536 graduate : 3253 average :23363 high school :26307 below average :10909 junior college: 2601 far above average: 898 lt high school:11777 far below average: 2438 NA's : 164 NA's : 4876 health excellent:11951 fair : 7149 good :17227 poor : 2164 NA's :12529 only consider extremes: very happy and not very happy individuals prodplot(data=happy, ~ happy+sex, c("vspine", "hspine"), na.rm=T, subset=level==2) # almost perfect independence # try a model happy.sex <- glm(happy~sex, family=binomial(), data=happy) summary(happy.sex) Call: glm(formula = happy ~ sex, family = binomial(), data = happy) Deviance Residuals: Min 1Q Median -1.6060 -1.6054 0.8027 3Q 0.8031 Max 0.8031 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.96613 0.02075 46.551 <2e-16 *** sexmale 0.00130 0.03162 0.041 0.967 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 24053 Residual deviance: 24053 AIC: 24057 on 20428 on 20427 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 female male > anova(happy.sex) Analysis of Deviance Table Model: binomial, link: logit Response: happy > confint(happy.sex) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.92557962 1.00693875 sexmale -0.06064378 0.06332427 Terms added sequentially (first to last) Df NULL sex • Deviance Resid. Df Resid. Dev 20428 24053 1 0.0016906 20427 24053 Deviance difference is asymptotically χ2 distributed • Null hypothesis of independence cannot be rejected Age and Happiness qplot(age, geom="histogram", fill=happy, binwidth=1, data=happy) 300 count qplot(age, geom="histogram", fill=happy, binwidth=1, position="fill", data=happy) 400 happy not too happy 200 very happy 100 0 20 # research paper claims that happiness is u-shaped happy.age <- glm(happy~poly(age,2), family=binomial(), data=na.omit(happy[,c ("age","happy")])) 30 40 50 age 60 70 80 1.0 0.8 count 0.6 happy not too happy 0.4 very happy 0.2 0.0 20 30 40 50 age 60 70 80 1.0 0.8 count 0.6 happy not too happy 0.4 very happy 0.2 > summary(happy.age) 0.0 20 30 40 50 age 60 70 Call: glm(formula = happy ~ poly(age, 2), family = binomial(), data = na.omit(happy[, c("age", "happy")])) Deviance Residuals: Min 1Q Median -1.6400 -1.5480 0.7841 3Q 0.8061 Max 0.8707 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.96850 0.01571 61.660 < 2e-16 *** poly(age, 2)1 6.41183 2.22171 2.886 0.00390 ** poly(age, 2)2 -7.81568 2.21981 -3.521 0.00043 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 23957 Residual deviance: 23936 AIC: 23942 on 20351 on 20349 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 80 age 20 30 40 50 60 70 80 0.0 1.0 count tnuoc 0.2 0.8 very happy happy not too happy not too happy happy 0.4 0.6 0.6 0.4 very happy 0.8 0.2 1.0 0.0 20 30 40 # effect of age X <- data.frame(cbind(age=20:85)) X$pred <- predict(happy.age, newdata=X, type="response") qplot(age, pred, data=X) + ylim(c(0,1)) 50 60 age 70 80 1.0 0.8 > anova(happy.age) Analysis of Deviance Table Model: binomial, link: logit Response: happy pred 0.6 0.4 0.2 0.0 Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 20351 23957 poly(age, 2) 2 20.739 20349 23936 20 30 40 50 age 60 70 80 Problems with Deviance • if X is continuous, deviance has no longer χ2 distribution. Two-fold violations: • regard X to be categorical (with lots of categories): we might end up with a contingency table that has lots of small cells - which means, that the χ2 approximation does not hold. • Increases in sample size, most likely increase the number of different values of X. Corresponding contingency table changes size (asymptotic distribution for the smaller contingency table doesn’t exist). > xtabs(~happy+age, age happy 18 not too happy 19 very happy 36 age happy 35 not too happy 107 very happy 307 age happy 52 not too happy 100 very happy 214 age happy 69 not too happy 53 very happy 188 age happy 86 not too happy 16 very happy 32 data=happy) 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 93 95 105 120 119 127 123 120 126 116 91 127 102 115 91 112 155 185 202 226 268 260 310 286 336 342 303 322 291 350 333 329 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 123 108 110 101 107 123 94 117 104 101 85 93 87 102 94 87 322 320 309 273 287 271 279 276 251 243 257 230 248 242 251 229 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 86 99 95 85 70 86 83 72 70 60 71 58 66 65 72 63 232 226 176 204 209 221 228 195 194 213 216 189 197 194 188 214 70 71 72 73 74 75 76 77 78 43 58 48 59 56 41 57 49 38 203 153 162 162 147 132 135 109 106 87 15 32 88 16 20 89 31 76 Grouping might be better happy$agec <- cut(happy$age, breaks=c(15,10*2:9)) 79 35 85 80 27 71 81 31 69 82 20 75 83 26 66 84 24 45 85 17 40 Model Checking by Grouping blem with deviance: if X continuous, deviance has no longer χ distribution. Th 2 ptions are violated two-fold: even if we regard X to be categorical (with lots of cate we end up with a contingency table that has lots of small cells - which means, tha data along estimates, e.g. such thatlikely the number of d does notGroup hold. Secondly, if we increase the sample size, most eases, too, which makes the correspondingequal contingency table change size (so we can groups are approximately in size. asymptotic distribution for the smaller contingency table, as it doesn’t exist anym is larger). Partition • • smallest n1 estimates into group 1, del Checking by Grouping To get around the problems with the distribution a second smallest batch into such that group 2 estimates group the data along estimates, e.g. of by n partitioning on estimates al in size.group 2, titioning ... theIf estimates is done by size, we the smallest we assume g groups, wegroup get the Hosmer-n1 estimates into g llest batch of n2 estimates into group 2, ... If we assume g groups, we get the Hos Lemeshow test statistic: istic g � i=1 �� ni j=1 �ni �2 yij − j=1 π̂ij 2 �� �� � ∼ χ g−2 . � ni 1 − j π̂ij /ni j=1 π̂ij Problems with Grouping • Different groupings might (and will) lead to different decisions w.r.t model fit • Hosmer et al (1997): “A COMPARISON OF GOODNESS-OF-FIT TESTS FOR THE LOGISTIC REGRESSION MODEL” (on Blackboard) π(x) log = α + βi , 1 − π(x) Example: Alcohol during pregnancy e βi is the effect of the ith category in X on the log odds, i.e. for each category one effect means that the above model is overparameterized (the “last” category can be explaine thers). To make the solution unique again, we have to use an additional constraint. I fault. Whenever one of the effects is fixed to be zero, this is called a contrast coding arison of all the� other effects to the baseline effect. For effect coding the constraint is on t s of a variable: i βi = 0. In a binary variable the effects are then the negatives of each Observational ctions and inference are independentStudy: from the specific coding used and are not affecte in the coding. at 3 months of pregnancy, expectant • mothers asked for average daily alcohol mple: Alcohol and Malformation consume. hol during pregnancy is believed to be associated with congenital malformation. The follow om an observational study - after three months of pregnancy questions on the average nu infant checked for malformation at birth olic beverages were asked; at birth the infant was checked for malformations: Alcohol malformed absent P(malformed) 1 0 48 17066 0.0028 2 <1 38 14464 0.0026 3 1-2 5 788 0.0063 4 3-5 1 126 0.0079 5 ≥6 1 37 0.0263 els m1 and m2 are the same in terms of statistical behavior: deviance, predictions and the same numbers. The variable Alcohol is recoded for the second model, giving differ Saturated Model glm(formula = cbind(malformed, absent) ~ Alcohol, family = binomial()) Deviance Residuals: [1] 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.87364 0.14454 -40.637 <2e-16 *** Alcohol<1 -0.06819 0.21743 -0.314 0.7538 Alcohol1-2 0.81358 0.47134 1.726 0.0843 . Alcohol3-5 1.03736 1.01431 1.023 0.3064 Alcohol>=6 2.26272 1.02368 2.210 0.0271 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020e+00 Residual deviance: -3.0775e-13 AIC: 28.627 on 4 on 0 Number of Fisher Scoring iterations: 4 degrees of freedom degrees of freedom ‘Linear’ Effect glm(formula = cbind(malformed, absent) ~ as.numeric(Alcohol), family = binomial()) Deviance Residuals: 1 2 3 0.7302 -1.1983 0.9636 4 0.4272 5 1.1692 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.2089 0.2873 -21.612 <2e-16 *** as.numeric(Alcohol) 0.2278 0.1683 1.353 0.176 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 Residual deviance: 4.4473 AIC: 27.074 on 4 on 3 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 levels: 1,2,3,4,5 ‘Linear’ Effect glm(formula = cbind(malformed, absent) ~ as.numeric(Alcohol), family = binomial()) Deviance Residuals: 1 2 3 0.5921 -0.8801 0.8865 4 -0.1449 5 0.1291 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.9605 0.1154 -51.637 <2e-16 *** as.numeric(Alcohol) 0.3166 0.1254 2.523 0.0116 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 Residual deviance: 1.9487 AIC: 24.576 on 4 on 3 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 levels: 0,0.5,1.5,4,7 Scores • Scores of categorical variables critically influence a model • usually, scores will be given by data experts • various choices: e.g. midpoints of interval variables, • default scores are values 1 to n Next: • Logit Models for nominal and ordinal response