Binary and Multinomial Logistic Regression stat 557 Heike Hofmann Outline • Logistic Regression: • model checking by grouping • Model selection • scores • Intro to Multinomial Regression Example: Happiness Data > summary(happy) happy not too happy: 5629 pretty happy :25874 very happy :14800 NA's : 4717 marital divorced : 6131 married :27998 never married:10064 separated : 1781 widowed : 5032 NA's : 14 year Min. :1972 1st Qu.:1982 Median :1990 Mean :1990 3rd Qu.:2000 Max. :2006 age sex Min. : 18.00 female:28581 1st Qu.: 31.00 male :22439 Median : 43.00 Mean : 45.43 3rd Qu.: 58.00 Max. : 89.00 NA's :184.00 degree finrela bachelor : 6918 above average : 8536 graduate : 3253 average :23363 high school :26307 below average :10909 junior college: 2601 far above average: 898 lt high school:11777 far below average: 2438 NA's : 164 NA's : 4876 health excellent:11951 fair : 7149 good :17227 poor : 2164 NA's :12529 only consider extremes: very happy and not very happy individuals prodplot(data=happy, ~ happy+sex, c("vspine", "hspine"), na.rm=T, subset=level==2) # almost perfect independence # try a model happy.sex <- glm(happy~sex, family=binomial(), data=happy) summary(happy.sex) Call: glm(formula = happy ~ sex, family = binomial(), data = happy) Deviance Residuals: Min 1Q Median -1.6060 -1.6054 0.8027 3Q 0.8031 Max 0.8031 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.96613 0.02075 46.551 <2e-16 *** sexmale 0.00130 0.03162 0.041 0.967 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 24053 Residual deviance: 24053 AIC: 24057 on 20428 on 20427 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 female male > anova(happy.sex) Analysis of Deviance Table Model: binomial, link: logit Response: happy > confint(happy.sex) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.92557962 1.00693875 sexmale -0.06064378 0.06332427 Terms added sequentially (first to last) Df NULL sex • Deviance Resid. Df Resid. Dev 20428 24053 1 0.0016906 20427 24053 Deviance difference is asymptotically χ2 distributed • Null hypothesis of independence cannot be rejected Age and Happiness qplot(age, geom="histogram", fill=happy, binwidth=1, data=happy) 300 count qplot(age, geom="histogram", fill=happy, binwidth=1, position="fill", data=happy) 400 happy not too happy 200 very happy 100 0 20 # research paper claims that happiness is u-shaped happy.age <- glm(happy~poly(age,2), family=binomial(), data=na.omit(happy[,c ("age","happy")])) 30 40 50 age 60 70 80 1.0 0.8 count 0.6 happy not too happy 0.4 very happy 0.2 0.0 20 30 40 50 age 60 70 80 1.0 0.8 count 0.6 happy not too happy 0.4 very happy 0.2 > summary(happy.age) 0.0 20 30 40 50 age 60 70 Call: glm(formula = happy ~ poly(age, 2), family = binomial(), data = na.omit(happy[, c("age", "happy")])) Deviance Residuals: Min 1Q Median -1.6400 -1.5480 0.7841 3Q 0.8061 Max 0.8707 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.96850 0.01571 61.660 < 2e-16 *** poly(age, 2)1 6.41183 2.22171 2.886 0.00390 ** poly(age, 2)2 -7.81568 2.21981 -3.521 0.00043 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 23957 Residual deviance: 23936 AIC: 23942 on 20351 on 20349 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 80 age 20 30 40 50 60 70 80 0.0 1.0 count tnuoc 0.2 0.8 very happy happy not too happy not too happy happy 0.4 0.6 0.6 0.4 very happy 0.8 0.2 1.0 0.0 20 30 40 # effect of age X <- data.frame(cbind(age=20:85)) X$pred <- predict(happy.age, newdata=X, type="response") qplot(age, pred, data=X) + ylim(c(0,1)) 50 60 age 70 80 1.0 0.8 > anova(happy.age) Analysis of Deviance Table Model: binomial, link: logit Response: happy pred 0.6 0.4 0.2 0.0 Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 20351 23957 poly(age, 2) 2 20.739 20349 23936 20 30 40 50 age 60 70 80 # effect of age X <- data.frame(expand.grid(age=20:85, sex=c("female","male"))) preds <- predict(happy.age, newdata=X, type="response", se.fit=T) X$pred <- preds$fit X$pred.se <- preds$se.fit limits <- aes(ymax = pred + pred.se, ymin=pred - pred.se) qplot(age, pred, data=X, size=I(1)) + ylim(c(0,1)) + geom_point(aes(age, pred2), size=1, colour="blue") + geom_errorbar(limits) + geom_errorbar(limits2, colour="blue") + geom_point(aes(x=age, y=happy/(happy+not), colour=sex), data=happy.age.df) 1.0 > anova(midlife.sex) Analysis of Deviance Table 0.8 Model: binomial, link: logit Terms added sequentially (first to last) 0.6 sex pred3 Response: happy Df Deviance Resid. Df Resid. Dev NULL 20351 23957 poly(age, 4) 4 59.021 20347 23898 sex 1 0.000 20346 23898 poly(age, 4):sex 4 37.554 20342 23860 female male 0.4 0.2 0.0 20 30 40 50 age 60 70 80 Problems with Deviance • if X is continuous, deviance has no longer χ2 distribution. Two-fold violations: • regard X to be categorical (with lots of categories): we might end up with a contingency table that has lots of small cells - which means, that the χ2 approximation does not hold. • Increases in sample size, most likely increase the number of different values of X. Corresponding contingency table changes size (asymptotic distribution for the smaller contingency table doesn’t exist). ... but • Differences in deviances between models that are only a few degrees of freedom apart, still have asymptotically χ2 # effect of age X <- data.frame(expand.grid(age=20:85, sex=c("female","male"))) preds <- predict(happy.age, newdata=X, type="response", se.fit=T) X$pred <- preds$fit X$pred.se <- preds$se.fit limits <- aes(ymax = pred + pred.se, ymin=pred - pred.se) qplot(age, pred, data=X, size=I(1)) + ylim(c(0,1)) + geom_point(aes(age, pred2), size=1, colour="blue") + geom_errorbar(limits) + geom_errorbar(limits2, colour="blue") + geom_point(aes(x=age, y=happy/(happy+not), colour=sex), data=happy.age.df) 1.0 > anova(midlife.sex) Analysis of Deviance Table 0.8 Model: binomial, link: logit Terms added sequentially (first to last) 0.6 sex pred3 Response: happy Df Deviance Resid. Df Resid. Dev NULL 20351 23957 poly(age, 4) 4 59.021 20347 23898 sex 1 0.000 20346 23898 poly(age, 4):sex 4 37.554 20342 23860 female male 0.4 0.2 0.0 20 30 40 50 age 60 70 80 Model Checking by Grouping blem with deviance: if X continuous, deviance has no longer χ distribution. Th 2 ptions are violated two-fold: even if we regard X to be categorical (with lots of cate we end up with a contingency table that has lots of small cells - which means, tha data along estimates, e.g. such thatlikely the number of d does notGroup hold. Secondly, if we increase the sample size, most eases, too, which makes the correspondingequal contingency table change size (so we can groups are approximately in size. asymptotic distribution for the smaller contingency table, as it doesn’t exist anym is larger). Partition • • smallest n1 estimates into group 1, del Checking by Grouping To get around the problems with the distribution a second smallest batch into such that group 2 estimates group the data along estimates, e.g. of by n partitioning on estimates al in size.group 2, titioning ... theIf estimates is done by size, we the smallest we assume g groups, wegroup get the Hosmer-n1 estimates into g llest batch of n2 estimates into group 2, ... If we assume g groups, we get the Hos Lemeshow test statistic: istic g � i=1 �� ni j=1 �ni �2 yij − j=1 π̂ij 2 �� �� � ∼ χ g−2 . � ni 1 − j π̂ij /ni j=1 π̂ij Problems with Grouping • Different groupings might (and will) lead to different decisions w.r.t model fit • Hosmer et al (1997): “A COMPARISON OF GOODNESS-OF-FIT TESTS FOR THE LOGISTIC REGRESSION MODEL” (on Blackboard) Model Selection ? Ideal Situation: Theory for relationship between response and outcome is well developed, model is fitted because we want to fine-tune dependency structure Model Selection ? Exploratory Modelling After initial data check, visually inspect relationship between response and potential co-variates include strongest co-variates first, build up from there, check whether additions are significant improvements Model Selection Stepwise Modelling (not recommended by itself) Include/Exclude variables based on goodness-offit criteria such as AIC, adjusted R2, ... In Practice: combination of all three methods (Forward) Selection • Results are often not easy to interpret - questionable value? Step: AIC=18176 cbind(happy, not) ~ sex + poly(age, 4) + marital + degree + finrela + degree:finrela + poly(age, 4):degree + poly(age, 4):finrela + sex:finrela + sex:degree Df Deviance <none> 16714 + sex:marital 4 16707 + marital:degree 16 16688 + poly(age, 4):marital 16 16688 + sex:poly(age, 4) 4 16714 + marital:finrela 16 16693 AIC 18176 18177 18182 18182 18184 18187 (Forward) Selection Step: AIC=18176 cbind(happy, not) ~ sex + poly(age, 4) + marital + degree + finrela + degree:finrela + poly(age, 4):degree + poly(age, 4):finrela + sex:finrela + sex:degree Df Deviance <none> 16714 - sex:degree 4 16722 + sex:marital 4 16707 - sex:finrela 4 16724 + marital:degree 16 16688 + poly(age, 4):marital 16 16688 + sex:poly(age, 4) 4 16714 + marital:finrela 16 16693 - poly(age, 4):finrela 16 16759 - poly(age, 4):degree 16 16766 - degree:finrela 16 16774 - marital 4 18232 AIC 18176 18176 18177 18178 18182 18182 18184 18187 18189 18196 18204 19686 Investigate Interactions • Financial Relation / Gender prodplot(happy, ~happy+sex+finrela, c("vspine","hspine","hspine"), subset=level==3) far below far female below average average female below average below averagefemale average female female female male male average above male average above average farfar male above abovemale average average male NA NA Investigate Interactions • Financial Relation / Gender prodplot(happy, ~happy+finrela+sex, c("vspine","hspine","hspine"), subset=level==3) female female female female female female male male male male male male Effect plots bachelor graduate high school junior college lt high school 0.8 0.6 0.4 0.2 divorced divorced 0.8 0.6 0.4 0.2 married never marriedseparated married never marriedseparated 0.8 0.6 0.4 0.2 widowed widowed 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 male female pred 0.8 0.6 0.4 0.2 male female 0.8 0.6 0.4 0.2 male female 0.8 0.6 0.4 0.2 male female 0.8 0.6 0.4 0.2 male female 0.8 0.6 0.4 0.2 20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90 age finrela far below average below average average above average far above average Standardized Residuals • Standardize by dividing by leverage values: • Hat matrix is result of iterative weighted fitting, • with the weights determined by the link: Diagnostics • Residual Plots • Predictive Power (corresponds to R ) • Deletion Statistics (Belsley, Kuh and Welsch 2 (1980), Cook and Weisberg (1982)): dfbeta, dffits, covratio, cooks.distance π(x) log = α + βi , 1 − π(x) Example: Alcohol during pregnancy e βi is the effect of the ith category in X on the log odds, i.e. for each category one effect means that the above model is overparameterized (the “last” category can be explaine thers). To make the solution unique again, we have to use an additional constraint. I fault. Whenever one of the effects is fixed to be zero, this is called a contrast coding arison of all the� other effects to the baseline effect. For effect coding the constraint is on t s of a variable: i βi = 0. In a binary variable the effects are then the negatives of each Observational ctions and inference are independentStudy: from the specific coding used and are not affecte in the coding. at 3 months of pregnancy, expectant • mothers asked for average daily alcohol mple: Alcohol and Malformation consume. hol during pregnancy is believed to be associated with congenital malformation. The follow om an observational study - after three months of pregnancy questions on the average nu infant checked for malformation at birth olic beverages were asked; at birth the infant was checked for malformations: Alcohol malformed absent P(malformed) 1 0 48 17066 0.0028 2 <1 38 14464 0.0026 3 1-2 5 788 0.0063 4 3-5 1 126 0.0079 5 ≥6 1 37 0.0263 els m1 and m2 are the same in terms of statistical behavior: deviance, predictions and the same numbers. The variable Alcohol is recoded for the second model, giving differ Saturated Model glm(formula = cbind(malformed, absent) ~ Alcohol, family = binomial()) Deviance Residuals: [1] 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.87364 0.14454 -40.637 <2e-16 *** Alcohol<1 -0.06819 0.21743 -0.314 0.7538 Alcohol1-2 0.81358 0.47134 1.726 0.0843 . Alcohol3-5 1.03736 1.01431 1.023 0.3064 Alcohol>=6 2.26272 1.02368 2.210 0.0271 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020e+00 Residual deviance: -3.0775e-13 AIC: 28.627 on 4 on 0 Number of Fisher Scoring iterations: 4 degrees of freedom degrees of freedom ‘Linear’ Effect glm(formula = cbind(malformed, absent) ~ as.numeric(Alcohol), family = binomial()) Deviance Residuals: 1 2 3 0.7302 -1.1983 0.9636 4 0.4272 5 1.1692 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.2089 0.2873 -21.612 <2e-16 *** as.numeric(Alcohol) 0.2278 0.1683 1.353 0.176 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 Residual deviance: 4.4473 AIC: 27.074 on 4 on 3 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 levels: 1,2,3,4,5 ‘Linear’ Effect glm(formula = cbind(malformed, absent) ~ as.numeric(Alcohol), family = binomial()) Deviance Residuals: 1 2 3 0.5921 -0.8801 0.8865 4 -0.1449 5 0.1291 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.9605 0.1154 -51.637 <2e-16 *** as.numeric(Alcohol) 0.3166 0.1254 2.523 0.0116 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 Residual deviance: 1.9487 AIC: 24.576 on 4 on 3 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 levels: 0,0.5,1.5,4,7 Scores • Scores of categorical variables critically influence a model • usually, scores will be given by data experts • various choices: e.g. midpoints of interval variables, • assume default scores are values 1 to n Multinomial Models Models for Multinomial Logit Response Y is categorical with J > 2 • ariable Y be a nominal variable with J > 2 categ categories define π (x) = P(Y=j | X=x) • ne Category Logit Models • Baseline Categorical Model: j pick one reference al” category i, e.g. i = 1category or i =i, express J or i logit is largest with respect to this reference: πj (x) � log = αj + βj x for all j = 1, πi (x) Multinomial Model • Choices for baseline: largest category gives most stable results • R picks first level Example: Alligator Food • 219 alligators from four lakes in Florida were examined with respect to their primary food choice: fish, invertebrae, birds, reptile, other. • Additionally, size of alligators (≤2.3m, >2.3m) and gender were recorded. > summary(alligator) ID food Min. : 1.0 bird :13 1st Qu.: 55.5 fish :94 Median :110.0 invert:61 Mean :110.0 other :32 3rd Qu.:164.5 rep :19 Max. :219.0 size <2.3:124 >2.3: 95 gender f: 89 m:130 lake george :63 hancock :55 oklawaha:48 trafford:53 rep other invert bird fish <2.3 >2.3 size <2.3:124 >2.3: 95 gender f: 89 m:130 lake george :63 hancock :55 oklawaha:48 xtabs(~lake + food, data = alligator) trafford:53 hancock oklawaha fish george rep other invert bird food > summary(alligator) ID food Min. : 1.0 bird :13 1st Qu.: 55.5 fish :94 Median :110.0 invert:61 Mean :110.0 other :32 3rd Qu.:164.5 rep :19 Max. :219.0 lake trafford size <2.3:124 >2.3: 95 gender f: 89 m:130 lake george :63 hancock :55 oklawaha:48 xtabs(~gender + food, data = alligator) trafford:53 m bird fish f rep other invert food > summary(alligator) ID food Min. : 1.0 bird :13 1st Qu.: 55.5 fish :94 Median :110.0 invert:61 Mean :110.0 other :32 3rd Qu.:164.5 rep :19 Max. :219.0 gender library(nnet) • Brian Ripley’s nnet package allows to fit multinomial models: library(nnet) alli.main <- multinom(food~lake+size+gender, data=alligator) > summary(alli.main) Call: multinom(formula = food ~ lake + size + gender, data = alligator) Coefficients: (Intercept) lakehancock lakeoklawaha laketrafford size>2.3 bird -2.4321397 0.5754699 -0.55020075 1.237216 0.7300740 invert 0.1690702 -1.7805555 0.91304120 1.155722 -1.3361658 other -1.4309095 0.7667093 0.02603021 1.557820 -0.2905697 rep -3.4161432 1.1296426 2.53024945 3.061087 0.5571846 genderm bird -0.6064035 invert -0.4629388 other -0.2524299 rep -0.6276217 Std. Errors: (Intercept) lakehancock lakeoklawaha laketrafford size>2.3 bird 0.7706720 0.7952303 1.2098680 0.8661052 0.6522657 invert 0.3787475 0.6232075 0.4761068 0.4927795 0.4111827 other 0.5381162 0.5685673 0.7777958 0.6256868 0.4599317 rep 1.0851582 1.1928075 1.1221413 1.1297557 0.6466092 genderm bird 0.6888385 invert 0.3955162 other 0.4663546 rep 0.6852750 Residual Deviance: 537.8655 AIC: 585.8655