Multinomial Logistic Regression stat 557 Heike Hofmann Outline • Ordinal Co-variates • Baseline Categorical Model • Proportional Odds Logistic Regression π(x) log = α + βi , 1 − π(x) Example: Alcohol during pregnancy e βi is the effect of the ith category in X on the log odds, i.e. for each category one effect means that the above model is overparameterized (the “last” category can be explaine thers). To make the solution unique again, we have to use an additional constraint. I fault. Whenever one of the effects is fixed to be zero, this is called a contrast coding arison of all the� other effects to the baseline effect. For effect coding the constraint is on t s of a variable: i βi = 0. In a binary variable the effects are then the negatives of each Observational ctions and inference are independentStudy: from the specific coding used and are not affecte in the coding. at 3 months of pregnancy, expectant • mothers asked for average daily alcohol mple: Alcohol and Malformation consume. hol during pregnancy is believed to be associated with congenital malformation. The follow om an observational study - after three months of pregnancy questions on the average nu infant checked for malformation at birth olic beverages were asked; at birth the infant was checked for malformations: Alcohol malformed absent P(malformed) 1 0 48 17066 0.0028 2 <1 38 14464 0.0026 3 1-2 5 788 0.0063 4 3-5 1 126 0.0079 5 ≥6 1 37 0.0263 els m1 and m2 are the same in terms of statistical behavior: deviance, predictions and the same numbers. The variable Alcohol is recoded for the second model, giving differ Saturated Model glm(formula = cbind(malformed, absent) ~ Alcohol, family = binomial()) Deviance Residuals: [1] 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.87364 0.14454 -40.637 <2e-16 *** Alcohol<1 -0.06819 0.21743 -0.314 0.7538 Alcohol1-2 0.81358 0.47134 1.726 0.0843 . Alcohol3-5 1.03736 1.01431 1.023 0.3064 Alcohol>=6 2.26272 1.02368 2.210 0.0271 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020e+00 Residual deviance: -3.0775e-13 AIC: 28.627 on 4 on 0 Number of Fisher Scoring iterations: 4 degrees of freedom degrees of freedom ‘Linear’ Effect glm(formula = cbind(malformed, absent) ~ as.numeric(Alcohol), family = binomial()) Deviance Residuals: 1 2 3 0.7302 -1.1983 0.9636 4 0.4272 5 1.1692 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.2089 0.2873 -21.612 <2e-16 *** as.numeric(Alcohol) 0.2278 0.1683 1.353 0.176 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 Residual deviance: 4.4473 AIC: 27.074 on 4 on 3 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 levels: 1,2,3,4,5 ‘Linear’ Effect glm(formula = cbind(malformed, absent) ~ as.numeric(Alcohol), family = binomial()) Deviance Residuals: 1 2 3 0.5921 -0.8801 0.8865 4 -0.1449 5 0.1291 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.9605 0.1154 -51.637 <2e-16 *** as.numeric(Alcohol) 0.3166 0.1254 2.523 0.0116 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 Residual deviance: 1.9487 AIC: 24.576 on 4 on 3 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 levels: 0,0.5,1.5,4,7 Ordinal X • Scores of categorical variables critically influence a model • • usually, scores will be given by data experts • • assume default scores are values 1 to n various choices: e.g. midpoints of interval variables, Linear changes to scores do not affect the overall model (predictions, goodness of fit) Example: Alligator Food • 219 alligators from four lakes in Florida were examined with respect to their primary food choice: fish, invertebrate, birds, reptile, other. • Additionally, size of alligators (≤2.3m, >2.3m) and gender were recorded. > summary(alligator) ID food Min. : 1.0 bird :13 1st Qu.: 55.5 fish :94 Median :110.0 invert:61 Mean :110.0 other :32 3rd Qu.:164.5 rep :19 Max. :219.0 size <2.3:124 >2.3: 95 rep other invert bird fish gender f: 89 m:130 lake george :63 hancock :55 oklawaha:48 trafford:53 > summary(alligator) ID food Min. : 1.0 bird :13 1st Qu.: 55.5 fish :94 Median :110.0 invert:61 Mean :110.0 other :32 3rd Qu.:164.5 rep :19 Max. :219.0 size <2.3:124 >2.3: 95 gender f: 89 m:130 lake george :63 hancock :55 oklawaha:48 trafford:53 > summary(alligator) ID food Min. : 1.0 bird :13 1st Qu.: 55.5 fish :94 Median :110.0 invert:61 Mean :110.0 other :32 3rd Qu.:164.5 rep :19 Max. :219.0 size <2.3:124 >2.3: 95 gender f: 89 m:130 lake george :63 hancock :55 oklawaha:48 trafford:53 Baseline Categorical Model Models for Multinomial Logit Response Y is categorical with J > 2 • ariable Y be a nominal variable with J > 2 categ categories define π (x) = P(Y=j | X=x) • ne Category Logit Models • Baseline Categorical Model: j pick one reference al” category i, e.g. i = 1category or i =i, express J or i logit is largest with respect to this reference: πj (x) � log = αj + βj x for all j = 1, πi (x) Multinomial Model • Choices for baseline: • largest category gives most stable results • R picks first level • Haberman : G and X are χ distributed, if 2 2 2 data is categorical and not sparse; for sparse or continuous data, deviance differences between nested models are still χ2 distributed, if the models differ in few parameters. library(nnet) • Brian Ripley’s nnet package allows to fit multinomial models: library(nnet) alli.main <- multinom(food~lake+size+gender, data=alligator) > summary(alli.main) Call: multinom(formula = food ~ lake + size + gender, data = alligator) Coefficients: (Intercept) lakehancock lakeoklawaha laketrafford size>2.3 bird -2.4321397 0.5754699 -0.55020075 1.237216 0.7300740 invert 0.1690702 -1.7805555 0.91304120 1.155722 -1.3361658 other -1.4309095 0.7667093 0.02603021 1.557820 -0.2905697 rep -3.4161432 1.1296426 2.53024945 3.061087 0.5571846 genderm bird -0.6064035 invert -0.4629388 other -0.2524299 rep -0.6276217 Std. Errors: (Intercept) lakehancock lakeoklawaha laketrafford size>2.3 bird 0.7706720 0.7952303 1.2098680 0.8661052 0.6522657 invert 0.3787475 0.6232075 0.4761068 0.4927795 0.4111827 other 0.5381162 0.5685673 0.7777958 0.6256868 0.4599317 rep 1.0851582 1.1928075 1.1221413 1.1297557 0.6466092 genderm bird 0.6888385 invert 0.3955162 other 0.4663546 rep 0.6852750 Residual Deviance: 537.8655 AIC: 585.8655 πa (x) πa (x) πb (x) log = log − log = (αa − αb ) + (βa − βb )� x. πb (x) πi (x) πi (x) Alligator an : G2 and X 2 are χ2 distributed, if data is categorical and not sparse; if data is us, deviance differences between nested models are still χ2 distributed, if the models diff rs. e: Alligator - Food Choice 219 alligators were examined with respect to their prim sh, invertebrae, birds, reptile, other). Explanatory variables are lake (4 categories), size( < d gender. The full model then has thethe formform Full Model has • log πj (x) LG SG LSG S G LS L + βsgj + βlsgj , for j = 1, ..., 4 + βsj + βgj + βlsj + βlgj = αj + βlj πF (x) • ber of parameters#weparameters estimate is thenestimated: (in the above order): 3 +11 + + 13++ 3) 1 · 4+ = 16 (1 +(13+ + + 11++3 +33 + 3)· 4* =464= 64 • find suitable sub-model model has 0 degrees of freedom: ry(nnet) ns(contrasts=c("contr.treatment","contr.poly")) -multinom(food~lake*size*gender,data=table.7.1) ts: 85 (64 variable) value 352.466903 0 value 261.200857 # saturated model Alligator • Corner-stone Models Full Two-way Main Effects Null Deviance df 487.6018 0 489.5426 12 537.8655 40 604.3629 60 • Suitable model ‘around’ main effects and all two-way interactions > anova(alli.full, alli.twoway, alli.main, alli.null) Model Resid. df Resid. Dev 1 1 872 604.3629 2 lake + size + gender 852 537.8655 3 size * gender * lake - size:gender:lake 824 489.5426 4 size * gender * lake 812 487.6018 Test Df LR stat. Pr(Chi) 1 NA NA NA 2 1 vs 2 20 66.497442 6.723394e-07 3 2 vs 3 28 48.322909 9.889238e-03 4 3 vs 4 12 1.940776 9.994914e-01 Estimated Response ponse Probabilities For model πj (x) log = αj + βj� x for all j = 1, ..., J and all x πi (x) onse probabilities are given as estimated probabilities: � � � exp αj + βj x � πj (x) = for all j = 1, ..., J � 1 + k�=i exp (αk + βk x) • • deling assumption we have for all j = 1, ..., J: πj (x) log = αj + βj� x πi (x) Model Diagnostics Fitted Values 1 2 3 4 5 6 7 8 size <2.3 >2.3 <2.3 >2.3 <2.3 >2.3 <2.3 >2.3 Observed Values lake bird fish invert other rep george 1.2 18.5 16.9 3.8 0.5 george 1.8 14.5 3.1 2.2 0.5 hancock 2.7 20.9 3.6 9.9 1.9 hancock 2.3 9.1 0.4 3.1 1.1 oklawaha 0.2 5.2 12.0 1.1 1.5 oklawaha 0.8 12.8 7.0 1.9 5.5 trafford 0.9 4.4 12.4 4.2 2.1 trafford 3.1 8.6 5.6 5.8 5.9 1 2 3 4 5 6 7 8 size <2.3 >2.3 <2.3 >2.3 <2.3 >2.3 <2.3 >2.3 lake george george hancock hancock oklawaha oklawaha trafford trafford bird -0.8 0.8 0.7 -0.7 0.2 -0.2 -0.1 0.1 1 2 3 4 5 6 7 8 lake george george hancock hancock oklawaha oklawaha trafford trafford size fish bird invert other rep <2.3 16 2 19 3 1 >2.3 17 1 1 3 0 <2.3 23 2 4 8 2 >2.3 7 3 0 5 1 <2.3 5 0 11 3 1 >2.3 13 1 8 0 6 <2.3 5 1 11 5 2 >2.3 8 3 7 5 6 fish invert other rep 2.5 -2.1 0.8 -0.5 -2.5 2.1 -0.8 0.5 -2.1 -0.4 1.9 -0.1 2.1 0.4 -1.9 0.1 0.2 1.0 -1.9 0.5 -0.2 -1.0 1.9 -0.5 -0.6 1.4 -0.8 0.1 0.6 -1.4 0.8 -0.1 Differences Model Diagnostics Fitted Values 1 2 3 4 5 6 7 8 size <2.3 >2.3 <2.3 >2.3 <2.3 >2.3 <2.3 >2.3 Observed Values lake bird fish invert other rep george 1.2 18.5 16.9 3.8 0.5 george 1.8 14.5 3.1 2.2 0.5 hancock 2.7 20.9 3.6 9.9 1.9 hancock 2.3 9.1 0.4 3.1 1.1 oklawaha 0.2 5.2 12.0 1.1 1.5 oklawaha 0.8 12.8 7.0 1.9 5.5 trafford 0.9 4.4 12.4 4.2 2.1 trafford 3.1 8.6 5.6 5.8 5.9 1 2 3 4 5 6 7 8 size <2.3 >2.3 <2.3 >2.3 <2.3 >2.3 <2.3 >2.3 lake george george hancock hancock oklawaha oklawaha trafford trafford bird -0.6 0.4 0.3 -0.3 1.0 -0.2 -0.2 0.0 1 2 3 4 5 6 7 8 lake george george hancock hancock oklawaha oklawaha trafford trafford size fish bird invert other rep <2.3 16 2 19 3 1 >2.3 17 1 1 3 0 <2.3 23 2 4 8 2 >2.3 7 3 0 5 1 <2.3 5 0 11 3 1 >2.3 13 1 8 0 6 <2.3 5 1 11 5 2 >2.3 8 3 7 5 6 fish invert other rep 0.1 -0.1 0.2 -1.1 -0.2 0.7 -0.4 1.0 -0.1 -0.1 0.2 -0.1 0.2 1.0 -0.6 0.1 0.0 0.1 -1.8 0.4 0.0 -0.1 1.0 -0.1 -0.1 0.1 -0.2 0.1 0.1 -0.3 0.1 0.0 Pearson Residuals Model Diagnostics Fitted Values 1 2 3 4 5 6 7 8 size <2.3 >2.3 <2.3 >2.3 <2.3 >2.3 <2.3 >2.3 george Observed Values lake bird fish invert other rep george 1.2 18.5 16.9 3.8 0.5 george 1.8 14.5 3.1 2.2 0.5 hancock 2.7 20.9 3.6 9.9 1.9 hancock 2.3 9.1 0.4 3.1 1.1 oklawaha 0.2 5.2 12.0 1.1 1.5 oklawaha 0.8 12.8 7.0 1.9 5.5 trafford 0.9 4.4 12.4 4.2 2.1 trafford 3.1 8.6 5.6 5.8 5.9 george hancock hancock oklawaha oklawaha trafford trafford 1 2 3 4 5 6 7 8 lake george george hancock hancock oklawaha oklawaha trafford trafford george george size fish bird invert other rep <2.3 16 2 19 3 1 >2.3 17 1 1 3 0 <2.3 23 2 4 8 2 >2.3 7 3 0 5 1 <2.3 5 0 11 3 1 >2.3 13 1 8 0 6 <2.3 5 1 11 5 2 >2.3 8 3 7 5 6 hancock hancock oklawaha oklawaha trafford trafford Proportional Odds Logistic Regression 8.68.6 5.65.6 5.95.93.13.1 5.85.8 8 8 77 66 33 d.3>2.3 0.50.51.21.2 3.83.8 1616 1919 11 22 <2.318.5 18.5 16.9 16.9 0.50.51.81.8 2.22.2 1717 1 1 00 11 >2.314.5 14.5 3.13.1 bserved andand fitted cellcell counts gives anan idea ng observed fitted counts gives ideaof ofthe thesign signofofthe theresiduals. residuals. we use use the the same residuals as as before, e.g. on will we will same residuals before, e.g.Pearsons Pearsonsresiduals residuals Ordinal Response • 55 33 33 For For aa oijo− eijeij ij − √√ , , eijeij asymptotic vehave asymptotic distributions. Y distributions. is categorical variable with J > 2 levels, that have natural ordering roportional Odds Model portional Odds Model Assume y1 < y2 < ... < yJ ponse variable is ordinal, take a different approachtotomodeling modelingit: it:based basedon onth th e variable Y isYordinal, we we cancan take a different approach ty (Yj ≤ j | for x) cumulative for 1, J ...,we J log we define cumulativeloglogodds oddsasas (YP≤ | x) j =j1,=..., define thethe cumulative odds: • P (Y ≤| jx)| x) π + ...++π π(x) i (x) j (x) P (Y ≤ j π (x) + ... i j log log 1 − P (Y ≤ j | x) == loglog πj+1 (x) + ... + πJ (x) 1 − P (Y ≤ j | x) πj+1 (x) + ... + πJ (x) ulative odds model or proportionalodds odds model is then given as logistic ve odds model proportional or proportional odds model is thenregression given as P (Y ≤ j | x) logP (Y ≤ j | x) = αj +�β � x, for j = 1, ..., J 1 − P (Y ≤ j | x) log = αj + β x, for j = 1, ..., J 1 − P (Y ≤ j | x) es αj are ordered, i.e. αj1 ≤ αj2 for j1 < j2 : for j1 < j2 , the cumulative prob αjordering: are ordered, j2 : Since for j1the< logit j2 , the proba P (Y i.e. ≤ j1 α|j1x)≤≤ αPj2(Yfor≤ jj12 < | x). is a cumulative monotone increas • Happiness happy.age <- polr(happy~poly(age,4)*sex, data=na.omit(happy[,c("happy","age","sex")])) 1.0 Estimated Probabilities 0.8 variable 0.6 not.too.happy pretty.happy very.happy sex 0.4 female male 0.2 0.0 20 30 40 age 50 60 70 80