Generalized Linear Models stat 557 Heike Hofmann Outline • GLM for Binomial Response • Model-fitting: Newton-Raphson / Fisher Scoring • Deviance • Residuals � � Binomial Distribution n P (X = x) = π x (1 − π)n−x x � � n x θ π) = −πlog(1 ) P (Xlog(1 = x)−= (1 −+π)en−x x � � 2 1 (x − µ) f (x; µ, σ) = √ exp π − = 2 P (X = x; θ2πσ = 2log , φ2σ = 1) = 1−π � � �� 2 2 2 2 (−x + 2µx − µ )/(2σ )θ− 1/2 log(2πσ ) n = exp xθ�− n log(1 + e ) + log � 2 x θx − (θ/2) f (x; θ, φ) = exp −�x2 /φ − 1/2� log(φπ) φ 1 (x − µ)2 f (x; µ, σ) = √ θxexp − = 2 2 − b(θ) 2σ φ)) 2πσ f (x; θ, φ) = exp( + c(x, a(φ) 2 2 (−x + 2µx − µ )/(2σ 2 ) − 1/2 log(2πσ 2 ) � � Binomial Response Binomial Response Binomial GLM Binomial Response Binomial GLM Binomial GLM l GLM ni Yi ∼ Bn ,π i Some probability link functions i Thenni Yi ∼ Bni ,πi � ,πi Then E [Yi ] = πi , Var [Yi ] = πi (1 − πi )/ni and ηi = x� ij βj E [Yi ] = πi , Var [Yi ] = πi (1 −� πi )/ni and ηi j = xi E [Yi ] = πi , Var [Yi ] = πi (1 − πi )/ni and ηi = xij βj j For logit link: �j For function logit link:g logit(πi ) = ηi = x� link ij βj . nk: � logit(πi ) = ηij = xij βj . π π i i logit(π xijlogit βj . link (canonical link) = log i ) = ηi = g(πig(π ) =i )log j Then 1 − 1πi− πi j Then x ∂η ∂η 1 1 i i λ link g(πi ) = πi = −λ = 1 + identity = P (X = x;∂µ λ) =∂η e ∂π ∂η 1 1 1 i i π 1 − π π (1 − π ) i1 ix! i= i= i ∂ηi ∂ηi 1 1 i x = + λ −1 −λ =x;i )λ)== = +i )i = πprobit ∂µ ∂π 1 −link πi πi (1 − πi ) g(π Φ (π i i P (X = e And∂µthe likelihood equations are � � ∂πi πi x!1 − πi πi (1 − πi ) i θ are And the likelihood equations ; θ = log λ, φ = 1) = exp xθ − e − log(x!) � � (θ ) � � ∂Lg(π y − b (θ ) ∂µ y − b kelihood equations are i� i − πi )) i log-log i i log(1 � link i ) = log(− π i = x = xij )πi (1 − πi � � (θ � � ij ) � � θ ∂L y − b (θ ∂µ y − b g(π ) = log i −)� i log(x!) �i � ∂βφ =i 1)� = expVar π (1 − πi )/n i = log λ, xθ(y e −∂η transformation 1.0 0.8 0.6 0.4 logit probit cloglog cauchit 0.2 0.0 -4 -2 0 logit(p) 2 4 µi = E[Yi ] � � �−1 ! ∂L � yi − E[Yi ] = =0 1 xij g (πi ) � g∂β (πi ) = − V ar(Yi ) i (1 − πi ) log(1 − πi ) µi = E[Yi ] g � (πi ) = 1 � g (πi ) = − � �(1�− πi�)−1 ∂L yi − E[Yi ] log(1 ! − πi ) = xij g (π =0 1 i) � ∂β V ar(Y )) = g (π i i i � i − E[Y � � �−1 ! πgi y(1 π=i ) i ] ∂L � (π ) i = xij g (πi ) = µ∂β � i ] V ar(Yi ) i = gE[Y (πii ) = 1 identity link: 1 � µ = E[Y ] g (π ) = i i i logit link ππii(1 − πi ) g(πi )g(π = log(− log(1 − )) log(1 − πi )) 1 i ) = log(− � complementary log-log g (πig) �= (πi−) (1 =− 1 π ) log(1 − π ) i i −1 g(πi ) = Φ (π−1 probit g(πi ) = Φi ) (πi )� g (πi )−=πi )) g(πi ) = log(− log(1 πi πi g(πi ) = log g(πi ) =1log − �πi 1 1 − π −1 ) =i (π ) g(π g) (π =Φ Likelihood Equations • • • • i logit(πi ) = ηi = j For logit link: Then � logit(πi ) = ηi = xij βj . ∂ηi ∂ηi 1 1 1 j = = + = ∂µi ∂πi πi 1 − πi πi (1 − πi ) Then And the likelihood ∂ηiequations ∂ηi are 1 1 1 = � = + = �∂µ − πi yi − πib(1� (θ−i )π ∂L yii − b∂π (θi ) π ∂µ i i 1� xij link:= likelihood=equations for logit And∂βthe likelihoodVar equations are (yi ) ∂ηi πi (1 − πi )/ni i i � yi − b � (θi ) !∂µi � yi − b � (θi ) � ∂L == ni (yi − πi )xijxij= 0 = ∂β Var (yi ) ∂ηi πi (1 − πi )/n i ii Stat 557 ( Fall 2008) Intro to GLMs � ! � = ni (yi − b (θi ))xij = 0 Binomial GLM • xij βj . i Stat 557 ( Fall 2008) Intro to GLMs Choice of Link Function? • range of E[Y]? • canonical link has nicer mathematical properties (cancels Var(Yi) from denominator) Binomial GLM X=0 X=1 • For 2 x 2 table: Y=0 πx = P(Y=1 | X=x) π00 • GLM: g(π ) = α + β x • β is effect in X, β = g(π ) - g(π ) • link g is • identity => β is difference of proportions • log link => β is log relative risk • logit link => β is log odds ratio Y=1 x 1 0 π10 π01 π11 Moth Data Frozen dead moths of two colors are placed on trees at locations of increasing distance from Liverpool, England. This species of moth rests during the day on tree trunks and is active at night. Trees near Liverpool are darkened by smoke to a greater extent than those farther away in the Welsh countryside. At each location, the number of moths of each color that are placed and removed 24 hours later are recorded. One might expect that lighter moths are more likely to be removed near Liverpool and that darker moths are more likely to be removed farther away, as the color of the trees provides more camouflage when the color of the moth is closer. Moth Data > head(moth) Xmorph distance placed removed location 1 light 0.0 56 17 Sefton Park 2 dark 0.0 56 14 Sefton Park 3 light 7.2 80 28 Eastham Ferry 4 dark 7.2 80 20 Eastham Ferry 5 light 24.1 52 18 Hawarden 6 dark 24.1 52 22 Hawarden 0.4 removed/placed qplot(distance, removed/placed, data=moth, size=I(3), colour=Xmorph) + geom_smooth (method="lm") 0.5 Xmorph dark light 0.3 0.2 0 10 20 distance 30 40 50 glm(formula = cbind(removed, placed) ~ Xmorph * distance, family = binomial(link = "identity"), data = moth) Deviance Residuals: Min 1Q -1.76275 -0.28737 Median 0.02419 3Q 0.50867 Max 0.90052 Moth Data Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.1954860 0.0318781 6.132 8.66e-10 *** Xmorphlight 0.0502263 0.0455106 1.104 0.2698 distance 0.0023054 0.0009614 2.398 0.0165 * Xmorphlight:distance -0.0034066 0.0013472 -2.529 0.0114 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 18.3994 Residual deviance: 7.1539 AIC: 79.414 on 13 on 10 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 glm(formula = cbind(removed, placed) ~ Xmorph * distance, family = binomial(link = "logit"), data = moth) Deviance Residuals: Min 1Q Median -1.7494 -0.3058 0.0076 3Q 0.4970 Max 0.9412 Moth Data Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.396032 0.189529 -7.366 1.76e-13 Xmorphlight 0.282320 0.261476 1.080 0.2803 distance 0.012127 0.005296 2.290 0.0220 Xmorphlight:distance -0.018771 0.007650 -2.454 0.0141 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ (Dispersion parameter for binomial family taken to be 1) Null deviance: 18.3994 Residual deviance: 7.1739 AIC: 79.434 on 13 on 10 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 *** * * ’ 1 m, and repeat (ii) until the new and the old value Model Fitting: Newton Raphson is maximum, i.e. Objective: find value for β, such that log-likelihood is maximized: argmaxβ L(β) = β̂ General Idea: matrix1.Hguess of aL: value for a solution, 2. approximate the function of interest in the current value ∂ ∂ by a (2nd degree) polynomial, � ( L, ..., L) 3. find the maximum∂β of the ∂βp 1 approximation, � value2by the maximum, � 4. replace the current ∂ 5. repeat steps H(β) = 2-4 until the new and L the old value are sufficiently close. ∂βi ∂βj 1≤i,j≤p u(β) = Newton Raphson - properties • • • Generally well-behaved, but starting values crucial but problematic in case of • • • multi-modality, or very flat functions (convergence problematic) 2nd order convergence Alternative: Fisher Scoring: substitute hessian matrix by expected Fisher information (same for canonical link, better: ‘closer’ to data, simpler to compute in many situations) for poisson family taken to be 1) 2.79 7.71 on 172 on 171 Deviance degrees of freedom degrees of freedom • Let L(µ; y) denote log-likelihood expressed in means µ22 = (µ1, µ2, ..., µn) ing iterations: • L(µ;ˆ y) is maximized log-likelihood • L(y; y) is maximal achievable log-likelihood; called saturated model or y = (y1 , ..., yN ) let L(µ; y) denote the log likeliho achievable log likelihood is L(y; y) (fully saturated m • Deviance of model M: D := −2 (L(µ̂; y) − L(y; y)) tic, called the deviance. stic, called the deviance. GLM deviance is asymptotically χ2 with df = N −p, where p Deviance del checking and inferential comparisons of model. model, with M1 being nested within M2 , i.e. M1 is a simp for1 .Poisson and Binomial, Deviance is eter of the M 2 distributed with n-p asymptotically χ elihood of M1 will then be smaller than the maximum of t degrees of freedom, if model has p L(µˆ1 ; y) ≤ L(µˆ2 ; y), parameters. • • Model comparisons: for model M1 nested ; µˆ2 ) for the deviances. within model M2, the deviance difference is he deviances given as D(y; µˆ1 ) − D(y; µˆ2 ) = −2 (L(µˆ1 ; y) − L(µˆ2 ; y)) ymptotical χ2 distribution, where the degrees of freedom i Null Deviance • —y) is the likelihood of the ‘null’ model, L(y; i.e. the model consisting of the intercept only. It is nested within all more complex models: —y) ˆ y) - L(y; null deviance Do = L(µ; • has χ2 distribution with p-1 degrees of freedom. Residuals Residuals duals uals Deviance eviance Pearson earson Residuals • Deviance Residuals � � di sign(yi − µ̂i ) di sign(yi − µ̂i ) • Pearson Residuals y − µ̂ e =� y − µ̂ i i i i i ei = � Var (yi ) Var (yi ) sets underestimate variance ets underestimate variance Both sets of residuals under-estimate the variance Next time: Poisson Model x x λ −λλ −λ P (X = x; λ) = e P (X = x; λ) = e x! x! � �� � θθ − log(x!) P (X = x; θ = log λ, φ = 1) = exp xθ − e P (X = x; θ = log λ, φ = 1) = exp xθ − e − log(x!) � � � � n x n n−x x (1 − π)n−x P (X = x) = π P (X = x) = x π (1 − π) x θθ ) log(1 − π) = − log(1 + e log(1 − π) = − log(1 + e ) (X = = x; x;θθ = = log log PP(X ππ = 1) 1) = = ,,φφ =