Generalized Linear Models: A subclass of nonlinear models with a \linear" component Data: (Y1; XT1 ) (Y2; XT2 ) .. (Yn; XTn ) % response variable where - explantory or regression, variables XTj = (X1j ; X2j ; : : : ; Xpj ) describe the conditions under which the response Yj was obtained. 1 Basic Features: 1. Y1; Y2; ; Yn are independent 2. Yj has a distribution in the exponential family of distributions, j = 1; 2; ; n. The joint likelihood has the form (canonical form): where L(; ; Y) = jn=1f (Yj ; j ; ) Yj j b(j ) f (Yj ; j ) = exp f g a() for some functions a(); b(); and c(): Here, j is called a canonical parameter. 2 3. (Systematic part of the model) There is a link function h() that links the conditional mean of Yj given Xj = (Xij ; : : : ; Xpj )T to a linear function of unknown parameters, e.g. h(E (Yj jXj )) = XTj = 1X1j + : : : + pXpj Members of the exponential family of distributions available in S-PLUS: glm() function SAS: PROC GENMOD 3 Normal distribution: Y N (; 2) ) ( 2 (Y ) 1 f (Y ; ; ) = p 2 exp 22 2 8 > < 2 " Y 2 1 Y2 2) = exp > 2 + log(2 2 2 : 9 #> = > ; " C (Y; ) Here E (Y ) = V ar(Y ) = 2 = 2 and b() = 2 a() = 4 Binomial Distribution: Y Bin(n; ) n f (Y ; ; ) = Y (1 )n Y Y n Y = (1 )n Y 1 ) ( = exp Y log + n log(1 )+ log n Y 1 " C (Y; ). Here and e = log ) = 1 1 + e b() = n log(1 ) = n log(1 + e ): There is no dispersion parameter, so we will take 1 and a() = 1. 5 Poisson distribution: Y P oisson() Y f (Y ; ; ) = e Y! n o = exp Y log() log(Y !) " C (Y; ). Here and = log() ) = exp() b() = = exp() and there is no additional dispersion parameter. We will take 1 and a() = 1. Then, f (Y ; ; ) = expfY exp() log( (Y + 1))g 6 Two parameter gamma distribution:Y gamma(; ) f (Y ; ; ) = Y = exp = exp n Y Y 1 exp( Y= ) () + ( 1) log(Y ) log( ) log( ()) 1 +log 1 1 o + ( 1) log(Y ) + 1 log( 1) log( ()) " C (Y; ) Here, = 1 = 1; b() = log( ); a() = 7 Inverse Gaussian Distribution: Y IG(; 2) ( 1 exp 2 3 2 Y for Y > 0; > 0; 2 > 0 also E (Y ) = ; V ar(Y ) = 23 f (Y ; ; ) = p 8 > < ) 2 (Y ) 222Y Y ( 212 ) exp > 2 : 1 1 + 2 2 Y 9 > = 1 log(22Y 3) 2 " C (Y; ) Here, 1 2 2p b() = = 2 = 2 \dispersion parameter" a() = = 2 = 8 > ; Canonical form of the log-likelihood: `(; ; Y ) = log(f (Y ; ; )) = Y b() + C (Y; ) a() paratial derivatives: ;Y ) = Y b0() (i) @`(; @ a() (ii) @ 2`(;;Y ) @2 = b00() a() Apply the following results ! @`(; ; Y ) =0 Ed @ ! " #2 2 @ `(; ; Y ) @`(; ; Y ) E + E =0 2 @ @ to obtain @`(; ; Y E (Y ) b0() 0=E = @ a() ! 9 which implies E (Y ) = b0() Furthermore, from (i), (ii) and (iv) we have ! !2 2 @ `(; ; Y ) @`(; ; Y ) 0 = E +E 2 @ @ ! Y b0() b00() + V ar = a() a() b00() V ar(Y ) = + a() [a()]2 which implies V (Y ) = a() b00() " " function of the dispersion called the parameter variance function 10 Normal distribution: Y N (; 2) 2 2 b() = b0() = b00() = 1 a() = = 2 ) E (Y ) = ) ) Var(Y ) = b00()a() = (1) = = 2 " the variance does not depend on the parameter that dene the mean 11 Binomial distribution: Y Bin(n; ) = log( 1 ) b() = (n)log(1 + e ) ne ne 0 b () = 1+e ) E (Y ) = 1+e = n ne b00() = [1+ e ]2 a() 1 9 = ; Var(Y ) = ne (1) [1+e ]2 e 1 n 1+e 1+e = = n(1 ) " V ar(Y ) is completely determined by the parameter that denes the mean. 12 Poisson Distribution: Y P oisson() b() = exp() b0() = exp() ) = E (Y ) = exp() ) 00 b () = exp() a() = 1 Var(Y ) = exp() = E (Y ) = Gamma Distribution: b() = b0() = b00() = 12 a() = ) log( ) 1 ) E (Y ) = 1 Var(Y ) = = = a()b00() 2 [E (Y )]2 13 Canonical link function: is the link function h() = XT that corresponds to = XT @b() = E (Y ) = = b0() @ Normal (Gaussian) distribution: = XT = b0() = then E (Y ) = = = XT and the canonical link function is the identity function h() = = XT 14 Binomial distribution: = XT ne 0 n = E (Y ) = b () = 1 + e ) = log 1 = XT Hence, the canonical (default) link is the logit link function h() = log = XT 1 Other available link functions: Probit: h() = 1() = XT where Z XT u2 1 T p = (X ) = e 2 du 1 2 15 Complimentary log-log link (clog log): h() = log[ log(1 )] = XT The inverse of this link function is = 1 exp[ exp(XT )] The glm( ) function glm(model, family, data, weights, control) family = binomial family = binomial(link=logit) family = binomial(link=probit) family = binomial(link=cloglog) 16 Poisson regression (log-linear models) Y P oisson() = XT = E (Y ) = b0() = exp() and the canonical link function is = log() = XT This is called the log link function. Other available link functions: identity link: h() = = XT square root link: h() = p = XT glm(model, family, data, weights, control) family = poisson family = poisson(link=log) family = poisson(link=identity) family = poisson(link=sqrt) 17 Gamma distribution: = XT 1 E (Y ) = = the canonical link function is 1 = XT h() = This is called the \inverse" link. glm(model, family, data, weights, control) family = gamma family = gamma(link=inverse) family = gamma(link=identity) family = gamma(link=log) 18 Inverse Gaussian Distribution = XT p 2 b() = 1 = E (Y ) = b0() = p 2 The canonical link is 1 = h() = 2 = XT 2 This is the only built-in link function for the inverse gaussian distribution. glm(model, family, data, weights, controls) family = inverse.guassian family = inverse.guassian(link=1=2) 19 Normal-theory Gauss-Markov model: Y N (X ; 2I ) or Y1; Y2; : : : ; Yn are independent with Yj N (XTj ; 2). Then 8 < 9 (Yj XTj )2 = ; 22 f (Yj ; j 2) = p 2 exp : 2 8 > > < (X T (Yj )(Xj ) 2 = exp > 2 > : T j )2 9 3> > = 5 > > ; 2 1 4 Yj2 + log(22) 2 2 " C (Yj ; 2) Here, (XTj )2 b() = 2 = 2 and a() = E (Yj ) = j = XTj 20 The link function is the identity function h(j ) = j = XTj The class of generalized linear models includes the class of linear models: (i) Systematic part of the model (identity link function) j = E (Yj ) = XTj = 1X1j + : : : + pXpj (ii) Stochastic part of the model Yj N (j ; 2) and Y1Y2 : : : Yn are independent. 21 Example 13.1 Mortality of a certain species of beetle after 5 hours exposure to gaseous carbon disulde (Bliss, 1935). Number Number of Number Dose Log(Dose) killed survivors exposed (mg/l) (Xj ) (Yj ) (nj Yj ) nj 49.057 3.893 6 53 59 52.991 3.970 13 47 60 56.911 4.041 18 44 62 60.842 4.108 28 28 56 64.759 4.171 52 11 63 68.691 4.230 53 6 59 72.611 4.258 61 1 62 76.542 4.338 60 0 60 22 Logistic Regression: Y1; Y2; : : : ; Y8 are independent random counts with YBin(nj ; j ) log j 1 j = 0 + 1 Xj Comments: * Here the logit link function is the canonical link function * Use of the binomial distribution follows from (i) Each beetle exposed to log-dose Xj responds independently (ii) Each beetle exposed to log-dose Xj has probability j of dying 23 Maximum Likelihood Estimation: Joint likelihood for Y1; Y2; : : : ; Yk L(; Y; X ) = jk=1f (Yj ; j ; ) ( ) Yj j b(j k = j =1 exp + C (Yj ; ) a ( phi ) n k = j =1 exp Yj (0 + 1Xj ) n o + X 0 1 j nj log(1 + e ) + log j Yj Since = 1 a() = = 1 j = log j 1 j = 0 + 1Xj b(j ) = nj log(1 + e0+1Xj ) C (Yj ; ) = log nj Yj 24 log-likelihood function: `() = = " k X # Yj j b(j ) + C (Yj ; ) a() j =1 k h X j =1 Yj (0 + 1Xj ) nj log(1 + e0+1Xj ) nj +log Yj = 0 k X j =1 k X Yj + 1 k X j =1 Yj Xj k X nj nj log Yj j =1 j =1 To maximize the log-likelihood, solve the system of equations obtained by setting rst partial derivatives equal to zero. log(1 + e0+1Xj ) + 25 Estimating Equations (likelihood equations): 8 8 X exp( ^ + ^1X1 @`(^ ) X = Yj nj 0 = ^ ^ @ ^0 j =1 j =1 1 + exp( + 1 X1) = 8 X (Yj nj ^j ) j =1 8 @`(^ ) X 0 = = Xj Y j @ ^0 j =1 8 X j =1 Xj (Yj 8 X j =1 nj Xj exp(^ + ^1X1 1 + exp(^ + ^1X1) nj ^j ) Generally no closed form expression for the solution If a solution exists in the interior of the parameter space (0 < j < 1 for all j = 1; : : : ; k) then it is unique. 26 The likelihood equations (or score function) can be expressed as h0i 0= = X T (Y m) = Q 0 " score function where 2 3 2 3 2 3 1 X1 Y1 n11 6 7 6 7 6 7 6 1 X2 7 6 Y2 7 6 n2 2 7 X = 64 .. 7Y = 6 . 7m = 6 . 7 5 4 . 5 4 . 5 1 Xk Yk n88 and from the inverse link function exp(0 + 1Xj j = 1 + exp(0 + 1Xj ^m We will use Q; ^ , and ^j to indicate when these quantities are evaluated at = ^ . 27 Second partial derivatives of the log-likelihood function: 2 3 @ 2`() @ 2`( ) 2 6 @0@1 7 @ 6 0 H = 4 @ 2`() @ 2`() 75 = where @0@1 XT V X @12 - called the Hessian matrix 2 n11(1 1) ... V = 64 3 nk k (1 k ) 7 5 = V ar(Y) We will use H^ = X T V^ X and V^ to denote H and V evaluated at = ^ . 28 Fisher Information matrix: 2 2`( ) 3 2`( ) @ @ E @0@1 7 E @ 2 6 0 7 H = 64 2 2 5 @ ` ( ) @ ` ( ) E @0@1 E @ 2 1 = E (X T V X ) = XT V X = H Since the second partial derivatives do not depend on Y = (Y1; : : : ; Yk)T . 29 Iterative procedures for solving the likelihood equations: Start with some initial values 2 where 3 " (0) P ) ^ log ( (0) 1 P ^ = 4 0(1) 5 = 0 ^1 P= # Pk j =1 Yj Pk j =1 nj Newton-Raphson algorithm ^ ^ (S +1) = ^ (S ) + H^ 1Q " these are evaluated at ^ (S ) 30 Fisher scoring algorithm ^ ^ (S +1) = ^ (S ) + I^ 1Q " for the logistic regression model with binomial counts I^ = H^ Modication: - Check if the log-likelihood is larger at ^ (S +1) than at ^ (S ). - If it is, go to the next iteration - If not, invoke a halving step S +1)) S +1) = 1 ( ^ (S ) + ^ (old ^(new 2 31 Large sample inference: (see results 9.1-9.3) From result 9.2, ^ _ N (; (X T V 1X ) 1) when n1; n2; : : : ; nk are all large " # \large" is nj j > 5 nj (1 j ) > 5 Estimate the covariance matrix as Vd ar(^ ) = (X T V^ 1X ) 1 where V^ is V evaluated at ^ 32 For the Bliss beetle data: " # " # ^ = ^^0 = 60:717 14:883 1 with 2 3 " 2 S S ^0;^1 5 26:84 ^ d 0 4 ^ V ar() = S = 2 6:55 S ^ ^ 0;1 ^1 6:55 1:60 33 # Wald tests: Test H0 : 0 = 0 vs. HA : 0 6= 0 2 32 2 ^ 0 60 : 717 0 5 = X2 = 4 = 137:36 S^0 5:18 p-value = P rf2(1) > 197:36g < :0001 Test H0 : 1 = 0 vs. HA : 1 6= 0 2 32 2 ^ 14 : 883 0 = 138:49 X2 = 4 1 5 = S^1 1:265 p-value = P rf2(1) > 138:49g < :0001 34 Approximate (1 ) 100% condence intervals ^0 Z=2S^0 ^1 Z=2S^1 " from the standard normal distribution Z:025 = 1:96 35 Likelihood ratio (deviance) tests: Model A: log 1 jj = 0 and Yj Bin(nj ; j ) Model B: log 1 jj = 0 + 1Xj and Yj Bin(nj ; j ) Model C: log 1 jj = 0 + j and Yj Bin(nj ; j ) Model C simply says that 0 < j < 1; j = 1; : : : ; k 36 Null: Deviance residual h DevianceNull = 2 log-likelihoodModel A i -log-likelihoodModel C = 2 k h X j =1 k X Yj log(Pj =P) 1 Pj i + (nj Yj )log 1 P j =1 with (k 1)d:f: where Yj Pj = j = 1; : : : ; k nj Pk nj Pj Y j =1 j = Pk P = nj j =1 nj are the m.l.e.'s of j under models C and A, respectively. 37 LOOSE: deviance residuals DevianceLOOSE = h 2 log-likelihoodModel A i -log-likelihoodModel B k h X ^j = 2 Yj log P j =1 k 1 X ^j i + (nj Yj )log 1 P j =1 = 272:7803 with (2 1) = 1d:f: where exp(^0 + ^1Xj ) ^j = 1 + exp(^0 + ^1Xj Pk ^ Yj exp( j =1 0 P = = 1 + exp(^0 Pkj=1 nj 38 LOOSE: DEV (Lack-of-t test) j H0 : log = 0 + 1Xj (model B) 1 j j = 0 + j (model C) HA : log 1 j G2 = deviancelack-of-t = devianceNULL devianceLOOSE 2 k P X = 2 4 Yj log j ^j j =1 3 k 1 X Pj 5 + (nj Yj )log 1 ^j j =1 = 11:42 on [(k 1) 1] = k 2d:f: (= 8 2 = 6d:f:) (p-value = 0:762) 39 Pearson chi-square test: k (Y X nj ^j )2 j 2 X = nj ^j j =1 k (n X Yj nj (1 ^j ))2 j + nj (1 ^j ) j =1 k k X (Pj ^j )2 X (1 Pj [1 ^j ])2 = nj + nj ^j 1 ^j j =1 j =1 = 10:223 with d.f.= k 2 = 6 p-value = .1155 40 Estimates of expected counts under Model B. j nj ^j nj (1 ^j ) 1 3.43 55.57 2 9.78 50.22 3 22.66 39.34 4 33.83 22.17 Cochran's rule: 5 50.05 12.95 6 53.27 5.73 6 59.22 2.78 8 58.74 1.26 All expected counts should be larger than one. At least 80% should be larger than 5. 41 Estimate mortality rates: exp(^0 + ^1Xj ) ^j = 1 + exp(^0 + ^1Xj Use the delta method to compute an approximate standard error @ ^j exp(^0 + ^1Xj = @ ^0 [1 + exp(^0 + ^1Xj )]2 = ^j (1 ^j ) Xj exp(^0 + ^1Xj @ ^j = ^ @ 0 [1 + exp(^0 + ^1Xj )]2 = Xj ^j (1 ^j ) 42 Dene Gj = @@^^j @@^^j = [^j (1 ^j )Xj ^j (1 ^j )] 0 1 Then, S^2j = Gj [Vd ar(^ )]GTj and an approximate standard error is r S^j = Gj [Vd ar(^ )]GTj An approximate (1 ) 100% condence interval for j is ^j Z=2S^j 43 Estimate the log-dose that provides a certain mortality rate, say 0. 0 log = 0 + 1X0 ) 1 0 X0 = log 0 1 0 1 0 The m.l.e. for X0 is log 1 00 ^0 X^0 = ^1 44 Use the delta method to obtain an approximate standard error: 1 @ X^0 = ^ @ ^0 h 1 i 0 log 1 0 ^0 @ X^0 = ^ @ 1 ^12 ^ ^ Dene G^ = @@X^0 @@X^0 and compute 0 1 h i 2 SX^0 = G^ Vd ar(^ ) G^T 45 The standard error is r h i SX^0 = G^ Vd ar(^ ) G^T and an approximate (1 ) 100% condence interval for X0 is X^0 Z=2SX^0 Suppose you want a 95% condence interval for the dose that provides a mortality rate of 0 . dose0 = exp(X0) 46 The m.l.e. for the does is d ^0) dose 0 = exp(X Apply the delta method @ exp(X^0) ^0) = exp( X @ X^0 Then Sdose = [exp(X^0)]2Vd ar(X^0) d 0 d )S Sdose = [exp(X^0)]SX^0 = (dose 0 X^0 d and an approximate (1 ) 100% condence interval for dose0 is exp(X^0) Z=2[exp(X^0)]SX^0 47 Fit a probit model to the Bliss beetle data: where 1(j ) = 0 + 1Xj Xj = log(dose) Yj = Bin(nj ; j ) j = 1; 2; : : : ; k Note that j = Z 0+1Xj 1 1 t2 1 p2 e 2 dt 48 Fit the complimentary log-log model to the Bliss beetle data where log[ log(1 j )] = 0 + 1Xj Note that Xj = log(dose) Yj = Bin(nj ; j ) j = 1 exp[ e0+1Xj ] j = 1; 2; : : : ; k 49 Poisson Regression (Log-linear models)[.2in] Example 13.2: Yjk = number of TA98 salmonella colonies on the k-th plate with the j-th dosage level of quinoline Xj = log10(dose of quinoline(g)) X1 = 0 X2 = 1:0 X3 = 1:5 X4 = 2:0 X5 = 2:5 and k = 1; 2; 3 plates at each ?? 50 Independent Poisson counts: Yjk P oisson(mj ) j = 1; : : : ; 5 k = 1; : : : ; 3 where myj mj P rfYjk = yg = e y! for y = 0; 1; 2; : : : E (Yjk = mj V ar(Yjk ) = mj Log-link function: log(mj = 0 + 1Xj 51 Joint likelihood function Yjk m L() = j5=1k3=1 j e mj Yjk Log-likelihood function `() = = 5 X 3 X j =1 k=1 5 X 3 X [Yjk log(mj ) mj log(Yjk !)] [Yjk [0 + 1Xj ] e0+1Xj j =1 k=1 log(Yjk !)] = 0Y00 + 1 5 X 3 X j =1 k=1 5 X j =1 Xj Yj 0 3 5 X j =1 e0+Xj log(Yjk !) 52 Likelihood equations: 5 X @`() 0 = = Y:: 3 e0+1Xj @0 j =1 5 5 X @`() X = Xj Yj: 3 Xj e0+1Xj 0 = @1 j =1 j =1 Use the Fisher scoring algorithm: 2 3 2 3 Y1 1 1 X1 6 7 6 7 6 Y1 2 7 6 1 X1 7 Y = 64 . . 75 X = 64 .. .. 75 Y53 3 1 X5 2 @`() T 0 7 Q = 64 @`@ m) 5 = X (Y () where @1 2 3 m1 6 7 6 m1 7 m = 64 . 75 = exp(X ) . m5 53 Second partial derivatives of `() 5 X XX @ 2`() + X 0 1 j = 3 e = mj 2 @0 j =1 j k 5 X XX @ 2`() + X 0 1 j = 3 e = Xj mj @0@1 j =1 j k 5 X XX @ 2`() 2 + X 2m 0 1 j = 3 X e = X j j j @12 j =1 j k 54 Fisher information matrix 2 H = = = where @ 2`( ) @02 @ 2`() @ 2`( 3 @0@1 7 7 E 2 @ `() 5 @0@1 @12 " P P P P m Xj mj j Pj Pk Pj P k 2 j k Xj mj j k Xj mj X T m X 6 6 4 V ar(Y) = m = 2 6 6 6 6 6 6 6 6 4 m1 # 3 m1 m1 m2 ... m5 55 7 7 7 7 7 7 7 7 5 Iterations: (S +1) = (S ) + H 1Q Use halving steps as needed Starting values: Use the m.l.e.s for the model with 1 = 0, i.e., Yij P oisson(m) are i.e.d. with m = e0 . Then, m ^ (0) = Y:: and ^0(0) = log(Y::) ^1(0) = 0 56 Large sample normal approximation for the distribution of ^ ^ N (; H 1) % estimate this by substituting ^ for in H to obtain ^ 57 Estimate the mean number of colonies for a specic log10(dose) of quinoline, say X The m.l.e. is m ^ = e^0+^1X Apply the delta method: @m ^ = e^0+^1X = m ^ ^ @ 0 @m ^ = X e^0+^1X = X m ^ ^ @ 1 58 Dene h @m ^ @m ^ @ ^0 @ ^1 i G^ = = [m ^ Xm ^ ] Then the estimated variance of is m ^ = e^0+^1X 2 = G 1G T ^ ^ ^ Sm H ^ = G^[X T m^ X ] 1GT q 2 and an apThe standard error is Sm^ = Sm ^ proximate (1 ) 100% condence interval is m ^ Z=2Sm^ 59 Four kinds of residuals: Response residuals: " # Yi Yi mu ^ i= ^i ni Pearson residuals: Yi ^i q = q Yi ni^i (binomial) ni^i(1 ^i) Vd ar(Yi) Yi ^i (Poisson) = p ^i 60 Working residuals: Yi n^i Yi ^i = (binomial) @i ni^i(1 ^i) @i Yi ^i = (Poisson) ^i For the binomial family with logit link ! ! nii i = log = XTi i = log 1 i ni(1 i) exp( XTi ) exp( i) i = nii = ni = ni T 1 + exp(i) 1 + exp(Xi @i = ni exp(i) 2 = nii(1 i) @i [1 + exp(i)] 61 Deviance residual: q ei = sign(Yi ^i) jdij where di is the contribution of the i-th case to the deviance binomial family G2 = 2 and k X i=1 " Yilog ! Yi + (ni Yi)log ni Yi ni^i ni(1 ^i) !# ei = sign( Y n^i) v " i ! !# u u n Y Y i i i t2 Yilog n ^ + (ni Yi)log n (1 ^ ) i i i i 62 Poisson family G2 = (deviance) = 2 where then k X i=1 Yilog Yi ^i ! ^i = exp(^0 + ^1X1i+??) log(i) = 0 + 1X1i + : : : v u u t Yi ei = sign(Yi ^i) 2Yilog ^i ! 63