18.650 – Fundamentals of Statistics 7. Generalized linear models 1/32 Linear model A linear model assumes Y |X = x ⇠ N (µ(x), 2 I), 1 And IE(Y |X = x) = µ(x) = x 1 > , Throughout we drop the boldface notation for vectors 2/32 Components of a linear model The two model components (that we are going to relax) are 1. Random component: the response variable Y is continuous and Y |X = x is with mean µ(x). 2. Regression function: µ(x) = > x . 3/32 Kyphosis The Kyphosis data consist of measurements on 81 children following corrective spinal surgery. The binary response variable, Y , indicates the presence or absence of a postoperative deforming. The three covariates are: I X (1) : Age of the child in month, I X (2) : Number of the vertebrae involved in the operation, and I X (3) : Start of the range of the vertebrae involved. Write X = ( (1) (2) (3) > ,X ,X ,X ) 2 4 IR 4/32 Kyphosis I The response variable is binary so there is no choice: Y |X = x is with expected value µ(x) = IE[Y |X = x] 2 I We cannot write µ(x) = x > because the right-hand side ranges through I We need an invertible function f such that f (x> ) 2 5/32 Generalization A generalized linear model (GLM) generalizes normal linear regression models in the following directions. 1. Random component: Y |X = x ⇠ some distribution (e.g. Bernoulli, exponential, Poisson) 2. Regression function: µ(x) = x > where g called link function and µ(x) = IE(Y |X = x) is the 6/32 Predator/Prey Consider the following model for the number of preys Y that a predator (Hawk) catches per day a predator given a number X of preys (mice) in its hunting territory. Random component: Y > 0 and the variance of capture rate is known to be approximately equal to its expectation so we propose the following model: Y |X = x ⇠ Where µ(x) = IE[Y |X = x]. Regression function: We assume mx µ(x) = , for some unknown m, h > 0. h+x where: I m is the max expected daily preys the predator can cope with I h is the number of preys such that µ(h) = 7/32 The regression function m(x) for m = h = 10 8/32 Example 2: Prey Capture Rate Obviously µ(x) is not linear but using reciprocal link: g(x) = the right-hand side can be made linear in the parameters: 1 g(µ(x)) = = µ(x) , 1 = 0+ 1 . x 9/32 Exponential Family k IR A family of distribution {IP✓ : ✓ 2 ⇥}, ⇥ ⇢ is said to be a q k-parameter exponential family on IR , if there exist real valued functions: I ⌘1 , ⌘2 , · · · , ⌘k and B of ✓, I T1 , T2 , · · · , Tk , and h of y 2 IRq such that the density function (pmf or pdf) of IP✓ can be written as f✓ (y) = exp k hX i=1 ⌘i (✓)Ti (y) i B(✓) h(y) 10/32 Normal distribution example I Consider Y ⇠ N (µ, 2 ), f✓ (y) = exp 2 ). ✓ = (µ, ⇣µ y 2 1 2 y 2 2 The density is 2 µ 2 2 ⌘ 1 p , 2⇡ which forms a two-parameter exponential family with ⌘1 = µ 1 , ⌘ = 2 2 B(✓) = 2 2 µ 2 2 2 , T (y) = y, T (y) = y , 1 2 2 + log( p 2⇡), h(y) = 1. I When 2 is known, it becomes a one-parameter exponential family on IR: ⌘= µ , T (y) = y, B(✓) = 2 2 µ 2 e y2 2 2 p , h(y) = 2 2⇡ . 11/32 Examples of discrete distributions The following distributions form discrete exponential families of distributions with pmf I Bernoulli(p): I Poisson( ): y p (1 p) 1 y , y 2 {0, 1} y y! e , y = 0, 1, . . . . 12/32 Examples of Continuous distributions The following distributions form continuous exponential families of distributions with pdf: y 1 a 1 b; I Gamma(a, b): y e (a)ba I above: a: shape parameter, b: scale parameter I reparametrize: µ = ab: mean parameter ✓ ◆a ay 1 a a 1 y e µ. (a) µ I Inverse Gamma(↵, ): I Inverse Gaussian(µ, 2 ): ↵ (↵) s y ↵ 1 2 2⇡y e 3 e /y . 2 (y µ)2 2µ2 y . Others: Chi-square, Beta, Binomial, Negative binomial distributions. 13/32 One-parameter canonical exponential family I Canonical exponential family for k = 1, y 2 IR f✓ (y) = exp ⇣ y✓ b(✓) + c(y, ) ⌘ for some known functions b(·) and c(·, ·) . I If is known, this is a one-parameter exponential family with ✓ being the canonical parameter . I If is unknown, this may/may not be a two-parameter exponential family. I is called dispersion parameter. I In this class, we always assume that is known. 14/32 Normal distribution example I Consider the following Normal density function with known variance 2 , f✓ (y) = = 1 p e 2⇡ ⇢ yµ exp I Therefore ✓ = µ, = (y µ)2 2 2 2, c(y, ) = ✓ 1 2 2µ 2 1 2 b(✓) = ✓2 , 2 1 ( 2 2 y y2 2 + log(2⇡ 2 ) ◆ , and + log(2⇡ )). 15/32 Other distributions Table 1: Exponential Family Notation Range of y Normal 2 N (µ, ) ( 1, 1) 2 b(✓) c(y, ) ✓2 2 1 y2 ( 2 + log(2⇡ )) Poisson P(µ) [0, 1) 1 e✓ log y! Bernoulli B(p) {0, 1} 1 log(1 + e✓ ) 0 16/32 Likelihood Let `(✓) = log f✓ (Y ) denote the log-likelihood function. The mean IE(Y ) and the variance var(Y ) can be derived from the following identities I First identity @` IE( ) = @✓ I Second identity @2` @` 2 IE( 2 ) + IE( ) = 0. @✓ @✓ 17/32 Expected value Note that `(✓) = Therefore Y✓ b(✓) + c(Y ; ), @` = @✓ It yields @` IE(Y ) 0 = IE( ) = @✓ 0 b (✓) , which leads to IE(Y ) = 18/32 Variance On the other hand we have we have @2` @` 2 + ( ) = @✓2 @✓ and from the previous result, 0 b (✓) Y = Y IE(Y ) Together, with the second identity, this yields 0= 00 b (✓) + var(Y ) 2 , which leads to var(Y ) = 19/32 Example: Poisson distribution Example: Consider a Poisson likelihood, µy f (y) = e y! µ = exp y log µ µ log(y!) Thus, ✓= b(✓) = = c(y, ) = log(y!), So ✓ µ=e , b(✓) = 00 b (✓) = 20/32 Link function I is the parameter of interest, and needs to appear somehow in the likelihood function to use maximum likelihood. I A link function g relates the linear predictor X > to the mean parameter µ, > X = g(µ). I g is required to be monotone increasing and di↵erentiable µ=g 1 (X > ). 21/32 Examples of link functions I For LM, g(·) = identity. I Poisson data. Suppose Y |X ⇠ Poisson(µ(X)). I µ(X) > 0; I log(µ(X)) = X > ; I In general, a link function for the count data should map (0, +1) to IR. I The log link is a natural one. I Bernoulli/Binomial data. I 0 < µ < 1; I g should map (0, 1) to IR: I 3 choices: ⇣ ⌘ 1. logit: log 2. probit: µ(X) 1 µ(X) 1 = X> ; (µ(X)) = X > where (·) is the normal cdf; I The logit link is the natural choice. 22/32 Examples of link functions for Bernoulli response 5 4 3 I in blue: 1 g1 (x) = f1 (x) = x log (logit link) 1 x I in red: 1 1 g2 (x) = f2 (x) = (x) (probit link) 2 1 0 -1 -2 -3 -4 -5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 23/32 Examples of link functions for Bernoulli response 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 ex I in blue: f1 (x) = 1 + ex I in red: f2 (x) = (x) (Gaussian CDF) 24/32 Canonical Link I The function g that links the mean µ to the canonical parameter ✓ is called Canonical Link: g(µ) = ✓ I Since µ = b0 (✓), the canonical link is given by 0 g(µ) = (b ) 1 (µ) . I If > 0, the canonical link function is strictly increasing. Why? 25/32 Example: the Bernoulli distribution I We can check that ✓ b(✓) = log(1 + e ) I Hence we solve exp(✓) b (✓) = =µ 1 + exp(✓) 0 , ✓= I The canonical link for the Bernoulli distribution is the 26/32 Other examples Normal Poisson Bernoulli Gamma b(✓) 2 ✓ /2 exp(✓) ✓ log(1 + e ) log( ✓) g(µ) µ log µ µ log 1 µ 1 µ 27/32 Model and notation I Let (Xi , Yi ) 2 IRp ⇥ IR, i = 1, . . . , n be independent random pairs such that the conditional distribution of Yi given Xi = xi has density in the canonical exponential family: f✓i (yi ) = exp yi ✓ i b(✓i ) o + c(yi , ) . I Y = (Y1 , . . . , Yn )> , X = (X1 , . . . , Xn )> I Here the mean µi = IE[Yi |Xi ] is related to the canonical parameter ✓i via µi = I and µi depends linearly on the covariates through a link function g: g(µi ) = . 28/32 Back to I Given a link function g, note the following relationship between and ✓: 0 1 (µi ) 0 1 1 > (Xi g 1 ✓i = (b ) = (b ) (g )) ⌘ > h(Xi ), where h is defined as 0 h = (b ) 1 0 = (g b ) 1 . I Remark: if g is the canonical link function, h is 29/32 Log-likelihood I The log-likelihood is given by `n (Y, X, ) = X Yi ✓i b(✓i ) i = X Yi h(X > ) i > b(h(Xi )) i up to a constant term. I Note that when we use the canonical link function, we obtain the simpler expression `n (Y, X, ) = X Yi X > i > b(Xi ) i 30/32 Strict concavity I The log-likelihood `(✓) is strictly concave using the canonical function when > 0. Why? I As a consequence the maximum likelihood estimator is I On the other hand, if another parameterization is used, the likelihood function may not be strictly concave leading to several local maxima. 31/32 Concluding remarks I Maximum likelihood for Bernoulli Y and the logit link is called I In general, there is no closed form for the MLE and we have to use I The asymptotic normality of the MLE also applies to GLMs. 32/32