STAT 405 - Biostatistics Handout 18 – Poisson Regression Recall the Poisson distribution is given by π(π = π¦) = π −π π π¦ π¦ = 0,1,2, … πππ π > 0. π¦! The response Y is a discrete random variable that represents the number of occurrences per time or space unit. In Poisson regression we seek a model for the mean of the response (π) as a function of terms based upon a set of predictors π₯1 , π₯2 , … , π₯π . For a Poisson random variable the mean and variance are both ο, so traditional OLS will not be adequate because the constant error variance assumption would be violated. The logistic regression model that we have been studying is one type of a broader class of models called Generalized Linear Models. Generalized linear models are an extension of regular linear models that allow: (i) the mean of a population to depend on a linear function of terms through a nonlinear link function and (ii) the response probability distribution to be any member of a special class of distributions referred to as the exponential family. The exponential family contains the normal distribution (OLS), the binomial distribution (logistic) and the Poisson distribution. The link function is function that relates the mean of the response ππ = πΈ(ππ ) linearly to a set of terms based on the explanatory variables or predictors. OLS Regression For a normally distributed response the link function is the identity function, π(π) = π thus, g(μ) = η0 + η1 u1 + β― + ηk−1 uk−1 or we typically write the model for the mean, πΈ(π|πΏ) = η0 + π1 π’1 + β― + ππ−1 π’π−1 Logistic Regresion For a binomial response we know that π π(π) = ππ ( ) = π0 + π1 π’1 + β― + ππ−1 π’π−1 1−π 1 which we expressed as, π(π₯Μ) ππ ( ) = ππ + π1 π’1 + β― + ππ−1 π’π−1 1 − π(π₯Μ) Poisson Regression For a Poisson distributed response the link function is π(π) = ln(π) so, ln(π) = ππ + π1 π’1 + β― + ππ−1 π’π−1 thus, π = exp(π0 + π1 π’1 + β― + ππ−1 π’π−1 ) Mathematics of Parameter Estimation 2 Interpretation of Coefficients in the Poisson Regression Model The interpretation of the coefficients in the Poisson regression model is as follows. Assume that we change one of the explanatory terms, for example, the first one, by one unit from u to u+1 while holding all other terms fixed. This change affects the mean of the Poisson response by 100 exp(ππ + π1 (π’ + 1) + β― + ππ−1 π’π−1 ) − exp(ππ + π1 π’ + β― + ππ−1 π’π−1 ) exp(ππ + π1 π’ + β― + ππ−1 π’π−1 ) = 100[exp(π1 ) − 1]% , i.e. a percent increase or decrease in the mean response. Alternatively we can simply take the ratio exp(ππ + π1 (π’ + 1) + β― + ππ−1 π’π−1 ) = π π1 exp(ππ + π1 π’ + β― + ππ−1 π’π−1 ) says the mean of the response gets a multiplicative increase by π π1 units per unit increase in the term π’1 . Wald Intervals and Tests for Parameters 95% CI for πΌπ : πΜπ ± 1.96 β ππΈ(πΜπ ) Therefore a CI for the multiplicative increase in the response is: 95% CI for ππΌπ : exp(ηΜi ± 1.96 β ππΈ(πΜ π )) For testing: H o :ο¨i ο½ 0 H a :ο¨i οΉ 0 Large sample test for significance of “slope” parameter (ο¨ i ) ο¨ˆi zο½ ο» N (0,1) SE (ο¨ˆi ) π§2~ π2 3 General Chi-Square Test Consider the comparing two rival models where the alternative hypothesis model T H o : log( ο ) ο½ ο¨1 x1 (reduced model OK) H 1 : log( ο ) ο½ ο¨1 x1 ο« ο¨ 2 x 2 T T (full model needed) General Chi-Square Statistic ο£ 2 = (residual deviance of reduced model) – (residual deviance of full model) = D( for model without the terms in x2 ) ο D( for model with the terms in x2 ) ~ ο£ οdf 2 2 If the full model is needed ο£ 2 is BIG and the associated p-value = P( ο£ οdf οΎ ο£ 2 ) is small. π π¦π π· = 2 ∑ π¦π ππ ( ) πΜ π π=1 and can also be approximated by π (π¦π − πΜ π )2 π·≅∑ πΜ π π=1 Residuals As with logistic regression there are two types of residuals, which are related to the two forms of model deviance. ο· Deviance residuals π¦π ππ = π πππ(π¦π − πΜ π )√2 [π¦π ππ ( ) − (π¦π − πΜ π )] πΜ π ο· Pearson residuals ππ = π¦π − πΜ π √πΜ π 4 Example 1: Mating Success of African Elephants In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful matings during the 8 years was recorded. The objective was to learn whether older animals are more successful at mating or whether they have diminished success after reaching a certain age. Y = Number of matings in the 8 year follow-up period X = Age (yrs.) of elephant at the start of the study OLS Regression > ele.lm = lm(Matings~Age, data=Elephants) > summary(ele.lm) Call: lm(formula = Matings ~ Age, data = Elephants) Residuals: Min 1Q Median -4.1158 -1.3087 -0.1082 3Q 0.8892 Max 4.8842 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.50589 1.61899 -2.783 0.00826 ** Age 0.20050 0.04443 4.513 5.75e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.849 on 39 degrees of freedom Multiple R-squared: 0.343, Adjusted R-squared: 0.3262 F-statistic: 20.36 on 1 and 39 DF, p-value: 5.749e-05 5 > Resplot(ele.lm) Do these plots suggest any violations with OLS regression assumptions? > ncv.plot(ele.lm) 6 One approach to attempting to correct the problem is to transform the response, using a variance stabilizing transformation which is found using the delta method. The delta method says, if Y ~π(π, π 2 ) then π(π) is approximately normally distributed with mean π(π) and variance [π′ (π)]2 π 2. > elesq.lm = lm(sqrt(Matings)~Age,data=Elephants) > summary(elesq.lm) Call: lm(formula = sqrt(Matings) ~ Age, data = Elephants) Residuals: Min 1Q -1.90532 -0.33654 Median 0.07767 3Q 0.45871 Max 1.09468 Coefficients: Estimate Std. Error t value (Intercept) -0.81220 0.56867 -1.428 Age 0.06320 0.01561 4.049 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 Pr(>|t|) 0.161187 0.000236 *** ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6493 on 39 degrees of freedom Multiple R-squared: 0.296, Adjusted R-squared: 0.2779 F-statistic: 16.4 on 1 and 39 DF, p-value: 0.0002362 7 While this may seem like a satisfactory model, interpretation of the model coefficients is difficult and the response is now in the square root scale. As the number of matings per 8 year period is likely to be well modeled using a Poisson distribution, we will now consider Poisson regression. > ele.glm = glm(Matings~Age,family="poisson") > summary(ele.glm) Call: glm(formula = Matings ~ Age, family = "poisson") Deviance Residuals: Min 1Q Median -2.80798 -0.86137 -0.08629 3Q 0.60087 Max 2.17777 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.58201 0.54462 -2.905 0.00368 ** Age 0.06869 0.01375 4.997 5.81e-07 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 75.372 Residual deviance: 51.012 AIC: 156.46 on 40 on 39 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 > par(mfrow=c(2,2)) > plot(ele.glm) > par(mfrow=c(1,1)) 8 > plot(Age,Matings,xlab="Age of Elephant",ylab="Num. of Matings") > lines(Age,fitted(ele.glm)) > title(main="Plot of Matings vs. Age of Elephant w/ Poisson Fit") Given the curvilinear appearance of the scatterplot perhaps adding a squared term for Age would improve the model. > elesq.glm = glm(Matings~Age+Age2,family="poisson") > summary(elesq.glm) Call: glm(formula = Matings ~ Age + Age2, family = "poisson") Deviance Residuals: Min 1Q Median -2.8470 -0.8848 -0.1122 3Q 0.6580 Max 2.1134 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.8574060 3.0356383 -0.941 0.347 Age 0.1359544 0.1580095 0.860 0.390 Age2 -0.0008595 0.0020124 -0.427 0.669 (Dispersion parameter for poisson family taken to be 1) Null deviance: 75.372 Residual deviance: 50.826 AIC: 158.27 on 40 on 38 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 9 We can use the Wald test to determine that adding Age2 to model was not necessary or we could use the General Chi-square Test. > 1 - pchisq(51.012-50.826,1) [1] 0.6662668 Interpretation of the estimated coefficient for Age in the first model The estimated coefficient for Age is πΜ 1 = .0632 thus a year increase in age we have 100[π .0632 − 1] = 6.52% increase in the number matings in the 8 year period per one year of age at the start of the study. Expressed as a multiplicative increase this would be 1.0632. For a 5 year difference in initial age we would expect a 100[π 5β×0632 − 1] = 37.2% increase in the number of matings in the following 8 year period. Expressed as a multiplicative increase this would be 1.372. Find a 95% CI for the 5-year Age Effect 10 Example 2: Reproduction of Ceriodaphnia Organisms In this study the number of Ceriodaphnia organisms are counted in a controlled environment in which reproduction occurs among the organisms. Two different strains of organisms are involved, and the environment is changed by adding varying amounts of a chemical component intended to impair reproduction. Initial population sizes are the same. > head(Ceriodaph) Cerio Conc Strain 1 82 0.0 0 2 106 0.0 0 3 63 0.0 0 4 99 0.0 0 5 101 0.0 0 6 45 0.5 0 … … … … > cerio.glm = glm(Cerio~Conc+Strain,family="poisson",data=Ceriodaph) > summary(cerio.glm) Call: glm(formula = Cerio ~ Conc + Strain, family = "poisson", data = Ceriodaph) Deviance Residuals: Min 1Q Median -2.6800 -0.6766 0.1528 3Q 0.6787 Max 2.0774 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.45464 0.03914 113.819 < 2e-16 *** Conc -1.54308 0.04660 -33.111 < 2e-16 *** Strain1 -0.27497 0.04837 -5.684 1.31e-08 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1359.381 Residual deviance: 86.376 AIC: 415.95 on 69 on 67 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 Interpret the coefficients: 11 > par(mfrow=c(2,2)) > plot(cerio.glm) > par(mfrow=c(1,1)) > plot(Conc[Strain=="0"],Cerio[Strain=="0"],col="blue",xlab=”Concentration”, ylab=”Ceriodaphnia Count”) > lines(Conc[Strain=="0"],fitted(cerio.glm)[Strain=="0"],col="blue") > points(Conc[Strain=="1"],Cerio[Strain=="1"],pch="X") > points(Conc[Strain=="1"],Cerio[Strain=="1"],pch="X",col="red") > lines(Conc[Strain=="1"],fitted(cerio.glm)[Strain=="1"],col="red",lty=2) > legend(locator(),legend=c("Strain 0","Strain 1"),col=c("blue","red"),pch="oX",lty=1:2) 12 > mmps(cerio.glm) Consider adding interaction term, although there is no visual evidence to suggest it is necessarily needed. > cerio2.glm = glm(Cerio~Conc*Strain,family="poisson",data=Ceriodaph) > summary(cerio2.glm) Call: glm(formula = Cerio ~ Conc * Strain, family = "poisson", data = Ceriodaph) Deviance Residuals: Min 1Q -2.84251 -0.64872 Median 0.01169 3Q 0.70636 Max 1.82195 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.48110 0.04350 103.008 < 2e-16 *** Conc -1.59787 0.06244 -25.592 < 2e-16 *** Strain1 -0.33667 0.06704 -5.022 5.11e-07 *** Conc:Strain1 0.12534 0.09385 1.336 0.182 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1359.381 Residual deviance: 84.596 AIC: 416.17 on 69 on 66 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 13 Poisson Regression in JMP Example 1: Age of Elephants and Number of Matings (Elephants.JMP) Select Analyze > Fit Model then change options in the dialog box as shown below: Comments: 14 Generalized Linear Model Fit Overdispersion parameter estimated by Pearson Chisq/DF Response: Matings Distribution: Poisson Link: Log Estimation Method: Maximum Likelihood Observations (or Sum Wgts) = 41 Poisson Regression Plot Whole Model Test Model Difference Full Reduced -LogLikelihood 10.5242328 65.8659248 76.3901576 Goodness Of Fit Statistic Pearson Deviance L-R ChiSquare 21.0485 DF 1 Prob>ChiSq <.0001* ChiSquare DF Prob>ChiSq Overdispersion 45.1360 51.0116 39 39 0.2309 0.0943 1.1573 AICc 138.3805 Effect Tests Source Age DF 1 L-R ChiSquare 21.048466 Prob>ChiSq <.0001* Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept -1.582008 0.5859008 7.5795099 0.0059* -2.750408 -0.450132 Age 0.0686928 0.0147876 21.048466 <.0001* 0.039621 0.0976833 15 Studentized Deviance Residual by Predicted 16 Appendix: Code for some useful R functions for OLS Regression NCV.test = function (model, var.formula, data = NULL, subset, na.action) { if (!is.null(weights(model))) stop("requires unweighted linear model") if ((!is.null(class(model$na.action))) && class(model$na.action) == "exclude") model <- update(model, na.action = na.omit) sumry <- summary(model) residuals <- residuals(model) S.sq <- df.residual(model) * (sumry$sigma)^2/sum(!is.na(residuals)) U <- (residuals^2)/S.sq if (missing(var.formula)) { mod <- lm(U ~ fitted.values(model)) varnames <- "fitted.values" var.formula <- ~fitted.values df <- 1 } else { if (missing(na.action)) { na.action <- if (is.null(model$na.action)) options()$na.action else parse(text = paste("na.", class(mod$na.action), sep = "")) } m <- match.call(expand.dots = FALSE) if (is.matrix(eval(m$data, sys.frame(sys.parent())))) m$data <- as.data.frame(data) m$formula <- var.formula m$var.formula <- m$model <- m$... <- NULL m[[1]] <- as.name("model.frame") mf <- eval(m, sys.frame(sys.parent())) response <- attr(attr(mf, "terms"), "response") if (response) stop(paste("Variance formula contains a response.")) mf$U <- U .X <- model.matrix(as.formula(paste("U~", as.character(var.formula)[2], "-1")), data = mf) mod <- lm(U ~ .X) df <- sum(!is.na(coefficients(mod))) - 1 } SS <- anova(mod)$"Sum Sq" RegSS <- sum(SS) - SS[length(SS)] Chisq <- RegSS/2 result <- list(formula = var.formula, formula.name = "Variance", ChiSquare = Chisq, Df = df, p = 1 - pchisq(Chisq, df), test = "Non-constant Variance Score Test") class(result) <- "chisq.test" result } 17 ncv.plot = function (fit) { temp <- NCV.test(fit) p <- temp$p e <- sqrt(abs(resid(fit))) yhat <- fitted(fit) plot(yhat, e, xlab = "Fitted Values", ylab = "Sqrt. Abs. Residuals", main = paste("Non-Constant Variance Plot ~ NCV Test (p=", signif(p, 4), ")")) lines(lowess(yhat, e), lty = 1, col = "Blue") } Resplot = function (lm1, lms = summary(lm1)) { par(mfrow = c(2, 2), pty = "m") y <- resid(lm1) qqnorm(Studresid(lm1), main = "Normal Probability Plot", ylab = "Residuals") abline(0, sqrt(var(Studresid(lm1)))) plot(fitted(lm1), Studresid(lm1), xlab = "Fitted Values", ylab = "Studentized Residuals", main = "Plot of Studentized Residuals vs. Fitted", cex = 0.65) x <- fitted(lm1) y <- Studresid(lm1) f <- 0.5 xs <- sort(x, index = T) x <- xs$x ix <- xs$ix y <- y[ix] trend <- lowess(x, y, f) e2 <- (y - trend$y)^2 scatter <- lowess(x, e2, f) uplim <- trend$y + sqrt(abs(scatter$y)) lowlim <- trend$y - sqrt(abs(scatter$y)) lines(trend$x, trend$y, col = "Blue") lines(scatter$x, uplim, col = "Red") lines(scatter$x, lowlim, col = "Red") abline(h = 0, lty = 2, col = 2) plot(fitted(lm1), sqrt(abs(Studresid(lm1))), main = "Loess Fit of Residuals", ylab = "Absolute Stud. Residuals", xlab = "Fitted Values", cex = 0.7) lines(lowess(fitted(lm1), sqrt(abs(Studresid(lm1)))), lty = 1, col = 3) abline(h = mean(sqrt(abs(Studresid(lm1)))), col = "blue", lty = 3) par(mfrow = c(1, 2)) par(ask = T) yl <- c(min(resid(lm1), fitted(lm1) - mean(fitted(lm1))), max(resid(lm1), fitted(lm1) - mean(fitted(lm1)))) fit <- fitted(lm1) p <- sort(fit - mean(fit)) pp <- ppoints(p) res <- resid(lm1) pr <- sort(res) ppr <- ppoints(pr) plot(pp, p, pch = "o", xlim = c(0, 1), ylim = yl, xlab = "f-value", ylab = "", main = "Fitted values", cex = 0.7) plot(ppr, pr, pch = "o", xlim = c(0, 1), ylim = yl, xlab = "f-value", ylab = "", main = "Residuals", cex = 0.7) par(mfrow = c(1, 1)) par(ask = F) invisible() } 18