STAT 405 - Biostatistics Handout 18 – Poisson Regression Recall

advertisement
STAT 405 - Biostatistics
Handout 18 – Poisson Regression
Recall the Poisson distribution is given by
𝑃(π‘Œ = 𝑦) =
𝑒 −πœ‡ πœ‡ 𝑦
𝑦 = 0,1,2, … π‘Žπ‘›π‘‘ πœ‡ > 0.
𝑦!
The response Y is a discrete random variable that represents the number of occurrences
per time or space unit. In Poisson regression we seek a model for the mean of the
response (πœ‡) as a function of terms based upon a set of predictors π‘₯1 , π‘₯2 , … , π‘₯𝑝 . For a
Poisson random variable the mean and variance are both , so traditional OLS will not
be adequate because the constant error variance assumption would be violated.
The logistic regression model that we have been studying is one type of a broader class
of models called Generalized Linear Models. Generalized linear models are an
extension of regular linear models that allow: (i) the mean of a population to depend on
a linear function of terms through a nonlinear link function and (ii) the response
probability distribution to be any member of a special class of distributions referred to
as the exponential family. The exponential family contains the normal distribution
(OLS), the binomial distribution (logistic) and the Poisson distribution.
The link function is function that relates the mean of the response πœ‡π‘– = 𝐸(π‘Œπ‘– ) linearly to
a set of terms based on the explanatory variables or predictors.
OLS Regression
For a normally distributed response the link function is the identity function, 𝑔(πœ‡) = πœ‡
thus,
g(μ) = η0 + η1 u1 + β‹― + ηk−1 uk−1
or we typically write the model for the mean,
𝐸(π‘Œ|𝑿) = η0 + πœ‚1 𝑒1 + β‹― + πœ‚π‘˜−1 π‘’π‘˜−1
Logistic Regresion
For a binomial response we know that
πœ‡
𝑔(πœ‡) = 𝑙𝑛 (
) = πœ‚0 + πœ‚1 𝑒1 + β‹― + πœ‚π‘˜−1 π‘’π‘˜−1
1−πœ‡
1
which we expressed as,
πœƒ(π‘₯Μƒ)
𝑙𝑛 (
) = πœ‚π‘œ + πœ‚1 𝑒1 + β‹― + πœ‚π‘˜−1 π‘’π‘˜−1
1 − πœƒ(π‘₯Μƒ)
Poisson Regression
For a Poisson distributed response the link function is 𝑔(πœ‡) = ln(πœ‡) so,
ln(πœ‡) = πœ‚π‘œ + πœ‚1 𝑒1 + β‹― + πœ‚π‘˜−1 π‘’π‘˜−1
thus,
πœ‡ = exp(πœ‚0 + πœ‚1 𝑒1 + β‹― + πœ‚π‘˜−1 π‘’π‘˜−1 )
Mathematics of Parameter Estimation
2
Interpretation of Coefficients in the Poisson Regression Model
The interpretation of the coefficients in the Poisson regression model is as follows.
Assume that we change one of the explanatory terms, for example, the first one, by one
unit from u to u+1 while holding all other terms fixed. This change affects the mean of
the Poisson response by
100
exp(πœ‚π‘œ + πœ‚1 (𝑒 + 1) + β‹― + πœ‚π‘˜−1 π‘’π‘˜−1 ) − exp(πœ‚π‘œ + πœ‚1 𝑒 + β‹― + πœ‚π‘˜−1 π‘’π‘˜−1 )
exp(πœ‚π‘œ + πœ‚1 𝑒 + β‹― + πœ‚π‘˜−1 π‘’π‘˜−1 )
= 100[exp(πœ‚1 ) − 1]% , i.e. a percent increase or decrease in the mean response.
Alternatively we can simply take the ratio
exp(πœ‚π‘œ + πœ‚1 (𝑒 + 1) + β‹― + πœ‚π‘˜−1 π‘’π‘˜−1 )
= 𝑒 πœ‚1
exp(πœ‚π‘œ + πœ‚1 𝑒 + β‹― + πœ‚π‘˜−1 π‘’π‘˜−1 )
says the mean of the response gets a multiplicative increase by 𝑒 πœ‚1 units per unit
increase in the term 𝑒1 .
Wald Intervals and Tests for Parameters
95% CI for πœΌπ’Š : πœ‚Μ‚π‘– ± 1.96 βˆ™ 𝑆𝐸(πœ‚Μ‚π‘– )
Therefore a CI for the multiplicative increase in the response is:
95% CI for π’†πœΌπŸ : exp(ηΜ‚i ± 1.96 βˆ™ 𝑆𝐸(πœ‚Μ‚ 𝑖 ))
For testing:
H o :i ο€½ 0
H a :i ο‚Ή 0
Large sample test for significance of “slope” parameter ( i )
ˆi
zο€½
ο‚» N (0,1)
SE (ˆi )
𝑧2~ πœ’2
3
General Chi-Square Test
Consider the comparing two rival models where the alternative hypothesis model
T
H o : log(  ) ο€½ 1 x1
(reduced model OK)
H 1 : log(  ) ο€½ 1 x1   2 x 2
T
T
(full model needed)
General Chi-Square Statistic
 2 = (residual deviance of reduced model) – (residual deviance of full model)
=
D( for model without the terms in x2 ) ο€­ D( for model with the terms in x2 ) ~  df
2
2
If the full model is needed  2 is BIG and the associated p-value = P(  df
ο€Ύ  2 ) is small.
𝑛
𝑦𝑖
𝐷 = 2 ∑ 𝑦𝑖 𝑙𝑛 ( )
πœ‡Μ‚ 𝑖
𝑖=1
and can also be approximated by
𝑛
(𝑦𝑖 − πœ‡Μ‚ 𝑖 )2
𝐷≅∑
πœ‡Μ‚ 𝑖
𝑖=1
Residuals
As with logistic regression there are two types of residuals, which are related to the two
forms of model deviance.
ο‚·
Deviance residuals
𝑦𝑖
𝑑𝑖 = 𝑠𝑖𝑔𝑛(𝑦𝑖 − πœ‡Μ‚ 𝑖 )√2 [𝑦𝑖 𝑙𝑛 ( ) − (𝑦𝑖 − πœ‡Μ‚ 𝑖 )]
πœ‡Μ‚ 𝑖
ο‚·
Pearson residuals
π‘Ÿπ‘– =
𝑦𝑖 − πœ‡Μ‚ 𝑖
√πœ‡Μ‚ 𝑖
4
Example 1: Mating Success of African Elephants
In this study, 41 male African elephants were followed over a
period of 8 years. The age of the elephant at the beginning of the
study and the number of successful matings during the 8 years
was recorded. The objective was to learn whether older animals
are more successful at mating or whether they have diminished
success after reaching a certain age.
Y = Number of matings in the 8 year follow-up period
X = Age (yrs.) of elephant at the start of the study
OLS Regression
> ele.lm = lm(Matings~Age, data=Elephants)
> summary(ele.lm)
Call:
lm(formula = Matings ~ Age, data = Elephants)
Residuals:
Min
1Q Median
-4.1158 -1.3087 -0.1082
3Q
0.8892
Max
4.8842
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.50589
1.61899 -2.783 0.00826 **
Age
0.20050
0.04443
4.513 5.75e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.849 on 39 degrees of freedom
Multiple R-squared: 0.343,
Adjusted R-squared: 0.3262
F-statistic: 20.36 on 1 and 39 DF, p-value: 5.749e-05
5
> Resplot(ele.lm)
Do these plots suggest any violations with OLS regression assumptions?
> ncv.plot(ele.lm)
6
One approach to attempting to correct the problem is to transform the response, using a
variance stabilizing transformation which is found using the delta method. The delta
method says, if Y ~𝑁(πœ‡, 𝜎 2 ) then 𝑔(π‘Œ) is approximately normally distributed with mean
𝑔(πœ‡) and variance [𝑔′ (πœ‡)]2 𝜎 2.
> elesq.lm = lm(sqrt(Matings)~Age,data=Elephants)
> summary(elesq.lm)
Call:
lm(formula = sqrt(Matings) ~ Age, data = Elephants)
Residuals:
Min
1Q
-1.90532 -0.33654
Median
0.07767
3Q
0.45871
Max
1.09468
Coefficients:
Estimate Std. Error t value
(Intercept) -0.81220
0.56867 -1.428
Age
0.06320
0.01561
4.049
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|t|)
0.161187
0.000236 ***
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6493 on 39 degrees of freedom
Multiple R-squared: 0.296,
Adjusted R-squared: 0.2779
F-statistic: 16.4 on 1 and 39 DF, p-value: 0.0002362
7
While this may seem like a satisfactory model, interpretation of the model coefficients is
difficult and the response is now in the square root scale. As the number of matings per
8 year period is likely to be well modeled using a Poisson distribution, we will now
consider Poisson regression.
> ele.glm = glm(Matings~Age,family="poisson")
> summary(ele.glm)
Call:
glm(formula = Matings ~ Age, family = "poisson")
Deviance Residuals:
Min
1Q
Median
-2.80798 -0.86137 -0.08629
3Q
0.60087
Max
2.17777
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.58201
0.54462 -2.905 0.00368 **
Age
0.06869
0.01375
4.997 5.81e-07 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 75.372
Residual deviance: 51.012
AIC: 156.46
on 40
on 39
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 5
> par(mfrow=c(2,2))
> plot(ele.glm)
> par(mfrow=c(1,1))
8
> plot(Age,Matings,xlab="Age of Elephant",ylab="Num. of Matings")
> lines(Age,fitted(ele.glm))
> title(main="Plot of Matings vs. Age of Elephant w/ Poisson Fit")
Given the curvilinear appearance of the scatterplot perhaps adding a squared term for
Age would improve the model.
> elesq.glm = glm(Matings~Age+Age2,family="poisson")
> summary(elesq.glm)
Call:
glm(formula = Matings ~ Age + Age2, family = "poisson")
Deviance Residuals:
Min
1Q
Median
-2.8470 -0.8848 -0.1122
3Q
0.6580
Max
2.1134
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.8574060 3.0356383 -0.941
0.347
Age
0.1359544 0.1580095
0.860
0.390
Age2
-0.0008595 0.0020124 -0.427
0.669
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 75.372
Residual deviance: 50.826
AIC: 158.27
on 40
on 38
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 5
9
We can use the Wald test to determine that adding Age2 to model was not necessary or
we could use the General Chi-square Test.
> 1 - pchisq(51.012-50.826,1)
[1] 0.6662668
Interpretation of the estimated coefficient for Age in the first model
The estimated coefficient for Age is πœ‚Μ‚ 1 = .0632 thus a year increase in age we have
100[𝑒 .0632 − 1] = 6.52% increase in the number matings in the 8 year period per one
year of age at the start of the study. Expressed as a multiplicative increase this would
be 1.0632.
For a 5 year difference in initial age we would expect a 100[𝑒 5βˆ™×0632 − 1] = 37.2%
increase in the number of matings in the following 8 year period. Expressed as a
multiplicative increase this would be 1.372.
Find a 95% CI for the 5-year Age Effect
10
Example 2: Reproduction of Ceriodaphnia Organisms
In this study the number of Ceriodaphnia organisms are counted in a controlled
environment in which reproduction occurs among the organisms. Two different
strains of organisms are involved, and the environment is changed by adding varying
amounts of a chemical component intended to impair reproduction. Initial population
sizes are the same.
> head(Ceriodaph)
Cerio Conc Strain
1
82 0.0
0
2
106 0.0
0
3
63 0.0
0
4
99 0.0
0
5
101 0.0
0
6
45 0.5
0
…
…
…
…
> cerio.glm = glm(Cerio~Conc+Strain,family="poisson",data=Ceriodaph)
> summary(cerio.glm)
Call:
glm(formula = Cerio ~ Conc + Strain, family = "poisson", data = Ceriodaph)
Deviance Residuals:
Min
1Q
Median
-2.6800 -0.6766
0.1528
3Q
0.6787
Max
2.0774
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.45464
0.03914 113.819 < 2e-16 ***
Conc
-1.54308
0.04660 -33.111 < 2e-16 ***
Strain1
-0.27497
0.04837 -5.684 1.31e-08 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1359.381
Residual deviance:
86.376
AIC: 415.95
on 69
on 67
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
Interpret the coefficients:
11
> par(mfrow=c(2,2))
> plot(cerio.glm)
> par(mfrow=c(1,1))
> plot(Conc[Strain=="0"],Cerio[Strain=="0"],col="blue",xlab=”Concentration”,
ylab=”Ceriodaphnia Count”)
> lines(Conc[Strain=="0"],fitted(cerio.glm)[Strain=="0"],col="blue")
> points(Conc[Strain=="1"],Cerio[Strain=="1"],pch="X")
> points(Conc[Strain=="1"],Cerio[Strain=="1"],pch="X",col="red")
> lines(Conc[Strain=="1"],fitted(cerio.glm)[Strain=="1"],col="red",lty=2)
> legend(locator(),legend=c("Strain 0","Strain 1"),col=c("blue","red"),pch="oX",lty=1:2)
12
> mmps(cerio.glm)
Consider adding interaction term, although there is no visual evidence to suggest it is
necessarily needed.
> cerio2.glm = glm(Cerio~Conc*Strain,family="poisson",data=Ceriodaph)
> summary(cerio2.glm)
Call:
glm(formula = Cerio ~ Conc * Strain, family = "poisson", data = Ceriodaph)
Deviance Residuals:
Min
1Q
-2.84251 -0.64872
Median
0.01169
3Q
0.70636
Max
1.82195
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
4.48110
0.04350 103.008 < 2e-16 ***
Conc
-1.59787
0.06244 -25.592 < 2e-16 ***
Strain1
-0.33667
0.06704 -5.022 5.11e-07 ***
Conc:Strain1 0.12534
0.09385
1.336
0.182
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1359.381
Residual deviance:
84.596
AIC: 416.17
on 69
on 66
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
13
Poisson Regression in JMP
Example 1: Age of Elephants and Number of Matings (Elephants.JMP)
Select Analyze > Fit Model then change options in the dialog box as shown below:
Comments:
14
Generalized Linear Model Fit
Overdispersion parameter estimated by Pearson Chisq/DF
Response: Matings
Distribution: Poisson
Link: Log
Estimation Method: Maximum Likelihood
Observations (or Sum Wgts) = 41
Poisson Regression Plot
Whole Model Test
Model
Difference
Full
Reduced
-LogLikelihood
10.5242328
65.8659248
76.3901576
Goodness Of Fit
Statistic
Pearson
Deviance
L-R ChiSquare
21.0485
DF
1
Prob>ChiSq
<.0001*
ChiSquare
DF
Prob>ChiSq
Overdispersion
45.1360
51.0116
39
39
0.2309
0.0943
1.1573
AICc
138.3805
Effect Tests
Source
Age
DF
1
L-R ChiSquare
21.048466
Prob>ChiSq
<.0001*
Parameter Estimates
Term
Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL
Intercept -1.582008 0.5859008
7.5795099
0.0059* -2.750408 -0.450132
Age
0.0686928 0.0147876
21.048466
<.0001* 0.039621 0.0976833
15
Studentized Deviance Residual by Predicted
16
Appendix: Code for some useful R functions for OLS Regression
NCV.test = function (model, var.formula, data = NULL, subset, na.action)
{
if (!is.null(weights(model)))
stop("requires unweighted linear model")
if ((!is.null(class(model$na.action))) && class(model$na.action) ==
"exclude")
model <- update(model, na.action = na.omit)
sumry <- summary(model)
residuals <- residuals(model)
S.sq <- df.residual(model) * (sumry$sigma)^2/sum(!is.na(residuals))
U <- (residuals^2)/S.sq
if (missing(var.formula)) {
mod <- lm(U ~ fitted.values(model))
varnames <- "fitted.values"
var.formula <- ~fitted.values
df <- 1
}
else {
if (missing(na.action)) {
na.action <- if (is.null(model$na.action))
options()$na.action
else parse(text = paste("na.", class(mod$na.action),
sep = ""))
}
m <- match.call(expand.dots = FALSE)
if (is.matrix(eval(m$data, sys.frame(sys.parent()))))
m$data <- as.data.frame(data)
m$formula <- var.formula
m$var.formula <- m$model <- m$... <- NULL
m[[1]] <- as.name("model.frame")
mf <- eval(m, sys.frame(sys.parent()))
response <- attr(attr(mf, "terms"), "response")
if (response)
stop(paste("Variance formula contains a response."))
mf$U <- U
.X <- model.matrix(as.formula(paste("U~",
as.character(var.formula)[2],
"-1")), data = mf)
mod <- lm(U ~ .X)
df <- sum(!is.na(coefficients(mod))) - 1
}
SS <- anova(mod)$"Sum Sq"
RegSS <- sum(SS) - SS[length(SS)]
Chisq <- RegSS/2
result <- list(formula = var.formula, formula.name = "Variance",
ChiSquare = Chisq, Df = df, p = 1 - pchisq(Chisq, df),
test = "Non-constant Variance Score Test")
class(result) <- "chisq.test"
result
}
17
ncv.plot = function (fit) {
temp <- NCV.test(fit)
p <- temp$p
e <- sqrt(abs(resid(fit)))
yhat <- fitted(fit)
plot(yhat, e, xlab = "Fitted Values", ylab = "Sqrt. Abs. Residuals",
main = paste("Non-Constant Variance Plot ~ NCV Test (p=",
signif(p, 4), ")"))
lines(lowess(yhat, e), lty = 1, col = "Blue")
}
Resplot = function (lm1, lms = summary(lm1))
{
par(mfrow = c(2, 2), pty = "m")
y <- resid(lm1)
qqnorm(Studresid(lm1), main = "Normal Probability Plot",
ylab = "Residuals")
abline(0, sqrt(var(Studresid(lm1))))
plot(fitted(lm1), Studresid(lm1), xlab = "Fitted Values",
ylab = "Studentized Residuals", main = "Plot of Studentized Residuals vs. Fitted",
cex = 0.65)
x <- fitted(lm1)
y <- Studresid(lm1)
f <- 0.5
xs <- sort(x, index = T)
x <- xs$x
ix <- xs$ix
y <- y[ix]
trend <- lowess(x, y, f)
e2 <- (y - trend$y)^2
scatter <- lowess(x, e2, f)
uplim <- trend$y + sqrt(abs(scatter$y))
lowlim <- trend$y - sqrt(abs(scatter$y))
lines(trend$x, trend$y, col = "Blue")
lines(scatter$x, uplim, col = "Red")
lines(scatter$x, lowlim, col = "Red")
abline(h = 0, lty = 2, col = 2)
plot(fitted(lm1), sqrt(abs(Studresid(lm1))), main = "Loess Fit of Residuals",
ylab = "Absolute Stud. Residuals", xlab = "Fitted Values",
cex = 0.7)
lines(lowess(fitted(lm1), sqrt(abs(Studresid(lm1)))), lty = 1,
col = 3)
abline(h = mean(sqrt(abs(Studresid(lm1)))), col = "blue",
lty = 3)
par(mfrow = c(1, 2))
par(ask = T)
yl <- c(min(resid(lm1), fitted(lm1) - mean(fitted(lm1))),
max(resid(lm1), fitted(lm1) - mean(fitted(lm1))))
fit <- fitted(lm1)
p <- sort(fit - mean(fit))
pp <- ppoints(p)
res <- resid(lm1)
pr <- sort(res)
ppr <- ppoints(pr)
plot(pp, p, pch = "o", xlim = c(0, 1), ylim = yl, xlab = "f-value",
ylab = "", main = "Fitted values", cex = 0.7)
plot(ppr, pr, pch = "o", xlim = c(0, 1), ylim = yl, xlab = "f-value",
ylab = "", main = "Residuals", cex = 0.7)
par(mfrow = c(1, 1))
par(ask = F)
invisible()
}
18
Download