STAT 510 Homework 12 Solutions Spring 2016

advertisement
STAT 510
Homework 12 Solutions
Spring 2016
1. [15pts] Let yi represent the ith count (i = 1, . . . , 10). We want to determine if
these counts could be an independent and identically distributed sample from one
iid
Poisson distribution, i.e., yi ∼ Poisson(λ). This is equivalent to an intercept-only GLM
ind
iid
assuming yi ∼ Poisson(λi ) with log(λi ) = β0 , since this implies yi ∼ Poisson(λ = eβ0 ).
For this model, both the deviance statistic (X 2 = 20.18 on 9 df with p = 0.017) and
Pearson statistic (X 2 = 19.76 on 9 df with p = 0.019) suggest significant lack of fit.
Hence, these counts do not seem like an independent and identically distributed sample
from one Poisson distribution.
d <- c(15, 9, 15, 23, 14, 18, 5, 7, 12, 11)
o <- glm(d ~ 1, family = poisson(link = "log"))
# Deviance statistic and p-value.
d <- resid(o, type = ’deviance’)
sum(d^2); o$df.residual; pchisq(sum(d^2), o$df.residual, lower.tail = F)
# Pearson statistic and p-value.
r <- resid(o, type = ’pearson’) # Pearson residuals.
sum(r^2); o$df.residual; pchisq(sum(r^2), o$df.residual, lower.tail = F)
2. [35pts] Let yij denote the number of infected cells in the ith plant (i = 1, . . . , 8) with
the jth genotype (j = 1, 2, 3). While the number of infected cells on each leaf cannot
technically be Poisson distributed because there is an upper bound (no more than the
total number of cells in the leaf can be infected), this bound is likely very large, so it is
ind
reasonable to assume that yij ∼ Poisson(λj ). Since we want to determine if there are
differences in resistance among the genotypes, let log(λj ) = βj , where βj is the fixed
effect for genotype j.
Upon fitting this model, notice that the deviance statistic (X 2 = 35.57 on 21 df with
p = 0.024) and Pearson statistic (X 2 = 34.58 on 21 df with p = 0.031) indicate significant lack of fit. Looking at the sample mean and sample variance for each genotype
(Table 1), we see that the sample variance is much higher than the sample mean for
genotype C. Plotting the deviance residuals against the fitted values suggests the lack of
fit is not due to extreme outliers (Figure 1). Consequently, it is reasonable to blame the
lack of fit on overdispersion, which we can account for using either (1) quasi-likelihood
(QL) or (2) leaf-specific random effects in a generalized linear mixed model (GLMM).
First, let’s adopt a QL approach. The estimated dispersion parameter is φ̂ = 1.693 or
φ̂ = 1.646, based on deviance or Pearson, respectively. I’ll use the Pearson estimate
for this analysis. There is strong evidence of a difference in resistance among the genotypes using an approximate F -test (F = 7.73 on (2, 21) df with p = 0.0031). If you
1
didn’t check for model fit and instead assumed φ = 1, your inference will be incorrectly
overly optimistic (X 2 = 25.45 on 2 df with p = 0.000003 0.0031). Proceeding to
test for pairwise differences, only genotypes A and B are significantly different at the
0.05 level (Table 2).
Table 1: Sample means and sample variances for the fungal resistance data.
Genotype
A
B
C
Sample Mean
34.00 20.88 26.75
Sample Variance 32.57 26.98 71.93
Table 2: Pairwise two-sided approximate t-tests for the fungal resistance data analyzed using
a QL approach (without adjustment for multiple comparisons).
t
df p-value
Genotype Pair
A−B
3.87 21 0.0009
2.05 21 0.054
A−C
B−C
-1.87 21 0.075
Instead of using a QL approach, you could use a GLMM with leaf-specific random
ind
effects to account for overdispersion. Here, let yij ∼ Poisson(λij ) and log(λij ) =
iid
βj + γij , where γij ∼ N (0, τ 2 ) for i = 1, . . . , 8, j = 1, . . . , 3. Similar to the quasilikelihood approach, there is strong evidence of a difference in resistance among the
genotypes using an asymptotic χ2 -test (X 2 = 13.26 on 2 df with p = 0.0013). All
pairwise differences are significant at the 0.05 level (Table 3), but recall that the GLMM
may be more liberal than the QL approach for small datasets (slide 57 of set 29).
Table 3: Pairwise two-sided asymptotic z-tests for the fungal resistance data analyzed using
a GLMM (without adjustment for multiple comparisons).
Genotype Pair
z
p-value
A−B
4.18 0.00003
A−C
2.18
0.029
B−C
-2.03 0.042
y <- c(39, 31, 43, 31, 34, 36, 34, 24,
23, 28, 24, 19, 16, 20, 25, 12,
36, 38, 33, 22, 23, 17, 29, 16)
geno <- factor(rep(c(’A’, ’B’, ’C’), each = length(y) / 3))
fit <- glm(y ~ geno, family = poisson(link = "log"))
# Test for lack of fit.
d <- resid(fit, type = ’deviance’)
sum(d^2); fit$df.residual; pchisq(sum(d^2), fit$df.residual, lower.tail = F)
r <- resid(fit, type = ’pearson’)
2
2
1
0
-2
-1
Deviance Residuals
22
24
26
28
30
32
34
Fitted Values
Figure 1: Deviance residuals versus fitted values for a Poisson GLM fit to the fungal resistance
data.
sum(r^2); fit$df.residual; pchisq(sum(r^2), fit$df.residual, lower.tail = F)
# Plot resids.
plot(d ~ fitted(fit), xlab = ’Fitted Values’, ylab = ’Deviance Residuals’)
# Sample mean and variance by genotype.
rbind(mean = tapply(y, geno, mean), var = tapply(y, geno, var))
# Estimated dispersion parameters (deviance-based).
sum(d^2) / fit$df.residual
# Very similar to Pearson-based estimate.
sum(r^2) / fit$df.residual
##### QL for overdispersion.
fit.ql <- glm(y ~ geno, family = quasipoisson(link = "log"))
# Do a quasi-likelihood test for genotype effects.
anova(fit.ql, test = "F")
# Compare to a LRT with phi = 1: wrongly optimistic!
options(scipen=999); anova(fit, test = "Chisq")
# Test for pairwise differences.
b <- coef(fit.ql)
3
v <- vcov(fit.ql)
C <- matrix(c(0, -1, 0, # A vs. B
0, 0, -1, # A vs. C
0, 1, -1), # B vs. C
nrow = 3, byrow = TRUE)
se <- sqrt(diag(C %*% v %*% t(C)))
t <- drop( (C %*% b) / se )
p <- 2 * pt(abs(t), fit$df.residual, lower.tail = F)
names(p) <- c(’A vs B’, ’A vs C’, ’B vs C’)
cbind(t, df = rep(fit$df.residual, 3), p)
##### GLMM for overdispersion.
library(lme4)
leaf <- factor(1:24)
fit.glmm <- glmer(y ~ geno + (1|leaf), family = poisson(link = "log"))
fit.glmm.red <- glmer(y ~ 1 + (1|leaf), family = poisson(link = "log"))
anova(fit.glmm.red, fit.glmm)
# Test for pairwise differences.
b.glmm <- fixef(fit.glmm)
v.glmm <- vcov(fit.glmm)
se.glmm <- sqrt(diag(C %*% v.glmm %*% t(C)))
z <- drop( (C %*% b.glmm) / se.glmm )
p.val <- 2 * pnorm(abs(z), 0, 1, lower.tail = F)
names(p.val) <- c(’A vs B’, ’A vs C’, ’B vs C’)
cbind(z, p.val)
4
3. [15pts] Fundamentally, this may be an unsafe analysis strategy due to violation of the
constant variance assumption in ANOVA, i.e., Var(yi ) = σ 2 for all i. Recall that the
variance for a single Binomial(m, p) observation is mp(1 − p). For this experiment, we
have Var(yi ) = 5 or Var(yi ) = 0.95 depending which treatment observation i received
(Table 4). Since the sample size is much higher for treatment C (50 experimental
units) than A and B combined (10 + 10 = 20 experimental units), the MSE will be
weighted more heavily towards the smaller variance of those receiving treatment C,
shrinking down the estimate of σ 2 . Ultimately, this can cause p-values to be too small
and confidence intervals to be too narrow, in particular for comparing treatments A
and B (as these two have higher variability).
Table 4: True variance of an observation (one experimental unit) for each treatment.
p
Variance of an Observation
Treatment
A
0.5
20(0.5)(1 − 0.5) = 5.00
0.5
20(0.5)(1 − 0.5) = 5.00
B
C
0.95 20(0.95)(1 − 0.95) = 0.95
4. [35pts] Let yi denote the number of multiple-fatality plane crashes in week i and
xi be the index of news coverage in the week prior to week i. A natural choice for
ind
count data such as these is the Poisson distribution. Assume yi ∼ Poisson(λi ) and
log(λi ) = β0 + β1 xi . Fitting this GLM to these data, there is no indication of lack of
fit via the deviance statistic (X 2 = 9.79 on 15 df with p = 0.83) or Pearson statistic
(X 2 = 10.08 on 15 df with p = 0.81). However, since the sample size is low and the
counts are small (all are less than ten), the asymptotic approximation of these tests
may be poor. A plot of the deviance residuals against the fitted values does not appear
to exhibit increasing variance or any thing else problematic (Figure 2).
The estimated regression coefficient for news coverage index, β̂1 = 0.00199, is small yet
positive and significant at the 0.05 level (X 2 = 5.501 on 1 df with p = 0.019). Hence,
there is evidence of an association between the news coverage index and the number of
crashes in the following week. Since β̂1 is small, let’s interpret it in terms of a 100-unit
increase in xi rather than the more customary one-unit increase: a 100-unit increase
in news coverage index is associated with a 22.1% increase in the average number of
crashes the following week.
dat <- read.table("http://www.public.iastate.edu/~dnett/S510/PlaneCrashes.txt",
header = T)
fit <- glm(crashes ~ index, data = dat, family = poisson(link = "log"))
# Test for lack of fit.
d <- resid(fit, type = ’deviance’)
sum(d^2); fit$df.residual; pchisq(sum(d^2), fit$df.residual, lower.tail = F)
5
1.5
1.0
0.5
0.0
-1.0
-0.5
Deviance Residuals
4
5
6
7
8
Fitted Values
Figure 2: Deviance residuals versus fitted values for a Poisson GLM fit to the plane crash
data.
r <- resid(fit, type = ’pearson’)
sum(r^2); fit$df.residual; pchisq(sum(r^2), fit$df.residual, lower.tail = F)
# Plot resids.
plot(d ~ fitted(fit), xlab = ’Fitted Values’, ylab = ’Deviance Residuals’)
# Test for effect of index.
anova(fit, test = "Chisq")
coef(fit)[2] # beta.1 estimate.
exp(100 * coef(fit))[2]
6
Download