Uploaded by Elham Wasei

Lec3 Slides

advertisement
Lecture 3: Analysis of Variance
Statistical Models
VU Amsterdam
1/28
1
ANCOVA
2
Lasso, ridge and elastic net
3
Multiple testing
2/28
ANCOVA
ANCOVA
Analysis of covariance (ANCOVA) combines features of both ANOVA
and linear regression.
It augments the ANOVA model with one or more additional
quantitative variables, called covariates.
The covariates are included to reduce the variance in the error terms
and provide more precise measurement of the (treatment) effects.
ANCOVA is usually used to test the main and interaction effects of
the factors, while taking into account the effects of the covariates.
3/28
ANCOVA
General ANCOVA model
In general, ANCOVA model can be written as
Y = Zα + Xβ + e,
e ∼ N (0, σ 2 I ).
Vector α contains µ and parameters such as αi , βj and γij
representing factors and interactions,
Matrix Z (design matrix of the ANOVA part) contains 0’s and 1’s,
Vector β contains coefficients of the covariates,
Matrix X (design matrix of the LR part) contains the covariates
values.
Remark. Note that now Zα is the same as Xβ in one-way and two-way ANOVA’s
considered earlier, whereas here we use Xβ to represent the covariates in the model.
4/28
ANCOVA
Examples of ANCOVA models
One factor and one covariate: for i = 1, . . . , I , j = 1, . . . , ni ,
Yij = µ + αi + βxij + eij ,
Z is as in one-way ANOVA, α = (µ, α1 , . . . , αI )T and
X = (x11 , . . . , x1n1 , x21 , . . . , xI ,nI )T .
One factor (balanced) and q covariates: for i = 1, . . . , I , j = 1, . . . , N,
Yij = µ + αi + β1 xij1 + · · · + βq xijq + eij .
Z and α are as before, X is a (IN × q)-matrix and β is
(q × 1)-vector, as in the multivariate linear regression.
Two factors (balanced) with one covariate: for i = 1, . . . , I ,
j = 1, . . . , J, k = 1 . . . , K ,
Yijk = µ + αi + δj + γij + βxijk + eijk .
Z and α, δ and γ as in two-way ANOVA, X = (x111 , . . . , xIJK )T .
5/28
ANCOVA
ANCOVA model with one factor and one covariate
One factor and one covariate ANCOVA: for i = 1, . . . , I , j = 1, . . . , N,
Ω:
Yij = µ + αi + βxij + eij .
Regression lines have the same slopes, but possibly different
intercepts per group. The total number of observations is n = IN.
Test for main effect of Factor A (reduced model ωA ):
HA : α1 = . . . = αI = 0.
Another interpretation of HA : “all regression lines have the same intercept” (and
the same slope, this is already in the full model Ω).
Test for presence of the covariate (reduced model ωβ ):
Hβ : β = 0.
Another interpretation of Hβ : “all regression lines are horizontal”.
6/28
ANCOVA
Testing in ANCOVA with one factor and one covariate
Testing HA is like in ANOVA (degrees of freedom may be different):
SSΩ
SSA
Sω − SΩ
∼ χ2n−I −1 ,
= A 2
∼ χ2I −1 (under ωA ).
σ2
σ2
σ
Since SSA and SΩ are independent, we have under HA :
FA =
SSA /(I − 1)
∼ FI −1,n−I −1 .
SΩ /(n − I − 1)
Testing Hβ : SSΩ /σ 2 ∼ χ2n−I −1 and SSβ /σ 2 = (Sωβ − SΩ )/σ 2 ∼ χ21
(under ωβ ). Since SSA and SΩ are independent, we have under Hβ :
Fβ =
SSβ
∼ F1,n−I −1 .
SΩ /(n − I − 1)
We have SSΩ /σ 2 ∼ χ2n−(I +1) as the rank of the design matrix is I + 1 under the full model Ω
(since dim(ΩA ) = I + 1), and SSωA /σ 2 ∼ χ2n−2 since rank of the design matrix is 2 under the
reduced model ωA , implying SSA /σ 2 ∼ χ2n−2−(n−(I +1)) = χ2I −1 . Similarly, SSωβ /σ 2 ∼ χ2n−I
since rank of the design matrix is I under the reduced model ωβ , implying
SSβ /σ 2 ∼ χ2n−I −(n−(I +1)) = χ21 .
7/28
ANCOVA
AN(C)OVA tabel, the order of variables in R-formula
matters
For testing HA , factor is second in R-formula for the model:
model1=lm(y∼covariate+factor); anova(model1).
For testing Hβ , covariate is second in R-formula for the model:
model2=lm(y∼factor+covariate); anova(model2).
The partitioning of the sum of squares is not the same for model1
and model2. The difference is whether covariate is added to a
model already containing factor, or vice versa.
For testing HA , AN(C)OVA table in R looks as follows:
Df
Sum Sq
Mean Sq
F value
Pr(>F)
covariate
factor
Residuals
8/28
ANCOVA
Testing for homogeneity of slopes in ANCOVA with one
factor and one covariate
The tests of HA and Hβ assume a common slope for all I groups.
We can test the assumption of equal slopes in the groups:
HAβ : β1 = . . . = βI , where βi is the slope in the ith group.
In effect, HAβ states that the I regression lines are parallel.
In a way, HAβ : interaction between factor and covariate is not present.
The full model Ω allowing for different slopes becomes:
Ω:
Yij = µ + αi + βi xij + eij ,
i = 1, . . . , I , j = 1, . . . , N.
Notice that this is a special case of ANCOVA with one factor (with I
levels) and I covariates (some of which are dummy).
Reduced model ωAβ : Yij = µ + αi + βxij + eij ,
i = 1, . . . , I , j = 1, . . . , N.
9/28
ANCOVA
Testing for homogeneity of slopes in ANCOVA with one
factor and one covariate
Testing HAβ is like in ANOVA (with different degrees of freedom):
SSΩ
∼ χ2n−2I ,
σ2
Sω − SΩ
SSAβ
= Aβ 2
∼ χ2I −1 (under ωAβ ).
2
σ
σ
Since SSAβ and SΩ are independent, we have under HAβ :
FAβ =
SSAβ /(I − 1)
∼ FI −1,n−2I .
SΩ /(n − 2I )
We have SSΩ /σ 2 ∼ χ2n−2I because the rank of the design matrix is 2I under the full model Ω,
and SSω /σ 2 ∼ χ2n−I −1 because the rank of the design matrix is n − I − 1 under the reduced
model ω, implying SSAβ /σ 2 ∼ χ2n−I −1−(n−2I ) = χ2I −1 .
10/28
ANCOVA
ANOVA table for testing homogeneity in ANCOVA model
with one factor and one covariate
For testing HAβ , you can use R-commands:
model=lm(y∼covariate*factor); anova(model).
The corresponding ANOVA table in R looks as follows:
Df
Sum Sq
Mean Sq
F value
Pr(>F)
covariate
factor
covariate:factor
Residuals
Only the line covariate:factor is relevant for testing HAβ .
11/28
Lasso, ridge and elastic net
Lasso, ridge and elastic net methods: motivation
So far we had only several variables to choose from, so we were able
to identify the significant ones by an inspection of corresponding
p-values.
This will quickly become unfeasible if the number of predictors is big.
An algorithm that could somehow automatically shrink the coefficients
of the insignificant variables or (better!) set them to zero altogether?
This is precisely what lasso and its close cousin, ridge regression, do.
12/28
Lasso, ridge and elastic net
Regularization by the penalty term
Lasso and ridge regularization work by adding a penalty term λP(β) to
the mean residual sum of squares
n
2 kY − Xβk2
1X
Yi − (β0 + β1 Xi,1 + . . . + βp Xi,p ) =
n
n
i=1
and minimizing the resulting sum n1 kY − Xβk2 + λP(β) (2n can be used
instead of n) with respect to β = (β0 , β1 , . . . , βp ) ∈ Rp+1 :
1
n kY
− Xβk2 + λP(β) → min
β
13/28
Lasso, ridge and elastic net
Lasso and ridge methods
Lasso method: P(β) = kβk1 =
Pp
k=0 |βk |,
i.e.,
p
o
o
n kY − Xβk2
n kY − Xβk2
X
+ λkβk1 = min
+λ
|βk | .
min
β
β
n
n
k=0
Ridge method: P(β) = kβk22 =
min
β
Pp
2
k=0 βk ,
i.e.,
p
n kY − Xβk2
n kY − Xβk2
o
o
X
+ λkβk22 = min
+λ
βk2 .
β
n
n
k=0
14/28
Lasso, ridge and elastic net
Elastic net method
Elastic net method: P(β) = αkβk1 + (1 − α)kβk22 (0 ≤ α ≤ 1
controls the “mix” of ridge and lasso regularization, with α = 1 being
“pure” lasso and α = 0 being “pure” ridge), i.e.,
min
β
n kY − Xβk2
o
+ λ αkβk1 + (1 − α)kβk22 .
n
Parameter λ ≥ 0 (in all three procedures lasso, ridge and elastic net)
is a free parameter which is usually selected by using a method called
cross-validation.
15/28
Lasso, ridge and elastic net
Lasso, ridge and elastic net methods: some remarks
Ridge regression enforces the β coefficients to be lower, but it does
not enforce them to be zero. That is, it will not get rid of irrelevant
features but rather minimize their impact on the trained model.
Lasso method overcomes the disadvantage of ridge regression by not
only punishing high values of the coefficients β but actually setting
them to zero if they are not relevant. One usually ends up with fewer
features included in the model than you started with, which is an
advantage.
The R-package glmnet implements the elastic net method (for any
0 ≤ α ≤ 1) by R-function glmnet, with particular cases ridge (α = 0)
and lasso (α = 1).
The choice of λ is done by the cross-validation method, implemented
by the R-function cv.glmnet.
16/28
Lasso, ridge and elastic net
Lasso (ridge and elastic net) analysis in R: generic code
Suppose we have a data frame named data, with its first column being the response
variable, and the remaining columns are the features to select from.
>library(glmnet)
>x=as.matrix(data[,-1]) #remove the response variable
>y=as.double(as.matrix(data[,1])) # only the response variable
>train=sample(1:nrow(x),0.67*nrow(x)) # train by using 2/3 of the data
>x.train=x[train,]; y.train=y[train] # data to train
>x.test=x[-train,]; y.test=y[-train] # data to test the prediction quality
>lasso.mod=glmnet(x.train,y.train,alpha=1)
>cv.lasso=cv.glmnet(x.train,y.train,alpha=1,type.measure=’mse’)
>plot(lasso.mod,label=T,xvar="lambda") # have a look at the lasso path
>plot(cv.lasso) # the best lambda by cross-validation
>plot(cv.lasso$glmnet.fit,xvar="lambda",label=T)
>lambda.min=lasso.cv$lambda.min; lambda.1se=lasso.cv$lambda.1se
>coef(lasso.model,s=lasso.cv$lambda.min) # beta’s for the best lambda
>y.pred=predict(lasso.model,s=lambda.min,newx=x.test) # predict for test
>mse.lasso=sum((y.test-y.pred)^2) #mse for the predicted test rows
Remark. lambda.min is the value of λ that gives minimum mean cross-validated error. The
other λ saved is lambda.1se, which gives the most regularized model such that error is within
one standard error of the minimum.
17/28
Multiple testing
Pairwise comparisons in ANOVA, multiple testing
In the one-way ANOVA, if H0 : α1 = · · · = αI is rejected, it is of
interest to investigate whether αi = αj (equivalently, αi − αj = 0) for
each 1 ≤ i < j ≤ I . (Or, investigate whether αi ≤ αj , equivalently, αi − αj ≤ 0.)
To test H0 : αi − αj = 0 against H1 : αi − αj 6= 0, the test statistic is
α̂i − α̂j
T = q
,
σ̂ ni−1 + nj−1
where
σ̂ 2 =
SΩ
.
n−I
Under ω, T ∼ tn−I , and we reject H0 at significance levelqα if
|T | > tn−I ;1−α/2 . A (1 − α)-CI for αi − αj is α̂i − α̂j ± tn−I ;1−α/2 s ni−1 + nj−1 .
These are called pairwise comparisons.
Test (construct CI’s) simultaneously all H0,ij : αi = αj , 1 ≤ i < j ≤ I ?
Many testing problems ⇒ multiple testing problem.
18/28
Multiple testing
Multiple testing
H0 is falsely rejected (type I error) with probability at most αind
(= 0.05).
Given 2 null hypotheses there are 2 possibilities to make such an
error. The probability of at least 1 error is between 0.05 and
0.05 + 0.05 = 0.1.
For m arbitrary null hypotheses H0,1 , . . . , H0,m , this error becomes
αtot = m ∗ αind . Indeed, under all H0,i , i = 1, . . . , m,
αtot = P(at least one H0,i is rejected)
Xm
≤
P(H0,i is rejected) ≤ mαind .
i=1
For example, to ensure αtot = 0.05 for m = 100, we will have to take
αind ≤ αtot /m = 0.0005, which is way too strict. Hardly any of H0,i ,
i = 1, . . . m, will be rejected for αind = 0.0005.
19/28
Multiple testing
Multiple testing, family-wise error rate (FWER)
We have just tried to control P(at least one H0,i is rejected) which is
called FWER (family-wise error rate).
To provide αtot ≤ 0.05 we need to impose αind,i ≤
0.05
m
for all H0,i ’s,
Thus, a simple way to obtain an overall αtot is to carry out each
individual test with αind,i = αmtot , known as the Bonferroni correction.
This is the same as to compare the individual p-values pind,i to
αtot
m .
Simultaneous p-values psim,i for H0,1 , . . . , H0,m are such that if every
H0,i with psim,i ≤ αtot is rejected, then under all H0,i , i = 1, . . . , m,
P(at least one true H0,i is rejected) ≤ αtot .
In R, the simultaneous p-values are called adjusted P-values for
Multiple Comparisons, and are computed by p.adjust. They are
different for different methods.
Adjusted p-values according to Bonferroni correction are
psim,i = m ∗ pind,i .
20/28
Multiple testing
Multiple testing procedures for controlling FWER
Multiple testing arises when:
there are many parameters of interest.
investigating all differences αi − αi 0 of a set of effects αi in ANOVA.
The latter is the so called “a-posteriori testing”, performed following rejection of a
composite hypothesis of the type H0 : αi = 0, i = 1, . . . , I .
Bonferroni correction is not the only method to control FWER, alternatives:
Sidak correction (under indep. assump., slightly better than Bonferroni),
Holm-Bonferroni method, better than Bonferroni (making it obsolete),
Tukey’s procedure (use library(multcomp), only for pairwise comparisons),
Hochberg’s step-up procedure and Hommel’s method (both under some assump.),
some extensions of the above mentioned.
Similarly, one designs simultaneous confidence intervals for a set of parameters that have
overall confidence level of 1 − αtot .
To implement these methods in R: fed with given individual p-values pind,i and a specified
method, p.adjust gives the adjusted p-values psim,i (not for Sidak and Tukey’s procedures)
which should be compared to a specified significance level αtot . The corresponding method
rejects those hypothesis for which psim,i ≤ αtot .
21/28
Multiple testing
More on Tukey’s method
Tukey method is used to test all pairwise differences in AN(C)OVA.
For example, in one-way ANOVA: test H0 : αi − αj = 0 against H1 : αi − αj 6= 0,
for all pairwise comparisons 1 ≤ i < j ≤ I , at level α, with r observations per level
(balanced design), so that the total number of observations is n = rI . Then
reject all H0,i for which
|α̂i − α̂j |
√
> qI ,n−I ;1−α ,
σ̂/ r
where qI ,n−I ;1−α denotes the 1 − α quantile of the Studentized range distribution
with parameters I and n − I .
The quantiles of this distribution can be obtained from reference books or
statistical software packages.
√
When the sample sizes areq
unequal, the Tukey-Kramer method replaces σ̂/ r
−1/2
−1/2
above with the quantity σ̂ (ni
+ nj
)/2.
(1 − α) Tukey SCIs for all pairwise differences αi − αj have the form
√
α̂i − α̂j ± qI ,n−I ;1−α σ̂/ r .
22/28
Multiple testing
More on Holm-Bonferroni method
Let p1 , . . . , pm be the individual p-values for testing H0,1 . . . , H0,m , α be
an overall significance level.
Order the p-values: p(1) ≤ . . . ≤ p(m) ,
let k be the minimal such that p(k) > α/(m + 1 − k),
reject the null hypothesis H(1) , . . . , H(k−1) and do not reject the rest,
if k = 1, do not reject any of the null hupothesis H0,i ,
if no such k exists, reject all H0,i .
Notice that the Holm-Bonferroni method can equivalently be realized by computing
adjusted p-values as follows psim,(i) = p(i) (m + 1 − i) (this is what is done in R by the
command p.adjust), which then can be compared with the overall significance level α.
23/28
Multiple testing
Individual p-values obtained in ANOVA
Recall our dataset coffee with two factors location and strategy.
> model2=lm(salesincr~location*strategy,coffee)
> anova(model2); summary(model2)
[ some output deleted ]
Coefficients:
Estimate Std. Error
(Intercept)
1938.757
48.854
locationDen Haag
-2087.925
69.091
locationHaarlem
-908.266
69.091
locationRotterdam
-1859.759
69.091
locationUtrecht
1097.922
69.091
strategy2
-118.416
69.091
strategy3
-289.456
69.091
locationDen Haag:strategy2
224.789
97.709
[ some output deleted ]
t value Pr(>|t|)
39.684 < 2e-16 ***
-30.220 < 2e-16 ***
-13.146 < 2e-16 ***
-26.918 < 2e-16 ***
15.891 < 2e-16 ***
-1.714 0.09170 .
-4.190 9.31e-05 ***
2.301 0.02491 *
The p-values produced above are not simultaneous. The p-values in the lines locationCity are
for the hypotheses H0 : α2 = α1 , . . . H0 : α5 = α1 , for the levels αi of the main factor location.
24/28
Multiple testing
Multiple testing in R by Tukey’s method
> library(multcomp) # another library: multcomView with function TukeyHSD()
> coffee.mult=glht(model2,linfct=mcp(location="Tukey")); summary(coffee.mult)
Estimate Std. Error t value Pr(>|t|)
Den Haag - Amsterdam == 0 -2087.92
69.09 -30.220
<0.001 ***
Haarlem - Amsterdam == 0
-908.27
69.09 -13.146
<0.001 ***
Rotterdam - Amsterdam == 0 -1859.76
69.09 -26.918
<0.001 ***
Utrecht - Amsterdam == 0
1097.92
69.09 15.891
<0.001 ***
Haarlem - Den Haag == 0
1179.66
69.09 17.074
<0.001 ***
Rotterdam - Den Haag == 0
228.17
69.09
3.302
0.0135 *
Utrecht - Den Haag == 0
3185.85
69.09 46.111
<0.001 ***
Rotterdam - Haarlem == 0
-951.49
69.09 -13.772
<0.001 ***
Utrecht - Haarlem == 0
2006.19
69.09 29.037
<0.001 ***
Utrecht - Rotterdam == 0
2957.68
69.09 42.809
<0.001 ***
Simultaneous p-values for the null hypotheses H0 : α2 = α1 , H0 : α3 = α1 , . . . , H0 : α5 = α4 ,
where αi is the main effect of the i-th level of location. We can “safely” say that all
differences with p-value < 0.05 are nonzero.
25/28
Multiple testing
False Discovery Rate (FDR)
Procedures that control the FWER are considered too conservative
for most cases of multiple testing (they lead to a substantial loss in
power).
Beter to control (and less stringent) is the False Discovery Rate
(FDR) introduced by Benjamini and Hochberg (1995), the expected
proportion of falsely rejected null hypothesis among the rejected
hypotheses.
Possible outcomes when testing m hypotheses simultaneously:
H0 is true H1 is true Total
Procedure rejects H0
V
S
R
Procedure does not reject H0
U
T
m−R
Total
m0
m − m0
m
V is the number
of false positives; T is the number of false negatives.
FDR = E VR , where we define FDR = 0 if R = 0 (then also V = 0).
26/28
Multiple testing
BH and BY procedures to control FDR
The Benjamini-Hochberg (BH) procedure ensures that its FDR is at
most α:
I
I
Order the p-values p(1) ≤ p(2) ≤ . . . ≤ p(m) and the null hypotheses
H0,(1) , H0,(2) , . . . , H0,(m) correspondingly;
exists, reject H0,(1) , . . . , H0,(kmax ) ; otherwise
If kmax = maxk p(k) ≤ αk
m
reject nothing.
The BH procedure is valid when the m tests are independent.
The Benjamini-Yekutieli (BY) procedure is the generalization of BH
procedure
P(for 1arbitrary dependence): instead of m one takes mc(m),
c(m) = m
is a bit more conservative.
i=1 i , so the BY procedure
mp(k)
αk
Notice that kmax = maxk p(k) ≤ m = maxk
≤α .
k
The R command p.adjust gives the adjusted ordered p-values
mp
psim,(k) = k(k) , to be compared to α. Similarly for the BY procedure.
27/28
Multiple testing
Multiple testing in R for the coffee data
>
>
>
>
>
>
>
>
p.raw=summary(model2)$coef[,4] # vector of individual (raw) p-values
p.raw=p.raw[order(p.raw)] # order the p-values
p.val=as.data.frame(p.raw)
p.val$Bonferroni=p.adjust(p.val$p.raw,method="bonferroni")
p.val$Holm=p.adjust(p.val$p.raw,method="holm")
p.val$Hochberg=p.adjust(p.val$p.raw,method="hochberg")
p.val$BH=p.adjust(p.val$p.raw,method="BH")
p.val$BY=p.adjust(p.val$p.raw,method="BY"); round(p.val,3)
p.raw Bonferroni Holm Hochberg
BH
BY
(Intercept)
0.000
0.000 0.000
0.000 0.000 0.000
locationDen Haag
0.000
0.000 0.000
0.000 0.000 0.000
[ some output deleted ]
locationUtrecht:strategy3
0.000
0.001 0.001
0.001 0.000 0.000
strategy3
0.000
0.001 0.001
0.001 0.000 0.001
locationDen Haag:strategy3 0.001
0.017 0.008
0.007 0.002 0.006
locationHaarlem:strategy2
0.001
0.018 0.008
0.007 0.002 0.006
locationDen Haag:strategy2 0.025
0.374 0.125
0.125 0.034 0.113
strategy2
0.092
1.000 0.367
0.367 0.115 0.380
locationHaarlem:strategy3
0.147
1.000 0.440
0.440 0.169 0.562
[ some output deleted ]
Notice that all these methods are general and can be applied to any multiple testing problem as
long as one has the individual p-values.
28/28
Download