Lecture 3: Analysis of Variance Statistical Models VU Amsterdam 1/28 1 ANCOVA 2 Lasso, ridge and elastic net 3 Multiple testing 2/28 ANCOVA ANCOVA Analysis of covariance (ANCOVA) combines features of both ANOVA and linear regression. It augments the ANOVA model with one or more additional quantitative variables, called covariates. The covariates are included to reduce the variance in the error terms and provide more precise measurement of the (treatment) effects. ANCOVA is usually used to test the main and interaction effects of the factors, while taking into account the effects of the covariates. 3/28 ANCOVA General ANCOVA model In general, ANCOVA model can be written as Y = Zα + Xβ + e, e ∼ N (0, σ 2 I ). Vector α contains µ and parameters such as αi , βj and γij representing factors and interactions, Matrix Z (design matrix of the ANOVA part) contains 0’s and 1’s, Vector β contains coefficients of the covariates, Matrix X (design matrix of the LR part) contains the covariates values. Remark. Note that now Zα is the same as Xβ in one-way and two-way ANOVA’s considered earlier, whereas here we use Xβ to represent the covariates in the model. 4/28 ANCOVA Examples of ANCOVA models One factor and one covariate: for i = 1, . . . , I , j = 1, . . . , ni , Yij = µ + αi + βxij + eij , Z is as in one-way ANOVA, α = (µ, α1 , . . . , αI )T and X = (x11 , . . . , x1n1 , x21 , . . . , xI ,nI )T . One factor (balanced) and q covariates: for i = 1, . . . , I , j = 1, . . . , N, Yij = µ + αi + β1 xij1 + · · · + βq xijq + eij . Z and α are as before, X is a (IN × q)-matrix and β is (q × 1)-vector, as in the multivariate linear regression. Two factors (balanced) with one covariate: for i = 1, . . . , I , j = 1, . . . , J, k = 1 . . . , K , Yijk = µ + αi + δj + γij + βxijk + eijk . Z and α, δ and γ as in two-way ANOVA, X = (x111 , . . . , xIJK )T . 5/28 ANCOVA ANCOVA model with one factor and one covariate One factor and one covariate ANCOVA: for i = 1, . . . , I , j = 1, . . . , N, Ω: Yij = µ + αi + βxij + eij . Regression lines have the same slopes, but possibly different intercepts per group. The total number of observations is n = IN. Test for main effect of Factor A (reduced model ωA ): HA : α1 = . . . = αI = 0. Another interpretation of HA : “all regression lines have the same intercept” (and the same slope, this is already in the full model Ω). Test for presence of the covariate (reduced model ωβ ): Hβ : β = 0. Another interpretation of Hβ : “all regression lines are horizontal”. 6/28 ANCOVA Testing in ANCOVA with one factor and one covariate Testing HA is like in ANOVA (degrees of freedom may be different): SSΩ SSA Sω − SΩ ∼ χ2n−I −1 , = A 2 ∼ χ2I −1 (under ωA ). σ2 σ2 σ Since SSA and SΩ are independent, we have under HA : FA = SSA /(I − 1) ∼ FI −1,n−I −1 . SΩ /(n − I − 1) Testing Hβ : SSΩ /σ 2 ∼ χ2n−I −1 and SSβ /σ 2 = (Sωβ − SΩ )/σ 2 ∼ χ21 (under ωβ ). Since SSA and SΩ are independent, we have under Hβ : Fβ = SSβ ∼ F1,n−I −1 . SΩ /(n − I − 1) We have SSΩ /σ 2 ∼ χ2n−(I +1) as the rank of the design matrix is I + 1 under the full model Ω (since dim(ΩA ) = I + 1), and SSωA /σ 2 ∼ χ2n−2 since rank of the design matrix is 2 under the reduced model ωA , implying SSA /σ 2 ∼ χ2n−2−(n−(I +1)) = χ2I −1 . Similarly, SSωβ /σ 2 ∼ χ2n−I since rank of the design matrix is I under the reduced model ωβ , implying SSβ /σ 2 ∼ χ2n−I −(n−(I +1)) = χ21 . 7/28 ANCOVA AN(C)OVA tabel, the order of variables in R-formula matters For testing HA , factor is second in R-formula for the model: model1=lm(y∼covariate+factor); anova(model1). For testing Hβ , covariate is second in R-formula for the model: model2=lm(y∼factor+covariate); anova(model2). The partitioning of the sum of squares is not the same for model1 and model2. The difference is whether covariate is added to a model already containing factor, or vice versa. For testing HA , AN(C)OVA table in R looks as follows: Df Sum Sq Mean Sq F value Pr(>F) covariate factor Residuals 8/28 ANCOVA Testing for homogeneity of slopes in ANCOVA with one factor and one covariate The tests of HA and Hβ assume a common slope for all I groups. We can test the assumption of equal slopes in the groups: HAβ : β1 = . . . = βI , where βi is the slope in the ith group. In effect, HAβ states that the I regression lines are parallel. In a way, HAβ : interaction between factor and covariate is not present. The full model Ω allowing for different slopes becomes: Ω: Yij = µ + αi + βi xij + eij , i = 1, . . . , I , j = 1, . . . , N. Notice that this is a special case of ANCOVA with one factor (with I levels) and I covariates (some of which are dummy). Reduced model ωAβ : Yij = µ + αi + βxij + eij , i = 1, . . . , I , j = 1, . . . , N. 9/28 ANCOVA Testing for homogeneity of slopes in ANCOVA with one factor and one covariate Testing HAβ is like in ANOVA (with different degrees of freedom): SSΩ ∼ χ2n−2I , σ2 Sω − SΩ SSAβ = Aβ 2 ∼ χ2I −1 (under ωAβ ). 2 σ σ Since SSAβ and SΩ are independent, we have under HAβ : FAβ = SSAβ /(I − 1) ∼ FI −1,n−2I . SΩ /(n − 2I ) We have SSΩ /σ 2 ∼ χ2n−2I because the rank of the design matrix is 2I under the full model Ω, and SSω /σ 2 ∼ χ2n−I −1 because the rank of the design matrix is n − I − 1 under the reduced model ω, implying SSAβ /σ 2 ∼ χ2n−I −1−(n−2I ) = χ2I −1 . 10/28 ANCOVA ANOVA table for testing homogeneity in ANCOVA model with one factor and one covariate For testing HAβ , you can use R-commands: model=lm(y∼covariate*factor); anova(model). The corresponding ANOVA table in R looks as follows: Df Sum Sq Mean Sq F value Pr(>F) covariate factor covariate:factor Residuals Only the line covariate:factor is relevant for testing HAβ . 11/28 Lasso, ridge and elastic net Lasso, ridge and elastic net methods: motivation So far we had only several variables to choose from, so we were able to identify the significant ones by an inspection of corresponding p-values. This will quickly become unfeasible if the number of predictors is big. An algorithm that could somehow automatically shrink the coefficients of the insignificant variables or (better!) set them to zero altogether? This is precisely what lasso and its close cousin, ridge regression, do. 12/28 Lasso, ridge and elastic net Regularization by the penalty term Lasso and ridge regularization work by adding a penalty term λP(β) to the mean residual sum of squares n 2 kY − Xβk2 1X Yi − (β0 + β1 Xi,1 + . . . + βp Xi,p ) = n n i=1 and minimizing the resulting sum n1 kY − Xβk2 + λP(β) (2n can be used instead of n) with respect to β = (β0 , β1 , . . . , βp ) ∈ Rp+1 : 1 n kY − Xβk2 + λP(β) → min β 13/28 Lasso, ridge and elastic net Lasso and ridge methods Lasso method: P(β) = kβk1 = Pp k=0 |βk |, i.e., p o o n kY − Xβk2 n kY − Xβk2 X + λkβk1 = min +λ |βk | . min β β n n k=0 Ridge method: P(β) = kβk22 = min β Pp 2 k=0 βk , i.e., p n kY − Xβk2 n kY − Xβk2 o o X + λkβk22 = min +λ βk2 . β n n k=0 14/28 Lasso, ridge and elastic net Elastic net method Elastic net method: P(β) = αkβk1 + (1 − α)kβk22 (0 ≤ α ≤ 1 controls the “mix” of ridge and lasso regularization, with α = 1 being “pure” lasso and α = 0 being “pure” ridge), i.e., min β n kY − Xβk2 o + λ αkβk1 + (1 − α)kβk22 . n Parameter λ ≥ 0 (in all three procedures lasso, ridge and elastic net) is a free parameter which is usually selected by using a method called cross-validation. 15/28 Lasso, ridge and elastic net Lasso, ridge and elastic net methods: some remarks Ridge regression enforces the β coefficients to be lower, but it does not enforce them to be zero. That is, it will not get rid of irrelevant features but rather minimize their impact on the trained model. Lasso method overcomes the disadvantage of ridge regression by not only punishing high values of the coefficients β but actually setting them to zero if they are not relevant. One usually ends up with fewer features included in the model than you started with, which is an advantage. The R-package glmnet implements the elastic net method (for any 0 ≤ α ≤ 1) by R-function glmnet, with particular cases ridge (α = 0) and lasso (α = 1). The choice of λ is done by the cross-validation method, implemented by the R-function cv.glmnet. 16/28 Lasso, ridge and elastic net Lasso (ridge and elastic net) analysis in R: generic code Suppose we have a data frame named data, with its first column being the response variable, and the remaining columns are the features to select from. >library(glmnet) >x=as.matrix(data[,-1]) #remove the response variable >y=as.double(as.matrix(data[,1])) # only the response variable >train=sample(1:nrow(x),0.67*nrow(x)) # train by using 2/3 of the data >x.train=x[train,]; y.train=y[train] # data to train >x.test=x[-train,]; y.test=y[-train] # data to test the prediction quality >lasso.mod=glmnet(x.train,y.train,alpha=1) >cv.lasso=cv.glmnet(x.train,y.train,alpha=1,type.measure=’mse’) >plot(lasso.mod,label=T,xvar="lambda") # have a look at the lasso path >plot(cv.lasso) # the best lambda by cross-validation >plot(cv.lasso$glmnet.fit,xvar="lambda",label=T) >lambda.min=lasso.cv$lambda.min; lambda.1se=lasso.cv$lambda.1se >coef(lasso.model,s=lasso.cv$lambda.min) # beta’s for the best lambda >y.pred=predict(lasso.model,s=lambda.min,newx=x.test) # predict for test >mse.lasso=sum((y.test-y.pred)^2) #mse for the predicted test rows Remark. lambda.min is the value of λ that gives minimum mean cross-validated error. The other λ saved is lambda.1se, which gives the most regularized model such that error is within one standard error of the minimum. 17/28 Multiple testing Pairwise comparisons in ANOVA, multiple testing In the one-way ANOVA, if H0 : α1 = · · · = αI is rejected, it is of interest to investigate whether αi = αj (equivalently, αi − αj = 0) for each 1 ≤ i < j ≤ I . (Or, investigate whether αi ≤ αj , equivalently, αi − αj ≤ 0.) To test H0 : αi − αj = 0 against H1 : αi − αj 6= 0, the test statistic is α̂i − α̂j T = q , σ̂ ni−1 + nj−1 where σ̂ 2 = SΩ . n−I Under ω, T ∼ tn−I , and we reject H0 at significance levelqα if |T | > tn−I ;1−α/2 . A (1 − α)-CI for αi − αj is α̂i − α̂j ± tn−I ;1−α/2 s ni−1 + nj−1 . These are called pairwise comparisons. Test (construct CI’s) simultaneously all H0,ij : αi = αj , 1 ≤ i < j ≤ I ? Many testing problems ⇒ multiple testing problem. 18/28 Multiple testing Multiple testing H0 is falsely rejected (type I error) with probability at most αind (= 0.05). Given 2 null hypotheses there are 2 possibilities to make such an error. The probability of at least 1 error is between 0.05 and 0.05 + 0.05 = 0.1. For m arbitrary null hypotheses H0,1 , . . . , H0,m , this error becomes αtot = m ∗ αind . Indeed, under all H0,i , i = 1, . . . , m, αtot = P(at least one H0,i is rejected) Xm ≤ P(H0,i is rejected) ≤ mαind . i=1 For example, to ensure αtot = 0.05 for m = 100, we will have to take αind ≤ αtot /m = 0.0005, which is way too strict. Hardly any of H0,i , i = 1, . . . m, will be rejected for αind = 0.0005. 19/28 Multiple testing Multiple testing, family-wise error rate (FWER) We have just tried to control P(at least one H0,i is rejected) which is called FWER (family-wise error rate). To provide αtot ≤ 0.05 we need to impose αind,i ≤ 0.05 m for all H0,i ’s, Thus, a simple way to obtain an overall αtot is to carry out each individual test with αind,i = αmtot , known as the Bonferroni correction. This is the same as to compare the individual p-values pind,i to αtot m . Simultaneous p-values psim,i for H0,1 , . . . , H0,m are such that if every H0,i with psim,i ≤ αtot is rejected, then under all H0,i , i = 1, . . . , m, P(at least one true H0,i is rejected) ≤ αtot . In R, the simultaneous p-values are called adjusted P-values for Multiple Comparisons, and are computed by p.adjust. They are different for different methods. Adjusted p-values according to Bonferroni correction are psim,i = m ∗ pind,i . 20/28 Multiple testing Multiple testing procedures for controlling FWER Multiple testing arises when: there are many parameters of interest. investigating all differences αi − αi 0 of a set of effects αi in ANOVA. The latter is the so called “a-posteriori testing”, performed following rejection of a composite hypothesis of the type H0 : αi = 0, i = 1, . . . , I . Bonferroni correction is not the only method to control FWER, alternatives: Sidak correction (under indep. assump., slightly better than Bonferroni), Holm-Bonferroni method, better than Bonferroni (making it obsolete), Tukey’s procedure (use library(multcomp), only for pairwise comparisons), Hochberg’s step-up procedure and Hommel’s method (both under some assump.), some extensions of the above mentioned. Similarly, one designs simultaneous confidence intervals for a set of parameters that have overall confidence level of 1 − αtot . To implement these methods in R: fed with given individual p-values pind,i and a specified method, p.adjust gives the adjusted p-values psim,i (not for Sidak and Tukey’s procedures) which should be compared to a specified significance level αtot . The corresponding method rejects those hypothesis for which psim,i ≤ αtot . 21/28 Multiple testing More on Tukey’s method Tukey method is used to test all pairwise differences in AN(C)OVA. For example, in one-way ANOVA: test H0 : αi − αj = 0 against H1 : αi − αj 6= 0, for all pairwise comparisons 1 ≤ i < j ≤ I , at level α, with r observations per level (balanced design), so that the total number of observations is n = rI . Then reject all H0,i for which |α̂i − α̂j | √ > qI ,n−I ;1−α , σ̂/ r where qI ,n−I ;1−α denotes the 1 − α quantile of the Studentized range distribution with parameters I and n − I . The quantiles of this distribution can be obtained from reference books or statistical software packages. √ When the sample sizes areq unequal, the Tukey-Kramer method replaces σ̂/ r −1/2 −1/2 above with the quantity σ̂ (ni + nj )/2. (1 − α) Tukey SCIs for all pairwise differences αi − αj have the form √ α̂i − α̂j ± qI ,n−I ;1−α σ̂/ r . 22/28 Multiple testing More on Holm-Bonferroni method Let p1 , . . . , pm be the individual p-values for testing H0,1 . . . , H0,m , α be an overall significance level. Order the p-values: p(1) ≤ . . . ≤ p(m) , let k be the minimal such that p(k) > α/(m + 1 − k), reject the null hypothesis H(1) , . . . , H(k−1) and do not reject the rest, if k = 1, do not reject any of the null hupothesis H0,i , if no such k exists, reject all H0,i . Notice that the Holm-Bonferroni method can equivalently be realized by computing adjusted p-values as follows psim,(i) = p(i) (m + 1 − i) (this is what is done in R by the command p.adjust), which then can be compared with the overall significance level α. 23/28 Multiple testing Individual p-values obtained in ANOVA Recall our dataset coffee with two factors location and strategy. > model2=lm(salesincr~location*strategy,coffee) > anova(model2); summary(model2) [ some output deleted ] Coefficients: Estimate Std. Error (Intercept) 1938.757 48.854 locationDen Haag -2087.925 69.091 locationHaarlem -908.266 69.091 locationRotterdam -1859.759 69.091 locationUtrecht 1097.922 69.091 strategy2 -118.416 69.091 strategy3 -289.456 69.091 locationDen Haag:strategy2 224.789 97.709 [ some output deleted ] t value Pr(>|t|) 39.684 < 2e-16 *** -30.220 < 2e-16 *** -13.146 < 2e-16 *** -26.918 < 2e-16 *** 15.891 < 2e-16 *** -1.714 0.09170 . -4.190 9.31e-05 *** 2.301 0.02491 * The p-values produced above are not simultaneous. The p-values in the lines locationCity are for the hypotheses H0 : α2 = α1 , . . . H0 : α5 = α1 , for the levels αi of the main factor location. 24/28 Multiple testing Multiple testing in R by Tukey’s method > library(multcomp) # another library: multcomView with function TukeyHSD() > coffee.mult=glht(model2,linfct=mcp(location="Tukey")); summary(coffee.mult) Estimate Std. Error t value Pr(>|t|) Den Haag - Amsterdam == 0 -2087.92 69.09 -30.220 <0.001 *** Haarlem - Amsterdam == 0 -908.27 69.09 -13.146 <0.001 *** Rotterdam - Amsterdam == 0 -1859.76 69.09 -26.918 <0.001 *** Utrecht - Amsterdam == 0 1097.92 69.09 15.891 <0.001 *** Haarlem - Den Haag == 0 1179.66 69.09 17.074 <0.001 *** Rotterdam - Den Haag == 0 228.17 69.09 3.302 0.0135 * Utrecht - Den Haag == 0 3185.85 69.09 46.111 <0.001 *** Rotterdam - Haarlem == 0 -951.49 69.09 -13.772 <0.001 *** Utrecht - Haarlem == 0 2006.19 69.09 29.037 <0.001 *** Utrecht - Rotterdam == 0 2957.68 69.09 42.809 <0.001 *** Simultaneous p-values for the null hypotheses H0 : α2 = α1 , H0 : α3 = α1 , . . . , H0 : α5 = α4 , where αi is the main effect of the i-th level of location. We can “safely” say that all differences with p-value < 0.05 are nonzero. 25/28 Multiple testing False Discovery Rate (FDR) Procedures that control the FWER are considered too conservative for most cases of multiple testing (they lead to a substantial loss in power). Beter to control (and less stringent) is the False Discovery Rate (FDR) introduced by Benjamini and Hochberg (1995), the expected proportion of falsely rejected null hypothesis among the rejected hypotheses. Possible outcomes when testing m hypotheses simultaneously: H0 is true H1 is true Total Procedure rejects H0 V S R Procedure does not reject H0 U T m−R Total m0 m − m0 m V is the number of false positives; T is the number of false negatives. FDR = E VR , where we define FDR = 0 if R = 0 (then also V = 0). 26/28 Multiple testing BH and BY procedures to control FDR The Benjamini-Hochberg (BH) procedure ensures that its FDR is at most α: I I Order the p-values p(1) ≤ p(2) ≤ . . . ≤ p(m) and the null hypotheses H0,(1) , H0,(2) , . . . , H0,(m) correspondingly; exists, reject H0,(1) , . . . , H0,(kmax ) ; otherwise If kmax = maxk p(k) ≤ αk m reject nothing. The BH procedure is valid when the m tests are independent. The Benjamini-Yekutieli (BY) procedure is the generalization of BH procedure P(for 1arbitrary dependence): instead of m one takes mc(m), c(m) = m is a bit more conservative. i=1 i , so the BY procedure mp(k) αk Notice that kmax = maxk p(k) ≤ m = maxk ≤α . k The R command p.adjust gives the adjusted ordered p-values mp psim,(k) = k(k) , to be compared to α. Similarly for the BY procedure. 27/28 Multiple testing Multiple testing in R for the coffee data > > > > > > > > p.raw=summary(model2)$coef[,4] # vector of individual (raw) p-values p.raw=p.raw[order(p.raw)] # order the p-values p.val=as.data.frame(p.raw) p.val$Bonferroni=p.adjust(p.val$p.raw,method="bonferroni") p.val$Holm=p.adjust(p.val$p.raw,method="holm") p.val$Hochberg=p.adjust(p.val$p.raw,method="hochberg") p.val$BH=p.adjust(p.val$p.raw,method="BH") p.val$BY=p.adjust(p.val$p.raw,method="BY"); round(p.val,3) p.raw Bonferroni Holm Hochberg BH BY (Intercept) 0.000 0.000 0.000 0.000 0.000 0.000 locationDen Haag 0.000 0.000 0.000 0.000 0.000 0.000 [ some output deleted ] locationUtrecht:strategy3 0.000 0.001 0.001 0.001 0.000 0.000 strategy3 0.000 0.001 0.001 0.001 0.000 0.001 locationDen Haag:strategy3 0.001 0.017 0.008 0.007 0.002 0.006 locationHaarlem:strategy2 0.001 0.018 0.008 0.007 0.002 0.006 locationDen Haag:strategy2 0.025 0.374 0.125 0.125 0.034 0.113 strategy2 0.092 1.000 0.367 0.367 0.115 0.380 locationHaarlem:strategy3 0.147 1.000 0.440 0.440 0.169 0.562 [ some output deleted ] Notice that all these methods are general and can be applied to any multiple testing problem as long as one has the individual p-values. 28/28