Lab 4: Mixed effects models In this lab, we will apply mixed models to study the beverage effects on gene expressions using the data set studied by Baty et al. (2006). The original purpose of this study was to measure the influence of beverages on blood gene expression. They would like to explore the underlying mechanisms of the cardio protective effects of beverages. Experiment Design Six healthy individuals participated in the randomized controlled cross-over experiment. On 4 independent days they had 4 different beverages (500mL each: grape juice, red wine, 40g diluted ethanol, water). The drinks they had on each day are summarized in the following table: Individual 1 Individual 2 Individual 3 Individual 4 Individual 5 Individual 6 Day 1 Grape Juice Water Red Wine Water Grape Juice Alcohol Day 2 Red Wine Grape Juice Alcohol Alcohol Red Wine Red Wine Day 3 Day 4 Alcohol Water Red Wine Alcohol Water Grape Juice Red Wine Grape Juice Alcohol Water Water Grape Juice On each day, blood samples were taken at baseline (0 hour), 1, 2, 4, 12 hours after the drink together with standardized nutrition. But some individuals missed the schedule to draw blood samples, which results in 12 missing values. RNA of 108 samples was hybridized on Affymetrix microarrays and the gene expression data were obtained for 108 blood samples. Data set The data set is contained in “Alldata.Rdata" file, which can be loaded into your R by using the command (after setting the working directory to the place where you saved the data set) load(file="Alldata.Rdata") 1 Within the data set, “Alldata" is a list, which includes the following objects: "originaldata" "trt1" "trt2" "trt3" "trt4" "time_h0" "time_h1" "time_h2" "time_h4" "time_h12" "ind1" "ind2" "ind3" "ind4" "ind5" "ind6" The objects included in the data set are originaldata: All the gene expression data (transformed counts data); trt1: IDs for individuals participated in Alcohol group; trt2: IDs for individuals participated in Grape juice group; trt3: IDs for individuals participated in Red wine group; trt4: IDs for individuals participated in Water group; time_h0: Observations measured at baseline; time_h1: Observations measured at 1 hour after the drink; time_h2: Observations measured at 2 hour after the drink; time_h4: Observations measured at 4 hour after the drink; time_h12: Observations measured at 12 hour after the drink; ind1: data obtained from individual 1; ind2: data obtained from individual 2; ind3: data obtained from individual 3; ind4: data obtained from individual 4; ind5: data obtained from individual 5; ind6: data obtained from individual 6; You can access to each object by using the operator $. For example, if you want to get the data contained in trt1, you could type in the following Alldata$trt1 The originaldata is a matrix containing 22283 rows and 130 columns. Each row corresponds to one gene, and the 3-110 columns correspond to gene expression data. The rest columns are gene IDs or the gene annotation information. The following is very small part of the data 2 1 2 3 4 5 6 GSM87863 GSM87887 GSM87896 GSM87934 GSM87943 GSM87853 6.96959 6.84646 6.99376 7.06780 7.07566 7.18618 4.94771 4.63228 4.47609 4.41107 4.62490 4.61241 7.38956 7.21881 7.62192 7.76446 7.43270 7.38960 7.73394 7.73069 8.12781 7.92782 7.96697 7.96694 3.10916 3.31460 3.42180 3.46084 3.29934 3.35648 6.93594 7.23465 6.63625 6.72077 7.15000 7.07379 The first column in the above data example corresponds to the observation ID GSM87863. This ID is contained in the variable Alldata$trt1, Alldata$time_h0 and Alldata$ind1. This means that this column data (gene expressions) are obtained from the individual 1 at time 0h who participated in the treatment 1 (Alcohol group). Each row corresponds the gene expression for each gene. In the above data set, the rows 1-6 provide gene expressions for the first 6 genes. In today’s lab, we will focus on a subset of genes that are related to the immunity. The set of genes are defined by Gene Ontology (GO) terms. GO is a set of controlled, structured vocabularies to describe key domains of molecular biology, including gene product attributes and biological sequences. The GO defines classes used to describe gene function and attributes of gene products in three nonβoverlapping domains of molecular biology. Specifically, we will focus on the gene within the GO term GO:0006955. Q1: The last three columns (namely, the columns from 128 to 130) of the data set (original data) contain the GO information for each gene. Find the genes that belongs to the GO term GO:0006955. Then create a new data set that contains gene expression data (namely, the columns 3-110 of original data) for the genes in the GO term GO:0006955. Answer: To find the genes belong to the GO term GO:0006955, we first create a matrix whose first column is the row of each gene and the second column is the GO term containing the gene. Then we find all the rows that belong to the GO term GO:0006955. At the end, we use these rows to find the gene expression data for the genes in the GO term GO:0006955. Below is the R code: load(file="Alldata.Rdata") Goterms<-Alldata$originaldata[,128:130] mc<-dim(Goterms)[1] 3 GOTermlist<-NULL for (j in 1:mc) { list<-NULL for (k in 1:3) { GOTerm22283<-as.character(Goterms[j,k]) getGoTerms<-unlist(strsplit(GOTerm22283,"///")) repj<-rep(j,length(getGoTerms)) newlist<-cbind(repj,getGoTerms) list<-rbind(list,newlist) } GOTermlist<-rbind(GOTermlist,list) } GO0006955<-which(GOTermlist[,2]=="GO:0006955") rownums<-as.numeric(GOTermlist[GO0006955,1]) subsetGenes<-Alldata$originaldata[rownums,c(3:110)] Q2: There are several factors need to be considered in this data set. These factors include beverage effects, time effects and individual effects. To see the impact of these effects on gene expression, we need to define these factors as covariates. Please define three covariates that, respectively, describe three factors (namely, beverage, time and individual effects). Answer: The beverage factor has four levels including alcohol, grape juice, red wine and water. We will use 1-4 to represent them respectively. The microarray gene expression were measured five times respectively at 0, 1, 2 4 and 12 hours. There is a total of six individual participated into this study, which will defined as 1-6 respectively. The following R code could be used to define the above three factors: samIDs<-names(subsetGenes) Beverages<(samIDs%in%Alldata$trt1)*1+(samIDs%in%Alldata$trt2)*2+(samIDs%in%Alldata$trt3)*3+(samIDs%in%A lldata$trt4)*4 Subject<(samIDs%in%Alldata$ind1)*1+(samIDs%in%Alldata$ind2)*2+(samIDs%in%Alldata$ind3)*3+(samIDs%in% Alldata$ind4)*4+(samIDs%in%Alldata$ind5)*5+(samIDs%in%Alldata$ind6)*6 hours<(samIDs%in%Alldata$time_h0)*0+(samIDs%in%Alldata$time_h1)*1+(samIDs%in%Alldata$time_h2)*2+( samIDs%in%Alldata$time_h4)*4+(samIDs%in%Alldata$time_h12)*12 4 Q3: Consider a linear mixed model with gene expressions of the first gene in the GO term GO:0006955 as response and three factors defined in Q2 as covariates. Specifically, in this question, consider beverage effects as fixed effects, but consider individual effects and time effects as random effects. What are the REML estimates of variances of the individual and time random effect? Are the mean gene expressions for alcohol and water group significantly different? Similarly, compare the mean of gene expression for red wine group with that of water group? Write down the statistical model, hypotheses, test statistics and the corresponding p-values. Please clearly define your notation. Answer: Let ππππ be j-th (j=1, 2, 3, 4, 5) repeated measurements of the gene expression of the first gene measured from the k-th individual belongs to the i-th beverage group (i=1, 2, 3, 4). A linear mixed effects model can be written as following ππππ = π + πΌπ + π½π + πΎπ + ππππ , where πΌπ ’s are beverage effects that are treated as fixed effects, π½π ’s are random time effects, πΎπ ’s are random individual effects, ππππ are IID random error with mean 0 and variance π 2 . Assume that π½π ’s are IID with normal distribution with mean 0 and variance ππ½2 , and assume that πΎπ ’s are IID with normal distribution with mean 0 and variance ππΎ2 . The above model could be fitted using the following R code, BeverFac<-as.factor(Beverages) hourFac<-as.factor(hours) resp<-as.numeric(subsetGenes[1,]) lmmd2<-lmer(resp~BeverFac+(1|Subject)+(1|hours)) summary(lmmd2) The output of the above R code is Linear mixed model fit by REML ['lmerMod'] Formula: resp ~ BeverFac + (1 | Subject) + (1 | hours) REML criterion at convergence: 61.1 Scaled residuals: Min 1Q Median 3Q Max -1.95114 -0.67473 -0.03279 0.69286 2.74099 Random effects: 5 Groups Name Variance Std.Dev. Subject (Intercept) 0.089614 0.29936 hours (Intercept) 0.002754 0.05248 Residual 0.078447 0.28008 Number of obs: 108, groups: Subject, 6; hours, 5 Fixed effects: Estimate Std. Error t value (Intercept) 5.97213 0.13670 43.69 BeverFac2 0.12216 0.07891 1.55 BeverFac3 0.06217 0.07717 0.81 BeverFac4 0.10770 0.07699 1.40 Correlation of Fixed Effects: (Intr) BvrFc2 BvrFc3 BeverFac2 -0.296 BeverFac3 -0.300 0.519 BeverFac4 -0.305 0.528 0.533 Based on the output of the model fitting, we obtain the REML estimate of ππ½2 as 0.0028 and the REML estimate of ππΎ2 as 0.0896. To test the equivalence of the mean gene expressions between water and alcohol group, we would like to test π»0 : πΌ1 = πΌ4 versus π»1 : πΌ1 ≠ πΌ4 . In our R output, since the baseline is the alcohol group, the estimate of πΌ4 − πΌ1 is 0.10770, with standard deviation 0.07699. The test statistic value is 1.40. This results in a pvalue 0.1618. Hence the mean gene expressions for the alcohol and water groups are not significantly different for the first gene. The p-value could be computed as following: Zstat<-summary(lmmd2)$coefficients[4,3] pval<-2*(1-pnorm(abs(Zstat))) To compare the mean gene expression between the water and the red wine group, we need to test π»0 : πΌ3 = πΌ4 versus π»1 : πΌ3 ≠ πΌ4 . An estimate of πΌ3 − πΌ4 is 0.06217-0.10770=-0.04553. The variance of πΌ3 − πΌ4 is Var(πΌ3 − πΌ4 ) = Var{(πΌ3 − πΌ1 ) − (πΌ4 − πΌ1 )} = Var{(πΌ3 − πΌ1 )} + Var{(πΌ4 − πΌ1 )}-2Cov{(πΌ3 − πΌ1 ), (πΌ4 − πΌ1 )} = 0.077172 + 0.076992 − 2 ∗ 0.533 ∗ 0.07717 ∗ 0.0769 6 = 0.00555. Therefore, the standard deviation is 0.074. The test statistic value is -0.611. Then the corresponding p-value is 0.54. Hence the mean gene expressions for the red wine and water groups are not significantly different for the first gene. Q4: For each gene in the GO term GO:0006955, conduct two hypotheses testing for the following group mean differences using the model in Q3: alcohol versus water group, and red wine versus water group. For each gene, compute the corresponding p-values. Draw histograms of the p-values for each hypothesis. Since we are conducting multiple hypotheses testing together, we need to account for the possible inflated type I error. The most conservative procedure is based on Bonferroni correction. Namely, we are going to compare p-values with 0.05/# genes, where #genes represents the number of genes in the GO term GO:0006955. Any gene with p-value less than 0.05/# genes is considered to be significant. Based on the Bonferroni correction, which genes are significant for the above two hypotheses testing? Which genes have the smallest p-value? Answer: For each gene in the GO term, we could perform the hypothesis tests as we did in Q3. The histograms for both tests are given below: For the hypothesis on comparing alcohol with water group, we found one gene (the 416-th gene) has a p value 1.696569e-05, which is less than the significant level 8.460237e-05. The 416-th gene has the smallest p-value. 7 For the hypothesis on comparing the red wine with water group, we found none of the genes is significant when it is compared to the significant level 8.460237e05. The 119-th gene has the smallest p-value 0.00013. The R code for computing p-values is given below: numgenes<-dim(subsetGenes)[1] pvalset<-rep(0,numgenes) pvalset2<-rep(0,numgenes) for (i in 1:numgenes) { resp<-as.numeric(subsetGenes[i,]) lmmd2<-lmer(resp~BeverFac+(1|Subject)+(1|hours)) Zstat<-summary(lmmd2)$coefficients[4,3] pvalset[i]<-2*(1-pnorm(abs(Zstat))) betadiff<-summary(lmmd2)$coefficients[3,1]-summary(lmmd2)$coefficients[4,1] varbetadiff<-t(c(0,0,1,-1))%*%(summary(lmmd2)$vcov)%*%c(0,0,1,-1) Zstat2<-as.numeric(betadiff/sqrt(varbetadiff)) pvalset2[i]<-2*(1-pnorm(abs(Zstat2))) } In the following questions, let us focus on the gene with the smallest p-value for testing the mean difference between the gene expression of alcohol group and that of water groups, based on the testing results in Q4. Q5: Plot the gene expression versus time (in hours) that observations were measured. Use curves to connect the measurements for every combination of individual and beverage. For difference beverage, use lines with different colors to distinguish them. For different individuals, use different symbol (e.g., circles, triangles) to distinguish them. Describe your observations? Answer: The plot is given in the following. Based on the plot, we could observe that the gene expression changes over time. Therefore, we should model the time effects. Also the time effects are non-linear, thus we will need to use a non-linear function to model the time effects. The gene expression levels for alcohol groups seem lower than that of the water group. 8 The R-code for producing the plot is given below: smallest<-which.min(pvalset) newresp<-as.numeric(subsetGenes[smallest,]) plot(hours,newresp,type="n",xlab="Hours", ylab="gene expression") for (sub in 1:6) for (bever in 1:4) { points(hours[(Subject==sub)&(Beverages==bever)],newresp[(Subject==sub)&(Beverages==bever)],col=b ever,pch=sub) lines(hours[(Subject==sub)&(Beverages==bever)],newresp[(Subject==sub)&(Beverages==bever)],col=bev er) } Q6: Let ππππ be the gene expression for the gene of interest obtained from π-th individual at π-th hour in the π-th beverage group. Let π‘πππ be the corresponding time when the gene expression was measured. (a) Fit the following linear mixed model 2 3 ππππ = π + πΌπ + π½1 π‘πππ + π½2 π‘πππ + π½3 π‘πππ + ππππ , (1) where πΌπ are fixed beverage group effects and π½1 , π½2 and π½3 are fixed coefficients. Define πππ = (ππ1π , ππ2π , β― , ππππππ )′ as the random error of measurements obtained from the π-th individual in the π-th beverage group, where πππ is the number repeated measurements. Assume that πππ has mean 0 and variance covariance πππ . Fit the above model (1) by choosing two different covariance structures for πππ : heterogeneous compound symmetry and heterogeneous AR(1). 9 Which covariance structure is better? Based on the chosen model, are time effects and beverage effects significant? Answer: To compare the models with different covariance structures, we use the AIC and BIC values, which is given in the following table AIC BIC Heterogeneous CS -34.47362 49.21024 Heterogeneous AR(1) -22.54317 61.14069 Based on the above output, we can see that heterogeneous CS has a smaller AIC and BIC values. Therefore, heterogeneous CS is better for this data set. Using the above chosen model, the coefficients and the corresponding p-values are given in the following (Intercept) BeverFac2 BeverFac3 BeverFac4 hours hours2 hours3 Value Std.Error t-value p-value 4.321293 0.03653146 118.2896 0.0000 0.070319 0.04973108 1.41398 0.1604 -0.078728 0.05517753 -1.42681 0.1567 0.014508 0.06496000 0.22333 0.8237 -0.091845 0.02778397 -3.30568 0.0013 0.018368 0.00866787 2.11910 0.0365 -0.000918 0.00054679 -1.67910 0.0962 If the significant level 0.05 is used, the beverage effects are not significant but the time effects for the linear and quadratic terms are significant. To test for the significant of the time effects, we use an asymptotic Wald type test statistic. Specifically, we use the following test statistic Μ1 , π½ Μ2 , π½ Μ3 )Σπ½ −1 (π½ Μ1 , π½ Μ2 , π½ Μ3 )′ ππ = (π½ Μ1 , π½ Μ2 , π½ Μ3 ). where Σπ½ is the variance covariance matrix for coefficient estimates (π½ Using the R output, the above test statistic is 29.9068. The corresponding p-value is 1.44e-06, which is much smaller than 0.05. Hence, we conclude that the time effects are significant. The R code for fit the models and computing test statistic is given below hours2<-hours^2 hours3<-hours^3 combfacs<-Subject*10+Beverages 10 glscsh<-gls(resp~BeverFac+hours+hours2+hours3, correlation=corCompSymm(form=~1|combfacs),weights=varIdent(form=~1|combfacs),method="REML") glsarh1<-gls(resp~BeverFac+hours+hours2+hours3, correlation=corAR1(form=~1|combfacs),weights=varIdent(form=~1|combfacs),method="REML") Sigbeta<-vcov(glscsh)[5:7,5:7] Tn<-t(summary(glscsh)$coefficients[5:7])%*%solve(Sigbeta)%*%summary(glscsh)$coefficients[5:7] (b) A generalization of the above model is 2 3 ππππ = π + πΌπ + π½1π π‘πππ + π½2π π‘πππ + π½3π π‘πππ + ππππ . (2) Note that the coefficients π½1π , π½2π and π½3π in the above model (2) depend on π, which could be different for different beverage group. Again, fit model (2) by choosing the same covariance structures πππ as those in part (a). Are time effects significant? How about the beverage effects? Answer: To fit the model (2), we use the following R code. We will demonstrate it using the heterogeneous compound symmetric covariance structures. The other covariance structure can be fitted similarly. glscshM2<-gls(resp~BeverFac+BeverFac*hours+BeverFac*hours2+BeverFac*hours3, correlation=corCompSymm(form=~1|combfacs),weights=varIdent(form=~1|combfacs),method="REML") summary(glscshM2) To check the time effect, we could use the similar test statistic as that in part (a). Μ Μ But now, we need to test the significant of all (π½Μ 1π , π½2π , π½3π ) (i=1,2,3,4). The test statistic ππ1 would be 188.37, which has p-value 0. The R code is given below Sigbeta1<-vcov(glscshM2)[5:16,5:16] Tn1<t(summary(glscshM2)$coefficients[5:16])%*%solve(Sigbeta1)%*%summary(glscshM2)$coefficients[5:16] To check for the beverage effect, a similar test statistic can be used. The test statistic value is 34.82, which has p-value 1.330064e-07. Thus, the beverage effects are also significant. The R code is given below Sigbeta0<-vcov(glscshM2)[2:4,2:4] Tn2<t(summary(glscshM2)$coefficients[2:4])%*%solve(Sigbeta0)%*%summary(glscshM2)$coefficients[2:4] 11 Q7: Now suppose we are interested in studying the over-expression or underexpression due to beverage effects. Namely, we would like to see if beverages increased or decreased the expression levels when they are compared to their baseline level. For this purpose, define the binary variable π to be 1 if the expression level increased from the baseline; otherwise define π to be 0. Fit a generalized linear mixed model with π as binary response, beverage effects and time effects as fixed factors, with random intercept for subject. Please try the following approximation methods to find the maximum likelihood estimators: Laplace approximation method, adaptive Gauss-Hermite quadrature and the penalized quasi-likelihood method. Compare the estimates from the above three methods. What are your observations? Answer: We first define the binary response as following: Combdata<-cbind(newresp,hours,Subject,Beverages,combfacs) unicombfacs<-Combdata[Combdata[,2]==0,5] newcombdata<-NULL for (i in 1:length(unicombfacs)) { subcomb<-Combdata[Combdata[,5]==unicombfacs[i],] Yvec<-subcomb[,1] Hvec<-subcomb[,2] Zvec<-((Yvec[Hvec!=0]-Yvec[Hvec==0])>0)+0 newcombdata<-rbind(newcombdata,cbind(Zvec,subcomb[Hvec!=0,c(2:5)])) } newcombdata1<-list(Zvec=newcombdata[,1],hours=newcombdata[,2], Subject=newcombdata[,3],Beverages=newcombdata[,4],combfacs=newcombdata[,5]) newcombdata1$Beverages<-as.factor(newcombdata1$Beverages) newcombdata1$hours<-as.factor(newcombdata1$hours) To fit the generalized linear mixed model using likelihood with Laplace approximation, we could use the following code glmRandomIntcept1<glmer(newcombdata1$Zvec~newcombdata1$Beverages+newcombdata1$hours+(1|newcombdata1$co mbfacs),family=binomial) summary(glmRandomIntcept1) library(glmmML) glmRandomIntcept2<glmmML(newcombdata1$Zvec~newcombdata1$Beverages+newcombdata1$hours,family=binomial,clus ter=newcombdata1$combfacs) summary(glmRandomIntcept2) 12 If we use the adaptive Gauss-Hermite quadrature approximation, we use the following code glmRandomIntcept4<glmer(newcombdata1$Zvec~newcombdata1$Beverages+newcombdata1$hours+(1|newcombdata1$co mbfacs),nAGQ=8,family=binomial) summary(glmRandomIntcept4) If we use the penalized quasi-likelihood method, we could use the following code library(MASS) newcombdata2<-data.frame(newcombdata1) glmRandomIntcept3<glmmPQL(Zvec~Beverages+hours,random=~1|combfacs,family=binomial,data=newcombdata2) summary(glmRandomIntcept3) The following table summarizes the coefficients estimates from the above approximations. Laplace Laplace AGH PQL 2.1984334 2.1982606 2.2550309 2.0951310 -2.7017778 -2.7018806 -2.7558244 -2.4573363 -1.2715441 -1.2713280 -1.3029116 -1.1602524 -2.4011942 -2.4011141 -2.4496964 -2.1760696 -0.7097403 -0.7095266 -0.7364916 -0.7267728 -2.3061768 -2.3063144 -2.3670837 -2.2640438 -2.3386592 -2.3387839 -2.3965655 -2.2785129 The above table shows that the estimates from different approximations are different. In general, AGH is more accurate than Laplace approximation since it uses more points to approximate. The PQL seems to be not as accurate as the Laplace approximation. 13