Lab 4: Mixed effects models

advertisement
Lab 4: Mixed effects models
In this lab, we will apply mixed models to study the beverage effects on gene
expressions using the data set studied by Baty et al. (2006). The original purpose
of this study was to measure the influence of beverages on blood gene expression.
They would like to explore the underlying mechanisms of the cardio protective
effects of beverages.
Experiment Design
Six healthy individuals participated in the randomized controlled cross-over
experiment. On 4 independent days they had 4 different beverages (500mL each:
grape juice, red wine, 40g diluted ethanol, water). The drinks they had on each
day are summarized in the following table:
Individual 1
Individual 2
Individual 3
Individual 4
Individual 5
Individual 6
Day 1
Grape Juice
Water
Red Wine
Water
Grape Juice
Alcohol
Day 2
Red Wine
Grape Juice
Alcohol
Alcohol
Red Wine
Red Wine
Day 3
Day 4
Alcohol
Water
Red Wine
Alcohol
Water
Grape Juice
Red Wine Grape Juice
Alcohol
Water
Water
Grape Juice
On each day, blood samples were taken at baseline (0 hour), 1, 2, 4, 12 hours
after the drink together with standardized nutrition. But some individuals missed
the schedule to draw blood samples, which results in 12 missing values. RNA of
108 samples was hybridized on Affymetrix microarrays and the gene expression
data were obtained for 108 blood samples.
Data set
The data set is contained in “Alldata.Rdata" file, which can be loaded into your R
by using the command (after setting the working directory to the place where you
saved the data set)
load(file="Alldata.Rdata")
1
Within the data set, “Alldata" is a list, which includes the following objects:
"originaldata" "trt1" "trt2" "trt3" "trt4"
"time_h0" "time_h1" "time_h2" "time_h4" "time_h12"
"ind1" "ind2" "ind3" "ind4" "ind5" "ind6"
The objects included in the data set are
originaldata: All the gene expression data (transformed counts data);
trt1: IDs for individuals participated in Alcohol group;
trt2: IDs for individuals participated in Grape juice group;
trt3: IDs for individuals participated in Red wine group;
trt4: IDs for individuals participated in Water group;
time_h0: Observations measured at baseline;
time_h1: Observations measured at 1 hour after the drink;
time_h2: Observations measured at 2 hour after the drink;
time_h4: Observations measured at 4 hour after the drink;
time_h12: Observations measured at 12 hour after the drink;
ind1: data obtained from individual 1;
ind2: data obtained from individual 2;
ind3: data obtained from individual 3;
ind4: data obtained from individual 4;
ind5: data obtained from individual 5;
ind6: data obtained from individual 6;
You can access to each object by using the operator $. For example, if you want to
get the data contained in trt1, you could type in the following
Alldata$trt1
The originaldata is a matrix containing 22283 rows and 130 columns. Each row
corresponds to one gene, and the 3-110 columns correspond to gene expression
data. The rest columns are gene IDs or the gene annotation information. The
following is very small part of the data
2
1
2
3
4
5
6
GSM87863 GSM87887 GSM87896 GSM87934 GSM87943 GSM87853
6.96959
6.84646
6.99376
7.06780 7.07566
7.18618
4.94771
4.63228
4.47609
4.41107 4.62490
4.61241
7.38956
7.21881
7.62192
7.76446 7.43270
7.38960
7.73394
7.73069
8.12781
7.92782 7.96697
7.96694
3.10916
3.31460
3.42180
3.46084 3.29934
3.35648
6.93594
7.23465
6.63625
6.72077 7.15000
7.07379
The first column in the above data example corresponds to the observation ID
GSM87863. This ID is contained in the variable Alldata$trt1, Alldata$time_h0 and
Alldata$ind1. This means that this column data (gene expressions) are obtained
from the individual 1 at time 0h who participated in the treatment 1 (Alcohol
group). Each row corresponds the gene expression for each gene. In the above
data set, the rows 1-6 provide gene expressions for the first 6 genes.
In today’s lab, we will focus on a subset of genes that are related to the immunity.
The set of genes are defined by Gene Ontology (GO) terms. GO is a set of
controlled, structured vocabularies to describe key domains of molecular biology,
including gene product attributes and biological sequences. The GO defines
classes used to describe gene function and attributes of gene products in three
non‐overlapping domains of molecular biology. Specifically, we will focus on the
gene within the GO term GO:0006955.
Q1: The last three columns (namely, the columns from 128 to 130) of the data set
(original data) contain the GO information for each gene. Find the genes that
belongs to the GO term GO:0006955. Then create a new data set that contains
gene expression data (namely, the columns 3-110 of original data) for the genes
in the GO term GO:0006955.
Answer: To find the genes belong to the GO term GO:0006955, we first create a
matrix whose first column is the row of each gene and the second column is the
GO term containing the gene. Then we find all the rows that belong to the GO
term GO:0006955. At the end, we use these rows to find the gene expression data
for the genes in the GO term GO:0006955. Below is the R code:
load(file="Alldata.Rdata")
Goterms<-Alldata$originaldata[,128:130]
mc<-dim(Goterms)[1]
3
GOTermlist<-NULL
for (j in 1:mc)
{
list<-NULL
for (k in 1:3)
{
GOTerm22283<-as.character(Goterms[j,k])
getGoTerms<-unlist(strsplit(GOTerm22283,"///"))
repj<-rep(j,length(getGoTerms))
newlist<-cbind(repj,getGoTerms)
list<-rbind(list,newlist)
}
GOTermlist<-rbind(GOTermlist,list)
}
GO0006955<-which(GOTermlist[,2]=="GO:0006955")
rownums<-as.numeric(GOTermlist[GO0006955,1])
subsetGenes<-Alldata$originaldata[rownums,c(3:110)]
Q2: There are several factors need to be considered in this data set. These factors
include beverage effects, time effects and individual effects. To see the impact of
these effects on gene expression, we need to define these factors as covariates.
Please define three covariates that, respectively, describe three factors (namely,
beverage, time and individual effects).
Answer: The beverage factor has four levels including alcohol, grape juice, red
wine and water. We will use 1-4 to represent them respectively. The microarray
gene expression were measured five times respectively at 0, 1, 2 4 and 12 hours.
There is a total of six individual participated into this study, which will defined as
1-6 respectively. The following R code could be used to define the above three
factors:
samIDs<-names(subsetGenes)
Beverages<(samIDs%in%Alldata$trt1)*1+(samIDs%in%Alldata$trt2)*2+(samIDs%in%Alldata$trt3)*3+(samIDs%in%A
lldata$trt4)*4
Subject<(samIDs%in%Alldata$ind1)*1+(samIDs%in%Alldata$ind2)*2+(samIDs%in%Alldata$ind3)*3+(samIDs%in%
Alldata$ind4)*4+(samIDs%in%Alldata$ind5)*5+(samIDs%in%Alldata$ind6)*6
hours<(samIDs%in%Alldata$time_h0)*0+(samIDs%in%Alldata$time_h1)*1+(samIDs%in%Alldata$time_h2)*2+(
samIDs%in%Alldata$time_h4)*4+(samIDs%in%Alldata$time_h12)*12
4
Q3: Consider a linear mixed model with gene expressions of the first gene in the
GO term GO:0006955 as response and three factors defined in Q2 as covariates.
Specifically, in this question, consider beverage effects as fixed effects, but
consider individual effects and time effects as random effects. What are the REML
estimates of variances of the individual and time random effect? Are the mean
gene expressions for alcohol and water group significantly different? Similarly,
compare the mean of gene expression for red wine group with that of water
group? Write down the statistical model, hypotheses, test statistics and the
corresponding p-values. Please clearly define your notation.
Answer: Let π‘Œπ‘–π‘—π‘˜ be j-th (j=1, 2, 3, 4, 5) repeated measurements of the gene
expression of the first gene measured from the k-th individual belongs to the i-th
beverage group (i=1, 2, 3, 4). A linear mixed effects model can be written as
following
π‘Œπ‘–π‘—π‘˜ = πœ‡ + 𝛼𝑖 + 𝛽𝑗 + π›Ύπ‘˜ + πœ€π‘–π‘—π‘˜ ,
where 𝛼𝑖 ’s are beverage effects that are treated as fixed effects, 𝛽𝑗 ’s are random
time effects, π›Ύπ‘˜ ’s are random individual effects, πœ€π‘–π‘—π‘˜ are IID random error with
mean 0 and variance 𝜎 2 . Assume that 𝛽𝑗 ’s are IID with normal distribution with
mean 0 and variance πœŽπ›½2 , and assume that π›Ύπ‘˜ ’s are IID with normal distribution
with mean 0 and variance πœŽπ›Ύ2 .
The above model could be fitted using the following R code,
BeverFac<-as.factor(Beverages)
hourFac<-as.factor(hours)
resp<-as.numeric(subsetGenes[1,])
lmmd2<-lmer(resp~BeverFac+(1|Subject)+(1|hours))
summary(lmmd2)
The output of the above R code is
Linear mixed model fit by REML ['lmerMod']
Formula: resp ~ BeverFac + (1 | Subject) + (1 | hours)
REML criterion at convergence: 61.1
Scaled residuals:
Min
1Q Median
3Q Max
-1.95114 -0.67473 -0.03279 0.69286 2.74099
Random effects:
5
Groups Name
Variance Std.Dev.
Subject (Intercept) 0.089614 0.29936
hours (Intercept) 0.002754 0.05248
Residual
0.078447 0.28008
Number of obs: 108, groups: Subject, 6; hours, 5
Fixed effects:
Estimate Std. Error t value
(Intercept) 5.97213 0.13670 43.69
BeverFac2 0.12216 0.07891 1.55
BeverFac3 0.06217 0.07717 0.81
BeverFac4 0.10770 0.07699 1.40
Correlation of Fixed Effects:
(Intr) BvrFc2 BvrFc3
BeverFac2 -0.296
BeverFac3 -0.300 0.519
BeverFac4 -0.305 0.528 0.533
Based on the output of the model fitting, we obtain the REML estimate of πœŽπ›½2 as
0.0028 and the REML estimate of πœŽπ›Ύ2 as 0.0896.
To test the equivalence of the mean gene expressions between water and alcohol
group, we would like to test 𝐻0 : 𝛼1 = 𝛼4 versus 𝐻1 : 𝛼1 ≠ 𝛼4 . In our R output,
since the baseline is the alcohol group, the estimate of 𝛼4 − 𝛼1 is 0.10770, with
standard deviation 0.07699. The test statistic value is 1.40. This results in a pvalue 0.1618. Hence the mean gene expressions for the alcohol and water groups
are not significantly different for the first gene. The p-value could be computed as
following:
Zstat<-summary(lmmd2)$coefficients[4,3]
pval<-2*(1-pnorm(abs(Zstat)))
To compare the mean gene expression between the water and the red wine
group, we need to test 𝐻0 : 𝛼3 = 𝛼4 versus 𝐻1 : 𝛼3 ≠ 𝛼4 . An estimate of 𝛼3 − 𝛼4 is
0.06217-0.10770=-0.04553. The variance of 𝛼3 − 𝛼4 is
Var(𝛼3 − 𝛼4 ) = Var{(𝛼3 − 𝛼1 ) − (𝛼4 − 𝛼1 )}
= Var{(𝛼3 − 𝛼1 )} + Var{(𝛼4 − 𝛼1 )}-2Cov{(𝛼3 − 𝛼1 ), (𝛼4 − 𝛼1 )}
= 0.077172 + 0.076992 − 2 ∗ 0.533 ∗ 0.07717 ∗ 0.0769
6
= 0.00555.
Therefore, the standard deviation is 0.074. The test statistic value is -0.611. Then
the corresponding p-value is 0.54. Hence the mean gene expressions for the red
wine and water groups are not significantly different for the first gene.
Q4: For each gene in the GO term GO:0006955, conduct two hypotheses testing
for the following group mean differences using the model in Q3: alcohol versus
water group, and red wine versus water group. For each gene, compute the
corresponding p-values. Draw histograms of the p-values for each hypothesis.
Since we are conducting multiple hypotheses testing together, we need to
account for the possible inflated type I error. The most conservative procedure is
based on Bonferroni correction. Namely, we are going to compare p-values with
0.05/# genes, where #genes represents the number of genes in the GO term
GO:0006955. Any gene with p-value less than 0.05/# genes is considered to be
significant. Based on the Bonferroni correction, which genes are significant for the
above two hypotheses testing? Which genes have the smallest p-value?
Answer: For each gene in the GO term, we could perform the hypothesis tests as
we did in Q3. The histograms for both tests are given below:
For the hypothesis on comparing alcohol with water group, we found one gene
(the 416-th gene) has a p value 1.696569e-05, which is less than the significant
level 8.460237e-05. The 416-th gene has the smallest p-value.
7
For the hypothesis on comparing the red wine with water group, we found none
of the genes is significant when it is compared to the significant level 8.460237e05. The 119-th gene has the smallest p-value 0.00013.
The R code for computing p-values is given below:
numgenes<-dim(subsetGenes)[1]
pvalset<-rep(0,numgenes)
pvalset2<-rep(0,numgenes)
for (i in 1:numgenes)
{
resp<-as.numeric(subsetGenes[i,])
lmmd2<-lmer(resp~BeverFac+(1|Subject)+(1|hours))
Zstat<-summary(lmmd2)$coefficients[4,3]
pvalset[i]<-2*(1-pnorm(abs(Zstat)))
betadiff<-summary(lmmd2)$coefficients[3,1]-summary(lmmd2)$coefficients[4,1]
varbetadiff<-t(c(0,0,1,-1))%*%(summary(lmmd2)$vcov)%*%c(0,0,1,-1)
Zstat2<-as.numeric(betadiff/sqrt(varbetadiff))
pvalset2[i]<-2*(1-pnorm(abs(Zstat2)))
}
In the following questions, let us focus on the gene with the smallest p-value for
testing the mean difference between the gene expression of alcohol group and
that of water groups, based on the testing results in Q4.
Q5: Plot the gene expression versus time (in hours) that observations were
measured. Use curves to connect the measurements for every combination of
individual and beverage. For difference beverage, use lines with different colors
to distinguish them. For different individuals, use different symbol (e.g., circles,
triangles) to distinguish them. Describe your observations?
Answer: The plot is given in the following. Based on the plot, we could observe
that the gene expression changes over time. Therefore, we should model the time
effects. Also the time effects are non-linear, thus we will need to use a non-linear
function to model the time effects. The gene expression levels for alcohol groups
seem lower than that of the water group.
8
The R-code for producing the plot is given below:
smallest<-which.min(pvalset)
newresp<-as.numeric(subsetGenes[smallest,])
plot(hours,newresp,type="n",xlab="Hours", ylab="gene expression")
for (sub in 1:6)
for (bever in 1:4)
{
points(hours[(Subject==sub)&(Beverages==bever)],newresp[(Subject==sub)&(Beverages==bever)],col=b
ever,pch=sub)
lines(hours[(Subject==sub)&(Beverages==bever)],newresp[(Subject==sub)&(Beverages==bever)],col=bev
er)
}
Q6: Let π‘Œπ‘–π‘—π‘˜ be the gene expression for the gene of interest obtained from π‘˜-th
individual at 𝑗-th hour in the 𝑖-th beverage group. Let π‘‘π‘–π‘—π‘˜ be the corresponding
time when the gene expression was measured.
(a) Fit the following linear mixed model
2
3
π‘Œπ‘–π‘—π‘˜ = πœ‡ + 𝛼𝑖 + 𝛽1 π‘‘π‘–π‘—π‘˜ + 𝛽2 π‘‘π‘–π‘—π‘˜
+ 𝛽3 π‘‘π‘–π‘—π‘˜
+ πœ€π‘–π‘—π‘˜ ,
(1)
where 𝛼𝑖 are fixed beverage group effects and 𝛽1 , 𝛽2 and 𝛽3 are fixed coefficients.
Define πœ€π‘–π‘˜ = (πœ€π‘–1π‘˜ , πœ€π‘–2π‘˜ , β‹― , πœ€π‘–π‘šπ‘–π‘˜π‘˜ )′ as the random error of measurements
obtained from the π‘˜-th individual in the 𝑖-th beverage group, where π‘šπ‘–π‘˜ is the
number repeated measurements. Assume that πœ€π‘–π‘˜ has mean 0 and variance
covariance π‘‰π‘–π‘˜ . Fit the above model (1) by choosing two different covariance
structures for π‘‰π‘–π‘˜ : heterogeneous compound symmetry and heterogeneous AR(1).
9
Which covariance structure is better? Based on the chosen model, are time
effects and beverage effects significant?
Answer: To compare the models with different covariance structures, we use the
AIC and BIC values, which is given in the following table
AIC
BIC
Heterogeneous CS
-34.47362 49.21024
Heterogeneous AR(1) -22.54317 61.14069
Based on the above output, we can see that heterogeneous CS has a smaller AIC
and BIC values. Therefore, heterogeneous CS is better for this data set.
Using the above chosen model, the coefficients and the corresponding p-values
are given in the following
(Intercept)
BeverFac2
BeverFac3
BeverFac4
hours
hours2
hours3
Value Std.Error t-value p-value
4.321293 0.03653146 118.2896 0.0000
0.070319 0.04973108 1.41398 0.1604
-0.078728 0.05517753 -1.42681 0.1567
0.014508 0.06496000 0.22333 0.8237
-0.091845 0.02778397 -3.30568 0.0013
0.018368 0.00866787 2.11910 0.0365
-0.000918 0.00054679 -1.67910 0.0962
If the significant level 0.05 is used, the beverage effects are not significant but the
time effects for the linear and quadratic terms are significant. To test for the
significant of the time effects, we use an asymptotic Wald type test statistic.
Specifically, we use the following test statistic
Μ‚1 , 𝛽
Μ‚2 , 𝛽
Μ‚3 )Σ𝛽 −1 (𝛽
Μ‚1 , 𝛽
Μ‚2 , 𝛽
Μ‚3 )′
𝑇𝑛 = (𝛽
Μ‚1 , 𝛽
Μ‚2 , 𝛽
Μ‚3 ).
where Σ𝛽 is the variance covariance matrix for coefficient estimates (𝛽
Using the R output, the above test statistic is 29.9068. The corresponding p-value
is 1.44e-06, which is much smaller than 0.05. Hence, we conclude that the time
effects are significant. The R code for fit the models and computing test statistic is
given below
hours2<-hours^2
hours3<-hours^3
combfacs<-Subject*10+Beverages
10
glscsh<-gls(resp~BeverFac+hours+hours2+hours3,
correlation=corCompSymm(form=~1|combfacs),weights=varIdent(form=~1|combfacs),method="REML")
glsarh1<-gls(resp~BeverFac+hours+hours2+hours3,
correlation=corAR1(form=~1|combfacs),weights=varIdent(form=~1|combfacs),method="REML")
Sigbeta<-vcov(glscsh)[5:7,5:7]
Tn<-t(summary(glscsh)$coefficients[5:7])%*%solve(Sigbeta)%*%summary(glscsh)$coefficients[5:7]
(b) A generalization of the above model is
2
3
π‘Œπ‘–π‘—π‘˜ = πœ‡ + 𝛼𝑖 + 𝛽1𝑖 π‘‘π‘–π‘—π‘˜ + 𝛽2𝑖 π‘‘π‘–π‘—π‘˜
+ 𝛽3𝑖 π‘‘π‘–π‘—π‘˜
+ πœ€π‘–π‘—π‘˜ .
(2)
Note that the coefficients 𝛽1𝑖 , 𝛽2𝑖 and 𝛽3𝑖 in the above model (2) depend on 𝑖,
which could be different for different beverage group. Again, fit model (2) by
choosing the same covariance structures π‘‰π‘–π‘˜ as those in part (a). Are time effects
significant? How about the beverage effects?
Answer: To fit the model (2), we use the following R code. We will demonstrate
it using the heterogeneous compound symmetric covariance structures. The other
covariance structure can be fitted similarly.
glscshM2<-gls(resp~BeverFac+BeverFac*hours+BeverFac*hours2+BeverFac*hours3,
correlation=corCompSymm(form=~1|combfacs),weights=varIdent(form=~1|combfacs),method="REML")
summary(glscshM2)
To check the time effect, we could use the similar test statistic as that in part (a).
Μ‚ Μ‚
But now, we need to test the significant of all (𝛽̂
1𝑖 , 𝛽2𝑖 , 𝛽3𝑖 ) (i=1,2,3,4). The test
statistic 𝑇𝑛1 would be 188.37, which has p-value 0. The R code is given below
Sigbeta1<-vcov(glscshM2)[5:16,5:16]
Tn1<t(summary(glscshM2)$coefficients[5:16])%*%solve(Sigbeta1)%*%summary(glscshM2)$coefficients[5:16]
To check for the beverage effect, a similar test statistic can be used. The test
statistic value is 34.82, which has p-value 1.330064e-07. Thus, the beverage
effects are also significant. The R code is given below
Sigbeta0<-vcov(glscshM2)[2:4,2:4]
Tn2<t(summary(glscshM2)$coefficients[2:4])%*%solve(Sigbeta0)%*%summary(glscshM2)$coefficients[2:4]
11
Q7: Now suppose we are interested in studying the over-expression or underexpression due to beverage effects. Namely, we would like to see if beverages
increased or decreased the expression levels when they are compared to their
baseline level. For this purpose, define the binary variable 𝑍 to be 1 if the
expression level increased from the baseline; otherwise define 𝑍 to be 0.
Fit a generalized linear mixed model with 𝑍 as binary response, beverage effects
and time effects as fixed factors, with random intercept for subject. Please try the
following approximation methods to find the maximum likelihood estimators:
Laplace approximation method, adaptive Gauss-Hermite quadrature and the
penalized quasi-likelihood method. Compare the estimates from the above three
methods. What are your observations?
Answer: We first define the binary response as following:
Combdata<-cbind(newresp,hours,Subject,Beverages,combfacs)
unicombfacs<-Combdata[Combdata[,2]==0,5]
newcombdata<-NULL
for (i in 1:length(unicombfacs))
{
subcomb<-Combdata[Combdata[,5]==unicombfacs[i],]
Yvec<-subcomb[,1]
Hvec<-subcomb[,2]
Zvec<-((Yvec[Hvec!=0]-Yvec[Hvec==0])>0)+0
newcombdata<-rbind(newcombdata,cbind(Zvec,subcomb[Hvec!=0,c(2:5)]))
}
newcombdata1<-list(Zvec=newcombdata[,1],hours=newcombdata[,2],
Subject=newcombdata[,3],Beverages=newcombdata[,4],combfacs=newcombdata[,5])
newcombdata1$Beverages<-as.factor(newcombdata1$Beverages)
newcombdata1$hours<-as.factor(newcombdata1$hours)
To fit the generalized linear mixed model using likelihood with Laplace
approximation, we could use the following code
glmRandomIntcept1<glmer(newcombdata1$Zvec~newcombdata1$Beverages+newcombdata1$hours+(1|newcombdata1$co
mbfacs),family=binomial)
summary(glmRandomIntcept1)
library(glmmML)
glmRandomIntcept2<glmmML(newcombdata1$Zvec~newcombdata1$Beverages+newcombdata1$hours,family=binomial,clus
ter=newcombdata1$combfacs)
summary(glmRandomIntcept2)
12
If we use the adaptive Gauss-Hermite quadrature approximation, we use the
following code
glmRandomIntcept4<glmer(newcombdata1$Zvec~newcombdata1$Beverages+newcombdata1$hours+(1|newcombdata1$co
mbfacs),nAGQ=8,family=binomial)
summary(glmRandomIntcept4)
If we use the penalized quasi-likelihood method, we could use the following code
library(MASS)
newcombdata2<-data.frame(newcombdata1)
glmRandomIntcept3<glmmPQL(Zvec~Beverages+hours,random=~1|combfacs,family=binomial,data=newcombdata2)
summary(glmRandomIntcept3)
The following table summarizes the coefficients estimates from the above
approximations.
Laplace
Laplace
AGH
PQL
2.1984334 2.1982606 2.2550309 2.0951310
-2.7017778 -2.7018806 -2.7558244 -2.4573363
-1.2715441 -1.2713280 -1.3029116 -1.1602524
-2.4011942 -2.4011141 -2.4496964 -2.1760696
-0.7097403 -0.7095266 -0.7364916 -0.7267728
-2.3061768 -2.3063144 -2.3670837 -2.2640438
-2.3386592 -2.3387839 -2.3965655 -2.2785129
The above table shows that the estimates from different approximations are
different. In general, AGH is more accurate than Laplace approximation since it
uses more points to approximate. The PQL seems to be not as accurate as the
Laplace approximation.
13
Download