Longitudinal GWA analyses using Linear Mixed Effects Models: lme

advertisement
Longitudinal GWA analyses using Linear Mixed Effects Models: lme and lmer
Prepared by: Karolina Sikorska and Kelly Benke
This exercise is designed to provide you with R scripting to read in MaCH dosage and info
files, convert a ‘wide’ dataset to ‘long’ format, and perform linear mixed effects modeling.
We will use 2 packages to do this. The first is the lme function in the NLME package. This
function can be considered the ‘gold standard’ and will allow many options for modeling, as
well as produce p-values. The second is the lmer function in the LME4 package. This
function runs much faster, however, it will not automatically produce p-values and cannot
perform many of the options that the lme function can perform. Our goal is familiarize you
with linear mixed effects models for a fairly simple example, and to guide you toward
scripting to run many models as you would need to do for a GWA.
1. How many individuals are present in this study?
1269, this can be determined by looking at the dim function for the object bmi.
2. How many measurements are taken in the study, and how many measurements per
individual are there?
If we table the table of the id variables in long format, we can see that there
are 1269 individuals with 6 measurements. Thus, each person has 6
measurements. Note this only looks at the number of ids represented in the
dataset, and thus will not take into account missing response variables.
3. The scan function allows us to bring in the dose file efficiently, which is very useful
for an entire chromosome. What columns are ignored from the MaCH dose file to
create the object dose?
The first 2 ID columns are ignored so that the column number for a SNP in the
dose file is the same as the row number for a SNP in the information file.
4. How do the trajectories appear – are they linear or do you detect curvature? What
patterns do you observe by genotype?
The trajectories appear linear, with some measurement error. It is very
difficult to distinguish patterns by genotype, since the effect sizes are very
small. We can see that red lines represent those with 0 copies of the at-risk
variant, blue lines represent those with 1 copy of the at-risk variant, and green
lines represent those with 2 copies of the at-risk variant. Note we had to
round the MaCH dosage scores to get integer values in order to make this
graph.
5. Model 1 provides estimates for several parameters for a select SNP. Fill in the table
below using the summary information from this model:
Table of Results for LME
Parameter
Random intercept
variance
Random slope variance
Residual variance
ρ (ran int, ran slp)
β0
β Time
β SNP
β SNP by Time
Estimate
Pval (if
relevant)
0.4996865^2
na
0.2445878^2
0.2958741^2
0.386
24.281427
1.011827
0.05562
0.023003
na
na
0.0
0.0
0.0144
0.0269
6. Does the lmer function produce results similar to the lme function for this example?
Why?
The lme and lmer functions are very similar because there is little to no
remaining correlation structure for the within-individual errors after
incorporating a random intercept and slope.
Table of Results for LMER
Parameter
Random intercept
variance
Random slope variance
Residual variance
ρ (ran int, ran slp)
β0
β Time
β SNP
Estimate
Pval (if
relevant)
0.249683
na
0.059823
0.087542
0.386
24.28143
1.01183
0.05562
na
na
0
0
0.01426062
β SNP by Time
0.02300
0.02691075
7. How does a linear regression model compare with the output for the lme and lmer
models? Why?
The effect sizes for the fixed effects parameters are not too far off, but the
standard errors and therefore p-values are not trustworthy. The SNP effect is
not significant at all, and the SNP by time effect is more significant than it
should be. The intercept and time effects are far more significant than they
should be.
8. Please run the results for both the lme, lmer and lm functions and save the output
files. How many snps are there in these files?
There are 201 SNPs, some of which are in LD with each other. Due to chance,
SNP rs9939609 was simulated to be the causal SNP, but does not have the
lowest p-value.
9. We would like to see a plot of these results, and will use the online program
locusZoom. Open a web browser to the following URL:
https://statgen.sph.umich.edu/locuszoom/genform.php?type=yourdata
a. Select the output file for either the lme or the lmer results
b. For P-value column name, type either the SNP main effect: P_SNP, or the
SNP by Time effect: P_int.
c. For Marker column name, type: SNP
d. Select white space for the delimiter
e. Type in the SNP reference name: rs9939609
f. Allow other values to default
g. Click the ‘Plot your Data’ tab (upper left) and wait for the .pdf file to be made,
then save. Does this look like a real signal? What is already known about the
snps in this region?
This should look like a real signal, where the significance of the p-value
attenuates with attenuating LD with the ‘causal’ snp. The SNP rs9939609 has
been shown to be associated with BMI in previous studies, and has been used
as an instrument for mendelian randomization studies. We can see, especially
for the SNP by time effect, that other surrounding SNPs can be more significant
than the ‘true’ SNP. This becomes less of an issue at larger effect sizes, but
with these subtle effects, it can be difficult to detect the true SNP from those
SNPs in LD with the truth, as well as the true signals from the false signals.
Download