Longitudinal GWA analyses using Linear Mixed Effects Models: lme and lmer Prepared by: Karolina Sikorska and Kelly Benke This exercise is designed to provide you with R scripting to read in MaCH dosage and info files, convert a ‘wide’ dataset to ‘long’ format, and perform linear mixed effects modeling. We will use 2 packages to do this. The first is the lme function in the NLME package. This function can be considered the ‘gold standard’ and will allow many options for modeling, as well as produce p-values. The second is the lmer function in the LME4 package. This function runs much faster, however, it will not automatically produce p-values and cannot perform many of the options that the lme function can perform. Our goal is familiarize you with linear mixed effects models for a fairly simple example, and to guide you toward scripting to run many models as you would need to do for a GWA. 1. How many individuals are present in this study? 1269, this can be determined by looking at the dim function for the object bmi. 2. How many measurements are taken in the study, and how many measurements per individual are there? If we table the table of the id variables in long format, we can see that there are 1269 individuals with 6 measurements. Thus, each person has 6 measurements. Note this only looks at the number of ids represented in the dataset, and thus will not take into account missing response variables. 3. The scan function allows us to bring in the dose file efficiently, which is very useful for an entire chromosome. What columns are ignored from the MaCH dose file to create the object dose? The first 2 ID columns are ignored so that the column number for a SNP in the dose file is the same as the row number for a SNP in the information file. 4. How do the trajectories appear – are they linear or do you detect curvature? What patterns do you observe by genotype? The trajectories appear linear, with some measurement error. It is very difficult to distinguish patterns by genotype, since the effect sizes are very small. We can see that red lines represent those with 0 copies of the at-risk variant, blue lines represent those with 1 copy of the at-risk variant, and green lines represent those with 2 copies of the at-risk variant. Note we had to round the MaCH dosage scores to get integer values in order to make this graph. 5. Model 1 provides estimates for several parameters for a select SNP. Fill in the table below using the summary information from this model: Table of Results for LME Parameter Random intercept variance Random slope variance Residual variance ρ (ran int, ran slp) β0 β Time β SNP β SNP by Time Estimate Pval (if relevant) 0.4996865^2 na 0.2445878^2 0.2958741^2 0.386 24.281427 1.011827 0.05562 0.023003 na na 0.0 0.0 0.0144 0.0269 6. Does the lmer function produce results similar to the lme function for this example? Why? The lme and lmer functions are very similar because there is little to no remaining correlation structure for the within-individual errors after incorporating a random intercept and slope. Table of Results for LMER Parameter Random intercept variance Random slope variance Residual variance ρ (ran int, ran slp) β0 β Time β SNP Estimate Pval (if relevant) 0.249683 na 0.059823 0.087542 0.386 24.28143 1.01183 0.05562 na na 0 0 0.01426062 β SNP by Time 0.02300 0.02691075 7. How does a linear regression model compare with the output for the lme and lmer models? Why? The effect sizes for the fixed effects parameters are not too far off, but the standard errors and therefore p-values are not trustworthy. The SNP effect is not significant at all, and the SNP by time effect is more significant than it should be. The intercept and time effects are far more significant than they should be. 8. Please run the results for both the lme, lmer and lm functions and save the output files. How many snps are there in these files? There are 201 SNPs, some of which are in LD with each other. Due to chance, SNP rs9939609 was simulated to be the causal SNP, but does not have the lowest p-value. 9. We would like to see a plot of these results, and will use the online program locusZoom. Open a web browser to the following URL: https://statgen.sph.umich.edu/locuszoom/genform.php?type=yourdata a. Select the output file for either the lme or the lmer results b. For P-value column name, type either the SNP main effect: P_SNP, or the SNP by Time effect: P_int. c. For Marker column name, type: SNP d. Select white space for the delimiter e. Type in the SNP reference name: rs9939609 f. Allow other values to default g. Click the ‘Plot your Data’ tab (upper left) and wait for the .pdf file to be made, then save. Does this look like a real signal? What is already known about the snps in this region? This should look like a real signal, where the significance of the p-value attenuates with attenuating LD with the ‘causal’ snp. The SNP rs9939609 has been shown to be associated with BMI in previous studies, and has been used as an instrument for mendelian randomization studies. We can see, especially for the SNP by time effect, that other surrounding SNPs can be more significant than the ‘true’ SNP. This becomes less of an issue at larger effect sizes, but with these subtle effects, it can be difficult to detect the true SNP from those SNPs in LD with the truth, as well as the true signals from the false signals.