Hunting for Genes with Longitudinal Phenotype Data Using Stata Stata Conference Boston 2010 July 15, 2010 John Charles “Chuck” Huber Jr, PhD Assistant Professor of Biostatistics Department of Epidemiology and Biostatistics School of Rural Public Health Texas A&M Health Science Center jchuber@tamu.edu Co-Authors • • • • • Michael Hallman, PhD (Principal Investigator) Ron Harrist, PhD Victoria Friedel, MA Melissa Richard, MS Huandong Sun All at University of Texas School of Public Health Motivation – Project Heartbeat! Reference: Fulton, JE, Dai, S, Grunbaum, JA, Boerwinkle, E, Labarthe, R (1999) Apolipoprotein E affects serial changes In total and low-density lipoprotein cholesterol in adolescent girls: Project Heartbeat!. Metabolism 48(3): 285-290 Motivation Reference: Fulton, JE, Dai, S, Grunbaum, JA, Boerwinkle, E, Labarthe, R (1999) Apolipoprotein E affects serial changes In total and low-density lipoprotein cholesterol in adolescent girls: Project Heartbeat!. Metabolism 48(3): 285-290 Motivation Reference: Fulton, JE, Dai, S, Grunbaum, JA, Boerwinkle, E, Labarthe, R (1999) Apolipoprotein E affects serial changes In total and low-density lipoprotein cholesterol in adolescent girls: Project Heartbeat!. Metabolism 48(3): 285-290 Motivation • Human genetics studies in the 1990s tended to focus on family data – Project Heartbeat! was a population-based study (no relatives) • Genetic studies of unrelated individuals became popular in the 2000s • Genetic markers called Single Nucleotide Polymorphisms (SNPs) became cheap to ascertain on a very large scale What is a SNP? Hartl & Jones (1998) pg 9, Figure 1.5 What is a SNP? Watson et al. (2004) pg 23, Figure 2.5 What is a SNP? • A SNP is a single nucleotide polymorphism (the individual nucleotides are called alleles) Person 1 – Chromosome 1 Person 1 – Chromosome 2 Person 2 – Chromosome 1 Person 2 – Chromosome 2 ataagtcgatactgatgcatagctagctgactgacgcgat ataagtccatactgatgcatagctagctgactgaagcgat ataagtccatactgatgcatagctagctgactgacgcgat ataagtcgatactgatgcatagctagctgactgaagcgat SNP1 SNP2 Motivation Stored Genotype Data Blood samples and DNA available for 131 African-American and 505 non-Hispanic white children between 8 and 17 years of age. Motivation Stored Phenotype Data Longitudinal measurements of: Body Mass Index Total Cholesterol HDL & LDL Cholesterol Systolic and Diastolic BP Much, much more….. Motivation Let’s go gene hunting!!! Challenges 1. 2. 3. 4. 5. Longitudinal Data – PLINK or HelixTree? Specialized genetic data analysis Need to run a very large number of graphs and models Multiple comparisons and replication Scaling up to 100,000 SNP Chips Longitudinal Phenotype Data No PLINK…. No HelixTree…. No dice? Longitudinal Phenotype Data • Stata is well equipped for longitudinal data – xtreg – xtgee – gllamm – xtmixed Challenges 1. 2. 3. 4. 5. Longitudinal Data – PLINK or HelixTree? Specialized genetic data analysis Need to run a very large number of graphs and models Multiple comparisons and replication Scaling up to 100,000 SNP Chips Genetic Data Analysis 1. 2. 3. 4. 5. 6. Genotype Frequencies Allele Frequencies Hardy-Weinberg Equilibrium Haplotype Reconstruction Linkage Disequilibrium TagSNPs Stata for Genetic Data Analysis 2007 UK Stata Users Group meeting: http://www.stata.com/meeting/13uk/ A brief introduction to genetic epidemiology using Stata Neil Shephard, University of Sheffield An overview of using Stata to perform candidate gene association analysis will be presented. Areas covered will include data manipulation, Hardy–Weinberg equilibrium, calculating and plotting linkage disequilibrium, estimating haplotypes, and interfacing with external programs. User Written Genetics Commands Programs written by David Clayton • • • • • • • • • • • • • • • ginsheet- Read genotype data from text files. gloci - Make a list of loci. greshape - Reshape a file containing genotypes to a file of alleles. gtab - Tabulate allele frequencies within genotypes and generate indicators (performs Hardy-Weinberg Equilibrium testing). gtype - Create a single genotype variable from two allele variables. htype - Create a haplotype variable from allele variables. mltdt - Multiple locus TDT for haplotype tagging SNPs (htSNPs). origin - Analysis of parental origin effect in TDT trios. pseudocc - Create a pseudo-case-control study from case-parent trios. pscc - Experimental version of pseudocc in which there may be several groups of linked loci. pwld - Pairwise linkage disequilibrium measures. rclogit - Conditional logistic regression with robust standard errors. snp2hap - Infer haplotypes of 2-locus SNP markers. tdt - Classical TDT test. trios - Tabulate genotypes of parent-offspring trios. User Written Genetics Commands Programs written by Adrian Mander • • • • • • • • • • gipf - Graphical representation of log-linear models. hapipf - Haplotype frequency estimation using an EM algorithm and log-linear modelling. pedread - Read's pedigree data file (in pre-Makeped LINKAGE format), similar to ginsheet pedsumm - Summarises a pre-Makeped LINKAGE file that is currently in Stata's memory. pedraw - Draws one pedigree in the graphics window plotmatrix - Produces LD heatmaps displaying graphically the strength of LD between markers. profhap - Calculates profile likelihood confidence intervals for results from hapipf swblock - A step-wise hapipf routine to identify the parsimonious model to describe the Haplotype block pattern. qhapipf - Analysis of quantitative traits using regression and log-linear modelling when phase is unknown. hapblock - attempts to find the edge of areas containing high LD within a set of loci User Written Genetics Commands Programs written by Mario Cleves • • • gencc - Genetic case-control tests genhw - Hardy-Weinberg Equilibrium tests qtlsnp - A program for testng associations between SNPs an a quantitative trait. Programs written by Catherine Saunders • • • • • • co_power - Power calculations for Case-only study designs. gei_matching geipower - Power calculations for Gene-Environment interactions. ggipower - Power calculations for Gene-Gene interactions. tdt_geipower - Power calculations for Gene-Environment interactions via TDT analysis. tdt_ggipower - Power calculations for Gene-Gene interactions via TDT analysis. Programs written by Neil Shephard • genass- Performs a number of statistical tests on your genotypic data and collates the results into a Stata formatted data set for browsing. User Written Genetics Commands Programs written by Roger Newson • • multproc – multiple comparison procedures and False Discovery Rates Far too many others to list……. Programs written by Chuck Huber • • • • • • • Accepted by the Stata Journal phaseout – export genotype data to PHASE phasein – import haplotype data from PHASE haploviewout – export haplotype data in HaploView format for estimating and visualizing LD and other tasks Forthcoming (when I have time to clean them up) snpsumm – summarize allele/genotype frequencies and H-W equilibrium for large numbers of SNPs manhattanplot – creates “Manhattan” plots from the results of a genome-wide association study (GWAS) User Written Genetics Commands Command: haplologit Y. V. Marchenko, R. J. Carroll, D. Y. Lin, C. I. Amos, and R. G. Gutierrez (2008) Semiparametric analysis of case-control genetic data in the presence of environmental factors. The Stata Journal 8 (3): 305333 User Written Genetics Commands So Stata is well equipped for genetic data analysis! Challenges 1. 2. 3. 4. 5. Longitudinal Data – PLINK or HelixTree? Specialized genetic data analysis Need to run a very large number of graphs and models Multiple comparisons and replication Scaling up to 100,000 SNP Chips Looping Over Graphs and Models Very simplistic structure of a model: E ( Phenotypeit ) 0 1ageit 2 SNPi 3 (ageit SNPi ) (12 Phenotypes) x (1753 SNPs) x (5 Candidate Models Each) = 105,180 Models! Looping Over Graphs and Models • Looping Over Lists – Code: * LOOPING THROUGH A SINGLE LIST OF WORDS local SnpList "rs2239560 rs7524046 rs35610691" foreach snp of local SnpList { disp "Currently processing SNP `snp'" } – Output: Currently processing SNP rs2239560 Currently processing SNP rs7524046 Currently processing SNP rs35610691 Looping Over Graphs and Models – Code: * LOOPING THROUGH TWO LISTS OF WORDS local SnpList "rs2239560 rs7524046 rs35610691" local PhenotypeList "bmi sbp tc" foreach Phenotype of local PhenotypeList { foreach snp of local SnpList { disp "The outcome variable is `Phenotype' and the SNP is `snp'." } } – Output: The outcome variable is bmi and the SNP is rs2239560. The outcome variable is bmi and the SNP is rs7524046. The outcome variable is bmi and the SNP is rs35610691. The outcome variable is sbp and the SNP is rs2239560. The outcome variable is sbp and the SNP is rs7524046. The outcome variable is sbp and the SNP is rs35610691. The outcome variable is tc and the SNP is rs2239560. The outcome variable is tc and the SNP is rs7524046. The outcome variable is tc and the SNP is rs35610691. Looping Over Graphs and Models • Lowess Curves for each Phenotype/SNP Combination: LOOPING THROUGH TWO LISTS OF WORDS local SnpList "rs2239560 rs7524046 rs35610691“ local PhenotypeList "bmi sbp tc“ foreach Phenotype of local PhenotypeList { foreach snp of local SnpList { twoway (lowess mean_`Phenotype' mean_age if `snp'=="AA", */ (lowess mean_`Phenotype' mean_age if `snp'=="AG", */ (lowess mean_`Phenotype' mean_age if `snp'=="GG", graph export Graph_`Phenotype'_`snp'.ps, as(ps) logo(off) } } sort lcolor(red)) /* sort lcolor(green)) /* sort lcolor(blue)) /* replace Note: Postscript files can be easily combined in Adobe Acrobat Professional Looping Over Graphs and Models • If we run many models, we need to be able to save the results to an output file. • Commands for writing to data files – postfile: creates an output data file and describes its structure – post: writes data to the output data file – postclose: closes the output data file Looping Over Graphs and Models • Longitudinal model for each Phenotype/SNP Combination: postfile Output str16 phenotype str16 snp chi2 using OutputFile.dta, replace local SnpList "rs2239560 rs7524046 rs35610691" local PhenotypeList "bmi sbp tc" foreach Phenotype of local PhenotypeList { foreach snp of local SnpList { xtmixed `Phenotype‘ age i.`snp‘ c.age#i.`snp‘ || Id: age, cov(unstruct) post Output ("`Phenotype'") ("`snp'") (e(chi2)) } } postclose Output Challenges 1. 2. 3. 4. 5. Longitudinal Data – PLINK or HelixTree? Specialized genetic data analysis Need to run a very large number of graphs and models Multiple comparisons and replication Scaling up to 100,000 SNP Chips Multiple Comparisons • In our study, we will be computing hundreds of thousands of p-values. How do we control for multiple comparisons? – False Discovery Rates – Replication in a second dataset Multiple Comparisons • False Discovery Rates are a collection of methods for adjusting for multiple comparisons commonly used in large scale genetics studies where the number of pvalues regularly exceeds 500,000. • Calculate a threshold p-value for determining overall statistical significance much like a Bonferroni correction. False Discovery Rates Copied from Benjamini & Hochberg (1995) page 291 V False Discovery Rate E | R 0 P( R 0) R Reference: Benjamini, Y. & Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful Approach to multiple testing. Journal of the Royal Statistical Society, Series B 57: 289-300 Replication Data • Bogalusa Heart Study – Similar longitudinal study – Included children in the 8-17 age range – 478 African-American participants – 1081 non-Hispanic White participants – Same phenotypes – Same genotypes (more or less) Multiple Comparisons Strategy 1. Identify the SNPs in the Project Hearbeat! sample that meet the overall threshold for statistical significance using False Discovery Rates. 2. Run the significant SNPs with the Bogalusa data to check for replication of the results. Challenges 1. 2. 3. 4. 5. Longitudinal Data – PLINK or HelixTree? Specialized genetic data analysis Need to run a very large number of graphs and models Multiple comparisons and replication Scaling up to 100,000 SNP Chips Scaling up to 100,000 SNPs HELP! Scaling up to 100,000 SNPs • Possible Strategies: – Read data from text files in “chunks” using the “infix” command. – Bribe Bill Gould with vast quantities of beer. – Other suggestions? Actual Analysis Disclaimer: • Since this study is a work in progress, I have changed the gene and SNP names to protect the innocent. Actual Analysis • Example Data . list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. id age bmi SNP1 SNP2 SNP3 if id==1, sepby(id) +--------------------------------------------+ | id age bmi SNP1 SNP2 SNP3 | |--------------------------------------------| | 1 14.33812 26.07 AG GG GG | | 1 14.6694 27.06 AG GG GG | | 1 15.00616 28.33 AG GG GG | | 1 15.40041 28.78 AG GG GG | | 1 15.66324 29.76 AG GG GG | | 1 15.97536 29.29 AG GG GG | | 1 16.33128 28.28 AG GG GG | | 1 16.65435 29.85 AG GG GG | | 1 17.01848 28.52 AG GG GG | | 1 17.30595 27.96 AG GG GG | | 1 17.63997 28.28 AG GG GG | +--------------------------------------------+ Actual Analysis • Variable “Characteristics” * EXAMPLE OF HOW TO ADD CHARACTERISTICS TO A VARIABLE AND EXTRACT THEM TO A LOCAL MACRO char SNP1[chromosome] 7 char SNP1[gene] Gene1 char SNP1[position] 142702852 local TempChromosome : char SNP1[chromosome] local TempGene : char SNP1[gene] local TempPosition : char SNP1[position] . disp "SNP1 is on Chromosome `TempChromosome', in `TempGene' at position `TempPosition'" SNP1 is on Chromosome 7, in Gene1 at position 142702852 Actual Analysis Lowess curve of BMI over age Actual Analysis Data checking with the “snpsumm” command: . snpsumm SNP*, listgeno Genotype Information ================================================================= gen1, gen2 and gen3 are the genotypes gencou~1, gencou~2, gencou~3 are the counts of each genotype genfreq1, genfreq2 and genfreq3 are the genotype frequencies ================================================================= 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. +----------------------------------------------------------------------------------------------------------+ | Marker gen1 gencou~1 genfreq1 gen2 gencou~2 genfreq2 gen3 gencou~3 genfreq3 gentotal | |----------------------------------------------------------------------------------------------------------| | SNP1 AA 23 0.0375 AG 177 0.2887 GG 413 0.6737 613 | | SNP2 AA 37 0.0605 AG 200 0.3268 GG 375 0.6127 612 | | SNP3 AG 1 . GG 612 . . . . | | SNP4 AA 35 0.0571 AG 201 0.3279 GG 377 0.6150 613 | | SNP5 AA 203 0.3524 AG 259 0.4497 GG 114 0.1979 576 | |----------------------------------------------------------------------------------------------------------| | SNP6 AA 55 0.0899 AG 251 0.4101 GG 306 0.5000 612 | | SNP7 AG 1 . GG 612 . . . . | | SNP8 AA 8 0.0131 AG 124 0.2023 GG 481 0.7847 613 | | SNP9 AA 41 0.0669 AG 204 0.3328 GG 368 0.6003 613 | | SNP10 AA 51 0.0833 AC 247 0.4036 CC 314 0.5131 612 | |----------------------------------------------------------------------------------------------------------| | SNP11 AA 30 0.0489 AG 208 0.3393 GG 375 0.6117 613 | +----------------------------------------------------------------------------------------------------------+ Actual Analysis Data checking with the “snpsumm” command: . snpsumm SNP*, listallele Allele Information ================================================================= a1 and a2 are the alleles acount1 and acount2 are the counts of each allele afreq1 and afreq2 are the counts of each allele maf is the Minor Allele Frequency ================================================================= 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. +--------------------------------------------------------------------------+ | Marker a1 acount1 afreq1 a2 acount2 afreq2 atotal maf | |--------------------------------------------------------------------------| | SNP1 A 223 0.1819 G 1003 0.8181 1226 0.1819 | | SNP2 A 274 0.2239 G 950 0.7761 1224 0.2239 | | SNP3 G 614 . G . . . . | | SNP4 A 271 0.2210 G 955 0.7790 1226 0.2210 | | SNP5 A 665 0.5773 G 487 0.4227 1152 0.4227 | |--------------------------------------------------------------------------| | SNP6 A 361 0.2949 G 863 0.7051 1224 0.2949 | | SNP7 G 614 . G . . . . | | SNP8 A 140 0.1142 G 1086 0.8858 1226 0.1142 | | SNP9 A 286 0.2333 G 940 0.7667 1226 0.2333 | | SNP10 A 349 0.2851 C 875 0.7149 1224 0.2851 | |--------------------------------------------------------------------------| | SNP11 A 268 0.2186 G 958 0.7814 1226 0.2186 | +--------------------------------------------------------------------------+ Actual Analysis Data checking with the “snpsumm” command: . snpsumm SNP*, listhw Hardy-Weinberg Equilibrium Information ================================================================= maf is the Minor Allele Frequency hw_c2 is the Pearson Chi-squared hw_c2p is the Pearson Chi-Squared p-value hw_lr is the Likelihood Ratio Chi-squared hw_lrp is the Likelihood Ratio Chi-Squared p-value hw_ex is the Exact p-value ================================================================= 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. +------------------------------------------------------------+ | Marker maf hw_c2 hw_c2p hw_lr hw_lrp hw_ex | |------------------------------------------------------------| | SNP1 0.1819 0.54 0.4605 0.53 0.4662 0.4965 | | SNP2 0.2239 2.17 0.1407 2.10 0.1470 0.1618 | | SNP3 . . . . . . | | SNP4 0.2210 1.40 0.2363 1.37 0.2425 0.2410 | | SNP5 0.4227 3.57 0.0589 3.56 0.0591 0.0605 | |------------------------------------------------------------| | SNP6 0.2949 0.12 0.7316 0.12 0.7321 0.7705 | | SNP7 . . . . . . | | SNP8 0.1142 0.00 0.9979 0.00 0.9979 1.0000 | | SNP9 0.2333 2.98 0.0844 2.88 0.0896 0.0901 | | SNP10 0.2851 0.06 0.8050 0.06 0.8053 0.8427 | |------------------------------------------------------------| | SNP11 0.2186 0.03 0.8670 0.03 0.8673 0.9058 | +------------------------------------------------------------+ Actual Analysis • Reconstructing haplotypes using PHASE without leaving Stata! local PositionList "142702852 142736196 142747932 etc.......” phaseout SNP*, idvar(id) filename("Gene1.inp") position(`PositionList') shell PHASE -S1234 Gene1.inp Gene1.out 100 1 100 clear phasein Gene1.out, markers("MarkerList.txt") positions("PositionList.txt") What is a Haplotype? • A haplotype is the combination of one or more alleles found on the same chromosome – Person 1 has a “gc” haplotype and a “ca” haplotype – Person 2 has a “cc” haplotype and a “ga” haplotype Person 1 – Chromosome 1 Person 1 – Chromosome 2 Person 2 – Chromosome 1 Person 2 – Chromosome 2 ataagtcgatactgatgcatagctagctgactgacgcgat ataagtccatactgatgcatagctagctgactgaagcgat ataagtccatactgatgcatagctagctgactgacgcgat ataagtcgatactgatgcatagctagctgactgaagcgat SNP1 SNP2 Actual Analysis The resulting haplotypes are back in Stata: . list id haplotype SNP1 SNP2 SNP3 in 1/10, sepby(id) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. +---------------------------------------+ | id haplotype SNP1 SNP2 SNP3 | |---------------------------------------| | 1 AGGGAAGGGCG A G G | | 1 GGGGAGGGGAG G G G | |---------------------------------------| | 2 GGGGAGGGGCG G G G | | 2 GGGGAGGGGCA G G G | |---------------------------------------| | 3 GGGGAGGGGAG G G G | | 3 GGGAGAGGGCA G G G | |---------------------------------------| | 4 AGGGAGGGGAG A G G | | 4 GGGAGGGAGCG G G G | |---------------------------------------| | 5 GGGGAGGGGAG G G G | | 5 GAGGGGGGACG G A G | +---------------------------------------+ Actual Analysis haploviewout SNP*, idvariable(id) filename("Gene1") poslabel Actual Analysis Actual Analysis . multproc, pval(pvalue) meth(simes) rank(FDR_rank) critical(FDR_critical) reject(FDR_reject) . list Chromosome Position pvalue FDR_rank FDR_critical FDR_reject in 1/22, sepby( FDR_reject) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. +------------------------------------------------------------------+ | Chromo~e Position pvalue FDR_rank FDR_cri~l FDR_re~t | |------------------------------------------------------------------| | 3 1.60e+08 .0000372 1 .00003266 Yes | | 8 4.59e+07 .0000465 2 .00006532 Yes | | 12 9.73e+07 .0000529 3 .00009798 Yes | | 7 7.08e+07 .0000661 4 .00013063 Yes | | 3 2.02e+08 .0000701 5 .00016329 Yes | | 11 3.02e+07 .0001106 6 .00019595 Yes | | 2 2.15e+07 .0001391 7 .00022861 Yes | | 5 9.80e+07 .0001418 8 .00026127 Yes | | 4 9229619 .0002013 9 .00029393 Yes | | 2 2.02e+08 .0002179 10 .00032658 Yes | | 5 1.39e+08 .0002698 11 .00035924 Yes | | 18 5.69e+07 .0003339 12 .0003919 Yes | | 12 8.07e+07 .0003429 13 .00042456 Yes | | 16 5.66e+07 .0004299 14 .00045722 Yes | | 3 9249973 .0004815 15 .00048988 Yes | | 9 1.43e+08 .0005735 16 .00052253 Yes | | 19 4.66e+07 .0005778 17 .00055519 Yes | | 8 2.29e+08 .0006019 18 .00058785 Yes | | 13 4.65e+07 .0006124 19 .00062051 Yes | |------------------------------------------------------------------| | 1 4.39e+07 .0007301 20 .00065317 No | | 5 1.52e+08 .000731 21 .00068583 No | | 8 4.88e+07 .0007519 22 .00071848 No | +------------------------------------------------------------------+ This continues for all 1753 SNPs Actual Analysis manhattanplot pvalue Chromosome Position, critical(`FDR_cutoff') This is VERY heavily based on code by Stephen Turner and Will Bush of Vanderbilt University http://gettinggeneticsdone.blogspot.com/2010/01/genome-wide-manhattan-plots-in-stata.html Summary Stata is a very useful platform for doing longitudinal genome-wide association studies! Acknowledgements • Grant 1-R01DK073618-02 from the National Institute of Diabetes and Digestive and Kidney Diseases • Michael Hallman, PhD – Assistant Professor of Epidemiology, UTSPH-Houston • Ron Harrist, PhD – Associate Professor of Biostatistics, UTSPH-Austin • Eric Boerwinkle, PhD – Professor and Director of the Division of Epidemiology – Kozmetsky Family Chair in Human Genetics, UTSPH-Houston • Darwin Labarthe, MD, PhD, MPH – Director of the Division for Heart Disease and Stroke Prevention, CDCAtlanta References • Barrett, J., Fry, B., Maller, J., & Daly, M. (2005). Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21, 263-265. • Hartl, D.L., Jones, E.W. (1998) Genetics: Principles and Analysis, 4th Ed. Jones & Bartlett Publishers • Stephens, M., & Donnelly, P. (2003). A Comparison of Bayesian Methods for Haplotype Reconstruction from Population Genotype Data. American Journal of Human Genetics, 73, 1162–1169. • Stephens, M., Smith, N. J., & Donnelly, P. (2001). A New Statistical Method for Haplotype Reconstruction from Population Data. American Journal of Human Genetics, 68, 978–989. • Watson, J.D., Baker, T.A., Bell, S.P., Gann, A., Levine, M., Losick, R. (2004) Molecular Biology of the Gene, 5th Ed. Benjamin Cummings