Genetics for Epidemiologists National Human Genome Research Institute National Institutes of Health U.S. Department of Health and Human Services Lecture 5: Analysis of Genetic Association Studies U.S. Department of Health and Human Services National Institutes of Health National Human Genome Research Institute Teri A. Manolio, M.D., Ph.D. Director, Office of Population Genomics and Senior Advisor to the Director, NHGRI, for Population Genomics Topics to be Covered • Discrete traits and quantitative traits • Measures of association • Detecting/correcting for false positives • Genotyping quality control • Quantile-quantile (Q-Q) plots • Odds ratios: allelic and genotypic • Models of genetic transmission • Interactions: gene-gene, gene-environment Larson, G. The Complete Far Side. 2003. Quantitative Genetics “…concerned with the inheritance of those differences between individuals that are of degree rather than of kind…” Quantitative Qualitative Continuous gradation among individuals from one extreme to other Sharply demarcated types with little connection by intermediates Effects of genes are small Effects of genes are large Usually many genes Single genes inherited in Mendelian ratios? Falconer and Mackay, Quantitative Genetics 1996. Inheritance Models in Single Gene Trait A a Inheritance Models in Single Gene Trait Genotype Group Model A is Dominant A is Recessive A is Co-Dominant AA Aa aa Inheritance Models in Quantitative Trait A x increase in height a x decrease in height Inheritance Models in Quantitative Trait Model -x A is Completely Dominant aa A is Partially Dominant aa A is Not (Co-) Dominant aa A is OverDominant aa Population Mean 0 +x AA Aa Aa Aa AA AA AA Aa Quantitative Traits with Published GWA Studies (16 - 34) • QT interval • Lipids and lipoproteins • Memory • Nicotine dependence • ORMDL3 expression • YKL-40 levels • Obesity, BMI, waist • Insulin resistance • Height • Bone mineral density • F-cell distribution • Fetal hemoglobin levels • C-Reactive protein • 18 groups of Framingham traits • Pigmentation • Uric Acid Levels • Recombination Rate Association of Alleles and Genotypes of rs1333049 (‘3049) with Myocardial Infarction C N (%) G N (%) 2,132 (55.4) 1,716 (44.6) Controls 2,783 (47.4) 3,089 (52.6) Cases Allelic Odds Ratio = 1.38 Samani N et al, N Engl J Med 2007; 357:443-453. 2 (1df) P-value 55.1 1.2 x 10-13 Association of Alleles and Genotypes of rs1333049 (‘3049) with Myocardial Infarction C N (%) G N (%) 2,132 (55.4) 1,716 (44.6) Controls 2,783 (47.4) 3,089 (52.6) Cases 2 (1df) P-value 55.1 1.2 x 10-13 2 (2df) P-value 59.7 1.1 x 10-14 Allelic Odds Ratio = 1.38 CC N (%) Cases CG N (%) GG N (%) 586 (30.5) 960 (49.9) 378 (19.6) Controls 676 (23.0) 1,431 (48.7) 829 (28.2) Heterozygote Odds Ratio = 1.47 Homozygote Odds Ratio = 1.90 Samani N et al, N Engl J Med 2007; 357:443-453. -Log10 P Values for SNP Associations with Myocardial Infarction Samani N et al, N Engl J Med 2007; 357:443-453. Genome-Wide Scan for Type 2 Diabetes in a Scandinavian Cohort http://www.broad.mit.edu/diabetes/scandinavs/type2.html GWA Study of Serum Uric Acid Levels • Linear regression of inverse normalized levels against number of alleles • Additive model • Sex, age, age2 as covariates Li S et al, PLoS Genet 2007; 3:e194. Association of rs6855911 and Uric Acid Levels Genotype Means (mg/dl) Cohort Additive Effect AA AG GG SardiNIA -0.317 4.66 (1.51) 4.48 (1.59) 4.02 (1.63) InCHIANTI -0.397 5.27 (1.44) 4.94 (1.31) 4.33 (1.37) Li S et al, PLoS Genet 2007; 3:e194. Association Methods for Quantitative Traits • Linear regression of multivariable adjusted residual against number of alleles (Kathiresan,Nat Genet 2008; 40:189-97) • Linear regression of log transformed or centralized BMI against genotype (Frayling, Science 2007; 316:889-94) • Variance components based Z-score analysis of quantile normalized height (Sanna, Nat Genet 2008; 40:198-203) Ways of Dealing with Multiple Testing • Control family wise error rate (FWER): Bonferroni (α’ = α/n) or Sĭdák (α’ = 1- [1- α]1/n) • False discovery rate: proportion of significant associations that are actually false positives • False positive report probability: probability that the null hypothesis is true, given a statistically significant finding • Bayes factors analysis: avoids need for assessing genome-wide error rates but must identify reasonable alternative model Hogart CJ et al, Genet Epidemiol 2008; 32:179-85. Larson, G. The Complete Far Side. 2003. Quality Control of SNP Genotyping: Samples • Identity with forensic markers (Identifiler) • Blind duplicates • Gender checks • Cryptic relatedness or unsuspected twinning • Degradation/fragmentation • Call rate (> 80-90%) • Heterozygosity: outliers • Plate/batch calling effects Chanock et al, Nature 2007; Manolio et al Nat Genet 2007 Quality Control of SNP Genotyping: SNPs • Duplicate concordance (CEPH samples) • Mendelian errors (typically < 1) • Hardy-Weinberg errors (often > 10-5) • Heterozygosity (outliers) • Call rate (typically > 98%) • Minor allele frequency (often > 1%) • Validation of most critical results on independent genotyping platform Chanock et al, Nature 2007; Manolio et al Nat Genet 2007 Hardy-Weinberg Equilibrium • Occurrence of two alleles of a SNP in the same individual are two independent events • Ideal conditions: – random mating - no selection (equal survival) – no migration - no mutation – no inbreeding - large population sizes – gene frequencies equal in males and females)… • If alleles A and a of SNP rs1234 have frequencies p and 1-p, expected frequencies of the three genotypes are: Freq AA = p2 After G. Thomas, NCI Freq Aa = 2p(1-p) Freq aa = (1-p)2 Coverage, Call Rates, and Concordance of Perlegen and Affymetrix Platforms on HapMap Phase II Metric Number of SNPs Coverage CEU CHB + JPT YRI Average call rate Concordance Homozygous genotypes Heterozygous genotypes Perlegen Affymetrix/Broad 480,744 Single MultiMarker Marker 0.90 0.96 0.87 0.93 0.64 0.78 98.9% 439,249 MultiSingle Marker Marker 0.78 0.87 0.78 0.86 0.63 0.75 99.3% 99.8% 99.9% 99.8% 99.8% GAIN Collaborative Group, Nat Genet 2007; 39:1045-51. Sample and SNP QC Metrics for Affymetrix 5.0 and 6.0 Platforms in GAIN Metric Total Samples Passing QC > 98% call rate 5.0 % fail 6.0 % fail 1,829 1,817 1,815 -0.44 0.55 2,289 2,192 2,257 -4.24 1.40 Courtesy, J Paschall, NCBI Sample and SNP QC Metrics for Affymetrix 5.0 and 6.0 Platforms in GAIN Metric Total Samples Passing QC > 98% call rate Total SNPs Passing QC MAF > 1% > 98% call rate > 95% call rate HWE < 10 -6 < 1 Mendel error < 1 Duplicate error 5.0 % fail 6.0 % fail 1,829 1,817 1,815 -0.44 0.55 2,289 2,192 2,257 -4.24 1.40 457,645 429,309 457,466 419,810 439,272 455,899 417,722 454,820 -6.19 0.04 8.27 4.01 0.38 8.72 0.01 906,660 845,814 888,234 821,942 873,856 904,275 899,721 892,103 -6.70 2.03 9.34 3.61 0.26 0.01 0.02 Courtesy, J Paschall, NCBI Sample Heterozygosity in GAIN 2,500 Frequency 2,000 1,500 1,000 500 0 0.20 0.22 0.24 0.26 Courtesy, J Paschall, NCBI 0.28 0.30 0.32 0.34 0.36 0.38 0.40 Sample Heterozygosity in GAIN 100 90 80 Frequency 70 60 50 40 30 20 10 0 0.20 0.22 0.24 0.26 Courtesy, J Paschall, NCBI 0.28 0.30 0.32 0.34 0.36 0.38 0.40 Signal Intensity Plots for rs10801532 in AREDS http://www.ncbi.nlm.nih.gov/sites/entrez Signal Intensity Plots for rs4639796 in AREDS http://www.ncbi.nlm.nih.gov/sites/entrez Signal Intensity Plots for rs534399 in AREDS http://www.ncbi.nlm.nih.gov/sites/entrez Signal Intensity Plots for rs572515 in AREDS http://www.ncbi.nlm.nih.gov/sites/entrez Signal Intensity Plots for CD44 SNP rs9666607 Clayton DG et al, Nat Genet 2005; 37:1243-1246. Principal Component Analysis of Structured Population: First to Third Components Courtesy, G. Thomas, NCI Principal Component Analysis of Structured Population: Fourth and Fifth Components Courtesy, G. Thomas, NCI Influence of Relatedness on Principal Component Analysis Courtesy, G. Thomas, NCI Principal Component Analysis of Structured Population: Fourth and Fifth Components Courtesy, G. Thomas, NCI Principal Component Analysis of Structured Population: Fourth and Fifth Components Courtesy, G. Thomas, NCI Summary Points: Genotyping Quality Control • Sample checks for identity, gender error, cryptic relatedness • Sample handling differences can introduce artifacts but probably can be adjusted for • Association analysis is often quickest way to find genotyping errors • Low MAF SNPs are most difficult to call • Inspection of genotyping cluster plots is crucial! Quantile-Quantile Plot for Test Statistics, 390 Breast Cancer Cases, 364 Controls 205,586 SNPs λ = 1.03 Easton D et al, Nature 2007; 447:1087-1093. Observed and Expected Associations after Stage 2 of Breast Cancer GWA Significance 0.01 - 0.05 Observed Observed Expected Adjusted Ratio 1,239 1,162 934 1.24 10-3 – 10-2 574 517 348 1.49 10-4 – 10-3 112 88 53 1.65 10-5 – 10-4 16 12 7 1.71 < 10-5 15 13 1 1,956 1,792 1,343 All p < 0.05 Easton D et al, Nature 2007; 447:1087-93. 13.5 1.33 Q-Q Plot for Multiple Sclerosis; Effect of MHC Hafler D et al, N Engl J Med 2007; 357:851-862. Q-Q Plot for Prostate Cancer, all SNPs Gudmundsson J et al, Nat Genet 2007; 39:977-983. Q-Q Plot for Prostate Cancer, excluding Chromosome 8 Gudmundsson J et al, Nat Genet 2007; 39:977-983. 40 20 0 Observed chi-squared statistic 60 Q-Q Plot for Myocardial Infarction 0 5 10 15 Expected chi-squared statistic 20 Samani N et al, N Engl J Med 2007; 357:443-453. 25 -Log10 P Values for SNP Associations with Myocardial Infarction Samani N et al, N Engl J Med 2007; 357:443-453. -Log10 P Values for SNP Associations with Myocardial Infarction Samani N et al, N Engl J Med 2007; 357:443-453. SNP Associations with 1,928 MI Cases and 2,938 Controls from UK Samani N et al, N Engl J Med 2007; 357:443-453. Association Signal for Coronary Artery Disease on Chromosome 9 ’3049 Samani N et al, N Engl J Med 2007; 357:443-453. Winner’s Curse: Odds Ratios for CHD Associated with LTA Genotypes in Multiple Studies Clarke et al, PLoS Genet 2006; 2:e107. Genome-Wide Scan for Alzheimer’s Disease in 861 Cases and 550 Controls Reiman E et al, Neuron 2007; 54:713-20. Genome-Wide Scan for Alzheimer’s Disease in ApoE*e4Carriers Reiman E et al, Neuron 2007; 54:713-20. LOAD Odds Ratios Associated with rs2373115 GG by APOE*e4 Status APOE*e4 Group APOE*e4 OR [95% CI] rs2373115 OR [95%CI] APOE*e4 - 1.12 [0.82,1.53] APOE*e4 + 2.88 [1.90,4.36] All 6.07 [4.63-7.95] Reiman et al, Neuron 2007; 54:713-720. 1.34 [1.06,1.70] P Values of GWA Scan for Age-Related Macular Degeneration Klein et al, Science 2005; 308:385-389. Odds Ratios and Population Attributable Risks for AMD Allelic association χ P value Odds ratio (dominant) Frequency in HapMap CEU Population Attributable Risk rs380390 (C/G) C –8 x 4.1 10 rs1329428 (C/T) C –6 x 1.4 10 4.6 [2.0-11] 0.70 70% [42-84%] 4.7 [1.0-22] 0.82 80% [0-96%] Odds ratio (recessive) Frequency in HapMap CEU Population Attributable Risk 7.4 [2.9-19] 6.2 [2.9-13] 0.23 0.41 46% [31-57%] 61% [43-73%] Attribute (SNP) Risk allele 2 Klein et al, Science 2005; 308:385-389. Risk of Developing AMD by CFH Y402H and Modifiable Risk Factors Risk Factor BMI < 30 kg/m2 BMI > 30 kg/m2 Non-smoker Current smoker CFH Y402H Genotype YY 1.00 1.98 [0.91-4.31] 1.00 2.34 [1.20-4.55] YH HH 1.95 3.96 [1.42-2.67] [2.69-5.82] 2.19 12.28 [1.11-4.30] [4.88-30.90] 1.95 4.23 [1.41-2.71] [2.86-6.27] 3.20 8.69 [1.85-5.55] [3.86-19.57] Schaumberg DA et al, Arch Ophthalmol 2007; 125:55-62. Interaction: Is LIPC Genotype Related to HDL-C? CC TT CT CT TT CC Ordovas et al, Circulation 2002; 106:2315-2321. Inverse Relation between Endotoxin Exposure and Allergic Sensitization by CD14 Genotype Simpson A et al, Am J Respir Crit Care Med 2006;174:386-392. Challenges in Studying Gene-Environment Interactions Challenge Genes Environment Ease of measure Pretty easy Often hard Variability over time Low/none High Recall bias None Possible Temporal relation to disease Easy Hard Larson, G. The Complete Far Side. 2003.