More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio SINGLE-MARKER AND TWO-MARKER ASSOCIATION TESTS FOR UNPHASED CASE-CONTROL GENOTYPE DATA, WITH A POWER COMPARISON Kim S, Morris NJ, Won S, Elston RC Genetic Epidemiology, in press 2 Introduction • A genome-wide association study with case-control data aims to localize disease susceptibility regions in the genome • Single Nucleotide Polymorphism (SNP) markers, which are usually diallelic, have been used to cover the whole genome • Two categories of tests have been applied to these data • single marker association tests, which examine association between affection status and the SNP data one SNP at a time • multi-marker association tests, which examine association between affection status and multiple SNP data simultaneously 3 Information for association a. Allele frequency trend test Association Analysis Allele a HWD d b g f LD e c b. c. d. e. HWD trend test LD contrast test genotype frequency test haplotype-based test with HWE f. ??? g. phase-known genotypebased test • The allele frequency, HWD and LD contrast tests are typically developed in what has been termed a retrospective context; i.e. case-control status is considered fixed and the genotypes are considered random • For case-control data, epidemiologists typically take advantage of the properties of the odds ratio and use the prospective logistic regression model, making the case-control status the random variable dependent on the predictors • Prospective modeling tends to allow for greater flexibility, especially when adjusting for covariates • It also provides a natural way to adjust for any correlations between the tests or other covariates, and can be extended to quantitative traits 5 Notation and Assumptions • We suppose there are two diallelic SNP markers, A and B having alleles {A1,A2} and {B1,B2}, respectively, where A1 and B1 are the minor alleles X= 1 0 for A1A1 for A1A2 , -1 for A2A2 Y= 1 0 for B1B1 for B1B2 -1 for B2B2 • Icase and Ictrl denote the sets of cases and controls • We make minimal assumptions about the general population sampled; in particular, we do not assume HWE in the population • μX, and σXY denote the expected value of X, the variance of X and the covariance of X and Y, respectively 6 2 x • The HWD parameter for marker A is given by d A pA1A1 pA2 • The HWD parameter can be expressed as d A 12 X2 X2 |HWE • This means that the HWD parameter, dA, is half the deviation of the variance from the variance expected under HWE • The composite LD parameter for alleles A1 and B1 of markers A and B is 2 g1,1 g1,0 g 0,1 g 0,0 2 pA pB 12 XY 7 Probabilities for unphased genotypes 1 2 g1,1 g1,0 g 0,1 g 0,0 2 pA pB 12 XY 8 • The joint test of allele frequency and HWD contrasts between cases and controls tests the null hypothesis H0: (pA|case dA|case) = (pA|ctrl dA|ctrl) _ 2 • Let Zi = (Xi X i )’; the sample mean Z is a sufficient statistic for (pA dA)’ • The Allelic-HWD contrast test can be performed by _ _ comparing Zcase and Zctrl. The T2 statistic for this test is n case n ctrl 2 S+ Z -Z T Z -Z case ctrl ctrl T 2 case n case +n ctrl 9 _ • Let Zi = (Xi Yi XiYi)’; Z is a sufficient statistic for (pA pB Δ)’ • The Allelic-LD contrast test can be performed using a version of Hotelling’s T2 • The additional case-control differences can be captured by the HWD and LD contrast tests, given the allele frequency contrast(s) • The Allelic-HWD-LD contrast test can be constructed in a similar manner by contrasting 2 2 the mean vector of Zi = (Xi Yi XiYi X i Yi )’ between cases and controls 10 Single-marker and two-marker association tests with corresponding models and hypotheses Test Single-marker association Test 1-2 Test 1-1 Two-marker association Test 2-5 Model Null hypothesis Test Description Allelic-HWD contrast test (Genotypic test) Allele frequency contrast test (Allelic test) Joint Allelic-HWD-LD contrast test Test 2-4 Joint Allelic-HWD contrast test Test 2-3 Joint Allelic-LD contrast test Test 2-2 Joint Allelic contrast test 11 Multistage Tests • “Self-replication” if the tests are independent • Sequential tests E.g. The HWD contrast test adjusted for allele frequency information which is used in the first stage can be performed by the test of H0 : X 2 | X 0 12 Penetrance Model and True Marker Association Model • Let D denote the disease genotype variable coded as D= 1 0 for D1D1 for D1D2 -1 for D2D2 • We write the penetrance model as: P(affected|D) 0 DD D D2 2 13 Constraints for disease models Disease Model Constraint Additive Dominant or Recessive Heterozygote (Dis)advantage 14 • Given the true disease model and the LD structure, we can set up the true single-marker association model between the phenotype and single-marker data X: P(affected|X) D= 1,0,1 P(affected|D) P( D | X ) 0 D E(D|X) D2 E(D 2 | X) aX 2 bX c, where a, b and c are functions of p A , pD and DXD • This true association model has the same form as the penetrance model • When (1 – 2pD) - γD γD2 ≠ 0, the coefficient of the quadratic terms generally approaches 0 faster than does that of the linear term 15 Power Computation • T2 test in a retrospective model and the score test and LRT in a prospective logistic model are expected to perform similarly • The noncentrality parameter of the T2 test for test 2-5 is n case n ctrl n case +n ctrl μ case -μ ctrl μ case -μ ctrl , n case n ctrl where case ctrl n case +n ctrl n case +n ctrl • The noncentrality parameters for the other tests can be obtained by using the corresponding sub-matrices of (μcase – μctrl) and (Σcase + Σctrl) • Then Power 1 FX 2 X1-2 16 Comparisons of theoretical and empirical power of test 1-2 Theoretical Power Additive Dominant Recessive Heterozygote Disadvantage Empirical Power T 2 test 0.532 0.366 0.734 T 2 test 0.533 0.366 0.741 LRT 0.527 0.361 0.736 Score test 0.523 0.359 0.708 0.284 0.283 0.277 0.275 For each of the four disease models, parameters were set as follows: pD = 0.2, pA = 0.3, K = 0.05, DXD = 0.048(D’ = 0.8), n = 2,000 (500 for recessive), α = 0.05/500,000 Empirical power is obtained by the ratio of the number of rejected replicates to the total 100,000 replicates. 17 18 Power comparisons of two-marker tests LD Haplotype Test 2-2 contrast -based Test 2-5 Test 2-4 Test 2-3 Additive 0.775 0.813 0.851 0.842 0.890 0.000 Dominant 0.695 0.736 0.774 0.749 0.819 0.000 Recessive 0.823 0.845 0.746 0.784 0.717 0.001 0.617 0.653 0.673 0.621 0.711 0.000 Additive 0.962 0.758 0.970 0.948 0.850 0.007 Dominant 0.921 0.673 0.926 0.887 0.769 0.003 Recessive 0.851 0.647 0.910 0.945 0.618 0.206 Heterozygote Disadvantage 0.845 0.584 0.831 0.773 0.656 0.001 (LD structure 1) Heterozygote Disadvantage (LD structure 2) 19 Power Comparisons on Real Data • We estimated LD parameters and marker allele frequencies from the HapMap CEU population • The data consist of 120 haplotypes estimated from 30 parent-offspring trios • We split chromosome 11 into mutually exclusive consecutive regions containing 3 SNPs each • For each region we estimated the LD and allele frequency parameters • We excluded regions where the minor allele frequencies of three consecutive markers were less than 0.1, leaving 4,648 regions • We chose the disease SNP to be the one with the smallest allele frequency • Parameters other than the allele frequency and LD parameters were set to be the same as before 20 Mean of power over chromosome 11 of CEU HapMap data Single-marker Test Disease Model Test 1-2 Test 1-1 Additive 0.423 0.457 Dominant 0.361 Recessive Heterozygote Disadvantage HWD contrast Two-marker Test Test 2-5 Test 2-4 Test 2-3 Test 2-2 Haplotypebased LD contrast 0.000 0.575 0.586 0.604 0.632 0.625 0.019 0.347 0.001 0.505 0.513 0.518 0.505 0.488 0.003 0.519 0.415 0.255 0.687 0.677 0.672 0.572 0.624 0.278 0.423 0.241 0.163 0.587 0.580 0.546 0.367 0.344 0.058 21 Conclusions • The best two marker test always appear to be more powerful than either the best singlemarker test or the haplotype-based test • It should be possible, by examining the LD structure of the markers, to predict which will be the best two-marker test to perform • We need to study > two marker tests 22 http://darwin.case.edu/ http://darwin.case.edu/sage.html