More Powerful Genome-wide Association Methods for Case-control Data

advertisement
More Powerful Genome-wide
Association Methods for
Case-control Data
Robert C. Elston, PhD
Case Western Reserve University
Cleveland Ohio
SINGLE-MARKER AND TWO-MARKER ASSOCIATION TESTS
FOR UNPHASED CASE-CONTROL GENOTYPE DATA,
WITH A POWER COMPARISON
Kim S, Morris NJ, Won S, Elston RC
Genetic Epidemiology, in press
2
Introduction
• A genome-wide association study with case-control data
aims to localize disease susceptibility regions in the
genome
• Single Nucleotide Polymorphism (SNP) markers, which
are usually diallelic, have been used to cover the whole
genome
• Two categories of tests have been applied to these data
• single marker association tests, which examine association
between affection status and the SNP data one SNP at a
time
• multi-marker association tests, which examine association
between affection status and multiple SNP data
simultaneously
3
Information for association
a. Allele frequency trend test
Association Analysis
Allele
a
HWD
d
b
g
f
LD
e
c
b.
c.
d.
e.
HWD trend test
LD contrast test
genotype frequency test
haplotype-based test with
HWE
f. ???
g. phase-known genotypebased test
• The allele frequency, HWD and LD contrast tests are
typically developed in what has been termed a
retrospective context; i.e. case-control status is
considered fixed and the genotypes are considered
random
• For case-control data, epidemiologists typically take
advantage of the properties of the odds ratio and
use the prospective logistic regression model,
making the case-control status the random variable
dependent on the predictors
• Prospective modeling tends to allow for greater
flexibility, especially when adjusting for covariates
• It also provides a natural way to adjust for any
correlations between the tests or other covariates,
and can be extended to quantitative traits
5
Notation and Assumptions
• We suppose there are two diallelic SNP markers, A
and B having alleles {A1,A2} and {B1,B2}, respectively,
where A1 and B1 are the minor alleles
X=
1
0
for A1A1
for A1A2 ,
-1
for A2A2
Y=
1
0
for B1B1
for B1B2
-1
for B2B2
• Icase and Ictrl denote the sets of cases and controls
• We make minimal assumptions about the general
population sampled; in particular, we do not assume HWE
in the population
• μX,  and σXY denote the expected value of X, the variance
of X and the covariance of X and Y, respectively
6
2
x
• The HWD parameter for marker A is given by
d A  pA1A1  pA2
• The HWD parameter can be expressed as
d A  12  X2   X2 |HWE 
• This means that the HWD parameter, dA, is
half the deviation of the variance from the
variance expected under HWE
• The composite LD parameter for alleles A1 and
B1 of markers A and B is


  2 g1,1  g1,0  g 0,1  g 0,0  2 pA pB  12  XY
7
Probabilities for unphased genotypes
1


  2 g1,1  g1,0  g 0,1  g 0,0  2 pA pB  12  XY
8
• The joint test of allele frequency and HWD contrasts
between cases and controls tests the null hypothesis
H0: (pA|case dA|case) = (pA|ctrl dA|ctrl)
_
2
• Let Zi = (Xi X i )’; the sample mean Z is a sufficient
statistic for (pA dA)’
• The Allelic-HWD
contrast
test can be performed by
_
_
comparing Zcase and Zctrl. The T2 statistic for this test
is
n case n ctrl
2
 S+ Z -Z
T 
Z
-Z

case
ctrl 
ctrl 
T 2  case
n case +n ctrl
9
_
• Let Zi = (Xi Yi XiYi)’; Z is a sufficient statistic for
(pA pB Δ)’
• The Allelic-LD contrast test can be performed
using a version of Hotelling’s T2
• The additional case-control differences can be
captured by the HWD and LD contrast tests,
given the allele frequency contrast(s)
• The Allelic-HWD-LD contrast test can be
constructed in a similar manner by contrasting
2
2
the mean vector of Zi = (Xi Yi XiYi X i Yi )’
between cases and controls
10
Single-marker and two-marker
association tests with corresponding
models and hypotheses
Test
Single-marker
association
Test 1-2
Test 1-1
Two-marker
association
Test 2-5
Model
Null hypothesis
Test Description
Allelic-HWD contrast test
(Genotypic test)
Allele frequency contrast test
(Allelic test)
Joint Allelic-HWD-LD
contrast test
Test 2-4
Joint Allelic-HWD contrast
test
Test 2-3
Joint Allelic-LD contrast test
Test 2-2
Joint Allelic contrast test
11
Multistage Tests
• “Self-replication” if the tests are independent
• Sequential tests
E.g. The HWD contrast test adjusted for allele
frequency information which is used in the first
stage can be performed by the test of
H0 :   X 2 |  X   0
12
Penetrance Model and
True Marker Association Model
• Let D denote the disease genotype variable
coded as
D=
1
0
for D1D1
for D1D2
-1
for D2D2
• We write the penetrance model as:
  P(affected|D)   0   DD   D D2
2
13
Constraints for disease models
Disease Model
Constraint
Additive
Dominant or Recessive
Heterozygote (Dis)advantage
14
• Given the true disease model and the LD structure, we
can set up the true single-marker association model
between the phenotype and single-marker data X:
  P(affected|X) 

D= 1,0,1
P(affected|D) P( D | X )   0   D E(D|X)   D2 E(D 2 | X)
 aX 2  bX  c, where a, b and c are functions of p A , pD and DXD
• This true association model has the same form as the
penetrance model
• When (1 – 2pD) -
γD
γD2
≠ 0, the coefficient of the
quadratic terms generally approaches 0 faster than
does that of the linear term
15
Power Computation
• T2 test in a retrospective model and the score test
and LRT in a prospective logistic model are
expected to perform similarly
• The noncentrality parameter of the T2 test for
test 2-5 is
n case n ctrl


n case +n ctrl
 μ case -μ ctrl    μ case -μ ctrl  ,
n case
n ctrl
where  
 case 
 ctrl
n case +n ctrl
n case +n ctrl
• The noncentrality parameters for the other tests can
be obtained by using the corresponding sub-matrices
of (μcase – μctrl) and (Σcase + Σctrl)
• Then
Power  1  FX 2  X1-2  

16
Comparisons of theoretical and
empirical power of test 1-2
Theoretical Power
Additive
Dominant
Recessive
Heterozygote
Disadvantage
Empirical Power
T 2 test
0.532
0.366
0.734
T 2 test
0.533
0.366
0.741
LRT
0.527
0.361
0.736
Score test
0.523
0.359
0.708
0.284
0.283
0.277
0.275
For each of the four disease models, parameters were set as follows:
pD = 0.2, pA = 0.3, K = 0.05, DXD = 0.048(D’ = 0.8), n = 2,000 (500 for recessive), α = 0.05/500,000
Empirical power is obtained by the ratio of the number of rejected replicates to
the total 100,000 replicates.
17
18
Power comparisons of
two-marker tests
LD
Haplotype
Test 2-2 contrast
-based
Test 2-5
Test 2-4
Test 2-3
Additive
0.775
0.813
0.851
0.842
0.890
0.000
Dominant
0.695
0.736
0.774
0.749
0.819
0.000
Recessive
0.823
0.845
0.746
0.784
0.717
0.001
0.617
0.653
0.673
0.621
0.711
0.000
Additive
0.962
0.758
0.970
0.948
0.850
0.007
Dominant
0.921
0.673
0.926
0.887
0.769
0.003
Recessive
0.851
0.647
0.910
0.945
0.618
0.206
Heterozygote
Disadvantage
0.845
0.584
0.831
0.773
0.656
0.001
(LD structure 1)
Heterozygote
Disadvantage
(LD structure 2)
19
Power Comparisons on Real Data
• We estimated LD parameters and marker allele frequencies
from the HapMap CEU population
• The data consist of 120 haplotypes estimated from 30
parent-offspring trios
• We split chromosome 11 into mutually exclusive consecutive
regions containing 3 SNPs each
• For each region we estimated the LD and allele frequency
parameters
• We excluded regions where the minor allele frequencies of
three consecutive markers were less than 0.1, leaving 4,648
regions
• We chose the disease SNP to be the one with the smallest
allele frequency
• Parameters other than the allele frequency and LD
parameters were set to be the same as before
20
Mean of power over chromosome 11
of CEU HapMap data
Single-marker Test
Disease Model
Test
1-2
Test
1-1
Additive
0.423
0.457
Dominant
0.361
Recessive
Heterozygote
Disadvantage
HWD
contrast
Two-marker Test
Test
2-5
Test
2-4
Test
2-3
Test
2-2
Haplotypebased
LD
contrast
0.000
0.575
0.586
0.604
0.632
0.625
0.019
0.347
0.001
0.505
0.513
0.518
0.505
0.488
0.003
0.519
0.415
0.255
0.687
0.677
0.672
0.572
0.624
0.278
0.423
0.241
0.163
0.587
0.580
0.546
0.367
0.344
0.058
21
Conclusions
• The best two marker test always appear to be
more powerful than either the best singlemarker test or the haplotype-based test
• It should be possible, by examining the LD
structure of the markers, to predict which will
be the best two-marker test to perform
• We need to study > two marker tests
22
http://darwin.case.edu/
http://darwin.case.edu/sage.html
Download