Rare Variants Association Analysis

advertisement
Association Analysis of
Rare Genetic Variants
Qunyuan Zhang
Division of Statistical Genomics
Course M21-621
Computational Statistical Genetics
1
Rare Variants
Low allele frequency: usually less than 1%
Low power: for most analyses, due to less variation
of observations
High false positive rate: for some model-based
analyses, due to sparse distribution of data,
unstable/biased parameter estimation and inflated pvalue.
2
An Example of Low Power
3
Jonathan C. Cohen, et al.
Science 305, 869 (2004)
An Example of High False Positive Rate
(Q-Q plots from GWAS data, unpublished)
N=~2500
N=~2500
MAF>0.03
MAF<0.03
N=~2500
N=50000
MAF<0.03
MAF<0.03
Permuted
Bootstrapped
Three Levels of
Rare Variant Data
Level 1: Individual-level
Level 2: Summarized over subjects
Level 3: Summarized over both
subjects and variants
5
Level 1: Individual-level
Subject
V1
V2
V3
V4
Trait-1
Trait-2
1
1
0
0
0
90.1
1
2
0
1
0
.
99.2
1
3
0
0
0
0
105.9
0
4
0
0
0
0
89.5
0
5
0
.
0
0
97.6
0
6
0
0
0
0
110.5
0
7
0
0
1
0
88.8
0
8
0
0
0
1
95.4
1
6
Level 2: Summarized over subjects (by group)
7
Jonathan C. Cohen, et al.
Science 305, 869 (2004)
Level 3: Summarized over subjects (by group)
and variants (usually by gene)
Variant
allele
number
Reference
allele
number
Total
Low-HDL
group
20
236
256
High-HDL
group
2
254
256
Total
22
490
512
Methods For Level 3 Data
9
Single-variant Test vs Total Freq.Test (TFT)
10
Jonathan C. Cohen, et al.
Science 305, 869 (2004)
What we have learned …
Single-variant test of rare variants has very low
power for detecting association, due to extremely
low frequency (usually < 0.01)
Testing collective effect of a set of rare variants
may increase the power (sum test, collective test,
group test, collapsing test, burden test…)
11
Methods For Level 2 Data
Allowing different samples sizes for different
variants
Different variants can be weighted differently
12
CAST: A cohort allelic sums test
Morgenthaler and Thilly, Mutation Research 615 (2007) 28–56
Under H0:
S(cases)/2N(cases)−S(controls)/2N(controls) =0
S: variant number; N: sample size
T= S(cases) − S(controls)N(cases)/N(controls)
= S(cases) − S∗(controls)
(S can be calculated variant by variant and can be weighted differently,
the final T=sum(WiSi) )
Z=T/SQRT(Var(T)) ~ N (0,1)
Var(T)= Var (S(cases) − S* (controls) )
=Var(S(cases)) + Var(S* (controls))
=Var(S(cases)) + Var(S(controls)) X [N(cases)/N(controls)]^2
13
C-alpha
PLOS Genetics, 2011 | Volume 7 | Issue 3 | e1001322
Effect direction problem
14
C-alpha
15
QQ Plots of Existing Methods
(under the null)
EFT
CAST
TFT
C-alpha
•EFT and C-alpha
inflated with false
positives
•TFT and CAST
no inflation, but
assuming single
effect-direction
•Objective
More general,
powerful methods …
More Generalized Methods
For Level 2 Data
17
Structure of Level 2 data
variant 1
variant 2
variant 3
Strategy
Instead of testing total
freq./number, we test
the randomness of all
tables.
…
variant i
…
variant k
Exact Probability Test (EPT)
1.Calculating the probability of each table based on
hypergeometric distribution

 
Pi  C n , a  C n , a
i
1
i
1
2. Calculating the
logarized joint
probability (L)
for all k tables
L
i
2
i
2
 CN , n 
i
i
A
3. Enumerating all possible
tables and L scores
k
 log(P )
i 1
i
4. Calculating p-value P= Prob.(
)
ASHG Meeting 1212, Zhang
Likelihood Ratio Test (LRT)
Binomial distribution
k
LR  2 log
 Pr(a , b , a , b
i
1
i 1
k
i
1
i
2
i
2
H0 :   )
i
1
i
2
i
i
i
i
i
i
Pr(
a
,
b
,
a
,
b
H
:



 1 1 2 2 A 1 2)
~

2
df k
i 1
ASHG Meeting 1212, Zhang
Q-Q Plots of EPT and LRT
(under the null)
EPT
N=500
LRT
N=500
EPT
N=3000
LRT
N=3000
Power Comparison
significance level=0.00001
Variant proportion
Neutral
20%
Power
Positive causal
80%
Negative Causal
0%
Sample size
Power Comparison
significance level=0.00001
Variant proportion
Neutral
20%
Power
Positive causal
60%
Negative Causal
20%
Sample size
Power Comparison
significance level=0.00001
Variant proportion
Neutral
20%
Power
Positive causal
40%
Negative Causal
40%
Sample size
Methods For Level 1 Data
•Including covariates
•Extended to quantitative trait
•Better control for population
structure
•More sophisticate model
25
Collapsing (C) test
Li and Leal,The American Journal of Human Genetics 2008(83): 311–321
Step 1
Step 2
logit(y)=a + b* X + e
(logistic regression)
26
Variant Collapsing
(+)
(+)
(.)
(.)
Subject
V1
V2
V3
V4
Collapsed
Trait
1
1
0
0
0
1
1
2
0
1
0
0
1
1
3
0
0
0
0
0
0
4
0
0
0
0
0
0
5
0
0
0
0
0
0
6
0
0
0
0
0
0
7
0
0
1
0
1
0
8
0
0
0
1
1
1
27
WSS
28
WSS
29
WSS
30
Weighted Sum Test
m
s   wi gi
i 1
Collapsing test (Li & Leal, 2008), wi =1 and s=1 if s>1
Weighted-sum test (Madsen & Browning ,2009), wi calculated based-on allele
freq. in control group
aSum: Adaptive sum test (Han & Pan ,2010), wi = -1 if b<0 and p<0.1,
otherwise wj=1
KBAC (Liu and Leal, 2010), wi = left tail p value
RBT (Ionita-Laza et al, 2011), wi = log scaled probability
PWST p-value weighted sum test (Zhang et al., 2011) :, wi = rescaled left tail
p value, incorporating both significance and directions
EREC( Lin et al, 2011), wi = estimated effect size
31
(+)
When there are only
causal(+) variants …
3.2
Subjec
t
1
2
3
4
5
6
(+)
V1
1
0
0
V2
0
1
0
0
0
0
0
0
0
Collapse
d
1
1
0
0
0
0
Trait
3.00
3.10
1.95
2.00
2.05
2.10
3.0
2.8
Collapsing (Li & Leal,2008)
works well, power
increased
Trait
2.6
2.4
2.2
2.0
1.8
0
Collapsed Genotype
1
32
(+)
When there are
causal(+) and
non-causal(.)
variants …
(+)
Subject V1
1
1
2
0
3
0
4
0
5
0
6
0
7
0
8
0
V2
0
1
0
0
0
0
0
0
(.)
(.)
V3
0
0
0
0
0
0
1
0
Collapse
V4
d
Trait
0
1
3.00
0
1
3.10
0
0
1.95
0
0
2.00
0
0
2.05
0
0
2.10
0
1
2.00
1
1
2.10
3.2
3.0
2.8
Collapsing still works,
power reduced
Trait
2.6
2.4
2.2
2.0
1.8
0
Collapsed Genotype
1
33
(+)
When there areSubject
1
causal(+)
2
3
non-causal(.)
4
and causal (-)
5
6
variants …
7
8
9
10
3.6
V1
1
0
0
0
0
0
0
0
0
0
(+)
V2
0
1
0
0
0
0
0
0
0
0
(.)
V3
0
0
0
0
0
0
1
0
0
0
(.)
V4
0
0
0
0
0
0
0
1
0
0
(-)
(-)
V5
0
0
0
0
0
0
0
0
1
0
Collaps
V6
ed
Trait
0
1
3.00
0
1
3.10
0
0
1.95
0
0
2.00
0
0
2.05
0
0
2.10
0
1
2.00
0
1
2.10
0
1
0.95
1
1
1.00
3.2
2.8
Power of collapsing
test significantly down
Trait
2.4
2.0
1.6
1.2
0.8
0
Collapsed Genotype
1
34
P-value Weighted Sum Test (PWST)
Subject
1
2
3
4
5
6
7
8
9
10
t
p(x≤t)
2*(p-0.5)
(+)
V1
1
0
0
0
0
0
0
0
0
0
1.61
0.93
0.86
(+)
(.)
(.)
(-)
(-)
V2
V3
V4
V5
V6 Collapsed pSum Trait
0
0
0
0
0
1
0.86 3.00
1
0
0
0
0
1
0.90 3.10
0
0
0
0
0
0
0.00 1.95
0
0
0
0
0
0
0.00 2.00
0
0
0
0
0
0
0.00 2.05
0
0
0
0
0
0
0.00 2.10
0
1
0
0
0
1
-0.02 2.00
0
0
1
0
0
1
0.08 2.10
0
0
0
1
0
1
-0.90 0.95
0
0
0
0
1
1
-0.88 1.00
1.84 -0.04 0.11 -1.84 -1.72
0.95 0.49 0.54 0.05 0.06
0.90 -0.02 0.08 -0.90 -0.88
Rescaled left-tail p-value [-1,1] is used as weight
35
P-value Weighted Sum Test (PWST)
3.2
2.8
Trait
2.4
2.0
1.6
1.2
-1.000
-0.500
0.8
0.000
pSum
0.500
1.000
Power of collapsing test is retained
even there are bidirectional effects
36
PWST:Q-Q Plots Under the Null
Direct test
Inflation of type I error
Corrected by permutation test
(permutation of phenotype)
37
Generalized Linear Mixed Model
(GLMM)
& Weighted Sum Test (WST)
38
GLMM & WST
m
Y      wi g i  X  Z  
i 1
Y : quantitative trait or logit(binary trait)
α : intercept
β : regression coefficient of weighted sum
m : number of RVs to be collapsed
wi : weight of variant i
gi : genotype (recoded) of variant i
Σwigi : weighted sum (WS)
X: covariate(s), such as population structure variable(s)
τ : fixed effect(s) of X
Z: design matrix corresponding to γ
γ : random polygene effects for individual subjects, ~N(0, G),
G=2σ2K, K is the kinship matrix and σ2 the additive ploygene
genetic variance
ε : residual
39
m
Weight  w g
i 1
i
i
Base on allele frequency, binary(0,1) or continuous, fixed or
variable threshold;
Based on function annotation/prediction; SIFT, PolyPhen
etc.
Based on sequencing quality (coverage, mapping quality,
genotyping quality etc.);
Data-driven, using both genotype and phenotype data,
learning weight from data or adaptive selection, permutation
test;
Any combination …
40
Application 1: Family Data
Adjusting relatedness in family data for non-datadriven test of rare variants.
Unadjusted:
m
Y      wi g i  
i 1
Adjusted:
m
Y      wi g i  Z  
i 1
γ ~N(0,2σ2K)
41
Q-Q Plots of –log10(P) under the Null
Li & Leal’s collapsing test,
ignoring family structure,
inflation of type-1 error
Li & Leal’s collapsing test, modeling
family structure via GLMM,
inflation is corrected
(From Zhang et al, 2011, BMC Proc.)
42
Application 2: Permuting Family Data
MMPT: Mixed Model-based Permutation Test
Adjusting relatedness in family data for data-driven
permutation test of rare variants.
Permuted
m
Y      wi g i  Z  
γ ~N(0,2σ2K)
i 1
Non-permuted, subject IDs fixed
43
Q-Q Plots under the Null
WSS
Permutation test,
ignoring family structure,
inflation of type-1 error
aSum
PWST
SPWST
44
(From Zhang et al, 2011, IGES Meeting)
Q-Q Plots under the Null
WSS
Mixed model-based
permutation test (MMPT),
modeling family structure,
inflation corrected
aSum
PWST
(From Zhang et al, 2011, IGES Meeting)
SPWST
Burden Test vs. Non-burden Test
k
Burden test
Y     ( wi xi )    
i 1
H0 :   0
k
Non-burden test
Y      i xi  ...  
i 1
H 0 :  i  0( 1   2  ...   k  0)
T-test, Likelihood Ratio Test, F-test, score test, …
SKAT: sequence kernel association test
46
SKAT: sequence kernel association test
k
Y      i xi  
i 1
H 0 :  i  0( 1   2  ...   k  0)
Extension of SKAT to Family Data
kinship matrix
Polygenic heritability of the trait
Han Chen et al., 2012, Genetic Epidemiology
Residual
Other problems
Missing genotypes & imputation
Genotyping errors & QC (family consistency,
sequence review)
Population Stratification
Inherited variants and de novo mutation
Family data & linkage infomation
Variant validation and association validation
Public databases
And more …
49
Download