Slides

advertisement
FastANOVA: an Efficient Algorithm
for Genome-Wide Association Study
Xiang Zhang Fei Zou Wei Wang
University of North Carolina at Chapel Hill
Speaker: Xiang Zhang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Genotype-phenotype association study
• Goal: finding genetic factors causing phenotypic difference
Mouse genome
Phenotype variation
http://www.bcgsc.ca
http://www.jax.org/
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Genotype-phenotype association study
Chrom1
bp3,568,717
Chrom6
bp120,323,342
• Single Nucleotide Polymorphism…… A A A C G …… A A T C C ……
 Mutation of a single nucleotide
(A,C,T,G)
 The most abundant source of
genotypic variation
 Server as genetic markers of
locations in the genome
 High throughput genotyping -thousands to millions of SNPs
…… A A A C G …… A A T C C ……
…… A A A C G …… A A T C G ……
…… A A A C G …… A A T C G ……
…… A A A C G …… A A T C G ……
…… A A A C G …… A A T C G ……
…… A A T C G …… A A T C C ……
…… A A T C G …… A A T C C ……
…… A A T C G …… A A T C G ……
…… A A T C G …… A A T C C ……
…… A A T C G …… A A T C C ……
…… A A T C G …… A A T C C ……
Thousands to millions of SNPs
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Genotype-phenotype association study
• Genotype
Phenotype
value
SNPs
 SNPs can be represented as binary
{0,1} (e.g. inbred mouse strains)
• Quantitative phenotypes
 Body weight, blood pressure, tumor
size, cancer susceptibility, ……
• Question
 Which SNPs are the most highly
associated with the phenotype?
…… 0
…… 0
…… 0
…… 0
…… 0
…… 0
…… 1
…… 1
…… 1
…… 1
…… 1
…… 1
0
0
1
1
1
1
0
0
1
0
0
0
0
0
1
0
0
0
1
0
1
0
0
1
1
0
0
0
1
0
1
0
1
1
1
1
0
0
0
1
0
0
1
1
1
0
0
0
1 ……
0 ……
1 ……
0 ……
1 ……
0 ……
1 ……
0 ……
1 ……
0 ……
1 ……
0 ……
8
7
12
11
9
13
6
4
2
5
0
3
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
A simple example: single marker
association study
• Partition individuals into groups
according to genotype of a SNP
• Do a statistic (t, ANOVA) test
• Repeat for each SNP
Phenotype
value
SNPs
…… 0
…… 0
…… 0
…… 0
…… 0
…… 0
…… 1
…… 1
…… 1
…… 1
…… 1
…… 1
0
0
1
1
1
1
0
0
1
0
0
0
0
0
1
0
0
0
1
0
1
0
0
1
1
0
0
0
1
0
1
0
1
1
1
1
0
0
0
1
0
0
1
1
1
0
0
0
1 ……
0 ……
1 ……
0 ……
1 ……
0 ……
1 ……
0 ……
1 ……
0 ……
1 ……
0 ……
8
7
12
11
9
13
6
4
2
5
0
3
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Two-locus association mapping
• Many phenotypes are complex traits
 Due to the joint effect of multiple genes
 Single marker approach may not suffice
• Consider SNP-SNP interactions
 Four possible genotype combinations for each SNP-pair:
00, 01, 10, 11
 Split mice into four groups according to the genotype of
each SNP-pair
 Do statistic test for each SNP-pair
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Statistical issue
• Multiple test problem
 , the family-wise error
  1  (1   )n
 Do n tests with Type I error
'
rate  is
'
• Example
 Performing 20 tests with Type I error=0.05, family-wise
error rate = 0.64
 64% probability to get at least one spurious result
• Solution
 permutation test
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Permutation test
• K permutations of
phenotype values
• For each permutation, find
the maximum test value
• Given Type I error α, the
critical value Fα is αK-th
largest value among K
maximum values
• SNP-pairs whose test
values are greater than Fα
are significant
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Genome-wide association study
• What’s GWA?
 Simple Idea: search for the associations in the
whole genome
• Hard to implement
 Enormous search space: 10,000 SNPs and
1,000 permutations, number of SNP-pairs
need to be tested: 5 ×1010
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Preliminary: ANOVA test and F-statistic
• ANOVA test
 To determine whether the group means
are significantly different
 Partition Total sum of squares into
Between-group sum of squares and
Within-group sum of squares
SST
SSB
SSW
• F-statistic




SNPs {X1, X2, …, XN},
a quantitative phenotype Y
Single SNP test -- F(Xi, Y)
SNP-pair test -- F(XiXj, Y)
SSB
F  C
SSW
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Problem Formalization
• Dataset: M individuals, N SNPs {X1, X2, …, XN}, a quantitative
phenotype Y, and its K permutations {Y1, Y2, …, Yk}.
• Maximum ANOVA test (F-statistic) value of permutation Yk
FYk = max {F(XiXj, Yk)|1≤i<j≤N}
• Problem 1: Given Type I error threshold α, find critical value
Fα, which is αK-th largest value among {FYk|1≤k≤K}
• Problem 2: Given the threshold Fα, find all significant SNPpairs such that F(XiXj, Y)≥ Fα
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Brute force approach
• Problem 1: Permutation test to find critical value
 For permutation Yk, test all SNP-pairs to find the
maximum test value FYk
 Repeat for all permutations
 Report αK-th largest value in {FYk|1≤k≤K}
• Problem 2: Finding significant SNP-pairs
 For phenotype Y, test all SNP-pairs and report the SNPpairs whose test values are above Fα
Problem 1 is more demanding due to large number of permutations
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Overview of FastANOVA
• Goal: Scale large permutation test to genome-wide
• Question: Do we have to perform ANOVA tests for
every SNP-pair and repeat for all permutations?
• Idea:
 Develop an upper bound: to filter out SNP-pairs having no
chance to become significant (all nodes on the same level
of the search tree, no sub-tree pruning, how?)
 Efficiently compute the upper bound: calculate the upper
bound for a group of SNP-pairs together (possible?)
 Identify redundant computations in the permutation tests
(reuse computations, how?)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
The upper bound
• For any SNP-pair (XiXj)
equivalent
SSB (XiXj, Y) ≥θ
F(XiXj, Y) ≥ Fα
• Bound on SSB
Fixed for given Fα
SSB ( X i X j , Y )  SSB ( X i , Y )  R1  R2
Need to be greater than θ for
(XiXj) to be significant
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
The upper bound
Given Xi ,Xj , and Y
Xi
Xj
0
0
0
0
0
1
0
1
0
1
0
1
1
0
1
0
1
1
1
0
1
0
1
0
na  min{#X j  1, # X j  0 | X i  0}
nb  min{# X j  1, # X j  0 | X i  1}
SSB ( X i X j , Y )  SSB ( X i , Y )  R1  R2
na  2
Constant
nb  1
f(na) f(nb)
Only depend on the
genotype of Xj
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Applying the upper bound
For a given Xi , let AP= {(XiXj)|i+1≤j≤N}.
Index the SNP-pairs in AP in the 2D space of (na , nb).
(X1X3)
(X1X5)
(X1X6)
X1 X2 X3 X4 X5 X6
0
0
0
1
0
1
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0
0
1
0
0
1
0
1
0
1
0
1
0
0
0
0
1
0
1
1
1
1
1
0
0
0
1
0
1
1
1
1
1
1
1
0
0
1
0
0
1
0
0
1
0
1
1
0
1
1
0
0
(1,3)
(3,3)
(2,1)
(X1X2)
(X1X4)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Key properties
SSB ( X i X j , Y )  SSB ( X i , Y )  R1  R2
f(na) f(nb)
• Maximum possible size: (M / 4  1)2
• Many SNP-pairs share the same entry
• All SNP-pairs in the same entry have the
same upper bound
• The indexing structure does not depend
on the phenotype permutations
Same upper bound value
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Schema of FastANOVA
(for permutation test)
• For each Xi , index the SNP-pairs
{(XiXj)|i+1≤j≤N} in the 2D space of (na, nb)
• For each permutation, find the candidate
SNP-pairs by accessing the indexing
structure
 Candidates are SNP-pairs whose upper bounds
are above the threshold.
 The dynamic threshold is the maximum test
value found so far.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Complexity of FastANOVA
• Time complexity
 FastANOVA: O(N2M + KNM2 +CM)
 Brute force: O(KN2M)
• Space complexity
 O((N+K)M)
N = # SNPs
M << N
M = # individuals
K = # permutations
C = # candidates
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Brute force v.s. FastANOVA
Two orders of magnitude faster than the brute force alternative
1000000
100000
100000
runtime (sec.)
runtime (sec.)
10000
1000
193
209
234
284
100
10000
1000
718
630
399
100
399
brute force approach
FastANOVA
brute force approach
FastANOVA
10
10
0.05
0.04
0.03
0.02
type I error threshold
0.01
10k
18k
26k
34k
number of SNPs
42k
#SNPs = 44k, #individuals = 26,
phenotype: metabolism (water intake)
SNP and phenotype data available at http://www.jax.org
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Pruning power of the bound
100.0%
99.9%
99.9%
99.8%
99.8%
99.7%
99.7%
99.6%
0.05
0.04
0.03
0.02
type I error threshold
0.01
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Runtime of each component
One time cost
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Future work
• Association study involving more than two SNPs
 Computationally much more demanding
 Three loci VS. two loci: in the order of number of SNPs
• Association study for heterozygous case
 SNPs are encoded as ternary variables {0, 1, 2}
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Thank You !
Questions?
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Download