FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University of North Carolina at Chapel Hill Speaker: Xiang Zhang The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genotype-phenotype association study • Goal: finding genetic factors causing phenotypic difference Mouse genome Phenotype variation http://www.bcgsc.ca http://www.jax.org/ The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genotype-phenotype association study Chrom1 bp3,568,717 Chrom6 bp120,323,342 • Single Nucleotide Polymorphism…… A A A C G …… A A T C C …… Mutation of a single nucleotide (A,C,T,G) The most abundant source of genotypic variation Server as genetic markers of locations in the genome High throughput genotyping -thousands to millions of SNPs …… A A A C G …… A A T C C …… …… A A A C G …… A A T C G …… …… A A A C G …… A A T C G …… …… A A A C G …… A A T C G …… …… A A A C G …… A A T C G …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C G …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C C …… …… A A T C G …… A A T C C …… Thousands to millions of SNPs The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genotype-phenotype association study • Genotype Phenotype value SNPs SNPs can be represented as binary {0,1} (e.g. inbred mouse strains) • Quantitative phenotypes Body weight, blood pressure, tumor size, cancer susceptibility, …… • Question Which SNPs are the most highly associated with the phenotype? …… 0 …… 0 …… 0 …… 0 …… 0 …… 0 …… 1 …… 1 …… 1 …… 1 …… 1 …… 1 0 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 1 …… 0 …… 1 …… 0 …… 1 …… 0 …… 1 …… 0 …… 1 …… 0 …… 1 …… 0 …… 8 7 12 11 9 13 6 4 2 5 0 3 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL A simple example: single marker association study • Partition individuals into groups according to genotype of a SNP • Do a statistic (t, ANOVA) test • Repeat for each SNP Phenotype value SNPs …… 0 …… 0 …… 0 …… 0 …… 0 …… 0 …… 1 …… 1 …… 1 …… 1 …… 1 …… 1 0 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 1 …… 0 …… 1 …… 0 …… 1 …… 0 …… 1 …… 0 …… 1 …… 0 …… 1 …… 0 …… 8 7 12 11 9 13 6 4 2 5 0 3 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Two-locus association mapping • Many phenotypes are complex traits Due to the joint effect of multiple genes Single marker approach may not suffice • Consider SNP-SNP interactions Four possible genotype combinations for each SNP-pair: 00, 01, 10, 11 Split mice into four groups according to the genotype of each SNP-pair Do statistic test for each SNP-pair The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Statistical issue • Multiple test problem , the family-wise error 1 (1 )n Do n tests with Type I error ' rate is ' • Example Performing 20 tests with Type I error=0.05, family-wise error rate = 0.64 64% probability to get at least one spurious result • Solution permutation test The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Permutation test • K permutations of phenotype values • For each permutation, find the maximum test value • Given Type I error α, the critical value Fα is αK-th largest value among K maximum values • SNP-pairs whose test values are greater than Fα are significant The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Genome-wide association study • What’s GWA? Simple Idea: search for the associations in the whole genome • Hard to implement Enormous search space: 10,000 SNPs and 1,000 permutations, number of SNP-pairs need to be tested: 5 ×1010 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Preliminary: ANOVA test and F-statistic • ANOVA test To determine whether the group means are significantly different Partition Total sum of squares into Between-group sum of squares and Within-group sum of squares SST SSB SSW • F-statistic SNPs {X1, X2, …, XN}, a quantitative phenotype Y Single SNP test -- F(Xi, Y) SNP-pair test -- F(XiXj, Y) SSB F C SSW The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Problem Formalization • Dataset: M individuals, N SNPs {X1, X2, …, XN}, a quantitative phenotype Y, and its K permutations {Y1, Y2, …, Yk}. • Maximum ANOVA test (F-statistic) value of permutation Yk FYk = max {F(XiXj, Yk)|1≤i<j≤N} • Problem 1: Given Type I error threshold α, find critical value Fα, which is αK-th largest value among {FYk|1≤k≤K} • Problem 2: Given the threshold Fα, find all significant SNPpairs such that F(XiXj, Y)≥ Fα The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Brute force approach • Problem 1: Permutation test to find critical value For permutation Yk, test all SNP-pairs to find the maximum test value FYk Repeat for all permutations Report αK-th largest value in {FYk|1≤k≤K} • Problem 2: Finding significant SNP-pairs For phenotype Y, test all SNP-pairs and report the SNPpairs whose test values are above Fα Problem 1 is more demanding due to large number of permutations The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Overview of FastANOVA • Goal: Scale large permutation test to genome-wide • Question: Do we have to perform ANOVA tests for every SNP-pair and repeat for all permutations? • Idea: Develop an upper bound: to filter out SNP-pairs having no chance to become significant (all nodes on the same level of the search tree, no sub-tree pruning, how?) Efficiently compute the upper bound: calculate the upper bound for a group of SNP-pairs together (possible?) Identify redundant computations in the permutation tests (reuse computations, how?) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL The upper bound • For any SNP-pair (XiXj) equivalent SSB (XiXj, Y) ≥θ F(XiXj, Y) ≥ Fα • Bound on SSB Fixed for given Fα SSB ( X i X j , Y ) SSB ( X i , Y ) R1 R2 Need to be greater than θ for (XiXj) to be significant The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL The upper bound Given Xi ,Xj , and Y Xi Xj 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 na min{#X j 1, # X j 0 | X i 0} nb min{# X j 1, # X j 0 | X i 1} SSB ( X i X j , Y ) SSB ( X i , Y ) R1 R2 na 2 Constant nb 1 f(na) f(nb) Only depend on the genotype of Xj The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Applying the upper bound For a given Xi , let AP= {(XiXj)|i+1≤j≤N}. Index the SNP-pairs in AP in the 2D space of (na , nb). (X1X3) (X1X5) (X1X6) X1 X2 X3 X4 X5 X6 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 1 0 1 1 0 0 (1,3) (3,3) (2,1) (X1X2) (X1X4) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Key properties SSB ( X i X j , Y ) SSB ( X i , Y ) R1 R2 f(na) f(nb) • Maximum possible size: (M / 4 1)2 • Many SNP-pairs share the same entry • All SNP-pairs in the same entry have the same upper bound • The indexing structure does not depend on the phenotype permutations Same upper bound value The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Schema of FastANOVA (for permutation test) • For each Xi , index the SNP-pairs {(XiXj)|i+1≤j≤N} in the 2D space of (na, nb) • For each permutation, find the candidate SNP-pairs by accessing the indexing structure Candidates are SNP-pairs whose upper bounds are above the threshold. The dynamic threshold is the maximum test value found so far. The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Complexity of FastANOVA • Time complexity FastANOVA: O(N2M + KNM2 +CM) Brute force: O(KN2M) • Space complexity O((N+K)M) N = # SNPs M << N M = # individuals K = # permutations C = # candidates The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Brute force v.s. FastANOVA Two orders of magnitude faster than the brute force alternative 1000000 100000 100000 runtime (sec.) runtime (sec.) 10000 1000 193 209 234 284 100 10000 1000 718 630 399 100 399 brute force approach FastANOVA brute force approach FastANOVA 10 10 0.05 0.04 0.03 0.02 type I error threshold 0.01 10k 18k 26k 34k number of SNPs 42k #SNPs = 44k, #individuals = 26, phenotype: metabolism (water intake) SNP and phenotype data available at http://www.jax.org The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Pruning power of the bound 100.0% 99.9% 99.9% 99.8% 99.8% 99.7% 99.7% 99.6% 0.05 0.04 0.03 0.02 type I error threshold 0.01 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Runtime of each component One time cost The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Future work • Association study involving more than two SNPs Computationally much more demanding Three loci VS. two loci: in the order of number of SNPs • Association study for heterozygous case SNPs are encoded as ternary variables {0, 1, 2} The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Thank You ! Questions? The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL