High throughput genome-wide scan for epistasis with implementation to Recombinant Inbred Lines (RIL) populations Pavel Goldstein Dr. Anat Reiner-Benaim Prof. Abraham Korol 1 Outline Problem description Modeling epistasis: NOIA – the model for epistasis identification Dimensionality: • Multi-trait complexes • Two-stage hypothesis testing • Hierarchical FDR control in eQTL analysis Proposed algorithm for epistasis identification Results: • Simulation study • Implementation on Arabidopsis data Conclusions and discussion 2 eQTL analysis The goal: find loci of which genotypic variation has an effect on the quantitative trait of interest using gene expressions as phenotype and molecular markers as genotype information. 3 Problem description Epistasis – nonadditivity in the contributions of several genes to a trait. The number of tests involved is enormous Error control 4 Statistical epistasis no epistasis epistasis 5 Natural and Orthogonal Interactions (NOIA) model (Alvarez-Castro and Carlborg , 2007) for RIL population For loci A and B, trait t, loci-pair l and replicate i : design matrix gene expression Indicator of genotype combinations for two loci vector of genetic effects phenotypes 6 The Weighted Gene Co-Expression Network Analysis (WGCNA) (Zhang and Horvath, 2005) Top-down hierarchical clustering. Dynamic Tree Cut algorithm: branch cutting method for detecting gene modules, depending on their shape Building up meta-genes by taking the first principal component of the genes from every cluster. 7 Two-stage hypothesis testing Framework marker Secondary markers 8 False Discovery Rate(FDR) in eQTL analysis FDR is the expected proportion of erroneously identified epistasis effects among all identified ones. Hierarchical FDR control (Yekutieli, 2008) : Full-tree FDR - all epistasis discoveries, whether in framework or in secondary marker pairs. 9 Hierarchical FDR control A universal upper bound is derived for the full-tree FDR: An upper bound for 𝜹* may be estimated using: where RtPi=0 and RtPi=1 are the number of discoveries in τt, given that Hi is a true null hypothesis in τt, and false null hypothesis, respectively. . 10 Simulation study 5 clusters of 10 traits each were simulated with different forms of epistasis or no epistasis Six configurations: effect size (1%, 2%, 3%) X two/four epistatic clusters Replicated 1000 times Heritability (effect size): 11 The WGCNA hierarchical clustering 12 Heritability gain 13 Power gain 14 Real Data of West et. al,2006 A sample of 210 RIL population individuals was derived from a cross between two inbred Arabidopsis thaliana accessions, Bayreuth-0 (Bay-0) and Shahdara (Sha). Genotype map consists of 579 markers Genome-wide transcript (mRNA) levels were quantified using Affymetrix whole-genome microarrays Total of 22,810 gene expressions from all five chromosomes. 15 Preprocessing The Variance Stabilization Normalization Gene expression filtering: 7244 genes out of 22810 Markers preprocessing 16 Two-stage hierarchical testing for epistasis Identified 314 gene clusters (WGSNA) 47 sparse "framework" markers that are within 10 cM of each other 10-12 “secondary" markers related to each "framework" marker First step: 1981 marker pairs X 314 meta-genes =339,434 tests 17 Hierarchical FDR control A universal upper bound is derived for the full-tree FDR: 𝜹*=1.015 (SE=0.008) q*=q/2𝜹*=0.1/2*1.015=0.0472 18 Two-stage hierarchical testing for epistasis First stage – 11 significant epistatic areas Second stage – 1141 significant epistatic effects out of 1673 (68%) 19 Epistasis detected, superimposed on the Arabidopsis markers map 20 Computational advantage Using the two-stage algorithm on meta-genes, 341,107 hypotheses were tests Naive analysis: 121278 loci pairs for each of 7244 traits, namely 878,537,832 tests would have been performed Reduction of tests number by 2575 times 21 Epistasis heritability: meta−genes vs single genes Meta-genes Single genes 22 Total heritability: meta−genes vs single genes Meta-genes Single genes 23 T-values of epistatic effects: meta−genes vs single genes Meta-genes Single genes 24 Further research The method by which markers are chosen may take the genomewide marker distribution into consideration. Generalization of the NOIA model Using GO for the validation of the approach 25 Acknowledgements Dr. Anat Reiner-Benaim Prof. Abraham Korol 26 Thank you 27