RECOMBINOMICS: Myth or Reality? Laxmi Parida IBM Watson Research New York, USA IBM Computational Biology Center RoadMap 1. Motivation 2. Reconstructability (Random Graphs Framework) 3. Reconstruction Algorithm (DSR Algorithm) 4. Conclusion 2 IBM Computational Biology Center 3 IBM Computational Biology Center www.nationalgeographic.com/genographic 4 IBM Computational Biology Center www.ibm.com/genographic 5 IBM Computational Biology Center Five year study, launched in April 2005 to address anthropological questions on a global scale using genetics as a tool Although fossil records fix human origins in Africa, little is known about the great journey that took Homo sapiens to the far reaches of the earth. How did we, each of us, end up where we are? phylogeographic question Samples all around the world are being collected and the mtDNA and Y-chromosome are being sequenced and analyzed 6 IBM Computational Biology Center DNA material in use under unilinear transmission 16000 bp 58 mill bp 0.38% 7 IBM Computational Biology Center Missing information in unilinear transmissions past present 8 IBM Computational Biology Center Paradigm Shift in Locus & Analysis Using recombining DNA sequences Why? Nonrecombining gives a partial story 1. represents only a small part of the genome 2. behaves as a single locus 3. unilinear (exclusively male of female) transmission Recombining towards more complete information Challenges Computationally very complex How to comprehend complex reticulations? 9 IBM Computational Biology Center RoadMap 1. Motivation 2. Reconstructability (Random Graphs Framework) 3. Reconstruction Algorithm (DSR Algorithm) 4. Conclusion L Parida, Pedigree History: A Reconstructability Perspective using Random-Graphs Framework, Under preparation. 10 IBM Computational Biology Center RoadMap 1. Motivation 2. Reconstructability (Random Graph Framework) 3. Reconstruction Algorithm (DSR Algorithm) 4. Conclusion L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns Journal of Computational Biology, vol 15(9), pp 1—22, 2008 L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009 11 IBM Computational Biology Center INPUT: Chromosomes (haplotypes) OUTPUT: Recombinational Landscape (Recotypes) 12 IBM Computational Biology Center Our Approach Granularity g statistical NO Acceptable p-value? YES IRiS combinatorial statistical Analyze Results M Mele, A Javed, F Calafell, L Parida, J Bertranpetit and Genographic Consortium Recombination-based genomics: a genetic variation analysis in human populations, under submission. 13 IBM Computational Biology Center Preprocess: Dimension reduction via Clustering 11 12 13 14 15 16 0 17 1 18 4 19 6 5 20 8 21 9 10 7 22 23 3 2 24 14 IBM Computational Biology Center Analysis Flow Granularity g NO statistical Acceptable p-value? YES IRiS combinatorial Analyze Results statistical 15 IBM Computational Biology Center p-value Estimation 16 IBM Computational Biology Center Comparison of the Randomization Schemes 17 IBM Computational Biology Center SNP Blocks (granularity g=3) 18 IBM Computational Biology Center Analysis Flow Granularity g NO statistical Acceptable p-value? YES IRiS combinatorial Analyze Results statistical 19 IBM Computational Biology Center IRiS (Identifying Recombinations in Sequences) Stage Haplotypes: use SNP block patterns biological insights Segment along the length: infer trees computational insights Infer network (ARG) L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns Journal of Computational Biology, vol 15(9), pp 1—22, 2008 20 IBM Computational Biology Center Segmentation 12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345 11111111111111111111111111111111111111112222222222222222222222222222222222233333333344444444455555555555555---- 21 IBM Computational Biology Center Segmentation 22 IBM Computational Biology Center Consensus of Trees 23 IBM Computational Biology Center Algorithm Design 1. Ensure compatibility of component trees 2. Parsimony model: minimize the no. of recombinations 24 IBM Computational Biology Center Algorithm Design 1. Ensure compatibility of component trees 2. Parsimony model: minimize the no. of recombinations Theorem: The problem is NP-Hard. “It is impossible to design an algorithm that guarantees optimality.” 25 IBM Computational Biology Center DSR Scheme (Dominant—Subdominant---Recombinant) 26 IBM Computational Biology Center DSR Scheme: Level 1 27 IBM Computational Biology Center DSR Assignment Rules 1. At most one D per row and column; if no D, at most one S per row and column 2. At most one non-R in the row and column, but not both 28 IBM Computational Biology Center DSR Assignment Rules 1. Each row and each column has at most one D ELSE has at most one S 2. A non-R can have other non-Rs either in its row or its column but NOT both 29 IBM Computational Biology Center DSR Scheme: Level 1 30 IBM Computational Biology Center DSR Scheme: Level 2 31 IBM Computational Biology Center DSR Scheme: Level 2 32 IBM Computational Biology Center DSR Scheme: Level 3 33 IBM Computational Biology Center DSR Scheme: Level 3 34 IBM Computational Biology Center DSR Scheme: Level 4 35 IBM Computational Biology Center DSR Scheme: Level 5 36 IBM Computational Biology Center Mathematical Analysis: Approximation Factor Greedy DSR Scheme Z and Y are computable functions of the input L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009 37 IBM Computational Biology Center Analysis Flow Granularity g NO statistical Acceptable p-value? YES IRiS combinatorial Analyze Results statistical 38 IBM Computational Biology Center IRiS Output: RECOTYPE Recombination vectors R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 s1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 s2 0 1 0 1 1 1 0 1 0 0 1 0 0 0 . . . . ………. ………. ………. 39 IBM Computational Biology Center Quick Sanity Check: Ultrametric Network on RECOTYPES 40 IBM Computational Biology Center IRiS (Identifying Recombinations in Sequences) Stage Haplotypes: use SNP block patterns IRiS software will be released by the end summer Segmentof along the length: infer’09 trees biological insights computational insights Infer network (ARG) Asif Javed L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns Journal of Computational Biology, vol 15(9), pp 1—22, 2008 41 IBM Computational Biology Center What’s in a name? 1. Allele-frequency variations between populations is also reflected RECOMBIN-OMICS in the purely recombination-based variations Jaume Bertranpetit 2. Detects subcontinental divide from short segments based on populations level analysis RECOMBIN-OMETRICS 3. Detects populations from short segments based on recombination events analysis Robert Elston 42 IBM Computational Biology Center wepopulations ready for the 1. Allele-frequency variationsAre between is also reflected in the purely recombination-based variations OMICS / OMETRICS? 2. Detects subcontinental divide from short segments based on populations level analysis population-specific signals ? 3. Detects populations from shorto segments other critical signals ? o based on recombination events analysis o anything we didn’t already know? 43 IBM Computational Biology Center Thank you!! 44 IBM Computational Biology Center 45