Max-Planck-Institute for Molecular Genetics Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1, Margret R. Hoehe1 1. Max Planck Institute for Molecular Genetics, Berlin, Germany 2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA Max-Planck-Institute for Molecular Genetics MHC: Key Region for Common Diseases & Transplant Medicine 29,74 MHC class I 31,59 MHC class III 32,34 MHC class II 33,21 Max-Planck-Institute for Molecular Genetics MHC: Variation amongst Haplotypes HLA-DRB Variation of MHC Haplotypes against PGF reference CNV 7 further MHC Haplotype sequences RCCX CNV PGF reference sequence MHC class III MHC class II Variation amongst 8 MHC Haplotypes: • 37.451 Substitutions • 7.093 Short Indels Variation and annotation map for eight MHC haplotypes, Horton et al. Immunogenetics (2008) 60,1-18 Max-Planck-Institute for Molecular Genetics Experimental Approach 5000 fosmids 100 Individuals 100 Libraries 3x96-well = 288 fosmid pools 40 kb haploid molecules One pool SNP Mapping for Prioritization of MHC Informative Pools SOLiD NGS Platform Targeted Complete Shotgunning complete Enrichment Fosmid Pool 40 kb fosmids Data Analysis Pipeline Identification of 40 kb fosmid sequences Haplotype A Haplotype B Phasing molecular fosmid sequences Contiguous MHC haplotype sequence Max-Planck-Institute for Molecular Genetics Data Analysis Pipeline Read Alignment against Genome Pairing Fosmid Detection Program Fosmid Specific Matching Algorithm Fosmid Sequences Based Phasing Consensus Calling SNP Analysis SOLiD Standard Pipeline Visualization & MHC Database In House Project Specific Analysis Pipeline Max-Planck-Institute for Molecular Genetics Data Analysis Pipeline Read Alignment against Genome Pairing Fosmid Detection Program Fosmid Specific Matching Algorithm Fosmid Sequences Based Phasing Consensus Calling SNP Analysis SOLiD Standard Pipeline Visualization & MHC Database In House Project Specific Analysis Pipeline Max-Planck-Institute for Molecular Genetics Mapping real data Bioscope classic Bioscope local repeat 40.3 Bioscope local repeat 45.3 70 60 50 40 30 20 10 0 mapped reads % unique mapped reads % multiple hits % Pool of 15.000 Fosmids 22 Mill. Reads 50bp Max-Planck-Institute for Molecular Genetics Data Analysis Pipeline Read Alignment against Genome Pairing Fosmid Detection Program Fosmid Specific Matching Algorithm Fosmid Sequences Based Phasing Consensus Calling SNP Analysis SOLiD Standard Pipeline Visualization & MHC Database In House Project Specific Analysis Pipeline Max-Planck-Institute for Molecular Genetics SNP calls: Haploid fosmids vs. genomic DNA gDNA Fosmid # cov ref consen F3 coord 335 C Y 177/17 62511614 3345 T C 3191/56 875 G A 1795 G 707 # cov ref consen F3 coord 595 C T 572/91 62511614 62512095 3418 T C 3278/98 62512095 862/25 62513689 2089 G A 2048/98 62513689 K 722/23 62513754 2238 G T 2194/98 62513754 C S 528/13 62515375 1134 C G 1107/73 62515375 2643 C Y 1391/20 62517737 3104 C T 2922/98 62517737 643 C Y 417/23 62518998 1033 C T 1014/83 62518998 1074 A R 554/21 62522445 1799 A G 1753/98 62522445 606 C S 226/21 62524689 1053 C G 1049/83 62524689 639 A M 167/15 62532474 54 G A 39/22 62527964 158 G R 89/14 62533464 32 A C 27/23 62529870 1032 A R 443/26 62534973 1374 A C 1355/95 62532474 7 A G 7/4 62537153 973 G A 946/97 62533464 775 T G 742/26 62540402 2850 A G 2745/98 62534973 10 G C 10/5 62540465 49 A G 48/33 62537153 698 G C 684/29 62541769 1888 T G 1845/95 62540402 40 C T 40/4 62542550 37 G C 36/20 62540465 94 C G 93/9 62542574 923 G C 901/97 62541769 286 C T 283/16 62543011 8411 A W 2006/78 62542258 194 C A 190/22 62543067 253 C T 253/47 62542550 Max-Planck-Institute for Molecular Genetics SNP Calling Accuracy in the MHC – Affymetrix genotype information for 1583 SNP positions as reference standard: • - Homozygous identical with reference: 957 • - Heterozygous: 562 • - Homozygous different from reference: 64 – Compared to variants called from the SOLiD sequenced genomic DNA sample (15x average read coverage) – Percentage of error in genotype calling: 3.66% – False positive rate: 0.1% – False negative rate: 9.25% Max-Planck-Institute for Molecular Genetics Data Analysis Pipeline Read Alignment against Genome Pairing Fosmid Detection Program Fosmid Specific Matching Algorithm Fosmid Sequences Based Phasing Consensus Calling SNP Analysis SOLiD Standard Pipeline Visualization & MHC Database In House Project Specific Analysis Pipeline Max-Planck-Institute for Molecular Genetics Fosmids Detection Fosmid Detection Algorithm 1. Assign each read to a single 1kb long bin. Select bins with more than 5 reads 2. Perform allele calls for each heterozygous SNP. Mark bins with heterozygous calls 3. Cluster adjacent bins as belonging to the same fosmid if: i. The gap distance between them is less than 10kb and ii. There are no bins with heterozygous SNPs between them 4. Keep fosmids with lengths between 3kb and 60kb UCSC Genome browser http://genome.ucsc.edu/ Kent et al. 2002 Genome Res. 12(6):996-1006. 3500 2500 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 10 number of contigs Max-Planck-Institute for Molecular Genetics Fosmids Detection Size distribution of read-contigs 3000 20 – 50 kb 2000 1500 fosmid sized contigs 1000 500 0 contig length kb Max-Planck-Institute for Molecular Genetics Data Analysis Pipeline Read Alignment against Genome Pairing Fosmid Detection Program Fosmid Specific Matching Algorithm Fosmid Sequences Based Phasing Consensus Calling SNP Analysis SOLiD Standard Pipeline Visualization & MHC Database In House Project Specific Analysis Pipeline Max-Planck-Institute for Molecular Genetics Haplotyping Locus Event Alleles Hap 1 Alleles Hap 2 1 SNV T C,T 2 Deletion C C,- - 3 SNV G 4 Insertion -,GC - A A,G C GC The process of grouping alleles that are present together on the same chromosome copy of an individual is called haplotyping Max-Planck-Institute for Molecular Genetics Single Individual Haplotyping • Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - 0 1 1 0 0 f2 1 1 0 - 1 1 f3 0 0 0 1 1 - - - 1 - 1 1 ... fm Max-Planck-Institute for Molecular Genetics Single Individual Haplotyping • Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - 0 1 1 0 0 f2 1 1 0 - 1 1 f3 0 0 0 1 1 - - - 1 - 1 1 ... fm Max-Planck-Institute for Molecular Genetics Single Individual Haplotyping • Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - 0 1 1 0 0 f2 1 1 0 - 1 1 f3 0 0 0 1 1 - - - 1 - 1 1 ... fm Max-Planck-Institute for Molecular Genetics Single Individual Haplotyping • Input: Matrix M of m fragments covering n loci Locus 1 2 3 4 5 ... n f1 - 0 1 1 0 0 f2 1 1 0 - 1 1 f3 0 0 0 1 1 - - - 1 - 1 1 ... fm Max-Planck-Institute for Molecular Genetics ReFHap Problem Formulation For two alleles a1, a2 For two rows i1, i2 of M f1 - f2 1 1 1 - Score 0 1 -1 0 1 0 1 1 0 1 s(M,1,2) = 1 Max-Planck-Institute for Molecular Genetics ReFHap Problem Formulation For a cut I of rows of M Max-Planck-Institute for Molecular Genetics ReFHap Algorithm • Reduce the problem to Max-Cut. • Solve Max-Cut • Build haplotypes according with the cut Locus 1 2 3 4 5 f1 - f2 1 1 0 - f3 1 - f4 - 1 0 1 1 0 - 1 0 - 0 0 - 1 h1 00110 h2 11001 -1 3 1 1 4 3 2 -1 Max-Planck-Institute for Molecular Genetics ReFHap Algorithm 1. 2. 3. 4. Build G=(V,E,w) from M Sort E from largest to smallest weight Init I with a random subset of V For each e in the first k edges a) I’ ← GreedyInit(G,e) b) I’ ← GreedyImprovement(G,I’) c) If s(M, I) < s(M, I’) then I ← I’ Max-Planck-Institute for Molecular Genetics ReFHap Algorithm • Classical greedy algorithm 1 4 1 4 3 2 2 3 Max-Planck-Institute for Molecular Genetics ReFHap Algorithm • Edge flipping 1 2 2 1 3 4 3 4 Max-Planck-Institute for Molecular Genetics Phasing the MHC: Mixed Diploid vs Fosmid-Based NGS Libraries Mixed Diploid Fosmid-Based Mate Pair & Paired End Genomic DNA 1/3rd Uniquely Mapped 47 Gb Paired End 16 Barcoded Pools 15 Gb Number of Blocks 407 40 1/10th 438 bp 3.7 kb 178 kb 12 % 85 kb 691 kb 3.4 Mb 66 % 194 x 186 x 19 x 5x Av. Block Length Max. Block Length Total Length all Blocks % of Phased SNPs Max-Planck-Institute for Molecular Genetics Phasing MHC: Preliminary Results • • • • • • • Number of blocks: 8 N50 block length: 793 kb Maximum block length: 1.6 MB Total extent of all blocks: 3.8 MB Fraction of MHC phased into haplotype blocks: 95% Number of heterozygous SNPs: 8030 SNPs Fraction of SNPs phased: 86% Max-Planck-Institute for Molecular Genetics Acknowledgements Margret Hoehe Anita Suk Thomas Hübsch Sabrina Schulz Steffi Palczewski Britta Horstmann Roger Horton Gayle McEwen The Life Tech Team: Thank You! Kevin McKernan Clarence Lee Jessica Spangler Tristen Weaver Tamara Gilbert Alexander Sartori Dustin Holloway Heather Peckham Stephen McLaughlin Tim Harkins Max-Planck-Institute for Molecular Genetics Comparison Mapping algos COX Haplotype simulated reads Bioscope classic Bioscope local iub Bioscope classic iub Bioscope local repeat schema Bioscope local Bfast 120 100 80 60 40 20 0 mapped reads % unique mapped reads % multiple hits % Max-Planck-Institute for Molecular Genetics Phasing MHC