CS 6293 Advanced Topics: Current Bioinformatics Genome Assembly: a brief introduction Slides Adapted from Mihai Pop, Art Delcher, and Steven Salzberg Homework #2 • #1: questions will be posted online before Monday class • #2: Form groups of 3 – Each group reads two papers on a topic: Short reads alignment or assembly – Present the papers and do some comparison – ~8 minutes presentation • You can choose to go to some really cool details • Or give the main idea of the paper – Other teams (and me) will judge you – Send me names in your group and optionally papers you want to present – List of papers: http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html Genome sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x109 nucleotides ~500 nucleotides Genome sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x109 nucleotides A big puzzle ~60 million pieces Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome Shotgun DNA Sequencing (Technology) DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 550bp Primer LIGATE & CLONE SEQUENCE Vector Whole Genome Shotgun Sequencing – Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads Short 2Kbp – Long for Human. 10Kbp Collect another 20X in clone coverage of 50Kbp end sequence pairs: ~ 1.2million pairs for Human. – Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 5’ + single highly automated process + only three library constructions – assembly is much more difficult BAC 3’ Sequencing Factory Celera’s Sequencing Factory (circa 2001) 300 ABI 3700 DNA Sequencers 50 Production Staff 20,000 sq. ft. of wet lab 20,000 sq. ft. of sequencing space 800 tons of A/C (160,000 cfm) $1 million / year for electrical service $10 million / month for reagents Human Data (April 2000) Collected 27.27 Million reads = 5.11X coverage 21.04 Million are paired (77%) = 10.52 Million pairs 2Kbp 5.045M 98.6% true * <6% std.dev. 10Kbp 4.401M 98.6% true * <8% std.dev. 50Kbp 1.071M 90.0% true * <15% std.dev. * validated against finished Chrom. 21 sequence The clones cover the genome 38.7X times Data is from 5 individuals (roughly 3X, 4 others at .5X) Pairs Give Order & Orientation Assembly without pairs results in contigs whose order and orientation are not known. Contig Consensus (15- 30Kbp) Reads ? Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. 2-pair Mean & Std.Dev. is known Scaffold Anatomy of a WGS Assembly STS Chromosome STS-mapped Scaffolds Contig Read pair (mates) Gap (mean & std. dev. Known) Consensus Reads (of several haplotypes) SNPs External “Reads” Assembly gaps Physical gaps Sequencing gaps sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap 12 Assembly paradigms • Overlap-layout-consensus – greedy (TIGR Assembler, phrap, CAP3...) – graph-based (Celera Assembler, Arachne) • Eulerian path (especially useful for short read sequencing) 13 TIGR Assembler/phrap Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done 14 (A) Overlap between two reads—note that agreement within overlapping region need not be perfect; (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Assembly produced by the greedy approach. Pop M Brief Bioinform 2009;10:354-366 © The Author 2009. Published by Oxford University Press. For Permissions, please email: journals.permissions@oxfordjournals.org Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 4 2 7 5 8 3 2 1 1 3 2 6 4 3 5 9 6 1 7 2 8 3 9 1 2 3 ACCTGA ACCTGA AGCTGA ACCAGA 16 Paths through graphs and assembly • Hamiltonian circuit: visit each node (city) exactly once, returning to the start • Hamiltonian path: visit each node (city) exactly once B C D E A G A E G F I H F B Genome C I H D Overlap between two sequences overlap (19 bases) overhang (6 bases) GGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGC overhang % identity = 18/19 % = 94.7% overlap - region of similarity between regions overhang - un-aligned ends of the sequences The assembler screens merges based on: • length of overlap • % identity in overlap region • maximum overhang size. 18 All pairs alignment • Needed by the assembler • Try all pairs – must consider ~ n2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome) – Generate the pairs from k-mer table (single pass through k-mer table) E k-mer A G B F C I H D 19 BWT-based overlap detection • Efficient construction of an assembly string graph using the FM-index, Jared T. Simpson and Richard Durbin, Bioinformatics, 26 (12): i367-i373 (2010) • Read it yourself for more details ACT ACT ACT$...... ACT….. ACT….. $ ACT…. BWT for multiple sequences OVERLAP GRAPH Edge Types: Regular Dovetail A B A B Prefix Dovetail A B B A Suffix Dovetail A B A B E.G.: Edges are annotated with deltas of overlaps The Unitig Reduction 1. Remove “Transitively Inferrable” Overlaps: A C A B C B The Unitig Reduction 412 352 45 2. Collapse “Unique Connector” Overlaps: A A B B Celera Assembly Pipeline Trim & Screen Find all overlaps 40bp allowing 6% mismatch. Overlapper A Unitiger B implies Scaffolder TRUE A B Repeat Rez I, II OR A REPEATINDUCED B Celera Assembly Pipeline Trim & Screen Overlapper Unitiger Scaffolder Repeat Rez I, II Compute all overlap consistent sub-assemblies: Unitigs (Uniquely Assembled Contig) Celera Assembly Pipeline Trim & Screen Scaffold U-unitigs with confirmed pairs Mated reads Overlapper Unitiger Scaffolder Repeat Rez I, II Celera Assembly Pipeline Trim & Screen Fill repeat gaps with doubly anchored positive unitigs Overlapper Unitig>0 Unitiger Scaffolder Repeat Rez I, II Handling repeats 1. Repeat detection – pre-assembly: find fragments that belong to repeats • • – – statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis-assemblies. • • Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) 2. Repeat resolution – – find DNA fragments belonging to the repeat determine correct tiling across the repeat 28 Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value Problem 1: assumption of uniform distribution of fragments leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives 29 Mis-assembled repeats excision collapsed tandem a b I c II a c I a b III c d b III a c d b II b c rearrangement I II a I c b a a III d IV e d III f II e b IV c f 30 Eulerian path-based assembly • Break each read into k-mers (typically k >= 19) • Construct a de Bruijn graph using the k-mers from all reads – Each k-mer is a node – v1 has a directed edge to v2 if v1 can be expressed by removing the last char from v2 and adding a new char at the beginning of v2, E.g. v1 = acgtctgact v2 = cgtctgactg • Find a Eulerian path in the graph – visits each edge exactly once 3. Simplification 1. Sequencing 2. Constructing a de Bruijn graph 4. Error removal Eulerian path-based assembly • No need to compute pairwise overlaps – important for NGS data • Eulerian paths are much easier to find than Hamiltonian path – Catch: multiple Eulerian paths may exist – Loss of information • Repeats appear as cycles in the graph – Less likely to cause mis-assembly • More suitable for short-reads assembly – – – – – Newbler VELVET EDENA ABySS See Flicek & Birney, Nat Methods, 2009 References • Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009) • Genome assembly reborn: recent computational challenges, Mihai Pop, Briefings in Bioinformatics, 10(4): 354-366 (2009)