Slides 3: NGS short

CS 6293 Advanced Topics: Current Bioinformatics Genome Assembly: a brief introduction Slides Adapted from Mihai Pop, Art Delcher, and Steven Salzberg Homework #2 • #1: questions will be posted online before Monday class • #2: Form groups of 3 – Each group reads two papers on a topic: Short reads alignment or assembly – Present the papers and do some comparison – ~8 minutes presentation • You can choose to go to some really cool details • Or give the main idea of the paper – Other teams (and me) will judge you – Send me names in your group and optionally papers you want to present – List of papers: http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html Genome sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x109 nucleotides ~500 nucleotides Genome sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x109 nucleotides A big puzzle ~60 million pieces Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome Shotgun DNA Sequencing (Technology) DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 550bp Primer LIGATE & CLONE SEQUENCE Vector Whole Genome Shotgun Sequencing – Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads Short 2Kbp – Long for Human. 10Kbp Collect another 20X in clone coverage of 50Kbp end sequence pairs: ~ 1.2million pairs for Human. – Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 5’ + single highly automated process + only three library constructions – assembly is much more difficult BAC 3’ Sequencing Factory Celera’s Sequencing Factory (circa 2001)  300 ABI 3700 DNA Sequencers  50 Production Staff  20,000 sq. ft. of wet lab  20,000 sq. ft. of sequencing space  800 tons of A/C (160,000 cfm)  $1 million / year for electrical service  $10 million / month for reagents Human Data (April 2000)  Collected 27.27 Million reads = 5.11X coverage  21.04 Million are paired (77%) = 10.52 Million pairs  2Kbp 5.045M 98.6% true * <6% std.dev.  10Kbp 4.401M 98.6% true * <8% std.dev.  50Kbp 1.071M 90.0% true * <15% std.dev. * validated against finished Chrom. 21 sequence  The clones cover the genome 38.7X times  Data is from 5 individuals (roughly 3X, 4 others at .5X) Pairs Give Order & Orientation Assembly without pairs results in contigs whose order and orientation are not known. Contig Consensus (15- 30Kbp) Reads ? Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. 2-pair Mean & Std.Dev. is known Scaffold Anatomy of a WGS Assembly STS Chromosome STS-mapped Scaffolds Contig Read pair (mates) Gap (mean & std. dev. Known) Consensus Reads (of several haplotypes) SNPs External “Reads” Assembly gaps Physical gaps Sequencing gaps sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap 12 Assembly paradigms • Overlap-layout-consensus – greedy (TIGR Assembler, phrap, CAP3...) – graph-based (Celera Assembler, Arachne) • Eulerian path (especially useful for short read sequencing) 13 TIGR Assembler/phrap Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done 14 (A) Overlap between two reads—note that agreement within overlapping region need not be perfect; (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Assembly produced by the greedy approach. Pop M Brief Bioinform 2009;10:354-366 © The Author 2009. Published by Oxford University Press. For Permissions, please email: journals.permissions@oxfordjournals.org Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 4 2 7 5 8 3 2 1 1 3 2 6 4 3 5 9 6 1 7 2 8 3 9 1 2 3 ACCTGA ACCTGA AGCTGA ACCAGA 16 Paths through graphs and assembly • Hamiltonian circuit: visit each node (city) exactly once, returning to the start • Hamiltonian path: visit each node (city) exactly once B C D E A G A E G F I H F B Genome C I H D Overlap between two sequences overlap (19 bases) overhang (6 bases) GGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGC overhang % identity = 18/19 % = 94.7% overlap - region of similarity between regions overhang - un-aligned ends of the sequences The assembler screens merges based on: • length of overlap • % identity in overlap region • maximum overhang size. 18 All pairs alignment • Needed by the assembler • Try all pairs – must consider ~ n2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome) – Generate the pairs from k-mer table (single pass through k-mer table) E k-mer A G B F C I H D 19 BWT-based overlap detection • Efficient construction of an assembly string graph using the FM-index, Jared T. Simpson and Richard Durbin, Bioinformatics, 26 (12): i367-i373 (2010) • Read it yourself for more details ACT ACT ACT$...... ACT….. ACT….. $ ACT…. BWT for multiple sequences OVERLAP GRAPH Edge Types: Regular Dovetail A B A B Prefix Dovetail A B B A Suffix Dovetail A B A B E.G.: Edges are annotated with deltas of overlaps The Unitig Reduction 1. Remove “Transitively Inferrable” Overlaps: A C A B C B The Unitig Reduction 412 352 45 2. Collapse “Unique Connector” Overlaps: A A B B Celera Assembly Pipeline Trim & Screen Find all overlaps  40bp allowing 6% mismatch. Overlapper A Unitiger B implies Scaffolder TRUE A B Repeat Rez I, II OR A REPEATINDUCED B Celera Assembly Pipeline Trim & Screen Overlapper Unitiger Scaffolder Repeat Rez I, II Compute all overlap consistent sub-assemblies: Unitigs (Uniquely Assembled Contig) Celera Assembly Pipeline Trim & Screen Scaffold U-unitigs with confirmed pairs Mated reads Overlapper Unitiger Scaffolder Repeat Rez I, II Celera Assembly Pipeline Trim & Screen Fill repeat gaps with doubly anchored positive unitigs Overlapper Unitig>0 Unitiger Scaffolder Repeat Rez I, II Handling repeats 1. Repeat detection – pre-assembly: find fragments that belong to repeats • • – – statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis-assemblies. • • Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) 2. Repeat resolution – – find DNA fragments belonging to the repeat determine correct tiling across the repeat 28 Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value Problem 1: assumption of uniform distribution of fragments leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives 29 Mis-assembled repeats excision collapsed tandem a b I c II a c I a b III c d b III a c d b II b c rearrangement I II a I c b a a III d IV e d III f II e b IV c f 30 Eulerian path-based assembly • Break each read into k-mers (typically k >= 19) • Construct a de Bruijn graph using the k-mers from all reads – Each k-mer is a node – v1 has a directed edge to v2 if v1 can be expressed by removing the last char from v2 and adding a new char at the beginning of v2, E.g. v1 = acgtctgact v2 = cgtctgactg • Find a Eulerian path in the graph – visits each edge exactly once 3. Simplification 1. Sequencing 2. Constructing a de Bruijn graph 4. Error removal Eulerian path-based assembly • No need to compute pairwise overlaps – important for NGS data • Eulerian paths are much easier to find than Hamiltonian path – Catch: multiple Eulerian paths may exist – Loss of information • Repeats appear as cycles in the graph – Less likely to cause mis-assembly • More suitable for short-reads assembly – – – – – Newbler VELVET EDENA ABySS See Flicek & Birney, Nat Methods, 2009 References • Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009) • Genome assembly reborn: recent computational challenges, Mihai Pop, Briefings in Bioinformatics, 10(4): 354-366 (2009)

Slides 3: NGS short

Related documents

Products

Support

Slides 3: NGS short

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib