Computational Genomics: Genome assembly Andrey Kislyuk 25 January 2010 Why do we need to assemble genomes? DNA sequencing methods can’t sequence more than about 1000 nt at a time Sanger method (1975) - chain termination with labeled ddNTPs Maxam-Gilbert method (1976) - cleaving agents - Both require primers for DNA Polymerase, but we can’t make the primers until we know the sequence! - Both limited to ~1000 nt on the gel/capillary - Both require cloning and/or PCR amplification Large-scale sequencing: primer walking - Slow, costly, and error-prone: not practical beyond ~10Kbp 8 April 2015 · Computational Genomics Shotgun sequencing Whole genome shotgun sequencing (1995) - Hydroshearing: prepare several libraries of random fragments approx. 2, 5, 10, 50… Kbp long - Cloning: use bacterial plasmids to grow DNA – problems arise if DNA contains a gene harmful to the host bacteria - Picking, amplification - Sanger sequencing, capillary electrophoresis, read out fluorescent dyes with a laser – 4 different colors Result: lots of ~1000 nt Sanger reads - Assemble them with pairwise sequence alignment - Multiple coverage corrects errors Seems straightforward now, but many did not believe it could be done! 8 April 2015 · Computational Genomics How much shotgun sequencing? So, can we really sequence the whole genome with this? (No, we can’t.) Lander and Waterman (1988): - Assuming random distribution of reads and ignoring repeat resolution issues, Define: G = genome length N = number of reads sequenced L = length of a single read T = minimum overlap to align the reads Then overall coverage is C = LN/G Coverage for any given base obeys the Poisson distribution: The number of gaps (bases with 0 coverage) is: together C C x e C PC x y C x! G Ne 8 April 2015 · Computational Genomics LN T 1 G L How much shotgun sequencing? 8 April 2015 · Computational Genomics How much shotgun sequencing? 8 April 2015 · Computational Genomics Pioneers of sequence assembly J. Craig Venter’s group at TIGR (later JCVI) - Created TIGR Assembler, Celera Assembler, and associated tools Jim Kent (UC Santa Cruz) - Created GigAssembler - Allowed the Human Genome Project to compete with Celera Philip Green’s group at the University of Washington - Created Phred, Phrap and Consed tools Sequencing centers: JCVI, Sanger Institute, Whitehead/MIT, DOE JGI, Baylor HGSC, WUSTL 8 April 2015 · Computational Genomics Why do we need to assemble genomes? 2nd Generation sequencing methods - Cheaper and more processive (sequence more data), but shorter read length - 454 Pyrosequencing: 200-600 nt average read length - Illumina: 50-70 nt average read length - ABI SOLiD: 50 nt average read length - Same idea: randomly hydrosheared library Random reads from across the genome form a big puzzle 8 April 2015 · Computational Genomics Next Generation Sequencing Technologies Sequencing by synthesis Method Read length Paired-end Coverage reads Sanger 1000 nt Yes, 2x500nt 454 200 (gen 1) 500 (gen 2) Yes, 2x40nt 20-40X Solexa/ Illumina 50 nt 100X SOLiD 35 nt Yes, 2x25nt 100X 2nd generation - 454 Pyrosequencing - Solexa/Illumina - SOLiD 3rd generation - Single-molecule sequencing 5-10X - Nanopore sequencing 8 April 2015 · Computational Genomics 454 Pyrosequencing A + PCR Reagents + Emulsion Oil B Mix DNA library & capture beads (limited dilution) “Break micro-reactors” Isolate DNA containing beads Create “Water-in-oil” emulsion Perform emulsion PCR 8 April 2015 · Computational Genomics 454 Pyrosequencing Load enzyme beads Load beads into PicoTiter™Plate PicoTiter™Plate Diameter = 44 μm 44 μm Depth = 55 μm Well size = 75 pl Well density = 480 wells mm-2 1.6 million wells per slide 8 April 2015 · Computational Genomics 454 Pyrosequencing Sequencing by synthesis Photons generated are captured by CCD camera Reagent flow Margulies et al., 2005 Raw sequencer output Sanger: Trace (usually .ab1/.scf file format) 454: Flowgram (.sff file format) Flow Order 4-mer 3-mer T A C G KEY (TCAG) Measures the presence or absence of each nucleotide at any given position 2-mer 1-mer 8 April 2015 · Computational Genomics Assembly algorithms Paradigms - Overlap-Layout-Consensus - De Bruijn graphs 8 April 2015 · Computational Genomics Differences between an overlap graph and a de Bruijn graph Schatz M C et al. Genome Res. 2010;20:1165-1173 ©2010 by Cold Spring Harbor Laboratory Press Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT.. Credit: Serafim Batzoglou Overlap: A pairwise alignment problem Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring Credit: Serafim Batzoglou Overlapping Reads • Sort all k-mers in reads (k ~ 24) • Find pairs of reads sharing a k-mer • Extend to full alignment – throw away if not >95% similar TACA TAGATTACACAGATTAC T GA || ||||||||||||||||| | || TAGT TAGATTACACAGATTAC TAGA Credit: Serafim Batzoglou Overlapping Reads and Repeats A k-mer that appears N times initiates N2 comparisons For an Alu that appears 106 times 1012 comparisons – too much Solution: Discard all k-mers that appear more than t Coverage, (t ~ 10) Credit: Serafim Batzoglou Finding Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Credit: Serafim Batzoglou, Masahiro Kasahara Finding Overlapping Reads (cont’d) • Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA C: C: T: C: C: 20 35 30 35 40 C: C: C: C: C: 20 35 0 35 40 A: A: A: A: 15 25 40 25 A: A: A: A: A: 15 25 0 40 25 • Score alignments • Accept alignments with good scores Credit: Serafim Batzoglou Layout Repeats are a major challenge Do two aligned fragments really overlap, or are they from two copies of a repeat? Credit: Serafim Batzoglou The k-mer uniqueness ratio Schatz M C et al. Genome Res. 2010;20:1165-1173 ©2010 by Cold Spring Harbor Laboratory Press Merge Reads into Contigs repeat region Merge reads up to potential repeat boundaries Credit: Serafim Batzoglou Merge Reads into Contigs (cont’d) repeat region Ignore non-maximal reads Merge only maximal reads into contigs Credit: Serafim Batzoglou Merge Reads into Contigs (cont’d) repeat boundary??? sequencing error b a Ignore “hanging” reads when detecting repeat boundaries Credit: Serafim Batzoglou Merge Reads into Contigs (cont’d) ????? Unambiguous • Insert non-maximal reads whenever unambiguous Credit: Serafim Batzoglou Link Contigs into Supercontigs (aka scaffolds) Normal density Too dense: Overcollapsed? (Myers et al. 2000) Inconsistent links: Overcollapsed? Credit: Serafim Batzoglou Link Contigs into Supercontigs (cont’d) Find all links between unique contigs Connect contigs incrementally, if 2 links Credit: Serafim Batzoglou Link Contigs into Supercontigs (cont’d) Fill gaps in supercontigs with paths of overcollapsed contigs Credit: Serafim Batzoglou Link Contigs into Supercontigs (cont’d) Contig A d ( A, B ) Contig B Define G = ( V, E ) V := contigs E := ( A, B ) such that d( A, B ) < C Reason to do so: Efficiency; full shortest paths cannot be computed Credit: Serafim Batzoglou Link Contigs into Supercontigs (cont’d) Contig A Contig B Define T: contigs linked to either A or B Fill gap between A and B if there is a path in G passing only from contigs in T Credit: Serafim Batzoglou Consensus A consensus sequence is derived from a profile of the assembled fragments A sufficient number of reads is required to ensure a statistically significant consensus Reading errors are corrected Credit: Serafim Batzoglou Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting Credit: Serafim Batzoglou Mate pairs and paired-end reads Mate pairs: Circularize and trim size-selected fragments during library preparation. Inserts can be approx. 1, 5, 10, 20 Kbp long. Paired-end reads: Sequence a short amplified fragment from both ends. Fragment length is more precise but limited to about 300 bp. 8 April 2015 · Computational Genomics Mate pairs/Paired-end reads 8 April 2015 · Computational Genomics Paired end reads (aka mate pairs) 8 April 2015 · Computational Genomics Credit: 454 Life Sciences Base Calling and Trimming Base Calling: the process of translating the raw sequencer output into 1. The most likely nucleotide sequence 2. Confidence scores for each position Trimming: the process of removing adapter, key, vector, and/or low quality sequence from a read 8 April 2015 · Computational Genomics Reference-based assembly Reference-based assembly - Replaces overlap detection with alignment against a similar genome - Also called mapping, mapped assembly 8 April 2015 · Computational Genomics Credit: M. Schatz Reference-based assembly Use a related genome to ease the layout task - Much faster computationally - Arranges reads with more confidence, so a better assembly is possible - Allows other types of analysis: somatic mutations, organismal SNPs, structural variation, RNA-Seq, … 8 April 2015 · Computational Genomics Assembly quality control QC/QA Metrics: Size, number of contigs, N50 Diagnostic procedures 8 April 2015 · Computational Genomics Genome size as predicted from the assembly 8 April 2015 · Computational Genomics Read length, paired-end reads, coverage - Read length and paired-end reads matter. - Long reads can span repeat regions - Paired-end reads can reach into repeat regions and bridge gaps - Combination of the two maximizes shotgun sequencing performance - Coverage also matters. - High coverage allows very high confidence in base calling - Can do repeat resolution based on coverage fitting - More likely that a read will span an ambiguous region 8 April 2015 · Computational Genomics Scaffolding If paired end reads are available, scaffolding is already done. If not (our case)… - Sequenced relatives may exist (our case) - Use reference-based assembly to predict scaffolding - No ready-made tools available for this - Can be inaccurate Assemblers can get confused by repeats or overlaps that are too short - May be able to join by hand - Manual gap fill - Automated gap fill (no tools exist yet) 8 April 2015 · Computational Genomics Finishing Finishing is the process of completing the chromosome sequence. - Close all gaps (usually by PCR, but large gaps in big genomes can be sent back to make BACs for resequencing) - Re-sequence areas with less than 2x, 3x, 5x coverage (depending on quality standard) – same procedure as gaps - Check and manually assemble unresolved repeat regions - Check for mis-assembly by analyzing the overlap graph - Lots of Consed work! This is the most expensive and time-consuming part of sequencing. - Lots of small projects omit finishing and work with draft genomes 8 April 2015 · Computational Genomics Assemblers we used and our results Strain Reads Avg. coverage Bases Avg. read length A_PH_M13220 197,000 47.6M 20× 240 B_OR_M10699 420,000 82M 35× 195 C_NYC_M15141 379,000 94.3M 40× 245 W135_BF_M9261 605,000 64M 27× 105 Strain Newbler Newbler + AMOS + manual gap fill Contigs Nt, max length Contigs Nt, max length % gapfilled A_PH_M13220 332 2.06M 68K 57 (41) 2.3M 398K 1.8% B_OR_M10699 169 2.10M 114K 40 2.18M 435K 1.1% C_NYC_M15141 176 2.05M 170K 50 (34) 2.28M 759K 2.0% W135_BF_M9261 205 2.0M 152K 27 (79) 2.21M 866K 1.6% 8 April 2015 · Computational Genomics Homopolymer errors Specific to 454 pyrosequencing Sequencing errors usually result in frameshifts! Flow Order 4-mer 3-mer T A C G KEY (TCAG) Measures the presence or absence of each nucleotide at any given position 2-mer 1-mer 8 April 2015 · Computational Genomics Visualization tools: Mauve 8 April 2015 · Computational Genomics More topics Currently popular assemblers - Newbler (demo) - Velvet - ALLPATHS 2 - ABYSS - SHRiMP - Celera/WGS - PHRAP Other visualization tools (Consed, MAQ, Prospector 2, ABySSExplorer…) Microread assembly (Solexa and SOLiD) de Bruijn graph assembly paradigm 8 April 2015 · Computational Genomics