DNA sequencing: Basic idea Background: test tube DNA synthesis • DNA polymerase (a natural enzyme) extends 2-stranded DNA over a 1-stranded template primer extension polymerase 5’ TTACAGGTCCATACTA 3’ AATGTCCAGGTATGATACATAGG 5’ Template • Can buy DNA polymerase and do this in a tube. Quicktime animation cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing, cont cse587A/Bio 5747: L2 1/19/06 2 DNA sequencing, cont cse587A/Bio 5747: L2 1/19/06 3 DNA sequencing, cont cse587A/Bio 5747: L2 1/19/06 4 Quicktime animation cse587A/Bio 5747: L2 1/19/06 5 Modern Sanger sequencing Dye terminator sequencing • • • • Flourescent label on terminator, not primer Different colors for ddA, ddC, ddG, ddT Run all 4 reactions in a single lane Image under 4 colors of laser Capillary electrophoresis • Each sequence is sized thru a separate, thin tube (capillary) • Avoids lane tracking errors Automated readout -- Phred cse587A/Bio 5747: L2 1/19/06 6 Limitations of technology • Error prone, especially at beginning & end –But Phred estimates error probability • Not useful beyond 500-800 bp cse587A/Bio 5747: L2 1/19/06 7 Whole chromatogram (trace) cse587A/Bio 5747: L2 1/19/06 8 Start of trace cse587A/Bio 5747: L2 1/19/06 9 End of trace cse587A/Bio 5747: L2 1/19/06 10 Base calling, assembly, editing Software tools • PHRED calls bases from traces. Reads. –Estimates error probability for each base (quality values) • PHRAP assembles reads a longer sequence –Uses quality values –Not intended for whole-genome assembly • Research on assembly algorithms is ongoing cse587A/Bio 5747: L2 1/19/06 11 Sequencing Genomes Michael Brent Dept. of Computer Science Washington University cse587A/Bio 5747: L2 1/19/06 12 Why sequence a genome? Cool technology Infrastructure for molecular science • E.g. Cloning & studying a gene of interest • “Parts list for the human body” Genome science • Evolution and dynamics of genomes Medicine • Genomic causes of disease and health cse587A/Bio 5747: L2 1/19/06 13 Which genomes? cse587A/Bio 5747: L2 1/19/06 14 How can I sequence a genome? Shotgun sequencing: simple version 1. Cut your DNA at random locations 2. Get ~700-800 bp of sequence from the end of each fragment: AAGTCGTGGG…. 3. Use overlapping sequences to reassemble cse587A/Bio 5747: L2 1/19/06 15 Step 1: cutting & cloning A. Cut/break the DNA • • Physical shear – put it in a blender, or Restriction digest B. Separate fragments by size & select cse587A/Bio 5747: L2 1/19/06 16 1C. Clone select fragments Quicktime animation cse587A/Bio 5747: L2 1/19/06 17 2. Sequence random clones • Pick a clone containing copies of 1 insert from the plate • Separate the plasmids from the cells • Sequence the inserts using primers complementary to the vector cse587A/Bio 5747: L2 1/19/06 18 3. Assemble fragments Idea • Common end sequences may indicate overlap in original sequence overlapping shotgun sequences …CTGACTAAGTCAUGTTACAG TTACAGCAGGTATGATA… assembled sequence …CTGACTAAGTCAUGTTACAGCAGGTATGATA… cse587A/Bio 5747: L2 1/19/06 19 3. Assemble fragments Problems • Sequencing error may obscure true overlap • Common end sequences can occur by chance • Repeats: DNA of higher eukaryotes contains many copies of nearly identical sequences –This means overlaps are often from different copies of the same repeat element –Repeats are the major issue in sequencing • Polymorphism cse587A/Bio 5747: L2 1/19/06 20 Genome assembly Challenge • Can’t assemble sequencing reads based on overlapping ends in long repeats …CTGACTAAGTCAUGTTACAG TTACAGCAGGTATGATA • Overlaps may be from different repeat copies • Leading to large-scale misassembly • Polymorphic mismatches may prevent good joins cse587A/Bio 5747: L2 1/19/06 21 Single-molecule sequencing • Since ~2007, we can sequence individual molecules without cloning 1. Many molecules are attached to a surface and copied, forming a cluster of identical templates 2. Reversible dye terminators are incorporated according to templates (1 bp) 3. Slide is imaged sequentially under 4 color lasers, showing which dye was incorporated at each cluster cse587A/Bio 5747: L2 1/19/06 22 Single-molecule sequencing 4. Terminator is cleaved off and 2nd-strand synthesis continues for next cycle • Each cycle is one position in the sequence • 108 50 nt reads / 2-day run (Solexa) • 106 400 nt reads / 5-day run (454) • For Sanger, ~103 700 nt reads / day • Read-length vs. throughput tradeoff cse587A/Bio 5747: L2 1/19/06 23