Bioinformatics and Comparative Genome Analysis Monday, march 19th 2007 Tunis Molecular biology story: DNA "the Queen molecule" Odile Ozier-Kalogeropoulos Institut Pasteur Université Pierre et Marie Curie E-mail :odozier@pasteur.fr Introduction Genomes: two views QuickTime™ et un décompresseur TIFF (non compressé) sont requis pour visionner cette image. QuickTime™ et un décompresseur TIFF (non compress é) sont requis pour visionner cette image. http://www.pasteur.fr/externe http://genetique.snv.jussieu.fr View of genomes for biologists View of genomes for computer scientists Pasteur Genopole® Île-de-France, Plate-forme technologique 4 DNA molecule: two views View 1 James Watson and Francis Crick (1953) View 2 5' 3' 3' 5' DNA sequence: one view DNA sequence: one view Sequencing DNA, "the Queen molecule" Sequencing DNA, "the Queen molecule" Most of sequencing methods are based on the natural living systems use to copy and repair their own genomes Reminder! Cell DNA synthesis Reminder! Cell DNA synthesis The main role of DNA polymerase Cell DNA synthesis 1 DNA polymerase 3' 5' 3' http://www.snv.jussieu.fr/vie/dossiers/sequencage/sequence.htm Cell DNA synthesis 1 2 DNA polymerase 3' 5' 3' 3' http://www.snv.jussieu.fr/vie/dossiers/sequencage/sequence.htm 5' 3' Cell DNA synthesis 1 2 DNA polymerase 3' 3' 5' 3' 3 3' 5' 3' http://www.snv.jussieu.fr/vie/dossiers/sequencage/sequence.htm 5' 3' Cell DNA synthesis 1 2 DNA polymerase 3' 3' 5' 3' 5' 3 4 3' 5' 3' 3' http://www.snv.jussieu.fr/vie/dossiers/sequencage/sequence.htm 3' 5' 5' 3' 1 Foundation of the current state-of-the-art production genome sequencing 1 Foundation of the current state-of-the-art production genome sequencing 1 Foundation of the current state-of-the-art production genome sequencing The Sanger method 1 Foundation of the current state-of-the-art production genome sequencing The Sanger method 1977 1 Foundation of the current state-of-the-art production genome sequencing The Sanger method 30th year celebration! 1977 DNA isolation The Sanger method Sample preparation Sequence production Assembly and analysis DNA isolation The Sanger method Sample preparation Sequence production Assembly and analysis The Sanger method Focus on Sequence production The Sanger method http://www.snv.jussieu.fr/vie/dossiers/sequencage/sequence.htm The Sanger method DNA polymerase DNA polymerase http://www.snv.jussieu.fr/vie/dossiers/sequencage/sequence.htm The Sanger method http://www.snv.jussieu.fr/vie/dossiers/sequencage/sequence.htm The Sanger method Fragment separation by electrophoresis on acrylamide gel (resolution: 1 base) The Sanger method Reading progression Fragment separation by electrophoresis on acrylamide gel (resolution: 1 base) 2 Current state-of-the-art production genome sequencing in high-throughput sequencing centers 2 Current state-of-the-art production genome sequencing in high-throughput sequencing centers Sanger production-scale genome sequencing requires the 4 successive steps: 1 2 DNA isolation Sample preparation Laboratory Chan E.Y. (2005), Mutation res, 573, 13-40 2 Current state-of-the-art production genome sequencing in high-throughput sequencing centers Sanger production-scale genome sequencing requires the 4 successive steps: 1 2 DNA isolation Sample preparation Laboratory 3 Sequence production Robots Chan E.Y. (2005), Mutation res, 573, 13-40 2 Current state-of-the-art production genome sequencing in high-throughput sequencing centers Sanger production-scale genome sequencing requires the 4 successive steps: 1 2 DNA isolation Sample preparation Laboratory 3 Sequence production Robots 4 Assembly and analysis Computers Chan E.Y. (2005), Mutation res, 573, 13-40 2 Current state-of-the-art production genome sequencing in high-throughput sequencing centers Sanger production-scale genome sequencing requires the 4 successive steps: 1 3 2 DNA isolation Sample preparation Laboratory Sequence production Robots 4 Assembly and analysis Computers Humans Chan E.Y. (2005), Mutation res, 573, 13-40 2 Current state-of-the-art production genome sequencing in high-throughput sequencing centers Sequence production Sequencing robots Lab technician working with sequencing machines Courtesy of Celera Genomics DNA isolation Laboratory Sample preparation Room filled with sequencing machines Courtesy of Celera Genomics 2 Current state-of-the-art production genome sequencing in high-throughput sequencing centers Sequencing robots Assembly and analysis Lab with sequencing machines Courtesy of Celera genomics Computers Close up of capillaries from a capillary sequencing machine Courtesy of Celera Genomics 2 Current state-of-the-art production genome sequencing in high-throughput sequencing centers Assembly and analysis Computers Plate-forme Génomique, Institut Pasteur 3 Sequencing statistics http://www.genomesonline.org Bacteria Archea http://www.genomesonline.org Eukarya Metagenomes * others * F F * High-throughput sequencing centers by country http://www.genomesonline.org UK * * * USA 4 Why continue sequencing? 4 Why continue sequencing? -Comparative genomics -Impact on biomedical research -The personal genome project 4 Why continue sequencing? -Comparative genomics -Impact on biomedical research -The personal genome project Figure 1 | Evolutionary relationship between metazoans that are sequenced or due for sequencing. The simplified phylogenetic relationships between the metazoans for which the complete, or nearly complete, genome sequences are available or will be available soon. Evolutionary distances (in million years) Abel Ureta-Vidal, Laurence Ettwiller & Ewan Birney (2003), Nature rev. genet., 4, pp251-262 - International sequence databases: Sequence fragments of 100 000 species - Estimation of the number of species: 14 millions at least... Shendure, 2004 and Wikipedia Number of sequences in GenBank (log scale) The phylogenetic sequence deficit for the Metazoa Mark Blaxter, 2002 - International sequence databases: Sequence fragments of 100 000 species - Estimation of the number of species: 14 millions at least... Shendure, 2004 and Wikipedia Number of sequences in GenBank (log scale) The phylogenetic sequence deficit for the Metazoa Mark Blaxter, 2002 - International sequence databases: Sequence fragments of 100 000 species - Estimation of the number of species: 14 millions at least... Number of sequences in GenBank (log scale) Shendure, 2004 and Wikipedia molluscs, worms.. QuickTime™ et un décompresseur TIFF (non compressé) sont requis pour visionner cette image. The phylogenetic sequence deficit for the Metazoa Mark Blaxter, 2002 4 Why continue sequencing? -Comparative genomics -Impact on biomedical research -The personal genome project -Single Nucleotide Polymorphism SNP HapMap Project A freely-available public resource to increase the power and efficiency of genetic association studies to medical traits High-density SNP genotyping across the genome provides information about: – SNP validation, frequency, assay conditions – correlation structure of alleles in the genome Mark J. Daly, PhD Associated alleles reported Tag SNPs Kirov 2004 2 3 4 5 Straub 2002 Van den Oord 2003 7 10 AGGCCA Williams 2004 Bray 2005 AAGCCT Mark J. Daly, PhD Schwab 2003 AGGCCT AGGCCA AGATTA GGATCA Van den Bogaert 2003 Funke 2004 4 Why continue sequencing? -Comparative genomics -Impact on biomedical research -The personal genome project Sequencing of individual human genomes as a component of preventative medicine The National Human Genome Research Institute (NHGRI) solicits grant applications to develop novel technologies that will enable extremely low-cost genomic DNA sequencing. (2005-2006) A genome: $ 1000 Revolutionary Genome Sequencing Technologies The $1000 Genome For 2015 US$ 0.001 US$ 1 US$ 10 000 Today Chan E.Y. (2005), Mutation res, 573, 13-40 5 Improvements of the Sanger method during these 30 years 5 Improvements of the Sanger method during these 30 years DNA isolation Sample preparation Sequence production Assembly and analysis 5 Improvements of the Sanger method during these 30 years DNA isolation Sample preparation Sequence production Assembly and analysis -Production of template DNA - Labelling: Radioactivity/Fluorescent dyes - Analysis of the DNA fragments produced: Radioactivity detection/ Laser within an automated DNA sequencing machine - Electrophoresis: acrylamide gel/capillaries 5 Improvements of the Sanger method during these 30 years DNA isolation Sample preparation Sequence production Assembly and analysis -Production of template DNA - Labelling: Radioactivity/Fluorescent dyes - Analysis of the DNA fragments produced: Radioactivity detection/ Laser within an automated DNA sequencing machine - Electrophoresis: acrylamide gel/capillaries 5 Improvements of the Sanger method during these 30 years DNA isolation Sample preparation Sequence production Assembly and analysis -Production of template DNA - Labelling: Radioactivity/Fluorescent dyes - Analysis of the DNA fragments produced: Radioactivity detection/ Laser within an automated DNA sequencing machine 5 Improvements of the Sanger method during these 30 years DNA isolation Sample preparation Sequence production Assembly and analysis -Production of template DNA - Labelling: Radioactivity/Fluorescent dyes - Analysis of the DNA fragments produced: Radioactivity detection/ Laser within an automated DNA sequencing machine - Electrophoresis: acrylamide gel/capillaries DNA isolation - Production of template DNA around 1985 Need of single-stranded DNA for sequencing M13 is a filamentous bacteriophage specific to Escherichia coli (+) SS RF (+) single strand (+/-) Replicative form Nick at a specific site in the (+) single strand Synthesis by rolling circle replication of the (+) single strand (+) single strand Replication of bacteriophage M13 DNA in infected bacteria -Sequencing of pure single-stranded DNA from recombinant M13 particles Single-stranded DNA DNA isolation - Production of template DNA around 1990 Double-stranded DNA from recombinant plasmids or PCR products denatured by heat or alcali for sequencing DNA isolation - Recent improvement of template DNA production Multiple displacement amplification Phi29 DNA Polymerase is the replicative polymerase from the Bacillus subtilis phage phi29 DNA templates can be amplified 10 000 fold in a few hours Blanco, L. and Salas, M. (1984) Proc. Natl. Acad. Sci. USA, 81, 5325-5329) Recent improvement of template DNA production Principle: Primers Scheme for multiply-primed rolling circle amplification (Dean et al, 2001) - Random oligonucleotide primers complementary to the amplification target circle - DNA polymerase and deoxynucleoside triphosphates (dNTPs) Recent improvement of template DNA production Principle: Primers Scheme for multiply-primed rolling circle amplification (Dean et al, 2001) - Random oligonucleotide primers complementary to the amplification target circle - DNA polymerase and deoxynucleoside triphosphates (dNTPs) -Strand displacement DNA synthesis for more than 70 000 nucleotides without dissociating from the template Recent improvement of template DNA production Principle: Primers Scheme for multiply-primed rolling circle amplification (Dean et al, 2001) - Random oligonucleotide primers complementary to the amplification target circle - DNA polymerase and deoxynucleoside triphosphates (dNTPs) -Strand displacement DNA synthesis for more than 70 000 nucleotides without dissociating from the template -Error rate: 1 in 106- 107 nucleotides (contrast to 3. 104 for PCR with Taq DNA Polymerase) Recent improvement of template DNA production Principle: Primers Blanco, PNAS,1989 DNA isolation Applications of the multiple displacement amplification DNA isolation Applications of the multiple displacement amplification 1. Whole human genome amplification using this method 2. Sequencing the genome of a single cell DNA isolation Applications of the multiple displacement amplification 1. Whole human genome amplification using this method Phi29 DNA polymerase is able to amplify linear DNA (Dean et al, PNAS, 2002) DNA isolation Applications of the multiple displacement amplification 1. Whole human genome amplification using this method Phi29 DNA polymerase is able to amplify linear DNA Cascading strand displacement Circular DNA Linear DNA (Dean et al, PNAS, 2002) DNA isolation Applications of the multiple displacement amplification 1. Whole human genome amplification using this method Phi29 DNA polymerase is able to amplify linear DNA 1-10 copies of human genomic DNA 20-30 mg product 18 hours at 30°C DNA amplification yield after MDA (Dean et al, PNAS, 2002) DNA isolation Applications of the multiple displacement amplification 1. Whole human genome amplification using this method Phi29 DNA polymerase is able to amplify linear DNA For: • Genome sequencing • Genetic analysis on blood, microdissected tissues... • Prenatal diagnosis, • Anthropological samples... (Dean et al, PNAS, 2002) DNA isolation Applications of the multiple displacement amplification 2. Sequencing the genome of a single cell (Zhang et al, Nature Biotech, 2006) Nature Biotechnology 24, 657 - 658 (2006) doi:10.1038/nbt0606-657 Single-cell genomics Clyde A Hutchison III & J Craig Venter Phi29 DNA Polymerase is the replicative polymerase from the Bacillus subtilis phage phi29.This polymerase has exceptional strand displacement and processive synthesis properties. The polymerase has an inherent 3´>5´ proofreading exonuclease activity (Blanco, L. and Salas, M. (1984) Proc. Natl. Acad. Sci. USA, 81, 5325-5329) Figure 1. Sequencing the genome of a single cell. A single cell is isolated by dilution or by cell sorting. The cell is lysed and the chromosome is denatured by alkaline treatment. The cellular DNA is amplified >109-fold by multiple displacement amplification (MDA) using random primers. The hyperbranched DNA product is resolved by shearing and enzymatic treatments, then cloned and shotgun sequenced. Ideally, a complete genome sequence could be assembled from the data and then annotated. DNA isolation Applications of the multiple displacement amplification 2. Sequencing the genome of a single cell A pioneer work and a new world: Polymerase cloning "Ploning" The authors refer to the DNA populations amplified from single cell as Polymerase clones, or "plones" Two limitations in this first experiments: - Bias in "plonable" amplification - Chimeric plones (about 6%) (Zhang et al, Nature Biotech, 2006) DNA isolation Applications of the multiple displacement amplification 2. Sequencing the genome of a single cell Most of the diversity of the biosphere remains unsampled. (Zhang et al, Nature Biotech, 2006) DNA isolation Applications of the multiple displacement amplification 2. Sequencing the genome of a single cell Most of the diversity of the biosphere remains unsampled. The ability to sequence an entire genome from a single uncultured cell should allowed to reveal this enormous biodiversity. (Zhang et al, Nature Biotech, 2006) DNA isolation Applications of the multiple displacement amplification 2. Sequencing the genome of a single cell Most of the diversity of the biosphere remains unsampled. The ability to sequence an entire genome from a single uncultured cell should allowed to reveal this enormous biodiversity. Metagenomics (Zhang et al, Nature Biotech, 2006) 6 Alternatives to the Sanger method Sequencing single molecules of DNA Reminder! The Sanger method is based on the analysis of populations of DNA molecules Sequence production - Analysis of the DNA fragments produced: Radioactivity detection/ Laser within an automated DNA sequencing machine 6 Alternatives to the Sanger method Sequencing single molecules of DNA Cycle extention method on single molecules 1- Template DNA is arrayed on a surface or wells 2- Sequencing reaction steps including nucleotide incorporation and washes are performed to identify each base pair. 3- The extended base pair is detected by fluorescence or luminescence. Sequential base incorporation steps Template Primer Surface Chan E.Y. (2005), Mutation res, 573, 13-40 Main features of cycle extention methods compared to Sanger: • Massive parallelism • Short read lengths • Potential for cost reduction Pyrosequencing is the most famous cycle extention method From Biotage, http://www.pyrosequencing.com Pyrosequencing From Biotage, http://www.pyrosequencing.com From Biotage, http://www.pyrosequencing.com a, Read length distribution for the 306,178 high-quality reads of the M. genitalium sequencing run. This distribution reflects the base composition of individual sequencing templates. b, Average read accuracy, at the single read level, as a function of base position for the 238,066 mapped reads of the same run From Biotage, http://www.pyrosequencing.com The two main problems of pyrosequencing a, Read length distribution for the 306,178 high-quality reads of the M. genitalium sequencing run. This distribution reflects the base composition of individual sequencing templates. b, Average read accuracy, at the single read level, as a function of base position for the 238,066 mapped reads of the same run From Biotage, http://www.pyrosequencing.com Pyrosequencing: massive parallelism Genome sequencing in microfabricated high-density picolitre reactors Margulies et al, 2005 Genomic DNA is fragmented, ligated to adapters and separated into single strands Fragments are bound to beads under conditions one fragment by bead. The beads are captured in droplets of a PCR-reaction-mixture-in-oil emulsion. PCR amplification occurs within each droplet. Each bead at the end of PCR reaction carries 10 million copies of an unique DNA template. Margulies, 2005, Nature, 437, pp376-380 Margulies et al, 2005 The emulsion is broken, the DNA strands denatured and the beads carrying single stranded DNA clones are deposited into wells of a fibre-optic slide. Smaller beads carrying immobilized enzymes required for pyrosequencing are deposited into each well. Margulies et al, 2005 Sequencing instrument a) Fluidic assembly b) The well-containing fibre-optic slide c) Computer providing the user interface and the instrument control Margulies et al, 2005 De novo assembly of the bacterial genomes Test on Mycoplasma genitalium (580 000 bp) Density of wells: 480/1mm2 Total of wells on a slide: 1.6 millions! 14 hours! Margulies et al, 2005 7 Sequencing or resequencing? 7 Sequencing or resequencing? -Sequencing: for studies of genomes of unknown species needing long read length - Resequencing: for individual studies using a known genome as guide Comparison of sequencing methods Sanger method ABI 3730xl Adapted from Chan E.Y. (2005), Mutation res, 573, 13-40 Comparison of sequencing methods Sanger method ABI 3730xl 454 technology Adapted from Chan E.Y. (2005), Mutation res, 573, 13-40 Comparison of sequencing methods Sanger method ABI 3730xl 454 technology Adapted from Chan E.Y. (2005), Mutation res, 573, 13-40 Comparison of sequencing methods Sanger method ABI 3730xl 454 technology Adapted from Chan E.Y. (2005), Mutation res, 573, 13-40 Choice of sequencing method Example of Neanderthal DNA DNA from a fragment of 38 000-year-old Neanderthal fossil found in 1980 in Vindija cave (Croatia) Neanderthal DNA constraints -Rare short DNA fragments -Many contaminations Advantages of Pyrosequencing - No bacterial cloning - No template competition for amplification - Read length about 200 bp - Each sequenced product stems from just one original single stranded template molecule of known orientation (difference with PCR) Green R.E. et al, 2006 Principle Lambert and Millar (2006), Green et al, (2006) http://WWW.454.COM/ Results Analysis of one million base pairs of Neanderthal DNA Location on the human karyotype of Neanderthal DNA Schematic tree illustrating the number of nucleotide changes inferred to have occured on hominoid lineages Green et al, (2006) Conclusions Conclusions - Sequencing today is performed in big centers Conclusions - Sequencing today is performed in big centers - The number of sequences is exponentially growing up.... Conclusions - Sequencing today is performed in big centers - The number of sequences is exponentially growing up.... But the bottle neck remains sequence analysis.... Conclusions - Sequencing today is performed in big centers - The number of sequences is exponentially growing up.... But the bottle neck remains analysis of sequences.... Precisely, the goal of the present course "Bioinformatics and Comparative Genome Analysis" is to give you tools to participate to improvements of this knowledge domain... So... Good work on the Queen molecule! Thanks to the organizers! And thanks for your attention! Plan of the course 5 Improvements of the Sanger method during these 30 years A. Generalities 1. DNA isolation: Production of template DNA - Sequence production: -Labelling: Radioactivity/Fluorescent dyes -Analysis of the DNA fragments produced: - Radioactivity detection/ Laser within an automated DNA sequencing machine - Electrophoresis: acrylamide gel/capillaries B. Details -DNA isolation: Production of template DNA around 1985 Need of single-stranded DNA for sequencing - M13 is a filamentous bacteriophage specific to Escherichia coli -DNA isolation: Production of template DNA around 1990 Double-stranded DNA from recombinant plasmids or PCR products denatured by heat or alcali for sequencing 3. DNA isolation: Recent improvement of template DNA production -Multiple displacement amplification Phi29 DNA Polymerase is the replicative polymerase from the Bacillus subtilis phage phi29 -Applications of the multiple displacement amplification - Whole human genome amplification usin g this method For: Genome sequencing, Genetic analysis on blood, microdis sected tissues...Prenatal diagnosis, Anthropological samples... - Sequencing the genome of a single cell: Polymerase cloning "Ploning" For: Most of the diversity of the biosphere remains unsampled. The ability to sequence an entire genome from a single uncultured cell should allowed to reveal this enormous biodiv ersity. Metagenomics 1 2 Plan of the course (conted) 6 Alternatives to the Sanger method: Sequencing single molecules of DNA Reminder! The Sanger method is based on the analysis of populations of DNA molecules 1. Cycle extention method on single molecules 1- Template DNA is arrayed on a surface or wells 2- Sequencing reaction steps including nucleotid e incorporation and washes are performed to identify each base pair. 3- The extended base pair is detected by fluorescence or luminescence. - Main features of cycle extention methods compared to Sanger: -Massive parallelism - Short read lengths - Potential for cost reduction - Pyrosequencing is the most famous cycle extention method -Principle -Two main difficulties -Pyrosequencing: massive parallelism Genome sequencing in microfabricated high-density picolitre reactors - Instrumention - Example: Mycoplasma genitalium (580 000 bp) 7 Sequencing or resequencing? 1. Sequencing: for studies of genomes of unknown species needing long read length 2. Resequencing: for individ ual studies using a known genome as guide 3. Comparison of sequencing methods 4. Choice of sequencing method: Example of Neanderthal DNA Conclusions - Sequencing today is performed in big centers - The number of sequences is exponentially growing up.... - But the bottle neck remains analysis of sequences.... Precisely, the goal of the present course "Bioinformatics and Comparative Genome Analysis " is to give you tools to participate to improvements of this knowledge domain... So….Good work on the Queen molecule! Thanks to the organizers! And thanks for your attention! 3