Next Generation Sequencing The past, present, and future of DNA sequencing *DNA sequencing: Determining the number and order of nucleotides that make up a given molecule of DNA. Alex V. Postma, PhD Department of Anatomy, Embryology & Physiology Academic Medical Center 1 (Relevant) Trivia How many base pairs (bp) are there in a human genome? How much did it cost to sequence the first human genome? How long did it take to sequence the first human genome? When was the first human genome sequence complete? Whose genome was it? (Relevant) Trivia How many base pairs (bp) are there in a human genome? ~3 billion (haploid) How much did it cost to sequence the first human genome? ~$2.7 billion How long did it take to sequence the first human genome? ~13 years When was the first human genome sequence complete? 2000-2003 Genome Sequencing • Goal figuring the order of nucleotides across a genome • Problem Current DNA sequencing methods can handle only short stretches of DNA at once (<1-2Kbp) • Solution Sequence and then use computers to assemble the small pieces Genome Sequencing TG..GT TC..CC AC..GC CG..CA TT..TC TG..AC AC..GC GA..GC CT..TG AC..GC GT..GC AC..GC AA..GC AT..AT TT..CC Genome Short fragments of DNA ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA… ACGTGGTAATGGCGTATACACCCTTAGGCCATA Short DNA sequences ACGTGACCGGTACTGGTAACGTACA CCTACGTGACCGGTACTGGTAACGT ACGCCTACGTGACCGGTACTGGTAA CGTATACACGTGACCGGTACTGGTA ACGTACACCTACGTGACCGGTACTG GTAACGTACGCCTACGTGACCGGTA CTGGTAACGTATACCTCT... Sequenced genome Sanger Sequencing • Mix DNA with dNTPs and ddNTPs • Amplify • Run in Gel – Fragments migrate distance that is proportional to their size Sanger Sequencing Sanger Sequencing • Advantages Long reads (~900bps) Suitable for small projects • Disadvantages Low throughput Expensive Sanger Sequencing 2007: Global Ocean Sampling Expedition ~3,000 organisms, 7Gbp (Venter et al.) 1994: H. Influenzae 1.8 Mbp (Fleischmann et al.) 1980 1982: lambda virus DNA stretches up to 30-40Kbp (Sanger et al.) 1990 2000 2001: H. Sapiens, D. Melanogaster 3 Gbp (Venter et al.) Next Generation Sequencing: Why Now? • Motivation: HGP and its derivatives, personalized medicine • Short reads applications: (re)sequencing, other methods (e.g. gene expression) • Advancements in technology High Parallelism is Achieved in Polony Sequencing Sanger Polony Generation of Polony array: DNA Beads (454, SOLiD) DNA Beads are generated using Emulsion PCR Generation of Polony array: DNA Beads (454, SOLiD) DNA Beads are placed in wells Generation of Polony array: BridgePCR (Solexa) DNA fragments are attached to array and used as PCR templates Single Molecule Sequencing: HeliScope • Direct sequencing of DNA molecules: no amplification stage • DNA fragments are attached to array • Potential benefits: higher throughput, less errors Genome Sequencer 20 (454) Ion torrent Genome Analyzer (Solexa) MinION Technology Summary Read length Sequencing Throughput Cost Technology (per run) (1mbp)* Sanger ~800bp Sanger 400kbp 500$ 454 ~400bp Polony 500Mbp 60$ Solexa 75bp Polony 20Gbp 2$ SOLiD 75bp Polony 60Gbp 2$ Helicos 30-35bp Single molecule 25Gbp 1$ *Source: Shendure & Ji, Nat Biotech, 2008 17 Comparing Different Technologies Sanger Sequencing Advantages Disadvantages Lowest error rate High cost per base Long read length (~750 bp) Long time to generate data Can target a primer Need for cloning Amount of data per run Comparing Different Technologies 454 Sequencing Advantages Low error rate Medium read length (~400-600 bp) Disadvantages Relatively high cost per base Must run at large scale Medium/high startup costs Comparing Different Technologies Ion Torrent Sequencing Advantages Low startup costs Scalable (10 – 1000 Mb of data per run) Disadvantages New, developing technology Cost not as low as Illumina Medium/low cost per base Low error rate Fast runs (<3 hours) Read lengths only ~100200 bp so far Comparing Different Technologies Illumina Sequencing Advantages Low error rate Disadvantages Must run at very large scale Lowest cost per base Tons of data Short read length (50-75 bp) Runs take multiple days High startup costs De Novo assembly difficult Comparing Different Technologies PacBio Sequencing Advantages Can use single molecule as template Potential for very long reads (several kb+) Disadvantages High error rate (~10-15%) Medium/high cost per base High startup costs NGS Platforms Overview • Differ in design and chemistries • Fundamentally relatedsequencing of thousands to millions of clonally amplified molecules in a massively parallel manner • Orders of magnitude more information-will continue to evolve • Attractive for clinical applications – individual sequencing assays costly and laborious- serial “gene by gene” analysis Pacific Biosciences Helicos Biosciences NABsys VisiGen Biotechnologies Complete Genomics Oxford Nanophore Technologies What, When and Why • Sanger: Small projects (less than 1Mbp) • 454: De-novo sequencing, metagenomics • Solexa, SOLiD, Heliscope: – Gene expression, protein-DNA interactions – Resequencing 24 Sequencing the Human Genome 2001: Human Genome Project 2.7G$, 11 years 10 Log10(price) 8 6 2007: 454 1M$, 3 months 2008: ABI SOLiD 60K$, 2 weeks 2001: Celera 100M$, 3 years 4 2009: Illumina, Helicos 40-50K$ 2 2000 2010: 5K$, a few days? 2012: 100$, <24 hrs? 2005 Year 2010 25 Sequencing costs have fallen Next Generation Sequencing Applications •Mutation dectection •Foreign DNA detection •Non invasive diagnosis aneuplody •Population characterization •Cancer genetics •Ancient DNA (Neanderthaler) •Expression analysis •Transcription binding •Chromosomal interaction •Etc etc 28 Exome Sequencing Identifies a Tibetan Adaptation Yi et al. Science 2010 The widespread mutation in Tibetans is near a gene called EPAS1, a so-called “super athlete gene” identified several years ago and named because some variants of the gene are associated with improved athletic performance. The gene codes for a protein involved in sensing oxygen levels and perhaps balancing aerobic and anaerobic metabolism. • Degraded state of the sample mitDNA sequencing • Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106 bp ) Problems: contamination modern humans and coisolation bacterial DNA NGS Application ExamplesInherited Conditions Discovery tool: Single gene disorders i.e. AD – Kabuki syndrome (MLL) Causative mutations for multigenic diseases –superior to “one by one” approach of traditional sequencing Diagnostic advancements for diseases with overlapping symptoms, multiple possible syndromes/genes Variant detection through next generation sequencing Meyerson et al. NRG 2010 Inherited ConditionsChallenges and Opportunities Challenges Example: Monogenic disorders Novel missense mutations Germ line mosaicism Structural aberrations Imprinting effects Epigenetic factors Opportunities Example: Multifactorial disease Risk loci more often in non-coding or inter-gene regions Pathogenicity of variants often unclear- less testing vs. monogenic disease Reference human genome cataloguing of variants = more test offerings Sequencing of a Single Individual with Family Data Lupski et al. NEJM 2010 The First 8 Human Genomes SNP Distribution in Proband Nonsynonymous SNPs in Known Disease Genes NGS Application ExamplesNeoplastic Conditions Cancer susceptibility genes Patient stratification Risk assessment Risk management Predictions of therapeutic response personalized treatment Somatic/driver mutations Therapeutic monitoring Micro-RNAs Methylation Epigenetic changes Prognosis Alterations in gene expression Molecular profiling Tumor sub-typing Exome Sequencing in Prostate Cancer Barbieri et al. Nature Genetics 2012 Exome Sequencing in Prostate Cancer Barbieri et al. Nature Genetics 2012 Nonsynonymous Somatic Mutations in Neuroblastoma Molenaar et al. Nature 2012 Mutation count associated with age, stage, and survival Molenaar et al. Nature 2012 Next Generation Sequencing NGS diagnostics - shifted towards data analysis rather than the technical component NGS infrastructures must consist of appropriate expertise and computational hardware Unprecedented amounts of medical data and various processing algorithms necessitate adequate tools for Data management (alignment and assembly) QC of image processing, base calling, filtering, alignment, SNP finding/application steps archiving Considerations • Evaluation of the variant positions “called” involves queries of all known relevant databases • Lack of databases curated to accept clinical standards likely the most significant challenge in managing and reporting genome sequencing data • EHR considerations – test ordering, archiving of NGS reports, patient consent, data (reinterpretation?) NGS-Post-Analytical Considerations • Expert interpretation and guidancecorrelation of age, gender, clinical presentation, family hx • Team approach ideal -pathologists, geneticists, other providers • Proficiency testing and alternative assessment are challenging • Proficiency testing schemes based on NGS methods vs. specific genes are likely Professional ConsiderationsReimbursement and Gene Patents • Challenging reimbursement issues • Genome sequencing may potentially involve numerous patented gene sequences • Development of an affordable system of common access to genes? • What about mutations in known disease genes, not evident to patient phenotype?