High Throughput Sequencing Technologies at YCGA Shrikant Mane Director, YCGA Director, Keck Laboratory Outline First-generation sequencing technology •Sanger sequencing Current massively parallel sequencing strategies “Second Generation” •454 •Illumina •Ion Torrent & Ion Proton “Third Generation” •Pacific Biosciences •Oxford Nanopore YCGA Goals of biomedical investigation Understand normal, healthy and disease biology Enable prevention and early diagnosis of disease Enable new effective treatments Utility of Next Generation Sequencing/genetics in medicine Unbiased approach to identify new pathways underlying basic physiology, health and disease Evolution of genomic technologies Genetic mapping studies: Discovery of genes for well characterized Mendelian diseases. Dense SNP genotyping using microarray technology: GWAS for discovery of common variants in common disease. High throughput sequencing: Discovery of rare variants in not previously recognized Mendelian diseases and common diseases. In recent years there has been an explosion of research articles using next generation sequencing technologies Number of PubMed Articles DNA sequencing can provide a deeper understanding about DNA/RNA than any other technology Microarray Technology revolutionized biomedical research, but has several limitations, which DNA sequencing may overcome As the cost of sequencing is rapidly decreasing, it is becoming affordable to perform sequencing at a genome level 2000 1500 1000 Projected HT Sequencing 500 0 First Generation: Sanger sequencing (1975-1977) 1980 Nobel Prize in chemistry gels read by hand •radiolabeled dideoxyNTPs •one lane per nucleotide •800 bp reads •low throughput (several kb/gel) phi X 174 ~5300 bp Second-generation sequencing Massively parallel sequencing of millions of template 454/Roche Illumina Ion Torrent-Proton Second Generation: Massively Parallel Sequencing. Throughput (24 hours): 2.8 Mb (Sanger) 60,000 Mb (HiSeq) Cost: $1500/Mb (Sanger) $0.06 /Mb (HiSeq) Read Lengths: ~800 bp (Sanger) ~ 100 – 600 ( HiSeq- 454) Error rates: < 0.5 % (Sanger) ~ 0.8 -2%% (HiSeq) Illumina next generation sequencing platform HiSeq 2500 Sequencing System Fast turnaround and highest output in a single instrument 1 Instrument – 2 Run Modes High Output Mode Rapid Run Mode 600 Gb in ~10.5 days Current v3 flow cell Current v3 reagents cBot required 120Gb in ~1 day New 2-lane flow cell New reagents No cBot required User configurable 6 human genomes in 10.5 days 1 human genome in a day Highest Output Fastest turnaround New sequencing platforms by Illumina HiSeq X Ten and HiSeq X Five: Production-scale human whole genome sequencing: 18,000 genomes/year at $ 1,500 cost/genome HiSeq 3000/HiSeq 4000: Up to 1.5 Tb/run. Whole genome as well as other applications including exome sequencing Overall Illumina Sequencing Workflow Sample Preparation Sequencing Library Preparation Sequencing Adapter1 Primer Insert Adapter2 Cluster Generation •Hybridizing Library to Flow Cell •Creating clusters from individual molecules Sequencing by Synthesis •Add all 4 bases with Reversible Terminators •Image 4 colors •Remove Terminator, repeat Genomic Sample Prep Workflow Purified genomic DNA 1. Genomic DNA fragmentation Fragments of less than 800 bp 2. End-repair Blunt ended fragments with 5’-Phosphorylated ends 3. Klenow exo- with dATP 3’-dA overhang 4. Adapter ligation Adapter modified ends 5. Gel purification/bead Removal of unligated adapter 6. PCR Genomic DNA Library Sequencing Adapter1 Primer Insert Adapter2 What is a Flow Cell? A flow cell is a thick glass slide with 8 channels or lanes Each lane is randomly coated with a lawn of oligos that are complementary to library adapters P5 oligo P7 Oligo Adapter1 Index Adapter2 Sequencing Primer Insert Reversible Terminator Seq Chemistry All 4 labeled nucleotides in 1 reaction (green, orange, red and blue) Advantages of reversible terminators: Only one base is added at a time Fluor can be cleaved off after the imaging. Thus, it does not emit color at the next cycle allowing only newly added base (with attached fluor) to emit the light O O HN O N cleavage fluor site 3’ HN 5’ DNA O block Incorporation Detection Deblock; fluor removal N O O O PPP X 3’ OH free 3’ end Next cycle Illumina sequencing Sequencing By Synthesis (SBS) 3’ 5’ Cycle 1: Add sequencing reagents First base incorporated Remove unincorporated bases A T G C C G T T A C A C Detect signal/Imaging Cleave off fluor and Deblock Cycle 2-n: Add sequencing reagents and repe G A T T A G A C T C C G A G C T C G A T 5’ All four labeled nucleotides in one reaction High accuracy Base-by-base sequencing No problems with homopolymer repeats Ion Torrent PGM and Proton Ion PGM™ Sequencer 4 Ion Protons: coming soon First PostLight sequencing technology: Instead of using light as an intermediary, PGM creates a direct connection between the chemical and the digital worlds. The Chip is the Machine Uses semiconductor chips for sequencing. Ion PI chip: >165 million wells per chip: 8 to 10 Gb data per run Ion PII chips: ~100 Gb of data in ~4 hours Base Calling When a nucleotide is incorporated into a strand of DNA, a Hydrogen ion is released as a by product. The H ion carries a charge which the PGM’s ion sensor can detect as a base. Ion Torrent technology video. Advantages and Current Limitations Advantages • Low equipment cost • Rapid run times: 3 to 4 hours • Simple Chemistry Limitations • • • • • Homopolymers detection Error rates Slow on introducing newer chips: Overpromise PGM and Proton: two separate sequencing equipment Library prep: Emulsion PCR/ New protocols Third generation sequencing PacBio RS The Third Generation Sequencing Platform: PacBio RS Pacific Biosciences has developed Single Molecule Real Time (SMRT™) DNA sequencing technology: PacBio RS. This technology enables, for the first time, the observation of natural DNA synthesis by a DNA polymerase as it occurs. This technology delivers long reads at single molecule level and fast time to result, enabling a new paradigm in genomic analysis. Pacific Biosciences SMRT® Technology Technology Video Key Applications for PacBio RS Targeted sequencing SNP and structure variants detection Repetitive region Full length transcript profiling De novo assembly and genome finishing Bacteria genome Fungal genome Gap-captured sequencing Targeted captured sequencing Base modifications detection Methylations DNA damages **Projects at YCGA YCGA PacBio RS Comparisons Between PacBio RS and Illumina HiSeq Sequencing Chemistry PacBio RS (Third generation) Illumina HiSeq (Second generation) Sequencing by synthesis (SBS) Single Molecule Real Time (SMRT) Sequencing by synthesis (SBS) Sequencing Smart Cell made up of Flow cell has made of 150,000 ZMWs 8 separate lanes substrate Data output per 60 billion/day at a cost of $.06 per 1 to 2 billion/ day. $1.5/ Mb Mb day Read Length Average up to 5 Kb 50bp to 150bp Raw: 10-15 %. With 30x coverage: Error rates 0.5 to 1 % Q50 (< 0.01) Sample Library SMRT Bell template dsDNA with adaptors (175 bp to (Single-strand circular DNA) 250 bp to 1 Kb) 10 Kb insert Upcoming Technologies Exonuclease Protein nanopore (Alpha Hemolysin) Cyclodextrin Electrically resistant Lipid bilayer http://www.nanoporetech.com/news/movies#movie24-nanopore-dna-sequencing • PromethION Recent advances in nanopore sequencing Two types of nanopores: Protein and synthetic (silicon nitride). Protein nanopores appear to be better in recognizing nucleotides. The rapid speed at which DNA strands pass through the tiny hole makes distinguishing bases more difficult. Currently an enzyme is used to control the rate. By shining low power green laser on synthetic Meller A. et al, Nat Biotech nanopore immersed in salt water it is possible to 2013 manipulate DNA speed at will. As the current increases, positive ions drag water molecules in the opposite direction of incoming DNA, acting as a brake and slowing its passage through the pore. As a result, nanoscale sensors in the pore would be more accurately able to read each nucleotide going into the pore. Using nanopores, long stretches of DNA can be zipped back and forth through the pore and can be read several times Protein nanopoers can also identify epigenetic changes. Advantages Nanopores offer a label-free, electrical, single-molecule DNA sequencing method No costly fluorescent labeling reagents No need for expensive optical hardware and sophisticated instrumentation to detect DNA bases Performance/Limitations…..? First data was released in Feb 2012. Since then slow to release new data Very little data available for the evaluation: High Error Rates - >5% The YCGA Laboratory at West Campus YCGA was established in January 2009 through generous funding support and the strong commitment from the Yale University and School of Medicine •Located in a newly renovated building. •Approximately 7,000 Sq Ft laboratory and ~4,000 Sq Ft office space Portion of the laboratory showing sequencing systems through the glass wall partition that separates laboratory from the rest of office and administrative area. • 23 staff Sequencing Platforms at YCGA 11 Illumina HiSeqs (2000 and 2500) One PacBio RS One MiSeq Ion PGM™ Sequencer YCGA has kept pace with cutting-edge sequencing technologies Computer Infrastructure BulldogN: Dell Cluster with 200 Nodes/2,500 Cores Hitachi/BlueArc Scalable Storage: ~2.5 Petabytes GB Sequenced Quarterly 40000 35000 30000 25000 20000 15000 10000 5000 0 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 2010 2010 2010 2010 2011 2011 2011 2011 2012 2012 2012 2012 2013 2013 2013 2013 2014 2014 2014 204 PIs from 31 Yale Departments 115 PIs from 81 Non-Yale Institutions from 14 countries Types of samples processed and runs of sequence read lengths carried out at YCGA in a typical month ChIP Sample Types 3% gDNA 5% 13% 2% 77% Run Types 4% 4% 1x36 18% mRNA 1x76 2% Multiplex gDNA Seq-cap 2x50 2x76 72% 2x101 Types of samples processed and runs of sequence read lengths carried out at YCGA in a typical month. Need for strong R&D efforts for Next-Generation sequencing operation • Optimization of sample preparation protocols for exome capture that have decreased the cost of a single human exome from $8,000 in 2009 to the current price of ~$500, while improving the quality of the data. • Development of a highly efficient protocol to extract and repair DNA from formalin-fixed paraffin embedded blocks for exome analysis. • Improved protocols for gDNA-seq, RNA-seq, and ChIP-seq that show higher data complexity than traditional protocols, allow users to start with less material, and cost less. • Continuous improvements of various analysis pipelines Whole- Genome VS. Whole Exome Sequencing Protein coding genes (exome) constitute 1% of the human genome but harbor 85 % of disease causing mutations Significantly cheaper than sequencing entire genome • 2.1M probes cover ~300,000 exons of 19,000 genes • Total covered bases: 44.1Mb Scientific and economic impact of high throughput sequencing at Yale List of select publications resulting form the next-generation sequencing at YCGA Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Bilguvar Nature, v467, 2010 A Novel miRNA Processing Pathway Independent of Dicer Requires Argonaute2 Activity. Cifuentes Science, v328, 2010 Mitotic recombination in ichthyosis causes reversion of dominant mutations in KRT10. Choate K Science, v330, 2010 Transcriptomic analysis of avian digits reveals conserved and derived digit identities in birds. Wang s. Nature, v477, 2011 Transposom-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals. Lynch and Wagner + K channel mutations in adrenal aldosterone-producing adenomas and hereditary hypertension. Choi M Nature, Genet. v43, 2011 Science, v331, 2011 Recessive LAMC3 mutations cause malformations of occipital cortical development. Barak and Gunel. Nat Genet., V43, 2011 Spatio-temporal transcriptome of the human brain. Kang and Sestan Nature, v478, 2011 Langerhans cells facilitate epithelial DNA damage and squamous cell carcinoma. Modi and Girardi Science, v335, 2012 Mutations in kelch-like 3 and cullin 3 causes hypertension and electrolyte abnormalities. Boyden et al Nature, v482, 2012 De novo point mutations, revealed by whole-exome sequencing, are strongly associated with Autism Spectrum Disorders. Sanders and State Exome sequencing identifies recurrent somatic RAC1 mutations in melanoma. Krauthammer Nature, v485, 2012 Genomic Analysis of Non-NF2 Meningiomas Reveals Mutations in TRAF7, KLF4, AKT1,& SMO. Clark V et al Science, v339, 2013 De novo mutations in histone-modifying genes in congenital heart disease. Zaidi and Lifton Nature, v498, 2013 Recessive mutations in DGKE cause atypical hemolytic-uremic syndrome. Lemaire and Lifton Nat Genet., V45, 2013 Somatic and germline CACNA1D calcium channel mutations in aldosterone-producing adenomas and primary aldosteronism. Scholl and Lifton The evolution of lineage-specific regulatory activities in the human embryonic limb. Cotney and Noonan Nat Genet., V45, 2013 Mutations in DSTYK and dominant urinary tract malformations. Sanna-Cherchi and Gharavi N Eng J Med., 2013 Nanog, and SoxB1 activate zygotic gene expression during the maternal-to-zygotic transition. Lee et al Nature, 2013 Co-expression networks implicate human mid-fetal deep cortical projection neurons in the pathogenesis of autism. Willsey and State CLP1 Founder Mutation Links tRNA Splicing and Maturation to Cerebellar Development and Neurodegeneration. Schaffer AE and Gleeson JG. Exome sequencing links corticospinal motor neuron disease to common neurodegenerative disorders. Novarino G and Gleeson JG. Cell, 2013 Nat Genet., V44, 2012 Cell, v154, 2013 Cell, V157, 2014 Science, V363, 2014 Impact of High Throughput Sequencing: Grant Funding (partial list) Mendelian center grant, NIH Gilead cancer grant Brain tumor gift ARRA brain development (NIH) ARRA kidney disease (NIH) Simons autism sequencing Brain transcriptome (NIH) Congenital heart disease (NIH) Pediatric Cardiac Genomic Consortium Melanoma Spore (NIH) Biogen Inc. (PPMS) VA- Schizophrenia/Bipolar disorder Yale Comprehensive Cancer Center Total $12M (3y) $40M (4y) $12M (4y) $ 3M (2y) $ 2M (2y) $ 4M (3y) $10M (2y) $ 5M (4y) $ 2M (2Y) $12M (5y) $ 2M $12 M $14 M $ 128 M Use of genomics to tailor medical care to individuals based on their genetic makeup. Discovery Diagnosis Classification Prognosis Therapeutic Choice • Elucidation of mechanism of cause • Identification of cancer biomarkers • Therapeutic targets How and why Is it benign? Which class of cancer? What are my Which treatment? chances? CLIA: The New Paradigm in Molecular Diagnostics Conventional molecular testing- gene by gene Genomic testing using Exome analysis YCGA is carrying out clinical diagnostic work in collaboration with Dr. Allen Bale Over 1,000 exomes are analyzed for various disorders Challenges Sequencing a genome is simple finding a cause of a disease is not First clinical use of whole genome sequencing shows just how challenging it can be. Study of fraternal twins with monogenic disorder Genomes on prescription: Nature 2011 Bainbridge M, Sci Transl Med 2011 Acknowledgement Jim Noonan Yale University, School of Medicine and west Campus NHGRI: CMG YCGA staff Questions? Data Analysis Overview Primary Analysis Secondary Analysis Data Visualization Primary and Secondary Analysis Overview Analysis Type Software Outputs ICS/RTA Images/TIFF files ICS/RTA Intensitie Base s Calling Sequencing Primary Analysis Secondary Analysis Alignments and Variant Detection Cluster Generation: Amplification Template hybridization and Initial Extension Original template is washed away 3' extension OH OH P7 P5 Grafted flowcell Template hybridization >250-300 million single molecules hybridize to the lawn of primers Initial extension Denaturation single molecules bound to flow cell in a random pattern Cluster Generation: Amplification Single-strand flips over to hybridize to adjacent oligos to form a bridge 1st cycle denaturation Hybridized primer is extended by polymerases 1st cycle annealing Result: two copies of covalently bound singlestranded templates Doublestranded bridge is denatured 1st cycle extension 2nd cycle denaturation n=35 total 2nd cycle extension 2nd cycle annealing Cluster Generation: Linearization, Blocking and sequencing primer hybridization dsDNA bridges are denatured complement strands are cleaved and washed away sequen primer Cluster Amplification P5 Linearization Block with ddNTPS Free 3’ ends are blocked to prevent unwanted DNA priming Denaturation and Sequencing Primer Hybridization Sequencing OH Denaturation and Hybridization Sequencing First Read OH OH Denaturation and Resynthesis of P5 Strand (15Cycles) De-Protection OH Sequencing Second Read Denaturation and Hybridization Block with ddNTPs P7 Linearization