Current Sequencing Technologies and Data Generation Corbin Jones & Piotr Mieczkowski Department of Biology, College of Arts and Sciences, Carolina Center for Genome Sciences Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill NEXT-GENERATION SEQUENCING (DEEP SEQUENCING) PLATFORMS o Short reads 1. Genome Analyzer IIx (GAIIx), HiSeq2000, HiSeq2500, MiSeq – Illumina 2. SOLiD 5500xl System – Applied Biosystem 3. HeliScope™ Single Molecule Sequencer - Helicos o Long reads 1. Genome Sequencer FLX System (454) – Roche 2. PacBio RS - Pacific Bioscience 3. Personal Genome Machine, Ion Proton - Ion Torrent 4. GridION – Oxford Nanopore o 1. 2. Mapping sequences to large DNA fragments NABsys Bionanomatrix UNC – HTSF • • • • • 9 HiSeq 2000/2500 1 GA II PacBio Ion Torrent MiSeq (Jeff Dangl) Liz Buda and Donghui Tan Also on campus: 454 (Microbiome) 454 jr. (Viral genomics) MiSeq – Kevin Weeks What type of sequencing should I choose for the Illumina sequencing project? HiSeq 2000/2500 – 100-160mln single end sequencing reads per lane. - ChIPseq – Single End 50 cycles (2-3 human samples per lane) - RNAseq – Single End 50 cycles (2-3 human samples per lane) If you are interested in splicing variants and fusion genes both Single End 100cycles and Paired End 2x50cycles will be better option for you. -Whole Genome Sequencing – Paired End 2x100cycles (2-3 lanes per genome) -Exome Capture - Paired End 2x100cycles (4 samples per lane) MiSeq – 3-7 mln single end sequencing reads per lane. Custom projects , fast turnaround. Metagenomics - 16S profile – Paired End 2x150cycles up to 24 samples per lane. -Whole Microbial Genome Sequencing - Paired End 2x150cycles SHORT READ PLATFORMS at UNC HiSeq 2000 Initially capable of up to 600Gb per run in 13 days. Cost of resequencing one human genome: Now UNC PI - (30x coverage) about $6,000 Now for outside of UNC - (30x coverage) about $9,000 HiSeq 2500 Initially capable of up 100Gb per run in 27hours. Cost per genome - ??? MiSeq - Small capacity system. PE 2x150cycles in 27hours. - PE 2 x 250bp coming soon – error rate for read 1 – less than 1%; read 2 about 1.2%. - In preparation – PE 2 x 400bp – error rate for read1 about 2%; read 2 about 4%. - In preparation – Longer insert size possible 1.5kb PacBio RS Single molecule resolution in real time • Short waiting time for result and simple workflow – – • No amplification required – – • Distinguish heterogeneous samples Simultaneous kinetic measurements Long reads – – • Bias not introduced More uniform coverage Direct observation – – • Generate basecalls in <1 day Polymerase speed ≥1 base per second Identify repeats and structural variants Less coverage required Information content – One assay, multiple applications • • • Genetic variation (SVs to SNPs) Methylation Enzymology C2 chemistry – installed March 2012 -Long reads 6-10kb -Meidan size of molecules 3kb -Still 15% error rate -No strobe sequencing Software focus on: De novo assembly Hi quality CCS consensus reads In preparation -Load long molecules by magnetic beads -Modified nucleotides detection PacBio RS – two sequencing modes LS – long sequencing reads Sample Preparation Standard • Large insert sizes (2kb-10kb) • Generates one pass on each molecule sequenced CCS – high quality sequencing reads Circular Consensus • Small insert sizes 500bp • Generates multiple passes on each molecule sequenced Example Data: 1 smart cell Pre-Filter # of Bases 180,320,136 bp Pre-Filter # of Reads 75153 Pre-Filter Mean Readlength 2399 bp Pre-Filter Mean Read Quality 0.624 % Adapter Dimer (0-10bp) 1.94 % % Short Insert (11-100bp) 0.47 % Post-Filter # of Bases 165,424,592 bp Post-Filter # of Reads 52801 Post-Filter Mean Readlength 3133 bp Post-Filter Mean Read Quality 0.827 Personal Genome Machine – Ion Torrent (life technologies) Three types of semiconductor chips: 314 – 20Mb 316 - 200Mb 318 – 1Gb Read length depends on base composition 200-250bp (200cycles) System is enabled for Paired End 2x100cycles The fastest sequencing system on the market. Recommendation: Resequencing applications which require fast turnaround of samples - Amplicons (PCR products) - Small and medium size genomes - Custom DNA capture applications How it works: H+ ion is released during base incorporation. Individual polymerases attached to beads are positioned in tiny wells that rest on a tiny pH meter. PGM/Ion Torrent Data 316 chip Thr. Total Number of Bases [Mbp] 77.65 ‣ Number of Q17 Bases [Mbp] 36.11 ‣ Number of Q20 Bases [Mbp] 27.33 Total Number of Reads 368,860 Mean Length [bp] 211 Longest Read [bp] 380 Library Preparation from Low Quantities of DNA or RNA Microfluidics stationary and portable systems Mondrian SP System – NuGEN Technologies - Human libraries from 5ng of total DNA. Only 10-15% of duplicate reads. - Ultralow DNA library systems Soon: - Ultralow RNA library systems - Libraries from total RNA with rRNA depletion. Advanced Liquid Logic from RTP Emerging Sequencing Technologies Semiconductor sequencing chip Nanopore / Nanochannel sequencing Ion Proton System - Human genome in one day Cost of reagents $1000 per run Error rate around 1.2% Human Genome, RNAseq, ChIPseq Ion Proton Chip I – 10Gb (Whole Exome capture experiments) Ion Proton Chip II – 100Gb Whole human Genome resequencing Oxford Nanopore – new view on sequencing Hemolysin – pore - inner diameter of 1nm, about 100,000 times smaller than that of a human hair. Oxford Nanopore DNA sequencing Error rate 4%, prediction for end of the year 0.1 – 2%. Nanopore array Oxford Nanopore – new concepts MinION - 150Mb per run - Tested 48kb read length -$900 per instrument -500 pores per device GridION - XXXMb per run - Tested 48kb read length -$XXX per instrument -2000 pores per device, soon 8000 pores -Cost per human genome $1500. Oxford Nanopore – applications - DNA sequencing Protein detection Protein DNA interaction Small molecule detection - 96 well plates for 96 samples - Controlled time of sequencing Intelligent BioSystems Mini20 System (manufactured by Azco Biotech) • Amplification by rolony method • Sequencing by Synthesis with announced 100 base reads, but expect to compete with Sanger down the road • Designed for clinical labs • 20 independent flow cells, no queue for loading, run asynchronously • 20M reads/flow cell, 4 GB/ flow cell • Potential problems with repeats • System cost $120K, $150 flow cell (disposable), full costs per sample not clear yet. • Entering early access now, expect commercial shipping late 2012 Genia Technologies • Very early stage announcement – Backed by Life Technologies (at least 1 year away) • Describe system as a cross between Ion Torrent and Oxford Nanopore • Electronic “Active Control” technology enables highly efficient nanopore-membrane assembly and control of DNA movement through the channel • Initially used α-Hemolysin and claimed 98% raw accuracy with that but now are using an undisclosed pore for further development. • Claim sensitivity 1-2 orders of magnitude greater than Oxford Nanopore. • Ramping up pore density to 100K pores/chip by end of 2012. • Plan to market a mobile reader for <$1K and per sample costs <$100 • Plan early access in late 2012, commercial shipment 2013 Basic RNAseq • Type 1: Description of trancriptome – Assembly of transcripts/isoforms – Annotation of genes • Type 2: “Paired” e.g. treatment vs control – Differential expression – Differential transciption • Type 3: Population – Elements of 1 and 2, but “random effects” – TCGA roughly fall into this category Strand Specific RNAseq • Perkins et al 2009, Levin et al 2010 • Goal: To mark the RNA molecules in order to know the direction of transcription. – differentiate anti-sense transcripts, lncRNAs, mRNAs etc. • Many methods, dUTP may be best, Illumina has kit End tagged RNAseq • GOAL: Identify ends of transcripts by attaching adaptors to ends of mRNAs – can be used in strand specific protocols – can be used in annotation and assembly protocols AAAAAA mG Normalized RNAseq • GOAL: To even the distribution of transcripts sequenced – Reduce the representation of high abundance transcripts and increase sensitivity to low abundance Normalized RNAseq 2 • Methods – Kinetic (Patanjali et al 1991, Bonaldo et al 1996) – dsDNA nuclease (Zhulidov 2004) – Cap-Trapper (Carninci et al 2000) • Results – Abundant transcripts reduced proportional to freq – Coverage still proportional to expression • Problems: bias, contamination w/ ncRNA Total RNAseq • Goal: Sequence every RNA molecule in the cell – Observe: unspliced RNAs, small RNAs, non-coding RNAs, tRNAs – Must remove rRNA! – Variants: Nuclear only, cytoplasmic only, mRNA removal small RNA • GOAL: Small RNAs are important for gene regulation, synthesis, splicing, and immunity (miRNA/miR, snRNA, snoRNAs, scaRNAs) – Several protocols (e.g. Illumina, Morin et al 2010) • All involve size selection, which can lead to bias – Produce short sequences that are then mapped back to the genome. • Aside, seem more Poisson like than other counts RIPseq/CLIPseq/HITS-CLIP • GOAL: Identify the sites on the RNA where RNA binding proteins are bound. – e.g. Components of the spliceosome – protocol is similar to ChIPseq except there is a random hexamer ds-cDNA synthesis step – refs: Khalil et al 2009, Sanford 2009, Licatalosi 2008, Zhang and Darnell 2011