Sequencing Technologies and Applications at JGI Feng Chen, Ph.D. 05/14/2012 MGM Workshops Outline • Overview of sequencing technologies at JGI • Pacific Biosciences potentials • Highlights of application development Staying State of the Art AB 3730 reduced Megabace offline 454 Titanium PacBio Solexa in production 454 in production 454 early access Solexa early access SOLiD early access 08/2007 07/2005 Illumina HiSeq 2000 Ion Torrent Illumina MiSeq Illumina GAIIx 454 1K ONT 12/2007 01/2007 04/2007 10/2007 07/2008 05/2009 12/2009 Emerging Sequencing Technologies Illumina MiSeq (improvement) Ion Torrent PGM Illumina HiSeq 2500 Ion Torrent Proton Illumina Improvement • • • • Longer read length (250 bp) 3-fold more reads (15 M) Higher throughput (5-7 Gb) Faster run time Two run configurations • Fast run config can be done in 27 hours and produce 120 Gb • Standard run config remains the same (600 Gb in 17 days) Promises from Ion Torrent Oxford Nanopore Technologies Long read length: > 50kb High output: > 1gb/hr “Run until…” Cheap: ~$40/gb Error rate: < 4% Evolution of JGI Sequencing Platforms FY2011 15 Units 9 FTEs $8M 29Tb { { 35 FY2010 22 Units 15 FTEs $11M 6Tb { 40 FY2009 49 Units 24 FTEs $11M 1Tb 30 40 35 30 ABI3730xl Units 25 25 20 20 Roche/454 Units GAIIx Units 3730 15 15 Budget ($M) 10 454 GAii GAii 454 GAii Hiseq 0 2009 Budget ($ Millions) Output (Trillions Bases) 10 5 HiSeq Units 2010 454 2011 Hiseq 5 0 Staff (FTE) JGI Current Sequencing Platforms Supplement Platform Major Platforms Platforms being Phased-out Illumina HiSeq Pacific Biosciences RS Illumina MiSeq Illumina GAIIx Roche/454 FLX-Ti Units 8 2 2 5 2 Reads 1,400 Million per Flowcell 0.04 Million per SMRT Cell 5 Million per Flowcell 210 Million per Flowcell 1 Million per PTP Average Readlength 150bp 2,700bp 150bp 150bp 450bp Total Bases 325 Billion per Flowcell 0.100 Billion per SMRT Cell 2.1 Billion per Flowcell 75 Billion per Flowcell 0.450 Billion per PTP Run Time 16.5 Days 0.08 Days (2 hours) 1 Day 14 Days 0.3 Days (8 hours) Applications Primary Sequence Generator at JGI de novo, cDNA, 16S ID, validation 16s, Sample QC, R&D Replaced by HiSeq 16s (replaced by MiSeq) Portfolio of Library Capabilities STANDARD DNA De Novo and Reseq: Std frag 270bp, 500bp (amplified/ unamp) tight insert 250bp, 500bp (amplified/ unamp) CLIP-PE 4kb, 8kb Transcriptome Diversity/Counting: RNASeq stranded RNASeq with/without rRNA depletion (Prok and Meta) small RNASeq PET RNASeq (5’ and 3') Environmental Diversity Profiling: 16S Profiling CUSTOM/R&D: DNA De Novo: CLIP-PE fosmid CLIP-PE 20kb LFPE 4kb, 8kb Haplotype resolved sequencing single cells/fragments Pacbio WGS PacBio amplicon sequencing Functional Genomics: TSS prokaryotic RNAseq Tn insertion site profiling sequencing Pacbio FL RNA PacBio methylation sequencing pools of 96 fosmids indexed libraries Bisulfite Seq chromatin IP nano RNAseq Outline • Overview of sequencing technologies at JGI • Pacific Biosciences potentials • Highlights of application development Pacific Biosciences Technology • Single Molecule – Sequence directly from the molecules in your sample, not the amplification product • Real time – Direct observation of natural DNA synthesis in a continuous and processive manner • Phospholinked Nucleotides Harnessing Single -Molecules; Observing in Real Time – Fluorescent label is at gama-phosphate position – Naturally cleaved during incorporation SMRT Cell ZMW Science, Vol 299, Jan 31 2003, pp682 -686, J. Appl. Phys. 103, 034301 (2008) Pacific Biosciences CONFIDENTIAL Pacific Biosciences Advantages • • • • Fast run time Long read length No amplification biases Able to measure DNA polymerase kinetics – Inter-pulse distance – Pulse duration • Multiple sequencing modes – Standard – Strobe – Circular consensus • Disadvantages: high error (indel), low throughput Less GC Bias Than Newest Illumina Chemistry 28% GC V3 HiSeq 73% GC V2 HiSeq V2 GAiix PacBio Data Improves Assembly Number of gaps in assembly Most improved genome: 53 / 71 (75%) gaps closed 100 90 80 70 60 50 40 30 20 10 0 11% of gaps were closed incorrectly with either errors in consensus or misassemblies Least improved genomes (.. but started out in good shape) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Microbe (sorted by number of gaps closed) PacBio Data Coverage Allpaths assembly (illumina only) Illumina coverage ~100x coverage PacBio coverage Great coverage of PacBio in gap region PacBio Read Length 3500 C2 chemistry 3000 V 1.2.2 Read Length (bp) 2500 Successful upgrades 2000 1500 We started 1000 here from V 1.1.2 V 1.2 V 1.2.1 Laser overpower 500 Instrument fine tuning 0 Oct-10 Dec-10 Feb-11 Apr-11 Jun-11 Timeline Aug-11 Oct-11 Dec-11 Transcriptome/FL-cDNA Sequencing Goals: capture the 5’ and 3’ end of the transcripts and splicing variants before after coverage 800x annotation 0x Alignment before and after correction Transcriptome Coverage • 1/3 of the transcripts (1/2 of transcripts hit by this dataset) are covered by at least one single PacBio subread • There is NO ambiguity if splice variants are detected Transcripts hit (73.3%) annotated transcript Transcripts tiled (38.6%) annotated transcript Transcripts covered by > 1 subread (36.5%) annotated transcript Error Correction revealed isoforms J. Martin Z. Wang Outline • Overview of sequencing technologies at JGI • Pacific Biosciences potentials • Highlights of application development Application Development • Large-insert paired-end sequencing - 3-5 kb, 8-10 kb, and >20 kb insert size - CLIP-PE: developed in-house • RNA sequencing - 5’ and 3’ end targeted and full-length sequencing - Metatranscriptome sequencing • 16S rRNA profiling and identification - iTag on Illumina MiSeq and 16S ID on PacBio •Haplotype-resolved sequencing - Single chromosome sequencing •Functional genomics: - Gene synthesis - Large scale gene disruption 16S Tagging on MiSeq Targeting V4 region in 16S gene (291 nt in length) • Use 3rd-read indexing strategy and custom forward sequencing primer to maximize the use of Illumina’s limited read length • 2x250 bp run to ensure read overlap Spacer 16S specific primer Illumina adapter 1 16S gene Barcode priming site HVR Read1 priming site Read2 priming site Illumina adapter 2 Illumina 454 V4 Amplicon Modification 96 samples are pooled in one MiSeq run High quality sequencing data were obtained from both reads Illumina MiSeq Suitable for 16S Tagging • MiSeq data largely agrees with 454 PyroTag data • Major differences are in low abundance clusters Functional Genomics through “Transposon bombing” • • • • Random Tn insertion mutagenesis Cell growth at multiple conditions High throughput insertion site sequencing Map insertion sites to reference sequence for functional annotation High throughput sequencing revels “essential” genes appear as transposon free regions 230 Illumina read depth Transposon insertions Insertion free site Transposon insertions 0 Genes Non-essential genes Non-essential genes Essential gene: dihydroxy-acid dehydratase (required for biosynthesis of amino acids) Tn Insertion Reveals Essential Genes 400 Expected distribution from random insertions Pseudomonas Stutzeri RCH2 300 250 200 150 Essential genes 508 (12 %) Non-essential genes 3,542 (80 %) Uncertain 362 (8%) 100 Observed distribution of insertions 50 Insertion index (Number of insertions / gene length) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 Number of genes 350 Single Chromosome Sequencing MM MF Single chromosome in droplet or micro-well Metaphase chromosomes LCM MM: micromanipulator MF: microfluidics LCM: Laser Capture Microdissector MDA/PCR amplification Thank you very much! Question?