Title goes here

advertisement
Sequencing Technologies and
Applications at JGI
Feng Chen, Ph.D.
05/14/2012
MGM Workshops
Outline
• Overview of sequencing technologies at JGI
• Pacific Biosciences potentials
• Highlights of application development
Staying State of the Art
AB 3730 reduced
Megabace offline
454 Titanium
PacBio
Solexa in production
454 in production
454 early access
Solexa early access
SOLiD early access
08/2007
07/2005
Illumina HiSeq 2000 Ion Torrent
Illumina MiSeq
Illumina GAIIx 454 1K
ONT
12/2007
01/2007 04/2007 10/2007
07/2008
05/2009
12/2009
Emerging Sequencing Technologies
Illumina MiSeq
(improvement)
Ion Torrent PGM
Illumina HiSeq 2500
Ion Torrent Proton
Illumina Improvement
•
•
•
•
Longer read length (250 bp)
3-fold more reads (15 M)
Higher throughput (5-7 Gb)
Faster run time
Two run configurations
• Fast run config can be done in 27
hours and produce 120 Gb
• Standard run config remains the
same (600 Gb in 17 days)
Promises from Ion Torrent
Oxford Nanopore Technologies
Long read length: > 50kb
High output: > 1gb/hr
“Run until…”
Cheap: ~$40/gb
Error rate: < 4%
Evolution of JGI Sequencing Platforms
FY2011
15 Units
9 FTEs
$8M
29Tb
{
{
35
FY2010
22 Units
15 FTEs
$11M
6Tb
{
40
FY2009
49 Units
24 FTEs
$11M
1Tb
30
40
35
30
ABI3730xl Units
25
25
20
20
Roche/454 Units
GAIIx Units
3730
15
15
Budget ($M)
10
454
GAii
GAii
454
GAii
Hiseq
0
2009
Budget ($ Millions)
Output (Trillions Bases)
10
5
HiSeq Units
2010
454
2011
Hiseq
5
0
Staff (FTE)
JGI Current Sequencing Platforms
Supplement
Platform
Major Platforms
Platforms being
Phased-out
Illumina
HiSeq
Pacific
Biosciences
RS
Illumina
MiSeq
Illumina
GAIIx
Roche/454
FLX-Ti
Units
8
2
2
5
2
Reads
1,400 Million per
Flowcell
0.04 Million per
SMRT Cell
5 Million per
Flowcell
210 Million per
Flowcell
1 Million per
PTP
Average
Readlength
150bp
2,700bp
150bp
150bp
450bp
Total Bases
325 Billion per
Flowcell
0.100 Billion per
SMRT Cell
2.1 Billion per
Flowcell
75 Billion per
Flowcell
0.450 Billion
per PTP
Run Time
16.5 Days
0.08 Days
(2 hours)
1 Day
14 Days
0.3 Days
(8 hours)
Applications
Primary
Sequence
Generator at JGI
de novo, cDNA,
16S ID, validation
16s, Sample
QC, R&D
Replaced by
HiSeq
16s (replaced
by MiSeq)
Portfolio of Library Capabilities
STANDARD
DNA De Novo and Reseq:
Std frag 270bp, 500bp (amplified/ unamp)
tight insert 250bp, 500bp (amplified/ unamp)
CLIP-PE 4kb, 8kb
Transcriptome Diversity/Counting:
RNASeq stranded
RNASeq with/without rRNA depletion
(Prok and Meta)
small RNASeq
PET RNASeq (5’ and 3')
Environmental Diversity Profiling:
16S Profiling
CUSTOM/R&D:
DNA De Novo:
CLIP-PE fosmid
CLIP-PE 20kb
LFPE 4kb, 8kb
Haplotype resolved sequencing
single cells/fragments
Pacbio WGS
PacBio amplicon sequencing
Functional Genomics:
TSS prokaryotic RNAseq
Tn insertion site profiling sequencing
Pacbio FL RNA
PacBio methylation sequencing
pools of 96 fosmids indexed libraries
Bisulfite Seq
chromatin IP
nano RNAseq
Outline
• Overview of sequencing technologies at JGI
• Pacific Biosciences potentials
• Highlights of application development
Pacific Biosciences Technology
• Single Molecule
– Sequence directly from the molecules in your sample, not
the amplification product
• Real time
– Direct observation of natural DNA synthesis in a
continuous and processive manner
• Phospholinked Nucleotides Harnessing Single -Molecules; Observing in Real Time
– Fluorescent label is at
gama-phosphate position
– Naturally cleaved during
incorporation
SMRT Cell
ZMW
Science, Vol 299, Jan 31 2003, pp682
-686, J. Appl. Phys. 103, 034301 (2008)
Pacific Biosciences CONFIDENTIAL
Pacific Biosciences Advantages
•
•
•
•
Fast run time
Long read length
No amplification biases
Able to measure DNA polymerase kinetics
– Inter-pulse distance
– Pulse duration
• Multiple sequencing modes
– Standard
– Strobe
– Circular consensus
• Disadvantages: high error (indel), low throughput
Less GC Bias Than Newest Illumina Chemistry
28% GC
V3 HiSeq
73% GC
V2 HiSeq
V2 GAiix
PacBio Data Improves Assembly
Number of gaps in assembly
Most improved genome:
53 / 71 (75%) gaps closed
100
90
80
70
60
50
40
30
20
10
0
11% of gaps were closed
incorrectly with either
errors in consensus or
misassemblies
Least improved genomes
(.. but started out in good shape)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Microbe (sorted by number of gaps closed)
PacBio Data Coverage
Allpaths assembly
(illumina only)
Illumina
coverage
~100x coverage
PacBio
coverage
Great coverage of
PacBio in gap region
PacBio Read Length
3500
C2 chemistry
3000
V 1.2.2
Read Length (bp)
2500
Successful
upgrades
2000
1500
We started
1000 here
from
V 1.1.2
V 1.2
V 1.2.1
Laser
overpower
500
Instrument
fine tuning
0
Oct-10
Dec-10
Feb-11
Apr-11
Jun-11
Timeline
Aug-11
Oct-11
Dec-11
Transcriptome/FL-cDNA Sequencing
Goals: capture the 5’ and 3’ end of the transcripts
and splicing variants
before
after
coverage
800x
annotation
0x
Alignment before and after correction
Transcriptome Coverage
• 1/3 of the transcripts (1/2 of transcripts hit by this
dataset) are covered by at least one single PacBio
subread
• There is NO ambiguity if splice variants are detected
Transcripts hit (73.3%)
annotated transcript
Transcripts tiled (38.6%)
annotated transcript
Transcripts covered by > 1 subread (36.5%)
annotated transcript
Error Correction revealed isoforms
J. Martin
Z. Wang
Outline
• Overview of sequencing technologies at JGI
• Pacific Biosciences potentials
• Highlights of application development
Application Development
• Large-insert paired-end sequencing
- 3-5 kb, 8-10 kb, and >20 kb insert size
- CLIP-PE: developed in-house
• RNA sequencing
- 5’ and 3’ end targeted and full-length sequencing
- Metatranscriptome sequencing
• 16S rRNA profiling and identification
- iTag on Illumina MiSeq and 16S ID on PacBio
•Haplotype-resolved sequencing
- Single chromosome sequencing
•Functional genomics:
- Gene synthesis
- Large scale gene disruption
16S Tagging on MiSeq
Targeting V4 region in 16S gene (291 nt in length)
• Use 3rd-read indexing strategy and custom forward sequencing
primer to maximize the use of Illumina’s limited read length
• 2x250 bp run to ensure read overlap
Spacer
16S specific primer
Illumina adapter 1
16S gene
Barcode priming site
HVR
Read1 priming site
Read2 priming site
Illumina adapter 2
Illumina
454
V4
Amplicon Modification
96 samples are pooled in one MiSeq run
High quality sequencing data were obtained from both reads
Illumina MiSeq Suitable for 16S Tagging
• MiSeq data largely agrees with 454 PyroTag data
• Major differences are in low abundance clusters
Functional Genomics through “Transposon
bombing”
•
•
•
•
Random Tn insertion mutagenesis
Cell growth at multiple conditions
High throughput insertion site sequencing
Map insertion sites to reference sequence for functional
annotation
High throughput sequencing revels “essential”
genes appear as transposon free regions
230
Illumina
read depth
Transposon
insertions
Insertion free
site
Transposon
insertions
0
Genes
Non-essential genes
Non-essential genes
Essential gene:
dihydroxy-acid dehydratase
(required for biosynthesis of amino acids)
Tn Insertion Reveals Essential Genes
400
Expected distribution from random insertions
Pseudomonas Stutzeri RCH2
300
250
200
150
Essential genes
508 (12 %)
Non-essential genes
3,542 (80 %)
Uncertain
362 (8%)
100
Observed distribution of insertions
50
Insertion index (Number of insertions / gene length)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
Number of genes
350
Single Chromosome Sequencing
MM
MF
Single chromosome in
droplet or micro-well
Metaphase
chromosomes
LCM
MM: micromanipulator
MF: microfluidics
LCM: Laser Capture
Microdissector
MDA/PCR
amplification
Thank you very much!
Question?
Download