Sequencing

advertisement
High Throughput Sequencing
Technologies at YCGA
Shrikant Mane
Director, YCGA
Director, Keck Laboratory
Outline
 First-generation sequencing technology
•Sanger sequencing
 Current massively parallel sequencing strategies
“Second Generation”
•454
•Illumina
•Ion Torrent & Ion Proton
“Third Generation”
•Pacific Biosciences
•Oxford Nanopore
 YCGA
Goals of biomedical investigation
 Understand normal, healthy and disease biology
 Enable prevention and early diagnosis of disease
 Enable new effective treatments
Utility of Next Generation
Sequencing/genetics in medicine
Unbiased approach to identify new
pathways underlying basic physiology,
health and disease
Evolution of genomic technologies
 Genetic mapping studies: Discovery of genes for well
characterized Mendelian diseases.
 Dense SNP genotyping using microarray technology: GWAS
for discovery of common variants in common disease.
 High throughput sequencing: Discovery of rare variants in not
previously recognized Mendelian diseases and common
diseases.
In recent years there has been an explosion of
research articles using next generation
sequencing technologies
Number of PubMed Articles
 DNA sequencing can provide
a deeper understanding
about DNA/RNA than any
other technology
 Microarray Technology
revolutionized biomedical
research, but has several
limitations, which DNA
sequencing may overcome
 As the cost of sequencing is
rapidly decreasing, it is
becoming affordable to
perform sequencing at a
genome level
2000
1500
1000
Projected
HT Sequencing
500
0
First Generation: Sanger sequencing
(1975-1977)
1980 Nobel Prize in chemistry
gels read by hand
•radiolabeled dideoxyNTPs
•one lane per nucleotide
•800 bp reads
•low throughput (several kb/gel)
phi X 174
~5300 bp
Second-generation sequencing
Massively parallel sequencing of millions of
template
454/Roche
Illumina
Ion Torrent-Proton
Second Generation: Massively Parallel
Sequencing.
Throughput (24 hours):
2.8 Mb (Sanger)
60,000 Mb (HiSeq)
Cost:
$1500/Mb (Sanger)
$0.06 /Mb (HiSeq)
Read Lengths:
~800 bp (Sanger)
~ 100 – 600 ( HiSeq- 454)
Error rates:
< 0.5 % (Sanger)
~ 0.8 -2%% (HiSeq)
Illumina next generation sequencing platform
HiSeq 2500 Sequencing System
Fast turnaround and highest output in a single instrument
1 Instrument – 2 Run Modes
High Output Mode
Rapid Run Mode
600 Gb in ~10.5 days
Current v3 flow cell
Current v3 reagents
cBot required
120Gb in ~1 day
New 2-lane flow cell
New reagents
No cBot required
User configurable
6 human genomes
in 10.5 days
1 human genome
in a day
Highest Output
Fastest turnaround
New sequencing platforms by Illumina
 HiSeq X Ten and HiSeq X Five:
 Production-scale human whole genome sequencing:
18,000 genomes/year at $ 1,500 cost/genome
 HiSeq 3000/HiSeq 4000:
 Up to 1.5 Tb/run.
 Whole genome as well as other applications including
exome sequencing
Overall Illumina Sequencing Workflow
Sample Preparation
Sequencing Library Preparation
Sequencing
Adapter1 Primer
Insert
Adapter2
Cluster Generation
•Hybridizing Library to Flow Cell
•Creating clusters from
individual molecules
Sequencing by Synthesis
•Add all 4 bases with Reversible Terminators
•Image 4 colors
•Remove Terminator, repeat
Genomic Sample Prep Workflow
Purified genomic DNA
1. Genomic DNA fragmentation
Fragments of less than 800 bp
2. End-repair
Blunt ended fragments with 5’-Phosphorylated ends
3. Klenow exo- with dATP
3’-dA overhang
4. Adapter ligation
Adapter modified ends
5. Gel purification/bead
Removal of unligated adapter
6. PCR
Genomic DNA Library
Sequencing
Adapter1 Primer
Insert
Adapter2
What is a Flow Cell?
A flow cell is a thick glass slide
with 8 channels or lanes
Each lane is randomly coated
with a lawn of oligos that are
complementary to library
adapters
P5 oligo
P7 Oligo
Adapter1
Index
Adapter2
Sequencing
Primer
Insert
Reversible Terminator Seq Chemistry
All 4 labeled nucleotides in 1 reaction (green, orange, red and blue)
Advantages of reversible terminators:
Only one base is added at a time
Fluor can be cleaved off after the imaging. Thus, it does not emit color at the
next cycle allowing only newly added base (with attached fluor) to emit the light
O
O
HN
O
N
cleavage
fluor
site
3’
HN
5’
DNA
O
block
Incorporation
Detection
Deblock; fluor removal
N
O
O
O
PPP
X
3’
OH free 3’ end
Next cycle
Illumina sequencing
Sequencing By Synthesis (SBS)
3’ 5’
Cycle 1:
Add sequencing reagents
First base incorporated
Remove unincorporated bases
A
T
G
C
C
G
T
T
A
C
A
C
Detect signal/Imaging
Cleave off fluor and Deblock
Cycle 2-n: Add sequencing reagents and repe
G
A
T
T
A
G
A
C
T
C
C
G
A
G
C
T
C
G
A
T
5’
All four labeled nucleotides in
one reaction
High accuracy
Base-by-base sequencing
No problems with homopolymer
repeats
Ion Torrent PGM and Proton
Ion PGM™ Sequencer
4 Ion Protons: coming soon
First PostLight sequencing technology:
Instead of
using light as an intermediary, PGM creates a direct connection
between the chemical and the digital worlds.
The Chip is the Machine
Uses semiconductor chips for sequencing.
 Ion PI chip: >165 million wells per chip: 8 to 10
Gb data per run
 Ion PII chips: ~100 Gb of data in ~4 hours
Base Calling
When a nucleotide is incorporated into a strand of DNA, a Hydrogen
ion is released as a by product. The H ion carries a charge which the
PGM’s ion sensor can detect as a base.
Ion Torrent technology video.
Advantages and Current Limitations
Advantages
• Low equipment cost
• Rapid run times: 3 to 4 hours
• Simple Chemistry
Limitations
•
•
•
•
•
Homopolymers detection
Error rates
Slow on introducing newer chips: Overpromise
PGM and Proton: two separate sequencing equipment
Library prep: Emulsion PCR/ New protocols
Third generation sequencing
PacBio RS
The Third Generation Sequencing Platform: PacBio RS
 Pacific Biosciences has developed Single
Molecule Real Time (SMRT™) DNA
sequencing technology: PacBio RS.
 This technology enables, for the first time,
the observation of natural DNA synthesis
by a DNA polymerase as it occurs.
 This technology delivers long reads at
single molecule level and fast time to
result, enabling a new paradigm in
genomic analysis.
Pacific Biosciences SMRT® Technology
Technology Video
Key Applications for PacBio RS



Targeted sequencing

SNP and structure variants detection

Repetitive region

Full length transcript profiling
De novo assembly and genome finishing

Bacteria genome

Fungal genome

Gap-captured sequencing

Targeted captured sequencing
Base modifications detection

Methylations

DNA damages
**Projects at YCGA
YCGA PacBio RS
Comparisons Between PacBio RS and Illumina HiSeq
Sequencing
Chemistry
PacBio RS (Third generation)
Illumina HiSeq (Second
generation)
Sequencing by synthesis (SBS)
Single Molecule Real Time (SMRT)
Sequencing by synthesis (SBS)
Sequencing
Smart Cell made up of
Flow cell has made of
150,000 ZMWs
8 separate lanes
substrate
Data output per
60 billion/day at a cost of $.06 per
1 to 2 billion/ day. $1.5/ Mb
Mb
day
Read Length
Average up to 5 Kb
50bp to 150bp
Raw: 10-15 %. With 30x coverage:
Error rates
0.5 to 1 %
Q50 (< 0.01)
Sample Library
SMRT Bell template
dsDNA with adaptors (175 bp to
(Single-strand circular DNA) 250 bp to
1 Kb)
10 Kb insert
Upcoming Technologies
Exonuclease
Protein
nanopore (Alpha
Hemolysin)
Cyclodextrin
Electrically resistant
Lipid bilayer
http://www.nanoporetech.com/news/movies#movie24-nanopore-dna-sequencing
• PromethION
Recent advances in nanopore sequencing
 Two types of nanopores: Protein and synthetic
(silicon nitride). Protein nanopores appear to be
better in recognizing nucleotides.
 The rapid speed at which DNA strands pass
through the tiny hole makes distinguishing bases
more difficult.
 Currently an enzyme is used to control the rate.
 By shining low power green laser on synthetic
Meller A. et al, Nat Biotech
nanopore immersed in salt water it is possible to
2013
manipulate DNA speed at will. As the current
increases, positive ions drag water molecules in
the opposite direction of incoming DNA, acting as a brake and slowing its
passage through the pore. As a result, nanoscale sensors in the pore would
be more accurately able to read each nucleotide going into the pore. Using
nanopores, long stretches of DNA can be zipped back and forth through the
pore and can be read several times
 Protein nanopoers can also identify epigenetic changes.
Advantages
Nanopores offer a label-free, electrical, single-molecule
DNA sequencing method
 No costly fluorescent labeling reagents
 No need for expensive optical hardware and sophisticated
instrumentation to detect DNA bases
Performance/Limitations…..?
 First data was released in Feb 2012. Since then
slow to release new data
 Very little data available for the evaluation: High
Error Rates - >5%
The YCGA Laboratory at West Campus
YCGA was established in January 2009 through generous
funding support and the strong commitment from the Yale
University and School of Medicine
•Located in a newly
renovated building.
•Approximately 7,000
Sq Ft laboratory and
~4,000 Sq Ft office
space
Portion of the laboratory showing sequencing
systems through the glass wall partition that
separates laboratory from the rest of office
and administrative area.
• 23 staff
Sequencing Platforms at YCGA
11 Illumina HiSeqs
(2000 and 2500)
One PacBio RS
One MiSeq
Ion PGM™ Sequencer
YCGA has kept pace with cutting-edge sequencing technologies
Computer Infrastructure
BulldogN: Dell Cluster with 200 Nodes/2,500 Cores
Hitachi/BlueArc Scalable Storage: ~2.5 Petabytes
GB Sequenced Quarterly
40000
35000
30000
25000
20000
15000
10000
5000
0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3
2010 2010 2010 2010 2011 2011 2011 2011 2012 2012 2012 2012 2013 2013 2013 2013 2014 2014 2014
204 PIs from 31 Yale Departments
115 PIs from 81 Non-Yale Institutions from 14 countries
Types of samples processed and runs of sequence read
lengths carried out at YCGA in a typical month
ChIP
Sample Types
3%
gDNA
5%
13%
2%
77%
Run Types
4% 4%
1x36
18%
mRNA
1x76
2%
Multiplex gDNA
Seq-cap
2x50
2x76
72%
2x101
Types of samples processed and runs of sequence read lengths carried out at YCGA in a typical month.
Need for strong R&D efforts for Next-Generation
sequencing operation
• Optimization of sample preparation protocols for exome capture
that have decreased the cost of a single human exome from
$8,000 in 2009 to the current price of ~$500, while improving the
quality of the data.
• Development of a highly efficient protocol to extract and repair
DNA from formalin-fixed paraffin embedded blocks for exome
analysis.
• Improved protocols for gDNA-seq, RNA-seq, and ChIP-seq that
show higher data complexity than traditional protocols, allow
users to start with less material, and cost less.
• Continuous improvements of various analysis pipelines
Whole- Genome VS. Whole Exome Sequencing
 Protein coding genes
(exome) constitute 1% of
the human genome but
harbor 85 % of disease
causing mutations
 Significantly cheaper
than sequencing entire
genome
• 2.1M probes cover ~300,000
exons of 19,000 genes
• Total covered bases: 44.1Mb
Scientific and economic impact
of high throughput sequencing
at Yale
List of select publications resulting form the next-generation sequencing at YCGA
Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Bilguvar
Nature, v467, 2010
A Novel miRNA Processing Pathway Independent of Dicer Requires Argonaute2 Activity. Cifuentes
Science, v328, 2010
Mitotic recombination in ichthyosis causes reversion of dominant mutations in KRT10. Choate K
Science, v330, 2010
Transcriptomic analysis of avian digits reveals conserved and derived digit identities in birds. Wang s.
Nature, v477, 2011
Transposom-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals.
Lynch and Wagner
+
K channel mutations in adrenal aldosterone-producing adenomas and hereditary hypertension. Choi M
Nature, Genet. v43,
2011
Science, v331, 2011
Recessive LAMC3 mutations cause malformations of occipital cortical development. Barak and Gunel.
Nat Genet., V43, 2011
Spatio-temporal transcriptome of the human brain. Kang and Sestan
Nature, v478, 2011
Langerhans cells facilitate epithelial DNA damage and squamous cell carcinoma. Modi and Girardi
Science, v335, 2012
Mutations in kelch-like 3 and cullin 3 causes hypertension and electrolyte abnormalities. Boyden et al
Nature, v482, 2012
De novo point mutations, revealed by whole-exome sequencing, are strongly associated with Autism Spectrum
Disorders. Sanders and State
Exome sequencing identifies recurrent somatic RAC1 mutations in melanoma. Krauthammer
Nature, v485, 2012
Genomic Analysis of Non-NF2 Meningiomas Reveals Mutations in TRAF7, KLF4, AKT1,& SMO. Clark V et al
Science, v339, 2013
De novo mutations in histone-modifying genes in congenital heart disease. Zaidi and Lifton
Nature, v498, 2013
Recessive mutations in DGKE cause atypical hemolytic-uremic syndrome. Lemaire and Lifton
Nat Genet., V45, 2013
Somatic and germline CACNA1D calcium channel mutations in aldosterone-producing adenomas and primary
aldosteronism. Scholl and Lifton
The evolution of lineage-specific regulatory activities in the human embryonic limb. Cotney and Noonan
Nat Genet., V45, 2013
Mutations in DSTYK and dominant urinary tract malformations. Sanna-Cherchi and Gharavi
N Eng J Med., 2013
Nanog, and SoxB1 activate zygotic gene expression during the maternal-to-zygotic transition. Lee et al
Nature, 2013
Co-expression networks implicate human mid-fetal deep cortical projection neurons in the pathogenesis of autism.
Willsey and State
CLP1 Founder Mutation Links tRNA Splicing and Maturation to Cerebellar Development and Neurodegeneration.
Schaffer AE and Gleeson JG.
Exome sequencing links corticospinal motor neuron disease to common neurodegenerative disorders. Novarino G
and Gleeson JG.
Cell, 2013
Nat Genet., V44, 2012
Cell, v154, 2013
Cell, V157, 2014
Science, V363, 2014
Impact of High Throughput Sequencing:
Grant Funding (partial list)













Mendelian center grant, NIH
Gilead cancer grant
Brain tumor gift
ARRA brain development (NIH)
ARRA kidney disease (NIH)
Simons autism sequencing
Brain transcriptome (NIH)
Congenital heart disease (NIH)
Pediatric Cardiac Genomic Consortium
Melanoma Spore (NIH)
Biogen Inc. (PPMS)
VA- Schizophrenia/Bipolar disorder
Yale Comprehensive Cancer Center
Total
$12M (3y)
$40M (4y)
$12M (4y)
$ 3M (2y)
$ 2M (2y)
$ 4M (3y)
$10M (2y)
$ 5M (4y)
$ 2M (2Y)
$12M (5y)
$ 2M
$12 M
$14 M
$ 128 M
Use of genomics to tailor medical care to individuals based
on their genetic makeup.
Discovery
Diagnosis
Classification
Prognosis
Therapeutic
Choice
• Elucidation of
mechanism of cause
• Identification of
cancer biomarkers
• Therapeutic targets
How and why
Is it benign?
Which class of
cancer?
What are my Which treatment?
chances?
CLIA: The New Paradigm in Molecular
Diagnostics
 Conventional molecular testing- gene by gene
 Genomic testing using Exome analysis

YCGA is carrying out clinical diagnostic work in
collaboration with Dr. Allen Bale
 Over 1,000 exomes are analyzed for various
disorders
Challenges
Sequencing a genome is simple
finding a cause of a disease is not
First clinical use of whole genome sequencing shows
just how challenging it can be.
Study of
fraternal
twins with
monogenic
disorder
Genomes on prescription: Nature 2011
Bainbridge M, Sci Transl Med 2011
Acknowledgement
 Jim Noonan
 Yale University,
School of Medicine
and west Campus
 NHGRI: CMG
 YCGA staff
Questions?
Data Analysis Overview
Primary
Analysis
Secondary
Analysis
Data
Visualization
Primary and Secondary Analysis Overview
Analysis Type
Software
Outputs
ICS/RTA
Images/TIFF
files
ICS/RTA
Intensitie Base
s
Calling
Sequencing
Primary Analysis
Secondary Analysis
Alignments and
Variant Detection
Cluster Generation: Amplification
Template hybridization and Initial Extension
Original template is washed away
3' extension
OH
OH
P7
P5
Grafted flowcell
Template
hybridization
>250-300 million single
molecules hybridize to the
lawn of primers
Initial extension
Denaturation
single molecules bound
to flow cell in
a random pattern
Cluster Generation: Amplification
Single-strand flips
over to hybridize to
adjacent oligos to
form a bridge
1st cycle
denaturation
Hybridized primer is
extended by
polymerases
1st cycle
annealing
Result: two copies of
covalently bound singlestranded templates
Doublestranded
bridge is
denatured
1st cycle
extension
2nd cycle
denaturation
n=35
total
2nd cycle
extension
2nd cycle
annealing
Cluster Generation: Linearization, Blocking and sequencing
primer hybridization
dsDNA bridges are denatured
complement strands are cleaved and washed away
sequen
primer
Cluster
Amplification
P5 Linearization
Block with
ddNTPS
Free 3’ ends are blocked to
prevent unwanted DNA priming
Denaturation and
Sequencing Primer
Hybridization
Sequencing
OH
Denaturation and
Hybridization
Sequencing
First Read
OH
OH
Denaturation and Resynthesis of P5
Strand (15Cycles)
De-Protection
OH
Sequencing
Second Read
Denaturation and
Hybridization
Block with
ddNTPs
P7 Linearization
Download