The Past, Present, and Future of DNA Sequencing

advertisement
The Past, Present, and Future of DNA Sequencing
Craig A. Praul
Co- Director
Genomics Core Facility
Huck Institutes of the Life Sciences
Penn State University
A very short history of DNA sequencing
I started from the conviction that, if different DNA species
exhibited different biological activities, there should also
exist chemically demonstrable differences between
deoxyribonucleic acids.
Edwin Chargaff
Milestones
•
•
•
•
•
•
First Isolation of DNA : 1867 (Freidrich Meisher)
Composition of nucleic acids; tetranucleotide theory : 1909 - 1940 (Phoebus
Levine)
G=C and A=T however, the G/C and A/T content of different organisms vary : 1950
(Edwin Chargaff)
G/C content measured by annealing : 1968 (Mandel and Marmur)
Maxam-Gilbert and Sanger Sequencing : 1977
Next-Generation Sequencing : 2005
Genomes Sequenced
• Virus – 3222 (Bacteriophage phi X 174, 5386 nt – 1977)
• Bacteria – 2289 (Haemophilus influenza, 1.8 x 106 nt – 1995)
• Eukarya – 168 (S. cerevisiae 1.2 x 107 nt – 1995; H. sapien, 3 x 109 nt -2001)
• Archaea – 152 (Methanococcus jannaschi , 1.7 x 106 nt – 1996)
Next-Generation Sequencing
Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364
Changes in instrument capacity*
ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796
Sequencing Cost
Date
Sep-01
Sep-02
Oct-03
Oct-04
Oct-05
Oct-06
Oct-07
Oct-08
Oct-09
Oct-10
Oct-11
Oct-12
Jan-13
Cost per Mb
Cost per Genome
$5,292.39
$3,413.80
$2,230.98
$1,028.85
$766.73
$581.92
$397.09
$3.81
$0.78
$0.32
$0.09
$0.07
$0.06
Source - NHGRI : http://www.genome.gov/sequencingcosts/
$95,263,072
$61,448,422
$40,157,554
$18,519,312
$13,801,124
$10,474,556
$7,147,571
$342,502
$70,333
$29,092
$7,743
$6,618
$5,671
Central Dogma of Molecular Biology
James Watson version - 1965
DNA
RNA
Protein
So once we have the genomic DNA sequence of a
species we have all of the information there is?
Really?
• No, not really.
Illumina HiSeq and MiSeq
•
Massively parallel
– HiSeq : 150 or 180 million reads per lane
– MiSeq : 15 million reads per run
•
Intermediate Read Length
– HiSeq : 100 nt or 150 nt
– MiSeq : 250 nt
•
High total output per run
– HiSeq : 90 GB or 288 GB
– MiSeq : 8 GB
Sequencing Types
Single Read
Paired-end read
Mate-pair read
Library Types
•
Many different library preps : DNA, mate-pair, mRNA, miRNA, ChIP
•
Fragmentation
– DNA : 300 – 500 nt
– RNA : 150 – 200 nt
•
Attachment of appropriate adapters
– Complex : flow cell binding, F & R sequencing, BC
– Custom : Avoid if possible
•
Removal of dimers/small inserts
•
Amplification (or not)
Applications
•
de Novo sequencing (genomes, transcriptomes)
•
Resequencing (genomes, exomes, custom sequence capture)
•
RNA-seq (mRNA, miRNA, degradome)
•
Chip-Seq
•
Methyl-seq
•
RIP-seq
•
Amplicon
de Novo Experimental Design
•
Estimate of genome size
•
Coverage (30 x – 100 x)
•
Sequencing Type (paired-end or mate-pair)
•
Example 100 MB genome, 100 x 100 nt paired-end reads
– (100 MB) x (30 x coverage) = 3 GB
– 3 GB / (200 nt for each pair of paired-end reads) = 15 million read pairs
•
Replicates
Resequencing : Sequence Capture
RNA-seq Experimental Design
•
Estimate of transcriptome size (1-5% of genome ?)
•
Coverage (30 x ?)
– mRNA or rRNA depleted RNA
– Relative abundance of transcripts you are interested in
•
Sequencing Type (single read or paired-end)
– Simple transcriptome vs. complex transcriptome
– Splice variants
•
Example 3 GB genome, 100 nt single reads
– (3 GB genome) x ( 5% transcriptome ) = 120 MB Transcriptome
– (120 MB transcriptome) x (30 x coverage) = 4.5 GB total sequence
– 4.5 GB / (100 nt for each read) = 45 million read pairs
•
Replicates : Yes!!!!
– Biological not technical
ChIP-Seq
http://www.nature.com/nmeth/journal/v4/n8/images/nmeth0807-613-F1.gif
RIP-seq
Source : http://openi.nlm.nih.gov/imgs/rescaled512/3269675_ijms-13-00097f6.png
Methyl-seq
20 different types of base modifications in DNA are
known and there are perhaps 200 modifications of RNA
Experimental Space: Next-Gen Platform
•
PacBio : 0.075 x 106 reads/sample, 1000 – 3000 nt
– Whole transcript
•
Roche 454 FLX+ : 0.5 -1 x 106 reads/sample, 800 -1000 nt
– Small – Medium Genome de novo sequencing
– Long Amplicon
– Transcriptome
•
PGM: 1-2 x 106 reads per sample, 400 nt
– Small genome de novo
– Medium Amplicon
•
MiSeq: 1-2 x 106 reads per sample, 50 – 250 nt
– Small genome de Novo
– Small Amplicon
•
HiSeq : 10-100 x 106 reads per sample, 50 – 150 nt
– Counting Applications : RNA-seq, ChIP-seq, RIP-seq, Methyl-seq
– Large genome de novo and resequencing
Experimental Space:
The Relevancy of “Classic” Techniques
Differential Gene Expression
•
Northern blotting (1977) : 1 Probe – 20 samples
•
Dot Blots (1987) : 100s of probes – 1 sample
•
RT-PCR (1992) : 100s of probes – 10 -100 samples
•
Microarrays (1995 ) : 100,000s of probes – 1 sample
•
Next-gen sequencing (2005) : 10-100 x 106 reads – 1 sample
The Future
• More Reads
• Longer Reads
• Faster Sequencing
• Cheaper Sequencing
• New Applications
Download