A general introduction to next-generation sequencing platforms

advertisement
Introduction to Next-Generation Sequencing Data
and Related Bioinformatic Analysis
Han Liang, Ph.D.
Department of Bioinformatics and
Computational Biology
3/25/2014 @ Rice University
Outline
•
•
•
•
•
History
NGS Platforms
Applications
Bioinformatics Analysis
Challenges
Central Dogma
Sanger sequencing
• DNA is fragmented
• Cloned to a plasmid
vector
• Cyclic sequencing
reaction
• Separation by
electrophoresis
• Readout with
fluorescent tags
‘Sanger sequencing’ has been the only DNA
sequencing method for 30 years but…
…hunger for even greater sequencing throughput
and more economical sequencing technology…
NGS has the ability to process millions of sequence
reads in parallel rather than 96 at a time (1/6 of
the cost)
Objections: fidelity, read length, infrastructure
cost, handle large volum of data
.
•
•
•
•
•
Roche/454 FLX: 2004
Illumina Solexa Genome Analyzer: 2006
Applied Biosystems SOLiDTM System: 2007
Helicos HeliscopeTM : recently available
Pacific Biosciencies SMRT: launching 2010
Quickly reduced Cost
Three Leading Sequencing Platforms
• Roche 454
• Illumina Solexa
• Applied Biosystems SOLiD
The general experimental procedure
Wang et al. Nature Reviews Genetics 2009
454
bead microreactor
Maridis Annu. Rev. Genome. Human Genet. 2008
Illumina
(Solexa)
Bridge amplification
Maridis Annu. Rev. Genome. Human Genet. 2008
SOLiD
color coding
Maridis Annu. Rev. Genome. Human Genet. 2008
Comparison of existing methods
Real Data – nucleotide space
• Solexa
@SRR002051.1 :8:1:325:773 length=33
AAAGAACATTAAAGCTATATTATAAGCAAAGAT
+SRR002051.1 :8:1:325:773 length=33
IIIIIIIIIIIIIIIIIIIIIIIII'II@I$)@SRR002051.2 :8:1:409:432 length=33
AAGTTATGAAATTGTAATTCCAATATCGTAAGC
+SRR002051.2 :8:1:409:432 length=33
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII07
@SRR002051.3 :8:1:488:490 length=33
AATTTCTTACCATATTAGACAAGGCACTATCTT
+SRR002051.3 :8:1:488:490 length=33
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII&I
Real Data – color space
• SOLiD Data
>1_24_47_F3
T1.1.23..0120230.320033300030030010022.00.0201.0201
>1_24_52_F3
T2.3.21..2122321.213110332101132321002.11.0111.1222
>1_24_836_F3
T0.2.22..2222222.010203032021102220200.01.2211.2211
>1_24_1404_F3
T2.3.30..2013222.222103131323012313233.22.2220.0213
>1_25_202_F3
T0.3213.111202312203021101111330201000313.121122211
>1_25_296_F3
T0.1130.100123202213120023121112113212121.013301210
Data output difference
among the three platforms
• Nucleotide space vs. color space
• Length of short reads
454 (400~500 bp) > SOLiD (70 bp) ~ Solexa (36~120bp)
Applications with “Digital output”
• De novo genome assembly
• Genome re-sequencing
• RNA-Seq (gene expression, exon-intron
structure, small RNA profiling, and mutation)
• CHIP-Seq (protein-DNA interaction)
• Epigenetic profiling
• Degraded state of the sample  mitDNA sequencing
• Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106 bp )
Problems: contamination modern humans and coisolation bacterial
• Key part in regulating gene
expression
• Chip: technique to study DNAprotein interaccions
• Recently genome-wide ChIPbased studies of DNA-protein
interactions
• Readout of ChIP-derived DNA
sequences onto NGS platforms
• Insights into transcription
factor/histone binding sites in
the human genome
• Enhance our understanding of the
gene expression in the context of
specific environmental stimuli
• ncRNA presence in genome difficult to predict by
computational methods with high certainty because the
evolutionary diversity
• Detecting expression level changes that correlate with
changes in environmental factors, with disease onset and
progression, complex disease set or severity
• Enhance the annotation of sequenced genomes (impact of
mutations more interpretable)
• Characterizing the biodiversity found on Earth
• The growing number of sequenced genomes enables us to interpret partial
sequences obtained by direct sampling of specif environmental niches.
• Examples: ocean, acid mine site, soil, coral reefs, human microbiome which may
vary according to the health status of the individual
• Common variants have not yet
completly explained complex disease
genetics rare alleles also contribute
• Also structural variants, large and
small insertions and deletions
• Accelerating biomedical research
• Enable of genome-wide patterns of
methylation and how this patterns
change through the course of an
organism’s development.
• Enhanced potential to combine the
results of different experiments,
correlative analyses of genome-wide
methylation, histone binding patterns
and gene expression, for example.
:
Integrating Omics
Mutation discovery
Protein-DNA interaction
Copy number variation
mRNA expression
microRNA expression
Alternative Splicing
Kahvejian et al. 2008
Data Analysis Flow
SOLiD machine:
Raw data
Central Server
Basic processing
decoding, filter and mapping
Local Machine
Downstream analysis
Short Read Mapping
• DNA-Resequencing
BLAST-like approach
• RNA-Seq
Read length and pairing
ACTTAAGGCTGACTAGC
TCGTACCGATATGCTG
• Short reads are problematic, because short
sequences do not map uniquely to the genome.
• Solution #1: Get longer reads.
• Solution #2: Get paired reads.
Post-alignment Analysis
•
•
•
•
DNA-SEQ
SNP calling
RNA-SEQ
Quantifying gene expression level
Concepts
The reference genome:
hg19 (GRC37)
Main assembly: Chr1-22, X, and Y
3,095,677,412 bp
Target Region: exonome
Ensembl: 85.3 Million (2.94%)
RefSeq: 67.7Million (2.34%)
ccds: 31,266,049 (1.08%) consisting of 185,446 nr exons
Target Coverage
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
0.00%
Depth
SOLiD
color coding
Maridis Annu. Rev. Genome. Human Genet. 2008
SNP calling
Array-based High-throughput
Dataset
Limitations of hybridization-based approach
• Reliance existing knowledge about genome
sequence
• Background noise and a limited dynamic
detecting range
• Cross-experiment comparison is difficult
• Requiring complicated normalization methods
Wang et al. Nature Reviews Genetics 2009
Quantifying gene expression
using RNA-Seq data
RPKM: Reads Per Kb exon length and Millions of mapped readings
Large Dynamic Range
Mortazavi et al. Nature Methods 2008
High reproducibility
Mortazavi et al. Nature Methods 2008
High Accuracy
Wang et al. Nature 2008
Advantages of RNA-Seq
•
•
•
•
•
•
•
Not limited to the existing genomic sequence
Very low (if any) background signal
Large dynamic detecting range
Highly reproducibility
Highly accurate
Less sample
Low cost per base
Wang et al. Nature Reviews Genetics 2009
Huge amount of data!
• For a typical RNA-Seq SOLiD run,
~ 2T image file
~ 120G text file for downstream analysis
~ 75 M short reads per sample
Efficient methods for data storage and management
Considerable sequencing error
High-quality image analysis for base calling
Genome alignment and assembly:
time consuming and memory demanding
• To perform genome mapping for SOLiD data
32-opteron HP DL785 with 128GB of ram
12~14 hours per sample
High-performance parallel computing
Bioinformatics Challenges
• Efficient methods to store, retrieve and
process huge amount of data
• To reduce errors in image analysis and base
calling
• Fast and accurate for genome alignment and
assembly
• New algorithms in downstream analyses
Experimental Challenges
Library fragmentation
Strand specific
Wang et al. Nature Reviews Genetics 2009
Question& Answer
Han Liang
E-mail: hliang1@mdanderson.org
Tel: 713-745-9815
Download