High Throughput Sequencing: Microscope in a Big Data Era

advertisement
High Throughput Sequencing:
Microscope in the Big Data Era
David Tse
EASIT
Chinese University of Hong Kong
July 7, 2014
Research supported by NSF Center for Science of Information.
DNA sequencing
…ACGTGACTGAGGACCGTG
CGACTGAGACTGACTGGGT
CTAGCTAGACTACGTTTTA
TATATATATACGTCGTCGT
ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC
TGATTTTAAAAAAATATT…
High throughput sequencing revolution
tech. driver for communications
Shotgun sequencing
read
Technologies
Sequence Sanger
r
3730xl
454 GS
Mechanis
m
Dideoxy
chain
terminatio
n
Read
length
Ion
Torrent
SOLiDv4
Illumina Pac Bio
HiSeq
2000
Pyrose Detection
quencin of
g
hydrogen
ion
Ligation
and twobase
coding
Reversi Single
ble
molecule
Nucleoti real time
des
400-900
bp
700 bp
~400 bp
50 + 50
bp
100 bp
PE
1000~1000
0 bp
Error Rate
0.001%
0.1%
2%
0.1%
2%
10-15%
Output
data (per
run)
100 KB
1 GB
100 GB
100 GB
1 TB
10 GB
High throughput sequencing:
Microscope in the big data era
Genomic variations, 3-D structures, transcription, translation,
protein interaction, etc.
The quantities measured can be dynamic and vary spatially.
Example: RNA expression is different in different tissues and
at different times.
Computational problems
for high throughput data
measure
data
•
Assembly (de Novo)
• Variant calling
(reference-based
assembly)
manage
data
• Compression
utilize
data
• Genome wide
association studies
• Privacy
• Phylogenetic tree
reconstruction
• Pathogen detection
Scope of this tutorial
Assembly: three points of view
• Software engineering
• Computational complexity theoretic
• Information theoretic
Assembly as a software engineering problem
• A single sequencing experiment can generate 100’s
of millions of reads, 10’s to 100’s gigabytes of data.
• Primary concerns are to minimize time and memory
requirements.
• No guarantee on optimality of assembly quality and in
fact no optimality criterion at all.
Computational complexity view
• Formulate the assembly problem as a combinatorial
optimization problem:
– Shortest common superstring (Kececioglu-Myers 95)
– Maximum likelihood (Medvedev-Brudno 09)
– Hamiltonian path on overlap graph (Nagarajan-Pop 09)
• Typically NP-hard and even hard to approximate.
• Does not address the question of when the solution
reconstructs the ground truth.
Information theoretic view
Basic question:
What is the quality and quantity of read data needed to
reliably reconstruct?
Tutorial outline
I.
De Novo DNA assembly.
I.
De Novo RNA assembly
Themes
• Interplay between information and computational
complexity.
• Role of empirical data in driving theory and algorithm
development.
Part I:
De Novo DNA Assembly
Shotgun sequencing model
Basic model : uniformly sampled reads.
Assembly problem: reconstruct the genome given the
reads.
A Gigantic Jigsaw Puzzle
Challenges
Read errors
Long repeats
log(# of `-repeats)
16
15
16.5
16
14
15.5
12
15
10
10
14.5
8
14
13.5
6
5
13
4
12.5
2
12
0
11.5
0
0
20
2
40
60 100080
6
500 4
1001500
8 120
140
10
2000
160 12
Human Chr 22
repeat length histogram
`
Illumina read error profile
Two-step approach
• First, we assume the reads are noiseless
• Derive fundamental limits and near-optimal assembly
algorithms.
• Then, we add noise and see how things change.
Repeat statistics
harder jigsaw puzzle
easier jigsaw puzzle
How exactly do the fundamental limits depend on repeat statistics?
Lower bound: coverage
• Introduced by Lander-Waterman
in 1988.
• What is the number of reads needed to cover the entire
DNA sequence with probability 1-²?
• NLW only provides a lower bound on the number of reads
needed for reconstruction.
• NLW does not depend on the DNA repeat statistics!
Simple model: I.I.D. DNA, G ! 1
(Motahari, Bresler & Tse 12)
normalized # of reads
reconstructable
by greedy algorithm
coverage
1
many repeats
of length L
no repeats
of length L
no coverage
read length L
What about for finite real DNA?
I.I.D. DNA vs real DNA
(Bresler, Bresler & Tse 12)
Example: human chromosome 22 (build GRCh37, G = 35M)
log(# of `-repeats)
16
15
16.5
16
16
14
15
15.5
12
15
10
14
10
14.5
8
1314
13.5
6
12
5
13
4
11
12.5
2
12
10
0
11.5
0
0
i.i.d. fit
20
2 2
40 4
4
500
data
60
80
6 10006 8
100 8 120
10
1500
140
10
122000
160 12
14
`
Can we derive performance bounds on an individual sequence
basis?
Individual sequence
performance bounds
(Bresler, Bresler, Tse
BMC Bioinformatics 13)
Given a genome s
log(# of `-repeats)
GREEDY
DEBRUIJN
lower
bound
Lcritical
repeat
length
Human Chr 19
Build 37
SIMPLEBRIDGING
MULTIBRIDGING
Lander-Waterman
coverage
GAGE Benchmark Datasets
http://gage.cbcb.umd.edu/
Rhodobacter sphaeroides
lower
bound
Staphylococcus aureus
Human Chromosome14
G = 2,903,081
G = 88,289,540
G = 4,603,060
MULTIBRIDGING
lower
bound
MULTIBRIDGING
lower
bound
MULTIBRIDGING
Lower bound: Interleaved repeats
Necessary condition:
all interleaved repeats are bridged.
L
m
n
m
n
In particular: L > longest interleaved repeat length (Ukkonen)
Lower bound: Triple repeats
Necessary condition:
all triple repeats are bridged
L
In particular: L > longest triple repeat length (Ukkonen)
Individual sequence
performance bounds
(Bresler, Bresler, T.
BMC Bioinformatics 13)
log(# of `-repeats)
lower
bound
length
Human Chr 19
Build 37
Lander-Waterman
coverage
Greedy algorithm
(TIGR Assembler, phrap, CAP3...)
Input: the set of N reads of length L
1. Set the initial set of contigs as the reads
2. Find two contigs with largest overlap and merge
them into a new contig
3. Repeat step 2 until only one contig remains
Greedy algorithm:
first error at overlap
repeat
contigs
bridging read already merged
A sufficient condition for reconstruction:
all repeats are bridged
L
Back to chromosome 19
lower bound
greedy
algorithm
log(# of `-repeats)
15
10
longest interleaved repeats
at length 2248
non-interleaved repeats
are resolvable!
5
0
0
1000
2000
GRCh37 Chr 19 (G = 55M)
3000
4000
longest repeat
at
Dense Read Model
• As the number of reads N increases, one can recover
exactly the L-spectrum of the genome.
• If there is at least one non-repeating L-mer on the
genome, this is equivalent information to having a
read at every starting position on the genome.
• Key question:
What is the minimum read length L for which the
genome is uniquely reconstructible from its L-spectrum?
de Bruijn graph
CCCT
CCTA
GCCC
(L = 5)
ATAGACCCTAGACGAT
AGCC
CTAG
TAGA
ATAG
AGAC
AGCG
GCGA
CGAT
1. Add a node for each (L-1)-mer on the genome.
2. Add k edges between two (L-1)-mers if their overlap has
length L-2 and the corresponding L-mer appears k times in
genome.
Eulerian path
CCCT
CCTA
GCCC
(L = 5)
ATAGACCCTAGACGAT
AGCC
CTAG
TAGA
ATAG
AGAC
AGCG
GCGA
CGAT
Theorem (Pevzner 95) :
If L > max(linterleaved, ltriple) , then the de Bruijn graph has a
unique Eulerian path which is the original genome.
Resolving non-interleaved repeats
Condensed sequence graph
non-interleaved repeat
Unique Eulerian path.
From dense reads to shotgun reads
[Idury-Waterman 95]
[Pevzner et al 01]
Idea: mimic the dense read scenario by looking at K-mers of
the length L reads
Construct the K-mer graph and find an Eulerian path.
Success if we have K-coverage of the genome and K > Lcritical
De Bruijn algorithm: performance
Loss of info. from the reads!
GREEDY
DEBRUIJN
lower
bound
length
Human Chr 19
Build 37
Lander-Waterman
coverage
Resolving bridged interleaved repeats
bridging read
interleaved repeat
Bridging read resolves one repeat and the unique Eulerian
path resolves the other.
Simple bridging: performance
GREEDY
DEBRUIJN
lower
bound
SIMPLEBRIDGING
length
Human Chr 19
Build 37
Lander-Waterman
coverage
Resolving triple repeats
all copies bridged
neighborhood of triple repeat
triple repeat
all copies bridged
resolve repeat locally
Triple Repeats: subtleties
Multibridging De-Brujin
Theorem: (Bresler,Bresler, Tse 13)
Original sequence is reconstructible if:
1. triple repeats are all-bridged
2. interleaved repeats are (single) bridged
3. coverage
Necessary conditions for ANY algorithm:
1. triple repeats are (single) bridged
1. interleaved repeats are (single) bridged.
2. coverage.
Multibridging: near optimality for Chr 19
GREEDY
DEBRUIJN
lower
bound
SIMPLEBRIDGING
length
MULTIBRIDGING
Human Chr 19
Build 37
Lander-Waterman
coverage
GAGE Benchmark Datasets
http://gage.cbcb.umd.edu/
Rhodobacter sphaeroides
Staphylococcus aureus
Human Chromosome14
G = 2,903,081
G = 88,289,540
G = 4,603,060
Lcritical = length of the longest triple or interleaved repeat.
Lcritical
Lcritical
Lcritical
lower
bound
MULTIBRIDGING
lower
bound
MULTIBRIDGING
lower
bound
MULTIBRIDGING
Gap
Sulfolobus islandicus.
G = 2,655,198
triple repeat
lower bound
interleaved repeat
lower bound
MULTIBRIDGING
algorithm
Complexity: Computational vs
Informational
• Complexity of MULTIBRIDGING
– For a G length genome, O(G2)
• Alternate formulations of Assembly
– Shortest Common Superstring: NP-Hard
– Greedy is O(G), but only a 4-approximation to SCS in the
worst case
– Maximum Likelihood: NP-Hard
• Key differences
– We are concerned only with instances when reads are
informationally sufficient to reconstruct the genome.
– Individual sequence formulation lets us focus on issues
arising only in real genomes.
Read Errors
ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGT
TATA CTTA
Error rate and nature depends on sequencing technology:
Examples:
• Illumina: 0.1 – 2% substitution errors
• PacBio: 10 – 15% indel errors
We will focus on a simple substitution noise model with noise
parameter p.
Consistency
Basic question:
What is the impact of noise on Lcritical?
This question is equivalent to whether the L-spectrum is
exactly recoverable as the number of noisy reads
N -> 1.
Theorem (C.C. Wang 13):
Yes, for all p except p = ¾.
What about coverage depth?
Theorem (Motahari, Ramchandran,Tse, Ma 13):
Assume i.i.d. genome model. If read error rate p is
less than a threshold, then Lander-Waterman
coverage is sufficient for L > Lcritical
For uniform distr. on {A,G,C,T}, threshold is 19%.
A separation architecture is optimal:
error
correction
assembly
Why?
noise averaging
M
• Coverage means most positions are covered by
many reads.
• Multiple aligning overlapping noisy reads is possible if
• Assembly using noiseless reads is possible if
From theory to practice
Two issues:
1) Multiple alignment is performed by testing joint
typicality of M sequences, computationally too
expensive.
Solution: use the technique of finger printing.
2) Real genomes are not i.i.d.
Solution: replace greedy by multibridging.
Lam, Khalak, T.
Recomb-Seq 14
X-phased multibridging
Prochlorococcus marinus
Lcritical
Substitution errors of rate 1.5 %
More results
Prochlorococcus marinus
Lcritical
Methanococcus maripaludis
Lcritical
Helicobacter pylori
Lcritical
Mycoplasma agalactiae
Lcritical
A more careful look
Mycoplasma agalactiae
Lcritical-approx
Lcritical
Approximate repeat example:
Yersinia pestis
exact triple repeat, length 1662
5608
approximate triple repeat length
Acknowledgements
DNA Assembly
Abolfazl Motahari
Sharif
RNA Assembly
Soheil Mohajer
U. of Minnesota
Guy Bresler
MIT
Sreeram Kannan
Berkeley
Lior Pachter
Berkeley
Ma’ayan Bresler
Berkeley
Ka Kit Lam
Berkeley
Eren Sasoglu
Berkeley
Asif Khalak
Pacific Biosciences
Joseph Hui
Berkeley
Kayvon Mazooji
Berkeley
Download