Michael Brudno
CSC 2431 – Algorithms for HTS
University of Toronto
06/01/2010
High Throughput Sequencers
100 Gb
10 Gb
1 Gb
100 Mb
10 Mb
1 Mb
10 bp
AB/SOLiDv3, Illumina/GAII short-read sequencers
(10+Gb in 50-100 bp reads,
>100M reads, 4-8 days)
454 GS FLX pyrosequencer
(100-500 Mb in 100-400 bp reads,
0.5-1M reads, 5-10 hours)
ABI capillary sequencer
(0.04-0.08 Mb in 450-800 bp reads,
96 reads, 1-3 hours)
100 bp read length
1,000 bp
From Gabor Marth, BC
Sequencing chemistries
DNA base extension DNA ligation
Church, 2005
Massively parallel sequencing
Church, 2005
Features of HTS data
• Short (for now) sequence reads
–200-400bp: 454 (Roche)
–35-100bp Solexa(Illumina), SOLiD(AB)
• Huge amount of sequence per run
–Up to 10s of gigabases per run
• Huge number of reads per run
–Up to 100’s of millions
• Higher error (compared with Sanger)
–Different error profile
The Raw Data
• Machine Readouts are different
• Read length, accuracy, and error profiles are variable.
• All parameters change rapidly as machine hardware, chemistry, optics, and noise filtering improves
454 Pyrosequencer error profile
• multiple bases in a homo-polymeric run are incorporated in a single incorporation test the number of bases must be determined from a single scalar signal the majority of errors are INDELs
• error rates are nucleotide-dependent
Illumina/Solexa base accuracy
• Error rate grows as a function of base position within the read
• A large fraction of the reads contains 1 or 2 errors
AB SOLiD System dibase sequencing
2-base, 4-color: 16 probe combinations
A
C
G
T
2
3
2 nd Base
A C G T
0 1 2 3
1 0 3 2
3
2
0
1
1
0
3’
3’
N N N A T z z z
5’
5’
3’
N N N G A z z z
5’
N N N T G z z z
● 4 dyes to encode 16 2-base combinations
● Detect a single color indicates 4 combinations & eliminates 12
● Each color reflects position, not the base call
● Each base is interrogated by two probes
● Dual interrogation eases discrimination
– errors (random or systematic) vs. SNPs (true polymorphisms)
Converting dibase (color) into letters
0 1 1 0 2 3 0 2 2
AA AC AC AA AG AT AA AG AG
CC CA CA CC CT CG CC CT CT
GG GT GT GG GA GC GG GA GA
TT TG TG TT TC TA TT TC TC
A
C
G
T
2
3
2 nd Base
A C G T
0 1 2 3
1 0 3 2
3
2
0
1
1
0
0 1 1 0 2 3 0 2 2
A A C A A G C C T C
C C A C C T A A G A
G G T G G A T T C T
T T G T T C G G A G
4
Possible
Sequences
The decoding matrix allows a sequence of transitions to be converted to a base sequence, as long as one of two bases is known.
SOLiD error checking code
A C G G T C G T C G T G T G C G T
A C G G T C G T C G T G T G C G T
No change
A C G G T C G C C G T G T G C G T
SNP
A C G G T C G T C G T G T G C G T
Measurement error
SOLiD Error rate & QVs
20
15
10
5
0
0
40
35
30
25
5 10 15 20
Position on Read
25 30 35
5.00%
4.00%
3.00%
2.00%
1.00%
40
0.00%
10.00%
9.00%
8.00%
7.00%
6.00%
Pacific Biosystems (PacBio)
Current and future application areas
Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome
SNP DEL
De novo genome sequencing
Short-read sequencing will be (at least) an alternative to microarrays for:
• DNA-protein interaction analysis (CHiP-Seq)
• novel transcript discovery
• quantification of gene expression
• epigenetic analysis (methylation profiling)
What’s in it for us?
Image Management
VISION/Graphics
Base calling
Probabilistic Models
Machine Learning
Variant Calling
Data Storage
Systems
Cloud Computing
Data Management
Databases
Data Integrity
Read Mapping
String Algorithms
Assembly
Fundamental informatics challenges
1. Interpreting machine readouts – base calling, base error estimation
2. Alignment of billions of reads
3. Dealing with nonuniqueness in the genome: resequenceability
Informatics challenges (cont’d)
4. SNP and short INDEL, and structural variation discovery
5. Data visualization
6. Data storage & management
Questions?
SHRiMP: SHort Read Mapping Package
• Fast Mapping Algorithm
- Spaced seed hashing
- Vectored (very fast) Smith Waterman
- Handles micro insertions/deletions
• Specialized algorithm for aligning color-space
(AB SOLiD) reads
• Computes p-values (and other statistics)
Regular Smith-Waterman
A C T A G A C T T G
C
A
G
T
T
C
Cell being computed
Previously computed cells
M i , j
M i 1, j 1
max
M
M
i 1, j i , j 1
S (
A i 1
, B gap gap j 1
)
Fast Local Alignment
BLAST
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
FASTA
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Altschul et al 1990 Pearson 1987
SHRiMP Hashing
• SHRiMP uses spaced seeds
• Vectored Smith-Waterman
Genome
Vectored Instructions
• Modern computers provide with capacity for performing same operation on several elements (SIMD)
1
9
3
6
5 6 1 5
4 4
+ = max =
8
13
11
9
3 8
4 10 6 4
5
9
8
6
• Can we take advantage of vectorized instruction in Smith-Waterman?
Vectorizing Smith-Waterman (1st try)
A C T A G A C T T G
C
A
G
T
T
C
Cell being computed
Previously computed cells
M i , j
M i 1, j 1
max
M
M
i 1, j i , j 1
S (
A i 1
, B gap gap j 1
)
Vectorizing Smith-Waterman (Wozniak)
A C T A G A C T T G
T
C
C
A
G
T
Current
Previous
Penultimate
M i , j
max
M i 1, j 1
M
S ( A i 1
, B i 1, j
gap
M i , j 1
gap j 1
)
Wozniak, 1997
Vectorizing Smith-Waterman (SHRiMP)
A C T A G A C T T G
T
C
C
A
G
T
+
-
-
-
+
Current
Previous
Penultimate
A C T A G A C T T G
T G A C C T
+ +
SHRiMP Speed
SW within SHRiMP while mapping 50,000 reads against a 4Mb contig of C. savignyi
Xeon
Core2
Unvectored Wozniak Farrar SHRiMP
97 261 335 338
105 285 533 537
SHRiMP performance for mapping 11,200 AB SOLiD
25 bp reads to 180Mb Ciona savignyi genome
K-mer (7,8) (8,9) (9,10) (10,11) (12,13)
% in SW 45% 25% 12% 7% 3%
Time (S) 2066 520 255 195 205
Color-space (dibase) Sequencing
A C G T
A 0 1 2 3
C 1 0 3 2
G 2 3 0 1
T 3 2 1 0
1
A
0 0
2
3 3
G
1
C T
2
0 0
Mapping reads in Color-space
SNPs
TGA G TT
G: T TGAGTTATGGAT 12 21 0
012210331023 C TT
R: 012 12 0331023 12 12 0
T TGA C TTATGGAT TGA A TT
12 03 0
TGA T TT
12 30 0
INDELS
TGAGTTA
122103
TGA TTA
12 -3 03
TGAGT T TA
1221 0 03
TGAGT A TA
1221 33 3
Mapping reads in Letter Space
G: TGACTTATGGAT
||||| |||||||
T TGAGT CGCAAGC
C CAGAC TATGGAT
R: 01221 2 331023
0
2
1
A
3 3
0
G
1
C T
2
0 0
SOLiD Translations
• Given the following read, there are 4 translations (we need an initial base):
0 1 2 2 3 3 1 0 2
A A C T C G C A A G
C C A G A T A C C T
G G T C T A T G G A
T T G A G C G T T C
SOLiD Translations
• Reads begin with a known primer (‘T’)
– The translation is: T T G A G C G T T C
0 1 2 2 3 3 1 0 2
A A C T C G C A A G
C C A G A T A C C T
G G T C T A T G G A
T T G A G C G T T C
SOLiD Translations
• What if we had a sequencing error?
– The right translation was: T T G A G C G T T C
0 1 0 2 3 3 1 0 2
A A C C T A T G G A
C C A A G C G T T C
G G T T C G C A A G
T T G G A T A C C T
Colour-space Smith-Waterman
• Think of 4 SW matrices stacked above one another
• If we have 1 read error, but otherwise perfect match, we’ll use 2 matrices
Genome
Read
Frame 1 Frame 2 Frame 3 Frame 4
Combined Color/Letter Space SW
0
2
1
A
3 3
0
G
1
C T
2
0 0
A C
G T
3
T G
C A
C A
2
T G
A C
Combined Color/Letter Space SW
0
2
1
A
3 3
0
G
1
C T
2
0 0
A C
G T
3
T G
C A
C A
2
T G
A C
SHRiMP on Ciona savignyi
• C. savignyi is a chordate with a very large SNP rate (5%)
• Mapped 22 million AB SOLiD reads to the reference
C. savignyi genome
(6 hours on 200 CPUs).
Reads mapped p<.05 p<.01
20% 9%
SNP rate .039
.024
Indel rate .004
.003
Error rate .024
.020
G: 1123724 TA-ACCACGGTCACACTTGCATCAC 1123701
|| |||||||||| |||X|||||||
T: TACACCACGGTCAGACTtGCATCAC
R: 0 T0311101130121221211313211 24
SHRiMP Summary
• Fast mapping of short reads to a genome
-- Handles indels & color-space reads
-- Easy to parallelize
-- Small memory footprint
• Computation of p-values & other statistics for hits
• Publicly available & free
Acknowledgments
UofT Stephen Rumble
Phil Lacroute
Anton Valouev
Arend Sidow
Stanford http://compbio.cs.toronto.edu/shrimp
FUNDING: NSERC, CFI, NIH
Acknowledgments
UofT Stephen Rumble
Phil Lacroute
Anton Valouev
Arend Sidow
Stanford http://compbio.cs.toronto.edu/shrimp
FUNDING: NSERC, CFI, NIH
Why is color-space good?
• SNP discovery
• Error correction with letter & color reads (assembly)
R1: 0 TAGACCACGGTCACACTTGCATCAC 24
|| |||||||||| |||X|||||||
T: TACACCACGGTCAGACTtGCATCAC
R2: 0 T0311101130121221211313211 24
• Can fix errors without (explicit) overlap
T: TACACCACGGTCAGACTTGCATCAC
R1: T03111011301 2 1221211013211 24
R2: T211301 3 122121101321103111 24
R3: T2212110132110311121130131 24
• Don’t just do everything in color space!
What are structural variations?
Various examples of structural variations
Type of Structural Variations (1)
Insertion
A
REF
Type of Structural Variations (2)
Deletion
A
REF
Type of Structural Variations (3)
5’
Inversion
3’
A
5’
3’ 5’
3’
REF
Type of Structural Variations (4)
Translocation chr1 chr2
Clone-end Sequencing Approaches
1. “Fine-scale structural variation of the human genome”
[Tuzun et al, 2005]
• Mapping matepairs onto the reference genome
• If mappings of matepairs are not consistent, then there exist structural variations.
2. “Paired-End mappings Reveals Extensive
Structural Variation in the Human Genome”
[Korbel et al, 2007]
• Proposed high-throughput and massive paired end mapping technique
• Detailed types of structural variations
Motivation
Reads can map to many locations on the genome. How do we choose between them?
Tuzun & Korbel used scores which are combination of several factors. (e.g. length, identity, quality of the sequences, concordance)
Probabilistic Framework (1)
We play with p(Y) to describe our probabilistic framework p(Y): distribution of mapped distances of
“uniquely mapped” matepairs of various sizes
Probabilistic Framework (2)
Insertion p(Y)
μ
Y
= (s+r)
P(X i
P(X i
, X j
|ins=r) = P(X i where δ= |μ
Y
|ins=r)P(X
|ins=r) = 1 - P( μ
Y
δ ≤Y≤μ j
|ins=r) y
+ δ)
- (s+r)|, s = mapped distance
μ y
δ
Probabilistic Framework (3)
Deletion p(Y)
μ
Y
= (s-r)
P(X i
, X j
P(X i
|del=r) = P(X where δ= |μ
Y i
|del=r)P(X
|del=r) = 1 - P( μ
Y j
δ ≤ Y ≤μ
|del=r) y
+ δ)
- (s-r)|, s = mapped distance
μ y
δ
Probabilistic Framework (4)
Inversion p(|Y1-Y2|) c - d = s(X1) - s(X2)
P(X i
, X j
|inv) = 1 - P( μ
|Y1-Y2| where δ= |μ
δ ≤|Y1-Y2|≤μ
|Y1-Y2|
|Y1-Y2|
– (c – d)|
+ δ)
μ
|Y1-Y2|
δ
Probabilistic Framework (5)
Translocation p(|Y1-Y2|)
(c – a) – (d – b) = s(X1) - s(X2)
P(X i
, X j
|trans) = 1 - P( μ
|Y1-Y2| where δ= |μ
|Y1-Y2|
δ ≤ |Y1- Y2| ≤μ
– (c – a) – (d – b) |
|Y1-Y2|
+ δ)
μ
|Y1-Y2|
δ
Flow of our Framework (1)
1. Preprocessing step
Discard concordant matepairs
Remove short mappings
Remove very similar mappings
Mask repeats
Get top K mappings
Remove invalid strands (-,+)
Make all possible combinations of mappings
Flow of our Framework (2)
2. Clustering
Do hierarchical clustering for each structural variation
(Insertion, Deletion, Inversion, Translocation)
3. Finding structural variations
Find initial configuration
Learn parameters for the objective function
Find a (locally) optimal configuration
Hierarchical Clustering (1)
X1
(ex) Insertion X2
C={X1, X2}
A
X2
X1
REF
• A cluster is a set of maped locations explaining the same structural variant
•Linkage distance is D(X1, X2) = - ln P(X1, X2|C)
Hierarchical Clustering (2)
• Linkage distance is
• Find two closest clusters; if D(C u
,C v
)< cutoff, merge.
C1
C2
1 2 3 4 5
R1
R2
Find a Unique Mapped Location
1 2 3
C1
C2
R1
R2
C1
4 5
M
1,4
M
2,4
C2
M
3,5
R1 R2
Assign matepairs to unique mapped locations
(and hence unique clusters).
Which Location is Best?
• We define a objective Function J(ω)
– ƒ
1 corresponds to BLAT hit scores
– ƒ
2 corresponds to the probability
– ƒ
3 corresponds to the size of clusters
Finding the “Best” Location
• Find the initial configuration greedily.
– Assign matepairs to clusters starting with those with fewest mapped locations
• Learn parameters for objective function J(ω).
– We used hill climbing search to maximize the log likelihood of P( ω|λ i
).
• Finally, find a configuration, locally maximizing J( ω) using hill climbing search.
Clustering Results
We started with ~2,984,000 matepair
• ~93% were uniquely mapped
• ~94% had a concordant position (mapped at ± 2
)
Through the clustering procedure we found (FDR 0.05)
• 795 Insertion clusters (691 had a uniquely mapped read)
• 1289 Deletion clusters (1120)
• ~200 Inversion clusters (~150)
• 164 Translocation (cross-chromosome) cluster
(all were required to have a uniquely mapped read)
Example Deletion
Agreement with Previous Results
Type All Tuzun Levy Korbel
Insertion 795(691) 50(36)/139 109(101)/319 1(1)/34
DGV-All
209(169)/2216
Deletion 1289(1120) 84(70)/102 194(188)/344 275(236)/742 539(446)/4697
Inversion ~200(~150) 198(46)/56 N/A 67(55)/105 111(87)/164
All of the correlations (besides the one) are
Translocations
• 47% of the translocations were close to the centromeres
(10 6 , 4.5*10 6 ] >4.5*10 6 Distance to <10 6 centromere
<10 6 38
(10 6 ,4.5*10 6 ]
>4.5*10 6
36
3
19
3
65
• She et al. predicted up to 200 interchromosomal rearrangement events near centromeres per million years. The two donors are ~0.2 million years apart
• These could also be mis-assemblies.
Summary (Structural Variation)
• Introduced a probabilistic framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions.
• Isolated hundreds of insertions, deletions, and inversions between the reference public human genome and the
JCVI donor.
• These results show statistically significant correlation with previous variation studies
• About 2/3 of the structural variants we isolate is not found in the Database of Genomic Variants
What about Copy Number Variants?
• Copy Number Variants are the result of duplications and deletions of large genomic segments
• Currently mainly found using microarray technology (ROMA, CGH)
• There is no algorithm for CNV finding with short reads (?)
• Goal: predict the number of times a certain segment appears in the genome
A Little Bit of Math
Let C = #reads / length of genome
Let i be a read
Let x i be # of times it was sampled.
Assembled genome should contain every read about x i
/ C times.
For example, let C = 3, x i
0.3
= 7
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6
More formally
Let n = number of reads, N = length of the genome
• The probability P i that the read i was sampled x i times given that it appears in the genome g i times is
P i
• We want to maximize the likelihood that all of the reads were sampled from the genome:
P i i
• However there is an additional constraint
The additional constraint…
0.3
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6
0.3
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6
ATCGGCACTG
0.3
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 g
1
+ g
2
= g
3
Solving for all g i
… Simultaneously!
Instead of Maximizing the product minimize sum of the logs:
12
10
8
6
4
2
0
0 1 2 3 4 5 6
6
4
2
0
12
10
8
0 1 2 3 4 5 6
ATCGGCACTG
12
10
8
6
4
2
0
0 1 2 3 4 5 6
This is just min-cost network flow with convex costs!
Copy Count Prediction Results
• Simulated reads from E.Coli
bacteria (4.5Mb)
C
-2
50x 4
75X 0
100X 0
200X 0
-1
397 3.9 M 170 18 6
7
2
0
Copy-Count Error
0 +1
4.3 M 22
4.5 M 6
4.5 M 4
• How to scale this to Human???
+2
0
0
0
+3
0
0
0
Discovering Variation
• SHRiMP -- SHort Read Mapping Package
– Computes p-values & other statistics
– Specialized Color-space alignment
• Algorithm for Structural Variation Discovery
– Will it scale to short reads?
• A model for Copy Count Prediction
– Works well with reads from E. coli , but how to scale to
Human?