Multiple Alignment of Genomic Sequences

advertisement

High Throughput Sequencing:

Technologies & Applications

Michael Brudno

CSC 2431 – Algorithms for HTS

University of Toronto

06/01/2010

High Throughput Sequencers

100 Gb

10 Gb

1 Gb

100 Mb

10 Mb

1 Mb

10 bp

AB/SOLiDv3, Illumina/GAII short-read sequencers

(10+Gb in 50-100 bp reads,

>100M reads, 4-8 days)

454 GS FLX pyrosequencer

(100-500 Mb in 100-400 bp reads,

0.5-1M reads, 5-10 hours)

ABI capillary sequencer

(0.04-0.08 Mb in 450-800 bp reads,

96 reads, 1-3 hours)

100 bp read length

1,000 bp

From Gabor Marth, BC

Sequencing chemistries

DNA base extension DNA ligation

Church, 2005

Massively parallel sequencing

Church, 2005

Features of HTS data

• Short (for now) sequence reads

–200-400bp: 454 (Roche)

–35-100bp Solexa(Illumina), SOLiD(AB)

• Huge amount of sequence per run

–Up to 10s of gigabases per run

• Huge number of reads per run

–Up to 100’s of millions

• Higher error (compared with Sanger)

–Different error profile

The Raw Data

• Machine Readouts are different

• Read length, accuracy, and error profiles are variable.

• All parameters change rapidly as machine hardware, chemistry, optics, and noise filtering improves

454 Pyrosequencer error profile

• multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal  the majority of errors are INDELs

• error rates are nucleotide-dependent

Illumina/Solexa base accuracy

• Error rate grows as a function of base position within the read

• A large fraction of the reads contains 1 or 2 errors

AB SOLiD System dibase sequencing

2-base, 4-color: 16 probe combinations

A

C

G

T

2

3

2 nd Base

A C G T

0 1 2 3

1 0 3 2

3

2

0

1

1

0

3’

3’

N N N A T z z z

5’

5’

3’

N N N G A z z z

5’

N N N T G z z z

● 4 dyes to encode 16 2-base combinations

● Detect a single color indicates 4 combinations & eliminates 12

● Each color reflects position, not the base call

● Each base is interrogated by two probes

● Dual interrogation eases discrimination

– errors (random or systematic) vs. SNPs (true polymorphisms)

Converting dibase (color) into letters

0 1 1 0 2 3 0 2 2

AA AC AC AA AG AT AA AG AG

CC CA CA CC CT CG CC CT CT

GG GT GT GG GA GC GG GA GA

TT TG TG TT TC TA TT TC TC

A

C

G

T

2

3

2 nd Base

A C G T

0 1 2 3

1 0 3 2

3

2

0

1

1

0

0 1 1 0 2 3 0 2 2

A A C A A G C C T C

C C A C C T A A G A

G G T G G A T T C T

T T G T T C G G A G

4

Possible

Sequences

The decoding matrix allows a sequence of transitions to be converted to a base sequence, as long as one of two bases is known.

SOLiD error checking code

A C G G T C G T C G T G T G C G T

A C G G T C G T C G T G T G C G T

No change

A C G G T C G C C G T G T G C G T

SNP

A C G G T C G T C G T G T G C G T

Measurement error

SOLiD Error rate & QVs

20

15

10

5

0

0

40

35

30

25

5 10 15 20

Position on Read

25 30 35

5.00%

4.00%

3.00%

2.00%

1.00%

40

0.00%

10.00%

9.00%

8.00%

7.00%

6.00%

Pacific Biosystems (PacBio)

Current and future application areas

Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome

SNP DEL

De novo genome sequencing

Short-read sequencing will be (at least) an alternative to microarrays for:

• DNA-protein interaction analysis (CHiP-Seq)

• novel transcript discovery

• quantification of gene expression

• epigenetic analysis (methylation profiling)

What’s in it for us?

Image Management

VISION/Graphics

Base calling

Probabilistic Models

Machine Learning

Variant Calling

Data Storage

Systems

Cloud Computing

Data Management

Databases

Data Integrity

Read Mapping

String Algorithms

Assembly

Fundamental informatics challenges

1. Interpreting machine readouts – base calling, base error estimation

2. Alignment of billions of reads

3. Dealing with nonuniqueness in the genome: resequenceability

Informatics challenges (cont’d)

4. SNP and short INDEL, and structural variation discovery

5. Data visualization

6. Data storage & management

High Throughput Sequencing:

Technologies & Applications

Questions?

SHRiMP: SHort Read Mapping Package

• Fast Mapping Algorithm

- Spaced seed hashing

- Vectored (very fast) Smith Waterman

- Handles micro insertions/deletions

• Specialized algorithm for aligning color-space

(AB SOLiD) reads

• Computes p-values (and other statistics)

Regular Smith-Waterman

A C T A G A C T T G

C

A

G

T

T

C

Cell being computed

Previously computed cells

M i , j



M i  1, j  1

 max 



M

M

 i  1, j i , j  1

S (

A i  1

, B gap gap j  1

)



Fast Local Alignment

BLAST

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

FASTA

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

Altschul et al 1990 Pearson 1987

SHRiMP Hashing

• SHRiMP uses spaced seeds

• Vectored Smith-Waterman

Genome

Vectored Instructions

• Modern computers provide with capacity for performing same operation on several elements (SIMD)

1

9

3

6

5 6 1 5

4 4

+ = max =

8

13

11

9

3 8

4 10 6 4

5

9

8

6

• Can we take advantage of vectorized instruction in Smith-Waterman?

Vectorizing Smith-Waterman (1st try)

A C T A G A C T T G

C

A

G

T

T

C

Cell being computed

Previously computed cells

M i , j



M i  1, j  1

 max 



M

M

 i  1, j i , j  1

S (

A i  1

, B gap gap j  1

)



Vectorizing Smith-Waterman (Wozniak)

A C T A G A C T T G

T

C

C

A

G

T

Current

Previous

Penultimate

M i , j

 max 





M i  1, j  1

M

 S ( A i  1

, B i  1, j

 gap

M i , j  1

 gap j  1

)

Wozniak, 1997



Vectorizing Smith-Waterman (SHRiMP)

A C T A G A C T T G

T

C

C

A

G

T

+

-

-

-

+

Current

Previous

Penultimate

A C T A G A C T T G

T G A C C T

+ +

SHRiMP Speed

SW within SHRiMP while mapping 50,000 reads against a 4Mb contig of C. savignyi

Xeon

Core2

Unvectored Wozniak Farrar SHRiMP

97 261 335 338

105 285 533 537

SHRiMP performance for mapping 11,200 AB SOLiD

25 bp reads to 180Mb Ciona savignyi genome

K-mer (7,8) (8,9) (9,10) (10,11) (12,13)

% in SW 45% 25% 12% 7% 3%

Time (S) 2066 520 255 195 205

Color-space (dibase) Sequencing

A C G T

A 0 1 2 3

C 1 0 3 2

G 2 3 0 1

T 3 2 1 0

1

A

0 0

2

3 3

G

1

C T

2

0 0

Mapping reads in Color-space

SNPs

TGA G TT

G: T TGAGTTATGGAT 12 21 0

012210331023 C TT

R: 012 12 0331023 12 12 0

T TGA C TTATGGAT TGA A TT

12 03 0

TGA T TT

12 30 0

INDELS

TGAGTTA

122103

TGA TTA

12 -3 03

TGAGT T TA

1221 0 03

TGAGT A TA

1221 33 3

Mapping reads in Letter Space

G: TGACTTATGGAT

||||| |||||||

T TGAGT CGCAAGC

C CAGAC TATGGAT

R: 01221 2 331023

0

2

1

A

3 3

0

G

1

C T

2

0 0

SOLiD Translations

• Given the following read, there are 4 translations (we need an initial base):

0 1 2 2 3 3 1 0 2

A A C T C G C A A G

C C A G A T A C C T

G G T C T A T G G A

T T G A G C G T T C

SOLiD Translations

• Reads begin with a known primer (‘T’)

– The translation is: T T G A G C G T T C

0 1 2 2 3 3 1 0 2

A A C T C G C A A G

C C A G A T A C C T

G G T C T A T G G A

T T G A G C G T T C

SOLiD Translations

• What if we had a sequencing error?

– The right translation was: T T G A G C G T T C

0 1 0 2 3 3 1 0 2

A A C C T A T G G A

C C A A G C G T T C

G G T T C G C A A G

T T G G A T A C C T

Colour-space Smith-Waterman

• Think of 4 SW matrices stacked above one another

• If we have 1 read error, but otherwise perfect match, we’ll use 2 matrices

Genome

Read

Frame 1 Frame 2 Frame 3 Frame 4

Combined Color/Letter Space SW

0

2

1

A

3 3

0

G

1

C T

2

0 0

A C

G T

3

T G

C A

C A

2

T G

A C

Combined Color/Letter Space SW

0

2

1

A

3 3

0

G

1

C T

2

0 0

A C

G T

3

T G

C A

C A

2

T G

A C

SHRiMP on Ciona savignyi

• C. savignyi is a chordate with a very large SNP rate (5%)

• Mapped 22 million AB SOLiD reads to the reference

C. savignyi genome

(6 hours on 200 CPUs).

Reads mapped p<.05 p<.01

20% 9%

SNP rate .039

.024

Indel rate .004

.003

Error rate .024

.020

G: 1123724 TA-ACCACGGTCACACTTGCATCAC 1123701

|| |||||||||| |||X|||||||

T: TACACCACGGTCAGACTtGCATCAC

R: 0 T0311101130121221211313211 24

SHRiMP Summary

• Fast mapping of short reads to a genome

-- Handles indels & color-space reads

-- Easy to parallelize

-- Small memory footprint

• Computation of p-values & other statistics for hits

• Publicly available & free

Acknowledgments

UofT Stephen Rumble

Phil Lacroute

Anton Valouev

Arend Sidow

Stanford http://compbio.cs.toronto.edu/shrimp

FUNDING: NSERC, CFI, NIH

Acknowledgments

UofT Stephen Rumble

Phil Lacroute

Anton Valouev

Arend Sidow

Stanford http://compbio.cs.toronto.edu/shrimp

FUNDING: NSERC, CFI, NIH

Why is color-space good?

• SNP discovery

• Error correction with letter & color reads (assembly)

R1: 0 TAGACCACGGTCACACTTGCATCAC 24

|| |||||||||| |||X|||||||

T: TACACCACGGTCAGACTtGCATCAC

R2: 0 T0311101130121221211313211 24

• Can fix errors without (explicit) overlap

T: TACACCACGGTCAGACTTGCATCAC

R1: T03111011301 2 1221211013211 24

R2: T211301 3 122121101321103111 24

R3: T2212110132110311121130131 24

• Don’t just do everything in color space!

What are structural variations?

Various examples of structural variations

Type of Structural Variations (1)

Insertion

A

REF

Type of Structural Variations (2)

Deletion

A

REF

Type of Structural Variations (3)

5’

Inversion

3’

A

5’

3’ 5’

3’

REF

Type of Structural Variations (4)

Translocation chr1 chr2

Clone-end Sequencing Approaches

1. “Fine-scale structural variation of the human genome”

[Tuzun et al, 2005]

• Mapping matepairs onto the reference genome

• If mappings of matepairs are not consistent, then there exist structural variations.

2. “Paired-End mappings Reveals Extensive

Structural Variation in the Human Genome”

[Korbel et al, 2007]

• Proposed high-throughput and massive paired end mapping technique

• Detailed types of structural variations

Motivation

Reads can map to many locations on the genome. How do we choose between them?

Tuzun & Korbel used scores which are combination of several factors. (e.g. length, identity, quality of the sequences, concordance)

Probabilistic Framework (1)

We play with p(Y) to describe our probabilistic framework p(Y): distribution of mapped distances of

“uniquely mapped” matepairs of various sizes

Probabilistic Framework (2)

Insertion p(Y)

μ

Y

= (s+r)

P(X i

P(X i

, X j

|ins=r) = P(X i where δ= |μ

Y

|ins=r)P(X

|ins=r) = 1 - P( μ

Y

δ ≤Y≤μ j

|ins=r) y

+ δ)

- (s+r)|, s = mapped distance

μ y

δ

Probabilistic Framework (3)

Deletion p(Y)

μ

Y

= (s-r)

P(X i

, X j

P(X i

|del=r) = P(X where δ= |μ

Y i

|del=r)P(X

|del=r) = 1 - P( μ

Y j

δ ≤ Y ≤μ

|del=r) y

+ δ)

- (s-r)|, s = mapped distance

μ y

δ

Probabilistic Framework (4)

Inversion p(|Y1-Y2|) c - d = s(X1) - s(X2)

P(X i

, X j

|inv) = 1 - P( μ

|Y1-Y2| where δ= |μ

δ ≤|Y1-Y2|≤μ

|Y1-Y2|

|Y1-Y2|

– (c – d)|

+ δ)

μ

|Y1-Y2|

δ

Probabilistic Framework (5)

Translocation p(|Y1-Y2|)

(c – a) – (d – b) = s(X1) - s(X2)

P(X i

, X j

|trans) = 1 - P( μ

|Y1-Y2| where δ= |μ

|Y1-Y2|

δ ≤ |Y1- Y2| ≤μ

– (c – a) – (d – b) |

|Y1-Y2|

+ δ)

μ

|Y1-Y2|

δ

Flow of our Framework (1)

1. Preprocessing step

Discard concordant matepairs

Remove short mappings

Remove very similar mappings

Mask repeats

Get top K mappings

Remove invalid strands (-,+)

Make all possible combinations of mappings

Flow of our Framework (2)

2. Clustering

Do hierarchical clustering for each structural variation

(Insertion, Deletion, Inversion, Translocation)

3. Finding structural variations

Find initial configuration

Learn parameters for the objective function

Find a (locally) optimal configuration

Hierarchical Clustering (1)

X1

(ex) Insertion X2

C={X1, X2}

A

X2

X1

REF

• A cluster is a set of maped locations explaining the same structural variant

•Linkage distance is D(X1, X2) = - ln P(X1, X2|C)

Hierarchical Clustering (2)

• Linkage distance is

• Find two closest clusters; if D(C u

,C v

)< cutoff, merge.

C1

C2

1 2 3 4 5

R1

R2

Find a Unique Mapped Location

1 2 3

C1

C2

R1

R2

C1

4 5

M

1,4

M

2,4

C2

M

3,5

R1 R2

Assign matepairs to unique mapped locations

(and hence unique clusters).

Which Location is Best?

• We define a objective Function J(ω)

– ƒ

1 corresponds to BLAT hit scores

– ƒ

2 corresponds to the probability

– ƒ

3 corresponds to the size of clusters

Finding the “Best” Location

• Find the initial configuration greedily.

– Assign matepairs to clusters starting with those with fewest mapped locations

• Learn parameters for objective function J(ω).

– We used hill climbing search to maximize the log likelihood of P( ω|λ i

).

• Finally, find a configuration, locally maximizing J( ω) using hill climbing search.

Clustering Results

We started with ~2,984,000 matepair

• ~93% were uniquely mapped

• ~94% had a concordant position (mapped at  ± 2

)

Through the clustering procedure we found (FDR 0.05)

• 795 Insertion clusters (691 had a uniquely mapped read)

• 1289 Deletion clusters (1120)

• ~200 Inversion clusters (~150)

• 164 Translocation (cross-chromosome) cluster

(all were required to have a uniquely mapped read)

Example Deletion

Agreement with Previous Results

Type All Tuzun Levy Korbel

Insertion 795(691) 50(36)/139 109(101)/319 1(1)/34

DGV-All

209(169)/2216

Deletion 1289(1120) 84(70)/102 194(188)/344 275(236)/742 539(446)/4697

Inversion ~200(~150) 198(46)/56 N/A 67(55)/105 111(87)/164

All of the correlations (besides the one) are

Translocations

• 47% of the translocations were close to the centromeres

(10 6 , 4.5*10 6 ] >4.5*10 6 Distance to <10 6 centromere

<10 6 38

(10 6 ,4.5*10 6 ]

>4.5*10 6

36

3

19

3

65

• She et al. predicted up to 200 interchromosomal rearrangement events near centromeres per million years. The two donors are ~0.2 million years apart

• These could also be mis-assemblies.

Summary (Structural Variation)

• Introduced a probabilistic framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions.

• Isolated hundreds of insertions, deletions, and inversions between the reference public human genome and the

JCVI donor.

• These results show statistically significant correlation with previous variation studies

• About 2/3 of the structural variants we isolate is not found in the Database of Genomic Variants

What about Copy Number Variants?

• Copy Number Variants are the result of duplications and deletions of large genomic segments

• Currently mainly found using microarray technology (ROMA, CGH)

• There is no algorithm for CNV finding with short reads (?)

• Goal: predict the number of times a certain segment appears in the genome

A Little Bit of Math

Let C = #reads / length of genome

Let i be a read

Let x i be # of times it was sampled.

Assembled genome should contain every read about x i

/ C times.

For example, let C = 3, x i

0.3

= 7

0.25

0.2

0.15

0.1

0.05

0

0 1 2 3 4 5 6

More formally

Let n = number of reads, N = length of the genome

• The probability P i that the read i was sampled x i times given that it appears in the genome g i times is

P i

• We want to maximize the likelihood that all of the reads were sampled from the genome:



P i i

• However there is an additional constraint



The additional constraint…

0.3

0.25

0.2

0.15

0.1

0.05

0

0 1 2 3 4 5 6

0.3

0.25

0.2

0.15

0.1

0.05

0

0 1 2 3 4 5 6

ATCGGCACTG

0.3

0.25

0.2

0.15

0.1

0.05

0

0 1 2 3 4 5 6 g

1

+ g

2

= g

3

Solving for all g i

… Simultaneously!

Instead of Maximizing the product minimize sum of the logs:

12

10

8

6

4

2

0

0 1 2 3 4 5 6

6

4

2

0

12

10

8

0 1 2 3 4 5 6

ATCGGCACTG

12

10

8

6

4

2

0

0 1 2 3 4 5 6

This is just min-cost network flow with convex costs!

Copy Count Prediction Results

• Simulated reads from E.Coli

bacteria (4.5Mb)

C

-2

50x 4

75X 0

100X 0

200X 0

-1

397 3.9 M 170 18 6

7

2

0

Copy-Count Error

0 +1

4.3 M 22

4.5 M 6

4.5 M 4

• How to scale this to Human???

+2

0

0

0

+3

0

0

0

Discovering Variation

• SHRiMP -- SHort Read Mapping Package

– Computes p-values & other statistics

– Specialized Color-space alignment

• Algorithm for Structural Variation Discovery

– Will it scale to short reads?

• A model for Copy Count Prediction

– Works well with reads from E. coli , but how to scale to

Human?

Download