HarvardBiophysics101-11

advertisement
Programming
To sequence analysis
Shotgun sequence assembly
Multiple alignments
Dispersed & tandem repeats
Bird song alignments
Gene Expression time-warping
3D-structure alignment
Through HMMs
RNA gene search & structure prediction
Distant protein homologies
Speech recognition
1
Alignments & Scores
Global (e.g. haplotype)
ACCACACA
::xx::x:
ACACCATA
Score= 5(+1) + 3(-1) = 2
Local (motif)
ACCACACA
::::
ACACCATA
Score= 4(+1) = 4
Suffix (shotgun assembly)
ACCACACA
:::
ACACCATA
Score= 3(+1) =3
2
Increasingly complex
(accurate) searches
Exact (StringSearch)
Regular expression (PrositeSearch)
CGCG
CGN{0-9}CG = CGAACG
Substitution matrix (BlastN)
Profile matrix (PSI-blast)
CGCG ~= CACG
CGc(g/a) ~ = CACG
Gaps (Gap-Blast)
Dynamic Programming (NW, SM)
CGCG ~= CGAACG
CGCG ~= CAGACG
Hidden Markov Models (HMMER)
3
"Hardness" of
(multi-) sequence alignment
Align 2 sequences of length N allowing gaps.
ACCAC-ACA
::x::x:x:
AC-ACCATA
,
ACCACACA
:xxxxxx:
A-----CACCATA
, etc.
2N gap positions, gap lengths of 0 to N each:
A naïve algorithm might scale by O(N2N).
For N= 3x109 this is rather large.
Now, what about k>2 sequences?
or rearrangements other than gaps?
4
Testing search & classification algorithms
Separate Training set and Testing sets
Need databases of non-redundant sets.
Need evaluation criteria (programs)
Sensistivity and Specificity (false negatives & positives)
sensitivity (true_predicted/true)
specificity (true_predicted/all_predicted)
Where do training sets come from?
More expensive experiments: crystallography, genetics, biochemistry
5
Comparisons of homology scores
Pearson WR Protein Sci 1995 Jun;4(6):1145-60
Comparison of methods for searching protein
sequence databases. Methods Enzymol 1996;266:227-58
Effective protein sequence comparison.
Algorithm: FASTA, Blastp, Blitz
Substitution matrix:PAM120, PAM250, BLOSUM50, BLOSUM62
Database: PIR, SWISS-PROT, GenPept
x=
Switch to
protein
searches
when possible
Uxu
uxc
uxa
uxg
Cxu
cxc
cxa
cxg
u
c
F
L
Y
C
H
W
P
TER
R
Q
I
N
S
K
R
T
C-S
M
NH+
gxu
gxc
gxa
gxg
g
S
axu
axc
axa
axg
a
D
V
A
G
O-
E
H:D/A
6
A Multiple Alignment of
Immunoglobulins
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWESNG--
7
Scoring matrix based on large set of
distantly related blocks: Blosum62
10 1 6 6
A C D E
8 0 -4 -2
18 -6 -8
12 4
10
4
F
-4
-4
-6
-6
12
8
G
0
-6
-2
-4
-6
12
2
H
-4
-6
-2
0
-2
-4
16
6
I
-2
-2
-6
-6
0
-8
-6
8
6
K
-2
-6
-2
2
-6
-4
-2
-6
10
9
L
-2
-2
-8
-6
0
-8
-6
4
-4
8
2
M
-2
-2
-6
-4
0
-6
-4
2
-2
4
10
4
N
-4
-6
2
0
-6
0
2
-6
0
-6
-4
12
4
P
-2
-6
-2
-2
-8
-4
-4
-6
-2
-6
-4
-4
14
4
Q
-2
-6
0
4
-6
-4
0
-6
2
-4
0
0
-2
10
5
R
-2
-6
-4
0
-6
-4
0
-6
4
-4
-2
0
-4
2
10
6
S
2
-2
0
0
-4
0
-2
-4
0
-4
-2
2
-2
0
-2
8
6
T
0
-2
-2
-2
-4
-4
-4
-2
-2
-2
-2
0
-2
-2
-2
2
10
7
V
0
-2
-6
-4
-2
-6
-6
6
-4
2
2
-6
-4
-4
-6
-4
0
8
1
W
-6
-4
-8
-6
2
-4
-4
-6
-6
-4
-2
-8
-8
-4
-6
-6
-4
-6
22
3 %
Y
-4 A
-4 C
-6 D
-4 E
6 F
-6 G
4 H
-2
I
-4 K
-2 L
-2 M
-4 N
-6 P
-2 Q
-4 R
-4 S
-4 T
-2 V
4 W
14 Y
8
Scoring Functions and
Alignments
Scoring function:
(match) = +1;
(mismatch) = -1;
(indel) = -2;
(other) = 0.
}
substitution matrix
Alignment score: sum of columns.
Optimal alignment: maximum score.
9
Calculating Alignment
Scores
(1
 A
T 
G 
 A
(1)Score                  1  1  1  1  0.
 A
C 
T 
 A
 A
 
T 
G 
 A
(2)Score                      1  2  1  2  1  1.
 A
C 
T 
 
 A
if  (indel )  1, Score  1  1  1  1  1  1.
10
DNA2: Aligning ancient diversity
Comparing types of alignments & algorithms
Dynamic programming
Multi-sequence alignment
Space-time-accuracy tradeoffs
Finding genes -- motif profiles
Hidden Markov Model for CpG Islands
11
What is dynamic programming?
A dynamic programming algorithm solves
every subsubproblem just once and then saves
its answer in a table, avoiding the work of
recomputing the answer every time the
subsubproblem is encountered.
-- Cormen et al. "Introduction to Algorithms",
The MIT Press.
12
Recursion of Optimal
Global Alignments
 u
s  : optimal global alignment score of u and v .
v 
  ATG A 
 
 s ACT     A ;

 
 

 ATG A 
 A
  ATG 
s
  max  s
    ;
 ACT A 
 A
  ACT 
  ATG 
 A
 s
    .

 
  ACT A 
13
Recursion of Optimal
Local Alignments
 u
s  : optimal local alignment score of u and v .
v 
  u1u 2...ui 

 s v1v 2...vj  1     vj ;

 
 
  u1u 2...ui  1 
 ui 
    ;
 u1u 2...ui 
 s
s
  max   v1v 2...vj  1 
 vj 
 v1v 2...vj 
 u1u 2...ui  1

 ui 
 s
    ;
  v1v 2...vj 
 

0.
14
Computing Row-by-Row
A
T
G
A
A
C
T
A
0 min min min min
min 1
-1 -3 -5
min
min
min
A
T
G
A
A
C
T
A
0 min min min min
min 1
-1 -3 -5
min -1
0
0
-2
min -3 -2 -1 -1
min
A
T
G
A
A
C
T
A
0 min min min min
min 1
-1 -3 -5
min -1
0
0
-2
min
min
A
T
G
A
A
C
T
A
min min min min
1
-1 -3 -5
-1
0
0
-2
-3 -2 -1 -1
-5 -4 -3
0
0
min
min
min
min
min = -1099 (match) = +1; (mismatch) = -1; (indel) = -2;
15
Traceback Optimal Global
Alignment
A
T
G
A
0
min
min
min
min
A
C
T
A
min min min min
1
-1
-3
-5
-1
0
0
-2
-3
-2
-1
-1
-5
-4
-3
0 A G
: 
A T

T

C
16
A
:
A 
Local and Global
Alignments
A C C A C A C A
A
C
A
C
C
A
T
A
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
1
0
1
0
0
2
1
0
2
0
2
0
0
1
0
1
2
0
3
1
3
0
0
2
1
0
3
1
4
2
0
0
0
3
2
1
2
2
3
0
1
0
1
4
2
2
1
3
0
0
0
0
2
3
1
1
1
0
1
0
0
1
1
4
2
2
A C C A C A C A
A
C
A
C
C
A
T
A
0
m
m
m m m
m
m
m
m
1
-1 -3 -5 -7
-9
-11 -13
m -1
2
0
-2 -4
-6
-8 -10
m -3
0
1
1
-1
-3
-5
-7
m -5
-2
1
0
2
0
-2
-4
m -7
-4 -1
0
1
1
1
-1
m -9
-6 -3
0
-1
2
0
2
m -11 -8 -5 -2 -1
0
1
0
m -13 -10 -7 -4 -1
0
-1
2
17
Time and Space Complexity
of Computing Alignments
For two sequences u=u1u2…un and v=v1v2…vm, finding the
optimal alignment takes O(mn) time and O(mn) space.
An O(1)-time operation: one comparison, three
multiplication steps, computing an entry in the alignment
table…
An O(1)-space memory: one byte, a data structure of two
floating points, an entry in the alignment table…
18
Time and Space Problems
Comparing two one-megabase genomes.
Space:
An entry: 4 bytes;
Table: 4 * 106 * 106 = 4 G bytes memory.
Time:
1000 MHz CPU: 1M entries/second;
10^12 entries: 1M seconds = 10 days.
19
Time & Space Improvement
for w-band Global Alignments
Two sequences
differ by at most w
bps (w<<n).
w-band algorithm:
O(wn) time and
space.
Example: w=3.
A C C A C A C A
A
C
A
C
C
A
T
A
0
m
m
m
m
1
-1 -3 -5
m -1
2
0
-2 -4
m -3
0
1
1
-1
-3
-5
-2
1
0
2
0
-2
-4 -1
0
1
1
1
-1
-3
0
-1
2
0
2
-2 -1
0
1
0
-1
0
-1
2
20
Summary
Dynamic programming
Statistical interpretation of alignments
Computing optimal global alignment
Computing optimal local alignment
Time and space complexity
Improvement of time and space
Scoring functions
21
DNA2: Aligning ancient diversity
Comparing types of alignments & algorithms
Dynamic programming
Multi-sequence alignment
Space-time-accuracy tradeoffs
Finding genes -- motif profiles
Hidden Markov Model for CpG Islands
22
A Multiple Alignment of
Immunoglobulins
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWESNG--
23
A multiple alignment <=> Dynamic
programming on a hyperlattice
From G. Fullen, 1996.
24
Multiple Alignment vs
Pairwise Alignment
AT
AT
AT
A-T
AT
AT
Optimal Multiple Alignment
A-T
Non-Optimal Pairwise Alignment
25
Computing a Node on
Hyperlattice
k=3
 N -
s NS 
 N -
 
 N -
s NS 
 NA 
 
A
 NV 
s NS 
 NA 
 
S
 N
s N 
 N
 
V
 NV   - 
s N -    S 
 N -   A 
 
 NV 
s N - 
 NA 
 

2k –1=7  s N     S 
A
  N 
 
N
V
  NV 
-
 s N -     S 
A
  N -


 

V
  N - 
 
 s NS     - 
A
  N -

 NV 
 N -
V

s NS   max  s N -     S 
 NA 


-
  NA 


 
  N -
V
 s NS     - 
-
  NA 
 
 NV

-
 

 
 s N -     S 
-
  NA 
-
  NV 
 s NS     - 

A
  N -
26
Challenges of Optimal
Multiple Alignments
Space complexity (hyperlattice size): O(nk) for
k sequences each n long.
Computing a hyperlattice node: O(2k).
Time complexity: O(2knk).
Find the optimal solution is exponential in k
(non-polynomial, NP-hard).
27
Methods and Heuristics for
Optimal Multiple Alignments
Optimal: dynamic programming
Pruning the hyperlattice (MSA)
Heuristics:
tree alignments(ClustalW)
star alignments
sampling (Gibbs)
(discussed in RNA2)
local profiling with iteration (PSI-Blast, ...)
28
ClustalW: Progressive Multiple
Alignment
S1
S2
S3
S4
Multiple Alignment Step:
1. Aligning S1 and S3
2. Aligning S2 and S4
3. Aligning (S1,S3) with (S2,S4).
All Pairwise
Alignments
Dendrogram
S1
S3
Similarity Matrix
S1 S2
S1
S2
S3
4
S3 S4
9
4
4
7
Cluster Analysis
S2
S4
4
S4
From Higgins(1991) and Thompson(1994).
Distance
29
Star Alignments
s1  ATTGCCATT
s2  ATGGCCATT
s3  ATCCAATTTT
Multiple Alignment
ATTGCCATT-ATGGCCATT-ATC-CAATTTT
ATCTTC-TT-ACTGACC---AT*GCCATTTT
Combine into
Multiple Alignment
s4  ATCTTCTT
s5  ACTGACC
Pairwise Alignment
Similarity Matrix
ATTGCCATT
ATGGCCATT
Pairwise Alignment
s1
s2
s3
s4
s2
7
s3 s 4
2 0
2 0
0
s5
3
4
7
3
Find the Central
Sequence s1
ATTGCCATT-ATC-CAATTTT
ATTGCCATT
ATCTTC-TT
ATTGCCATT
ACTGACC
30
Why probabilistic models
in sequence analysis?
 Recognition - Is this sequence a protein start?
 Discrimination - Is this protein more like a
hemoglobin or a myoglobin?
 Database search - What are all of sequences in
SwissProt that look like a serine protease?
31
A Basic idea
Assign a number to every possible sequence
such that
sP(s|M) = 1
P(s|M) is a probability of sequence s given a
model M.
32
Sequence recognition
Recognition question - What is the
probability that the sequence s is from the
start site model M ?
P(M|s) = P(M)* P(s|M) / P(s)
(Bayes' theorem)
P(M) and P(s) are prior probabilities
and P(M|s) is posterior probability.
33
Database search
 N = null model (random bases or AAs)
 Report all sequences with
logP(s|M) - logP(s|N) > logP(N) - logP(M)
 Example, say a/b hydrolase fold is rare in the database,
about 10 in 10,000,000. The threshold is 20 bits. If
considering 0.05 as a significant level, then the threshold
is 20+4.4 = 24.4 bits.
34
Plausible sources of mono, di, tri,
& tetra- nucleotide biases
C rare due to lack of uracil glycosylase (cytidine deamination)
TT rare due to lack of UV repair enzymes.
CG rare due to 5methylCG to TG transitions (cytidine deamination)
AGG rare due to low abundance of the corresponding Arg-tRNA.
CTAG rare in bacteria due to error-prone "repair" of CTAGG to C*CAGG.
AAAA excess due to polyA pseudogenes and/or polymerase slippage.
AmAcid
Arg
Arg
Arg
Arg
Arg
Arg
Codon
AGG
AGA
CGG
CGA
CGT
CGC
Number
3363.00
5345.00
10558.00
6853.00
34601.00
36362.00
/1000
1.93
3.07
6.06
3.94
19.87
20.88
Fraction
0.03
0.06
0.11
0.07
0.36
0.37
ftp://sanger.otago.ac.nz/pub/Transterm/Data/codons/bct/Esccol.cod
35
CpG Island + in a ocean of First order Hidden Markov Model
MM=16, HMM= 64 transition probabilities (adjacent bp)
P(A+|A+)
A+
T+
C+
G+
P(G+|C+) >
A-
T-
C-
G36
Estimate transistion
probabilities -- an example
Training set
P(G|C) = 3/7 =
#(CG) / N #(CN)
S= aaacagcctgacatggttcCGAACAGCCTCGACATGGCGTT
Isle
A+
C+
G+
T+
ACGTA+
.20
.40
.20
.20
.00
.00
.00
.00
1.00
C+
.29
.14
.43
.14
.00
.00
.00
.00
1.00
G+
.33
.33
.17
.17
.00
.00
.00
.00
1.00
T+
.00
.25
.25
.25
.25
.00
.00
.00
1.00
.82 1.13 1.05 .76
.25
.00
.00
.00 Sums
Ocean ACGTA+
C+
G+
T+
A.33
.33
.17
.17
.00
.00
.00
.00
1.00
C.40
.20
.00
.20
.00
.20
.00
.00
1.00
G.25
.25
.25
.25
.00
.00
.00
.00
1.00
T.00
.25
.50
.25
.00
.00
.00
.00
1.00
.98 1.03 .92
.87
.00
.20
.00
.00 Sums
Laplace pseudocount: Add +1 count to each observed.
(p.9,108,321 Dirichlet)
37
Estimated transistion probabilities
from 48 "known" islands
Training set
(+)
A
C
G
T
(-)
P(G|C) =
#(CG) / N #(CN)
A
C
G
T
A
.18
.17
.16
.08
C
.27
.37
.34
.36
G
.43
.27
.38
.38
T
.12
.19
.13
.18
1.00
1.00
1.00
1.00
A
.30
.32
.25
.18
C
.21
.30
.25
.24
G
.29
.08
.30
.29
T
.21
.30
.21
.29
1.00
1.00
1.00
1.00
1.05
.99
.95
1.01
38 Sums
Viterbi: dynamic programming
for HMM
1/8*.27
Most probable path
l,k=2
states
Recursion:
vl(i+1) =
el(xi+1) max(vk(i)akl)
v
begin
A+
C+
G+
T+
ACGT-
si =
1
0
0
0
0
0
0
0
0
C
G
C
G
0
0
0
0
0
0
0
0
0.125
0
0.012
0
0
0.034
0
0.0032
0
0
0
0
0
0
0
0
0.125
0
0.0026
0
0
0.01
0
0.0002
0
0
0
0
a= table in slide 50
e= emit si in state l (Durbin p.56)
39
Download