Probability Theory and Basic Alignment of String Sequences

advertisement
Weighting training sequences

Why do we want to weight training
sequences?
 Many different proposals
– Based on trees
– Based on the 3D position of the sequences
– Interested only in classifying family
membership
– Maximizing entropy
Probability estimation and weights
1
Why do we want to weight
training sequences?

Parts of sequences can be closely related to
each other and don’t deserve the same
influence in the estimation process as a
sequence which is highly diverted.
– Phylogenetic trees
AGTC
– Sequences AGAA, CCTC, AGTC
AGAA
Probability estimation and weights
CCTC
2
Weighting schemes based on
trees

Thompson, Higgins & Gibson (1994)
(Represents electric currents as calculated
by Kirchhoff’s laws)
 Gerstein, Sonnhammer & Chothia (1994)
 Root weights from Gaussian parameters
(Altschul-Caroll-Lipman weights for a
three-leaf tree 1989)
Probability estimation and weights
3
Thompson, Higgins & Gibson

Electric network of voltages, currents and
resistances
I1  I 2 R4
I1 R1
1
V4
V5
I 3 R3
I 2 R2
2
3
Probability estimation and weights
4
Thompson, Higgins & Gibson
V4  2I1  2I 2
V5  2I1  3( I1  I 2 )  4 I 3
I1  I 2 3
I1 2
1
V4
V5
I3 4
I1 : I 2 : I 3  1 : 1 : 2
I2 2
2
3
Probability estimation and weights
5
Gerstein, Sonnhammer &
Chothia

Works up the tree, incrementing the weights
– Initially: weights are set to the edge lengths t n
(resistances in previous example)
wi 

tn wi
w
k
leaves k below n
Probability estimation and weights
6
Gerstein, Sonnhammer &
Chothia
w1  2, w2  2, w3  4
w1  w2  2  1.5  3.5
2
3
4
1
2
w1 : w2 : w3  7 : 7 : 8
2
0
1
2
3
Probability estimation and weights
7
Gerstein, Sonnhammer &
Chothia

Small difference with Thompson, Higgins
& Gibson?
1
1
2
T, H & G : I1 : I 2  2 : 1
G, S & C : w1 : w2  1 : 2
2
Probability estimation and weights
8
Root weights from Gaussian
parameters

Continuous in stead of discrete members of
an alphabet
 Probability density in stead of a substitution
matrix
 Example: Gaussian
  ( x  y)
P( x  y | t )  exp 

 2 t
Probability estimation and weights
2



9
Root weights from Gaussian
parameters
P( x at node 4 | L1 , L2 )
  ( x  x1 ) 2 
  ( x  x2 ) 2 
 exp 

 K1 exp 
2t1
2t 2




v
  ( x  v1 x1  v2 x2 )
 K1 exp 
2t12

2



v1  t2 (t1  t2 ) , v2  t1 (t1  t2 )
Probability estimation and weights
10
Root weights from Gaussian
parameters

Altschul-Caroll-Lipman weights for a tree
with three leaves
P( x at node 5 | L1 , L2 , L3 )
  ( x  w1 x1  w2 x2  w3 x3 ) 2 

 K 2 exp 
2t123


Probability estimation and weights
11
Root weights from Gaussian
parameters
w1  t 2t3 / 
  t1t2  (t3  t4 )(t1  t2 )
w2  t1t3 / ,
w3  {t1t 2  t 4 (t1  t 2 )} / 
3
4
2
1
w1 : w2 : w3  1 : 1 : 2
2
2
3
Probability estimation and weights
12
Weighting schemes based on
trees

Thompson, Higgins & Gibson (Electric
current): 1:1:2
 Gerstein, Sonnhammer & Chothia: 7:7:8
 Altschul-Caroll-Lipman weights for a tree
with three leaves: 1:1:2
Probability estimation and weights
13
Weighting scheme using
‘sequence space’


Voronoi weights
wi 
ni

n
k k
Probability estimation and weights
14
More weighting schemes

Maximum discrimination weights
 Maximum entropy weights
– Based on averaging
– Based on maximum ‘uniformity’ (entropy)
Probability estimation and weights
15
Maximum discrimination weights

Does not try to maximize likelihood or
posterior probability
 It decides whether a sequence is a member
of a family
Probability estimation and weights
16
Maximum discrimination weights
P( x | M ) P( M )
P( M | x) 
P( x | M ) P( M )  P( X | R) P( R)

Discrimination D
D   P( M | x )
k
k

Maximize D, emphasis is on distant or
difficult members
Probability estimation and weights
17
Maximum discrimination weights

Differences with previous systems
– Iterative method
 Initial weights give rise to a model
 New calculated posterior probabilities P(M|x) gives
rise to new weights and hence a new model until
convergence is reached
– It optimizes performance for that what the
model is designed for : classifying whether a
sequence is a member of a family
Probability estimation and weights
18
More weighting schemes

Maximum discrimination weights
 Maximum entropy weights
– Based on averaging
– Based on maximum ‘uniformity’ (entropy)
Probability estimation and weights
19
Maximum entropy weights

Entropy = A measure of the average
uncertainty of an outcome (maximum when
we are maximally uncertain about the
outcome)
1
 Averaging: wk 
mk
i
i ixik
weight wk , total different residues mi
and kia residues of type a in column i
Probability estimation and weights
20
Maximum entropy weights

Sequences
AGAA
CCTC
AGTC
1
4
1
2
1
4
1
4
1
2
1
4
1
2
1
4
1
4
1
2
1
4
1
4
3
w1 
8
3
w2 
8
2
w3 
8
m1  2 (A and C)
k1 A  2, k1C  1
Probability estimation and weights
21
Maximum entropy weights

‘Uniformity’:
Shannon : H ( X )   P( xi ) log P( xi )
 H (w)    w
i
i
k
i
k
H i ( w)   pia log( pia )
a
Probability estimation and weights
22
Maximum entropy weights

Sequences
AGAA
CCTC
AGTC
H1 ( w)  ( w1  w3 ) log( w1  w3 )  w2 log w2
H 2 ( w)  ( w1  w3 ) log( w1  w3 )  w2 log w2
H 3 ( w)   w1 log w1  ( w2  w3 ) log( w2  w3 )
H 4 ( w)   w1 log w1  ( w2  w3 ) log( w2  w3 )
Probability estimation and weights
23
Maximum entropy weights
( w1  w3 ) w  w ( w2  w3 ) 
2
2
2
2
2
( w1  w3 ) ( w2  w3 )
2

2
2
Solving the equations leads to:
1
1
w1  , w2  , w3  0
2
2
Probability estimation and weights
24
Summary of the entropy methods

Maximum entropy weights (avaraging)
3
3
2
w1  , w2  , w3 
8
8
8

Maximum entropy weights (‘uniformity’)
1
1
w1  , w2  , w3  0
2
2
Probability estimation and weights
25
Conclusion

Many different methods
 Which one to use depends on problem

Questions??
Probability estimation and weights
26
Download