Probability Theory and Basic Alignment of String Sequences

Weighting training sequences  Why do we want to weight training sequences?  Many different proposals – Based on trees – Based on the 3D position of the sequences – Interested only in classifying family membership – Maximizing entropy Probability estimation and weights 1 Why do we want to weight training sequences?  Parts of sequences can be closely related to each other and don’t deserve the same influence in the estimation process as a sequence which is highly diverted. – Phylogenetic trees AGTC – Sequences AGAA, CCTC, AGTC AGAA Probability estimation and weights CCTC 2 Weighting schemes based on trees  Thompson, Higgins & Gibson (1994) (Represents electric currents as calculated by Kirchhoff’s laws)  Gerstein, Sonnhammer & Chothia (1994)  Root weights from Gaussian parameters (Altschul-Caroll-Lipman weights for a three-leaf tree 1989) Probability estimation and weights 3 Thompson, Higgins & Gibson  Electric network of voltages, currents and resistances I1  I 2 R4 I1 R1 1 V4 V5 I 3 R3 I 2 R2 2 3 Probability estimation and weights 4 Thompson, Higgins & Gibson V4  2I1  2I 2 V5  2I1  3( I1  I 2 )  4 I 3 I1  I 2 3 I1 2 1 V4 V5 I3 4 I1 : I 2 : I 3  1 : 1 : 2 I2 2 2 3 Probability estimation and weights 5 Gerstein, Sonnhammer & Chothia  Works up the tree, incrementing the weights – Initially: weights are set to the edge lengths t n (resistances in previous example) wi   tn wi w k leaves k below n Probability estimation and weights 6 Gerstein, Sonnhammer & Chothia w1  2, w2  2, w3  4 w1  w2  2  1.5  3.5 2 3 4 1 2 w1 : w2 : w3  7 : 7 : 8 2 0 1 2 3 Probability estimation and weights 7 Gerstein, Sonnhammer & Chothia  Small difference with Thompson, Higgins & Gibson? 1 1 2 T, H & G : I1 : I 2  2 : 1 G, S & C : w1 : w2  1 : 2 2 Probability estimation and weights 8 Root weights from Gaussian parameters  Continuous in stead of discrete members of an alphabet  Probability density in stead of a substitution matrix  Example: Gaussian   ( x  y) P( x  y | t )  exp    2 t Probability estimation and weights 2    9 Root weights from Gaussian parameters P( x at node 4 | L1 , L2 )   ( x  x1 ) 2    ( x  x2 ) 2   exp    K1 exp  2t1 2t 2     v   ( x  v1 x1  v2 x2 )  K1 exp  2t12  2    v1  t2 (t1  t2 ) , v2  t1 (t1  t2 ) Probability estimation and weights 10 Root weights from Gaussian parameters  Altschul-Caroll-Lipman weights for a tree with three leaves P( x at node 5 | L1 , L2 , L3 )   ( x  w1 x1  w2 x2  w3 x3 ) 2    K 2 exp  2t123   Probability estimation and weights 11 Root weights from Gaussian parameters w1  t 2t3 /    t1t2  (t3  t4 )(t1  t2 ) w2  t1t3 / , w3  {t1t 2  t 4 (t1  t 2 )} /  3 4 2 1 w1 : w2 : w3  1 : 1 : 2 2 2 3 Probability estimation and weights 12 Weighting schemes based on trees  Thompson, Higgins & Gibson (Electric current): 1:1:2  Gerstein, Sonnhammer & Chothia: 7:7:8  Altschul-Caroll-Lipman weights for a tree with three leaves: 1:1:2 Probability estimation and weights 13 Weighting scheme using ‘sequence space’   Voronoi weights wi  ni  n k k Probability estimation and weights 14 More weighting schemes  Maximum discrimination weights  Maximum entropy weights – Based on averaging – Based on maximum ‘uniformity’ (entropy) Probability estimation and weights 15 Maximum discrimination weights  Does not try to maximize likelihood or posterior probability  It decides whether a sequence is a member of a family Probability estimation and weights 16 Maximum discrimination weights P( x | M ) P( M ) P( M | x)  P( x | M ) P( M )  P( X | R) P( R)  Discrimination D D   P( M | x ) k k  Maximize D, emphasis is on distant or difficult members Probability estimation and weights 17 Maximum discrimination weights  Differences with previous systems – Iterative method  Initial weights give rise to a model  New calculated posterior probabilities P(M|x) gives rise to new weights and hence a new model until convergence is reached – It optimizes performance for that what the model is designed for : classifying whether a sequence is a member of a family Probability estimation and weights 18 More weighting schemes  Maximum discrimination weights  Maximum entropy weights – Based on averaging – Based on maximum ‘uniformity’ (entropy) Probability estimation and weights 19 Maximum entropy weights  Entropy = A measure of the average uncertainty of an outcome (maximum when we are maximally uncertain about the outcome) 1  Averaging: wk  mk i i ixik weight wk , total different residues mi and kia residues of type a in column i Probability estimation and weights 20 Maximum entropy weights  Sequences AGAA CCTC AGTC 1 4 1 2 1 4 1 4 1 2 1 4 1 2 1 4 1 4 1 2 1 4 1 4 3 w1  8 3 w2  8 2 w3  8 m1  2 (A and C) k1 A  2, k1C  1 Probability estimation and weights 21 Maximum entropy weights  ‘Uniformity’: Shannon : H ( X )   P( xi ) log P( xi )  H (w)    w i i k i k H i ( w)   pia log( pia ) a Probability estimation and weights 22 Maximum entropy weights  Sequences AGAA CCTC AGTC H1 ( w)  ( w1  w3 ) log( w1  w3 )  w2 log w2 H 2 ( w)  ( w1  w3 ) log( w1  w3 )  w2 log w2 H 3 ( w)   w1 log w1  ( w2  w3 ) log( w2  w3 ) H 4 ( w)   w1 log w1  ( w2  w3 ) log( w2  w3 ) Probability estimation and weights 23 Maximum entropy weights ( w1  w3 ) w  w ( w2  w3 )  2 2 2 2 2 ( w1  w3 ) ( w2  w3 ) 2  2 2 Solving the equations leads to: 1 1 w1  , w2  , w3  0 2 2 Probability estimation and weights 24 Summary of the entropy methods  Maximum entropy weights (avaraging) 3 3 2 w1  , w2  , w3  8 8 8  Maximum entropy weights (‘uniformity’) 1 1 w1  , w2  , w3  0 2 2 Probability estimation and weights 25 Conclusion  Many different methods  Which one to use depends on problem  Questions?? Probability estimation and weights 26

Probability Theory and Basic Alignment of String Sequences

Related documents

Products

Support

Probability Theory and Basic Alignment of String Sequences

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib