PPT - the Department of Statistics

advertisement
Interpretation of exponentiation + eigenvalue decomposition
The terms in the series expansion of P(t) does not directly have an interpretation. The first, I, is the
trivial transition function and the remaining has negative numbers and 0 row sums. If the CTMC has
identical exit rates for all states (q) then:
Continuous Time Markov Chain
0
ti
ti + 1
t
Poisson Process
=
Discrete Time Markov Chain
+
0
i
i+1
n
Then a little rearrangement gives:
Where Q’ is Q-I/q (the single step transition probabilities). Without identical exit rates, I don’t know a
simple interpretation.
Ie Q is weighted symmetric Qi,j = πj Q j,i/πi and thus is
diagonazable, Q = UDUT
If Q has distinct eigenvalues, then it will also have
simple expressions for Pi,j(t)
Kimura 2-parameter model - K80
TO
A
C
G
T
F A
-2*b-a
b
a
b
R C
b
-2*b-a
b
a
G
a
b
-2*b-a
b
M T
b
a
b
-2*b-a
Q: O
a = a*t
b = b*t
P(t)
start
.25(1  e-4b  2e-2( ab) )
.25(1 - e-4b )
.25(1  e-4b - 2e-2( ab) )
.25(1 - e-4b )
Felsenstein81 & Hasegawa, Kishino & Yano 85
Unequal base composition:
Qi,j = C*πj
(Felsenstein, 1981 F81)
i unequal j
Rates to frequent nucleotides are high - (π =(πA , πC , πG , πT)
Tv/Tr = (πT πC +πA πG )/[(πT+πC )(πA+ πG )]
A
T
C
G
Tv/Tr & compostion bias (Hasegawa, Kishino & Yano, 1985 HKY85)
(a/b)*C*πj
Qi,j =
C*πj
i- >j a transition
i- >j a transversion
Tv/Tr = (a/b) (πT πC +πA πG )/[(πT+πC )(πA+ πG )]
Group 3, Symmetric 6, Reversible 9 and General 12 models
Kimura 3 parameter 1980,
C
Evans and Speed 1993:
æ
b a gö
ç
÷
b
g
a
ç
÷
ça g
b÷
ç
÷
èg a b
ø
a
b
A
a
g
T
b
Symmetric:
G
Can be interpreted as random walk on Z2*Z2
Time reversible:
æ
p A q12 p A q13 p A q14 ö
ç
÷
p
q
p
q
p
q
C 23
C 24 ÷
ç C 12
çp G q13 pG q23
pG q34 ÷
ç
÷
p
q
p
q
p
q
è T 14
ø
T 24
T 34
Can be obtained from differences and equilibrium
distributions
æ
ç
ç q12
ç q13
ç
èq14
q12
q23
q24
q13
q23
q34
q14 ö
÷
q24 ÷
q34 ÷
÷
ø
• Often only differences can be observed leading
to a symmetric matrix
• Symmetric matrices has uniform equilibrium
distributions
General:
æ
q12 q13 q14 ö
ç
÷
q
q
q
23
24 ÷
ç 21
çq31 q32
q34 ÷
ç
÷
q
q
q
è 41 42 43
ø
Non-reversible models allow
rooting with only 2 sequences
Alternative condition for time
reversibility: No net flows
=
From Nucleotide to Sequence
Each nucleotide evolves independent
ATTGCGTCC A ATATTGCGTCCGAT
ATGGCGTCC T ATATTGCGTGCAAT
ACGGAGT
• Di-nucleotide events
ACGTCGT
• Context-dependent models
Dinucleotides
Genome:
• Rate Variation
ATTGCGTCCAATATTGCGTCCAAT
..ACGGA..
00: 10-8 doublet mutation rate , ~10% of singlet rate
03: much less for a large more reliable data set
ACGTCGT
Double events
Assuming JC69 + doublet mutations.
?
=
ACGTCGT
ACGTCGT
Singlet
Doublet
ACGGAGT
ACGGAGT
Single nucleotide events
ACGGAGT
Singlet
Averof et al. (2000) Evidence for High Frequency of Simultaneous Double-Nucleotide Substitutions” Science287.1283- . + Smith et al. (2003) A Low rate of
Simultaneous Double-Nucleotide Mutations in Primates” Mol.Biol.Evol 20.1.47-53
Di-nucleotide events
Context-dependent models
From singlet models to doublet models:
Contagious Dependence:
Independence
Independence with CG avoidance
Strand symmetry
Pedersen and Jensen, 2001
Only single events
Siepel and Haussler, 2003
Single events with simple double events
G
A
C
?
C
T
A
A
Rate variation between sites:iid each site
The rate at each position is drawn independently from a distribution, typically a G (or
lognormal) distribution. G(a,b) has density xb-1*e-ax/G(b) , where a is called scale
parameter and b form parameter.
Let L(pi,,t) be the likelihood for observing the i'th pattern, t all time lengths,  the
parameters describing the process parameters and f (ri) the continuous distribution of
rate(s). Then L 
L( p , , r ) f ( r )dr

i
i
i
i
Measuring Selection
ThrSer
ACGTCA
ThrPro
ACGCCA
Certain events have functional
consequences and will be selected
out. The strength and localization of
this selection is of great interest.
-
ThrSer
ACGCCG
ArgSer
AGGCCG
The selection criteria could in
principle be anything, but the
selection against amino acid changes
is without comparison the most
important
ThrSer
ACTCTG
AlaSer
GCTCTG
AlaSer
GCACTG
The Genetic Code
3 classes of sites:
4
2-2
1-1-1-1
i.
4 (3rd)
Problems:
1-1-1-1 (3rd)
ii. TA (2nd)
i. Not all fit into those categories.
ii. Change in on site can change the status of another.
Possible events if the genetic code
remade from Li,1997
Possible number of substitutions: 61 (codons)*3 (positions)*3 (alternative nucleotides).
Substitutions
Number
Percent
Total in all codons
549
100
Synonymous
134
25
415
75
Missense
392
71
Nonsense
23
4
Nonsynonymous
Kimura’s 2 parameter model & Li’s Model.
Probabilities:
Rates:
start
b
b
a
b
.25(1  e-4b  2e-2( ab) )
.25(1 - e-4b )
.25(1  e-4b - 2e-2( ab) )
.25(1 - e-4b )
a
Selection on the 3 kinds of sites (a,b)(?,?)
1-1-1-1
(f*a,f*b)
2-2
(a,f*b)
4
(a, b)
alpha-globin from rabbit and mouse.
Ser
TCA
*
TCG
Ser
Sites
1-1-1-1
2-2
4
Thr
ACT
*
ACA
Thr
Glu
GAG
*
GGG
Gly
Total
274
77
78
Z(at,bt) = .50[1+exp(-2at) - 2exp(-t(a+b)]
Y(at,bt) = .25[1-exp(-2bt )]
X(at,bt) = .25[1+exp(-2at) + 2exp(-t(ab)]
Met
ATG
*
ATA
Ile
Cys
TGT
*
TAT
Tyr
Leu
TTA
*
CTA
Leu
Met Gly Gly
ATG GGG GGA
* **
ATG GGT ATA
Met Gly Ile
Conserved
246 (.8978)
51 (.6623)
47 (.6026)
Transitions
12(.0438)
21(.2727)
16(.2051)
Transversions
16(.0584)
5(.0649)
15(.1923)
transition
transversion
identity
L(observations,a,b,f)=
C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15}
where a = at and b = bt.
Estimated Parameters:
1-1-1-1
2-2
4
a = 0.3003 b = 0.1871
Transitions
a*f = 0.0500
a
= 0.3004
a
= 0.3004
2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663
Transversions
2*b*f = 0.0622
2*b*f = 0.0622
2*b
= 0.3741
Expected number of:
replacement substitutions 35.49
synonymous
Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72
Silent sites
: 429 - 314.72
= 114.28
Ks = .6644 Ka = .1127
75.93
Learning to Count: Robust Estimates for Labeled Distances between Molecular Sequences O’Brien, Minin, and Marc A. Suchard Mol. Biol. Evol.
26(4):801–814. 2009
Vladimir N Minin and Marc A Suchard Fast, accurate and simulation-free stochastic mapping 3995 363 2008 Phil. Trans. R. Soc. B
Counting labeled transitions in continuous-time Markov models of evolution Vladimir N. Minin á Marc A. Suchard J. Math. Biol. (2008) 56:391–
412
Probabilities of different paths
Starting in A ending in B after time t
Rate of going from i to j: qi,j
S1
S3
A
B
Sk
S2
Key questions (conditional/unconditional):
• Number of events
• Kinds of events
• Time spent at different states
• Very liked and dis-liked sets
• Probability of only visiting {S}
• Which edges/nodes carry
most/least probability? Ranked
lists of edges nodes.
• Time to get from A to B
C
t3
Generalize to a phylogeny
• Distribution of ancestor state, X
P(X)= P(A,B,C,X)/P(A,B,C)
S1
S3
A
t1
Sk
S2
t2
B
Summary of Substitution Models
• Assumptions behind substitution models
• Independence of lineages
• Continuous time Markov Chain
• Only substitutions
• From P to Q &
• Independence and identity of positions
• The simplest model: Jukes-Cantor
from Q to P
• Extensions to the basic model
• From nucleotide to sequence
• Independence of nucleotides
• Context Dependent Models
• Codons
• Rate heterogeneity
• Ancestral Analysis – conditioning on start and finish
0
t1
• The above expression can be shown to be of the form
And recursions O(N2) exists to calculate coefficients.
t2
T
• Integrate of all waiting times (t1,..,ti) and state assignments of length i gives
probability of specific trajectory
Õ åe
N
M
• Sum over all path lengths gives probability of N turning into N’
-q i 0T
n=1 n= 0
• Sum over i state assignments gives probability of paths of length i.
åc T
dn
k
n
k= 0
k
Koskinen,J. (2004) Bayesian Inference for Longitudinal Social Networks. Research Report, number 2004:4, Stockholm University, Department of Statistics. Koskinen,J. and Snijders,T. (2007) Bayesian inference for dynamic social network data, Journal of
Statistical Planning and Inference, 137, 3930--3938. R. Sharan, T. Ideker, Modeling cellular machinery through biological network comparison, Nature Biotechnology, 24, 427 (2006). Snijders, T. (2001) “Statistical evaluation of social networks dynamics” in
Sociological Methodology By Michael Sobel Snijders, T. et al. (2008) “Maximum Likelihood Evaluation for Social Network Dynamics” In press I. Miklos, G.A. Lunter and I. Holmes (2004) A "long indel" model for evolutionary sequence alignment.
Mol. Biol. Evol. 21(3):529-540. Appendix A
From Continuous to Discrete Time
Correlated Mutations
Motivation: Models often assume independence between sites or sites and phenotype, however
that might not be warranted and methods detecting correlation is of great use.
• Explicit modelling:
Single binary state
+
Single binary state
--
+
-- æ-2l
æl lö
ç
÷
è m -mø
ç
-+ç m
ç m
+-ç
++è 0
• Ancestral Analysis:
Ideal situation – complete history known.
A
G
-+
+-
++
l
l
0 ö
÷
-l -m -e
0
l + e÷
0
l- -m -e l + e ÷
÷
m
-2m
m ø
Many end points – a phylogeny might give power.
C
*
C
T
Real situation – end points known, little power.
A
C
G
C
G A
C G
T
C
• Multiple Testing
All pairs – n(n-1)/2
Site – phenotypic character - n
G
A
Transition Path Sampling Algorithm/MCMC
p2
Path 1 - probability: p1 p2 p3
Local modification of Path 1 in Path 2:
p4
p1
p3
p5
p6
P1
P2
Path 2 - probability: p1 p4 p5 p6
Discrete Space:
Acceptance ratio
Set of
Likelihood - L(
paths:
Probability of going from
)
to
- q(
, )
L( )q( , )
L( )q( , )
Continuous Space – reversible jump MCMC (Green, 1995)
The acceptance ration will have to be weighted by Jacobian – J.
Typically much slower as continuous case includes stochastic integration
Challenge for large state space, E(steps) large and P a,b(t) small:
S1
Algorithm (forward rejection sampling)
S1
q1B
q0
1
q02
S2
S2
q2B
A
q03
0
S3
S3
Keep paths ending in B at time t
B
Normalize their probability by dividing with PA,B(t)
t
Can be modified to be more efficient if Paa(t) has high probability
q3B
q0k
Sk
Sample paths unconditionally
qkB
Sk
Algorithm
Create Uniformized Process
- Real jumps
- Self jumps
The Poisson Process – tag the red stars !!
G: maxi -qii
Q’: qii:=-G
Interpret increased exit rates as self-jumps
Sample jump points according to Poisson Process
Sample discrete jump transition according to conditional jump process
Unconditional jump probabilities
R:= I + Q’/G
Conditional jump probabilities
P(x i , x i+1 | b) =
Rx-1,i (R n-1 ) xi ,b
(R n-i+11) x i -1 ,b
0
i
i+1
n
Hobolth and Stone (2009) EFFICIENT SIMULATION FROM FINITE-STATE, CONTINUOUS-TIME MARKOV CHAINS WITH INCOMPLETE OBSE
Simulating trajectories that ends in B at time t
Statistical Test of Models
(Goldman,1990)
Data: 3 sequences of length L
ACGTTGCAA ...
AGCTTTTGA ...
TCGTTTCGA ...
A. Likelihood (free multinominal model 63 free parameters)
L1 = pAAA#AAA*...pAAC#AAC*...*pTTT#TTT where pN1N2N3 = #(N1N2N3)/L
B. Jukes-Cantor and unknown branch lengths
ACGTTGCAA ...
l1
l2
l3
TCGTTTCGA ...
L2 = pAAA(l1',l2',l3') #AAA*...*pTTT(l1',l2',l3') #TTT
AGCTTTTGA ...
Test statistics: I. S (expected-observed)2/expected or II: -2 lnQ = 2(lnL1 - lnL2)
JC69 Jukes-Cantor: 3 parameters => c2 60 d.of freedom
Problems: i. To few observations pr. pattern.
Parametric bootstrap:
i. Maximum likelihood to estimate the parameters.
iii. Make simulated distribution of -2 lnQ.
ii. Many competing hypothesis.
ii. Simulate with estimated model.
iv. Where is real -2 lnQ in this distribution?
Extension to Overlapping Regions
Hein & Stoevlbaek, 95
1st
1-1-1-1
2-2
1-1-1-1 sites
(f1f2a, f1f2b)
(f2a, f1f2b)
(f2a, f2b)
2-2
(f1a, f1f2b)
(f2a, f1f2b)
(a, f2b)
4
(f1a, f1b)
(a, f1b)
(a, b)
2nd
4
pol
gag
Example: Gag & Pol from HIV
Pol
Gag
1-1-1-1
2-2
4
1-1-1-1 sites
64
31
34
2-2
40
7
0
4
27
2
0
MLE:
a=.084
b= .024
a+2b=.133
fgag=.403
fpol=.229
Ziheng Yang has an alternative model to this, were sites are lumped into the same category if they have the same configuration of positions and reading frames.
HIV1 Analysis
Hasegawa, Kisino & Yano Subsitution Model Parameters:
a*t
0.350
0.015
β*t
0.105
0.005
pA
0.361
0.004
pC
0.181
0.003
pG
0.236
0.003
Selection Factors
GAG
POL
VIF
VPR
TAT
REV
VPU
ENV
NEF
0.385
0.220
0.407
0.494
1.229
0.596
0.902
0.889
0.928
(s.d.
(s.d.
(s.d.
(s.d.
(s.d.
(s.d.
(s.d.
(s.d.
(s.d.
Estimated Distance per Site: 0.194
0.030)
0.017)
0.035)
0.044)
0.104)
0.052)
0.079)
0.051)
0.073)
pT
0.222
Open Problem II: Example Neural Networks (NN))
Motivation: To combine methods that define patterns from a series of independent instances with
pattern inferred by instances related by a phylogenetic tree.
Independent instances:
Related by a phylogeny:
Basic Equations I
Pi, j (t) = å Pi,k (t1)Pk, j (t 2 )
Chapman-Kolmogorov:
t1
i
k
t2
k
j
t = t 1 + t2
Matrix version:
P(t) = P(t1)P(t2 )
Forward Equation:
t1
i
k
h
j
Initial Condition:
Backward Equation:
i
k
t2
j
P(0) = I
0
t1
• The above expression can be shown to be of the form
And recursions O(N2) exists to calculate coefficients.
t2
T
• Integrate of all waiting times (t1,..,ti) and state assignments of length i gives
probability of specific trajectory
Õ åe
N
M
• Sum over all path lengths gives probability of N turning into N’
-q i 0T
n=1 n= 0
• Sum over i state assignments gives probability of paths of length i.
åc T
dn
k
n
k= 0
k
Koskinen,J. (2004) Bayesian Inference for Longitudinal Social Networks. Research Report, number 2004:4, Stockholm University, Department of Statistics. Koskinen,J. and Snijders,T. (2007) Bayesian inference for dynamic social network data, Journal of
Statistical Planning and Inference, 137, 3930--3938. R. Sharan, T. Ideker, Modeling cellular machinery through biological network comparison, Nature Biotechnology, 24, 427 (2006). Snijders, T. (2001) “Statistical evaluation of social networks dynamics” in
Sociological Methodology By Michael Sobel Snijders, T. et al. (2008) “Maximum Likelihood Evaluation for Social Network Dynamics” In press I. Miklos, G.A. Lunter and I. Holmes (2004) A "long indel" model for evolutionary sequence alignment.
Mol. Biol. Evol. 21(3):529-540. Appendix A
From Continuous to Discrete Time
Fast/Slowly Evolving States
Felsenstein & Churchill, 1996
1
1
position
s
n
k
slow rs
fast
- rf
HMM:
• pr - equilibrium distribution of hidden states (rates) at first
position
•pi,j - transition probabilities between hidden states
•L(j,r) - likelihood for j’th column given rate r.
•L(j,r) - likelihood for first j columns given j’th column has rate r.
Likelihood
Recursions:
L(j,f) = (L(j-1,f) p
f,f
Likelihood
Initialisations:
L(1,f) = p f L(1, f )
+ L(j-1,s) ps, f )L( j, f ) L(j,s) = (L(j-1,f) p f ,s + L(j-1,s) ps,s )L( j,s)
L(1,s) = p sL(1,s)
Basic Equations
Expected time spent in j, T(j), in going from a to b:
b
i
E ab (T(i)) =
ò
t
0
Pai (s)Pib (T - t - s)ds /Pab (t)
a
n
s
0
Expected number of transition from i to j, N(i,j), in going from a to b:
b
E ab (N(i, j)) = qi, j
ò
t
0
Pai (s)Pjb (T - t - s)ds /Pab (t)
qi,j
i
j
a
0
n
s
Higher moments and combinations of N( ) and T( ) can be calculated using the same reasoning
b
Evaluation of Eab(N1(),..Nr,T1(),..Tm()),
would involve the evaluation of at
most n+m dimensional integral
a
0
Hobolth, A. and Jensen, J.L. (2005). Statistical inference in evolutionary models of DNA sequences via the EM algorithm. Statistical applications in Genetics and Molecular Biology, 4, 18
n
Codon based Models
Goldman,Yang + Muse,Gaut
i. Codons as the basic unit.
ii. A codon based matrix would have (61*61)-61 (= 3661) off-diagonal entries.
i. Bias in nucleotide usage.
ii. Bias in codon usage.
iii. Bias in amino acid usage.
iv. Synonymous/non-synonymous distinction.
v. Amino acid distance.
vi. Transition/transversion bias.
codon i and codon j differing by one nucleotide, then
apj exp(-di,j/V)
qi,j =
bpj exp(-di,j/V)
differs by transition
differs by transversion.
-di,j is a physico-chemical difference between amino acid i and amino acid j. V
is a factor that reflects the variability of the gene involved.
Dayhoffs empirical approach
Take a set of closely related
proteins, count all differences
and make symmetric difference
matrix, since time direction
cannot be observed.
(1970)
History of Phylogenetic Methods & Stochastic Models
1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock.
1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza.
1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock.
1967 First large molecular phylogenies by Fitch and Margoliash.
1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences.
1969 Jukes-Cantor proposes simple model for amino acid evolution.
1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution.
1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing
parsimony ancetral sequences.
1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.
1979 Cavender and Felsenstein independently comes up with same evolutionary model where
parsimony is inconsistent. Later called the “Felsenstein Zone”.
1979: Kimura introduces transition/transversion bias in nucleotide model in response to
pbulication of mitochondria sequences.
1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP).
Simple nucleotide model with equilibrium bias.
1981 Parsimony tree problem is shown to be NP-Complete.
1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies.
1985: Hasegawa, Kishino and Yano combines transition/transversion bias with unequal equilibrium
frequencies.
1986 Bandelt and Dress introduces split decomposition as a generalization of trees.
1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of
recombinations in phylogenies.
1991 Gillespie’s book proposes “lumpy” evolution.
1994 Goldman & Yang + Muse & Gaut introduces codon based models
1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock.
2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves.
2000 Complex Context Dependent Models by Jensen & Pedersen. Dinucleotide and overlapping
reading frames.
2001- Major rise in the interest in phylogenetic statistical alignment
2001- Comparative genomics underlines the functional importance of molecular evolution.
Download