poster

advertisement
Efficient Inference in Phylogenetic InDel Trees
Alexandre Bouchard-Côté Michael I. Jordan Dan Klein
Computer Science Division
University of California at Berkeley
c:
C
-
-
A
A
A
T
-
G
T
C
-
-
-
C
a
a:
:C
C
C
application: homology modeling
• Solution: always condition on the data
G
A
C
A
T
b:
C
c: A
a:
• Approach 1: heuristic
A
T
• Experiments
[8]
C
?
A
b:
C
A
?
G
? ?
c:
?
A
T
C
?
?
?
d:
d
r:
b
C
...ACCGTGCGTGCT...
• Not irreducible
d:
? ?
? ?
c
[2]
...ACCCTGCGGTGCT...
...ACGCTGCGTGAT...
d: ?
a: C
a
?
?
..
..
A
T
A
b: C
A
c: A
c
application: the origins of life
T
A
T
C
T
...ACGCTGTTGTGAT...
...ACGGTGCTGTGCT..
character-valued!
C
A
G
G
Del
...ACCCTGCGGTGCT...
[1]
Sub
Sub Sub
...ACGCTGCGGTGAT...
...ACGTTGCGGTGAT...
• Assuming we have an evolutionary InDel model, both
tasks can be cast as an inference problem
...ACGCTGTTGTGAT...
r:
d:
State augmentation: alignments
...ACCGTGCGATGCT...
C
C
...ACGGTGCTGTGCT..
A
Ins Ins
T
A
}
}
4
4
• What is the state of the sampler?
• Approach 2: naive Gibbs sampling
5
• Contiguous version?
(AR)
G
[2]
G
C
A
C
d
b
A
C
3
a:
1
1
0
0
2
0
5000
!!
derivation
C
10000
• How should "vertical slices" be defined?
C
A
20000
25000
T
C
string-valued r.v.
C
T
A
C
T
C
C
C
A
T
C
r
d
C
A
G
G
r: A
Del
A
A
C
T
C
C
C
A
G
A
T
C
C
C
A
r: C
d: C
A
T
A
G
A
C
r:
d:
a:
a
T
C
C
Ins Ins
T
A
a:
• Next: how to sample?
C
c: A
T
C
T C
C
Obtaining multi-alignments:
d: C
T C C
A
T
a: C
A
T
A
C
b: C
C A T C
T C C
C
A T C C
A
G
Indexed by a span on an observed word (anchor)
A T C
C A G
C A T C
Legend
A T C C
Problem II: each step is expensive
(cubic in the sequence length)
C A G
observed characters
selected characters
anchor
A G C C
A G C C
A G C C
anchor
10000
12000
CS
0.63
0.77
14000
.ktisktkgaprmPDVYTLPPSRD
..tiakvtvntfpPQVHLLPPPSE
.tkltvlgqpkssPSVTLFPPSSE
gtkleikr.tvaaPSVFIFPPSDE
.ttvtvssasttaPSVYPLAPVSG
gc2_cavpo
alc_mouse
1yuh18000
16000
1mim
1nld
49
49
49
20000
49
49
asnrvvsekpeykntppieda...
lhgn....eelspesylveplkep
kvdgtpvtegmettqpskqs....
kvdnalqsgnsqesvteqdsk...
nsg..slssgvhtfpavlqs....
gc2_cavpo
alc_mouse
1yuh
1mim
1nld
92
95
91
92
87
SP
0.77
0.86
YTCSVMHealhnhvtqkaisrsp
YSCMVGHealpmnftqktidrls
YSCQVTHeg..htve.kslsra.
YACEVTHqglsspvt.ksfnrg.
ITCNVAHpasstkvdkkieprg.
The taller the tree, the larger the gap:
CS
0.63
0.77
100
SP
0.77
90
0.86
100
75
75
25
References
0
SSR
100
AR
90
50
80
50Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant
[1] N.
life forms. Science, 283:220–221, 1999.
3
4
5
6
7
70
[2] I. Holmes and W. J. Bruno. Evolutionary HMM: a bayesian approach to multiple alignment.
Bioinformatics, 17:803–820, 2001.
[3] J.
25Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003.
8
3
4
5
6
7
8
[4] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov
Chain Monte Carlo Method. Mol. Biol. Evol., 1997.
[5] B. Mau and M. A. Newton. Phylogenetic inference for binary data on dendrograms using
markov chain monte carlo. J. of Comp. and Graphical Statistics, 1997.
0
70
3
4
5
6
7
8
[6] S. Li, D. K. Pearl, and H. Doss. Phylogenetic tree construction using Markov Chain Monte
Carlo.3J. of the American
2000.6
4 Statistical Association,
5
7
8
Figure 4: CS and SP XXXCITE
are recall metrics XXX
Depth
C A G C C
A T C C
C A G C C
A T C C
A G C C
C A G C C
A T C C
[7] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-based
approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000.
Depth
[8] D. G. Higgins and P. M Sharp. Clustal: a package for performing multiple sequence alignment
on a microcomputer. Gene, 73:237–244, 1988.
[9] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood
model of sequence evolution. J. of Mol. Evol., 34:3–16, 1992.
The multi-alignment
in the litterature that most resembles our lineage sampler is Handel. It models in
r : A Tsystem
C
a probabilistic fashion an InDel evolutionary
a treefrom
above
the observed
proteins
produces
sequencederivation
alignment canalong
be extracted
the infered
derivation D
as follow:and
deem
the amino acids x, y ∈ S
iff y difference
∈ A0 (D, x). with our approach is that their inference algorithm
a multi-alignment
asd : described
above. aligned
The key
c: A T C C
C A T C
is based on SSR rather than the lineageThesampling
moves
that wesequence
advocate
in this
paper.
state-of-the-art
for multiple
alignment
systems
based on an evolutionary model is Handel [2].
References
A
T
A C
A G C
C
A
G
[1] N. Galtier, N. Tourasse, and M. Gouy. A nonhyperthermophilic common ancestor to extant
life forms. Science, 283:220–221, 1999.
[10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood
alignment of DNA sequences. J. of Mol. Evol., 33:114–124, 1991.
[2] I. Holmes and W. J. Bruno. Evolutionary HMM: a bayesian approach to multiple alignment.
Bioinformatics, 17:803–820, 2001.
[3] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2003.
[11] G. A Lunter, I. Miklós, Y. S. Song, and J. Hein. An efficient algorithm for statistical multiple
alignment on arbitrary phylogenetic trees. J. Comp. Biol., 10:869–889, 2003.
[4] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov
Chain Monte Carlo Method. Mol. Biol. Evol., 1997.
[5] B. Mau and M. A. Newton. Phylogenetic inference for binary data on dendrograms using
markov chain monte carlo. J. of Comp. and Graphical Statistics, 1997.
[6] S. Li, D. K. Pearl, and H. Doss. Phylogenetic tree construction using Markov Chain Monte
Carlo. J. of the American Statistical Association, 2000.
[12] A. Bouchard-Côté, P. Liang, T. L. Griffiths, and D. Klein. A probabilistic approach to language
change. In EMNLP 07, 2007.
[13] P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a non-reversible Markov chain sampler.
Technical report, Cornell University, 1997.
[14] S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J. of Mol. Biol., 48:443–453, 1970.
It is based on TKF91 and produces a multiple sequence alignment as described above. The key difference
a: C A T A C b:C A G
The BAliBASE multi-alignment dataset
is a standard
for this
type
of comparison
andmove
wasthat we advocate
with[13]
our approach
is that theirbenchmark
inference algorithm
is based
on SSR
rather than the AR
C A G
A T A C
inC this
paper.
the dataset used to assess the performance
of Handel
in its original paper [8]. While other approaches are
A
a:
8000
Table 1: XXXXXX—clustalw
C
d:
6000
80
C A G
C A T A C
G C C
C
C AA G
C AA TT AC CC
C A G
C A T A C
r:
1
1
1
1
1
Table 1: XXXXXX—clustalw
a:
G
a: A
4000
(b) Fig 1b
System
SSR (Handel)
SSR+AR
r:
300
gc2_cavpo
alc_mouse
1yuh
1mim
1nld
SSR 100 AR
d:
d:
Sub
Sub Sub
C
T
2000
Figure 1: XXXXX
evolution (InDel process)
C
0
- annotated protein data
- source: BAliBASE [16]
- loss: Sum-of-Pairs (SP)
Time
Column-Score
(CS)
System
SSR (Handel)
AR (this paper)
• A correct definition:
r:
C
2. Multi-alignment
2
Time
Deterministic functions of a random graphical model
evolutionary parameters
T
30000
3
3
• Not reversible
Problem I: random walk behavior
T
15000
4
- synthetic
2
- 4 characters types
- 124010 character tokens 1
0
- tree has 7 nodes
200
Max -dev
= 2 nodes
Maxheld-out
dev =0 MAX 100
all internal
Time
- evaluated on root reconstruction
- loss: Levenshtein distance
(a) Fig 1a
(SSR)
(High-level) graphical model:
AR
5
...
C
SSR
Error
• What are guide trees and ancestral reconstructions?
...ACGTTGCGTGAT...
• Non-contiguous lineages
a:
AR
5
1. Reconstruction
resample "thin vertical slices"
a
C
A
T
• Connected component of the anchor?
SSR
CS
C
Error
A
SP
b:
C
Error
• What is a multi-alignment?
T
C
C
C
A
G
C
[7] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-based
approach to distance-based phylogeny reconstruction. Mol. Biol. Evol., 17:189–197, 2000.
[8] D. G. Higgins and P. M Sharp. Clustal: a package for performing multiple sequence alignment
on a microcomputer. Gene, 73:237–244, 1988.
[9] J. L. Thorne, H. Kishino, and J. Felsenstein. Inching toward reality: an improved likelihood
model of sequence evolution. J. of Mol. Evol., 34:3–16, 1992.
Cylindric proposals: use a DP
[10] J. L. Thorne, H. Kishino, and J. Felsenstein. An evolutionary model for maximum likelihood
alignment of DNA sequences. J. of Mol. Evol., 33:114–124, 1991.
[11] G. A Lunter, I. Miklós, Y. S. Song, and J. Hein. An efficient algorithm for statistical multiple
[15] K. St. John, T. Warnow, B. M. E. Moret, and L. Vawter. Performance study of phylogenetic
methods: (unweighted) quartet methods and neighbor-joining. J. of Algorithms, 48:173–193,
2003.
[16] J. Thompson, F. Plewniak, and O. Poch. Balibase: A benchmark alignments database for the
evaluation of multiple sequence alignment programs. Bioinformatics, 15:87–88, 1999.
[17] C. B. Do, M. S. P. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Probabilistic
consistency-based multiple sequence alignment. Genome Research, 15:330–340, 2005.
Download