Learning Nonsingular Phylogenies and Hidden Markov

advertisement
Learning Nonsingular Phylogenies
and Hidden Markov Models
Elchanan Mossel
Sébastien Roch
Department of Statistics
University of California, Berkeley
Berkeley, CA 94720-3860
Department of Statistics
University of California, Berkeley
Berkeley, CA 94720-3860
mossel@stat.berkeley.edu
sroch@stat.berkeley.edu
ABSTRACT
In this paper, we study the problem of learning phylogenies and
hidden Markov models. We call a Markov model nonsingular if all
transition matrices have determinants bounded away from 0 (and
1). We highlight the role of the nonsingularity condition for the
learning problem. Learning hidden Markov models without the
nonsingularity condition is at least as hard as learning parity with
noise. On the other hand, we give a polynomial-time algorithm for
learning nonsingular phylogenies and hidden Markov models.
Categories and Subject Descriptors: F.2 [Theory of Computation]: Analysis of Algorithms and Problem Complexity; I.2.6 [Artificial Intelligence]: Learning; J.3 [Computer Applications]: Life
and Medical Sciences—Biology and genetics.
General Terms: Algorithms, Theory.
Keywords: Hidden Markov models, evolutionary trees, phylogenetic reconstruction, PAC learning.
1.
INTRODUCTION
In this paper, we consider the problem of learning phylogenies.
Formally, phylogenies are Markov models on 3-regular trees. In
evolutionary biology, these stochastic models are used to study
the evolution of characters in related species. More precisely, the
leaves of the tree correspond to (known) extant species. Internal
nodes represent unknown ancestors, the root of the tree being the
most recent ancestor to all species in the tree. Following paths from
the root to the leaves, each bifurcation indicates a speciation event
whereby two new species are created from a parent one. Each node
is assigned the state of a character—e.g. a nucleotide {A, C, G, T} at
a particular site in the (aligned) genome of the species. This character evolves from parents to descendants in the tree according to
transition matrices on the edges—e.g. transitions can model mutations in the DNA sequence. One usually has a large number of
characters—e.g. an (aligned) sequence of nucleotides. All characters are assumed to evolve according to the same Markov model
independently—e.g. each site in a DNA sequence is assumed to
mutate independently from its neighbours according to the same
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
STOC’05, May 22-24, 2005, Baltimore, Maryland, USA.
Copyright 2005 ACM 1-58113-960-8/05/0005 ...$5.00.
mutation mechanism (this is of course a crude simplification, but
it has nevertheless given reasonable results in practice). The goal
of the biologist is typically (but not only) to infer the topology of
the tree from the characters at the leaves. Often, one also needs
to infer the properties of the evolution process on the edges (a.k.a.
the transition matrices) or the states of various characters at ancestor (unobserved) species. The focus of this paper is the efficient
inference of the transition matrices.
The reconstruction of phylogenies is one of the most important
tasks of molecular biology. In the systematics and statistics literature, three main approaches have been studied in depth: parsimony,
maximum likelihood and distance-based methods (see e.g. [12, 27]
for a detailed review and a thorough bibliography). Parsimony is
known to be inconsistent [15] (it may converge to the wrong tree
even if the number of characters tends to infinity) and NP-hard [17],
although it is used extensively in practice because of its good performance. Maximum likelihood is consistent [5] but a version of it
is known to be hard [2]. In constrast, distance-based methods can
be both consistent and efficient [10] (under assumptions, see below) but their performance in practice is somewhat unsatisfactory
(see e.g. [25]).
Many works have also been devoted to the reconstruction of phylogenies in the learning setting. In particular, in [3] the authors
obtain almost optimal upper and lower bounds for learning a phylogenetic “star” tree on 2-state space given its topology (i.e. the
underlying graph structure), where the mutation matrices are symmetric. In [7], a polynomial-time algorithm is obtained for general
Markov models on 2-state spaces. The authors of [7] also make the
claim that their technique extends to the general k-state model, but
then restrict themselves to k = 2.
The framework of Markov models on trees have several special
cases that are of independent interest. The case of Product distributions is discussed in [3]. The case of mixture of product distributions is discussed in [14]. Arguably, the most interesting case is
that of learning hidden Markov models (HMMs). HMMs play a
crucial role in many areas from speech recognition to biology, see
e.g. [24, 8].
In [1] it is shown that finding the “optimal” HMM for an arbitrary distribution is hard unless RP = N P . See also [21] where
hardness of approximation results are obtained for problems such
as comparing two hidden Markov models. Most relevant to our setting is the conjecture made in [20] that learning parity with noise is
hard. It is easy to see that the problem of learning parity with noise
may be encoded as learning an HMM over 4 states, see subsection
1.3.
There is an interesting discrepancy between the two viewpoints
taken in works concerning learning phylogenies and works con-
cerning learning hidden Markov models. The results in phylogeny
are mostly positive – they give polynomial-time algorithms for learning. On the other hand, the results concerning HMMs are mostly
negative.
This paper tries to resolve the discrepancy between the two points
of view by pointing to the source of hardness in the learning problem. Roughly speaking, we claim that the source of hardness for
learning phylogenies and hidden Markov models are transtion matrices P such that det P is 0 but rankP > 1. Note that in the
case k = 2 there are no such matrices and indeed the problem is
not hard [7]. We note furthermore that, indeed, in the problem of
learning partiy with noise all of the determinants are 0 and all the
ranks are greater than 1.
The main technical contribution of this paper is to show that
the learning problem is feasible once all the matrices have 1 −
1/poly(n) > | det P | > β for some β > 0. We thus present
a proper PAC learning algorithm for this case. We believe that
the the requirement that | det P | < 1 − poly(n) is an artifact
of our proofs. Indeed, in the case of hidden Markov models we
prove that the model can be learned under the weaker condition
that | det P | > 1/poly(n). Assuming that learning parity with
noise is indeed hard, this is an optimal result, see subsection 1.3
The learning algorithms we present are based on a combination
of techniques from phylogeny, statistics, combinatorics and probability. We believe that these algorithms may be also extended to
cases where | det P | is close to 1 and futhermore to cases where if
| det P | is small, then the matrix P is close to a rank 1 matrix, thus
recovering the results of [3, 7].
Interestingly, to prove our result we use and extend several previous results from combinatorial phylogeny and statistics. The
topology of the tree is learned via variants of combinatorial results
proved in phylogeny [10]. Thus, the main technical challenge is to
learn the mutation matrices along the edges. For this we follow and
extend the approach developed in statistics by Chang [5]. Chang’s
results allow the recovery of the mutation matrices from infinite
number of samples. The reconstruction of the mutation matrices
from a polynomial number of samples requires a delicate error
analysis along with various combinatorial and algorithmic ideas.
The algorithm is sketched in Section 2 and the error analysis is
performed in Section 3.
1.1 Definitions and results
We let T3 (n) denote the set of all trees on n labeled leaves where
all internal degrees are exactly 3. Note that if T = (V, E) ∈ T3 (n)
then |V| = 2n − 2 We will sometimes omit n from the notation.
Below we will always assume that the leaf set is labeled by the set
[n]. We also denote the leaf set by L. Two trees T1 , T2 are considered identical if there is a graph isomorphism between them that is
the identity map on the set of leaves [n]. We define a caterpillar
to be a tree on n nodes with the following property: the subtree induced by the internal nodes is a path (and all internal vertices have
degree at least 3). See Figure 1 for an example. We let TC3 (n)
denote the set of all caterpillars on n labeled leaves.
In a Markov model T on a (undirected) tree T = (V, E) rooted
at r, each vertex iteratively chooses its state given the one of its
parent by an application of a Markov transition rule. Consider the
orientation of E where all edges are directed away from the root.
We note this set of directed edges Er . Then the probability distribution on the tree is the probability distribution on C V given by
Y
uv
π T (s) = πrT (s(r))
Ps(u)s(v)
,
(1)
(u,v)∈Er
where s ∈ C V , C is a finite state space, P uv is the transition matrix
for edge (u, v) ∈ Er and πrT is the distribution at the root. We
T
let k = |C|. We write πW
for the marginals of π T on the set W.
T
Since the set of leaves is labeled by [n], the marginal π[n]
is the
marginal on the set of leaves. We will often remove the superscript
T . Furthermore, for two vertices u, v ∈ V we let Pijuv = P[s(v) =
j | s(u) = i]. We will be mostly interested in nonsingular Markov
models.
D EFINITION 1. We say that a Markov model on a tree T =
(V, E) is nonsingular ((β, β 0 , σ)-nonsingular) if
I. For all e ∈ Er it holds that 1 > | det P e | > 0 (1 − β 0 >
| det P e | > β) and
II. For all v ∈ V it holds that πv (i) > 0 (πv (i) > σ) for all i in
C.
We will call the model nonsingular ((β, σ)-nonsingular) at 0 if instead of I. we have the weaker condition | det P e | > 0 (| det P e | >
β).
It is well known that if the model is nonsingular, then for each
w ∈ V, one can write
Y
uv
Ps(u)s(v)
.
(2)
π T (s) = πvT (s(w))
(u,v)∈Ew
(where now all edges (u, v) are oriented away from w). In other
words, the tree may be rooted arbitrarily. Indeed, in the learning
algorithms discussed below, we will root the tree arbitrarily. We
will actually refer to E as the set of directed edges formed by taking
the two orientations of all (undirected) edges in the tree. It is easy
to show that (β, σ)-nonsingularity as stated above also implies that
property I. holds for all (u, v) ∈ E with appropriate values of β, σ.
Note moreover, that if | det P uv | > 0 for all edges (u, v) and for
all v ∈ V the distribution of s(v) is supported on at most |C| − 1
elements then one can redefine the model by allowing only |C| − 1
values of s(v) at each node and deleting the corresponding row and
columns from the transition matrices P e . Thus condition II. is very
natural given condition I.
Given a collection M of mutation matrices P , we let T3 (n)⊗M
denote all phylogenetic trees of the form (1) where T ∈ T3 and
P e ∈ M for all e. We similarly define TC3 ⊗ M.
We use the PAC learning framework introduced by [30], in its
variant proposed by [20]. Let ε > 0 denote an approximation parameter, δ > 0, a confidence parameter, M, a collection of matrices and T, a collection of trees. Then, we say that an algorithm A
PAC-learns T ⊗ M if for all T ∈ T ⊗ M, given access to samT
ples from the measure π[n]
, A outputs a phylogenetic tree T 0 such
0
T
T
that the total variation distance between π[n]
and π[n]
is smaller
than ε with probability at least 1 − δ and the running time of A is
poly(n, 1/δ, 1/ε).
In our main result we prove the following.
T HEOREM 1. For every constant β, κβ , κπ > 0 and every finite set C, the collection of (β, n−κβ , n−κπ )-nonsingular Markov
phylogenetic models is PAC-learnable. More formally, let C be a finite set, β, κβ , κπ > 0 and M denote the collection of all |C| × |C|
transition matrices P where 1 − n−κβ > | det P | > β. Then
there exists a PAC-learning algorithm for T3 ⊗ M whose running
time is poly(n, k, 1/ε, 1/δ), provided the node distributions satisfy
πv (i) > n−κπ for all v in V and i in C. If the topology is known,
we can relax the assumption on determinants to 1 ≥ | det P | > β.
For hidden Markov models we can prove more.
T HEOREM 2. Let φd , φ0d , κπ > 0 be constants. Let C be a finite
set and M denote the collection of |C| × |C| transition matrices P
0
where 1 − n−φd > | det P | > n−φd . Then there exists a PAClearning algorithm for TC3 ⊗ M, provided the node distributions
satisfy πv (i) > n−κπ for all v in V and i in C. If the topology
is known, we can relax the assumption on determinants to 1 ≥
| det P | > n−φd . The running time and sample complexity of the
algortihm is poly(n, k, 1/ε, 1/δ).
Note however that, contrary to the phylogenetic setting, the assumptions in the previous theorem are never met in typical applications of HMMs. The hardness result below is probably more
relevant in that case.
1.2 Inferring the topology
(x1 , 0)
(x2 , S2 )
We will also need a stronger result that applies only for hidden
Markov models. This result essentially follows from [10, 11].
T HEOREM 4. Let ζ, τ > 0 and suppose that M consists of all
matrices P satisfying n−ζ < | det P | < 1 − n−τ . Then for all
θ > 0, and all T ∈ TC3 ⊗ M one can recover from nO(ζ+τ +θ)
samples the topology T with probability 1 − n2−θ .
1.3 Hardness of learning singular models
We now briefly explain why hardness of learning “parity with
noise” implies that learning singular hidden Markov models is hard.
Kearns’ work [19] on the statistical query model leads to the following conjecture.
(xn , Sn ) (0, Sn ⊕ N )
···
(x2 , 0)
(x3 , 0)
(xn , 0)
Figure 1: L
Hidden Markov model for noisy parity. The model computes N ⊕ i∈T xi , where the xi ’s are uniform over {0, 1}, T is a
subset of {1, . . . , n}, and N is a small random noise. The Si ’s are the
partial sums over variables included in T . The observed nodes are in
light gray. In the hardness proof, the passage from learning a distribution (HMM) to learning a function (parity) requires the notion of an
evaluator, i.e. the possibility of computing efficiently the probability of
a particular state. Here this can be done by dynamic programming.
We let the topology of T denote the underlying tree T = (V, E).
The task at hand can be divided into two natural subproblems. First,
the topology of T needs to be recovered with high probability. Second, the transition matrices have to be estimated. Reconstructing
the topology has been a major task in phylogeny. It follows from
[10, 11] that the topology can be recovered with high probability
using a polynomial number of samples. Here is one formulation
from [22].
T HEOREM 3. Let β, κβ > 0 and suppose that M consists of all
matrices P satisfying β < | det P | < 1 − n−κβ . For all κT > 0,
the topology of T ∈ T3 ⊗ M can be learned in polynomial time
using κT nO(1/β)+κβ samples with probability 1 − n2−κT .
(x3 , S3 )
a
P ar P ra
r
P rb
P br
b
P rc
P cr
c
Figure 2: Star tree with three leaves.
2. OVERVIEW OF THE ALGORITHM
2.1 Chang’s spectral technique
One of the main ingredients of the algorithm is a nice result due
to Chang [5] that we rederive here for completeness. Let T be a 4node (star) tree with a root r and 3 leaves a, b, c (see Figure 2). Let
P uv be the transition matrix between vertices u and v, i.e. Pijuv =
P[s(v) = j | s(u) = i] for all i, j ∈ C. Fix γ ∈ C. Then by the
Markov property, for all i, j ∈ C,
X ar rc rb
Pih Phγ Phj ,
P[s(c) = γ, s(b) = j | s(a) = i] =
h∈S
C ONJECTURE 1 (N OISY PARITY A SSUMPTION [19]). There
is a 0 < α < 1/2 such that there is no efficient algorithm for learning parity under uniform distribution in the PAC framework with
classification noise α.
In [20], this is used to show that learning probabilistic finite automata with an evaluator is hard. It is easy to see that the same
construction works with the probabilistic finite automata replaced
by an equivalent hidden Markov model (HMM) with 4 states (this
is a special case of our evolutionary tree model when the tree is a
caterpillar). The proof, which is briefly sketched in Figure 1, is left
to the reader. We remark that all matrices in the construction have
determinant 0 and rank 2. Note moreover that from a standard coupling argument it follows that if for all edges (u, v) we replace the
matrix P uv by the matrix (1 − n−τ )P uv + n−τ I, then the model
given in Figure 1 and its variant induces undistinguishable distributions on k samples if k ≤ o(nτ −1 ). This shows that assuming that
learning parity with noise is hard, Theorem 2 is tight up to the constant in the power of n (and the upper bound on the determinant,
which we think can be avoided). Strictly speaking, this hardness
result only gives evidence that 0 determinants are an obstacle to
learning Markov models on trees. Other factors could contribute to
the hardness of learning parity with noise.
rc
or in matrix form P ab,γ = P ar diag(P·γ
) P rb , where the matrix
P ab,γ is defined by
Pijab,γ = P[s(c) = γ, s(b) = j | s(a) = i],
for all i, j ∈ C. Then, noting that (P ab )−1 = (P rb )−1 (P ar )−1 ,
we have
rc
P ab,γ (P ab )−1 = P ar diag(P·γ
) (P ar )−1 ,
(?)
assuming the transition matrices are invertible. Equation (?) is an
eigenvalue decomposition where the l.h.s. involves only the distribution at the leaves. Therefore, given the distribution at the leaves,
we can recover from (?) the columns of P ar (up to scaling) provided the eigenvalues are distinct. Note that the above reasoning
applies when the edges (r, a), (r, b), (r, c) are replaced by paths.
Therefore, loosely speaking, in order to recover an edge (w, w 0 ),
one can use (?) on star subtrees with w and w 0 as roots to obtain
0
0
0
P aw and P aw , and then compute P ww = (P aw )−1 P aw . In [5],
under further assumptions on the structure of the transition matrices, the above scheme is used to prove the identifiability of the
full model, that is that the output distributions on triples of leaves
uniquely determine the transition matrices. In this paper, we show
that the transition matrices can actually be approximately recovered
using (?) with a polynomial number of samples.
There are many challenges in extending Chang’s identifiability
result to our efficient reconstruction claim. First, as noted above,
equation (?) uncovers only the columns of P ar . The leaves acutally
give no information on the labelings of the internal nodes. To resolve this issue, Chang assumes that the transition matrices come in
a canonical form that allows to reconstruct them once the columns
are known. For instance, if in each row, the largest entry is always
the diagonal one, this can obviously be performed. This assumption is a strong and unnatural restriction on the model we wish to
learn and therefore we seek to avoid it. The point is that relabeling all internal nodes does not affect the output distribution, and
therefore the internal labelings can be made arbitrarily (in the PAC
setting). The issue that arises is that those arbitrary labelings have
to be made consistently over all edges sharing a node. Another major issue is that the leaf distributions are known only approximately
through sampling. This requires a delicate error analysis and many
tricks which are detailed in Section 3. The two previous problems
are actually competing. Indeed, one way to solve the consistency
issue is to fix a reference leaf ω and do all computations with respect to the reference leaf, that is choose a = ω in every spectral
decomposition. However, this will necessitate the use of long paths
on which the error builds up uncontrollably. Our solution is to partition the tree in smaller subtrees, reconstruct consistently the subtrees using one of their leaves as a reference, and patch up the subtrees by fixing the connecting edges properly afterwards. We refer
to the connecting edges as separators. The algorithm F ULL R ECON
thus consists of two phases as follows. It assumes the knowledge of
distributions on triples of leaves. These are known approximately
through sampling. Exactly how this is done is detailed in Section
3.
A detailed version of the algorithm, including two subroutines,
appears in Figure 3. The two subroutines are described next. The
correctness of the algorithm uses the error analysis and is therefore
left for Section 3.
2.2 Subtree reconstruction and patching
We need the following notations to describe the subroutines. If
e = (u, v), let d1 (e) (resp. d2 (e)) be the length of the shortest path
(in number of edges) from u (resp. v) to a leaf in L not using edge
e. Then the depth of T is
∆ = max {max{d1 (e), d2 (e)}} .
e∈T
It is easy to argue that ∆ ≤ log n (see Section 3). Also, for a
set of vertices W and edges S, denote N (W, S) the set of nodes
not in W that share an edge in E\S with a node in W (“outside
neighbors” of W “without using edges in S”). Let Ba∆ be the subset
of nodes in V at distance at most ∆ from leaf a.
Subroutine L EAF R ECON. The subtree reconstruction phase is
performed by the algorithm L EAF R ECON depicted in Figure 4. The
purpose of L EAF R ECON is to partition the tree into subtrees each
of which has the property that all its nodes are at distance at most
∆ from one of its leaves (same leaf for all nodes in the subtree),
that we refer to as the reference leaf of the subtree. The correctness
of the algorithm, proved in Section 3, thus establishes the existence
of such a partition. This partition serves our purposes because it
allows 1) to reconstruct mutation matrices in a consistent way (in
each subtree) using reference leaves, and 2) to control the building
up of the error by using short paths to the reference leaf. The matrix reconstruction is performed simultaneously by L EAF R ECON,
as the partition is built. At the call of L EAF R ECON, we consider
the subgraph T 0 of T where edges previously labeled as separators
a2
a1
Ta 6
Ta 1
a6
Ta 2
Ta0
Ta 5
T a3
Ta 4
a3
a0
a5
a4
Figure 4: Schematic representation of the execution of L EAF R ECON.
The only edges shown are separators.
have been removed. We are given a reference leaf a and restrict
ourselves further to the (connected) subtree Ta of T 0 consisting of
nodes at distance at most ∆ from a. Moving away from a, we
recover edge by edge the mutation matrices in Ta by Chang’s spectral decomposition. At this point, it is crucial 1) that we use the
transition matrices previously computed to ensure consistency in
the labeling of internal nodes, and 2) in order to control error, that
we choose the leaves b and c (in the notations of the previous subsection) to be at distance at most ∆ + 1 from the edge currently
reconstructed (which is always possible by definition of ∆). Note
that the paths from the current node to b and c need not be in Ta .
Once Ta is reconstructed, edges on the “outside boundary” of Ta
(edges in T 0 with exactly one endpoint in Ta ) are added to the list
of separators, each with a new reference leaf taken from the unexplored part of the tree (at distance at most ∆). The algorithm
L EAF R ECON is then run on those new reference leaves, and so on
until the entire tree is recovered (see Figure 4). The algorithm is
given in Figure 3. Some steps are detailed in Section 3. We denote
estimates with hats, e.g. the estimate of P ar is P̂ ar .
Subroutine S EP R ECON. For its part, the algorithm S EP R ECON
consists in taking a separator edge (w, w 0 ) along with the leaf a0
from which it was found and the new reference leaf a it led to, and
computing
0
0
P̂ ww := (P̂ aw )−1 P̂ aa (P̂ w
0 0
a
)−1 ,
0 0
where the matrices P̂ aw , P̂ w a have been computed in the subtree
reconstruction phase. It will be important in the error analysis that
the two leaves a, a0 are at distance at most ∆ from w, w 0 respec0
tively. We then use Bayes’ rule to compute P̂ w w . See Figure 3.
3. ERROR ANALYSIS
As pointed out in the previous section, the distribution on the
leaves is known only approximately through sampling. The purpose of this section is to account for the error introduced by this
approximation. We are also led to make a number of modifications
to the algorithm described in the previous section. Those modifications are described where needed in the course of the analysis.
For W a subset of vertices of T , denote πW the joint distribution
on W and π̂W our estimate of πW obtained by using the estimated
mutation matrices. For a vertex u, we let πu (·) = P[s(u) = ·]. We
denote by the all-one vector (the size is usually clear from the
context). For any vector ρ, we let diag(ρ) be the diagonal
P matrix
with diagonal ρ. Recall that for a P
vector x, kxk1 = i |xi |, and
for a matrix X, kXk1 = maxj i |xij |. Recall that L stands
for the set of leaves. We assume throughout that the model is
Algorithm F ULL R ECON
Input: tree topology; independent samples at leaves;
Output: estimated mutation matrices {P̂ e }e∈E ; estimated node distributions {π̂u }u∈V ;
• Phase 0: Initialization
– Let ω be any leaf in L, and initialze the set of uncovered leaves U := ∅; Set t := 0 and a 0 := ω;
– Initialize the separator set S to ∅;
• Phase 1: Subtree Reconstruction
– While U 6= L,
∗ Perform L EAF R ECON(at ); set t := t + 1;
• Phase 2: Patching Up
– Let T be the last value taken by t above;
– For t := 1 to T , perform S EP R ECON(at , (wt , wt0 ), a0t ) [the variables at , a0t , wt , wt0 are computed by L EAF R ECON ].
Algorithm L EAF R ECON
Input: tree topology; reference leaf a; uncovered leaves U ; separator set S; independent samples at the leaves;
Output: estimated mutation matrices P̂ e and estimated node distributions π̂u on edges and vertices in subtree associated to a;
• Set W := {a}, U := U ∪ {a};
• While N (W, S) ∩ Ba∆ 6= ∅,
– Pick any vertext r ∈ N (W, S) ∩ Ba∆ ;
– Let (r0 , r) be the edge with endpoint r in the path from a to r
– If r0 = a, set P̂ ar0 to be the identity, otherwise P̂ ar0 is known from previous computations;
– If r is not a leaf,
∗
∗
∗
∗
Let (r, r1 ) and (r, r2 ) be the other two edges out of r;
Find the closest leaf b (resp. c) connected to r1 (resp. r2 ) by a path not using (r0 , r);
Draw Gaussian vector U ; Perform the eigenvalue decomposition (?) in its modified version from Section 3;
Build P̃ ar out of an arbitrary ordering of the columns recovered above;
∗ Compute ρ = (P̃ ar )−1 , and set P̂ ar := P̃ ar diag(ρ);
– Otherwise if r is a leaf:
∗ Use estimate from leaves for P̂ ar ;
∗ Set U := U ∪ {r};
– Compute P̂ r0 r := (P̂ ar0 )−1 P̂ ar ;
– Compute π̂r = π̂a P̂ ar ;
– Obtain P̂ rr0 from P̂ r0 r , π̂r and π̂r0 using Bayes rule;
– Set W := W ∪ {r};
– If r is not a leaf and the distance between a and r is exactly ∆, for ι = 1, 2,
∗ If {r, rι } is not in S, set t0 := |S| + 1, at0 := b (if ι = 1, and c o.w.), a0t0 := a, (wt0 , wt0 0 ) := (rι , r), and add
{r, rι } (as an undirected edge) to S.
Algorithm S EP R ECON
Input: leaves a, a0 and separator edge (w, w 0 );
0
0
Output: estimated mutation matrices P̂ ww , P̂ ww ;
0
• Estimate P̂ aa ;
0
0
• Compute P̂ ww := (P̂ aw )−1 P̂ aa (P̂ w
• Obtain
0
P̂ w w
from
0
P̂ ww ,
0 a0 −1
)
;
π̂w and π̂w0 using Bayes rule.
Figure 3: Algorithms F ULL R ECON, L EAF R ECON and S EP R ECON
(β, n−κπ )-nonsingular, for β, κπ > 0 constant, and that the topology is known. We prove the following theorem.
T HEOREM 5. For all ε, δ > 0 and n large enough, our reconstruction algorithm produces a Markov model satisfying
kπ̂L − πL k1 ≤ ε,
with probability at least 1 − δ. The running time of the algorithm
is polynomial in n, 1/ε, 1/δ.
From here on, all κ’s are meant to be exponents of n (not depending themselves on n). We use the expression with high probability
(w.h.p.) to mean with probability at least 1 − 1/poly(n). Likewise
we say negligible to mean at most 1/poly(n). In both definitions
it is implied that poly(n) is O(nK ) for a constant K that can be
0
made as large as one wants if the number of samples is O(nK )
0
with K large enough. Most proofs have been omitted from this abstract and can be found in [23]. In all proofs given here, we assume
that the bounds proved in previous lemmas apply, even though they
actually occur only with high probability. We also assume in all
lemmas that the various κ’s from previous lemmas have been chosen large enough for the results not to be vacuous. This is always
possible if the number of samples is taken (asymptotically) large
enough. Standard linear algebra results used throughout the analysis can be found e.g. in [18].
3.1 Approximate spectral argument
In this subsection, we address several issues arising from the application of Chang’s spectral technique to an approximate distribution on the leaves. Our discussion is summarized in Proposition 1.
We use the notations of Section 2.1.
P ROPOSITION 1. The estimate P̂ ar recovered from (?) using
poly(n) samples is such that kP̂ ar − P ar k1 is negligible w.h.p.
See Lemma 8 for a precise statement.
Determinants on Paths. The estimation error depends on the determinant of the transition matrices involved. Since we use Chang’s
spectral technique where a → r, r → b, and r → c are paths rather
than edges, we need a lower bound on transition matrices over
paths. This is where the use of short paths is important. Multiplicativity of determinants gives immediately that all determinants of
transition matrices on paths of length O(∆) are at least 1/poly(n).
L EMMA 1 (B OUND ON D EPTH ). The depth of any full binary
tree is bounded above by log n.
L EMMA 2 (D ETERMINANTS ON PATHS ). Let a, b be vertices
at distance at most 2∆ + 1 from each other. Then, under the
(β, n−κπ )-nonsingularity assumption, the transition matrix between
a and b satisfies det[P ab ] ≥ 1/nκd , for some constant κd depending only on β.
Error on Leaf Distributions. The algorithm estimates leaf distributions through sampling. We need to bound the error introduced
by sampling. Let a, b, c be leaves at distance at most 2∆ + 1 from
each other and consider the eigenvalue decomposition (?). We estimate P ab by taking poly(n) samples and computing
P̂ijab =
a,b
Ni,j
,
Nia
Nia
for i, j ∈ C where
is the number of occurences of s(a) = i and
a,b
Ni,j
is the number of occurences of s(a) = i, s(b) = j. Likewise
for Pijab,γ , we use poly(n) samples and compute the estimate
P̂ijab,γ =
a,b,c
Ni,j,γ
,
Nia
a,b,c
where Ni,j,γ
is the number of occurences of s(a) = i, s(b) =
j, s(c) = γ. We also bound the error on the 1-leaf distributions;
this will be used in the next subsection. We use poly(n) samples
to estimate πa using empirical frequencies. Standard concentration
inequalities give that kP ab − P̂ ab k1 , kP ab,γ − P̂ ab,γ k1 , and kπa −
π̂a k1 are negligible w.h.p.
L EMMA 3
(1)
(S AMPLING E RROR ). Using nκs = k3 (nκs
+
ab
and
(2)
κs
n
) samples, the estimation errors on the matrices P
P ab,γ satisfy
kP̂ ab − P ab k1 ≤
k
(1)
n κe
kP̂ ab,γ − P ab,γ k1 ≤
,
(2)
for all γ ∈ C with probability at least 1 − 1/nκp
for ι = 1, 2,
(ι)
(2)
k
(2)
n κe
,
provided that
(ι)
(2)
nκs ≥ max{8n2κπ log(8k3 nκp ), 4n2κe nκπ log(8k3 nκp )},
for all n large enough.
L EMMA 4
(S AMPLING E RROR AT L EAVES ). We have
kπ̂a − πa k1
≤
1
0 ,
n κe
(4)
with probability at least 1 − 1/nκp , provided
0
0
(4)
nκs ≥ 2k2 (nκe )2 log(knκp ),
for all n large enough.
Separation of Exact Eigenvalues. In Section 2, it was noted that
the eigenvalues in (?) need to be distinct to guarantee that all eigenspaces have dimension 1. This is clearly necessary to recover the
columns of the transition matrix P ar . When taking into account the
error introduced by sampling, we actually need more. From standard results on eigenvector sensitivity it follows that we want the
eigenvalues to be well separated. A polynomially small separation
will be enough for our purposes. We accomplish this by multiplying the matrix P rc in (?) by a random Gaussian vector. One
can think of this as adding an extra edge (c, d) and using leaves
a, b, d for the reconstruction, except that we do not need the transition matrix P cd to be stochastic and only one line of it suffices.
More precisely, let U be a vector whose k entries are independent
Gaussians with mean 0 and variance 1. We solve the eigenvalue
rc
problem (?) with P·γ
replaced with Y = P rc U . This can be done
by replacing P ab,γ on the l.h.s. of (?) with the matrix P ab,U whose
(i, j)th entry is
X
Uγ P[s(c) = γ, s(b) = j | s(a) = i].
γ∈C
This is justified by the Markov property. A different (independent)
vector U is used for every triple of leaves considered by the algorithm. One can check that w.h.p. the entries of Y are 1/poly(n)separated. Chang [5] actually uses a similar idea.
L EMMA 5 (E IGENVALUE S EPARATION ). Let κλ be a constant
(1)
such that κd < κλ . Then, with probability at least 1 − 1/nκp no
κλ
two entries of Y are at distance less than 1/n provided
1
n
(1)
κp
for all n large enough.
2k5/2 nκd
≥ √
,
2πnκλ
P ROOF. By Lemma 2, | det[P rc ]| > 1/nκd . Take any two lines
i, j of P rc . The matrix, say A, whose entries are the same as P rc
except that row i is replaced by Pi·rc −Pj·rc has the same determinant
as P rc . Moreover,
P Q
Q
| det[A]| ≤ σ kh=1 |Ahσ(h) | ≤ kh=1 kAh· k1 ,
where the sum is over all permutations of {1, . . . , k}. Therefore
kPi·rc − Pj·rc k1 ≥ 1/nκd .
By the Cauchy-Schwarz inequality,
√
kPi·rc − Pj·rc k2 ≥ 1/( knκd ).
Therefore, (Pi·rc − Pj·rc )U is Gaussian with mean 0 and variance at
most 1/(kn2κd ). A simple bound on the normal distribution gives
√ κ
kn d
1
1
P |(Pi·rc − Pj·rc )U | < κ
.
≤ 2 κ √
n λ
n λ
2π
There are O(k 2 ) pairs of lines to which we apply the previous inequality. The union bound gives the result.
Error on Estimated L.H.S. On the l.h.s.Pof (?) (modified version),
we use the following estimate P̂ ab,U = γ∈C Uγ P̂ ab,γ . It follows
from linear algebraic inequalities and the previous bounds that the
error on the l.h.s. of (?) is negligible w.h.p.
L EMMA 6 (E RROR ON L.H.S.). Assume n
Then we have
kP̂ ab,U (P̂ ab )−1 − P ab,U (P ab )−1 k1 ≤
(1)
κe
2
≥ 2k n
κd
which is smaller than 1/(3nκλ ). Given that the separation between the entries of Y is at least 1/nκλ we deduce that there is a
unique Yi at distance at most 1/(3nκλ ) from Ŷj (note that j might
not be equal to i since the ordering might differ in both vectors).
This is true for all j ∈ C. This implies that all Ŷj ’s are distinct
and therefore they are real and P̂ ab,U (P̂ ab )−1 is diagonalizable as
claimed.
Error on Estimated Eigenvectors. From (?), we recover k eigenvectors that are defined up to scaling. Assume that for all i ∈ C, Ŷi
is the estimated eigenvalue corresponding to Yi (see above). Denote by X̂ i , X i their respective eigenvectors. We denote X̂ (resp.
X) the matrix formed with the X̂ i ’s (resp. X i ’s) as columns. Say
we choose the estimated eigevectors such that kX̂ i k1 = 1. This
is not exactly what we are after because we need the rows to sum
to 1 (not the columns). To fix this, we then compute η = X̂ −1 .
This can be done because the columns of X̂ form a basis. Then we
define X̃ i = ηi X̂ i for all i with the corresponding matrix X̃. Our
final estimate P̂ ar = X̃ is a rescaled version of X̂ with row sums
1. There is still the possibility that some entries of X̃ are negative.
But this is not an issue at this point. We will make sure later that
(one-step) mutation matrices are stochastic. It can be proven that
kX̃ − Xk1 is negligible w.h.p.
L EMMA 8
.
1
kX̃ − Xk1 ≤
(3)
n κe
≥ 2k5 n2κd
p
2 log(nκg )
1
(1)
n κe
+
1
(2)
n κe
if
+
1
1
(1)
(2)
n κe n κe
for all n large enough, with probability at least 1 − 1/n
we must have
k
1
p
,
≥
(3)
κg
κp
(n
π log(nκg ))
n
(3)
κp
, where
3 κd
2k n
1
1
1
+
k
+ (5) ,
≥
(5)
(5)
n κe
n κe
n κe
n κe
Separation of Estimated Eigenvalues. We need to make sure that
the estimated l.h.s. of (?) is diagonalizable. By bounding the variation of the eigenvalues and relying on the gap between the exact
eigenvalues, we can ensure that the eigenvalues remain distinct and
therefore P̂ ab,U (P̂ ab )−1 is diagonalizable.
L EMMA 7 (S ENSITIVITY OF E IGENVALUES ). Under the as(3)
sumption that nκe ≥ 3k2 nκd nκλ , P̂ ab,U (P̂ ab )−1 is diagonalizable and all its eigenvalues are real and distinct. In particular, all
eigenspaces have dimension 1.
P ROOF. Using a standard formula for the inverse, we have
=
1
|(adj[P ar ])ij | ≤ nκd ,
| det[P ar ]|
(3)
where we have used the nonsingularity assumption, and the fact
that adj[P ar ]ij is the determinant of a substochastic matrix. Then
k(P ar )−1 k1 ≤ knκd . A standard theorem on eigenvalue sensitivity [18] asserts that if Ŷj is an eigenvalue of P̂ ab,U (P̂ ab )−1 , there
is an eigenvalue Yi of P ab,U (P ab )−1 such that |Ŷj − Yi | is less
than
kP ar k1 k(P ar )−1 k1 kP̂ ab,U (P̂ ab )−1 −P ab,U (P ab )−1 k1 ≤
with
1
n
(5)
κe
≥
k2
n
(4)
κe
− k2
,
1
n
(4)
κe
≥
2k3 nκλ n2κd
(3)
n κe
,
for all n large enough.
for all n large enough.
|(P ar )−1
ij |
1
,
n κe
(3)
n κe
provided
1
(S ENSITIVITY OF E IGENVECTORS ). We have
k 2 n κd
(3)
n κe
P ROOF. We want to bound the norm of X̃ i − X i . We first argue
about the components of X̂ i − X i in the directions X j , j 6= i.
We follow a standard proof that can be found for instance in [16].
We need a more precise result than the one stated in the previous
reference, and so give the complete proof here.
Because the X i ’s form a basis, we can write
X
X̂ i − X i =
ρj X j ,
j∈C
for some values of ρj ’s. Denote A = P ab,U (P ab )−1 , ∆i = Ŷi −Yi
and E = P̂ ab,U (P̂ ab )−1 − P ab,U (P ab )−1 . Then
(A + E)X̂ i
=
Ŷi X̂ i ,
which, using AX i = Yi X i , implies
X
X
Yj ρj X j + E X̂ i = Yi
ρj X j + ∆i X̂ i .
j∈C
j∈C
For all j ∈ C, let Z j be the left eigenvector corresponding to X j .
0
It is proved in [16] that (X j )T Z j = 0 for all j 6= j 0 . Fix h ∈ C.
Multiplying both sides of the previous display by Z h and rearranging gives
ρh
=
(Z h )T (E X̂ i ) + ∆i (Z h )T X̂ i
.
(Yh − Yi )(Z h )T X h
Eigenvectors are determined up to multiple constants. Here we
make the X i be equal to the columns of P ar and the Z i ’s equal
to the columns of ((P ar )T )−1 . Recall that the X̂ i ’s were chosen
such that kX̂ i k1 = 1. Then using standard matrix norm inequalities, Cauchy-Schwarz inequality, and Lemmas 5, 6, and 7, we get
(Z h )T (E X̂ i ) + ∆ (Z h )T X̂ i i
|ρh | = (Yh − Yi )(Z h )T X h
(3)
kZ h k2 kEk2 kX̂ i k2 + (k2 nκd /nκe )kZ h k2 kX̂ i k2
1/nκλ
√
(3)
≤ nκλ kZ h k1 kX̂ i k1 [ kkEk1 + k2 nκd /nκe ]
≤
3
≤
2k n
κλ
n
2κd
≤
(3)
n κe
1
(4)
n κe
,
(5)
(5)
kĒk1 ≤ 1/nκe . Assuming that knκd /nκe
kX̄
−1
j6=i
2
k|1 + ρi | + k /n
(4)
κe
.
(4)
Assuming that nκe is large enough, we get
k1
≤ 2 kX
−1
≤ 1/2, we get
k1 ≤ 2knκd .
Therefore,
kη̄ − k1 ≤
2k3 nκd
(5)
n κe
.
This finally gives the bound
kX̃ − Xk1
≤
≤
≤
where we have used the bound kZ h k1 ≤ knκd which follows from
(3) in Lemma 7 and the choice of Z h . The bound on |∆i | comes
from the proof of Lemma 7 (after renumbering the components of
Ŷ to match those of Y ). Plugging in the expansion of X̂ i gives
X
1 = kX̂ i k1 ≤ |1 + ρi | kX i k1 +
|ρj | kX j k1
≤
As we have seen before, kX −1 k1 ≤ knκd and by the bound above,
≤
≤
kX̃ − X̄k1 + kX̄ − Xk1
1
kX̄diag(η̄) − X̄k1 + (5)
n κe
1
kX̄k1 kη̄ − k1 + (5)
n κe
1
2k3 nκd
+ (5)
[kX̄ − Xk1 + kXk1 ]
(5)
κ
κ
e
n
n e
3 κd
2k n
1
1
1
+k
+ (5) ≤ κe .
(5)
(5)
n
n κe
n κe
n κe
There is one last issue which is that X is the same as P ar up
to permutation on the states of r. But since relabeling internal
nodes does not affect the output distribution, we assume w.l.o.g.
that P ar = X. We make sure in the next subsection that this relabeling is performed only once for each internal node.
(4)
|1 + ρi | ≥
1 − k2 /nκe
k
3.2 Bounding error propagation
> 0.
Defining X̄ i = X̂ i /(1 + ρi ) and plugging again into the expansion
of X̂i , we get
X ρj
1
k2
kX̄ i − X i k1 = Xj ,
≤
≤
(4)
(5)
κ
κ
2
j6=i 1 + ρi
e
n
n e
−k
1
where we have used kX j k1 ≤ k, j 6= i.
Denote ē = X̄ the row sums of X̄, the matrix formed with
the X̄ i ’s as columns. The scaling between X̄ and X̃ is given by
η̄ = X̄ −1 . Indeed, because the columns of X̃ and X̄ are the
same up to scaling, there is a vector η̃ such that X̃ = X̄diag(η̃).
By the normalization of both matrices, we get
X̄ η̄ = X̃ = X̄diag(η̃) = X̄ η̃.
Because X̄ is invertible, η̃ = η̄. We want to argue that η̄ is close to
, that is that X̄ and X̃ are close. Note that
kη̄ − k1 = kX̄ −1 ( − ē)k1 ≤ kX̄ −1 k1 k − ēk1 .
(5)
By the condition, kX̄ i − X i k1 ≤ 1/nκe for all i and the fact that
(5)
the row sums of X are 1, we get k − ēk1 ≤ k2 /nκe . To bound
kX̄ −1 k1 , let Ē = X̄ − X and note that, using a standard theorem
on the sensitivity of the inverse [18],
kX̄ −1 k1
≤
≤
k(X + Ē)−1 − X −1 + X −1 k1
k(X + Ē)−1 − X −1 k1 + k(X + Ē)−1 − X −1 k1
≤
kX −1 k1
≤
kX −1 k1
.
1 − kX −1 k1 kĒk1
kX −1 k1 kĒk1
+ kX −1 k1
1 − kX −1 k1 kĒk1
The correctness of the algorithm F ULL R ECON proceeds from the
following remarks.
Partition. We have to check that the successive application of
L EAF R ECON covers the entire tree, that is that all edges are reconstructed. Figure 4 helps in understanding why this is so. When
uncovering a separator edge, we associate to it a new reference leaf
at distance at most ∆. This can always be done by definition of ∆.
It also guarantees that the subtree associated to this new leaf will
cover the endpoint of the separator outside the subtree from which
it originated. This makes the union of all subtrees explored at any
point in the execution (together with their separators) connected. It
follows easily that the entire tree is eventually covered.
L EMMA 9 (PARTITION ). The successive application of subroutine L EAFRECON covers the entire tree.
P ROOF. We need to check that the algorithm outputs a transition
matrix for each edge in T . Denote Tat the subtree explored by
L EAF R ECON applied to at . The key point is that for all t, the
tree T≤t made of all Tat0 for t0 ≤ t as well as their separators is
connected. We argue by induction. This is clear for t = 0. Assume
this is true for t. Because T is a tree, T≤t is a (connected) subtree
0
of T and (wt+1 , wt+1
) is an edge on the “boundary” of T≤t , the
leaf at+1 lies outside T≤t . Moreover, being chosen as the closest
0
, it is at distance at most ∆. Therefore, applying
leaf from wt+1
L EAF R ECON to at+1 will cover a (connected) subtree including
0
. This proves the claim.
wt+1
Consistency. It is straightforward to check that all choices of labelings are done consistently. This follows from the fact that for
each node, say w, the arbitrary labeling is performed only once.
Afterwards, all computations involving w use only the matrix P̂ aw
where a is the reference leaf for w.
L EMMA 10 (C ONSISTENCY ). The labelings are made consistently by subroutines L EAF R ECON and S EP R ECON.
Subroutines. Standard inequalities show that the (unnormalized)
estimates computed in L EAF R ECON and S EP R ECON have negligible error w.h.p.
if
L EMMA 11 (E RROR A NALYSIS : L EAF R ECON). Let a be a
(1)
leaf and assume that nκe ≤ nκe /k. Then all edges (r0 , r) reconstructed by L EAF R ECON applied to a satisfy
for all n large enough.
1
kP̃ r0 r − P r0 r k1 ≤
n
,
κ0M
where
1
n
κ0M
≥
n κπ n κv
n
(6)
κe
(nκv − nκπ )
+
2n2κπ
,
− n κπ
n κv
provided
1
(6)
n κe
2knκd (2k2 n2κd + 1)
≥
n κe
1
n κM
1 + 2k2
,
000
n κM
≥
Precision and Confidence. Now that all matrices have been approximately reconstructed, it is straightforward to check that the
distributions on the leaves of the estimated and real models are
close. The only issue is that some of the “real” transition probabilities are smaller than the tolerance of F ULL R ECON and will not
be well approximated (relatively) by the 1/poly(n) additive error.
But it is easy to show that those unlikely transitions only contribute
to outputs that have a very low overall probability. We show below
that kπ̂L − πL k1 is negligible w.h.p. thereby proving Theorem 5.
L EMMA 14 (P RECISION AND C ONFIDENCE ). Asume the topology of the tree can be reconstructed properly with probability at
κv ≥ κπ + 1,
(T )
for all n large enough (after permuting the rows and columns of
P r0 r to match the labeling of P̃ r0 r ). As for vertex distributions,
they satisfy
1
kπ̂r − πr k1 ≤ κv ,
n
(T )
least 1 − 1/nκp using at most nκs samples. Then all previous
lemmas hold for all edges and all vertices and the reconstructed
model satisfies
kπ̂L − πL k1 ≤ ε,
for all n large enough with probability at least 1 − 1/nκp where
1
1
1
1
1
1
3
+ (2) + (3) + (4) + (T ) .
≥n
(1)
n κp
n κp
n κp
n κp
n κp
n κp
This can be done using a sample of total size at most
if
1
k
1
≥ κe + κ0 ,
n κv
n
n e
(T )
0
m = n3 (nκs + nκs ) + nκs .
for all n large enough.
L EMMA 12 (E RROR A NALYSIS : S EP R ECON). Every
(w, w0 ) reconstructed by S EP R ECON satisfies
0
1
0
kP̃ ww − P ww k1 ≤
n
κ00
M
edge
,
provided
1
n
κ00
M
≥
12k4 n3κd
,
n κe
for all n large enough (after permuting the rows and columns of
0
0
P ww to match the labeling of P̃ ww ).
Stochasticity. It only remains to make the estimates of mutation
0
matrices into stochastic matrices. Say P̃ ww is the (unnormalized)
0
estimate of P ww . First, some entries might be negative. Define
0
0
P̃+ww to be the positive part of P̃ ww . Then renormalize to get our
final estimate
0
P̂i·ww
0
(P̃+ww )i·
=
,
0
k(P̃+ww )i· k1
0
L EMMA 13 (S TOCHASTICITY ). Define
the
constant
0
00
ww0
κ000
=
min{κ
,
κ
}.
Then
the
estimate
P̂
is
well-defined
M
M
M
and satisfies
0
0
1
,
n κM
sL ∈C L
≤
X
X
sL ∈C L sL̄ ∈C L̄
= kπ̂V − πV k1 .
for all i 0∈ C. We know from Lemmas 11 and 12 that P̃ ww is close
0
to P ww in L1 distance. From this, it is easy to check that P̃+ww is
0
0
also close to P ww and that k(P̃+ww )i· k1 is close to 1. Therefore
0
0
kP̂ ww − P ww k1 is negligible w.h.p.
kP̂ ww − P ww k1 ≤
P ROOF. The bounds in Sections 3.1 and 3.2 must hold for all
triples of leaves used by subroutine L EAF R ECON, of which there
are at most n3 (recall that n is the number of leaves). Therefore,
0
we need a total of m = n3 (nκs + nκs ) samples. The probability
that one of the inequalities fails or that the topology is wrong is
1
1
1
1
1
1
+
+
+
+
≤ κp ,
n3
(1)
(2)
(3)
(4)
(T )
κp
κp
κp
κp
κp
n
n
n
n
n
n
by the union bound.
The L1 distance on the leaves is bounded above by the L1 distance on the entire tree. Indeed, let r be the root of T . Denote
by L̄ = V\L the internal vertices. For partial assignments of the
states sL ∈ C L and sL̄ ∈ C L̄ on leaf and non-leaf vertices resp. we
denote s = (sL , sL̄ ) the state on the entire tree. Then
X
kπ̂L − πL k1 =
|π̂L (s) − πL (s)|
|π̂V (s) − πV (s)|
One issue we have to deal with before bounding kπ̂V − πV k1 is
that the transition probabilities smaller than 1/nκM are not well
approximated (relatively). We bound the probability associated to
states arising from such small transition probabilities. Denote by
nV the number of vertices. Fix a threshold 1/nκth ≥ 2/nκM . In
a realization of the Markov model, consider the event that at least
one transition occurs which has probability less than the threshold
1/nκth . Using the union bound, the probability of that event is
bounded above by (nV −1)(k−1)/nκth because for each transition
there is at most k − 1 possible transitions with probability less than
1/nκth . By the same argument, under π̂V the same probability
is less than (nV − 1)(k − 1)[1/nκth + 1/nκM ]. So as long as
nκth = ω(n), these probabilities are small. Denote by Sth the
states on V that use no transition with probability at most 1/nκth .
Then, for any s ∈ Sth such that π̂V (s) > πV (s), |π̂V (s) − πV (s)|
is equal to
Y
π̂r (s(r))
Y
uv
Ps(u)s(v)
(u,v)∈Er
(u,v)∈Er
≤ πr (s(r))
Y
uv
P̂s(u)s(v)
− πr (s(r))
uv
Ps(u)s(v)
(u,v)∈Er
"
1+
nκπ
nκv
1+
nκth
nκM
nV −1 #
−1 ,
and likewise for π̂V (s) < πV (s), |π̂V (s) − πV (s)| is less than
πr (s(r))
Y
"
uv
1−
Ps(u)s(v)
(u,v)∈Er
1−
nκπ
nκv
1−
nκth
nκM
nV −1 #
.
Provided κM = Ω(n2 nκth ) and nκv = Ω(n2 nκπ ), the two expressions in square brackets are less than ε/4 for n large enough.
Then kπ̂V − πV k1 is less than
X
X
|π̂V (s) − πV (s)|
|π̂V (s) − πV (s)| +
s∈Sth
s∈S
/ th
≤ ε/4 + ε/4 + (nV − 1)(k − 1)
2
nκth
+
1
n κM
≤ ε,
for n large enough.
All matrices involved have constant size so the only contributions to the running time are the O(n) reconstructions and O(n)
breath-first searches. The overall running time is O(n2 ).
In the special case of the HMM (“caterpillar” tree, see Figure 1),
our algorithm F ULL R ECON actually gives a stronger result. Here
we only need to assume that determinants of mutation matrices are
at least 1/poly(n) (instead Ω(1)). The reason is that the HMM tree
can be partitioned into subtrees of constant size in which all nodes
are at distance at most a constant from a reference leaf. The proofs
above apply without change. The only difference is that now all
paths are of constant length and therefore we can allow mutation
matrices to have 1/poly(n) determinants. Therefore, if the topology is known, which is often naturally the case in applications of
HMMs, we can reconstruct the (approximate) mutation matrices in
polynomial time.
4.
CONCLUDING REMARKS
Many extensions of this work deserve further study.
• There remains a gap between our positive result (for general
trees), where we require determinants Ω(1), and the hardness
resut of [20], which uses determinants exactly 0. Is learning
possible in the case where determinants are Ω(n−c ) (as it is
in the case of HMMs)?
• There is another gap arising from the upper bound on the
determinants. Having mutation matrices with determinant 1
does not seem like a major issue. It does not arise in the estimation of the mutation matrices. But it is tricky to analyze
rigorously how the determinant 1 edges affect the reconstruction of the topology. Again, we have shown that this is not a
problem in the special case of HMMs (at least if the topology
is known).
• We have emphasized the difference between k = 2 and k ≥
3. As it stands, our algorithm works only for (β, n−κπ )nonsingular models even when k = 2. It would be interesting to rederive the results of [7] using our technique possibly
in combination with [22].
Acknowledgements
The first author is supported by a Miller Fellowship in Statistics and Computer Science, UC Berkeley. The second author acknowledges the partial financial support of
NSERC and NATEQ. This work was also supported by CIPRES. We gratefully thank
Jon Feldman, Ryan O’Donnell, Rocco Servedio and Mike Steel for interesting discussions.
5. REFERENCES
[1] N. Abe and N. K. Warmuth, On the Computational Complexity of
Approximating Probability Distributions by Probabilistic Automata, Machine
Learning, 9(2–3), 205–260, 1992.
[2] L. Addario-Berry, B. Chor, M. Hallett, J. Lagergren, A. Panconesi and Todd
Wareham, Ancestral Maximum Likelihood of Evolutionary Trees Is Hard,
Journal of Bioinformatics and Computational Biology, 2, 257-271, 2004.
[3] A. Ambainis, R. Desper, M. Farach, and S. Kannan, Nearly tight bounds on the
learnability of evolution, FOCS 1997.
[4] J. Cavender, Taxonomy with confidence, Mathematical Biosciences, 40,
271–280, 1978.
[5] J. T. Chang, Full Reconstruction of Markov Models on Evolutionary Trees:
Identifiability and Consistency, Mathematical Biosciences, 137, 51–73, 1996.
[6] R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter,
Probabilistic Networks and Expert Systems, Springer, New York, 1999.
[7] M. Cryan, L. A. Goldberg, and P. W. Goldberg, Evolutionary Trees can be
Learned in Polynomial Time in the Two-State General Markov Model, SIAM
Journal on Computing, 31(2), 375–397, 2002.
[8] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge
University Press, 1998.
[9] R. Durrett, Probability: Theory and Examples, Duxbury, 1996.
[10] P. L. Erdos, M. Steel, L. Szekely, and T. Warnow, A Few Logs Suffice to Build
(Almost) all Trees (I), Random Structures and Algorithms, 14, 153–184, 1997.
[11] P. L. Erdos, M. A. Steel, L. A. Szekely, and T. J. Warnow, A Few Logs Suffice
to Build (Almost) all Trees (II), Theoretical Computer Science, 221(1–2),
77–118, 1999.
[12] J. Felsenstein. Inferring Phylogenies. Sinauer, New York, New York, 2004.
[13] M. Farach and S. Kannan, Efficient algorithms for inverting evolution, STOC
1996.
[14] J. Feldman and R. O’Donnell and R. Servedio. Learning mixtures of product
distributions, preprint, 2004.
[15] J. Felsenstein, Cases in which parsimony or compatibility methods will be
positively misleading, Systematic Zoology, 27, 410–410, 1978.
[16] G. H. Golub and C. H. Van Loan, Matrix Computations, Johns Hopkins
University Press, 1996.
[17] R. L. Graham and L. R. Foulds, Unlikelihood that minimal phylogenies for a
realistic biological study can be constructed in reasonable computational time,
Mathematical Biosciences, 60, 133-142, 1982.
[18] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press,
1985.
[19] M. Kearns, Efficient noise-tolerant learning from statistical queries, STOC
1993.
[20] M. J. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. E. Schapire, and L. Sellie,
On the learnability of discrete distributions, STOC 1994.
[21] R. B. Lyngs and C. N. S. Pedersen, Complexity of Comparing Hidden Markov
Models, ISAAC 2001.
[22] E. Mossel, Distorted metrics on trees and phylogenetic forests, preprint, 2004
(available at http://arxiv.org/math.CO/0403508).
[23] E. Mossel and S.Roch, Learning Nonsingular Phylogenies and Hidden Markov
Models, preprint, 2005 (available at
http://arxiv.org/cs.LG/0502076).
[24] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in
speech recognition, Proceedings of the IEEE, 77(2), 257–286, 1989.
[25] K. Rice and T. Warnow, Parsimony is Hard to Beat, Proceedings of the Third
Annual International Conference on Computing and Combinatorics, Lecture
Notes In Computer Science, Springer, 1276, 124 – 133, 1997.
[26] D. Ron, Y. Singer, and N. Tishby, On the Learnability and Usage of Acyclic
Probabilistic Finite Automata, J. Comput. Syst. Sci., 56(2), 133–152, 1998.
[27] C. Semple and M. Steel. Phylogenetics, volume 22 of Mathematics and its
Applications series. Oxford University Press, 2003.
[28] M. Steel, Recovering a tree from the leaf colourations it generates under a
Markov model, Appl. Math. Lett. 7(2), 19–24, 1994.
[29] M. A. Steel, L. A. Szekely, and M. D. Hendy, Reconstructing trees when
sequence sites evolve at variable rates, J. Comput. Biol. 1(2), 153–63, 1994.
[30] L. G. Valiant, A theory of the learnable, Communications of the ACM, 27(11),
1134–1142, 1984.
Download