Paper

advertisement
A Large-Deviation Analysis for the Maximum
Likelihood Learning of Tree Structures
Vincent Y. F. Tan∗ , Animashree Anandkumar∗† , Lang Tong† and Alan S. Willsky∗
∗ Stochastic Systems Group, LIDS, MIT, Cambridge, MA 02139, Email: {vtan,animakum,willsky}@mit.edu
† ECE Dept., Cornell University, Ithaca, NY 14853, Email: {aa332@,ltong@ece.}cornell.edu
Abstract—The problem of maximum-likelihood learning of the
structure of an unknown discrete distribution from samples is
considered when the distribution is Markov on a tree. Largedeviation analysis of the error in estimation of the set of edges of
the tree is performed. Necessary and sufficient conditions are provided to ensure that this error probability decays exponentially.
These conditions are based on the mutual information between
each pair of variables being distinct from that of other pairs.
The rate of error decay, or error exponent, is derived using the
large-deviation principle. The error exponent is approximated
using Euclidean information theory and is given by a ratio, to
be interpreted as the signal-to-noise ratio (SNR) for learning.
Numerical experiments show the SNR approximation is accurate.
Index Terms—Large-deviations, Tree structure learning, Error
exponents, Euclidean Information Theory.
I. I NTRODUCTION
The estimation of a distribution from samples is a classical
problem in machine learning and is challenging for highdimensional multivariate distributions. In this respect, graphical models [1] provide a significant simplification of the joint
distribution, and incorporate a Markov structure in terms of a
graph defined on the set of nodes. Many specialized algorithms
exist for learning graphical models with sparse graphs.
A special scenario is when the graph is a tree. In this
case, the classical Chow-Liu algorithm [2] finds the maximumlikelihood (ML) estimate of the probability distribution by
exploiting the Markov structure to learn the tree edges in
terms of a maximum-weight spanning tree (MWST). The ML
estimator learns the distribution correctly as we obtain more
learning samples (consistency). These learning samples are
drawn independently from the distribution.
In this paper, we study the performance of ML estimator
in terms of the nature of convergence with increasing sample
size. Specifically, we are interested in the event that the set of
edges of the ML tree is not the true set of edges. We address
the following questions: Is there exponential decay of error in
structure learning as the number of samples goes to infinity?
If so, what is the rate of decay of the error probability? How
does the rate depend on the parameters of the distribution? Our
analysis and answers to these questions provide us insights into
The first author is supported by A*STAR, Singapore, in part by a MURI funded
through ARO Grant W911NF-06-1-0076 and in part under a MURI through AFOSR
Grant FA9550-06-1-0324. The second author is supported by the IBM Ph.D Fellowship
for the year 2008-09 and is currently a visiting student at MIT, Cambridge, MA 02139.
This work is also supported in part through collaborative participation in Communications
and Networks Consortium sponsored by the U. S. Army Research Laboratory under
the Collaborative Technology Alliance Program, Cooperative Agreement DAAD19-012-0011 and by the Army Research Office under Grant ARO-W911NF-06-1-0346.
the distributions and the edges where we are more likely to
make an error when learning the tree structure.
Summary of Main Results and Related Work
Learning the structure of graphical models is an extensively
studied problem (e.g. [2]–[5]). The previous works look at
establishing consistency of the estimators, while a few prove
the estimators to have exponential rate of error decay under
some technical conditions in the Gaussian case [4]. However,
to the best of our knowledge, none of the works quantifies the
exact rate of decay of error for structure learning.
Following the seminal work of Chow and Liu in [2], a
number of works have looked at learning graphical models.
Using the maximum entropy principle as a learning technique,
Dudik et al. [3] provides strong consistency guarantees on
the learned distribution in terms of the log-likelihood of the
samples. Wainwright et al. [5] also proposed a regularization
method for learning the graph structure based on `1 logistic
regression and provide theoretical guarantees for learning the
correct structure as the number of samples, the number of
variables, and the neighborhood size grow. Meinshausen et al.
[4] consider learning the structure for Gaussian models, and
show that the error probability of getting the structure wrong
decays exponentially. However, the rate is not provided.
The main contributions of this paper are three-fold. First,
using the large-deviation principle (LDP) [6] we prove that
the most-likely error in ML estimation is a tree which differs
from the true tree by a single edge. Second, again using the
LDP, we derive the exact error exponent for ML estimation of
tree structures. Third, using ideas from Euclidean Information
Theory [7], we provide a succinct and intuitively appealing
closed-form approximation for the error exponent which is
tight in the very noisy learning regime, where the individual
samples are not too informative about the tree structure. The
approximate error exponent has a very intuitive explanation as
the signal-to-noise ratio (SNR) for learning. It corroborates the
intuition that if the edges belonging to the true tree model are
strongly distinguishable from the non-edges using the samples,
we can expect the rate of decay of error to be large.
All the results are stated without proof. The reader may
refer to http://web.mit.edu/vtan/www/isit09 for the details.
II. P ROBLEM S TATEMENT AND C HOW-L IU A LGORITHM
An undirected graphical model [1] is a probability distribution that factorizes according to the structure of an
underlying undirected graph. More explicitly, a vector of
random variables x = (x1 , . . . , xd ) is said to be Markov
¡ ¢
on a graph G = (V, E) with V = {1, . . . , d} and E ⊂ V2
if P (xi |xV\{i} ) = P (xi |xN (i) ) where N (i) is the set of
neighbors of xi . In this paper, we are given a set of ddimensional samples (or observations) xn = {x1 , . . . , xn }
where each xk ∈ X d and X is a finite set. Each element of the
set of samples xk is drawn independently from some unknown,
positive everywhere distribution P whose support is X d , that
is P ∈ P(X d ), the set of all distributions supported on X d .
We further assume that the graph of P , denoted TP = (V, EP ),
belongs to the set of spanning trees1 on d nodes T d . The set of
d-dimensional tree distributions is denoted D(X d , T d ). Tree
distributions possess the following factorization property [1]
Y
Y
Pi,j (xi , xj )
,
(1)
P (x) =
Pi (xi )
Pi (xi )Pj (xj )
i∈V
(i,j)∈EP
where Pi and Pi,j are the marginals on node i ∈ V and edge
(i, j) ∈ EP respectively. Given xn , we define the empirical
distribution of P to be Pb(x) = N (x|xn )/n, where N (x|xn )
is the number of times x ∈ X d occurred in xn .
Using the samples xn , we can use the Chow-Liu algorithm [2], reviewed in Section II-A, to compute PML , the ML
tree-structured distribution with edge set EML . It is known [8]
that as n → ∞, EML approaches EP , the true tree structure. But
at what rate does this happen for a given tree distribution P ?
Is the error decay exponential? In this paper, we use the LargeDeviation Principle (LDP) [6] to quantify the exponential rate
at which the probability of the error event
An := {EML 6= EP }
(2)
decays to zero. In other words, given a distribution P and
P := P n , we are interested to study the rate KP given by,
1
KP := lim − log P(An ),
(3)
n→∞
n
whenever the limit exists. KP is the error exponent for the
event An as defined in (2). When KP > 0, there is exponential
decay of error probability in structure learning, and we would
like to provide conditions to ensure this.
In addition, we define the set of disjoint events Un (T ) that
the graph of the ML estimate PML of the tree distribution is a
tree T different from the true tree TP , i.e.,
½
{TML = T }, if T ∈ T d \ {TP },
(4)
Un (T ) :=
∅,
if T = TP .
S
From (2), An = T ∈T d \{TP } Un (T ). We also define the
exponent for each error event Un (T ) as
1
Υ(T ) := lim − log P (Un (T )) .
(5)
n→∞
n
Definition The dominant error tree TP∗ = (V, EP∗ ) is the
spanning tree given by
TP∗ = argmin
Υ(T ).
(6)
T ∈T d \{TP }
1 In a spanning tree, none of pairwise joint distributions P
i,j are allowed
to be product distributions.
In section IV, we characterize the dominant error tree and its
relation to the error exponent KP in (3).
A. Maximum-Likelihood Learning of Tree Distributions
In this section, we review the Chow-Liu algorithm [2] for
learning the ML tree distribution PML given a set of n samples
xn drawn independently from a tree distribution P . The ChowLiu algorithm solves the following ML problem:
PML = argmax
n
X
Q∈D(X d ,T d ) k=1
log Q(xk ) = argmin
D(Pb||Q). (7)
Q∈D(X d ,T d )
For the above problem, the ML tree model will always be
the projection of the empirical distribution Pb onto some tree
structure [2], i.e., PML (xi ) = Pbi (xi ) for all i ∈ V and
PML (xi , xj ) = Pbi,j (xi , xj ) for all (i, j) ∈ EML . Thus, the
optimization problem in (7) reduces to a search for the tree
structure of PML . By using the fact that the graph of Q is a tree,
the KL divergence in (7) decomposes into a sum of mutual
information quantities. Hence, if EQ is the edge set of the tree
distribution Q, the optimization for the structure of PML is
X
EML =
argmax
I(Pbe ),
(8)
EQ :Q∈D(X d ,T d )
e∈EQ
where I(Pbe ) is the empirical mutual information given the
data xn , defined for e = (i, j) as:
I(Pbe ) :=
X
Pbi,j (xi , xj )
Pbi,j (xi , xj ) log
.
Pbi (xi )Pbj (xj )
(xi ,xj )∈X 2
(9)
Note that EP is the structure obtained from the MWST algorithm without any error, i.e., with the true mutual informations
n
as edge weights. To solve (8), we use
¡V ¢the samples x to
b
compute I(Pe ), for each node pair e ∈ 2 given the empirical
distribution Pb. Subsequently, we use these as the edge weights
for the MWST problem in (8). Note that the search for the
MWST is not the same as that for largest d − 1 mutual
information quantities as one has to take into consideration
the tree constraint. There are well-known algorithms [9] that
solve the MWST problem in O(d2 log d) time.
III. LDP FOR E MPIRICAL M UTUAL I NFORMATION
To compute the exponent for the error in structure
¡ ¢learning
KP , we first consider a simpler event. Let e, e0 ∈ V2 be any
two node pairs satisfying I(Pe ) > I(Pe0 ), where I(Pe ) and
I(Pe0 ) are the true mutual informations on node pairs e and
e0 respectively. We are interested in the crossover event of the
empirical mutual informations defined as:
n
o
Ce,e0 := I(Pbe ) ≤ I(Pbe0 ) .
(10)
The occurrence of this event may potentially lead to an error
in structure learning when EML differs from EP . In the next
section, we will see how these crossover events relate to
KP . Now, we would like to compute the crossover rate for
empirical mutual informations Je,e0 as:
Je,e0 := lim −
n→∞
1
log P (Ce,e0 ) .
n
(11)
x7
x8
x
t
t
t9
@
¡
@
¡
x
@
¡
1
x6
x
t
@t¡ Qa t 2
¡@
Qb
¡
@
x5¡
x4 @ x3
¡
@t
t
t
(a)
0
ut
@
e ∈
/ EP
vt
¡
¡
0
0
@ r(e ) ∈ Path(e ; EP ) ¡
@
t
t¡
@t ?
(b)
Fig. 1. (a) The symmetric “star” graph with d = 9. (b) The path associated
to the non-edge e0 = (u, v) ∈
/ EP , denoted Path(e0 ; EP ) ⊂ EP , is the
set of edges along the unique path linking the end points of e0 . r(e0 ) ∈
Path(e0 , EP ) is the dominant replacement edge associated to e0 ∈
/ EP .
Intuitively, if the difference between the true mutual informations I(Pe ) − I(Pe0 ) is large, we expect the crossover rate to
be large. Consequently, the probability of the crossover event
would be small. Note that Je,e0 in (11) is not the same as
KP in (3). We will see that the rate Je,e0 depends, not only
on the mutual informations I(Pe ) and I(Pe0 ), but also on the
distribution Pe,e0 of the variables on node pairs e and e0 .
Theorem 1 (LDP for Empirical MIs): Let the distribution
on two node pairs e and e0 be Pe,e0 ∈ P(X 4 ).2 The crossover
rate for empirical mutual information is
Je,e0 =
inf
Q∈P(X 4 )
{D(Q || Pe,e0 ) : I(Qe0 ) = I(Qe )} ,
(12)
where P
Qe , Qe0 ∈ P(X 2 ) are distributions on e, e0 resp., i.e.,
Qe = xe0 Q and similarly for Qe0 . The infimum in (12) is
attained for some Q∗e,e0 ∈ P(X d ) satisfying I(Q∗e0 ) = I(Q∗e ).
The proof of this theorem hinges on Sanov’s theorem [6,
Sec. 6.2] and the contraction principle [6, Thm 4.2.1]. We
remark that the rate Je,e0 is strictly positive since we assumed,
a-priori, that e and e0 satisfy I(Pe ) > I(Pe0 ).
A. Example: The ‘Symmetric ‘Star” Graph
It is instructive to study a simple example – the “star” graph,
shown in Fig. 1(a) with EP = {(1, i) : i = 2, . . . , d}. We
assign the joint distributions Qa , Qb ∈ P(X 2 ) and Qa,b ∈
P(X 4 ) to the variables in this graph in the following specific
way: (i) P1,i = Qa for all 2 ≤ i ≤ d. (ii) Pi,j = Qb for all
2 ≤ i, j ≤ d. (iii) P1,i,j,k = Qa,b for all 2 ≤ i, j, k ≤ d. Thus,
the joint distribution on the central node and any outer node
is Qa , while the joint distribution of any two outer nodes is
Qb . Note that Qa and Qb are the pairwise marginals of Qa,b .
Furthermore, we assume that I(Qa ) > I(Qb ) > 0.
Proposition 2: For the “star” graph with the distributions
as described, KP , the error exponent for structure learning, is
KP =
inf
R1,2,3,4 ∈P(X 4 )
{D(R1,2,3,4 || Qa,b ) : I(R1,2 ) = I(R3,4 )} ,
where R1,2 and R3,4 are the pairwise marginals of R1,2,3,4 .
In general, it is not easy to derive the error exponent
KP since crossover events for different node pairs affect the
learned structure in a complex manner. We now provide an
expression for KP by identifying the dominant error tree.
2 If e and e0 share a node, P
3
e,e0 ∈ P(X ). This does not change the
subsequent exposition significantly.
IV. E RROR E XPONENT FOR S TRUCTURE L EARNING
The analysis in the previous section characterized the rate
for the crossover event for empirical mutual information Ce,e0 .
In this section, we connect these events {Ce,e0 } to the quantity
of interest KP in (3). Not all the events in the set {Ce,e0 }
contribute to the overall event error An in (2) because of the
global spanning tree constraint for the learned structure EML .
A. Identifying the Dominant Error Tree
Definition Given a node pair e0 ∈
/ EP , its dominant replacement edge r(e0 ) ∈ EP is given by the edge along the unique
path (See Fig. 1(b)) connecting the nodes in e0 (denoted
Path(e0 ; EP )) and having minimum crossover rate, i.e.,
r(e0 ) :=
argmin
e∈Path(e0 ;EP )
Je,e0 .
(13)
Note that if we replace the true edge r(e0 ) by a non-edge e0 ,
the tree constraint is still satisfied. This is important since such
replacements lead to an error in structure learning.
Theorem 3 (Error Exponent as a Single Crossover Event):
A dominant error tree TP∗ = (V, EP∗ ), has edge set
EP∗ = EP ∪ {e∗ } \ {r(e∗ )}, where
e∗ := argmin
e0 ∈E
/ P
Jr(e0 ),e0 ,
(14)
with r(e0 ), defined in (13), being the dominant replacement
edge associated with e0 ∈
/ EP . Furthermore, the error exponent
KP , which is the rate at which the ML tree EML differs from
the true tree structure EP , is given by,
KP = Jr(e∗ ),e∗ = min
0
min
e ∈E
/ P e∈Path(e0 ;EP )
Je,e0 ,
(15)
where e∗ ∈
/ EP is given in (14).
The above theorem relates the set of crossover rates {Je,e0 },
which we characterized in the previous section, to the overall
error exponent KP , defined in (2). We see that the dominant
error tree TP∗ differs from the true tree TP in exactly one
edge. Note that the result in (15) is exponentially tight in the
.
sense that P(An ) = exp(−nKP ). From (15), we see that if
at least one of the crossover rates Je,e0 is zero, then overall
error exponent KP is zero. This observation is important for
the derivation of necessary and sufficient conditions for KP
to be positive, and hence, for the error probability to decay
exponentially in the number of samples n.
B. Conditions for Exponential Decay
We now provide necessary and sufficient conditions that
ensure that KP is strictly positive. This is obviously of crucial
importance since if KP > 0, we have exponential decay in
the desired probability of error.
Theorem 4 (Conditions for exponential decay): The
following three statements are equivalent.
(a) The error probability decays exponentially, i.e., KP > 0.
(b) The mutual information quantities satisfy I(Pe0 ) 6=
I(Pe ), ∀ e ∈ Path(e0 ; EP ), e0 ∈
/ EP .
(c) TP is not a proper forest3 as assumed in Section II.
3 A proper forest on d nodes is an undirected, acyclic graph that has (strictly)
fewer than d − 1 edges.
Condition (b) states that, for each non-edge e0 , we need I(Pe0 )
to be different from the mutual information of its dominant
replacement edge I(Pr(e0 ) ). Condition (c) is a more intuitive
condition for exponential decay of the probability of error
P(An ). This is an important result since it says that for
any non-degenerate tree distribution in which all the pairwise
joint distributions are not product distributions, then we have
exponential decay in the probability of error.
C. Computational Complexity to Compute Error Exponent
We now provide an upper bound on the complexity to
compute KP . The diameter of the tree TP = (V, EP ) is
ζ(TP ) := max L(u, v),
u,v∈V
(16)
where L(u, v) is the length (number of hops) of the unique
path between nodes u and v. For example, L(u, v) = 4 for
the non-edge e0 = (u, v) in the subtree in Fig. 1(b).
Theorem 5 (Computational Complexity): The number of
computations of Je,e0 required to determine KP is upper
bounded by (1/2)ζ(TP )(d − 1)(d − 2).
Thus, if the diameter of the tree is relatively low and independent of number of nodes d, the complexity is quadratic in d.
For instance, for a “star” network, the diameter ζ(TP ) = 2.
For a balanced tree, ζ(TP ) = O(log d), hence the number of
computations is O(d2 log d). The complexity is vastly reduced
as compared to exhaustive search which requires dd−2 − 1
computations, since there are dd−2 trees on d nodes.
V. E UCLIDEAN A PPROXIMATIONS
In order to gain more insight into the error exponent, we
make use of Euclidean approximations [7] of informationtheoretic quantities to obtain an approximate but closed-form
solution to (12), which is non-convex and hard to solve exactly.
To this end, we first approximate the crossover rate Je,e0 . This
will allow us to understand which pairs of edges have a higher
probability of crossing over and hence result in an error as
defined in (2). It turns out that Je,e0 intuitively depends on the
“separation” of the mutual information values. It also depends
on the uncertainty of the mutual information estimates.
Roughly speaking, our strategy is to “convexify” the objective and the constraints in (12). To do so, we recall that if P
and Q are two distributions with the same support, the KL
divergence can be approximated [7] by
1
kQ − P k2P ,
(17)
2
2
where kyk
the weighted squared norm of y, i.e.,
Pw denotes
2
kykw = i yi2 /wi . This bound is tight whenever P ≈ Q. By
≈, we mean that P is close to Q entry-wise, i.e., kP −Qk∞ <
² for some small ² > 0. In fact, if P ≈ Q, D(P || Q) ≈
D(Q || P ). We will also need following notion.
D(Q || P ) ≈
Definition Given a joint distribution Pe = Pi,j on X 2 with
marginals Pi and Pj , the information density [10] function,
denoted by si,j : X 2 → R, is defined as
si,j (xi , xj ) := log
Pi,j (xi , xj )
,
Pi (xi )Pj (xj )
∀ (xi , xj ) ∈ X 2 .
(18)
If e = (i, j), we will also use the notation se (xi , xj ) =
si,j (xi , xj ). The mutual information is the expectation of the
information density, i.e., I(Pe ) = E[se ].
Recall that we also assumed in Section II that TP is a
spanning tree, which implies that for all node pairs (i, j), Pi,j
is not a product distribution, i.e., Pi,j 6= Pi Pj , because if
it were, then TP would be disconnected. We now define a
condition for which our approximation holds.
Definition We say that Pe,e0 ∈ P(X 4 ), the joint distribution
on node pairs e and e0 , satisfies the ²-very noisy condition if
kPe −Pe0 k∞ :=
max
(xi ,xj )∈X 2
|Pe (xi , xj )−Pe0 (xi , xj )| < ². (19)
This condition is needed because if (19) holds, then by
continuity of the mutual information, there exists a δ > 0
such that |I(Pe ) − I(Pe0 )| < δ, which means that the mutual
information quantities are difficult to distinguish and the
approximation in (17) is accurate. Note that proximity of the
mutual informations is not sufficient for the approximation to
hold since we have seen from Theorem 1 that Je,e0 depends
not only on the mutual informations but on Pe,e0 . We now
define the approximate crossover rate on e and e0 as
½
¾
1
Jee,e0 := inf 4
kQ − Pe,e0 k2Pe,e0 : Q ∈ Q(Pe,e0 ) , (20)
Q∈P(X ) 2
where the (linearized) constraint set is
n
Q(Pe,e0 ) := Q ∈ P(X 4 ) : I(Pe0 ) + h∇Pe0 I(Pe0 ), Q − Pe,e0 i
o
= I(Pe ) + h∇Pe I(Pe ), Q − Pe,e0 i ,
(21)
where ∇Pe I(Pe ) is the gradient vector of the mutual information with respect to the joint distribution Pe . We also define
the approximate error exponent as
e P := min
K
0
min
e ∈E
/ P e∈Path(e0 ;EP )
Jee,e0 .
(22)
We now provide the expression for the approximate crossover
rate Jee,e0 and also state the conditions under which the
approximation is asymptotically accurate.4
Theorem 6 (Euclidean approximation of Je,e0 ): The
approximate crossover rate for the empirical mutual
information quantities, defined in (20), is given by
Jee,e0 =
(E[se0 − se ])2
(I(Pe0 ) − I(Pe ))2
=
,
2 Var(se0 − se )
2 Var(se0 − se )
(23)
where se is the information density defined in (18) and the
expectations are with respect to Pe,e0 . The approximation Jee,e0
is asymptotically accurate, i.e., as ² → 0, Jee,e0 → Je,e0 .
Corollary 7 (Euclidean approximation of KP ): The
e P is asymptotically accurate if either of the
approximation K
following 2 conditions.
(a) The joint distribution Pr(e0 ),e0 satisfies the ²-very noisy
condition for every e0 ∈
/ EP .
4 We say that a family of approximations {θ
e(²) : ² > 0} of a true parameter
θ is asymptotically accurate if the approximations converge to θ as ² → 0.
VI. N UMERICAL E XPERIMENTS
In this section, we perform numerical experiments to study
the accuracy of the Euclidean approximations. We do this by
analyzing under which regimes Jee,e0 in (23) is close to the
true crossover rate Je,e0 in (12). We parameterize a symmetric
“star” graph distribution with d = 4 variables and with a single
parameter γ > 0. We let X = {0, 1}, i.e., all the variables
are binary. The central node is x1 and the outer nodes are
x2 , x3 , x4 . For the parameters, we set P1 (x1 = 0) = 1/3 and
1
+ (−1)a γ, a = 0, 1,
(24)
2
for i = 2, 3, 4. With this parameterization, we see that if γ is
small, the mutual information I(P1,i ) for i = 2, 3, 4 is also
small. In fact if γ = 0, x1 is independent of xi and as a result,
I(P1,i ) = 0. Conversely, if γ is large, the mutual information
I(P1,i ) increases as the dependence of the outer nodes with
the central node increases. Thus, we can vary the size of the
mutual information along the edge by varying γ. By symmetry,
there is only one crossover rate and hence it is also the error
exponent for the error event An in (2).
We vary γ from 0 to 0.2 and plot both the true and
approximate rates against I(Pe ) − I(Pe0 ) in Fig. 2, where e
Pi|1 (xi = 0|x1 = a) =
0.025
tx2
e
tx1
¡@
¡
@
¡
@ x
x4
e0
t¡
@t3
Crossover rates Je,e′
(b) The joint distribution Pr(e∗ ),e∗ satisfies the ²-very noisy
condition but all the other joint distributions on the nonedges e0 ∈
/ EP ∪ {e∗ } and their dominant replacement
edges r(e0 ) do not satisfy the ²-very noisy condition.
Hence, the expressions for the crossover rate Je,e0 and the
error exponent KP are vastly simplified under the ²-very noisy
condition on the joint distributions Pe,e0 . The approximate
crossover rate Jee,e0 in (23) has a very intuitive meaning. It
is proportional to the square of the difference between the
mutual information quantities of Pe and Pe0 . This corresponds
exactly to our initial intuition – that if I(Pe ) and I(Pe0 )
are well separated (I(Pe ) À I(Pe0 )) then the crossover
rate has to be large. Jee,e0 is also weighted by the precision
(inverse variance) of (se0 − se ). If this variance is large then
we are uncertain about the estimate I(Pbe ) − I(Pbe0 ), and
crossovers are more likely, thereby reducing the crossover
rate Jee,e0 . The expression in (23) is, in fact, the SNR for
the estimation of the difference between empirical mutual
information quantities. This answers one of the fundamental
questions we posed in the introduction. We are now able
to distinguish between distributions that are “easy” to learn
and those that are “difficult” by computing the set of SNR
quantities in (22).
We now comment on our assumption of Pe,e0 satisfying the
²-very noisy condition, under which the approximation is tight.
When Pe,e0 is ²-very noisy, then we have |I(Pe )−I(Pe0 )| < δ.
Thus it is very hard to distinguish the relative magnitudes
of I(Pbe ) and I(Pbe0 ). The particular problem of learning the
distribution Pe,e0 from samples is very noisy. Under these
conditions, the approximation in (23) is accurate. In fact, ratio
of the approximate crossover rate and the true crossover rate
approaches unity as ² → 0.
True Rate
Approx Rate
0.02
0.015
0.01
0.005
0
0
0.01
0.02
0.03 0.04
I(Pe)−I(Pe′)
0.05
0.06
0.07
Fig. 2.
Left: The true model is a symmetric star where the mutual
informations satisfy I(P1,2 ) = I(P1,3 ) = I(P1,4 ) and I(Pe0 ) < I(P1,2 )
for any non-edge e0 . Right: Comparison of True and Approximate Rates.
denotes any edge and e0 denotes any non-edge. We note from
Fig. 2 that both rates increase as the difference I(Pe ) − I(Pe0 )
increases. This is in line with our intuition because if I(Pe ) −
I(Pe0 ) is large, the crossover rate is also large. We also observe
that if the difference is small, the true and approximate rates
are close. This is in line with the assumptions of Theorem 6;
that if Pe,e0 satisfies the ²-very noisy condition, then the
mutual informations I(Pe ) and I(Pe0 ) are close and the true
and approximate crossover rates are also close. When the
difference between the mutual informations increases, the true
and approximate rate separate from each other.
VII. C ONCLUSION
In this paper, we presented a solution to the problem of
finding the error exponent for ML tree structure learning by
employing tools from large-deviations theory combined with
facts about tree graphs. We quantified the error exponent for
learning the structure and exploited the structure of the true
tree to identify the dominant tree in the set of erroneous trees.
We also drew insights from the approximate crossover rate,
which can be interpreted as the SNR for learning. These two
main results in Theorems 3 and 6 provide the intuition as to
how errors occur for learning discrete tree distributions.
R EFERENCES
[1] S. Lauritzen, Graphical Models. Oxford University Press, USA, 1996.
[2] C. K. Chow and C. N. Liu, “Approximating discrete probability distributions with dependence trees.” IEEE Transactions on Information Theory,
vol. 14, no. 3, pp. 462–467, May 1968.
[3] M. Dudik, S. Phillips, and R. Schapire, “Performance guarantees for
regularized maximum entropy density estimation,” in COLT, 2004.
[4] N. Meinshausen and P. Buehlmann, “High dimensional graphs and
variable selection with the Lasso,” Annals of Statistics, vol. 34, no. 3,
pp. 1436–1462, 2006.
[5] M. J. Wainwright, P. Ravikumar, and J. D. Lafferty, “High-Dimensional
Graphical Model Selection Using `1 -Regularized Logistic Regression,”
in NIPS, 2006, pp. 1465–1472.
[6] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, 2nd ed. Springer, 1998.
[7] S. Borade and L. Zheng, “Euclidean Information Theory,” in Allerton
Conference, 2007.
[8] C. K. Chow and T. Wagner, “Consistency of an estimate of treedependent probability distributions ,” IEEE Transactions in Information
Theory, vol. 19, no. 3, pp. 369 – 371, May 1973.
[9] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to
Algorithms, 2nd ed. McGraw-Hill Science/Engineering/Math, 2003.
[10] J. N. Laneman, “On the Distribution of Mutual Information,” in Information Theory and Applications Workshop, 2006.
Download