A Large-Deviation Analysis for the Maximum Likelihood Learning of Tree Structures Vincent Y. F. Tan∗ , Animashree Anandkumar∗† , Lang Tong† and Alan S. Willsky∗ ∗ Stochastic Systems Group, LIDS, MIT, Cambridge, MA 02139, Email: {vtan,animakum,willsky}@mit.edu † ECE Dept., Cornell University, Ithaca, NY 14853, Email: {aa332@,ltong@ece.}cornell.edu Abstract—The problem of maximum-likelihood learning of the structure of an unknown discrete distribution from samples is considered when the distribution is Markov on a tree. Largedeviation analysis of the error in estimation of the set of edges of the tree is performed. Necessary and sufficient conditions are provided to ensure that this error probability decays exponentially. These conditions are based on the mutual information between each pair of variables being distinct from that of other pairs. The rate of error decay, or error exponent, is derived using the large-deviation principle. The error exponent is approximated using Euclidean information theory and is given by a ratio, to be interpreted as the signal-to-noise ratio (SNR) for learning. Numerical experiments show the SNR approximation is accurate. Index Terms—Large-deviations, Tree structure learning, Error exponents, Euclidean Information Theory. I. I NTRODUCTION The estimation of a distribution from samples is a classical problem in machine learning and is challenging for highdimensional multivariate distributions. In this respect, graphical models [1] provide a significant simplification of the joint distribution, and incorporate a Markov structure in terms of a graph defined on the set of nodes. Many specialized algorithms exist for learning graphical models with sparse graphs. A special scenario is when the graph is a tree. In this case, the classical Chow-Liu algorithm [2] finds the maximumlikelihood (ML) estimate of the probability distribution by exploiting the Markov structure to learn the tree edges in terms of a maximum-weight spanning tree (MWST). The ML estimator learns the distribution correctly as we obtain more learning samples (consistency). These learning samples are drawn independently from the distribution. In this paper, we study the performance of ML estimator in terms of the nature of convergence with increasing sample size. Specifically, we are interested in the event that the set of edges of the ML tree is not the true set of edges. We address the following questions: Is there exponential decay of error in structure learning as the number of samples goes to infinity? If so, what is the rate of decay of the error probability? How does the rate depend on the parameters of the distribution? Our analysis and answers to these questions provide us insights into The first author is supported by A*STAR, Singapore, in part by a MURI funded through ARO Grant W911NF-06-1-0076 and in part under a MURI through AFOSR Grant FA9550-06-1-0324. The second author is supported by the IBM Ph.D Fellowship for the year 2008-09 and is currently a visiting student at MIT, Cambridge, MA 02139. This work is also supported in part through collaborative participation in Communications and Networks Consortium sponsored by the U. S. Army Research Laboratory under the Collaborative Technology Alliance Program, Cooperative Agreement DAAD19-012-0011 and by the Army Research Office under Grant ARO-W911NF-06-1-0346. the distributions and the edges where we are more likely to make an error when learning the tree structure. Summary of Main Results and Related Work Learning the structure of graphical models is an extensively studied problem (e.g. [2]–[5]). The previous works look at establishing consistency of the estimators, while a few prove the estimators to have exponential rate of error decay under some technical conditions in the Gaussian case [4]. However, to the best of our knowledge, none of the works quantifies the exact rate of decay of error for structure learning. Following the seminal work of Chow and Liu in [2], a number of works have looked at learning graphical models. Using the maximum entropy principle as a learning technique, Dudik et al. [3] provides strong consistency guarantees on the learned distribution in terms of the log-likelihood of the samples. Wainwright et al. [5] also proposed a regularization method for learning the graph structure based on `1 logistic regression and provide theoretical guarantees for learning the correct structure as the number of samples, the number of variables, and the neighborhood size grow. Meinshausen et al. [4] consider learning the structure for Gaussian models, and show that the error probability of getting the structure wrong decays exponentially. However, the rate is not provided. The main contributions of this paper are three-fold. First, using the large-deviation principle (LDP) [6] we prove that the most-likely error in ML estimation is a tree which differs from the true tree by a single edge. Second, again using the LDP, we derive the exact error exponent for ML estimation of tree structures. Third, using ideas from Euclidean Information Theory [7], we provide a succinct and intuitively appealing closed-form approximation for the error exponent which is tight in the very noisy learning regime, where the individual samples are not too informative about the tree structure. The approximate error exponent has a very intuitive explanation as the signal-to-noise ratio (SNR) for learning. It corroborates the intuition that if the edges belonging to the true tree model are strongly distinguishable from the non-edges using the samples, we can expect the rate of decay of error to be large. All the results are stated without proof. The reader may refer to http://web.mit.edu/vtan/www/isit09 for the details. II. P ROBLEM S TATEMENT AND C HOW-L IU A LGORITHM An undirected graphical model [1] is a probability distribution that factorizes according to the structure of an underlying undirected graph. More explicitly, a vector of random variables x = (x1 , . . . , xd ) is said to be Markov ¡ ¢ on a graph G = (V, E) with V = {1, . . . , d} and E ⊂ V2 if P (xi |xV\{i} ) = P (xi |xN (i) ) where N (i) is the set of neighbors of xi . In this paper, we are given a set of ddimensional samples (or observations) xn = {x1 , . . . , xn } where each xk ∈ X d and X is a finite set. Each element of the set of samples xk is drawn independently from some unknown, positive everywhere distribution P whose support is X d , that is P ∈ P(X d ), the set of all distributions supported on X d . We further assume that the graph of P , denoted TP = (V, EP ), belongs to the set of spanning trees1 on d nodes T d . The set of d-dimensional tree distributions is denoted D(X d , T d ). Tree distributions possess the following factorization property [1] Y Y Pi,j (xi , xj ) , (1) P (x) = Pi (xi ) Pi (xi )Pj (xj ) i∈V (i,j)∈EP where Pi and Pi,j are the marginals on node i ∈ V and edge (i, j) ∈ EP respectively. Given xn , we define the empirical distribution of P to be Pb(x) = N (x|xn )/n, where N (x|xn ) is the number of times x ∈ X d occurred in xn . Using the samples xn , we can use the Chow-Liu algorithm [2], reviewed in Section II-A, to compute PML , the ML tree-structured distribution with edge set EML . It is known [8] that as n → ∞, EML approaches EP , the true tree structure. But at what rate does this happen for a given tree distribution P ? Is the error decay exponential? In this paper, we use the LargeDeviation Principle (LDP) [6] to quantify the exponential rate at which the probability of the error event An := {EML 6= EP } (2) decays to zero. In other words, given a distribution P and P := P n , we are interested to study the rate KP given by, 1 KP := lim − log P(An ), (3) n→∞ n whenever the limit exists. KP is the error exponent for the event An as defined in (2). When KP > 0, there is exponential decay of error probability in structure learning, and we would like to provide conditions to ensure this. In addition, we define the set of disjoint events Un (T ) that the graph of the ML estimate PML of the tree distribution is a tree T different from the true tree TP , i.e., ½ {TML = T }, if T ∈ T d \ {TP }, (4) Un (T ) := ∅, if T = TP . S From (2), An = T ∈T d \{TP } Un (T ). We also define the exponent for each error event Un (T ) as 1 Υ(T ) := lim − log P (Un (T )) . (5) n→∞ n Definition The dominant error tree TP∗ = (V, EP∗ ) is the spanning tree given by TP∗ = argmin Υ(T ). (6) T ∈T d \{TP } 1 In a spanning tree, none of pairwise joint distributions P i,j are allowed to be product distributions. In section IV, we characterize the dominant error tree and its relation to the error exponent KP in (3). A. Maximum-Likelihood Learning of Tree Distributions In this section, we review the Chow-Liu algorithm [2] for learning the ML tree distribution PML given a set of n samples xn drawn independently from a tree distribution P . The ChowLiu algorithm solves the following ML problem: PML = argmax n X Q∈D(X d ,T d ) k=1 log Q(xk ) = argmin D(Pb||Q). (7) Q∈D(X d ,T d ) For the above problem, the ML tree model will always be the projection of the empirical distribution Pb onto some tree structure [2], i.e., PML (xi ) = Pbi (xi ) for all i ∈ V and PML (xi , xj ) = Pbi,j (xi , xj ) for all (i, j) ∈ EML . Thus, the optimization problem in (7) reduces to a search for the tree structure of PML . By using the fact that the graph of Q is a tree, the KL divergence in (7) decomposes into a sum of mutual information quantities. Hence, if EQ is the edge set of the tree distribution Q, the optimization for the structure of PML is X EML = argmax I(Pbe ), (8) EQ :Q∈D(X d ,T d ) e∈EQ where I(Pbe ) is the empirical mutual information given the data xn , defined for e = (i, j) as: I(Pbe ) := X Pbi,j (xi , xj ) Pbi,j (xi , xj ) log . Pbi (xi )Pbj (xj ) (xi ,xj )∈X 2 (9) Note that EP is the structure obtained from the MWST algorithm without any error, i.e., with the true mutual informations n as edge weights. To solve (8), we use ¡V ¢the samples x to b compute I(Pe ), for each node pair e ∈ 2 given the empirical distribution Pb. Subsequently, we use these as the edge weights for the MWST problem in (8). Note that the search for the MWST is not the same as that for largest d − 1 mutual information quantities as one has to take into consideration the tree constraint. There are well-known algorithms [9] that solve the MWST problem in O(d2 log d) time. III. LDP FOR E MPIRICAL M UTUAL I NFORMATION To compute the exponent for the error in structure ¡ ¢learning KP , we first consider a simpler event. Let e, e0 ∈ V2 be any two node pairs satisfying I(Pe ) > I(Pe0 ), where I(Pe ) and I(Pe0 ) are the true mutual informations on node pairs e and e0 respectively. We are interested in the crossover event of the empirical mutual informations defined as: n o Ce,e0 := I(Pbe ) ≤ I(Pbe0 ) . (10) The occurrence of this event may potentially lead to an error in structure learning when EML differs from EP . In the next section, we will see how these crossover events relate to KP . Now, we would like to compute the crossover rate for empirical mutual informations Je,e0 as: Je,e0 := lim − n→∞ 1 log P (Ce,e0 ) . n (11) x7 x8 x t t t9 @ ¡ @ ¡ x @ ¡ 1 x6 x t @t¡ Qa t 2 ¡@ Qb ¡ @ x5¡ x4 @ x3 ¡ @t t t (a) 0 ut @ e ∈ / EP vt ¡ ¡ 0 0 @ r(e ) ∈ Path(e ; EP ) ¡ @ t t¡ @t ? (b) Fig. 1. (a) The symmetric “star” graph with d = 9. (b) The path associated to the non-edge e0 = (u, v) ∈ / EP , denoted Path(e0 ; EP ) ⊂ EP , is the set of edges along the unique path linking the end points of e0 . r(e0 ) ∈ Path(e0 , EP ) is the dominant replacement edge associated to e0 ∈ / EP . Intuitively, if the difference between the true mutual informations I(Pe ) − I(Pe0 ) is large, we expect the crossover rate to be large. Consequently, the probability of the crossover event would be small. Note that Je,e0 in (11) is not the same as KP in (3). We will see that the rate Je,e0 depends, not only on the mutual informations I(Pe ) and I(Pe0 ), but also on the distribution Pe,e0 of the variables on node pairs e and e0 . Theorem 1 (LDP for Empirical MIs): Let the distribution on two node pairs e and e0 be Pe,e0 ∈ P(X 4 ).2 The crossover rate for empirical mutual information is Je,e0 = inf Q∈P(X 4 ) {D(Q || Pe,e0 ) : I(Qe0 ) = I(Qe )} , (12) where P Qe , Qe0 ∈ P(X 2 ) are distributions on e, e0 resp., i.e., Qe = xe0 Q and similarly for Qe0 . The infimum in (12) is attained for some Q∗e,e0 ∈ P(X d ) satisfying I(Q∗e0 ) = I(Q∗e ). The proof of this theorem hinges on Sanov’s theorem [6, Sec. 6.2] and the contraction principle [6, Thm 4.2.1]. We remark that the rate Je,e0 is strictly positive since we assumed, a-priori, that e and e0 satisfy I(Pe ) > I(Pe0 ). A. Example: The ‘Symmetric ‘Star” Graph It is instructive to study a simple example – the “star” graph, shown in Fig. 1(a) with EP = {(1, i) : i = 2, . . . , d}. We assign the joint distributions Qa , Qb ∈ P(X 2 ) and Qa,b ∈ P(X 4 ) to the variables in this graph in the following specific way: (i) P1,i = Qa for all 2 ≤ i ≤ d. (ii) Pi,j = Qb for all 2 ≤ i, j ≤ d. (iii) P1,i,j,k = Qa,b for all 2 ≤ i, j, k ≤ d. Thus, the joint distribution on the central node and any outer node is Qa , while the joint distribution of any two outer nodes is Qb . Note that Qa and Qb are the pairwise marginals of Qa,b . Furthermore, we assume that I(Qa ) > I(Qb ) > 0. Proposition 2: For the “star” graph with the distributions as described, KP , the error exponent for structure learning, is KP = inf R1,2,3,4 ∈P(X 4 ) {D(R1,2,3,4 || Qa,b ) : I(R1,2 ) = I(R3,4 )} , where R1,2 and R3,4 are the pairwise marginals of R1,2,3,4 . In general, it is not easy to derive the error exponent KP since crossover events for different node pairs affect the learned structure in a complex manner. We now provide an expression for KP by identifying the dominant error tree. 2 If e and e0 share a node, P 3 e,e0 ∈ P(X ). This does not change the subsequent exposition significantly. IV. E RROR E XPONENT FOR S TRUCTURE L EARNING The analysis in the previous section characterized the rate for the crossover event for empirical mutual information Ce,e0 . In this section, we connect these events {Ce,e0 } to the quantity of interest KP in (3). Not all the events in the set {Ce,e0 } contribute to the overall event error An in (2) because of the global spanning tree constraint for the learned structure EML . A. Identifying the Dominant Error Tree Definition Given a node pair e0 ∈ / EP , its dominant replacement edge r(e0 ) ∈ EP is given by the edge along the unique path (See Fig. 1(b)) connecting the nodes in e0 (denoted Path(e0 ; EP )) and having minimum crossover rate, i.e., r(e0 ) := argmin e∈Path(e0 ;EP ) Je,e0 . (13) Note that if we replace the true edge r(e0 ) by a non-edge e0 , the tree constraint is still satisfied. This is important since such replacements lead to an error in structure learning. Theorem 3 (Error Exponent as a Single Crossover Event): A dominant error tree TP∗ = (V, EP∗ ), has edge set EP∗ = EP ∪ {e∗ } \ {r(e∗ )}, where e∗ := argmin e0 ∈E / P Jr(e0 ),e0 , (14) with r(e0 ), defined in (13), being the dominant replacement edge associated with e0 ∈ / EP . Furthermore, the error exponent KP , which is the rate at which the ML tree EML differs from the true tree structure EP , is given by, KP = Jr(e∗ ),e∗ = min 0 min e ∈E / P e∈Path(e0 ;EP ) Je,e0 , (15) where e∗ ∈ / EP is given in (14). The above theorem relates the set of crossover rates {Je,e0 }, which we characterized in the previous section, to the overall error exponent KP , defined in (2). We see that the dominant error tree TP∗ differs from the true tree TP in exactly one edge. Note that the result in (15) is exponentially tight in the . sense that P(An ) = exp(−nKP ). From (15), we see that if at least one of the crossover rates Je,e0 is zero, then overall error exponent KP is zero. This observation is important for the derivation of necessary and sufficient conditions for KP to be positive, and hence, for the error probability to decay exponentially in the number of samples n. B. Conditions for Exponential Decay We now provide necessary and sufficient conditions that ensure that KP is strictly positive. This is obviously of crucial importance since if KP > 0, we have exponential decay in the desired probability of error. Theorem 4 (Conditions for exponential decay): The following three statements are equivalent. (a) The error probability decays exponentially, i.e., KP > 0. (b) The mutual information quantities satisfy I(Pe0 ) 6= I(Pe ), ∀ e ∈ Path(e0 ; EP ), e0 ∈ / EP . (c) TP is not a proper forest3 as assumed in Section II. 3 A proper forest on d nodes is an undirected, acyclic graph that has (strictly) fewer than d − 1 edges. Condition (b) states that, for each non-edge e0 , we need I(Pe0 ) to be different from the mutual information of its dominant replacement edge I(Pr(e0 ) ). Condition (c) is a more intuitive condition for exponential decay of the probability of error P(An ). This is an important result since it says that for any non-degenerate tree distribution in which all the pairwise joint distributions are not product distributions, then we have exponential decay in the probability of error. C. Computational Complexity to Compute Error Exponent We now provide an upper bound on the complexity to compute KP . The diameter of the tree TP = (V, EP ) is ζ(TP ) := max L(u, v), u,v∈V (16) where L(u, v) is the length (number of hops) of the unique path between nodes u and v. For example, L(u, v) = 4 for the non-edge e0 = (u, v) in the subtree in Fig. 1(b). Theorem 5 (Computational Complexity): The number of computations of Je,e0 required to determine KP is upper bounded by (1/2)ζ(TP )(d − 1)(d − 2). Thus, if the diameter of the tree is relatively low and independent of number of nodes d, the complexity is quadratic in d. For instance, for a “star” network, the diameter ζ(TP ) = 2. For a balanced tree, ζ(TP ) = O(log d), hence the number of computations is O(d2 log d). The complexity is vastly reduced as compared to exhaustive search which requires dd−2 − 1 computations, since there are dd−2 trees on d nodes. V. E UCLIDEAN A PPROXIMATIONS In order to gain more insight into the error exponent, we make use of Euclidean approximations [7] of informationtheoretic quantities to obtain an approximate but closed-form solution to (12), which is non-convex and hard to solve exactly. To this end, we first approximate the crossover rate Je,e0 . This will allow us to understand which pairs of edges have a higher probability of crossing over and hence result in an error as defined in (2). It turns out that Je,e0 intuitively depends on the “separation” of the mutual information values. It also depends on the uncertainty of the mutual information estimates. Roughly speaking, our strategy is to “convexify” the objective and the constraints in (12). To do so, we recall that if P and Q are two distributions with the same support, the KL divergence can be approximated [7] by 1 kQ − P k2P , (17) 2 2 where kyk the weighted squared norm of y, i.e., Pw denotes 2 kykw = i yi2 /wi . This bound is tight whenever P ≈ Q. By ≈, we mean that P is close to Q entry-wise, i.e., kP −Qk∞ < ² for some small ² > 0. In fact, if P ≈ Q, D(P || Q) ≈ D(Q || P ). We will also need following notion. D(Q || P ) ≈ Definition Given a joint distribution Pe = Pi,j on X 2 with marginals Pi and Pj , the information density [10] function, denoted by si,j : X 2 → R, is defined as si,j (xi , xj ) := log Pi,j (xi , xj ) , Pi (xi )Pj (xj ) ∀ (xi , xj ) ∈ X 2 . (18) If e = (i, j), we will also use the notation se (xi , xj ) = si,j (xi , xj ). The mutual information is the expectation of the information density, i.e., I(Pe ) = E[se ]. Recall that we also assumed in Section II that TP is a spanning tree, which implies that for all node pairs (i, j), Pi,j is not a product distribution, i.e., Pi,j 6= Pi Pj , because if it were, then TP would be disconnected. We now define a condition for which our approximation holds. Definition We say that Pe,e0 ∈ P(X 4 ), the joint distribution on node pairs e and e0 , satisfies the ²-very noisy condition if kPe −Pe0 k∞ := max (xi ,xj )∈X 2 |Pe (xi , xj )−Pe0 (xi , xj )| < ². (19) This condition is needed because if (19) holds, then by continuity of the mutual information, there exists a δ > 0 such that |I(Pe ) − I(Pe0 )| < δ, which means that the mutual information quantities are difficult to distinguish and the approximation in (17) is accurate. Note that proximity of the mutual informations is not sufficient for the approximation to hold since we have seen from Theorem 1 that Je,e0 depends not only on the mutual informations but on Pe,e0 . We now define the approximate crossover rate on e and e0 as ½ ¾ 1 Jee,e0 := inf 4 kQ − Pe,e0 k2Pe,e0 : Q ∈ Q(Pe,e0 ) , (20) Q∈P(X ) 2 where the (linearized) constraint set is n Q(Pe,e0 ) := Q ∈ P(X 4 ) : I(Pe0 ) + h∇Pe0 I(Pe0 ), Q − Pe,e0 i o = I(Pe ) + h∇Pe I(Pe ), Q − Pe,e0 i , (21) where ∇Pe I(Pe ) is the gradient vector of the mutual information with respect to the joint distribution Pe . We also define the approximate error exponent as e P := min K 0 min e ∈E / P e∈Path(e0 ;EP ) Jee,e0 . (22) We now provide the expression for the approximate crossover rate Jee,e0 and also state the conditions under which the approximation is asymptotically accurate.4 Theorem 6 (Euclidean approximation of Je,e0 ): The approximate crossover rate for the empirical mutual information quantities, defined in (20), is given by Jee,e0 = (E[se0 − se ])2 (I(Pe0 ) − I(Pe ))2 = , 2 Var(se0 − se ) 2 Var(se0 − se ) (23) where se is the information density defined in (18) and the expectations are with respect to Pe,e0 . The approximation Jee,e0 is asymptotically accurate, i.e., as ² → 0, Jee,e0 → Je,e0 . Corollary 7 (Euclidean approximation of KP ): The e P is asymptotically accurate if either of the approximation K following 2 conditions. (a) The joint distribution Pr(e0 ),e0 satisfies the ²-very noisy condition for every e0 ∈ / EP . 4 We say that a family of approximations {θ e(²) : ² > 0} of a true parameter θ is asymptotically accurate if the approximations converge to θ as ² → 0. VI. N UMERICAL E XPERIMENTS In this section, we perform numerical experiments to study the accuracy of the Euclidean approximations. We do this by analyzing under which regimes Jee,e0 in (23) is close to the true crossover rate Je,e0 in (12). We parameterize a symmetric “star” graph distribution with d = 4 variables and with a single parameter γ > 0. We let X = {0, 1}, i.e., all the variables are binary. The central node is x1 and the outer nodes are x2 , x3 , x4 . For the parameters, we set P1 (x1 = 0) = 1/3 and 1 + (−1)a γ, a = 0, 1, (24) 2 for i = 2, 3, 4. With this parameterization, we see that if γ is small, the mutual information I(P1,i ) for i = 2, 3, 4 is also small. In fact if γ = 0, x1 is independent of xi and as a result, I(P1,i ) = 0. Conversely, if γ is large, the mutual information I(P1,i ) increases as the dependence of the outer nodes with the central node increases. Thus, we can vary the size of the mutual information along the edge by varying γ. By symmetry, there is only one crossover rate and hence it is also the error exponent for the error event An in (2). We vary γ from 0 to 0.2 and plot both the true and approximate rates against I(Pe ) − I(Pe0 ) in Fig. 2, where e Pi|1 (xi = 0|x1 = a) = 0.025 tx2 e tx1 ¡@ ¡ @ ¡ @ x x4 e0 t¡ @t3 Crossover rates Je,e′ (b) The joint distribution Pr(e∗ ),e∗ satisfies the ²-very noisy condition but all the other joint distributions on the nonedges e0 ∈ / EP ∪ {e∗ } and their dominant replacement edges r(e0 ) do not satisfy the ²-very noisy condition. Hence, the expressions for the crossover rate Je,e0 and the error exponent KP are vastly simplified under the ²-very noisy condition on the joint distributions Pe,e0 . The approximate crossover rate Jee,e0 in (23) has a very intuitive meaning. It is proportional to the square of the difference between the mutual information quantities of Pe and Pe0 . This corresponds exactly to our initial intuition – that if I(Pe ) and I(Pe0 ) are well separated (I(Pe ) À I(Pe0 )) then the crossover rate has to be large. Jee,e0 is also weighted by the precision (inverse variance) of (se0 − se ). If this variance is large then we are uncertain about the estimate I(Pbe ) − I(Pbe0 ), and crossovers are more likely, thereby reducing the crossover rate Jee,e0 . The expression in (23) is, in fact, the SNR for the estimation of the difference between empirical mutual information quantities. This answers one of the fundamental questions we posed in the introduction. We are now able to distinguish between distributions that are “easy” to learn and those that are “difficult” by computing the set of SNR quantities in (22). We now comment on our assumption of Pe,e0 satisfying the ²-very noisy condition, under which the approximation is tight. When Pe,e0 is ²-very noisy, then we have |I(Pe )−I(Pe0 )| < δ. Thus it is very hard to distinguish the relative magnitudes of I(Pbe ) and I(Pbe0 ). The particular problem of learning the distribution Pe,e0 from samples is very noisy. Under these conditions, the approximation in (23) is accurate. In fact, ratio of the approximate crossover rate and the true crossover rate approaches unity as ² → 0. True Rate Approx Rate 0.02 0.015 0.01 0.005 0 0 0.01 0.02 0.03 0.04 I(Pe)−I(Pe′) 0.05 0.06 0.07 Fig. 2. Left: The true model is a symmetric star where the mutual informations satisfy I(P1,2 ) = I(P1,3 ) = I(P1,4 ) and I(Pe0 ) < I(P1,2 ) for any non-edge e0 . Right: Comparison of True and Approximate Rates. denotes any edge and e0 denotes any non-edge. We note from Fig. 2 that both rates increase as the difference I(Pe ) − I(Pe0 ) increases. This is in line with our intuition because if I(Pe ) − I(Pe0 ) is large, the crossover rate is also large. We also observe that if the difference is small, the true and approximate rates are close. This is in line with the assumptions of Theorem 6; that if Pe,e0 satisfies the ²-very noisy condition, then the mutual informations I(Pe ) and I(Pe0 ) are close and the true and approximate crossover rates are also close. When the difference between the mutual informations increases, the true and approximate rate separate from each other. VII. C ONCLUSION In this paper, we presented a solution to the problem of finding the error exponent for ML tree structure learning by employing tools from large-deviations theory combined with facts about tree graphs. We quantified the error exponent for learning the structure and exploited the structure of the true tree to identify the dominant tree in the set of erroneous trees. We also drew insights from the approximate crossover rate, which can be interpreted as the SNR for learning. These two main results in Theorems 3 and 6 provide the intuition as to how errors occur for learning discrete tree distributions. R EFERENCES [1] S. Lauritzen, Graphical Models. Oxford University Press, USA, 1996. [2] C. K. Chow and C. N. Liu, “Approximating discrete probability distributions with dependence trees.” IEEE Transactions on Information Theory, vol. 14, no. 3, pp. 462–467, May 1968. [3] M. Dudik, S. Phillips, and R. Schapire, “Performance guarantees for regularized maximum entropy density estimation,” in COLT, 2004. [4] N. Meinshausen and P. Buehlmann, “High dimensional graphs and variable selection with the Lasso,” Annals of Statistics, vol. 34, no. 3, pp. 1436–1462, 2006. [5] M. J. Wainwright, P. Ravikumar, and J. D. Lafferty, “High-Dimensional Graphical Model Selection Using `1 -Regularized Logistic Regression,” in NIPS, 2006, pp. 1465–1472. [6] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, 2nd ed. Springer, 1998. [7] S. Borade and L. Zheng, “Euclidean Information Theory,” in Allerton Conference, 2007. [8] C. K. Chow and T. Wagner, “Consistency of an estimate of treedependent probability distributions ,” IEEE Transactions in Information Theory, vol. 19, no. 3, pp. 369 – 371, May 1973. [9] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. McGraw-Hill Science/Engineering/Math, 2003. [10] J. N. Laneman, “On the Distribution of Mutual Information,” in Information Theory and Applications Workshop, 2006.