Necessary and sufficient conditions for high-dimensional salient feature subset recovery The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Tan, Vincent Y. F., Matthew Johnson, and Alan S. Willsky. “Necessary and Sufficient Conditions for High-dimensional Salient Feature Subset Recovery.” IEEE International Symposium on Information Theory Proceedings (ISIT), 2010. 1388–1392. ©2010 IEEE As Published http://dx.doi.org/10.1109/ISIT.2010.5513598 Publisher Institute of Electrical and Electronics Engineers (IEEE) Version Final published version Accessed Thu May 26 06:36:55 EDT 2016 Citable Link http://hdl.handle.net/1721.1/73588 Terms of Use Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. Detailed Terms ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010 Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery Vincent Y. F. Tan, Matthew Johnson, and Alan S. Willsky Stochastic Systems Group, LIDS, MIT, Cambridge, MA 02139, Email: {vtan,mattjj,willsky}@mit.edu Abstract—We consider recovering the salient feature subset for distinguishing between two probability models from i.i.d. samples. Identifying the salient set improves discrimination performance and reduces complexity. The focus in this work is on the high-dimensional regime where the number of variables d, the number of salient variables k and the number of samples n all grow. The definition of saliency is motivated by error exponents in a binary hypothesis test and is stated in terms of relative entropies. It is shown that if n grows faster than max{ck log((d − k)/k), exp(c k)} for constants c, c , then the error probability in selecting the salient set can be made arbitrarily small. Thus, n can be much smaller than d. The exponential rate of decay and converse theorems are also provided. An efficient and consistent algorithm is proposed when the distributions are graphical models which are Markov on trees. Index Terms—Salient feature subset, High-dimensional, Error exponents, Binary hypothesis testing, Tree distributions. I. I NTRODUCTION Consider the following scenario: There are 1000 children participating in a longitudinal study in childhood asthma of which 500 of them are asthmatic and the other 500 are not. 106 measurements of possibly relevant features (e.g. genetic, environmental, physiological) are taken from each child but only a very small subset of these (say 30) is useful in predicting whether the child has asthma. This example is, in fact, modeled after a real-life large-scale experiment — the Manchester Asthma and Allergy Study (http://www.maas.org.uk/). The correct identification and subsequent interpretation of this salient subset is important to clinicians for assessing the susceptibility of other children to asthma. We expect that by focusing only on the 30 salient features, we can improve discrimination and reduce the computational cost in coming up with a decision rule. Indeed, when the salient set is small compared to the overall dimension (106 ), we also expect to be able to estimate the salient set with a small number of samples. This general problem is also known as feature subset selection [1]. In this paper, we derive, from an informationtheoretic perspective, necessary and sufficient conditions so that the salient set can be recovered with arbitrarily low error probability in the high-dimensional regime, i.e., when the number of samples n, the number of variables d and the number of salient variables k grow. Intuitively, we expect that if k and d do not grow too quickly with n, then consistent This work is supported by a AFOSR funded through Grant FA9559-08-11080, a MURI funded through ARO Grant W911NF-06-1-0076 and a MURI funded through AFOSR Grant FA9550-06-1-0324. V. Tan and M. Johnson are supported by A*STAR, Singapore and the NSF fellowship respectively. 978-1-4244-7892-7/10/$26.00 ©2010 IEEE recovery is possible. However, in this paper, we focus on the interesting case where d n, k, which is most relevant to problems such as the asthma example above. Motivated by the Chernoff-Stein lemma [2, Ch. 11] for a binary hypothesis test under the Neyman-Pearson framework, we define the notion of saliency for distinguishing between two probability distributions. We show that this definition of saliency can also be motivated by the same hypothesis testing problem under the Bayesian framework, in which the overall error probability is minimized. For the asthma example, intuitively, a feature is salient if it is useful in predicting whether a child has asthma and we also expect the number of salient features to be very small. Also, conditioned on the salient features, the non-salient ones should not contribute to the distinguishability of the classes. Our mathematical model and definition of saliency in terms of the KL-divergence (or Chernoff information) captures this intuition. There are three main contributions in this work. Firstly, we provide sufficient conditions on the scaling of the model parameters (n, d, k) so the salient set is recoverable asymptotically. Secondly, by modeling the salient set as a uniform random variable (over all sets of size k), we derive a necessary condition that any decoder must satisfy in order to recover the salient set. Thirdly, in light of the fact that the exhaustive search decoder is computationally infeasible, we examine the case in which the underlying distributions are Markov on trees and derive efficient tree-based combinatorial optimization algorithms to search for the salient set. The literature on feature subset selection (or variable extraction) is vast. See [1] (and references therein) for a thorough review of the field. The traditional methods include the socalled wrapper (assessing different subsets for their usefulness in predicting the class) and filter (ranking) methods. Our definition of saliency is related to the minimum-redundancy, maximum-relevancy model in [3] and the notion of Markov blankets in [4] but is expressed using information-theoretic quantities motivated by hypothesis testing. The algorithm suggested in [5] shows that the generalization error remains small even in the presence of a large number of irrelevant features, but this paper focuses on exact recovery of the salient set given scaling laws on (n, d, k). This work is also related to [6] on sparsity pattern recovery (or compressed sensing) but does not assume the linear observation model. Rather, samples are drawn from two arbitrary discrete multivariate probability distributions. 1388 ISIT 2010 ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010 II. N OTATION , S YSTEM M ODEL AND D EFINITIONS Let X be a finite set and let P(X ) denote the set of discrete distributions supported on X d . Let {P (d) , Q(d) }d∈N be two sequences of distributions where P (d) , Q(d) ∈ P(X d ), are the distributions of d-dimensional random vectors x, y respectively. For a vector x ∈ X d , xA is the length-|A| subvector that consists of the elements in A. Let Ac := Vd \A. In addition, let Vd := {1, . . . , d} be the index set and for (d) a subset A ⊂ Vd , let PA be the marginal of the subset of random variables in A, i.e., the random vector xA . Each (d) (d) index i ∈ Vd , associated to marginals (Pi , Qi ), will be generically called a feature. We assume that for each pair (P (d) , Q(d) ), there exists a set of n i.i.d. samples (xn , yn ) := ({x(l) }nl=1 , {y(l) }nl=1 ) drawn from P (d) × Q(d) . Each sample x(l) (and also y(l) ) belongs to X d . Our goal is to distinguish between P (d) and Q(d) using the samples. Note that for each d, this setup is analogous to binary classification where one does not have access to the underlying distributions but only samples from the distribution. We suppress the dependence of (xn , yn ) on the dimensionality d when the lengths of the vectors are clear from the context. d Definition 1 (KL-divergence Salient Set): A subset Sd ⊂ Vd of size k is KL-divergence salient (or simply salient) if (d) H0 : zn ∼ P (d) , H1 : zn ∼ Q(d) . (d) The Chernoff-Stein lemma [2, Theorem 11.8.3] says that the error exponent for (1) under the Neyman-Pearson formulation is P (d) (z) D(P (d) || Q(d) ) := P (d) (z) log (d) . (2) Q (z) z More precisely, if the probability of false alarm PFA = Pr(Ĥ1 |H0 ) is kept below α, then the probability of misdetection PM = Pr(Ĥ0 |H1 ) tends to zero exponentially fast as n → ∞ with exponent given by D(P (d) || Q(d) ) in (2). In the Bayesian formulation, we seek to minimize the overall probability of error Pr(err) = Pr(H0 )PFA + Pr(H1 )PM , where Pr(H0 ) and Pr(H1 ) are the prior probabilities of hypotheses H0 and H1 respectively. It is known [2, Theorem 11.9.1] that in this case, the error exponent governing the rate of decay of Pr(err) with the sample size n is the Chernoff information between P (d) and Q(d) : D∗ (P (d) , Q(d) ) := − min log (P (d) (z))t (Q(d) (z))1−t . (3) t∈[0,1] z Similar to the KL-divergence, D∗ (P (d) , Q(d) ) is a measure of the separability of the distributions. It is a symmetric quantity in the distributions but is still not a metric. Given the form of the error exponents in (2) and (3), we would like to identify a size-k subset of features Sd ⊂ Vd that “maximally distinguishes” between P (d) and Q(d) . This motivates the following definitions: (d) D∗ (P (d) , Q(d) ) = D∗ (PSd , QSd ), (5) Thus, given the variables in Sd , the remaining variables in Sdc do not contribute to the Chernoff information defined in (3). A natural question to ask is whether the two definitions above are equivalent. We claim the following lemma. Lemma 1 (Equivalence of Saliency Definitions): For a subset Sd ⊂ Vd of size k, the following are equivalent: S1: Sd is KL-divergence salient. S2: Sd is Chernoff information salient. S3: P (d) and Q(d) admit the following decompositions: (d) (1) (4) Thus, conditioned on the variables in the salient set Sd (with |Sd | = k for some 1 ≤ k ≤ d), the variables in the complement Sdc do not contribute to the distinguishability (in terms of the KL-divergence) of P (d) and Q(d) . Definition 2 (Chernoff information Salient Set): A subset Sd ⊂ Vd of size k is Chernoff information salient if A. Definition of The Salient Set of Features We now motivate the notion of saliency (and the salient set) by considering the following binary hypothesis testing problem. There are n i.i.d. d-dimensional samples zn := {z(1) , . . . , z(n) } drawn from either P (d) or Q(d) , i.e., (d) D(P (d) || Q(d) ) = D(PSd || QSd ), P (d) = PSd · WSdc |Sd , (d) Q(d) = QSd · WSdc |Sd . (6) Lemma 1 is proved using Hölder’s inequality and Jensen’s inequality.1 Observe from (6) that the conditionals WSdc |Sd of both models are identical. Consequently, the likelihood ratio test (LRT) [7, Sec. 3.4] between P (d) and Q(d) depends solely on the marginals of the salient set Sd , i.e., (d) (l) n n P d (zSd ) P (d) (z(l) ) 1 1 log (d) (l) = log S(d) (l) n n Q (z ) QSd (zSd ) l=1 l=1 Ĥ=H0 ≷ γn , (7) Ĥ=H1 is the most powerful test [2, Ch. 11] of fixed size α for threshold γn .2 Also, the inclusion of any non-salient subset of features B ⊂ Sdc keeps the likelihood ratio in (7) exactly the (d) (d) (d) (d) same, i.e., PSd /QSd = PSd ∪B /QSd ∪B . Moreover, correctly identifying Sd from the set of samples (xn , yn ) results in a reduction in the number of relevant features, which is advantageous for the design of parsimonious and efficient binary classifiers. Because of this equivalence of definitions of saliency (in terms of the Chernoff-Stein exponent in (2) and the Chernoff information in (3)), if we have successfully identified the salient set in (4), we have also found the subset that maximizes the error exponent associated to the overall probability of error Pr(err). In our results, we find that the characterization of saliency in terms of (4) is more convenient than its equivalent characterization in (5). Finally, we emphasize that the number of variables and the number of salient variables k = |Sd | can grow as functions of n, i.e., d = d(n), k = k(n). In the 1 Due to space constraints, the proofs of the results are not included here but can be found at http://web.mit.edu/vtan/www/isit10. 2 We have implicitly assumed that the distributions P (d) , Q(d) are nowhere zero and consequently the conditional WS c |Sd is also nowhere zero. d 1389 ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010 A1: (Saliency) For each pair of distributions P (d) , Q(d) , there exists a salient set Sd ⊂ Vd of cardinality k such that (4) (or equivalently (5)) holds. A2: (η-Distinguishability) There exists a constant η > 0, independent of (n, d, k), such that for all d ∈ N and for all non-salient subsets Sd ∈ Sk,d \ {Sd }, we have sequel, we provide necessary and sufficient conditions for the asymptotic recovery of Sd as the model parameters scale, i.e., when (n, d, k) all grow. B. Definition of Achievability Let Sk,d := {A :A ⊂Vd , |A| = k} be the set of cardinality-k subsets in Vd . A decoder is a set-valued function ψn : (X d )n× (X d )n → Sk,d . Note in this paper that the decoder is given (d) to denote k.3 In the following, we use the notation P (d) , Q n n the empirical distributions (or types) of x , y respectively. Definition 3 (Exhaustive Search Decoder): The exhaustive search decoder (ESD) ψn∗ : (X d )n ×(X d )n → Sk,d is given as ). ψn∗ (xn , yn ) ∈ argmax D(PS || Q S (d) Sd ∈Sk,d d (d) d (8) If the argmax in (8) is not unique, output any set Sd ∈ Sk,d that maximizes the objective. The ESD is closely related to the maximum-likelihood (ML) decoder, and can be viewed as an approximation that is tractable for analysis. For a discussion, please see the supplementary material. We remark that, in practice, the ESD is computationally infeasible for large d and k since it has to compute the (d) (d) empirical KL-divergence D(PS ||Q Sd ) for all sets in Sk,d . In d Section IV, we analyze how to reduce the complexity of (8) for tree distributions. Nonetheless, the ESD is consistent for fixed d and k. That is, as n → ∞, the probability that a non-salient set is selected by ψn∗ tends to zero. We provide the exponential rate of decay in Section III-B. Let Pn := (P (d) ×Q(d) )n denote the n-fold product probability measure of P (d) × Q(d) . Definition 4 (Achievability): We say that the sequence of model parameters {(n, d, k)}n∈N is achievable for the sequence of distributions {P (d) , Q(d) ∈ P(X d )}d∈N if there exists a decoder ψn such that to every > 0, there exists a N ∈ N for which the error probability pn (ψn ) := Pn (ψn (xn , yn ) = Sd ) < , ∀ n > N . (9) Thus, if {(n, d, k)}n∈N is achievable, limn pn (ψn ) = 0. In the sequel, we provide achievability conditions for the ESD. III. C ONDITIONS FOR THE H IGH -D IMENSIONAL R ECOVERY OF S ALIENT S UBSETS (d) (d) (d) (d) D(PSd || QSd ) − D(PS || QS ) ≥ η > 0. d d (10) A3: (L-Boundedness of the Likelihood Ratio) There exists a L ∈ (0, ∞), independent of (n, d, k), such that for all (d) (d) d ∈ N, we have log[PSd (xSd )/QSd (xSd )] ∈ [−L, L] k for all xSd ∈ X . Assumption A1 pertains to the existence of a salient set. Assumption A2 allows us to employ the large deviation principle [7] to quantify error probabilities. This is because all non-salient subsets Sd ∈ Sk,d \ {Sd } are such that their divergences are uniformly smaller than the divergences on Sd , the salient set. Thus, for each d, the associated salient set Sd is unique and the error probability of selecting any non-salient set Sd decays exponentially. A2 together with A3, a regularity condition, allows us to prove that the exponents of all the possible error events are uniformly bounded away from zero. In the next subsection, we formally define the notion of an error exponent for the recovery of salient subsets. B. Fixed Number of Variables d and Salient Variables k In this section, we consider the situation when d and k are constant. This provides key insights for developing achievability results when (n, d, k) scale. Under this scenario, we have a large deviations principle for the error event in (9). We define the error exponent for the ESD ψn∗ as C(P (d) , Q(d) ) := − lim n−1 log Pn (ψn∗ (xn , yn ) = Sd ). (11) n→∞ be the error rate at which the non-salient set Sd ∈ Let J Sk,d \ {Sd } is selected by the ESD, i.e., Sd |Sd JSd |Sd := − lim n−1 log Pn (ψn∗ (xn , yn ) = Sd ) . n→∞ (12) For each Sd ∈ Sk,d \ {Sd }, also define the set of distributions ΓSd |Sd := {(P, Q) ∈ P(X 2|Sd ∪Sd | ) : In this section, we state three assumptions on the sequence of distributions {P (d) , Q(d) }d∈N such that under some specified scaling laws, the triple of model parameters (n, d, k) is achievable with the ESD as defined in (9). We provide both positive (achievability) and negative (converse) sample complexity results under these assumptions. That is, we state when (9) holds and also when the sequence pn (ψn ) is uniformly bounded away from zero. D(PSd || QSd ) = D(PSd || QSd )}. Proposition 2 (Error Exponent as Minimum Error Rate): Assume that the ESD ψn∗ is used. If d and k are constant, then the error exponent (11) is given as C(P (d) , Q(d) ) = 3 We will discuss the recovery of S without knowledge of k in a longer d version of this paper. min Sd ∈Sk,d \{Sd } JSd |Sd , (14) where the error rate JSd |Sd , defined in (12), is A. Assumptions on the Distributions In order to state our results, we assume that the sequence of probability distributions {P (d) , Q(d) }d∈N satisfy the following three conditions: (13) JSd |Sd = (d) min ν∈ΓS |S d (d) D(ν || PSd ∪S × QSd ∪S ). d d d (15) Furthermore if A2 holds, C(P (d) , Q(d) ) > 0 and hence the error probability in (9) decays exponentially fast in n. This result is proved using Sanov’s Theorem and the contraction principle [7, Ch. 4] in large deviations. 1390 ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010 C. An Achievability Result for the High-Dimensional Case in the number of samples n as in the asthma example. We now consider the high-dimensional scenario when (n, d, k) all scale and we have a sequence of salient set recovery problems indexed by n for the probability models {P (d) , Q(d) }d∈N . Thus, d = d(n) and k = k(n) and we are searching for how such dependencies must behave (scale) such that we have achievability. This is of interest since this regime (typically d n, k) is most applicable to many practical problems and modern datasets such as the motivating example in the introduction. Before stating our main theorem, we define the greatest lower bound (g.l.b.) of the error exponents as B := B({P (d) , Q(d) }d∈N ) := inf C(P (d) , Q(d) ), d∈N (16) where C(P (d) , Q(d) ) is given in (14). Clearly, B ≥ 0 by the non-negativity of the KL-divergence. In fact, we prove that B > 0 under assumptions A1 – A3, i.e., the exponents in (14) are uniformly bounded away from zero. For > 0, define 2k log |X | g1 (k, ) := exp , (17) 1− k d−k . (18) g2 (d, k) := log B k Theorem 3 (Main Result: Achievability): Assume that A1 – A3 hold for the sequence of distributions {P (d) , Q(d) }d∈N . If there exists an > 0 and an N ∈ N such that n > max{g1 (k, ), g2 (d, k)}, ∀ n > N, (19) then pn (ψn∗ ) = O(exp(−nc)), where the exponent k d−k c := B − lim sup log > 0. k n→∞ n D. A Converse Result for the High-Dimensional Case In this section, we state a converse theorem (and several useful corollaries) for the high-dimensional case. Specifically, we establish a condition on the scaling of (n, d, k) so that the probability of error is uniformly bounded away from zero for any decoder. In order to apply standard proof techniques (such as Fano’s inequality) for converses that apply to all possible decoders ψn , we consider the following slightly modified problem setup where Sd is random and not fixed (d) }d∈N be as was in Theorem 3. More precisely, let {P (d) , Q (d) (d) a fixed sequence of distributions, where P , Q ∈ P(X d ). We assume that this sequence of distributions satisfies A1 – A3, namely there exists a salient set Sd ∈ Sk,d such that (d) satisfies (4) for all d. P(d) , Q Let Π be a permutation of Vd chosen uniformly at random, i.e., Pr(Π = π) = 1/(d!) for any permutation operator π : Vd → Vd . Define the sequence of distributions {P (d) , Q(d) }d∈N as P (d) := Pπ(d) , π ∼ Π, (d) Q(d) := Q π . (d) (according to Put simply, we permute the indices in P(d) , Q the realization of Π) to get P (d) , Q(d) , i.e., P (d) (x1 . . . xd ) := P(d) (xπ(1) . . . xπ(d) ). Thus, once π has been drawn, the distributions P (d) and Q(d) of the random vectors x and y are completely determined. Clearly the salient sets Sd are drawn uniformly at random (u.a.r.) from Sk,d and we have the Markov chain: Sd −→ (xn , yn ) −→ Sd , ϕn (20) In other words, the sequence {(n, d, k)}n∈N of parameters is achievable if (19) holds. Furthermore, the exhaustive search decoder in (8) achieves the scaling law in (19). The key elements in proof include applications of large deviations bounds (e.g., Sanov’s theorem), asymptotic behavior of binomial coefficients and most crucially demonstrating the positivity of the g.l.b. of the error exponents B defined in (16). We now discuss the ramifications of Theorem 3. Firstly, n > g1 (k, ) means that k, the number of salient features, is only allowed to grow logarithmically in n. Secondly, n > g2 (d, k) means that if k is a constant, the number of redundant features |Sdc | = d − k can grow exponentially with n, and pn still tends to zero exponentially fast. This means that recovery of Sd is asymptotically possible even if the data dimension is extremely large (compared to n) but the number of salient ones remain a small fraction of the total number d. We state this observation formally as a corollary of Theorem 3. Corollary 4 (Achievability for constant k): Assume A1 – A3. Let k = k0 be a constant and fix R < R1 := B/k0 . Then if there exists a N ∈ N such that n > (log d)/R for all n > N , then the error probability obeys pn (ψn∗ ) = O(exp(−nc )), where the exponent is c := B − k0 R. This result means that we can recover the salient set even though the number of variables d is much larger (exponential) (21) ψn (22) where the length-d random vectors (x, y) ∼ P (d) × Q(d) and Sd is any estimate of Sd . Also, ϕn is the encoder given by the random draw of π and (21). ψn is the decoder defined in Section II-B. We denote the entropy of a random vector z with pmf P as H(z) = H(P ) and the conditional entropy of zA given zB as H(zA |zB ) = H(PA|B ). Theorem 5 (Converse): Assume that the salient sets {Sd }d∈N are drawn u.a.r. and encoded as in (21). If n< λk log( kd ) , + H(Q(d) ) H(P (d) ) for some λ ∈ (0, 1), (23) then pn (ψn ) ≥ 1 − λ for any decoder ψn . The converse is proven using Fano’s inequality [2, Ch. 1]. Note from (23) that if the non-salient set Sdc consists of uniform random variables independent of those in Sd then H(P (d) ) = O(d) and the bound is never satisfied. However, the converse is interesting and useful if we consider distributions with additional structure on their entropies. In particular, we assume that most of the non-salient variables are redundant (or processed) versions of the salient ones. Again appealing to the asthma example in the introduction, there could be two features in the dataset “body mass index” (in Sd ) and “is obese” (in Sdc ). These two features capture the same basic information and are thus redundant, but the former may be more informative to the asthma hypothesis. 1391 ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010 Corollary 6 (Converse with Bound on Conditional Entropy): been studied [10]. Secondly, solve the following optimization: If there exists a M < ∞ such that i + i,j , Tk∗ = argmax D W (28) (d) (d) Tk ∈Tk (TML ) max{H(PS c |Sd ), H(QS c |Sd )} ≤ M k (24) i∈V (Tk ) (i,j)∈E(Tk ) d d for all d ∈ N, and n< λ log( kd ) 2(M + log |X |) , for some λ ∈ (0, 1), (25) then pn (ψn ) ≥ 1 − λ for any decoder ψn . Corollary 7 (Converse for constant k): Assume the setup in Corollary 6. Fix R > R2 := 2(M + log |X |). Then if k is a constant and if there exists an N ∈ N such that n < (log d)/R for all n > N , then there exist a δ > 0 such that error probability pn (ψn ) ≥ δ for all decoders ψn . We previously showed (cf. Corollary 4) that there is a rate of growth R1 so that achievability holds if R < R1 . Corollary 7 says that, under the specified conditions, there is also another rate R2 so that if R > R2 , recovery of Sd is no longer possible. IV. S PECIALIZATION TO T REE D ISTRIBUTIONS As mentioned previously, the ESD in (8) is computationally prohibitive. In this section, we assume Markov structure on the distributions (also called graphical models [8]) and devise an efficient algorithm to reduce the computational complexity of the decoder. To do so, for each d and k, assume the following: A4: (Markov tree) The distributions P := P (d) , Q := Q(d) are undirected graphical models [8]. More specifically, P, Q are Markov on a common tree T = (V (T ), E(T )), where V (T ) = {1, . . . , d} is the vertex set and E(T ) ⊂ V 2 is the edge set. That is, P, Q admit the factorization: P (x) = Pi (xi ) i∈V (T ) (i,j)∈E(T ) Pi,j (xi , xj ) . (26) Pi (xi )Pj (xj ) A5: (Subtree) The salient set S := Sd is such that PS , QS are Markov on a common (connected) subtree TS = (V (TS ), E(TS )) in T . Note that TS ⊂ T has to be connected so that the marginals PS and QS remain Markov on trees. Otherwise, additional edges may be introduced when the variables in S c are marginalized out [8]. Under A4 and A5, the KL-divergence decomposes as: D(P || Q) = Di + Wi,j , (27) i∈V (T ) (i,j)∈E(T ) where Di := D(Pi || Qi ) is the KL-divergence of the marginals and the weights Wi,j := Di,j − Di − Dj . A similar decomposition holds for D(PS || QS ) with V (TS ), E(TS ) in (27) in place of V (T ), E(T ). Let Tk (T ) be the set of subtrees with k < d vertices in T , a tree with d vertices. We now describe an efficient algorithm to learn S when T is unknown. Firstly, using the samples (xn , yn ), learn a single ChowLiu [9] tree model TML using the sum of the empirical mutual i,j )} as the edge weights. information quantities {I(Pi,j )+I(Q It is known that the Chow-Liu max-weight spanning tree algorithm is consistent and large deviations rates have also i,j are the empirical versions of Di and Wi,j i and W where D respectively. In (28), the sum of the node and edge weights over all size-k subtrees in TML is maximized. The problem in (28) is known as the k-CARD TREE problem [11] and it runs in time O(dk 2 ) using a dynamic programming procedure on trees. Thirdly, let the estimate of the salient set be the vertex set of Tk∗ , i.e, ψn (xn , yn ) := V (Tk∗ ). Proposition 8 (Complexity Reduction for Trees): Assume that A4 and A5 hold. Then if k, d are constant, the algorithm described above to estimate S is consistent. Moreover, the time complexity is O(dk 2 + nd2 |X |2 ). Hence, there are significant savings in computational complexity if the probability models P and Q are trees (26). V. C ONCLUSION AND F URTHER W ORK In this paper, we defined the notion of saliency and provided necessary and sufficient conditions for the asymptotic recovery of salient subsets in the high-dimensional regime. In future work, we seek to strengthen these results by reducing the gap between the achievability and converse theorems. In addition, we would like to derive similar types of results for the scenario when k is unknown to the decoder. We have developed thresholding rules for discovering the number of edges in the context of learning Markov forests [12] and we believe that similar techniques apply here. We also plan to analyze error rates for the algorithm introduced in Section IV. Acknowledgments: The authors acknowledge Prof. V. Saligrama and B. Nakiboğlu for helpful discussions. R EFERENCES [1] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” J. Mach. Learn. Research, vol. 3, pp. 1157–1182, 2003. [2] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley-Interscience, 2006. [3] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and minredundancy,” IEEE Trans. on PAMI, vol. 27, no. 8, pp. 1226–1238, 2005. [4] D. Koller and M. Sahami, “Toward optimal feature selection,” Stanford InfoLab, Tech. Rep., 1996. [5] A. Y. Ng, “On feature selection: learning with exponentially many irrelevant features as training examples,” in Proc. 15th ICML. Morgan Kaufmann, 1998, pp. 404–412. [6] M. J. Wainwright, “Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting,” IEEE Trans. on Info. Th., pp. 5728–5741, Dec 2009. [7] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, 2nd ed. Springer, 1998. [8] S. Lauritzen, Graphical Models. Oxford University Press, USA, 1996. [9] C. K. Chow and C. N. Liu, “Approximating discrete probability distributions with dependence trees.” IEEE Trans. on Info. Th., vol. 14, no. 3, pp. 462–467, May 1968. [10] V. Y. F. Tan, A. Anandkumar, L. Tong, and A. S. Willsky, “A LargeDeviation Analysis for the ML Learning of Markov Tree Structures,” submitted to IEEE Trans. on Info. Th., Arxiv 0905.0940, May 2009. [11] C. Blum, “Revisiting dynamic programming for finding optimal subtrees in trees,” European J. of Ops. Research, vol. 177, no. 1, 2007. [12] V. Y. F. Tan, A. Anandkumar, and A. S. Willsky, “Learning HighDimensional Markov Forest Distributions: Analysis of Error Rates,” submitted to J. Mach. Learn. Research, on Arxiv, May 2010. 1392