Necessary and sufficient conditions for high-dimensional salient feature subset recovery Please share

advertisement
Necessary and sufficient conditions for high-dimensional
salient feature subset recovery
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
Citation
Tan, Vincent Y. F., Matthew Johnson, and Alan S. Willsky.
“Necessary and Sufficient Conditions for High-dimensional
Salient Feature Subset Recovery.” IEEE International
Symposium on Information Theory Proceedings (ISIT), 2010.
1388–1392. ©2010 IEEE
As Published
http://dx.doi.org/10.1109/ISIT.2010.5513598
Publisher
Institute of Electrical and Electronics Engineers (IEEE)
Version
Final published version
Accessed
Thu May 26 06:36:55 EDT 2016
Citable Link
http://hdl.handle.net/1721.1/73588
Terms of Use
Article is made available in accordance with the publisher's policy
and may be subject to US copyright law. Please refer to the
publisher's site for terms of use.
Detailed Terms
ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010
Necessary and Sufficient Conditions for
High-Dimensional Salient Feature Subset Recovery
Vincent Y. F. Tan, Matthew Johnson, and Alan S. Willsky
Stochastic Systems Group, LIDS, MIT, Cambridge, MA 02139, Email: {vtan,mattjj,willsky}@mit.edu
Abstract—We consider recovering the salient feature subset
for distinguishing between two probability models from i.i.d.
samples. Identifying the salient set improves discrimination
performance and reduces complexity. The focus in this work is on
the high-dimensional regime where the number of variables d,
the number of salient variables k and the number of samples
n all grow. The definition of saliency is motivated by error
exponents in a binary hypothesis test and is stated in terms
of relative entropies. It is shown that if n grows faster than
max{ck log((d − k)/k), exp(c k)} for constants c, c , then the error probability in selecting the salient set can be made arbitrarily
small. Thus, n can be much smaller than d. The exponential rate
of decay and converse theorems are also provided. An efficient
and consistent algorithm is proposed when the distributions are
graphical models which are Markov on trees.
Index Terms—Salient feature subset, High-dimensional, Error
exponents, Binary hypothesis testing, Tree distributions.
I. I NTRODUCTION
Consider the following scenario: There are 1000 children
participating in a longitudinal study in childhood asthma of
which 500 of them are asthmatic and the other 500 are not. 106
measurements of possibly relevant features (e.g. genetic, environmental, physiological) are taken from each child but only
a very small subset of these (say 30) is useful in predicting
whether the child has asthma. This example is, in fact, modeled
after a real-life large-scale experiment — the Manchester
Asthma and Allergy Study (http://www.maas.org.uk/). The
correct identification and subsequent interpretation of this
salient subset is important to clinicians for assessing the
susceptibility of other children to asthma. We expect that by
focusing only on the 30 salient features, we can improve
discrimination and reduce the computational cost in coming
up with a decision rule. Indeed, when the salient set is small
compared to the overall dimension (106 ), we also expect to be
able to estimate the salient set with a small number of samples.
This general problem is also known as feature subset
selection [1]. In this paper, we derive, from an informationtheoretic perspective, necessary and sufficient conditions so
that the salient set can be recovered with arbitrarily low error
probability in the high-dimensional regime, i.e., when the
number of samples n, the number of variables d and the
number of salient variables k grow. Intuitively, we expect that
if k and d do not grow too quickly with n, then consistent
This work is supported by a AFOSR funded through Grant FA9559-08-11080, a MURI funded through ARO Grant W911NF-06-1-0076 and a MURI
funded through AFOSR Grant FA9550-06-1-0324. V. Tan and M. Johnson are
supported by A*STAR, Singapore and the NSF fellowship respectively.
978-1-4244-7892-7/10/$26.00 ©2010 IEEE
recovery is possible. However, in this paper, we focus on the
interesting case where d n, k, which is most relevant to
problems such as the asthma example above.
Motivated by the Chernoff-Stein lemma [2, Ch. 11] for a
binary hypothesis test under the Neyman-Pearson framework,
we define the notion of saliency for distinguishing between
two probability distributions. We show that this definition
of saliency can also be motivated by the same hypothesis
testing problem under the Bayesian framework, in which the
overall error probability is minimized. For the asthma example,
intuitively, a feature is salient if it is useful in predicting
whether a child has asthma and we also expect the number
of salient features to be very small. Also, conditioned on the
salient features, the non-salient ones should not contribute to
the distinguishability of the classes. Our mathematical model
and definition of saliency in terms of the KL-divergence (or
Chernoff information) captures this intuition.
There are three main contributions in this work. Firstly,
we provide sufficient conditions on the scaling of the model
parameters (n, d, k) so the salient set is recoverable asymptotically. Secondly, by modeling the salient set as a uniform
random variable (over all sets of size k), we derive a necessary
condition that any decoder must satisfy in order to recover the
salient set. Thirdly, in light of the fact that the exhaustive
search decoder is computationally infeasible, we examine the
case in which the underlying distributions are Markov on
trees and derive efficient tree-based combinatorial optimization
algorithms to search for the salient set.
The literature on feature subset selection (or variable extraction) is vast. See [1] (and references therein) for a thorough
review of the field. The traditional methods include the socalled wrapper (assessing different subsets for their usefulness
in predicting the class) and filter (ranking) methods. Our
definition of saliency is related to the minimum-redundancy,
maximum-relevancy model in [3] and the notion of Markov
blankets in [4] but is expressed using information-theoretic
quantities motivated by hypothesis testing. The algorithm
suggested in [5] shows that the generalization error remains
small even in the presence of a large number of irrelevant
features, but this paper focuses on exact recovery of the salient
set given scaling laws on (n, d, k). This work is also related
to [6] on sparsity pattern recovery (or compressed sensing) but
does not assume the linear observation model. Rather, samples
are drawn from two arbitrary discrete multivariate probability
distributions.
1388
ISIT 2010
ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010
II. N OTATION , S YSTEM M ODEL AND D EFINITIONS
Let X be a finite set and let P(X ) denote the set of
discrete distributions supported on X d . Let {P (d) , Q(d) }d∈N
be two sequences of distributions where P (d) , Q(d) ∈ P(X d ),
are the distributions of d-dimensional random vectors x, y
respectively. For a vector x ∈ X d , xA is the length-|A|
subvector that consists of the elements in A. Let Ac := Vd \A.
In addition, let Vd := {1, . . . , d} be the index set and for
(d)
a subset A ⊂ Vd , let PA be the marginal of the subset
of random variables in A, i.e., the random vector xA . Each
(d)
(d)
index i ∈ Vd , associated to marginals (Pi , Qi ), will be
generically called a feature.
We assume that for each pair (P (d) , Q(d) ), there exists a set
of n i.i.d. samples (xn , yn ) := ({x(l) }nl=1 , {y(l) }nl=1 ) drawn
from P (d) × Q(d) . Each sample x(l) (and also y(l) ) belongs to
X d . Our goal is to distinguish between P (d) and Q(d) using
the samples. Note that for each d, this setup is analogous to
binary classification where one does not have access to the
underlying distributions but only samples from the distribution.
We suppress the dependence of (xn , yn ) on the dimensionality
d when the lengths of the vectors are clear from the context.
d
Definition 1 (KL-divergence Salient Set): A subset Sd ⊂
Vd of size k is KL-divergence salient (or simply salient) if
(d)
H0 : zn ∼ P (d) ,
H1 : zn ∼ Q(d) .
(d)
The Chernoff-Stein lemma [2, Theorem 11.8.3] says that the
error exponent for (1) under the Neyman-Pearson formulation
is
P (d) (z)
D(P (d) || Q(d) ) :=
P (d) (z) log (d) .
(2)
Q (z)
z
More precisely, if the probability of false alarm PFA =
Pr(Ĥ1 |H0 ) is kept below α, then the probability of misdetection PM = Pr(Ĥ0 |H1 ) tends to zero exponentially fast
as n → ∞ with exponent given by D(P (d) || Q(d) ) in (2).
In the Bayesian formulation, we seek to minimize the overall probability of error Pr(err) = Pr(H0 )PFA + Pr(H1 )PM ,
where Pr(H0 ) and Pr(H1 ) are the prior probabilities of
hypotheses H0 and H1 respectively. It is known [2, Theorem
11.9.1] that in this case, the error exponent governing the rate
of decay of Pr(err) with the sample size n is the Chernoff
information between P (d) and Q(d) :
D∗ (P (d) , Q(d) ) := − min log (P (d) (z))t (Q(d) (z))1−t . (3)
t∈[0,1]
z
Similar to the KL-divergence, D∗ (P (d) , Q(d) ) is a measure
of the separability of the distributions. It is a symmetric
quantity in the distributions but is still not a metric. Given
the form of the error exponents in (2) and (3), we would like
to identify a size-k subset of features Sd ⊂ Vd that “maximally
distinguishes” between P (d) and Q(d) . This motivates the
following definitions:
(d)
D∗ (P (d) , Q(d) ) = D∗ (PSd , QSd ),
(5)
Thus, given the variables in Sd , the remaining variables in Sdc
do not contribute to the Chernoff information defined in (3).
A natural question to ask is whether the two definitions above
are equivalent. We claim the following lemma.
Lemma 1 (Equivalence of Saliency Definitions): For a subset Sd ⊂ Vd of size k, the following are equivalent:
S1: Sd is KL-divergence salient.
S2: Sd is Chernoff information salient.
S3: P (d) and Q(d) admit the following decompositions:
(d)
(1)
(4)
Thus, conditioned on the variables in the salient set Sd (with
|Sd | = k for some 1 ≤ k ≤ d), the variables in the
complement Sdc do not contribute to the distinguishability (in
terms of the KL-divergence) of P (d) and Q(d) .
Definition 2 (Chernoff information Salient Set): A subset
Sd ⊂ Vd of size k is Chernoff information salient if
A. Definition of The Salient Set of Features
We now motivate the notion of saliency (and the salient
set) by considering the following binary hypothesis testing
problem. There are n i.i.d. d-dimensional samples zn :=
{z(1) , . . . , z(n) } drawn from either P (d) or Q(d) , i.e.,
(d)
D(P (d) || Q(d) ) = D(PSd || QSd ),
P (d) = PSd · WSdc |Sd ,
(d)
Q(d) = QSd · WSdc |Sd .
(6)
Lemma 1 is proved using Hölder’s inequality and Jensen’s
inequality.1
Observe from (6) that the conditionals WSdc |Sd of both
models are identical. Consequently, the likelihood ratio test
(LRT) [7, Sec. 3.4] between P (d) and Q(d) depends solely on
the marginals of the salient set Sd , i.e.,
(d)
(l)
n
n
P d (zSd )
P (d) (z(l) )
1
1
log (d) (l) =
log S(d)
(l)
n
n
Q (z )
QSd (zSd )
l=1
l=1
Ĥ=H0
≷
γn , (7)
Ĥ=H1
is the most powerful test [2, Ch. 11] of fixed size α for
threshold γn .2 Also, the inclusion of any non-salient subset of
features B ⊂ Sdc keeps the likelihood ratio in (7) exactly the
(d)
(d)
(d)
(d)
same, i.e., PSd /QSd = PSd ∪B /QSd ∪B . Moreover, correctly
identifying Sd from the set of samples (xn , yn ) results in
a reduction in the number of relevant features, which is
advantageous for the design of parsimonious and efficient
binary classifiers.
Because of this equivalence of definitions of saliency (in
terms of the Chernoff-Stein exponent in (2) and the Chernoff
information in (3)), if we have successfully identified the
salient set in (4), we have also found the subset that maximizes
the error exponent associated to the overall probability of error
Pr(err). In our results, we find that the characterization of
saliency in terms of (4) is more convenient than its equivalent
characterization in (5). Finally, we emphasize that the number
of variables and the number of salient variables k = |Sd |
can grow as functions of n, i.e., d = d(n), k = k(n). In the
1 Due to space constraints, the proofs of the results are not included here
but can be found at http://web.mit.edu/vtan/www/isit10.
2 We have implicitly assumed that the distributions P (d) , Q(d) are nowhere
zero and consequently the conditional WS c |Sd is also nowhere zero.
d
1389
ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010
A1: (Saliency) For each pair of distributions P (d) , Q(d) , there
exists a salient set Sd ⊂ Vd of cardinality k such that (4)
(or equivalently (5)) holds.
A2: (η-Distinguishability) There exists a constant η > 0,
independent of (n, d, k), such that for all d ∈ N and
for all non-salient subsets Sd ∈ Sk,d \ {Sd }, we have
sequel, we provide necessary and sufficient conditions for the
asymptotic recovery of Sd as the model parameters scale, i.e.,
when (n, d, k) all grow.
B. Definition of Achievability
Let Sk,d := {A :A ⊂Vd , |A| = k} be the set of cardinality-k
subsets in Vd . A decoder is a set-valued function ψn : (X d )n×
(X d )n → Sk,d . Note in this paper that the decoder is given
(d) to denote
k.3 In the following, we use the notation P (d) , Q
n
n
the empirical distributions (or types) of x , y respectively.
Definition 3 (Exhaustive Search Decoder): The exhaustive
search decoder (ESD) ψn∗ : (X d )n ×(X d )n → Sk,d is given as
).
ψn∗ (xn , yn ) ∈ argmax D(PS || Q
S
(d)
Sd ∈Sk,d
d
(d)
d
(8)
If the argmax in (8) is not unique, output any set Sd ∈ Sk,d
that maximizes the objective. The ESD is closely related to the
maximum-likelihood (ML) decoder, and can be viewed as an
approximation that is tractable for analysis. For a discussion,
please see the supplementary material.
We remark that, in practice, the ESD is computationally
infeasible for large d and k since it has to compute the
(d) (d)
empirical KL-divergence D(PS ||Q
Sd ) for all sets in Sk,d . In
d
Section IV, we analyze how to reduce the complexity of (8) for
tree distributions. Nonetheless, the ESD is consistent for fixed
d and k. That is, as n → ∞, the probability that a non-salient
set is selected by ψn∗ tends to zero. We provide the exponential
rate of decay in Section III-B. Let Pn := (P (d) ×Q(d) )n denote
the n-fold product probability measure of P (d) × Q(d) .
Definition 4 (Achievability): We say that the sequence of
model parameters {(n, d, k)}n∈N is achievable for the sequence of distributions {P (d) , Q(d) ∈ P(X d )}d∈N if there
exists a decoder ψn such that to every > 0, there exists
a N ∈ N for which the error probability
pn (ψn ) := Pn (ψn (xn , yn ) = Sd ) < , ∀ n > N .
(9)
Thus, if {(n, d, k)}n∈N is achievable, limn pn (ψn ) = 0.
In the sequel, we provide achievability conditions for the ESD.
III. C ONDITIONS FOR THE H IGH -D IMENSIONAL
R ECOVERY OF S ALIENT S UBSETS
(d)
(d)
(d)
(d)
D(PSd || QSd ) − D(PS || QS ) ≥ η > 0.
d
d
(10)
A3: (L-Boundedness of the Likelihood Ratio) There exists a
L ∈ (0, ∞), independent of (n, d, k), such that for all
(d)
(d)
d ∈ N, we have log[PSd (xSd )/QSd (xSd )] ∈ [−L, L]
k
for all xSd ∈ X .
Assumption A1 pertains to the existence of a salient set.
Assumption A2 allows us to employ the large deviation
principle [7] to quantify error probabilities. This is because
all non-salient subsets Sd ∈ Sk,d \ {Sd } are such that their
divergences are uniformly smaller than the divergences on Sd ,
the salient set. Thus, for each d, the associated salient set Sd
is unique and the error probability of selecting any non-salient
set Sd decays exponentially. A2 together with A3, a regularity
condition, allows us to prove that the exponents of all the
possible error events are uniformly bounded away from zero.
In the next subsection, we formally define the notion of an
error exponent for the recovery of salient subsets.
B. Fixed Number of Variables d and Salient Variables k
In this section, we consider the situation when d and
k are constant. This provides key insights for developing
achievability results when (n, d, k) scale. Under this scenario,
we have a large deviations principle for the error event in (9).
We define the error exponent for the ESD ψn∗ as
C(P (d) , Q(d) ) := − lim n−1 log Pn (ψn∗ (xn , yn ) = Sd ). (11)
n→∞
be the error rate at which the non-salient set Sd ∈
Let J
Sk,d \ {Sd } is selected by the ESD, i.e.,
Sd |Sd
JSd |Sd := − lim n−1 log Pn (ψn∗ (xn , yn ) = Sd ) .
n→∞
(12)
For each Sd ∈ Sk,d \ {Sd }, also define the set of distributions
ΓSd |Sd := {(P, Q) ∈ P(X 2|Sd ∪Sd | ) :
In this section, we state three assumptions on the sequence
of distributions {P (d) , Q(d) }d∈N such that under some specified scaling laws, the triple of model parameters (n, d, k)
is achievable with the ESD as defined in (9). We provide
both positive (achievability) and negative (converse) sample
complexity results under these assumptions. That is, we state
when (9) holds and also when the sequence pn (ψn ) is uniformly bounded away from zero.
D(PSd || QSd ) = D(PSd || QSd )}.
Proposition 2 (Error Exponent as Minimum Error Rate):
Assume that the ESD ψn∗ is used. If d and k are constant,
then the error exponent (11) is given as
C(P (d) , Q(d) ) =
3 We will discuss the recovery of S without knowledge of k in a longer
d
version of this paper.
min
Sd ∈Sk,d \{Sd }
JSd |Sd ,
(14)
where the error rate JSd |Sd , defined in (12), is
A. Assumptions on the Distributions
In order to state our results, we assume that the sequence of
probability distributions {P (d) , Q(d) }d∈N satisfy the following
three conditions:
(13)
JSd |Sd =
(d)
min
ν∈ΓS |S
d
(d)
D(ν || PSd ∪S × QSd ∪S ).
d
d
d
(15)
Furthermore if A2 holds, C(P (d) , Q(d) ) > 0 and hence the
error probability in (9) decays exponentially fast in n.
This result is proved using Sanov’s Theorem and the contraction principle [7, Ch. 4] in large deviations.
1390
ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010
C. An Achievability Result for the High-Dimensional Case
in the number of samples n as in the asthma example.
We now consider the high-dimensional scenario when
(n, d, k) all scale and we have a sequence of salient set
recovery problems indexed by n for the probability models
{P (d) , Q(d) }d∈N . Thus, d = d(n) and k = k(n) and we are
searching for how such dependencies must behave (scale) such
that we have achievability. This is of interest since this regime
(typically d n, k) is most applicable to many practical
problems and modern datasets such as the motivating example
in the introduction. Before stating our main theorem, we define
the greatest lower bound (g.l.b.) of the error exponents as
B := B({P (d) , Q(d) }d∈N ) := inf C(P (d) , Q(d) ),
d∈N
(16)
where C(P (d) , Q(d) ) is given in (14). Clearly, B ≥ 0 by the
non-negativity of the KL-divergence. In fact, we prove that
B > 0 under assumptions A1 – A3, i.e., the exponents in (14)
are uniformly bounded away from zero. For > 0, define
2k log |X |
g1 (k, ) := exp
,
(17)
1−
k
d−k
.
(18)
g2 (d, k) := log
B
k
Theorem 3 (Main Result: Achievability): Assume that A1
– A3 hold for the sequence of distributions {P (d) , Q(d) }d∈N .
If there exists an > 0 and an N ∈ N such that
n > max{g1 (k, ), g2 (d, k)},
∀ n > N,
(19)
then pn (ψn∗ ) = O(exp(−nc)), where the exponent
k
d−k
c := B − lim sup log
> 0.
k
n→∞ n
D. A Converse Result for the High-Dimensional Case
In this section, we state a converse theorem (and several
useful corollaries) for the high-dimensional case. Specifically,
we establish a condition on the scaling of (n, d, k) so that
the probability of error is uniformly bounded away from zero
for any decoder. In order to apply standard proof techniques
(such as Fano’s inequality) for converses that apply to all
possible decoders ψn , we consider the following slightly
modified problem setup where Sd is random and not fixed
(d) }d∈N be
as was in Theorem 3. More precisely, let {P (d) , Q
(d) (d)
a fixed sequence of distributions, where P , Q ∈ P(X d ).
We assume that this sequence of distributions satisfies A1 –
A3, namely there exists a salient set Sd ∈ Sk,d such that
(d) satisfies (4) for all d.
P(d) , Q
Let Π be a permutation of Vd chosen uniformly at random,
i.e., Pr(Π = π) = 1/(d!) for any permutation operator π : Vd →
Vd . Define the sequence of distributions {P (d) , Q(d) }d∈N as
P (d) := Pπ(d) ,
π ∼ Π,
(d)
Q(d) := Q
π .
(d) (according to
Put simply, we permute the indices in P(d) , Q
the realization of Π) to get P (d) , Q(d) , i.e., P (d) (x1 . . . xd ) :=
P(d) (xπ(1) . . . xπ(d) ). Thus, once π has been drawn, the distributions P (d) and Q(d) of the random vectors x and y
are completely determined. Clearly the salient sets Sd are
drawn uniformly at random (u.a.r.) from Sk,d and we have
the Markov chain:
Sd −→ (xn , yn ) −→ Sd ,
ϕn
(20)
In other words, the sequence {(n, d, k)}n∈N of parameters
is achievable if (19) holds. Furthermore, the exhaustive search
decoder in (8) achieves the scaling law in (19).
The key elements in proof include applications of large deviations bounds (e.g., Sanov’s theorem), asymptotic behavior
of binomial coefficients and most crucially demonstrating the
positivity of the g.l.b. of the error exponents B defined in (16).
We now discuss the ramifications of Theorem 3.
Firstly, n > g1 (k, ) means that k, the number of salient features, is only allowed to grow logarithmically in n. Secondly,
n > g2 (d, k) means that if k is a constant, the number of
redundant features |Sdc | = d − k can grow exponentially with
n, and pn still tends to zero exponentially fast. This means
that recovery of Sd is asymptotically possible even if the data
dimension is extremely large (compared to n) but the number
of salient ones remain a small fraction of the total number d.
We state this observation formally as a corollary of Theorem 3.
Corollary 4 (Achievability for constant k): Assume A1 –
A3. Let k = k0 be a constant and fix R < R1 := B/k0 . Then
if there exists a N ∈ N such that n > (log d)/R for all n > N ,
then the error probability obeys pn (ψn∗ ) = O(exp(−nc )),
where the exponent is c := B − k0 R.
This result means that we can recover the salient set even
though the number of variables d is much larger (exponential)
(21)
ψn
(22)
where the length-d random vectors (x, y) ∼ P (d) × Q(d) and
Sd is any estimate of Sd . Also, ϕn is the encoder given by
the random draw of π and (21). ψn is the decoder defined
in Section II-B. We denote the entropy of a random vector z
with pmf P as H(z) = H(P ) and the conditional entropy of
zA given zB as H(zA |zB ) = H(PA|B ).
Theorem 5 (Converse): Assume that the salient sets
{Sd }d∈N are drawn u.a.r. and encoded as in (21). If
n<
λk log( kd )
,
+ H(Q(d) )
H(P (d) )
for some λ ∈ (0, 1),
(23)
then pn (ψn ) ≥ 1 − λ for any decoder ψn .
The converse is proven using Fano’s inequality [2, Ch.
1]. Note from (23) that if the non-salient set Sdc consists
of uniform random variables independent of those in Sd
then H(P (d) ) = O(d) and the bound is never satisfied.
However, the converse is interesting and useful if we consider
distributions with additional structure on their entropies. In
particular, we assume that most of the non-salient variables are
redundant (or processed) versions of the salient ones. Again
appealing to the asthma example in the introduction, there
could be two features in the dataset “body mass index” (in Sd )
and “is obese” (in Sdc ). These two features capture the same
basic information and are thus redundant, but the former may
be more informative to the asthma hypothesis.
1391
ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010
Corollary 6 (Converse with Bound on Conditional Entropy): been studied [10]. Secondly, solve the following optimization:
If there exists a M < ∞ such that
i +
i,j ,
Tk∗ = argmax
D
W
(28)
(d)
(d)
Tk ∈Tk (TML )
max{H(PS c |Sd ), H(QS c |Sd )} ≤ M k
(24)
i∈V (Tk )
(i,j)∈E(Tk )
d
d
for all d ∈ N, and
n<
λ log( kd )
2(M + log |X |)
,
for some λ ∈ (0, 1),
(25)
then pn (ψn ) ≥ 1 − λ for any decoder ψn .
Corollary 7 (Converse for constant k): Assume the setup
in Corollary 6. Fix R > R2 := 2(M + log |X |). Then if k is a
constant and if there exists an N ∈ N such that n < (log d)/R
for all n > N , then there exist a δ > 0 such that error
probability pn (ψn ) ≥ δ for all decoders ψn .
We previously showed (cf. Corollary 4) that there is a rate of
growth R1 so that achievability holds if R < R1 . Corollary 7
says that, under the specified conditions, there is also another
rate R2 so that if R > R2 , recovery of Sd is no longer possible.
IV. S PECIALIZATION TO T REE D ISTRIBUTIONS
As mentioned previously, the ESD in (8) is computationally
prohibitive. In this section, we assume Markov structure on the
distributions (also called graphical models [8]) and devise an
efficient algorithm to reduce the computational complexity of
the decoder. To do so, for each d and k, assume the following:
A4: (Markov tree) The distributions P := P (d) , Q := Q(d)
are undirected graphical models [8]. More specifically,
P, Q are Markov on a common tree T = (V (T ), E(T )),
where
V (T ) = {1, . . . , d} is the vertex set and E(T ) ⊂
V
2 is the edge set. That is, P, Q admit the factorization:
P (x) =
Pi (xi )
i∈V (T )
(i,j)∈E(T )
Pi,j (xi , xj )
. (26)
Pi (xi )Pj (xj )
A5: (Subtree) The salient set S := Sd is such that PS , QS
are Markov on a common (connected) subtree TS =
(V (TS ), E(TS )) in T .
Note that TS ⊂ T has to be connected so that the marginals PS
and QS remain Markov on trees. Otherwise, additional edges
may be introduced when the variables in S c are marginalized
out [8]. Under A4 and A5, the KL-divergence decomposes as:
D(P || Q) =
Di +
Wi,j ,
(27)
i∈V (T )
(i,j)∈E(T )
where Di := D(Pi || Qi ) is the KL-divergence of the marginals
and the weights Wi,j := Di,j − Di − Dj . A similar decomposition holds for D(PS || QS ) with V (TS ), E(TS ) in (27) in
place of V (T ), E(T ). Let Tk (T ) be the set of subtrees with
k < d vertices in T , a tree with d vertices. We now describe
an efficient algorithm to learn S when T is unknown.
Firstly, using the samples (xn , yn ), learn a single ChowLiu [9] tree model TML using the sum of the empirical mutual
i,j )} as the edge weights.
information quantities {I(Pi,j )+I(Q
It is known that the Chow-Liu max-weight spanning tree
algorithm is consistent and large deviations rates have also
i,j are the empirical versions of Di and Wi,j
i and W
where D
respectively. In (28), the sum of the node and edge weights
over all size-k subtrees in TML is maximized. The problem
in (28) is known as the k-CARD TREE problem [11] and it
runs in time O(dk 2 ) using a dynamic programming procedure
on trees. Thirdly, let the estimate of the salient set be the vertex
set of Tk∗ , i.e, ψn (xn , yn ) := V (Tk∗ ).
Proposition 8 (Complexity Reduction for Trees): Assume
that A4 and A5 hold. Then if k, d are constant, the algorithm
described above to estimate S is consistent. Moreover, the
time complexity is O(dk 2 + nd2 |X |2 ).
Hence, there are significant savings in computational complexity if the probability models P and Q are trees (26).
V. C ONCLUSION AND F URTHER W ORK
In this paper, we defined the notion of saliency and provided
necessary and sufficient conditions for the asymptotic recovery
of salient subsets in the high-dimensional regime. In future
work, we seek to strengthen these results by reducing the gap
between the achievability and converse theorems. In addition,
we would like to derive similar types of results for the scenario
when k is unknown to the decoder. We have developed
thresholding rules for discovering the number of edges in the
context of learning Markov forests [12] and we believe that
similar techniques apply here. We also plan to analyze error
rates for the algorithm introduced in Section IV.
Acknowledgments: The authors acknowledge Prof. V.
Saligrama and B. Nakiboğlu for helpful discussions.
R EFERENCES
[1] I. Guyon and A. Elisseeff, “An introduction to variable and feature
selection,” J. Mach. Learn. Research, vol. 3, pp. 1157–1182, 2003.
[2] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed.
Wiley-Interscience, 2006.
[3] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and minredundancy,” IEEE Trans. on PAMI, vol. 27, no. 8, pp. 1226–1238, 2005.
[4] D. Koller and M. Sahami, “Toward optimal feature selection,” Stanford
InfoLab, Tech. Rep., 1996.
[5] A. Y. Ng, “On feature selection: learning with exponentially many
irrelevant features as training examples,” in Proc. 15th ICML. Morgan
Kaufmann, 1998, pp. 404–412.
[6] M. J. Wainwright, “Information-Theoretic Limits on Sparsity Recovery
in the High-Dimensional and Noisy Setting,” IEEE Trans. on Info. Th.,
pp. 5728–5741, Dec 2009.
[7] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, 2nd ed. Springer, 1998.
[8] S. Lauritzen, Graphical Models. Oxford University Press, USA, 1996.
[9] C. K. Chow and C. N. Liu, “Approximating discrete probability distributions with dependence trees.” IEEE Trans. on Info. Th., vol. 14, no. 3,
pp. 462–467, May 1968.
[10] V. Y. F. Tan, A. Anandkumar, L. Tong, and A. S. Willsky, “A LargeDeviation Analysis for the ML Learning of Markov Tree Structures,”
submitted to IEEE Trans. on Info. Th., Arxiv 0905.0940, May 2009.
[11] C. Blum, “Revisiting dynamic programming for finding optimal subtrees
in trees,” European J. of Ops. Research, vol. 177, no. 1, 2007.
[12] V. Y. F. Tan, A. Anandkumar, and A. S. Willsky, “Learning HighDimensional Markov Forest Distributions: Analysis of Error Rates,”
submitted to J. Mach. Learn. Research, on Arxiv, May 2010.
1392
Download