Document 11270083

advertisement
Learning and Testing Junta Distributions over
Hypercubes
by
Maryam Aliakbarpour
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
ARCHVES
Master of Science in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
NOV 022015
September 2015
LIBRARIES
@ Massachusetts Institute of Technology 2015. All rights reserved.
Signature redacted
A uth or .................... . - - - - - - - - - - - - ........ ....
Department of Electrical Engineering and Computer Science
August 24, 2015
Certified by..
Signature redacted
I
Ronitt Rubinfeld
Professor of Electrical Engineering and Computer Science
Thesis Supervisor
Accepted by .....
Signature redacted
j c/
Leslie A. Kolodziejski
Professor of Electrical Engineering and Computer Science
Chair, EECS Committee for Graduate Students
I
2.
Learning and Testing Junta Distributions over Hypercubes
by
Maryam Aliakbarpour
Submitted to the Department of Electrical Engineering and Computer Science
on August 24, 2015, in partial fulfillment of the
requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
Abstract
Many tasks related to the analysis of high-dimensional datasets can be formalized
as problems involving learning or testing properties of distributions over a highdimensional domain. In this work, we initiate the study of the following general
question: when many of the dimensions of the distribution correspond to "irrelevant"
features in the associated dataset, can we learn the distribution efficiently? We formalize this question with the notion of junta distribution. The distribution D over
{0, 1}" is a k-junta distribution if the probability mass function p of D is a k-juntai.e., if there is a set J C [n] of at most k coordinates such that for every x c {0, 1}7,
the value of p(x) is completely determined by the value of x on the coordinates in J.
We show that it is possible to learn k-junta distributions with a number of samples
that depends only logarithmically on the total number n of dimensions. We give two
proofs of this result; one using the cover method and one by developing a Fourierbased learning algorithm inspired by the Low-Degree Algorithm of Linial, Mansour,
and Nisan (1993).
We also consider the problem of testing whether an unknown distribution is a
k-junta distribution. We introduce an algorithm for this task with sample complexity
O(2'/ 2 k) and show that this bound is nearly optimal for constant values of k. As a
byproduct of the analysis of the algorithm, we obtain an optimal bound on the number
of samples required to test a weighted collection of distribution for uniformity.
Finally, we establish the sample complexity for learning and testing other classes
of distributions related to junta-distributions. Notably, we show that the task of
testing whether a distribution on {0, 1}' contains a coordinate i E [n] such that xi
is drawn independently from the remaining coordinates requires 0(2 2 ,/ 3 ) samples.
This is in contrast to the task of testing whether all of the coordinates are drawn
independently from each other, which was recently shown to have sample complexity
6(2'/2) by Acharya, Daskalakis, and Kamath (2015).
Thesis Supervisor: Ronitt Rubinfeld
Title: Professor of Electrical Engineering and Computer Science
3
4
Acknowledgments
The result of this thesis was based on a collaboration with Eric Blais and Ronitt
Rubinfeld. I would like to gratefully and sincerely thank Prof. Ronitt Rubinfeld, my
advisor, for her guidance, support, and most importantly, her friendship. Moreover,
I would like to thank my parents for providing me with unending love and support.
5
-11 "1
RIM
6
-
| |
-
M.
.
I-
.
.M
-
-1
1
n
,
Ir
,ir
,1
,!1
I
Iln
0
Contents
1.1.1
Learning junta distributions . . . . . . . . . . . .
. . . . . .
11
1.1.2
Testing junta distributions . . . . . . . . . . . . .
. .
1.1.3
Learning and testing dictator distributions . . . .
.*
1.1.4
Learning and testing feature-restricted distributions . . . . . .
15
1.1.5
Learning and testing feature-separable distributions . . . . . .
15
.
.
11
14
Related work
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
16
1.3
Prelim inaries
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
18
.
.
1.2
23
Learning junta distributions
2.2
23
.
Learning junta distributions using Fourier analysis . . . .
Step 1: The gap between h(J*) and h(J)
. . . .
25
2.1.2
Step 2: Equality of f(J) and h(J) . . . . . . . .
27
2.1.3
Step 3: Estimating f(J) . . . . . . . . . . . . . .
30
. . . . . . . . . . . .
31
.
.
.
2.1.1
A Lower bound for learning juntas
.
2.1
33
3.2
33
3.1.1
Uniformity Test of a Collection of Distributions
.
35
3.1.2
Uniformity Test within a Bucket . . . . . . . . . .
40
A lower bound for testing junta distributions . . . . . . .
45
A test algorithm for junta distributions . . . . . . . . . .
.
3.1
.
Testing junta distributions
.
3
. . . . . .
.
. .. .13
O ur Results . . . . . . . . . . . . . . . . . . . . . . . . .
.
1.1
2
9
Introduction
.
1
7
4
5
6
Learning and testing feature-separable distributions
49
. . . . . . . . . . . . . . . . .
49
. . . . . . . . . . . . . . . . . . . . . .
53
4.1
Testing feature-separable distributions
4.2
Identifying separable features
57
Learning and testing 1-junta distributions
5.1
Learning 1-juntas . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.2
Testing 1-juntas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
63
Learning and testing dictator distributions
6.1
Learning dictator distributions . . . . . . . . . . . . . . . . . . . . . .
63
6.2
Testing dictator distributions
. . . . . . . . . . . . . . . . . . . . . .
66
6.3
Learning and testing feature-restricted distribution
. . . . . . . . . .
69
. . . . . . . . . . . . .
69
. . . . . . . . . . . . . . . . . .
74
6.3.1
Testing feature-restricted distributions
6.3.2
Identifying restricted features
A Learning juntas with the cover method
81
B Proof of Equation (2.1)
83
8
. ki. ,- ---
11.,,, i-
,
-
ulfii
Chapter 1
Introduction
One of the central challenges in data analysis today is the study of high-dimensional
datasets.
Consider for example the general scenario for medical studies: to bet-
ter understand some medical condition, researchers select a set of patients with this
condition and collect some attributes (or features) related to their health and environment (e.g., smoker or not, age, full genome analysis).
Until recently, typical
medical studies collected only a few features from each patient. For example, a classic dataset obtained from a long-term diabetes study [39] included only 8 features.
By contrast, modern genome-wide association studies routinely include over a million
different features [16].
Many problems related to understanding high-dimensional datasets can be formalized in the setting of learning distributions and testing whether those distributions
have some given properties.
For example, in the medical study scenario, we can
view each selected patient as being drawn from the (unknown) underlying distribution over all patients with the medical condition; to understand this condition, we
want to learn this distribution or identify some of its prominent characteristics. In
order to do so, we need learning and testing algorithms whose sample complexity and
time complexity are both reasonable functions (ideally, polynomial in) the dimension
of the distribution's domain. Traditional algorithms developed for low-dimensional
distributions typically have sample and time complexity that is exponential in the
dimension of the distribution, so in most cases new algorithms are required for the
9
high-dimensional setting. To obtain such algorithms, we must exploit some structural
aspects of the high-dimensional distributions under study. The starting point of the
current research is a basic observation: in datasets with many features, most of the
features are often irrelevant to the phenomenon under study. (In our medical study
example, a full genome analysis includes information about every gene even when
only a small number of them are related to the condition.)
As we discuss in the
related work section below, the same observation has been extremely influential in
the theory of learning functions; can it also lead to efficient algorithms for learning
and testing distributions?
In this work, we initiate the theoretical study of these questions in the setting of
learning distributions [291 and testing properties of distributions [71. For concreteness,
we focus on the case where each feature collected in the dataset is Boolean-i.e., on
distributions over the Boolean hypercube {0,
1 }'.
Our first task is to formalize the
notion of relevant and irrelevant features. This notion already has been studied extensively in the setting of functions over the Boolean hypercube, where f : {0, 1}
is a k-junta if there is a set J C [n] of size at most k such that for every x
-+ R
E {0, }",
the value f(x) is completely determined by {xj}ijE. In this setting, the coordinates
in J correspond to the relevant features, and the remaining coordinates are called
irrelevant.
There is a very natural extension of the definition of juntas to the setting of
distributions: we say that the distribution D over {0, 1}" is a k-junta distribution if
its probability mass function p : {0, 1}' -+ {0, 1} is a k-junta. In other words, D is a
k-junta distribution if there is a set J C [n] of size at most k such that the probability
that x E {0, 1}" is drawn from D is completely determined by {xj}ijE. This notion,
as in the case of Boolean functions, also captures a fundamental notion of relevance
of features. Again using the medical study example, we see that the features that are
relevant to the medical condition will be exactly the ones identified in J when the
features are distributed uniformly (and independently) at random in distribution over
the whole population (instead of those with the medical condition). This notion is of
course a somewhat idealized scenario; nonetheless, it does appear to capture much of
10
the inherent complexity of learning distributions in the presence of irrelevant features
and appears to be a natural starting point for this line of research.
We also examine other classes of distributions that are of particular interest to
our understanding the role of relevant and irrelevant features in high-dimensional distributions. We study dictator distributions, which capture the simpler setting where
one of the features is always set to 1 and the remaining coordinates are irrelevant;
feature-restricteddistributions,where we keep the requirement that one of the features
takes the value 1 for all inputs in the support of the distribution but add no other
constraints to the remaining coordinates; and feature-separable distributions, where
one of the features is drawn independently from the rest.
We consider the problems of learning and testing each of these classes of distributions. Our results show that these problems exhibit a rich diversity in sample and
time complexity. We discuss our results in more details below. Table 1.1 includes a
summary of the sample complexity for each of these problems.
Our Results
1.1
1.1.1
Learning junta distributions
We begin by considering the problem of learning k-juntas. We can obtain a strong
upper bound on the sample complexity of this problem using the cover method [24,
22, 231. By showing that there is a set C of
(')
2
k2k/'
distributions such that every
k-junta distribution is E-close to some distribution in C, we obtain the following result.
Theorem 1.1.1. Fix E > 0 and 1 < k < n. Define t = ( 2k2k/E.
9
There is an
2
3
algorithm A with sample complexity O(log t/c 2 ) = O(k2k/c + k log n/E ) and running
time O(t log t/E 2 ) that, given samples from a k-junta distributionD, with probability at
least
2
outputs a distributionD' such that dTv(D, D') := E
o,}n
pE(x) - p,(x)j I
C.
The algorithm's sample complexity is logarithmic in n-an exponential improvement over the sample complexity of the (folklore) general learning algorithm. The
11
2
running time of the algorithm from Theorem 1.1.1, however, is Q(t/c ), which is ex-
ponential in k. Our second main learning result introduces a different algorithm for
learning k-juntas with a running time that is only singly-exponential in k while still
maintaining the logarithmic dependence on n in the sample complexity.
Theorem 1.1.2. Fix c > 0 and 1 < k < n. There is an algorithm A with sample
4 ) and running time
complexity O(2 2kk log n/0
( ()
. 2 2k
log n/,E) that, given samples
from a k-junta distribution D, with probability at least
outputs a distribution D'
such that dTv(D, D') < CThe proof of Theorem 1.1.2 is inspired by the Low-Degree Algorithm of Linial,
Mansour, and Nisan [311 for learning Boolean functions. As in the Low-Degree Algorithm, we estimate the low-degree Fourier coefficients of the function of interest (in
this case, the probability mass function of the unknown distribution). Unlike in the
low-degree algorithm, however, we do not use those estimated Fourier coefficients to
immediately generate our hypothesis. Instead, we show that these estimates can be
used to identify a set J c [n] of coordinates such that the target function is close to
being a junta distribution on J.
Our upper bounds on the sample complexity of the problem of learning k-juntas
have a logarithmic dependence on n and an exponential complexity in k. We show
that both of these dependencies are necessary.
Theorem 1.1.3. Fix e > 0 and n > k > 1.
Any algorithm that learns k-junta
distributions over {0, 1}" has sample complexity Q(2'/E 2 + k log(n)/e).
The key element of the proof of Theorem 1.1.3 is the construction of a distribution
'
V over k-junta distributions that outputs the uniform distribution with probability
and such that any deterministic algorithm drawing o(k log(n)/E) samples minimizes
its error by always outputing the uniform distribution as its hypothesis. This result
implies the Q(k log(n)/e) portion of the lower bound in the theorem, and the remaining
2 k/2
term is obtained by a simple reduction to the problem of learning general
discrete distributions.
12
The proofs of Theorems 1.1.2 and 1.1.3 are presented in Section 2, and the proof
of Theorem 1.1.1 is included in Appendix A.
1.1.2
Testing junta distributions
We next turn our attention to the problem of testing whether an unknown distribution
is a k-junta distribution or not. More precisely, a distribution is E-far from having
some property P (e.g., being a k-junta) if for every distribution D' that does have
property P, we have dTv(D, D') > E. An c-tester for property P is a randomized
algorithm with bounded error that distinguishes between distributions with property
P from those that are e-far from having the same property. We show that it is possible
to test k-juntas with a number of samples that is sublinear in the size of the domain
of the distribution.
Theorem 1.1.4. Fix c > 0 and n > k > 1.
There is an algorithm that draws
O(2n/2k log n/E2 ) samples from D and distinguisheswith probability at least 2 between
the case where D is a k-junta and the case where it is E-far from being a k-junta.
The proof of Theorem 1.1.4 is obtained by reducing the problem of testing juntas
to the problem of testing a weighted collection of distributions for uniformity. We then
show that this problem can be solved in time O(V/mN) when there are m distributions
with domain sizes N each. See Theorem 3.1.2 for the details.
When k < n, the sample complexity of the junta testing algorithm is much larger
than (i.e., doubly-exponential in) the sample complexity of the learning algorithm.
We show that this gap is unavoidable and that the bound in Theorem 1.1.4 is nearly
optimal.
Theorem 1.1.5. Fix any 0 < k < n and any constant 0 < c < 1. Every algorithm
for c-testing k-juntas must make Q( 2 n/2/ c2 ) queries.
The lower bound again uses the connection to the problem of testing collections
of distributions. In this case, this is done by constructing a distribution over k-junta
distributions and distributions that are far from k-juntas such that any algorithm
13
that distinguishes between the two with sample complexity o( 2 n/2/E2) would also be
able to test collections of distributions for uniformity with a number of samples that
violates a lower bound of Levi, Ron, and Rubinfeld [301.
See Section 3 for the proofs of Theorems 1.1.4 and 1.1.5.
1.1.3
Learning and testing dictator distributions
Let us examine the special case where k = 1 in more detail. The definition of 1-junta
distributions can be expressed in a particularly simple way: a distribution D is a
1-junta if its probability mass function p : {0, 1}"
-
R is of the form
for some i E [n] and 0 < a < 1. Using this definition, we obtain simpler and more
sample-efficient algorithms for learning and testing 1-juntas. (See Appendix 5 for the
details.) This representation also suggests another natural class of distributions to
study: those that satisfy this definition with a = 1. This corresponds to the class of
distributions with only 1 relevant feature, and where moreover this feature completely
determines whether the element is in the support of the distribution or not. This class
also corresponds to the set of distributions whose probability mass function is a scalar
multiple of a dictator function. We thus call these distributions dictator distributions.
We give exact bounds for the number of samples required for learning and testing
dictator distributions.
Theorem 1.1.6. Fix e > 0 and n > 1.
The minimal sample complexity to learn
dictatordistributionsup to total variation distance e is
e(log n). The minimal sample
complexity to F-test whether a distribution D on {0, 1}' is a dictator distribution is
8(2n_-
'/E2).
The sample complexity for the task of learning dictator distributions is obtained
with a simple argument. The upper arid lower bounds for testing dictator distributions
14
-'-
-. I-
-
-
.. "1"
1-1111
'ag"
. _Ijidu
are both obtained by exploiting the close connection between dictator distributions
and uniform distributions. The proof of Theorem 1.1.6 is included in Appendix 6.
1.1.4
Learning and testing feature-restricted distributions
The definition of dictator distributions in turn suggests a second variant of interest:
the distribution D is a dictator distribution if there exists an index i E [n] such that
D is the uniform distribution over {x E {0, 1}" : X, = 1}. What happens when we
consider arbitrary distributions whose support satisfies the same property? We call
such distributions feature-restricteddistributions,and a coordinate i that satisfies the
condition that xi
1 for every x in the support of D is called a restricted feature.
We give exact bounds for the number of samples required to learn and test featurerestricted distributions as well as to identify a restricted feature of feature-restricted
distributions.
Theorem 1.1.7. Fix E > 0 and n > 1.
The minimal sample complexity to learn
feature-restricteddistributionsup to total variation distance e is e(2/. 2 ), but we can
identify a restrictedfeature of these distributions with only 8(log(n)/e) samples. The
minimal sample complexity to E-test whether a distributionD on {0, 1}
is feature-
restricted is e(log n/e).
The learning result follows from a simple reduction to the problem of learning
general discrete distributions.
The sample complexity of the other two tasks, the
identification of restricted features and testing feature-restricted distributions, are
both established with elementary arguments.
For completeness, the details of the
proof of Theorem 1.1.7 are included in Appendix 6.3.
1.1.5
Learning and testing feature-separable distributions
Finally, the last topic we consider in this work is a property of distributions called
feature separability. The distribution D over {0, 1}" is feature-separable if there exists
an index i C [n] such that the variable xi is drawn independently from the remaining
coordinates of x, or equivalently where there exist probability mass functions q :
15
_'__-
.................
{0, 1} -+ [0, 1] and r : {0, 1}t1
-
[0, 1] such that the probability mass function
Ii+1 -... , X 71). This property is very
p of E is defined by p(x) = q(xi)r(xi, . . . ,
.i1
close to that of dictatorships-a dictator distribution is one that satisfies the feature
separability condition under the extra restriction that r be the uniform distribution.
We show that feature-separability can be tested with a number of samples that is
sublinear in the size of the domain of the distribution.
Theorem 1.1.8. Fix E > 0 and n > 1.
The minimal sample complexity to learn
feature-separable distributions up to total variation distance E is E(2r/c 2 ), but we
can identify a separable feature of these distributions with O(poly(1/E)22 n /3 n2 log n)
samples.
There is an algorithm that E-tests whether a distributionD on {0, 1}'
is feature-
separable with O(poly(1/E)22,/ 3n 2 log n) samples. Furthermore, there is a constant
c 0 > 0 such that every co-testerfor feature separabilityhas sample complexity Q(2 2
3
l/ ).
When every index i E [n] satisfies the feature separability condition, the distribution D is called independent. The sample complexity of the independence testing
problem has been established very recently: 0(2"/ 2/6 2 ) samples are both necessary
and sufficient for this task [11; our result shows that, interestingly, the feature separability testing problem requires significantly more samples. The proof of Theorem 1.1.8
is presented in Section 4.
1.2
Related work
Learning and testing Boolean functions.
The present work was largely in-
fluenced by the seminal work of Blum [131 and Blum and Langley [151, who first
proposed the study of junta functions to formalize the problem of learning Boolean
functions over domains that contain many irrelevant features. Their work led to a
rich line of research on learning juntas [14, 33, 3, 40]. Starting with the work of Fischer et al. [27], there has also been a lot of research on the problem of testing junta
functions [21, 10, 11, 4, 38, 21. This work, along with the testing by implicit learning method, has since led to new results for testing many other properties of Boolean
16
uw
Learning
Testing
O(2n-/
k-juntas
2
k4/e 3
Q(2'r/
2
+ 2 /e)
ft 2 )
Dictators
E(2
Feature-restricted
e(log n/E)
__________________Q(2
e(logn)
E(log n/e) t
O(poly(l/E)222 n/33 rn2 log n)O(poly(1/)22n/31
n/
2
log n) f
)
Feature-separable
/e2)
O(k2k/E 3 + k log n/C 2
Q(2 t 2 + k log n/f)
)
Distribution class
Table 1.1: Summary of our results. The table includes the upper and lower bounds on
the sample complexity for learning and testing the classes of distributions described in
Sections 1.1. The results marked with t describe the sample complexity for identifying
the restricted feature and separable feature, respectively. In both cases, the standard
learning task requires 8(2"/E2) samples.
functions as well [26, 19, 18, 12]. Our work can be viewed as trying to extend this line
of research from the setting of supervised learning theory to unsupervised learning.
Learning and testing distributions.
The problem of determining the sample
complexity and running time requirements for learning unknown distributions over
large domains has a long and rich history; the recent book chapter of Diakonikolas [24]
and the references therein provide a great introduction to the topic. The model and
notation we used for the problem of learning distributions was introduced in [29].
The problem of testing properties of distributions was introduced more recently but
it has also generated a rich body of results, including tight bounds for the problem
of testing uniformity [28, 8, 351, monotonicity of the probability distribution function
[91, and identity to a given known distribution [61 or another unknown distribution
[8, 42, 251. In addition, related problems like estimating entropy or support size have
been investigated [5, 41, 42, 371. See Canonne's survey [171 for an engaging overview
of the results in this area.
Despite all of this previous work on learning and testing properties of distributions,
we are not aware of previous results on the classes of distributions we study in the
current paper.
Nor are we aware of existing approaches that seek to exploit the
notion of relevant or irrelevant features (rather than other assumptions about the
17
shape or family of the distribution) to obtain efficient algorithms for the analysis of
distributions.
Remark on terminology.
Procaccia and Rosenschein [361 introduced a different
class of distributions that they also called "junta distributions".
They did so in a
completely different context (the study of manipulability of distributions by small
coalitions of voters) and over completely different types of distributions (over instances
of candidate-ranking problems instead of the Boolean hypercube). We use the same
name for our notion of junta distributions because of the strong connection to junta
functions and because we believe there should be no confusion between the two (very
different) settings. We should emphasize, however, that our results do not apply to
Procaccia-Rosenschein junta distributions (and vice-versa).
1.3
Preliminaries
We denote the set {1, 2,.. ., n} by [n]. We use x to indicate a binary vector of size
n unless otherwise specified. The value of the i-th coordinate of x is denoted by xi.
In addition, the restriction of x to the coordinates in the set I C [n] is denoted by
x('). For example, 101000(1,31)
a hypercube.
11. Let P : {, 1}' -+
[0, 1] be a distribution over
We sometimes write P(x) to denote the probability of drawing x in
P. The uniform distribution is denoted by U. The L1 (or total variation) distance
between two distributions is defined as dL 1 (Pl, P2) = E Pi(x) - P2 (x)IA function f
{0, 1}r -÷ R is a dictatorfunction if there is an index i E [n] such
that f(x) = xi for every x E {0, 1 }". We say that
there is a constant c 7 0 such that f(x)
f is a weighted dictatorfunction if
c - x for every x C {0, 1}". The function
f is a k-junta for some value 1 < k < n if there is a set J C [n] of size
that f(x) =
IJI
< k such
f(y) whenever () = y(J). The classes of distributions that we consider
are defined as follows.
Definition 1.3.1 (Junta distributions). The distribution P over {0, 1}" with probability mass function p
{0, 1}"
-
[0,1] is a k-junta distribution if p is a k-junta.
18
Equivalently, P is a k-junta distribution if p is of the form
-o
if X
al
if x()
) =0 ...00
...
01
pnx)if XW= 11...1
for some values a1 ,.
. .
, a 2 k-1 E [0, 1] that satisfy
Zj=0
ai
=
1.
Definition 1.3.2 (Dictator distributions). The distribution P over {0, 1}" is a dictator distribution if its probability mass function p : {0, 1}' -+ [0, 1] is a weighted
dictator function. Equivalently, P is a dictator distribution if its pmf p is of the form
P0
if Xi = 1
if Xi = 0
for some index i E [n].
Definition 1.3.3 (Feature-restricted distributions). The distribution P over {0, 1}"
is a feature-restricteddistribution if there is an index i E [n] such that its probability
mass function p : {0, 1}' -+ [0, 1] satisfies p(x) = 0 for every x with xi = 0.
Definition 1.3.4 (Feature-separable distributions). The distribution P over {0, 1}'
is a feature-restricteddistribution if there is an index i E [n] such that xi and
are independent random variables when x is drawn from P.
X([t\{i})
Equivalently, P is a
feature-restricted distribution if its probability mass function is of the form
a - q(X([r\{i}))
if xi = 1
(1
if Xi = 0
-
a) q(x"l\fi))
for some index i E [n], some parameter a E [0, 1], and some probability mass function
q : {0, 1}T-1 -+ [0, 1].
19
We use the model of learning distributions introduced in [291. A concept class C
of discrete distributions is simply a set of distributions (e.g., the set of all dictator
or k-junta distributions). All of the learning algorithms introduced in this paper are
proper learning algorithms: they always output a hypothesis distribution in the target
distribution's concept class.
Definition 1.3.5 ((q, e, 6)-Learner). An (q, c, 6)-learner for a class C of distributions
is an algorithm that draws q samples of an arbitrary and unknown distribution P E C
and outputs a hypothesis P' E C such that with probability at least 1-6, dTv(P, P')
<
E.
Our testing results are obtained within the standard framework for testing properties of distributions introduced in [7]. A property of discrete distributions is again
a set of distributions (it is equivalent to the notion of concept class). A distribution
P is E-far from having the property C if the total variation between the distribution
and any distribution that has the property is at least e. A tester for C is an algorithm
that distinguishes distributions in C from those that are E-far from C.
Definition 1.3.6 ((c,6)-test). An (E,6)-test for a property C of discrete distributions is an algorithm that draws samples from and unknown distribution P and with
probability 1 - 6, accepts if P is in C and rejects if P is E-far from C.
We also consider the problem of testing collections of distributions. A (weighted)
collection of distributionsis a set of m distributions PI, P2 ,.. . , P.,, on the domain [N]
and a set of m weights wi, .
. .
,
E [0, 1] such that
'wi
= 1. We denote such
collection by {Pilwi}'i 1 . We can also consider the collection as a single distribution
P where for any i E [m] and x E [N], P((i, x)) = wiPi(x). When we draw a sample
from {Pilwi}T, we obtain a pair (P, j) such that Pi is picked with probability wi
and then
j
is a sample drawn from Pi.
Definition 1.3.7 (Weighted distance to uniformity). The weighted distance to uni-
formity of the set S c [m] of distributions in the collection {Pilwi}"
is E
wi-
dt,,(Ps, U). The weighted distance to uniformity of the collection itself is the weighted
distance to uniformity of the set S = [n].
20
Definition 1.3.8 ((c, 6)-tester for uniformity). An (E, 6) -test for uniformity (of collections of distributions)is an algorithm that draws samples from an unknown collection
of distributions
{Pilwi}'&1 and with probability at least 1 - 6, accepts if all the Pi's
are uniform and rejects if the weighted distance of the collection is at least e.
21
"
-
"1,"
,1
"l
---
s
.
I
22
.
uw
a
.
.7
w.
. .o
mu
. .
.
K6 1 . "
Chapter 2
Learning junta distributions
2.1
Learning junta distributions using Fourier analysis
In this section, we introduce an algorithm for learning junta distributions. To do so,
we have to figure out the set of junta coordinates P and the biases ai's. We use
Fourier analysis over the Boolean hypercube to solve this problem. For a complete
introduction of Fourier analysis, see [341. The Fourier basis of the set S is Xs(x) =
(-1)EiEsxi = (-1)(EsiS
when S is not empty and Xo(x) = 1. The distribution P-
or, more precisely, its probability mass function-is written in terms of the Fourier
basis as
P(x)=
P(x) - s(x)
SC [n]
where the Fourier coefficient associated with S is defined as
P(x) -xs(x) = E[P(x) -xs(x)].
P(S) =XE{O,1}"1
Here and throughout, we write E[f(x)] to refer to the expected value of f(x) when x
is picked uniformly at random and Ezp[f(x)] to refer to the expected value of f(x)
when x is drawn from P. The proof of Theorem 1.1.2 is obtained via the following
23
Junta learning algorithm
1. Draw s
=
9- 2
- ln(6 - 2 k - ('))/2e 4 samples.
2. For all J C [n] of size k:
3. f(J) +-0
3.1 For all S C J where S # 0:
2. [#samples with Xs(x) = 1]
i. f(J) <- f(J) + (
1)2
S)
S
4. Output J such that maximizes f(J) (break ties arbitrarily).
Figure 2-1: A learning algorithm for junta distributions
analysis of the learning algorithm in Figure 2-1.
Theorem 2.1.1. Let P be a junta function over k coordinates in the set P with biases
ao , a2 ,...,
a 2 k-1.
The Junta learning algorithm, shown in Figure 2-1, outputs the set
J such that P is e-close to a junta distribution on J using s = 9. 2
k
- ln(6 - 2kk (') )/2e
samples with probability at least 2/1 In addition, the running time of the algorithm
is O(nk - 2 3k .
log n).
Proof: To prove the correctness of the algorithm, we need to show that the probability that the junta learning algorithm outputs a set J such that P is c-far from any
junta distribution on the set J is very small. Let us call such sets J invalid sets; we
want to show that P outputs an invalid set only with small probability.
For the distribution P in the theorem statement and any set J C [n] of size k, we
define a distribution Pj by setting
Pi(x) = Pr [yM() = x(J)]/ 2 -k
y~P
for every x in the hypercube. The distribution Pj is a junta distribution on the set
J. In particular, for any x and x' that agree on coordinates in J, Pj(x) is equal to
Pj(x'). Furthermore, Pi. is equal to the original distribution P. Now we define two
24
functions f and h on subsets of [n] of size k by setting
h(J) = 22"' - E[(Pj(x) - 1/2"")2]
and
f (J) = 2 2r.
'P ()2_
scJ'sfe
We complete the proof of correctness of the algorithm via the following three steps:
9
Step 1. If J is not a valid output, then h(J*) - h(J) is at least 4E2.
2
* Step 2. f(J) is always equal to h(J). Therefore, f(J*) - f(J) is at least 46
too.
&
Step 3. With probability at least 2/3, for every set J of size k, If(J) -
f(J)I
<
2 2.
We prove each statement in the corresponding section below. Assuming the correctness of each step, the correctness of the algorithm follows from the fact that the
estimated value f(J) will be less than f(J*) for every invalid set J*
Finally, we analyze the running time of the algorithm. Observe that we consider
(") subsets of size k. Each of them has 2 k -1 non-empty subsets (the S's). Then
we need to compute Xs(x) for each sample x which takes O(k - s). Thus, the time
complexity of our algorithm is Q(nk -23k - k2 log ,,,).
2.1.1
Step 1: The gap between h(J*) and h(J)
We begin with the observation that for every set J, the distribution Pj is a junta
distribution over the set Jf P. Also, as we show in Appendix B, we have the identity
Pj(x) = (Pr [y JflJ) = x(Jnj*)])/
Y-P
2
kt
(2.1)
25
...........
.......
......
Note that if P is c-far from being a junta distribution on the set J, it should be also
E-far from 'Pj. Therefore, by the following lemma we can infer that
h(J*) - h(J) > 4C2.
(2.2)
Lemma 2.1.2. Let P and P' be two k-Junta distributions defined as above, which
are are E-far from each other. Then we have
E[(P(x) - 1/2 n)2] - E[(Pj(x) - 1/2 n)2] > 2-
2
n(2c)
2
(2.3)
.
Proof: Before, we go to prove the inequality, we prove the following equality
E[Pj(x)(P(x) - Pj(x))] = 0.
(2.4)
Assume we partition all x's into Xi's such that for any two vectors x, and x 2 in
Xi then X
flJ=
. We prove that for each Xi, E
Pj(x)(P(x) - Pj(x)) is
zero which yields the Equation 2.4. By Equation 2.1, Pj is a junta distribution on the
set J n J*. Therefore, for any two vectors x 1 and x 2 in X, we have PJ(x2 ) =
Pi(x) = E
Thus, we just need to prove E
PJ(x2).
P(x) which follows from Equation
2.1 directly.
Moreover, observe that E[D(x)] = 1/
2
" for any distribution D.
Therefore, we
have
E[(D(x) - 1/2")2]
= E[D(x) 2 ] + E[1/22"] - 2E[D(x)/2"']
= E[D(x) 2 ] + 1/22n - 2/22-n
= E[D(X) 21 - 1/2n-
(2.5)
E[D(X) 2] - E[D(X)]2
Now, we prove Equation 2.3. Since P and Pj are distributions. Then, E[P(x)]
26
E[Pj(x)] = 1/2"n. By this fact and linearity of expectation,
E[(P(x) - 1/2" )2] - E[(Pj(x) - 1/2"=)2
=
E[P(x) 2 ] - E[Pj(x)
2
]
E[(P(x)
-
p(x) +pj(x))
= E[(P(x)
-
Pj(x)) 2]
=
+E[Pj(x) 2]
-
2
]
-
E[P (X)
2
1
+ 2E((Pj(x)(P(x) - Pj(x))]
E[Pj(x) 2 1
= E[(P(x) -pj(x))2
> (E[|P(x)
= dgi (-P,
>2
-
Pj(x)|)
2
-PJ)2/22n
-2n (2e)2
where the second to last equality comes from Equation 2.4 and the second to last
inequality comes from the fact that the L 1 distance of P and Pj is at least 2e.
2.1.2
L
Step 2: Equality of f(J) and h(J)
Below, we first show that the Fourier coefficient P(S) of a set S 9 J* is zero. This
lemma allows us to infer that it is enough to compute the low degree Fourier coefficients, because the other ones are zero. Intuitively, such a high degree S contains
a coordinate that will be either zero or one with probability a half. Therefore, the
Fourier coefficient of S is zero. We prove this formally in Lemma 2.1.3. Leveraging
this lemma, we prove that the value of h(J) and f(J) are equal in Lemma 2.1.4.
Lemma 2.1.3. For any J C [n], let D be a junta distribution with J being the set of
junta coordinates. For any S 7 J, 'D(S) is zero.
Proof: Observe that J might be the empty set, in which case D is a uniform distribution. Since S is not a subset of J, there is a coordinate i such that i is in S but not
J. Thus, the i-th coordinate in each sample x is one or zero, each with probability
a half. We simply pair up all x's based on their agreement on xnND and denote a
pair by (xo, x 1 ).
Observe that since i is not in the junta, distribution D(Vo) = D(x1 ).
27
However since i is in S, xls(Xo) = -Xs(xi). Therefore, we have
E
1n
' (S)=
D(x)- Xs(x)
xE{O,1}
1
D(o)-xs(xi) + D(xI) -xs(xi)
(2 o
1 E
xi)
D(xo)
.
xs(x1)
-
D(xo) - xs(xo)
(xo x1)
=0.
Now we are ready to prove that f(J) is equal to h(J) for any J C [n] of size k.
Lemma 2.1.4. Let f and h be two functions as defined before, for any J C [n] of
size k we have
f(J) = h(J).
Proof: Recall that
h(J) = 2
- E[(Pj(x) - 1/2")2
and
5
P(S)
2
.
f(J) = 22n -_
ScJ.'so0
By Equation 2.5, we have
h(J) = 22n- E[(Pj(x) - 1/2")2]
)
= 2"2- (E[pj(x) 2 _ E[P (x)]2
= 2
Pj(S)
-n-
2
-
Pi(0)2
wS
where the second to last equality follows by Parseval's Theorem and the fact that
Pj(0) =
1
E
Pj(x) - Xo(x) = E[Pj(x)].
In addition, note that by Equation 2.1, any Pj is a junta distribution over set Jn J*.
28
By Lemma 2.1.3, for any S g (J n J*), Pj(S) is zero. Thus, we know
=2
h(J)
(2n
P (S) 2
-
PJ(0)2
(S
p (S) 2
= 22n.
Sc(JnJ*),S30
Now, it is clear that l(J*)
J n
f(J*). Assume J = J*. Let S be a non-empty set of
J* and c be a fixed binary vector of size jSI. By definition of PJ, it is not hard
to see Pr,-p[x(S)
-
c] = Prx~pj[x(s)
-
c]. Thus, by conditioning over all possible c,
we can prove Prx-p[Xs(x) = b] = Prx-pj[Xs(x) = b] when b is +1
or -1.
Therefore,
we have
PJ(S)
>] PJ(x)xS(X)
2-
-
x
=2-"-- Pr [Xs(x) =1 Pr [ys(x) = 1] -
2-"
-
Pr [Xs(x) =-1]
Pr [XS(x) =-1
E Pj(X)'XS(X)
92"
P(S)
In this way, for any non-empty subset S of J n J*, Pj(S) is equal to the P(S). By
Lemma 2.1.3 for any S C J which is not subset of J* then P(S) is zero. Thus,
I(J) = 22,n-
J(S)2
.
Sc(JnJ*),S#0
p(S)
= 2 2n.
2
SC(JnlJ*),S#
= 2rz= 2fl-
P(S)
E
Sc(JnJ*),S#0
>
P(S)
Sg(J),S 0
=f(J)
and the proof is complete.
29
2
2
E
SCJ,S;J*
p ()2
Step 3: Estimating f(J)
2.1.3
In the following lemma, we prove that with probability 2/3 we can assume for all
J C [n] of size k
f(J*)
-
f(J)
If(J)
- f(J)I is less than
2C
2
. Note that for any invalid output J,
is at least 42. Thus, f(J*) > f(J) with probability 2/3, so the learning
algorithm does not output an invalid J.
Lemma 2.1.5. Let P be a junta distribution on the set J* of size k. Suppose we
draw s = 9 - 22k - ln(6 .2 k -( ')/2e'
samples from P. For any set J of size k, we
estimate f(J), as it is defined before, by
f2 )
z
2# samples xswith Xs(x) = 1]
sc JIs#
)2
With probability 2/3 all of the J we have If(J) - f(J) I < 2e2.
Proof: By definition of Fourier coefficients, we have
f(J)
2' -
scJ,so0
E
SCJ,S54
E
scjsos
P (S)
E P(x)\s(x)
x
Pr [,Xs(x)
(X~P
Pr [s(x)
S(2 X~P
sg J,ss (
-
X~P
[xs(x)
-1])
1] - 1)
Now for abbreviation, let Ps = 2-Pr[ys(x) = 1]-1 and let PS be 2.[# samples x with Xs(x)
1]/s. First, notice that Ps is an estimator for PS such that their difference is not likely
to be more than e' = e2 /(10 . 2k), because by the Hoeffding bound we have
PrPs
- PS > e']
Pr
[# samples x with
e'/2 <;
s(x) =1
Pr fy s(x) =1] 1/2-
>_' P
Pr_____________
Note that in the learning algorithm we estimate this value for
there are
(")
such J's. Thus, for s
=
21n(6 - 2
. ())/2,
2k
subsets of J and
it is not hard to see that
the probability of estimating at least one Ps inaccurately is at most 1/3 by the union
30
2/2
bound. Now, we can assume as = Ps - Ps is in the range [-e, C'] with probability
2/3. Now, we compute the maximum error of f(J) as below
If(J)-f(J)I
=
1P -
scJ's#o
P
>E
2Psas
Z
2Pslas|+a2
-
scJsw
<
E
SCJ,S/0
S-a
ScGJS#0
<
where the last inequality follows by c'
2.2
2
1
)
< 2 k (2e' +
2E2
=
2E 2 /(3-
2 k) <
1.
A Lower bound for learning juntas
We now complete the proof of the lower bound on the number of samples required to
learn juntas.
Theorem 1.1.3 (Restated). Fix 0 < e < - and 1 < k < n. Any (s, c, )-learner for
2
k-junta distributions must have sample complexity s = Q(max{2k/E , log
Proof: The first part of the lower bound, s = Q( 2 k/2)
()
/).
follows from the (folklore)
lower bound on the number of samples required to learn a general discrete distribution
over a domain of size N. Q(N/E 2 ) samples are required for this task. Observing that
the set of juntas on the set J = {1,2, . .. k} is a set of general discrete functions on
a domain of size N = 2 k, we conclude that any k-junta learning algorithm must draw
Q( 2 k/,2)
samples--even if it is given the identity of the junta coordinates.
We now want to show that s = Q(log ('k) /c):
By Yao's minimax principle, it
suffices to show that there is a distribution P on k-junta distributions such that any
deterministic algorithm that (6, j)-learns D ~ P must draw at least s = Q(log
()/c)
samples from D. For non-empty sets S C [n], let Ds be the distribution with the
31
probability mass function
+ iE)/2T'~1
ps~) =(I
(I - c)/2n-1
if (Dis xi =-I
if @jes xi = 0.
Let Do be the uniform distribution on {0, 1}". We let P be the distribution defined
by P(DO) = - and P(Ds) =
-
2
for every set of size 1 <
< k.
KSI
Every function
<k)1
in the support of P is a k-junta distribution, and they are all E-far from each other.
Fix any deterministic learning algorithm A that (f,
)-learns the k-junta distribu-
tions drawn from P. Let X be a sequence of s samples drawn from D. The success
probability of A guarantees that
-
3
3 < Pr[A identifies the correct distribution]
4
S
ps(X) - 1[A outputs Ds on X]
P(Ds)
se(n)
Xe{0,1}"nS
max P(Ds) -ps(X).
<
XE{O 11}nXs
se(k)
We can partition the set of s-tuples of samples, {0, 1}"' into
S E
()
()
parts xs,
-) P(DT) -pT(X) (breaking
such that X E Xs iff P(Ds) -ps(X) = max
ties arbitrarily). For any set of samples X, we have that P(DO) - ps(X)
2-"t
since Do is the uniform distribution. This means that if X E Xs for some S 4 0, then
P(Ds) - ps(X) > 2 "'-1 and hence ps(X) > (( " ) - 1)2-'s. Let Ks(X) denote the
number of samples x E X such that @iES xi = 1. Then from the above inequality we
have
< k)
1)2-"
<<
ps(X)
-
2
+ )s(X)(_
< (1 + 2,) s(X) . 2
Therefore, s > Ks(X)
Q(log ()/e),
2
K- e 2
<
_ g)ss(X)
Ks (X) -
as we wanted to show.
32
2 -(-1)
2-"i.
Chapter 3
Testing junta distributions
3.1
A test algorithm for junta distributions
In this section we consider the testing junta distributions. Considering the Definition
1.3.1, we want to see if there exists a subset of coordinates of size k, namely J, such
that condition on any setting of x),
we get a uniform distribution. Hence, testing
that a distribution is junta relies on a test of a collection of
2k
distributions.
In
Section 3.1.1, we provide uniformity test for a collection of distributions, which is a
natural problem by its own too. In Figure 3-1 we show the reduction and prove it
formally in the following theorem.
Theorem 3.1.1. The algorithm shown in Figure 3-1 is an (e, 1/3)-test for k-junta
distributions using O(Sklog'n)
-
6(2/2k4/E3 + 2k/c).
Proof: First, note that we amplify the confidence parameter of the uniformity test of
a collection by repeating the algorithm [2log 3 (3("t))] times and taking the majority
of the answers. By Theorem 3.1.2, the uniformity testof a collection uses at most 2S
samples and returns the correct answer with probability at least 2/3. It is not hard to
see, the majority of the answers is correct with probability 1 - 1/ (3 ()).
Therefore,
by the union bound, we can assume we test all the J's correctly with probability at
2k
and n
hard to see that the total number of samples is
33
-
O(k
2'-k in Theorem 3.1.2, it is not
- log n -S) =
O(2
/ 2 k 4 /e 3 +2
/
least 2/3. In addition, by setting m =
.......
...
"I'll,
"I'll,
"I'll,
- I,"'."
...........
- I'll"
........
....
, ,"I'll
-
1 -1 -
- 1-
1 I'll
Testing junta distributions
{(Input: e, n, oracle access to draw sampies from a distribution P}}
1. Draw s = 2[21og 3 (3("))
-1
2. For every subset of [n] of size k, namely J':
2.1 Convert each sample x into a pair (x(J), x[.l\j).
2.2 Repeat the uniformity test of a collection withe the same e [2log 3 (3('))]
times each time using at most 28 samples.
2.3 If the majority of the answers is "Accept",
Accept.
3. Reject.
Figure 3-1: A testing algorithm junta distributions
Now we prove the correctness of the algorithm.
For an arbitrary set J of size
k, let Pi be a conditional distribution over the domain {0, 1}a-k such that P(z)
Prx~p[x
l\J) = Z / y- ) = Ci].
We need to consider two cases when P is a junta
distribution and when it is c-far from being a junta distribution.
We show to show that a junta distribution is passed by probability at least 2/3.
Assume P is a junta distribution on the set J*. By definition, Pr[x([flI\J*) |x(J)
1 / 2 'nk
In other words, the Pis are uniform. Thus, we can assume that the pair
(x(.), x('\J*)) is distributed according to a collection of uniform distributions. This
means that in the iteration where J = J* the distribution should be accepted by the
uniformity test of a collection. Thus, we answer the correct answer with probability
at least 2/3.
Third, we show that the algorithm does not accept a distribution which is Efar from being a junta distribution with probability at least 2/3. Let P be such a
distribution. Let define Pj be a junta distribution on J, as it is defined before,
BP(x)d Pr [(J) .Nt s- r
Y-P
.L
Below, we compute the distance of Pj and P. Note that P is c-far from Pj. Let Xi
34
,
,
11-.1-
.
.. . .
be the set of all x's such that x(J) = Ci where Ci is binary code of i over k bits. Then,
x
2k
-1
=EJ
E
2 -1 xEXi
i=1
1P(x) -
PJ(-)1
2Z' 1E Pr
Pry
ij
[y E Xi]-
=E
i=O xcX, Y-4
Observe that Pry-p[y E Xi] = Pryp[yW() =]
P(x)
[y c XiJ
Pr
y~P
= Pr
_
_____
Pr [y c Xi]
y~P
p [yW() = ci] by definition of
Pj. Thus,
2
dL1(P1 PJ)
k-1
= E
=O
2
Pr [y EXi-))
Pr [y c Xi]
XX y~-a'
Y~P
Pr [y E Xi]
Y~Pi
k -1
E E Pr [y c Xi] -Pi(x([\J))
i=O XEXi Y-P
2k-1
E
Pr [y E Xi]dL1 (PiU)
-
=
i=0 YP*
Note that if we the distribution as a collection of Pi's then Pry p[y E Xi] is basically
the weight of Pi namely wi. In other words, we see a sample from Pi with probability
wi. Thus, the value of dL, (P, PJ) or (equivalently dt,(P, Pj)) is the weighted distance
of the collection.
Since P is c-far from any junta distribution, the dLl(P, PJ) is at
least 2e. Thus, the collection is e-far from being a collection of uniform distribution
and it should be rejected by the uniformity test with high probability.
l
proof is complete.
3.1.1
Thus, the
Uniformity Test of a Collection of Distributions
In this section, we propose an approach for uniformity test of a collection namely
C = {Pilwi}. Note that when we sample a collection we get a pair (i, x) which means
the distribution Pi is selected with probability wi and x is drawn from 'P. Observe
that when wu are uniform then the problem is related to uniformity test over a single
35
Uniformity test of a collection
((Input: E, n, m, oracle access to draw samples from the collection {Piwi}"..1 .))
1. B +- [log(4m/c)]
2. S +- max{[80m log 12(m + 1)/cl, [8B/E[2log 3 (6B)][21 'B/6mn/ 2 11}
3. Draw a sample, namely s, from Poi(S).
4. If s > 28, reject.
5. Draw s samples from the collection.
6. di +- si/s where si's is the number of samples from Pi.
7. For I = 1,. .. , B
7.1 B, +- {ie/4m2i~
7.2 WI 1 *+
1
<,&i
E/4m2i}
z- i
iE B1
7.3 S, +- E si
7.4 If WI > E/8B and S > 2 13 B2 log 3 6B /6mn/E 2
i. Run bucket uniformity test with distance parameter E/2B and maximum error probability 1/6B
ii. If the test rejects,
Reject
8. Accept.
Figure 3-2: A test algorithm for uniformity of a collection of distributions
distribution over the domain [in] x [n]. Based on this observation, we use bucketing
argument in a way that each bucket contains distributions with wi's within a constant
factor of each other. In this section, we formally prove the reduction to the uniformity
test in a bucket and in Section 3.1.2, we show how we can test each bucket.
Theorem 3.1.2. The algorithm shown in Figure 3-2, is an (E, !)-test that test the
uniformity of a collection of distributions {P1iwij using s < 2S samples where S
O(B3
mn./
3
+Orn/g)
where B
O(log/e).
Proof: In the algorithm instead of drawing a fixed number of samples, we use the
36
"Poissonization method"1 and draw s samples where s is a random variable drawn
from a Poisson distribution with mean S. Thus, we can assume the number of samples
from each distribution Pi, namely si, is distributed as Poi(wi - s) and is independent
from the rest of s.'s. Now, we show in the following concentration lemma that the
si's are not far from their mean. Equivalently, we prove that 'i
= si/s is close to we.
Lemma 3.1.3. Suppose we draw s ~ Poi(S) samples from a collection of distributions { Pi
}i
such that S > 80 m log 12(m + 1)/c. Let dij = si/s where si is the
number of samples from, Pi. With probability of 5/6 all of the following events happen.
" s is in the range [2S,S/2].
" For any i if w> e/8m, then zi is in the range [lwi, 2w].
" For any i if w < e/8m, then ibi < e/4m.
Proof:
Here we need to use concentration inequalities for Poisson distribution.
(See
Theorem 5.4 in [32]) For a Poisson random variable X with mean ft
eE
(
Pr(X > (1 + e)t) <
and
Pr(X < (1 - e)p) <
Thus, it is not hard to see
1 - Pr[p/2 < X < 21 < 0.68" + 0.86,4 < 2 -2-p/5 <
1
- 6(m + 1)
where the last inequality holds for it > 5 log 12 (m + 1).
Thus, s is in the range
1 Observe that when we draw a fix number of samples, the number of appearances of each element
depends on others. This would usually convolute the analysis of algorithms. However, the Poisson
distribution has a very convenient property that makes the number of appearance of each symbol
independent from the others. In the literature like 1321, it has been proved that if a single distribution
P sampled Poi(n) times, then the number of samples equals to the a symbol x is a random variable
from a Poisson distribution with mean nP(x). This also implies that the number of appearance of
each symbol is independent of the others.
37
[S/2, 28] with probability 1 - 1/6(rn + 1). We assume this fact is true for the rest of
the proof.
Note that by properties of the Poissonization method [321, the si's are distributed
as independent draws from Poi(wi - s). For wi > e/8m, since we assume that s is at
least S/2, we can conclude that si is in the range [wi - s/2, 2wi - s] or equivalently '&j
is in the range [wj/2, 2wi] with probability at least 1 - 1/6(m
+ 1). Now assume wi is
less than e/8m. Clearly, that the expected value of si is less than s e/8m. Consider
another random variable X which is drawn from Poi(s - E/8m). Thus,
Pr[si > s - E/4m] < Pr[X > s - e/4m] <
-
1
1
6(m + 1)
Thus, by the union bound over the si's and the s with probability at least 5/6 the
E
lemma is true.
Partitioning into buckets: Based on the idea that uniformity test of a collection
of distributions is easier when wi's are uniform, we partition the distributions into
buckets such that wi in the same buckets are within a constant factor of each other.
Assume we have B = [log(4m/e)l buckets where the I-th buckets contains all the
distributions Pi's such that E/4n2-
1
< di < E/4m2l.
By Lemma 3.1.3, the wi's
are in the range [e/8m2' 1 , e/2m2l]. Observe that each bucket I can be viewed as a
(sub-)collection of m. =
1B11 distributions
with the new weights wi/WI
where W is
the total weight of the I-th bucket.
Reduction to the bucket uniformity test: Here, we want to show that there is
a reduction between uniformity test of a collection of distributions and uniformity test
of each bucket as a sub-collection of distributions. For uniformity test of a collection,
we partition the collection into buckets as explained before. Then for each bucket,
we invoke the bucket uniformity test with distance parameter E/2B and with error
probability of at most 1/6B. To prove the correctness of the reduction, we consider
the two following cases:
* {Pijwi}'
is a collection of uniform distributions. Since all of the dis-
tributions are uniform, all buckets contain only uniform distributions.
38
Then,
all the B invocations of bucket uniformity test should accept with probability
at least 1 - 1/6B. Thus, none of them rejects with probability 1 - 1/6 by the
union bound.
is c-far from being a collection of uniform distributions. We
o {Pilwi}l
prove that at least one bucket should be rejected with high probability. Note
that in our bucketing method we ignore the distribution with ib < e/4m: by
Lemma 3.1.3, the total weight of these distributions is at most e/2m and since
the total variation distance is at most one, they can not contribute to the
weighted distance by more than e/2. Thus,
B
E
E
wi - dt ('Ps,,U ) > e /2.
I=1 iEBi
By averaging there is at least one bucket, namely 1, such that EiEB, w -dtv(Pi,U)
e/2B. Since the total variation distance is at most one, EiEB, w
e/2B. In
addition, we would like to consider this bucket as a separate collection. Since
W, < 1, if we renormalize the weights, we can also see that
E
-
- 2BW
(
WB.
E
2B
Now if we show the assumptions of Corollary 3.1.6 are satisfied, then the bucket
uniformity test rejects the bucket 1 with probability at least 1 - 1/6B. It is not
hard to see our estimation of the new weight of the i-th distribution in bucket
I is i/Wi which is in the range [w/4Wi, 4wt/Wi] by Lemma 3.1.3. Moreover,
Since the wi's are in the range [e/8m2 1 , e/2m2'], wi/W 1 are at most 8/mi. In
addition the number of samples in this bucket is
a -8 > S E
si
icB,
iEB1
wi >ES > [210g 3 (6B)] [2. 2 B 2 V6mn/C 2
iEB1
Hence, the proof is complete.
39
bucket uniformity test
((Input: E, 6, c, T, n, m, estimated 'i's,
[2log 3 1/61 sets of s = [8c3 m 3Tn/C2] samples drawn from C.))
Repeat the following algorithm [2log 3 1/31 and output the majority answer.
1. Draw s
=
8c 3 5 m /3Tn/e
2
samples from the collection.
2. Y +- the number of unique elements in these s samples.
3. For each sample (i, x), replace x with x' uniformly chosen from [n].
4. A*<-
C.r
5. Y' +- the number of unique elements in these s samples.
6. If
IY -
Y'j > A/2,
6.1 Reject.
7. Otherwise,
7.1 Accept.
Figure 3-3: A uniformity test for collection of distributions with special constraints.
3.1.2
Uniformity Test within a Bucket
In this section, we provide a uniformity test for a collection of distributions when
the weights are bounded. In other words, the algorithm distinguishes whether the
weighted distance is zero or at least e. Our algorithm is based on counting the number
of unique elements which is also negatively related to the number of the coincidences.
This idea was proposed before in
135, 7 for uniformity test of a single distribution.
The high level idea is to estimate the expected value of the number of unique elements
when the underlying collection is an unknown collection and compare that value to
the case when it is a collection of uniform distributions.
If these values are close
enough to each other we can infer that the unknown collection is actually a collection
of uniform distributions. Otherwise it is not. The algorithm is showed in Figure 3-3
and its correctness is proved in the following theorem.
Theorem 3.1.4. Assume we have a collection of distributionsC
40
{PIw.};'1 such
.
1- 1- 1- 1---y..........................
..
that wi < T for all i's.
We have s = [2log3 1/] [8c3 m /3Tn/es1 samples drawn
form. C such that for each distribution Pi we get s
~ Poi(w - s) samples from Pi.
Suppose si is in [wi - s/c, cw - s] for a constant c. Then, the test shown in Figure 3-3
distinguishes that C is a collection of uniform distributions or it is c-far from it with
probability 2/3.
Proof: In the following, we prove the repeating part of the algorithm outputs the
correct answer by probability at least 2/3.
Hence, we can amplify the confidence
parameter by repeating it [210g 3 1/] times and taking the majority of the answers.
Thus, the uniformity test Algorithm outputs the correct answer with probability 1-5.
Let Y be a random variable that indicates the number of unique elements in a set
of samples. Notice that we consider each sample as an ordered pair (i, x) which means
x is drawn based on Pi. Thus, (i, x) is not equal to
(j, ). Similarly, let YI denote the
number of unique elements from distribution P. It is not hard to see
Yi
To abbreviation, Ef-p }[Y] denotes the expected value of Y when samples are drawn
1. In addition, we denote the expected value
from the underlying collection {Pilwi}
of Y by E{u} [Y] when the underlying collection is a set of uniform distributions with
the same weights as P (i.e. {Ulwi}" 1 ).
Now we need to answer this question: Does the number of unique elements indicate
that the collection is a set of uniform distribution or not? The answer is Yes. We
show E{p, I[Y] is smaller than Eu} [Y] if the collection is far from being a collection of
uniform distributions. Therefore, if we see a meaningful difference between Efp,[Y
and Eu} [Y], we can conclude {Pi li }I'I is not a collection of uniform distributions.
For a single distribution P, in [35], Paninski showed that the difference E-P[Y] and
Eu[Y] is related to the distance between P and the uniform distribution.
Eu[YI - EPY >
S
Since we are looking for E{u}[Yj (not EUm]
directly over the domain [m] x [n].
separately.
(dLI(P, U)) 2
n
[Y]),
we can not use this inequality
However, we use this inequality for each Pi
Observed that the way that we convert the samples allows us to get
41
-
--
the same number of samples, si from Pi and U over the domain of size
n. Thus,
we can use the above inequality separately. Hence, by linearity of expectation and
Cauchy-Schwarz inequality, we have
Trn
E{u}[Y]- E{p,1 [Y]
=
(Eu[Y]
-
E-pY])
in
n
2
i (si - dL1 (Pi,
i=1
U))2
(w
>
c2
ulm i
dL
i
i,-
where the first inequality follows from [351, the second follows from that si E
[Wi
s/c, cwi . s]. Set A to (s 2 E2 )/(c 2 - m - n). Therefore, if C is E-far from being a collection
of uniform distributions, then
Elul[Y] - E{p.,[Y]
2
C271 -'
=
A,
(3.1)
because the weighted L1 distance is at least 2e. However, these two expected values
cannot be calculated directly since wi's and Pi's are unknown.
estimate them.
Thus, we need to
By definition, the number of unique elements in s samples, Y, is
an unbiased estimator for Efp,}[Y].
To estimate E{11 }[YI, we reuse the samples we
get from the collection and change each sample (i, x) to (i, x') where x' is chosen
respectively, we can assume the sample (i, x') is drawn from the collection {u Iwi }i
.
uniformly at random from [n]. Since i and x' are picked with probability wi and 1/n
Therefore, the number of unique elements in the new sample set, namely Y', is an
unbiased estimator for Elul [Y]. Below, we formally prove that the number of unique
elements, Y, (and Y') can not be far from their own expected value using Chebyshev's
inequality. To do so, we need to bound the varianc e
Lemma 3.1.5.
We have s samples drawn form, a collection of distributions C
{Pi Iwi} such that for each distribution Pi we get si ~ Poi(wi - s) samples from, Pi.
42
............
a............
-111111.111.:
I
Suppose the si is in the range [w- s/c, cw
s] for a fixed constant c. Also, each 'w is
at most T. Then
Eu[Y| - EPi[Y] + c s T
n
Var[Y]
Proof: Bounding the variance of the number of unique elements has been studied in
[35]. Paninski showed the following inequality
Eu[Y] - Ep[Y]
+
Varp[Y]
Here, since we know the si's are independent, we have
rn
Var[Y]
-
Z Var[Yi]
i=1
(Eu[Yi] - E,[Y]+
C2 S2
<Eu[Y] - E'P[Y]+ c
(
rn
2
On the other hand, it is not hard to see that since wi's are less than T, we have
2 <i&
(
w
T < T.
Combining the two above inequalities we get
Var[Y] < Eu[Y] - EPi[Y1 + c s T
n
Now, we are ready to use Chebyshev's inequality to prove that we are able to estimate Y accurately. Below we consider two cases based on the underlying collection.
* Case 1: C is a collection of uniform distribution: In this case Ejpj is
equal to E{u} [Y], so by Lemma 3.1.5 the variance of Y is at most c2 s 2 T/n. Thus
43
by Chebyshev's inequality we have
Pr[IY
-
Efu}[Y]I
< 16Var[Y]/A 2
A/4]
16 c6 T n12
2 4
S E
It is not hard to see that for s > 4c3 mv/6Tn/
2
the above probability is 1/6.
Similar to Y, we can prove that the probability that Y' is A/4 far away from
its mean is less than 1/6. Therefore, Y' - Y is at most A/2 with probability at
least 1 - 1/3.
* Case 2: C is c-far from being a collection of uniform distributions:.
Therefore by Equation 3.1, E{u} [Y] - Ep, I[Y] is at least A. Similar to the above,
we use Chebyshev's inequality. By Lemma 3.1.5 and Equation 3.1 we have
Pr[IY - E{-pl [Y] >
(E{tji[Y] - E{pl [Y])]
< Var[Y]/(E{u}[Y] - ElpI[Y]/4)2
3 2
Elul[Y] - E-ps}[Y + c S T/n
(Elul [Y] - ElpI[Y]/4)2
16
< E{u}[Y] - E{pI[Y]
16c 2 8 2 T
+ n - (E{u}[Y] - Elp [Y])2
16c 2 s2 T
16
A
2
n
16c nm
16c 6 Tnm 2
S 2E2
82 f4
32 c
6 Tnm 2
K 24
Note that T by definition can not be less than 1/rn that's why the last inequality
is true. It is straightforward that for s > 8c3 mvN3Tn/c 2 the above probability is
at most 1/6. On the other hand, similar to what we had in case one, Y' cannot
go far from its mean too. Thus,
Pr[IY' - E{u}[Y]I > 1(Elul[Y] - E-p,}[Y])] < Pr[Y' - E{u}[Y]I > A/4] < 1/6.
Therefore, Y' - Y is at least (E{u [Y] - Ep, [Y])/2 > A/2 with probability at
44
least 1 - 1/3.
In both cases, the uniformity test outputs the correct answer with probability at least
2/3.
such
Corollary 3.1.6. Assume we have a collection of distributionsC = {RJiwi}
that wi < 8/m for all i's.
We have s = [2 log
1/61 [210 /6mn/e 2] samples drawn
form C such that for each distribution Pi we get si ~ Poi(wi - s) samples from Pi.
Suppose si is in [wi - s/4,4wi - s]. Then, there exists a uniformity test for C such that
outputs the correct answer with probability 1 - 6.
Proof: This corollary follows directly form Theorem 3.1.4 by setting T = 8/m and
c = 4.
3.2
A lower bound for testing junta distributions
We prove a lower bound for testing junta distributions in the following theorem.
Theorem 3.2.1. There is no (e, 1/3)-test for k-junta distributions using less than
o(2n/2) samples where c < 1/4.
Proof: First, we construct two families of distributions, T+ and F-, such that the
distributions in T+ are junta distributions and the distributions in T- are 1/4-far
from being junta distributions. Thus, any (e, 1/3)-test with e < 1/4 must distinguish
these two families with probability 2/3. Below, we prove that the distributions from
these two families are so similar that if we draw only o( 2 n/2) samples, we cannot
distinguish them. Hence, there is no (c, 1/3)-test for k-junta distributions using less
than o(2
2
/
2
) samples when e < 1/4.
Let C be the binary representation of i with k bits for i
1, 2,.. ., 2
-
1. Let
X1 be the set of all x's that x(kI) = Ci. In other words, Xi's forms a partition of the
domain based on the first k bits. Let F+(x) be a family that contains only a single
distribution P+(x) as constructed below.
1. For all the x's in Xi, if the parity of the Ci bits is odd, set P+(x) = 0.
45
2. For all the x's in Xi, if the parity of the C' bits is even, set P+(x) = 1/271
Note that P+(x) only depends on the first k bits of the x.
Thus, it is a junta
distribution on [k].
We construct a distribution P- by the randomized process explained below. Let
F- be the set of all possible P- that this process can generate.
1. For all the x's in Xi, if the parity of the C is odd, set P~(x) = 0.
2. If the parity of the C bits is even, pick half of the x's in Xi randomly and set
P-(x) = 1/2'-2. Set the probabilities of the other half to zero.
Now, assume we pick a distribution P- from F- uniformly at random. We show
that P- is a 1/4-far from being a junta distribution. Let Pi be an arbitrary k-junta
distribution with biases a as defined in Definition 1.3.1. We define X' to be the set
of all x's such that x() = Cs.
(In contrast to Xi's that partition the domain based
on xUlI)). Now, we show the probability of at least half of the elements in X' is zero.
If J = [k], then it follows by the construction of P-. Otherwise, since JJ = k there
exists a coordinate I E [k] such that 1 is not in the set J. Consider an element x E X'.
Let y be x with the i-th bit flipped. Since 1 is not in J,
W) = y(J). Thus, y is also
in X.V. On the other hand, the parity of the first k bits of x and y is not the same,
because I E [k]. Thus one of them has probability zero. Note that we can pair up all
the elements in X4 similar to x and y. Thus, at least half of the elements in X, has
46
probability zero. Hence, we can conclude that P- is 1/4-far from Pj as below
dL1(P-,PJ) = EP-(x) -Pj(x)
X
2 -- 1
E E IP~(x) - PJ(x)
i=1
aEX
2 k_1
D-(x) - ai/2"k1
=1E
i=
2
EX,
k-1
E
> E
Z=1 xEX'
s.t.
a1/2"n-k
P-(x)=Q
2k-1
> i=1
-2
where the first inequality follows from the fact that at least half of the elements in
X, has probability zero. Thus, the total variation distance is at least 1/4. Since we
show that P- is 1/4-far from any arbitrary k-junta distribution, P- is 1/4-far from
being a junta distribution.
Here, we want to show that with high probability P+(x) and P-(x) that is picked
from F- uniformly at random are indistinguishable. Note that we can consider a
sample from these distributions to be a sample from a collection of 2 k-1 distributions:
For each sample r consider the first k bits, X 1 , ..., Xk, as the index of distribution,
and then the last n - k bits xk+1, -- , x,, to be the sample from distribution x 1 , ...
Since odd parity patterns do not show up, there are exactly
the bits x 1 ,. .-
Xk,
2 k-1
possible settings on
and we see each of them with uniform probability for both P- and
P+. Moreover, conditioned on fixing the first k bits, the distribution over
is uniform for
, Xk.
'P+.
x([n]\1k])'s
On the other hand, conditioned on fixing the first k bits, the
distribution over x(QH\[k)'s is uniform over exactly half of the settings of
In 1301 (Lemma 4.3), it has been proved that if we draw o( / 2 k -
Xk+1,
.-
n-
= o(2I/2),
no algorithm can distinguishes P+ and P- (more precisely the collections correspond
to them) with probability more that 1/2 + o(1).
47
48
Chapter 4
Learning and testing
feature-separable distributions
4.1
Testing feature-separable distributions
Testing feature-separable distributions is related to testing closeness of two distributions: even if we are given the separated feature i and the parameter a (as it is
defined in Definition 1.3.4), testing feature-separable distributions requires that we
make sure the distributions over x-0 condition on xi = 1 and xi
-
0 are close. The
problem of testing closeness of two distributions is considered in [8, 42, 201. Here we
show reductions between closeness testing and testing feature-separable distributions
in Theorem 4.1.1 and Theorem 4.1.2.
In (201, an algorithm is given for (e, 6)-testing the closeness of two distributions
p and q using sid( 6 )
O(poly(1/e)N 2 / 3 log N log g) samples where co is a constant.
closeness-test(S,), Sq c, 6) where Sp and Sq denote the
We refer to this algorithm as
sample set of size at least sid(6 ) drawn from p and q respectively.
Theorem 4.1.1. Feature-separable test, shown in Figure 4-1, is an (e,
3)-test
for
feature-separable distributions.
Proof: First, we show that the algorithm does not reject feature-separable distributions with high probability.
Assume P is a feature-separable distribution with the
49
Feature-separable test
(Input: E and oracle access to draw sampIcs from distributionP)}
1. Draw si =[16 In"" samples.
2. For i = 1,...,n
2.1 Let ai = (#samples x with xi = 1)/s
Prx-p[xi = 1].
2.2 If ai <
or ai > 1
Accept.
-c,
3. Draw S2 = 4sid(0.1/n)/e
=
2
be an estimation of
O(poly(1/c)2 2 ,/ 3 n log n) samples.
4. Fori= 1, ... ,I n:
4.1 Split samples into two sets, Y' and Y based on i-th coordinate
and remove this coordinate.
4.2 If there is enough samples, Run the closeness-test(Yo, y1, E,1)
4.3 If the test accepts,
Accept
5. Reject
Figure 4-1: A test algorithm for feature-separable distributions
separated feature i. The rejection of P means the "if condition" in Lines 2b, 4b, and
4c do not hold, which implies the following:
" ai is in [1E, 1 - 3e].
"
For the i-th iteration, either there are not enough samples or the
closeness-
test(Yo, Y,, e, 1) rejects.
Here, we show that the probability of these two events happening together is small.
Observe that by the Hoeffding bound it is unlikely that ai deviates from its mean,
50
Pr[xi = 1], by more than e/4. Since ai is at least 3e/4, we have
Pr[Pr[xi = 1] < c/2]
1] + { < 41
Pr[Pr[xi
1] + i < a]
< Pr[Pr[xi
K4
< -(2C2s1)/16
=0(
2)
Therefore, after drawing S2 samples, we expect to see at least Es 2 /2 samples with
xi = 1 with high probability. However, this is twice of the number of samples we
need:
2
sid(0.1/n).
By the Chernoff bound, the probability of not having enough
samples is o(1). By similar arguments, one can prove that there are also a sufficient
number of samples with xi = 0 with probability 1 - o(1).
Additionally, since i is the separated feature coordinate of P, the probability that
closeness-test(Y, Y1 , c, -)
rejects on the i-th iteration is at most 0.1/n = o(1). By
the union bound the probability of rejection in both cases is o(1).
Now, assume P is E-far from being a feature-separable distribution.
We prove
that the algorithm rejects with probability at least 2/3. First, note that any featurerestricted distribution is also a feature-separable distribution, which implies that P
is c-far from being feature-restricted too. By Lemma 6.3.1, the probability of xi = 0
is at least e for each coordinate i. It is straightforward to similarly prove that the
probability of x=
1 is at least E for each coordinate i. By the Hoeffding bound, the
probability that ai (- [c, 1 - je] is at least 1
-
1/n
2
for each coordinate i. By the
union bound, the probability of accepting P due to the wrong estimation of ai's is at
most 1/n = o(1). In addition, the closeness-test(Y', Y', e, 1) mistakenly accepts P
with probability 1/10n. By the union bound, we will not accepts P with probability
more than 0.1. Therefore, the total probability of accepting an E-far distribution is
at most 0.1 + o(1) < 1/3.
Moreover, the Feature-separabletest uses at most
sample complexity is O(poly(1/)22 -n/3n log n).
Now, we show
Q( 2 2n/3)
sI 1 + s2 samples.
Thus the
I
samples are required to test feature-separable distributions.
51
Theorem 4.1.2. There is no (e, '
)-test for feature-separabledistributionsusing o( 2 2n/3)
samples.
Proof: We prove this theorem by showing a reduction between the feature-separable
testing problem and the closeness testing problem. The main idea is that if a distribution is feature-separable the two distributions over x's with xi = 1 and x's with xi = 0
have to be equal for some i. We provide two distributions on {0, 1}i.-1 namely p and
q such that they seem quite similar if we draw too few samples. These distributions
are used to prove lower bounds for the sample complexity in [421. Suppose we have a
distribution PI on {0, 1}
such that the distribution over x's with xi = 0 is p and the
distribution over x's with xi = 1 is q. Since p and q seem similar to each other, P1 is
indistinguishable from a feature-separable distribution, although it is not one. Thus,
any (f, 3/4)-test has to draw enough number of samples to reveal the dissimilarity of
p and q.
We construct two distributions p and q as explained below. First, we pick two sets
of elements: heavy and small elements. As the term indicates, heavy elements are
much more probable than small elements. The distributions p and q have the same
heavy elements but their small elements are disjoint. If one draw too few samples,
heavy elements may conceal the difference between two distributions.
1. Pick 22(-1)/3 elements randomly and set p(x) = q(x) = 1/22(I-1)/3+1
2. Pick two random disjoint set of 2"-1/4 elements, P and
Q
yet.
3. Let p(x) = 1/2n--2 for x E P and q(x) = 1/2n-2 for x E Q.
Now, we construct two distributions.
_P=1/2
1/2 - p(x('))
if xi = 1
1/2. p(x('))
if xi = 0.
- ~()if
1/ 2 -p(x('))
52
Xi = i
if xi = 0.
which are not picked
Clearly, P, is a feature-separable distribution. In addition,
separable distribution with probability 1 - o(1). Assume
distribution with the separated feature
equal to i. Let t E {o, 1}
j
P2
P2
is a not a feature-
is a feature-separable
and bias parameter a.
Clearly,
j
is not
- 2 and bo, b, E {0, 1}. We define x = vec(bo, bi, t) be a
vector of size n such that xi = bo, xj = b1 and
X(nMfsl
-
t. Fix a vector t of size
n - 2. Let Xbo,bi = vec(bo, bl, t). Since P2 is feature-separable we have
'P2 (X 0 ,1 ) + P2 (Xi,1)
1 - a
a
(P2 (X 0 ,0 ) + P2 (Xi, 0 )).
Or equivalently,
Let ti = x1(i =
(i
and to = x
=('). We rewrite the above equation as below
,o
1- a
1
(p(to) + q(to)).
-(p(ti) + q(ti)) =
2a
2
Note that there are only three different possible values in the distributions p and
q based on our construction. It is not hard to show that the probability of holding
above equality for all t is o(1). Thus, distinguishing between P1 and 'P2 with high
probability is equivalent to the closeness testing problem.
4.2
Identifying separable features
Definition 4.2.1 (Separable feature identifier). The algorithm A is a (s, E, 6) -separable
feature identifier if, given sample access to a feature-separable distribution D, with
probability at least 1 - 6 it identifies an index i E [nl such that P is c-close to a
distribution D' for which i is a separable feature.
Theorem 4.2.2. The Separable feature identifier algorithm in Figure
(O(22n/3log -n
1 3
4-2 is an
), c, 2)-separable feature identifier.
Proof: Assume the underlying distribution P is c-far from any feature-separable
distribution with the separated feature i. We show that the probability of outputting
such i is very small. There are three possible ways to output i. Here we consider each
case and show that the probability of each case is very small.
53
Separable feature identifier
((Input: c and oracle access to draw samples from distribution P))
1. Draw si
I""samples.
[16 h
2. For i=1,..., n
2.1 Let ai = (#samples x with xi = 1)/s 2 be an estimation of
Prxp[xi = 1].
3c
or ai > 1
-
2.2 If ai <
Output i.
3. Draw s2 =4sid(0.1/n)/c
4. For i = 1,
. ..
o(2 2n/ 3n log nc 1 1 / 3 ) samples.
, n:
4.1 Split samples into two sets, Yo and Yi based on i-th coordinate
and remove this coordinate.
4.2 If there is enough samples, Run the closeness-test(Yo, Y, cE).
4.3 If the test accepts,
Output i.
5. Output a random index in [n]
Figure 4-2: A test algorithm for feature-separable distributions
* Case 1: The algorithm outputs i, because ai is not in [4e, 1
-
ic].
Observe that any distribution such that Pr[xj = 0] is zero or one is a featureseparable distribution. However, we know that the underlying distribution P is
E-far from any feature-separable distribution with the separated feature i. This
implies that Prtxi = 0] is at least e and not greater 1 - c (the proof is quite
similar to Lenma 6.3.1).
Therefore, by the Hoeffding bound the probability
that ai is in [4e, 1 - 4c] is 0(1/n 2 )
o(1).
* Case 2: The algorithm outputs i, because closeness test accepts on
i-th iteration..
Let p and q be the two distribution on x's with x
= 0 and x's with x=
1
respectively. Clearly, if we replace one distribution with the other we reach a
54
-
11MMIRTTM."
II
1
11
feature-separable distribution. It is not hard to show that the distance between
p and q is at least twice that of the distance between P and the feature-separable
distribution with the separated feature i. This means the probability of acceptance of closeness test is 0(1/10n) = o(1).
Case 3:
The algorithm outputs i, because i is chosen on last line
randomly.
Note that the feature-separable learneris very similar to feature-separable test
shown in Figure 4-1. The probability of this case is equal to the probability of
rejection of a feature-separable distribution. In the proof of Theorem 4.1.1, we
show the probability of this case is o(1).
By the union bound, the total probability of outputting incorrect i is less than
.
Moreover, the Feature-separablelearner uses at most ns1 +
82
samples. Thus the
sample complexity is O(22,/3n log n01 /3). Therefore, the proof is complete.
55
0
56
Chapter 5
Learning and testing 1-junta
distributions
In this section, we discuss the problems of learning and testing k-junta distributions
in the special case where k
=
1. The algorithms for both testing and learning are
simpler in this setting, and we obtain tighter bounds as well.
5.1
Learning 1-juntas
Here, we consider the problem of learning a 1-junta distribution.
Theorem 5.1.1. Fix e > 0 and n > 1. Let D be a 1-junta distribution over {0, 1}".
There is an algorithm that draws O(log n/E2 ) from D and outputs a distribution D'
such that 'with probabilityat least 2, dTv(D, D') < e. Furthermore, every such learning
algorithm must have sample complexity at least Q(log n/C).
The quadratic dependence on 1/c in this result improves on both Theorems 1.1.1
and 1.1.2 in their special case where k = 1.
Let Di, denote the 1-junta distribution with junta coordinate i and biased parameter a. Recall that a (q, E, 6)-learner for 1-junta distributions is an algorithm that
determines the junta coordinate, namely i, and the biased parameter a by using q
samples with probability 6. The learner algorithm is described in Figure 5-1 and the
57
........
.............
1-Junta Learner
((Input: oracle access to draw samples from distributionp))
1. Draw s = 2ln(n )/c 2 samples from p: x(), X(2) .
(S).
2. For i= 1,..., n
2.1 Let &j be the fraction of samples with xi = 1.
&j
-
3. Find i that maximize
4. Output i as junta coordinate and &i as biased parameter.
Figure 5-1: An learner algorithm for 1-junta distributions
correctness of that explained in Theorem 5.1.2.
Theorem 5.1.2. The algorithm "1-Junta Learner", described in Figure 5-1., is an
(2 In(n
2
,/e,
e, 6)-learner for the set of all 1-junta distribution.
Proof: Assume the underlying distribution is D = Dia and the algorithm outputs
D = D . Here, we want to bound the probability of failure of this algorithm in
order to prove the theorem. Thus, we want to compute the probability of the event
dt,(D, D) > e or equivalently, dL,(D, D) > 2e . We consider two cases:
* Case 1: i = i. In this case dLi(P, $D) = Ia - cj > 2e. Note that i is an unbiased
estimator of a. In other words, the expected value of & is a. If dL, (D, $)
> 2e.
Therefore, by the Hoeffding bound, the probability of this case is at most e* Case 2: i
$
- ai - j,+has to be greater
Z. In this case, the L 1 distance,
than 2c since we failed. Clearly, we have either 1& - ' > e or jai -
}
> 6. Note
that the expected value of &6is a half and the expected value of 6i is a. Thus,
If
1&; - 11 > E, again by the Hoeffding bound we can say Pr[dL,(D, D) > 2e <
e-2se 2 . Otherwise, we can assume Jai -
I1>
K-
say Iai - i I+ IK- 11 > e . Observe that
of the algorithm. Therefore,
Iai - &iI + I
e. By triangle inequality, we can
-
11 < IK- j1 by the third line
> e. This is equivalent that one
of the estimators is at least e/2 away of its mean. Therefore, the probability of
picking j-th coordinate as i mistakenly is at most e-(" 2 )/2. Since we have n - 1
58
"da
6wAOLWA
coordinates other than i, the total probability of failure is at most (n-l)e(
2
)/2
by the union bound.
Hence, in both cases, setting s = 2 n(n})/c 2 make the failure probability less that
By Theorem 1.1.3, any algorithm for learning 1-juntas needs Q(log(n)/c) samples.
So Theorem 5.1.1 is optimal up to a factor of 1 in terms of sample complexity.
5.2
Testing 1-juntas
We now consider the problem of testing 1-junta distributions. We obtain an exact
characterization of the minimal sample complexity for this task.
Theorem 5.2.1. Fix E > 0 and n > 1.
whether a distributionD on {0, 1}
The minimal sample complexity to E-test
is a 1-junta distribution is
8(2(,-1)/2
log n/E2 ).
The algorithm that tests 1-juntas with this sample complexity is described in
Figure 5-2. We establish its correctness in Theorem 5.2.2. The matching lower bound
is established in Theorem 5.2.3.
Theorem 5.2.2. The 1-Junta test, shown in Figure 5-2, is an (E, 6)-test for 1-junta
distributions.
Proof: We want to show that the 1-Junta test is an (E, 2/3)-test for 1-junta distributions. Let b = Pr[xi = 1] =
E
p(x). First, we prove that the probability of
X s.t. Xi=1
not seeing enough samples for Line 5.1 and Line 5.2 is really small.
The probability of having enough samples: Here, we compute the probability
of having enough sample for the uniformity tests in Line 5.1.
and Line 5.2. Note
that computing these probabilities are quite similar. Here, we focus on the first one.
Observe that the maximum total variation distance between two arbitrary distribution
is at most 1. Therefore, if a < e/2, the test will always accept with no sample. Thus,
we can assume a > E/2.
To use Paninstki's method for testing uniformity, we need C1a2 2(r- 1 )/ 2 /
2
samples
where C1 is a constant. We want to show that with probability 1 - o(1), S1 contains
59
1-Junta test
((Input: E, 6, oracle access to draw sampics from distributionp))
1. i, a <- 1-Junta Learner (c, 0.1).
2. Draw C2(nf2)
samples.
3. Split samples in two sets So and S1 based on their value on i-th coordinate.
4. Remove i-th coordinate of all samples.
5. If there are enough samples,
5.1. Run the uniformity (e/2a, 0.1)-test on n - 1 coordinates using samples
in S1.
5.2. Run the uniformity (e/2(1 - a), 0.1)-test on n - 1 coordinates using
samples in So.
5.3. If both of the above tests accept,
Accept.
6. Reject.
Figure 5-2: A test algorithm for 1-junta distributions
this many samples. Let
2OC12(
It is clear that E[ISi1] = bs. Also, E[a]
1/2
b by
our approach in learner algorithm. So
Pr[Sij <
Oi2
0)]
= Pr[S'
< (')b]
= Pr[sl < (a2)bjI
I< (')baL
+ Pr[
" Pr[-
"
> 0.1] Pr[!
> 0.1]
< 0.1] Pr[b
> 0.1] + Pr[H < (()b
0.1]
a2<
0.1]
Pr[a 2 > 2b] + Pr[ I' < 0.1bla 2 < b]
" Pr[a > 2b] + Pr[1I < 0.1bla 2 < b]
=O(
i) =o(1).
Accepting 1-junta distribution: Now, we want to show that the algorithm
accepts 1-junta distributions. Let p = Dg. the probability of i 4
60
j
is at most 0.1. If
we guess j correctly (i =
J). The distance of distributions on non-junta coordinates
1 is zero. Thus, the tests on Line 5.1. and 5.2.
in both case where xi = 0 or xi
have to pass with probability 0.9. Thus, the total probability of rejection of p is at
most 0.3.
Rejecting of non-1-junta distributions: If p is not a 1-junta distribution, we
have
dt (P, i ,)
= dL (p, Di,b)
p(x) - Di, b(x)|
1
x
x S.t. xi=
Ip(x)
+1
-
2-_
bp(x)- 2n-1
Ss.t. xs=o
=lb
|P~c
-
22-I|
x s.t. Xi
+(1
- b)
E
x S .t.
Note that by definition of b, we q
junta coordinates after eliminating xi
distribution.
xz=o
1-
2
n-1
= p(x)/b is a distribution over n - 1 non1.
Thus, all samples in Si are from this
Similarly, all samples in So are from distribution q2 = p(x)/(l - b).
Since dt,(p, Di,b) is at least e one of the test in Line 5.1 or Line 5.2 has to reject these
distribution with probability 0.9. Thus, the proof is complete.
Theorem 5.2.3. The is no (e,
2)-test
E
for 1-junta distributions using o(2*T1 /e2) sam-
ples.
Proof: Note that 1-junta distribution is uniform over non-junta coordinates. Thus,
for any input distribution the test should check the uniformity on those coordinates
as well. It is not hard to show that any (E, 6)-test for 1-junta distribution can be
leveraged as a uniformity (e, 6)-test as well. The proof is quite similar to what we
had for the dictator distributions in 6.2.1. By Paninski's lower bound for uniformity
test in
[35],
we need at least 2-"21 /E2 to test 1-junta distributions.
61
0
62
Chapter 6
Learning and testing dictator
distributions
In this section, we consider dictator distributions.
Based on Definition 1.3.2, in a
dictator distribution there is some coordinate i for which the sample is always one,
and the distribution is uniform over the rest of the coordinates.
In the following
subsections, we describe our tight sample complexity algorithms and lower bounds
for learning and testing dictator distributions.
6.1
Learning dictator distributions
We show that the sample complexity of learning dictator distributions is O(log n).
The upper bound and the lower bound are described in Theorem 6.1.1 and Theorem
6.1.2 respectively.
A dictator distribution, p, has only one parameter to learn: the
index of the dictator coordinate. Let i denote the index of the dictator coordinate. If
we draw a sample x from p, then xi is always one and for any
j
: i, xj is zero or one
each with probability 1/2. After drawing several samples, we expect to see a sample
x such that xj is zero, assuring us that
j
is not the dictator coordinate. Based on this
fact, we give a simple algorithm for learning dictator distributions. The algorithm
is described in Figure 6-1. Note that as described, the algorithm may output more
than one index, however we show that with high probability, with O(log n) samples,
63
Dictator Learner
)
((Input: oracle access to draw samples from distributionp
1. Draw s samples from p: X(,),
O(log n)))
. . , X(s).
((s will be
2. Fori = 1, . .. , n
If all
(1)
...
, X(S),
are one.
Output i.
3. Output _.
Figure 6-1: A learner algorithm for dictator distribution
the algorithm will output only the correct index. In addition, observe that we never
each Line 3 since one of the "if statement" must be satisfied before. However, since
we use this algorithm later without the assumption that the underlying distribution
is dictator, it may output I and we keep this line on purpose.
Theorem 6.1.1. If x(), X(2) ... ,x(s)
are s = 2 log n samples from a dictator distri-
bution p, The Dictator Learner Algorithm will output only the dictator coordinate
-
with probability 1
Proof: Let i* be the actual dictator coordinate of p. Since the i*-th coordinate is
always one, the algorithm will output i*. For i 7 i*, the probability that the algorithm
outputs i is at most 2-'. By the union bound over the n - 1 non-dictator coordinates,
we have Pr[anyi
$ i*isoutput < (n - 1)2- . Thus, by setting s = [2log'n], the
probability of Pr[i
# i*] <1.
Theorem 6.1.2. Let A be a randomized algorithm that learns a dictator distribution
using at most
log(n-1)
samples. The probability of success of A is O( 1).
Proof: By Yao's Lemma, we can assume A is deterministic and that the underlying
distributions are chosen randomly: first, we choose a random dictator distribution, p,
uniformly from n possible dictator distributions, and then we draw s samples from p
to feed A.
64
Let pi denote the dictator distribution such that the i-th coordinate is dictator.
Also, let x E {0, 1}
be one of the samples we draw via above procedure.
Note
that if xi and xj are one, then the underlying distribution p can be either pi or pj
with equal probability. Now, assume we draw s samples, namely x(, X2,..., X(s).
are one and zero
. x
Suppose 1 is an indicator variable that is one if all P
otherwise. Let C be the set of coordinates i such that 1 is one. For any i and j in
C, the underlying distribution p could be equal to pi or pj with the same probability.
Thus, these distributions are not distinguishable and any deterministic algorithm,
1
=
A, outputs the correct distribution with probability at most
i.e., the
Pr[A outputs the correct coordinatel E I =
i= 1
11 <
.
i=O
Observe that if i is the dictator coordinate of p then xi's will be always one.
Otherwise, it can be zero with probability a half by definition. Thus, Pr[1j = 1 Id
1= 2-'. Therefore, we have the following for the success probability of A (i.e. the
probability that A outputs i):
Pr[ success of A]
<
Z Pr[
l=1
= 1] Pr[E 1 = 1]
success of Al IE
i=1
i=1
< E }Pr[Y: I== 1
1
i=1i
-2
2
n
< E Pr[E Ii= 1] +
<
u-] +
-
SPr[i
E Pr[j 1i = 11
2
2
The last inequality holds by the Chernoff bound and the fact that E[E Il]
i=1
n2--
;> /117.
E
65
6.2
Testing dictator distributions
Although learning dictator distributions can be done quickly, with only theta(log n)
samples with respect to the domain size, the testing task is much harder and needs
0 (2
"21) samples. The lower bound and upper bound comes from the natural relation
between the definition of dictator distributions and the uniform distribution.
In
Lemmas 6.2.1 and 6.2.3, the reductions to and from uniformity testing are described.
Also, the formal results of the lower and upper bounds are in Theorem 6.2.4.
Lemma 6.2.1. [Reductions from testing uniformity to testing dictator
I If there exists
an (e, 6)-test, A, for dictator distributions on domain {0, 1}" using q samples, then
there exists an (E, 6)-test, B, for uniformity on domain {0, 1}"1 using q samples.
Proof: We show that the existence of A implies the existence of B.
want to test the uniformity of a distribution p.
Assume we
We define another distribution p'
corresponding to p. Given samples of distribution p, create samples of distribution p'
as follows:
1. Draw sample x = (xi,... , xr1_) from p.
2. Output x' = (x 1 ,
. . , x_ 1 , 1).
Note that p' is a dictator iff p is uniform.
Thus, to test the uniformity of a
distribution p, B can just simulate A and output as A does. The only difference is
that instead of using samples directly from p, we should use samples of p' by above
procedure.
Additionally, the 11-distance between the uniform distribution and p on n - 1 bits
is equal to the 11 distance between dictator distributions and p' on n bits. Thus, the
error guarantees of A also applies to B.
The reduction we proved above immediately implies that any lower bound for
uniformity testing is also a lower bound for testing dictator distributions.
Corollary 6.2.2. If there is no (e, 6)-test for uniformity on domain {0, 1}"
using
q samples, then any (e, 6)-test for testing dictator distribution on domain {0, 1}'" uses
more than q samples.
66
Dictator test
((Input: e, oracle access to draw samples from distributionp))
1. i <- The first output of DictatorLearner()
2. If i # -L,
2.1 Draw q = O(22l /e2) samples: x(1) ...
2.2 If for any j, xi
(q).
1
reject
2.3 Else
return the result of uniformity test with samples
removing i-th
)
.
(q)
coordinate.
3. Else Reject
Figure 6-2: A test algorithm for dictator distributions
There is a weaker reduction in other direction. In the following Lemma, we describe it more formally.
Lemma 6.2.3. [Reductions from testing dictator to testing uniformity] If there exists
an (e, 6)-test, A, for uniformity on domain {0, 1}'-' using q samples, then there exists
an (2c, max{1(1-
)
,
})-test B., for dictator distributionson domain {0, 1}'
using q + 2 log n samples.
Proof: Note that B has to accept the distribution p if it is a dictator distribution
and reject if it is 2e-far from being a dictator distribution both with probability
1 - max{(1
-
f)2logn+q,
6 +
}. The general idea is that first, B learns the index
of dictator coordinate, say i, and then performs a uniformity test on the rest of the
coordinates. The algorithm is described in Figure 6-2.
By Theorem 6.1.1, we can learn i using 2 log n samples with probability 1 -
I. f
the learner returns I, we know that there is no dictator coordinate, and we reject
(Line 3). Otherwise, i is the candidate dictator coordinate. If we see any violation to
this assumption (i.e. any sample xW such that P" = 0), then we reject (Line 2.2).
If not (i.e. the samples are consistent with i being the dictator coordinate), then the
67
result of the uniformity test on the rest of the coordinates should be returned as the
answer.
Assume the input distribution, p, is a dictator distribution. By Theorem 6.1.1, i
is the dictator coordinate with probability 1 p is uniform on the rest of the coordinates.
Hence, none of xU are zero and
.
Thus, with probability at least 1 - 6,
+
probability of rejection is at most 6
.
the uniformity test accepts p and our algorithm accepts it as well. Thus the total
Now, suppose p is 2e-far from being a dictator distribution. Equivalently, the 11
distance is at least e (since dt = 4d1 ). Thus, we have
E
[px
- q, ;> E where q is
XC{0,1}"
any dictator distribution with dictator coordinate i. Thus, we have:
<
<
|px - qI
=
x s.t. Xi=O
px - q.T|I+
E
|Px -q._|
x S.t. Xi=1
=Pr[xi = 0] +
|px - 2-
E
x
2
s.t. x'i=I
Note that the second term is the 11 distance to the uniform distribution (after
removing i-th coordinate). Therefore, at least one the following has to happen:
(i) Pr[xi = 0] is at least '.
(ii) p is at least e far from being uniform on the rest of the coordinates.
However, the only way to accept p mistakenly is having x= 1 for all samples and
passing the uniformity test. The probability that B does not detect the case (i) is
(1
-
L)21ogn+q.
Also in case (ii), the uniformity test fails with probability 6. Thus, B
may fail to reject p with probability max{(I
-
)2Iogn+q, 61.
Thus, the probability of failure in either cases (p is dictator and p is 2e-far from dic-
tator) is at most max{(1 -
c)2log n+q,
6+1}. Hence, B is a (2e, max{(I
_ E)2
og n+q,
6+
S})-test.
Theorem 6.2.4. There is an (e, 6)-test for the dictator distributions with sample
complexity 0(2% /e). Also, there is no (e, 6)-test 'with asyrptoticallysmaller sample
size.
68
Proof: Paninski in [351 shows the testing uniformity needs Q(N-2/1
N is the size of the domain.
Q(2
i2
In our case, N = 2".
2
) samples where
Thus, by Corollary 6.2.2,
/c 2 ) samples are needed for testing dictator distributions.
Additionally, Paninski provides an (E,
)-test for uniformity using O(N/ 2 /c 2 ) sam-
ples. Thus, by Lemma 6.2.3, there exists an (E, -)-test
for dictator distributions (for
sufficiently large n). Hence, the proof is complete.
6.3
Learning and testing feature-restricted distribution
In this section we consider the problem of testing and learning feature-restricted
distributions.
6.3.1
Testing feature-restricted distributions
In this section, we describe our results for testing feature-restricted distributions: By
Definition 1.3.3, a distribution is feature-restricted if for a fixed i, the i-th coordinate
of any sample drawn from this distribution is always one. However, if a distribution is
E-far from being feature-restricted, the probability of xi = 0 is not close to zero. Using
this property, in Theorem 6.3.2 and Theorem 6.3.3, we show 8(log n/c) samples are
required to test feature-restricted distributions.
We first show that for any distribution far from to being feature-restricted, Pr[xi
0] is noticeably far from zero:
Lemma 6.3.1. Let D denote a distribution which is c-far from being a featurerestricted distribution. For any 1 < i
n and any sample, x, drawn randomly
<
from D, the probability that xi equals to zero is at least E.
Proof: Pick any coordinate i. Let Xo denote all the domain elements with xi = 0.
Clearly, the probability of xi = 0 is EE
D(x). Now, we define a feature-restricted
69
Feature-restricted test
(Input: e and oracic access to draw samples from distribution'P))
1. Draw s =
[I2-"]
samples: x(), ...
x(s).
2. for i = 1,1 ...
., n:
2.1 If xj = 1 for all I < j < s:
Accept
3. Reject
Figure 6-3: Algorithm for testing feature-restricted distributions
distribution as follows
'D(x) + (Ex,,V D(x))/2"-
1
if Xi = 1
(6.1)
if xi = 0.
0
It is not hard to see that total variation distance between D and D' is
Zgy0
D(x).
On the other hand, D' is a feature-restricted distribution, and this means the total
variation distance of D' and D, which is E-far from being feature-restricted is at least
c. Therefore, Pr[xi = 0] is at least E, as is EZx-, D(x).
Lemma 6.3.1 gives us the insight for how to test these types of distributions. If
a distribution is not feature-restricted, after drawing enough samples, we will see
samples with
Xj
equal to both zero and one for each coordinate i C [n]. We prove this
formally in Theorem 6.3.2.
Theorem 6.3.2. Feature-restrictedtest (See Figure 6-3) is an (e, 2)-test for featurerestricted distributions using O(log n/c) samples.
Proof: Here we show that the probability than the algorithm makes a mistake is
at most J.
Note that if p is a feature-restricted distribution, we never reject it.
Thus, the only mistake that the algorithm can do is to accept a non-feature-restricted
distribution. This means there is a coordinate i such that P)
is one for all 1 <
< s.
By Lemma 6.3.1, the probability of this event happening for a particular i is at most
(1
-
)s. By the union bound over all coordinates, the probability we have such an i
70
is at most n(1 - r)'. By setting s
we have
[l|gf~]
Pr[Algorithm fails]
< n(1 - 6)1o911/1 < 1
(6.2)
where the last inequality holds for sufficiently large n.
Now, we show that s = Q(log n/c) samples is necessary to test feature-restricted
distributions.
To do so, we introduce a family of feature-restricted distributions
{D 1 , . . . , D,,} and a non-feature-restricted distribution D', and then show no algorithm can distinguish between these two types of distributions. Below the processes
of drawing a sample x from these distribution is described.
Drawing sample from V: For each coordinate xi, set xi
1 with probability
1 - E and xi = 0 with probability E.
Drawing sample from Di: For each non-restricted feature
j
(i
j),
set xi j
1
with probability 1 - c and xj = 0 with probability c. Always set xi = 1.
The formal lower bound on number of samples is proved in the following theorem.
Theorem 6.3.3. There is no (E, 3) -test for feature-restricteddistributionswith o( I"")
samples and c <
Proof: The proof is by contradiction. Assume there is an algorithm, namely A that
is an (e, 2)-test for feature-restricted distributions using s = o('"gn) samples. By Yao's
Lemma, assume A is a deterministic algorithm, and that the input distribution P is
chosen as follows: With probability a half, P is D' and with probability 1/2n P is
Di for each 1 < i < it. Observe that with probability a half P is a feature-restricted
distribution and with probability half it is E-far from being feature-restricted.
the coordinates i such that x)
E(C(X)) = n(1-E)s if P =
x(-).
is one for all 1 <
'. Otherwise E(C(X))
j
Let C(X) be the number of
< s. It is straightforward that
1+(n-1)(1-c)s =(n(1-c) 8
by our assumption about s. By the Chernoff bound and the fact that n(1
I,
)
Let X denote any set of s samples x,.,
-
e)y
>
we can show the probability of C(X) < n(1 - c)3/2 or C(X) > 2n(1 - C)' is
O(e--G/)) = o(1).
Now, we compute the success probability of A and show that it is strictly less
71
than 1. Let X denote set of all X with n(1 - e)'/2 < C(X) < 2n(1 - c)' and X'
denote set of all X with C(X) < n(1 -,E)"/2
Pr(A succeeds)
or C(X) > 2n(1 - c)".
= Pr(A)
=E Pr(A X) Pr(X)
x
=E Pr(AIX) Pr(X) + E Pr(AIX) Pr(X)
XeX
XEX'
o(1) + E Pr(AIX, P is feature-restricted) - Pr(P is feature-restrictediX) - Pr(X)
XEx
+
Z
Pr(AIX, P is not feature-restricted) - Pr(P is not feature-restrictedIX) - Pr(X)
X CX
o(1) + E [A accepts P] - Pr(P is feature-restrictedIX) - Pr(X)
XEX
+
Z
[A rejects P] . Pr(P is not feature-restrictedIX) . Pr(X)
XEX
o(1) + E [A accepts P] -Pr(XIP is feature-restricted) - Pr(P is feature-restricted)
xEx
+
1 [A rejects P] - Pr(XIP is not feature-restricted) - Pr(P is not feature-restricted)
XEX
* o(1) + - E max(Pr(XIP is feature-restricted), Pr(XIP is not feature-restricted))
XEX
(6.3)
For any X let #0(X) denote number of x*
o(
= 0. Define #1(X) similarly with
= 1. Note that we have
Pr(XIP is not feature-restricted) = F#O(x)(i _ E)#1(x)
and
Pr(X
'P is
feature-restricted) =
Pr(X IP = D) Pr(Di'P is feature-restricted
u-~n
Additionally, we have
72
C(X) E#0(x)(i
)#i(X)-s
Restriced feature identifier
((Input: c and oracle access to draw samples from distributionP))
1. Draw s = | *l]
samples: x)
... , x(s).
2. for i =1,...,:
2.1 If x ) = 1 for all 1 < J < s:
Output i.
Figure 6-4: Algorithm for identifying restricted features.
Pr(X)
=
[Pr(XIP is feature-restricted) + Pr(XIP is not fcature-restricted)]
(1
(X)) Pr(XIP is not feature-restricted)
or
= (1+ "
2 (1
j
) Pr(XIP is feature-restricted)
(1) )) max(Pr(X P is feature-restricted), Pr(X P is not feature-restricted))
+ min( "(X)
By Equation 6.3,
Pr(A succeeds)
< o(1) + 1 T max(Pr(XIP is feature-restricted), Pr(XIP is not feature-restricted))
XEX
o(1) + E 1/(1 + min( "
))Pr(X)
XEX
< o(1) + E 1/(1 + 1/2) Pr(X)
XeX
< o(1) + - E Pr(X)
XEX
<3
Thus, by contradiction no such algorithm exists.
73
LI
6.3.2
Identifying restricted features
Definition 6.3.4 (Restricted feature identifier). The algorithm A is a (s, E, 6)-restricted
feature. identifier if, given sample access to a feature-restricted distribution D, with
probability at least 1 - 6 it identifies an index i E [n] such that D is c-close to a
distribution D' for which i is a restricted feature.
Theorem 6.3.5. The Restricted feature identifier algorithm described in Figure 6-4
is a
(W2-1,e,
2)-restricted feature identifier.
Proof: Assume i is the restricted feature of the underling distribution P. The "If
condition" in Line 2.1 in the algorithm holds at least for the coordinate i. Thus,
the Feature-restricted Learner always outputs a coordinate.
Let
j
be the output
coordinate. We show that the probability of P being c-far from a feature-restricted
distribution with restricted feature
j
is at most 1/3.
Assume P is a distribution which is e-far from any feature-restricted distribution
with restricted feature
j.
Similar to what we had in Lemma 6.3.1, we easily can
prove Pr. p[xj = 0] ;> E. Thus, the probability of outputting
j
mistakenly is (1 - f)'.
Since we have n - 1 non-restricted feature, by the union bound, the probability of
outputing any wrong coordinate is n(1
-
c)s, which is less 1/3 for sufficiently large n
(See Equation 6.2 for more detail).
D
Now, we show Q(log n/e) samples are required to learn feature-restricted distributions.
Theorem 6.3.6. There is no (o(log n/c), c, 3) -learnerfor feature-restricteddistribu[ions.
Proof: The proof is by contradiction. Assume there is such a learner A. By Yao's
Lemma, suppose A is deterministic and the underlying distribution is randomly chosen from D1, . . . , DP
as explained for Theorem 6.3.3.
Let X denote a set of s samples x(M),...
,x(s).
Let C(X) be the number of the
coordinates i such that xH is one for all 1 < j < s. Clearly, E(C(X)) = I+(n -1)(1c)" > Vr. By the Chernoff bound, the probability of C(X) < xrz/2 is O(e-O(a))
which is o(1). Now, assume we have a set of s samples ,r with C(X) > #n/2 and
74
A outputs I as the restricted feature. Suppose xV is one for all t E {i ...
and 1 <
j
< s. The probability of X coming from any of Dj, ... .
(X)
,IC(X)I
is equal.
Thus, I is the correct answer with probability C(1). Thus, the success probability of
algorithm in this case is at most
+ o(1) = o(1). Thus, there is no such learner and
the proof is complete.
75
76
Bibliography
[1] Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath. Optimal Testing for Properties of Distributions. arXiv.org, July 2015.
[21 Andris Ambainis, Aleksandrs Belovs, Oded Regev, and Ronald de Wolf. Efficient
quantum algorithms for (gapped) group testing and junta testing. arXiv preprint
arXiv:1507.03126, 2015.
[31 Alp Atici and Rocco A Servedio. Quantum Algorithms for Learning and Testing
Juntas. Quantum Information Processing, 6(5):323-348, October 2007.
[41 Maria-Florina Balcan, Eric Blais, Avrim Blum, and Liu Yang. Active Property
Testing. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer
Science (FOCS), pages 21-30. IEEE, 2012.
[51 Tugkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. The complexity of approximating the entropy. SIAM Journal on Computing, 35(1):132-
150, 2005.
[6] Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld,
and Patrick White. Testing random variables for independence and identity. In
Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium
on, pages 442-451. IEEE, 2001.
[71 Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick
White. Testing that distributions are close. In 41st Annual Symposium on
Foundations of Computer Science, FOCS 2000., 12-14 November 2000, Redondo
Beach, California, USA, pages 259-269, 2000.
[8] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick
White. Testing closeness of discrete distributions. CoRR, abs/1009.5397, 2010.
[91 Arnab Bhattacharyya, Eldar Fischer, Ronitt Rubinfeld, and Paul Valiant. Testing monotonicity of distributions over general partial orders. In ICS, pages 239-
252, 2011.
[101 Eric Blais. Improved Bounds for Testing Juntas. In APPROX '08 / RANDOM
'08: Proceedings of the 11th internationalworkshop, APPROX 2008, and 12th
internationalworkshop, RANDOM 2008 on Approximation, Randomization and
77
CombinatorialOptimization: Algorithms and Techniques, pages 317-330, Berlin,
Heidelberg, August 2008. Springer-Verlag.
[11] Eric Blais. Testing juntas nearly optimally. In STOC '09: Proceedings of the
forty-first annual ACM symposium on Theory of computing, page 151, New York,
New York, USA, May 2009. ACM Request Permissions.
[121 Eric Blais, Amit Weinstein, and Yuichi Yoshida. Partially Symmetric Functions
Are Efficiently Isomorphism Testable. SIAM J. Comput., 44(2):411-432, 2015.
[131 Avrim Blum. Relevant examples and relevant features: Thoughts from computational learning theory. In AAAI Fall Symposium on 'Relevance', volume 5,
1994.
[141 Avrim Blum, Lisa Hellerstein, and Nick Littlestone. Learning in the presence
of finitely or infinitely many irrelevant attributes. J. Comput. Syst. Sci. (),
50(1):32-40, 1995.
[15] Avrim L Blum and Pat Langley. Selection of relevant features and examples in
machine learning. Artificial Intelligence, 97(1-2):245-271, December 1997.
[16] William S. Bush and Jason H. Moore.
Chapter 11: Genome-wide association
studies. PLoS Comput Biol, 8(12):e1002822, 12 2012.
[17] Cl6ment L. Canonne. A survey on distribution testing: Your data is big. but is
it blue? Electronic Colloquium on Computational Complexity (ECCC), 22:63,
2015.
[18] Sourav Chakraborty, Eldar Fischer, David Garcfa-Soriano, and Arie Matsliah.
Junto-Symmetric Functions, Hypergraph Isomorphism and Crunching. In CCC
'12: Proceedings of the 2012 IEEE Conference on Computational Complexity
(CCC, pages 148-158. IEEE Computer Society, June 2012.
[19] Sourav Chakraborty, David Garefa-Soriano, and Arie Matsliah. Efficient sample
extractors for juntas with applications. In ICALP'11: Proceedings of the 38th
international colloquim conference on Automata, languages and programming,
pages 545-556. Springer-Verlag, July 2011.
[201 Siu-on Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal
algorithms for testing closeness of discrete distributions. In Proceedings of the.
Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA
2014, Portland, Oregon, USA, January 5-7, 2014, pages 1193-1203, 2014.
1211 Hana Chockler and Dan Gutfreund. A lower bound for testing juntas. Information Processing Letters, 90(6), June 2004.
[22] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A. Servedio. Learning
k-modal distributions via testing. Theory of Computing, 10(20):535-570, 2014.
1231 Luc Dcvroye and Gdbor Lugosi. Combinatorial methods in density estimation.
Springer, 2001.
124] Ilias Diakonikolas. Learning structured distributions, To appear.
1251 Ilias Diakonikolas, Daniel M. Kane, and Vladimir Nikishkin. Testing identity of
structured distributions. CoRR, abs/1410.2266, 2014.
[261 Ilias Diakonikolas, Homin K Lee, Kevin Matulef, Krzysztof Onak, Ronitt Rubinfeld, Rocco A Servedio, and Andrew Wan. Testing for Concise Representations.
In FOCS '07: Proceedings of the 48th Annual IEEE Symposium on Foundations
of Computer Science, pages 549-558. IEEE Computer Society, October 2007.
127] Eldar Fischer, Guy Kindler, Dana Ron, Shnuel Safra, and Alex Samorodnitsky.
Testing juntas. Journal of Computer and System, Sciences, 68(4):753-787, June
2004.
1281 Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs.
In Studies in Complexity and Cryptography. Miscellanea on the Interplay between
Randomness and Computation, pages 68-75. Springer, 2011.
[291 Michael J. Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E.
Schapire, and Linda Sellie. On the learnability of discrete distributions. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing,
23-25 May 1994, Montr6al, Quebec, Canada, pages 273-282, 1994.
[301 Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing properties of collections of
distributions. Theory of Computing, 9(8):295-347, 2013.
[311 Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits,
Fourier transform, and learnability. Journalof the A CM (JA CM, 40(3):607-620,
July 1993.
[321 Michael Mitzeninacher and Eli Upfal. Probability and Computing: Randomized
Algorithms and ProbabilisticAnalysis. Cambridge University Press, New York,
NY, USA, 2005.
[331 Elchanan Mossel, Ryan ODonnell, and Rocco A Servedio. Learning functions of
k relevant variables. Journal of Computer and System Sciences, 69(3):421-434,
November 2004.
[341 Ryan ODonnell.
Analysis of Boolean functions. Cambridge University Press,
October 2014.
[35] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Trans. Inf. Theor., 54(10):4750-4755, October 2008.
[36] Ariel D Procaccia and Jeffrey S Rosenschein. Junta distributions and the averagecase complexity of manipulating elections. J. Artif. Intell. Res., pages 157-181,
2007.
79
[371 Sofya Raskhodnikova, Dana Ron, Amir Shpilka, and Adam Smith. Strong lower
bounds for approximating distribution support size and the distinct elements
problem. SL4M J. Comput., 39(3):813-842, 2009.
[381 Rocco A Servedio, Li-Yang Tan, and John Wright. Adaptivity Helps for Testing
Juntas. Conference on Computational Complexity, 33:264-279, 2015.
[39] Jack W Smith, JE Everhart, WC Dickson, WC Knowler, and RS Johannes.
Using the adap learning algorithm to forecast the onset of diabetes mellitus. In
Proceedingsof the Annual Symposium on Computer Application in Medical Care,
page 261. American Medical Informatics Association, 1988.
[401 Gregory Valiant. Finding correlations in subquadratic time, with applications to
learning parities and juntas. FOCS, pages 11-20, 2012.
141] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log(n)-sample
estimator for entropy and support size, shown optimal via new clts. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San
Jose, CA., USA, 6-8 June 2011, pages 685-694, 2011.
[421 Paul Valiant. Testing symmetric properties of distributions. In Proceedings of
the FortiethAnnual ACM Symposium on Theory of Computing, STOC '08, pages
383-392, New York, NY, USA, 2008. ACM.
80
Appendix A
Learning juntas with the cover
method
Fix any class C of distributions over {o, 1}". An e-cover of C is a collection C, of
distributions on {0, 1}' such that for every distribution D E C, there is a distribution
D'
C CE such that dTv(D, D') < E. We can obtain a good learning algorithm for C by
designing a small E-cover for it and using the following lemma.
Lemma A.0.7. Let C be an arbitraryfamily of distributions and e > 0. Let C, C C be
an e-cover of C of cardinality N. Then there is an algorithm that draws O(E-2 log N)
samples from an unknown distribution D E C and, with probability 9/10, outputs
a distribution D' G C, that satisfies dTv(D, D') < 6E.
algorithm is 0(N log N/
See
2
The running time of this
).
124, 23, 221 for good introductions to the lemma itself and its application to
distribution learning problems. We are now ready to use it to complete the proof of
Theorem 1.1.1.
Theorem 1.1.1 (Restated). Fixe > 0 and 1 < k < n. Define t =
(") 2k2k/. There is
2
3
an algorithm A with sample complexity O(log t/c 2 ) = 0(k2k/E + k log n/E ) and run-
ning time O(t log t/c 2 ) that, given samples from a k-junta distribution D, with proba-
P-D1QW I K c
81
Grn
P(X)
-
bility at least 2 outputs a distribution D' such that dTv(D, D') := Eo
Proof: By Lcmma A.0.7, it suffices to show that the class of all k-junta distributions
has a cover of size N = (')
2 2-/
. This, in turn, follows directly from the fact that we
can simply let C, be the set of all k-juntas with probability mass function p where p(x)
is a multiple of E/2" for each element x C {0, 1}y. There are
(")
ways to choose the
set J C [n] of junta coordinates and at most (2 k)2k/f ways to allocate the probability
mass in e/2 k increments among the
2k
different restrictions of x on J.
82
n
Appendix B
Proof of Equation (2.1)
We establish some basic properties of Pj as described in Section 2. First, for a fixed
set J, define the biases bi, i E {o, 1, . . . , 2e - 1} for J to be the probability of x07) = C
where x is drawn from P and CO is the binary encoding of i with k bits.
Lemma B.O.8. For each bias bi of the set J, we have
bi =
a.,
(B. 1)
Proof: First, we introduce the notation we use below. For a subset I of size k, we
define a function r : I
-4
[k] such that rj(c) indicates the rank of the coordinate
c E I (rank the smallest first). Basically, when x)
= Cj, the c-th bits of x is equal to
r1 (c)-th bit of Ci for all c E I. In addition, we define ri(S) to be the image of subset
S under the function rj. In particular, if x') = Cj, we have x(S) = Cr(s)) for any
S C I. Let t
J \ J*I and it is at least one. Then by Bayes' Rule, we have
bi = Pr[(j) = Cd]
x(Jnj*) = C[.j(jnj*)]
2k-_jj
Prfx('\') =C
=0
Pr[5i-(\J*)
-C
= Pr[x~j\''*) = CT
1=0
A
n
(JJ)- "(AJKJ
AX$/2)=Cd'''|6*
83
j ),
(*
= CTJ(JnJ*)I*) =
Oi
.
1
C] ,
= Pr[(J\J*) = Cr(J\J*) A
Pr[x(*)
C-1]
Observe the restriction of the P on non Junta coordinates (including J\ J*) is uniform.
Thus, the probability of each setting for the coordinates in J \ J* is 2-
= 2-.
-\'
Consecutively, it is independent of the junta coordinates. Therefore, we conclude
Pr[xG'\*)
i=0
] - Pr[x(* )
C
C'"'
C1]
|
.
2k-1
bi
2k _1
S
2- - Pr[x(jnJ*) =CTj(Jnj*)x(J*) = C1]
i=0
Note that if x(*)
=
a,
C1, then the values of x on all the coordinates in J n J* is
determined. Therefore, both binary encodings Ci and C, should appoint the same
value to these coordinates. In other words, if CrIl'nj') #
bility of x(jnj*)
-
J*(Jj*),
then the proba-
Ca['j(jnj) is zero since we have the condition x(jnj*) =Cj*(jnj).
Otherwise, it is one.
Note that we can partition C's (or l's) based on the values they appoint for
the coordinates in J
{0,
1,
J*. We define
= {L 1 , L22 ,... , L 2
1-t }
to be a partition of
1} such that for each L E L and any two elements mi and M2 in L,
.. . , 2-
Cr*Jl4*)
n
C
namely 7n, C'
(2J*2 ). In addition, L(i) is the set such that for any of its element,
=
CfnJ*(J'n). In
other words, while CO is a binary encoding over
set J it assign the same values to J n J as Cm (or any member of L(i)). As there
is only one sets that its elements agree on C* (JoJ*), L(i) is well-defined. We define
L- 1 to be the set of all i such that L(i) = La. By this new notation, we conclude
bi
=a,
be=Zs)
I&C(i)
Thus, the proof is complete.
In addition, it is not hard to see that Pj is actually a junta distribution over the
set J n J*. Recall that based on Equation B.1, we obtain the desired inequality
Pj(x) = (Pr[y(J"J*) = x(Jnj*)])/
Y-P
84
2 I-k
i.
(B.2)
Download