Document 10790240

advertisement
Testing shape restriction properties of probability
distributions in a unified way
by
Themistoklis Gouleakis
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Science in Computer Science and Engineering
ARCHIVES
MASSACHUSETTS INSTITUTE
OF TECHNOLOLGY
at the
JUL 07 2015
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
LIBRARIES
June 2015
@ Massachusetts Institute of Technology 2015. All rights reserved.
Author..
Signature redacted
Department of Electrical Engineering and Computer Science
May 20, 2015
Signature redacted
.../....I.............
.-
Certified by ..............
Ronitt Rubinfeld
Professor
Thesis Supervisor
Accepted by ..........................
Signature redacted
Professor keslie & Ulodziejski
Chair, Department Committee on Graduate Theses
Testing shape restriction properties of probability distributions in
a unified way
by
Themistoklis Gouleakis
Submitted to the Department of Electrical Engineering and Computer Science
on May 20, 2015, in partial fulfillment of the
requirements for the degree of
Master of Science in Computer Science and Engineering
Abstract
We study the question of testing structured properties of discrete distributions. Specifically,
given sample access to an arbitrary distribution D over [n] and a property P, the goal is to
distinguish between D E P and fi (D, P) > e. Building on a result of [9], we develop a general
algorithm for this question, which applies to a large range of "shape-constrained" properties,
including monotone, log-concave, t-modal and Poisson Binomial distributions. Our generic
property tester works for properties that exclusively contain distributions which can be well
approximated by L-histograms for a small (usually logarithmic in the domain size) value
of L. The sample complexity of this generic approach is O(L -).
Moreover, for all cases
considered, our algorithm has near-optimal sample complexity. Finally, we also describe a
generic method to prove lower bounds for this problem, and use it to derive strong converses
to our algorithmic results. More specifically, we use the following reduction technique: we
compose the property tester for a class-C of distributions with an agnostic learner for that
same class to get a tester for subset CHARD C C for which a lower bound is known.
Thesis Supervisor: Ronitt Rubinfeld
Title: Professor
3
4
Acknowledgments
I would like to primarily thank my advisor Ronitt Rubinfeld for introducing me to the field
of property testing and constantly giving me new ideas for problems and very helpful advice
throughout these years.
I would also like to thank my undergrad advisor Dimitris Achioptas for the close collaboration and guidance in my first research steps.
Many thanks to all my research collaborators: Maryam Aliakbarpour, Amartya Shankha
Biswas , Clement Cannone, Ilias Diakonikolas, John Peebles, Anak Yodpinyanee and all my
other members of the MIT theory group for creating a very friendly environment. To mention
a few, I would like to thank my officemates and my friends Arturs Backurs, Cheng Chen,
Gautam Kamath, Madalina Persu , Ilya Razenshteyn, Adam Sealfon, Katerina Sotiraki,
Christos Tzamos, Manolis Zampetakis for their help and support at various times.
Lastly, I would like to thank Paris Kanellakis' family for supporting me with a fellowship
in the first year of my studies.
5
6
Contents
Property testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.2
Motivation and related work . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Testing properties of probability distributions in a unified way. . . .
20
23
Preliminaries
23
.
D efinitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Testing and learning
. . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.1.2
Properties of interest . . . . . . . . . . . . . . . . . . . . . . . . . .
26
.
2.1.1
.
2.1
29
Testing distributions in a unifed way
Structural results . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
30
3.2
Computing the distances . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
34
3.3
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
36
3.3.1
The algorithm . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
36
3.3.2
Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . .
. . . . . . . . .
38
3.3.3
Sample complexity. . . . . . . . . . . . . . . . . . .
. . . . . . . . .
38
3.3.4
Correctness. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
38
Considering the effective support size only . . . . . . . . .
. . . . . . . . .
42
3.4.1
Proof of Theorem 3.4.4 . . . . . . . . . . . . . . . .
. . . . . . . . .
44
3.4.2
Application: Testing Poisson Binomial Distributions
. . . . . . . . .
45
Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
48
3.5
.
.
.
.
.
.
.
.
3.4
.
3.1
.
3
.
.
.
1.1
1.2.1
2
13
Introduction
.
1
7
A Summary of results
53
B Ommited proofs
55
B.1
Proof of Lemma 1.1.7 ..............................
B.2 Proof of Theorem 3.1.7 ..............................
8
.
55
58
List of Figures
9
10
List of Tables
. . . . . . . . . . . . . . . . . . . .
20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
1.1
Summary of distribution testing results.
A.1
Sunimary of our results.
11
12
Chapter 1
Introduction
1.1
Property testing
Property testing is a framework where one has to come up with an algorithm that solves
relaxed decision problem in sublinear time. Namely, it has to distinguish between a "YES"
instance and a "NO' instance whose distance from any "YES' instance is at least 6 for some
distance measure on the set of possible instances. The hope is that this relaxed notion will in
some cases enable us to solve the problem in time sublinear in the length of the input. This
seems quite surprising since it implies that the algorithm will only look at an infinitesimal
fraction of the input. As we are going to see later, randomness (i.e looking at random places
of the input) will always play a significant role.
In particular, for the special case of testing properties of probability distributions, the
problem can be phrased as follows: given a class (property) of distributions P and sample
access to an arbitrarydistribution D, one must distinguish between the case that (a) D E P,
versus (b)
lID - D'11 1 > . for all D' c P (i.e., D is either in the class, or far from it). The
algorithm has access to either samples from the distribution, or is able to query a point in
the probability density function (pdf) or the cumulative distribution function (CDF) of the
input distribution. We will call sample (query) complexity of such an algorithm the number
of samples (resp. queries) needed in order for it to satisfy the aforementioned guarantees.
13
Note that, the size of the input here is considered to be the size of the domain of the input
distribution. We are only interested in sublinear sample or query complexity since otherwise
the algorithm could just learn the distribution and and computationally check if it belongs
to the class we are interested in (i.e as an ordinary decision problem).
An example: Testing uniformity
As a motivating example for testing properties of distributions, we will consider the problem
of testing if a distribution is uniform. That is, we want to find an algorithm that will be
taking a sublinear number of samples and will have the following performance guarantees'
" if p is uniform, it outputs ACCEPT with probability at least 2/3;
" if dTv(p, U) > e, it outputs REJECT with probability at least 2/3.
At this point we observe that there exists a statistic for which the uniform distribution
is extremal. Namely, given s samples, the uniform distribution has the minimum expected
number of collisions between them (i.e pairs of identical samples). Indeed, let S
=
x1,..., x,
be the set of independent samples from p. Let Eij be the event that xi = xj. By linearity of
expectation on the indicator random variables Xij for the events Eij, we have:
Since
=
=p
{i,j}ES2 k=1
{i, 3 }
s
/
Pr[xi = xj]
coll(S1 ) = E[ collisions] =
Pk = 1, the expected number of collisions is minimized when the distribution is
k=1
uniform, in which case we have: coll(S)
=
(.
n
From the above we conclude that the collision probability (i.e the expected fraction of
collisions among the sample pairs) is equal to the
12
norm of the distribution p. Another
useful observation is that we can estimate the 12 distance of p from the uniform distribution
using the
12
'Note that
norm of p as follows:
this type algorithm is called a property tester as we define more generally in section Section 2.1.1
14
lip-Ull
+1
= >(pi
1p
-)p=
niE[n]
E[n]
E[n]
1
ie[n]
So, it is natural to use this statistic in order to decide whether to test whether a distribution
is uniform or far from it.
The following lemma shows that the observed collision frequency is also concentrated
around its mean:
Lemma 1.1.1 ([9]). Let I be n interval and q be the conditional distribution of p on I. Then,
I
2
e2
q1- T3211-1
<
coll(S
<,(12)
(
-
with probability at least 1 - O(log- 3 n) provided that
1 2S.
2
32111)
ISI| = Q(E-
(2
)-additive estimate
)
By setting I = [n], we get that ""( ') is with high probability an (
log log n).
of |IpII| and thus an (t2)-additive estimate of |p - UI.
We are now ready to check if the algorithm that accepts if and only if the estimate for
lip - u11 2 is at most
* If lip
1(iS)
-
2
ulli = 0
<+ n
has the desired properties:
=4
lp
-
'uII
=
0, then with probability at least 2/3 we will have:
. So the algorithm returns ACCEPT. This property of the algorithm is
called completeness.
* If j|p - ujji > e > lp - U112 >
we will have:
c'il(S1 )
> 1+
|p
-
u|||>
(,
then with probability at least 2/3
. So the algorithm returns REJECT. This property of the
algorithm is called soundness.
In fact, the algorithm has even stronger guarantees in the sense that it will also return
ACCEPT when the distribution is sufficiently close to being monotone. This is actually the
case whenever a distance approximation algorithm for a given property exists as stated below:
Theorem 1.1.2. If there exists an algorithm that estimates the distance to a property P up
to additive errore, then there exists a algorithm that is able to distinguish between d(p, P) < E'
and d(pP) > e'+E.
15
This algorithm is called a tolerant tester for P and will be defined formally in Section 2.1.1.
Lower bound
We now turn to showing that the algorithm described above is almost
optimal in its dependence on n by showing an Q(V/ i) lower bound. 2 We observe that the
property of uniformity that we are interested in falls under the following definition:
Definition 1.1.3. A property P is called symmetric if it is closed under permutations
of the domain. More formally, let -x : [n]
(PP2,...,pn) E P --> (PN(1), P,(2),
...
--
[n] be a permutation on n elements. Then
, Px(n)) E P.
When we want to test for a symmetric property, then the specific labels of domain elements
that we see in the sample are not important. The following definition is exactly the statistic
of the set of samples that contains all the information we might need from it.
Definition 1.1.4. Given a multiset of samples S of size s from a distribution p, we define
the fingerprint of the sample as follows: Let Ci, for 0 < i < s be the number of elements that
appear exactly i times in S. The multiset of values {Ci}oj ,' is called the fingerprint of the
set S.
Lemma 1.1.5. Let P be a symmetric property. If there exists a property tester for P the
uses a multiset of samples S, then there exists a property tester that gets as input only the
fingerprint of S.
Proof. Given that there is a property testing algorithm A that tests a symmetric property
P, we are going to build an algorithm that only takes the the fingerprint of the sample s as
the input and also, tests for P with the same guarantees.
Our algorithm will read the input fingerprint, construct an " artificial" sample from it
and feed it to the algorithm A. More specifically, the algorithm A' is defined as follows:
" Read the fingerprint Co, C1,...C,
" cursor= 1;
" For i
2
=
1 to s do
Note that the logarithmic factor in lemma
11-. goes away if we only ask for constant success probability.
16
- read Ci and put the next Ci elements (starting from the " cursor") in the sample
S, i times each element (in increasing order).
- cursor = cursor + Ci
o Run the algorithm A using the " artificial" sample S' and give the same output.
Example: if the fingerprint is (CO, CI, ...
,
Cs)
the algorithm creates the artificial sample: S'
=
=
(2, 3, 2,1, 0,0 0, 0, 0,0, 0), D = {1..8} then
(1, 2, 3, 4,4, 5, 5, 6, 6, 6).
We now have to argue that whatever the actual sample S might have been, if we feed the
algorithm A with S' instead, we will get the same output distribution. Indeed, since P is
symmetric, the fact that the actual labels of the elements might have been different does not
affect the output distribution of the algorithm (i.e if in fact number '7' had appeared 3 times
in our example instead of number '6', this would not alter the output of the algorithm). Also,
due to the fact that we take independent samples from the distribution, any permutation
in the order in which the samples are taken has the same probability as all the rest of the
permutations. So, the particular order of the samples does not give the algorithm any extra
information about the distribution. This is why the output distribution of A is not affected
if we feed it with the "artificial' sample S' instead of the original sample S.
L
Theorem 1.1.6. Any property tester for uniformity has sample complexity Q(/n).
Proof. Since uniformity is a symmetric property, the above lemma applies. We are now ready
to show that for every possible e, there exists a distribution p such that I|p - UD II1 > E which is
indistinguishable from UD with probability 1-o(1) if we use o(V,,' ) samples. For that, it suffices
to show that both p and UD will produce exactly the same fingerprint with probability 1 - o(1).
If that happens, then algorithm will have to accept PASS on that fingerprint with probability
a and FAIL with probability 1 - a for some a E [0, 1]. Since this happens with probability
1-o(1), the overall probability of " passing" distribution
UD
will be (1--o(1))a+o(1) = a o(1)
and the probability of failing p will be (1 - o(1)) - (1 - a)+o(1) = 1 - a t o(1). The algorithm
is required to PASS UD with probability at least 2/3 and FAIL p with probability at least
2/3, which means that a t o(1) + 1 - a t o(1) > 4/3 -,> 1
17
o(1) > 4/3. Contradiction!
So, it remains to show that if we use o(Vfr) samples, then both p and
produce the
UD
same fingerprint with probability 1 - o(1) and this will imply the desired lower bound.
Consider the following family of distributions pc (where c E (0, 1] is a parameter):
" For a uniformly random subset B C [n] : BI < c - n: pc(i)
"
=
Otherwise, Pc(i) = 0
We can now calculate the distance between pc and
1
IIc-UD 1 =
1
UD:
1
liC('(-c)-n2(1-c)
-n
n
n
So,
Ipc
-
UD1jj
> c-
2. (1
-
c) >,E c c <1-
2
So, for any constant E, we can choose a constant c which should satisfy the above inequality
and makes the distribution pc E-far from
UD.
Let s = o(V~ ) be the number of samples that we take. In order to find the most possible
fingerprint for both distributions, we are going to use a birthday paradox argument.
So, we want to estimate the probability of collisions (i.e the probability of getting at least
a pair of samples with the same label among the s samples we take from each distribution).
The probability that we see a particular element at least twice when is sample from pn is
the following:
Pr[element '1' appears at least 2 times]
5)
2
(1
c- n
2
So, by union bound, we get that
Pr[] i: element 'i' appears at least 2 times]
n
()
28)
(< 1)2
1-2
-()
_ EnS)
=
o(1)
0
since s = o(Vf/).
So, for every distribution P,, c E (0, 1], there all the s = o(Vi/) samples are distinct with
probability at least 1-o(1). Note that
UD
= pi, so the same holds for the uniform distribution
18
as well. Also, the fact that the subset B is selected uniformly at random ensures that the
samples have no bias in favour of some element of [n]. Thus, the fingerprint of any function pc
in the family we have described has the fingerprint: Co = n - s, C1
=
s, Vi > 1 : Ci = 0 with
probability at 1 - o(1). This makes any pc indistinguishable from pi = UD, which finishes
the proof.
Remark:
The gap for the dependence on . for this problem was closed by Paninski in [27]
where the sample complexity for uniformity was shown to be
e ( 9).
In [19], Diakonikolas et' al also used a different approach also used a different approach
based on a modified X 2 -test to prove the following slightly more general result:
Lemma 1.1.7 (Adapted from [19, Theorem 11]). There exists an algorithm CHECK-SMALL-
e
2
which, given parameters e, 6 E (0, 1) and C - fI/e
2
log(1/6) independent samples from
a distributionD over I (for some absolute constant C > 0), outputs either yes or no, and
satisfies the following.
" If |ID -
1|112
> El
" If |ID - U112 <; /2
lI,
then the algorithm outputs no with probability at least 1 - 6;
ill, then the algorithm outputs yes with probability at least 1 - 6.
Results on other properties
Over the last couple decades, a number of properties of distributions that have been examined
including identity, closeness, independence, monotonicity and k-modality. We will mention
here some results about the above properties. The interested reader may also look at [10] for
a more detailed survey of the field of distribution property testing.
We will now give the definitions and notable results for those properties:
9 Testing Identity: Given an explicit description of a distribution p and samples from
an unknown distribution q, test if q = p.
* Testing Closeness: Given samples from two unknown distributions p, q, test whether
p = q or ||p - q1 1 > -.
19
"
Testing Independence: Given samples from a joint distribution over [n] x [m], test
whether the two random variables are independent or there is no independent joint
distribution that is e-close to the input distribution.
" Testing monotonicity: Given samples from a distribution p, test if pi > P2 > -..>
pn"
k-modality: Given samples from a distribution p, test if p has at most k modes.
Some of the known results for the above properties are summarized in the table below:
Class
Mnotone
O()
[9],
Upperbound
() (Corolhary 3.3.4)
O(Y
Identity
) ([35])
,
)) ([14])
Closeness
O(4nax(
t-modal
6t -) (Corollary 3.3.5)
6 n2 / 3M 1/ 3 ) poly(!) ( 7])
Independence
Table 1.1:
1.2
1.2.1
(-7)
Lowerbound
[9], Q(-n-) (Corollary 3.5.2)
Q(n) (uniformity lower bound: [27])
Q(max(' , V )) ([14])
Q(-n) (Corollary 3.5.2)
Q(n2 / 3 m'/ 3 )
(7])
Summary of distribution testing results.
Motivation and related work
Testing properties of probability distributions in a unified
way.
Inferring information about the probability distribution that underlies a data sample is an
essential question in Statistics, and one that has ramifications in every field of the natural
sciences and quantitative research. In many situations, it is natural to assume that this data
exhibits some simple structure because of known properties of the origin of the data, and in
fact these assumptions are crucial in making the problem tractable.
As a result, the problem of deciding whether a distribution possesses such a structural
property has been widely investigated both in theory and practice, in the context of shape
restricted inference [5, 32]. Here, it is guaranteed or thought that the unknown distribution
satisfies a shape constraint, such as having a monotone or log-concave probability density
function [31, 4, 37]. From a different perspective, a recent line of work in Theoretical Computer
20
Science, originating from the papers of Batu et al. [8, 7, 21] has also been tackling similar
questions in the setting of property testing (see [28, 29, 30] for surveys on this field). This very
active area has seen a spate of results and breakthroughs over the past decade, culminating
in very efficient (both sample and time-wise) algorithms for a wide range of distribution
testing problems [6, 22, 2, 17, 14, 1, 19]. In many cases, this led to a tight characterization
of the number of samples required for these tasks as well as the development of new tools
and techniques, drawing connections to learning and information theory [33, 34, 35].
In this thesis, we will consider the following property testing problem: given a class
(property) of distributions P and sample access to an arbitrary distribution D, one must
distinguish between the case that (a) D E P, versus (b)
(i.e., D is either in the class, or far from it).
lID
- D'|| > E for all D' E P
While many of the previous works have
focused on the testing of specific properties of distributions or obtained algorithms and lower
bounds on a case-by-case basis, an emerging trend in distribution testing is to design general
frameworks that can be applied to several distribution property testing problems [36, 19].
This direction, the testing analog of a similar movement in distribution learning [11, 13, 12],
aims at abstracting the minimal assumptions that are shared by a large variety of problems,
and giving algorithms that can be used for any of these problems.
One goal of this work is to provide a unified framework for the question of testing various
properties of probability distributions.
Batu et al. [9] initiated the study of efficient property testers for monotonicity and obtained
(nearly) matching upper and lower bounds for this problem; while [1] later considered testing
the class of Poisson Binomial Distributions, and settled the complexity of this problem
(up to the precise dependence on e).
Another body of work by [6], [9] and [17] shows
how assumptions on the shape of the distributions can lead to significantly more efficient
algorithms. They describe such improvements in the case of identity and closeness testing as
well as for entropy estimation, under monotonicity or k-modality constraints. Specifically,
3
Batu et al. show in [9] how to obtain a O(log 3 r/&
)-sample tester for closeness in this setting,
in stark contrast to the Q(n2/ 3) general lower bound. Diakonikolas et al. [17] later gave
21
O(v/log n) and O(log 2 /3 n)-sample testing algorithms for testing respectively identity and
closeness of monotone distributions, and obtained similar results for k-modal distributions.
Except for these two cases of monotonicity and PBDs nothing is known about testing the
properties we study.
Moreover, for the specific problem of identity testing (i.e given the explicit description of
a distribution D* and sample access to an unknown distribution D, to decide whether D is
equal to D* or far from it), a recent result of [19] describes a general algorithm which applies
to a large range of shape or structural constraints, and yields optimal identity testers for
classes of distributions that satisfy them. We observe that while the question they answer
can be cast as a specialized instance of membership testing, our results are incomparable to
them, both because of the distinction above (testing with versus testing for structure) and as
the structural assumptions they rely on are fundamentally different from ours.
One idea about how to tackle this problem is inspired by Batu et al. [9], which show
that monotone distributions can be well-approximated by piecewise constant densities on
a suitable partition of the domain; and leverage this fact to reduce monotonicity testing
to uniformity testing on each interval of this partition. While their argument is tailored
specifically for the setting of monotonicity setting, we will try to generalize this idea to other
classes of probability distributions like k-modal, Log-concave, monotone hazard risk (MHR),
Poisson
binomial distributions (rDIs) etc.
We, win also cunsider mixtures of
distributions.
22
the above
Chapter 2
Preliminaries
2.1
Definitions
Hereafter, we write [n] for the set {1,... , n}, and log for the logarithm in base 2.
A
probability distribution over a finite domain X is a non-negative function D: X -+ [0, 1] such
that E_,cz D(x) = 1; we denote by lx
the uniform distribution on X. Moreover, given a
distribution D over X and a set S C X, we write D(S) for the total probability weight
EZxes D(x) assigned to S by D.
2.1.1
Testing and learning
We first define precisely the notions of testing, tolerant testing, learning and proper learning
of distributions over a domain [n].
Definition 2.1.1 (Testing). Fix any property P of distributions, and let ORACLED be an
oracle providing some type of access to D. A q-sample testing algorithm for P is a randomized
algorithm T which takes as input n, e E (0, 1], as well as access to ORACLED. After making
at most q(e, n) calls to the oracle, T outputs either ACCEPT or REJECT, such that the
following holds:
* if D E P, T outputs ACCEPT with probability at least 2/3;
23
e if dTv(D, P)
E, T outputs REJECT with probability at least 2/3.
We shall also be interested in tolerant testers - roughly, algorithms robust to a relaxation of
the first item above:
Definition 2.1.2 (Tolerant testing). Fix property P and ORACLED as above. A q-sample
tolerant testing algorithm for P is a randomized algorithm T which takes as input n, 0 <
el
< E2 <
1, as well as access to ORACLED. After making at most q(ei,82, n) calls to the
oracle, T outputs either ACCEPT or REJECT, such that the following holds:
* if dTv(D, P)
ei, T outputs ACCEPT with probability at least 2/3;
* if dTv(D, P)
E2,
T outputs REJECT with probability at least 2/3.
Another class of algorithms we consider is that of (proper) learners. We give the precise
definition below:
Definition 2.1.3 (Learning). Let C be a class of probability distributions and D E C be an
unknown distribution. Let also W be a hypothesis class of distributions. A q-sample learning
algorithm for C is a randomized algorithm C which takes as input n, E, 6 E (0, 1), as well
as access to SAMPD and outputs the description of a distribution D
probability at least 1 - 6 one has dTv (D,
If in addition N G C, then we sy
b)
C N such that with
-.
is a pruptrI WrniUn algorithm.
The exact formalization of what learning a probability distribution means has been
considered in Kearns et al. [24].
A more general definition of a type of learning algorithms follows:
Definition 2.1.4 (Agnostic learning). Let C and N be two classes of probability distributions
over X. A (semi-)agnostic learnerfor C (using hypothesis class N) is a learning algorithm
A which, given sample access to an arbitrary distribution D and parameter E, outputs a
hypothesis D E W such that, with high probability, D does "as much as well as the best
approximation from C:"
dTv(D,
b)
C- OPTC,D + O(E)
24
where OPTC,D
=
infDcEC dTv (Dc, D) and c > 1 is some absolute constant (if c
=
1, the
learner is said to be agnostic).
Note that a (semi-)agnostic learner with parameters c, 6 is also a learner with parameter
25
2.1.2
Properties of interest
We give here the formal descriptions of the classes of distributions involved in this work.
Recall that a distribution D over [n] is monotone (non-increasing) if its probability mass
function (pmf) satisfies D(1) ;> D(2) > ...
D(n). A natural generalization of the class M of
monotone distributions is the set of t-modal distributions, i.e. distributions whose pmf can
go "up and down" or "down and up" up to t times:
Definition 2.1.5 (t-modal). Fix any distribution D over [n], and integer t. D is said to have
< it+, such that either (-1)iD(ij) < (-1)jD(ij+1
...
)
t modes if there exists a sequence io <
for all 0 < j < t, or (-1)iD(ij) > (-1)iD(ij+1 ) for all 0 < j K t. We call D t-modal if
it has at most t modes, and write Mt for the class of all t-modal distributions (omitting
the dependence on n). The particular case of t
=
1 corresponds to the set M
1
of unimodal
distributions.
Definition 2.1.6 (Log-Concave). A distribution D over [n] is said to be log-concave if it
satisfies the following conditions: (i) for any 1 < i <
D(j) > 0; and (ii) for all 1 < k < n, D(k)
2
j
< k < n such that D(i)D(k) > 0,
> D(k - 1)D(k + 1). We write L for the class of
all log-concave distributions (omitting the dependence on n).
Definition 2.1.7 (Concave and Convex). A distribution D over [n] is said to be concave if
it satisfies the following conditions: (i) for any 1 < i < j < k < n such that D(i)D(k) > 0,
+
D(j) > 0; and (ii) for all 1 < k < n such that D(k - 1)D(k + 1) > 0, 2D(k) > D(k - 1)
D(k + 1); it is convex if the reverse inequality holds in (ii). We write K- (resp.
C+) for the
class of all concave (resp. convex) distributions (omitting the dependence on n).
It is not hard to see that convex and concave distributions are unimodal; moreover, every
concave distribution is also log-concave, i.e.
C- C L. Note that in both definitions 2.1.6
and 2.1.7, condition (i) is equivalent to enforcing that the distribution be supported on an
interval.
26
Definition 2.1.8 (Monotone Hazard Rate). A distribution D over [n] is said to have
monotone hazard rate (MHR) if its hazard rate H(i) 41
We write MI-7ti
Di
is a non-decreasing function.
for the class of all MHR distributions (omitting the dependence on n).
It is known that every log-concave distribution is both unimodal and MHR (see e.g. [3,
Proposition 10]), and that monotone distributions are MHR. Finally, we recall the definition
of the two following classes, which both extend the family of Binomial distributions BIgn:
the first, by removing the need for each of the independent Bernoulli summands to share the
same bias parameter.
Definition 2.1.9. A random variable X is said to follow a Poisson Binomial Distribution
(with parameter n E N) if it can be written as X =
E'l Xk, where X1 .. ., X,, are independent,
non-necessarily identically distributed Bernoulli random variables. We denote by PBDn the
class of all such Poisson Binomial Distributions.
One can generalize even further, by allowing each random variable of the summation to be
integer-valued:
Definition 2.1.10. Fix any k > 0. We say a random variable X is a k-Sum of Independent
Integer Random Variables (k-SIIRV) with parameter n E N if it can be written as X =
En_ 1 Xj, where X1 ...
, Xn
are independent, non-necessarily identically distributed random
variables taking value in {0, 1,
... , k
- 1}. We denote by k-SI-ERV
k-SIIRVs.
27
the class of all such
28
Chapter 3
Testing distributions in a unifed way
At the core of our approach is an idea of Batu et al. [9], which show that monotone distributions
can be well-approximated by piecewise constant densities on a suitable partition of the domain;
and leverage this fact to reduce monotonicity testing to uniformity testing on each interval of
this partition. While their argument is tailored specifically for the setting of monotonicity
setting, we are able to abstract the key ingredients, and generalize this idea to many classes
of probability distributions. In more detail, we provide a generic testing algorithm which
applies indifferently to any class of distributions which admit succinct decompositions - that
is, which are well-approximated (in a strong
2
sense) by piecewise constant densities on a
small number of intervals (we hereafter refer to this approximation property as (Succinctness);
and extend the notation to apply to any class C of distributions for which all D
c C satisfy
this property). Crucially, the algorithm does not care about how these decompositions can
be obtained: for the purpose of testing these structural properties we only need to establish
their existence.
29
3.1
Structural results
We start by defining formally the "structural criterion" we shall rely on, before describing
the algorithm at the heart of our result in Section 3.3.1. (We note that a modification of this
algorithm will be described in Section 3.4, and will allow us to derive Corollary 3.3.6.)
Definition 3.1.1 (Decompositions). Let - > 0 and L = L(7, n) > 1. A class of distributions
C on [n] is said to be (-y, L)-decomposable if for every D E C there exists f
I(-y, D) = (I1,.. ., I)
(i) D(I) K
of [n] such that, for all j
K
L and a partition
C [f], one of the following holds:
or
(ii) max D(i) 5 (1 + )min D(i).
iEIj
iEIj
Further, if I(7, D) is dyadic (i.e., each Ik is of the form [j - 2' + 1, (j + 1) - 2'] for some
integers i, j, corresponding to the leaves of a recursive bisection of [n]), then C is said to be
(-, L) -splittable.
Lemma 3.1.2. If C is (-y, L)-decomposable, then it is (y, O(L log n))-splittable.
Proof. We will begin by proving a claim that for every partition 1
=
{11, I2,
.. IL} of the
interval [1, n] into L intervals, there exists a refinement of that partition which consists of
at most L - log n dyadic intervals. So, it suffices to prove that every interval [a, b] C [1, n],
can be partitioned in at most O(log n) dyadic intervals. Indeed, let f be the largest integer
such that 2' < Y
m - 2' <a +
and let m be the smallest integer such that m - 2t > a. If follows that
-a = a
and (m +1)- 2
b. So, the interval I= [m- 2 + 1, (m +1)
fully contained in [a, b] and has size at least
- 2 ] is
b.
In the following, we will also use the following fact:
m - 2" = m - 2 '-.2
= m' .2"
(3.1)
for every f' < E.
Now consider the following procedure: Starting from right (resp. left) side of the interval
I, we add the largest interval which is adjacent to it and fully contained in [a, b] and recurse
30
until we cover the whole interval [(m + 1) - 2 + 1, b] (resp. [a, m - 2t]). Clearly, at the end of
this procedure, the whole interval [a, b] is covered by dyadic intervals. It remains to show that
the procedure takes O(logn) steps. Indeed, using Equation 3.1, we can see that at least half
of the remaining left or right interval is covered in each step (except maybe for the first 2 steps
where it is at least a quarter). Thus, the procedure will take at most 2 log n + 2 = O(log n)
steps in total. From the above, we can see that each of the L intervals of the partition I can
be covered with O(log n) dyadic intervals, which completes the proof of the claim.
In order to complete the proof of the lemma, notice that the two conditions in Theorem 3.1.1
are closed under taking subsets.
Extension to mixtures.
Finally, it is immediate to see that if Dj,..., Dk are (-Y, L)-
decomposable, then any mixture a1 D, + -.--
+ akDk is (-y, kL)-decomposable.
In the following, we show that many "natural" classes of distributions are (succinctly)
decomposable.
Theorem 3.1.3 (Monotonicity). For all -y > 0, the class M of monotone distributions on
=
.
[n] is (-y, L)-splittable for L
Note that this proof can already be found in [9, Theorem 10], interwoven with the analysis
of their algorithm. For the sake of being self-contained, we reproduce the structural part of
their argument, removing its algorithmic aspects:
Proof of Theorem 3.1.3. We define the
I recursively as follows:
the partition I(+) is obtained from I(j = (
. ..,
(0) = ([1, n]), and for
Ij) by going over the I)
= [a,
j
> 0
bj'0]
in order, and:
(a) if D(I4') < 2, then Ij
is added as element of E(j+1) ("marked as leaf");
(b) else, if D(bW) < (1 + -y)D(aW), then I0) is added as element of IU+) ("marked as
leaves");
(c) otherwise, bisect IU) in I), I)
(with I)
elements of 1E(+1).
31
= [II(i
/21)
and add both I)
and I9 as
and repeat until convergence (that is, whenever the last item is not applied for any of the
intervals). Clearly, this process is well-defined, and will eventually terminate (as (lZ)j is a
non-decreasing sequence of natural numbers, upper bounded by n). Let I = (I1, .. . , It) (with
I
=
[ai, ai+1)) be its outcome, so that the Ii's are consecutive intervals all satisfying either
K<L;
(a) or (b). As (b) clearly implies (ii), we only need to show that
for this purpose, we
shall leverage as in [9] the fact that D is monotone to bound the number of recursion steps.
The recursion above defines a complete binary tree (with the leaves being the intervals
satisfying (a) or (b), and the internal nodes the other ones). Let t be the number of recursion
steps the process goes through before converging to I (height of the tree); as mentioned
above, we have t < log n (as we start with an interval of size n, and the length is halved at
= [af), b)] has D(afj)) 5
each step.). Observe further that if at any point an interval Ij
-,
then it immediately (as well as all the I(A's for k > i by monotonicity) satisfies (a) and is no
longer split ("becomes a leaf"). So at any j < t, the number of intervals ij for which neither
(a) nor (b) holds must satisfy
I > D(aj) > (1 + 7)D(af)1 > (1 +
2
_Y)2D(afj))
3
> ... > (1 + 7)i-1D(aY) > (1 +)
IinL
--
,~i
Mhe extrena were reaciied at the ends of each interval), so that
log~
i-
-
Inparticular,
T
-
where ak denotes the beginning of the k-th interval (again we use monotonicity to argue that
the total number of internal nodes is then
1
log nL
Y
log(l + -Y)
(
()
=-1+ ()
This implies the same bound on the number of leaves
log2
<L
log(1+Y)
f.
Observe that if j satisfies the second item then |ID1, - Uu
32
n
.
t
~
Eii < t.-I+
--
L
1
1 < y by ??. Thus, the
distance between a monotone D and its flattening D-
D(I)- ID
3- Zujj 11|1 ID - DII,
=
4Db(D, I(D, -y)) is at most
D
j: D(Ij)<;
-I
j: D(Ij) >
< L - 2Y + ^t = 37.
L
Corollary 3.1.4 (Unimodality). For all y > 0, the class M 1 of unimodal distributions on
[n] is (-y, L)-decomposable for L df 0
2
log n
Proof. For any D E M 1 , [n] can be partitioned in two intervals I, J such that DI, Dj are
either monotone non-increasing or non-decreasing. Applying Theorem 3.1.3 to D, and Dj
and taking the union of both partitions yields a (no longer necessarily dyadic) partition of
[n].
The same argument yields an analogue statement for t-modal distributions:
Corollary 3.1.5 (t-modality). For any t > 1 and all -y > 0, the class Mt of t-modal
distributions on [n] is (y, L)-decomposable for L =
-1
Corollary 3.1.6 (Log-concavity, concavity and convexity). For all -y > 0, the classes C, Aand K+ of log-concave, concave and convex distributions on [n] are (y, L)-decomposable for
L def Q(1 ("
)
(D(Ij) +D(I))+
def
Proof. This is directly implied by Corollary 3.1.4, recalling that log-concave, concave and
convex distributions are unimodal.
Theorem 3.1.7 (Monotone Hazard Rate). For all -y > 0, the class M'HR of MHR distributions on [n] is (-y, L)-decomposable for L def O( ion).
Proof. This follows from adapting the proof of [11], which establishes that every MHR distribution can be approximated in f, distance by a O(log(n/,)/s)-histogram. For completeness,
we reproduce their argument, suitably modified to our purposes, in Section B.2.
33
0
Theorem 3.1.8 (k-histogram). The class of k-histogram distributions on [n] is (0, k)decomposable.
El
Proof. It is immediate from the definition.
3.2
Computing the distances
This section contains details of the distance estimation procedures for these classes, required
in the last stage of Algorithm 1.
Lemma 3.2.1 (Monotonicity [9, Lemma 8]). There exists a procedure ESTIMATEDISTANCEM
that, on input n as well as the full (succinct) specification of a 1-histogram D on [n], computes
the distance f21 (D,.M) in time poly(V).
A straightforward modification of the algorithm above (e.g., by adapting the underlying
linear program to take as input the location m E [f] of the mode of the distribution; then
trying all f possibilities, running the subroutine f times and picking the minimum value)
results in a similar claim for unimodal distributions:
Lemma 3.2.2 (Unimodality). There exists a procedure ESTIMATEDISTANCEM
1
that, on
input n as well as the full (succinct) specification of a e-histogram D on [n], computes the
distaice 2 1 (D,.M 1 ) in time poly(t).
Lemma 3.2.3 (k-histogram). There exists a procedure ESTIMATEDISTANCE-
that, on input
n as well as the full (succinct) specification of a f-histogram D on [n], computes the distance
f21 (D,'W) in time poly(2).
The latter can be done by running
(n)
linear programs (i.e one for each possible partition
of the domain [n] into f segments) and picking the smallest value at the end.
Similar statements hold for the classes of concave and convex distributions K+, K-, also
based on linear programming (specifically, on running 0(n2 ) different linear programs - one
for each possible support [a, b] c [n] - and taking the minimum over them).
34
Lemma 3.2.4 (MHR).
There exists a (non-efficient) procedure ESTIMATEDISTANCEM-HR
that, on input n as well as the full specification of a distribution D on [n], computes (an
c-accurate approximation of) the distance f1 (D, M RWR) in time 26e(n).
Lemma 3.2.5 (Log-concavity). There exists a (non-efficient) procedure ESTIMATEDISTANCE
that, on input n as well as the full specification of a distribution D on [n], computes (an
E-accurate approximation of) the distance f1 (D, L) in time 266(')
Lemma 3.2.4 and Lemma 3.2.5. We here give a naive algorithm for these two problems,
based on an exhaustive search over a (huge) E-cover S of distributions over [n]. Essentially,
S contains all possible distributions whose probabilities pi, ...
, p"
are of the form
jE/n, for
j e 10, . . . , n/E} (so that ISI = O((n/E)')). It is not hard to see that this indeed defines an
E-cover of the set of all distributions, and moreover that it can be computed in time poly(ISj).
To approximate the distance from an explicit distribution D to the class C (either M-R.
or L), it is enough to go over every element S of S, checking (this time, efficiently) if S
satisfies the constraints for C, and if so computing the values ||S - D11 1 . At the end of this
enumeration, the procedure outputs the minimum value obtained.
35
l
3.3
Algorithm
In this section, we obtain our main result, restated below:
Theorem 3.3.1. There exists an algorithm TESTSPLITTABLE which, given sampling access
to an unknown distributionD over [n] and parametere
c
(0, 1], can distinguishwith probability
2/3 between (a) D E 1 versus (b) f I(D, P) > e, for any property P that satisfies the above
natural structural criterion (Succinctness). Moreover, for many such properties this algorithm
is efficient, and its sample complexity is optimal (up to logarithmic factors and the exact
dependence on E).
3.3.1
The algorithm
Theorem 3.3.1, and with it Corollary 3.3.4 and Corollary 3.3.5 will follow from the theorem
below, combined with the structural theorems from Section 3.1:
Theorem 3.3.2. Let C be a class of distributions over [n] for which the following holds.
1. C is (-, L)-splittable;
2. there exists a procedure ESTIMATEDISTANCEC which, given as input the explicit description of a distribution D over [n], computes its distance f 1 (D, C) to C.
Then, the algorithm TESTSPLITTABLE (Algorithm 1) is a O(max(ViL log L/e3 , L 2 /
2
))_
sample tester for C, for L = L(s, n). (Moreover, if ESTIMATEDISTANCEC is efficient, then
so is TESTSPLITTABLE.)
36
Algorithm 1 TESTSPLITTABLE
Require: Domain I (interval), sample access to D over I; subroutine ESTIMATEDISTANCEC
Input: Parameters e and function Lc(-,-).
1: SETTING
UP
2:
Define -y - 2, L = Lc(, I),
3:
Set m df C- max
2,
def 3IL
r, log L)
=
O(max
(
1 C is an absolute
L, L2))
constant.
4:
5: DECOMPOSITION
6:
Obtain a sequence s of m independent samples from D.
7:
Let D be the empirical estimate of D they induce.
8:
while
D(I) >
and at most L splits have been performed do
> Implies mi 2.
Run CHECK-SMALL-t 2 (from Lemnma 1.1.6) with parameters f and 6
9:
=
, using
the samples of s belonging to I.
if CHECK-SMALL-t 2 outputs no then
10:
Bisect I, and recurse on both halves (using the same samples).
11:
12:
end if
13:
end while
14:
if more than L splits have been performed then
return REJECT
15:
16:
else
Let I
17:
=e
(I,, ... , It) be the partition of [n] from the leaves of the recursion.
e < L.
18:
end if
19:
20: APPROXIMATION
Learn the flattening <P(D,I) of D to
4j
error ! (with probability 1/10), using
O(e/E 2
)
21:
o D is a e-histogram.
new samples. Let h be the resulting hypothesis.
22:
23: OFFLINE CHECK
24:
25:
37
return ACCEPT if and only if ESTIMATEDISTANCEC(b) <
.
No sample needed.
3.3.2
Proof of Theorem 3.3.2
We now give the proof of our main result (Theorem 3,3.2), first analyzing the sample
For the latter, we will need a
complexity of Algorithm I before arguing its correctness.
classical result from Probability, the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, restated
below:
Theorem 3.3.3 ([20, 26]). Let D be a distribution over [n]. Given m independent samples
from D, define the empirical distribution D as follows:
x1,...,X
D(i)
Sdef{ j
=,
: x=
E []
icE[n].
m
Then, for all e > 0, PrIdK(DD) >D
2e-22, where dK(-,-) denotes the Kolmogorov
distance (i.e., the Eoo distance between cumulative distributionfunctions).
In particular, this implies that 0(1/c 2 ) samples suffice to learn a distribution up to F in
Kolmogorov distance.
3.3.3
Sample complexity.
The sample complexity is immediate, and comes from Steps 6 and 21. The total number of
samples is
3.3.4
=
max
IlL2A
Llog L,L2+
=O
(L(__
max
L2
LlogL, L
.
m+O (
Correctness.
Say an interval I is heavy if D(I) > j, and light if D(I) <
. By choice of m, the DKW
inequality ensures that with probability at least 9/10, for every interval J we consider
D(J) - D(J) < 1/(2.). In particular, every heavy interval I will have D(I) >
light interval I will have D(I) < 1. We hereafter condition on this.
38
and every
We first argue that if the algorithm does not reject in Step 14, then with probability at
least 9/10 we have
lID - <D(D,I)11 1
E
|ID - 1(DI)Il =
E/4. Indeed, we can write
D(Ik) -ID
k:
>
<2
k:
E
+
1k - UIk1I1
k: f)(Ik)<l1/.
D(Ik) +
(
D(Ik) - |ID1 k
Uk11
-
f)(Ik)>l1/z
D(Ik) |DIk -Uk ll.
k: b(Ik)>ll/.
b)(Ik)<1/x-
Let us bound the two terms separately.
9 If D(I) ;>
, then this means at least m/I
samples fell into I, and by our choice of
constants we can apply Lemma 1.1.6 with 6 =
1; conditioning on all of the (at most
L) events happening, which overall fails with probability at most 1/10 by a union
bound, we get
||DiI||
=
64
+ (~)
11
1
||D1 -U|
as CHECK-SMALL4 2 returned yes; and by ?? this implies
* If D(Ik) <
11
lID,
- Ul1Il <
E/8.
we know that I cannot be heavy, and therefore D(I) <
I
K
16L
Putting it together, this yields
|ID -<D(D,I)I 1
2< 2
k:
+b(Ik)<1
D(Ik.) <E/8 +E/8 =E/4.
L
8k: D(Ik)
1/K
Soundness By contrapositive, we argue that if the test returns ACCEPT, then (with probability at least 2/3) D is E-close to C. Indeed, conditioning on h being (e/4)-close to
<D(D,1), we get by the triangle inequality that
||D - ClI 1 < lID -
<D(D,I)
I| + IkID(D, I) - b|l 1 + dist(b, C)
< E/4 + E/4 + E/2 = E.
Overall, this happens except with probability at most 1/10 + 1/10
+ 1/10 < 1/3.
Completeness Assume D E C. Then the choice of of -y and L ensures the existence of a
39
good dyadic partition I(y, D) in the sense of Theorem 3.1.1. For any I in this partition
for which (i) holds (D(I) <; - < -;), I will have D(I) < - and be kept as a "light leaf".
For the other ones, (ii) holds: let I be one of these (at most L) intervals.
" If D(I) <
" If b(I) ;
6=
, then I is kept as "light leaf".
,
then by our choice of constants we can apply Lemma 1.1.6 with
; conditioning on all of the (at most L) events happening, which overall
fails with probability at most 1/10 by a union bound, CHECK-SMALL-e
2
will
output yes, as
ID1 -
=
A'i1f1
2
ID
1| |
2
|
<( 1 +
400
=E
1I1
1I1
<2
400I1 <256III
and I is kept as "flat leaf".
Therefore, as 1(7, D) is dyadic the DECOMPOSITION stage is guaranteed to stop within
at most L splits (in the worst case, it goes on until I(7, D) is considered, at which point
it succeeds). 1 Thus Step 14 passes, and the algorithm reaches the APPROXIMATION
stage. By the foregoing discussion, this implies <b(D, I) is E/4-close to D (and hence to
C); D is then (except with probability at most 1/10) (-/4 + E/4 = e/2)-close to C, and
the algorithm returns ACCEPT.
We can now combine the structural results from section Section 3.1 and Theorem 3.3.2 to
get the following corollaries:
Corollary 3.3.4. The algorithm TESTSPLITTABLE can test the classes of monotone, unimodal, log-concave, concave, convex, and monotone hazard rate (MHR) distributions, with
O(VW/E4 )
samples.
'In more detail, we want to argue that if D is in the class, then a decomposition with at most L
pieces is found by the algorithm. Since there is a dyadic decomposition with at most L pieces (namely,
I(7, D) = (I,, ... , It)), it suffices to argue that the algorithm will never split one of the Ij's (as every single
Ij will eventually be considered by the recursive binary splitting, unless the algorithm stopped recursing in
this "path" before even considering Ij, which is even better). But this is the case by the above argument,
which ensures each such Ij will be recognized as satisfying one of the two conditions for "good decomposition"
(being either close to uniform in 2, or having very little mass).
40
Corollary 3.3.5. The algorithm TESTSPLITTABLE can test the class of t-modal distributions,
with O(tV%/e4 ) samples.
Corollary 3.3.6. The algorithm TESTSPLITTABLE can test the classes of Binomial and
Poisson Binomial Distributions, with0 (n1/4/E4) samples.
Corollary 3.3.7. The algorithm TESTSPLITTABLE can test the class of k-histogram Distributions, with
6(k - v')/E 3
Extension to mixtures.
samples.
We finally note that, in contrast to the case-specific analogous
results of [9, 23], Corollary 3.3.4, Corollary 3.3.5 and Corollary 3.3.6 easily extend to mixtures
of distributions from these respective classes. More specifically, since the mixture of any k
(-y, L)-decomposable distributions is (y, k - L)-decomposable, our results imply the following:
Corollary 3.3.8. Let Ck be the set of mixtures of (at most) k distributionsfrom C, where C
is any union of the classes of monotone, unimodal, log-concave and monotone hazard rate
(MHR) distributions. Then, Ck can be tested with 0(k-/inI/e4 ) samples. Similarly, the class
Mt,k of mixtures of (at most) k t-modal distributions can be tested with 0(ktfiie') samples.
41
3.4
Considering the effective support size only
The general approach we have been following so far gives, out-of-the-box, an efficient testing
algorithm with sample complexity O(V/-/E3) for a large range of properties. However, this
sample complexity can for some classes P be brought down a lot more, by taking advantage
in a preprocessing step of good concentration guarantees of distributions in P.
As a motivating example, consider the class of Poisson Binomial Distributions (PBD). It
is well-known (see e.g. [25, Section 2]) that PBDs are unimodal, and more specifically that
PBD, C L C .M 1 . Therefore, using our generic framework we can test Poisson Binomial
Distributions with
O(Vi) samples. This is, however, far from optimal: as shown in [23],
a sample complexity of
E (nl/4) is both necessary and sufficient. The reason our general
algorithm ends up making quadratically too many queries can be explained as follows. PBDs
are tightly concentrated around their expectation, so that they "morally" live on a support
of size m = O(Vji). Yet, instead of testing them on this very small support, in the above we
still consider the entire range [n], and thus end up paying a dependence V/- - instead of vf.
If we could use that observation to first reduce the domain to the effective support of the
distribution, then we could call our testing algorithm on this reduced domain of size O(#).
In the rest of this section, we formalize and develop this idea, and in Section 3.4.2 will obtain
as a direct application a
O(nl/4)-query
testing algorithm for PBD,.
Definition 3.4.1. Given e > 0, the E-effective support of a distribution D is the smallest
interval I such that D(I) > 1 - s.
The last definition we shall require is of the conditioned distributions of a class C:
Definition 3.4.2. For any class of distributions C over [n], define the set of conditioned distributions of C (with respect to E > 0 and interval I C [n]) as Ce" N
{ D, : D E C, D(I) > 1 - e
Finally, we will require the following simple result:
Lemma 3.4.3. Let D be a distribution over [n], and I C [n] an interval such that D(I) >
1 -.
Then,
42
9 IfD EC, then D 1 EC ';* If f 1 (D, C) > e, then f 1(DI,
C1) >1
Proof. The first item is obvious. As for the second, let P c C be any distribution with
-. By assumption, |ID - P|1 > e: but we have, writing a
|ID, - PI1|1 = E
D(i)
P(i)
P(I)
1D(I)
D(I)
(
DI)
D(i) - P(i) + P(i)
Pi)
()
-
P(I) > 1 -
=
1/10,
P1
P)
P(i))
D(i) - P(i)j - 1 -
=DEI)D(i) -- P(i)| - |P(I) - D(-[I) > D ) D(i) - P(Z-)| - aE)
D(I)
D
-
Pji
7
> (1 - 3a)E
=
-
D(i)- P(i)
-
a
D(I) (lID - P11 1 - 3ae)
E.
We now proceed to state and prove our result - namely, efficient testing of structured
classes of distributions with nice concentrationproperties.
Theorem 3.4.4. Let C be a class of distributions over [n] for which the following holds.
1. there is a function M(-, -) such that each D E C has E-effective support of size at most
M(n, E);
2. for every e G [0, 1] and interval I C [n], CE' is ('y, L)-splittable;
3. there exists an efficient procedure ESTIMATEDISTANCECe,i which, given as input the
explicit description of a distribution D over [n] and interval I C [n], computes the
Then, the algorithm TESTEFFECTIVESPLITTABLE (Algorithm 2) is a 0 (max
sample tester for C, where m = M(n, b) and f
43
=
L( 6, m).
(L
v49
log f,
)
-
distance f 1 (D 1 , C',I).
Algorithm 2 TESTEFFECTIVESPLITTABLE
Require: Domain X (interval of size n), sample access to D over X;
subroutine
ESTIMATEDISTANCECe,i
Input: Parameters e E (0, 1], function L(., -), and upper bound function M(-, -) for the
effective support of the class C.
def
def01E)
1: Set m
-
O(1/&2),
T
M(n,
6)
2: EFFECTIVE SUPPORT
3:
Compute D, an empirical estimate of D, by drawing m independent samples from D.
4:
Let J be the largest interval of the form {1,
5:
Let K be the largest interval of the form {k..., n} such that D(K)
6:
Set I <- [n] \ (J U K).
7:
if
8:
end if
Il >
T
,j} such that D(J)
< 3.
.
...
then return REJECT
9:
10: TESTING
11:
Call TESTSPLITTABLE with I (providing simulated access to D" by rejection sampling,
returning FAIL if the number of samples q from DI required by the subroutine is not
obtained after O(q) samples from D), ESTIMATEDISTANCEcF,i, parameters E' -
and
L(-, .).
12:
return ACCEPT if TESTSPLITTABLE accepts, REJECT otherwise.
13:
3.4.1
Proof of Theorem 3.4.4
By the choice of m and the DKW inequality, with probability at least 23/24 the estimate D
satisfies dK(D,
1 - --.
-f). Conditioning on that from now on, we get that D(I) ;> D(I) - 'F
Furthermore, denoting by
F, we have D(J U
{j + 1})
>
j
and k the two inner endpoints of J and K in Steps 4 and
D(J U {j + 1}) -
>
(similarly for D(K U {k - 1})), so
that I has size at most o + 1, where o- is the ' -effective support size of D.
44
Finally, note that since D(I) = Q(1) by our conditioning, the simulation of samples by
rejection sampling will succeed with probability at least 23/24 and the algorithm will not
output FAIL.
Sample complexity.
The sample complexity is the sum of the 0(1/e 2 ) in Step 3 and
the O(q) in Step 11.
From Theorem 3.3.1 and the choice of I, this latter quantity is
0(max
(-
M(n, -)f log f, ")) where f
Correctness.
=
L( ' , M(n, ')).
If D E C, then by the setting of r (set to be an upper bound on the n-
effective support size of any distribution in C) the algorithm will go beyond Step 6. The call
to TESTSPLITTABLE will then end up in the algorithm returning ACCEPT in Step 12, with
probability at least 2/3 by Lemma 3.4.3, Theorem 3.3.1 and our choice of parameters.
Similarly, if D is S-far from C, then either its effective support is too large (and then the
test on Step 6 fails), or the main tester will detect that its conditional distribution on I is
-- far from C and output REJECT in Step 12.
+
Overall, in either case the algorithm is correct except with probability at most 1/24
1/24 + 1/3 = 5/12 (by a union bound). Repeating constantly many times and outputting the
majority vote brings the probability of failure down to 1/3.
3.4.2
Application: Testing Poisson Binomial Distributions
In this section, we illustrate the use of our generic two-stage approach to test the class of
Poisson Binomial Distributions. Specifically, we prove the following result:
Corollary 3.4.5. The class of Poisson Binomial Distributionscan be tested with6 (n1/4/E4)
samples, using Alqorithm 2.
This is a direct consequence of Theorem 3.4.4 and the lemmas below. The first one states
that, indeed, PBDs have small effective support:
Fact 3.4.6. For any e > 0, a PBD has E-effective support of size O( nlog(lft))
45
Proof. By an additive Chernoff Bound, any random variable X following a Poisson Binomial Distribution has Pr[ X - EXI > yn] < 2e2' 2 l. Taking -y
def
n
X +
n
Pr[X E
1- E, where I = [EX - Vf ! ln
, EX
ln2
It is clear that if D
In j, we get that
El
E PBDn (and therefore is unimodal), then for any interval I C
[n]
the conditional distribution DI is still unimodal, and thus the class of conditioned PBDs
PBDE' =d { D
: D E PBDn, D(I)
1 - e
} falls under Corollary 3.1.4. The last piece we
need to apply our generic testing framework is the existence of an efficient algorithm to
compute the distance between an (explicit) distribution and the class of conditioned PBDs.
This is provided by our next lemma:
Claim 3.4.7.
There exists a procedure ESTIMATEDISTANCEPBVSJ
that, on input n and
e, E [0, 1], I C [n] as well as the full specification of a distributionD on [n], computes a value
r such that T E [1
2e]f,(D, PBDP') t
' , in time poly(n) - (1/E)
1)
Proof. The goal is to find a -y-approximation of the minimum value of Ei)
D(1)
1 - E and P C PBDn. We first note that, given the parameters
subject to P(I) =EiE P(i)
n
_D)
I P(I)
C N and pi,... ,p, E [0, 1] of a PBD P, the vector of (n + 1) probabilities P(0),... , P(n)
can be obtained in time O(n2 ), e.g. by dynamic programming. Therefore, computing the
f, distance between D and any PBD with known parameters can be done efficiently. To
conclude, we invoke a result of Daskalakis and Papadimitriou, that guarantees the existence
of a succinct (proper) cover of PBLDP:
Theorem 3.4.8 ([18, Theorem 1]). For all n, y> 0, there exists a set S, C PBDn such that:
(i) S, is an -- cover of PBD
such that |ID - D'11 1
(ii)
ISI
in f 1; that is, for all D E PBD, there exists some D' C S-,
-y
< n2 + n - (1 /_Y)(1Qg2
l/Y)
(iii) S. can be computed in time 0(n2 log n) + O(n log n) - ( 1 /_Y)O(,og
2
1/Y)
and each D E S, is explicitly described by its set of parameters.
Set -
d
'
. Fix P E PBD, such that P(I) > 1- , and Q E S. such that ||P
Q11 K.
In particular, it is easy to see via the correspondence between f, and total variation distance
46
liP' - Q'11 1
=
< -y/2. By a calculation analogue as in Lemma. 3.4.3, we have
Q(I)
P
PM
iE
_
+iQi i)
~)P()
_
P(i)
Qj2__
QI)
-
P(i)
P(I)
-
\~QII
2PI
c [lIP - Q11 1 - 5-y/2, (1 + 2E) (lIP - QI11 + 57/2)]
where we used the fact that EZ 1
IP(i)
-
Q(i)I
= 2
(EiI: P(i)>Q(i)(P(i) - Q))) + Q(I)
P(I) E [-27, 2-]. By the triangle inequality, this implies that the minimum of
-
Ii+I(I
)-yt)E
Q-Q(i)
Q(I)I
that IP(I) -
lIP, - D|11
over the distributions P of S, with P(I) ;> 1 - (e + -y/2) will be within an additive 0(e)
of f,(D, PL3DE'). The fact that the former can be done in time poly(n) _ ( 1 /E)0(1og2 1/')
concludes the proof.
It is not hard to see that the fact this only guarantees an approximation of fi (D, PBD')
instead of the exact value is sufficient for the purpose of Algorithm 1.
Proof of Corollary 3.4.5. Combining the above, we invoke Theorem 3.4.4 with M(n, e)
=
o'Y) (Corollary 3.1.4). This yields the
O(Vnlog(1/e)) (Fact 3.4.6) and L(m, 7) = O(L
claimed sample complexity; finally, the efficiency is a direct consequence of Claim 3.4.7.
47
E
3.5
Lower bounds
We now turn to proving converses to our positive results - namely, that many of the upper
bounds we obtain cannot be significantly improved upon. As in our algorithmic approach,
we describe for this purpose a generic framework for obtaining lower bounds.
In order to state our results, we will require the usual definition of agnostic learning.
Recall that an algorithm is said to be a semi-agnostic learner for a class C if it satisfies the
following. Given sample access to an arbitrary distribution D and parameter E, it outputs a
hypothesis b which (with high probability) does "almost as well as it gets":
|ID - D11 1 < C - OPTC,D + O(E)
where oPTC,D
nf D'EC f 1 (D', D), and c > 1 is some absolute constant (if c = 1, the learner
is said to be agnostic).
High-level idea.
The motivation for our result is the observation of [9] that "monotonicity
is at least as hard as uniformity". Unfortunately, their specific argument does not generalize
easily to other classes of distribution, making it impossible to extend it readily. The starting
point of our approach is to observe that while uniformity testing is hard in general, it becomes
very easy under the promise that the distributionis monotone, or even only close to monotone
(namely, 0(1/, 2 ) samples suffice). This can give an alternate proof of the lower bound for
monotonicity testing, via a different reduction: first, test if the unknown distribution is
monotone; if it is, test whether it is uniform, now assuming closeness to monotone.
More generally, this idea applies to any class C which (a) contains the uniform distribution,
and (b) for which we have a o(VI)-sample agnostic learner L, as follows. Assuming we have
a tester T for C with sample complexity o(
), define a uniformity tester as below.
" test if D E C using T; if not, reject (as U E C, D cannot be uniform);
" otherwise, agnostically learn D with 1 (since D is close to C), and obtain hypothesis D;
" check offline if b is close to uniform.
48
By assumption, T and L each use o(funi) samples, so does the whole process; but this
contradicts the lower bound of [21, 27] on uniformity testing. Hence, T must use Q(f/ii)
samples.
This "testing-by-narrowing" reduction argument can be further extended to other reductions than to uniformity, as we show below:
Theorem 3.5.1. Let C be a class of distributions over
[n] for which the following holds:
(i) there exists a semi-agnostic learner L for C, with sample complexity qL(n,, 6) and
"agnostic constant" c;
(ii) there exists a subclass CHard C C such that testing CHard requires qH (n, e) samples.
Suppose further that qL(n, E, 1/10) = o(qH(n, )). Then, any tester for C must use Q(qH (n, E))
samples.
Proof. The above theorem relies on the reduction outlined above, which we rigorously detail
here. Assuming C,
CHaid,
L as above (with semi-agnostic constant c > 1), and a tester T for
C with sample complexity qT(n, e), we define a tester
THard
given sample access to a distribution D on [n],
acts as follows:
"
THard
for
CHard.
On input - E (0, 1] and
call T with parameters n, j (where E' - -) and failure probability 1/6, to i-test if
D E C. If not, reject.
" otherwise, agnostically learn a hypothesis b for D, with L called with parameters n, S'
and failure probability 1/6;
" check offline if D is c'-close to
CHard,
accept if and only if this is the case.
We condition on both calls (to T and L) to be successful, which overall happens with
probability at least 2/3 by a union bound. The completeness is immediate: if D E
T accepts, and the hypothesis D satisfies
iHard
f
--
D1 1 < e'. Therefore, f (f,
CHard)
CHard C C
E', and
accepts.
For the soundness, we proceed by contrapositive. Suppose
THard
accepts; it means that
each step was successful. In particular, f1 (D, C) < E'/c; so that the hypothesis outputted by
the agnostic learner satisfies
ID - D1 1
c - OPT
49
E'
< 2e'. In turn, since the last step passed
1 (D, CHard)
Observing that the overall sample complexity is qT(n,
2E' + f1(1, CHard) < Vc = E.
+ qL(n, 6', C) = qT(n, C)
-)
+
and by a triangle inequality we get, as claimed,
o(qH(n, E')) concludes the proof.
Taking CHard to be the singleton consisting of the uniform distribution, and from the semiagnostic learners of [11, 12] (each with sample complexity either poly(1/e) or poly(log n, 1/e)),
we obtain the following: 2
Corollary 3.5.2. Testing log-concavity, convexity, concavity, MHR, unimodality and tmodality each require Q(-)
samples (the latter for t = o(Vi/)).
Similarly, we can use another result of [16] which shows how to agnostically learn Poisson
Binomial Distributions with 0(1/E2) samples.3 Taking CHard to be the single Bin(n, 1/2)
distribution (along with the testing lower bound of [35]), this yields the following:
Corollary 3.5.3. Testing 'PBD,and BIJfn each require Q
2) samples.
Finally, we derive a lower bound on testing k-SIIRVs from the agnostic learner of [15] (which
has sample complexity poly(k, 1/) samples, independent of n):
Corollary 3.5.4. There exists an absolute constant c > 0 such that testing k-SIIRV
requires Q(k1/2n1/4) samples, for any k = o(nc).
Corollary 3. 5. 4. To prove this result, it is enough by Theorem 3.5.1 to exhibit a particular
k-SIIRV S such that testing identity to S requires this many samples. Moreover, from [35] this
last part amounts to proving that the (truncated) 2/3-norm
(for some small
6o
|iS-iM|12/3
ofSisQ(k1/2n1/4)
> 0). Our hard instance S will be defined as follows: it is defined as
the distribution of X, +
- --+ Xn, where the X's are independent integer random variables
uniform on {0, . .. , k - 1} (in particular, for k = 2 we get a Bin(n, 1/2) distribution). It is
straightforward to verify that ES = n(k-1)
and
2
2
.2
Var S
(k 2 12
= 01)"
O(k2 n); moreover,
Specifically, these lower bounds hold as long as e =(1/n0) for some absolute constant a > 0 (so that
the sample complexity of the agnostic learner is indeed negligible in front of V/n/ 2 ).
3
Note the quasi-quadratic dependence on e of the learner, which allows us to get e into our lower bound
for n > poly log(1/E).
50
S is log-concave (as the convolution of n uniform distributions). From this last point, we
get that (i) the maximum probability of S, attained at its mode, is |ISIK, =
for every
E(I/a); and (ii)
j in an interval I of length 2a centered at this mode, S(j) > Q(IISIK|). Putting
this together, we get that the 2/3-norm (and similarly the truncated 2/3-norm) of S is lower
bounded by
(
jEI
)3/2
\/
S(j)23 > 2 Q(1/0)2/3) 3/ 2 _ Q 1/2) = Q(k1/2 1/4)
which concludes the proof.
1
51
52
Appendix A
Summary of results
Class
Monotone
Upperbound
(Corollary 3.3.4)
O4)
[9], 6O(i)
Unimodal
k-mixtures of monotone
and/or unimodal
(Corollary 3.3.4)
6
k
(Corollary 3.3.4)
t-modal
O t
(Corollary 3.3.5)
k-mixtures of t-modal
6
Log-conve
Q(f)
'O
oncave
[9],
Lowerbound
l(f) (Corollary 3,5.2)
) (Corollary 3.5.2)
Q(
(Corollary 3.5.2)
(Corollary 3.3.4)
Corollary 3.3.4) 4 )(n) (Corollary 3.5.2)
(Crlay3.4
convexe
k-mixtures of log-concave,
concave and/or convex
Monotone Hazard Rate (MHR)
(ke) (Corollary 3.3.4)
(Corollary 3.3.4)
6
k-mixtures of MHR
Binomial, Poisson Binomial (PBD)
k-mixtures of Binomials
and/or PBD
0
!)
O
(Corollary 3.3.4)
[23] (t)
O
6
(Corollary
/)
(Corollary 3.3.6)
O2
F
(Corollary 3.5.2)
)
[23], Corollary 3.5.3
3.3.6)
2 1
Q(k'/ n /
k-SIIRV
Table A.1:
4
)
(Corollary 3.5.4)
Summary of our results.
We stress that although two of these results are not new (the O(#r/&6) upper and
Q(Vri/e2 ) lower bounds on testing monotonicity can be found in [9], while the e(n1/4)
sample complexity of testing PBDs is pinned down 1 in [23]), the crux here is that we are able
to derive them in a unified way, by applying the same generic algorithm to these different
tasks. As an addition to its generality, we emphasize that our framework yields a much
cleaner and simpler proof of the results from [23], for both the upper and lower bounds.
53
Moreover, prior to our work nothing was known on testing membership to the other classes
we consider.
'Specifically, [23] obtain a n1/ 4 -6 (/
2
) +6(1/6) sample complexity, to be compared with our
upper bound.; as well as an Q(n1/ /e ) lower bound.
4
2
54
(n1/4/64)
Appendix B
Ommited proofs
Proof of Lemma 1.1.6
B.1
We now give the proof of Lemma 1.1.6, restated below:
Proof. To do so, we first describe an algorithm that distinguishes between ||D - U1' ;> s2 /n
and
lID
-
U1l1
< e2 /(2n) with probability at least 2/3, using C -
success probability to 1 - J at the price of a multiplicative log
samples. Boosting the
4 factor can then be achieved
by standard techniques.
Similarly as in the proof of Theorem 11 (whose algorithm we use, but with a threshold
instead of 4-"-deie h quantities
define the
s,
def 3 M2 62
4
Zkzkf(Xk
X
and Z
and Xk
En'1 Zk,
d
-
n)X22 _--X,, kc[n]
k
where the Xk's (and thus the ZA's) are independent by Poissonization,
Poisson(mD(k)). It is not hard to see that EZk
so that EZ
=
2
m IID
-
=
A, where Ak
U1 . Furthermore, we also get
Var
Zk= 2m2
(-
n
- Ak
55
- 4m
n
AkAk
(-
D(k)),
so that
A2 1
kl + - - 2m7- t
(k=1
(after expanding and since
lID
0).
=
Almost straight from [19], but the threshold has changed. Assume A 2
> E2/n; we will show that Pr[Z <r]
-12
sufficient to show that
T
1/3. By Chebyshev's inequality, it is
< EZ - V' /Var Z, as
Pr IEZ - Z >
AS T <
(B.1)
Ai
k=1
%
Soundness.
E I1 Ak
n
.
Var Z = 2m 2
VarZ ]
1/3.
v 5Var Z < 1IEZ is enough, i.e. that 48 Var Z
EZ, arguing that
(EZ)2 . From
(B.1), this is equivalent to showing
A2+ I--2.
A3
n
n
m2 A 4
<
96
k=1
We bound the LHS term by term.
, and thus 2A1 >
-
-, wen- get m2A2>
2
2
"
Similarly, "iA>
288
"
Finally, recalling that'
C
- 288e2
e
n
>
n
n
3
-
C >1
576E
-24
288
. 2.288
mA
<
-
2
4
A 4n
288 1using
the fact that
(by choice offC_
C > 576).
JI)
Overall, the LHS is at most 3.
For any sequence x
0 <p
2
k=1
< 2mA 3
Ak 13
we get that 2m En1
k_
mA >
A 2 > A 2 (as C > 17 and E < 1).
3/2
(EAk
1
EAk
k=1
2.288
2
C
288--2
E:288
.
* As A 2
2
n A
4
288
(xi,... ,xn)
q < oo,
|xi|
2
-96
m A
4
,
as Claimed.
amd
E R", p >
)1
/ =q||14|
1
0
jjxjj, is non-increasing.
I /P
:5 ||X||, =
56
xi|
.P
In particular, for
Assume A 2 =
Completeness.
lID - l2 <E 2 /(4n). We need to show that Pr[ Z >
T]
1/3. Chebyshev's inequality implies
Pr[Z-EZ > V
VarZ]
1/3
and therefore it is sufficient to show that
r > EZ +
VarZ
Recalling the expressions of EZ and Var Z from (B.1), this is tantamount to showing
2
3m 2
4
>2A2~
+ 0-m A 2 + -
_
n
2m
\
A3
)k
or equivalently
n
2
> mv
A
2
+
3 M
I
V
A-
2nmZE
k
k=1
Since
+ n
term is at most
A2 2(4n),
2
-
2nm '=
Ak
2
- V1+ nA
1+ 32/4
is to show that m
30/4 < 3. All that remains
22
</i2
d
and our choice of m ;> C -
5/4, we get that the second
A2 > 3m
- 3. But as
for some absolute constant
0
C > 6 ensures this holds.
To see why, one can easily prove that if jjxjj, = 1, then ||xjj' < 1 (bounding each term xj < IxI), and
therefore IIXIIq < 1 = IjxI|,. Next, for the general case, apply this to y = x/||xj1,, which has unit i, norm,
and conclude by homogeneity of the norm.
57
B.2
Proof of Theorem 3.1.7
In this section, we prove our structural result for MHR distributions, Theorem 3.1.7:
Theorem 3.1.8 (k-histogram). The class of k-histogram distributions on [n] is (0, k)decomposable.
Proof. We reproduce and adapt the argument of [11, Section 5.1] to meet our definition
of decomposability (which, albeit related, is incomparable to theirs). First, we modify the
algorithm at the core of their constructive proof, in Algorithm 3: note that the only two
changes are in Steps 2 and 3, where we use parameters respectively 2 and g-. Following
Algorithm 3 DECOMPOSE-MHR'(D,y)
Require: explicit description of MHR distribution D over [n]; accuracy parameter 7 > 0
1: Set J +- [n] and Q +- 0.
2: Let I - RIGHT-INTERVAL(D, J, 2) and I' +- RIGHT-INTERVAL(D, J \ I, -). Set J +-
J \ (I U 1).
Set i E J to be the smallest integer such that D(i) > -. If no such i exists, let I" <- J
and go to Step 9. Otherwise, let I" <- {1, . .i.,i
- 1} and J +- J \ I".
4: while J # 0 do
5:
Let j C J bet the smallest integer such that D(j) ( [g, 1 + ']D(i). If no such j
exists, let I"' +- J; otherwise let I"' +- {i,. . ,j - 1}.
6:
Add I'" to Q and set J +- J \ I'".
3:
7:
Let i +8: end while
j.
9: Return Q U {I, I', I"}
the structure of their proof, we write Q
Q' = { I
E
=
{1h,..., Iji} with i = [ai, bi], and define
Q : D(a) > D(aj+j }, Q" = { Ii E Q : D(aj) < D(aj+j }-
We immediately obtain the analogues of their Lemmas 5.2 and 5.3:
Lemma B.2.1. We have HEQ'
Lemma B.2.2. Step
D(a
)
--
4 of Algorithm 3 adds at most o( log 1) intervals to Q.
58
Sketch. This derives from observing that now D(I U I') > -y/n, which as in [11, Lemma 5.3]
in turn implies
1 > !(1+
)IQ'I--
n
so that
IQ'I
= O(}log!E).
Again following their argument, we also get
D(a1)
D(ai+1)
D(ai+1) .
D(alg41,) _
IcQ,,
D (ai)
,iQ
D(aj)
by combining Lemma B.2.1 with the fact that D(alQl+l
1 and that by construction
D(aj) ;> y/n2 , we get
n3
H D(a,+1) <n . n 2 -y
-y
7
IheQ", D(aj) But since each term in the product is at least (1 + -y) (by construction of Q and the definition
of Q"), this leads to
(I + 7Y)Ql<
and thus IQ"I = O( log 1) as well.
It remains to show that Q U {I, I', I"} is indeed a good decomposition of [n] for D, as
per Theorem 3.1.1. Since by construction every interval in Q satisfies item (ii), we only are left
with the case of I, I' and I". For the first two, as they were returned by RIGHT-INTERVAL
either (a) they are singletons, in which case item (ii) trivially holds; or (b) they have at
least two elements, in which case they have probability mass at most 2 (by the choice of
parameters for RIGHT-INTERVAL) and thus item (i) is satisfied. Finally, it is immediate to
see that by construction D(I") < n -7/n 2 = 7/n, and item (i) holds as well.
59
E
60
Bibliography
[1] Jayadev Acharya and Constantinos Daskalakis. Testing Poisson Binomial Distributions.
In Proceedings of SODA, pages 1829-1840, 2015. 1.2.1
[2] Noga Alon, Alexandr Andoni, Tali Kaufman, Kevin Matulef, Ronitt Rubinfeld, and
Ning Xie. Testing k-wise and almost k-wise independence. In Proceedings of STOC,
pages 496-505, New York, NY, USA, 2007. 1.2.1
[3] Mark Y. An. Log-concave probability distributions: theory and statistical testing.
Technical report, Centre for Labour Market and Social Research, Denmark, 1996. 2.1.2
[4] Mark Bagnoli and Ted Bergstrom. Log-concave probability and its applications. Economic
Theory, 26(2):445-469, 2005. 1.2.1
[5] Richard E. Barlow, Bartholomew D.J, J.M Bremner, and H.D Brunk. StatisticalInference
Under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley
Series in Probability and Mathematical Statistics. J. Wiley, London, New York, 1972.
1.2.1
[6] Tu-kan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. The complexity of
approximating the entropy. SIAM Journal on Computing, 35(1):132-150, 2005. 1.2.1
[7] Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick
White. Testing random variables for independence and identity. In Proceedings of FOCS,
pages 442-451, 2001. 1.1, 1.2.1
[8] TuVkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White.
Testing that distributions are close. In Proceedings of FOCS, pages 189-197, 2000. 1-2.1
[9] Tu'kan Batu, Ravi Kumar, and Ronitt Rubinfeld. Sublinear algorithms for testing
monotone and unimodal distributions. In STOC '04, pages 381-390, New York, NY,
USA, 2004. ACM. (document), 1.1, 1.2.1, 3, 3.1, 3.1, 3.2.1, 3.3.4, 3.5, A, A
[10] Clement L. Canonne. A survey on distribution testing: Your data is big. but is it blue?
Electronic Colloquium on Computational Complexity (ECCC), 22:63, 2015. 1.1
61
[11] Siu-on Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Learning mixtures
of structured distributions over discrete domains. In Proceedings of SODA, pages
1380-1394, 2013. 1.2.1, 3.1., 3.5, B.2, B.2
[12] Siu-on Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Efficient density
estimation via piecewise polynomial approximation. In Proceedings of STOC, pages
604-613. ACM, 2014. 1.2.1, 3.5
[13] Siu-on Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Near-optimal
density estimation in near-linear time using variable-width histograms. In Advances in
Neural Information ProcessingSystems 27 (NIPS), pages 1844-1852, 2014. 1.2.1
[14] Siu-On Chan, Ilias Diakonikolas, Gregory Valiant, and Paul Valiant. Optimal algorithms
for testing closeness of discrete distributions. In Proceedings of SODA, pages 1193-1203.
Society for Industrial and Applied Mathematics (SIAM), 2014. 1.1, 1.2.1
[15] Constantinos Daskalakis, Ilias Diakonikolas, Ryan O'Donnell, Rocco A. Servedio, and
Li-Yang Tan. Learning Smns of Independent Integer Random Variables. In Proceedings
of FOCS, pages 217-226. IEEE Computer Society, 2013. 3.5
[16] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A. Servedio. Learning Poisson
Binomial Distributions. In Proceedings of STOC, STOC '12, pages 709-728, New York,
NY, USA, 2012. ACM. 3.5
[17] Constantinos Daskalakis, Ilias Diakonikolas, Rocco A. Servedio, Gregory Valiant, and
Paul Valiant. Testing k-modal distributions: Optimal algorithms via reductions. In
Proceedings of SODA, pages 1833-1852. Society for Industrial and Applied Mathematics
(SIAM), 2013. 1.2.1
[18] Constantinos Daskalakis and Christos Papadimitriou. Sparse covers for sums of indicators.
ArXiV, June 2013. 34.8
[19] Ilias Diakonikolas, Daniel M. Kane, and Vladimir Nikishkin. Testing Identity of Structured Distributions. In Proceedings of SODA, 2015. 1.1.6, 1.2.1, B.1
[20] Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of
the sample distribution function and of the classical multinomial estimator. The Annals
of Mathematical Statistics, 27(3):642-669, 1956. 3.3.3
[21] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. Technical
Report TROO-020, Electronic Colloquium on Computational Complexity (ECCC), 2000.
1.2.1, 3.5
[22] Sudipto Guha, Andrew McGregor, and Suresh Venkatasubramanian. Streaming and
sublinear approximation of entropy and information distances. In Proceedings of SODA,
pages 733-742, Philadelphia, PA, USA, 2006. Society for Industrial and Applied Mathematics (SIAM). 1.2.1
62
[23] Piotr Indyk, editor. Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium
on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015. SIAM,
2015. 3.3.4, 3.4, A, A, 1
[24] Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire, and
Linda Sellie. On the learnability of discrete distributions. In Proceedings of the Twentysixth Annual ACM Symposium on Theory of Computing, STOC '94, pages 273-282, New
York, NY, USA, 1994. ACM. 2.1.1
[25] J. Keilson and H. Gerber. Some results for discrete unimodality. Journal of the American
Statistical Association, 66(334):pp. 386-389, 1971. 3.4
[26] Pascal Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The
Annals of Probability, 18(3):1269-1283, 1990. 3.3.3
[27] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled
discrete data. IEEE Transactions on Information Theory, 54(10):4750-4755, 2008. 1.1,
3.5
[28] Dana Ron. Property Testing: A Learning Theory Perspective. Foundations and Trends
in Machine Learning, 1(3):307-402, 2008. 1.2.1
[29] Dana Ron. Algorithmic and analysis techniques in property testing. Foundations and
Trends in Theoretical Computer Science, 5:73-205, 2010. 1.2.1
[30] Ronitt Rubinfeld. Taming Big Probability Distributions. XRDS, 19(1):24-28, September
2012. 1.2.1
[31] Debasis Sengupta and Asok K. Nanda. Log-concave and concave distributions in
reliability. Naval Research Logistics (NRL), 46(4):419-433, 1999. 1.2.1
[32] Mervyn J. Silvapulle and Pranab K. Sen. ConstrainedStatistical Inference. John Wiley
& Sons, Inc., 2001. 1.2.1
[33] Gregory Valiant and Paul Valiant. A CLT and tight lower bounds for estimating entropy.
Electronic Colloquium on Computational Complexity (ECCC), 17:179, 2010. 1.2.1
[34] Gregory Valiant and Paul Valiant. The power of linear estimators. In Proceedings of
FOCS, pages 403-412, October 2011. 1.2.1
[35] Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal
identity testing. In Proceedings of FOCS, 2014. 1.1, 1.2.1, 3.5, 3.5
[36] Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing,
40(6):1927-1968, 2011. 1.2.1
[37] Guenther Walther. Inference and modeling with log-concave distributions. Statistical
Science, 24(3):319-327, 2009. 1.2.1
63
Download