The project objectives are to study efficient methods of determining the
properties held by samples generated from an underlying probability
distribution. Such probability distributions may occur in the natural sciences
or in Internet traffic records. The current project aims to study the sample
and time complexity of these problems when the distributions are over very
large domains, and in particular, aims to determine when these tasks can be
performed in sample and time complexity that is sublinear in the size of the
domain. The thrusts suggested in the proposal are to study broader classes
of properties than what has previously been studied, to find extremely
efficient algorithms that work for important special classes of distributions, and
to develop algorithms which estimate how close the distribution is to having
the property.
In the direction of broadening the classes of properties that can be tested, in
joint work with my PhD student Ning Xie, entitled “Testing Non-uniform k-wise
Independence” which appeared in the ICALP 2010 conference, we consider
the problem of testing whether a distribution has at least a limited
independence. More specifically, a distribution D over $n$-tuples is called
(non-uniform) $k$-wise independent if
for any set of $k$ indices the distribution looks independent. Such
have played an important role in algorithm design, complexity and
cryptography. For the case when the marginal distributions are uniform, we
show an upper bound on the distance between a distribution $D$ from the set
of $k$-wise independent distributions in terms of the sum of Fourier
coefficients of $D$ at vectors of weight at most $k$. Such a bound was
previously known only for the binary field.
For the non-uniform case, we give a new characterization of distributions
$k$-wise independent and further show that such a characterization is robust.
These greatly generalize the results of Alon et al on uniform $k$-wise
independence over the binary field to non-uniform $k$-wise independence
over product spaces. Our results yield natural testing algorithms for $k$-wise
independence with time and sample complexity sublinear in terms of the
support size when $k$ is a constant. The main technical tools employed
include discrete Fourier transforms and the theory of linear systems of
Broadening the class of problems that can be tested in another direction, as
well as extending known lower bound techniques to more general classes of
properties is considered in joint work with (my PhD student) Arnab
Bhattacharyya, Eldar Fischer and Paul Valiant, which will appear in the
Innovations in Computer Science 2011 conference, entitled “Testing
monotonicity of distributions over general partial orders”. In this work, we
investigate the number of samples required for testing the monotonicity of a
distribution with respect to an arbitrary underlying partially ordered set. Our
first result is a nearly linear lower bound for the sample complexity of testing
monotonicity with respect to the poset consisting of a directed perfect
matching. This is the first nearly linear lower bound known for a natural nonsymmetric property of distributions. Testing monotonicity with respect to the
matching reduces to testing monotonicity with respect to various other natural
posets, showing corresponding
lower bounds for these posets also. Next, we show that whenever a poset has
a linear-sized matching in the transitive closure of its Hasse digraph, testing
monotonicity with respect to it requires nearly sqrt(n) samples. Previous
such lower bounds applied only to the total order. We also give upper bounds
to the sample complexity in terms of the chain decomposition of the poset.
Our results simplify the known tester for the two dimensional grid and provide
the first sublinear upper bound for the Boolean cube.
Attempting to find ways of coping with very strong lower bounds for estimating
the L_1 distance between distributions, we have considered the related
problem of estimating the popular Earth Mover’s Distance (EMD) between
In addition to the above motivation, we note that the L_1
distance is not appropriate for continuous or other infinite domains, and when
dealing with applications on such domains, one popular distance metric is the
EMD. In work with Khanh Do Ba, Huy L. Nguyen and Huy N. Nguyen, that
will appear in the Theory of Computing Systems Journal, we study the
problem of estimating the Earth Mover's Distance (EMD) between probability
distributions when given access only to samples of the distribution. We give
closeness testers and additive-error estimators over domains in $[0,1]^d$,
with sample complexities independent of domain size -- permitting the
testability even of continuous distributions over infinite domains. Instead, our
algorithms depend on the dimension of the domain space and the quality of
the result required. We also prove lower bounds for closeness testing,
showing the dependencies on these parameters to be essentially optimal.
Additionally, we consider whether natural classes of distributions exist for
which there are algorithms with better dependence on the dimension, and
show that for highly clusterable data, this is indeed the case. Lastly, we
consider a variant of the EMD, defined over tree metrics instead of the usual
$\ell_1$ metric, and give tight upper and lower bounds.
In addition to the scientific impact, the project has helped fund the training of
graduate student and outreach programs to elementary school children.