2. PUBLISHABLE SUMMARY The project objectives are to study efficient methods of determining the properties held by samples generated from an underlying probability distribution. Such probability distributions may occur in the natural sciences or in Internet traffic records. The current project aims to study the sample and time complexity of these problems when the distributions are over very large domains, and in particular, aims to determine when these tasks can be performed in sample and time complexity that is sublinear in the size of the domain. The thrusts suggested in the proposal are to study broader classes of properties than what has previously been studied, to find extremely efficient algorithms that work for important special classes of distributions, and to develop algorithms which estimate how close the distribution is to having the property. In the direction of broadening the classes of properties that can be tested, in joint work with my PhD student Ning Xie, entitled “Testing Non-uniform k-wise Independence” which appeared in the ICALP 2010 conference, we consider the problem of testing whether a distribution has at least a limited independence. More specifically, a distribution D over $n$-tuples is called (non-uniform) $k$-wise independent if for any set of $k$ indices the distribution looks independent. Such distributions have played an important role in algorithm design, complexity and cryptography. For the case when the marginal distributions are uniform, we show an upper bound on the distance between a distribution $D$ from the set of $k$-wise independent distributions in terms of the sum of Fourier coefficients of $D$ at vectors of weight at most $k$. Such a bound was previously known only for the binary field. For the non-uniform case, we give a new characterization of distributions being $k$-wise independent and further show that such a characterization is robust. These greatly generalize the results of Alon et al on uniform $k$-wise independence over the binary field to non-uniform $k$-wise independence over product spaces. Our results yield natural testing algorithms for $k$-wise independence with time and sample complexity sublinear in terms of the support size when $k$ is a constant. The main technical tools employed include discrete Fourier transforms and the theory of linear systems of congruences. Broadening the class of problems that can be tested in another direction, as well as extending known lower bound techniques to more general classes of properties is considered in joint work with (my PhD student) Arnab Bhattacharyya, Eldar Fischer and Paul Valiant, which will appear in the Innovations in Computer Science 2011 conference, entitled “Testing monotonicity of distributions over general partial orders”. In this work, we investigate the number of samples required for testing the monotonicity of a distribution with respect to an arbitrary underlying partially ordered set. Our first result is a nearly linear lower bound for the sample complexity of testing monotonicity with respect to the poset consisting of a directed perfect matching. This is the first nearly linear lower bound known for a natural nonsymmetric property of distributions. Testing monotonicity with respect to the matching reduces to testing monotonicity with respect to various other natural posets, showing corresponding lower bounds for these posets also. Next, we show that whenever a poset has a linear-sized matching in the transitive closure of its Hasse digraph, testing monotonicity with respect to it requires nearly sqrt(n) samples. Previous such lower bounds applied only to the total order. We also give upper bounds to the sample complexity in terms of the chain decomposition of the poset. Our results simplify the known tester for the two dimensional grid and provide the first sublinear upper bound for the Boolean cube. Attempting to find ways of coping with very strong lower bounds for estimating the L_1 distance between distributions, we have considered the related problem of estimating the popular Earth Mover’s Distance (EMD) between distributions. In addition to the above motivation, we note that the L_1 distance is not appropriate for continuous or other infinite domains, and when dealing with applications on such domains, one popular distance metric is the EMD. In work with Khanh Do Ba, Huy L. Nguyen and Huy N. Nguyen, that will appear in the Theory of Computing Systems Journal, we study the problem of estimating the Earth Mover's Distance (EMD) between probability distributions when given access only to samples of the distribution. We give closeness testers and additive-error estimators over domains in $[0,1]^d$, with sample complexities independent of domain size -- permitting the testability even of continuous distributions over infinite domains. Instead, our algorithms depend on the dimension of the domain space and the quality of the result required. We also prove lower bounds for closeness testing, showing the dependencies on these parameters to be essentially optimal. Additionally, we consider whether natural classes of distributions exist for which there are algorithms with better dependence on the dimension, and show that for highly clusterable data, this is indeed the case. Lastly, we consider a variant of the EMD, defined over tree metrics instead of the usual $\ell_1$ metric, and give tight upper and lower bounds. In addition to the scientific impact, the project has helped fund the training of graduate student and outreach programs to elementary school children.