Testing Collections of Properties Reut Levi Dana Ron Ronitt Rubinfeld ICS 2011 Shopping distribution What properties do your distributions have? Testing closeness of two distributions: Transactions in California Transactions in New York trend change? Testing Independence: Shopping patterns: Independent of zip code? This work: Many distributions One distribution: D is arbitrary black-box distribution over [n], generates iid samples. D samples Sample complexity in terms Test Pass/Fail? of n? (can it be sublinear?) Some answers… Uniformity (n1/2) [Goldreich, Ron 00] [Batu, Fortnow, Fischer, Kumar, Rubinfeld, White 01] [Paninski 08] Identity (n1/2) [Batu, Fortnow, Fischer, Kumar, Rubinfeld, White 01] Closeness (n2/3) [Batu, Fortnow, Rubinfeld, Smith, White], [Valiant 08] Independence O(n12/3 n21/3), (n12/3 n21/3) [Batu, Fortnow, Fischer, Kumar, Rubinfeld, White 01] , this work Entropy n1/β^2+o(1) [Batu, Dasgupta, Kumar, Rubinfeld 05], [Valiant 08] Support Size (n/logn) [Raskhodnikova, Ron, Shpilka, Smith 09], [Valiant, Valiant 10] Monotonicity on total order (n1/2) [Batu, Kumar, Rubinfeld 04] Monotonicity on poset n1-o(1) [Bhattacharyya, Fischer, Rubinfeld, Valiant 10] Collection of distributions: Further refinement: Known or unknown distribution on i’s? D1 D2 … Dm samples Two models: Sampling model: Get (i,x) for random i, xDi Query model: Get (i,x) for query i and xDi Test Pass/Fail? Sample complexity in terms of n,m? Properties considered: Equivalence All distributions are equal ``Clusterability’’ Distributions can be clustered into k clusters such that within a cluster, all distributions are close Equivalence vs. independence Process of drawing pairs: Draw i [m], x Di output (i,x) Easy fact: (i,x) independent iff Di‘s are equal Results Also yields “tight” lower bound for independence testing Def: (D1,…Dm) has the Equivalence property if Di = Di' for all 1 ≤ i, i’ ≤ m. Lower Bound Upper Bound n>m (n2/3m1/3) Õ(n2/3m1/3) Unknown Weights m>n (n1/2m1/2) Õ(n1/2m1/2) Known Weights Clusterability Can we cluster distributions s.t. in each cluster, distributions (very) close? Sample complexity of test is O(kn2/3) for n = domain size, k = number of clusters No dependence on number of distributions Closeness requirement is very stringent Open Questions • Clusterability in the sampling model, less stringent notion of close • Other properties of collections? • E.g., all distributions are shifts of each other? Thank you