Testing Closeness of Distributions

advertisement
Testing Collections of
Properties
Reut Levi Dana Ron
Ronitt Rubinfeld
ICS 2011
Shopping distribution
What properties do your
distributions have?
Testing closeness of two
distributions:
Transactions in California
Transactions in New York
trend change?
Testing Independence:
Shopping patterns:
Independent of zip code?
This work: Many distributions
One distribution:
 D is arbitrary black-box
distribution over [n],
generates iid samples.
D
samples
 Sample complexity in terms
Test
Pass/Fail?
of n? (can it be sublinear?)
Some answers…
 Uniformity
(n1/2)
[Goldreich, Ron 00] [Batu, Fortnow, Fischer, Kumar, Rubinfeld, White 01] [Paninski 08]
 Identity
(n1/2)
[Batu, Fortnow, Fischer, Kumar, Rubinfeld, White 01]
 Closeness
(n2/3)
[Batu, Fortnow, Rubinfeld, Smith, White], [Valiant 08]
 Independence
O(n12/3 n21/3), (n12/3 n21/3)
[Batu, Fortnow, Fischer, Kumar, Rubinfeld, White 01] , this work
 Entropy
n1/β^2+o(1)
[Batu, Dasgupta, Kumar, Rubinfeld 05], [Valiant 08]
 Support Size (n/logn)
[Raskhodnikova, Ron, Shpilka, Smith 09], [Valiant, Valiant 10]
 Monotonicity on total order
(n1/2)
[Batu, Kumar, Rubinfeld 04]
 Monotonicity on poset n1-o(1)
[Bhattacharyya, Fischer, Rubinfeld, Valiant 10]
Collection of distributions:
Further refinement: Known or
unknown distribution on i’s?
D1 D2
…
Dm
samples
 Two models:
 Sampling model:
 Get (i,x) for random i, xDi
 Query model:
 Get (i,x) for query i and xDi
Test
Pass/Fail?
 Sample complexity in terms
of n,m?
Properties considered:
 Equivalence
 All distributions are equal
 ``Clusterability’’
 Distributions can be clustered into k
clusters such that within a cluster, all
distributions are close
Equivalence vs. independence
 Process of drawing pairs:
 Draw i [m], x  Di output (i,x)
 Easy fact: (i,x) independent iff Di‘s are
equal
Results
Also yields “tight” lower
bound for independence
testing
Def: (D1,…Dm) has the Equivalence property if
Di = Di' for all 1 ≤ i, i’ ≤ m.
Lower Bound
Upper Bound
n>m
(n2/3m1/3)
Õ(n2/3m1/3) Unknown Weights
m>n
(n1/2m1/2)
Õ(n1/2m1/2) Known Weights
Clusterability
 Can we cluster distributions s.t. in each
cluster, distributions (very) close?
 Sample complexity of test is
 O(kn2/3) for n = domain size, k = number of clusters
 No dependence on number of distributions
 Closeness requirement is very stringent
Open Questions
• Clusterability in the sampling model, less
stringent notion of close
• Other properties of collections?
• E.g., all distributions are shifts of each other?
Thank you
Download