- Lorentz Center

advertisement
Preserving Validity in Adaptive
Data Analysis
Moritz Hardt
IBM Research Almaden
Joint work with Cynthia Dwork, Vitaly Feldman,
Toni Pitassi, Omer Reingold, Aaron Roth
Statistical Estimation
Data domain X, class labels Y
Unknown distribution D over X × Y
Data set S of size n sampled i.i.d. from D
Problem: Given function q : X × Y ⟶ [0,1], want to
estimate expected value of q over D (“statistic”)
Notation:
𝔼D[q] = expected value of q over D
𝔼S[q] = average of q over S
Example: Empirical Loss Estimation
We trained a classifier f : X ⟶ Y
and want to know how good it is
We can estimate 0/1-loss of classifier using
the hypothesis q(x,y) = 1{ f(x) ≠ y }
𝔼S[q] = empirical loss with respect to sample S
𝔼D[q] = true loss with respect to unknown D
Example 2: Statistical Query Model
Function q specifies a statistical query.
Estimate a of 𝔼D[q] is an answer to the query.
We can implement almost all (reasonable)
learning algorithms using sufficiently accurate
answers to statistical queries.
Statistical Validity:
Sample versus Distribution
Definition. A sample estimate a of 𝔼D[q] is
statistically valid if |a − 𝔼D[q]| < o(1)
In particular 𝔼S[q] is statistically valid if
|𝔼S[q] − 𝔼D[q]| < o(1)
Estimation Problem
Def: Given data set D of size n,
an algorithm estimates the statistic 𝔼D[q] if
it returns a statistically valid estimate of 𝔼D[q]
Problem: Given data set D of size n,
how many statistics can we estimate?
Non-adaptive estimation
q1,q2,…,qk
a1,a2,…,ak
Algorithm
gets sample of size n
Data
analyst
Non-adaptive estimation is easy
Theorem [Folklore]:
There is an algorithm that estimates exp(n) nonadaptively chosen statistics.
Proof:
Fix functions q1,...,qk.
Sample data set S of size n.
Use Hoeffding bound + union bound to argue 𝔼
S[qi] has error o(1) for all 1 ≤ i ≤ k
Adaptive estimation
q1
a1
q2
a2
...
qk
ak
Algorithm
gets sample of size n
Data
analyst
Naive bound
Theorem [folklore]. There is an algorithm that
can estimate nearly n adaptively chosen
statistics.
Proof:
Partition data into roughly n chunks. Use fresh
data set of small size for estimating each statistic.
Our main result
Theorem. There is an algorithm that estimates
exp(n) adaptively chosen statistics.
Caveat: Running time poly(n,|X|)
where |X| is the domain size.
A computationally efficient estimator
Theorem. There is a computationally efficient
algorithm that can estimate nearly n2
adaptively chosen statistics.
Can we do better?
No!
Theorem [H-Ullman,Ullman-Steinke]
There is no computationally efficient
algorithm that can estimate n2+o(1) adaptively
chosen statistics.
power of
the analyst
> n2 adaptive
statistics
impossible
[HU,US]
possible
n2 adaptive
statistics
possible
exp(n)
non-adaptive
statistics
possible
[folklore]
power of the
algorithm
poly-time
exp-time
False Discovery
“Trouble at the Lab” – The Economist
Preventing false discovery
Decade old subject in Statistics
Powerful results such as Benjamini-Hochberg
work on controlling False Discovery Rate
– 20000+ citations
Theory focuses on non-adaptive hypothesis testing
Adaptivity
Data dredging, data snooping, fishing,
p-hacking, post-hoc analysis,
garden of the forking paths
Some caution strongly against it:
“Pre-registration” — specify entire experimental setup ahead of time
Humphreys, Sanchez, Windt (2013), Monogan (2013)
Adaptivity
The most valuable statistical analyses often arise
only after an iterative process involving the data
— Gelman, Loken (2013)
Our results: Rich adaptive data analysis is
possible with provable confidence bounds.
Differential Privacy
Notion designed for privacy protection in
statistical data analysis.
Here: We’ll use Differential Privacy as a tool to
achieve statistical validity
Differential Privacy
(Dwork-McSherry-Nissim-Smith 06)
Informal Definition. A randomized algorithm A is
differentially private if for any two data sets S,S’ that
differ in at most one element, the random variables
A(S) and A(S’) are “indistinguishable”.
Density
A(S)
A(S’)
ratio
bounded
by 1±o(1)
Outputs
Why Differential Privacy?
DP is a stability guarantee: Changing one sample
point implies small change in output
General theme: Stability implies generalization
– Known in non-adaptive setting (BousquettElisseeff 2000,…,SSSS10)
Here: analog in the adaptive setting
– new technical challenges due to adaptivity
Why Differential Privacy?
DP comes with strong composition guarantees
sition of multiple differentially private algorithms is still d
Other stability notions don’t compose.
Transfer Theorem
Differential privacy and accuracy on the sample
implies accuracy on the distribution
Can instantiate transfer theorem
with existing DP algorithms.
Example:
Private Multiplicative Weights [H-Rothblum]
gives main theorem.
Questions?
Download