Preserving Validity in Adaptive Data Analysis Moritz Hardt IBM Research Almaden Joint work with Cynthia Dwork, Vitaly Feldman, Toni Pitassi, Omer Reingold, Aaron Roth Statistical Estimation Data domain X, class labels Y Unknown distribution D over X × Y Data set S of size n sampled i.i.d. from D Problem: Given function q : X × Y ⟶ [0,1], want to estimate expected value of q over D (“statistic”) Notation: 𝔼D[q] = expected value of q over D 𝔼S[q] = average of q over S Example: Empirical Loss Estimation We trained a classifier f : X ⟶ Y and want to know how good it is We can estimate 0/1-loss of classifier using the hypothesis q(x,y) = 1{ f(x) ≠ y } 𝔼S[q] = empirical loss with respect to sample S 𝔼D[q] = true loss with respect to unknown D Example 2: Statistical Query Model Function q specifies a statistical query. Estimate a of 𝔼D[q] is an answer to the query. We can implement almost all (reasonable) learning algorithms using sufficiently accurate answers to statistical queries. Statistical Validity: Sample versus Distribution Definition. A sample estimate a of 𝔼D[q] is statistically valid if |a − 𝔼D[q]| < o(1) In particular 𝔼S[q] is statistically valid if |𝔼S[q] − 𝔼D[q]| < o(1) Estimation Problem Def: Given data set D of size n, an algorithm estimates the statistic 𝔼D[q] if it returns a statistically valid estimate of 𝔼D[q] Problem: Given data set D of size n, how many statistics can we estimate? Non-adaptive estimation q1,q2,…,qk a1,a2,…,ak Algorithm gets sample of size n Data analyst Non-adaptive estimation is easy Theorem [Folklore]: There is an algorithm that estimates exp(n) nonadaptively chosen statistics. Proof: Fix functions q1,...,qk. Sample data set S of size n. Use Hoeffding bound + union bound to argue 𝔼 S[qi] has error o(1) for all 1 ≤ i ≤ k Adaptive estimation q1 a1 q2 a2 ... qk ak Algorithm gets sample of size n Data analyst Naive bound Theorem [folklore]. There is an algorithm that can estimate nearly n adaptively chosen statistics. Proof: Partition data into roughly n chunks. Use fresh data set of small size for estimating each statistic. Our main result Theorem. There is an algorithm that estimates exp(n) adaptively chosen statistics. Caveat: Running time poly(n,|X|) where |X| is the domain size. A computationally efficient estimator Theorem. There is a computationally efficient algorithm that can estimate nearly n2 adaptively chosen statistics. Can we do better? No! Theorem [H-Ullman,Ullman-Steinke] There is no computationally efficient algorithm that can estimate n2+o(1) adaptively chosen statistics. power of the analyst > n2 adaptive statistics impossible [HU,US] possible n2 adaptive statistics possible exp(n) non-adaptive statistics possible [folklore] power of the algorithm poly-time exp-time False Discovery “Trouble at the Lab” – The Economist Preventing false discovery Decade old subject in Statistics Powerful results such as Benjamini-Hochberg work on controlling False Discovery Rate – 20000+ citations Theory focuses on non-adaptive hypothesis testing Adaptivity Data dredging, data snooping, fishing, p-hacking, post-hoc analysis, garden of the forking paths Some caution strongly against it: “Pre-registration” — specify entire experimental setup ahead of time Humphreys, Sanchez, Windt (2013), Monogan (2013) Adaptivity The most valuable statistical analyses often arise only after an iterative process involving the data — Gelman, Loken (2013) Our results: Rich adaptive data analysis is possible with provable confidence bounds. Differential Privacy Notion designed for privacy protection in statistical data analysis. Here: We’ll use Differential Privacy as a tool to achieve statistical validity Differential Privacy (Dwork-McSherry-Nissim-Smith 06) Informal Definition. A randomized algorithm A is differentially private if for any two data sets S,S’ that differ in at most one element, the random variables A(S) and A(S’) are “indistinguishable”. Density A(S) A(S’) ratio bounded by 1±o(1) Outputs Why Differential Privacy? DP is a stability guarantee: Changing one sample point implies small change in output General theme: Stability implies generalization – Known in non-adaptive setting (BousquettElisseeff 2000,…,SSSS10) Here: analog in the adaptive setting – new technical challenges due to adaptivity Why Differential Privacy? DP comes with strong composition guarantees sition of multiple differentially private algorithms is still d Other stability notions don’t compose. Transfer Theorem Differential privacy and accuracy on the sample implies accuracy on the distribution Can instantiate transfer theorem with existing DP algorithms. Example: Private Multiplicative Weights [H-Rothblum] gives main theorem. Questions?