Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University Background for talk I am a proponent of unrestricted data access when possible. I advocate procedures that have provably valid inferential properties. I am focused on non-interactive setting. I am learning differential privacy. My questions about differential privacy in non-interactive setting What does ε imply in terms of risks? Is it wise to be agnostic to the content of data and how they were collected? Why consider all possible versions of data when determining protection? Meaning of ε : A simulation study • Data generation: y ~ .9 N (0,1) .1N ( ,1) but bound values to lie within +/- 5 of means. • Data: 9 values from first component 1 values from second component. • Query is mean of 10 observations. • Add Laplace noise to mean as in Dwork (2006) (sensitivity |μ|+10). (NOTE: the sensitivity was not divided by sample size and so is conservative) Meaning of ε: Set μ = 10 • Adversary knows first nine values but not last one. Knows marginal distribution of Y. • Let S be the reported (noisy) value of the mean. p ( y10 | S , y10 ) p ( S | y10 , y10 ) p ( y10 | y10 ) • Straightforward to simulate posterior distribution. • Prior implies p (Y10 max( y10 ) | y10 ) .19 Results from simulation w/ μ=10 Global sensitivity (1000 runs) Compute p (Y10 max( y10 ) | S , y10 ) ε = .01 p = .19 ε = .1 p = .19 ε=1 p = .19 ε = 10 p = .21 ε = 100 p = .82 ε = 1000 p = 1 Results from simulation w/ μ=3 Global sensitivity (1000 runs) ε = .01 ε = .1 ε=1 ε = 10 ε = 100 ε = 1000 p= p= p= p= p= p= .18 .18 .18 .18 .43 .88 Results from simulation w/ μ=10 Local sensitivity (1000 runs) Sensitivity = max(data) – min(data). ε = .01 p = .82 ε = .1 p = .82 ε=1 p = .82 ε = 10 p = .89 ε = 100 p = 1.0 ε = 1000 p = 1.0 Should the local sensitivity results be dismissed? Suppose data are a census. Support of Y is finite and known to agency. Sensitivity parameter cannot be “close to” the max(Y) – min(Y). How to set sensitivity while preserving accuracy? Hybrid approach Noise done with global approach. Intruder uses local approach. This has no justification. But it can give good predictions in this context. Results from simulation w/ μ=10 Global sensitivity, local intruder (1000 runs) ε = .01 p = .90 ε = .1 p = .90 ε=1 p = .90 ε = 10 p = .90 ε = 100 p = .99 ε = 1000 p = 1.0 Need more study to see if this approach is too specific to the simulation design. What about the data? Two independent random samples of 6 people walking the streets of LA. Would you say they represent an equal disclosure risk? What about the data? Sex F M M F M F Age 31 26 48 52 41 28 Partners 4 2 10 3 2 7 What about the data? Sex F M M F M F Age 31 26 48 52 41 108 Partners 4 2 10 3 2 7 What about the data? Act of sampling provides protection. Does this fit in differential privacy? Even with sampling, some people at higher risk than others. Does this fit in differential privacy? Why all versions of data? Sex F M M F M F Age 31 26 48 52 41 28 Partners 4 2 10 3 2 7 DP: consider all ages in support. Statistical approaches: Main types of disclosure risks Identification disclosure Match record in released data with target; learn someone participated in the study. Attribute disclosure Learn value of sensitive variable for target. Measures of identification disclosure risk Number of population uniques Does not incorporate intruders’ knowledge. May not be useful for numerical data. Hard to gauge effects of SDL procedures. Probability-based methods (Direct matching using external databases. Indirect matching using existing data set.) Require assumptions about intruder behavior. May be costly to obtain external databases. Identification disclosure risks in microdata Context: Survey of Earned Doctorates Intruder knows target’s characteristics, e.g., age, field, gender, race (?), citizenship (?). Available from public records. Searches for people with those characteristics in the released SED files. If small number of people match, claims that the target participated in SED. Assessing identification disclosure risk: The intruder’s actions • Released data on n records: Z. • Information about disclosure protection: M. • Target: t (man, statistics, 1999, white, citizen). Let J = j when record j in Z matches t. Let J = n + 1 when target is not in Z. For j = 1, …, n+1, intruder computes Pr(J j | t, Z, M ) Intruder selects j with highest probability. Assessing identification disclosure risk Let Ys be true values of perturbed data. Pr(J j | t, Z, M) Pr( J j | t, Z, M, YS ) Pr(YS | t, Z, M )dYS Pr( Z | t, Ys , M ) Pr(YS | t, M ) Pr( J j | t, Z, M, YS ) dYS Pr( Z | t, M ) Calculation CASE 1: Target assumed to be in Z: Records that do not match target’s values have zero probability. For matches, probability equals 1/nt where nt is number of matches. If no matches, use 1/nt* where nt* is number of matches in unperturbed data. Probability equals zero for j = n+1. Calculation CASE 2: Target not assumed to be in Z: Units that do not match target’s values have zero probability. For matches, probability is 1/Nt where Nt is number of matches in pop’n. For j = n+1, probability is (Nt – nt) / Nt Implications 1. 2. 3. 4. 5. Clear interpretation of risk for given assumptions (encoded in prior distribution). Specific to collected data. Incorporates sampling. Incorporates information released about protection scheme. Could in principle incorporate measurement error, missing data. Implications 1. 2. 3. Can incorporate strong assumptions like those in CS, e.g., intruder knows all values but one. Provides risk measures under variety of assumptions to enable decision making under uncertainty. Provides record-level risk measures useful for targeted protection.