Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter

advertisement
Reconciling Confidentiality Risk
Measures from Statistics and
Computer Science
Jerry Reiter
Department of Statistical Science
Duke University
Background for talk




I am a proponent of unrestricted data
access when possible.
I advocate procedures that have
provably valid inferential properties.
I am focused on non-interactive setting.
I am learning differential privacy.
My questions about differential
privacy in non-interactive setting



What does ε imply in terms of risks?
Is it wise to be agnostic to the content
of data and how they were collected?
Why consider all possible versions of
data when determining protection?
Meaning of ε : A simulation study
• Data generation:
y ~ .9 N (0,1)  .1N (  ,1)
but bound values to lie within +/- 5 of means.
• Data: 9 values from first component
1 values from second component.
• Query is mean of 10 observations.
• Add Laplace noise to mean as in Dwork (2006)
(sensitivity |μ|+10).
(NOTE: the sensitivity was not divided by sample
size and so is conservative)
Meaning of ε: Set μ = 10
• Adversary knows first nine values but not last one.
Knows marginal distribution of Y.
• Let S be the reported (noisy) value of the mean.
p ( y10 | S , y10 )  p ( S | y10 , y10 ) p ( y10 | y10 )
• Straightforward to simulate posterior distribution.
• Prior implies
p (Y10  max( y10 ) | y10 )  .19
Results from simulation w/ μ=10
Global sensitivity (1000 runs)

Compute
p (Y10  max( y10 ) | S , y10 )
ε = .01
p = .19
ε = .1
p = .19
ε=1
p = .19
ε = 10
p = .21
ε = 100
p = .82
ε = 1000 p = 1
Results from simulation w/ μ=3
Global sensitivity (1000 runs)
ε = .01
ε = .1
ε=1
ε = 10
ε = 100
ε = 1000
p=
p=
p=
p=
p=
p=
.18
.18
.18
.18
.43
.88
Results from simulation w/ μ=10
Local sensitivity (1000 runs)
Sensitivity = max(data) – min(data).
ε = .01
p = .82
ε = .1
p = .82
ε=1
p = .82
ε = 10
p = .89
ε = 100
p = 1.0
ε = 1000 p = 1.0
Should the local sensitivity results
be dismissed?



Suppose data are a census. Support of
Y is finite and known to agency.
Sensitivity parameter cannot be “close
to” the max(Y) – min(Y).
How to set sensitivity while preserving
accuracy?
Hybrid approach



Noise done with global approach.
Intruder uses local approach.
This has no justification. But it can give
good predictions in this context.
Results from simulation w/ μ=10
Global sensitivity, local intruder
(1000 runs)
ε = .01
p = .90
ε = .1
p = .90
ε=1
p = .90
ε = 10
p = .90
ε = 100
p = .99
ε = 1000 p = 1.0
Need more study to see if this approach
is too specific to the simulation design.
What about the data?


Two independent random samples of 6
people walking the streets of LA.
Would you say they represent an equal
disclosure risk?
What about the data?
Sex
F
M
M
F
M
F
Age
31
26
48
52
41
28
Partners
4
2
10
3
2
7
What about the data?
Sex
F
M
M
F
M
F
Age
31
26
48
52
41
108
Partners
4
2
10
3
2
7
What about the data?


Act of sampling provides protection.
Does this fit in differential privacy?
Even with sampling, some people at
higher risk than others. Does this fit in
differential privacy?
Why all versions of data?
Sex
F
M
M
F
M
F
Age
31
26
48
52
41
28
Partners
4
2
10
3
2
7
DP: consider all ages in support.
Statistical approaches:
Main types of disclosure risks

Identification disclosure
Match record in released data with
target; learn someone participated in
the study.

Attribute disclosure
Learn value of sensitive variable for
target.
Measures of identification
disclosure risk

Number of population uniques
Does not incorporate intruders’ knowledge.
May not be useful for numerical data.
Hard to gauge effects of SDL procedures.

Probability-based methods
(Direct matching using external databases.
Indirect matching using existing data set.)
Require assumptions about intruder behavior.
May be costly to obtain external databases.
Identification disclosure risks
in microdata
Context: Survey of Earned Doctorates



Intruder knows target’s characteristics, e.g.,
age, field, gender, race (?), citizenship (?).
Available from public records.
Searches for people with those characteristics
in the released SED files.
If small number of people match, claims that
the target participated in SED.
Assessing identification disclosure
risk: The intruder’s actions
• Released data on n records: Z.
• Information about disclosure protection: M.
• Target: t (man, statistics, 1999, white, citizen).
Let J = j when record j in Z matches t.
Let J = n + 1 when target is not in Z.
For j = 1, …, n+1, intruder computes
Pr(J  j | t, Z, M )
Intruder selects j with highest probability.
Assessing identification disclosure
risk
Let Ys be true values of perturbed data.
Pr(J  j | t, Z, M)
  Pr( J  j | t, Z, M, YS ) Pr(YS | t, Z, M )dYS
Pr( Z | t, Ys , M ) Pr(YS | t, M )
  Pr( J  j | t, Z, M, YS )
dYS
Pr( Z | t, M )
Calculation
CASE 1: Target assumed to be in Z:

Records that do not match target’s values have
zero probability.

For matches, probability equals 1/nt where nt is
number of matches.

If no matches, use 1/nt* where nt* is number of
matches in unperturbed data.

Probability equals zero for j = n+1.
Calculation
CASE 2: Target not assumed to be in Z:



Units that do not match target’s values
have zero probability.
For matches, probability is 1/Nt
where Nt is number of matches in pop’n.
For j = n+1, probability is (Nt – nt) / Nt
Implications
1.
2.
3.
4.
5.
Clear interpretation of risk for given
assumptions (encoded in prior
distribution).
Specific to collected data.
Incorporates sampling.
Incorporates information released about
protection scheme.
Could in principle incorporate
measurement error, missing data.
Implications
1.
2.
3.
Can incorporate strong assumptions like
those in CS, e.g., intruder knows all
values but one.
Provides risk measures under variety of
assumptions to enable decision making
under uncertainty.
Provides record-level risk measures
useful for targeted protection.
Download