IPAM 2010

advertisement
IPAM 2010
Privacy Protection from Sampling
and Perturbation in Surveys
Natalie Shlomo and Chris Skinner
Southampton Statistical Sciences
Research Institute
1
Topics:
1 Social surveys in official statistics
2 Releasing survey microdata
3 Disclosure limitation methods for survey microdata
4 Assessing disclosure risk in survey microdata
5. Misclassification/perturbation
6 Differential privacy definition
7 Differential privacy under:
Sampling
Misclassification / Perturbation
Sampling and Misclassification / Perturbation
8 Conclusions
2
Social Surveys in Official Statistics
• Examples: Labour Force Survey, Family Expenditure
Survey, Social Survey
• Unit of investigation household or individuals
• Sample fractions small (generally from about 0.01%
to 0.05%)
• Multistage sampling
– PSU is an address
– SSU is a household
– USU is an individual (or all individuals) in a
household
• Population characteristics are NOT known
3
Releasing Survey Microdata
Agencies release microdata from surveys arising from
social surveys:
• Safe settings – data is licensed out to users through
data archives, microdata under contract, safe data labs
• Safe data – depending on the mode of release, Agencies
protect data through disclosure limitation methods
4
Disclosure Limitation Methods in Survey
Microdata
• Sampling from the population
• Broad banding, eg. age groups, high geographies
• Recoding by collapsing categories, eg. top or bottom
categories
• Categorizing continuous variables, eg. income or
expenditures
• Deletion of high-risk variables
5
Disclosure Limitation Methods in Survey
Microdata
High risk records with respect to population uniqueness
might add perturbative method:
• Delete-impute method
• Post-randomisation method
• Data swapping
Reweighting and post-editing to preserve logical
consistencies
6
Assessing disclosure risk in sample
microdata
Assumptions:
• Two types of variables:
Key variables - categorical variables that are identifying,
visible, traceable and/or common with publicly available
datasets containing population units
Key – cells of cross-classification of key variables
k=1,…,K
Sensitive variables – all other variables in the dataset
7
Assessing disclosure risk in sample
microdata
Assumptions (cont):
• Main concern is risk of identification
- Notion of population uniqueness, eg. ability to link
unit in microdata to a unit in the population
through key variables which may be known to an
intruder
-
Risk of identification used in statistics acts,
confidentiality pledges to respondents and codes of
practice
- Identification is a pre-requisite for attribute and
inferential disclosure from sensitive variables (what
an intruder may learn)
8
Assessing disclosure risk in sample
microdata
Assumptions (cont):
• Intruder has no direct knowledge of what units fall into the
sample and sampling is part of the SDL mechanism
• Intruder has prior knowledge relating to some aspects of
the population
• First assume no misclassification/perturbation error on the
key variables and SDL method based on non-perturbative
methods (we later drop this assumption)
9
Assessing disclosure risk in
sample microdata
Notation:
U population with elements i=1,…,N
s sample drawn with specified probability sampling scheme
p(s) from U containing elements i=1,..,n
Let xi be 1*K vector indicating which cell population unit i
belongs to
Denote: ej the 1*K vector with a 1 in the jth position and
zeros everywhere else so that for each i=1,…N we have
xi=ej for some j {1,2,..., K }
10
Assessing disclosure risk in
sample microdata
Xu matrix of dimension N*K with rows Xi for elements in U
Xs matrix of dimension n*K with rows for elements of s
Xs is the data known to the Agency
Xu includes values of variables from non-sampled units
which are typically unknown
Agency releases: f  ( f1 ,..., f K )  1Tn X s
F  ( F1 ,..., FK )  1TN X U
unknown
11
Assessing disclosure risk in
Sample Microdata
For identity disclosure interested in evidence that a
particular record belongs to a known individual
Disclosure risk measures:
Probability of a population unique given a sample unique:
p( Fk  1| f k  1)
Expected number of correct matches for a sample unique:
E(
1
| f k  1)
Fk
Aggregate to obtain global risk measures
12
Assessing disclosure risk in
sample microdata
Estimation based on probabilistic model since population
counts are unknown
Natural assumption in contingency table literature: Fk ~ Pois(k )
From sampling:
And
f k | Fk ~ Bin ( Fk , k )
Fk | f k ~ Pois (k (1   k ))
Estimation by:
Fj | f j
so: f k ~ Pois ( k k )
conditionally independent
p ( Fk  1 | f k  1)  e  k (1 k )
E (1 / Fk | f k  1) 
1
(1  e k (1 k ) )
k (1   k )
13
Assessing disclosure risk in
sample microdata
f k ~ Pois ( k   k k )
From
estimate parameters:
use log linear models to
log k  xk 
based on the contingency table of sample counts spanned
by the identifying key variables
Estimated parameters: uˆ k  exp( xk ˆ )
and
ˆk  ˆ k /  k
Plug into estimated risk measures
14
Misclassification / Perturbation
Misclassification arises naturally from survey processes
Perturbation introduced as a SDL technique
Assume
~
Xs
is a misclassified or perturbed version of X s
The SDL techniques leads to an arbitrary ordering of the
records in the microdata, so the released data is
~ ~
~
T ~
f  ( f1 ,..., f K )  1n X s
Misclassification/ perturbation assumed generated through
a probability mechanism:
Pr( ~
xi  e j1 | xi  e j 2 )  M j1 j 2 , i  1,..., n j1 , j2  1,..., K
15
Assessing disclosure risk in perturbed
sample microdata
~
X
~
Let
be denoted X j
for j  {1,..., n}
Let their identities be denoted by B1 ,..., Bn
Let j(i) be the microdata label for population unit i.
We are interested in:
pij  Pr( B j  i ), i  1,..., N , j  1,..., n
16
Assessing disclosure risk in perturbed
sample microdata
Suppose we observe
known population unit
~
X j (with identity B j
D  {1,2,..., N }
) and
X D for
Identification risk  Pr( ED ) / iU Pr( Ei )
Where:
- Ei is the event that population unit i is sampled
~
X
- its value j (i ) matches X D
- no other population unit is both sampled and has a value
~
X
of
which matches X D
17
Assessing disclosure risk in perturbed
sample microdata
~
X j (k )  e j
Let
and
independently
~
X i  ek
and sample selected
Pr( Ei )   j M jk /(1   j M jk )
 j   j  (1   j M jl ) F
We obtain:
where
l
and Fl is the number of units in the population with
Identification risk
Estimated by
l
X  el
 [ M jj /(1   j M jj )] /[ k Fk M jk /(1   j M jk )]
 M jj /( k Fk M jk )
~
 M jj / F j
~ ~
E (1 / F j | f j  1)
as before multiplied by M jj
18
Misclassification / Perturbation Matrix
Examples:
Natural misclassification errors – measured by quality
studies, eg. miscoding in data capture, misreporting by
respondents, etc.
Perturbation errors (categorical variables):
Random Data Swapping:
n
 
Probability of selecting any 2 records for swapping:  2 
Let nj and nk be the number of records taking values j and k
nj 
 
2 
Q
n j nk
M
After Q swaps:
M jk  M kj 
, M jj   
n
n
 
 
2
2
19
Misclassification / Perturbation
Examples (cont):
Pram: Use a probability mechanism
random changes across categories
M  {M jk }
Define the property of invariance on M: vM  v
is the vector of sample proportions:
nK 
 n1
v   ,..., 
n
n
to make
where
v
to ensure that perturbed marginal distribution is similar to
the original marginal distribution
20
Misclassification / Perturbation
Example of a non-perturbative method:
Recoding Categories 1 to a into 1:
1 k  1,..., a and j  1, or j  k  a  1 and j  1
M jk  
0 otherwise
21
Differential privacy definition
Dwork, et al. (2006) refer to a database and attribute
(inferential) disclosure of a target unit where the intruder
has knowledge of the entire database except for the target
unit
No distinction between key variables and sensitive
variables
In a survey setting, two notions of database: X U and X s
- X s defines the data collected
- X U includes non-sampled units which are unknown to
the agency
22
Differential privacy definition
Assume:
Intruder has no knowledge of who is in the sample
Sampling part of SDL mechanism
Intruder’s prior knowledge relates to X U
Let Pr(f | X U ) denote the probability of f under the
sampling mechanism, where X U
is fixed:

- differential privacy holds if:
where maximum is over all pairs
( X U(1) , X U( 2 ) ) which differ in one row and
all possible values of f
 P(f | X (1)

U )
 
max ln 
(2) 
 P(f | X U 
for some   0
23
Differential privacy definition
Let S (f )
denote set of samples s with
T
1
n Xs  f
for which
Then
Pr[f | X U ] 
p( s)  0
 p(s)
sS ( f )
Example: simple random sampling of size n
 Fj 
Pr[f | X U ]    
 
j 1  f j 
k
N
 
n 
24
Does Sampling ensure differential privacy?
Let i0 be a target unit and suppose the intruder wishes to
learn about xi0
( i )
Let Fj 0
be the count in cell j in the population
excluding i0 , so
Fj  Fj( i0 )  I ( xi0  j )
Intruder knows all units in the population except the target
( i0 )
F
unit, so j
is known
f j  Fj( i0 )  1
Suppose intruder knows that
Then since f j  F j
the intruder can infer xi  j and
attribute disclosure arises
0

- differential privacy does not hold
25
Differential privacy under sampling
Is  -differential privacy too strong a condition?
1.Small sample fractions in social surveys
2. Proportions typically released with sufficiently wide
confidence intervals to ensure uncertainty, i.e. sample
count generally not equal to population count
Dwork, et al. (2006) discuss low ‘sensitivity’ for a
sample proportion requiring small perturbation
3. All possible ‘population databases’ are considered
(even ones that are implausible)
26
Differential privacy under sampling
4. Population database not known to the Agency and an
intruder would not know the whole database except
one target unit (the population is a random variable)
5. All possible samples considered when in fact, only
one is drawn (the sample is fixed) to which SDL
methods applied
To obtain  -differential privacy - need to avoid the
case where f j  F j and in particular no population
uniques
27
Differential privacy under sampling
How likely is it to get a population unique (leakage)?
Assume: N=100,000, n=2,000 (2% sample)
11 key variables generated {0,1} with Bernoulli (0.2)
Count number of cases where f j  F j (typically only
relevant for f j  Fj  1 ):
Calculate the proportion of population uniques out of sample
uniques
Average across 1000 populations and samples:
‘leakage’=0.016,
28
Differential privacy under sampling
Population not known and the probability of a population
unique given a sample unique under SRS is estimated
by:
Pr( Fk  1 | f k  1)  e  k (1 )
Pr(PU|SU)
Probability of Population Unique (1:50 Sample)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Lamda k
29
Differential privacy under perturbation
Assume no sampling and independent misclassification
~
Pr[ X U | X U(1) ]   Pr( ~
xi | xi(1) )
iU
Suppose
X U(1)
( 2)
differs from X U only in row i, so that
xi(1)  xi( 2 )
~
M ~j j (1)
Pr[ X U | X U(1) ]
Pr( ~
xi | xi(1) )


~
Pr( ~
xi | xi( 2 ) )
M ~j j ( 2 )
Pr[ X U | X U( 2 ) ]
where
~ (1) ( 2)
j, j , j
are the entries of
~
xi , xi(1) , xi( 2 )
30
Differential privacy under perturbation
There exists a finite  for which  -differential privacy
holds iff all elements of M are positive
~
M ~j j (1)
Pr[ X U | X U(1) ]
max ln
 ~ max
ln
 max ~j (ln[max j M ~j j ]  ln[min
~
( 2)
(1)
( 2)
j
,
j

j
M
Pr[ X U | X U ]
~ ( 2)
jj
j
M ~j j ])
(1)
Write f before and after change denoted by f
and f ( 2)
| f (1)  f ( 2) | 2
where
(Abowd and Vilhuber, 2008)
~
Pr[ f | f (1) ]
max ln
~ ( 2)  
Pr[ f | f ]
31
Differential privacy under perturbation
Considering categorical key variables, some forms of
random data swapping, PRAM and other synthetic data
generating methods preserve differential privacy
Non-perturbative methods do not preserve differential
privacy
Remark: often to preserve logical consistency of data,
edit rules are used in the form of structural zeros in M
32
Differential Privacy under sampling and
perturbation
Assume two step process
- X s with probability p( s)  0
~
X
- generation of s
~
~
Pr[ f | X U ]   p ( X s  ~
x | X s ) p( s)
Write:
~
x
s
Differential privacy will not hold for some  iff there is
a pair ( X , X ) which differ only in one row and for
which Pr[~f | X ]  0 and Pr[~f | XU(2) ]  0
(1)
U
( 2)
U
(1)
U
~
If all elements of M are positive  p( X
s such that 1TK f  1TK ~f
~
x
s
~
x | Xs)  0
for any
33
Conclusions
Sampling does not preserve differential privacy when
sample counts equal population counts
- generally Agencies avoid these cases by small
sampling fractions and implementing SDL methods
Misclassification/perturbation preserves differential
privacy if the probability mechanism has all positive
elements
34
Download