Definition of Disclosure Risk

advertisement
WP 10
On Risk Definitions and a Neighbourhood
Regression Model for Sample Disclosure Risk
Estimation
Yosi Rinott
Hebrew University
Natalie Shlomo
Hebrew University
Southampton University
1
Disclosure Risk Measures
Notation:
Sample (size n): f  { f k : k  1,..., K}
Population (size N): F  {Fk : k  1,..., K}
Tables with K cells: k  (k1 ,..., k m )
m-way table
Risk Measures:  1   I ( f k  1, Fk  1)
 2   I ( f k  1)1 Fk )
= expected number
of correct matches
of sample uniques
Estimates:
ˆ1   I ( f k  1) Pˆ ( Fk  1 | f k  1)
ˆ2   I ( f k  1) Eˆ [1 Fk | f k  1]
2
On Definitions of Disclosure Risk
•
In the statistics literature, we present examples of risk
measures,  1 and  2 , but we lack formal definitions of when
a file is safe
•
In the computer science literature, there is a formal definition of
disclosure risk (e.g., Dinur, Dwork, Nisim (2004-5), Adam and
Wortman(1989), who write “it may be argued that elimination of disclosure
is possible only by elimination of statistics”)
In some of the CS literature any data must be released with
noise.
The noise must be small enough so that legitimate information
on large subsets of the data will be useful, and large enough so
that information on small subsets, or individuals will be too
noisy and therefore useless (regardless of whether they are 3
obtained by direct queries or differencing etc.)
On Definitions of Disclosure Risk
Worst Case scenario of the CS approach (for example, that the
intruder has all information on anyone in the data set except
the individual being snooped) simplify definitions, there is no
need to consider other, more realistic but more complicated
scenarios.
But would Statistics Bureaus and statisticians agree to adding
noise to any data?
Other approaches like query restriction or query auditing do not
lead to formal definitions.
4
Definition of Disclosure Risk
Numerical Data Base
D  {di : i  1,..., n} ,
A Query is a sum over a subset of
adding some noise of magnitude 
di
d i  {0,1}m
. Query is perturbed by
Proven that almost all d i can be reconstructed if   n and
none of them can be reconstructed if   n
Adding noise of order n hides information
on individuals and small groups, but yields meaningful
information about sums of O(n) units for which noise of order
is natural.
Work further expanded to lessen the magnitude of the noise by
limiting the number of queries.
5
n
Definition of Disclosure Risk
Collaboration with the CS and Statistical Community where:
1. In the statistical community, there is a need for more formal
and clear definitions of disclosure risk
2. In the CS community, there is a need for statistical methods
to preserve the utility of the data
- allow sufficient statistics to be released without
perturbation
- methods for adding correlated noise
- sub-sampling and other methods for data masking
Can the formal notions from CS and the practical approach
of statisticians lead to a compromise that will allow us to set
practical but well defined standard for disclosure risk?
6
Probabilistic Models
Focus on sample microdata and not whole population (sampling
provides a priori protection against disclosure)
Standard (natural) Assumptions
Fk |  k ~ Poisson ( N k ) ind.

k
1
f k | Fk ~ Bin ( Fk ,  k ) Bernoulli or Poisson sampling
f k |  k ~ Poisson ( N k  k )
Fk | f k ~ f k  Poisson (k  N k (1   k ))
In particular
Fk | f k  1 ~ 1  Poisson (k )
the size biased Poisson distribution
7
Probabilistic Models
Add
 k ~ Gamma( ,  )
iid
  1 K  E  k  1
N k  1 
Fk | f k ~ f k  NB(  f k ,
)
N  1/ 
As   0 (
   ) we obtain the mu-argus assumption
Fk | f k ~ f k  NB( f k ,  k )
As   
Model
(  2  0 ) we obtain the above Poisson
8
Mu-Argus Model (Benedetti, Capobianchi, Franconi (1998))
wi is the sampling weight of individual i obtained from
design or post-stratification
ˆ k  f k Fˆk
If

fk  0
ˆ k
where
then
Fˆk  0
are underestimated
F̂k 
w
i
isample cell k
but

 Fˆ   w
k
k
i
N
i
risk is under estimated
Monotonicity: if we replace f k  0 by some  , risk
estimates increase to the correct level in  , but how to
estimate  ?
9
Poisson Log-linear Models
(Skinner, Holmes (1998), Elamir,
Skinner (2005), Skinner, Shlomo (2005))
N k  exp( x )
Monotonicity in the size of the model (number of parameters):
Saturated (“big” model)  data under fitted
underestimated
Independent (“small” model)
overestimated

risk
 data over fitted  risk
Intermediate models with conditional independence involves
smaller products of marginal proportions and therefore we expect
monotonicity of the models, so similar to the choice of  , there
will be a model which will give a good risk estimate
10
Neighborhood of a Log-linear Model
Log-linear models takes into account a neighborhood of cells to
infer on  k
j
for determining the risk.
For example:
Independence Neighborhood, k=(i,j):
The estimate ˆ k is the product
i
of marginal proportions obtained
by fixing one attribute at a time,
thus if one attribute is income
group then inference on very rich
involves information on very poor,
provided there is another attribute
in common, such as marital status.
11
Discussion of Neighborhoods
How likely is a sample unique a population unique?
If a sample unique has mostly small or empty neighboring cells,
it is more likely to be a population unique.
• Argus is based on weights and no learning from other cells.
• The log linear Poisson model takes into account
neighborhoods, reduces the number of
parameters and also reduces their standard deviation and hence
of risk measures (provided that the model is valid).
Are there other types of neighborhoods which may be more
natural?
We focus on ORDINAL variables
12
Proposed Neighborhoods
• Local smoothers for large sparse (ordinal) tables, e.g.
Bishop, Fienberg, Holland (1975), Simonoff (1998)
• Use local neighborhoods to fit a simple smooth function to f k
or to estimate  k smoothly
• Construct neighborhood of cells N k of k, by varying the
coordinates of ordinal attributes, and fixing non-ordinal
attributes
N c( k ) Neighborhood of cell k at distance c from cell k
13
Proposed Neighborhoods
Regressors, for cell k:
xc( k )  lN ( k ) f l
j
c
 k  exp{  0  cC  c xc( k ) }
f k |  k ~ Poisson (n k )
i
Fk | f k ~ f k  Poisson (k  N k (1   k ))
Define structural zeros if all
neighborhoods of a cell which are used
in the regression contain only empty
cells
14
Example
Population from 1995 Israeli Census File, Age>15,
N=746,949, n=14,939, and K=337,920
Key: Sex(2), Age groups(16), Groups of years of study(10),
Number of years in Israel(11), Income groups(12), Number of
persons in household (8)
Sex is not ordinal and is fixed
Weights for Argus obtained by post-stratification on
weighting classes: sex, age and geographical location
15
Example
Model
2
1
True Values
Argus
430.0
114.5
1,125.8
456.0
Log-linear model: Independence
Log-linear model: 2-way Interactions
Neighborhood Method M ak
Neighborhood Method M ak w/out
structural zeros
Neighborhood Method N ck
773.8
470.0
786.8
385.4
1,774.1
1,178.1
2,146.9
1,674.1
723.3
2,099.6
344.8
1,624.2
Neighborhood Method
structural zeros
N ck
w/out
16
Results of Example
•
Independent log-linear model and neighborhoods over estimate
the two risk measures
•
Argus Model under estimates
•
The all 2-way interaction log-linear Poisson Model has the best
estimates
•
Taking into account the structural zeros in the neighborhoods
yield more reasonable estimates
17
Conclusions
•
Need to refine the neighborhood approach, define the model
better and develop MLE theory
•
We expect the new model to work well in multi-way tables
when simple log-linear models are not valid
•
Incorporate the approach into a more general regression model,
the Negative Binomial Regression, which subsumes both the
Poisson Risk Model and the Argus Model
18
Download