Keeping Data Confidential in an Era of No Privacy Prof. Jerry Reiter

advertisement
Keeping Data Confidential in an Era
of No Privacy
Prof. Jerry Reiter
Department of Statistical Science
Duke University
Disclosure limitation setting



Agency seeks to release data on
individuals
Risk of re-identifications from matching
to external databases
Statistical disclosure limitation applied
to data before release
Standard approaches to
disclosure limitation

Recode variables

Suppress data

Swap data

Add random noise
General issues with standard SDL

Recoding


Loses information in tails, disables fine
spatial analysis, creates ecological fallacies
Suppression


Creates nonignorable missing data
May not be fully protective
General issues with standard SDL

Swapping



Attenuates correlations
Protection based on perception
Adding noise


Inflates variances, distorts distributions,
attenuates correlations
May need large noise variances
Fully synthetic data
Rubin (1993, JOS ): create multiple, fully
synthetic datasets for public release so that:

No unit in released data has sensitive data
from actual unit in population

Released data look like actual data

Statistical procedures valid for original data
are valid for released data
Generating fully synthetic data



Randomly sample new units from frame
(can use simple random samples)
Impute survey variables for new units
using models fit from observed data
Repeat multiple times and release m
datasets
Inferences from fully synthetic
datasets
Raghunathan, Reiter, Rubin
(2003, Journal of Official Statistics)

Estimand: Q = Q (X , Y )

In each synthetic dataset
di
qi  Q(d i ) ui  U (d i )
Quantities needed for inferences
m
qm   qi / m
i 1
bm   ( qi qm ) /( m  1)
2
m
u m   ui / m
i 1
Inferences from fully synthetic
data
qm

Estimate of Q :

Estimate of variance is
T f  (1  1 / m)bm  um

For large n, s, m, use normal based inference
for Q:
qm  1.96 T f
Advantages of full synthesis





No sensitive data released: very high
protection
No need to decide which values to alter nor
which variables are quasi-identifiers
Potential to preserve associations, maintain
geographies, release data in tails
Analysts can use standard methods on simple
random samples
Protection does not depend on hiding nature
of SDL to public
Drawbacks of full synthesis

Analysts have to deal with multiple datasets
(not a serious issue)

Quality of data highly dependent on quality of
synthesis models



Relationships omitted in models are not in
released data
Inaccurate distributions are passed on to analysts
Only possible for analysts to rediscover what
is the synthesis models
A modification of the proposal:
Partially synthetic data
Little (1993, JOS ): create multiple, partially
synthetic datasets for public release so that:

Released data comprise mix of observed and
synthetic values

Released data look like actual data

Statistical procedures valid for original data
are valid for released data
Observed Data
x
y
Synthetic Datasets
x
y
x
y
x
y
Observed Data
x
y
Synthetic Datasets
x
y
x
y
x
y
Observed Data
x
y
Synthetic Datasets
x
y
x
y
x
y
Existing applications



Replace sensitive values for selected units:
Survey of Consumer Finances
County-to-county migration flows (current)
Replace values of identifiers for selected units:
American Community Survey group quarters
Tract IDs for NCI SEER cancer registry data
Replace all values of sensitive variables:
Longitudinal Business Database
On the Map
Survey Income Program Participation
Inference with partially synthetic
datasets (no missing data)
Reiter (2003, Survey Methodology)

Estimand: Q = Q (X , Y )

In each synthetic dataset
di
qi  Q(di ) ui  U (di )
Inference with partially synthetic
data (no missing data)
qm

Estimate of Q :

Estimate of variance is
Tp  um  bm / m

For large n and m, use normal based
inference for Q:
qm  1.96 Tp
Fully synthetic Partially synthetic





New units sampled
Cannot match--low
disclosure risk
Full reliance on
imputation models
Released data SRS
May need large
synthetic sample
sizes or m





Collected units used
Matches to observed
data possible
Partial reliance on
imputation models
Original design
Small m can be
adequate for
replacements
Open research questions

Synthesis models for specific data types:





Data nested within households
Longitudinal data
Social network data
And many more…
Record linkage with synthetic data
Guide to literature:
Overviews of synthetic data




Rubin (1993, Journal of Official Statistics )
Little (1993, Journal of Official Statistics )
Abowd and Woodcock (2001) in
Confidentiality, Disclosure, and Data Access:
Theory and Practical Applications for
Statistical Agencies
Reiter (2004, Chance )
Guide to literature:
Inferences with synthetic data




Full synthesis: Raghunathan, Reiter, Rubin (2003,
Journal of Official Statistics )
Partial synthesis (no missing): Reiter (2003, Survey
Methodology )
Partial synthesis with missing data: Reiter (2004,
Survey Methodology )
Significance tests of multi-component hypotheses



Full synthesis and partial synthesis (no missing): Reiter
(2005, Journal of Statistical Planning and Inference )
Partial synthesis with missing: Kinney and Reiter (2010,
Journal of Official Statistics )
Model selection in regression: Kinney, Reiter, and
Berger (forthcoming, Journal of Privacy and
Confidentiality )
Guide to literature:
Generating synthetic data







Sequential regression approaches:
Abowd and Woodcock (2004) in Privacy in Statistical Databases
Classification and regression trees:
Reiter (2005, Journal of Official Statistics )
Survey weights and partial synthesis:
Mitra and Reiter (2006) in Privacy in Statistical Databases
Bayesian networks:
Young, Graham, Penny (2009, Journal of Official Statistics )
Regression with kernel density transformations:
Woodcock and Benedetto (2009, Computational Statistics and
Data Analysis )
Random forests:
Caiola and Reiter (2010, Transactions on Data Privacy )
Support vector machines:
Drechsler (2010) in Privacy in Statistical Databases
Guide to literature:
Disclosure risk estimation


Record linkage for partial synthesis:
Abowd, Stinson, Benedetto (2006) technical
report
Identification risks in partial synthesis



Reiter and Mitra (2009, Journal of Privacy and
Confidentiality )
Drechsler and Reiter (2008) in Privacy in
Statistical Databases
Differential privacy and synthetic data:
Abowd and Vilhuber (2008) in Privacy in
Statistical Databases
Guide to literature:
Utility of synthetic data



Complex designs in full synthesis:
Reiter (2002, Journal of Official Statistics )
Impact of number of datasets on quality:
Drechsler and Reiter (2009, Journal of Official
Statistics )
Verification servers: Reiter, Oganian, and Karr
(2009, Computational Statistics and Data
Analysis)
Guide to literature:
Genuine applications






Synthesis instead of topcoding:
An and Little (2007, Journal of the Royal Statistical Society – A )
Survey of Income and Program Participation linked data
www.census.gov/sipp/synth_data.html
Longitudinal Business Database: Kinney and Reiter (2007,
Proceedings of the Joint Statistical Meetings )
American Community Survey group quarters:
Hawala (2008, Proceedings of the Joint Statistical Meetings )
OnTheMap: http://lehdmap4.did.census.gov/themap4/
German Establishment Panel:
Drechsler, Bender, and Rassler (2008, Transactions on Data
Privacy )
Guide to literature:
Other adaptions

Combining two confidential datasets




Kohnen and Reiter (2009, Journal of the Royal
Statistical Society - A)
Reiter 2009, International Statistical Review
Synthesize some variables m times and
others r times (Reiter and Drechsler 2010,
Statistica Sinica)
Sampling from a census followed by synthesis
of confidential data (Drechsler and Reiter
2010, Journal of the American Statistical
Association)
Download