Data

advertisement
STATISTICAL CONFIDENTIALITY IN
LONGITUDINAL LINKED DATA:
OBJECTIVES AND ATTRIBUTES
Mario Trottini
University of Alicante (Spain)
mario.trottini@ua.es
Joint UNECE/Eurostat Work Session on Statistical Confidentiality, Geneva 9-11 November 2005
Problem Definition
Longitudinal Linked Microdata :
“Microdata that contain observations from two or more related
sampling frame, with measurements for multiple time periods for
all units of observation” (Abowd and Woodcock 2004)
Great Research Potential
Two related issues:
????
• How to create the data set ?
• How to disseminate the data ?
2
Problem Definition
Longitudinal Linked Microdata :
“Microdata that contain observations from two or more related
sampling frame, with measurements for multiple time periods for
all units of observation” (Abowd and Woodcock 2004)
Great Research Potential
Two related issues:
????
• How to create the data set ?
• How to disseminate the data ?
3
Data Dissemination: Why is It Difficult?
Ideal Data Dissemination Procedure
Three Objectives
1. Should allow legitimate users to
perform statistical analyses as if
the were using the original data
“Maximize usefulness”
2. “Control” the risk of misuses of
the data by potential intruders
“Maximize safety”
3. Be operational
“Minimize Cost”
Two issues:
(i) Objectives are too ambiguous
How to measure achievement?
(ii) Objectives are conflicting
How to find a suitable balance?
4
Data Dissemination as a Decision Problem
A SOLUTION REQUIRES:
Step(1) Identify the alternatives
candidate data dissemination
procedures
Step(2) Structuring the objectives
• “cost”
interpretation of: • “usefulness”
• “safety”
Step(3) Define suitable attributes
• “cost” (C)
measures of: • “usefulness” (DU)
• “safety” (DS)
Step(4) Assessing the trade-off
between the fundamental
objectives
DS1
DU1
C1
DS1- 1
DU1+2
C1
????
5
Data Dissemination as a Decision Problem
A SOLUTION REQUIRES:
Step(1) Identify the alternatives
Step(2) Structuring the objectives
Step(3) Define suitable attributes
Step(4) Assessing the trade-off
between the fundamental
objectives
6
Outline
• Identify the alternatives: review of existing data
dissemination procedures
• Structuring the objectives:
- Theory
- Current practice
• Selecting attributes :
- Theory
- Current practice
• Conclusions
7
Identifying the Alternatives
Let M = { Mk , k  E } denote the class of alternatives data
dissemination procedures
CURRENT APPROACH
MORE REALISTIC APPROACH
Mk is one of the following
Mk should be
1. Data Masking
2. Synthetic Data
Combination of 1-5
Two rationales:
• Data
3. Licensing
users and data users
needs are very diverse
(Mackie and Bradburn 2000)
4. Remote Access
5. Research Data Center
• Combining
different methods
can produce greater data utility
for any level of disclosure risk
(Abowd and Lane 2003)
8
Identifying the Alternatives
Let M = { Mk , k E } denote the class of alternatives data
dissemination procedures
CURRENT APPROACH
MORE REALISTIC APPROACH
Mk is on of the following
Mk should be
1. Data Masking
2. Synthetic Data
3. Licensing
?
?
$
?
$
$
?
4. Remote Access
Combination of 1-5
$
?
$
$
$
$
$
Portfolio Problem
5. Research Data Center
9
Structuring the Objectives: Theory
Information Organization Overall Objective:
“The best” data dissemination
Minimize
Cost
Maximize
Usefulness
Maximize
safety
Too broad and ambiguous to be of operational use
STRATEGY: Divide an objective in lower level objectives that
clarify the interpretation of the broader objective
10
An Illustration
Usefulness
“[the data dissemination procedure] should allow
legitimate data users to perform the statistical
analyses of interest as if they were using the
data set originally collected”.
Sources of ambiguity
a) Definition and identification of “Legitimate data users”
b) For a given user in (a) identification of the statistical
analysis of interest
c) For a given user in (a) and statistical analysis in b)
definition of “as if”
11
The Hierarchy
Maximize Usefulness
Max. usefulness
Max. usefulness
for
DATA USER 1
for
DATA USER 2
Max. usefulness
Max. usefulness
for
DATA USER k
for OTHER
UNKNOWN DATA USER
...
statistical analysis
SAk1
QUALITY
...
statistical analysis
SAkm
FEASIBILITY
Exploratory Analysis
People/skills
Model uncertainty
Technology
Access to the data
Perform the analysis
Interprete the results
TRANSPARENCY
Estimation
Time
Prediction
Cost
Access to the data
Perform the analysis
Interprete the results
12
Structuring the Objectives: Current Practice
• Research literature and current
practice in SDC as a whole
have identified relevant aspects of
the fundamental objectives:
- Maximize Usefulness
- Maximize Safety
- Minimize Cost
• No
explicit hierarchy is
used
• Implicit
hierarchy is often
incomplete

• However, only few of them are
taken into account in applications
Transparency, accessibility,
feasibility are often not
considered

13
 : An Illustration
DORIG:
DATA
MASKING
ORIGINAL MICRODATA
1) Apply some transformation, T, to the data
DREL= T( DORIG) )
2)
Release to the user: DMASKED= ( DREL, I(T) )
Output of the
transformation
F(Data)
Output of a Statistical
analysis of interest using
“Data”
Information about the
transformation T
Usefulness assessment:
D= F(DORIG)- F(DMASKED)
IGNORING TRANSPARENCY!
14
General Guidelines for Structuring the
Objectives
• Definition of “safety”, “usefulness” and “cost” are problem dependent.
• However, providing a clear definition of them in any specific Data
Dissemination Problem is crucial for the quality of the final decision.
• The use of hierarchies could be very beneficial in terms of:
1. clarifying the interpretation of the relevant objectives
2. check that no relevant aspects of the problem have been ignored
3. facilitate communication
15
Selecting Attributes: Theory
Types of Attributes:
1. Natural attributes
2. Constructed Attributes
3. Proxy attributes
16
Selecting Attributes: Theory
Types of Attributes:
1. Natural attributes
Obvious scale that can be used to
measure the extent to which an
objective is achieved.
Example:
2. Constructed Attributes
Objective: “Minimize Cost”
(Natural) attribute: “Cost in Euros”
3. Proxy attributes
• Not very common in SDC
17
The Hierarchy
Maximize Usefulness
Max. usefulness
Max. usefulness
for
DATA USER 1
for
DATA USER 2
Max. usefulness
Max. usefulness
for
DATA USER k
for OTHER
UNKNOWN DATA USER
...
statistical analyses
SAk1
QUALITY
...
statistical analyses
SAkm
FEASIBILITY
Exploratory Analysis
People/skills
Model uncertainty
Technology
Access to the data
Perform the analysis
Interprete the results
TRANSPARENCY
Estimation
Time
Prediction
Cost
Access to the data
Perform the analysis
Interprete the results
18
Selecting Attributes: Theory
Types of Attributes:
1. Natural attributes
"subjective scale" constructed
out of several aspects typically
associated with the objective of
interest.
2. Constructed Attributes
3. Proxy attributes
19
Attribute
level
Description of
attribute level
1
Support: No groups are opposed to the facility and at least one
group has organized support for the facility.
0
Neutrality: All groups are indifferent or uninterested.
-1
Controversy: One or more groups have organized opposition,
although no groups have action-oriented opposition. Other groups
may either be neutral or support the facility.
-2
Action-oriented opposition: Exactly one group has action-oriented
opposition. The other groups have organized support, indifference,
or organized opposition.
-3
Strong action-oriented opposition: Two or more groups have
action-oriented opposition.
Table 1. Constructed attribute for public attitudes.
(Keeney and Gregory 2005)
20
Selecting Attributes: Theory
Types of Attributes:
1. Natural attributes
2. Constructed Attributes
"subjective scale" constructed
out of several aspects typically
associated with the objective of
interest.
• Defining feature: Interpretability
• Not used in SDC

3. Proxy attributes
21
Selecting Attributes: Theory
Types of Attributes:
1. Natural attributes
2. Constructed Attributes
Reflects the degree to which an
associate objective is met but
does not directly measure the
objective.
3. Proxy attributes
22
Proxy Attributes for “Usefulness” in SDC
GENERAL FORMULATION
DORIG: ORIGINAL DATA
DREL : DISSEMINATED DATA
F( Data): some feature of “Data”
PROXY = DISCREPANCY ( F(XORIG), F(XREL) )
INTUITION: Low distorsion of the data implies nearly correct
inferences for nearly all statistical analyses
23
Proxy Attributes for “Usefulness” in SDC
PROXY = DISCREPANCY ( F(DORIG), F(DREL) )
F
DISCREPANCY
Summary
statistics
• Absolute (relative) difference
• Percentage variation
• Mean variation, etc
Density
estimation
• Hellinger distance
• Kullback-Leibler divergence
• Other “distances”…
Model based
inferences:
Estimation
Prediction
Model Selection
• Difference in parameter
estimates, Intervals overlaps
• Discrepancy in model
ranking
• etc.
Proxy as discrepancy
between summary statistics
Domingo Torra (2001), Yancey W.E.
et al. (2002), Oganyan, A. (2003),
Grup Crises (2004)
Proxy as discrepancy
between distributions
Agrawal and Aggarwal (2001),
Gomatam et al. (2004), Karr et al.
(2005)
Inference based proxy
Gomatam et al. (2004). , A.F.
Karr et al. (2005) ,
24
Selecting Attributes: Theory
Types of Attributes:
1. Natural attributes
2. Constructed Attributes
3. Proxy attributes
Defining features:
• Usually easier to handle
• Require some understanding
of the relationship between
the objective of interest and
the associated objective
measured by the proxy.
• (TOO) OFTEN USED IN SDC
25
An Illustration
Goal: Assessing the trade-off between “Maximize usefulness” and
“maximize safety” for a given level “c” of “Cost”
• Attribute for “usefulness” (Information loss): Hellinger Distance (IL)
• Attribute for “safety” (Disclosure risk): % of record correctly re-identified (DR)
Data dissemination1:
D1
IL(D1)=0.4
DR(D1)= 1%
DR(D1)
IL(D1)
C
Data dissemination 2:
D2
IL(D2)=0.5
DR(D2)=0.5%
DR(D1)- 0.5
IL(D1)+ 0.1
C
????
What does
D(IL)=0.1
mean in terms
of fitting a
regression
model?
26
Attribute Selection: Theory and Current Practice
THEORY
CURRENT PRACTICE
Prescriptive Order in
Attributes selection
Order in
Attributes selection
1. Natural attributes
2. Constructed attributes
1. Natural attributes
Constructed attributes
3. Proxy attributes
2. Proxy attributes
Desirable Properties (in the paper)
27
Conclusions
``There is a tendency in all problem solving to move quickly
away from the ill-defined to the well-defined, from constraintfreethinking to constrained thinking. There is a need to feel,
and perhaps even to measure, progress toward reaching a
``solution" to a decision problem.“ (Keeney, 1992, page 9)
• In this talk it is argued that too little effort has been made
for a comprehensive definition of the Data Dissemination
problem in terms of:
- alternatives
- objectives
- attributes
28
Conclusions (Cont.)
• Hierarchy and constructed attributes could represent useful
tools to address these problems.
• Although the discussion has not focus on dissemination of
longitudinal linked data as much as desired, I think it is
particularly relevant for this type of data given:
- The complexity of the modeling
- The multiple decision makers involved
- The different perspectives of disclosure and utility
that must be accommodated in the final decision.
29
Acknowledgements
Preparation of this paper was supported by the U.S.
National Science Foundation under Grant EIA-0131884 to
the National Institute of Statistical Sciences. The contents
of the paper reflects the authors' personal opinion. The
National Science Foundation is not responsible for any
views or results presented.
30
THANK YOU !
31
Download