STATISTICAL CONFIDENTIALITY IN LONGITUDINAL LINKED DATA: OBJECTIVES AND ATTRIBUTES Mario Trottini University of Alicante (Spain) mario.trottini@ua.es Joint UNECE/Eurostat Work Session on Statistical Confidentiality, Geneva 9-11 November 2005 Problem Definition Longitudinal Linked Microdata : “Microdata that contain observations from two or more related sampling frame, with measurements for multiple time periods for all units of observation” (Abowd and Woodcock 2004) Great Research Potential Two related issues: ???? • How to create the data set ? • How to disseminate the data ? 2 Problem Definition Longitudinal Linked Microdata : “Microdata that contain observations from two or more related sampling frame, with measurements for multiple time periods for all units of observation” (Abowd and Woodcock 2004) Great Research Potential Two related issues: ???? • How to create the data set ? • How to disseminate the data ? 3 Data Dissemination: Why is It Difficult? Ideal Data Dissemination Procedure Three Objectives 1. Should allow legitimate users to perform statistical analyses as if the were using the original data “Maximize usefulness” 2. “Control” the risk of misuses of the data by potential intruders “Maximize safety” 3. Be operational “Minimize Cost” Two issues: (i) Objectives are too ambiguous How to measure achievement? (ii) Objectives are conflicting How to find a suitable balance? 4 Data Dissemination as a Decision Problem A SOLUTION REQUIRES: Step(1) Identify the alternatives candidate data dissemination procedures Step(2) Structuring the objectives • “cost” interpretation of: • “usefulness” • “safety” Step(3) Define suitable attributes • “cost” (C) measures of: • “usefulness” (DU) • “safety” (DS) Step(4) Assessing the trade-off between the fundamental objectives DS1 DU1 C1 DS1- 1 DU1+2 C1 ???? 5 Data Dissemination as a Decision Problem A SOLUTION REQUIRES: Step(1) Identify the alternatives Step(2) Structuring the objectives Step(3) Define suitable attributes Step(4) Assessing the trade-off between the fundamental objectives 6 Outline • Identify the alternatives: review of existing data dissemination procedures • Structuring the objectives: - Theory - Current practice • Selecting attributes : - Theory - Current practice • Conclusions 7 Identifying the Alternatives Let M = { Mk , k E } denote the class of alternatives data dissemination procedures CURRENT APPROACH MORE REALISTIC APPROACH Mk is one of the following Mk should be 1. Data Masking 2. Synthetic Data Combination of 1-5 Two rationales: • Data 3. Licensing users and data users needs are very diverse (Mackie and Bradburn 2000) 4. Remote Access 5. Research Data Center • Combining different methods can produce greater data utility for any level of disclosure risk (Abowd and Lane 2003) 8 Identifying the Alternatives Let M = { Mk , k E } denote the class of alternatives data dissemination procedures CURRENT APPROACH MORE REALISTIC APPROACH Mk is on of the following Mk should be 1. Data Masking 2. Synthetic Data 3. Licensing ? ? $ ? $ $ ? 4. Remote Access Combination of 1-5 $ ? $ $ $ $ $ Portfolio Problem 5. Research Data Center 9 Structuring the Objectives: Theory Information Organization Overall Objective: “The best” data dissemination Minimize Cost Maximize Usefulness Maximize safety Too broad and ambiguous to be of operational use STRATEGY: Divide an objective in lower level objectives that clarify the interpretation of the broader objective 10 An Illustration Usefulness “[the data dissemination procedure] should allow legitimate data users to perform the statistical analyses of interest as if they were using the data set originally collected”. Sources of ambiguity a) Definition and identification of “Legitimate data users” b) For a given user in (a) identification of the statistical analysis of interest c) For a given user in (a) and statistical analysis in b) definition of “as if” 11 The Hierarchy Maximize Usefulness Max. usefulness Max. usefulness for DATA USER 1 for DATA USER 2 Max. usefulness Max. usefulness for DATA USER k for OTHER UNKNOWN DATA USER ... statistical analysis SAk1 QUALITY ... statistical analysis SAkm FEASIBILITY Exploratory Analysis People/skills Model uncertainty Technology Access to the data Perform the analysis Interprete the results TRANSPARENCY Estimation Time Prediction Cost Access to the data Perform the analysis Interprete the results 12 Structuring the Objectives: Current Practice • Research literature and current practice in SDC as a whole have identified relevant aspects of the fundamental objectives: - Maximize Usefulness - Maximize Safety - Minimize Cost • No explicit hierarchy is used • Implicit hierarchy is often incomplete • However, only few of them are taken into account in applications Transparency, accessibility, feasibility are often not considered 13 : An Illustration DORIG: DATA MASKING ORIGINAL MICRODATA 1) Apply some transformation, T, to the data DREL= T( DORIG) ) 2) Release to the user: DMASKED= ( DREL, I(T) ) Output of the transformation F(Data) Output of a Statistical analysis of interest using “Data” Information about the transformation T Usefulness assessment: D= F(DORIG)- F(DMASKED) IGNORING TRANSPARENCY! 14 General Guidelines for Structuring the Objectives • Definition of “safety”, “usefulness” and “cost” are problem dependent. • However, providing a clear definition of them in any specific Data Dissemination Problem is crucial for the quality of the final decision. • The use of hierarchies could be very beneficial in terms of: 1. clarifying the interpretation of the relevant objectives 2. check that no relevant aspects of the problem have been ignored 3. facilitate communication 15 Selecting Attributes: Theory Types of Attributes: 1. Natural attributes 2. Constructed Attributes 3. Proxy attributes 16 Selecting Attributes: Theory Types of Attributes: 1. Natural attributes Obvious scale that can be used to measure the extent to which an objective is achieved. Example: 2. Constructed Attributes Objective: “Minimize Cost” (Natural) attribute: “Cost in Euros” 3. Proxy attributes • Not very common in SDC 17 The Hierarchy Maximize Usefulness Max. usefulness Max. usefulness for DATA USER 1 for DATA USER 2 Max. usefulness Max. usefulness for DATA USER k for OTHER UNKNOWN DATA USER ... statistical analyses SAk1 QUALITY ... statistical analyses SAkm FEASIBILITY Exploratory Analysis People/skills Model uncertainty Technology Access to the data Perform the analysis Interprete the results TRANSPARENCY Estimation Time Prediction Cost Access to the data Perform the analysis Interprete the results 18 Selecting Attributes: Theory Types of Attributes: 1. Natural attributes "subjective scale" constructed out of several aspects typically associated with the objective of interest. 2. Constructed Attributes 3. Proxy attributes 19 Attribute level Description of attribute level 1 Support: No groups are opposed to the facility and at least one group has organized support for the facility. 0 Neutrality: All groups are indifferent or uninterested. -1 Controversy: One or more groups have organized opposition, although no groups have action-oriented opposition. Other groups may either be neutral or support the facility. -2 Action-oriented opposition: Exactly one group has action-oriented opposition. The other groups have organized support, indifference, or organized opposition. -3 Strong action-oriented opposition: Two or more groups have action-oriented opposition. Table 1. Constructed attribute for public attitudes. (Keeney and Gregory 2005) 20 Selecting Attributes: Theory Types of Attributes: 1. Natural attributes 2. Constructed Attributes "subjective scale" constructed out of several aspects typically associated with the objective of interest. • Defining feature: Interpretability • Not used in SDC 3. Proxy attributes 21 Selecting Attributes: Theory Types of Attributes: 1. Natural attributes 2. Constructed Attributes Reflects the degree to which an associate objective is met but does not directly measure the objective. 3. Proxy attributes 22 Proxy Attributes for “Usefulness” in SDC GENERAL FORMULATION DORIG: ORIGINAL DATA DREL : DISSEMINATED DATA F( Data): some feature of “Data” PROXY = DISCREPANCY ( F(XORIG), F(XREL) ) INTUITION: Low distorsion of the data implies nearly correct inferences for nearly all statistical analyses 23 Proxy Attributes for “Usefulness” in SDC PROXY = DISCREPANCY ( F(DORIG), F(DREL) ) F DISCREPANCY Summary statistics • Absolute (relative) difference • Percentage variation • Mean variation, etc Density estimation • Hellinger distance • Kullback-Leibler divergence • Other “distances”… Model based inferences: Estimation Prediction Model Selection • Difference in parameter estimates, Intervals overlaps • Discrepancy in model ranking • etc. Proxy as discrepancy between summary statistics Domingo Torra (2001), Yancey W.E. et al. (2002), Oganyan, A. (2003), Grup Crises (2004) Proxy as discrepancy between distributions Agrawal and Aggarwal (2001), Gomatam et al. (2004), Karr et al. (2005) Inference based proxy Gomatam et al. (2004). , A.F. Karr et al. (2005) , 24 Selecting Attributes: Theory Types of Attributes: 1. Natural attributes 2. Constructed Attributes 3. Proxy attributes Defining features: • Usually easier to handle • Require some understanding of the relationship between the objective of interest and the associated objective measured by the proxy. • (TOO) OFTEN USED IN SDC 25 An Illustration Goal: Assessing the trade-off between “Maximize usefulness” and “maximize safety” for a given level “c” of “Cost” • Attribute for “usefulness” (Information loss): Hellinger Distance (IL) • Attribute for “safety” (Disclosure risk): % of record correctly re-identified (DR) Data dissemination1: D1 IL(D1)=0.4 DR(D1)= 1% DR(D1) IL(D1) C Data dissemination 2: D2 IL(D2)=0.5 DR(D2)=0.5% DR(D1)- 0.5 IL(D1)+ 0.1 C ???? What does D(IL)=0.1 mean in terms of fitting a regression model? 26 Attribute Selection: Theory and Current Practice THEORY CURRENT PRACTICE Prescriptive Order in Attributes selection Order in Attributes selection 1. Natural attributes 2. Constructed attributes 1. Natural attributes Constructed attributes 3. Proxy attributes 2. Proxy attributes Desirable Properties (in the paper) 27 Conclusions ``There is a tendency in all problem solving to move quickly away from the ill-defined to the well-defined, from constraintfreethinking to constrained thinking. There is a need to feel, and perhaps even to measure, progress toward reaching a ``solution" to a decision problem.“ (Keeney, 1992, page 9) • In this talk it is argued that too little effort has been made for a comprehensive definition of the Data Dissemination problem in terms of: - alternatives - objectives - attributes 28 Conclusions (Cont.) • Hierarchy and constructed attributes could represent useful tools to address these problems. • Although the discussion has not focus on dissemination of longitudinal linked data as much as desired, I think it is particularly relevant for this type of data given: - The complexity of the modeling - The multiple decision makers involved - The different perspectives of disclosure and utility that must be accommodated in the final decision. 29 Acknowledgements Preparation of this paper was supported by the U.S. National Science Foundation under Grant EIA-0131884 to the National Institute of Statistical Sciences. The contents of the paper reflects the authors' personal opinion. The National Science Foundation is not responsible for any views or results presented. 30 THANK YOU ! 31