Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector Why is confidentiality protection needed? • One of the fundamental principles of official statistics is that statistical information of data suppliers is strictly confidential, and is used only for statistical purposes. • Legislation places a legal obligation on NSIs to protect data suppliers. • Data suppliers should have confidence in the NSI to preserve the confidentiality of individual information – better quality of the collected data. National legislation • National Statistics Act – Data published in aggregated form. – Data may be published individually if • written consent of reporting units is obtained; • data are collected from public data collections; • data are published in such a way that the reporting units cannot be directly identified. – The Office or authorized producers shall transmit individual data to users on the basis of a written application. • Other legislation – Personal Data Protection Act; –… European legislation • European Regulation (EC) No 223/2009 – General definitions; – Chapter 5 – Statistical Confidentiality • Access to confidential data for scientific purposes • European Statistics Code of Practice - Principle 5: The confidentiality of the information the data providers provide and its use only for statistical purposes are absolutely guaranteed. What does SDC cover at SORS? • Tabular data protection – Publication – Eurostat and other institutions – Users‘ requests • Microdata protection – Preparation of public-use files and scientificuse files – Checking rules set up by Eurostat • Output checking Tabular data protection • Tables – aggregated data – Magnitude tables Sum of quantitative variable of respondents, where respondents are grouped by categorical variables. – Frequency tables Number of respondents, where respondents are grouped by categorical variables. Tabular data protection at SURS • Method Cell Suppression - Post-tabular method - Non-perturbative method (less information available) - Implemented in Tau-Argus software (CASC project) - The interval of possible values for each sensitive cell is sufficiently large Tabular data protection Cell Suppression • Sensitivity rules – defining unsafe cells – Threshold The number of respondents in a cell is below a certain threshold value. – Concentration rules One or two respondents are dominant. – Group disclosure All respondents in one cell have the same value for a sensitive variable. Cell Suppression • Secondary suppression - Needed due to sums in the tables. The feasibility interval for each unsafe cell has to be wide enough. Optimisation problem -> LP-solver used (XPress, CPlex). Cell Suppression Publication Microdata protection • Microdata are deindividualized pieces of information for individual units (enterprises, persons, households). – no direct identifiers (ID numbers, TAX numbers, name + address…) • Microdata files are available to our researchers in the secure room and via remote access. Microdata protection Scientific-use file (SUF) • Prepared for researchers • Signed contract • Usually sent by CD + password, has to be destroyed after usage • More information (variables) available • Only unintentional disclosures are protected Microdata protection Public-use file (PUF) • Publicly available or after registration • Less information (variables) available • All microdata protection methods are NOT usable (too complex for normal users) • Intentional disclosures are protected Microdata protection • The goal of microdata protection is to make a safe microdata file, where – disclosure risk is low; – analyses done on a safe file have to give results which are close or equal to results of analyses done on original data. Microdata protection methods, used at SURS • Modifying original microdata file, done by – non-perturbative methods: • global recoding; • top and bottom coding; • local suppression (not very usable for PUFs). – some perturbative methods: • microaggregation; • rounding. • Software packages Mu-Argus and R. Labour Force Survey - PUF • Prepared for Social Data Archives (DwB project). • We used Eurostat‘s rules for creating SUF and by method sampling created PUF (one third of original sample). • We didn‘t use local suppression. • The quality of statistics used as parameters for method sampling is ensured, other should be used with precaution. Output checking 1. Researchers fill out our form after finishing work. 2. An e-mail is sent to our common e-mail address zascas.surs@gov.si. 3. One of the SDC methodologists checks the output. In case of disclosive data or incorrectly filled form, the researchers are contacted for additional information or to correct the output. 4. After the SDC methodologist agrees with the dissemination, the output is sent to the researcher by e-mail. Rules for output checking • Rule-of-thumb model – Threshold N – all tabular and similar output should have at least N units. – Dominance rule – the analysis should not be done on groups with a dominant unit. – Maximum and minimum are usually not released (exception if they refer to more than one unit). – 100% percentile is usually not released (maximum). Thank you for your attention! zascas.surs@gov.si