Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals Maria Cristina Casciano, Laura Corallo, Daniela Ichim Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Outline • Multiple releases: MFR and PUF • Subsampling – allocation: reduce the risk of disclosure – selection: pre-defined quality standards • Results – Career of Doctorate Holders Survey • Further work Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple surveys Multiple … Multiple countries Multiple countries Multiple countries Multiple countries SURVEY1 TABLES1 PUF1 MFR1 OTHER1 Multiple releases SURVEY2 TABLES2 PUF2 MFR2 OTHER2 Multiple releases SURVEYX TABLESX PUFX MFRX OTHERX Multiple releases SURVEY1 TABLES1 PUF1 MFR1 OTHER1 Multiple releases SURVEY2 TABLES2 PUF2 MFR2 OTHER2 Multiple releases SURVEYX TABLESX PUFX MFRX OTHERX Multiple releases SURVEY1 TABLES1 PUF1 MFR1 OTHER1 Multiple releases SURVEY2 TABLES2 PUF2 MFR2 OTHER2 Multiple releases SURVEYX TABLESX PUFX MFRX Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona OTHERX Multiple releases MS1 MS2 MS27 Comparability • ESSnet on SDC harmonisation and common tools – WP1: test the comparability concept – Istat, Destatis, Statistics Austria – multiple countries HOW • 1 Assessment of effects of different practices on predefined statistics • 2 Definition of a threshold to define when action is needed • 3 setting a process for choosing acceptable practices Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple releases SURVEY1 TABLES1 PUF1 MFR1 • A particular harmonisation dimension • Hierarchical structure – Utility – Risk of disclosure Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona OTHER1 Multiple releases hierarchical structure More restrictive license Less aggregated information + - MFR PUF - + Less restrictive license More aggregated information UNIQUE PRODUCTION PROCESS! Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona • MFR PUF-MFR – definition of a disclosure scenario – risk assessment R1 – risk limitation w.r.t. • adopted disclosure scenario • some data utility requirements • PUF – – – – – harmonized with the MFR (e.g. weighted totals) reduced the risk of disclosure random sample internal consistency of records some (other) data utility requirements (CV and weighted totals – precision and accuracy) Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Data description Doctorate Holders Year t-5 Year t-3 CDH 2009 Survey Year t Focus on the characterisation of the occupational status of the PhD holders: job labour market satisfaction entry usefulness of the type of type of PhD earnings contract work for obtaining a job Estimates by PhD scientific area, by gender and by region Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Data description 18500 PhD Holders (Census) 72% resp Adjustment for non-responses via calibration 28% No resp 12964 respondents Citizenship PhD Scientific Area Gender (2 categories) (14 categories) Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona weights obtained by constraining on known marginal distributions: Region PUF-subsampling Simple random sampling Utility: Weighted totals may always be preserved by calibration Risk: how many units at risk are sampled? Example (MFR-CDH): 12964 units, 24.7% of units at risk Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona key variables stratification utility scenario allocation auxiliary Subsampling disclosure dissemination calibration domains totals sample size quality Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona users PUF-subsampling: proposal 1. Optimal allocation of units to be sampled in each domain according to Bethel’s approach (Risk minimization) 2. Selection of a fixed size balanced sample (CUBE method) (Data utility maximization) Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona 1. Bethel’s approach (1989) Hj d ●Cost function to minimize: C ' C0 Ch n h h 1 nh and Ch related to the risk to be reduced ● Expected Coefficient of Variation (CV) of the estimates of the total of variable P in domain jd equal or lower than prefixed thresholds: * CV CV p jd p jd Optimal allocation: nh* Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona 2. Balanced sampling A sampling design s is said to be balanced on the auxiliary variables x x1...x j ..x p ' if and only if the balancing equations given by: ˆ X X π are satisfied, where X is the vector of known population totals, X̂ π is the H.-T. estimator exact estimates for pre-defined variables Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Balanced sampling: the CUBE method (011) (010) (111) s 0 ,1 (110) K (101) (000) Geometrically each vertex of the hypercube is a sample: N The balancing equations define a subspace of RN named K. The problem is to choose a vertex (sample) of the N-cube that remains in the sub-space of constraints K (100) Cube method (Deville & Tillé,2004): 1. Flight phase: it’s a random walk starting from the vector p and moving in the intersection of the cube C and K. It stops at the vertex of intersection of C and K, if this vertex exists. 2. Landing phase: At the end of the flight phase, if a sample is not exactly determined in C∩K, a sample is selected as close as possible to the constraints space K. Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Implementation 1. determination of the optimal strata sizes in terms of reduction of the overall risk (cost function), keeping the CV level of the estimates below a 5% threshold for three combinations of the allocation and domain variables Allocation variables: Occup, JobS, Contract, Work, Income Domain variables: Gender, Region, Scientific Area, Year of Completion 2. six possible settings, corresponding to different choices of the parameters: a. Risk R1 used as the minimization cost of the algorithm b. Risk R1 used as a stratification variable c. include all units of the strata containing no units at risk Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Allocations (CV* = 5%) Max.BethelProp Max.BethelEqual Size Equal Size Prop. Size Bethel #Cens.units Cens.no.risk #Cens.strata Risk.strat # Strata Risk.cost C.S 1 N Y N 925 153 252 4933 5391 5550 459 618 2 N Y Y 925 214 704 5105 5547 5550 443 446 3 Y Y N 925 204 558 5239 5719 5550 480 311 4 Y Y Y 925 235 814 5330 5781 5550 451 220 5 Y N N 925 240 687 5555 5953 6475 399 921 6 Y N Y 925 269 983 5649 6094 6475 446 827 7 N Y N 925 306 1614 8725 9256 9250 530 524 8 N Y Y 925 352 1919 8827 9324 9250 498 424 9 Y Y N 925 416 3229 8955 9424 9250 468 294 10 Y Y Y 925 451 3398 9045 9511 9250 466 205 11 Y N N 925 426 3243 9151 9601 9250 451 100 12 Y N Y 925 457 3399 9222 9669 9250 446 84 13 N Y N 56 0 0 4745 4773 4760 138 132 14 N Y Y 56 28 9761 10320 10346 10360 166 630 15 Y Y N 56 21 5844 8812 8841 8848 189 389 16 Y Y Y 56 28 9761 10323 10349 10360 166 630 17 Y N N 28 0 0 4760 4774 4788 176 88 18 Y N Y 28 0 0 4759 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona 4774 4788 176 88 Allocations 12000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 10000 Bethel sample size 8000 6000 4000 2000 0 0 0.05 0.1 0.15 CV Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona 0.2 0.25 Balanced sample Selection of samples of fixed size from the CDH survey: Utility constraints on: • the population size N • the optimal sample size n • the marginal frequency distributions by Gender, Year of Doctorate Completion and Scientific Area 18 equations CUBE algorithm: I. Input Vector p is the optimal one determined by Bethel II. Flight phase ends with no exact solution III. Landing phase starts: selection of a sample which ensures a low difference to the balance, according to the distance between p* to p Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Results Median of absolute relative errors Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Results Income Y N 1366 0.88 0.97 0.97 0.99 0.99 2 N Y Y 1333 0.92 0.99 0.94 0.97 0.99 3 Y Y N 1335 0.92 0.98 0.95 0.99 0.99 4 Y Y Y 1354 0.87 0.99 0.95 0.97 0.99 5 Y N N 1490 0.86 0.98 0.97 0.98 0.98 6 Y N Y 1525 0.91 0.98 0.95 0.97 0.99 7 N Y N 2194 0.83 0.91 0.99 0.97 1.00 8 N Y Y 2177 0.56 0.81 0.99 0.94 0.99 9 Y Y N 2149 0.78 0.91 0.99 0.91 1.00 10 Y Y Y 2163 0.64 0.88 0.97 0.95 0.99 11 Y N N 2232 0.63 0.87 0.99 0.86 1.00 12 Y N Y 2233 0.55 0.78 0.96 0.94 0.99 13 N Y N 1272 0.96 0.99 0.92 0.96 0.98 14 N Y Y 559 0.52 0.79 0.41 0.83 0.98 15 Y Y N 564 0.77 0.94 0.93 0.97 0.99 16 Y Y Y 562 0.84 0.59 0.88 0.99 17 Y N N 1270 0.95 0.99 0.98 0.99 0.99 18 Y N Y 1247 0.91 0.99 0.98 0.99 0.98 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Work N 0.56* Contract JobS Cens.no.risk Occup Risk.strat Risk Risk.cost C.S 1 Further work 1. the relationship between coefficients of variation and disclosure risk, together with different options of including the risk of disclosure in the sampling design; 2. the introduction of an utility-priority approach into the way to deal with the balancing equations; 3. the usage of other data utility constraints to be investigated. Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona