Document

advertisement
Sampling as a way to
reduce risk and create
a Public Use File
maintaining weighted
totals
Maria Cristina Casciano, Laura Corallo, Daniela Ichim
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Outline
• Multiple releases: MFR and PUF
• Subsampling
– allocation: reduce the risk of disclosure
– selection: pre-defined quality standards
• Results
– Career of Doctorate Holders Survey
• Further work
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Multiple
surveys
Multiple …
Multiple
countries
Multiple
countries
Multiple
countries
Multiple
countries
SURVEY1
TABLES1
PUF1
MFR1
OTHER1
Multiple releases
SURVEY2
TABLES2
PUF2
MFR2
OTHER2
Multiple releases
SURVEYX
TABLESX
PUFX
MFRX
OTHERX
Multiple releases
SURVEY1
TABLES1
PUF1
MFR1
OTHER1
Multiple releases
SURVEY2
TABLES2
PUF2
MFR2
OTHER2
Multiple releases
SURVEYX
TABLESX
PUFX
MFRX
OTHERX
Multiple releases
SURVEY1
TABLES1
PUF1
MFR1
OTHER1
Multiple releases
SURVEY2
TABLES2
PUF2
MFR2
OTHER2
Multiple releases
SURVEYX
TABLESX
PUFX
MFRX
Joint UNECE-Eurostat
worksession on confidentiality,
2011, Tarragona
OTHERX
Multiple releases
MS1
MS2
MS27
Comparability
• ESSnet on SDC harmonisation and common
tools
– WP1: test the comparability concept
– Istat, Destatis, Statistics Austria
– multiple countries
HOW
• 1 Assessment of effects of different
practices on predefined statistics
• 2 Definition of a threshold to define
when action is needed
• 3 setting a process for choosing
acceptable practices
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Multiple releases
SURVEY1
TABLES1
PUF1
MFR1
• A particular harmonisation
dimension
• Hierarchical structure
– Utility
– Risk of disclosure
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
OTHER1
Multiple releases
hierarchical structure
More restrictive license
Less aggregated information
+
-
MFR
PUF
-
+
Less restrictive license
More aggregated information
UNIQUE PRODUCTION PROCESS!
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
• MFR
PUF-MFR
– definition of a disclosure scenario
– risk assessment R1
– risk limitation w.r.t.
• adopted disclosure scenario
• some data utility requirements
• PUF
–
–
–
–
–
harmonized with the MFR (e.g. weighted totals)
reduced the risk of disclosure
random sample
internal consistency of records
some (other) data utility requirements (CV and
weighted totals – precision and accuracy)
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Data description
Doctorate Holders
Year t-5
Year t-3
CDH 2009 Survey
Year t
Focus on the characterisation of the
occupational status of the PhD holders:
job
labour market
satisfaction
entry
usefulness of the
type of type of
PhD
earnings
contract work
for obtaining a job
Estimates by PhD scientific area, by gender and by region
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Data description
18500
PhD Holders
(Census)
72%
resp
Adjustment for
non-responses via
calibration
28%
No resp
12964 respondents
Citizenship PhD Scientific Area
Gender
(2 categories)
(14 categories)
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
weights obtained
by constraining on
known marginal
distributions:
Region
PUF-subsampling
Simple random sampling
Utility: Weighted totals may always be preserved
by calibration
Risk: how many units at risk are sampled?
Example (MFR-CDH): 12964 units, 24.7% of units at
risk
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
key variables
stratification
utility
scenario
allocation
auxiliary
Subsampling
disclosure
dissemination
calibration
domains
totals
sample size
quality
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
users
PUF-subsampling: proposal
1.
Optimal allocation of units to be
sampled in each domain according to
Bethel’s approach
(Risk minimization)
2. Selection of a fixed size balanced
sample (CUBE method)
(Data utility maximization)
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
1. Bethel’s approach (1989)
Hj
d
●Cost function to minimize: C '  C0   Ch n h
h 1
 nh and Ch related to the risk to be reduced
● Expected Coefficient of Variation (CV) of the
estimates of the total of variable P in domain
jd equal or lower than prefixed thresholds:
*
CV

CV
p
jd
p
jd
 Optimal allocation: nh*
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
2. Balanced sampling
A sampling design s is said to be balanced on the


auxiliary variables x  x1...x j ..x p ' if and only if
the balancing equations given by:
ˆ X
X
π
are satisfied, where X is the vector of known
population totals, X̂ π is the H.-T. estimator
 exact estimates for pre-defined variables
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Balanced sampling: the CUBE method
(011)
(010)
(111)
s  0 ,1
(110)
K
(101)
(000)
Geometrically each vertex of the
hypercube is a sample:
N
The balancing equations define a subspace of RN named K.
The problem is to choose a vertex
(sample) of the N-cube that remains in
the sub-space of constraints K
(100)
Cube method (Deville & Tillé,2004):
1.
Flight phase: it’s a random walk starting from the
vector p and moving in the intersection of the cube C
and K. It stops at the vertex of intersection of C and K,
if this vertex exists.
2.
Landing phase: At the end of the flight phase, if a
sample is not exactly determined in C∩K, a sample is
selected as close as possible to the constraints space K.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Implementation
1. determination of the optimal strata sizes in terms of
reduction of the overall risk (cost function), keeping the
CV level of the estimates below a 5% threshold for three
combinations of the allocation and domain variables
Allocation variables: Occup, JobS, Contract, Work, Income
Domain variables:
Gender, Region, Scientific Area, Year of
Completion
2. six possible settings, corresponding to different choices
of the parameters:
a. Risk R1 used as the minimization cost of the algorithm
b. Risk R1 used as a stratification variable
c. include all units of the strata containing no units at risk
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Allocations (CV* = 5%)
Max.BethelProp
Max.BethelEqual
Size Equal
Size Prop.
Size Bethel
#Cens.units
Cens.no.risk
#Cens.strata
Risk.strat
# Strata
Risk.cost
C.S
1
N
Y
N
925
153
252
4933
5391
5550
459
618
2
N
Y
Y
925
214
704
5105
5547
5550
443
446
3
Y
Y
N
925
204
558
5239
5719
5550
480
311
4
Y
Y
Y
925
235
814
5330
5781
5550
451
220
5
Y
N
N
925
240
687
5555
5953
6475
399
921
6
Y
N
Y
925
269
983
5649
6094
6475
446
827
7
N
Y
N
925
306
1614
8725
9256
9250
530
524
8
N
Y
Y
925
352
1919
8827
9324
9250
498
424
9
Y
Y
N
925
416
3229
8955
9424
9250
468
294
10
Y
Y
Y
925
451
3398
9045
9511
9250
466
205
11
Y
N
N
925
426
3243
9151
9601
9250
451
100
12
Y
N
Y
925
457
3399
9222
9669
9250
446
84
13
N
Y
N
56
0
0
4745
4773
4760
138
132
14
N
Y
Y
56
28
9761
10320
10346
10360
166
630
15
Y
Y
N
56
21
5844
8812
8841
8848
189
389
16
Y
Y
Y
56
28
9761
10323
10349
10360
166
630
17
Y
N
N
28
0
0
4760
4774
4788
176
88
18 Y
N
Y
28
0
0
4759
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
4774
4788
176
88
Allocations
12000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
10000
Bethel sample size
8000
6000
4000
2000
0
0
0.05
0.1
0.15
CV
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
0.2
0.25
Balanced sample
Selection of samples of fixed size from the CDH survey:
Utility constraints on:
•
the population size N
•
the optimal sample size n
•
the marginal frequency distributions by Gender,
Year of Doctorate Completion and Scientific Area
 18 equations
CUBE algorithm:
I. Input Vector p is the optimal one determined by Bethel
II. Flight phase ends with no exact solution
III. Landing phase starts: selection of a sample which
ensures a low difference to the balance, according to
the distance between p* to p
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Results
Median of absolute relative errors
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Results
Income
Y
N
1366
0.88
0.97
0.97
0.99
0.99
2
N
Y
Y
1333
0.92
0.99
0.94
0.97
0.99
3
Y
Y
N
1335
0.92
0.98
0.95
0.99
0.99
4
Y
Y
Y
1354
0.87
0.99
0.95
0.97
0.99
5
Y
N
N
1490
0.86
0.98
0.97
0.98
0.98
6
Y
N
Y
1525
0.91
0.98
0.95
0.97
0.99
7
N
Y
N
2194
0.83
0.91
0.99
0.97
1.00
8
N
Y
Y
2177
0.56
0.81
0.99
0.94
0.99
9
Y
Y
N
2149
0.78
0.91
0.99
0.91
1.00
10
Y
Y
Y
2163
0.64
0.88
0.97
0.95
0.99
11
Y
N
N
2232
0.63
0.87
0.99
0.86
1.00
12
Y
N
Y
2233
0.55
0.78
0.96
0.94
0.99
13
N
Y
N
1272
0.96
0.99
0.92
0.96
0.98
14
N
Y
Y
559
0.52
0.79
0.41
0.83
0.98
15
Y
Y
N
564
0.77
0.94
0.93
0.97
0.99
16
Y
Y
Y
562
0.84
0.59
0.88
0.99
17
Y
N
N
1270
0.95
0.99
0.98
0.99
0.99
18
Y
N
Y
1247
0.91
0.99
0.98
0.99
0.98
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Work
N
0.56*
Contract
JobS
Cens.no.risk
Occup
Risk.strat
Risk
Risk.cost
C.S
1
Further work
1. the relationship between coefficients of
variation and disclosure risk, together with
different options of including the risk of
disclosure in the sampling design;
2. the introduction of an utility-priority
approach into the way to deal with the
balancing equations;
3. the usage of other data utility constraints
to be investigated.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Download