Combined use of data from
registers and sample surveys
Eric Schulte Nordholt
Statistics Netherlands
Division Socio-economic and spatial statistics
e.schultenordholt@cbs.nl
Statistical Training Course on Use of Administrative
Registers in Production of Statistics in Warsaw (October
2014)
Contents General
• Social Statistics
• System of social statistical datasets (SSD)
• Group work on registers and surveys
• The Dutch virtual census
• Time for questions and discussion
2
Contents Social Statistics
• Requirements for modern Social Statistics
• Driving forces
• Policy implications
• Life cycle model
• Relevant statistical information for policy and society
• Strategy for data collection
• Secondary data
• How to get consistency of different data sources?
• Prototype of a micro database
• Conclusions
3
Requirements for modern
Social Statistics
Product quality (Eurostat Code of Practice):
1. Relevance
2. Accuracy
3. Timeliness and punctuality
4. Comparability and coherence
5. Accessibility and clarity
4
Driving Forces
More coherence, more thematic publications,
more detail (small areas, population groups)
and more flexibility in the statistical output (will
lead to a better product)
ICT developments: more registers
High nonresponse rates in social surveys
To cut down processing costs: standardisation
To lower response burden: less questions, EDI (or
EDC) and diminish ‘irritation factor’
5
Policy implications
• From primary to secondary data collection
– Wherever possible use data available in existing
registers and other administrative sources
– Primary data collection only, if no (timely) data
available (or of bad quality)
– Statistics Netherlands Act
• From traditional to electronic data collection
• Standardisation of statistical processes; multidata-source statistics; efficient sampling
• Challenges must be faced while the available
budget is constantly being reduced
6
Life cycle model (1)
Labour market
position
Education
- Working/non
Income
working
- Occupation
Health
Consumption
Demography
- Economic activity
- Year of birth
Social
Nationality
Demography
capital
Housing
- Household
composition
Time use
…
Well-being
- Etc.
7
Labour market
position
Life cycle model (2)
Labour market
position
Education
Income
Health
Consumption
Demography
Housing
Time use
…
Social
capital
Well-being
8
Life cycle model (3)
T+2
T+1
T
Variables
9
Life cycle model (4)
10
Life cycle model (5)
Analysis possibilities:
• State
• Transitions
between states
• Duration time in
a certain state
11
Life cycle model (6)
12
Relevant statistical information
for policy and society
• Domain specific
• Transitions and durations within a domain
• Relations between domains
• Relations between transitions and durations
between domains
• Monitor information (long period)
13
Strategy for data collection (1)
• Start with registers (e.g. population register,
housing register, business register)
• Add data from other administrative sources
• Add data from business and household surveys
• Match all these data at the micro level
• Create a ‘data clearing house’ within the
statistical office
14
Strategy for data collection (2)
Variables
Registers
.
.
n
All inhabitants Netherlands
1
Surveys
15
Strategy for data collection (3)
Matching method for individual data
RIN
RIN
Longitudinal
Administrative or
Population Register survey data
16
Secondary data (1)
Quality
• Quality may be good for some basic registers,
but not for all registers; monitoring quality is
important
• No sampling errors
• No unit nonresponse
• Many sources of non-sampling errors remain:
– Item nonresponse
– Measurement errors
– Coverage errors
17
Secondary data (2)
Challenges
• Impact on the organisation, coordination,
crossing departmental boundaries, change in
culture
• Influence of a statistical office on contents of
registers is limited
• Communication with register holders, e.g. about
quality and changes
• Quality control system (control surveys?)
• Comprehensive, standardised metadata system
• Version control system for updates
• Changing form surveys to registers without
causing a trend break
18
How to get consistency of different
data sources?
• Harmonisation! (coverage, definitions,
reference periods, etc.)
• Editing of all records at micro level by
automated procedures
• Only edit what needs to be edited (clear
instructions are necessary!)
• Make use of the technique of repeated
weighting for survey data
19
Prototype of a micro database (1)
X1…XK
Y1…YM
Z1…ZR
U1…US
LFS
HS
20
Prototype of a micro database (2)
Output inspired harmonisation:
the one figure for one phenomenon idea
StatLine:
all statistical information on the web
(via home page of Statistics Netherlands)
http://www.cbs.nl/en-GB/menu/home/default.htm
21
Conclusions
Social Statistics develop in the direction of a
permanent virtual census to be able to
produce:
– More crosstables over different domains
– More longitudinal information
– More flexible policy relevant output
22
Contents System of social
statistical datasets (SSD)
• Introduction to Statistics Netherlands
• Examples of registers
• Definition and driving forces of the SSD
• The scope of the SSD
• Core and satellites
• The process
• Linking the sources
• Micro integration
• Estimation aspects
• Statistical confidentiality
• Conclusions
23
Introduction to Statistics Netherlands
(1)
The Central Statistical Office (CBS)
• almost all official statistics in the Netherlands
• no regional offices
• two buildings: The Hague (in the West)
24
Introduction to Statistics Netherlands
(2)
and Heerlen (in the South); both have about 1000 employees
Mission
The mission of Statistics Netherlands is to publish reliable
and coherent statistical information that meets the needs of
society.
Position of the Statistical Office
Statistics Netherlands is since 2004 a semi-independent
organisation (still government funding) with about 2000
employees
25
Examples of registers
Three kinds of registers
• Population Register (PR)
• Job register
• Self-employed register
• Education register
• Occupation register
• Income register
• Social security register
• Unemployment register
• Pension register
• Other registers on persons, families and households
• Housing register
• Other registers on properties, buildings and dwellings
• General business register
• Other registers on enterprises and establishments
Common identifier: (numerical) address
26
Definition and driving forces of the
SSD
Definition:
set of integrated microdata files with coherent
and detailed demographic and socio-economic
data on persons, households, jobs and benefits
No remaining internal conflicting information
Driving forces:
• Virtual Census of 2001
• Better products: more coherence and flexibility
27
The scope of the SSD
All relevant variables in the life cycle
• Demography
• Health
• Education
• Labour market position
• Income
• Consumption
• Housing
• Time use
• Etc.
28
satellite
Core and satellites (1)
satellite
satellite
SSDcore
29
Core and satellites (2)
Core:
• contains only integral register information
• contains the most important demographic and
socio-economic information
• contains only information that is used in at
least two satellites
30
Core and satellites (3)
Satellites are produced in two steps:
• Copying and derivation of the relevant
information from the core SSD
• Adding of the unique information on a specific
theme from registers and surveys
31
Core and satellites (4)
Examples of current SSD satellites:
• Labour market
• Social security
• Income
• Education
• Health care
• Justice and security
• Ethnic minorities
• Social cohesion
The development of more SSD-satellites has been planned
32
The process
Already discussed:
– Specify the information needed
– Collection of registers
– Surveys only additional
Still to discuss:
– Linking the sources
– Micro integration
– Estimation aspects
– Statistical confidentiality
33
Linking the sources (1)
• The Population Register is the
backbone of the system for persons
• All other files are matched exactly to the
Population Register,
• such that the true matches are maximised (aim:
no missed matches) and the false matches
(mismatches) are minimised
34
Linking the sources (2)
Matching variables:
• Social security and fiscal (SOFI) number
(effectiveness close to 100%), since 2007
Citizen Service Number
• Other personal identifiers: sex, date of birth,
and address (effectiveness close to 100%)
• Number of mismatches very low (close to 0%)
35
Micro integration (1)
The aim of micro integration is:
– To check the linked data and modify incorrect
records,
– in such a way that the results that are to be
published are of higher quality than the
original sources
36
Micro integration (2)
To fulfil this demand an integrated process of:
• data editing,
• derivation of statistical variables,
• and imputation
is executed
37
Micro integration (3)
Constraints and limitations:
- Only variables that are to be published are
micro integrated
- Identity rules are necessary, e.g. the same
variable in two sources or a relationship
between two or more variables in one or more
sources
- No mass imputation
38
Estimation aspects
Surveys are samples from the population
If surveys are enriched with register information,
estimations of the register part of the enriched
survey will lead to inconsistencies with the
counts from the entire register
Statistics Netherlands developed the method of
repeated weighting to solve these inconsistencies
(aim: numerically consistent estimations)
39
Statistical confidentiality
IDs
Variables
Characteristics
Administrative sources
Identifiers
(PINs, sex,
date of birth,
address)
IDs
Variables
Household surveys
PERSONS BACKBONE
full range of all persons as from 1995
IDs in sources are replaced by random
Record Identification Numbers (RINs)
40
Conclusions
The SSD diminishes the administrative burden
and increases:
– The efficiency of statistics production
– The accuracy of statistical outputs
– The possibilities for social policy research
Safeguarding confidentiality is vital for the
process of record linkage
41
Group work on registers and
surveys (1)
Key question: which census variables are missing
in all the registers? Consider the following
thirteen census variables:
1.Sex
2.Age
3.Country of citizenship
4.Marital status
5.Household position
6.Religious denomination
7.Country of birth
8.Household size
42
Group work on registers and
surveys (2)
9. Place of residence one year prior to the census
10. Economic status
11. Level of educational attainment
12. Occupation
13. Branch of current economic activity
A. Discuss the situation in the countries
represented in your group or select some
countries for further discussion
43
Group work on registers and
surveys (3)
B. Are those missing variables available is any
survey? Discuss where those surveys may be
used (legal aspect and agreement with survey
organiser) for producing official statistics
C. Can the surveys and registers be linked? Is this
exact matching or is statistical matching
necessary?
Are there other important issues that affect the
overall situation?
44
Group work on registers and
surveys (4)
D. Possibilities and limitations for further
development of combining registers and surveys.
What is the policy in the NSIs for further
development? What are the possibilities and
limitations for such a development?
E. Prepare a short presentation (5 minutes per
group)
45
Contents The Dutch virtual
census (1)
• History of the Dutch Census
• The Dutch Census of 2011
• Data sources
• Combining sources: micro linkage
• Combining sources: micro integration
• Conditions facilitating use of administrative sources
• Miscellaneous aspects
• Census tables
• Micro macro method
• Result on 2011 economic activity
46
Contents The Dutch virtual
census (2)
• Comparison with other countries
• Comparison with other years
• Harmonisation
• Microdata availability
• Data integration activities between the 2001 Census and
the 2011 Census
• Preparing the 2011 Census
• Conclusions
47
History of the Dutch Census (1)
TRADITIONAL CENSUS
Ministry of Home Affairs:
1829, 1839, 1849, 1859, 1869, 1879 and 1889
Statistics Netherlands:
1899, 1909, 1920, 1930, 1947, 1960 and 1971
Unwillingness (nonresponse) and reduction
expenses  no more traditional censuses
48
History of the Dutch Census (2)
ALTERNATIVE: VIRTUAL CENSUS
1981 and 1991: limited virtual censuses based on
Population Register and surveys
development 90’s: more registers → integrated set of
registers and surveys, SSD
2001 and 2011: complete virtual censuses based on
the SSD with information at the municipality level
49
The Dutch Census of 2011
is based on the Social Statistical Database (SSD) which
• is a set of integrated microdata files with coherent and
detailed demographic and socio-economic data on
persons, households, jobs and benefits
• has no remaining internal conflicting information
is part of the European Census
• Eurostat: coordinator of EU, accession and EFTA
countries in the European Census Rounds
• Census Table Programme, every 10 years
Social statistics in the Netherlands develop in the direction
of a permanent Virtual Census to be able to produce:
• More crosstables over different domains
• More longitudinal information
• More flexible policy relevant output
50
Data sources
Registers:
• Population Register (PR) → illegal people excluded,
homeless counted at last known address
• Jobs file, containing all employees
• Self-employed file, containing all self-employed
• Fiscal administration
• Social Security administrations
• Pensions and life insurance benefits
• Housing registers
Surveys:
• Survey on Employment and Earnings (SEE) stopped
• Labour Force Survey data around Census Day
• Housing surveys no longer necessary for the Census
51
Combining sources: micro linkage
• Linkage key:
Registers
Citizen Service Number, unique
Surveys
Sex, date of birth,
address (postal code and house number)
• Linkage key replaced by RIN-person
• Linkage strategy
Optimizing number of matches
Minimizing number of mismatches and missed
matches
52
Combining sources: micro integration
• Collecting data from several sources 
more comprehensive and coherent information
on aspects of a person’s life
• Compare sources
- coverage
- conflicting information (reliability of sources)
• Integration rules
- checks
- adjustments
- imputations
• Optimal use of information  quality improves
• Example: job period vs. benefit period
53
Conditions facilitating use of
administrative sources
• Legal base (Statistics Act)
• Public approval (‘Big Brother is watching you’)
• Cooperation among authorities (mainly
government organisations)
• Comprehensive and reliable register system
(administrative versus statistical quality)
• Unified identification system (preferably unique
ID-numbers)
54
Miscellaneous aspects (1)
• Stable identifiers
• Stability of registers
• Only edit what needs to be edited (by automated
procedures)
• Dates of real events versus dates of registration
• Derived variables (example: current activity
status)
• Impact on the organisation (change of culture)
• Communication with register holders
55
Miscellaneous aspects (2)
Output inspired harmonisation (coverage,
definitions, reference periods):
the one figure for one phenomenon idea
StatLine:
all statistical information on the web
(via home page of Statistics Netherlands)
http://www.cbs.nl/en-GB/menu/home/default.htm
56
Census tables (1)
Preliminary work before tabulating
Census Programme definitions:
not always clear and unambiguous, e.g. economic activity
Priority rules
• (characteristics of) main job (highest wage)
• employee or employer
• job or (partially) unemployed
• job or attending education
• job or retired
• engaged in family duties or retired
• age restrictions
Tabulating register variables:
Simply straightforward counting from SSD register
data
57
Census tables (2)
Tabulating survey (and register) variables
Mass imputation?
•Pro’s: reproducible results
•Con’s: danger of oddities in estimates (e.g. highly
educated baby)
Traditional Weighting?
•Pro’s: simple, reproducible results (if same microdata and
weights)
•Con’s: no overall numerical consistency between survey
and register estimates
Demand for overall numerical consistency
• one figure for one phenomenon idea
• all tables based on different sources (e.g. surveys)
should be mutually consistent
58
Census tables (3)
Ethnicity: register
Education: survey 1 and survey 2
Employment status: survey 2
Estimate: T1: educ x ethnic and T2: educ x employ
educ x
ethnic
notNL
NL
Total
educLo
20
29
49
educHi
9
42
51
29
71
100
Total
ethnic1...k
educLo...Hi
Register
Survey 1
Survey 2
employ
x educ
ethnic
Total
notNL
30
NL
70
employ1...m
employed
nonemployed
Total
educLo
32
20
52
educHi
28
20
48
Total
60
40
59
100
Census tables (4)
Repeated Weighting (RW) : tool to achieve
numerical consistency (VRD-software)
Basic principles of RW:
• estimate table on most reliable source (mostly
source with most records, e.g. register)
• estimate tables by calibrating on common
margins of the current table and tables already
estimated (auxiliary information)
• repeatedly use of regression estimator:
- initial weights (e.g. survey weights) calibrated as
minimal as possible
- lower variances
- no excessive increase of (non-response) bias (as long
as cell size>>0)
• each table has its own set of weights
60
Census tables (5)
Calibrate on ethnic, then on educ x ethnic
ethnic1...k
educLo...Hi
Register
Survey 1
employ1...m
educ x
ethnic
notNL
NL
Total
educLo
20
30
50
educHi
10
40
50
Total
30
70
100
Total
Survey 2
3
employ
x educ
1
ethnic
sampling units
2
notNL
30
NL
70
employed
nonemployed
Total
educLo
31
19
50
educHi
30
20
50
Total
61
39
100
61
Micro macro method (1)
Repeated Weighting works nicely, but in the 2011
Census a new requirement was introduced:
hypercubes (= high dimensional tables)
Problem:
Very detailed tables contain many sample zeros
that RW cannot handle
Solution 1: estimate subhypercubes
Solution 2: micro macro method (an IPF method)
was introduced to estimate the interior of
subhypercubes containing LFS variables
62
Micro macro method (2)
Results of the micro macro method are published
if two conditions are fullfilled:
1. table margins estimated with RW are small
enough
2. number of records in estmated cells are large
enough
Criteria:
1. estimated relative inaccuracy of at most 20
percent (i.e. the estimated margins amount to 40
percent at most) which corresponds to a
threshold of 25 persons
2. only table cells based on 5 or more persons are
published
63
Result on 2011 economic activity
8.9%
4.1%
Employed
16.6%
Unemployed
49.1%
Under 15 years
Pension or capital income recipients
Students (not economically active)
Homemakers and others
17.5%
3.8%
64
Comparison with other countries
Traditional Census (complete enumeration):
Most countries in the world (including the UK and
the US)
Traditional Census (partial enumeration) and
Registers:
Some countries (e.g. Germany, Poland and
Switzerland)
Rolling Census:
France
Fully or largely register-based (Virtual) Census:
Five Nordic countries (Iceland,Norway, Sweden, Finland
and Denmark), the Netherlands, Belgium, Austria and
Slovenia
65
Comparison with other years
Inhabitants and household size
Number of inhabitants (x mln) / Mean houshold size
18
16
14
12
10
8
6
4
2
0
1829 1839 1849 1859 1879 1889 1899 1909 1920 1930 1947 1960 1971 1981 1991 2001 2011
Census year
Number of inhabitants
Mean household size
66
Harmonisation (1)
More information about the Dutch traditional
Censuses (including those of 1960 and 1971):
http://www.volkstellingen.nl/en/
For 1960 and 1971 the same variables as for 2001
• if not available: constructed based on existing variables in
Census data
Variables not internationally harmonised (e.g. sex,
age, marital status, household position, country of
birth, economic status, household size and
country of citizenship)
• same classification and priority rules as for 2001
67
Harmonisation (2)
Household size and country of citizenship:
• missing for 1960
Religious denomination (philosophy of life):
• only for 1960 and 1971
Place of residence one year prior to the census:
• only for 2001
International classifications
• Branch of current economic activity: ISIC / NACE
• Occupation: ISCO
• Level of educational attainment: ISCED
68
Harmonisation (3)
1960
1971
2001
Sex
X
X
X
Age
X
X
X
X
X
Country of citizenship
Marital status
X
X
X
Household position
X
X
X
Religious denomination
X
X
Country of birth
X
X
X
X
X
Household size
Place of residence one
year prior to the census
X
Economic status
X
X
X
Level of educational
attainment
X
X
X
Occupation
X
X
X
Branch of current
economic activity
X
X
X
69
Microdata availability
One percent samples for three years (1960, 1971
and 2001)
IPUMS (Integrated Public Use Microdata Series):
http://www.ipums.org/international/index.html
Weighting to population totals
Protecting according to rules for public use files
Microdata sets for all three years available for
research!
DANS (Data Archiving and Networked Services):
http://www.dans.knaw.nl/en/
70
Data integration activities between
the 2001 Census and the 2011 Census
(1)
• Tables (http://www.cbs.nl/nlNL/menu/themas/dossiers/his
torischereeksen/publicaties/volkstelli
ng-2001/2003-volkstellingexcel.htm)
• Book and extra chapter
(http://www.cbs.nl/nlNL/menu/themas/dossiers/his
torischereeksen/publicaties/volkstelli
ng-2001/2001-b57-pub.htm)
71
Data integration activities between
the 2001 Census and the 2011 Census
(2)
• Integrated Public Use Microdata Series
(https://international.ipums.org/international)
• Lectures (Conferences, Universities, Research
institutes, Statististical offices)
• ESTP-course Registers in Statistics (Oslo)
• International Statistical Seminar Eustat in Bilbao
(http://www.eustat.es/prodserv/seminario_i.html)
• Digitalizing (http://www.volkstellingen.nl/en/)
• Recommendations and register-based statistics
• CENEX on ISAD (http://cenex-isad.istat.it)
• European census regulations
72
Preparing the 2011 Census
• Sources (the PR as backbone of the census,
changes in contents and quality of registers,
remaining information from LFS)
• Estimation method (repeated weighting, new
version of the software, fall-back option of
weighting to PR, zero cells problem)
• Statistical Disclosure Control of the
hypercubes (Workshop on SDC of Census Data
in April 2012)
• Tabular data in SDMX format and the Census
Hub
73
Conclusions (1)
• A Dutch Virtual Census: yes, we can!
• Micro integration remains important
• Repeated weighting was a success
Advantages:
• Relatively cheap (small cost per inhabitant)
• Quick (short production time)
Disadvantages:
• Dependent on register holders (statistics is not their
priority), timeliness of registers, concepts and population of
registers may differ from what is needed (keep good
relations with the register holders!)
• Publication of small subpopulations sometimes difficult or
even impossible because of limited information
74
Conclusions (2)
Other aspects:
• Less attention for the results of a virtual census
than for a traditional one
• Difficult to keep knowledge and software up-todate (Census is running every ten years)
• Enormous international interest in virtual
censuses
• A lot of interesting census work in the coming
years!
75
Time for questions and discussion
76