Linking, selecting cut-offs, and
examining quality in the Integrated
Data Infrastructure (IDI)
Laura O’Sullivan
Statistics New Zealand
laura.o’sullivan@stats.govt.nz
IAOS Vietnam October 2014
Outline
The Integrated Data Infrastructure (IDI)
Terminology
IDI linking
•
•
•
•
Near-exact and non-exact
Selecting cut-offs
Quality
Clerical review
Linking at Statistics New Zealand and at the
Australian Bureau of Statistics
2
Integrated Data Infrastructure (IDI)
Student loans
& allowances
Migration
&
movements
Education
Benefits
Business data
Person-centred data
Tax
Justice
Health & safety
Families
&
households
33
Terminology
Data integration (aka Record linkage)
Deterministic linking
Probabilistic linking (Fellegi-Sunter theory)
Weights
Represent the probability that two records are from
the same person
4
Cut-offs
Distribution of the weights
Non-links
1240
Number of record pairs
1040
840
640
Links
440
240
40
-95
-75
-50
-25
0
25
50
Source: Statistics New Zealand
5
Quality
True matches
Non matches
Linked
True positives
False positives
Unlinked
False negatives
True negatives
6
Near-exact and non-exact
First name and Last name agreement
Data Insert
Delete Replace Double Single
A
Robert Robert Robert
Robert
B
Robiert Robrt
Roobert Robert
Rovert
Swap
Append Truncate
Robbert Robert Kat
Robret Katie
Katie
Kat
Date of birth agreement
Data
Replace
Swap
Transpose
A
04/08/1982
02/08/1982
02/08/1982
B
04/02/1982
20/08/1982
08/02/1982
7
Selecting the cut-off
Graph of near-exact and non-exact links
Frequency of links
300,000
Non-exact
Near-exact
250,000
200,000
150,000
100,000
50,000
0
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
Source: Statistics New Zealand
8
Quality in the IDI
False positive rates
• Sample from non-exact links
• Assume near-exact links are true matches
• Use proportional sampling
Non-exact rates
• Monitoring
9
Clerical review
A link with two first names matching and different last name
Dataset
First names
Last names
Date of birth
Sex
A
B
Mary Louise
Mary Lou
Brown
Hughes
04/11/1984
04/11/1984
2
2
A link with unique identifiers and missing name information in one dataset
Dataset
A
B
Identifier
12345
12345
First names
Owen
-
Last names
Keyes
-
Date of birth
06/01/1951
06/01/1951
Sex
1
1
A link with missing name information and without unique identifiers
Dataset
A
B
First names
Holly Jessica
Holly
Last names
Gordon
Date of birth
01/05/1940
01/05/1940
Sex
2
2
10
Statistics New Zealand and the
Australian Bureau of Statistics
Statistics New Zealand
Census to the Post-enumeration survey (PES)
Linking the longitudinal census
Australian Bureau of Statistics
Linking projects using name and address
Census data enhancement project
11
Thank you for listening
Questions
laura.o’sullivan@stats.govt.nz
12