Linking, selecting cut-offs, and examining quality in the Integrated Data Infrastructure (IDI) Laura O’Sullivan Statistics New Zealand laura.o’sullivan@stats.govt.nz IAOS Vietnam October 2014 Outline The Integrated Data Infrastructure (IDI) Terminology IDI linking • • • • Near-exact and non-exact Selecting cut-offs Quality Clerical review Linking at Statistics New Zealand and at the Australian Bureau of Statistics 2 Integrated Data Infrastructure (IDI) Student loans & allowances Migration & movements Education Benefits Business data Person-centred data Tax Justice Health & safety Families & households 33 Terminology Data integration (aka Record linkage) Deterministic linking Probabilistic linking (Fellegi-Sunter theory) Weights Represent the probability that two records are from the same person 4 Cut-offs Distribution of the weights Non-links 1240 Number of record pairs 1040 840 640 Links 440 240 40 -95 -75 -50 -25 0 25 50 Source: Statistics New Zealand 5 Quality True matches Non matches Linked True positives False positives Unlinked False negatives True negatives 6 Near-exact and non-exact First name and Last name agreement Data Insert Delete Replace Double Single A Robert Robert Robert Robert B Robiert Robrt Roobert Robert Rovert Swap Append Truncate Robbert Robert Kat Robret Katie Katie Kat Date of birth agreement Data Replace Swap Transpose A 04/08/1982 02/08/1982 02/08/1982 B 04/02/1982 20/08/1982 08/02/1982 7 Selecting the cut-off Graph of near-exact and non-exact links Frequency of links 300,000 Non-exact Near-exact 250,000 200,000 150,000 100,000 50,000 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Source: Statistics New Zealand 8 Quality in the IDI False positive rates • Sample from non-exact links • Assume near-exact links are true matches • Use proportional sampling Non-exact rates • Monitoring 9 Clerical review A link with two first names matching and different last name Dataset First names Last names Date of birth Sex A B Mary Louise Mary Lou Brown Hughes 04/11/1984 04/11/1984 2 2 A link with unique identifiers and missing name information in one dataset Dataset A B Identifier 12345 12345 First names Owen - Last names Keyes - Date of birth 06/01/1951 06/01/1951 Sex 1 1 A link with missing name information and without unique identifiers Dataset A B First names Holly Jessica Holly Last names Gordon Date of birth 01/05/1940 01/05/1940 Sex 2 2 10 Statistics New Zealand and the Australian Bureau of Statistics Statistics New Zealand Census to the Post-enumeration survey (PES) Linking the longitudinal census Australian Bureau of Statistics Linking projects using name and address Census data enhancement project 11 Thank you for listening Questions laura.o’sullivan@stats.govt.nz 12