Hye-Chung Kum : Record Linkage

advertisement
Deterministic Record Linking
University of North Carolina,
Chapel Hill
Hye-Chung Kum
Example
EISID : E1
ssn : 085-66-9980
first name : Sally
last name : Hill
MI : L
DOB : 3/4/1999
SISID : S1
ssn : 085-66-9980
first name : Sally
last name : Hill
MI : L
DOB : 3/4/1999
EISID : E2
ssn : 143-25-9304
first name : Emily
last name : Brown
MI : K
DOB : 6/2/2004
SISID : S2
ssn : 143-52-9304
first name : Emily
last name : Brown
MI : K
DOB : 6/2/2004
EISID : E3
ssn : 354-563-2343
first name : Mary
last name : Johnson
MI : G
DOB : 5/13/1983
SISID : S3
ssn : 354-563-2343
first name : Mary
last name : Hawkins
MI : J
DOB : 5/13/1983
EISID : E4
ssn : 532-34-9183
first name : David
last name : Ford
MI : J
DOB : 10/25/1990
SISID : S4
ssn : 532-34-9183
first name : David
last name : Ford
MI : J
DOB : 10/23/1990
Exact Match
EISID : E1
ssn : 085-66-9980
first name : Sally
last name : Hill
MI : L
DOB : 3/4/1999
SISID : S1
ssn : 085-66-9980
first name : Sally
last name : Hill
MI : L
DOB : 3/4/1999
Approximate Matching I : SSN
EISID : E2
ssn : 143-25-9304
first name : Emily
last name : Brown
MI : K
DOB : 6/2/2004
SISID : S2
ssn : 143-52-9304
first name : Emily
last name : Brown
MI : K
DOB : 6/2/2004
Approximate Matching II : DOB
EISID : E4
ssn : 532-34-9183
first name : David
last name : Ford
MI : J
DOB : 10/25/1990
SISID : S4
ssn : 532-34-9183
first name : David
last name : Ford
MI : J
DOB : 10/23/1990
Approximate Matching III : Name
EISID : E3
SISID : S3
ssn : 354-563-2343
first name : Mary
last name : Johnson
MI : G
DOB : 5/13/1983
ssn : 354-563-2343
first name : Mary
last name : Hawkins
MI : J
DOB : 5/13/1983
Deterministic Record Linking





Allow for approximate matching
Use explicit approximate rules
Pros : can control the linkage process
Con: difficult to implement
Alternative : Probabilistic record linking
–
–
–
–
–
Also approximate matching
However, uses general rules specified by users
Based on total probability
Con: can not control exactly what to consider a match or not
Pros: can use specialized software
Approximate Matching : DOB



element to element match : date, month, year
Allow for one element difference
Allow for month and day transposed
DOB : one element
dob1 : 10/25/1990
dob2 : 10/23/1990
DOB : transpose
dob1 : 11/7/1995
dob2 : 7/11/1995
Approximate Matching : Name


kfname
mi
kmi
1 RUDOLPH
RULDOLPH
A
A
one letter different
2 ALIJAH
ALIYAH
M
insert or replace
3 CAROL
CAROLYN
J
First name soundex match
First name is approx
–

–


lsound equal
or lname approx
–
–

and/or substr
MI=FI
FI equal
Fsound & Lsound swapped
obs
fname
4 ANGELIQUE ANGIE
D
5 JOHNNY
JOHNNY JR L
6 ZACHARY
ZACK
L
7 J
MICHAEL
M
8 ANTON
COUDRAY
C A
9 ARTHUR
AUTHOR
R R
10 EDWIN
EDDIE
11 GOLDY
OWENS
A
A
Approximate Matching : Name
kfname
mi
1 RUDOLPH
RULDOLPH
A
2 ALIJAH
ALIYAH
3 CAROL
obs
fname
kmi
klname
SIMARD
SIMARD
M
FOSS
FOSS
CAROLYN
J
YOUNG
YOUNG
4 ANGELIQUE
ANGIE
D
OUELLETTE
OUELLETTE
5 JOHNNY
JOHNNY JR L
MAYO
MAYO
6 ZACHARY
ZACK
L
ROGERS
ROGERS
7 J
MICHAEL
M
GALLAGHER
GALLAGHER
8 ANTON
COUDRAY
C
A
CYPRESS
CYPRESS
9 ARTHUR
AUTHOR
R
R
DAVIS
DAVIS
KAHKONE
KAHKONE
OWENS
GOLDY
10 EDWIN
EDDIE
11 GOLDY
OWENS
A
A
lname
A
Match on ssn (ssn equal)


1 : dob, fsound equal
dob approx
–
–
–
–
–
–
2 : dob approx, fsound equal
3 : dob approx, fname approx
4 : dob approx, lsound equal, & fsound diff, but MI=FI
5 : dob approx, lsound equal, & fsound diff, but FI equal
6 : dob approx, lsound and fsound swapped
7 : dob approx, lname approx & fsound diff



but MI=FI (4 with lname approx rather than equal)
or FI equal (5 with lname approx rather than equal)
dob mismatch
–
–
8 : fname approx, lsound equal, and dob diff
9 : fname approx, lsound approx, and dob diff
Match on ssn (ssn equal)


1 : dob, fsound equal
dob approx
–
–
–
–
–
–
2 : dob approx, fsound equal
3 : dob approx, fname approx
4 : dob approx, lsound equal, & fsound diff, but MI=FI
5 : dob approx, lsound equal, & fsound diff, but FI equal
6 : dob approx, lsound and fsound swapped
7 : dob approx, lname approx & fsound diff



but MI=FI (4 with lname approx rather than equal)
or FI equal (5 with lname approx rather than equal)
dob diff
–
–
8 : fname approx, lsound equal, and dob diff
9 : fname approx, lsound approx, and dob diff
Approximate Matching : SSN



Digit to digit match
Allow for one digit difference
Allow for two digit difference if transposed
SSN : one digit
ssn1 : 532-34-9183
ssn2 : 532-34-8183
SSN : transpose
ssn1 : 143-25-9304
ssn2 : 143-52-9304
Match on ndob (dob+fsound)

ssn missing
–
–

1: lname equal
2: lname approx
ssn approx
–
–
–

3: lname equal
4: lname approx
5: lname diff

but fname equal
ssn different
–
–

11 : lname equal
12 : lname approx
lname different
–
–
51: ssn approx
52: ssn missing
Match on ndob (dob+fsound)

ssn missing
–
–

1: lname equal
2: lname approx
ssn approx
–
–
–

3: lname equal
4: lname approx
5: lname diff

but fname equal
ssn different
–
–

11 : lname equal
12 : lname approx
lname different
–
–
51: ssn approx
52: ssn missing
obs
SSN
kSSN
fname
kfname
lname
klname
1 244572812
.
APPOLONIA APPOLONIA GAVINS
GAVINS
2 .
.
ABEL
ABEL
LOMELIGARCIA
LOMELI
3 248511181
.
JOSH
JOSHUA
PHIPPS
PHIPPS
obs
SSN
kSSN
fname
kfname
lname
klname
1 244572812
.
APPOLONIA APPOLONIA GAVINS
GAVINS
2 .
.
ABEL
ABEL
LOMELIGARCIA
LOMELI
3 248511181
.
JOSH
JOSHUA
PHIPPS
PHIPPS
4 243352045
243352054
LENA
LENA
COOPER
COOPER
5 239565518
239565519
MILES
MILES
KNIGHT JR.
KNIGHT
obs
SSN
kSSN
fname
kfname
lname
klname
1 244572812
.
APPOLONIA APPOLONIA GAVINS
GAVINS
2 .
.
ABEL
ABEL
LOMELIGARCIA
LOMELI
3 248511181
.
JOSH
JOSHUA
PHIPPS
PHIPPS
4 243352045
243352054
LENA
LENA
COOPER
COOPER
5 239565518
239565519
MILES
MILES
KNIGHT JR.
KNIGHT
6 245193584
245493584
MARTHA
MARTHA
LYDA
HOPKINS
obs
SSN
kSSN
fname
kfname
lname
klname
1 244572812
.
APPOLONIA APPOLONIA GAVINS
GAVINS
2 .
.
ABEL
ABEL
LOMELIGARCIA
LOMELI
3 248511181
.
JOSH
JOSHUA
PHIPPS
PHIPPS
4 243352045
243352054
LENA
LENA
COOPER
COOPER
5 239565518
239565519
MILES
MILES
KNIGHT JR.
KNIGHT
6 245193584
245493584
MARTHA
MARTHA
LYDA
HOPKINS
7 244779182
244778182
AUSTIN
AUSTYN
TERWILLIGER
OMEARA
8 489987513
489987573
ALISIA
ALICE
GRAVES
WATSON
9 239966568
239966578
ANNA
ANAYA
MONTAGUE
BOLDING
obs
SSN
kSSN
fname
kfname
lname
klname
1 244572812
.
APPOLONIA APPOLONIA GAVINS
GAVINS
2 .
.
ABEL
ABEL
LOMELIGARCIA
LOMELI
3 248511181
.
JOSH
JOSHUA
PHIPPS
PHIPPS
4 243352045
243352054
LENA
LENA
COOPER
COOPER
5 239565518
239565519
MILES
MILES
KNIGHT JR.
KNIGHT
6 245193584
245493584
MARTHA
MARTHA
LYDA
HOPKINS
7 244779182
244778182
AUSTIN
AUSTYN
TERWILLIGER
OMEARA
8 489987513
489987573
ALISIA
ALICE
GRAVES
WATSON
9 239966568
239966578
ANNA
ANAYA
MONTAGUE
BOLDING
obs
SSN
kSSN
fname
kfname
lname
klname
1 244572812
.
APPOLONIA APPOLONIA GAVINS
GAVINS
2 .
.
ABEL
ABEL
LOMELIGARCIA
LOMELI
3 248511181
.
JOSH
JOSHUA
PHIPPS
PHIPPS
4 243352045
243352054
LENA
LENA
COOPER
COOPER
5 239565518
239565519
MILES
MILES
KNIGHT JR.
KNIGHT
6 245193584
245493584
MARTHA
MARTHA
LYDA
HOPKINS
7 244779182
244778182
AUSTIN
AUSTYN
TERWILLIGER
OMEARA
8 489987513
489987573
ALISIA
ALICE
GRAVES
WATSON
9 239966568
239966578
ANNA
ANAYA
MONTAGUE
BOLDING
10 227691655
227691633
BRITTNEY
BRITTNEY
REVELS
REVELS
11 242339913
239524402
DANIEL
DANIEL
ROBINSON
ROBINSON
12 221864852
225206017
HELEN
HELEN
HALL
HOLLER
13 240212489
222565604
DEBORAH
DEBRA
LEE
LEACH
obs
SSN
kSSN
fname
kfname
lname
klname
1 244572812
.
APPOLONIA APPOLONIA GAVINS
GAVINS
2 .
.
ABEL
ABEL
LOMELIGARCIA
LOMELI
3 248511181
.
JOSH
JOSHUA
PHIPPS
PHIPPS
4 243352045
243352054
LENA
LENA
COOPER
COOPER
5 239565518
239565519
MILES
MILES
KNIGHT JR.
KNIGHT
6 245193584
245493584
MARTHA
MARTHA
LYDA
HOPKINS
7 244779182
244778182
AUSTIN
AUSTYN
TERWILLIGER
OMEARA
8 489987513
489987573
ALISIA
ALICE
GRAVES
WATSON
9 239966568
239966578
ANNA
ANAYA
MONTAGUE
BOLDING
10 227691655
227691633
BRITTNEY
BRITTNEY
REVELS
REVELS
11 242339913
239524402
DANIEL
DANIEL
ROBINSON
ROBINSON
12 221864852
225206017
HELEN
HELEN
HALL
HOLLER
13 240212489
222565604
DEBORAH
DEBRA
LEE
LEACH
obs
SSN
kSSN
fname
kfname
lname
klname
1 244572812
.
APPOLONIA APPOLONIA GAVINS
GAVINS
2 .
.
ABEL
ABEL
LOMELIGARCIA
LOMELI
3 248511181
.
JOSH
JOSHUA
PHIPPS
PHIPPS
4 243352045
243352054
LENA
LENA
COOPER
COOPER
5 239565518
239565519
MILES
MILES
KNIGHT JR.
KNIGHT
6 245193584
245493584
MARTHA
MARTHA
LYDA
HOPKINS
7 244779182
244778182
AUSTIN
AUSTYN
TERWILLIGER
OMEARA
8 489987513
489987573
ALISIA
ALICE
GRAVES
WATSON
9 239966568
239966578
ANNA
ANAYA
MONTAGUE
BOLDING
10 227691655
227691633
BRITTNEY
BRITTNEY
REVELS
REVELS
11 242339913
239524402
DANIEL
DANIEL
ROBINSON
ROBINSON
12 221864852
225206017
HELEN
HELEN
HALL
HOLLER
13 240212489
222565604
DEBORAH
DEBRA
LEE
LEACH
14 238995019
.
ABIGAHIL
ABIGAHIL
GARCIATREJO
TREJO
15 .
.
APSLEY
APSLEY
CARLYLE
KARYLE
16 .
.
ABIGAIL
ABIGAIL
GENTRY
KING
17 237999685
.
ABIGAIL
ABIGAIL
RODRIGUEZRINCON
HERNANDEZ
18 237998504
.
ABIGAYLE
ABIGAIL
FITZGERALD
HERNANDEZ
Match on name (fname+lname)

ssn missing
& dob approx
–
–
–

1: MI equal
7: MI missing
8: MI not equal
ssn approx
–
–
3: dob equal
dob approx


4: one element
5: transpose
Match on name (fname+lname)

ssn missing
& dob approx
–
–
–

1: MI equal
7: MI missing
8: MI not equal
ssn approx
–
–
3: dob equal
dob approx


4: one element
5: transpose
obs
ssn
kssn
dob
kdob
1 362201047
326201047
09/06/09
09/06/08
2 313416906
313146906
12/09/75
12/09/76
3 246381056
216381056
07/ 15/20
07/07/20
4 238013803
238013830
11/12/14
12/11/14
241191104
241191103
12/08/94
08/12 /94
5
Match on name (fname+lname)
obs
Type
ssn
kssn
dob
kdob
fname
lname
1
4
362201047
326201047
09/06/09
09/06/08
MARION
MONTAGUE
2
4
313416906
313146906
12/09/75
12/09/76
WILLIAM
JOHNSON
3
4
246381056
216381056 07/ 15/20
07/07/20
WILLIE
GRANT
4
5
238013803
238013830
11/12/14
12/11/14
GLADYS
SOUTHARD
5
5
241191104
241191103
12/08/94
08/12 /94
TAYLOR
FORD
6
52
272318863
272318860
09/11/77
.
NICOLE
PARKER
7
52
578111173
578111113
07/07/88
.
ASAJAH
ROSS
8
100
120688146
120688142
01/31/05
10/31/99
PATRICIA
BANEGAS
9
100
133680780
133680798
01/12/88
02/12/88
DANIEL
ANDRONIC
10
100
132769052
132759052
02/27/89
11/15/89
VICTORIA
HORN
Match on name (fname+lname)
obs
Type
ssn
kssn
dob
kdob
fname
lname
1
4
362201047
326201047
09/06/09
09/06/08
MARION
MONTAGUE
2
4
313416906
313146906
12/09/75
12/09/76
WILLIAM
JOHNSON
3
4
246381056
216381056 07/ 15/20
07/07/20
WILLIE
GRANT
4
5
238013803
238013830
11/12/14
12/11/14
GLADYS
SOUTHARD
5
5
241191104
241191103
12/08/94
08/12 /94
TAYLOR
FORD
6
52
272318863
272318860
09/11/77
.
NICOLE
PARKER
7
52
578111173
578111113
07/07/88
.
ASAJAH
ROSS
8
100
120688146
120688142
01/31/05
10/31/99
PATRICIA
BANEGAS
9
100
133680780
133680798
01/12/88
02/12/88
DANIEL
ANDRONIC
10
100
132769052
132759052
02/27/89
11/15/89
VICTORIA
HORN
link


Put together all links found
Identify indirect duplicates (type2>10000)
–
–

i.e. both EISID1 & EISID2 link to identical SISID1
Consider indirect duplicates on both EIS & SIS
Create unique link and indirect duplicate files
–
Keep only the first id in data file link
–
Create indirect duplicates files

dupeis2 & dupsis2

TODO : explore indirect duplicates
Create unique list of EIS & SIS

Generate unique full list of each set of ids
–
–
–
–
use linkage info
Link in the duplicates (dupeis & dupsis)
TODO : link in the indirect duplicates
eis & sis
Data flow
4,308,863
1,888,747
eisid.sas7bdat
sisid.sas7bdat
4,277,402
99%
dupeis.sas7bdat
duplicates
unduplicated
unique records
ueis.sas7bdat
31,461
250,635
dupsis.sas7bdat
usis.sas7bdat
1,638,112
87%
Link eis to sis
1,173,404
27%
link.sas7bdat
1270
dupeis2.sas7bdat
493
dupsis2.sas7bdat
eis.sas7bdat
4,308,863
28%
72%
sis.sas7bdat
1,888,747
74%
Type of links
Exact match
Approx match (miss)
ssn, dob, fsound
Freq
%
cum %
781094
66.57%
66.57%
ssn, fsound
dob
52173
4.45%
71.01%
ssn
dob, fsound
10959
0.93%
71.95%
ssn, lsound
fname (dob mismatch)
9320
0.79%
72.74%
ssn
other
7095
0.60%
73.35%
dob, fsound, lname
(ssn=.)
251124
21.40%
94.75%
dob, fsound
lname
16189
1.38%
96.13%
dob, fsound, lname
ssn
23653
2.02%
98.14%
dob, fsound, lname
(ssn mismatch)
15544
1.32%
99.47%
dob, fsound
other
4398
0.37%
99.84%
fname, lname
other
1855
0.16%
100.00%
1173404
100.00%
TOTAL
Type of duplicates and links
Type
EIS
freq
DLD
DLX
DXX
PLD
PLX
PXX
XLD
XLX
XXX
TOT
%
SIS
cum %
freq
%
cum %
3270
0.08%
0.08%
4345
0.23%
0.23%
8790
0.20%
0.28%
228039
12.07%
12.30%
19401
0.45%
0.73%
18251
0.97%
13.27%
3221
0.07%
0.80%
3221
0.17%
13.44%
8706
0.20%
1.01%
185066
9.80%
23.24%
19198
0.45%
1.45%
16929
0.90%
24.14%
185066
4.30%
5.75%
8706
0.46%
24.60%
976411
22.66%
28.41%
976411
51.70%
76.29%
3084800
71.59%
100.00%
447779
23.71%
100.00%
4308863
100.00%
1888747
100.00%
Number of Duplicates
dups
EIS
freq
1
2
3
4
4246277
61600
942
44
5
6
7
8
9
TOT
4308863
sets
SIS
%
cum %
freq
sets
%
cum %
4246277 98.55% 98.55% 1432896 1432896 75.86% 75.86%
30800 1.43% 99.98% 338251 169125 17.91% 93.77%
314 0.02% 100.00%
86379
28793 4.57% 98.35%
11 0.00% 100.00%
22928
5732 1.21% 99.56%
6020
1204 0.32% 99.88%
1662
277 0.09% 99.97%
497
71 0.03% 99.99%
96
12 0.01% 100.00%
18
2 0.00% 100.00%
4277402 100.00%
1888747 1638112 100.00%
Implementation details

Ndob & name must be looped
–

multiple matches
Too many match on name
–
–
use half of ssn
Overlap for transpose
Basic Process




Unduplicate EIS (dupeis)
Unduplicate SIS (dupsis)
Link unduplicated EIS & SIS (link)
Generate unique full list of each set of ids (list)
–
–
–
use linkage info
Link in the duplicates
eis & sis
Unduplication


Same as matching between different system
Except, match the database to itself
–

Randomly select one as Primary
–

i.e. EIS to EIS, SIS to SIS
TODO: for those not linked using primary ID, try
with duplicate ID
TODO: explore indirect duplicate links
Conclusion

Future work :
–
–

indirect duplicates
Link using duplicates
SSN have been changed from real data
Thank You !
Type of id

first letter:
–
–
–

second letter: link status
–
–

P : primary id with duplicates
D : duplicates (primary info given with prefix ‘l’)
X : no duplicates
L: linked
X: no linked id
third letter: duplicates status of the linked id
–
–
D: duplicates exist for the linked id
X: no duplicates for the linked id
EIS & SIS Table








Unique full is of EIS (or SIS) ids
Type : type of id (XXX) – see next slide
All eis info have no prefix
All sis info have prefix ‘k’
Prefix ‘l’ is the link id info
freqeis & freqsis : # of duplicate ids
Pindid (eis) & pkindid (sis) is the primary id
indid1-indid3 & kindid1-kindid8
Link type

sdiff : # digits different in ssn
–
–
–

ddiff : diff in dob
–
–
–
–

-1 : one or both ssn is missing
2 : two digits are transposed
10 : two digits are different but not transposed
-1 : one or both dob is missing
2 : date and month is transposed
3 : date, month and year are different
4 : date and month are different
Fdiff (ldiff) : difference in first (last) name
–
–
–
–
-1 : one or both are missing
1 : one letter difference (INDEL or REPL)
100 : one is a substring of the other
101 : one letter diff & substring
Duplicate type

If duplicate id
–
–
Primary id info is given with prefix “l”
Duplicate type


Lsdiff, lddiff, lfdiff, & lldiff
If primary id
–
–
# of duplicates : freqeis & freqsis
Duplicate ids

Indid1-indid3 (eis) & kindid1-kindid8 (sis)
Other tables

Link
–

Linkage between the primary eis & sis ids
dupeis & dupsis
–
List of duplicates with primary id
Data flow

eisid: 4,308,863
–

sisid: 1,888,747
–


usis (1,638,112)+dupsis (250,635) : 87%
Link : 1,173,404 (eis: 27%, sis: 72%)
–

ueis (4,277,402)+dupeis (31,461) : 99%
dupeis2 (1,270) + dupsis2(493)
EIS: 4,308,863 (28%)
SIS: 1,888,747 (74%)
Download