Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum Example EISID : E1 ssn : 085-66-9980 first name : Sally last name : Hill MI : L DOB : 3/4/1999 SISID : S1 ssn : 085-66-9980 first name : Sally last name : Hill MI : L DOB : 3/4/1999 EISID : E2 ssn : 143-25-9304 first name : Emily last name : Brown MI : K DOB : 6/2/2004 SISID : S2 ssn : 143-52-9304 first name : Emily last name : Brown MI : K DOB : 6/2/2004 EISID : E3 ssn : 354-563-2343 first name : Mary last name : Johnson MI : G DOB : 5/13/1983 SISID : S3 ssn : 354-563-2343 first name : Mary last name : Hawkins MI : J DOB : 5/13/1983 EISID : E4 ssn : 532-34-9183 first name : David last name : Ford MI : J DOB : 10/25/1990 SISID : S4 ssn : 532-34-9183 first name : David last name : Ford MI : J DOB : 10/23/1990 Exact Match EISID : E1 ssn : 085-66-9980 first name : Sally last name : Hill MI : L DOB : 3/4/1999 SISID : S1 ssn : 085-66-9980 first name : Sally last name : Hill MI : L DOB : 3/4/1999 Approximate Matching I : SSN EISID : E2 ssn : 143-25-9304 first name : Emily last name : Brown MI : K DOB : 6/2/2004 SISID : S2 ssn : 143-52-9304 first name : Emily last name : Brown MI : K DOB : 6/2/2004 Approximate Matching II : DOB EISID : E4 ssn : 532-34-9183 first name : David last name : Ford MI : J DOB : 10/25/1990 SISID : S4 ssn : 532-34-9183 first name : David last name : Ford MI : J DOB : 10/23/1990 Approximate Matching III : Name EISID : E3 SISID : S3 ssn : 354-563-2343 first name : Mary last name : Johnson MI : G DOB : 5/13/1983 ssn : 354-563-2343 first name : Mary last name : Hawkins MI : J DOB : 5/13/1983 Deterministic Record Linking Allow for approximate matching Use explicit approximate rules Pros : can control the linkage process Con: difficult to implement Alternative : Probabilistic record linking – – – – – Also approximate matching However, uses general rules specified by users Based on total probability Con: can not control exactly what to consider a match or not Pros: can use specialized software Approximate Matching : DOB element to element match : date, month, year Allow for one element difference Allow for month and day transposed DOB : one element dob1 : 10/25/1990 dob2 : 10/23/1990 DOB : transpose dob1 : 11/7/1995 dob2 : 7/11/1995 Approximate Matching : Name kfname mi kmi 1 RUDOLPH RULDOLPH A A one letter different 2 ALIJAH ALIYAH M insert or replace 3 CAROL CAROLYN J First name soundex match First name is approx – – lsound equal or lname approx – – and/or substr MI=FI FI equal Fsound & Lsound swapped obs fname 4 ANGELIQUE ANGIE D 5 JOHNNY JOHNNY JR L 6 ZACHARY ZACK L 7 J MICHAEL M 8 ANTON COUDRAY C A 9 ARTHUR AUTHOR R R 10 EDWIN EDDIE 11 GOLDY OWENS A A Approximate Matching : Name kfname mi 1 RUDOLPH RULDOLPH A 2 ALIJAH ALIYAH 3 CAROL obs fname kmi klname SIMARD SIMARD M FOSS FOSS CAROLYN J YOUNG YOUNG 4 ANGELIQUE ANGIE D OUELLETTE OUELLETTE 5 JOHNNY JOHNNY JR L MAYO MAYO 6 ZACHARY ZACK L ROGERS ROGERS 7 J MICHAEL M GALLAGHER GALLAGHER 8 ANTON COUDRAY C A CYPRESS CYPRESS 9 ARTHUR AUTHOR R R DAVIS DAVIS KAHKONE KAHKONE OWENS GOLDY 10 EDWIN EDDIE 11 GOLDY OWENS A A lname A Match on ssn (ssn equal) 1 : dob, fsound equal dob approx – – – – – – 2 : dob approx, fsound equal 3 : dob approx, fname approx 4 : dob approx, lsound equal, & fsound diff, but MI=FI 5 : dob approx, lsound equal, & fsound diff, but FI equal 6 : dob approx, lsound and fsound swapped 7 : dob approx, lname approx & fsound diff but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal) dob mismatch – – 8 : fname approx, lsound equal, and dob diff 9 : fname approx, lsound approx, and dob diff Match on ssn (ssn equal) 1 : dob, fsound equal dob approx – – – – – – 2 : dob approx, fsound equal 3 : dob approx, fname approx 4 : dob approx, lsound equal, & fsound diff, but MI=FI 5 : dob approx, lsound equal, & fsound diff, but FI equal 6 : dob approx, lsound and fsound swapped 7 : dob approx, lname approx & fsound diff but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal) dob diff – – 8 : fname approx, lsound equal, and dob diff 9 : fname approx, lsound approx, and dob diff Approximate Matching : SSN Digit to digit match Allow for one digit difference Allow for two digit difference if transposed SSN : one digit ssn1 : 532-34-9183 ssn2 : 532-34-8183 SSN : transpose ssn1 : 143-25-9304 ssn2 : 143-52-9304 Match on ndob (dob+fsound) ssn missing – – 1: lname equal 2: lname approx ssn approx – – – 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different – – 11 : lname equal 12 : lname approx lname different – – 51: ssn approx 52: ssn missing Match on ndob (dob+fsound) ssn missing – – 1: lname equal 2: lname approx ssn approx – – – 3: lname equal 4: lname approx 5: lname diff but fname equal ssn different – – 11 : lname equal 12 : lname approx lname different – – 51: ssn approx 52: ssn missing obs SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS 2 . . ABEL ABEL LOMELIGARCIA LOMELI 3 248511181 . JOSH JOSHUA PHIPPS PHIPPS obs SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS 2 . . ABEL ABEL LOMELIGARCIA LOMELI 3 248511181 . JOSH JOSHUA PHIPPS PHIPPS 4 243352045 243352054 LENA LENA COOPER COOPER 5 239565518 239565519 MILES MILES KNIGHT JR. KNIGHT obs SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS 2 . . ABEL ABEL LOMELIGARCIA LOMELI 3 248511181 . JOSH JOSHUA PHIPPS PHIPPS 4 243352045 243352054 LENA LENA COOPER COOPER 5 239565518 239565519 MILES MILES KNIGHT JR. KNIGHT 6 245193584 245493584 MARTHA MARTHA LYDA HOPKINS obs SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS 2 . . ABEL ABEL LOMELIGARCIA LOMELI 3 248511181 . JOSH JOSHUA PHIPPS PHIPPS 4 243352045 243352054 LENA LENA COOPER COOPER 5 239565518 239565519 MILES MILES KNIGHT JR. KNIGHT 6 245193584 245493584 MARTHA MARTHA LYDA HOPKINS 7 244779182 244778182 AUSTIN AUSTYN TERWILLIGER OMEARA 8 489987513 489987573 ALISIA ALICE GRAVES WATSON 9 239966568 239966578 ANNA ANAYA MONTAGUE BOLDING obs SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS 2 . . ABEL ABEL LOMELIGARCIA LOMELI 3 248511181 . JOSH JOSHUA PHIPPS PHIPPS 4 243352045 243352054 LENA LENA COOPER COOPER 5 239565518 239565519 MILES MILES KNIGHT JR. KNIGHT 6 245193584 245493584 MARTHA MARTHA LYDA HOPKINS 7 244779182 244778182 AUSTIN AUSTYN TERWILLIGER OMEARA 8 489987513 489987573 ALISIA ALICE GRAVES WATSON 9 239966568 239966578 ANNA ANAYA MONTAGUE BOLDING obs SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS 2 . . ABEL ABEL LOMELIGARCIA LOMELI 3 248511181 . JOSH JOSHUA PHIPPS PHIPPS 4 243352045 243352054 LENA LENA COOPER COOPER 5 239565518 239565519 MILES MILES KNIGHT JR. KNIGHT 6 245193584 245493584 MARTHA MARTHA LYDA HOPKINS 7 244779182 244778182 AUSTIN AUSTYN TERWILLIGER OMEARA 8 489987513 489987573 ALISIA ALICE GRAVES WATSON 9 239966568 239966578 ANNA ANAYA MONTAGUE BOLDING 10 227691655 227691633 BRITTNEY BRITTNEY REVELS REVELS 11 242339913 239524402 DANIEL DANIEL ROBINSON ROBINSON 12 221864852 225206017 HELEN HELEN HALL HOLLER 13 240212489 222565604 DEBORAH DEBRA LEE LEACH obs SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS 2 . . ABEL ABEL LOMELIGARCIA LOMELI 3 248511181 . JOSH JOSHUA PHIPPS PHIPPS 4 243352045 243352054 LENA LENA COOPER COOPER 5 239565518 239565519 MILES MILES KNIGHT JR. KNIGHT 6 245193584 245493584 MARTHA MARTHA LYDA HOPKINS 7 244779182 244778182 AUSTIN AUSTYN TERWILLIGER OMEARA 8 489987513 489987573 ALISIA ALICE GRAVES WATSON 9 239966568 239966578 ANNA ANAYA MONTAGUE BOLDING 10 227691655 227691633 BRITTNEY BRITTNEY REVELS REVELS 11 242339913 239524402 DANIEL DANIEL ROBINSON ROBINSON 12 221864852 225206017 HELEN HELEN HALL HOLLER 13 240212489 222565604 DEBORAH DEBRA LEE LEACH obs SSN kSSN fname kfname lname klname 1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS 2 . . ABEL ABEL LOMELIGARCIA LOMELI 3 248511181 . JOSH JOSHUA PHIPPS PHIPPS 4 243352045 243352054 LENA LENA COOPER COOPER 5 239565518 239565519 MILES MILES KNIGHT JR. KNIGHT 6 245193584 245493584 MARTHA MARTHA LYDA HOPKINS 7 244779182 244778182 AUSTIN AUSTYN TERWILLIGER OMEARA 8 489987513 489987573 ALISIA ALICE GRAVES WATSON 9 239966568 239966578 ANNA ANAYA MONTAGUE BOLDING 10 227691655 227691633 BRITTNEY BRITTNEY REVELS REVELS 11 242339913 239524402 DANIEL DANIEL ROBINSON ROBINSON 12 221864852 225206017 HELEN HELEN HALL HOLLER 13 240212489 222565604 DEBORAH DEBRA LEE LEACH 14 238995019 . ABIGAHIL ABIGAHIL GARCIATREJO TREJO 15 . . APSLEY APSLEY CARLYLE KARYLE 16 . . ABIGAIL ABIGAIL GENTRY KING 17 237999685 . ABIGAIL ABIGAIL RODRIGUEZRINCON HERNANDEZ 18 237998504 . ABIGAYLE ABIGAIL FITZGERALD HERNANDEZ Match on name (fname+lname) ssn missing & dob approx – – – 1: MI equal 7: MI missing 8: MI not equal ssn approx – – 3: dob equal dob approx 4: one element 5: transpose Match on name (fname+lname) ssn missing & dob approx – – – 1: MI equal 7: MI missing 8: MI not equal ssn approx – – 3: dob equal dob approx 4: one element 5: transpose obs ssn kssn dob kdob 1 362201047 326201047 09/06/09 09/06/08 2 313416906 313146906 12/09/75 12/09/76 3 246381056 216381056 07/ 15/20 07/07/20 4 238013803 238013830 11/12/14 12/11/14 241191104 241191103 12/08/94 08/12 /94 5 Match on name (fname+lname) obs Type ssn kssn dob kdob fname lname 1 4 362201047 326201047 09/06/09 09/06/08 MARION MONTAGUE 2 4 313416906 313146906 12/09/75 12/09/76 WILLIAM JOHNSON 3 4 246381056 216381056 07/ 15/20 07/07/20 WILLIE GRANT 4 5 238013803 238013830 11/12/14 12/11/14 GLADYS SOUTHARD 5 5 241191104 241191103 12/08/94 08/12 /94 TAYLOR FORD 6 52 272318863 272318860 09/11/77 . NICOLE PARKER 7 52 578111173 578111113 07/07/88 . ASAJAH ROSS 8 100 120688146 120688142 01/31/05 10/31/99 PATRICIA BANEGAS 9 100 133680780 133680798 01/12/88 02/12/88 DANIEL ANDRONIC 10 100 132769052 132759052 02/27/89 11/15/89 VICTORIA HORN Match on name (fname+lname) obs Type ssn kssn dob kdob fname lname 1 4 362201047 326201047 09/06/09 09/06/08 MARION MONTAGUE 2 4 313416906 313146906 12/09/75 12/09/76 WILLIAM JOHNSON 3 4 246381056 216381056 07/ 15/20 07/07/20 WILLIE GRANT 4 5 238013803 238013830 11/12/14 12/11/14 GLADYS SOUTHARD 5 5 241191104 241191103 12/08/94 08/12 /94 TAYLOR FORD 6 52 272318863 272318860 09/11/77 . NICOLE PARKER 7 52 578111173 578111113 07/07/88 . ASAJAH ROSS 8 100 120688146 120688142 01/31/05 10/31/99 PATRICIA BANEGAS 9 100 133680780 133680798 01/12/88 02/12/88 DANIEL ANDRONIC 10 100 132769052 132759052 02/27/89 11/15/89 VICTORIA HORN link Put together all links found Identify indirect duplicates (type2>10000) – – i.e. both EISID1 & EISID2 link to identical SISID1 Consider indirect duplicates on both EIS & SIS Create unique link and indirect duplicate files – Keep only the first id in data file link – Create indirect duplicates files dupeis2 & dupsis2 TODO : explore indirect duplicates Create unique list of EIS & SIS Generate unique full list of each set of ids – – – – use linkage info Link in the duplicates (dupeis & dupsis) TODO : link in the indirect duplicates eis & sis Data flow 4,308,863 1,888,747 eisid.sas7bdat sisid.sas7bdat 4,277,402 99% dupeis.sas7bdat duplicates unduplicated unique records ueis.sas7bdat 31,461 250,635 dupsis.sas7bdat usis.sas7bdat 1,638,112 87% Link eis to sis 1,173,404 27% link.sas7bdat 1270 dupeis2.sas7bdat 493 dupsis2.sas7bdat eis.sas7bdat 4,308,863 28% 72% sis.sas7bdat 1,888,747 74% Type of links Exact match Approx match (miss) ssn, dob, fsound Freq % cum % 781094 66.57% 66.57% ssn, fsound dob 52173 4.45% 71.01% ssn dob, fsound 10959 0.93% 71.95% ssn, lsound fname (dob mismatch) 9320 0.79% 72.74% ssn other 7095 0.60% 73.35% dob, fsound, lname (ssn=.) 251124 21.40% 94.75% dob, fsound lname 16189 1.38% 96.13% dob, fsound, lname ssn 23653 2.02% 98.14% dob, fsound, lname (ssn mismatch) 15544 1.32% 99.47% dob, fsound other 4398 0.37% 99.84% fname, lname other 1855 0.16% 100.00% 1173404 100.00% TOTAL Type of duplicates and links Type EIS freq DLD DLX DXX PLD PLX PXX XLD XLX XXX TOT % SIS cum % freq % cum % 3270 0.08% 0.08% 4345 0.23% 0.23% 8790 0.20% 0.28% 228039 12.07% 12.30% 19401 0.45% 0.73% 18251 0.97% 13.27% 3221 0.07% 0.80% 3221 0.17% 13.44% 8706 0.20% 1.01% 185066 9.80% 23.24% 19198 0.45% 1.45% 16929 0.90% 24.14% 185066 4.30% 5.75% 8706 0.46% 24.60% 976411 22.66% 28.41% 976411 51.70% 76.29% 3084800 71.59% 100.00% 447779 23.71% 100.00% 4308863 100.00% 1888747 100.00% Number of Duplicates dups EIS freq 1 2 3 4 4246277 61600 942 44 5 6 7 8 9 TOT 4308863 sets SIS % cum % freq sets % cum % 4246277 98.55% 98.55% 1432896 1432896 75.86% 75.86% 30800 1.43% 99.98% 338251 169125 17.91% 93.77% 314 0.02% 100.00% 86379 28793 4.57% 98.35% 11 0.00% 100.00% 22928 5732 1.21% 99.56% 6020 1204 0.32% 99.88% 1662 277 0.09% 99.97% 497 71 0.03% 99.99% 96 12 0.01% 100.00% 18 2 0.00% 100.00% 4277402 100.00% 1888747 1638112 100.00% Implementation details Ndob & name must be looped – multiple matches Too many match on name – – use half of ssn Overlap for transpose Basic Process Unduplicate EIS (dupeis) Unduplicate SIS (dupsis) Link unduplicated EIS & SIS (link) Generate unique full list of each set of ids (list) – – – use linkage info Link in the duplicates eis & sis Unduplication Same as matching between different system Except, match the database to itself – Randomly select one as Primary – i.e. EIS to EIS, SIS to SIS TODO: for those not linked using primary ID, try with duplicate ID TODO: explore indirect duplicate links Conclusion Future work : – – indirect duplicates Link using duplicates SSN have been changed from real data Thank You ! Type of id first letter: – – – second letter: link status – – P : primary id with duplicates D : duplicates (primary info given with prefix ‘l’) X : no duplicates L: linked X: no linked id third letter: duplicates status of the linked id – – D: duplicates exist for the linked id X: no duplicates for the linked id EIS & SIS Table Unique full is of EIS (or SIS) ids Type : type of id (XXX) – see next slide All eis info have no prefix All sis info have prefix ‘k’ Prefix ‘l’ is the link id info freqeis & freqsis : # of duplicate ids Pindid (eis) & pkindid (sis) is the primary id indid1-indid3 & kindid1-kindid8 Link type sdiff : # digits different in ssn – – – ddiff : diff in dob – – – – -1 : one or both ssn is missing 2 : two digits are transposed 10 : two digits are different but not transposed -1 : one or both dob is missing 2 : date and month is transposed 3 : date, month and year are different 4 : date and month are different Fdiff (ldiff) : difference in first (last) name – – – – -1 : one or both are missing 1 : one letter difference (INDEL or REPL) 100 : one is a substring of the other 101 : one letter diff & substring Duplicate type If duplicate id – – Primary id info is given with prefix “l” Duplicate type Lsdiff, lddiff, lfdiff, & lldiff If primary id – – # of duplicates : freqeis & freqsis Duplicate ids Indid1-indid3 (eis) & kindid1-kindid8 (sis) Other tables Link – Linkage between the primary eis & sis ids dupeis & dupsis – List of duplicates with primary id Data flow eisid: 4,308,863 – sisid: 1,888,747 – usis (1,638,112)+dupsis (250,635) : 87% Link : 1,173,404 (eis: 27%, sis: 72%) – ueis (4,277,402)+dupeis (31,461) : 99% dupeis2 (1,270) + dupsis2(493) EIS: 4,308,863 (28%) SIS: 1,888,747 (74%)