Accounting for Naming Conventions in NCHS Data Linkages

What’s in a Name?
Accounting for Naming Conventions in
NCHS Data Linkages
Eric A. Miller
National Center for Health Statistics (NCHS)
2012 FCSM Statistical Policy Seminar
December 4, 2012
“Two men say they’re Jesus.
One of them must be wrong.”
Mark Knopfler, Dire Straits
What Does This Have to do With Data Quality?
• One reason for data sharing is data linkage
• Assessing the quality of linked data is
different from assessing a standalone dataset
•
•
The quality of variables from a specific source doesn’t
matter if the linkage is poor
Problems with linkage can produce poor quality data
– Are the data fit for use? + Are the data fit for linkage?
Names
• Names are commonly used in data
linkages
• Important to account for name differences
and naming conventions to produce a high
quality linked data file
Quick Background on Data Linkage
• Deterministic
– Exact match on linkage
variables
• Frank ≠ Francis
• Probabilistic
– Accounts for imperfect data
– Probability of a match
• Frank ≈ Francis
Caveats of Data Linkage
• It’s not perfect
Prince Rogers
Nelson
Prince
?
Prince
Some things are out of our control!
Caveats of Data Linkage
• Varying levels of quality for linkage
variables can substantially increase
workload
– Clean-up, reformatting
– Clerical review
• Analysis of insufficiently linked data can
produce biased estimates
Example - Hispanic Paradox
• Despite having a higher risk profile,
Hispanics have been found to have lower
mortality rates compared to non-Hispanic
whites
Markides and Coreil (1986). Public Health Reports; 101: 253-265
Mortality Rate per 100,000 Among Women in 19861990 National Health Interview Survey Linked to
1991 National Death Index
White-NH
Black-NH
Hispanic
3928
4000
Rate per 100,000
3504
3000
2438
2000
969
1000
642
80
182
480
97
0
Age 18-44
Age 45-64
Liao et al. (1998). Mortality Patterns among Adult Hispanics:
Findings from the NHIS, 1986 to 1990. AJPH.
Age 65+
Potential Reasons for Paradox
• Health selective immigration
• Salmon bias (return migration)
• Advantageous health behaviors and social
support
• Data quality / Insufficient linkage
Potential Reasons for Paradox
• Data quality / Insufficient linkage
– Naming conventions for Hispanics differ from
other US populations
• Use of mother’s and father’s surname
• May not have single middle name
– Less likely to have social security number
• Especially among older adults and foreign born
Percent of “True” Matches for Hispanics and
Non-Hispanic Whites by Foreign-Born Status
Hispanic
Class 1
(“True”)
Matches
Non-Hispanic White
Foreign-born
US-born
Foreign-born
US-born
32.5%
50.0%
57.4%
62.5%
Class 1: records agree on at least 8 digits of SSN as well as first
and last name, middle initial, and birth year (+/- 3 years)
Joseph Lariscy. Differential record linkage by Hispanic ethnicity and age in linked mortality
studies: Implications for the epidemiologic paradox. J of Aging and Health (2011); 23:
1263-1284.
What does this have to do with NCHS?
• NCHS Record Linkage Program
– Links survey data with data collected from
administrative records
– Designed to maximize the scientific value of the
NCHS population-based surveys
– Examine factors that influence chronic disease,
disability, health care utilization, morbidity, and
mortality
Linked NCHS surveys
• National Health Interview Survey (NHIS)
• 1999-2004 NHANES, NHANES III, and NHANES II
• NHANES I Epidemiologic Follow-up Study (NHEFS)
• The Second Longitudinal Study of Aging (LSOA II)
• National Nursing Home Survey (NNHS)
14
Linked Administrative Records
• National Death Index
• Medicare and Medicaid enrollment and claims
• Social Security Administration Retirement and
Disability
• Pilot projects
– Florida Cancer Data System
– Texas Supplemental Nutrition Assistance Program (SNAP)
15
Case Study: NCHS Survey linkage with the NDI
• National Death Index (NDI)
– A national file of identifying death record
information (beginning with 1979 deaths)
– Every four years we send a file of survey
participants to NDI to conduct a linkage
and identify participant deaths
– We take additional steps to try and improve
the linkage
NDI Matching Algorithm
•
•
•
•
•
•
•
•
•
•
•
•
•
Social Security Number
First name
Middle initial
Last name
Month of birth
Year of birth
Sex
Father’s surname
State of birth
Race
State of residence
State of birth
Marital Status
Unweighted percent of NHIS sample adults
aged 18 or older, refusing to provide SSN,
1997-2009
NCHS Record Linkage Program
• To make sure we provide research quality
data, we spend a lot of time processing
the data to increase the chance of finding
a true match
– Try to increase the number of matches while
minimizing false matches
• Addressing name clean-up and naming
conventions is a major activity
Methods – Name Clean-up
• Fix invalid characters
• Compress spaces
• Remove titles/descriptors/suffixes
– e.g. Mr., baby, jr.
• Linkage uses NYSIIS phonetic codes
– Accounts for misspellings or unusual spellings
Methods – Name Clean-up
• Create alternate records
– Sent with original record
• Among women substitute surnames for last name
• Nicknames (using a look-up table)
– Substituting Elizabeth for Beth
Nickname Lookup Table
SEX
NICKNAME
PROPER NAME
M
ABE
ABRAHAM
F
AGGIE
AGNES
M
AL
ALBERT
M
ALEX
ALEXANDER
M
ALF
ALFRED
F
ALLIE
ALBERTA
M
ANDY
ANDREW
Example: If first name=‘Andy’ then alternate record
first name=‘Andrew’
Methods – Name Clean-up
– Accounting for Hispanic and Asian naming
conventions
• Hispanic
– Hispanic nickname lookup table
– switch middle and last
• Asian
– switch first and last
Hispanic Lookup Table
Sex
Formal Name Nicknames
F
Adelina
Deli
Lina
F
Adelaida
Ade
Adela
M
Adrián
Adri
F
Adriana
Adri
M
Alberto
Alber
Albertito Beto
Berto
Tico
Tuco
M
Alejandro
Ale
Álex
Alejo
Jandro
Jano
Sandro
F
Alejandra
Sandra
Ale
Álex
Aleja
Jandra
Jana
M
Alfonso
Alfon
Fon
Fonso Fonsi
F
Alicia
Ali
Licha
Poncho
Tito
Alternate Records Example
Number
First
Middle
Last
1
David
Américo
Arias Ortiz
2
David
Américo
Ortiz
3
David
Américo
Arias
4
David
Américo
5
Big
Papi
Conclusions
• Care needs to be taken to avoid false links
– Alternate records increases the number of
potential matches
• If two men claim they’re Jesus, they can both be
wrong
– Need a higher level of scrutiny to determine
that a pair of records match
Conclusions
• Accounting for name differences and naming
conventions improves quality of the linked-data
product
• Hope our efforts to account for Hispanic and
Asian naming conventions reduces potential
bias
– Need to evaluate
Important Considerations
• How are names are collected?
• How are the names recorded?
• More likely to have formal names versus
nicknames?
– Surveys may differ from official documents
• Are maiden names (surnames) available?
• Are there consistent rules for recording
names?
Acknowledgements
• Dr. Jennifer Parker
• Dr. Dean Judson
Thank you