Identifiers Linkages and Match Error Estimation in CMS Data June 2009

advertisement
Identifiers Linkages and Match Error
Estimation in CMS Data
June 2009
JEN Associates - Cambridge, MA
1
Building Blocks of Integrated Data Models
• Integrated Data Goals: follow people over time with
interwoven data from current and historical payment
from all program sources
• Two ‘Simple’ Steps
– Gather and link all identifiers associated with a single
person within a source over time and assign
standardized identifier
– Link person level records between sources and
assign a unique identifier
• Generate person level analytic records with data from
different sources over time
JEN Associates - Cambridge, MA
2
CMS Maintains Data from National and
State Sources
• Medicare claims and enrollment
– Updated record of beneficiary enrollment
– Historical archives of claims
• Medicaid claims and enrollment
– Historical archives of quarterly enrollment and claims
data submitted by 50 states+
• Nursing Home MDS
– Assessment records submitted by 50 states+
• Home Health OASIS
– Assessment records submitted by 50 states+
JEN Associates - Cambridge, MA
3
Concurrent and Transitional Overlapping
Populations in CMS Data
• Intersections are extensive and non-random
• 3-major sources, 8 permutations
– In each source IDs can change over time as well as supporting
information
– Degree of overlap between sources not known in advance
• Errors come from incorrect linkages, missed linkages
and on incomplete linkages within and between sources
• If errors are evenly distributed then analytic applications
could still be accurate
• Non-Random overlap distributions: by demographics,
disease, institutional status and disability type lead to
analytic bias
JEN Associates - Cambridge, MA
4
Examples of Non-Random Overlaps
• 75% of long term Nursing Home residents in Medicare
also in Medicaid
• 70% of under 65 CMI beneficiaries in Medicare also in
Medicaid
• 93% of under 65 MR beneficiaries in Medicare also in
Medicaid
• 41% of Alzheimer’s/Dementia also in Medicaid
• Missed and incomplete linkages concentrated in these
populations have ramifications in program planning and
evaluation analytics
JEN Associates - Cambridge, MA
5
Underlying Causes in Linkage Failures are
Population Correlated Identifier Instability
• Elderly women with dementia
– Frequent associations with spouse’s SSN, HIC can
change upon widowhood and remarriage
• Under 65 MR
– Frequent association with parent’s SSNs
• 1.5% change per year in BOAN part of HIC number – but
concentrated in specific populations
• Cross state Nursing Home/OASIS transitions
• Cross state Medicaid transitions
JEN Associates - Cambridge, MA
6
Internal Completeness and Uniqueness
• Each source has potential for multiple IDs per person
– Medicare HIC changes over time
• Change in affiliation to primary account holder
– MSIS ID can change
• In state ID transitions with case number changes
• Cross state ID transitions
– MDS/OASID
• In state ID transitions are observed
• Cross state moves will lead to ID transitions
– State assignments of IDs for MSIS, MDS and OASIS
are not coordinated leading to potential for different
people with same ID
JEN Associates - Cambridge, MA
7
Linkage Elements
Data Source
Medicare EBD
Key Identifiers
Health Insurance Code
(HIC)
Medicare Claims
HIC
Nursing Home MDS MDS Resident ID, state
Secondary Identifiers
SSN, Date of Birth, Sex,
First/Last Name
HIC, SSN, Date of Birth,
Sex, First/Last Name
OASIS
OASIS ID, state
HIC, SSN, Date of Birth,
Sex, First/Last Name
MSIS Eligibility
MSIS ID, state
HIC, SSN, Date of Birth,
Sex
MSIS Claims
MSIS ID, state
JEN Associates - Cambridge, MA
8
CMS Linkage Methods
• CMS employs deterministic linkages using primary and
secondary identifiers
– Compatible with linkages between overlapping
populations
– Production efficiency greater in formal score based
approach
– Can be coded for specific entry error types, e.g.
transcription
– Setting threshold scores for linkages require
justification
– Immediate face validity
JEN Associates - Cambridge, MA
9
Adjustment Parameter for CMS
Deterministic Linkages
• Match on a key identifier
• Adjustment weights for match scoring
– DOB
– Sex
– Last Name
– SSN
– HIC
• Select threshold score to accept a match
JEN Associates - Cambridge, MA
10
Underlying Data Quality
Missing Data in Secondary Identifier Fields by Source
Field
CY 2003 MDS
CY 2003 OASIS
Historical Medicare
CY 2001 MAX
Primary Key
Identifier Count
Resident IDs*
9,805,216
Resident IDs*
11,602,668
HICs
107,933,817
MSIS IDs*
47,971,350
SSN
58,071
1%
419,035
4%
1,444,013
1%
6,071,860
13%
1,963,232
20%
1,604,133
14%
6
0%
40,535,349
85%
First Name
1,699
0%
-
0%
-
0%
N/A
Last Name
-
0%
-
0%
-
0%
N/A
776
0%
22,180
0%
5
0%
1,082,354
2%
3,763
0%
21,674
0%
108
0%
1,052,697
2%
HIC9**
DOB
Sex
*Unique values after state-specific suffix is added
** First 9-digits of the Medicare HIC
JAI Cambridge Massachusetts
11
How Can We Determine Match Quality?
• Quality a function of weights and acceptance
score
• Linkage Error
– Incorrect matches
– Missed matches
• Face validity: at what rate does arbiter reject
proposed matches?
• Case driven: looking for the exceptions
• Probabilistic Linkage for error identification
JEN Associates - Cambridge, MA
12
Calibration Challenge
• Propose weights for each match field
• Test threshold scores
• Use error measurements to adjust weights
and threshold scores to identify optimal
scoring for minimal error
• Check for bias in error set composition
• Repeat process until acceptable error rate
with minimum bias is achieved
JEN Associates - Cambridge, MA
13
Canary Tests for Self-Linkage Minimizing
False Positives Test Cases
• Twins: Same date of birth and close sequencing in the last
two digits of the SSN.
– Twin Extreme: First names different but rhyming, SSNs off by a
single digit;
• Parent Child: Same first and last name, same HIC account
number, different year of birth with the later year associated
with a Cx BIC.
– Parent/Child Extreme: Name matches and month and day of birth
matches;
• Remarriage: Same first and last name, same HIC account
number, different SSN, different spousal BIC indicating a
remarriage, i.e an additional dependent.
– Remarriage Extreme: New spouse with similar date of birth or SSN
JEN Associates - Cambridge, MA
14
Medicare Example: Self-Linkage Score
Profile for Key Based Candidate Matches
Medicare EBD Extract Self-Linkage Score Profiles (Threshold=130)
5000000
4500000
4000000
3500000
3000000
2500000
2000000
1500000
1000000
500000
JAI Cambridge Massachusetts
16
0
16
6
17
2
17
8
94
10
0
10
6
11
2
11
8
12
4
13
0
13
6
14
2
14
8
15
4
82
88
70
76
58
64
45
52
0
15
Medicare Example: Loss of Valid Matches
with Arbitration Determined Error
• In order to create a simple conservative rule that falls
outside the range of the special cases we recommended
the use of a threshold score of 130
• The concern is the potential of the loss of valid linkages
in the100-129 range.
• The total number of links in the 120-130 represent only
0.03% of all candidate matches
• Inspection by a human arbiter rejected a majority of
these records upon first review. The estimate is that at
most the 130 threshold leads to a 0.006% loss of valid
linkages.
JAI Cambridge Massachusetts
16
Validation Set: Using Unique Data
Signatures to Establish Linkage Errors
• Concurrencies and logical relationships between data
sources permits the development of tests that support
uniqueness and completeness tests
• Number of dimensions in utilization fingerprint leads to
high potential for pattern uniqueness: dates of service,
procedures, diagnoses, service type, payor, provider
state
• Unique patterns over time in a person-level history can
be identified in complementary data sources based
solely on non-ID based matching
– Medicare SNF – MDS
– Medicare HH – OASIS
– Medicaid NH – MDS
– Medicaid Cross-Over Claims – Medicare Claims
JAI Cambridge Massachusetts
17
Validation Set Application
• Validation Set does not yield comprehensive non ID
linkages but, arguably, produces a representative
sample: i.e. error in ID and demographics recording not
related to service utilization patterns
• The disposition of IDs of people in the VS in the
deterministic linkage results tells us something about the
performance of the linkage algorithm
• VS matches as based on unique data patterns permit the
user to identify both False Positive and False Negative
linkages
• Optimization of match weights and acceptance
thresholds leads to lowest achievable level of linkage
error
JAI Cambridge Massachusetts
18
Linkage Selection Threshold Testing
Example
Table 3b: OASIS - Medicare 2003 Linkage Error Rates
at Different Confidence Threshold Levels (n=45,649)
Threshold Score
20
46
60
70
75
76
77
80
False Negatives
0.62%
0.67%
0.86%
1.80%
2.06%
2.20%
23.14%
24.84%
False Positives
1.12%
0.58%
0.53%
0.49%
0.49%
0.49%
0.36%
0.35%
98.27%
98.76%
98.61%
97.71%
97.45%
97.31%
76.50%
74.81%
Correct
JAI Cambridge Massachusetts
19
What Does the VS Test Tell Use
• Deterministic linkages are relatively insensitive to False
Positive linkages. Requiring a high degree of overall
matching in key fields leads to over identification rates of
1.12% - 0.35% of all linked pairs in OASIS-Medicare
EDB
• Deterministic linkages are very sensitive to False
Negative identifications with large potential for error if
match acceptance score set too high (too stringent)
• Error rates are a discontinuous function of threshold
score. Optimized selection for the CMS weights occur at
score 46, minimizing both False Positive and False
Negative errors and leading to the highest ‘correct’ rate
JEN Associates - Cambridge, MA
20
VS Use for Reducing Self-Linkage Error
• Strategy 1: use intuitive cases to tune deterministic
algorithm
• Strategy 2: use VS final results to determine
effectiveness of strategy 1
• VS results are a composite of self-linkage anomalies and
cross-source match errors
• Error source is distinguishable and rates can be reduced
by fine-tuning by error type
• Biases in match error can be tested by reviewing
demographics for False Positives and False Negatives
JEN Associates - Cambridge, MA
21
Final Error Rates After Pre-processing and
Further Internal Linkages Cleaning
(Method Threshold Score=60)
Table 14: Linkage Error Rates (n=13,141) MDS VS to MCR 5%
MDS-Medicare (N=13,141)
False Negatives
0.64%
False Positives
0.00%
Correct
99.36%
Table 15: Linkage Error Rates (n=45,636) OASIS VS to MCR 5%
OASIS-Medicare (N=45,636)
False Negatives
0.58%
False Positives
0.07%
Correct
99.35%
JAI Cambridge Massachusetts
22
Conclusion
• Algorithms that produce matches that have face
validity will also yield errors at an undetermined
rate, both bad links and missing links
• Probabilistic based linkages are difficult to use
operationally but are essential for tuning
deterministic methods and minimizing error
• Utilization based VS testing – Probabilistic
Linkage - is viable for error testing purposes and
leads to optimization of operational matching
algorithms
JEN Associates - Cambridge, MA
23
Download