Identifiers Linkages and Match Error Estimation in CMS Data June 2009 JEN Associates - Cambridge, MA 1 Building Blocks of Integrated Data Models • Integrated Data Goals: follow people over time with interwoven data from current and historical payment from all program sources • Two ‘Simple’ Steps – Gather and link all identifiers associated with a single person within a source over time and assign standardized identifier – Link person level records between sources and assign a unique identifier • Generate person level analytic records with data from different sources over time JEN Associates - Cambridge, MA 2 CMS Maintains Data from National and State Sources • Medicare claims and enrollment – Updated record of beneficiary enrollment – Historical archives of claims • Medicaid claims and enrollment – Historical archives of quarterly enrollment and claims data submitted by 50 states+ • Nursing Home MDS – Assessment records submitted by 50 states+ • Home Health OASIS – Assessment records submitted by 50 states+ JEN Associates - Cambridge, MA 3 Concurrent and Transitional Overlapping Populations in CMS Data • Intersections are extensive and non-random • 3-major sources, 8 permutations – In each source IDs can change over time as well as supporting information – Degree of overlap between sources not known in advance • Errors come from incorrect linkages, missed linkages and on incomplete linkages within and between sources • If errors are evenly distributed then analytic applications could still be accurate • Non-Random overlap distributions: by demographics, disease, institutional status and disability type lead to analytic bias JEN Associates - Cambridge, MA 4 Examples of Non-Random Overlaps • 75% of long term Nursing Home residents in Medicare also in Medicaid • 70% of under 65 CMI beneficiaries in Medicare also in Medicaid • 93% of under 65 MR beneficiaries in Medicare also in Medicaid • 41% of Alzheimer’s/Dementia also in Medicaid • Missed and incomplete linkages concentrated in these populations have ramifications in program planning and evaluation analytics JEN Associates - Cambridge, MA 5 Underlying Causes in Linkage Failures are Population Correlated Identifier Instability • Elderly women with dementia – Frequent associations with spouse’s SSN, HIC can change upon widowhood and remarriage • Under 65 MR – Frequent association with parent’s SSNs • 1.5% change per year in BOAN part of HIC number – but concentrated in specific populations • Cross state Nursing Home/OASIS transitions • Cross state Medicaid transitions JEN Associates - Cambridge, MA 6 Internal Completeness and Uniqueness • Each source has potential for multiple IDs per person – Medicare HIC changes over time • Change in affiliation to primary account holder – MSIS ID can change • In state ID transitions with case number changes • Cross state ID transitions – MDS/OASID • In state ID transitions are observed • Cross state moves will lead to ID transitions – State assignments of IDs for MSIS, MDS and OASIS are not coordinated leading to potential for different people with same ID JEN Associates - Cambridge, MA 7 Linkage Elements Data Source Medicare EBD Key Identifiers Health Insurance Code (HIC) Medicare Claims HIC Nursing Home MDS MDS Resident ID, state Secondary Identifiers SSN, Date of Birth, Sex, First/Last Name HIC, SSN, Date of Birth, Sex, First/Last Name OASIS OASIS ID, state HIC, SSN, Date of Birth, Sex, First/Last Name MSIS Eligibility MSIS ID, state HIC, SSN, Date of Birth, Sex MSIS Claims MSIS ID, state JEN Associates - Cambridge, MA 8 CMS Linkage Methods • CMS employs deterministic linkages using primary and secondary identifiers – Compatible with linkages between overlapping populations – Production efficiency greater in formal score based approach – Can be coded for specific entry error types, e.g. transcription – Setting threshold scores for linkages require justification – Immediate face validity JEN Associates - Cambridge, MA 9 Adjustment Parameter for CMS Deterministic Linkages • Match on a key identifier • Adjustment weights for match scoring – DOB – Sex – Last Name – SSN – HIC • Select threshold score to accept a match JEN Associates - Cambridge, MA 10 Underlying Data Quality Missing Data in Secondary Identifier Fields by Source Field CY 2003 MDS CY 2003 OASIS Historical Medicare CY 2001 MAX Primary Key Identifier Count Resident IDs* 9,805,216 Resident IDs* 11,602,668 HICs 107,933,817 MSIS IDs* 47,971,350 SSN 58,071 1% 419,035 4% 1,444,013 1% 6,071,860 13% 1,963,232 20% 1,604,133 14% 6 0% 40,535,349 85% First Name 1,699 0% - 0% - 0% N/A Last Name - 0% - 0% - 0% N/A 776 0% 22,180 0% 5 0% 1,082,354 2% 3,763 0% 21,674 0% 108 0% 1,052,697 2% HIC9** DOB Sex *Unique values after state-specific suffix is added ** First 9-digits of the Medicare HIC JAI Cambridge Massachusetts 11 How Can We Determine Match Quality? • Quality a function of weights and acceptance score • Linkage Error – Incorrect matches – Missed matches • Face validity: at what rate does arbiter reject proposed matches? • Case driven: looking for the exceptions • Probabilistic Linkage for error identification JEN Associates - Cambridge, MA 12 Calibration Challenge • Propose weights for each match field • Test threshold scores • Use error measurements to adjust weights and threshold scores to identify optimal scoring for minimal error • Check for bias in error set composition • Repeat process until acceptable error rate with minimum bias is achieved JEN Associates - Cambridge, MA 13 Canary Tests for Self-Linkage Minimizing False Positives Test Cases • Twins: Same date of birth and close sequencing in the last two digits of the SSN. – Twin Extreme: First names different but rhyming, SSNs off by a single digit; • Parent Child: Same first and last name, same HIC account number, different year of birth with the later year associated with a Cx BIC. – Parent/Child Extreme: Name matches and month and day of birth matches; • Remarriage: Same first and last name, same HIC account number, different SSN, different spousal BIC indicating a remarriage, i.e an additional dependent. – Remarriage Extreme: New spouse with similar date of birth or SSN JEN Associates - Cambridge, MA 14 Medicare Example: Self-Linkage Score Profile for Key Based Candidate Matches Medicare EBD Extract Self-Linkage Score Profiles (Threshold=130) 5000000 4500000 4000000 3500000 3000000 2500000 2000000 1500000 1000000 500000 JAI Cambridge Massachusetts 16 0 16 6 17 2 17 8 94 10 0 10 6 11 2 11 8 12 4 13 0 13 6 14 2 14 8 15 4 82 88 70 76 58 64 45 52 0 15 Medicare Example: Loss of Valid Matches with Arbitration Determined Error • In order to create a simple conservative rule that falls outside the range of the special cases we recommended the use of a threshold score of 130 • The concern is the potential of the loss of valid linkages in the100-129 range. • The total number of links in the 120-130 represent only 0.03% of all candidate matches • Inspection by a human arbiter rejected a majority of these records upon first review. The estimate is that at most the 130 threshold leads to a 0.006% loss of valid linkages. JAI Cambridge Massachusetts 16 Validation Set: Using Unique Data Signatures to Establish Linkage Errors • Concurrencies and logical relationships between data sources permits the development of tests that support uniqueness and completeness tests • Number of dimensions in utilization fingerprint leads to high potential for pattern uniqueness: dates of service, procedures, diagnoses, service type, payor, provider state • Unique patterns over time in a person-level history can be identified in complementary data sources based solely on non-ID based matching – Medicare SNF – MDS – Medicare HH – OASIS – Medicaid NH – MDS – Medicaid Cross-Over Claims – Medicare Claims JAI Cambridge Massachusetts 17 Validation Set Application • Validation Set does not yield comprehensive non ID linkages but, arguably, produces a representative sample: i.e. error in ID and demographics recording not related to service utilization patterns • The disposition of IDs of people in the VS in the deterministic linkage results tells us something about the performance of the linkage algorithm • VS matches as based on unique data patterns permit the user to identify both False Positive and False Negative linkages • Optimization of match weights and acceptance thresholds leads to lowest achievable level of linkage error JAI Cambridge Massachusetts 18 Linkage Selection Threshold Testing Example Table 3b: OASIS - Medicare 2003 Linkage Error Rates at Different Confidence Threshold Levels (n=45,649) Threshold Score 20 46 60 70 75 76 77 80 False Negatives 0.62% 0.67% 0.86% 1.80% 2.06% 2.20% 23.14% 24.84% False Positives 1.12% 0.58% 0.53% 0.49% 0.49% 0.49% 0.36% 0.35% 98.27% 98.76% 98.61% 97.71% 97.45% 97.31% 76.50% 74.81% Correct JAI Cambridge Massachusetts 19 What Does the VS Test Tell Use • Deterministic linkages are relatively insensitive to False Positive linkages. Requiring a high degree of overall matching in key fields leads to over identification rates of 1.12% - 0.35% of all linked pairs in OASIS-Medicare EDB • Deterministic linkages are very sensitive to False Negative identifications with large potential for error if match acceptance score set too high (too stringent) • Error rates are a discontinuous function of threshold score. Optimized selection for the CMS weights occur at score 46, minimizing both False Positive and False Negative errors and leading to the highest ‘correct’ rate JEN Associates - Cambridge, MA 20 VS Use for Reducing Self-Linkage Error • Strategy 1: use intuitive cases to tune deterministic algorithm • Strategy 2: use VS final results to determine effectiveness of strategy 1 • VS results are a composite of self-linkage anomalies and cross-source match errors • Error source is distinguishable and rates can be reduced by fine-tuning by error type • Biases in match error can be tested by reviewing demographics for False Positives and False Negatives JEN Associates - Cambridge, MA 21 Final Error Rates After Pre-processing and Further Internal Linkages Cleaning (Method Threshold Score=60) Table 14: Linkage Error Rates (n=13,141) MDS VS to MCR 5% MDS-Medicare (N=13,141) False Negatives 0.64% False Positives 0.00% Correct 99.36% Table 15: Linkage Error Rates (n=45,636) OASIS VS to MCR 5% OASIS-Medicare (N=45,636) False Negatives 0.58% False Positives 0.07% Correct 99.35% JAI Cambridge Massachusetts 22 Conclusion • Algorithms that produce matches that have face validity will also yield errors at an undetermined rate, both bad links and missing links • Probabilistic based linkages are difficult to use operationally but are essential for tuning deterministic methods and minimizing error • Utilization based VS testing – Probabilistic Linkage - is viable for error testing purposes and leads to optimization of operational matching algorithms JEN Associates - Cambridge, MA 23