Preparing Measures for High Stakes Use: Beyond Basic Psychometric Testing ___________________________________________________________________________ Dana Gelb Safran, Sc.D. Senior Vice President Performance Measurement & Improvement Blue Cross Blue Shield of Massachusetts Presented at: Composite Measures in Health Care Annual Research Meeting of Academy Health Chicago, IL 28 June 2009 Topics for Today: Preparing measures for high stakes use: the case of surveybased measures Basic psychometric testing of survey-based measures Beyond basic psychometric testing Reliability and sample size requirements Risk of misclassification A framework and principles for evaluating measures’ readiness for “high stakes” use Measuring Patient Experiences with Individual Physicians and Their Practice Sites Survey-based measurement of patients’ experiences with individual physicians is not new What’s new: Efforts to standardize and use for public reporting, and pay-for-performance IOM report Crossing the Quality Chasm gave “patient-centered care” a front row seat Methods and metrics had been honed over 15 years of research But putting these measures to use raised many questions about feasibility, value and “readiness for prime time” Principal Questions of Statewide Pilots What sample size is needed for highly reliable estimate of patients’ experiences with a physician? What is the risk of misclassification under varying reporting frameworks? Is there enough performance variability to justify measurement? How much of the measurement variance is accounted for by physicians as opposed to other elements of the system (practice site, network organization, plan)? Sampling Framework: Massachusetts Eastern, MA Central, MA Western, MA Tufts, BCBSMA, HPHC, Medicaid BCBSMA, Fallon, Medicaid BCBSMA, HNE, Medicaid PNO1 PNO2 PNO3 PNO4 PNO5 PNO6 34 Sites 23 Sites 10 Sites 143 Physicians 35 Physicians 37 Physicians Both commercially insured & Medicaid patients sampled Only commercially insured patients sampled Basic Psychometric Assessment Mean (SD) Cronbach’s Alpha Range of ItemScale Correlations Scaling Success (%) Quality of MD-Patient Interactions (k=6) 89.4 (18.0) 0.95 0.87-0.93 100 Health Promotion (k=2) 67.0 (38.9) 0.84 0.93-0.93 100 Access (k=5) 74.0 (22.4) 0.85 0.73-0.83 100 Coordination of Care (k=2) 78.5 (28.3) 0.63 0.84-0.92 83.3 Office Staff (k=2) 83.4 (21.9) 0.89 0.95-0.96 100 Adult PCP NGRP=183 NPT=11615 Basic Psychometric Assessment N=11615 Quality of MD-Pt Interaction Access Coordination Drexpln 0.857* 0.493 0.518 Drlistn 0.893* 0.485 0.541 Drinst1 0.850* 0.489 0.531 Comphist 0.801* 0.504 0.571 Interp1 0.834* 0.536 0.558 Interp5 0.860* 0.456 0.519 Incarsn 0.447 0.686* 0.418 Rgaptsn 0.461 0.683* 0.405 Wait15 0.378 0.485* 0.354 Calbck1 0.500 0.652* 0.505 Aftrhr3 0.548 0.807* 0.543 0.728 0.516 0.491* Quality of MD-Pt Interaction Access Coordination of Care Integ7 Basic Psychometric Assessment (2) % Floor % Ceiling % Missing SD of Group Effect Quality of MD-Patient Interactions 0.3 48.7 1.9 3.2 Health Promotion 10.7 41.4 2.4 4.0 Access 0.4 12.6 7.8 5.7 Coordination of Care 4.1 46.3 5.9 5.9 Office Staff 0.8 46.0 2.3 4.1 Adult PCP NGRP=183 NPT=11615 Beyond Basic Psychometric Assessment SD of Group Effect Observed group-level Reliability Estimated group-level reliability (n=200) Minimum N required to achieve 0.70 group-level reliability Quality of MD-Patient Interactions 3.2 0.68 0.87 70 Health Promotion 4.0 0.47 0.74 163 Access 5.7 0.80 0.93 35 Coordination of Care 5.9 0.73 0.90 53 Office Staff 4.1 0.69 0.88 64 Adult PCP NGRP=183 NPT=11615 Beyond Basic Psychometric Assessment (2) Mean (SD) % % Ceiling Missing SD of Group Effect Observed Estimated group-level grouplevel reliability Reliability (n=200) Minimum N required to achieve 0.70 group-level reliability Access 74.0 (22.4) 12.6 7.8 5.7 0.80 0.93 35 Scheduling: illness or injury 78.8 (26.6) 47.9 11.5 6.6 0.79 0.93 36 Scheduling: check-up or routine 82.9 (23.7) 55.2 5.8 5.1 0.74 0.91 48 In-office wait 61.6 (31.5) 22.7 1.9 7.7 0.80 0.93 37 Call back: during office hours 73.7 (29.5) 41.0 22.0 6.3 0.70 0.91 79 Call back: after hours 72.3 (33.7) 46.5 67.1 6.6 0.45 0.89 60 Variance Components Models ... Influence of MDs and Groups on scores follows a variance components model: Y patient = µ +α group +γ ~ (0, σ 2 ) group group γ ~ (0, σ 2 ) doctor doctor ε ~ (0, σ 2 ) patient patient α doctor + βX patient +ε patient … Support Reliability Calculation Reliability is a percent of variance explained ρ group = 2 + σ group 2 2 = τ group + σ group 2 σ group 2 σ patients N patients per group 2 σ doctor N doctorsin group Sample Size Requirements for Varying Physician-Level Reliability Thresholds Number of Responses per Physician Needed to Achieve Desired MD-Level Measurement Reliability Reliability: Reliability: Reliability: 0.7 0.8 0.95 ORGANIZATIONAL/STRUCTURAL FEATURES OF CARE Organizational access 23 39 185 Visit-based continuity 13 22 103 Integration 39 66 315 Communication 43 73 347 Whole-person orientation 21 37 174 Health promotion 45 77 366 Interpersonal treatment 41 71 337 Patient trust 36 61 290 DOCTOR-PATIENT INTERACTIONS Source: Safran et al. JGIM 2006; 21(1):13-21 What is the Risk of Misclassification? Not simply 1- αMD Depends on: Measurement reliability (αMD) Number of cutpoints in the reporting framework Proximity of score to the cutpoint Probability of Misclassification Label a “cut point” c1 Let Xd be the true score for a doctor Let X be the average for his/her n patients Compute P(X<c1 given Xd)~normal mean Integrate over Xd: c2>Xd>c1 Add results over all cut points Substantially Below Average Below Average Average Substantially Above Average Above Average MEASURE RELIABILITY (αMD) 0 0.9 50 19.7 0.8 50 0.7 3.3 50 2.2 0 50 17.6 3.2 50 0 0 28.5 11.1 50 8.8 0.4 50 27.0 11.2 50 0.4 0 50 33.0 17.3 50 14.7 2.0 50 32.0 17.4 50 2.3 0 0.6 50 36.4 22.5 50 19.9 4.7 50 35.4 22.8 50 5.4 0.1 0.5 50 38.7 27.7 50 25.2 8.7 50 27.3 50 9.7 0.4 52.9 58.5 64.6 70.8 10th ptile 25th ptile 50th ptile 75th ptile 38.3 76.3 90th ptile 100 Substantially Below Average Average Substantially Above Average MEASURE RELIABILITY (αMD) 0 0.9 50 0.01 0 50 0.01 0 0.8 50 0.6 0 50 0.5 0 0.7 50 2.4 0 50 2.4 0 64.6 76.3 50th ptile 90th ptile 52.9 10th ptile 88.0 100 Buffers Around Cutpoint Alters Risk When Xd=c1 conditional risk is 50% Buffer idea: report in higher group if X>c* where c*<c1 c1-c* is the “buffer” Reduce the provider risk but are lenient Hypothesis tests (confidence intervals) accomplish a similar purpose But the same score can be in different groups! Risk of Misclassification at Varying Distances from the Benchmark and Varying in Measurement Reliability (αMD ) MD Mean Score Distance from Benchmark (Points) 1 2 3 4 5 6 7 8 9 10 Probability of Misclassification at Varying Thresholds of MD-Level Reliability αMD=.70 αMD=.80 αMD=.90 38.0 27.1 18.0 11.1 6.3 3.3 1.6 0.7 0.3 0.1 34.5 21.2 11.5 5.5 2.3 0.8 0.3 <0.001 <0.001 <0.001 27.4 11.5 3.6 0.8 0.1 <0.001 <0.001 <0.001 <0.001 <0.001 Certainty and Uncertainty in Classification Comparison with a Single Benchmark Significantly below 6.3 Significantly above αMD=0.7 4.9 αMD=0.8 3.26 αMD=0.9 0 65 100 50th p’tile = area of uncertainty Certainty and Uncertainty in Classification Cutpoints at 10th & 90th Percentile Bottom Tier Middle Tier Top Tier 6.3 6.3 4.9 4.9 3.26 3.26 αMD=0.7 αMD=0.8 αMD=0.9 0 10th 53 p’tile 76 p’tile 100 90th = area of uncertainty Risk of Misclassification: Summary • Even “highly reliable” scores – with everything done correctly can translate to high risk of misclassification if reported/used in ways that over-differentiate • Not simply 1- αMD • Depends on: • Measurement reliability (αMD) • Proximity of score to the cutpoint • Number of cutpoints in the reporting framework 15th ½ 50th ½ 85th ½ Source: Safran et al. JGIM 2006; 21(1):13-21 Allocation of Explainable Variance: Doctor-Patient Interactions 100 80 62 60 Doctor 74 77 70 84 Site Network 40 20 0 Plan 38 25 22 29 16 Guiding Principles in Selecting “High Stakes” Measures Measures will be focused on safety, effectiveness, patient-centeredness and affordability Wherever possible, our measures should be drawn from nationally accepted standard measure sets The measure must reflect something that is broadly accepted as clinically important There must be empirical evidence that the measure provides stable and reliable information at the level at which it will be reported (i.e. individual, site, group, or institution) with available sample sizes and data sources There must be sufficient variability on the measure across providers (or at the level at which data will be reported) to merit attention There must be empirical evidence that the level of the system that will be held accountable (clinician, site, group, institution) accounts for a large portion of the system-level variance in the measure Providers should be exposed to information about the development and validation of the measures and given the opportunity to view their own performance, ideally for one measurement cycle, before the data are used for “high stakes” purposes Staged Development & Use of Performance Measures Phase II Phase III Development & Testing Initial Large-Scale Implementation Implementing Measures for “High Stakes” Purposes Initial measure implementation. Final measure validation/testing. Stakeholder buyin. Initial QI cycle. P4P Time 1 Time 0 Phase I Public Tiering Reporting Summary Without measurement, we don’t know where we are on the path But imprecise measurement used in “high stakes” ways undermines improvement efforts Getting to “high stakes” measurement with reliable, valid indicators does not have to take long Ascertaining sample sizes required for stable, reliable measurement is a key step With a 3-level performance framework, risk of misclassification is low Except at performance cutpoints, where risk is high irrespective of measurement reliability Disciplined measure development and testing, together with strong guiding principles for selection of high stakes measures, allow use of performance measures for accountability and improvement For More Information: ___________________________________________________________________________ dana.safran@bcbsma.com Pooling Multiple Datasets to Evaluate the Potential for Absolute Performance Targets HVMA 2004 MHQP 2002 PBGH 2003 (NMD=323) (NMD=183) (NMD=216) ABIM CMS 2005 2005 NMD=78 NMD=183 (NMD=207) (NMD=2785) 2006 2004 (NMD=697) UMA UMISS (NMD=2487) (NMD=13) 2005 (NMD=192) (NMD=919) 2006 (NMD=8970) Nationally Representative Dataset (Appendix B) Percentile Scores 90th 65th 25th Item Median Scores (from national dataset) Mean of Item Median Scores Used to Achieve Composite Scores TABLE 5: PERFORMANCE THRESHOLDS PERFORMANCE THRESHOLDS Why the Beta Distribution? The beta distribution has the probability density function f(x; α, β) = 1 xα-1(1-x)β-1 B(α,β) .08 .04 .02 40 60 80 (mean) drexpln 0 100 Density .06 Range of Scores Observed Across Percentiles of the Beta Distribution 50.0 45.0 40.0 Range across datasets 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0 99 98 95 90 75 65 60 50 Beta distributed percentile 40 30 25 15 10 Recommended Performance Thresholds for Core Composites & Items, C/G CAHPS® Percentile 90th 65th 25th Risk of downward misclassification 95 91 84 3.4% Explains things in a way that is easy to understand 96 92 86 6.6% Listens carefully 96 92 85 4.6% Provides clear instructions 95 91 84 5.5% Knowledge of medical history 94 88 79 4.0% Shows respect for what patient has to say 98 94 90 7.4% Spends enough time with patients 94 90 80 4.6% 91 81 69 3.7% Appointment for urgent care 94 87 77 4.9% Appointment for routine care 92 87 78 5.6% Getting a call back 91 82 70 5.3% Waiting less than 15 minutes 82 70 55 5.1% After hour medical help/advice 94 80 66 6.2% 94 88 80 5.6% Staff treat patient with courtesy and respect 92 85 76 6.7% Staff helpful 95 90 83 1.4% Communication Organizational Access Office Staff Consequences of “Missing Data” What if there are insufficient data to score a particular provider on a particular metric: Minimally problematic: Incentives can be tied to the measures for which sufficient data are available for a given provider P4P Somewhat problematic: Need to convey to public that missing data for a provider doesn’t signify poor performance Public Reporting Extremely problematic: Each provider must be “scored” on each metric in order to be placed in a tier Tiering Allocation of Explainable Variance: Organizational/Structural Features of Care 100 80 39 36 23 Doctor 60 Site 40 45 56 77 Network Plan 20 16 0 Organizational Access 8 Visit-based Continuity Integration Using Measures to Drive Transformation Public Reporting P4P Tiering