Preparing Measures for High Stakes Use: Beyond Basic Psychometric Testing

advertisement
Preparing Measures for High Stakes Use:
Beyond Basic Psychometric Testing
___________________________________________________________________________
Dana Gelb Safran, Sc.D.
Senior Vice President
Performance Measurement & Improvement
Blue Cross Blue Shield of Massachusetts
Presented at:
Composite Measures in Health Care
Annual Research Meeting of Academy Health
Chicago, IL
28 June 2009
Topics for Today:

Preparing measures for high stakes use: the case of surveybased measures

Basic psychometric testing of survey-based measures

Beyond basic psychometric testing

Reliability and sample size requirements
 Risk of misclassification

A framework and principles for evaluating measures’
readiness for “high stakes” use
Measuring Patient Experiences with Individual
Physicians and Their Practice Sites

Survey-based measurement of patients’ experiences with
individual physicians is not new

What’s new: Efforts to standardize and use for public reporting,
and pay-for-performance

IOM report Crossing the Quality Chasm gave “patient-centered
care” a front row seat

Methods and metrics had been honed over 15 years of research

But putting these measures to use raised many questions about
feasibility, value and “readiness for prime time”
Principal Questions of Statewide Pilots

What sample size is needed for highly reliable estimate of
patients’ experiences with a physician?

What is the risk of misclassification under varying
reporting frameworks?

Is there enough performance variability to justify
measurement?

How much of the measurement variance is accounted for
by physicians as opposed to other elements of the system
(practice site, network organization, plan)?
Sampling Framework: Massachusetts
Eastern, MA
Central, MA
Western, MA
Tufts, BCBSMA, HPHC,
Medicaid
BCBSMA, Fallon,
Medicaid
BCBSMA, HNE,
Medicaid
PNO1
PNO2
PNO3
PNO4
PNO5
PNO6
34 Sites
23 Sites
10 Sites
143 Physicians
35 Physicians
37 Physicians
Both commercially insured & Medicaid patients sampled
Only commercially insured patients sampled
Basic Psychometric Assessment
Mean (SD)
Cronbach’s
Alpha
Range of ItemScale
Correlations
Scaling
Success
(%)
Quality of MD-Patient
Interactions (k=6)
89.4 (18.0)
0.95
0.87-0.93
100
Health Promotion (k=2)
67.0 (38.9)
0.84
0.93-0.93
100
Access (k=5)
74.0 (22.4)
0.85
0.73-0.83
100
Coordination of Care (k=2)
78.5 (28.3)
0.63
0.84-0.92
83.3
Office Staff (k=2)
83.4 (21.9)
0.89
0.95-0.96
100
Adult PCP
NGRP=183
NPT=11615
Basic Psychometric Assessment
N=11615
Quality of MD-Pt
Interaction
Access
Coordination
Drexpln
0.857*
0.493
0.518
Drlistn
0.893*
0.485
0.541
Drinst1
0.850*
0.489
0.531
Comphist
0.801*
0.504
0.571
Interp1
0.834*
0.536
0.558
Interp5
0.860*
0.456
0.519
Incarsn
0.447
0.686*
0.418
Rgaptsn
0.461
0.683*
0.405
Wait15
0.378
0.485*
0.354
Calbck1
0.500
0.652*
0.505
Aftrhr3
0.548
0.807*
0.543
0.728
0.516
0.491*
Quality of MD-Pt
Interaction
Access
Coordination of Care
Integ7
Basic Psychometric Assessment (2)
% Floor
% Ceiling
% Missing
SD of Group
Effect
Quality of MD-Patient
Interactions
0.3
48.7
1.9
3.2
Health Promotion
10.7
41.4
2.4
4.0
Access
0.4
12.6
7.8
5.7
Coordination of Care
4.1
46.3
5.9
5.9
Office Staff
0.8
46.0
2.3
4.1
Adult PCP
NGRP=183
NPT=11615
Beyond Basic Psychometric Assessment
SD of
Group
Effect
Observed
group-level
Reliability
Estimated
group-level
reliability
(n=200)
Minimum N
required to
achieve 0.70
group-level
reliability
Quality of MD-Patient
Interactions
3.2
0.68
0.87
70
Health Promotion
4.0
0.47
0.74
163
Access
5.7
0.80
0.93
35
Coordination of Care
5.9
0.73
0.90
53
Office Staff
4.1
0.69
0.88
64
Adult PCP
NGRP=183
NPT=11615
Beyond Basic Psychometric Assessment (2)
Mean
(SD)
%
%
Ceiling
Missing
SD of
Group
Effect
Observed Estimated
group-level
grouplevel
reliability
Reliability
(n=200)
Minimum N
required to
achieve 0.70
group-level
reliability
Access
74.0
(22.4)
12.6
7.8
5.7
0.80
0.93
35
Scheduling:
illness or injury
78.8
(26.6)
47.9
11.5
6.6
0.79
0.93
36
Scheduling:
check-up or
routine
82.9
(23.7)
55.2
5.8
5.1
0.74
0.91
48
In-office wait
61.6
(31.5)
22.7
1.9
7.7
0.80
0.93
37
Call back: during
office hours
73.7
(29.5)
41.0
22.0
6.3
0.70
0.91
79
Call back: after
hours
72.3
(33.7)
46.5
67.1
6.6
0.45
0.89
60
Variance Components Models ...
Influence of MDs and Groups on scores
follows a variance components model:
Y
patient
= µ +α
group
+γ
~ (0, σ 2
)
group
group
γ
~ (0, σ 2
)
doctor
doctor
ε
~ (0, σ 2
)
patient
patient
α
doctor
+ βX
patient
+ε
patient
… Support Reliability Calculation
Reliability is a percent of variance
explained
ρ group =
2
+
σ group
2
2
= τ group
+
σ group
2
σ group
2
σ patients
N patients per group
2
σ doctor
N doctorsin group
Sample Size Requirements for Varying
Physician-Level Reliability Thresholds
Number of Responses per Physician Needed to Achieve Desired
MD-Level Measurement Reliability
Reliability:
Reliability:
Reliability:
0.7
0.8
0.95
ORGANIZATIONAL/STRUCTURAL FEATURES OF CARE
Organizational
access
23
39
185
Visit-based continuity
13
22
103
Integration
39
66
315
Communication
43
73
347
Whole-person
orientation
21
37
174
Health promotion
45
77
366
Interpersonal
treatment
41
71
337
Patient trust
36
61
290
DOCTOR-PATIENT INTERACTIONS
Source: Safran et al. JGIM 2006; 21(1):13-21
What is the Risk of Misclassification?
Not
simply 1- αMD
Depends
on:
Measurement
reliability (αMD)
Number
of cutpoints in the reporting
framework
Proximity
of score to the cutpoint
Probability of Misclassification
 Label
a “cut point” c1
 Let
Xd be the true score for a doctor
 Let
X be the average for his/her n patients
 Compute
P(X<c1 given Xd)~normal mean
 Integrate
over Xd: c2>Xd>c1
 Add
results over all cut points
Substantially
Below Average
Below Average
Average
Substantially
Above Average
Above Average
MEASURE RELIABILITY (αMD)
0
0.9
50
19.7
0.8
50
0.7
3.3
50
2.2
0
50
17.6
3.2
50
0
0
28.5
11.1 50
8.8
0.4
50
27.0
11.2
50
0.4
0
50
33.0
17.3
50
14.7
2.0
50
32.0
17.4
50
2.3
0
0.6
50
36.4
22.5
50
19.9
4.7
50
35.4
22.8
50
5.4
0.1
0.5
50
38.7
27.7
50
25.2
8.7
50
27.3
50
9.7
0.4
52.9
58.5
64.6
70.8
10th ptile
25th ptile
50th ptile
75th ptile
38.3
76.3
90th ptile
100
Substantially Below Average
Average
Substantially Above Average
MEASURE RELIABILITY (αMD)
0
0.9
50
0.01
0
50
0.01
0
0.8
50
0.6
0
50
0.5
0
0.7
50
2.4
0
50
2.4
0
64.6
76.3
50th ptile
90th ptile
52.9
10th ptile
88.0
100
Buffers Around Cutpoint Alters Risk
 When
Xd=c1 conditional risk is 50%
 Buffer
idea: report in higher group if X>c*
where c*<c1 c1-c* is the “buffer”
 Reduce
the provider risk but are lenient
 Hypothesis
tests (confidence intervals)
accomplish a similar purpose
 But
the same score can be in different groups!
Risk of Misclassification at Varying Distances from the Benchmark
and Varying in Measurement Reliability (αMD )
MD Mean Score
Distance from
Benchmark
(Points)
1
2
3
4
5
6
7
8
9
10
Probability of Misclassification at Varying Thresholds of
MD-Level Reliability
αMD=.70
αMD=.80
αMD=.90
38.0
27.1
18.0
11.1
6.3
3.3
1.6
0.7
0.3
0.1
34.5
21.2
11.5
5.5
2.3
0.8
0.3
<0.001
<0.001
<0.001
27.4
11.5
3.6
0.8
0.1
<0.001
<0.001
<0.001
<0.001
<0.001
Certainty and Uncertainty in Classification
Comparison with a Single Benchmark
Significantly below
6.3
Significantly above
αMD=0.7
4.9
αMD=0.8
3.26
αMD=0.9
0
65
100
50th p’tile
= area of uncertainty
Certainty and Uncertainty in Classification
Cutpoints at 10th & 90th Percentile
Bottom Tier
Middle Tier
Top Tier
6.3
6.3
4.9
4.9
3.26
3.26
αMD=0.7
αMD=0.8
αMD=0.9
0
10th
53
p’tile
76
p’tile
100
90th
= area of uncertainty
Risk of Misclassification: Summary
• Even “highly reliable” scores – with
everything done correctly can translate to
high risk of misclassification if reported/used
in ways that over-differentiate
• Not simply 1- αMD
• Depends on:
• Measurement reliability (αMD)
• Proximity of score to the cutpoint
• Number of cutpoints in the reporting
framework
15th

½
50th

½
85th

½ 
Source: Safran et al. JGIM 2006; 21(1):13-21
Allocation of Explainable Variance:
Doctor-Patient Interactions
100
80
62
60
Doctor
74
77
70
84
Site
Network
40
20
0
Plan
38
25
22
29
16
Guiding Principles in Selecting “High Stakes” Measures
 Measures will be focused on safety, effectiveness, patient-centeredness and affordability
 Wherever possible, our measures should be drawn from nationally accepted standard measure
sets
 The measure must reflect something that is broadly accepted as clinically important

There must be empirical evidence that the measure provides stable and reliable information at the
level at which it will be reported (i.e. individual, site, group, or institution) with available sample sizes
and data sources
 There must be sufficient variability on the measure across providers (or at the level at which data will
be reported) to merit attention
 There must be empirical evidence that the level of the system that will be held accountable
(clinician,
site, group, institution) accounts for a large portion of the system-level variance in the measure

Providers should be exposed to information about the development and validation of the measures
and given the opportunity to view their own performance, ideally for one measurement cycle, before
the data are used for “high stakes” purposes
Staged Development & Use of Performance Measures
Phase II
Phase III
Development & Testing
Initial Large-Scale
Implementation
Implementing Measures
for “High Stakes”
Purposes
Initial measure
implementation.
Final measure
validation/testing.
Stakeholder buyin.
Initial QI cycle.
P4P
Time 1
Time 0
Phase I
Public Tiering
Reporting
Summary

Without measurement, we don’t know where we are on the path

But imprecise measurement used in “high stakes” ways undermines
improvement efforts

Getting to “high stakes” measurement with reliable, valid indicators
does not have to take long

Ascertaining sample sizes required for stable, reliable measurement is
a key step

With a 3-level performance framework, risk of misclassification is low
 Except at performance cutpoints, where risk is high irrespective of
measurement reliability

Disciplined measure development and testing, together with strong
guiding principles for selection of high stakes measures, allow use of
performance measures for accountability and improvement
For
More
Information:
___________________________________________________________________________
dana.safran@bcbsma.com
Pooling Multiple Datasets to Evaluate the
Potential for Absolute Performance Targets
HVMA
2004
MHQP
2002
PBGH
2003
(NMD=323)
(NMD=183)
(NMD=216)
ABIM
CMS
2005
2005
NMD=78
NMD=183
(NMD=207)
(NMD=2785)
2006
2004
(NMD=697)
UMA
UMISS
(NMD=2487)
(NMD=13)
2005
(NMD=192)
(NMD=919)
2006
(NMD=8970)
Nationally
Representative
Dataset (Appendix B)
Percentile Scores
90th
65th
25th
Item Median Scores
(from national dataset)
Mean of Item Median Scores Used
to Achieve Composite Scores
TABLE 5: PERFORMANCE
THRESHOLDS
PERFORMANCE
THRESHOLDS
Why the Beta Distribution?
The beta distribution has the probability density function
f(x; α, β) =
1
xα-1(1-x)β-1
B(α,β)
.08
.04
.02
40
60
80
(mean) drexpln
0
100
Density
.06
Range of Scores Observed Across
Percentiles of the Beta Distribution
50.0
45.0
40.0
Range across datasets
35.0
30.0
25.0
20.0
15.0
10.0
5.0
0.0
99
98
95
90
75
65
60
50
Beta distributed percentile
40
30
25
15
10
Recommended Performance Thresholds for
Core Composites & Items, C/G CAHPS®
Percentile
90th
65th
25th
Risk of downward
misclassification
95
91
84
3.4%
Explains things in a way that is easy to understand
96
92
86
6.6%
Listens carefully
96
92
85
4.6%
Provides clear instructions
95
91
84
5.5%
Knowledge of medical history
94
88
79
4.0%
Shows respect for what patient has to say
98
94
90
7.4%
Spends enough time with patients
94
90
80
4.6%
91
81
69
3.7%
Appointment for urgent care
94
87
77
4.9%
Appointment for routine care
92
87
78
5.6%
Getting a call back
91
82
70
5.3%
Waiting less than 15 minutes
82
70
55
5.1%
After hour medical help/advice
94
80
66
6.2%
94
88
80
5.6%
Staff treat patient with courtesy and respect
92
85
76
6.7%
Staff helpful
95
90
83
1.4%
Communication
Organizational Access
Office Staff
Consequences of “Missing Data”
What if there are insufficient data to score a particular
provider on a particular metric:
Minimally
problematic:
Incentives can
be tied to the
measures for
which sufficient
data are
available for a
given provider
P4P
Somewhat
problematic:
Need to convey to
public that missing
data for a provider
doesn’t signify
poor performance
Public Reporting
Extremely
problematic:
Each provider
must be
“scored” on
each metric in
order to be
placed in a tier
Tiering
Allocation of Explainable Variance:
Organizational/Structural Features of Care
100
80
39
36
23
Doctor
60
Site
40
45
56
77
Network
Plan
20
16
0
Organizational
Access
8
Visit-based
Continuity
Integration
Using Measures to Drive Transformation
Public Reporting
P4P
Tiering
Download